Skip Navigation


Bioinformatics Advance Access originally published online on March 12, 2008
Bioinformatics 2008 24(9):1217-1220; doi:10.1093/bioinformatics/btn092
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/9/1217    most recent
btn092v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Xu, X.
Right arrow Articles by Li, X.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xu, X.
Right arrow Articles by Li, X.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

PlasmoGF: an integrated system for comparative genomics and phylogenetic analysis of Plasmodium gene families

Xiang Xu 1,{dagger}, Jinyu Wu 2,{dagger}, Jian Xiao 1, Yi Tan 1, Qiyu Bao 2,*, Fangqing Zhao 3,* and Xiaokun Li 1,4,*

1School of Pharmaceutical Science/Zhejiang Provincial Key Laboratory of Biotechnology Pharmaceutical Engineering, Wenzhou Medical College, Wenzhou 325035, 2Institute of Biomedical Informatics/Zhejiang Provincial Key Laboratory of Medical Genetics, Wenzhou Medical College, Wenzhou 325000, China, 3Department of Biochemistry and Molecular Biology, Pennsylvania State University, Pennsylvania 16802, USA and 4National Engineering Research Center for Gene Medicine, Jinan University, Guangzhou 510632, PR China

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GENE-FAMILY CLASSIFICATION
 3 DATABASE CONSTRUCTION AND...
 4 DATABASE CONTENT AND...
 5 COMPARATIVE GENOMICS AND...
 6 PERSPECTIVES
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: Malaria, one of the world's most common diseases, is caused by the intracellular protozoan parasite known as Plasmodium. Recently, with the arrival of several malaria parasite genomes, we established an integrated system named PlasmoGF for comparative genomics and phylogenetic analysis of Plasmodium gene families. Gene families were clustered using the Markov Cluster algorithm implemented in TribeMCL program and could be searched using keywords, gene-family information, domain composition, Gene Ontology and BLAST. Moreover, a number of useful bioinformatics tools were implemented to facilitate the analysis of these putative Plasmodium gene families, including gene retrieval, annotation, sequence alignment, phylogeny construction and visualization. In the current version, PlasmoGF contained 8980 sets of gene families derived from six malaria parasite genomes: Plasmodium. falciparum, P. berghei, P. knowlesi, P. chabaudi, P. vivax and P. yoelii. The availability of such a highly integrated system would be of great interest for the community of researchers working on malaria parasite phylogenomics.

Availability: PlasmoGF is freely available at http://bioinformatics.zj.cn/pgf/

Contact: xiaokunli{at}163.net; baoqy{at}genomics.org.cn; fuz3{at}psu.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GENE-FAMILY CLASSIFICATION
 3 DATABASE CONSTRUCTION AND...
 4 DATABASE CONTENT AND...
 5 COMPARATIVE GENOMICS AND...
 6 PERSPECTIVES
 ACKNOWLEDGEMENTS
 REFERENCES
 
Malaria is one of the world's most common diseases, which infects more than half a billion people and is responsible for over one million deaths annually (Snow et al., 2005). The pathogenesis of malaria is mainly caused by species of the intracellular protozoan parasites known as Plasmodium. A better understanding of parasite biology will facilitate our efforts to resist malaria. Comparative genomics and phylogenetic analysis have great potential to help us understand the sequence-structure-function correlations of Plasmodium genes and thus design new drugs to combat this disease. One of the major findings of early comparative genomics was that human parasite Plasmodium. falciparum showed a remarkable synteny with other malaria parasites of rodents, indicating the majority of features of different Plasmodium species are conserved during long evolutionary process (Kooij et al., 2006). The differences among the malaria parasite genomes were found in the subtelomeric regions, where many multiple-gene families involved in antigenic variation are located (Carlton et al., 2002). Comparative analysis of transcription-associated proteins in P. falciparum revealed that its genome contained a limited number of detectable transcription factors and most of them appeared to be species-specific (Coulson et al., 2004).

Since the breakthrough of genomic sequencing technologies, several human-infectious and model malaria parasite genomes become available, representing one of the most abundant data for a single eukaryotic pathogen. Undoubtedly, these malaria parasite genome sequences provide novel opportunities to have a better insight into the role of different Plasmodium gene families and further into their adaptation to parasitic niches in the eukaryotes. Currently, a useful resource named PlasmoDB has been developed and continuously updated, providing rapid and convenient access to malaria parasite genes and genomes (Stoeckert et al., 2006). However, there is no comprehensive database focusing on the Plasmodium gene families, as well as providing online bioinformatics tools to analyze these gene families. On the other hand, with the increasing number of malaria parasite genomes, it is necessary to explore the evolutionary mechanism of their gene families for further understanding of malaria parasite evolution. Therefore, PlasmoGF, an integrated system for comparative genomics and phylogenetic analysis of Plasmodium gene families, was constructed to help users to perform such analyses more efficiently. The web interface of PlasmoGF provided users with many easy ways to access gene families and their detailed annotation information. Meanwhile, a number of powerful bioinformatics tools were implemented to facilitate comparative genomics and phylogenetic analysis on Plasmodium genomes.


    2 GENE-FAMILY CLASSIFICATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GENE-FAMILY CLASSIFICATION
 3 DATABASE CONSTRUCTION AND...
 4 DATABASE CONTENT AND...
 5 COMPARATIVE GENOMICS AND...
 6 PERSPECTIVES
 ACKNOWLEDGEMENTS
 REFERENCES
 
Protein families can be defined as a group of molecules that share significant sequence similarity and a common evolutionary history. Clustering different proteins into families is usually based on BLAST similarities, such as those algorithms implemented in BLASTClust (from the NCBI BLAST suite) and cd-hit program (Li and Godzik, 2006). In recent years, TribeMCL program has been proved good at classification of divergent proteins (Enright et al., 2003) and widely used in related studies (Chen et al., 2005; Conte et al., 2007; Lee et al., 2004; Zhang et al., 2007). One of the most important features of this program is that it uses a novel clustering algorithm (Markov Clustering) to effectively break the barriers during the clustering process, such as multi-domains, fragments of proteins and promiscuous domains in alignment (Enright et al., 2003). In order to cluster all Plasmodium proteins into families, in this study, predicted protein sequences of the malaria parasite genomes were firstly retrieved from the PlasmoDB databases release 5.4. Then, an all-against-all sequence comparison was carried out using the BLAST program (Altschul et al., 1997). In the end, putative protein families were generated using the TribeMCL program with a high stringency of inflation 5.0.


    3 DATABASE CONSTRUCTION AND CONTENT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GENE-FAMILY CLASSIFICATION
 3 DATABASE CONSTRUCTION AND...
 4 DATABASE CONTENT AND...
 5 COMPARATIVE GENOMICS AND...
 6 PERSPECTIVES
 ACKNOWLEDGEMENTS
 REFERENCES
 
PlasmoGF is developed using our previous integrated pipeline for ArchaeaTF (Wu et al., 2008), which is constructed using a number of open source software, such as MySQL, PHP, Apache and Perl. In brief, the processed data are stored in a MySQL database system. The PHP language is used to connect the database and produce dynamic HTML pages. Apache is used as the background web server. Programs for malaria parasite data manipulation and presentation are performed using Perl and BioPerl modules. All the above procedures are executed on the Linux operating system.


    4 DATABASE CONTENT AND DATA RETRIEVAL
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GENE-FAMILY CLASSIFICATION
 3 DATABASE CONSTRUCTION AND...
 4 DATABASE CONTENT AND...
 5 COMPARATIVE GENOMICS AND...
 6 PERSPECTIVES
 ACKNOWLEDGEMENTS
 REFERENCES
 
There are eight completely sequenced malaria parasite genomes available. At present, PlasmoGF has 8980 gene families clustered from six malaria parasite genomes, including P. falciparum, P. berghei, P. knowlesi, P. chabaudi, P. vivax and P. yoelii. The summary information of Plasmodium gene families with respect to family sizes is shown in Figure 1. The other two malaria parasite genomes, P. gallinaceum and P. reichenowi are not included in the current release because of the unavailability of their annotation data, and will be incorporated into PlasmoGF in later updates.


Figure 1
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Distribution of Plasmodium family size with respect to the number of genes or gene families. X-axis indicates the number of genes in each gene family. Y-axis indicates the number of genes or gene families corresponding to various family sizes, respectively.

 
On the web interface, users can interact with PlasmoGF in many different ways (Fig. 2). They can download the entire clustered Plasmodium gene families in fasta format. More importantly, they can retrieve specific gene families of interest through a set of queries in the query system, including (1) by protein identifier or keyword associated with the protein annotation in PlasmoDB databases, (2) by cluster information, such as cluster ID and cluster size, (3) by Pfam functional domain with a specific combination and (4) by Gene Ontology term. All the above searches can be combined with the logic operators AND, OR and NOT. Additionally, the ViroBLAST program (Deng et al., 2007), which provides an output to easily parse and navigate BLAST results, is implemented to allow the user to search gene families based on sequence similarities. All search results will be shown in a table format and allow being added into a personal work-set for further operation, such as deletion, modification or download.


Figure 2
View larger version (63K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. A snapshot of the PlasmoGF. Users can retrieve the data through the query system or BLAST tool. Search results can be saved into a personal work-set to perform further analysis, such as sequence retrieval, multiple sequence alignment, and phylogeny construction. The integrated annotation for each gene is also well organized for users.

 
Detailed annotation information required can be obtained by clicking on any protein identifier in the search results (Fig. 2). These annotation information include (1) basic information, such as ID, function description, sequence length, molecular weight, isoelectric point and cross-links to PlasmoDB database, (2) genomic structure information, such as location in chromosome, orientation, number of exons and transmembrane domains, (3) Gene Ontology information, such as cellular component, biological process and molecular function, (4) domain organization assigned using Pfam database release 1.69 (Bateman et al., 2004), (5) sequence homolog to several important databases, such as PDB (collected by November 20, 2007), Swiss-Prot release 52.0 and Refseq release 22 and (6) sequence information, such as nucleotide and protein sequence.


    5 COMPARATIVE GENOMICS AND PHYLOGENETIC ANALYSIS TOOLS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GENE-FAMILY CLASSIFICATION
 3 DATABASE CONSTRUCTION AND...
 4 DATABASE CONTENT AND...
 5 COMPARATIVE GENOMICS AND...
 6 PERSPECTIVES
 ACKNOWLEDGEMENTS
 REFERENCES
 
To facilitate the analysis of these generated putative Plasmodium gene families, a number of bioinformatics tools were implemented into the PlasmoGF web interface for comparative genomics and phylogenetic analysis (Fig. 2). The ClustalW program (Thompson et al., 1994) with many customized parameters was implemented to allow users to perform multiple sequence alignments of specific Plasmodium gene family. It has been designed to work in two basic modes. Users can retrieve some data from a primary query and then add them into the work-set. The ClustalW program will make multiple sequence alignment of the data in work-set by default. Alternatively, users can input their own sequence data (nucleotide or protein) in fasta format to make multiple sequence alignment. For easy display and manipulation of the aligned result, the Jalview program (Clamp et al., 2004) based on Java applet is implemented to provide many related functions, such as coloring different kinds of amino acids according to their biochemical properties. Moreover, the QuickTree program (Howe et al., 2002), which is based on neighbor-joining algorithm, is implemented to allow users to construct a phylogenetic tree with the aligned result. The tree visualization is done using the ATV program based on the Java applet (Clamp et al., 2004).


    6 PERSPECTIVES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GENE-FAMILY CLASSIFICATION
 3 DATABASE CONSTRUCTION AND...
 4 DATABASE CONTENT AND...
 5 COMPARATIVE GENOMICS AND...
 6 PERSPECTIVES
 ACKNOWLEDGEMENTS
 REFERENCES
 
PlasmoGF is an integrated system containing the putative gene families and their detailed annotation information identified from malaria parasite genomes. Equipped with ViroBLAST, ClustalW and QuickTree program, it has become a useful platform to enable the users to perform the comparative genomics and phylogenetic analysis of each gene family. Continuing efforts will be made to create a niche for it in the post-genomics era. An important approach to increasing its power is to respond the users’ questions, comments and suggestions timely. To facilitate this process, one special page is developed for collecting and displaying users’ feedbacks on the website. Furthermore, when new genomes are fully sequenced and annotated, their genomic data and corresponding classified gene families will be incorporated into PlasmoGF. In further release of PlasmoGF, we will build in more phylogenetic analysis tools and incorporate the graphic visualization of comparison results. Finally, we expect PlasmoGF to serve as a valuable resource for obtaining new insights about Plasmodium genomes.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GENE-FAMILY CLASSIFICATION
 3 DATABASE CONSTRUCTION AND...
 4 DATABASE CONTENT AND...
 5 COMPARATIVE GENOMICS AND...
 6 PERSPECTIVES
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work was supported by the National Natural Science Foundation of China (30600768), the Program of New Century Excellent Talents in University (Li XK) and Zhejiang Provincial Program for the Cultivation of High-level Innovative Health talents (Li XK).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Jonathan Wren

{dagger}The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Back

Received on November 23, 2007; revised on March 3, 2008; accepted on March 4, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GENE-FAMILY CLASSIFICATION
 3 DATABASE CONSTRUCTION AND...
 4 DATABASE CONTENT AND...
 5 COMPARATIVE GENOMICS AND...
 6 PERSPECTIVES
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:3389–3402.[Abstract/Free Full Text]

    Bateman A, et al. The Pfam protein families database. Nucleic Acids Res (2004) 32:D138–D141.[Abstract/Free Full Text]

    Carlton JM, et al. Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoelii. Nature (2002) 419:512–519.[CrossRef][Medline]

    Chen Y, et al. SPD–a web-based secreted protein database. Nucleic Acids Res (2005) 33:D169–D173.[Abstract/Free Full Text]

    Clamp M, et al. The Jalview Java alignment editor. Bioinformatics (2004) 20:426–427.[Abstract/Free Full Text]

    Conte MG, et al. GreenPhylDB: a database for plant comparative genomics. Nucleic Acids Res (2007).

    Coulson RM, et al. Comparative genomics of transcriptional control in the human malaria parasite Plasmodium falciparum. Genome Res (2004) 14:1548–1554.[Abstract/Free Full Text]

    Deng W, et al. ViroBLAST: a stand-alone BLAST web server for flexible queries of multiple databases and user's datasets. Bioinformatics (2007) 23:2334–2336.[Abstract/Free Full Text]

    Enright AJ, et al. Protein families and TRIBES in genome sequence space. Nucleic Acids Res (2003) 31:4632–4638.[Abstract/Free Full Text]

    Howe K, et al. QuickTree: building huge Neighbour-Joining trees of protein sequences. Bioinformatics (2002) 18:1546–1547.[Abstract/Free Full Text]

    Kooij TW, et al. Plasmodium post-genomics: better the bug you know? Nat. Rev. Microbiol (2006) 4:344–357.[CrossRef][Web of Science][Medline]

    Lee DA, et al. EyeSite: a semi-automated database of protein families in the eye. Nucleic Acids Res (2004) 32:D148–D152.[Abstract/Free Full Text]

    Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics (2006) 22:1658–1659.[Abstract/Free Full Text]

    Snow RW, et al. The global distribution of clinical episodes of Plasmodium falciparum malaria. Nature (2005) 434:214–217.[CrossRef][Medline]

    Stoeckert CJ Jr, et al. PlasmoDB v5: new looks, new genomes. Trends Parasitol (2006) 22:543–546.[CrossRef][Web of Science][Medline]

    Thompson JD, et al. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res (1994) 22:4673–4680.[Abstract/Free Full Text]

    Wu J, et al. ArchaeaTF: an integrated database of putative transcription factors in Archaea. Genomics (2008) 91:102–107.[CrossRef][Web of Science][Medline]

    Zhang W, et al. SynDB: a Synapse protein DataBase based on synapse ontology. Nucleic Acids Res (2007) 35:D737–D741.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/9/1217    most recent
btn092v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Xu, X.
Right arrow Articles by Li, X.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xu, X.
Right arrow Articles by Li, X.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?