Bioinformatics Advance Access originally published online on March 12, 2008
Bioinformatics 2008 24(9):1217-1220; doi:10.1093/bioinformatics/btn092
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PlasmoGF: an integrated system for comparative genomics and phylogenetic analysis of Plasmodium gene families


1School of Pharmaceutical Science/Zhejiang Provincial Key Laboratory of Biotechnology Pharmaceutical Engineering, Wenzhou Medical College, Wenzhou 325035, 2Institute of Biomedical Informatics/Zhejiang Provincial Key Laboratory of Medical Genetics, Wenzhou Medical College, Wenzhou 325000, China, 3Department of Biochemistry and Molecular Biology, Pennsylvania State University, Pennsylvania 16802, USA and 4National Engineering Research Center for Gene Medicine, Jinan University, Guangzhou 510632, PR China
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Malaria, one of the world's most common diseases, is caused by the intracellular protozoan parasite known as Plasmodium. Recently, with the arrival of several malaria parasite genomes, we established an integrated system named PlasmoGF for comparative genomics and phylogenetic analysis of Plasmodium gene families. Gene families were clustered using the Markov Cluster algorithm implemented in TribeMCL program and could be searched using keywords, gene-family information, domain composition, Gene Ontology and BLAST. Moreover, a number of useful bioinformatics tools were implemented to facilitate the analysis of these putative Plasmodium gene families, including gene retrieval, annotation, sequence alignment, phylogeny construction and visualization. In the current version, PlasmoGF contained 8980 sets of gene families derived from six malaria parasite genomes: Plasmodium. falciparum, P. berghei, P. knowlesi, P. chabaudi, P. vivax and P. yoelii. The availability of such a highly integrated system would be of great interest for the community of researchers working on malaria parasite phylogenomics.
Availability: PlasmoGF is freely available at http://bioinformatics.zj.cn/pgf/
Contact: xiaokunli{at}163.net; baoqy{at}genomics.org.cn; fuz3{at}psu.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Malaria is one of the world's most common diseases, which infects more than half a billion people and is responsible for over one million deaths annually (Snow et al., 2005). The pathogenesis of malaria is mainly caused by species of the intracellular protozoan parasites known as Plasmodium. A better understanding of parasite biology will facilitate our efforts to resist malaria. Comparative genomics and phylogenetic analysis have great potential to help us understand the sequence-structure-function correlations of Plasmodium genes and thus design new drugs to combat this disease. One of the major findings of early comparative genomics was that human parasite Plasmodium. falciparum showed a remarkable synteny with other malaria parasites of rodents, indicating the majority of features of different Plasmodium species are conserved during long evolutionary process (Kooij et al., 2006). The differences among the malaria parasite genomes were found in the subtelomeric regions, where many multiple-gene families involved in antigenic variation are located (Carlton et al., 2002). Comparative analysis of transcription-associated proteins in P. falciparum revealed that its genome contained a limited number of detectable transcription factors and most of them appeared to be species-specific (Coulson et al., 2004).
Since the breakthrough of genomic sequencing technologies, several human-infectious and model malaria parasite genomes become available, representing one of the most abundant data for a single eukaryotic pathogen. Undoubtedly, these malaria parasite genome sequences provide novel opportunities to have a better insight into the role of different Plasmodium gene families and further into their adaptation to parasitic niches in the eukaryotes. Currently, a useful resource named PlasmoDB has been developed and continuously updated, providing rapid and convenient access to malaria parasite genes and genomes (Stoeckert et al., 2006). However, there is no comprehensive database focusing on the Plasmodium gene families, as well as providing online bioinformatics tools to analyze these gene families. On the other hand, with the increasing number of malaria parasite genomes, it is necessary to explore the evolutionary mechanism of their gene families for further understanding of malaria parasite evolution. Therefore, PlasmoGF, an integrated system for comparative genomics and phylogenetic analysis of Plasmodium gene families, was constructed to help users to perform such analyses more efficiently. The web interface of PlasmoGF provided users with many easy ways to access gene families and their detailed annotation information. Meanwhile, a number of powerful bioinformatics tools were implemented to facilitate comparative genomics and phylogenetic analysis on Plasmodium genomes.
| 2 GENE-FAMILY CLASSIFICATION |
|---|
|
|
|---|
Protein families can be defined as a group of molecules that share significant sequence similarity and a common evolutionary history. Clustering different proteins into families is usually based on BLAST similarities, such as those algorithms implemented in BLASTClust (from the NCBI BLAST suite) and cd-hit program (Li and Godzik, 2006). In recent years, TribeMCL program has been proved good at classification of divergent proteins (Enright et al., 2003) and widely used in related studies (Chen et al., 2005; Conte et al., 2007; Lee et al., 2004; Zhang et al., 2007). One of the most important features of this program is that it uses a novel clustering algorithm (Markov Clustering) to effectively break the barriers during the clustering process, such as multi-domains, fragments of proteins and promiscuous domains in alignment (Enright et al., 2003). In order to cluster all Plasmodium proteins into families, in this study, predicted protein sequences of the malaria parasite genomes were firstly retrieved from the PlasmoDB databases release 5.4. Then, an all-against-all sequence comparison was carried out using the BLAST program (Altschul et al., 1997). In the end, putative protein families were generated using the TribeMCL program with a high stringency of inflation 5.0.
| 3 DATABASE CONSTRUCTION AND CONTENT |
|---|
|
|
|---|
PlasmoGF is developed using our previous integrated pipeline for ArchaeaTF (Wu et al., 2008), which is constructed using a number of open source software, such as MySQL, PHP, Apache and Perl. In brief, the processed data are stored in a MySQL database system. The PHP language is used to connect the database and produce dynamic HTML pages. Apache is used as the background web server. Programs for malaria parasite data manipulation and presentation are performed using Perl and BioPerl modules. All the above procedures are executed on the Linux operating system.
| 4 DATABASE CONTENT AND DATA RETRIEVAL |
|---|
|
|
|---|
There are eight completely sequenced malaria parasite genomes available. At present, PlasmoGF has 8980 gene families clustered from six malaria parasite genomes, including P. falciparum, P. berghei, P. knowlesi, P. chabaudi, P. vivax and P. yoelii. The summary information of Plasmodium gene families with respect to family sizes is shown in Figure 1. The other two malaria parasite genomes, P. gallinaceum and P. reichenowi are not included in the current release because of the unavailability of their annotation data, and will be incorporated into PlasmoGF in later updates.
|
On the web interface, users can interact with PlasmoGF in many different ways (Fig. 2). They can download the entire clustered Plasmodium gene families in fasta format. More importantly, they can retrieve specific gene families of interest through a set of queries in the query system, including (1) by protein identifier or keyword associated with the protein annotation in PlasmoDB databases, (2) by cluster information, such as cluster ID and cluster size, (3) by Pfam functional domain with a specific combination and (4) by Gene Ontology term. All the above searches can be combined with the logic operators AND, OR and NOT. Additionally, the ViroBLAST program (Deng et al., 2007), which provides an output to easily parse and navigate BLAST results, is implemented to allow the user to search gene families based on sequence similarities. All search results will be shown in a table format and allow being added into a personal work-set for further operation, such as deletion, modification or download.
|
Detailed annotation information required can be obtained by clicking on any protein identifier in the search results (Fig. 2). These annotation information include (1) basic information, such as ID, function description, sequence length, molecular weight, isoelectric point and cross-links to PlasmoDB database, (2) genomic structure information, such as location in chromosome, orientation, number of exons and transmembrane domains, (3) Gene Ontology information, such as cellular component, biological process and molecular function, (4) domain organization assigned using Pfam database release 1.69 (Bateman et al., 2004), (5) sequence homolog to several important databases, such as PDB (collected by November 20, 2007), Swiss-Prot release 52.0 and Refseq release 22 and (6) sequence information, such as nucleotide and protein sequence.
| 5 COMPARATIVE GENOMICS AND PHYLOGENETIC ANALYSIS TOOLS |
|---|
|
|
|---|
To facilitate the analysis of these generated putative Plasmodium gene families, a number of bioinformatics tools were implemented into the PlasmoGF web interface for comparative genomics and phylogenetic analysis (Fig. 2). The ClustalW program (Thompson et al., 1994) with many customized parameters was implemented to allow users to perform multiple sequence alignments of specific Plasmodium gene family. It has been designed to work in two basic modes. Users can retrieve some data from a primary query and then add them into the work-set. The ClustalW program will make multiple sequence alignment of the data in work-set by default. Alternatively, users can input their own sequence data (nucleotide or protein) in fasta format to make multiple sequence alignment. For easy display and manipulation of the aligned result, the Jalview program (Clamp et al., 2004) based on Java applet is implemented to provide many related functions, such as coloring different kinds of amino acids according to their biochemical properties. Moreover, the QuickTree program (Howe et al., 2002), which is based on neighbor-joining algorithm, is implemented to allow users to construct a phylogenetic tree with the aligned result. The tree visualization is done using the ATV program based on the Java applet (Clamp et al., 2004).
| 6 PERSPECTIVES |
|---|
|
|
|---|
PlasmoGF is an integrated system containing the putative gene families and their detailed annotation information identified from malaria parasite genomes. Equipped with ViroBLAST, ClustalW and QuickTree program, it has become a useful platform to enable the users to perform the comparative genomics and phylogenetic analysis of each gene family. Continuing efforts will be made to create a niche for it in the post-genomics era. An important approach to increasing its power is to respond the users questions, comments and suggestions timely. To facilitate this process, one special page is developed for collecting and displaying users feedbacks on the website. Furthermore, when new genomes are fully sequenced and annotated, their genomic data and corresponding classified gene families will be incorporated into PlasmoGF. In further release of PlasmoGF, we will build in more phylogenetic analysis tools and incorporate the graphic visualization of comparison results. Finally, we expect PlasmoGF to serve as a valuable resource for obtaining new insights about Plasmodium genomes.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This work was supported by the National Natural Science Foundation of China (30600768), the Program of New Century Excellent Talents in University (Li XK) and Zhejiang Provincial Program for the Cultivation of High-level Innovative Health talents (Li XK).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Jonathan Wren
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. ![]()
Received on November 23, 2007; revised on March 3, 2008; accepted on March 4, 2008
| REFERENCES |
|---|
|
|
|---|
Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:3389–3402.
Bateman A, et al. The Pfam protein families database. Nucleic Acids Res (2004) 32:D138–D141.
Carlton JM, et al. Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoelii. Nature (2002) 419:512–519.[CrossRef][Medline]
Chen Y, et al. SPD–a web-based secreted protein database. Nucleic Acids Res (2005) 33:D169–D173.
Clamp M, et al. The Jalview Java alignment editor. Bioinformatics (2004) 20:426–427.
Conte MG, et al. GreenPhylDB: a database for plant comparative genomics. Nucleic Acids Res (2007).
Coulson RM, et al. Comparative genomics of transcriptional control in the human malaria parasite Plasmodium falciparum. Genome Res (2004) 14:1548–1554.
Deng W, et al. ViroBLAST: a stand-alone BLAST web server for flexible queries of multiple databases and user's datasets. Bioinformatics (2007) 23:2334–2336.
Enright AJ, et al. Protein families and TRIBES in genome sequence space. Nucleic Acids Res (2003) 31:4632–4638.
Howe K, et al. QuickTree: building huge Neighbour-Joining trees of protein sequences. Bioinformatics (2002) 18:1546–1547.
Kooij TW, et al. Plasmodium post-genomics: better the bug you know? Nat. Rev. Microbiol (2006) 4:344–357.[CrossRef][Web of Science][Medline]
Lee DA, et al. EyeSite: a semi-automated database of protein families in the eye. Nucleic Acids Res (2004) 32:D148–D152.
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics (2006) 22:1658–1659.
Snow RW, et al. The global distribution of clinical episodes of Plasmodium falciparum malaria. Nature (2005) 434:214–217.[CrossRef][Medline]
Stoeckert CJ Jr, et al. PlasmoDB v5: new looks, new genomes. Trends Parasitol (2006) 22:543–546.[CrossRef][Web of Science][Medline]
Thompson JD, et al. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res (1994) 22:4673–4680.
Wu J, et al. ArchaeaTF: an integrated database of putative transcription factors in Archaea. Genomics (2008) 91:102–107.[CrossRef][Web of Science][Medline]
Zhang W, et al. SynDB: a Synapse protein DataBase based on synapse ontology. Nucleic Acids Res (2007) 35:D737–D741.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

