Evolution and Phylogenetics
Phylogenetic reconstruction from non-genomic data
1 School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi Ishikawa 923-1292, Japan
2 Algorithms, Bioinformatics, Complexity and Formal Methods Research Group, Technical University of Catalonia E-08034 Barcelona, Spain
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Recent results related to horizontal gene transfer suggest that phylogenetic reconstruction cannot be determined conclusively from sequence data, resulting in a shift from approaches based on polymorphism information in DNA or protein sequence to studies aimed at understanding the evolution of complete biological processes. The increasing amount of available information on metabolic pathways for several species makes it of greater relevance to understand the similarities and differences among such pathways. These similarities can then be used to infer phylogenetic trees not based exclusively in sequence data, therefore avoiding the previously mentioned problems.
Results: In this article, we present a method to assess the structural similarity of metabolic pathways for several organisms. Our algorithms work by using one of the three possible enzyme similarity measures (hierarchical, information content, gene ontology), and one of the two clustering methods (neighbor-joining, unweighted pair group method with arithmetic mean), to produce a phylogenetic tree both in Newick and graphic format. The web server implementing our algorithms is optimized to answer queries in linear time.
Availability: The software is available for free public use on a web server, at the address http://www.jaist.ac.jp/~clemente/cgi-bin/phylo.pl. It is available on demand in source code form for research use to educational institutions, non-profit research institutes, government research laboratories and individuals, for non-exclusive use, without the right of the licensee to further redistribute the source code.
Contact: valiente{at}lsi.upc.edu; clemente{at}jaist.ac.jp
| 1 INTRODUCTION |
|---|
|
|
|---|
As our understanding of horizontal gene transfer (HGT) gradually changes with the publication of new results, it becomes clear that phylogenetic reconstruction cannot be conclusively inferred from sequence information alone. For instance, cases of HGT among species in different phylogenetic domains have been reported (Hall et al., 2005; Kondo et al., 2002), raising questions on whether robust phylogenies can be obtained exclusively from genomic data.
Current knowledge on metabolic pathways, as contained in KEGG (Kanehisa et al., 2006, http://www.genome.jp/kegg/) or MetaCyc (Caspi et al., 2006), is mature enough to effectively determine the evolutionary history of complex biological processes for a set of species by establishing the similarity among the different pathways. Comparison and alignment of pathways have been a research topic of great interest in the last few years (Dandekar et al., 1999; Tohsato et al., 2000; Pinter et al., 2005), as well as reconstruction of phylogenetic trees from similarities among pathways (Forst and Schulten, 1999, 2001; Heymans and Singh, 2003; Clemente et al., 2005).
Co-analysis of phylogeny and metabolic pathways can, from a more general perspective, provide valuable insights into the problem of explaining the appearance and development of complex networks of interacting proteins and chemical molecules (Rison and Thornton, 2002; Ebenhöh et al., 2004). Several theories have been proposed to explain the evolution of such networks [see Lazcano and Miller (1999) for a review], and although current research seems to support the so-called patchwork evolution model (Lazcano and Miller, 1996), it is still unclear whether other biological mechanisms play a significant role in the emergence of metabolic pathways.
In this article, we present a method to establish the structural similarity among a set of pathways for several organisms. Based on the similarity matrix so obtained, we then construct a phylogenetic tree by clustering the set of organisms (Section 2).
Section 3 describes the implementation of our method as a web server, which allows the user to select a set of organisms and pathways, as well as several parameters to control the behaviour of the algorithms. The resulting phylogenetic tree, with branch lengths, is shown both in Newick format and as a graphical image.
In Section 4, we present results for a series of phylogenetic reconstruction experiments that provide additional proof of the validity of our method. Also, we suggest possible uses of the web tool to answer biologically relevant questions.
| 2 APPROACH |
|---|
|
|
|---|
We perform a pseudo-alignment of metabolic pathways from different organisms in order to assess their structural similarity. The pseudo-alignment, which involves both a graph representation of each metabolic pathway and a similarity measure between individual reactions, enzymes and compounds present in the pathway, is a mapping of each reaction in one pathway to the most similar reaction in the other pathway, and similarly for compounds and enzymes. A pseudo-alignment differs from a full alignment in the mapping being not necessarily one-to one.
Metabolic pathways are represented as directed hypergraphs, with the compounds and enzymes as nodes and the reactions activated by the enzymes as hyperarcs (Deville et al., 2003). For instance, the directed hypergraph for the Citric Acid Cycle pathway in the bacterium Escherichia coli consists of 35 nodes and 18 hyperarcs.
The similarity of two metabolic pathways P = (R) and Q = (S), where R, S are sets of enzymatic reactions, is given by
|
| (1) |
|
| (2) |
is a weight parameter with 0
1. The
parameter establishes the relative weight of compound similarity to enzyme similarity in the assessment of enzymatic reaction similarity. Similarity of compounds is 1 for identical compounds and 0 for distinct compounds. Preliminary results based on a more complex similarity measure for compounds have not shown a significant increase in performance, and therefore have not been included.
In order to assess similarity of enzymes, we used three different enzyme similarity measures: hierarchical (Tohsato et al., 2000), information content (Tohsato et al., 2000; Pinter et al., 2005) and gene ontology (Clemente et al., 2005). While the first two measures are directly based on the Enzyme Commision hierarchical classification of enzymes (number of common EC code digits for hierarchical similarity and size of the subtree rooted at the least common ancestor for information content similarity), the gene ontology similarity measure relies on the normalized shortest path distance in the function ontology between the GO concepts representing the enzymes.
In general, our method calculates the similarity of any two sets of objects A and B (compounds, enzymes, reactions) as the normalized sum (divided by |A
B|) of three components: mapping elements belonging exclusively to A to elements of B (A\B
B\A), mapping elements belonging exclusively to B to elements of A (B\A
A\B) and the mapping of common elements to common elements (A
B
B
A). If such mappings are based on normalized distances (as it is the case in our approach), the last term can simply be replaced by |A
B|, the cardinal of the intersection. Our algorithm works in
time, with |E| being the total number of enzymes and |C|the total number of compounds.
Figure 1 presents an example of the pseudo-alignment of the TCA cycle for three organisms. Each reaction in the metabolic pathway is pseudo-aligned with the most similar reaction in the same pathway for a different organism. For instance, reaction R00351
[GenBank]
in Archaeoglobus fulgidus (metabolites: acetyl-CoA, oxaloacetate, citrate; enzymes: citrate synthase) could be aligned with reactions R00362
[GenBank]
(acetate, oxaloacetate, citrate; citrate lyase) or R01323
[GenBank]
[acetate, acetyl-CoA, (3s)-citryl-CoA; citrate CoA-transferase] in Clostridium perfringens. By applying Equation (2), we get sim(R00351, R00362) = (1
)/2 and sim(R00351, R01323) = (4
)/20. The relative weight
can then select whether we want to emphasize more on the enzymatic similarity or the compound similarity between the reactions. In this experiment,
was set to 0.5 and therefore, reaction cpe:R00362 was chosen as the most similar to afu:R00351.
|
| 3 IMPLEMENTATION AND WEB SERVER |
|---|
|
|
|---|
We have implemented the previously described method in a web service that can be used to reconstruct the phylogeny of a set of organisms from the observed similarity of their common metabolic pathways. Information about organisms and pathways is periodically retrieved from KEGG through a Perl script, and then updated into a local SQLite database that contains information on all organisms, pathways, reactions, enzymes and chemical compounds. Since the computation of pathway similarity using our method is computationally expensive, we precalculate and store distances among all enzymes, compounds and reactions in order to provide results in linear time. Currently, our database stores information for 13 organisms, 25 metabolic pathways and >1000 enzymes, compounds and reactions, as well as precalculated distances for 180 000 enzyme pairs and over 1 000 000 reaction pairs.
The user can select among a set of organisms and pathways, three different enzyme similarity measures (hierarchical, information content and gene ontology), two clustering methods (neighbor-joining, unweighted pair group method with arithmetic mean), the choice of full organism names or abbreviations in the output, and the value of the
parameter (relative weight of compounds and enzymes in assessing the similarity of enzymatic reactions). The interface is designed in a way such that, when selecting an organism, all the pathways not present in that organism are disabled, and vice versa. This is to ensure that the alignment of pathways is done in a meaningful way, i.e. we are not aligning against an empty set of reactions.
Once a query is submitted, we calculate the phylogenetic tree for the selected organisms, pathways and parameters, as previously described (Section 2). Results are presented both in Newick and graphic format, including branch lengths. A postscript file of the produced tree can be downloaded by clicking on the displayed image.
| 4 RESULTS AND DISCUSSION |
|---|
|
|
|---|
Previous results on phylogenetic reconstruction from a single metabolic pathway of the set of 73 organisms studied in Heymans and Singh (2003), already showed that our method can outperform previous approaches (Clemente et al., 2005).
We have extended those results by performing a series of additional phylogenetic reconstruction experiments. All experiments presented in this article were conducted with the following parameters: hierarchical enzyme similarity, UPGMA clustering and
= 0.5. This particular choice of parameters was made for simplicity: compounds and enzymes are given the same weight in pathway alignment, and hierarchical enzyme similarity is faster to compute than information content similarity and gene ontology similarity. Previous results (Clemente et al., 2005) show that reconstruction of phylogenetic relationships from metabolic pathways using any of these three enzyme similarity measures produces a similar clustering of related organisms for any value of the
parameter.
The first experiment consisted in the phylogenetic reconstruction of a set of organisms based on the comparison of all of their common metabolic pathways. We have chosen eight organisms, A.fulgidus (afu), C.perfringens (cpe), Haemophilus influenzae (hin), Listeria innocua (lin), Methanocaldococcus jannaschii (mja), Mus musculus (mmu), Neisseria meningitidis (nme) and Rattus norvegicus (rno). As can be seen in Figure 2, the phylogenetic reconstruction based on their 53 common metabolic pathways produced a phylogeny which is very close to the NCBI taxonomy (Wheeler et al., 2000), which is used as a gold standard in this paper.
|
In a second experiment, we have performed the incremental phylogenetic reconstruction of the same set of 8 organisms by adding their 53 common metabolic pathways one by one. The addition of a metabolic pathway does not always have a consequence on the resulting phylogeny and, as illustrated in Figure 3, incremental phylogenetic reconstruction produces only 11, instead of 53, different phylogenies for the set of eight organisms. Figure 3 also shows that second cousins (Shasha et al., 2004) similarity to the NCBI taxonomy increases gradually when adding metabolic pathways until reaching a highest similarity value.
|
A third series of experiments consisted in removing outlier metabolic pathways. We have repeated the incremental phylogenetic reconstruction experiment on the same set of 8 organisms, discarding some of the 53 common metabolic pathways according to one of three criteria in turn (Fig. 4).
- Discard those metabolic pathways where the distance of an organism to its closest neighbour is more than a certain threshold, for a percentage of all organisms. The intuition behind this criteria is to remove pathways where the organisms are too dissimilar, since the mapping among reactions is probably not meaningful (hence the high distance). Best results were obtained for a distance threshold of 0.4 and 10% of the organisms, where the remaining 25 metabolic pathways produced a top second cousins similarity score of 0.714, improving those presented in Figure 3. Notice though how the similarity curve now reaches a maximum after adding 7 pathways, and drops to the previous 0.571 once we add >10 metabolic routes.
- Discard those metabolic pathways where the distance of an organism to its closest neighbour is equal to zero, for a percentage of all organisms. For a threshold of 20% of the organisms, the remaining 37 metabolic pathways produce similar results to the previous criteria.
- Discard those metabolic pathways where either the number of reactions is less than a certain threshold, or one organism has significantly less reactions than the organism with more reactions in the metabolic pathway. A reaction threshold of 0.3 reduces the set to 19 metabolic pathways, with similar results to previous methods.
|
These results show that we can produce phylogenetic trees reasonably close to the correct taxonomy (Fig. 5 presents the best phylogeny obtained so far), provided that enough metabolic pathways are available for our method to use, and that noisy ones are removed by some criteria. We have also shown in previous works that the use of alternative clustering methods (Casasnovas et al., 2006) as well as parameter-tuning (Clemente et al., 2005) can significantly improve performance.
|
In the fourth experiment, we performed the incremental phylogenetic reconstruction of a set of 10 related organisms, 8 photosynthetic bacteria [Anabaena (ana), Gloeobacter violaceus (gvi), Prochlorococcus marinus marinus (pma), Prochlorococcus marinus pastoris (pmm), Prochlorococcus marinus (pmt), Synechocystis (syn), Synechococcus (syw), Thermosynechococcus elongatus (tel)] and two photosynthetic eukaryotes containing chloroplasts [Arabidopsis thaliana (ath), Cyanidioschyzon merolae (cme)] by adding 61 of their 68 common metabolic pathways one by one. (The remaining seven common pathways have no annotated reactions in KEGG.) Chloroplasts descended from cyanobacteria (Martin et al., 2002), and despite HGT of many ancestral cyanobacterial genes to the plant nuclear genome, a selective set of metabolic pathways is maintained in chloroplasts (Wang et al., 2006). As a result, these eight photosynthetic bacteria share a large number of metabolic pathways with the two photosynthetic eukaryotes, which makes it particularly hard to distinguish them by comparison of their metabolic pathways alone. However, as can be seen in Figure 6, the incremental phylogenetic reconstruction based on 61 of their 68 common metabolic pathways produced a phylogeny that separates the photosynthetic eukaryotes from the photosynthetic bacteria, although it is not able to separate the Chroococcales (Synechocystis, Synechococcus, T.elongatus) from the Prochlorales (P.marinus marinus, P.marinus pastoris, P.marinus), the Nostocales (Anabaena), and the Gloeobacterales (G.violaceus).
|
A different set of questions of biological relevance can also be answered using our web server. For instance, experimentation of novel disease treatments on humans is costly, can imply risks for the subject, and might raise ethical issues. The alternative use of model organisms is a common practice in biology to avoid these problems. Identifying appropriate model organisms for a specific experiment can be of great use for the biologist. Our web server can provide this functionality by simply choosing the candidate model organisms and the set of pathways involved in the experiments.
As an example of such use, we performed a set of queries to find a suitable model organism in a hypothetical experiment related to pyruvate kinase deficiency in humans, which is usually related to haemolytic anaemias (Baronciani and Beutler, 1993; Zanella and Bianchi, 2000). Given a set of candidate organisms (E.coli, Caenorhabditis elegans, R.norvegicus, Drosophila melanogaster, plus Homo sapiens), we started by constructing a query with all their common pathways. The result, as seen in Figure 7 (left), is a phylogenetic tree with one cluster for animals, further divided into vertebrate (R.norvegicus, H.sapiens) and non-vertebrate (C.elegans and D.melanogaster), and a second cluster with the bacterium E.coli. This tree corresponds with the NCBI taxonomy for this set of organisms. We then considered those KEGG pathways that contain pyruvate kinase and are common to the candidate organisms, namely: glycolysis, pyruvate metabolism and purine metabolism. Figure 7 (right) presents the produced tree in this case, placing D.melanogaster closer to H.sapiens than R.norvegicus. Although we could not find conclusive confirmation in the literature, this result suggests that organisms with a most similar global metabolism might not necessarily be the best option as model organisms for specific biological experiments.
|
| 5 CONCLUSION |
|---|
|
|
|---|
In this article we have tried to answer the following question:
Is it possible to reconstruct robust phylogenetic trees from non-genomic data, such as metabolic pathways?
We argued that indeed this is the case, by presenting a method to assess the structural similarity of metabolic pathways for several organisms. Based on this similarity, and by using one out of three possible enzyme similarity measures (hierarchical, information content, and gene ontology), and one of two clustering methods (neighbor-joining and UPGMA), we were able to reconstruct phylogenies that are very similar to the NCBI taxonomy. We showed how the use of a large number of pathways, as well as criteria to discard those pathways which might not contribute with valuable information, can further improve the quality of the produced phylogenetic trees.
We also introduced a web service that can be used to answer different questions of biological relevance. Results are returned both in Newick and graphic format.
| Acknowledgments |
|---|
The authors wish to thank Shizuka Uchida for his valuable advice and comments on this article. The authors would also like to acknowledge with thanks the anonymous reviewers, whose comments and criticism have led to a substantial improvement of this article. The research described in this paper was supported by the Institute for Bioinformatics Research and Development (BIRD), Japan Science and Technology Agency (JST), Japan, and by the Spanish CICYT, project GRAMMARS (TIN2004-07925-C03-01). Funding to pay the Open Access publication charges for this article was provided by the Institute for Bioinformatics Research and Development (BIRD), Japan Science and Technology Agency (JST), Japan.
| REFERENCES |
|---|
|
|
|---|
Baronciani, L. and Beutler, E. (1993) Analysis of pyruvate kinase-deficiency mutations that produce nonspherocytic hemolytic anemia. Proc. Natl Acad. Sci. USA, 90, 9, 43244327
Casasnovas, J., Clemente, J.C., Miró-Julià, J., Rosselló, F., Satou, K., Valiente, G. (2006) Fuzzy clustering improves phylogenetic relationships reconstruction from metabolic pathways. Proceedings of the 11th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (in press).
Caspi, R., et al. (2006) MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucleic Acids Res, . 34, D511D516
Clemente, J., Satou, K., Valiente, G. (2005) Reconstruction of phylogenetic relationships from metabolic pathways based on the enzyme hierarchy and the gene ontology. Genome Inform, . 16, 4555.
Dandekar, T., Schuster, S., Snel, B., Huynen, M., Bork, P. (1999) Pathway alignment: application to the comparative alignment of glycolytic enzymes. Biochem. J, . 343, 115124.
Deville, Y., Gilbert, D., Van Helden, J., Wodak, S.J. (2003) An overview of data models for the analysis of biochemical pathways. Briefings in Bioinformatics, 4, 246259
Ebenhöh, O., Handorf, T., Heinrich, R. (2004) Structural analysis of expanding networks. Genome Inform, . 15, 3545.
Forst, C.V. and Schulten, K. (1999) Evolution of metabolisms: A new method for the comparison of metabolic pathways using genomic information. J. Comput. Biol, . 6, 343360[CrossRef][Web of Science][Medline].
Forst, C.V. and Schulten, K. (2001) Phylogenetic analysis of metabolic pathways. J. Mol. Evol, . 52, 471489[Web of Science][Medline].
Hall, C., Brachat, S., Dietrich, F.S. (2005) Contribution of horizontal gene transfer to the evolution of Saccharomyces cerevisiae. Eukaryot. Cell, 4, 11021115
Heymans, M. and Singh, A.K. (2003) Deriving phylogenetic trees from the similarity analysis of metabolic pathways. Bioinformatics, 19, i138i146[Abstract].
Kanehisa, M., et al. (2006) From genomics to chemical genomics: New developments in KEGG. Nucleic Acids Res, . 34, D354D357
Kondo, N., Nikoh, N., Ijichi, N., Shimada, M., Fukatsu, T. (2002) Genome fragment of Wolbachia endosymbiont transferred to X chromosome of host insect. Proc. Natl Acad. Sci. USA, 99, 1428014285
Lazcano, A. and Miller, S.L. (1996) The origin and early evolution of life: Prebiotic chemistry, the pre-RNA world, and time. Cell, 85, 793798[CrossRef][Web of Science][Medline].
Lazcano, A. and Miller, S.L. (1999) On the origin of metabolic pathways. J. Mol. Evol, . 49, 424431[CrossRef][Web of Science][Medline].
Martin, W., et al. (2002) Evolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleus. Proc. Natl Acad. Sci. USA, 99, 1224612251
Pinter, R.Y., Rokhlenko, O., Yeger-Loterm, E., Ziv-Ukelson, M. (2005) Alignment of metabolic pathways. Bioinformatics, 21, 34013408
Rison, S.C. and Thornton, J.M. (2002) Pathway evolution, structurally speaking. Curr. Opin. Struc. Biol, . 12, 374382[CrossRef][Web of Science][Medline].
Shasha, D., Wang, J.T.-L., Zhang, K. (2004) Unordered tree mining with applications to phylogeny. Proceedings of the 20th International Conference on Data Engineering IEEE Computer Society Press, pp. 708719.
Tohsato, Y., Matsuda, H., Hashimoto, A. (2000) A multiple alignment algorithm for metabolic pathway analysis using enzyme hierarchy. Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, pp. 376383.
Wang, Z., et al. (2006) Exploring photosynthesis evolution by comparative analysis of metabolic networks between chloroplasts and photosynthetic bacteria. BMC Genomics, 7, 100[CrossRef][Medline].
Wheeler, D.L., et al. (2000) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res, . 28, 1014
Zanella, A. and Bianchi, P. (2000) Red cell pyruvate kinase deficiency: From genetics to clinical manifestations. Best Pract. Res. Cl. Ha, . 13, 5781.
This article has been cited by other articles:
![]() |
A. Mazurie, D. Bonchev, B. Schwikowski, and G. A. Buck Phylogenetic distances are encoded in networks of interacting pathways Bioinformatics, November 15, 2008; 24(22): 2579 - 2585. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









