Bioinformatics Advance Access originally published online on September 16, 2004
Bioinformatics 2005 21(3):390-392; doi:10.1093/bioinformatics/bti020
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 21 issue 3 © Oxford University Press 2005; all rights reserved.
Clann: investigating phylogenetic information through supertree analyses
Bioinformatics and Pharmacogenomics Laboratory, Department of Biology, National University of Ireland Maynooth, Co. Kildare, Ireland
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: Clann has been developed in order to provide methods of investigating phylogenetic information through the application of supertrees.
Availability: Clann has been precompiled for Linux, Apple Macintosh and Windows operating systems and is available from http://bioinf.may.ie/software/clann. Source code is available on request from the authors.
Supplementary information: Clann has been written in the C programming language. Source code is available on request.
Contact: chris.creevey{at}may.ie
The aim of constructing phylogenetic supertrees is to combine the information contained in source trees with partially overlapping leaf-sets. Supertree methods can combine the information from trees with no taxa in common as long as additional trees that overlap both exist. Increasingly many methods for supertree construction exist (see Bininda-Emonds et al., 2002 for a review), and there is a need for a tool that permits the exploration of the congruence across the input data and the quality of the hypotheses that are derived from the data. In this manuscript we report one such software product. Some desirable properties of supertrees have been described elsewhere (Wilkinson et al., 2004), however, no method is guaranteed to have all these properties. As a result we need to explore the data and trees using a variety of methods, each with different properties. This amounts to a sensitivity analysis to examine which hypotheses of relationships are most frequently supported by the different methods and therefore more likely to be the correct relationships.
At present there are four supertree methods implemented in Clann: Matrix Representation using Parsimony (MRP); Most Similar Supertree (MSSA) (Creevey et al., 2004); Maximum Quartet Fit (QFIT) and Maximum Splits Fit (SFIT). With MRP, the Baum and Ragan coding scheme, which is additive and binary, is used to create a matrix from the set of source trees (Baum, 1992; Ragan, 1992). This matrix consists of rows representing each taxon and columns representing each internal branch from each source tree. Each internal branch of a source tree divides the taxa into two groups (those descended from the branch versus those ancestral to it). Scoring the taxa with either 1 or 0 according to the group in which they are found represent the hypotheses of relationships defined by each internal branch. If a taxon is not present in a source tree, it is scored with a ?. Parsimony analyses are then used to reconstruct the supertree from this data. The parsimony step must be carried out by PAUP* (Swofford, 2002) as Clann writes a nexus formatted file containing the MRP coding scheme and commands for PAUP* to carry out the analysis.
The MSSA scoring method compares each source tree separately to the supertree by comparing the path length distance matrix (Steel and Penny, 1993) derived from a source tree to another distance matrix derived from a pruned supertree. The differences between the matrices are scored and the sum of the scores from all the comparisons is calculated. The user can choose to impose several weighting schemes on this score to adjust for the influence of differential tree size. The weighted or un-weighted sum is the score assigned to the supertree. This sum is used as an optimality criterion to determine the supertree that best fits the set of source trees. This method is related to the average consensus method (Lapointe and Cucumel, 1997) with branch lengths set to unity and as such is also related to MRP (Lapointe et al., 2003).
With both the QFIT and SFIT method, each source tree is individually compared to a proposed supertree by determining all the quartets (relationships between any four taxa) (QFIT) or splits (components) (SFIT), respectively for both the source tree and appropriately pruned supertree. A score is then calculated which is defined by the number of quartets or splits that are shared between the supertree and the set of source trees. The sum of the scores calculated for all the source trees is used as an optimality criterion to determine the optimal supertree (the supertree that shares the most quartets or splits with the set of source trees).
For each of the optimality criteria, several different methods of searching tree-space and analysing the underlying phylogenetic information are implemented in Clann. These methods include complete exhaustive searches of tree-space, heuristic methods of searching tree-space (though not for MRP), methods of bootstrapping the trees to examine the underlying support for any hypothesis and methods for determining whether any phylogenetic signal present in the data is better than would be expected from random data.
Two heuristic algorithms for searching supertree-space are implemented in Clann. They are nearest neighbour interchange (NNI) and sub-tree pruning and re-grafting (SPR) as described and implemented in PAUP* (Swofford, 2002).
Bootstrapping is a statistical technique for empirically estimating the variability in an estimate. It assumes that the samples are independent and identically distributed (Efron, 1979). In a phylogenetic context, bootstrapping allows the estimation of support for a phylogeny. This can be extended to the supertree context as implemented in Clann, by considering the source trees as one possible set of trees that could have been used in the analysis. Choosing a slightly different set of source trees may result in a different optimal supertree. In order to estimate the likely nature of the universe of optimal supertrees, the source trees may be bootstrapped. For each bootstrap replicate, the source trees are sampled with replacement until a new dataset is created with the same number of source trees as the original dataset. This means that some source trees may be represented in the dataset more than once, while others may not be represented at all. For each repetition, the supertree that best represents this (bootstrapped) set of source trees, according to the chosen optimality criterion, is determined. Repeating this procedure a large number of times gives an indication as to how much support there is for the clades in a supertree (Purvis, 1995). If during any bootstrap replicate taxon is not represented (due to the initial low occurrence of the taxon), the software will alert the user to the unsuitability of the data to bootstrapping and refuse to continue.
A randomization method to test the null hypothesis that the phylogenetic signal in the source trees is no better than random is also implemented in Clann. This test has been implemented for all the supertree methods except MRP where a normal Permutation Tail Probability (PTP) (Archie, 1989; Faith and Cranston, 1991) test is available. We have called this method the YAPTP (Yet Another Permutation Tail Probability) test (Creevey et al., 2004). For each repetition of the test, each source tree is replaced with a randomly chosen topology for the same leaf-set. This removes any congruent phylogenetic signal between source trees, while leaving the numbers and sizes of source trees, the frequency with which any particular taxon was found across the source trees and the frequency of cooccurrence of any group of taxa within source trees unaltered. A search of tree space can then be carried out and the score of the best supertree recorded. The user can repeat this test as many times as required and the distribution of the resulting scores can be compared to the score of the real data (or the distribution of scores from bootstrapping) to assess if the real data contains a signal that is better than random. Permutation tests of this kind are extremely forgiving in nature. Passing them may however be considered a minimal requirement for any dataset to be considered for further analysis.
While both bootstrapping and the YAPTP test provide means of assessing the results of the supertree analysis, it must be pointed out that such assessments must be regarded within the context of what the supertree analysis was trying to achieve and the methods used to achieve them. For instance, was the goal of the analysis to reconstruct a phylogeny, test for tree-likeness of the data, to assess the support for particular clades or to reconstruct a historical timeline? Then, how do the methods chosen to carry out these analyses affect the interpretation of the results? Clann provides a necessary tool to help achieve these and other goals in a supertree context.
| Acknowledgments |
|---|
The authors would like to thank Dr Mark Wilkinson for reading the manuscript, the four anonymous referees for their helpful comments and for the many users who contributed suggestions and reported bugs in earlier versions of the software.
Received on March 11, 2004; revised on September 3, 2004; accepted on September 6, 2004
| REFERENCE |
|---|
|
|
|---|
Archie, J.W. (1989) A randomisation test for phylogenetic information in systematic data. Syst. Zoo., 38, 239252[CrossRef].
Baum, B.R. (1992) Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees. Taxon, 41, 310.
Bininda-Emonds, O.R.P., Gittleman, J.L., Steel, M. (2002) The (super)tree of life: procedures, problems and prospects. Ann. Rev. Ecol. Syst., 33, 265289[CrossRef][Web of Science].
Creevey, C.J., Fitzpatrick, D.A., Philip, G.K., Kinsella, R.J., OConnell, M.J., Pentony, M.M., Travers, S.A., Wilkinson, M., McInerney, J.O. (2004) Does a tree-like phylogeny only exist at the tips in the prokaryotes?. Proc. R. Soc. Lond. B. Biol. Sci., (in press).
Efron, B. (1979) Bootstrap methods: another look at the jackknife. Ann. Stat., 7, 126.
Faith, D.P. and Cranston, P.S. (1991) Could a cladogram this short have arisen by chance alone? On permutation tests for cladistic structure. Cladistics, 7, 128.
Lapointe, F. and -J. and Cucumel, G. (1997) The average consensus procedure: combination of weighted trees containing identical or overlapping sets of taxa. Syst. Biol., 46, 306312[CrossRef][Web of Science].
Lapointe, F.J., Wilkinson, M., Bryant, D. (2003) Matrix representations with parsimony or with distances: two sides of the same coin?. Syst. Biol., 52, 865868
Purvis, A. (1995) A composite estimate of primate phylgoeny. Philos. Trans. R. Soc. Lond. B. Biol. Sci., 348, 405421[Web of Science][Medline].
Ragan, M.A. (1992) Matrix representation in reconstructing phylogenetic-relationships among the eukaryotes. Biosystems, 28, 4755[CrossRef][Web of Science][Medline].
Steel, M. and Penny, D. (1993) Distributions of tree comparison metricssome new results. Syst. Biol., 42, 126141[CrossRef][Web of Science].
Swofford, D.L. PAUP*. Phylogenetic Analysis Using Parsimony (*And Other Methods). Version 4, (2002) , Sunderland, MA Sinauer Associates.
Wilkinson, M., Thorley, J.L., Pisani, D., Lapointe, F.-J., McInerney, J. (2004) Some desiderata for liberal supertrees. In Bininda-Emonds, O.R.P. (Ed.). Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life, , Dordrecht Kluwer Academic.
This article has been cited by other articles:
![]() |
T. A. Holton and D. Pisani Deep Genomic-Scale Analyses of the Metazoa Reject Coelomata: Evidence from Single- and Multigene Families Analyzed Under a Supertree and Supermatrix Paradigm Genome Biol Evol, July 29, 2010; 2(0): 310 - 324. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Ranwez, A. Criscuolo, and E. J. P. Douzery SUPERTRIPLETS: a triplet-based supertree approach to phylogenomics Bioinformatics, June 15, 2010; 26(12): i115 - i123. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. S. Haggerty, F. J. Martin, D. A. Fitzpatrick, and J. O. McInerney Gene and genome trees conflict at many levels Phil Trans R Soc B, August 12, 2009; 364(1527): 2209 - 2219. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Zhaxybayeva, W. F. Doolittle, R. T. Papke, and J. P. Gogarten Intertwined Evolutionary Histories of Marine Synechococcus and Prochlorococcus marinus Genome Biol Evol, January 1, 2009; 1(0): 325 - 339. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Claesson, D. van Sinderen, and P. W. O'Toole Lactobacillus phylogenomics - towards a reclassification of the genus Int J Syst Evol Microbiol, December 1, 2008; 58(12): 2945 - 2954. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. T Lloyd, K. E Davis, D. Pisani, J. E Tarver, M. Ruta, M. Sakamoto, D. W.E Hone, R. Jennings, and M. J Benton Dinosaurs and the Cretaceous Terrestrial Revolution Proc R Soc B, November 7, 2008; 275(1650): 2483 - 2490. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Bayram, C. Biesemann, S. Krappmann, P. Galland, and G. H. Braus More Than a Repair Enzyme: Aspergillus nidulans Photolyase-like CryA Is a Regulator of Sexual Development Mol. Biol. Cell, August 1, 2008; 19(8): 3254 - 3262. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Callanan, P. Kaleta, J. O'Callaghan, O. O'Sullivan, K. Jordan, O. McAuliffe, A. Sangrador-Vegas, L. Slattery, G. F. Fitzgerald, T. Beresford, et al. Genome Sequence of Lactobacillus helveticus, an Organism Distinguished by Selective Gene Loss and Insertion Sequence Element Expansion J. Bacteriol., January 15, 2008; 190(2): 727 - 735. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Ruta, D. Pisani, G. T Lloyd, and M. J Benton A supertree of Temnospondyli: cladogenetic patterns in the most species-rich group of early tetrapods Proc R Soc B, December 22, 2007; 274(1629): 3087 - 3095. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Pisani, J. A. Cotton, and J. O. McInerney Supertrees Disentangle the Chimerical Origin of Eukaryotic Genomes Mol. Biol. Evol., August 1, 2007; 24(8): 1752 - 1760. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Puigbo, S. Garcia-Vallve, and J. O. McInerney TOPD/FMTS: a new software to compare phylogenetic trees Bioinformatics, June 15, 2007; 23(12): 1556 - 1558. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Fraser, J. E. Stajich, E. J. Tarcha, G. T. Cole, D. O. Inglis, A. Sil, and J. Heitman Evolution of the Mating Type Locus: Insights Gained from the Dimorphic Primary Fungal Pathogens Histoplasma capsulatum, Coccidioides immitis, and Coccidioides posadasii Eukaryot. Cell, April 1, 2007; 6(4): 622 - 629. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Wilkinson, J. A. Cotton, F.-J. Lapointe, and D. Pisani Properties of Supertree Methods in the Consensus Setting Syst Biol, April 1, 2007; 56(2): 330 - 337. [Full Text] [PDF] |
||||
![]() |
B. E. Dutilh, V. van Noort, R. T. J. M. van der Heijden, T. Boekhout, B. Snel, and M. A. Huynen Assessment of phylogenomic and orthology approaches for phylogenetic inference Bioinformatics, April 1, 2007; 23(7): 815 - 824. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Comas, A. Moya, and F. Gonzalez-Candelas From Phylogenetics to Phylogenomics: The Evolutionary Relationships of Insect Endosymbiotic {gamma}-Proteobacteria as a Test Case Syst Biol, February 1, 2007; 56(1): 1 - 16. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Canchaya, M. J. Claesson, G. F. Fitzgerald, D. van Sinderen, and P. W. O'Toole Diversity of the genus Lactobacillus revealed by comparative genomics of five species. Microbiology, November 1, 2006; 152(Pt 11): 3185 - 3196. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Criscuolo, V. Berry, E. J. P. Douzery, and O. Gascuel SDM: A Fast Distance-Based Approach for (Super)Tree Building in Phylogenomics Syst Biol, October 1, 2006; 55(5): 740 - 755. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Zhaxybayeva, J. P. Gogarten, R. L. Charlebois, W. F. Doolittle, and R. T. Papke Phylogenetic analyses of cyanobacterial genomes: Quantification of horizontal gene transfer events Genome Res., September 1, 2006; 16(9): 1099 - 1108. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. R. Moore, S. A. Smith, and M. J. Donoghue Increasing Data Transparency and Estimating Phylogenetic Uncertainty in Supertrees: Approaches Using Nonparametric Bootstrapping Syst Biol, August 1, 2006; 55(4): 662 - 676. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Hughes, S. J. Longhorn, A. Papadopoulou, K. Theodorides, A. de Riva, M. Mejia-Chang, P. G. Foster, and A. P. Vogler Dense Taxonomic EST Sampling and Its Applications for Molecular Systematics of the Coleoptera (Beetles) Mol. Biol. Evol., February 1, 2006; 23(2): 268 - 278. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A. Fitzpatrick, C. J. Creevey, and J. O. McInerney Genome Phylogenies Indicate a Meaningful {alpha}-Proteobacterial Phylogeny and Support a Grouping of the Mitochondria with the Rickettsiales Mol. Biol. Evol., January 1, 2006; 23(1): 74 - 85. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. K. Philip, C. J. Creevey, and J. O. McInerney The Opisthokonta and the Ecdysozoa May Not Be Clades: Stronger Support for the Grouping of Plant and Animal than for Animal and Fungi and Stronger Support for the Coelomata than Ecdysozoa Mol. Biol. Evol., May 1, 2005; 22(5): 1175 - 1184. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











