Skip Navigation


Bioinformatics Advance Access originally published online on December 1, 2005
Bioinformatics 2006 22(3):363-364; doi:10.1093/bioinformatics/bti798
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/3/363    most recent
bti798v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bortolussi, N.
Right arrow Articles by François, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bortolussi, N.
Right arrow Articles by François, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

apTreeshape: statistical analysis of phylogenetic tree shape

Nicolas Bortolussi , Eric Durand , Michael Blum and Olivier François *

Team of Mathematical Biology (TIMB), TIMC, Faculty of Medicine F38706 La Tronche, France

*To whom correspondence should be adderessed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CONTENTS
 3 EXAMPLES
 4 CONCLUSION
 REFERENCES
 

Summary: apTreeshape is a R package dedicated to simulation and analysis of phylogenetic tree topologies using statistical imbalance measures. It is a companion library of the R package ‘ape’, which provides additional functions for reading, plotting, manipulating phylogenetic trees and for connecting to public phylogenetic tree databases. One strength of the package is to include appropriate corrections of classical shape statistics as well as new tests based on the statistical theory of likelihood ratios.

Availability: http://cran.r-project.org

Contact: Olivier.Francois{at}imag.fr


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CONTENTS
 3 EXAMPLES
 4 CONCLUSION
 REFERENCES
 
The understanding of macroevolutionary processes, such as speciation or extinction, is a major issue in evolutionary biology. It is widely acknowledged that such processes leave their fingerprint on the phylogenetic trees that we reconstruct from extant taxa.

The recent explosion of phylogenetic data has generated a bulk of modern analytical methods that rely on stochastic models of tree structure. These methods fall into two classes: temporal and topological. Temporal methods focus on the estimation of diversification rates (Nee, 2001). Topological methods are based on statistical measures of tree imbalance (Mooers and Heard, 1997; Aldous, 2001). Most of them assume null models of tree structure among which the Yule's process (1924) is the most popular.

In this article, we describe the computer package apTreeshape that is dedicated to simulation and analysis of phylogenetic tree topologies using statistical indices. It is programmed in the R language (R Development Core Team, 2005), and complements the library ‘ape’ of Paradis et al. (2004) which covers aspects of temporal methods essentially. It also provides additional functions for reading, plotting, manipulating phylogenetic trees and offers immediate web-access to public phylogenetic tree databases, such as TreeBASE and Pandit (Whelan et al., 2003).

Beyond the software facilities for data analysis and graphical display offered by the R language, apTreeshape includes important corrections on classical shape statistics. One strength of the package is to present new tests based on the statistical theory of likelihoods, and therefore provide optimal power for testing null models of macroevolution.


    2 CONTENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CONTENTS
 3 EXAMPLES
 4 CONCLUSION
 REFERENCES
 
The functions contained in apTreeshape can be classified into four categories: basic topological manipulation, web-access, simulation and statistical testing.

The basic objects handled by the package are cladograms, i.e. binary trees for which branch lengths have been ignored. They can be read from files in the Newick/Nexus format or converted from objects of the ‘ape’ package. These objects are stored into a class called ‘treeshape’. Objects of class ‘treeshape’ have dendrogram-like data structure, and they are plotted using methods for dendrograms. Basic topological manipulations are allowed such as pruning or cuting from a specified internal node. Pruning returns the ancestral part of a tree, while cutting extracts a subtree rooted at a specific node. Subtrees corresponding to a subset of taxa can be extracted from a whole tree as well.

The package apTreeshape has been designed to perform large-scale studies of tree shape from phylogeny databases. For instance, it contains specific functions for accessing TreeBASE and Pandit through R. As an example, the following instructions download the trees with ID numbers = 705, 706 and 709 in Pandit, and convert them into objects of class ‘treeshape’. Basic summaries can be obtained very easily.

trees<-dbtrees("pandit", c(705,706,709))

summary(trees[[2]]);plot(trees[[2]])

Although apTreeshape deals with fully resolved tree, any phylogeny can be downloaded, and converted into a binary tree solving polytomies using a random simulation method.

Simulation methods and Monte Carlo estimates of P-values are central to apTreeshape. The function rtreeshape enables sampling trees from the most usual stochastic models of trees: the equal rate Markov (ERM) and proportional to distinguishable arrangements models (PDA). In the ERM each branch has an equal probability of splitting, whereas the PDA model has the property that all trees are equally likely (Mooers and Heard, 1997). Note that the topology of the ERM model is shared by other models such as the Hey, Moran or coalescent models for which branch lengths can be simulated using the R base package without difficulties. In addition, we implemented the biased-speciation model used by Kirkpatrick and Slatkin (1993), and a universal random generator for branching Markov processes. Solving polytomies makes use of one of the ERM, PDA or biased-speciation models locally.

The core of apTreeshape consists of statistical testing procedures for the ERM and PDA null hypotheses. We implemented classical shape measures such as the Sackin's and Colless' imbalance measures. We introduced standardized measures with means and variances computed under the ERM and PDA models. The use of standardized measures can reduce size effects when comparing trees with different sizes. The standardization were computed using recent results regarding tree structures in theoretical computer science. In addition, we implemented a graphical test described in Aldous (2001) which attempts to fit Beta-splitting processes, a family that contains both the ERM and PDA as special cases.

As an improvement over the existing literature on tree balance, we used the theory of likelihood ratios in order to provide a test statistic with maximal power for rejecting the ERM against the PDA model. The shape statistic can be computed as

Formula 1(1)
where n is the number of taxa, and Ni is the size of the clade that descends from the i-th ancestor in the tree. Mathematical formulae for likelihoods were found in Semple and Steel (2003), and asymptotic properties of s have been established earlier by Fill (1996).


    3 EXAMPLES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CONTENTS
 3 EXAMPLES
 4 CONCLUSION
 REFERENCES
 
In this section, we illustrate the use of apTreeshape from two examples: the HIV-1 phylogeny and a large-scale study of tree imbalance obtained from the screening of the Pandit database.

Tests based on Colless' indices are more conservative that tests based on likelihood ratios. An example of this is illustrated by the HIV-1 phylogeny (data from ‘ape’ and tree with 193 tips) published in Rambaut et al. (2001). The authors attempted to date the most recent common ancestor of the HIV-1 viruses assuming a coalescent tree whose topological structure is identical to the ERM model. Using a test based on standardized Colless’ indices, the hypothesis that the tree was less balanced than the ERM model was not rejected (Colless index = 992, P-value = 0.1). However the departure from the ERM model (and then the coalescent) is strongly asserted by the likelihood ratio test (standardized s = 3.48, P-value = 0.25 x 10–4). These results were obtained thanks to the following instructions:

colless.test(tree<-hivtree.treeshape, alternative="greater")

likelihood.test(tree,model="yule", alternative="greater")}

The next script connects to Pandit via the Internet, and downloads resolved trees with ID numbers in the range 100–300. Then the histogram of shape statistics s is plotted using the PDA normalization.

trees<-dbtrees(db="pandit", 100:300, quiet=T)

s.statistic<-sapply(trees, FUN=shape.statistic, norm="pda")hist(s.statistic,prob=T)

The results are displayed in Figure 1. We obtain a clear departure from the PDA model. Nevertheless the empirical distribution indices are bell-shaped [shift to the left from the standard N(0,1)], with a standard error (SD = 1.34) close to the value predicted by the PDA model (SD = 1).


Figure 1
View larger version (7K):
[in this window]
[in a new window]
 
Fig. 1 Histogram of shape statistics s obtained after PDA standardization (196 trees collected from Pandit). (The histogram displays a departure from the PDA model [shift to the left from the standard N(0,1)].

 

    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CONTENTS
 3 EXAMPLES
 4 CONCLUSION
 REFERENCES
 
The R programming language has been proved to be a powerful tool for bioinformatics. We contributed to R in order to improve the analysis of phylogenetic data. The package apTreeshape integrates recent development in the statistical theory of imbalance measures, which warrant the optimality of some testing procedures. This package competes with another program called SymmeTREE (Chan and Moore, 2005) which covers the same range of applications (temporal and topological analyses of trees). In this comparison apTreeshape benefits the extended power of R for performing all the types of data analyses (and its facilities for connecting to public databases). This should make this resource attractive to R users.


    FOOTNOTES
 
Associate Editor: Keith A Crandall

Received on September 14, 2005; revised on November 21, 2005; accepted on November 22, 2005

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CONTENTS
 3 EXAMPLES
 4 CONCLUSION
 REFERENCES
 

    Aldous, D.J. (2001) Stochastic models and descriptive statistics for phylogenetic trees, from Yule to Today. Stat. Sci, . 16, 23–34.

    Chan, K.M.A. and Moore, B.R. (2005) SymmeTREE: whole-tree analysis of differential diversification rates. Bioinformatics, 21, 1709–1710[Abstract/Free Full Text].

    Fill, J.A. (1996) On the distribution of binary search trees under the random permutation model. Rand. Struct. Algor, . 8, 1–25.

    Kirkpatrick, M. and Slatkin, M. (1993) Searching for evolutionary patterns in the shape of a phylogenetic tree. Evolution, 47, 1171–1181[CrossRef].

    Mooers, A.O. and Heard, S.B. (1997) Inferring evolutionary process from phylogenetic tree shape. Q. Rev. Biol, . 72, 31–54.

    Nee, S. (2001) Inferring speciation rates from phylogenies. Evolution, 55, 661–668[CrossRef][ISI][Medline].

    Paradis, E., et al. (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20, 289–290[Abstract/Free Full Text].

    R Development Core Team. (2005) A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. , Vienna, Austria.

    Rambaut, A., et al. (2001) Human immunodeficiency virus phylogeny and the origin of HIV-1. Nature, 410, 1047–1048[CrossRef][Medline].

    Semple, C. and Steel, M. Phylogenetics, (2003) , Oxford Oxford University Press.

    Whelan, S., et al. (2003) Pandit: a database of protein and associated nucleotide domains with inferred trees. Bioinformatics, 19, 1556–1563[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
L. J. Harmon, J. T. Weir, C. D. Brock, R. E. Glor, and W. Challenger
GEIGER: investigating evolutionary radiations
Bioinformatics, January 1, 2008; 24(1): 129 - 131.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/3/363    most recent
bti798v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bortolussi, N.
Right arrow Articles by François, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bortolussi, N.
Right arrow Articles by François, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?