Bioinformatics Advance Access originally published online on December 1, 2005
Bioinformatics 2006 22(3):363-364; doi:10.1093/bioinformatics/bti798
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
apTreeshape: statistical analysis of phylogenetic tree shape
Team of Mathematical Biology (TIMB), TIMC, Faculty of Medicine F38706 La Tronche, France
*To whom correspondence should be adderessed.
| ABSTRACT |
|---|
|
|
|---|
Summary: apTreeshape is a R package dedicated to simulation and analysis of phylogenetic tree topologies using statistical imbalance measures. It is a companion library of the R package ape, which provides additional functions for reading, plotting, manipulating phylogenetic trees and for connecting to public phylogenetic tree databases. One strength of the package is to include appropriate corrections of classical shape statistics as well as new tests based on the statistical theory of likelihood ratios.
Availability: http://cran.r-project.org
Contact: Olivier.Francois{at}imag.fr
| 1 INTRODUCTION |
|---|
|
|
|---|
The understanding of macroevolutionary processes, such as speciation or extinction, is a major issue in evolutionary biology. It is widely acknowledged that such processes leave their fingerprint on the phylogenetic trees that we reconstruct from extant taxa.
The recent explosion of phylogenetic data has generated a bulk of modern analytical methods that rely on stochastic models of tree structure. These methods fall into two classes: temporal and topological. Temporal methods focus on the estimation of diversification rates (Nee, 2001). Topological methods are based on statistical measures of tree imbalance (Mooers and Heard, 1997; Aldous, 2001). Most of them assume null models of tree structure among which the Yule's process (1924) is the most popular.
In this article, we describe the computer package apTreeshape that is dedicated to simulation and analysis of phylogenetic tree topologies using statistical indices. It is programmed in the R language (R Development Core Team, 2005), and complements the library ape of Paradis et al. (2004) which covers aspects of temporal methods essentially. It also provides additional functions for reading, plotting, manipulating phylogenetic trees and offers immediate web-access to public phylogenetic tree databases, such as TreeBASE and Pandit (Whelan et al., 2003).
Beyond the software facilities for data analysis and graphical display offered by the R language, apTreeshape includes important corrections on classical shape statistics. One strength of the package is to present new tests based on the statistical theory of likelihoods, and therefore provide optimal power for testing null models of macroevolution.
| 2 CONTENTS |
|---|
|
|
|---|
The functions contained in apTreeshape can be classified into four categories: basic topological manipulation, web-access, simulation and statistical testing.
The basic objects handled by the package are cladograms, i.e. binary trees for which branch lengths have been ignored. They can be read from files in the Newick/Nexus format or converted from objects of the ape package. These objects are stored into a class called treeshape. Objects of class treeshape have dendrogram-like data structure, and they are plotted using methods for dendrograms. Basic topological manipulations are allowed such as pruning or cuting from a specified internal node. Pruning returns the ancestral part of a tree, while cutting extracts a subtree rooted at a specific node. Subtrees corresponding to a subset of taxa can be extracted from a whole tree as well.
The package apTreeshape has been designed to perform large-scale studies of tree shape from phylogeny databases. For instance, it contains specific functions for accessing TreeBASE and Pandit through R. As an example, the following instructions download the trees with ID numbers = 705, 706 and 709 in Pandit, and convert them into objects of class treeshape. Basic summaries can be obtained very easily.
trees<-dbtrees("pandit", c(705,706,709))
summary(trees[[2]]);plot(trees[[2]])
Although apTreeshape deals with fully resolved tree, any phylogeny can be downloaded, and converted into a binary tree solving polytomies using a random simulation method.
Simulation methods and Monte Carlo estimates of P-values are central to apTreeshape. The function rtreeshape enables sampling trees from the most usual stochastic models of trees: the equal rate Markov (ERM) and proportional to distinguishable arrangements models (PDA). In the ERM each branch has an equal probability of splitting, whereas the PDA model has the property that all trees are equally likely (Mooers and Heard, 1997). Note that the topology of the ERM model is shared by other models such as the Hey, Moran or coalescent models for which branch lengths can be simulated using the R base package without difficulties. In addition, we implemented the biased-speciation model used by Kirkpatrick and Slatkin (1993), and a universal random generator for branching Markov processes. Solving polytomies makes use of one of the ERM, PDA or biased-speciation models locally.
The core of apTreeshape consists of statistical testing procedures for the ERM and PDA null hypotheses. We implemented classical shape measures such as the Sackin's and Colless' imbalance measures. We introduced standardized measures with means and variances computed under the ERM and PDA models. The use of standardized measures can reduce size effects when comparing trees with different sizes. The standardization were computed using recent results regarding tree structures in theoretical computer science. In addition, we implemented a graphical test described in Aldous (2001) which attempts to fit Beta-splitting processes, a family that contains both the ERM and PDA as special cases.
As an improvement over the existing literature on tree balance, we used the theory of likelihood ratios in order to provide a test statistic with maximal power for rejecting the ERM against the PDA model. The shape statistic can be computed as
![]() | (1) |
| 3 EXAMPLES |
|---|
|
|
|---|
In this section, we illustrate the use of apTreeshape from two examples: the HIV-1 phylogeny and a large-scale study of tree imbalance obtained from the screening of the Pandit database.
Tests based on Colless' indices are more conservative that tests based on likelihood ratios. An example of this is illustrated by the HIV-1 phylogeny (data from ape and tree with 193 tips) published in Rambaut et al. (2001). The authors attempted to date the most recent common ancestor of the HIV-1 viruses assuming a coalescent tree whose topological structure is identical to the ERM model. Using a test based on standardized Colless indices, the hypothesis that the tree was less balanced than the ERM model was not rejected (Colless index = 992, P-value = 0.1). However the departure from the ERM model (and then the coalescent) is strongly asserted by the likelihood ratio test (standardized s = 3.48, P-value = 0.25 x 104). These results were obtained thanks to the following instructions:
colless.test(tree<-hivtree.treeshape, alternative="greater")
likelihood.test(tree,model="yule", alternative="greater")}
The next script connects to Pandit via the Internet, and downloads resolved trees with ID numbers in the range 100300. Then the histogram of shape statistics s is plotted using the PDA normalization.
trees<-dbtrees(db="pandit", 100:300, quiet=T)
s.statistic<-sapply(trees, FUN=shape.statistic, norm="pda")hist(s.statistic,prob=T)
The results are displayed in Figure 1. We obtain a clear departure from the PDA model. Nevertheless the empirical distribution indices are bell-shaped [shift to the left from the standard N(0,1)], with a standard error (SD = 1.34) close to the value predicted by the PDA model (SD = 1).
|
| 4 CONCLUSION |
|---|
|
|
|---|
The R programming language has been proved to be a powerful tool for bioinformatics. We contributed to R in order to improve the analysis of phylogenetic data. The package apTreeshape integrates recent development in the statistical theory of imbalance measures, which warrant the optimality of some testing procedures. This package competes with another program called SymmeTREE (Chan and Moore, 2005) which covers the same range of applications (temporal and topological analyses of trees). In this comparison apTreeshape benefits the extended power of R for performing all the types of data analyses (and its facilities for connecting to public databases). This should make this resource attractive to R users.
| FOOTNOTES |
|---|
Associate Editor: Keith A Crandall
Received on September 14, 2005; revised on November 21, 2005; accepted on November 22, 2005
| REFERENCES |
|---|
|
|
|---|
Aldous, D.J. (2001) Stochastic models and descriptive statistics for phylogenetic trees, from Yule to Today. Stat. Sci, . 16, 2334.
Chan, K.M.A. and Moore, B.R. (2005) SymmeTREE: whole-tree analysis of differential diversification rates. Bioinformatics, 21, 17091710
Fill, J.A. (1996) On the distribution of binary search trees under the random permutation model. Rand. Struct. Algor, . 8, 125.
Kirkpatrick, M. and Slatkin, M. (1993) Searching for evolutionary patterns in the shape of a phylogenetic tree. Evolution, 47, 11711181[CrossRef].
Mooers, A.O. and Heard, S.B. (1997) Inferring evolutionary process from phylogenetic tree shape. Q. Rev. Biol, . 72, 3154.
Nee, S. (2001) Inferring speciation rates from phylogenies. Evolution, 55, 661668[CrossRef][ISI][Medline].
Paradis, E., et al. (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20, 289290
R Development Core Team. (2005) A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. , Vienna, Austria.
Rambaut, A., et al. (2001) Human immunodeficiency virus phylogeny and the origin of HIV-1. Nature, 410, 10471048[CrossRef][Medline].
Semple, C. and Steel, M. Phylogenetics, (2003) , Oxford Oxford University Press.
Whelan, S., et al. (2003) Pandit: a database of protein and associated nucleotide domains with inferred trees. Bioinformatics, 19, 15561563
This article has been cited by other articles:
![]() |
L. J. Harmon, J. T. Weir, C. D. Brock, R. E. Glor, and W. Challenger GEIGER: investigating evolutionary radiations Bioinformatics, January 1, 2008; 24(1): 129 - 131. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


