Bioinformatics Advance Access originally published online on August 23, 2007
Bioinformatics 2007 23(21):2952-2953; doi:10.1093/bioinformatics/btm410
CTree: comparison of clusters between phylogenetic trees made easy
Faculty of Life Science, University of Manchester, Michael Smith Building, Oxford Road, Manchester, M13 9PT, UK
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: CTree has been designed for the quantification of clusters within viral phylogenetic tree topologies. Clusters are stored as individual data structures from which statistical data, such as the Subtype Diversity Ratio (SDR), Subtype Diversity Variance (SDV) and pairwise distances can be extracted. This simplifies the quantification of tree topologies in relation to inter- and intra-cluster diversity. Here the novel features incorporated within CTree, including the implementation of a heuristic algorithm for identifying clusters, are outlined along with the more usual features found within general tree viewing software.
Availability: CTree is available as an executable jar file from: http://www.manchester.ac.uk/bioinformatics/ctree
Contact: john.archer{at}postgrad.manchester.ac.uk
| 1 INTRODUCTION |
|---|
|
|
|---|
There are many programs available for viewing phylogenetic trees. A comprehensive list of such programs, maintained by Joe Felsenstein can be found at http://evolution.genetics.washington.edu/phylip/software.html. Few, however, are specialized in the quantification of clustering present on such trees. Quantifying clusters is biologically important in relation to both vaccine design and epidemiology (Archer and Robertson, 2007; Nickle et al., 2003). Within CTree clusters of strains can be treated as unique data structures—allowing for easy quantification and comparison. Clusters can be manually populated by the user or alternatively by use of a novel heuristic clustering algorithm. Use of such an algorithm removes the subjectivity of the individual when allocating strains to individual clusters, e.g. on random trees or on trees with no previously identified clusters. CTree also incorporates other useful features for phylogenetic analysis that are rarely included within current tree-viewing software. The input for CTree is a tree string in NEWICK format (e.g. phylip output .ph or.phb). Trees can be saved as a pdf, in NEWICK format or as binary files that will maintain any edits made to them. Implementation was done using the Java SDK and so CTree is platform independent.
| 2 NOVEL FEATURES |
|---|
|
|
|---|
2.1 Heuristically defining clusters
If the number of taxa present on the tree is less than 125, a heuristic algorithm can be used to allocate individual strains to clusters on the tree topology. This feature is useful when no previous clusters have been published for a particular dataset, as well as for finding clusters on random trees in a systematic manner (such trees can be used as a control).
2.2 Manually defining clusters
The user can manually assign strains to clusters. This allows for the comparison between previously identified clusters on different trees, e.g. trees representing HIV-1 groups M and O.
2.3 SDR and SDV
These two statistics are used to quantify the degree of clustering present on tree topology. The SDR, is defined as the ratio of the mean intra-cluster pairwise distance to the mean inter-cluster pairwise distance (Rambaut et al., 2001). Low intra-pairwise distances relative to inter-pairwise distances imply the presence of more defined clusters. The SDV is a measure of the variation within the ratio of the mean intra-cluster pairwise distance to the mean inter-cluster pairwise distance calculated for each cluster on the tree (Archer and Robertson, 2007). The lower the SDV the more symmetrical the clusters present. Together these two statistics quantify the presence of clustering within tree topologies.
2.4 Working with random trees
CTree provides a random birth model of exponential population growth for the generation of random trees. The user can generate a single random tree or specify a number of random trees to be generated. Terminal nodes from trees generated can be randomly sampled. Clustering analysis can be preformed on the trees with the end result being a null distribution for the various statistics (including SDR and SDV).
2.5 Finding the center of the tree (COT)
COT is the point on a tree with the smallest average distance from each of the strains on the tree. As well as having implications for vaccine design (Nickle et al., 2003), it is a useful reference point on a tree.
|
| 3 THE HEURISTIC CLUSTERING ALGORITHM |
|---|
|
|
|---|
The algorithm is based on finding the cluster set with the minimum SDR value (Rambaut et al., 2001). The runtime is O(n3) where n is the total number of vertices within the tree. There are two phases:
3.1 The explore phase
Initially m cluster sets for the input tree are created during m iterations of steps 1–6: m is dependent on the number of incrementations used for the threshold distance (t) in step 2.
Step 1: each strain on the tree is designated as a potential central cluster strain––each representing a potential cluster.
Step 2: potential clusters are individually populated by adding strains falling within t from the central cluster strain.
Step 3: potential clusters are sorted (largest first). All strains within the largest potential cluster are removed from all other potential clusters. The remaining potential clusters are resorted. This step is repeated until there are no duplicated strains between clusters.
Step 4: strains belonging to each potential cluster are checked to see whether or not they are more suited to a different potential cluster. Suitability is based on an individual member's proximity to its current potential central cluster strain and its proximity to central cluster strain's representing other potential clusters.
Step 5: the SDR value for the remaining populated potential clusters is calculated and stored.
Step 6: t is incremented by a predefined amount.
3.2 The selection phase
In the selection phase, the potential cluster set that produced the lowest SDR value is selected as the cluster set to represent the input tree.
| 4 STANDARD FEATURES |
|---|
|
|
|---|
CTree provides the user with many standard tree-viewing features including: (i) re-rooting the tree, (ii) obtaining statistical information such as pairwise distances between strains, (iii) swapping the order of sibling strains (iv) manually removing strains from the tree, (v) removing strains randomly/non-randomly from the tree, (vi) an improved search interface that allows the user to color strains-based on search criteria, (vii) basic coloring of the tree, (viii) loading multiple trees from a file containing more than one tree and conveniently scrolling through them, (ix) allowing the user to obtain lists of strains within a user-specified proximity to each other and (x) allowing the user to define the distance covered by the scale bar associated with the tree.
| 5 TYPICAL USAGE |
|---|
|
|
|---|
With the range of tree-viewing features available within CTree the user can perform all of the functionalities that would be required to create a publishable phylogenetic tree. The novel features of CTree do not complicate this process. A typical usage is illustrated in the study of the clusters present within HIV-1 group M and O data (Archer and Robertson, 2007). Here CTree was used to manually divide multiple trees from each of the group M and O datasets into clusters. SDR and SDV values calculated from these clusters where then compared to each other as well as to heuristically defined clusters present on 1000 randomly generated trees. The end result was the production of a definitive model of HIV-1 subtype emergence and the highlighting of the misleading nature of the group M subtype classification system when used at the center of the pandemic.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We are very grateful to Andy Rambaut for his insightful comments and his advice concerning the generation of random trees. We would also like to thank John Pinney for helpful discussions. J.A. is supported by BBSRC studentship.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on June 1, 2007; revised on July 12, 2007; accepted on August 8, 2007
| REFERENCES |
|---|
|
|
|---|
Archer J, Robertson DL. Understanding the diversification of HIV-1 groups M and O. AIDS (2007) 21:1693–700.[CrossRef][Web of Science][Medline]
Nickle DC, et al. Consensus and ancestral state HIV vaccines. Science (2003) 299:1515–1517.
Rambaut A, et al. Phylogeny and the origin of HIV-1. Nature (2001) 410:1047–1048.[CrossRef][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
