Skip Navigation


Bioinformatics Advance Access originally published online on May 12, 2008
Bioinformatics 2008 24(13):1540-1541; doi:10.1093/bioinformatics/btn230
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/13/1540    most recent
btn230v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wehe, A.
Right arrow Articles by Eulenstein, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wehe, A.
Right arrow Articles by Eulenstein, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

DupTree: a program for large-scale phylogenetic analyses using gene tree parsimony

André Wehe 1, Mukul S. Bansal 1, J. Gordon Burleigh 2 and Oliver Eulenstein 1,*

1Department of Computer Science, Iowa State University, Ames, IA 50011 and 2NESCent, Durham, NC 27705, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURES
 3 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: DupTree is a new software program for inferring rooted species trees from collections of gene trees using the gene tree parsimony approach. The program implements a novel algorithm that significantly improves upon the run time of standard search heuristics for gene tree parsimony, and enables the first truly genome-scale phylogenetic analyses. In addition, DupTree allows users to examine alternate rootings and to weight the reconciliation costs for gene trees. DupTree is an open source project written in C++.

Availability: DupTree for Mac OS X, Windows, and Linux along with a sample dataset and an on-line manual are available at http://genome.cs.iastate.edu/CBL/DupTree

Contact: oeulenst{at}cs.iastate.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURES
 3 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
With rapidly growing amounts of genomic sequence data, there is renewed interest in inferring species trees from gene trees with a history of duplication and loss (e.g. Sanderson and McMahon, 2007). There are several methods for inferring species trees from collections of potentially paralogous gene trees, including uninode coding (Simmons et al., 2000) and gene tree parsimony (GTP) (Maddison, 1997; Slowinski and Page, 1999). GTP takes a collection of rooted, binary gene trees and seeks a rooted, binary species tree with the minimum reconciliation cost for the corresponding taxa. The reconciliation cost can be the overall number of gene duplications, or gene duplications and losses, induced in the gene trees by the species tree. With incomplete gene sampling, it is difficult to distinguish gene loss from the absence of sequence data. Therefore, following Page and Charleston (1997), we define the reconciliation cost strictly in terms of gene duplications. The reconciliation cost for a collection of gene trees and a single species tree is linear time computable (Zhang, 1997). However, finding the minimum reconciliation cost across all possible species trees is an intrinsically difficult problem (Ma et al., 1998). Therefore, heuristics are commonly used to estimate a solution for GTP.

GTP has been successfully applied to phylogenetic inference in snakes (Slowinski et al., 1997), vertebrates (Page, 2000; Page and Cotton, 2002) and plants (Sanderson and McMahon, 2007). Yet the run time performance of previous GTP implementations is not suitable for large-scale studies. Currently, there are two programs, Gtp (Sanderson and McMahon, 2007) and GeneTree (Page, 1998), for the GTP approach. Gtp performs an exhaustive search of all possible species trees, and therefore, it is only suitable for analyses with very few taxa. GeneTree implements standard local search heuristics for gene tree parsimony based on the rooted Subtree Pruning and Regrafting (rSPR) tree edit operation (Semple and Steel, 2003). Due to the high time complexity of the underlying algorithm, GeneTree typically cannot infer large species trees. Furthermore, the GTP approach requires rooted input gene trees, and rooting gene trees can be difficult if there is a history of gene duplication and loss.

We introduce DupTree, a program that enables the first truly genome-scale phylogenetic analyses based on GTP, and also allows unrooted input gene trees. DupTree implements the same standard rSPR based heuristics that are found in GeneTree, but solves the local rSPR search problem much more efficiently than GeneTree using the algorithm from Bansal et al. (2007). Due to some random choices in the heuristic, the resulting species trees from DupTree and GeneTree may differ slightly. In practice, we noticed little or no difference. Our efficiency study (Table 1) demonstrates the enormous speed-up of DupTree in comparison to GeneTree. For the study, we first generated sets of 20 random gene trees, each having the same set of taxa. Next, we conducted 6 runs, each with a different number of taxa in the 20 random gene trees (50, 100, 200, 400, 1000, 2000). Table 1 compares the resulting run times for GeneTree and DupTree. Additionally, we have run DupTree on datasets with (i) 3978 gene trees containing a total of 624 taxa (Bansal et al., 2007) and (ii) 18 896 gene trees containing a total of 136 taxa. Neither of these datasets could be run with GeneTree. DupTree also implements novel features that can significantly enhance the quality of phylogenetic analyses. These features include methods to (i) examine alternate gene tree rootings, (ii) weight the reconciliation cost for each of the gene trees and (iii) define topological constraints for the resulting species tree.


View this table:
[in this window]
[in a new window]

 
Table 1. Runtime comparison between GeneTree and DupTree

 
In DupTree, the reconciliation cost is defined strictly in terms of gene duplications. Although there are highly efficient algorithms for search heuristics for the gene duplication objective (Bansal and Eulenstein, 2007; Bansal et al., 2007), such algorithms are unknown for objectives based on duplication loss and deep coalescence. Still, small studies under these objectives can be performed using heuristics implemented in GeneTree (Page, 2000).


    2 FEATURES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURES
 3 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The input for a DupTree analysis is a single file containing a list of the gene trees, all in Newick format. The output, in either Newick or NEXUS format, is the best species tree found during the search. DupTree begins its heuristic rSPR tree search from an initial species tree, which can either be (i) generated randomly, (ii) built using an effective rooted stepwise addition algorithm or (iii) supplied by the user. To adjust thoroughness of the tree search, DupTree implements three versions of local rSPR search heurisitcs, including standard queue-based searching. DupTree also provides two effective search methods to identify gene tree rootings that minimize the reconciliation cost. In the most comprehensive method, DupTree can examine the reconciliation cost of every possible rooting of each gene tree (Sanderson and McMahon, 2007). Alternately, DupTree can examine the reconciliation cost of every possible rooting of each gene tree only after it reaches a local optima in the rSPR search. If rootings are found that decrease the reconciliation cost, DupTree will continue with the tree search using the new rootings. DupTree also implements weighted GTP analyses, in which the reconciliation cost of each gene tree is multiplied by a ‘weight’ between 0 and 1, and the species tree is inferred based on the overall weighted reconciliation cost.

DupTree is designed to interact smoothly with complementary phylogenetic tools. For example, there is a frequent interest in determining where gene duplications and losses may have occurred on a species tree. The species tree from DupTree can be used, together with the given gene trees, as input for existing programs that can locate duplications and losses, such as COMPONENT (Page, 1995) and GeneTree (Page, 2000).


    3 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURES
 3 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
With DupTree, it is now possible to perform GTP analyses on large-scale genomic datasets across many taxa. Thus, DupTree can facilitate the incorporation of new genomic data into phylogenetic inference. Future work will include implementing a fast tree search based on ‘tree bisection and reconnection (TBR)’ (Bansal and Eulenstein, 2007), a parallel version of the rSPR tree search, as well as options for incorporating gene losses into the reconciliation cost.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURES
 3 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank Mike Sanderson for providing extensive comments and testing of DupTree, and the reviewers for their suggestions.

Funding: National Science Foundation (EF-0334832 to A.W., M.S.B. and O.E., and EF-0423641 to J.G.B.).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on November 5, 2007; revised on May 7, 2008; accepted on May 8, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURES
 3 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bansal MS, Eulenstein O. An {Omega}(n2 / log n) speed-up of TBR heuristics for the gene-duplication problem. In: Lecture Notes in Computer Science—Giancario R, Hannenhalli S, eds. (2007) Vol. 4645. Proceedings of the 7th International Workshop on Algorithms in Bioinformatics (WABI'07). Philadelphia, PA, USA: Springer. 124–135.[CrossRef]

    Bansal MS, et al. Heuristics for the gene-duplication problem: a {theta}(n) speed-up for the local search. In: Lecture Notes in Computer Science—Speed TP, Huang H, eds. (2007) Vol. 4453. Proceedings of the 11th Annual International Conference on Research in Computational Molecular Biology (RECOMB'07). Oakland, CA, USA: Springer. 238–252.[CrossRef]

    Ma B, et al. On reconcstructing species trees from gene trees in term of duplications and losses. (1998) Proceedings of the Second Annual International Conference on Research in Computational Molecular Biology. New York, NY, USA: ACM. 182–191.

    Maddison WP. Gene trees in species trees. Syst. Biol. (1997) 46:523–536.[Abstract/Free Full Text]

    Page RDM. (1995) (last accessed date June 3, 2008). Program COMPONENT. Available at http://taxonomy.zoology.gla.ac.uk/rod/cpw.html.

    Page RDM. GeneTree: comparing gene and species phylogenies using reconciled trees. Bioinformatics (1998) 14:819–820.[Abstract/Free Full Text]

    Page RDM. Extracting species trees from complex gene trees: reconciled trees and vertebrate phylogeny. Mol. Phylogenet. Evol. (2000) 14:89–106.[CrossRef][Web of Science][Medline]

    Page,R.D.M. and Charleston MA. Reconciled trees and incongruent gene and species trees. In: Mathematical Hierarchies in Biology—Mirkin B, et al, eds. (1997) Vol. 37. Rhode Island, USA: American Mathematical Society, Providence. 57–70.

    Page,R.D.M. and Cotton J. Vertebrate phylogenomics: reconciled trees and gene duplications. Altman RB, et al, eds. (2002) Proceedings of the 7th Pacific Symposium on Biocomputing (PSB'02): Lihue, Hawaii, USA. 536–547.

    Sanderson MJ, McMahon MM. Inferring angiosperm phylogeny from EST data with widespread gene duplication. BMC Evol. Biol. (2007) 7((Suppl. 1)):S3.

    Semple C, Steel M. Phylogenetics (2003) New York, NY, USA: Oxford University Press.

    Simmons MP, et al. Phylogenetic reconstruction using duplicate genes. Mol. Biol. Evol. (2000) 17:469–473.[Abstract/Free Full Text]

    Slowinski J, Page RDM. How should species phylogenies be inferred from sequence data? Syst. Biol. (1999) 105:147–158.

    Slowinski JB, et al. Inferring species trees from gene trees: a phylogenetic analysis of the elapidae (serpentes) based on the amino acid sequences of venom proteins. Mol. Phylogenet. Evol. (1997) 8:349–362.[CrossRef][Web of Science][Medline]

    Zhang L. On a Mirkin-Muchnik-Smith conjecture for comparing molecular phylogenies. J. Comput. Biol. (1997) 4:177–187.[Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Syst BiolHome page
R. C. Thomson and H. B. Shaffer
Sparse Supermatrices for Phylogenetic Inference: Taxonomy, Alignment, Rogue Taxa, and the Phylogeny of Living Turtles
Syst Biol, November 11, 2009; (2009) syp075v1.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
B. C. O'Meara
New Heuristic Methods for Joint Species Delimitation and Species Tree Inference
Syst Biol, November 10, 2009; (2009) syp077v1.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
B. Frajman, F. Eggens, and B. Oxelman
Hybrid Origins and Homoploid Reticulate Evolution within Heliosperma (Sileneae, Caryophyllaceae)--A Multigene Phylogenetic Approach with Relative Dating
Syst Biol, July 3, 2009; (2009) syp030v1.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/13/1540    most recent
btn230v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wehe, A.
Right arrow Articles by Eulenstein, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wehe, A.
Right arrow Articles by Eulenstein, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?