Skip Navigation


Bioinformatics Advance Access originally published online on January 28, 2008
Bioinformatics 2008 24(5):715-716; doi:10.1093/bioinformatics/btm619
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/5/715    most recent
btm619v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Smith, S. A.
Right arrow Articles by Dunn, C. W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Smith, S. A.
Right arrow Articles by Dunn, C. W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Phyutility: a phyloinformatics tool for trees, alignments and molecular data

Stephen A. Smith 1,* and Casey W. Dunn 2

1Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, 06520 and 2Department of Ecology and Evolutionary Biology, Brown University, Providence, RI, 02912, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 TREE MANIPULATIONS AND...
 3 SEQUENCE DATA MANIPULATION
 4 INTERFACES TO BIOINFORMATICS...
 5 APPLICATIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: Phyutility provides a set of phyloinformatics tools for summarizing and manipulating phylogenetic trees, manipulating molecular data and retrieving data from NCBI. Its simple command-line interface allows for easy integration into scripted analyses, and is able to handle large datasets with an integrated database.

Availability: Phyutility, including source code, documentation, examples, and executables, is available at http://code.google.com/p/phyutility

Contact: stephen.smith{at}yale.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 TREE MANIPULATIONS AND...
 3 SEQUENCE DATA MANIPULATION
 4 INTERFACES TO BIOINFORMATICS...
 5 APPLICATIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
There are now many tools available for phylogenetic inference, but software for the assembly of molecular datasets and the analysis of resulting trees is far more limited. These restrictions are increasingly apparent as many new phylogenetic studies rely heavily on phyloinformatics, which largely consists of connecting existing tools into analysis pipelines to automate sophisticated analyses of dynamic datasets. Surprisingly, there are few or no scriptable programs available for some simple tasks such as rerooting multiple phylogenetic trees. Here we present Phyutility, a command line program written in Java that integrates many frequently needed dataset assembly and tree manipulation tasks into a single package, as well as implements several new metrics. The tree analysis functionalities focus largely on summarizing topological variation within a set of trees. Furthermore, Phyutility automates several simple and important phylogenetic tree, molecular sequence and alignment manipulations that to date have been complicated and time consuming to implement. The simple command-line interface allows for easy integration with other programs in phyloinformatic pipelines, plugging several large holes that remain between the feature sets of existing software tools. The documentation distributed with the source code and executables provides further detail on command usage and details on implementation, as well as describing other features not listed here.


    2 TREE MANIPULATIONS AND SUMMARIES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 TREE MANIPULATIONS AND...
 3 SEQUENCE DATA MANIPULATION
 4 INTERFACES TO BIOINFORMATICS...
 5 APPLICATIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Multiple tree file formats are supported, including Newick and Nexus (with or without taxon translation tables). This reduces the need to preprocess files prior to analysis and allows Phyutility to serve as a convenient tree file format converter in support of other programs. Phyutility can also thin trees (i.e. retain only every n-th tree), making the output tree file more manageable when computer memory is limiting. This is often essential when preprocessing a large posterior distribution of trees for further analysis.

Phyutilty can root, re-root or unroot entire treesets with a single command. This allows one, to root all the unrooted trees in a posterior distribution of trees for use in a comparative analysis that requires rooted trees. In addition to rooting, trees can be pruned of either tips or clades (which are designated by the most recent common ancestor of two or more taxa). To date, no other software tools perform this type of tree editing on multiple trees and multiple file types.

Phyutility can perform traditional consensus tree analyses. Most other programs that generate consensus trees are critically limited because they impose restrictions on taxon name length, unnecessarily require sequence alignments or other auxiliary data, or have complicated user interfaces that make automated analyses needlessly cumbersome or even impossible. Along with traditional clade frequencies provided by consensus tree methods, Phyutility can calculate leaf stability indices for phylogenetic trees based on the measurements described in Thorley and Wilkinson (1999). Previously, this was only available in the MacOS 9 program RadCon (Thorley and Page, 2000), which is limited by the number and size of trees.

Phyutility can also calculate the frequency of all bipartitions found in a single specified exemplar tree from across a set of trees with the same taxa. This allows one, for instance, to easily label each clade in the most likely tree with the posterior probability or bootstrap support value. Until now, this computationally simple task was laborious to complete and fully implement in few programs (but see Sukumaran, 2007).

We implement a new metric in Phyutility called ‘branch attachment frequency’ (BAF). BAF helps to visualize the alternative positions of a particular lineage across a set of trees, which is particularly informative for taxa whose position is poorly resolved. BAF will indicate whether the lineage in question is attaching at many branches, each with low frequency, or is found at a small number of positions. The resulting node labels are not an indication of clade support, but instead show the frequency with which the lineage in question attaches along the stem of the minimal clade containing all daughter taxa of the stem. This most recent common ancestor approach accommodates topological variation within the treeset, as not all clades in the specified tree will necessarily be found in every tree in the set. BAF conveys far more information about the placement of a lineage than does the frequency of a single position, as inferred for instance from clade support values on a consensus or most likely tree. BAF can help to guide future taxon sampling by indicating which branches are most relevant to resolving the position of a particular lineage of interest.


    3 SEQUENCE DATA MANIPULATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 TREE MANIPULATIONS AND...
 3 SEQUENCE DATA MANIPULATION
 4 INTERFACES TO BIOINFORMATICS...
 5 APPLICATIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Phyutility can manipulate molecular sequence data and alignments in several ways. Most Phyutility sequence analyses allow input and output file formats to be Fasta or Nexus file types. Phyutility can concatenate alignments across multiple Fasta or Nexus files that may or may not have completely overlapping taxa, a frequent operation prior to producing phylogenetic trees. Another common task is parsing NCBI GenBank Fasta files. Phyutility can parse many GenBank Fasta entries in one or multiple files. The name of the sequences in the output file are determined by input options, which greatly facilitates downstream analyses.

Many researchers edit alignments by eye. With increasingly powerful multiple sequence alignment algorithms such as MUSCLE (Edgar, 2004) and DIALIGN (Morgenstern, 2004), it is possible to standardize the editing of alignments by removing sites based on the percentage of missing data per site (Castresana, 2000). This is essential when performing a meta-analysis. Phyutility can trim alignments of sites with gaps based on the percentage of missing data designated by the user.


    4 INTERFACES TO BIOINFORMATICS TOOLS AND DATABASES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 TREE MANIPULATIONS AND...
 3 SEQUENCE DATA MANIPULATION
 4 INTERFACES TO BIOINFORMATICS...
 5 APPLICATIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Phyutility makes use of two major Java bioinformatics libraries: JADE [part of the PEBLS project; (Smith, 2006)] and JEBL: java evolutionary biology libraries (http://sourceforge.net/projects/jebl). These libraries cannot be used as standalone programs, and Phyutility acts as a convenient interface to their functionality.

The size of typical treefiles has increased dramatically in recent years, even in routine analyses. Whether large because of many taxa, many trees or both, these files present a practical challenge for many tasks in phyloinformatics. In order to deal with the inherent memory problems associated with these files, an integrated database called Derby (http://db.apache.org/derby/) is employed. While reading in trees for a task, if the number of trees exceeds the memory capacity, Derby is engaged and the trees are then stored in a temporary database where disk memory is the only limit. Certain analyses, such as consensus building, cannot be performed as described above due to technical limitations, and instead the user should thin the tree files first (tree thinning does employ the database).

Phyutility also acts as an interface to NCBI's search and fetch functions. Currently, the Phyutility search function returns the number of hits for the search term as well as the gi numbers of the sequences matching the search term. Phyutility can also fetch sequences from NCBI databases using the gi numbers. Phyutility provides several considerable improvements over the existing web interface. First, the user can designate a maximum length of sequence to retrieve, which is particularly useful when trying to avoid genomic sequences. Second, the user has considerable control over the output of the retrieved sequence names. The user can form names using any of the following elements: gi number, gb number, taxa id, organism name, defline and sequence length. The user can also supply a custom separator between the elements. These two major functions can be useful for more than just simple searching and fetching. For example, if the user has a Fasta file with the names of each sequence containing gi numbers, Phyutility may be used to search and retrieve missing, non-overlapping, sequences from GenBank that may be appended to the original file. This is especially useful when keeping large, mined datasets up to date.


    5 APPLICATIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 TREE MANIPULATIONS AND...
 3 SEQUENCE DATA MANIPULATION
 4 INTERFACES TO BIOINFORMATICS...
 5 APPLICATIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Phyutility is currently used to perform data manipulation and analyses in the collaborative project Tolkin (Beaman et al., 2006).


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 TREE MANIPULATIONS AND...
 3 SEQUENCE DATA MANIPULATION
 4 INTERFACES TO BIOINFORMATICS...
 5 APPLICATIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We appreciate valuable feedback from Michael Donoghue, David Tank, Kellie Heckman, Erem Kazancioglu and two anonymous reviewers. Much is owed to the many early testers of Phyutility. Thanks to Brian Moore for suggesting the name Phyutility. SAS was partially supported by NSF Cyberinfrastructure for Phylogenetic Research (CIPRES) grant EF-0331654.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Keith Crandall

Received on October 10, 2007; revised on November 19, 2007; accepted on December 10, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 TREE MANIPULATIONS AND...
 3 SEQUENCE DATA MANIPULATION
 4 INTERFACES TO BIOINFORMATICS...
 5 APPLICATIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Beaman R, et al. TOLKIN v.1.0. (2006) www.tolkin.org.

    Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol (2000) 17:540–552.[Abstract/Free Full Text]

    Edgar RC. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res (2004) 32:1792–1797.[Abstract/Free Full Text]

    Morgenstern B. DIALIGN: multiple DNA and protein sequence alignment at BiBiServ. Nucl. Acids Res (2004) 32:W33–W36.[Abstract/Free Full Text]

    Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics (2003) 19:1572–1574.[Abstract/Free Full Text]

    Smith SA. JADE 1.0: Java component of the PEBLS evolutionary biology libraries. (2006) http://code.google.com/p/pebls.

    Sukumaran J. bootscore: A Bootstrap Tree Scoring Utility. Version 3.0. (2007) http://souceforge.net/projects/bootscore.

    Thorley JL, Page RD. RadCon: phylogenetic tree comparison and consensus. Bioinformatics (2000) 16:486–487.[Abstract/Free Full Text]

    Thorley JL, Wilkinson M. Testing the phylogenetic stability of early tetrapods. J. Theor. Biol (1999) 200:343–344.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Syst BiolHome page
C. M. Bossu and T. J. Near
Gene Trees Reveal Repeated Instances of Mitochondrial DNA Introgression in Orangethroat Darters (Percidae: Etheostoma)
Syst Biol, May 22, 2009; (2009) syp014v1.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/5/715    most recent
btm619v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Smith, S. A.
Right arrow Articles by Dunn, C. W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Smith, S. A.
Right arrow Articles by Dunn, C. W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?