Skip Navigation



Bioinformatics Advance Access published online on January 12, 2005

Bioinformatics, doi:10.1093/bioinformatics/bti244
This Article
Right arrow Advance Access manuscript (PDF) Freely available
Right arrow All Versions of this Article:
21/9/1876    most recent
bti244v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Lazareva-Ulitsky, B.
Right arrow Articles by Thomas, P. D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lazareva-Ulitsky, B.
Right arrow Articles by Thomas, P. D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics © Oxford University Press 2004; all rights reserved.
Received January 15, 2004
Revised December 2, 2004
Accepted December 17, 2004

Article

On the quality of tree-based protein classification

Betty Lazareva-Ulitsky 1*, Karen Diemer 1, and Paul D. Thomas 1

1 Computational Biology Department, Applied Biosystems, 850 Lincoln Centre Dr., Foster City, CA 94404 USA

* To whom correspondence should be addressed.
Betty Lazareva-Ulitsky, E-mail: betty.lazareva{at}fc.celera.com


   Abstract

Motivation: Phylogenetic analysis of protein sequences is widely used in protein function classification and delineation of subfamilies within larger families. In addition, the recent increase in the number of protein sequence entries with controlled vocabulary terms describing function (such as the Gene Ontology) suggests that it may be possible to overlay these terms onto phylogenetic trees to automatically locate functional divergence events in protein family evolution. Phylogenetic analysis of large data sets requires fast algorithms, and even "fast", approximate distance matrix-based phylogenetic algorithms are slow on large data sets since they involve calculating Maximum Likelihood (ML) estimates of pairwise evolutionary distances. There have been many attempts to classify protein sequences on the family and subfamily level without reconstructing phylogenetic trees, but using hierarchical clustering with simpler distance measures, which also produce trees, or dendograms. How can these trees be compared in their ability to accurately classify protein sequences?

Results: Given a "reference classification" or "group membership labels" for a set of related protein sequences as well as a tree describing their relationships (e.g. a phylogenetic tree), we propose a method for dividing the tree into mono- or para-phyletic groups so as to optimize the correspondence between the reference groups and the tree-derived groups. We call he achieved optimal correspondence the "accuracy of a tree-based classification", which measures the ability of a tree to separate proteins of similar function into mono-or para-phyletic groups. We apply this measure to compare classical NJ and UPGMA phylogenetic trees with the trees obtained from hierarchical clustering using different protein similarity measures. Our preliminary analysis on a set of expert-curated protein families and alignments suggests that there is no uniformly superior algorithm, and that simple protein similarity measures combined with hierarchical clustering produce trees with reasonable and often the most accurate tree-based classification. We used our measure to help us to design TIPS, a tree-building algorithm based on agglomerative clustering with a similarity measure derived from profile scoring. TIPS is comparable with phylogenetic algorithms in terms of classification accuracy and is much faster on large protein families. Due to its time scalability and acceptable accuracy TIPS is being used in the large-scale PANTHER protein classification project.The trees produced by different algorithms for different protein families can be viewed at http://panther.appliedbiosystems.com/pub/tree_quality/trees.jsp. For every tree and every level of classification granularity we provide the optimal tree-based classification along with the reference classification.

Availability: The script that evaluates the accuracy of tree-based classification is available at http://panther.appliedbiosystems.com/pub/tree_quality/index.jsp.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
Y. Loewenstein, E. Portugaly, M. Fromer, and M. Linial
Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space
Bioinformatics, July 1, 2008; 24(13): i41 - i49.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
H. Mi, N. Guo, A. Kejariwal, and P. D. Thomas
PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D247 - D252.
[Abstract] [Full Text] [PDF]



Disclaimer:
Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.