Skip Navigation


Bioinformatics Advance Access originally published online on February 1, 2008
Bioinformatics 2008 24(6):783-790; doi:10.1093/bioinformatics/btn028
This Article
Right arrow Full Text Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/6/783    most recent
btn028v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Shah, A. R.
Right arrow Articles by Webb-Robertson, B.-J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shah, A. R.
Right arrow Articles by Webb-Robertson, B.-J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

SVM-HUSTLE—an iterative semi-supervised machine learning approach for pairwise protein remote homology detection

Anuj R. Shah 1,*, Christopher S. Oehmen 2 and Bobbie-Jo Webb-Robertson 2

1Scientific Data Management and 2Computational Biology and Bioinformatics, Pacific Northwest National Laboratory, Richland, WA, USA

*To whom correspondence should be addressed.


   Abstract

Motivation: As the amount of biological sequence data continues to grow exponentially we face the increasing challenge of assigning function to this enormous molecular ‘parts list’. The most popular approaches to this challenge make use of the simplifying assumption that similar functional molecules, or proteins, sometimes have similar composition, or sequence. However, these algorithms often fail to identify remote homologs (proteins with similar function but dissimilar sequence) which often are a significant fraction of the total homolog collection for a given sequence. We introduce a Support Vector Machine (SVM)-based tool to detect homology using semi-supervised iterative learning (SVM-HUSTLE) that identifies significantly more remote homologs than current state-of-the-art sequence or cluster-based methods. As opposed to building profiles or position specific scoring matrices, SVM-HUSTLE builds an SVM classifier for a query sequence by training on a collection of representative high-confidence training sets, recruits additional sequences and assigns a statistical measure of homology between a pair of sequences. SVM-HUSTLE combines principles of semi-supervised learning theory with statistical sampling to create many concurrent classifiers to iteratively detect and refine, on-the-fly, patterns indicating homology.

Results: When compared against existing methods for identifying protein homologs (BLAST, PSI-BLAST, COMPASS, PROF_SIM, RANKPROP and their variants) on two different benchmark datasets SVM-HUSTLE significantly outperforms each of the above methods using the most stringent ROC1 statistic with P-values less than 1e-20. SVM-HUSTLE also yields results comparable to HHSearch but at a substantially reduced computational cost since we do not require the construction of HMMs.

Availability: The software executable to run SVM-HUSTLE can be downloaded from http://www.sysbio.org/sysbio/networkbio/svm_hustle

Contact: anuj.shah{at}pnl.gov

Associate Editor: Burkhard Rost


Received on August 14, 2007; revised on January 15, 2008; accepted on January 16, 2008

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?




Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.