Skip Navigation



Bioinformatics Advance Access published online on May 8, 2007

Bioinformatics, doi:10.1093/bioinformatics/btm247
This Article
Right arrow Advance Access manuscript (PDF) Freely available
Right arrow All Versions of this Article:
23/14/1728    most recent
btm247v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Hochreiter, S.
Right arrow Articles by Obermayer, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hochreiter, S.
Right arrow Articles by Obermayer, K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author (2007). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Fast Model-based Protein Homology Detection without Alignment

Sepp Hochreiter a,*, Martin Heusel b and Klaus Obermayer b

a Institute of Bioinformatics, Johannes Kepler Universität Linz 4040 Linz, Austria
b Department of Electrical Engineering and Computer Science, Technische Universität Berlin and Bernstein Center for Computational Neuroscience, 10587 Berlin, Germany

*To whom correspondence should be addressed. Prof. Sepp Hochreiter, E-mail: hochreit{at}bioinf.jku.at


   Abstract

Motivation: As more genomes are sequenced, the demand for fast gene classification techniques is increasing. To analyze a newly sequenced genome, first the genes are identified and translated into amino acid sequences which are then classified into structural or functional classes. The best performing protein classification methods are based on protein homology detection using sequence alignment methods. Alignment methods have recently been enhanced by discriminative methods like support vector machines as well as by position specific scoring matrices (PSSM) as obtained from PSI-BLAST.

However alignment methods are time consuming if a new sequence must be compared to many known sequences — the same holds for support vector machines. Even more time consuming is to construct a PSSM for the new sequence. The best performing methods would take about 25 days on these-days computers to classify the sequences of a new genome (20,000 genes) as belonging to just one specific class — however there are hundreds of classes.

Another shortcoming of alignment algorithms is that they do not build a model of the positive class but measure the mutual distance between sequences or profiles. Only multiple alignment and hidden Markov models are popular classification methods which build a model of the positive class but they show low classification performance. The advantage of a model is that it can be analyzed for chemical properties common to the class members to obtain new insights into protein function and structure.

We propose a fast model-based recurrent neural network for protein homology detection, the "Long Short-Term Memory" (LSTM). LSTM automatically extracts indicative patterns for the positive class but in contrast to profile methods it also extracts negative patterns and uses correlations between all detected patterns for classification. LSTM is capable to automatically extract useful local and global sequence statistics like hydrophobicity, polarity, volume, polarizability and combine them with a pattern. These properties make LSTM complementary to alignment based approaches as it does not use predefined similarity measures like BLOSUM or PAM matrices.

Results: We have applied LSTM to a well known benchmark for remote protein homology detection, where a protein must be classified as belonging to a SCOP superfamily. LSTM reaches state-of-the-art classification performance but is considerably faster for classification than other approaches with comparable classification performance. LSTM is 5 orders of magnitudes faster than methods which perform slightly better in classification and 2 orders of magnitudes faster than the fastest SVM-based approaches (which, however, have lower classification performance than LSTM). Only PSI-BLAST and HMMbased methods show comparable time complexity as LSTM but they cannot compete with LSTM in classification performance.

To test the modeling capabilities of LSTM, we applied LSTM to PROSITE classes and interpreted the extracted patterns. In 8 out of 15 classes LSTM automatically extracted the PROSITE motif. In the remaining 7 cases alternative motifs are generated which give better classification results on average than the PROSITE motifs.

Availability: The LSTM algorithm is available from http://www.bioinf.jku.at/software/LSTM_protein/.

Associate Editor: Dr. Limsoon Wong


Received on December 22, 2006; revised on April 18, 2007; accepted on May 1, 2007

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?




Disclaimer:
Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.