Skip Navigation


Bioinformatics Advance Access originally published online on November 29, 2005
Bioinformatics 2006 22(3):285-290; doi:10.1093/bioinformatics/bti801
This Article
Right arrow Full Text Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/3/285    most recent
bti801v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Dong, Q.-w.
Right arrow Articles by Lin, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dong, Q.-w.
Right arrow Articles by Lin, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Application of latent semantic analysis to protein remote homology detection

Qi-wen Dong *, Xiao-long Wang and Lei Lin

School of Computer Science and Technology, Harbin Institute of Technology Harbin, China

*To whom correspondence should be addressed.

Motivation: Remote homology detection between protein sequences is a central problem in computational biology. The discriminative method such as the support vector machine (SVM) is one of the most effective methods. Many of the SVM-based methods focus on finding useful representations of protein sequence, using either explicit feature vector representations or kernel functions. Such representations may suffer from the peaking phenomenon in many machine-learning methods because the features are usually very large and noise data may be introduced. Based on these observations, this research focuses on feature extraction and efficient representation of protein vectors for SVM protein classification.

Results: In this study, a latent semantic analysis (LSA) model, which is an efficient feature extraction technique from natural language processing, has been introduced in protein remote homology detection. Several basic building blocks of protein sequences have been investigated as the ‘words’ of ‘protein sequence language’, including N-grams, patterns and motifs. Each protein sequence is taken as a ‘document’ that is composed of bags-of-word. The word-document matrix is constructed first. The LSA is performed on the matrix to produce the latent semantic representation vectors of protein sequences, leading to noise-removal and smart description of protein sequences. The latent semantic representation vectors are then evaluated by SVM. The method is tested on the SCOP 1.53 database. The results show that the LSA model significantly improves the performance of remote homology detection in comparison with the basic formalisms. Furthermore, the performance of this method is comparable with that of the complex kernel methods such as SVM-LA and better than that of other sequence-based methods such as PSI-BLAST and SVM-pairwise.

Availability: The source codes are freely available at http://www.insun.hit.edu.cn/news/view.asp?id=413 or upon request from the authors.

Contact: qwdong{at}insun.hit.edu.cn


Received on July 17, 2005; revised on November 6, 2005; accepted on November 24, 2005

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
S. Hochreiter, M. Heusel, and K. Obermayer
Fast model-based protein homology detection without alignment
Bioinformatics, July 15, 2007; 23(14): 1728 - 1736.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
T. Lingner and P. Meinicke
Remote homology detection based on oligomer distances
Bioinformatics, September 15, 2006; 22(18): 2224 - 2231.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.