Bioinformatics Advance Access published online on November 29, 2005
Bioinformatics, doi:10.1093/bioinformatics/bti801
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
* To whom correspondence should be addressed.
Motivation: Remote homology detection between protein sequences is a central problem in computational biology. The discriminative method such as the Support Vector Machine (SVM) is one of most effective methods. Many of SVM-based methods focus on finding useful representations of protein sequence, using either explicit feature vector representations or kernel functions. Such representations may suffer from the peaking phenomenon in many machine learning methods because the features are usually very large and noise data may be introduced. Based on these observations, this research focuses on feature extraction and efficient representation of protein vectors for SVM protein classification. Results: In this study, a latent semantic analysis model, which is an efficient feature extraction technique from natural language processing, has been introduced in protein remote homology detection. Several basic building blocks of protein sequences have been investigated as the "words" of "protein sequence language", including N-grams, patterns and motifs. Each protein sequence is taken as a "document" that is composed of bags-of-word. The word-document matrix is constructed firstly. The latent semantic analysis is performed on the matrix to produce the latent semantic representation vectors of protein sequences, leading to noise-removal and smart description of protein sequences. The latent semantic representation vectors are then evaluated by SVM. The method is tested on the SCOP 1.53 database. The results show that the latent semantic analysis model significantly improves the performance of remote homology detection in comparison with the basic formalisms. Furthermore, the performance of this method is comparable with that of the complex kernel methods such as SVM-LA and better than that of other sequence-based methods such as PSI-BLAST and SVM-pairwise. Availability: The source codes are freely available at http://www.insun.hit.edu.cn/news/view.asp?id=413 or upon request from the authors.
Received July 17, 2005
Revised November 6, 2005
Accepted November 24, 2005
Article
Application of latent semantic analysis to protein remote homology detection
Qi-wen Dong 1 *,
Xiao-long Wang 1,
and
Lei Lin 1
Qi-wen Dong, E-mail: qwdong{at}insun.hit.edu.cn
![]()
Abstract ![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
S. Hochreiter, M. Heusel, and K. Obermayer Fast model-based protein homology detection without alignment Bioinformatics, July 15, 2007; 23(14): 1728 - 1736. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Lingner and P. Meinicke Remote homology detection based on oligomer distances Bioinformatics, September 15, 2006; 22(18): 2224 - 2231. [Abstract] [Full Text] [PDF] |
||||
