Bioinformatics Advance Access published online on March 9, 2006
Bioinformatics, doi:10.1093/bioinformatics/btl088
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Instituto de Matemática e Estatística, Universidade de São Paulo, Rua do Matão 1010 CEP 05508-090, São Paulo, Brazil
* To whom correspondence should be addressed.
Motivation: A central problem in genomics is to determine the function of a protein using the information contained in its amino acid sequence. Variable Length Markov Chains (VLMC) are a promising class of models that can effectively classify proteins into families and they can be estimated in linear time and space. Results: We introduce a new algorithm, called Sparse Probabilistic Suffix Trees (SPST), that identifies equivalences between the contexts of a VLMC. We show that, in many cases, the identification of these equivalences can improve the classification rate of the classical Probabilistic Suffix Trees (PST) algorithm. We also show that better classification can be achieved identifying representative fingerprints in the amino acid chains, and this variation in the SPST algorithm is called F-SPST. Availability: The SPST algorithm can be freely downloaded from the site http://www.ime.usp.br/~leonardi/spst/.
Received October 26, 2005
Revised February 19, 2006
Accepted March 6, 2006
Article
A generalization of the PST algorithm: modeling the sparse nature of protein sequences
Florencia G. Leonardi 1 *
Florencia G. Leonardi, E-mail: leonardi{at}ime.usp.br
![]()
Abstract
Associate Editor: Thomas Lengauer
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
H. Li, X. Dai, and X. Zhao A nearest neighbor approach for automated transporter prediction and categorization from protein sequences Bioinformatics, May 1, 2008; 24(9): 1129 - 1136. [Abstract] [Full Text] [PDF] |
||||
