Bioinformatics Advance Access originally published online on March 9, 2006
Bioinformatics 2006 22(11):1302-1307; doi:10.1093/bioinformatics/btl088
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A generalization of the PST algorithm: modeling the sparse nature of protein sequences
Instituto de Matemática e Estatística, Universidade de São Paulo Rua do Matão 1010 CEP 05508-090, São Paulo, Brazil
Motivation: A central problem in genomics is to determine the function of a protein using the information contained in its amino acid sequence. Variable length Markov chains (VLMC) are a promising class of models that can effectively classify proteins into families and they can be estimated in linear time and space.
Results: We introduce a new algorithm, called Sparse Probabilistic Suffix Trees (SPST), that identifies equivalences between the contexts of a VLMC. We show that, in many cases, the identification of these equivalences can improve the classification rate of the classical Probabilistic Suffix Trees (PST) algorithm. We also show that better classification can be achieved by identifying representative fingerprints in the amino acid chains, and this variation in the SPST algorithm is called F-SPST.
Availability: The SPST algorithm can be freely downloaded from the site http://www.ime.usp.br/~leonardi/spst/
Contact: leonardi{at}ime.usp.br
Supplementary information: Supplementary data are available at Bioinformatics online.
Received on October 26, 2005; revised on February 19, 2006; accepted on March 6, 2006
This article has been cited by other articles:
![]() |
H. Li, X. Dai, and X. Zhao A nearest neighbor approach for automated transporter prediction and categorization from protein sequences Bioinformatics, May 1, 2008; 24(9): 1129 - 1136. [Abstract] [Full Text] [PDF] |
||||
