Bioinformatics Vol. 17 no. 1 2001
Pages 23-43
© 2001 Oxford University Press
Original Paper |
Variations on probabilistic suffix trees: statistical modeling and prediction of protein families
1 School of Computer Science and
Engineering, Hebrew University, Jerusalem 91904, Israel
2 Department of Structural Biology,
Fairchild Bldg. D-109, Stanford University, CA, 94305, USA
Received on November 1, 1999
; revised on June 7, 2000
; accepted on June 7, 2000
Motivation: We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, with surprising success. Basic biological considerations such as amino acid background probabilities, and amino acids substitution probabilities can be incorporated to improve performance.
Results: The PST can serve as a predictive tool for protein sequence classification, and for detecting conserved patterns (possibly functionally or structurally important) within protein sequences. The method was tested on the Pfam database of protein families with more than satisfactory performance. Exhaustive evaluations show that the PST model detects much more related sequences than pairwise methods such as Gapped-BLAST, and is almost as sensitive as a hidden Markov model that is trained from a multiple alignment of the input sequences, while being much faster.
Availability: The programs are available upon request from the authors.
Contact: jill{at}cs.huji.ac.il; golan{at}cs.cornell.edu
* To whom correspondence should be addressed.
3 Address starting from January 2001: Department of Computer Science, Cornell University, Ithaca, NY 14853, USA.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
H. Li, X. Dai, and X. Zhao A nearest neighbor approach for automated transporter prediction and categorization from protein sequences Bioinformatics, May 1, 2008; 24(9): 1129 - 1136. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Nikolski and D. J. Sherman Family relationships: should consensus reign?--consensus clustering for protein families Bioinformatics, January 15, 2007; 23(2): e71 - e76. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Grau, I. Ben-Gal, S. Posch, and I. Grosse VOMBAT: prediction of transcription factor binding sites using variable order Bayesian trees. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W529 - W533. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. G. Leonardi A generalization of the PST algorithm: modeling the sparse nature of protein sequences Bioinformatics, June 1, 2006; 22(11): 1302 - 1307. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Ben-Gal, A. Shani, A. Gohr, J. Grau, S. Arviv, A. Shmilovici, S. Posch, and I. Grosse Identification of transcription factor binding sites with variable-order Bayesian networks Bioinformatics, June 1, 2005; 21(11): 2657 - 2666. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Adar, Y. Benenson, G. Linshiz, A. Rosner, N. Tishby, and E. Shapiro Stochastic computing with biomolecular automata PNAS, July 6, 2004; 101(27): 9960 - 9965. [Abstract] [Full Text] [PDF] |
||||


