Bioinformatics Vol. 16 no. 3 2000
Pages 233-244
© 2000 Oxford University Press
Fast probabilistic analysis of sequence function using scoring matrices
1 Department of Biochemistry, Stanford
University School of Medicine, Stanford, CA, USA
2 Department of Computer Science, Rutgers
University, Piscataway, NJ, USA
Received on April 6, 1999
; revised on August 4, 1999
; accepted on September 10, 1999
Motivation: We present techniques for increasing the speed of sequence analysis using scoring matrices. Our techniques are based on calculating, for a given scoring matrix, the quantile function, which assigns a probability, or p, value to each segmental score. Our techniques also permit the user to specify a p threshold to indicate the desired trade-off between sensitivity and speed for a particular sequence analysis. The resulting increase in speed should allow scoring matrices to be used more widely in large-scale sequencing and annotation projects.
Results: We develop three techniques for increasing the speed of sequence analysis: probability filtering, lookahead scoring, and permuted lookahead scoring. In probability filtering, we compute the score threshold that corresponds to the user-specified p threshold. We use the score threshold to limit the number of segments that are retained in the search process. In lookahead scoring, we test intermediate scores to determine whether they will possibly exceed the score threshold. In permuted lookahead scoring, we score each segment in a particular order designed to maximize the likelihood of early termination. Our two lookahead scoring techniques reduce substantially the number of residues that must be examined. The fraction of residues examined ranges from 62 to 6%, depending on the p threshold chosen by the user. These techniques permit sequence analysis with scoring matrices at speeds that are several times faster than existing programs. On a database of 12 177 alignment blocks, our techniques permit sequence analysis at a speed of 225 residues/s for a p threshold of 10-6, and 541 residues/s for a p threshold of 10-20.
In order to compute the quantile function, we may use either an independence assumption or a Markov assumption. We measure the effect of first- and second-order Markov assumptions and find that they tend to raise the p value of segments, when compared with the independence assumption, by average ratios of 1.30 and 1.69, respectively. We also compare our technique with the empirical 99.5th percentile scores compiled in the BLOCKSPLUS database, and find that they correspond on average to a p value of 1.5 x 10-5.
Availability: The techniques described above are implemented in a software package called EMATRIX. This package is available from the authors for free academic use or for licensed commercial use. The EMATRIX set of programs is also available on the Internet at http://motif.stanford.edu/ematrix.
*To whom correspondence should be addressed.
3Present address: Genentech, Inc., 1 DNA Way, South San Francisco, CA 94080, USA. E-mail: twu@gene.com.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
J. Korhonen, P. Martinmaki, C. Pizzi, P. Rastas, and E. Ukkonen MOODS: fast search for position weight matrix matches in DNA sequences Bioinformatics, December 1, 2009; 25(23): 3181 - 3182. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. J. Pape, S. Rahmann, and M. Vingron Natural similarity measures between position frequency matrices with an application to clustering Bioinformatics, February 1, 2008; 24(3): 350 - 357. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Huang and D. L. Brutlag Dynamic use of multiple parameter sets in sequence alignment Nucleic Acids Res., January 28, 2007; 35(2): 678 - 686. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Freschi and A. Bogliolo Using sequence compression to speedup probabilistic profile matching Bioinformatics, May 15, 2005; 21(10): 2225 - 2229. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Barash, G. Elidan, T. Kaplan, and N. Friedman CIS: compound importance sampling method for protein-DNA binding site p-value estimation Bioinformatics, March 1, 2005; 21(5): 596 - 600. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. J. Su, L. Lu, S. Saxonov, and D. L. Brutlag eBLOCKs: enumerating conserved protein blocks to achieve maximal sensitivity and specificity Nucleic Acids Res., January 1, 2005; 33(suppl_1): D178 - D182. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Qin, M. Shen, and S. N. Cohen Identification and Characterization of a pSLA2 Plasmid Locus Required for Linear DNA Replication and Circular Plasmid Stable Inheritance in Streptomyces lividans J. Bacteriol., November 15, 2003; 185(22): 6575 - 6582. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. P. Bennett, L. Lu, and D. L. Brutlag 3MATRIX and 3MOTIF: a protein structure visualization system for conserved sequence motifs Nucleic Acids Res., July 1, 2003; 31(13): 3328 - 3332. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Pouliot, J. Gao, Q. J. Su, G. G. Liu, and X. B. Ling DIAN: A Novel Algorithm for Genome Ontological Classification Genome Res., October 1, 2001; 11(10): 1766 - 1779. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Y. Huang and D. L. Brutlag The EMOTIF database Nucleic Acids Res., January 1, 2001; 29(1): 202 - 204. [Abstract] [Full Text] [PDF] |
||||



