Skip Navigation

This Article
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow FREE Full Text (Screen PDF)
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (18)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wu, T. D.
Right arrow Articles by Brutlag, D. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wu, T. D.
Right arrow Articles by Brutlag, D. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics Vol. 16 no. 3 2000
Pages 233-244
© 2000 Oxford University Press

Fast probabilistic analysis of sequence function using scoring matrices

Thomas D. Wu 1,3,*, Craig G. Nevill-Manning 2 and Douglas L. Brutlag 1

1 Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, USA
2 Department of Computer Science, Rutgers University, Piscataway, NJ, USA

Received on April 6, 1999 ; revised on August 4, 1999 ; accepted on September 10, 1999

Motivation: We present techniques for increasing the speed of sequence analysis using scoring matrices. Our techniques are based on calculating, for a given scoring matrix, the quantile function, which assigns a probability, or p, value to each segmental score. Our techniques also permit the user to specify a p threshold to indicate the desired trade-off between sensitivity and speed for a particular sequence analysis. The resulting increase in speed should allow scoring matrices to be used more widely in large-scale sequencing and annotation projects.

Results: We develop three techniques for increasing the speed of sequence analysis: probability filtering, lookahead scoring, and permuted lookahead scoring. In probability filtering, we compute the score threshold that corresponds to the user-specified p threshold. We use the score threshold to limit the number of segments that are retained in the search process. In lookahead scoring, we test intermediate scores to determine whether they will possibly exceed the score threshold. In permuted lookahead scoring, we score each segment in a particular order designed to maximize the likelihood of early termination. Our two lookahead scoring techniques reduce substantially the number of residues that must be examined. The fraction of residues examined ranges from 62 to 6%, depending on the p threshold chosen by the user. These techniques permit sequence analysis with scoring matrices at speeds that are several times faster than existing programs. On a database of 12 177 alignment blocks, our techniques permit sequence analysis at a speed of 225 residues/s for a p threshold of 10-6, and 541 residues/s for a p threshold of 10-20.

In order to compute the quantile function, we may use either an independence assumption or a Markov assumption. We measure the effect of first- and second-order Markov assumptions and find that they tend to raise the p value of segments, when compared with the independence assumption, by average ratios of 1.30 and 1.69, respectively. We also compare our technique with the empirical 99.5th percentile scores compiled in the BLOCKSPLUS database, and find that they correspond on average to a p value of 1.5 x 10-5.

Availability: The techniques described above are implemented in a software package called EMATRIX. This package is available from the authors for free academic use or for licensed commercial use. The EMATRIX set of programs is also available on the Internet at http://motif.stanford.edu/ematrix.

*To whom correspondence should be addressed.

3Present address: Genentech, Inc., 1 DNA Way, South San Francisco, CA 94080, USA. E-mail: twu@gene.com.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
J. Korhonen, P. Martinmaki, C. Pizzi, P. Rastas, and E. Ukkonen
MOODS: fast search for position weight matrix matches in DNA sequences
Bioinformatics, December 1, 2009; 25(23): 3181 - 3182.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
U. J. Pape, S. Rahmann, and M. Vingron
Natural similarity measures between position frequency matrices with an application to clustering
Bioinformatics, February 1, 2008; 24(3): 350 - 357.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
X. Huang and D. L. Brutlag
Dynamic use of multiple parameter sets in sequence alignment
Nucleic Acids Res., January 28, 2007; 35(2): 678 - 686.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
V. Freschi and A. Bogliolo
Using sequence compression to speedup probabilistic profile matching
Bioinformatics, May 15, 2005; 21(10): 2225 - 2229.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Barash, G. Elidan, T. Kaplan, and N. Friedman
CIS: compound importance sampling method for protein-DNA binding site p-value estimation
Bioinformatics, March 1, 2005; 21(5): 596 - 600.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
Q. J. Su, L. Lu, S. Saxonov, and D. L. Brutlag
eBLOCKs: enumerating conserved protein blocks to achieve maximal sensitivity and specificity
Nucleic Acids Res., January 1, 2005; 33(suppl_1): D178 - D182.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
Z. Qin, M. Shen, and S. N. Cohen
Identification and Characterization of a pSLA2 Plasmid Locus Required for Linear DNA Replication and Circular Plasmid Stable Inheritance in Streptomyces lividans
J. Bacteriol., November 15, 2003; 185(22): 6575 - 6582.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. P. Bennett, L. Lu, and D. L. Brutlag
3MATRIX and 3MOTIF: a protein structure visualization system for conserved sequence motifs
Nucleic Acids Res., July 1, 2003; 31(13): 3328 - 3332.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
Y. Pouliot, J. Gao, Q. J. Su, G. G. Liu, and X. B. Ling
DIAN: A Novel Algorithm for Genome Ontological Classification
Genome Res., October 1, 2001; 11(10): 1766 - 1779.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. Y. Huang and D. L. Brutlag
The EMOTIF database
Nucleic Acids Res., January 1, 2001; 29(1): 202 - 204.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.