Bioinformatics, Vol 14, 279-284, Copyright © 1998 by Oxford University Press
R Spang and M Vingron
MOTIVATION: Database search programs such as FASTA, BLAST or a rigorous
Smith-Waterman algorithm produce lists of database entries, which are
assumed to be related to the query. The computation of statistical
significance of similarity scores is well established for single pairs of
sequences and using purely random models. However, the multi-trial context
of a database search poses new problems. The credibility of a certain score
obtained in a database search decreases with the amount of data that is
compared. To improve p-value computation for database search experiments,
statistical properties of the databases, such as the distribution of
sequence length and effects induced by frequently repeated sequence
patterns, need to be taken into account. RESULTS: We investigated the
SWISS-PROT protein database Release 31.0 running extensive simulations of
database searches. A discrepancy is observed between the theoretical
predictions and the empirical distribution. To correct for this, we
evaluate the statistical significance of scores in the context of a
database search by a contrasting semi-random model. This model enhances
purely random models by one additional parameter reflecting individual
statistical properties of real databases. We call this parameter the
effective size of the database. CONTACT:
r.spang@dkfz-heidelberg.de;m.vingron@dkfz-hei del berg.de
ARTICLES
Statistics of large-scale sequence searching
Deutsches Krebsforschungszentrum (DKFZ), Theoretische Bioinformatik, Im Neuenheimer Feld 280, D-69120 Heidelberg, Germany.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
A. Yu. Mitrophanov and M. Borodovsky Statistical significance in biological sequence analysis Brief Bioinform, March 1, 2006; 7(1): 2 - 24. |
||||
