Bioinformatics, Vol 15, 111-121, Copyright © 1999 by Oxford University Press
C Miller, J Gurd and A Brass
MOTIVATION: Word-matching algorithms such as BLAST are routinely used for
sequence comparison. These algorithms typically use areas of matching words
to seed alignments which are then used to assess the degree of sequence
similarity. In this paper, we show that by formally separating the
word-matching and sequence-alignment process, and using information about
word frequencies to generate alignments and similarity scores, we can
create a new sequence-comparison algorithm which is both fast and
sensitive. The formal split between word searching and alignment allows
users to select an appropriate alignment method without affecting the
underlying similarity search. The algorithm has been used to develop
software for identifying entries in DNA sequence databases which are
contaminated with vector sequence. RESULTS: We present three algorithms,
RAPID, PHAT and SPLAT, which together allow vector contaminations to be
found and assessed extremely rapidly. RAPID is a word search algorithm
which uses probabilities to modify the significance attached to different
words; PHAT and SPLAT are alignment algorithms. An initial implementation
has been shown to be approximately an order of magnitude faster than BLAST.
The formal split between word searching and alignment not only offers
considerable gains in performance, but also allows alignment generation to
be viewed as a user interface problem, allowing the most useful output
method to be selected without affecting the underlying similarity search.
Receiver Operator Characteristic (ROC) analysis of an artificial test set
allows the optimal score threshold for identifying vector contamination to
be determined. ROC curves were also used to determine the optimum word size
(nine) for finding vector contamination. An analysis of the entire
expressed sequence tag (EST) subset of EMBL found a contamination rate of
0.27%. A more detailed analysis of the 50 000 ESTs in est10.dat (an EST
subset of EMBL) finds an error rate of 0.86%, principally due to two
large-scale projects. AVAILABILITY: A Web page for the software exists at
http://bioinf.man.ac.uk/rapid, or it can be downloaded from
ftp://ftp.bioinf.man.ac.uk/RAPID CONTACT: crispin@cs.man.ac.uk
ARTICLES
A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases
School of Biological Sciences, 2.205 The Stopford Building, University of Manchester, Oxford Road, Manchester M13 9PT, UK.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
Z. Ning, A. J. Cox, and J. C. Mullikin SSAHA: A Fast Search Method for Large DNA Databases Genome Res., October 1, 2001; 11(10): 1725 - 1729. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Gemund, C. Ramu, B. Altenberg-Greulich, and T. J. Gibson Gene2EST: a BLAST2 server for searching expressed sequence tag (EST) databases with eukaryotic gene-sized queries Nucleic Acids Res., March 15, 2001; 29(6): 1272 - 1277. [Abstract] [Full Text] [PDF] |
||||

