Bioinformatics Advance Access originally published online on October 6, 2007
Bioinformatics 2007 23(21):2949-2951; doi:10.1093/bioinformatics/btm479
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Improved BLAST searches using longer words for protein seeding
Department of Health and Human Services, National Center for Biotechnology Information, National Institutes of Health
*To whom correspondence should be addressed.
| Abstract |
|---|
Motivation: The blastp and tblastn modules of BLAST are widely used methods for searching protein queries against protein and nucleotide databases, respectively. One heuristic used in BLAST is to consider only database sequences that contain a high-scoring match of length at most 5 to the query. We implemented the capability to use words of length 6 or 7. We demonstrate an improved trade-off between running time and retrieval accuracy, controlled by the score threshold used for short word matches. For example, the running time can be reduced by 20-30% while achieving ROC (receiver operator characteristic) scores similar to those obtained with current default parameters.
Availability: The option to use long words is in the NCBI C and C++ toolkit code for BLAST, starting with version 2.2.16 of blastall. A Linux executable used to produce the results herein is available at: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/protein_longwords
Contact: richa{at}helix.nih.gov
Associate Editor: Thomas Lengauer
Received on August 8, 2007; revised on September 13, 2007; accepted on September 19, 2007