Skip Navigation

This Article
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow FREE Full Text (Screen PDF)
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (27)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Buhler, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Buhler, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics Vol. 17 no. 5 2001
Pages 419-428
© 2001 Oxford University Press

Efficient large-scale sequence comparison by locality-sensitive hashing

Jeremy Buhler

Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195-2350, USA

Received on July 4, 2000 ; revised on December 8, 2000 ; accepted on December 12, 2000

Motivation: Comparison of multimegabase genomic DNA sequences is a popular technique for finding and annotating conserved genome features. Performing such comparisons entails finding many short local alignments between sequences up to tens of megabases in length. To process such long sequences efficiently, existing algorithms find alignments by expanding around short runs of matching bases with no substitutions or other differences. Unfortunately, exact matches that are short enough to occur often in significant alignments also occur frequently by chance in the background sequence. Thus, these algorithms must trade off between efficiency and sensitivity to features without long exact matches.

Results: We introduce a new algorithm, LSH-ALL-PAIRS, to find ungapped local alignments in genomic sequence with up to a specified fraction of substitutions. The length and substitution rate of these alignments can be chosen so that they appear frequently in significant similarities yet still remain rare in the background sequence. The algorithm finds ungapped alignments efficiently using a randomized search technique, locality-sensitive hashing. We have found LSH-ALL-PAIRSto be both efficient and sensitive for finding local similarities with as little as 63% identity in mammalian genomic sequences up to tens of megabases in length

Availability: Contact the author at the address below.

Contact: jbuhler{at}cs.washington.edu

Supplementary information: The sequences and local alignment data described in this work are available at http://bio.cs.washington.edu/jbuhler-bioinformatics-2001/.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Genome ResHome page
H. Li, J. Ruan, and R. Durbin
Mapping short DNA sequencing reads and calling variants using mapping quality scores
Genome Res., November 1, 2008; 18(11): 1851 - 1858.
[Abstract] [Full Text] [PDF]


Home page
Journal of Information ScienceHome page
J.-I. Won, S. Park, J.-H. Yoon, and S.-W. Kim
An efficient approach for sequence matching in large DNA databases
Journal of Information Science, February 1, 2006; 32(1): 88 - 104.
[Abstract] [PDF]


Home page
Nucleic Acids ResHome page
S. McGinnis and T. L. Madden
BLAST: at the core of a powerful and diverse set of sequence analysis tools
Nucleic Acids Res., July 1, 2004; 32(suppl_2): W20 - W25.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.