Bioinformatics Vol. 17 no. 5 2001
Pages 419-428
© 2001 Oxford University Press
Efficient large-scale sequence comparison by locality-sensitive hashing
Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195-2350, USA
Received on July 4, 2000
; revised on December 8, 2000
; accepted on December 12, 2000
Motivation: Comparison of multimegabase genomic DNA sequences is a popular technique for finding and annotating conserved genome features. Performing such comparisons entails finding many short local alignments between sequences up to tens of megabases in length. To process such long sequences efficiently, existing algorithms find alignments by expanding around short runs of matching bases with no substitutions or other differences. Unfortunately, exact matches that are short enough to occur often in significant alignments also occur frequently by chance in the background sequence. Thus, these algorithms must trade off between efficiency and sensitivity to features without long exact matches.
Results: We introduce a new algorithm, LSH-ALL-PAIRS, to find ungapped local alignments in genomic sequence with up to a specified fraction of substitutions. The length and substitution rate of these alignments can be chosen so that they appear frequently in significant similarities yet still remain rare in the background sequence. The algorithm finds ungapped alignments efficiently using a randomized search technique, locality-sensitive hashing. We have found LSH-ALL-PAIRSto be both efficient and sensitive for finding local similarities with as little as 63% identity in mammalian genomic sequences up to tens of megabases in length
Availability: Contact the author at the address below.
Contact: jbuhler{at}cs.washington.edu
Supplementary information: The sequences and local alignment data described in this work are available at http://bio.cs.washington.edu/jbuhler-bioinformatics-2001/.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
H. Li, J. Ruan, and R. Durbin Mapping short DNA sequencing reads and calling variants using mapping quality scores Genome Res., November 1, 2008; 18(11): 1851 - 1858. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.-I. Won, S. Park, J.-H. Yoon, and S.-W. Kim An efficient approach for sequence matching in large DNA databases Journal of Information Science, February 1, 2006; 32(1): 88 - 104. [Abstract] [PDF] |
||||
![]() |
S. McGinnis and T. L. Madden BLAST: at the core of a powerful and diverse set of sequence analysis tools Nucleic Acids Res., July 1, 2004; 32(suppl_2): W20 - W25. [Abstract] [Full Text] [PDF] |
||||


