Bioinformatics Advance Access originally published online on July 15, 2004
Bioinformatics 2004 20(18):3363-3369; doi:10.1093/bioinformatics/bth408
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 20 issue 18 © Oxford University Press 2004; all rights reserved.
Reducing storage requirements for biological sequence comparison
Institute for Physical Science and Technology, University of Maryland, College Park, MD 20742-2431, USA
Received on April 16, 2004; revised on July 5, 2004; accepted on July 7, 2004
Advance Access Publication July 15, 2004
Motivation: Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinformatics. A dominant method of such string matching is the seed-and-extend approach, in which occurrences of short subsequences called seeds are used to search for potentially longer matches in a large database of sequences. Each such potential match is then checked to see if it extends beyond the seed. To be effective, the seed-and-extend approach needs to catalogue seeds from virtually every substring in the database of search strings. Projects such as mammalian genome assemblies and large-scale protein matching, however, have such large sequence databases that the resulting list of seeds cannot be stored in RAM on a single computer. This significantly slows the matching process.
Results: We present a simple and elegant method in which only a small fraction of seeds, called minimizers, needs to be stored. Using minimizers can speed up string-matching computations by a large factor while missing only a small fraction of the matches found using all seeds.
Contact: yorke{at}ipst.umd.edu, bhunt{at}ipst.umd.edu