Skip Navigation

This Article
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow FREE Full Text (Screen PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Giladi, E.
Right arrow Articles by Volkmuth, W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Giladi, E.
Right arrow Articles by Volkmuth, W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics Vol. 18 no. 6 2002
Pages 873-877
© 2002 Oxford University Press

SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size

Eldar Giladi 1,*, Michael G. Walker 1, James Z. Wang 2 and Wayne Volkmuth 1

1 Incyte Pharmaceuticals, 3174 Porter Drive, Palo Alto, CA~94304, USA
2 Department of Computer Science, Pennsylvania State University, University Park, PA 16802, USA

Received on January 11, 2001 ; revised on January 7, 2002 ; accepted on January 29, 2002

Motivation: Searches for near exact sequence matches are performed frequently in large-scale sequencing projects and in comparative genomics. The time and cost of performing these large-scale sequence-similarity searches is prohibitive using even the fastest of the extant algorithms. Faster algorithms are desired.

Results: We have developed an algorithm, called SST (Sequence Search Tree), that searches a database of DNA sequences for near-exact matches, in time proportional to the logarithm of the database size n. In SST, we partition each sequence into fragments of fixed length called ‘windows’ using multiple offsets. Each window is mapped into a vector of dimension 4k which contains the frequency of occurrence of its component k-tuples, with k a parameter typically in the range 4–6. Then we create a tree-structured index of the windows in vector space, with tree-structured vector quantization (TSVQ). We identify the nearest neighbors of a query sequence by partitioning the query into windows and searching the tree-structured index for nearest-neighbor windows in the database. When the tree is balanced this yields an O(logn) complexity for the search. This complexity was observed in our computations. SST is most effective for applications in which the target sequences show a high degree of similarity to the query sequence, such as assembling shotgun sequences or matching ESTs to genomic sequence. The algorithm is also an effective filtration method. Specifically, it can be used as a preprocessing step for other search methods to reduce the complexity of searching one large database against another. For the problem of identifying overlapping fragments in the assembly of 120 000 fragments from a 1.5 megabase genomic sequence, SST is 15 times faster than BLAST when we consider both building and searching the tree. For searching alone (i.e. after building the tree index), SST 27 times faster than BLAST.

Availability: Request from the authors.

Contact: egiladi{at}incyte.com mwalker{at}incyte.com

* To whom correspondence should be addressed.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
A. Morgulis, G. Coulouris, Y. Raytselis, T. L. Madden, R. Agarwala, and A. A. Schaffer
Database indexing for production MegaBLAST searches
Bioinformatics, August 15, 2008; 24(16): 1757 - 1764.
[Abstract] [PDF]


Home page
BioinformaticsHome page
T. W. Lam, W. K. Sung, S. L. Tam, C. K. Wong, and S. M. Yiu
Compressed indexing and local alignment of DNA
Bioinformatics, March 15, 2008; 24(6): 791 - 797.
[Abstract] [Full Text] [PDF]



Disclaimer:
Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.