Bioinformatics Vol. 17 no. 90001 2001
Pages S5-S12
© 2001 Oxford University Press
An efficient algorithm for finding short approximate non-tandem repeats
1 Wilhelm-Schickard-Institut
für Informatik,
Universität
Tübingen, Sand 13,
Tübingen, 72076, Germany
2 Department of Computer Science and
Engineering, College of Engineering, University of California,
Riverside, USA
Received on February 6, 2001
; revised on April 2, 2001
; accepted on April 2, 2001
We study the problem of approximate non-tandem repeat
extraction. Given a
long subject string S of length N over a finite
alphabet
and a threshold D, we would like to
find all short substrings of S of length P that repeat
with at most D differences, i.e., insertions,
deletions, and mismatches. We give a careful theoretical
characterization of the set of seeds (i.e., some
maximal exact repeats) required by the algorithm, and prove a
sublinear bound on their expected numbers. Using this result, we
present a sub-quadratic algorithm for finding all short
(i.e., of length O(log N)) approximate
repeats. The running time of our algorithm is
O(DN3pow(
)-1log N), where
= D/P and pow(
) is an increasing, concave
function that is 0 when
= 0 and about
0.9 for DNA and protein sequences.
Contact: adebiyi{at}informatik.uni-tuebingen.de
Throughout the paper we only consider
non-tandem repeats.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
J. Reneker, C.-R. Shyu, P. Zeng, J. C. Polacco, and W. Gassmann ACMES: fast multiple-genome searches for short repeat sequences with concurrent cross-species information retrieval Nucleic Acids Res., July 1, 2004; 32(suppl_2): W649 - W653. [Abstract] [Full Text] [PDF] |
||||
