Skip Navigation


Bioinformatics Advance Access originally published online on September 5, 2007
Bioinformatics 2007 23(22):2969-2977; doi:10.1093/bioinformatics/btm422
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/22/2969    most recent
btm422v2
btm422v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ilie, L.
Right arrow Articles by Ilie, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ilie, L.
Right arrow Articles by Ilie, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Multiple spaced seeds for homology search

Lucian Ilie 1,* and Silvana Ilie 2

1Department of Computer Science, University of Western Ontario, N6A 5B7, London, Ontario, Canada and 2Numerical Analysis, Centre for Mathematical Sciences, Lund University, Box 118, SE-221 00 Lund, Sweden

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SPACED SEEDS
 3 OVERLAP COMPLEXITY
 4 SENSITIVITY OF LOW-OVERLAP...
 5 A POLYNOMIAL-TIME ALGORITHM
 6 BETTER MULTIPLE SEEDS
 7 CONCLUSION AND FURTHER...
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Homology search finds similar segments between two biological sequences, such as DNA or protein sequences. The introduction of optimal spaced seeds in PatternHunter has increased both the sensitivity and the speed of homology search, and it has been adopted by many alignment programs such as BLAST. With the further improvement provided by multiple spaced seeds in PatternHunterII, Smith–Waterman sensitivity is approached at BLASTn speed. However, computing optimal multiple spaced seeds was proved to be NP-hard and current heuristic algorithms are all very slow (exponential).

Results: We give a simple algorithm which computes good multiple seeds in polynomial time. Due to a completely different approach, the difference with respect to the previous methods is dramatic. The multiple spaced seed of PatternHunterII, with 16 weight 11 seeds, was computed in 12 days. It takes us 17 s to find a better one. Our approach changes the way of looking at multiple spaced seeds.

Contact: ilie{at}csd.uwo.ca


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SPACED SEEDS
 3 OVERLAP COMPLEXITY
 4 SENSITIVITY OF LOW-OVERLAP...
 5 A POLYNOMIAL-TIME ALGORITHM
 6 BETTER MULTIPLE SEEDS
 7 CONCLUSION AND FURTHER...
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 
Homology search finds similar segments between two biological sequences, such as DNA or protein sequences. A significant fraction of computing power in the world is dedicated to performing such tasks. The increase in genomic data is quickly outgrowing computer advances and hence better mathematical solutions are required. As the classical dynamic programming techniques of (Needleman and Wunsch, 1970; Smith and Waterman, 1981) became overwhelmed by the task, popular programs such as FASTA (Lipman and Pearson, 1985) and BLAST (Altschul et al., 1990) used heuristic algorithms. BLAST used a filtration technique in which positions with short consecutive matches, or hits, were identified first and then extended into local alignments. Speed was traded for sensitivity since longer initial matches missed many local alignments, hence decreasing sensitivity, whereas short initial matches produced too many hits, thus decreasing speed.

A breakthrough came with PatternHunter (Ma et al., 2002) where the hits were no longer required to consist of consecutive matches. More precisely, PatternHunter looks for runs of 18 consecutive nucleotides in each sequence such that only those specified by 1's in the string 111*1**1*1**11*111 are required to match. Such a string is called a spaced seed and the number of 1's in it is its weight. Using this notion, BLAST required a hit according to a consecutive seed such as 11111111111.

The filtration principle has been used before in approximate string matching (Burkhardt and Kärkäinen, 2001; Karp and Rabin, 1987; Pevzner and Waterman, 1995) but the important novelty of PatternHunter was the use of optimal spaced seeds, i.e. spaced seeds that have optimal sensitivity. Impressively, the approach of PatternHunter increases both the speed and sensitivity. The idea has been adopted since by the new versions of BLAST, MegaBLAST, BLASTZ and other software programs (Brejova et al., 2004; Kisman et al., 2005; Noé and Kucherov, 2005).

As noticed in Ma et al. (2002), multiple spaced seeds—sets of seeds that hit whenever one of the components does so—are better, and with their introduction in PatternHunterII, (Li et al., 2004), Smith–Waterman sensitivity (Smith and Waterman, 1981) is approached whereas the speed is that of BLASTn.

Quite a few papers have been written about spaced seeds, evaluating the advantages of spaced seeds over consecutive ones (Buhler et al., 2003; Choi and Zhang, 2004; Keich et al., 2004; Li et al., 2006), showing that the relevant computational problems are NP-hard (Li et al., 2004, 2006), giving exact (exponential) algorithms for computing sensitivity (Buhler et al., 2003; Choi and Zhang, 2004; Choi et al., 2004; Keich et al., 2004; Li et al., 2004), polynomial time approximation schemes (Li et al., 2006) or heuristic algorithms (Choi et al., 2004; Ilie and Ilie, 2007; Kong, 2007; Li et al., 2004; Perparata et al., 2005; Yang et al., 2004), adapting the seeds for more specific biological tasks (Brejova et al., 2004; Kucherov et al., 2004; Noé and Kucherov, 2005; Sun and Buhler, 2004), or building models to understand the mechanism that makes spaced seeds powerful (Buhler et al., 2003; Preparata et al., 2005; Sun and Buhler, 2004).

Finding optimal (multiple) spaced seeds is NP-hard but even finding good ones is very difficult. Exhaustive search involves two exponential-time steps: (i) there are exponentially many seeds to be tried and (ii) computing the sensitivity of each takes exponential time as well. Several approaches (Buhler et al., 2003; Keich et al., 2004; Li et al., 2004) tried to deal with the latter exponential by approximating the sensitivity. For the former, the number of seeds to be considered has been reduced by various heuristics (Choi et al., 2004; Kong, 2007; Preparata et al., 2005; Yang. et al., 2004) but it remained exponential.

The approach here is based on the overlaps between the hits of a multiple seed. A new measure, overlap complexity, is introduced and shown to be experimentally well correlated with sensitivity. Since the new measure is computable in (low) polynomial time, we shall use overlap complexity instead of sensitivity and this takes care of the exponential in (ii). A similar approach has been introduced in Ilie and Ilie (2007) for single seeds. Also, Yang et al. (2004) and Kong (2007) contain some other measures well correlated with sensitivity for multiple seeds. However, we take care also of the exponential at (i), i.e. the exponential number of candidate seeds. We give a simple algorithm which improves quickly the overlap complexity of an initial multiple seed, thus providing a good multiple seed in polynomial time.

We provide some results showing the good correlation between overlap complexity and sensitivity for single seeds. Our polynomial-time algorithm produces single seeds of sensitivity very close to optimal. For the multiple seed case such comparison cannot be made since no optimal multiple seeds are known. We shall compare our multiple seeds with previous ones and show them to have better sensitivity while our algorithm is much faster. The most important test is to compare against the multiple seed implemented in PatternHunterII, which contains 16 weight 11 seeds. While it took (Li et al., 2004) 12 days to compute this multiple seed, we obtain a better multiple seed in 17 s. The dramatic improvement is due to a completely different approach. As discussed in the last section, our approach allows looking at multiple seeds in a totally different way.

A number of problems remain to be investigated such as proving guarantees about the correlation between overlap complexity and sensitivity, approximation ratio and exact running time of our heuristic algorithm for approximating the overlap complexity. However, such problems may be mostly of theoretical interest as in practice our algorithms produce very good multiple seeds in very short time.

The article is organized as follows. The next section formally introduces multiple spaced seed and all concepts needed later. Our new measure is introduced in Section 3. Section 4 shows good correlation between overlap complexity and sensitivity. Our polynomial-time algorithm for computing good multiple seeds is given in Section 5. In Section 6, we compute better seeds than all previous ones. We conclude with a brief discussion in Section 7. More seeds whose sensitivity is discussed in the text are provided in the Appendix.

The content of the article can be read in several ways, according to the goal of the reader. First, we computed a number of multiple spaced seeds that are ready to be used. No understanding of our algorithm is necessary for that purpose. Second, our algorithm is simple and explained in detail for the reader interested in producing a more efficient implementation and/or modifying the algorithm in order to solve different problems, such as computing more specialized seeds. Finally, we provide explanation of the intuitive ideas behind our algorithm in order to provide the interested reader with in-depth understanding of our approach.


    2 SPACED SEEDS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SPACED SEEDS
 3 OVERLAP COMPLEXITY
 4 SENSITIVITY OF LOW-OVERLAP...
 5 A POLYNOMIAL-TIME ALGORITHM
 6 BETTER MULTIPLE SEEDS
 7 CONCLUSION AND FURTHER...
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 
We start with some basic definitions. An alphabet is a finite non-empty set, denoted by A. The set of finite strings over A is denoted by A*. For a string x isin A*, the length of x is denoted by |x|. For 1 ≤ i ≤ |x|, the ith letter of x is denoted by x[i]. If u = xy, for some x,y isin A*, then x (y, respectively) is called a prefix (suffix, respectively) of u. For two strings u and v, an overlap between u and v is any string that is both a suffix of u and a prefix of v.

A spaced seed is any1 string over the alphabet {1,*}; 1 stands for a ‘match’ and * for a ‘don't care’ position. For a seed s, the length of s is {ell} = |s| and the weight, w, of s is the number of 1's in s. A multiple spaced seed S is any finite non-empty set of spaced seeds.

The quality of the spaced seeds is given by their sensitivity, which, intuitively, is a measure of their ability to detect similar segments between biological sequences (Ma et al., 2002). This is done as follows. Given two DNA sequences and a seed s, we say that s simultaneously matches (hits) the two sequences at given positions if each 1 in s corresponds to a match between the corresponding nucleotides in the two sequences; see Figure 1 for an example using PatternHunter's seed. Such a match is then extended using classical methods to a local alignment.


Figure 1
View larger version (5K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. An example of a hit using PatternHunter's spaced seed. All 1's in the seed (the last row) must correspond to matches between the sequences. The spaced seed s hits the Bernoulli sequence R (ending) at the third position from the right.

 
However, in order to be able to compare spaced seeds and ultimately compute good ones, we need a precise mathematical setting. The above process will therefore be reformulated as follows (Keich et al., 2004; Ma et al., 2002). Assume there are two DNA sequences S1 and S2 such that the events that they are identical at any given position are jointly independent and each event is of probability p, called the similarity level. The sequence of equalities/inequalities between the two DNA sequences translates then into a sequence R of 1's (corresponding to matches) and 0's (corresponding to mismatches) that appear with probability p and 1 – p, respectively. Therefore, given an (infinite) Bernoulli random sequence R and a seed s, we say that s hits R (ending) at position k if aligning the end of s with position k of R causes all 1's in s to align with 1's; in R; see Figure 1.

We are now in the position to give a rigorous definition for sensitivity of a spaced seed. The sensitivity of a seed s is the probability that s hits R at or before position n (Keich et al., 2004; Ma et al., 2002). Note that the sensitivity depends on both the similarity level p and the length of the random region n.

An intuitive explanation of the reason for which seeds have different sensitivities follows. Recall that the sensitivity of a seed is the probability of hitting a random region of a given length. For two spaced seeds of the same weight, the expected number of hits is the same but their sensitivities need not be the same. This happens as the hits of one seed may appear more clustered. A good intuitive example is searching for the strings aaa and abc in a random text. For each occurrence of aaa, the chance of having another one sharing two letters with it is 1/26 whereas starting afresh would require (1/26)3. Therefore, the occurrences of abc are more evenly distributed and it is more likely to see first an abc in the text.

A multiple spaced seed hits a sequence R if and only if one of its seeds hits R. The sensitivity of a multiple spaced seed S is defined similarly, i.e. the probability that at least one seed of S hits R at or before position n.

In the light of the trade-off between search speed and sensitivity, it makes sense to consider only multiple seeds in which all seeds have the same weight (they may have different lengths).


    3 OVERLAP COMPLEXITY
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SPACED SEEDS
 3 OVERLAP COMPLEXITY
 4 SENSITIVITY OF LOW-OVERLAP...
 5 A POLYNOMIAL-TIME ALGORITHM
 6 BETTER MULTIPLE SEEDS
 7 CONCLUSION AND FURTHER...
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 
We introduce in this section our complexity measure, the overlap complexity, which will turn out to be well correlated with sensitivity but much easier to compute. Therefore, it will replace sensitivity in our computations. Before introducing it, we give some intuitive explanation why overlapping hits of a seed are undesirable.

The hits of a seed can overlap but overlapping hits will detect a single local alignment. An example showing such a situation is shown in Figure 2.


Figure 2
View larger version (4K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. An example showing the intuition behind overlap complexity; a local alignment is detected by one hit of a good seed whereas a bad seed ‘wastes’ three hits to detect the same alignment.

 
Therefore, the sensitivity of a seed is inversely proportional with the number of overlapping hits, since the expected number of hits is the same. Thus, good seeds should have a low number of overlapping hits. The definite proof that (non-uniformly) spaced seeds are better than consecutive seeds, due to Li et al. (2006), involves estimating the expected number of non-overlapping hits. However, computing this number in general is as difficult as computing sensitivity. Therefore, we look here for simpler ways to detect low numbers of overlapping hits.

We shall define a measure that is independent of the similarity level p. Consider two seeds s1 and s2 and denote by {sigma}[i] the number of pairs of 1's aligned together when a copy of s2 shifted by i positions is aligned against s1. The shift i takes values from 1 – |s2| to |s1| – 1, where a negative shift means s2 starts first. Precisely, if we denote


Formula

then


Formula

The overlap complexity for two seeds is defined as


Formula

An example is shown in Figure 3. Note that the measure is symmetric, that is, OC(s1, s2) = OC(s2, s1), for any seeds s1 and s2.


Figure 3
View larger version (5K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. An example of the overlap complexity of two seeds: Figure 3.

 
The definition of the overlap complexity deserves a few comments. Note that the ‘importance’ (we should say ‘weight’ but that would be confused with the weight of the seeds) of the number of pairs of 1's aligned together for each shift doubles with each pair of 1's. While this may look as a reasonably natural definition, there is a good intuitive reason behind it. For a shift i, denote by {sigma}*[i] the number of 1's aligned against *'s and by {sigma}**[i] the number of pairs of *'s aligned together. For our example, these arrays are given below:


Formula

The number of overlapping hits (with shift i) is then proportional to p{sigma}*[i](p2+(1-p)2){sigma}**[i], where p is the similarity level; p{sigma}*[i] is the probability that {sigma}*[i] *'s take value 1 and (p2 +(1-p)2){sigma}**[i] is the probability that {sigma}**[i] pairs of *'s take the same value, 0 or 1. Choosing p = 0.5 makes this quantity equal to 2{sigma}*[i] – {sigma}**[i]. It is reasonable to assume that a fixed-size window is considered when evaluating the overlapping hits, i.e. the sum {sigma}[i] + {sigma}*[i] + {sigma}**[i] is assumed to be a constant, say c; in our example c = 13. Then, the number of overlapping hits is proportional to 2{sigma}*[i] – {sigma}**[i] = 2{sigma}[i]-c = Formula . Since c is constant, this is proportional with 2{sigma}[i] which gives our definition of overlap complexity.

It is important to mention that the freedom to conveniently choose the value p = 0.5 is due to the fact that, even if the optimal seed may change with p (see Table 1 subsequently), the sensitivity changes very little.


View this table:
[in this window]
[in a new window]

 
Table 1. The top sensitivity seeds from Choi et al. (2004); a ‘-’ means not in top 20

 
For a multiple seed S = {s1, s2, ..., sk}, the overlap complexity is defined by:


Formula

Note that the overlap complexity is invariant with respect to the order of the seeds and reversal (assuming all seeds are reversed simultaneously). This is expected of any measure well correlated with sensitivity.


    4 SENSITIVITY OF LOW-OVERLAP SEEDS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SPACED SEEDS
 3 OVERLAP COMPLEXITY
 4 SENSITIVITY OF LOW-OVERLAP...
 5 A POLYNOMIAL-TIME ALGORITHM
 6 BETTER MULTIPLE SEEDS
 7 CONCLUSION AND FURTHER...
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 
We show here that the overlap complexity is, experimentally, well correlated with sensitivity for single seeds. We consider in Table 1 the top sensitivity seeds of Choi et al. (2004) (i.e. seeds with highest sensitivity among those with a given weight); their sensitivity ranks for similarity levels 65%, 70%, ..., 90% are given in columns 2, 3, ..., 7, respectively. As mentioned earlier, the top sensitivity seed may change with the similarity level p. For instance, the first line for weight 11 corresponds to PatternHunter's; seed which is the best for similarity levels 65 and 70%, second best for 75, 80 and 85%, and only third best for 90%. However, the differences between the sensitivities of these top seeds for any of the similarity levels considered is very small, a crucial observation for our approach, which is independent of similarity level.

The last column of Table 1 gives the overlap complexity rank. In all cases, at least one top sensitivity seed is on top of the overlap complexity ranking. Note that the seeds with the lowest overlap complexity are on top of the overlap complexity ranking.

The opposite is shown in Table 2 where the highest sensitivity of the seeds with lowest overlap complexity is shown. (There may be several seeds with lowest overlap complexity.) Almost all differences are zero. The correlation between the two measures is remarkable.


View this table:
[in this window]
[in a new window]

 
Table 2. The sensitivity of the top overlap complexity seeds for weights 9–18, similarity 70%, and length of random region 64

 
We cannot make the same comparison for multiple spaced seeds since there are no optimal multiple spaced seeds known.


    5 A POLYNOMIAL-TIME ALGORITHM
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SPACED SEEDS
 3 OVERLAP COMPLEXITY
 4 SENSITIVITY OF LOW-OVERLAP...
 5 A POLYNOMIAL-TIME ALGORITHM
 6 BETTER MULTIPLE SEEDS
 7 CONCLUSION AND FURTHER...
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 
The exact algorithms for computing sensitivity are all exponential, see (Buhler et al., 2003; Choi and Zhang, 2004; Keich et al., 2004), which is expected since the problem is NP-hard (Li et al., 2006). The one of Choi and Zhang (2004) runs in time Formula , for seeds of length {ell} and weight w. The other two have running times Formula for Keich et al. (2004) and Formula for Buhler et al. (2003).

For multiple seeds, Li et al. (2004) gave a dynamic programming algorithm that runs in time Formula Formula , where k is the number of seeds, {ell}i's are the lengths of the seeds and L = max1 ≤ i ≤ k {ell}i.

Therefore, finding optimal seeds by trying all seeds of a given weight (and length) and selecting the best is computationally very expensive. In fact, it has been shown by Li et al. (2004) to be NP-hard for an arbitrary distribution.

Some heuristic algorithms for computing good multiple seeds are presented in Yang et al. (2004) and Kong (2007). As with our approach, they find some measures that are well correlated with similarity but they still need to consider exponentially many seeds. We shall compare our seeds with theirs in the next section.

The heuristic algorithm we derive from our overlap complexity is very simple: compute the seed with the lowest overlap complexity. This produces very good multiple seeds but we need to consider exponentially many candidates. To reduce the complexity of this step, we shall start with a fixed seed and repeatedly modify it to improve its overlap complexity. Each improvement consists of swapping a 1 with a * as long as the overlap complexity improves. Moreover, we greedily choose a swap that produces the greatest improvement. The number of such swaps in each seed will be bounded by the weight of the seeds.

We shall say that 1 flipped is * and vice versa. For a seed s and two positions i,j, we denote by flip (s, i, j) the seed obtained from s by flipping the letters in positions i and j. For instance, flip(1*11*11,3,5) = 1**1111. With this notation, the algorithm MULTIPLESEEDS is described in Figure 4. Remarkably, PatternHunter's seed is obtained by performing only four swaps in the algorithm MULTIPLESEEDS(11, 18); see Figure 5. This can be done by hand!


Figure 4
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. The MULTIPLESEEDS algorithm which, given the weight and lengths of the seeds, computes a multiple seed with low overlap complexity and, therefore, high sensitivity.

 

Figure 5
View larger version (5K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Intermediate seeds computed by MULTIPLESEEDS(11, 18) to find PatternHunter's seed 111*1**1*1**11*111. The flipped positions are given in the right column.

 
Let us discuss briefly our choice of the initial seeds in step 6. These are consecutive seeds and have very low sensitivity. One would imagine that starting from different seeds, e.g. random, would produce better results. Somewhat unexpectedly this does not seem to be the case and we preferred to keep a simple, deterministic and ultimately reliable choice.

Concerning the swapping technique, it is trickier to give a good intuitive explanation. First, such swaps may change very much the overlaps and thus have the potential of improving the overlap complexity. Second, any seed can be transformed into any other seed with the same length and weight using few such swaps and therefore we may potentially reach those seeds with very low overlap complexity we are looking for. Finally, and most convincingly, it works very well in practice.

It is possible that the swaps can be improved, or done differently. For instance, we did not perform more than one swap at the time as that would slow down the algorithm, even if it would remain polynomial.

Note that our whole approach with overlap complexity works within fixed length for seeds. When given only a fixed weight and number of seeds, a problem we need to solve is finding a good length set of the seeds. Trying all possible lengths is impractical. We came up with a simple but efficient choice, see steps 1–5 in the algorithm. Essentially, we set half of the lengths equal to 25 and the other half ‘equally’ spaced between Formula and 25. (The code in lines 1–5 makes our choice precise.) The number 25 depends on the computer. Our tests were performed on a laptop with only 512 MB of RAM which prevented us from computing the sensitivity of longer seeds. We believe that the addition of longer seeds to some of our multiple seeds would increase the sensitivity but this needs to be tested.

Our choice of seed lengths turns out to be very good as we shall see below. However, for one seed we need to consider all lengths in an interval. In a few cases below we shall do the same for two seeds.

We have shown in the previous section good correlation between overlap complexity and sensitivity but now we compute an approximation of the overlap complexity. Still, the seeds we obtain have high sensitivity as shown in Table 3. We give also the time required for computing each seed.


View this table:
[in this window]
[in a new window]

 
Table 3. The sensitivity of the single spaced seeds computed by MULTIPLESEEDS compared to the optimal sensitivity for weights 9–18, similarity 70%, and length of random region 64

 
Let us consider the time complexity of the MULTIPLESEEDS algorithm. Computing the lengths and initial seeds in steps 1–6 takes Formula (kw) time. To perform a swap, all possibilities for the triple (r, i, j) in step 9 are considered, that is, Formula . For each, we compute the new overlap complexity in Formula time. (This is because the overlap complexity of two seeds is computed in time the product of their lengths and here we need only to update the pairs containing the seed sr.) If we set L = max1 ≤ i ≤ k {ell}i, then the total time complexity of the MULTIPLESEEDS algorithm is Formula . If we assume that, in practice, k is bounded and L is linear in w, then it becomes Formula .

It may be useful to briefly summarize the steps of our approach to constructing multiple spaced seeds. Finding the optimal multiple spaced seed for a given weight and number of seeds involves two exponential stages: (i) there are exponentially many candidate seeds and (ii) computing the sensitivity of each requires exponential time as well. The exponential at (i) hides in fact two exponentials: (a) there are exponentially many lengths sets and (b) for each length set, there are exponentially many multiple seeds. First, we guess the length set (steps 1–5 of MULTIPLESEEDS); this takes care of the exponential at (a). Second, we start with some fixed values for the seeds (step 6), and this eliminates the exponential at (b). Finally, we repeatedly modify (polynomially many times) this multiple seed using overlap complexity that is computable in polynomial time as well. This way the exponential at (ii) is avoided. Instead of testing candidates we directly build a multiple spaced seed as required. This is totally different from previous approaches.


    6 BETTER MULTIPLE SEEDS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SPACED SEEDS
 3 OVERLAP COMPLEXITY
 4 SENSITIVITY OF LOW-OVERLAP...
 5 A POLYNOMIAL-TIME ALGORITHM
 6 BETTER MULTIPLE SEEDS
 7 CONCLUSION AND FURTHER...
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 
We compare in this section our multiple seeds with the ones computed by other approaches. In Table 4, we compare our seeds with the best of Yang et al. (2004) and Kong (2007). We picked the best multiple seed of Yang et al. (2004) and compared it with ours for several similarity levels. The ones of Kong (2007) were computed for a specific similarity level and we give the sensitivity for those. Our seeds are better for all levels. Note that our method for choosing the length set in steps 1–5 of the algorithm MULTIPLESEEDS worked well even for two or three seeds. Only in the second last line the lengths given by it would produce a multiple seed of slightly lower sensitivity and we had to use an interval of lengths. This is still very fast.


View this table:
[in this window]
[in a new window]

 
Table 4. Comparing the seeds computed by MULTIPLESEEDS with previous multiple seeds; first group (lines 1–5): best of Yang et al. (2004), 8 seeds, weight 12; second group (lines 6 and 7): best of Kong (2007), 3 seeds of weight 9; third group (lines 8 and 9): best of Kong (2007), 2 seeds of weight 11

 
The most difficult test is comparing with the multiple seeds of Li et al. (2004), the sensitivities of which were kindly provided by the authors (Li, 2007; Ma, 2007). The multiple seed of 16 weight 11 seeds in Li et al. (2004)—which is implemented in the best homology search software, PatternHunterII—took 12 days to compute greedily, i.e. assuming the first i seeds are known, the (i + 1)th seed is selected by exhaustive search in a length interval so that it maximizes the sensitivity of all i + 1 seeds. Remarkably, MULTIPLESEEDS computes a better multiple seed in 17 s! It is shown in Table 5 and the comparison with the one of Li et al. (2004) is provided in columns two and three of Table 6. The last column of Table 6 contains the sensitivity (significantly higher) of a multiple seed consisting of 32 weight 11 seeds which we computed in <3 min. The multiple seed itself is given in the Appendix.


View this table:
[in this window]
[in a new window]

 
Table 5. Our multiple seed with 16 weight 11 seeds

 

View this table:
[in this window]
[in a new window]

 
Table 6. The sensitivity of our multiple seed of weight 11 from Table 5 compared to that of Li et al. (2004) for length of random region 64

 
We computed then, for the same weight 11, any number of seeds between 1 and 16 and compared their sensitivity for similarity level 70% with those of Li et al. (2004) in Table 7. We give also the time required by each computation. The multiple seeds are given in the Appendix. Note that the sensitivity of our multiple seeds with 13, 14 and 15 seeds are higher than the sensitivity of the ones of Li et al. (2004) with an extra seed, i.e. 14, 15 and 16 seeds, respectively.


View this table:
[in this window]
[in a new window]

 
Table 7. The sensitivity of our multiple seeds with i seeds, 1 ≤ i ≤ 16, of weight 11, compared to that of Li et al. (2004) for similarity 70% and length of random region 64

 
We should mention that the implementation of MULTIPLESEEDS is straightforward, and we used the dynamic programming algorithm of Li et al. (2004) for computing sensitivity. The running times can probably be improved but our focus is on fast algorithms and not on efficient implementation.


    7 CONCLUSION AND FURTHER RESEARCH
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SPACED SEEDS
 3 OVERLAP COMPLEXITY
 4 SENSITIVITY OF LOW-OVERLAP...
 5 A POLYNOMIAL-TIME ALGORITHM
 6 BETTER MULTIPLE SEEDS
 7 CONCLUSION AND FURTHER...
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 
The introduction of optimal spaced seeds in Ma et al. (2002) followed by multiple spaced seeds in Li et al. (2004) revolutionized homology search. It is therefore important to compute good multiple spaced seeds fast. The optimal ones are hard to compute and research has been done for finding faster ways to compute less than optimal but still good seeds. Our approach is much faster and produces better multiple seeds than the existing ones. This was shown by comparing our results with the best previous ones.

We believe that the dramatic improvement brought by our approach allows looking at multiple seeds in a different way, beyond the improvement in homology search simply due to better multiple seeds. So far, as computing good multiple spaced seeds was a very time-consuming task, the seeds were first computed and then hard-coded in the homology search software. With our approach testing of many seeds for a given purpose becomes possible. Also, the swapping technique we used for fast improvement of overlap complexity may be useful for fast improvement of other, specific, properties as well.

While our experimental results are very good, the theory to support them needs development. Problems include proving guarantees for the correlation between overlap complexity and sensitivity, finding bounds on the approximation ratio of our heuristic algorithm and approximating the number of swaps needed. (The bound we set for the number of swaps in the algorithm was never reached in practice.) On one hand, these theoretical questions are not easy to solve and they are not essential for the practical aspect of our study; we simply build better multiple spaced seeds than all previous ones using a much faster algorithm. On the other hand, they may bring new ideas to further improve our approach.

From practical point of view, the best way of using the overlap complexity is an open problem and should be further investigated. Also the way the lengths are computed could be improved. As mentioned, this is computer dependent in our case. We plan to make more experiments on a computer with a larger RAM. However, all these improvement, important as they might be, are most likely to be incremental, nowhere near the dramatic improvement presented here.


    Appendix
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SPACED SEEDS
 3 OVERLAP COMPLEXITY
 4 SENSITIVITY OF LOW-OVERLAP...
 5 A POLYNOMIAL-TIME ALGORITHM
 6 BETTER MULTIPLE SEEDS
 7 CONCLUSION AND FURTHER...
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 
The multiple seed used in Table 4 for comparison with the one of Yang et al. (2004):



Formula

The multiple seeds used in Table 4 for comparison with the ones of Kong (2007); they are given in the same order in which the sensitivities are given in Table 4:



Formula

Our multiple seed of 32 weight 11 seeds the sensitivity of which is given in the last column of Table 6; the time needed to compute all 32 seeds was 171 s:



Formula

The seeds used to obtain the sensitivities in Table 7; the first is PatternHunter's seeds; the set of 16 seeds is given in Table 5:



Formula



Formula


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SPACED SEEDS
 3 OVERLAP COMPLEXITY
 4 SENSITIVITY OF LOW-OVERLAP...
 5 A POLYNOMIAL-TIME ALGORITHM
 6 BETTER MULTIPLE SEEDS
 7 CONCLUSION AND FURTHER...
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank Ming Li and Bin Ma for kindly providing the sensitivities of their seeds in Li et al. (2004).

We also thank the anonymous referees for careful reading and useful suggestions that helped improving the presentation.

S.I. was supported by a postdoctoral fellowship from NSERC and L.I. was partially supported by an NSERC grant.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

1From biological point of view only strings starting and ending with 1 are spaced seeds. The seeds we shall eventually compute satisfy this condition. Back

Received on May 14, 2007; revised on July 19, 2007; accepted on August 11, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SPACED SEEDS
 3 OVERLAP COMPLEXITY
 4 SENSITIVITY OF LOW-OVERLAP...
 5 A POLYNOMIAL-TIME ALGORITHM
 6 BETTER MULTIPLE SEEDS
 7 CONCLUSION AND FURTHER...
 Appendix
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol. (1990) 215:403–410.[CrossRef][Web of Science][Medline]

    Altschul SF, et al. Gapped Blast and Psi-Blast: a new generation of protein database search programs. Nucleic Acids Res. (1997) 25:3389–3402.[Abstract/Free Full Text]

    Brejova B, et al. Optimal spaced seeds for homologous coding regions. J. Bioinform. Comput. Biol. (2004) 1:595–610.[CrossRef][Medline]

    Buhler J, et al. Designing seeds for similarity search in genomic DNA. In: Proceedings of RECOMB'03 (2003) New York: ACM Press. 67–75.

    Burkhardt S, Kärkkäinen J. Better filtering with gapped q-grams. In: Proceedings of CPM'01, Lecture Notes in Computer Science. 2089 (2001) Berlin: Springer. 73–85.

    Choi KP, Zhang L. Sensitivity analysis and efficient method for identifying optimal spaced seeds. J. Comput. Syst. Sci. (2004) 68:22–40.[CrossRef]

    Choi KP, et al. Good spaced seeds for homology search. Bioinformatics (2004) 20:1053–1059.[Abstract/Free Full Text]

    Ilie L, Ilie S. Long spaced seeds for finding similarities between biological sequences. In: Proceedings of BIOCOMP'07 (2007) to appear.

    Karp R, Rabin MO. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. (1987) 31:249–260.

    Keich U, et al. On spaced seeds for similarity search. Discrete Appl. Math. (2004) 3:253–263.

    Kisman D, et al. tPatternHunter: gapped, fast and sensitive translated homology search. Bioinformatics (2005) 21:542–544.[Abstract/Free Full Text]

    Kong Y. Generalized correlation functions and their applications in selection of optimal multiple spaced seeds for homology search. In: J. Comput. Biol. (2007) 14:238–254.[CrossRef][Web of Science][Medline]

    Kucherov G, et al. Estimating seed sensitivity on homogeneous alignments. In: Proceedings IEEE 4th Symposium on Bioinformatics and Bioengineering Tiwan (2004) Taiwan. 387–394.

    Li M. personal communication.

    Li M, et al. Pattern-HunterII: highly sensitive and fast homology search. J. Bioinform. Comput. Biol. (2004) 2:417–440.[CrossRef][Medline]

    Li M, et al. Superiority and complexity of spaced seeds. In: Proceedings of SODA'06 (2006) SIAM. 444–453.

    Lipman DJ, Pearson WR. Rapid and sensitive protein similarity searches. Science (1985) 227:1435–1441.[Abstract/Free Full Text]

    Ma B. personal communication.

    Ma B, et al. PatternHunter: faster and more sensitive homology search. Bioinformatics (2002) 18:440–445.[Abstract/Free Full Text]

    Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. (1970) 48:443–453.[CrossRef][Web of Science][Medline]

    Ning Z, et al. SSAHA: a fast search method for large DNA databases. Genome Res. (2001) 11:1725–1729.[Abstract/Free Full Text]

    Noé L, Kucherov G. Yass: enhancing the sensitivity of DNA similarity search. Nucleic Acids Res. (2005) 33:540–543.[CrossRef]

    Pevzner P, Waterman MS. Multiple filtration and approximate pattern matching. Algorithmica (1995) 13:135–154.[CrossRef][Web of Science]

    Preparata FP, et al. Quick, practical selection of effective seeds for homology search. J. Comput. Biol. (2005) 12:137–1152.

    Smith TF, Waterman MS. Identification of common molecular subsequences. J. Mol. Biol. (1981) 147:195–197.[CrossRef][Web of Science][Medline]

    Sun Y, Buhler J. Designing multiple simultaneous seeds for DNA similarity search. In: Proceedings of RECOMB'04 (2004) New York: ACM Press. 76–85.

    Xu J, et al. Optimizing multiple spaced seeds for homology search. In: Proceedings of CPM'04. Lecture Notes in Computer Science 3109, Springer, Berlin (2004) 47–58.

    Yang I.-H, et al. Efficient methods for generating optimal single and multiple spaced seeds. In: Proceedings of IEEE 4th Symposium on Bioinformatics and Bioengineering, Taiwan (2004) Taiwan. 411–418.

    Zhang Z, et al. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. (2000) 7:203–214.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
L. Ilie and S. Ilie
Fast computation of neighbor seeds
Bioinformatics, March 15, 2009; 25(6): 822 - 823.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/22/2969    most recent
btm422v2
btm422v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ilie, L.
Right arrow Articles by Ilie, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ilie, L.
Right arrow Articles by Ilie, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?