Skip Navigation


Bioinformatics Advance Access originally published online on November 25, 2004
Bioinformatics 2005 21(8):1316-1324; doi:10.1093/bioinformatics/bti155
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1316    most recent
bti155v2
bti155v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (18)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Yamada, T.
Right arrow Articles by Morishita, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Yamada, T.
Right arrow Articles by Morishita, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© Published by Oxford University Press 2004.

Accelerated off-target search algorithm for siRNA

Tomoyuki Yamada * and Shinichi Morishita

Department of Computational Biology, Graduate School of Frontier Sciences, University of Tokyo Japan

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 siRNA SEQUENCE DESIGN...
 3 NON-REDUNDANT SEQUENCE SET
 4 SELECTION OF siRNA...
 5 ALGORITHMS FOR COMPUTING...
 6 EXPERIMENTAL RESULTS
 7 DISCUSSION
 REFERENCES
 

Motivation: Designing highly effective short interfering RNA (siRNA) sequences with maximum target-specificity for mammalian RNA interference (RNAi) is one of the hottest topics in molecular biology. The relationship between siRNA sequences and RNAi activity has been studied extensively to establish rules for selecting highly effective sequences. However, there is a pressing need to compute siRNA sequences that minimize off-target silencing effects efficiently and to match any non-targeted sequences with mismatches.

Results: The enumeration of potential cross-hybridization candidates is non-trivial, because siRNA sequences are short, ca. 19 nt in length, and at least three mismatches with non-targets are required. With at least three mismatches, there are typically four or five contiguous matches, so that a BLAST search frequently overlooks off-target candidates. By contrast, existing accurate approaches are expensive to execute; thus we need to develop an accurate, efficient algorithm that uses seed hashing, the pigeonhole principle, and combinatorics to identify mismatch patterns. Tests show that our method can list potential cross-hybridization candidates for any siRNA sequence of selected human gene rapidly, outperforming traditional methods by orders of magnitude in terms of computational performance.

Availability: http://design.RNAi.jp

Contact: yamada{at}cb.k.u-tokyo.ac.jp


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 siRNA SEQUENCE DESIGN...
 3 NON-REDUNDANT SEQUENCE SET
 4 SELECTION OF siRNA...
 5 ALGORITHMS FOR COMPUTING...
 6 EXPERIMENTAL RESULTS
 7 DISCUSSION
 REFERENCES
 
The use of short interfering RNA (siRNA) is an emerging powerful method for gene silencing (Elbashir et al., 2001). siRNA is a short RNA duplex of between 10 and 25 nucleotides (nt), although 21 nt are frequently used because the length is effective. After transfection into cells, siRNA targets mRNA and causes the sequence-specific degradation of the mRNA. As siRNA duplexes have two-nucleotide overhangs at their 3' ends, the remaining 19 nt subsequence matches the target mRNA sequence to silence its behavior. Therefore, the key to designing siRNA is knowing how to select the 19 nt sequence from a target mRNA sequence.

Recently, novel guidelines for selecting highly effective siRNA have been reported (Schwarz et al., 2003; Chalk et al., 2004; Reynolds et al., 2004; Ui-Tei et al., 2004). These rules are the result of extensive studies of the relationship between siRNA sequences and RNAi activity. Thus, combining these rules would enable us to design effective siRNA sequences with a high probability. Afterwards, it is important to minimize the risk of off-target gene silencing caused by unintended hybridization of the designed effective siRNA sequence.

The minimization of off-target silencing effects calls for the selection of siRNA sequences that are guaranteed to have some mismatches to all unrelated sequences. We here use a rigorous specificity measure called the mismatch tolerance, the minimum number of mismatches between the siRNA sequence and any non-targeted sequence. For instance, an siRNA sequence of mismatch tolerance 3 does not match any off-target candidates with fewer than three mismatches. A higher mismatch tolerance of an siRNA sequence indicates its high specificity in the presence of some mismatches.

Exact mismatch-tolerance is costly to calculate, because it demands searching the entire sequence database to check. Moreover, siRNA sequences are short, ca. 19 nt, and they should contain at least three mismatches with non-targets. Consequently, there are typically 3–6 contiguous matches between an siRNA sequence and its potential off-targets (Fig. 1). Although many existing tools (Chalk et al., 2004; Levenkova et al., 2004; Wang and Mu, 2004) for designing siRNA sequences use BLAST search (Altschul et al., 1990), BLAST frequently overlooks off-target candidates because it initially looks for contiguous matches that are at least 7 nt long. Figure 1 illustrates a typical example.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 1 Alignment between a 19-nt-long siRNA sequence and a potential off-target sequence with three mismatches: Querying the upper short sequence, a substring in NM_000014 [GenBank] , into the RefSeq database by using BLAST fails to find the lower off-target sequence, a substring in NM_002827 [GenBank] . The word size parameter of BLAST was set to the minimum value, 7, and the expected value to 1000.

 
1.1 Previous work
One might consider using the Smith–Waterman algorithm (Smith and Waterman, 1981) which is extraordinarily expensive to execute. Special purpose hardware that device the Smith–Waterman algorithm or other pattern matching algorithms are inherently highly priced.

Several methods for designing gene-specific oligos for DNA microarrays (Li and Stormo, 2001; Kaderali and Schliep, 2002; Rahmann, 2002; Yamada and Morishita, 2004) might be relevant. Rahmann (2002) proposed a method for designing oligomers from human EST sequences. This method uses a suffix array to rapidly identify the longest consecutive match between the designed oligomer and database. However, these methods are not efficient for designing 19 nt-long siRNA because, when at least three of the 19 nt must be mismatches, the longest common subsequence becomes too short, typically no more than 6 nt. Previously, we proposed an approximation algorithm for estimating a tight lower bound of the minimum number of mismatches between an oligomer and its potential off-target sequences (Yamada and Morishita, 2004). The algorithm is intended to process oligomers of around 50 nt in length as input and is not designed for handling short oligomers of about 20 nt.

The approximate string matching problem (Ukkonen, 1992) is most relevant. The problem is to enumerate all substrings in a sequence that are within a (Hamming or edit) distance d of the target string. There are no other substrings within the Hamming distance d – 1 if and only if the mismatch tolerance of the focal string is at least d. Because of this correspondence, it is worth utilizing ideas developed for solving the approximate string matching problem.

However computing this value is very difficult, especially in large genomic sequences or sequence databases of size |G|, if we need an siRNA sequence E. Calculation of the exact mismatch tolerance using Abrahamson's algorithm (1987) requires computation time. Even if the maximum number of mismatches is d, the computation time is still (Amir et al., 2000). These two methods are designed to scan G for individual query E.

However, in reality, the number of queries could be huge, for instance, the number of all 19 nt sequences in all the possible mRNA sequences. Thus, it is effective to preprocess G once to generate the lookup table that stores locations of short strings. According to this idea, a variety of filtering procedures have been proposed for reducing candidate regions to search in the whole sequence. Baeza-Yates and Perleberg (1992) proposed a simple filter. Its basic idea is that division of a pattern into d + 1 pieces must yield one perfect match piece in candidate substrings. However, for larger values of d, shorter pieces must be considered, which broadens the search space and deteriorates the overall performance.

Sung and Lee (2003) and Zheng et al. (2003) generalized this idea to allow non-overlapping short seeds (q-grams, in their word) {with a couple of mismatches} to search candidate hit positions and then extend each hit to determine the Hamming distance. Although their methods are able to list all the qualified oligomers, they still demand expensive calculation even for processing mid-scale genomic sequences of length tens of millions base pairs.

1.2 Our result
These problems motivated us to improve Sung and Lee's approach. Our novel ideas are the use of overlapping seeds instead of non-overlapping ones and the careful elimination of redundant seeds to search. Tests demonstrate that our method can list the potential cross-hybridization candidates for any siRNA sequence of a selected human gene promptly, outperforming traditional methods (Wu and Manber, 1992; Sung and Lee, 2003; Zheng et al., 2003) by orders of magnitude in terms of computational performance.

Snove and Holen (2004) also remark that many commonly used siRNAs risk off-target silencing, which is verified by utilizing Interagon's special purpose hardware named Pattern Matching Chip that implements pattern matching algorithms. According to their paper (Snove and Holen, 2004) the hardware is able to process 60 million siRNA sequences of length 21 nt in 10 h, though the number of processing units on the hardware is not mentioned explicitly in the paper. Our software can achieve a similar performance if it is executed in parallel on ten inexpensive PCs with Xeon CPU of clock rate 3.2 GHz and with 2 GB main memory.


    2 siRNA SEQUENCE DESIGN PROBLEM
 TOP
 Abstract
 1 INTRODUCTION
 2 siRNA SEQUENCE DESIGN...
 3 NON-REDUNDANT SEQUENCE SET
 4 SELECTION OF siRNA...
 5 ALGORITHMS FOR COMPUTING...
 6 EXPERIMENTAL RESULTS
 7 DISCUSSION
 REFERENCES
 
Let us introduce the notion of mismatch tolerance. Let E be a consecutive substring in the input sequence G, such that E occurs only once in G. Intuitively, the mismatch tolerance of E with respect to G is the minimum number of mismatches that allows E to match a similar subsequence in G. A greater mismatch tolerance implies a lower probability of making E match non-targeted sequences, even in the presence of some mismatches.

DEFINITION 2.1.
We use A,T,G,C and N as letters for sequences. Let G denote an input sequence of length |G|. For instance, one can concatenate EST sequences into a single sequence by inserting spaces (e.g. 100 Ns) between them. Let E denote a sequence of length |E| that consists of the letters A, T, G and C. The subsequence of G that starts from l and ends at r is denoted by G[l,r]. Given two sequences, S and T, of the same length n, the Hamming distance dH(S,T) is defined as the number of positions at which the two bases differ, i.e.

We define the mismatch tolerance of E with respect to G, which is denoted by mt(E,G). If the frequency of E in G exceeds 1, the unique location of E in G cannot be specified; therefore, we set mt(E,G) = 0. Otherwise, mt(E,G) is the minimum Hamming distance between E and any sub-sequence of length |E| in G that is not identical to E, that is:

Note that E should not be compared with itself in G.

EXAMPLE 2.1.
Let G denote

and let E be the first four letters in G, ATGC. Observe that mt(E,G) = 2.

DEFINITION 2.2.
Given a non-negative minimum threshold k(≤|E|) for the mismatch tolerance of siRNA sequence E in G, the mismatch-tolerant oligo design problem checks whether mt(E,G) > k and returns any off-target sequence F such that the Hamming distance dH(E,F) ≤ k.

The use of siRNA sequences of mismatch tolerance 3 or more would be an effective criterion to avoid off-target activity. In addition, we should consider the observation that as few as 11 contiguous matches between a target siRNA sequence and an unrelated mRNA might cause off-target silencing (Jackson et al., 2003). The risk probability of the cross-reaction with off-targets would be reduced by selecting such an siRNA sequence E that the length of a longest common factor (Rahmann, 2002) shared between E and any other string of the same length is smaller. During the computation of longest common factors, substrings that overlap with E trivially share contiguous matches with E and are left out of consideration.

DEFINITION 2.3.
Let G be an input sequence and let E be a substring of G. Let lcf(E,G) denote a longest common factor (contiguous string) shared between E and any other string of G that is of length |E| and does not overlap with E. |lcf(E,G)| denotes the length of a longest common factor.

EXAMPLE 2.2.
Let G and E denote the same sequences in the running example. There exists no longest common factors of length three, and lcf(E,G) is AT or GC.


    3 NON-REDUNDANT SEQUENCE SET
 TOP
 Abstract
 1 INTRODUCTION
 2 siRNA SEQUENCE DESIGN...
 3 NON-REDUNDANT SEQUENCE SET
 4 SELECTION OF siRNA...
 5 ALGORITHMS FOR COMPUTING...
 6 EXPERIMENTAL RESULTS
 7 DISCUSSION
 REFERENCES
 
Another major obstacle is the generation of a non-redundant sequence set of genes for checking the target specificity of siRNA sequences. Traditional non-redundant sequence datasets, such as UniGene or RefSeq, cannot be used for this purpose, because alternative splice variants in these datasets share common exons that bring about duplication. One siRNA oligo may hybridize to any of the redundant exons, calling for duplicate elimination to yield one representative exon. In addition, it is also necessary to consider siRNA sequences that target the junction connecting two exons of a particular alternative splice variant. Therefore, non-redundant sequences over exon–exon junctions together with duplicate-free exons ought to comprise the non-redundant sequence set of genes for checking target specificity. The procedure for building the dataset can be found in Naito et al. (2004).


    4 SELECTION OF siRNA SEQUENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 siRNA SEQUENCE DESIGN...
 3 NON-REDUNDANT SEQUENCE SET
 4 SELECTION OF siRNA...
 5 ALGORITHMS FOR COMPUTING...
 6 EXPERIMENTAL RESULTS
 7 DISCUSSION
 REFERENCES
 
We here consider how to accelerate the selection of highly effective siRNA sequences with a higher mismatch tolerance and with a smaller length of a longest common factor. Existing guidelines for effective siRNA (Schwarz et al., 2003; Chalk et al., 2004; Reynolds et al., 2004; Ui-Tei et al., 2004) investigate short nucleotide strings of 19 or 21 nt and are therefore not computationally costly to execute. On the other hand, the calculation of the length of a longest common factor and the mismatch tolerance for an individual short nucleotide string is much more expensive, because it needs to examine the whole input sequence. To be precise, computing a longest common factor is less time-consuming than calculating the mismatch tolerance, and hence, the former step should be done before the latter. Thus, the following steps will efficiently reduce the search space and select desired siRNA sequences.

  1. Given an input sequence G and a subsequence M of G in which siRNA sequences are designed.
  2. Scan the input sequence M and select highly effective siRNA sequence candidates of length 19 nt according to existing guidelines (Schwarz et al., 2003; Chalk et al., 2004; Reynolds et al., 2004; Ui-Tei et al., 2004).
  3. For each candidate E in M, calculate the length of a longest common factor, |lcf(E,G)|, and reserve E for the next step if |lcf(E,G)| is no more than a specified threshold.
  4. For each remaining candidate E in M, compute the mismatch tolerance, mt(E,G), and output E as a qualified siRNA sequence if mt(E,G) is no less than a specified threshold.

Although the effect of mismatches upon off-target silencing is not fully understood to date, a couple of guidelines for avoiding off-target silencing have been reported (Brummelkamp et al., 2002; Miller et al., 2003; Jackson et al., 2003). Thus, the thresholds for the mismatch tolerance and the length of a longest common factor need to be set carefully in order to retain a sufficient number of qualified siRNA sequence candidates that would be consistent with these guidelines.

Jackson et al. (2003) showed that as few as 11 contiguous matches between siRNA and an unrelated mRNA might cause off-target silencing. However, we found that there is no 19 nt sequence such that |lcf| < 11, where |lcf| denotes the length of a longest common factor, though there are an ample number of 19 nt sequences such that mt = 3 and |lcf| ≤ 14, or mt = 2 and |lcf| ≤ 13, where mt is the mismatch tolerance (Fig. 2). To be precise, among 174.8 million 19 nt sequences in the non-redundant sequence set, 18.9 million (10.83%) meet the two requirements on mt and |lcf|.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 2 Number of 19 nt sequences in the non-redundant sequence set of nucleotide sequences in RefSeq and Unique UniGene. The horizontal axis represents the length of a longest common factor in a 19 nt sequence, and the vertical axis represents the number of 19 nt sequences for individual mismatch tolerances ranging from 1 to 4.

 
Other recent studies have shown that even a single mismatch in the center of an siRNA can abolish silencing (Brummelkamp et al., 2002; Miller et al., 2003). Thus, an siRNA sequence and its potential off-target are expected to have a mismatch in the center; however, this property is hard to guarantee, because there is no 19 nt sequence such that |lcf| < 11, and the number of 19 nt sequences such that |lcf| ≤ 12 is 0.86 million (0.49%). Our analysis brings out the difficulty of selecting siRNA sequences that are consistent with existing, stringent guidelines for avoiding off-target silencing.


    5 ALGORITHMS FOR COMPUTING MISMATCH TOLERANCE
 TOP
 Abstract
 1 INTRODUCTION
 2 siRNA SEQUENCE DESIGN...
 3 NON-REDUNDANT SEQUENCE SET
 4 SELECTION OF siRNA...
 5 ALGORITHMS FOR COMPUTING...
 6 EXPERIMENTAL RESULTS
 7 DISCUSSION
 REFERENCES
 
To solve the mismatch-tolerant oligo design problem and to decide whether mt(E,G) > k, the Smith–Waterman (SW) algorithm (Smith and Waterman, 1981) can be perfectly used. However, the simple application of the SW algorithm demands testing all |G| – 18 sequences of 19 nt in length, which is extremely costly to perform. We instead present several algorithms that utilize seed sequences and consist of the following three major steps.

  1. Lookup table creation phase: Build a lookup table that stores the locations of all the short seed sequences in G. This step is performed only once for G.
  2. Search phase: Search the lookup table for the positions of E's seed sequences on G. The search trial of one seed effectively eliminates positions where E cannot hybridize with k or less mismatches, and the search reports all the other potential locations to check. The seed length must be defined properly in each algorithm, so as not to overlook potential off-target sequences.
  3. Check phase: For individual seeds that hit G, try to align E with G and count the number of mismatches in the alignment.
The overall performance is dominated by two factors; the total number of seed search trials in Step 2 and the total number of occurrences of seeds that appear in G (seed hits, for short) in Step 3. In general, there is a trade-off between these two major factors; that is, the increase in one factor often decreases the other. Starting with a naive algorithm, we will now improve a suite of algorithms step by step.

5.1 Naive algorithm
The idea of a naive algorithm is to look for a contiguous common string shared between the target siRNA sequence and its potential off-target sequences (Wu and Manber, 1992; Lipson et al., 2002). The major issue is the choice of the minimum length of contiguous strings to search. For instance, Figure 3 illustrates two alignments with four mismatches. In the upper alignment, the length of the contiguous string shared between the two strings is 8, while the value gets to be 3 in the lower alignment.



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 3 Alignments between a 19 nt-long siRNA sequence and potential off-target sequences with four mismatches. Boxes indicate longest common matches.

 
In general, in order to return any off-target sequences F such that the Hamming distance dH(E,F) ≤ k, we utilize the property that E and F must share a contiguous string of length {lceil}(|E| – k)/(k + 1){rciel}.

EXAMPLE 5.1.
When E is a 19 nt sequence siRNA and k = 4, as illustrated in the lower alignment of Figure 3, the seed length is 3.

Let l denote the seed length {lceil}(|E| – k)/(k + 1){rciel}. The number of seed search trials is |E| + 1 – l. In order to have an estimate of the number of seed hits, let us assume that all seeds are uniformly distributed in G. The number of positions where one seed occurs in G is expected to be |G|/4l. The multiplication of this figure by the number of seed search trials gives the expected number of seed hits, (|G|/4l) · (|E| + 1 – l).

EXAMPLE 5.2.
When |G| = 1.5 x 108, |E| = 19 and k = 4, the seed length is 3, the number of seed search trials is 16, and the expected number of seed hits is 39,843,750.

5.2 BYP method
The naive algorithm searches all |E| + 1 – l seeds of length l. We here present an idea of reducing the number of seed search trials, and Figure 4 illustrates the key idea. In the figure, we see that, if the number of mismatches is 4, at least one of the five seeds enclosed in boxes must contain no mismatches.



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 4 Key idea of BYP method.

 
Baeza-Yates and Perleberg (1992) generalized this idea and presented an algorithm called the BYP method that uses the following property.

PROPOSITION 5.1.
Let E and F be sequences of the same length, such that the Hamming distance between E and F is k. E can be divided into disjoint substrings such that (k + 1) substrings are of length {lfloor}|E|/(k + 1){rfloor}, and at least one of them exactly matches F with no mismatches.

EXAMPLE 5.3.
For instance, when |E| = |F| = 19 and k = 4, five disjoint substrings of length 3 can be extracted, and at least one of them must match F exactly, as shown in Figure 4.

According to this proposition, using seeds of length {lfloor}|E|/(k + 1){rfloor}, which is denoted by l in this subsection, allows us to find a seed that fully matches a potential off-target sequence, such as F in the proposition. Note that {lfloor}|E|/(k + 1){rfloor} equals {lceil}(|E| k)/(k + 1){rciel}, the seed length of the naive algorithm in the previous subsection. The number of seed search trials is (k + 1), and the expected number of seed hits under the assumption of the uniform seed distribution is |G|/4l · (k + 1).

EXAMPLE 5.4.
When |G| = 1.5 x 108, |E|=19 and k = 4, the number of seed search trials is 5, and the expected number of seed hits is 11,718,750.

5.3 Partially matching seeds
The algorithms presented in the previous subsections require that at least one seed in E exactly matches the off-target sequence F. As we have seen, the seeds are likely to be short, which makes it difficult to reduce the expected number of seed hits. Note that if we allow seeds to partially match off-target sequences with at most s (≤k) mismatches, we can use longer seeds. Sung and Lee, 2003 utilize this property for probe selection from large genomes. We first introduce their idea, and we will improve it later in this section, to achieve one order-of-magnitude speedup.

For instance, Figure 5 presents two alignments. In each alignment, at least one of the seeds of length 9 contains at most two mismatches. Proposition 5.2 generalizes this property.



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 5 Partially matching seeds.

 
PROPOSITION 5.2.
If E is divided into {lceil}(k + 1)/(s + 1){rciel} disjoint seeds of length {lfloor}|E|/{lceil}(k + 1)/(s + 1){rciel}{rfloor}, at least one of these substrings must match F with at most s mismatches.

PROOF
Otherwise, each disjoint seed contains at least s + 1 mismatches, and hence the number of mismatches, k, must be greater than or equal to

which is no less than k + 1 and is therefore a contradiction.

Again, let l denote the seed length {lfloor}|E|/{lceil}(k + 1)/(s + 1){rciel}{rfloor}. Allowing at most s mismatches is effective for extending the seed length, and this should reduce the expected number of seed hits. However, this relaxation increases the number of seed search trials to look for the positions where a seed of length l may partially match with at most s mismatches. More precisely, we generate all strings such that i (≤s) letters among l positions in the seed are replaced with three other letters, and use these strings as additional seeds, the total number of which is:

Subsequently, we check if each seed exactly matches an off-target sequence, although some probes may fail to match any sequence. As there are {lceil}(k + 1)/(s + 1){rciel} seeds in E, the total number of seeds to check is

because we replace one nucleotide with three other ones. When a large number of mismatches are allowed, the number of seed search trials tends to dominate the overall performance. Conversely, as the seed length l increases, we can decrease the expected number of seed hits; that is,

EXAMPLE 5.5.
Suppose that |G| = 1.5 x 108, |E| = 19 and k = 4. The following table shows the number of seed search trials, expected number of seed hits, and seed length for s = 1,2,3.Observe that allowing at most two mismatches minimizes the expected number of seed hits. The comparison with the BYP method in the previous subsection shows a significant reduction in the expected number of seed hits while allowing a slight increase in the number of seed search trials, demonstrating the effectiveness of using partially matching seeds.

5.4 Elimination of redundant seed search trials
We are now in a position to present our novel ideas for accelerating the approach of using partially matching seeds. The following example shows the basic idea that the number of seed search trials can be further reduced.

EXAMPLE 5.6.
Let us consider designing partially matching seeds when |E| = 19, k = 2 and s = 1. In the previous running examples, we have set k to 4, but in this example, we set k to 2, for simplicity. Note that the seed length l equals 9 (= {lfloor}19/{lceil}(2 + 1)/(1 + 1){rciel}{rfloor}). The analysis in the previous subsection showed that the number of seed search trials is

However, Figure 6 shows that some seeds are redundant. Boxes denote substrings of E, where long boxes represent seeds of length 9 and short boxes are of length 1. Each box displays the number of mismatches. The upper 10 rows enumerate all 10 possible patterns of seed search trials. The reduction in the number can be achieved by selecting the seeds with fewer mismatches which are colored gray in the figure. Observe that the three patterns at the bottom are sufficient to cover all the patterns; therefore, the number of seed search trials required is



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 6 The upper 10 rows show all the possible patterns, while the bottom three patterns are sufficient as seeds to search

 

In general, the following property presents the number of seed search trials.

THEOREM 5.1.
Let m denote {lceil}(k + 1)/(s + 1){rciel}, and let ri(i = 1,...,m) be the number of mismatches in the i-th seed. Let rm+1 denote the number of mismatches in the rest of the m seeds. The number of all the mismatches is bounded by k; namely, . Then, one of the following inequalities holds:

Let l be the seed length. The following number of seed search trials are sufficient to cover all of the patterns.

PROOF
Otherwise, all the following inequalities hold:

Then,

contradicting the assumption that .

Next, if ri ≤ s, the i-th seed can contain at most s mismatches, and hence the number of seed search trials using this seed is .

If we do not have to search off-targets for some s mismatched seeds using this subtle ingenuity, we can markedly decrease the search space size, as described in the above example.

5.5. Overlapping seeds achieve significant speedup
So far, we have assumed that seeds do not overlap. However, using non-overlapping seeds imposes severe restrictions. For instance, when |E| = 19, only one non-overlapping seed with a length greater than 9 can be used. Here, we consider how to handle overlapping seeds. In a subsequent section, the experimental results show that overlapping seeds beat non-overlapping ones by an order of magnitude in terms of execution time.

EXAMPLE 5.7.
Figure 7 shows the case in which the seed length l exceeds {lfloor}|E|/2{rfloor}. Each box shows the number of mismatches within; we should select the seeds with fewer mismatches, which are colored gray. Although the upper 10 seeds enumerate all of the patterns, the lower six probes are sufficient to search off-targets.



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 7 Reducing the number of seed search trials for overlapping seeds

 

One may be concerned that the number of seed search trials would explode when all the mismatches are squeezed into the central box, which is illustrated in the third line from the bottom in Figure 7. In reality, however, the number can be kept to be moderately low when |E| = 19. For instance, if seeds of length 11 are used, the central box has three nucleotides. Selecting two positions of mismatches from the central three letters and replacing each of the two with other three nucleotides yields the following number of seed search trials:

We developed a program that enumerated all the patterns and reduced the number of seed search trials for overlapping seeds, though one might be able to design a simple formula for this task. We here present a simple formula to count the number of seed search trials for each pattern.

PROPOSITION 5.3.
Suppose that two overlapping seeds of length l share 2l – |E| letters in the central interval. Let mC denote the number of mismatches in the central 2l – |E| letters. Let mL(mR) be the number of mismatches in the left (right) box excluding the central box. The minimum number of seed search trials for this overlapping seed is

Summing the numbers of seed search trails for all the patterns yields the total number. We can reduce the search space size by using overlapping seeds in this way.


    6 EXPERIMENTAL RESULTS
 TOP
 Abstract
 1 INTRODUCTION
 2 siRNA SEQUENCE DESIGN...
 3 NON-REDUNDANT SEQUENCE SET
 4 SELECTION OF siRNA...
 5 ALGORITHMS FOR COMPUTING...
 6 EXPERIMENTAL RESULTS
 7 DISCUSSION
 REFERENCES
 
We compared the naive algorithm, the BYP method and non-overlapping, partially matching seeds with our algorithms: non-overlapping, partially matching seeds with a reduced number of seed search trials; and overlapping seeds with a reduced number of seed search trials. We assume that |G| = 5 x 107, |E| = 19 and k = 3 ~ 4. In non-overlapping, partially matching seeds, s mismatches are allowed at most, which minimizes the expected number of seed hits; namely, s = 1 for seed length = 4 ~ 6, s = 2 for seed length = 7 ~ 9 and s = 4 for seed length ≥10.

Figure 8 compares seed length, the number of seed search trials and expected number of seed hits. Note that in most cases, the expected number of seed hits decreases monotonically as the seed length increases, while the number of seed search trials increases monotonically. Both the naive algorithm and the BYP method can use seeds with a maximum length of 3, so the corresponding line graphs terminate at seed length 3. Therefore, it is impossible to reduce the expected number of seed hits for these two methods by increasing the seed length. By contrast, our algorithms can extend the seed length by partially matching seeds, which decreases the expected number of seed hits, but increases the number of seed search trials. One might wonder why both measures of non-overlapping, partially matching seeds jump when the seed length is 7 and 10. This is because the maximum number of mismatches in one seed s increases when the seed length l gets to be 7 and 10; namely, s = 2 when l = 7 and s = 4 when l = 10.



View larger version (27K):
[in this window]
[in a new window]
 
Fig. 8 Comparing the performance of the naive algorithm, BYP method and our algorithms using non-overlapping/overlapping, partially matching seeds.

 
If both measures are evaluated equally, partially matching seeds of length 13 or 14 are the best choice. In addition, note that overlapping, partially matching seeds always outperform non-overlapping ones with respect to both measures. Thus, from Figure 8, it is recommended that overlapping, partially matching seeds be used.

We verified this by implementing our algorithms in C++ and testing them on the Microsoft Windows XP platform. We used a Dell workstation precision 650 with a Xeon CPU, a clock rate of 3.2 GHz, and 2 GB of main memory. The non-redundant sequence set described in the previous section is used as G. The other parameter settings are k = 3 (s = 1 ~ 3) and k = 4 (s = 1 ~ 4). We solved the mismatch-tolerant oligo design problem for all substrings of length 19 in an mRNA of length 1891 (accession number: NM_003380 [GenBank] ). Figure 9 compares the performance of overlapping and non-overlapping seeds in terms of the average execution time for calculating the mismatch tolerance of one 19 nt sequence. When k = 3, execution time does not improve by non-overlapping, partially matching seeds and reduced number of seed search trials. However the advantage of reducing the number of seed search trials is evident when k = 4 with seeds of length no more than 9. Note that the overlapping, partially matching seeds, reduced number of seed search trials is faster than the others for the same seed length. The performance of non-overlapping seeds slows down when the seed length becomes 10 because the maximum number of mismatches in one seed increases from 2 to 4. Using seeds of length 12 or 13 minimizes the execution time. In this setting, overlapping seeds result in an execution speed that is an order of magnitude faster than that of non-overlapping seeds. Furthermore, it takes less than 10 ms to handle one string on average when k = 3. We also compared the execution time consumed by our algorithm with SSEARCH which uses the Smith–Waterman algorithm (Smith and Waterman, 1981). SSEARCH needs about 13 s for the average execution time for each substring of length 19 in NM_003380 [GenBank] . For k = 3, SSEARCH is three orders of magnitude slower than our algorithm.



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 9 The average execution time of calculating the mismatch tolerance of one 19 nt sequence by using non-overlapping/overlapping, partially matching seeds of length ranging from 7 to 17. The left figure shows the comparison of performance when k = 3, and the right figure is when k = 4.

 

    7 DISCUSSION
 TOP
 Abstract
 1 INTRODUCTION
 2 siRNA SEQUENCE DESIGN...
 3 NON-REDUNDANT SEQUENCE SET
 4 SELECTION OF siRNA...
 5 ALGORITHMS FOR COMPUTING...
 6 EXPERIMENTAL RESULTS
 7 DISCUSSION
 REFERENCES
 
We studied efficient ways to search potential off-targets for siRNA sequences. To process typical siRNA sequences for human genes, the analysis shows that our method using overlapping, partially matching seeds outperforms existing methods by orders of magnitude, in terms of execution time, number of seed search trials and number of seed hits. For better performance, one can devise a straightforward data parallelization by dividing the target dataset of sequences into disjoint, equal-sized subsets and by executing off-target searches into individual subsets. Obviously, the speedup is almost linear in the number of divisions.

Our software for searching potential off-targets is available as a part of the web-based siRNA design software siDirect (Naito et al., 2004). The tool might be beneficial to a broad class of users who search a large set of sequences for specific short sequences of about 20 nucleotides in length, such as primers, oligo-probes, long SAGE tags or miRNA targets. The revision of the software to meet such general requirements is in progress.


View this table:
[in this window]
[in a new window]
 
 


    Acknowledgments
 
We are grateful to Kaoru Saigo, Kumiko Ui-Tei and Yuki Naito for their valuable input.

This work was partly supported by grants from Scientific Research on Priority Areas (Grant#12209003) and Leading Project for Biosimulation, the Ministry of Education, Culture, Sports, Science and Technology of Japan to S.M.

Received on June 26, 2004; revised on November 4, 2004; accepted on November 12, 2004

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 siRNA SEQUENCE DESIGN...
 3 NON-REDUNDANT SEQUENCE SET
 4 SELECTION OF siRNA...
 5 ALGORITHMS FOR COMPUTING...
 6 EXPERIMENTAL RESULTS
 7 DISCUSSION
 REFERENCES
 

    Abrahamson, K. (1987) Generalized string matching. SIAM J. Comput., 16, 1039–1051[CrossRef].

    Amir, A., Lewenstein, M., Porat, E. (2000) Faster algorithms for string matching with k mismatches. 794–803 In proceedings, 11th ACM - SIAM Symposium on Discrete Algorithms (SODA).

    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410[CrossRef][ISI][Medline].

    Baeza-Yates, R. and Perleberg, C. (1992) Fast and practical approximate string matching. CPM, 185–192.

    Brummelkamp, T.R., Bernards, R., Agami, R. (2002) Stable suppression of tumorigenicity by virus-mediated RNA interference. Cancer Cell., 2, 243–247[CrossRef][ISI][Medline].

    Chalk, A.M., Wahlestedt, C., Sonnhammer, E.L. (2004) Improved and automated prediction of effective siRNA. Biochem. Biophys. Res. Commun., 319, 264–274[CrossRef][ISI][Medline].

    Elbashir, S.M., Harborth, J., Lendeckel, W., Yalcin, A., Weber, K., Tuschl, T. (2001) Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells. Nature, 411, 494–498[CrossRef][Medline].

    Jackson, A.L., Bartz, S.R., Schelter, J., Kobayashi, S.V., Burchard, J., Mao, M., Li, B., Cavet, G., Linsley, P.S. (2003) Expression profiling reveals off-target gene regulation by RNAi. Nature Biotechnol., 21, 635–637[CrossRef][ISI][Medline].

    Kaderali, L. and Schliep, A. (2002) Selecting signature oligonucleotides to identify organisms using DNA arrays. Bioinformatics, 18, 1340–1349[Abstract/Free Full Text].

    Levenkova, N., Gu, Q., Rux, J.J. (2004) Gene specific siRNA selector. Bioinformatics, 20, 430–432[Abstract/Free Full Text].

    Li, F. and Stormo, G. (2001) Selection of optimal DNA oligos for gene expression arrays. Bioinformatics, 17, 1067–1076[Abstract/Free Full Text].

    Lipson, D., Webb, P., Yakhini, Z. (2002) Designing specific oligonucleotide probes for the entire S. cerevisiae Transcriptome. WABI, 491–505.

    Miller, V.M., Xia, H., Marrs, G.L., Gouvioun, C.M., Lee, G., Davidson, B.L., Paulson, H.L. (2003) Allele-specific silencing of dominant disease genes. Proc. Natl Acad. Sci., 100, 7195–7200[Abstract/Free Full Text].

    Naito, Y., Yamada, T., Ui-Tei, K., Morishita, S., Saigo, K. (2004) siDirect: highly effective, target-specific siRNA design software for mammalian RNA interference. Nucleic Acids Res., 32, W124–W129[Abstract/Free Full Text].

    Rahmann, S. (2002) Rapid large-scale oligonucleotide selection for microarrays. Proceedings of the First IEEE Computer Society Bioinformatics Conference (CSB'02) In proceedings of the CSB2002 IEEE Computer Society.

    Reynolds, A., Leake, D., Boese, Q., Scaringe, S., Marchall, W.S., Khvorova, A. (2004) Rational siRNA design for RNA interference. Nat. Biotechnol., 22, , pp. 326–330[CrossRef][ISI][Medline].

    Schwarz, D.S., Hutvagner, G., Du, T., Xu, Z., Aronin, N., Zamore, P.D. (2003) Asymmetry in the assembly of the RNAi enzyme complex. Cell, 115, 199–208[CrossRef][ISI][Medline].

    Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197[CrossRef][ISI][Medline].

    Snove, O. and Holen, T. (2004) Many commonly used siRNAs risk off-target activity. Biochem. Biophys. Res. Commun., 319, 256–263[CrossRef][ISI][Medline].

    Sung, W.-K. and Lee, W.-H. Fast and accurate probe selection algorithm for large genomes. IEEE Computer Society Bioinformatics Conference (CSB) In Proceedings of the CSB2003 IEEE Computer Society, pp. 65–74.

    Ui-Tei, K., Naito, Y., Takahashi, F., Haraguchi, T., Ohki-Hamazaki, H., Juni, A., Ueda, R., Saigo, K. (2004) Guidelines for the selection of highly effective siRNA sequences for mammalian and chick RNA interference. Nucleic Acids Res., 32, 936–948[Abstract/Free Full Text].

    Ukkonen, E. (1992) Approximate string-matching with q-grams and maximal matches. Theoret. Comput. Sci., 92, 191–211[CrossRef].

    Wang, L. and Mu, F.Y. (2004) A web-based design center for vector-based siRNA and siRNA cassette. Bioinformatics, 20, 1818–1820[Abstract/Free Full Text].

    Wu, S. and Manber, U. (1992) Fast text searching allowing errors. Commun. ACM, 35, 83–91.

    Yamada, T. and Morishita, S. (2004) Computing highly specific and noise-tolerant oligomers efficiently. J. Bioinfo. Comp. Biol., 2, 21–46.

    Zheng, J., Close, T.J., Jiang, T., Lonardi, S. (2003) Efficient selection of unique and popular oligos for large EST databases. Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching (CPM2003) Springer, pp. 273–283.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
Y.-K. Park, S.-M. Park, Y.-C. Choi, D. Lee, M. Won, and Y. J. Kim
AsiDesigner: exon-based siRNA design server considering alternative splicing
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W97 - W103.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
P. R. Clark, J. S. Pober, and M. S. Kluger
Knockdown of TNFR1 by the sense strand of an ICAM-1 siRNA: dissection of an off-target effect
Nucleic Acids Res., March 27, 2008; 36(4): 1081 - 1097.
[Abstract] [Full Text] [PDF]


Home page
Brief Funct Genomic ProteomicHome page
A. Jagannath and M. Wood
RNA interference based gene therapy for neurological disease
Brief Funct Genomic Proteomic, May 3, 2007; (2007) elm005v1.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
Y. Naito, K. Ui-Tei, T. Nishikawa, Y. Takebe, and K. Saigo
siVirus: web-based antiviral siRNA design software for highly divergent viral sequences.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W448 - W450.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
T. Yamada, H. Soma, and S. Morishita
PrimerStation: a highly specific multiplex genomic PCR primer design server for the human genome.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W665 - W669.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
Y. Naito, T. Yamada, T. Matsumiya, K. Ui-Tei, K. Saigo, and S. Morishita
dsCheck: highly sensitive off-target search software for double-stranded RNA-mediated RNA interference
Nucleic Acids Res., July 1, 2005; 33(suppl_2): W589 - W591.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1316    most recent
bti155v2
bti155v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (18)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Yamada, T.