Skip Navigation


Bioinformatics Advance Access originally published online on May 18, 2006
Bioinformatics 2006 22(15):1823-1831; doi:10.1093/bioinformatics/btl194
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary data
Right arrowOA All Versions of this Article:
22/15/1823    most recent
btl194v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Google Scholar
Right arrow Articles by Busch, A.
Right arrow Articles by Backofen, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Busch, A.
Right arrow Articles by Backofen, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2006 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commerical use, distribution, and reproduction in any medium, provided the original work is properly cited.

INFO-RNA—a fast approach to inverse RNA folding

Anke Busch and Rolf Backofen *

Albert-Ludwigs-University Freiburg, Institute of Computer Science, Chair of Bioinformatics Georges-Koehler-Allee 106, 79110 Freiburg, Germany

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE INFO-RNA APPROACH
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

Motivation: The structure of RNA molecules is often crucial for their function. Therefore, secondary structure prediction has gained much interest. Here, we consider the inverse RNA folding problem, which means designing RNA sequences that fold into a given structure.

Results: We introduce a new algorithm for the inverse folding problem (INFO-RNA) that consists of two parts; a dynamic programming method for good initial sequences and a following improved stochastic local search that uses an effective neighbor selection method. During the initialization, we design a sequence that among all sequences adopts the given structure with the lowest possible energy. For the selection of neighbors during the search, we use a kind of look-ahead of one selection step applying an additional energy-based criterion. Afterwards, the pre-ordered neighbors are tested using the actual optimization criterion of minimizing the structure distance between the target structure and the mfe structure of the considered neighbor.

We compared our algorithm to RNAinverse and RNA-SSD for artificial and biological test sets. Using INFO-RNA, we performed better than RNAinverse and in most cases, we gained better results than RNA-SSD, the probably best inverse RNA folding tool on the market.

Availability: www.bioinf.uni-freiburg.de?Subpages/software.html

Contact: backofen{at}informatik.uni-freiburg.de

Supplementary information: Supplementary data are available on Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE INFO-RNA APPROACH
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
RNAs are involved in translation (tRNA, rRNA), splicing (snRNA), processing of other RNAs (snoRNA, RNAseP) and regulatory processes (miRNA, siRNA) (Hüttenhofer et al., 2002). Furthermore, parts of mRNAs can adopt structures that regulate their own translation [SECIS (Hüttenhofer and Böck, 1998; Liu et al., 1998), IRE (Addess et al., 1997)]. The function of RNA molecules often depends on both the primary sequence and the secondary structure. Since prediction or experimental determination of three-dimensional RNA structures remain difficult, much work focuses on problems associated with its secondary structure, which can be described as a set of paired positions of the RNA sequence. These positions are assigned to complementary bases according to Watson and Crick (A and U, C and G). In some cases, other pairings (e.g. G and U) can be found. The problem of predicting the secondary structure of an RNA is called the RNA folding problem. Existing computational approaches are based on a thermodynamic model that gives a free energy value for each secondary structure (Zuker, 1994). The structure with the lowest energy [called the minimum free energy (mfe) structure] is expected to be the most stable one.

In this paper, we consider the inverse RNA folding problem, which is the design of RNA sequences that fold into a desired structure. This design is applicable to ribozymes and riboswitches (Knight, 2003; Winkler et al., 2004; Cech, 2004), which may be used as drugs in research and medicine. Furthermore, the inverse RNA folding can be applied to the design of noncoding RNAs, which are involved in a large variety of processes, e.g. gene regulation, chromosome replication and RNA modification (Storz, 2002).

Given an RNA secondary structure, we aim at finding an RNA sequence that is going to adopt this structure. Straight forwardly testing each sequence, whether its mfe structure is the searched one, is impossible since the number of sequences grows exponentially in the size of the structure (Hofacker, 1994). Thus, different heuristic local search strategies, which do not analyze the complete solution space, were used by existing programs dealing with inverse RNA folding (Hofacker et al., 1994; Andronescu et al., 2004; Dirks et al., 2004). One approach to inverse folding is implemented in RNAinverse, which is included in the Vienna RNA Package (Hofacker et al., 1994). There, the strategy of adaptive walk is used and local optima are found according to two different criteria, namely a structural distance between the mfe structure of the designed sequence and the target structure (mfe-mode) and the probability of folding into the target structure (p-mode). A second algorithm is called RNA-SSD (RNA Secondary Structure Designer) and was developed by Andronescu et al. (2004). It is based on a recursive stochastic local search that also tries to minimize a structure distance.

We present a new algorithm INFO-RNA for the INverse FOlding of RNA. It consists of two steps; a new design method for good initial sequences and a following improved stochastic local search that uses an effective neighbor selection method. Concerning the initialization step, we found out that a good choice is to use a sequence that among all sequences adopts the given structure with the lowest possible energy. We present a dynamic programming approach to solve this problem. Here, multi-branched loops (short: multiloops) are especially complicated to handle. For the selection of neighbors during the local search, we deviated from the arbitrary order used in RNAinverse and RNA-SSD. Using a kind of look-ahead of one selection step, we first order the set of neighbors using an energy-based criterion, which is much faster to calculate than the actual optimization criterion of minimizing the structure distance between the target structure and the mfe structure of the considered neighbor. Afterwards, the neighbors are tested in the calculated order using the actual optimization criterion. We tested INFO-RNA on artificial as well as on real data and compared the results to the ones of RNA-SSD and RNAinverse.


    2 THE INFO-RNA APPROACH
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE INFO-RNA APPROACH
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
The general problem of inverse RNA folding can be described as follows. Find an RNA sequence S = S1 ...Sn of length n that folds into a given secondary structure T, where Si isin B = {A, C, G, U} for 1 ≤ i ≤ n. T can be described as a set of pairs (i1, i2), where 1 ≤ i1 < i2 ≤ n and positions i1 and i2 are paired. In the following, all regarded secondary structures are pseudoknot-free, where a structure T is called pseudoknot-free if for every 2 base pairs (bp) (i1, i2) and (j1, j2) in T holds i1 < i2 < j1 < j2 or i1 < j1 < j2 < i2. We have to analyze a search space of an exponentially high number of valid RNA sequences. These are sequences that can form the base pairs required for the target structure regardless of energy. Therefore, it is not possible to find a globally optimal solution by testing all candidate sequences and thus, local search methods are widely used to address the inverse folding problem. Consequently, the resulting local optima are not guaranteed to be globally optimal but are optimal among all their sequence neighbors. The sequence neighbors of a sequence S are all sequences Si that differ from S in one unbound position or in two positions which have to pair concerning structure T (Fig. 1).

Except on the search strategy itself, the performance of the local search depends on the quality of the initializing sequence. Often, it is chosen at random. In the following, we are going to introduce a new method to create an excellent initializing sequence and describe the local search strategy we used.

2.1 The initializing step
The initializing step of INFO-RNA uses the technique of dynamic programming. This method was successfully applied to RNA secondary structure prediction, e.g. by Zuker and Stiegler (1981), and related problems. Similar to Zuker and Stiegler, we use free energies of structural elements [stacks, bulge- (BL), interior- (IL), hairpin- (HL), multiloops (ML)]. They depend on the size of the loop, the closing base pairs, and on the free bases inside the loops and adjacent to the closing pairs. Since each pair belongs to two elements, neighbored elements in a structure are linked and base pairs cannot be handled independently. Additionally, their energy fraction depends on directly adjacent free bases. Free bases that are not adjacent to a base pair do not give any energy fraction. The free energy value of a pseudoknot-free structure is calculated by adding up all partial energies of its elements.

Given a target structure T, we find a sequence S that among all sequences adopts T with the lowest possible energy. Formally, this means that we find a sequence S resulting from Formula where e(S', T) represents the free energy of sequence S' folded into structure T. For solving this problem, our dynamic programming algorithm needs linear time depending on the structure size. It starts with small substructures and enlarges them gradually by 1 bp. Thus, the algorithm starts at the closing pair of a hairpin loop, subsequently fixes it to pair assignments out of the set of valid pairs BP = {A-U,C-G,G-C,U-A,G-U,U-G}, and assigns the unbound positions of the loop such that they provide the lowest possible energy value for this small substructure under the condition that the closing pair is fixed. This is stored for all six possible assignments of the pair. Afterwards, the next pair to the HL-closing one is fixed. The energy can be calculated by the sum of the energy of the hairpin loop including the closing pair and the stacking energy of the current pair and the closing one of the HL. To find the best energy value, we have to minimize this sum over all possible assignments of the base pair closing the HL. This is demonstrated in equation 1 exemplarily, where e(.) represents the mfe. It refers to Figure 2A.

Formula (1)

We define a substructure Formula as structural part of T that is closed by pair (i1, i2) and has a connected backbone (Fig. 2B). Formula is defined to be its mfe under the condition that the sequence positions of the closing pair of the substructure Formula are fixed to a base pair assignment (a1, a2). Furthermore, we define a structural element Formula as part of structure T that consists of only two neighboring pairs (i1, i2) and (i1 + k, i2l). These ones close the element. It does not have a connected backbone (Fig. 2C). In case of a ML, the structural element is closed by >2 bp and the definition is applied analogously. We define Formula as the mfe of the structural element that is closed by base pairs (i1, i2) and (i1 + k, i2 l) whose sequence positions are fixed to assignments (a1, a2) and (b1, b2), respectively. Equation (2) formalizes the example of Equation (1).

Formula 2(2)

The energy value of any element depends on the assignment of the base pairs and, in case of a loop, on the unpaired bases adjacent to a stem and the loop size. The minimal energy of a substructure T(i1, i2) can be evaluated by adding the minimum energy of the one pair smaller substructure Formula 2 and the energy of the structural element Formula 2. Therefore, an already analyzed smaller substructure can be seen as black box, except for its closing pair. We calculate the lowest possible energies for substructures gradually by adding the next pair to a smaller substructure.

Of course, the question arises in which order the base pairs should be handled. For that purpose, we define an order {precedes} in which base pairs are analyzed. (i1, i2) {precedes} (j1, j2) means that base pair (i1, i2) is analyzed prior to base pair (j1, j2). The actual order in which the base pairs of the target structure are examined is defined as follows.

Formula 3(3)
Relating to the example of Figure 2A, the order of all pairs is the following: (28, 32) {precedes} (27, 33) {precedes} (26, 35) {precedes} (25, 36) {precedes} (16, 21) {precedes} (15, 22) {precedes} (14, 23) {precedes} (6, 10) {precedes} (5, 11) {precedes} (4, 12) {precedes} (2, 38) {precedes} (1, 39).

All pairs that are part of the structural element that is closed by the current pair and that are smaller than the current one (concerning the order) are denoted as predecessors of the current pair. Since closing pairs of HLs do not depend on any other pair, they have no predecessor. The closing pair of a ML has as many predecessors as stems originate from the loop. All other pairs have exactly one predecessor. Table 1 shows the predecessors of Figure 2A.

Having set the order of all pairs, a dynamic programming matrix D is filled with minimal free energies. Each row in D represents a base pair of the target structure while each column stands for a possible assignment of the pairs. Thus, D has as many rows as pairs are in our desired structure and six columns, which represent the assignments A-U, U-A, C-G, G-C, G-U and U-G. In the following, pairs are no longer represented by their pairing positions, e.g. (i1, i2), but only by their numbers in D, e.g. i. The values in the matrix D(i,a) give the mfe of a substructure ending at base pair i (represented by the row) that is assigned to a isin BP (given by the column). Every substructure starts at one or more base pairs that do not have any predecessors.

Before giving a detailed description of the algorithm, we have to define some variables and notations that are used in the following equations. Now, Formula 3 represents the structural element of T between base pairs i and j where i and j are row numbers in D. The free energy of the structural element between base pairs i and j assigned to a and b, respectively, is given by Formula 3. Further definitions are shown in Table 2.

During our dynamic programming approach, the fields in the matrix are filled row by row. Depending on the kind of the pair, i.e. on the number of predecessors, the values are calculated as follows.

(A) If base pair i has exactly one predecessor, i.e. it is a closing pair of a bulge loop, of an interior loop, or of a stack,

Formula 3
where Formula 3 gives the energy of the structural element between pairs i 1 and i assigned to a and b. Besides on a and b, this energy value depends on the assignment of the free bases directly adjacent to i – 1 and i. Thus, both mentioned effects can be seen here: the dependency of the base pairs to each other and the dependency to the adjacent free bases.

(B) If base pair i has more than one predecessor, i.e. it is a closing pair of a multiloop, {forall} a isin BP:

Formula 3
where the minimum is taken over all possible assignments of all predecessor base pairs a1, ... , as and of all free bases b1, ... , bf adjacent to them. Straight forwardly implemented, this evaluation can be exponential in the number of stems originating from the ML and the number of adjacent free bases since the energy fraction of a free base adjacent to two stems depends on the assignments of both. But since usually only MLs with a low number of stems occur, even the naive solution is usable in practice. However, this complexity can be reduced to linear time for all MLs by introducing a further dynamic programming matrix M for each ML. It calculates the best free energy of the substructure closed by the closing pair of the ML dynamically. The evaluation of the ML starts with the first stem-ending pair according to the order of the pairs. The order is defined as given in Equation (3), but the definition of the predecessors is renewed. Now, pair i is predecessor in an ML to pair j iff i {precedes} j and it does not exist any pair k in the ML such that i {precedes} k {precedes} j.

Matrix M includes a row for each stem-ending base pair except the one that is closing the ML. The matrix has to be re-calculated for each possible assignment of the closing pair of the loop since this pair is fixed and a kind of predecessor for the first stem-ending pair. In each step, the best energy of the part of the ML that includes the current base pair j and all pairs i with i {precedes} j is evaluated. To this aim, all assignments of the previous base pair as well as of the stem-adjacent free base(s) between the current and the previous pair have to be taken into account. This has to be done, since the energy fraction of a free base depends on the assignment of all adjacent base pairs. Thus, we have to differentiate between the two cases given in Equations (4) and (5). There, i represents the base pair and the associated assignment is denoted as a.

The energy fraction of a free base assigned to t and adjacent left (l) or right (r) to base pair j assigned to b is given by the single base-stacking energy Formula 3 and Formula 3, respectively.

(I) If there is only one free base between the current base pair i and its predecessor ip,

Formula 4(4)
where b denotes the assignment of base pair ip, t represents the assignment of the free base between ip and i.

(II) If there are more than one bases between the current base pair i and its predecessor ip,

Formula 5(5)
where b denotes the assignment of base pair ip, tp and tc represent the assignments of the free bases adjacent to ip and to i, respectively.

The calculation of the first base pair in the ML (represented by the first row of M) works analogously. Here, the closing pair of the ML acts as its predecessor with a fixed assignment. Finally, the entry for the closing pair of the ML in D is evaluated analogously to Equations (4) and (5) depending on the entries of matrix M.

(C) If base pair i has no predecessor, i.e. it is a closing pair of a hairpin loop, {forall} a isin BP:

Formula 5
where the minimum is taken over all possible assignments of all free bases b1, ... , bH in the HL.

Having filled the complete matrix D, we finally aim at finding the sequence that adopts the given structure with the lowest possible energy. For that purpose, we choose the smallest energy value of the last row of D. It gives the mfe a sequence can have, when folding into the target structure. To find the sequence that provides this energy, we are going back through matrix D along the path of the best predecessor assignments. For this reason, we store traceback pointers during the computation of D. Finally, all free bases that are not directly adjacent to a base pair and thus not giving any energy value are chosen arbitrarily.

Using this dynamic programming algorithm, we obtain a sequence that among all sequences adopts the target structure with the lowest possible energy. There is no other sequence that has lower energy when folding into this structure. Nevertheless, the sequence is not guaranteed to fold into it since actually this sequence can have even less energy when folding into another structure. Therefore, the resulting sequence is processed further in a second step.

Complexity. Filling matrices D and M, and generating the traceback are the things to do during the initialization. Since D has six entries per row and at most n/2 rows, it consists of at most 3n values. Hence, we have to check what time is needed per entry. For that purpose, we differentiate between the three kinds of entries in D depending on the corresponding base pair (A,B,C). During the calculation of values corresponding to pairs having exactly one predecessor (type A), the minima are taken over all assignments of the predecessor and the adjacent free bases. Thus, at most 6 * 44 steps are needed. As already mentioned, straight forwardly, the complexity for entries representing a closing pair of a ML (type B) is exponentially high in the number of stems that form the loop and the number of free bases adjacent to them. Since usually only MLs with a low number of stems occur, even the naive solution is usable in practice. Nevertheless, this complexity can be reduced by using a separate dynamic programming matrix M for the evaluation of MLs. By doing this, all closing pairs of all MLs can be analyzed in at most linear time overall. Last but not least the complexity of type C entries has to be considered. Here, the entry corresponds to a closing pair of a HL. Thus, only in case of a tetra-loop, the assignment of all bases in the loop is important. For larger HLs, only the assignment of the free bases adjacent to the closing pair are taken into account. Therefore, the calculation of fields of type C needs at most 44 steps.

Thus, each entry in D of type A or C can be calculated in constant time and hence, evaluating all of them needs O(n) time. Furthermore, the calculation of all entries for pairs closing a ML can also be done in O(n) time. Consequently, the whole matrix D can be evaluated in linear time as well as the generation of the traceback.

2.2 The local search step
After generating the start sequence, local optima are found by mutating iteratively. For that purpose, neighbored sequences are tested to check whether they provide a better value according to an objective function. In INFO-RNA, we use the objective function of minimizing the structure distance between the mfe structure of the designed sequence and the target structure (mfe-mode) as used in RNAinverse. This structure distance is defined as the number of differentially paired or unpaired bases. Furthermore, we also optimize small substructures first and proceed them to larger ones as done by Hofacker et al. (1994), since running the optimization directly on the full-length sequence would take too much time. The idea is that a substructure, which is optimal for a subsequence, will appear in the full sequence with enhanced probability even if this is not assured.

In INFO-RNA, the local search is a stochastic local search (SLS) (Hoss, 1998). This strategy has a lot in common with the widely used search strategy of adaptive walk (AW), which moves to the first found neighbor of the sequence that has an mfe structure with a lower structure distance to the target structure than the current one. But whereas the AW often gets stuck in local optima (sequences which are better than all their neighbors but not necessarily the globally best solution), the stochastic local search is allowed to move to worse neighbors with a fixed probability pw to overcome local optima. A tested neighbor is retained if its mfe structure has a smaller distance to the target structure than the current one. Otherwise, it is kept with probability pw. We set pw to 0.1 since this has turned out to be the best value during our experiments. The search terminates after a fixed number of steps. We set this number to 10 times the length of the structure. As moves to worse neighbors are allowed, the last sequence is not necessarily the best one found. Thus, the best found sequence is stored during the search and finally given.

During the search, not all sequence neighbors are candidates for mutation. Only positions that do not pair correctly and positions adjacent to those are tested. While in RNAinverse these neighbors are tested in arbitrary order, during INFO-RNA, the order can be chosen depending on a look-ahead of one selection step. Thereto, the energy of each candidate sequence folded into the target structure is calculated. Then, the resulting energy difference to the current one is evaluated. Let e(S, T) be the energy of sequence S folded into the wanted structure T. Let S' be a neighbor of S and e(S', T) the energy of S' when folding into T. Then, the energy difference is given by e(S, T) – e(S', T). The higher the difference is, the earlier the neighbor is examined according to the actual optimization criterion. Using INFO-RNA, the order of testing the neighbors can be chosen depending on energy as described above (NE-mode) or arbitrarily (NA-mode) as done in RNAinverse. This pre-selection step can be done easily, since all structural elements contribute additively to the energy of the whole structure and thus, only structural elements that are closed by the mutated pair or include a mutated free base have to be re-evaluated. After fixing the test order of the neighbors, they are evaluated concerning the actual optimization criterion of minimizing the structure distance of the mfe structure of the sequence and the target structure (mfe-mode). To evaluate the folding energy, we use functions from the Vienna RNA Package.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE INFO-RNA APPROACH
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
We evaluated the performance of INFO-RNA and compared it with two other inverse RNA folding algorithms: RNAinverse from the Vienna RNA Package (Hofacker et al., 1994) and RNA-SSD (Andronescu et al., 2004). For that purpose, we chose several artificially generated RNA structures as well as some test sets containing real biological data. To make our results comparable, we chose some biological test sets similar to that of Andronescu et al. (2004). We used an executable of RNA-SSD, kindly provided by the authors, and repeated the tests they had done, since first, RNA-SSD was improved since its publication and second, we used faster PCs.

When using RNAinverse, maximizing the probability of folding into the target structure (p-mode) often gives better results than minimizing the structure distance between the mfe structure of the designed sequence and the target structure (mfe-mode) but the former works only for short structures up to a size of ~200 since it operates on the whole structure. Thus, we always chose the mode of RNAinverse that gives a result at all and if both modes gained a solution, the better one was chosen. We call a run successful, if the mfe structure of the final sequence is the target one. Otherwise, it is denoted as unsuccessful.

All computations were done on PCs with 3 GHz Intel Pentium 4 processors and 2 GB RAM. Since all tested algorithms are non-deterministic, we performed multiple runs on each problem instance. In the following, all given runtimes ET denote expected times required for finding a solution. They are given in CPU seconds and calculated in the same way as done by Andronescu et al. (2004) by ES + (1/fs – 1) EU, where ES and EU represent the average time needed for successful and unsuccessful runs, respectively. The fraction of successful runs is given by fS. A problem arises if fS is 0. Then the expected time ET is set to infinity, which means that a solution will never be found. In the following tables, we indicate these cases by a –. During our tests, INFO-RNA will be used in mfe-mode in combination with the NE-mode, if nothing further is mentioned.

3.1 Artificial test sets
Our first three test sets consist of artifically generated structures. For that purpose, we generated RNA structures with some user-given structural features, e.g. the overall size of the structure, loop sizes and the length of the stems. For all sizes, minimal and maximal values are fixed. A structure generator chooses values among valid sizes as well as structural elements at random. The values we used are summarized in the Supplementary Data.

We generated two test sets of 300 structures each. Test set Ia consists of short structures up to a length of 200, while test set Ib includes larger structures of sizes between 300 and 700. Even if test sets Ia and Ib also include multiloop structures, we additionally analyzed two small but complex ones in test set Ic. Structure Ic-1 turned out to be hard to design because it includes a stem just consisting of only 1 bp. None of the tested programs managed it to design a sequence folding into this structure. Structure Ic-2 differs from Ic-1 just in the challenging stem, which consists of 3 bp here (Fig. 3). Our results show that this slight difference made the structure much easier to design.

Using the artificial test sets Ia–c, we analyzed success and speed of INFO-RNA, RNA-SSD and RNAinverse. Please note that due to the large sizes of structures included in test set Ib the p-mode of RNAinverse was not applied to this test set. For test set Ia, we examined each structure 100 times with each algorithm and tested for how many structures the respective algorithm succeeded in all 100 cases. The same was done for test set Ib, but here, each structure was examined only 10 times. Table 3 summarizes the results. INFO-RNA was always successful for all 300 structures of Ia as well as Ib. For small structures (Ia), RNA-SSD and RNAinverse performed only a little worse. RNA-SSD needed twice as long and RNAinverse 400 times as long as INFO-RNA did. For test set Ib, which includes larger structures, RNA-SSD also performed only a little worse than INFO-RNA but was much slower.

The structures of Ic were examined 100 times each. The resulting succession rates and expected times are given in Table 4. Since for structure Ic-1 all algorithms failed in all cases, no times are given. But all three algorithms designed sequences whose mfe structures are in a small structure distance to the target one. In Table 4, the succession rates in parentheses give the fraction of sequences whose mfe structures have a distance of two to the target one. For structure Ic-2, the fraction of successful runs differs among all three algorithms. While INFO-RNA failed in only one run (out of 100), RNA-SSD did not succeed in 38 cases. Furthermore, the latter is >3000 times slower than INFO-RNA. It can be summarized that, for test set Ic, INFO-RNA has a better succession rate than the other two algorithms and is much faster.

3.2 Biological test sets
Computationally predicted structures for known RNA sequences. In order to test the performance for real biological data, we used two further test sets. These ones consist of structures that are computationally predicted for known RNA sequences. All structures were predicted by RNAfold from the Vienna RNA Package (Hofacker et al., 1994), the same procedure that is used to evaluate the foldings during INFO-RNA, RNA-SSD and RNAinverse. Thus, it is guaranteed that at least one sequence exists, whose mfe structure is the analyzed one.

The first test set of computationally predicted structures consists of 24 structures of 260–1475 bases also analyzed by Andronescu et al. (2004). They created a set of ribosomal RNA sequences obtained from the Ribosomal Database Project (Cole et al., 2003) and predicted their mfe structures. We refer to this as test set IIa. Since Andronescu et al. have already shown that RNA-SSD performs better than RNAinverse when analyzing structures of IIa, we restricted our tests to a comparison of INFO-RNA and RNA-SSD. For each structure, we performed between 10 and 50 runs per algorithm similar to Andronescu et al. (2004). As both algorithms are successful here, we turned our attention to the comparison of speed. For that purpose, we applied INFO-RNA in a slightly different way. If it did not succeed within 300 CPU seconds, INFO-RNA was aborted and, thus, terminated unsuccessfully. Afterwards, the neighbor-testing-mode is changed for the next runs (from the energy-dependent NE-mode to the arbitrary NA-mode or back), as it seems that the current strategy is not successful for the given structure. The new mode is retained till it fails. Thus, we always applied the mode with less unsuccessful terminations. If both modes led to the same number of failures, NE-mode was chosen. This strategy of testing is obvious, since users of the program will change the parameters as well, if the algorithm has failed with their current parameter values. Since RNA-SSD does not include these modes, the strategy was only applied in case of INFO-RNA. Generally, it can be said that INFO-RNA performs much faster than RNA-SSD did in (Andronescu et al., 2004). But we repeated all tests with a newer version of RNA-SSD on our PCs. For test set IIa, the results are comparable for INFO-RNA and RNA-SSD. However, INFO-RNA failed for only one structure, while RNA-SSD did for three. Detailed results can in be seen in Supplementary Data.

The second test set of computationally predicted structures consists of 308 structures of 220–1975 bases. They are the mfe structures of all annotated eukaryotic rRNA gene sequences from release 9.27 of the Ribosomal Database Project (RDP-II) (Cole et al., 2005). We refer to this as test set IIb. The whole set of eukaryotic sequences from RDP-II was chosen since the performance of INFO-RNA is to be tested on more than some exemplary sequences chosen in (Andronescu et al., 2004). For INFO-RNA and RNA-SSD, we performed 25 runs for each structure, and because of the longer runtime only 10 times for RNAinverse. Furthermore, runs of RNAinverse were terminated unsuccessfully if no solution was found after 3600 CPU seconds. To determine the success of the algorithms for classes of structures in a certain size range, we divided test set IIb into three subsets according to the size of the structures. The results are shown in Table 5. INFO-RNA performs best and fastest for all three subsets of IIb and all runs for each structure were successful.

Structures from the biological literature. Finally, we analyzed the performance of INFO-RNA and RNA-SSD on a test set containing structures published in the literature. This set is chosen identical to test set C of Andronescu et al. (2004). We refer to it as test set III. Pseudoknots were removed by disregarding pairs in pseudoknots. Results are given in Table 6. We did not examine the performance of RNAinverse since Andronescu et al. have already done this. To analyze success and speed of INFO-RNA and RNA-SSD, we examined 100 runs per structure for each algorithm. The succession rates and expected computing times demonstrate the excellent performance of INFO-RNA. In all but one case, it was faster than RNA-SSD. Furthermore, it succeeded for more structures than RNA-SSD and unsuccessful runs terminated with better approximate solutions.

3.3 Stability of the designed sequences
Another important item for the validation of INFO-RNA is the question of the stability of the designed sequences. We analyzed the stability of some arbitrarily chosen structures of test sets IIa+b. The selected biological sequences underlying test set IIa were chosen according to Andronescu et al. to assure comparability. For each, we compared the stability of its mfe fold to that of the designed sequences when folding into the predicted structure. For that purpose, we used the partion function option of RNAfold of the Vienna RNA Package (Hofacker et al., 1994). For each designed sequence S as well as for the biological sequences Sb, we computed the probability P(T|S) of the final sequence folding into the target structure T. The designed sequences were sorted according to their stability. The best, the median and the worst ones as well as the results for the biological sequences are given in Table 7. Sequences designed by INFO-RNA are much more stable than the biological ones and the sequences obtained by Andronescu et al. (2004).

In a second step, we analyzed arbitrarily chosen sequences underlying test set IIb (a small, a medium and a long one) and evaluated the stability of their mfe folds to that of the designed sequences when folding into the predicted structure. Results are also given in Table 7. Again, sequences designed by INFO-RNA have a much higher stability than the biological ones. Furthermore, Table 7 shows the excellent quality of all results of test set IIb. In all cases, the best and even the median designed sequences have a higher stability than the biological ones. For most structures, the best designed sequence was >1000 times more stable than the biological one.

All these results show that in INFO-RNA, it is not necessary to optimize the stability additionally. It suffices to minimize the structure distance of the mfe structures to the target one to design highly stable structures.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE INFO-RNA APPROACH
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
We have introduced a very fast and successful new approach of the inverse RNA folding problem, called INFO-RNA. In general, it outperforms existing tools. It consists of two major steps: a new initialization method and a following advanced stochastic local search that uses an effective neighbor selection method. The former is implemented by a dynamic programming approach, which finds a sequence that among all sequences adopts the target structure with the lowest possible energy. It is done in linear time depending on the structure size. We have shown that this initial sequence is an excellent starting point for the subsequent local search, which is short but powerful. Only few local search steps and less time are needed to generate a good sequence that folds into the target structure. This is due to an energy-based pre-ordering of the set of neighbors which can be calculated much faster than the actual optimization criterion of minimizing the structure distance. Since all computational approaches for (inverse) RNA folding are based on a thermodynamic model, the sequences designed by INFO-RNA are not guaranteed to fold into the target structure in a cell.

To test the performance of INFO-RNA, we analyzed several test sets of artifically generated as well as biological RNA structures and compared the results with RNA-SSD and RNAinverse. In general, INFO-RNA outperforms RNA-SSD and it performs substantially better than RNAinverse. However, it should be noted that RNAinverse was also designed to produce random samples from the sequence space. Obviously, INFO-RNA cannot be used to produce samples since the initializing sequence is rather fixed apart from a random variation in unbound bases located in loop regions. We performed some initial experiments to investigate the performance of INFO-RNA using a random initializing sequence. The improved stochastic local search alone is still able to produce comparable results, although INFO-RNA looses much of its speed. Thus, for the sampling task INFO-RNA should be used without the initialization method.

Since G–C base pairs are energetically most favorable, the initializiation sequences of INFO-RNA have a high GC content. This GC content is subsequently reduced by the local search but the final sequences are still enriched in G's and C's which might explain the high stability of the designed sequences. In future, it is desirable to introduce sequence constraints in INFO-RNA to reduce the GC content.


Figure 1
View larger version (29K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Exemplary sequence (and structure) and all its sequence neighbors that can adopt this structure. The parenthesis notation is used for the representation of the RNA secondary structure. An opening ‘(’and a closing parenthesis‘)’ stand for a base pair, a dot ‘.’ represents an unbound position.

 


Figure 2
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 (A) RNA secondary structure. (B) Substructure T(26,35) having a connected backbone. (C) Structural element Formula 5 without a connected backbone.

 


View this table:
[in this window]
[in a new window]

 
Table 1 Base pairs and their predecessors of the structure of Figure 2A

 


View this table:
[in this window]
[in a new window]

 
Table 2 Definition of variables

 


Figure 3
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Special ML structures of Ic differing in the part given in gray.

 


View this table:
[in this window]
[in a new window]

 
Table 3 Results for test sets Ia and Ib

 


View this table:
[in this window]
[in a new window]

 
Table 4 Results for test set Ic

 


View this table:
[in this window]
[in a new window]

 
Table 5 Results for test set IIb.

 


View this table:
[in this window]
[in a new window]

 
Table 6 Results for test set III.

 


View this table:
[in this window]
[in a new window]

 
Table 7 The stability of test set IIa and IIb.

 

    Acknowledgments
 
The authors would like to thank Michael Hiller and Sebastian Will for reading the manuscript and for their helpful comments. Funding to pay the Open Access publication charges was provided by the Albert-Ludwigs-University Freiburg.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on March 14, 2006; revised on April 28, 2006; accepted on May 15, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE INFO-RNA APPROACH
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

    Addess, K.J., et al. (1997) Structure and dynamics of the iron responsive element RNA: implications for binding of the RNA by iron regulatory binding proteins. J. Mol. Biol, . 274, 72–83[CrossRef][ISI][Medline].

    Andronescu, M., et al. (2004) A new algorithm for RNA secondary structure design. J. Mol. Biol, . 336, 607–624[CrossRef][ISI][Medline].

    Antal, M., et al. (2000) molecular characterization at the RNA and gene levels of U3 snoRNA from a unicellular green alga, Chlamydomonas reinhardtii. Nucleic Acids Res, . 28, 2959–2968[Abstract/Free Full Text].

    Cech, T.R. (2004) RNA finds a simpler way. Nature, 428, 263–264[CrossRef][Medline].

    Cole, J.R., et al. (2005) The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis. Nucleic Acids Res, . 33, D294–D296[Abstract/Free Full Text].

    Cole, J.R., et al. (2003) The Ribosomal Database Project (RDP-II): previewing a new autoaligner that allows regular updates and the new prokaryotic taxonomy. Nucleic Acids Res, . 31, 442–443[Abstract/Free Full Text].

    Dirks, R.M., et al. (2004) Paradigms for computational nucleic acid design. Nucleic Acids Res, . 32, 1392–403[Abstract/Free Full Text].

    Fedor, M.J. (2000) Structure and function of the hairpin ribozyme. J. Mol. Biol, . 297, 269–291[CrossRef][ISI][Medline].

    Haas, E.S., et al. (1996) Comparative analysis of ribonuclease P RNA structure in Archaea. Nucleic Acids Res, . 24, 1252–259[Abstract/Free Full Text].

    Hofacker, I.L., et al. (1994) Fast folding and comparison of RNA secondary structures. Monatshefte Chemie, 125, 167–188[CrossRef].

    Hofacker, I.L. (1994) The rules of the evolutionary game for RNA: a statistical characterization of the sequence to structure mapping in RNA.

    Hoos, H.H. (1998) Stochastic Local Search—Methods,Models,Applications. PhD thesis, Darmstadt University of Technology, Darmstadt, Germany,.

    Hüttenhofer, A. and Böck, A. (1998) RNA structures involved in selenoprotein synthesis. In Simons, R.W. and Grunberg-Manago, M. (Eds.). RNA Structure and Function, , Cold Spring Habor Cold Spring Habor Laboratory Press, pp. 603–639.

    Hüttenhofer, A., et al. (2002) RNomics: identification and function of small, non-messenger RNAs. Curr. Opin. Chem. Biol, . 6, 835–843[CrossRef][ISI][Medline].

    Knight, J. (2003) Gene regulation: switched on to RNA. Nature, 425, 232–233[CrossRef][Medline].

    Lafontaine, D.A., et al. (2001) Structure, folding and activity of the VS ribozyme: importance of the 2-3-6 helical junction. EMBO J, . 20, 1415–1424[CrossRef][ISI][Medline].

    Liu, Z., et al. (1998) The nature of the minimal ‘selenocysteine insertion sequence’ (SECIS) in Escherichia coli. Nucleic Acids Res, . 26, 896–902[Abstract/Free Full Text].

    Mackie, G.A. (1992) Secondary structure of the mRNA for ribosomal protein S20. Implications for cleavage by ribonuclease E. J. Biol. Chem, . 267, 1054–1061[Abstract/Free Full Text].

    Mobley, E.M. and Pan, T. (1999) Design and isolation of ribozyme-substrate pairs using RNase P-based ribozymes containing altered substrate binding sites. Nucleic Acids Res, . 27, 4298–4304[Abstract/Free Full Text].

    Pitulle, C., et al. (1998) Comparative structure analysis of vertebrate ribonuclease P RNA. Nucleic Acids Res, . 26, 3333–3339[Abstract/Free Full Text].

    Storz, G. (2002) An expanding universe of noncoding RNAs. Science, 296, 1260–1263[Abstract/Free Full Text].

    Sun, L., et al. (2002) A selected ribozyme catalyzing diverse dipeptide synthesis. Chem. Biol, . 9, 619–628[CrossRef][ISI][Medline].

    Swisher, J., et al. (2001) Visualizing the solventinaccessible core of a group II intron ribozyme. EMBO J, . 20, 2051–2061[CrossRef][ISI][Medline].

    Szymanski, M., et al. (2002) 5s Ribosomal RNA Database. Nucleic Acids Res, . 30, 176–178[Abstract/Free Full Text].

    Winkler, W.C., et al. (2004) Control of gene expression by a natural metabolite-responsive ribozyme. Nature, 428, 281–286[CrossRef][Medline].

    Zuker, M. and Stiegler, P. (1981) Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res, . 9, 133–148[Abstract/Free Full Text].

    Zuker, M. (1994) Prediction of RNA secondary structure by energy minimization. Methods Mol. Biol, . 25, 267–294[Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
A. Busch and R. Backofen
INFO-RNA--a server for fast inverse RNA folding satisfying sequence constraints
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W310 - W313.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary data
Right arrowOA All Versions of this Article:
22/15/1823    most recent
btl194v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Google Scholar
Right arrow Articles by Busch, A.
Right arrow Articles by Backofen, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Busch, A.
Right arrow Articles by Backofen, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?