Bioinformatics Advance Access originally published online on October 21, 2008
Bioinformatics 2008 24(24):2849-2856; doi:10.1093/bioinformatics/btn544
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
IntaRNA: efficient prediction of bacterial sRNA targets incorporating target site accessibility and seed regions
Bioinformatics Group, Albert-Ludwigs-University Freiburg, Georges-Koehler-Allee 106, Freiburg D-79110, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: During the last few years, several new small regulatory RNAs (sRNAs) have been discovered in bacteria. Most of them act as post-transcriptional regulators by base pairing to a target mRNA, causing translational repression or activation, or mRNA degradation. Numerous sRNAs have already been identified, but the number of experimentally verified targets is considerably lower. Consequently, computational target prediction is in great demand. Many existing target prediction programs neglect the accessibility of target sites and the existence of a seed, while other approaches are either specialized to certain types of RNAs or too slow for genome-wide searches.
Results: We introduce INTARNA, a new general and fast approach to the prediction of RNA–RNA interactions incorporating accessibility of target sites as well as the existence of a user-definable seed. We successfully applied INTARNA to the prediction of bacterial sRNA targets and determined the exact locations of the interactions with a higher accuracy than competing programs.
Availability: http://www.bioinf.uni-freiburg.de/Software/
Contact: IntaRNA{at}informatik.uni-freiburg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Starting with the discovery of microRNAs (miRNAs) and the advent of genome-wide transcriptomics, it has become obvious that RNA plays a large variety of important, often regulatory, roles in living organism that extends far beyond being a mere intermediate in protein biosynthesis (Storz, 2002). Several of these non-protein coding RNAs (ncRNAs) regulate gene expression post-transcriptionally through base pairing to a target mRNA, like eukaryotic miRNAs and small interfering RNAs (siRNAs) (Bartel, 2004; Hannon, 2002; Zamore and Haley, 2005) as well as bacterial small regulatory RNAs (sRNAs) (Gottesman, 2005).
Typically, the size of bacterial sRNAs ranges from 50 nt to 250 nt (Vogel and Wagner, 2007). The post-transcriptional interaction with their target mRNA causes translational repression or activation, mRNA degradation or changes in mRNA stability (Storz et al., 2004). They have been found to be crucial in the bacterial stress response and in bacterial virulence (Gottesman, 2005). Please note that we use the term sRNAs for small, regulatory RNAs in bacteria as done predominantly in the literature (e.g. Storz and Haas, 2007).
Since base pairing to a target mRNA is the main regulatory mechanism of sRNAs (Vogel and Wagner, 2007), the hybridization energy is a widely used criterion to predict RNA–RNA interactions (Rehmsmeier et al., 2004; Tjaden et al., 2006). For miRNAs and siRNAs, it was shown that the accessibility of the target sites has also an important influence (Ameres et al., 2007; Kertesz et al., 2007; Kretschmer-Kazemi Far and Sczakiel, 2003; Long et al., 2007; Luo and Chang, 2004; Shao et al., 2007). It can be assumed that the same holds for sRNAs. Furthermore, perfect Watson–Crick pairing of seven or eight consecutive bases (typically at positions 2–8) at the 5' end of animal miRNAs (the seed region) is often sufficient for effective regulation (Bentwich, 2005; Brennecke et al., 2005; Doench and Sharp, 2004). There is not much known about possible seed regions in bacterial sRNAs. A previous work about sRNA target prediction suggested that the interaction of sRNA and mRNA also starts with a stretch of bases that are unpaired in the sRNA and the mRNA and that form at least a minimal number of consecutive base pairs (Tjaden et al., 2006). A very recent study in Salmonella confirms this assumption by showing that the conserved 5'-end of the RybB sRNA recognizes many omp mRNA targets by short seed pairings (Mika et al., 2008).
In Escherichia coli (E.coli), there are >70 validated sRNAs, but only about 20 with a known cellular function since the experimental identification and verification of sRNA targets has been lagging behind (Vogel and Wagner, 2007). Thus, there is a high demand for computational target predictions for regulatory sRNAs. So far, there is no tool that integrates both accessibility of the target regions in the mRNA and in the sRNA and the existence of an arbitrary seed in a general approach.
This article presents INTARNA, a new approach to the prediction of interacting RNAs (INTARNA), where a combined energy score of the interaction is calculated as the sum of the free energy of hybridization and the free energy required for making the interaction sites accessible. In addition, the existence (but not the exact location) of a seed is enforced. The length of the seed can be set by the user. We present two variants: a complete approach, which is O(n2m2) in time and O(nm) in space, when restricting the size of internal loops and where n and m are the lengths of the interacting RNA sequences (n>m), and a heuristic simplification of the complete approach, which has a time complexity of
and a space complexity of O(nm), where
and L is the size of the sequence window in which both the target mRNA and the sRNA are folded. We successfully applied INTARNA to the prediction of sRNA targets. Additionally, INTARNA predicted the exact location of the RNA–RNA interactions.
Related work: to date, most existing target prediction approaches are based on one of the two following basic ideas. The first major group determines a common structure for ncRNA and targeted mRNA by concatenating the two RNA sequences and memorizing the linkage location. The new single sequence is folded by an usual RNA-folding algorithm, e.g. the algorithm of Zuker and Stiegler (1981), with slightly different parameters for the loop including the linkage location. The most prominent tools using this idea are PAIRFOLD (Andronescu et al., 2005) and RNACOFOLD (Bernhart et al., 2006b). They have a space complexity of O((n+m)2) and a time complexity of O((n+m)3) when restricting the size of interior loops. The main problem with these tools is that they can only predict interactions where the common structure is pseudoknot-free. In contrast, many interactions in living cells are located in loop regions and would represent a pseudoknot in the context of concatenated sequences.
The second major group of target prediction tools neglects intra-molecular binding in both RNA molecules. Algorithms based on this idea find the energetically most favorable hybridization of two RNA sequences. The most popular tools incorporating this idea are RNAHYBRID (Kruger and Rehmsmeier, 2006; Rehmsmeier et al., 2004), RNADUPLEX and RNAPLEX (Tafer and Hofacker, 2008), and DINAMELT (Dimitrov and Zuker, 2004; Markham and Zuker, 2005). RNAHYBRID is primarily tailored for predicting potential miRNA binding sites in large target RNAs. This method uses a modification of the classical secondary structure prediction algorithm of Zuker and Stiegler (1981) that neglects multiloops. Furthermore, the loop size is restricted to a fixed value to reduce complexity. In principle, RNADUPLEX and RNAPLEX incorporate the same ideas as RNAHYBRID, but RNAPLEX uses a simplified energy scoring of loops and a length penalty to favor short stable interactions. By doing so, RNAPLEX performs 10–27 times faster than RNAHYBRID (Tafer and Hofacker, 2008).
There is a variety of additional tools that are specially designed to search for miRNA target sites [for a review see Bentwich (2005) and Yoon and De Micheli (2006)]. In contrast, there has been little investigation so far regarding the computational prediction of mRNA targets of bacterial sRNAs. RNAPLEX is also suitable for longer queries like sRNAs because it integrates a nucleotide penalty. It was recently applied to the prediction of sRNA targets (Tafer and Hofacker, 2008). Tjaden et al. (2006) developed a tool named TARGETRNA that predicts the targets of bacterial sRNAs (neglecting intra-molecular base pairs) and outputs them in a ranked list.
While most of the aforementioned approaches use the free energy of the hybridized duplex to predict the potential target site, in general, the free energy of the entire duplex is a poor predictor for that aim (Rajewsky, 2006). Several authors have shown that the secondary structure of the target mRNA (Ameres et al., 2007; Kertesz et al., 2007; Kretschmer-Kazemi Far and Sczakiel, 2003; Long et al., 2007; Luo and Chang, 2004; Shao et al., 2007) and the ncRNA (Koberle et al., 2006) has a strong effect on target recognition. To the best of our knowledge, there are only two tools that incorporate the secondary structure of the mRNA (Kertesz et al., 2007; Mückstein et al., 2008). RNAUP (Mückstein et al., 2008) calculates the thermodynamics of RNA–RNA interactions as the sum of two energy contributions: the energy needed to open the binding sites and make them accessible, and the hybridization energy. It has a space complexity of O(n2+nw3) and a time complexity of O(n3+nw5), when restricting the interior loop sizes to a fixed value and limiting the size of interaction to w. In contrast to our approach, RNAUP does not use any seed condition. PITA, developed by Kertesz et al. (2007) for miRNA target prediction, starts with a genome-wide search for initial seed regions and tries to extend these sites in one direction. This is typical for a large proportion of miRNAs, but is unlikely to hold for other ncRNAs.
Herein, we present the algorithmic details of INTARNA, our general approach to the prediction of RNA–RNA interactions, including target site accessibility and user-definable seeds.
| 2 METHODS |
|---|
|
|
|---|
We are given two potentially interacting sequences S1 and S2 of lengths n and m, respectively. For every single RNA sequence Sk, a target site is a pair of positions [x,y] that define an interval with x being the first and y being the last included position.
Hybridization energy: the first component determining the quality of an RNA–RNA interaction between target site [i,k] in S1 and target site [j,l] in S2 is the hybridization energy Ehybrid(i,j,k,l). Its calculation is based on the energy model of RNAHYBRID. The energy parameters used are from Mathews et al. (1999). With H(i,j), we denote the hybridization energy of the best interaction of subsequences S
...S
and S
...S
, where the left-most positions of both subsequences i and j form a base pair. H(i,j) can be calculated using a restricted variant of the algorithm of Zuker and Stiegler (1981) discarding multiloop structures. This algorithm has a time complexity of O(nm) when restricting the internal loop length. Using the RNAHYBRID convention to number the first RNA 5'
3' and the second in the reverse direction, we get the following basic recursion for H(i,j):
|
| (1) |
Accessibility: the second component contributing to the quality of an RNA–RNA interaction is the accessibility of the target sites in each sequence, which is the energy required to make them single stranded. It is defined as the difference between the energy of the ensemble of all structures and the energy of the ensemble of structures, where the target site [i,k] is single stranded. It is denoted by ED(i,k) and calculated using a partition function approach (McCaskill, 1990). Let
be the set of all structures (called ensemble) that can be formed by a sequence S. Then
|
|
is the partition function of
, E(Q) is the free energy of sequence S folded into the secondary structure Q and Eens(
) denotes the ensemble energy of the set of structures
. Let 
be the set of all structures of S that have nucleotides Si, Si+1,...,Sk unpaired. Then, |
|
Next, we combined both energy contributions. The extended hybridization energy of a specific interaction of two target sites [i,k] and [j,l] is now defined by summing up the ED-values and the hybridization energy. For calculating the ED-values, we must know the first and the last interacting base in both sequences. Hence, the basic recursion for calculating the extended hybridization energy requires a 4D array C(i,j,k,l). Note that the basic assumption in C(i,j,k,l) is that both (i,j) and (k,l) form a base pair. Thus, we have
|
|
|
|
2.1 Reducing the space complexity
The space complexity can be improved by calculating all interactions for a common interaction start in one step. This leads to a 2D matrix Ck,l(i,j), which is basically the slice of C(i,j,k,l) for fixed k,l. The hybridization that starts at base pair (k,l), is elongated to the left and ends with base pair (i,j) (Fig. 1).
|
Considering the recursion for Ck,l(i,j), note that the ED-values are already included. Since ED-values are not additive, we have to subtract the old ED-values before adding the new ones. The idea of the recursion for Ck,l(i,j) is illustrated in Figure 2.
|
Formally, we get the following recursion:
|
| (2) |
|
| (3) |
k and j
l, i.e.
|
| (4) |
|
| (5) |
2.2 Incorporation of seed features
According to the findings on seed regions presented in Section 1, we introduce seed features that define their properties:
- P: the number of bases perfectly paired in the seed region
- bmax, b
and b
: the maximal number of bases not hybridized in the seed region of both RNAs, the mRNA and the sRNA, respectively.
|
| (6) |
P–1 and l–q+1
P–1 ensure that P–1 base pairs are possible in [p,k] and [q,l], respectively. Let lm=k–i+1 and ls=l–j+1 be the lengths of intervals [i,k] and [j,l]. Then, seed(i,j,k,l;P) is only valid if lm–P
b
, ls-P
b
and lm+ls–2P
bmax. These three conditions assure compliance with the seed features.
While seed(i,j,k,l;P) finds the minimal free energy for two fixed intervals [i,k] and [j,l], all valid intervals have to be analyzed to find the optimal seed region. This is done during the calculation of a second 2D array C
(i,j), which contains energy scores of interactions with a seed region. The interpretation of C
(i,j) is given as in Figure 3. This leads to the following recursion:
|
| (7) |
|
Note that this extension of the algorithm does not increase its complexity. The final values are stored in C(i,j) by replacing Ck,l(i,j) with C
(i,j) in Equation (5).
2.3 Reducing the space and time complexity
Although the INTARNA algorithm presented above has a space complexity of O(nm) like RNAHYBRID, its time complexity of O(n2m2) is still impractical for a genome-wide search. Hence, we want to achieve also the same time complexity as RNAHYBRID. In the following, the idea is described for the case without seeds to simplify the presentation. However, seeds are integrated in the same way as described above.
Before introducing the version of INTARNA with reduced space and time complexity, we summarize the full version of the algorithm to clarify the differences between them. For this purpose, Equations (2), (3) and (4) are combined into the following single equation:
|
| (8) |
To reduce both time and space complexity, we use a heuristic simplification that is inspired by the sparsification technique. Here, the basic idea is that the matrix C(i,j,k,l) is sparse in the sense that many entries will have the same values. This is due to the fact that many right hybridization ends will not be used in the next recursion steps. Our idea is to store the Ck,l(i,j) values only for one starting point (k,l). Thus, we use a matrix che(i,j) that stores for every (i,j) the right hybridization end (k,l), which yields the best extended hybridization energy until the left end (i,j) [see Equation (10)]. Given che(i,j), we can directly use a 2D matrix C'(i,j) and do not need Ck,l(i,j) and C(i,j) any longer. Denoting with che1(i,j) the first component k of the pair (k,l)=che(i,j), and with che2(i,j) the second component l, we get the following recursion:
|
| (9) |
|
| (10) |
Compared with Equation (8), Equation (9) computes the minimum over only two instead of four variables. Since these two variables p and q are restricted by the maximal loop size, INTARNA has a space complexity of O(nm) and a time complexity of
, where
and L is the size of the sequence window in which both mRNA and sRNA are folded. All ED-values are calculated in O(nL3) time by integrating RNAPLFOLD into INTARNA via the Vienna RNA library (Hofacker et al., 1994).
Suboptimal hybridizations: INTARNA can predict multiple potential target sites per sRNA. The computation of suboptimal hybridizations is implemented by multiple traceback. Since target sites at different locations within the mRNA are especially of interest, an interaction is accepted as suboptimal if it does not overlap with any other interaction predicted thus far. The desired number of suboptimal hybridizations are computed iteratively from the matrix Cseed'.
2.4 Functional analysis of sRNA target sites
In prokaryotes, the interaction between ribosome and mRNA is promoted by the Shine–Dalgarno (SD) sequence, a sequence motif typically 4–5 nt in length and located around 5–8 nt upstream of the start codon. The SD sequence is bound by a complementary motif of the 3'-tail of the 16S ribosomal RNA (rRNA). The translation is usually regulated by blocking access to this initiation site (Kozak, 2005).
The majority of bacterial sRNAs act as antisense regulators on trans-encoded mRNAs. Often, the base pairing occurs at the ribosome binding site (RBS), which leads to blockage of ribosome entry and thus to translation inhibition. In contrast, some sRNAs activate translation of their target mRNAs. Thereby, the sRNA binding results in melting of an inhibitory structure that sequesters the RBS (Vogel and Wagner, 2007).
Here, we study the consequences of sRNA binding to the target mRNA regarding the accessibility of the SD sequence, and thus the translational regulation. First, we predict the SD sequence location for every studied gene by simulating the hybridization between the mRNA and the single stranded 16S rRNA 3'-tail. The SD sequence is located by the position of the minimum free energy hybridization with a free energy below a significance threshold (Starmer et al., 2006). Then, we calculate the probability that the SD sequence is unpaired before (PU
) and after (PU
) the hybridization of the sRNA and the mRNA. The change in this probability,
PUSD, is defined as
|
|

is the set of all structures of the sequence S having Si...Sj and Sk...Slunpaired.
PUSD>0 suggests translational activation by the sRNA–mRNA interaction, whereas
PUSD<0 suggests translational repression. The higher the absolute value, the higher is the expected regulatory outcome. However, a special case arises if the mRNA target site overlaps with or is in close vicinity to the SD sequence. Then, the RBS is blocked and translation inactivation is expected. This measurement of single strandness has the advantage that it is based on base pair probabilities. Thus, it accounts for all possible secondary structures within the thermodynamic ensemble. The concept has been previously applied for searching binding motifs of RNA-binding proteins (Hiller et al., 2006). | 3 RESULTS |
|---|
|
|
|---|
In order to assess the performance of the INTARNA algorithm as presented in Section 2.3, we used the program to predict targets of bacterial regulatory sRNAs. The test set consisted of 10 biochemically mapped sRNA–mRNA interactions from E.coli and eight interactions from Salmonella typhimurium (denoted Salmonella in the following) that were previously published. For each sRNA, we predicted interactions for all genes of the respective genome. The genome sequences were downloaded from GenBank database of the National Center for Biotechnology Information (NCBI) (Benson et al., 2008). Since the majority of the known sRNAs bind their target gene in close proximity to the RBS, we defined a subsequence of 150-nt upstream and 50-nt downstream of the first base of the start codon as the (putative) target region. We obtained 4294 target regions from the E.coli genome (GenBank accession number NC_000913 [GenBank] ) and 4425 target regions from the Salmonella genome (GenBank accession number NC_003197 [GenBank] ).
The seed features and other parameters used for the target prediction were chosen according to the known interactions from our test set. They included a seed of at least eight paired nucleotides length and no restriction on the interaction length. All interactions except OxyS-fhlA have a continuous hybridization pattern. Amongst these examples, the Spot42-galK interaction is the longest one with a length of 75 nt.
We compared the results with several state-of-the-art methods for the prediction of RNA–RNA interactions, namely TARGETRNA, RNAHYBRID, RNAPLEX and RNAUP. Although RNAHYBRID is primarily designed for the prediction of miRNA target sites, it has been used occasionally for prediction related to sRNAs (see, e.g. Sharma et al., 2007; Urban and Vogel, 2008). Therefore, it has been included in our comparison for the sake of completeness using the default parameters. For TARGETRNA, we used the web application (Tjaden, 2008) with default parameters, except that the search was focused on our target regions. RNAPLEX was used with a penalty of 0.3 kcal/mol per nucleotide as suggested by Tafer and Hofacker (2008). We used RNAUP including the probability of unpaired regions in both RNAs (parameter -b) (Mückstein et al., 2008) and set the maximal length of interaction to 80, which is slightly longer than the maximal interaction length in our dataset.
3.1 Accuracy of predicted sRNA–mRNA interactions
In a first experiment, we assessed whether INTARNA is able to predict precisely the interaction between each sRNA and its mRNA target. Therefore, we computed the sensitivity and the positive predictive value (PPV) for each sRNA–target pair, where
and
. These measures have been used in the past to compare different RNA secondary structure prediction methods (see e.g. Do et al., 2006).
As shown in Table 1, INTARNA outperforms existing methods in the accuracy of the predicted interactions. TARGETRNA achieves the second best sensitivity and the third best PPV, but reports only 12 out of 18 interactions due to its cutoff. The program RNAHYBRID tends to maximize the length of hybridization, which leads to high sensitivity, but very low PPV. Thus, the program is more appropriate to predict interactions between short RNAs (like miRNAs) and long RNAs. To overcome this problem, RNAPLEX introduced a length penalty, which significantly increased its PPV compared with RNAHYBRID.
|
RNAUP achieves the third best sensitivity and the second best PPV. Among all programs compared, it has an overall accuracy closest to INTARNA.
INTARNA and RNAUP are the only programs whose optimal solution locates every sRNA binding site correctly, except for the interaction GcvB-gltI. For this interaction, both INTARNA and RNAUP predict optimal hybridizations, which do not share a base pair with the experimentally verified location. However, INTARNA, RNAHYBRID and RNAPLEX also give suboptimal solutions, see Supplementary Table 1. When these predictions are additionally taken into account, the average sensitivity/PPV of RNAHYBRID and RNAPLEX improve to 0.774/0.251 and 0.736/0.761, respectively. INTARNA achieves, on average, both a sensitivity and a PPV greater than 0.8 when the first suboptimal prediction for GcvB-gltI is included.
To study the influence of seed features on INTARNA's prediction quality, we repeated the experiments neglecting them (Supplementary Table 2). In this case, the averaged values of sensitivity and PPV are 0.699 and 0.728, respectively, which is below the accuracy of INTARNA with seed features and RNAUP. The difference to the latter, which uses a similar energy model, can be explained by the heuristic of INTARNA.
Altogether, the results demonstrate that RNAUP and INTARNA, which both incorporate the accessibility of binding sites, perform better in the prediction of sRNA–mRNA interactions than the other programs neglecting the accessibility. Furthermore, the quality of INTARNA's predictions is substantially improved when seed features are additionally taken into account.
3.2 Performance on prediction of sRNA targets
In a second experiment, we compared INTARNA and the existing methods with respect to the ability of finding sRNA targets. We applied every program to our test set and for each sRNA searched all target regions for potential target sites. The resulting list of target candidates for each sRNA was sorted by the computed energy score. All programs except TARGETRNA and INTARNA give an interaction for each putative target region. TARGETRNA reports at most 100 putative interaction sites per sRNA. INTARNA returns interactions that have both a seed with specified features and an energy score below 0.0 kcal/mol. For each method, we calculated the sensitivity and specificity. For our test set, there are 18 true interactions. Each of the nine sRNAs in E.coli may interact with any of the 4294 target regions, and each sRNA in Salmonella may interact with any of the 4425 target regions. Consequently, there are 47 496 potential interactions, of which 47 478 are considered non-interactions. A similar approach to evaluate the performance on prediction of sRNA targets has been used by Tjaden et al. (2006).
The ROC curves in Figure 4a illustrate the performance of different target prediction methods on our test set. We generated each ROC curve by calculating sensitivity and specificity while varying the number of computed interactions that were taken into account for each sRNA. The plot shows that INTARNA and RNAUP are the methods performing best on prediction of sRNA targets. Both RNAHYBRID and RNAPLEX achieve a low sensitivity suggesting that these programs are suitable only to a limited degree for genome-wide sRNA target searches. Taking in consideration that TARGETRNA limits the number of reported putative interaction sites, it achieves a fairly high sensitivity at a low false positive rate, although only an alignment-like algorithm based on base pairing potential is used. However, it can be assumed that the program will perform worse on interactions that show lower sequence complementarity, but underlie more complex duplex formation rules. The curves show that INTARNA and RNAUP have a similar performance on predicting sRNA targets and perform best among all studied programs. However, there is a clear difference in the practical applicability of both programs (Fig. 4b). On an Intel Xeon 5160 (3.0 GHz) with 7.8 GB available RAM, a GcvB target search in all Salmonella target regions allowing a maximal interaction length of 80 nt takes 21 h and requires 33 MB RAM with INTARNA. The same search with RNAUP needs 95 h and 840 MB RAM. An increase of the maximal interaction length to 140 nt raises INTARNA's runtime to 29 h with unchanged memory usage, whereas RNAUP now requires 207 h and 4.3 GB RAM. Without a restriction on the interaction length, INTARNA takes 36 h and requires again 33 MB RAM. Since RNAUP requires a restriction, we limited the interaction size to the length of the sRNA. This causes exhaustion of the complete memory and, as a consequence, a crash of RNAUP. The dramatic increase of RNAUP's resource requirements results from its higher asymptotic complexity and impairs its applicability on normal work stations with limited available memory.
|
Furthermore, it should be noted that the calculation of sensitivity and specificity is conservative, since we rely only on biochemically mapped interactions from literature. For instance, there are several unpublished experimentally verified targets among the top-ranked predictions of INTARNA (Jörg Vogel lab, personal communication) that have not been taken into consideration when evaluating the performance.
3.3 Prediction of the type of regulation
For all interactions of our test set, we also analyzed the regulatory outcome of the sRNA binding to its target mRNA. Many sRNAs regulate the translation of their target by changing the accessibility of the SD sequence where translation initiation occurs. Therefore, we studied the change
PUSD in the probability that the SD sequence is unpaired as a consequence of sRNA–target interaction. We predicted SD sequence locations within a region of 35-nt up- and downstream of the first base of the start codon for each gene following the approach of Starmer et al. (2006). Then, we determined whether the sRNA binding site predicted by INTARNA is at or close to the SD sequence. In this case, the SD sequence is inaccessible for ribosome binding. Otherwise, we calculated
PUSD for the gene. Supplementary Table 3 shows the results for all interactions of our test set.
In 11 out of 18 examples, our method successfully predicted the type of translational regulation by the sRNA. For three of the remaining seven interactions, the SD sequence could either not be located (GcvB-STM4351) or was located at an incorrect position (MicF-ompF and OxyS-fhlA). Another sRNA, IstR, blocks translation of its mRNA target tisAB by binding 100-nt upstream of the start codon without inducing structural changes at the RBS. Instead, IstR blocks a ribosome standby site that is essential for translation initiation (Unoson and Wagner, 2007). The remaining three interactions all involve the sRNA GcvB. Its targets argT, livJ and gltI are bound upstream of the ribosome binding site, and the inhibitory activity cannot be directly explained by competition with ribosome binding. At least for the last example, translational repression by a simple interference model or by masking a ribosome standby site is unlikely (Sharma et al., 2007). Consequently, the regulation cannot be predicted by our model.
| 4 CONCLUSIONS |
|---|
|
|
|---|
Although numerous regulatory ncRNAs have already been identified, the number of experimentally verified targets is much smaller. Consequently, computational target prediction is in great demand to restrain the list of putative targets.
In this article, we presented INTARNA, a new method for the prediction of interactions between two RNAs based on minimization of an extended hybridization energy. Our algorithm accounts for two important features that influence the strength of RNA–RNA interactions: the accessibility of the interaction sites and the existence of a user-specified seed. In contrast to previous methods for the prediction of RNA–RNA interactions, both features are integrated in a general approach for arbitrary RNAs. Although INTARNA was applied to predict targets for bacterial sRNAs in this work, the program can readily be used to find other RNA–RNA interactions as well.
The INTARNA target predictions for bacterial sRNA were compared to results of several state-of-the-art methods for the prediction of RNA–RNA interactions. INTARNA outperforms existing methods in the accuracy of the predicted interaction, and our method performs as well as the best existing program on finding putative sRNA targets, while the required CPU time and memory decrease drastically. Overall, the results show that our method is well suited both for general searches for putative target sites and the prediction of accurate RNA–RNA interactions. The comparison with RNAHYBRID, whose hybridization energy model is the basis of our more sophisticated extended hybridization energy, shows how the incorporation of the free energy required for making both interaction sites single stranded and the existence of a seed can improve the prediction quality. In addition, we were also able to successfully predict the regulatory effect on translation initiation for a number of sRNAs.
The interaction between the OxyS sRNA and the fhlA mRNA is the only one in our test set with a discontinuous hybridization pattern. In fact, the two RNAs form kissing hairpins at two sites (Argaman and Altuvia, 2000). We are aware of two approaches that can be used to predict interactions consisting of such independent substructures (Aksay et al., 2007; Alkan et al., 2006; Pervouchine, 2004). However, these algorithms are rather expensive with a time complexity of O(n3m3). Further extensions of INTARNA could incorporate some basic ideas of those approaches to allow for prediction of more complex interactions with multiple sites.
Many bacterial sRNA genes and their mRNA interaction sites are conserved in closely related bacteria (see e.g. Delihas, 2003; Udekwu et al., 2005). It can be assumed that the sRNA–target interaction mechanisms are conserved as well. Therefore, the performance on finding sRNA targets may be further improved by restricting the prediction to orthologous genes. A scoring scheme that accounts for interactions conserved between related species may be a promising extension of our combined energy score.
| ACKNOWLEDGMENTS |
|---|
We thank Sven Siebert for his initial work and fruitful discussions, Cynthia M. Sharma and Jörg Vogel for assistance with the biological data and Michael Beckstette for discussions on statistical analysis. We also thank the anonymous referees for their helpful comments.
Funding: German Federal Ministry of Education and Research (BMBF grant 0313921 FRISYS); German Research Foundation (DFG grant BA 2168/2-1 SPP 1258).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Ivo Hofacker
Received on June 16, 2008; revised on October 14, 2008; accepted on October 17, 2008
| REFERENCES |
|---|
|
|
|---|
Aksay C, et al. taveRNA: a web suite for RNA algorithms and applications. Nucleic Acids Res. (2007) 35:W325–W329.
Alkan C, et al. RNA-RNA interaction prediction and antisense RNA target search. J. Comput. Biol. (2006) 13:267–282.[CrossRef][Web of Science][Medline]
Ameres SL, et al. Molecular basis for target RNA recognition and cleavage by human RISC. Cell (2007) 130:101–112.[Medline]
Andronescu M, et al. Secondary structure prediction of interacting RNA molecules. J. Mol. Biol. (2005) 345:987–1001.[CrossRef][Web of Science][Medline]
Argaman L, Altuvia S. fhla repression by OxyS RNA: kissing complex formation at two sites results in a stable antisense-target RNA complex. J. Mol. Biol. (2000) 300:1101–1112.[CrossRef][Web of Science][Medline]
Bartel DP. MicroRNAs: genomics, biogenesis, mechanism, and function. Cell (2004) 116:281–297.[CrossRef][Web of Science][Medline]
Benson DA, et al. GenBank. Nucleic Acids Res. (2008) 36:D25–D30.
Bentwich I. Prediction and validation of microRNAs and their targets. FEBS Lett. (2005) 579:5904–5910.[CrossRef][Web of Science][Medline]
Bernhart SH, et al. Local RNA base pairing probabilities in large sequences. Bioinformatics (2006a) 22:614–615.
Bernhart SH, et al. Partition function and base pairing probabilities of RNA heterodimers. Algorithms Mol. Biol. (2006b) 1:3.[CrossRef][Medline]
Bompfünewerer AF, et al. Variations on RNA folding and alignment: lessons from Benasque. J. Math. Biol. (2008) 56:129–144.[CrossRef][Web of Science][Medline]
Bossi L, Figueroa-Bossi N. A small RNA downregulates LamB maltoporin in Salmonella. Mol. Microbiol. (2007) 65:799–810.[CrossRef][Web of Science][Medline]
Brennecke J, et al. Principles of microRNA-target recognition. PLoS Biol. (2005) 3:e85.[CrossRef][Medline]
Chen S, et al. MicC, a second small-RNA regulator of Omp protein expression in Escherichia coli. J. Bacteriol. (2004) 186:6689–6697.
Delihas N. Annotation and evolutionary relationships of a small regulatory RNA gene micF and its target ompF in Yersinia species. BMC Microbiol. (2003) 3:13.[Medline]
Dimitrov RA, Zuker M. Prediction of hybridization and melting for double-stranded nucleic acids. Biophys. J. (2004) 87:215–226.[CrossRef][Web of Science][Medline]
Do CB, et al. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics (2006) 22:e90–e98.
Doench JG, Sharp PA. Specificity of microRNA target selection in translational repression. Genes Dev. (2004) 18:504–511.
Geissmann TA, Touati D. Hfq, a new chaperoning role: binding to messenger RNA determines access for small RNA regulator. EMBO J. (2004) 23:396–405.[CrossRef][Web of Science][Medline]
Gottesman S. Micros for microbes: non-coding regulatory RNAs in bacteria. Trends Genet. (2005) 21:399–404.[CrossRef][Web of Science][Medline]
Hannon GJ. RNA interference. Nature (2002) 418:244–251.[CrossRef][Medline]
Hiller M, et al. Using RNA secondary structures to guide sequence motif finding towards single-stranded regions. Nucleic Acids Res. (2006) 34:e117.
Hofacker IL, et al. Fast folding and comparison of RNA secondary structures. Monatsh. Chem. (1994) 125:167–188.[CrossRef]
Kawamoto H, et al. Base-pairing requirement for RNA silencing by a bacterial small RNA and acceleration of duplex formation by Hfq. Mol. Microbiol. (2006) 61:1013–1022.[CrossRef][Web of Science][Medline]
Kertesz M, et al. The role of site accessibility in microRNA target recognition. Nat. Genet. (2007) 39:1278–1284.[CrossRef][Web of Science][Medline]
Koberle C, et al. Selecting effective siRNAs based on guide RNA structure. Nat. Protoc. (2006) 1:1832–1839.
Kozak M. Regulation of translation via mRNA structure in prokaryotes and eukaryotes. Gene (2005) 361:13–37.[CrossRef][Web of Science][Medline]
Kretschmer-Kazemi Far R, Sczakiel G. The activity of siRNA in mammalian cells is related to structural target accessibility: a comparison with antisense oligonucleotides. Nucleic Acids Res. (2003) 31:4417–4424.
Kruger J, Rehmsmeier M. RNAhybrid: microRNA target prediction easy, fast and flexible. Nucleic Acids Res. (2006) 34:W451–W454.
Long D, et al. Potent effect of target structure on microRNA function. Nat. Struct. Mol. Biol. (2007) 14:287–294.[CrossRef][Web of Science][Medline]
Luo KQ, Chang DC. The gene-silencing efficiency of siRNA is strongly dependent on the local structure of mRNA at the targeted region. Biochem. Biophys. Res. Commun. (2004) 318:303–310.[CrossRef][Web of Science][Medline]
Markham NR, Zuker M. DINAMelt web server for nucleic acid melting prediction. Nucleic Acids Res. (2005) 33:W577–W581.
Masse E, Gottesman S. A small RNA regulates the expression of genes involved in iron metabolism in Escherichia coli. Proc. Natl Acad. Sci. USA (2002) 99:4620–4625.
Mathews D, et al. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. (1999) 288:911–940.[CrossRef][Web of Science][Medline]
McCaskill JS. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers (1990) 29:1105–1119.[CrossRef][Web of Science][Medline]
Mika F, et al. Seed pairing and non-RBS target sites facilitate globalompmRNA regulation by a bacterial small RNA. (2008) inpress.
Møller T, et al. Spot 42 RNA mediates discoordinate expression of the E. coli galactose operon. Genes Dev. (2002) 16:1696–1706.
Mückstein U, et al. Translational control by RNA-RNA interaction: Improved computation of RNA-RNA binding thermodynamics. In: Bioinformatics Research and Development, Vol. 13 of Communications in Computer and Information Science.—Elloumi M, et al, eds. (2008) Berlin, Heidelberg: Springer. 114–127.
Pervouchine DD. IRIS: intermolecular RNA interaction search. Genome Inform. (2004) 15:92–101.[Medline]
Rajewsky N. microRNA target predictions in animals. Nat. Genet. (2006) (Suppl. 38):S8–S13.
Rehmsmeier M, et al. Fast and effective prediction of microRNA/target duplexes. RNA (2004) 10:1507–1517.
Repoila F, et al. Small non-coding RNAs, co-ordinators of adaptation processes in Escherichia coli: the RpoS paradigm. Mol. Microbiol. (2003) 48:855–861.[CrossRef][Web of Science][Medline]
Schmidt M, et al. Secondary structures of Escherichia coli antisense micF RNA, the 5'-end of the target ompF mRNA, and the RNA/RNA duplex. Biochemistry (1995) 34:3621–3631.[CrossRef][Web of Science][Medline]
Shao Y, et al. Effect of target secondary structure on RNAi efficiency. RNA (2007) 13:1631–1640.
Sharma CM, et al. A small RNA regulates multiple ABC transporter mRNAs by targeting C/A-rich elements inside and upstream of ribosome-binding sites. Genes Dev. (2007) 21:2804–2817.
Starmer J, et al. Predicting Shine-Dalgarno sequence locations exposes genome annotation errors. PLoS Comput. Biol. (2006) 2:e57.[CrossRef][Medline]
Storz G. An expanding universe of noncoding RNAs. Science (2002) 296:1260–1263.
Storz G, Haas D. A guide to small RNAs in microorganisms. Curr. Opin. Microbiol. (2007) 10:93–95.[CrossRef][Web of Science]
Storz G, et al. Controlling mRNA stability and translation with small, noncoding RNAs. Curr. Opin. Microbiol. (2004) 7:140–144.[CrossRef][Web of Science][Medline]
Tafer H, Hofacker IL. RNAplex: a fast tool for RNA-RNA interaction search. Bioinformatics (2008) [Epub ahead of print; doi:10.1093/bioinformatics/btn193; April 23, 2008].
Tjaden B. TargetRNA: a tool for predicting targets of small RNA action in bacteria. Nucleic Acids Res. (2008) 36:W109–W113.
Tjaden B, et al. Target prediction for small, noncoding RNAs in bacteria. Nucleic Acids Res. (2006) 34:2791–2802.
Udekwu KI, et al. Hfq-dependent regulation of OmpA synthesis is mediated by an antisense RNA. Genes Dev. (2005) 19:2355–2366.
Unoson C, Wagner EG. Dealing with stable structures at ribosome binding sites: bacterial translation and ribosome standby. RNA Biol. (2007) 4:113–117.[Medline]
Urban JH, Vogel J. Two seemingly homologous noncoding RNAs act hierarchically to activate glmS mRNA translation. PLoS Biol. (2008) 6:e64.[CrossRef][Medline]
Vogel J, Wagner EGH. Target identification of small noncoding RNAs in bacteria. Curr. Opin. Microbiol. (2007) 10:262–270.[CrossRef][Web of Science][Medline]
Vogel J, et al. The small RNA IstR inhibits synthesis of an SOS-induced toxic peptide. Curr. Biol. (2004) 14:2271–2276.[CrossRef][Web of Science][Medline]
Yoon S, De Micheli G. Computational identification of microRNAs and their targets. Birth Defects Res. C Embryo Today (2006) 78:118–128.[CrossRef][Medline]
Zamore PD, Haley B. Ribo-gnome: the big world of small RNAs. Science (2005) 309:1519–1524.
Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. (1981) 9:133–148.
This article has been cited by other articles:
![]() |
F. W. D. Huang, J. Qin, C. M. Reidys, and P. F. Stadler Partition function and base pairing probabilities for RNA-RNA interaction prediction Bioinformatics, October 15, 2009; 25(20): 2646 - 2654. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Chitsaz, R. Salari, S. C. Sahinalp, and R. Backofen A partition function algorithm for interacting nucleic acid strands Bioinformatics, June 15, 2009; 25(12): i365 - i373. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



ED(i,p)+ED(p,k), we need to substract ED(p,k) and ED(q,l), and add ED(i,k) and ED(j,l) to get the final result of Ck,l(i,j).



(i,j) matrix and its relation to the other matrices. Note that both Ck,l(p,q) and seed(i,j,p,q;5) consistently assume that (p,q) is a pair. Here, a seed of 5 bp and one unpaired base is shown.



