Bioinformatics Advance Access originally published online on December 4, 2007
Bioinformatics 2008 24(3):367-373; doi:10.1093/bioinformatics/btm591
Rfold: an exact algorithm for computing local base pairing probabilities
1Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-42 Aomi, Koto-ku, Tokyo 135-0064 and 2Department of Computational Biology, Faculty of Frontier Science, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8561, Japan
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Base pairing probability matrices have been frequently used for the analyses of structural RNA sequences. Recently, there has been a growing need for computing these probabilities for long DNA sequences by constraining the maximal span of base pairs to a limited value. However, none of the existing programs can exactly compute the base pairing probabilities associated with the energy model of secondary structures under such a constraint.
Results: We present an algorithm that exactly computes the base pairing probabilities associated with the energy model under the constraint on the maximal span W of base pairs. The complexity of our algorithm is given by
in time and
in memory, where N is the sequence length. We show that our algorithm has a higher sensitivity to the true base pairs as compared to that of RNAplfold. We also present an algorithm that predicts a mutually consistent set of local secondary structures by maximizing the expected accuracy function. The comparison of the local secondary structure predictions with those of RNALfold indicates that our algorithm is more accurate. Our algorithms are implemented in the software named Rfold.
Availability: The C++ source code of the Rfold software and the test dataset used in this study are available at http://www.ncrna.org/software/Rfold/
Contact: kiryu-h{at}aist.go.jp
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Recently, it has been revealed that a substantial number of non-protein-coding RNA (ncRNA) transcripts in eukaryotic cells play various roles in regulating cellular functions (Carninci et al., 2005; Dunham et al., 2004; Kin et al., 2007; Okazaki et al., 2002). In order to investigate such transcripts, several studies have used the base pairing probability (BPP) matrices in their analyses such as secondary structure prediction (Do et al., 2006; Kiryu et al., 2007a; Knudsen and Hein 2003), structural alignment (Hofacker et al., 2004a; Kiryu et al., 2007b; Tabei et al., 2006), finding conserved motifs (Hamada et al., 2006), and clustering of ncRNA candidates (Will et al., 2007). For a pair of sequence positions (i, j), the BPP p(i, j) represents the probability that the positions i and j form a base pair and it is defined by the following formula:
|
| (1) |
denotes a secondary structure candidate of sequence x; E(
, x), the secondary structure free energy that is computed using the energy parameters collected by the Turner group (Mathews et al., 1999); R, the gas constant; T, the temperature; Z(x), the partition function;
, the set of all the possible secondary structures. The matrix that tabulates the BPPs for all the possible pairs {(i, j)|1
i < j
N} is called the BPP matrix, and it is usually calculated by using McCaskill's algorithm (McCaskill, 1990). Since the BPPs compactly represent the complex energy balance of secondary structures, they are very useful in searching for novel ncRNAs from genomic scale sequences. However, the computation of BPPs by using McCaskill's algorithm requires
involve only the structures that conform to the constraint.
Previously, two algorithms have been proposed for the computation of BPPs under such a constraint (Bernhart et al., 2006; Hofacker et al., 2004b). In Reference (Hofacker et al., 2004b), an exact method that computes BPPs with
time and
memory has been introduced. However, their algorithm strongly depends on a less realistic score model, based on the maximal circular matching problem, that cannot take into account base pair stacking. The other algorithm (Bernhart et al., 2006) is based on the complete energy model, and has the same complexity as that of the algorithm by Hofacker et al. However, it is an approximate algorithm that returns the averaged BPPs that are calculated for all the subsequences of fixed length with the use of a sliding window method.
In this article, we present an algorithm that exactly computes the BPPs of Equation (1) under the maximal span constraint. The complexity of our algorithm is the same as that of previous algorithms. We show that our algorithm has a higher sensitivity to true base pairs than the averaged probabilities obtained by RNAplfold. We also present an algorithm that predicts a mutually consistent set of local structures by maximizing the expected accuracy function. The comparison of the local secondary structure predictions with those of RNALfold implies that our algorithm is more accurate. Our algorithms are implemented in a software called Rfold.
| 2 SYSTEMS AND METHODS |
|---|
|
|
|---|
2.1 The model
We consider the problem of finding local secondary structures in a long DNA sequence. We assume that a greater part of the sequence is intergenic and only small portions possess stable structures. Because no general statistical bias is observed in the external loops of RNA genes, we do not distinguish the external loops from the intergenic regions. These non-structured parts of the sequence are referred to as the outer regions and the complementary regions that are enclosed by any base pair are referred to as the inner regions. The outer and inner regions are unambiguously determined only by the base pair annotations to the sequence. The maximal size of an inner region is given by the maximal span, which is denoted by W throughout this article.
Although our algorithm computes the BPPs of the full energy model, it is illustrated by using a simpler model in order to avoid complicating the notation. We will describe how the algorithm presented here can be applied to the energy model in a subsequent section. The model is a variant of stochastic context free grammars (SCFGs) and has three non-terminal states O, P and I. The transformation rules of the SCFG are defined as follows.
|
|
represents the null terminal symbol and a and a1,2 represent the terminal symbols that correspond to the nucleotide characters. The O state emits the outer bases and is referred to as the outer state. P and I states emit paired and unpaired bases in the inner regions and are collectively referred to as the inner states. The P state has only one transition that emits base pairs. We consider the P state separately from the I state in order to ensure that each of the parse subtrees that parse the inner regions has a P state as the root node. We note that although this simplified grammar is ambiguous in parsing the secondary structures of inner regions, the subsequent analyses are not affected by the detailed grammar of the inner states and equally apply to the unambiguous grammar of Rfold, which is described in the supplementary material.
Figure 1 shows an example of a secondary structure. In this figure, the three base pairs are represented by the semicircles and the disk regions correspond to the two inner regions in the sequence. The arrows schematically show the parse tree that parses this structure: the solid arrows represent parts of the parse tree that emit the outer bases and the broken arrows represent the bifurcation transition O
OP. The traceback matrix that is computed to parse this structure is shown in Figure 2. In this figure, each matrix element stores the pointer to the state that appears in the child node of the parse tree. The solid arrows represent the traceback directions. The broken arrows represent the base emissions. As observed in these figures, all the outer bases can be emitted by the outer states that reside in the first row of the traceback matrix. In general, all the outer bases of an arbitrary secondary structure can be unambiguously parsed as the right emissions from the outer states in the first row of the traceback matrix. This indicates that only
memory is necessary to store the dynamic programming (DP) scores of the outer state.
|
|
2.2 Inside and outside algorithms
The inside algorithm of the SCFG is given by
|
|
O(i) and
P,I(i, j) represent the inside variables for the states O, P and I, respectively. Here, we use the same coordinate system as that of CONTRAfold (Do et al., 2006), where the indexes i, j and k represent the intervening positions between the nucleotides (see Fig. 3). The score functions
,
and
represent the logarithms of the emission probabilities associated with the outer base, unpaired inner base and base pair emissions, respectively. which are replaced by the partial energies of the structure in the case of the energy model. We have not explicitly written the transition probability factors that are associated with all the terms on the right hand side in the above formulas for simplicity of notation.
|
The corresponding outside algorithm is given by
|
|
i, j
l, (k, l)
(i, j)}, which is shown in Figure 4 as the dark inverted triangle. The computations of the outside variables in that region in turn require the computations of the inside variables of the inner states in the trapezoid region that is enclosed by the bold lines in Figure 4. Hence, the outside algorithm also requires |
| (2) |
O(i) by sliding the
P, I (i, j) from left to right [Fig. 5(a)]. We then compute the outside variables βO(i) and βP, I (i, j) from right to left. Since the inside variables
P, I(i, j) are required for βP, I(i, j), these are recomputed before the outside variables. When the inside and outside variables of the P state are computed for cell (i, j), the BPP value at that position is computed by using Equation (2). The obtained p(i, j) is immediately written to the output in the Rfold implementation.
|
|
The asymptotic computation time is dominated by the calculations of the bifurcation transition I
II in the inside-outside algorithm. Therefore, the time complexity of the algorithm is given by O(NW2).
2.3 Maximal expected accuracy algorithm
Recent studies have shown that the secondary structure predictions based on the principle of maximization of expected accuracy (MEA) (Holmes and Durbin 1998; Miyazawa, 1995) are better than those made by the conventional maximal likelihood algorithms (Do et al. 2006; Knudsen and Hein, 2003). This method first computes the BPP matrix of the sequence and then predicts the structure that maximizes the expected accuracy of the base pair predictions. The functional form of the expected accuracy used in these studies is given by
|
| (3) |
|
|
|
| (4) |
|
| (5) |
|
|
The computation of the structure prediction proceeds as (a)
(b')
(c) in Figure 5. The first step [Fig. 5(a)] is the same as that of the BPP computation. In the second step [Fig. 5(b')], we fill the CYK variables and the traceback pointers as well as the inside and outside variables by using the CYK algorithm that is defined as follows:
|
|
',
' and
' are the scores associated with the outer loop, inner loop and base pair emissions, respectively, that are obtained from the BPPs p(i, j) and the loop probabilities pI(i), pO(i) and pL(i) according to the expected accuracy chosen. For example, if Equation () is used for the expected accuracy, these scores are given as |
|
In a previous study (Hofacker et al., 2004b), an algorithm that predicts locally stable secondary structures is proposed and implemented as the RNALfold program in the Vienna RNA package (Hofacker, 2003). This algorithm produces multiple overlapping structures that may conflict with each other. Although the overlapping RNA structures may be important in some analyses, it is often practical to remove the mutually inconsistent predictions in order to reduce the false positives and redundancies. Rfold fully resolves the conflicts of base pairs by maximizing the globally defined objective function and returns a mutually consistent set of local structures in a transparent manner.
2.4 Application to the energy model
The above-mentioned algorithms can be directly applied to the energy model. The only modification is to replace the inner states and transitions with more complicated ones that identify the stacking of base pairs and the various types of inner loops such as hairpins, interior/bulge loops and multi-loops. An inside-outside algorithm that is applicable to the energy model is presented in the supplementary material of Do et al. (2006). We use a similar algorithm for the inside-outside calculation; this model is also able to account for the energy model and enumerates the secondary structures without redundancy. We have described the detailed structure and unambiguity of the Rfold model in the supplementary material.
2.5 The dataset
We extracted 151 alignments of structural RNA families from the seed alignments in the Rfam 7.0 database (Griffiths-Jones et al., 2003). All these alignments had annotated secondary structures that had been published in the literature. We then selected a single representative sequence from each family that had the maximal number of canonical base pairs. The distribution of the sequence length and the span of the base pairs are shown in Tables 1 and 2, respectively. Table 2 shows that 85% of the base pairs have a span of
100 bases. From these sequences, we created four types of dataset (Datasets 1–4). Dataset 1 comprises these 151 RNA sequences. Dataset 2 is created from Dataset 1 by appending random sequences of length e = 100, 300, 500 and 1000 to both the ends of each sequence. Dataset 3 contains a single sequence of length 172 k bases, which is obtained by concatenating the sequences of Dataset 1 and the random sequences of length 1000 alternately. Dataset 4 comprises 10 random sequences of length 10 k bases, and it is used as the control set to estimate the false positive rate of the structure predictions. The random sequences of these datasets were generated by concatenating the 151 RNA sequences and shuffling the nucleotides of the sequence, while conserving the dinucleotide frequency.
|
|
2.6 Accuracy measures
To estimate the accuracy of the BPPs and the structure predictions, we draw receiver operator characteristic (ROC) curves that represent the balance between the sensitivity to the true base pairs and the rate of false positives in the non-structured sequences.
In the case of BPP comparison, the sensitivity is defined by the fraction of the true base pairs that have a BPP larger than the given threshold value
. The false positive rate is defined by the frequency of the base pairs (i, j) with p(i, j) >
in the non-structured sequence divided by the length of the sequence. We draw the ROC curve by examining several values of
. In the case of structure prediction, the sensitivity is defined by the fraction of base pairs that are correctly predicted by the programs. We define the false positive rate as the fraction of the inner regions in the non-structured sequence. This definition penalizes long inner regions that contain only a small number of predicted base pairs. Furthermore, only the base pairs that satisfy the maximal span constraint are counted as true base pairs in order to remove the effect of trivial loss of sensitivity to the distant base pairs |i – j| > W.
We used the RNAplfold program (Bernhart et al., 2006) for the comparison of BPPs; this program computes the averaged BPPs by using a sliding sequence window. We used the RNALfold program for the comparison of the local secondary structure prediction (Hofacker, 2003; Hofacker et al., 2004b). Since RNALfold returns multiple overlapping structures, we post-processed the RNALfold results as follows: first, we sorted the partial structures of the RNALfold output in ascending order of free energies. We then selected the local structures one by one from the list such that the selected structure did not overlap with the structures that had already been selected. We defined the structure prediction of RNALfold as the structure obtained by combining these partial structures. We note that the accuracy of RNALfold, which is discussed in a subsequent section, depends on this post-processing method.
| 3 IMPLEMENTATION |
|---|
|
|
|---|
Rfold was implemented using the C++ language. We used RNAplfold and RNALfold in version 1.6.2 of the Vienna RNA package (Hofacker, 2003) Experiments were performed on a cluster of Linux machines equipped with dual AMD Opteron 850 2.4 GHz processors and 6 GB RAM.
| 4 DISCUSSION |
|---|
|
|
|---|
4.1 Running time
Table 3 shows the comparison of the computation times of Rfold, RNAplfold, and RNALfold. We computed BPPs and the secondary structures for Dataset 3 with W = 100. In the computation of BPPs, the Rfold and RNAplfold have a similar speed. In the computation of the secondary structures, Rfold is 40 times slower than RNALfold.
|
4.2 Comparison of base pairing probabilities
Figure 6 shows the distributions of the BPP values at the true base pair positions. The x axis represents the probability values and the y axis represents the fraction of the true base pairs for specific ranges of BPPs. W is set to be 800, for which all the annotated base pairs satisfy the maximal span constraint. The left and right figures are the distributions for Rfold and RNAplfold, respectively. Both figures show that the frequency of the probability value peaks at around 1.0. The peak for Dataset 2 is smaller than that of Dataset 1 by
10% due to an increase in the volume of the structure space. With regard to RNAplfold, the peak significantly drops as longer random sequences are added to the RNA sequences, while the distributions of Rfold do not show such behavior. This reduction of the peak indicates that the averaging operation of RNAplfold smears the probabilities of most likely base pairs.
|
The left panel in Figure 7 shows the frequency distribution of the BPPs for the non-structured sequences (Dataset 4). The figure indicates that Rfold predicts a smaller number of false positives as compared to those of RNAplfold.
|
The right panel in Figure 7 shows the ROC curves for the computed BPPs. The false positive rates are calculated by using Dataset 4, and the sensitivities are calculated by using Dataset 1 (circle) and Dataset 2 with e = 1000 (triangle). The sensitivities of Rfold are always higher than those of RNAplfold for a given false positive rate.
4.3 Comparison of different expected accuracies
In Figure 8, we show the comparison of the performance of structure predictions for three different definitions of the expected accuracy (Eqs 3, 4 and ). We used Dataset 4 for the calculations of the false positive rates and Dataset 2 with e = 1000 for the sensitivities. We calculated the MEA structures by using EA0 (cross), EA1 with cI = 0, 0.4cO, 0.8cO (open circle, open triangle and open square, respectively), and EA2 with cI = 0, 0.4cO (filled circle, filled triangle, respectively). We set cP = 1 in all the cases and drew the curves by changing the values of cL and cO for EA0 and EA1,2, respectively.
|
In the figure, the sensitivities of EA0 and EA1 with cI < cO rapidly decrease with an increase in cO/cP because the structure predictions made by using these functions tend to produce large inner regions with a very small number of base pairs. For EA2 with cI = 0, the number of predicted base pairs did not show a significant decrease, even when cO/cP was increased up to 10 000, because of the very small values of the outer probabilities pO(i). In contrast, the number of base pairs predicted by using EA1 with cO < cI becomes zero at cO/cP
4, indicating a better balance between pL and pP as compared to the one between pO and pP. We also note that the predicted structures are relatively insensitive to the values of cI for EA1 as long as cI < cO. Since EA1 with cI = cO is equivalent to EA0, the accuracy is considerably degraded in this case. The accuracies for EA1 with cI > cO showed similar behavior (data not shown). Thus, we conclude that the function EA1 with cI < cO is the best expected accuracy function among the three functions.
4.4 Comparison of secondary structure predictions
Figure 9 shows the comparison of the accuracies of the predicted structures. We used Dataset 4 for the computation of the false positive rate and Dataset 3 for the sensitivity. For Rfold, we used the expected accuracy function in Equation (4) with cI = 0.4cO and drew the curves by varying the compositional weight cO for fixed cP.
|
Figure 9 indicates that Rfold has a higher sensitivity as compared to that of RNALfold for a given false positive rate. Alternatively, Rfold predicts less than half of the false positives while retaining the same sensitivity as that of RNALfold when cI and cO are appropriately selected. The figure also shows that the fraction of detected base pairs that conform to the maximal span constraint slightly increases with a decrease in W.
| 5 CONCLUSION |
|---|
|
|
|---|
We have presented algorithms that exactly compute the BPPs and the MEA secondary structures under the constraint on the maximal span W of base pairs. The algorithms require O(NW2) time and O(N + W2) memory. Our algorithms are based on the separate treatment of the DP variables for the outer state and inner states (Fig. 2) and the analysis of the dependency of the inside and outside variables Fig. 4) under the maximal span constraint. From these analyses, the two-pass algorithm [Fig. 5(a), (b)] for the BPP computation and the three-pass algorithm for the secondary structure prediction [Fig. 5(a), (b'), (c)] are derived. We have demonstrated that our algorithms have a higher sensitivity as compared to previous algorithms for a given false positive rate.
Our analyses can be generalized to other problems related to RNA secondary structures, such as the prediction of secondary structures with pseudoknots and the structural alignment problem. For example, it can be shown that the time complexity of the Sankoff algorithm for structural alignment (Sankoff, 1985) reduces from
to
under the maximal span constraint W. Some of these generalizations will be presented elsewhere.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This study was partially supported by the Functional RNA Project funded by the New Energy and Industrial Technology Development Organization (NEDO) of Japan. The authors thank their colleagues who worked on the project for discussions and comments.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Dmitrij Frishman
Received on July 9, 2007; revised on November 25, 2007; accepted on November 26, 2007
| REFERENCES |
|---|
|
|
|---|
Bernhart SH, et al. Local RNA base pairing probabilities in large sequences. Bioinformatics (2006) 22:614–615.
Carninci P, et al. The transcriptional landscape of the mammalian genome. Science (2005) 309:1559–1563.
Do C, et al. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics (2006) 22:e90–e98.
Dunham A, et al. The DNA sequence and analysis of human chromosome 13. Nature (2004) 428:522–528.[CrossRef][Medline]
Griffiths-Jones S, et al. Rfam: an RNA family database. Nucleic Acids Res (2003) 31:439–441.
Hamada M, et al. Mining frequent stem patterns from unaligned RNA sequences. Bioinformatics (2006) 22:2480–2487.
Hofacker I. Vienna RNA secondary structure server. Nucleic Acids Res (2003) 31:3429–3431.
Hofacker I, et al. Alignment of RNA base pairing probability matrices. Bioinformatics (2004a) 20:2222–2227.
Hofacker IL, et al. Prediction of locally stable RNA secondary structures for genome-wide surveys. Bioinformatics (2004b) 20:186–190.
Holmes I, Durbin R. Dynamic programming alignment accuracy. J. Comput. Biol (1998) 5:493–504.[Web of Science][Medline]
Kin T, et al. fRNAdb: a platform for mining/annotating functional RNA candidates from non-coding RNA sequences. Nucleic Acids Res (2007) 35:145–148.
Kiryu H, et al. Robust prediction of consensus secondary structures using averaged base pairing probability matrices. Bioinformatics (2007a) 23:434–441.
Kiryu H, et al. Murlet: a practical multiple alignment tool for structural RNA sequences. Bioinformatics (2007b) 23:1588–1598.
Knudsen B, Hein J. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res (2003) 31:3423–3428.
Mathews D, et al. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol (1999) 288:911–940.[CrossRef][Web of Science][Medline]
McCaskill J. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers (1990) 29:1105–1119.[CrossRef][Web of Science][Medline]
Miyazawa S. A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng (1995) 8:999–1009.
Okazaki Y, et al. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature (2002) 420:563–573.[CrossRef][Medline]
Sankoff D. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl. Math (1985) 45:810–825.[CrossRef]
Tabei Y, et al. SCARNA: fast and accurate structural alignment of RNA sequences by matching fixed-length stem fragments. Bioinformatics (2006) 22:1723–1729.
Will S, et al. Inferring Noncoding RNA Families and Classes by Means of Genome-Scale Structure-Based Clustering. PLoS Comput. Biol (2007) 3:e65.[CrossRef][Medline]
This article has been cited by other articles:
![]() |
Y. Tabei and K. Asai A local multiple alignment method for detection of non-coding RNA sequences Bioinformatics, June 15, 2009; 25(12): 1498 - 1505. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Hamada, H. Kiryu, K. Sato, T. Mituyama, and K. Asai Prediction of RNA secondary structure using generalized centroid estimators Bioinformatics, February 15, 2009; 25(4): 465 - 473. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Asai, H. Kiryu, M. Hamada, Y. Tabei, K. Sato, H. Matsui, Y. Sakakibara, G. Terai, and T. Mituyama Software.ncrna.org: web servers for analyses of RNA sequences Nucleic Acids Res., July 1, 2008; 36(suppl_2): W75 - W78. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








O (j)] and the traceback pointers [
O (j)] of the outer state. The thick arrows represent the directions along which the O(W2) memories for the inner states move. The thin arrows that are depicted along the front boundary of the shaded triangles indicate the order of DP computation. For example, the outside step (b') fills inside (upward)







