Skip Navigation


Bioinformatics Advance Access originally published online on December 4, 2007
Bioinformatics 2008 24(3):367-373; doi:10.1093/bioinformatics/btm591
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/3/367    most recent
btm591v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kiryu, H.
Right arrow Articles by Asai, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kiryu, H.
Right arrow Articles by Asai, K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Rfold: an exact algorithm for computing local base pairing probabilities

Hisanori Kiryu 1,*, Taishin Kin 1 and Kiyoshi Asai 1,2

1Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-42 Aomi, Koto-ku, Tokyo 135-0064 and 2Department of Computational Biology, Faculty of Frontier Science, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8561, Japan

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Base pairing probability matrices have been frequently used for the analyses of structural RNA sequences. Recently, there has been a growing need for computing these probabilities for long DNA sequences by constraining the maximal span of base pairs to a limited value. However, none of the existing programs can exactly compute the base pairing probabilities associated with the energy model of secondary structures under such a constraint.

Results: We present an algorithm that exactly computes the base pairing probabilities associated with the energy model under the constraint on the maximal span W of base pairs. The complexity of our algorithm is given by Formula in time and Formula in memory, where N is the sequence length. We show that our algorithm has a higher sensitivity to the true base pairs as compared to that of RNAplfold. We also present an algorithm that predicts a mutually consistent set of local secondary structures by maximizing the expected accuracy function. The comparison of the local secondary structure predictions with those of RNALfold indicates that our algorithm is more accurate. Our algorithms are implemented in the software named ‘Rfold.’

Availability: The C++ source code of the Rfold software and the test dataset used in this study are available at http://www.ncrna.org/software/Rfold/

Contact: kiryu-h{at}aist.go.jp

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Recently, it has been revealed that a substantial number of non-protein-coding RNA (ncRNA) transcripts in eukaryotic cells play various roles in regulating cellular functions (Carninci et al., 2005; Dunham et al., 2004; Kin et al., 2007; Okazaki et al., 2002). In order to investigate such transcripts, several studies have used the base pairing probability (BPP) matrices in their analyses such as secondary structure prediction (Do et al., 2006; Kiryu et al., 2007a; Knudsen and Hein 2003), structural alignment (Hofacker et al., 2004a; Kiryu et al., 2007b; Tabei et al., 2006), finding conserved motifs (Hamada et al., 2006), and clustering of ncRNA candidates (Will et al., 2007). For a pair of sequence positions (i, j), the BPP p(i, j) represents the probability that the positions i and j form a base pair and it is defined by the following formula:


Formula 1

(1)
where {sigma} denotes a secondary structure candidate of sequence x; E({sigma}, x), the secondary structure free energy that is computed using the energy parameters collected by the Turner group (Mathews et al., 1999); R, the gas constant; T, the temperature; Z(x), the partition function; Formula , the set of all the secondary structures that have a base pair between i and j; and {Omega}, the set of all the possible secondary structures. The matrix that tabulates the BPPs for all the possible pairs {(i, j)|1 ≤ i < j ≤ N} is called the BPP matrix, and it is usually calculated by using McCaskill's algorithm (McCaskill, 1990). Since the BPPs compactly represent the complex energy balance of secondary structures, they are very useful in searching for novel ncRNAs from genomic scale sequences. However, the computation of BPPs by using McCaskill's algorithm requires Formula time and Formula memory for a sequence of length N, making it inhibitive to apply the algorithm to long sequences. Since we do not usually consider the possibility that both ends of a chromosome form a base pair, it is natural to constrain the maximal span of base pairs to a fixed value W. The definition of the BPP under this constraint is the same as that given in Equation (1) except that Formula and {Omega} involve only the structures that conform to the constraint.

Previously, two algorithms have been proposed for the computation of BPPs under such a constraint (Bernhart et al., 2006; Hofacker et al., 2004b). In Reference (Hofacker et al., 2004b), an exact method that computes BPPs with Formula time and Formula memory has been introduced. However, their algorithm strongly depends on a less realistic score model, based on the maximal circular matching problem, that cannot take into account base pair stacking. The other algorithm (Bernhart et al., 2006) is based on the complete energy model, and has the same complexity as that of the algorithm by Hofacker et al. However, it is an approximate algorithm that returns the averaged BPPs that are calculated for all the subsequences of fixed length with the use of a sliding window method.

In this article, we present an algorithm that exactly computes the BPPs of Equation (1) under the maximal span constraint. The complexity of our algorithm is the same as that of previous algorithms. We show that our algorithm has a higher sensitivity to true base pairs than the averaged probabilities obtained by RNAplfold. We also present an algorithm that predicts a mutually consistent set of local structures by maximizing the expected accuracy function. The comparison of the local secondary structure predictions with those of RNALfold implies that our algorithm is more accurate. Our algorithms are implemented in a software called ‘Rfold.’


    2 SYSTEMS AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 The model
We consider the problem of finding local secondary structures in a long DNA sequence. We assume that a greater part of the sequence is intergenic and only small portions possess stable structures. Because no general statistical bias is observed in the external loops of RNA genes, we do not distinguish the external loops from the intergenic regions. These non-structured parts of the sequence are referred to as the outer regions and the complementary regions that are enclosed by any base pair are referred to as the inner regions. The outer and inner regions are unambiguously determined only by the base pair annotations to the sequence. The maximal size of an inner region is given by the maximal span, which is denoted by W throughout this article.

Although our algorithm computes the BPPs of the full energy model, it is illustrated by using a simpler model in order to avoid complicating the notation. We will describe how the algorithm presented here can be applied to the energy model in a subsequent section. The model is a variant of stochastic context free grammars (SCFGs) and has three non-terminal states O, P and I. The transformation rules of the SCFG are defined as follows.


Formula

Here, {varepsilon} represents the null terminal symbol and a and a1,2 represent the terminal symbols that correspond to the nucleotide characters. The O state emits the outer bases and is referred to as the outer state. P and I states emit paired and unpaired bases in the inner regions and are collectively referred to as the inner states. The P state has only one transition that emits base pairs. We consider the P state separately from the I state in order to ensure that each of the parse subtrees that parse the inner regions has a P state as the root node. We note that although this simplified grammar is ambiguous in parsing the secondary structures of inner regions, the subsequent analyses are not affected by the detailed grammar of the inner states and equally apply to the unambiguous grammar of Rfold, which is described in the supplementary material.

Figure 1 shows an example of a secondary structure. In this figure, the three base pairs are represented by the semicircles and the disk regions correspond to the two inner regions in the sequence. The arrows schematically show the parse tree that parses this structure: the solid arrows represent parts of the parse tree that emit the outer bases and the broken arrows represent the bifurcation transition O {longrightarrow} OP. The traceback matrix that is computed to parse this structure is shown in Figure 2. In this figure, each matrix element stores the pointer to the state that appears in the child node of the parse tree. The solid arrows represent the traceback directions. The broken arrows represent the base emissions. As observed in these figures, all the outer bases can be emitted by the outer states that reside in the first row of the traceback matrix. In general, all the outer bases of an arbitrary secondary structure can be unambiguously parsed as the right emissions from the outer states in the first row of the traceback matrix. This indicates that only Formula memory is necessary to store the dynamic programming (DP) scores of the outer state.


Figure 1
View larger version (5K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. An example of a secondary structure. The base pairs are represented by three semicircles and two disk regions correspond to the inner regions in the sequence. Three shaded bars are the outer regions. The arrows schematically show the parse tree that parses this structure. The solid arrows represent parts of the parse tree that emit the outer bases. The broken arrows represent the bifurcation transition O {longrightarrow} OP of the SCFG model.

 

Figure 2
View larger version (69K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. The traceback matrix that parses the structure shown in Figure 1. The matrix element at position (i, j) stores the pointer to the state that appears in the child node of the parse tree. The solid arrows represent the traceback directions. The broken arrows represent the base emissions. The SCFG states (O, P and I) that emit the bases are also shown.

 
2.2 Inside and outside algorithms
The inside algorithm of the SCFG is given by


Formula

where {alpha}O(i) and {alpha}P,I(i, j) represent the inside variables for the states O, P and I, respectively. Here, we use the same coordinate system as that of CONTRAfold (Do et al., 2006), where the indexes i, j and k represent the intervening positions between the nucleotides (see Fig. 3). The score functions {eta}, {lambda} and {delta} represent the logarithms of the emission probabilities associated with the outer base, unpaired inner base and base pair emissions, respectively. which are replaced by the partial energies of the structure in the case of the energy model. We have not explicitly written the transition probability factors that are associated with all the terms on the right hand side in the above formulas for simplicity of notation.


Figure 3
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Coordinate system of our algorithms. The indexes i, j and k of DP variables represent the intervening positions between the nucleotides.

 
The corresponding outside algorithm is given by


Formula

where βO(i) and βP,I (i, j) represent the inside variables for the states O, P and I, respectively. Because of the maximal span constraint, the inside and outside variables of the P state at all the positions (i, j) with |ji| > W can be set to be zero. The DP variables for the inner state I can also be set to be zero at these positions because all the inner states have at least one P state as an ancestor in the parse tree. In the inside algorithm, the computation of the inside variables of the inner states at position (i, j) requires the computations of the inside values at all the pair positions between i and j. Hence the inside algorithm requires Formula memory for the inner states. In the outside algorithm, the computation of the outside variables of the inner states at (i, j) requires the computations of all the outside variables in the region {(k, l)| k ≤ i, j ≤ l, (k, l) != (i, j)}, which is shown in Figure 4 as the dark inverted triangle. The computations of the outside variables in that region in turn require the computations of the inside variables of the inner states in the trapezoid region that is enclosed by the bold lines in Figure 4. Hence, the outside algorithm also requires Formula memory for the inner states. In summary, the space complexity of the inside-outside algorithm is given by Formula , which comprises the Formula memory for the outer state and the Formula memory for the inner states. Once the inside-outside variables are computed, the probability p(i, j) for the present model is obtained by the following formula:


Formula 2

(2)
The procedure for the computation is shown in Figure 5. First, we fill the linear array of inside variables {alpha}O(i) by sliding the Formula memory for {alpha}P, I (i, j) from left to right [Fig. 5(a)]. We then compute the outside variables βO(i) and βP, I (i, j) from right to left. Since the inside variables {alpha}P, I(i, j) are required for βP, I(i, j), these are recomputed before the outside variables. When the inside and outside variables of the P state are computed for cell (i, j), the BPP value at that position is computed by using Equation (2). The obtained p(i, j) is immediately written to the output in the Rfold implementation.


Figure 4
View larger version (24K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. The dependency of the outside variables for the inner states. The DP matrix, which is similar to Figure 2, is rotated by 45 degrees counter-clockwise and only the banded region {(i, j)| |ji| < W} is shown. The computation of the outside variable at (i, j) (shown as a blob) requires the computations of the outside variables in the region that is indicated as a dark inverted triangle. These outside variables require the computation of the inside variables in the trapezoid region that is enclosed by the bold lines.

 

Figure 5
View larger version (43K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. The computation procedures of our algorithms. The computation of the BPPs proceeds as (a)->(b). The computation of the secondary structure proceeds as (a)->(b')->(c). The shaded bars at the bottom represent the 1 D arrays that store the DP variables [{alpha}O(j), βO (j) and {gamma}O (j)] and the traceback pointers [{tau}O (j)] of the outer state. The thick arrows represent the directions along which the O(W2) memories for the inner states move. The thin arrows that are depicted along the front boundary of the shaded triangles indicate the order of DP computation. For example, the outside step (b') fills inside (upward)-> outside (downward) -> CYK (upward) variables along the thin arrows for all the sequence positions i = N+1, ... , 0.

 
The asymptotic computation time is dominated by the calculations of the bifurcation transition I {longrightarrow} II in the inside-outside algorithm. Therefore, the time complexity of the algorithm is given by O(NW2).

2.3 Maximal expected accuracy algorithm
Recent studies have shown that the secondary structure predictions based on the principle of maximization of expected accuracy (MEA) (Holmes and Durbin 1998; Miyazawa, 1995) are better than those made by the conventional maximal likelihood algorithms (Do et al. 2006; Knudsen and Hein, 2003). This method first computes the BPP matrix of the sequence and then predicts the structure that maximizes the expected accuracy of the base pair predictions. The functional form of the expected accuracy used in these studies is given by


Formula 3

(3)
where cL and cP are the compositional weights of the loop and base pair regions, respectively. pL(i) is the loop probability and is defined by


Formula

Formula and Formula are the sets of unpaired and paired bases in the secondary structure S, respectively. Since the overall scale of the expected accuracy has no effect on the structure prediction, the ratio cL/cP is the only essential parameter. This ratio was efficiently used in Do et al. (2006) for striking a balance between the specificity and the sensitivity of the prediction. In their studies, the lengths of the input sequences are implicitly assumed to be close to those of the RNA genes and hence the inner and outer loops are equally treated in Equation (3). In contrast, the length scales between the outer loops and the inner loops are quite different in our case and it seems to be necessary to assign different compositional weights to these loops. Therefore, we consider the following two functions as alternative definitions for the expected accuracy.


Formula 4

(4)


Formula 5

(5)
Here, cO and cI are the compositional weights of the outer and inner loops, respectively. pO(i) and pI(i) are the outer and inner loop probabilities, respectively, and they are defined as


Formula

These probabilities satisfy the relation pO(i) + pI(i) = pL(i). Formula and Formula are the sets of unpaired bases in the outer and inner regions, respectively. Later, we shall show that the structures predicted by using EA1(S) have the highest accuracy among those predicted by using these functions.

The computation of the structure prediction proceeds as (a)->(b')->(c) in Figure 5. The first step [Fig. 5(a)] is the same as that of the BPP computation. In the second step [Fig. 5(b')], we fill the CYK variables and the traceback pointers as well as the inside and outside variables by using the CYK algorithm that is defined as follows:


Formula

where {eta}', {lambda}' and {delta}' are the scores associated with the outer loop, inner loop and base pair emissions, respectively, that are obtained from the BPPs p(i, j) and the loop probabilities pI(i), pO(i) and pL(i) according to the expected accuracy chosen. For example, if Equation () is used for the expected accuracy, these scores are given as


Formula

In the third step [Fig. 5(c)], we obtain the local secondary structures by reading the traceback pointers. Because we store only the traceback pointers for the outer state, only the boundaries of the inner regions can be identified. Hence, we recompute the traceback pointers for the inner states to reproduce the entire secondary structure by recomputing the inside, outside, and CYK variables around each of the inner regions. The time and memory complexities of the algorithm are still the same as those in the BPP computation.

In a previous study (Hofacker et al., 2004b), an algorithm that predicts locally stable secondary structures is proposed and implemented as the RNALfold program in the Vienna RNA package (Hofacker, 2003). This algorithm produces multiple overlapping structures that may conflict with each other. Although the overlapping RNA structures may be important in some analyses, it is often practical to remove the mutually inconsistent predictions in order to reduce the false positives and redundancies. Rfold fully resolves the conflicts of base pairs by maximizing the globally defined objective function and returns a mutually consistent set of local structures in a transparent manner.

2.4 Application to the energy model
The above-mentioned algorithms can be directly applied to the energy model. The only modification is to replace the inner states and transitions with more complicated ones that identify the stacking of base pairs and the various types of inner loops such as hairpins, interior/bulge loops and multi-loops. An inside-outside algorithm that is applicable to the energy model is presented in the supplementary material of Do et al. (2006). We use a similar algorithm for the inside-outside calculation; this model is also able to account for the energy model and enumerates the secondary structures without redundancy. We have described the detailed structure and unambiguity of the Rfold model in the supplementary material.

2.5 The dataset
We extracted 151 alignments of structural RNA families from the seed alignments in the Rfam 7.0 database (Griffiths-Jones et al., 2003). All these alignments had annotated secondary structures that had been published in the literature. We then selected a single representative sequence from each family that had the maximal number of canonical base pairs. The distribution of the sequence length and the span of the base pairs are shown in Tables 1 and 2, respectively. Table 2 shows that 85% of the base pairs have a span of <=100 bases. From these sequences, we created four types of dataset (Datasets 1–4). Dataset 1 comprises these 151 RNA sequences. Dataset 2 is created from Dataset 1 by appending random sequences of length e = 100, 300, 500 and 1000 to both the ends of each sequence. Dataset 3 contains a single sequence of length 172 k bases, which is obtained by concatenating the sequences of Dataset 1 and the random sequences of length 1000 alternately. Dataset 4 comprises 10 random sequences of length 10 k bases, and it is used as the control set to estimate the false positive rate of the structure predictions. The random sequences of these datasets were generated by concatenating the 151 RNA sequences and shuffling the nucleotides of the sequence, while conserving the dinucleotide frequency.


View this table:
[in this window]
[in a new window]

 
Table 1. Length distribution of our dataset

 

View this table:
[in this window]
[in a new window]

 
Table 2. The distribution of the distance between the base pairing positions in our dataset

 
2.6 Accuracy measures
To estimate the accuracy of the BPPs and the structure predictions, we draw receiver operator characteristic (ROC) curves that represent the balance between the sensitivity to the true base pairs and the rate of false positives in the non-structured sequences.

In the case of BPP comparison, the sensitivity is defined by the fraction of the true base pairs that have a BPP larger than the given threshold value {varepsilon}. The false positive rate is defined by the frequency of the base pairs (i, j) with p(i, j) > {varepsilon} in the non-structured sequence divided by the length of the sequence. We draw the ROC curve by examining several values of {varepsilon}. In the case of structure prediction, the sensitivity is defined by the fraction of base pairs that are correctly predicted by the programs. We define the false positive rate as the fraction of the inner regions in the non-structured sequence. This definition penalizes long inner regions that contain only a small number of predicted base pairs. Furthermore, only the base pairs that satisfy the maximal span constraint are counted as true base pairs in order to remove the effect of trivial loss of sensitivity to the distant base pairs |ij| > W.

We used the RNAplfold program (Bernhart et al., 2006) for the comparison of BPPs; this program computes the averaged BPPs by using a sliding sequence window. We used the RNALfold program for the comparison of the local secondary structure prediction (Hofacker, 2003; Hofacker et al., 2004b). Since RNALfold returns multiple overlapping structures, we post-processed the RNALfold results as follows: first, we sorted the partial structures of the RNALfold output in ascending order of free energies. We then selected the local structures one by one from the list such that the selected structure did not overlap with the structures that had already been selected. We defined the structure prediction of RNALfold as the structure obtained by combining these partial structures. We note that the accuracy of RNALfold, which is discussed in a subsequent section, depends on this post-processing method.


    3 IMPLEMENTATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Rfold was implemented using the C++ language. We used RNAplfold and RNALfold in version 1.6.2 of the Vienna RNA package (Hofacker, 2003) Experiments were performed on a cluster of Linux machines equipped with dual AMD Opteron 850 2.4 GHz processors and 6 GB RAM.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
4.1 Running time
Table 3 shows the comparison of the computation times of Rfold, RNAplfold, and RNALfold. We computed BPPs and the secondary structures for Dataset 3 with W = 100. In the computation of BPPs, the Rfold and RNAplfold have a similar speed. In the computation of the secondary structures, Rfold is 40 times slower than RNALfold.


View this table:
[in this window]
[in a new window]

 
Table 3. Comparison of running times of Rfold, RNAplfold, and RNALfold. We used Dataset 3, which consists of a sequence with a length of 172k bases. W is set to be 100

 
4.2 Comparison of base pairing probabilities
Figure 6 shows the distributions of the BPP values at the true base pair positions. The x axis represents the probability values and the y axis represents the fraction of the true base pairs for specific ranges of BPPs. W is set to be 800, for which all the annotated base pairs satisfy the maximal span constraint. The left and right figures are the distributions for Rfold and RNAplfold, respectively. Both figures show that the frequency of the probability value peaks at around 1.0. The peak for Dataset 2 is smaller than that of Dataset 1 by >= 10% due to an increase in the volume of the structure space. With regard to RNAplfold, the peak significantly drops as longer random sequences are added to the RNA sequences, while the distributions of Rfold do not show such behavior. This reduction of the peak indicates that the averaging operation of RNAplfold smears the probabilities of most likely base pairs.


Figure 6
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Distributions of the BPP values at the true base pair positions. The x axis represents the probability value and the y axis represents the frequency of the fraction of the true base pairs for specific ranges of probabilities. The datasets used are Dataset 1 (cross) and Dataset 2 with e = 100, 300, 500 and 1000 (circle, triangle, square and diamond, respectively). W is set to be 800. The left and right panels show the distribution for Rfold and RNAplfold, respectively.

 
The left panel in Figure 7 shows the frequency distribution of the BPPs for the non-structured sequences (Dataset 4). The figure indicates that Rfold predicts a smaller number of false positives as compared to those of RNAplfold.


Figure 7
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7. The left panel shows the distribution of BPPs for the non-structured sequences. The y axis indicates the number of base pairs per unit length. The open and filled circles represent the distributions for Rfold and RNAplfold, respectively. The right panel shows the ROC curves of BPPs. False positive rates are calculated for Dataset 4 and the sensitivities are calculated for Dataset 1 (circles) and Dataset 2 with e = 1000 (triangles). The open and filled symbols represent the ROC curves of Rfold and RNAplfold, respectively. In both the figures, W is set to be 800.

 
The right panel in Figure 7 shows the ROC curves for the computed BPPs. The false positive rates are calculated by using Dataset 4, and the sensitivities are calculated by using Dataset 1 (circle) and Dataset 2 with e = 1000 (triangle). The sensitivities of Rfold are always higher than those of RNAplfold for a given false positive rate.

4.3 Comparison of different expected accuracies
In Figure 8, we show the comparison of the performance of structure predictions for three different definitions of the expected accuracy (Eqs 3, 4 and ). We used Dataset 4 for the calculations of the false positive rates and Dataset 2 with e = 1000 for the sensitivities. We calculated the MEA structures by using EA0 (cross), EA1 with cI = 0, 0.4cO, 0.8cO (open circle, open triangle and open square, respectively), and EA2 with cI = 0, 0.4cO (filled circle, filled triangle, respectively). We set cP = 1 in all the cases and drew the curves by changing the values of cL and cO for EA0 and EA1,2, respectively.


Figure 8
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 8. Comparison of the accuracies of the structures computed by using the different expected accuracy functions (Equations 3, 4, and ). We used Dataset 4 for the calculations of the false positive rates and Dataset 2 with e = 1000 for the sensitivities. We set W = 800. The MEA structures were calculated by using EA0 (cross), EA1 with cI = 0, 0.4cO, 0.8cO (open circle, open triangle, open square, respectively), and EA2 with cI = 0, 0.4cO (filled circle, filled triangle, respectively).

 
In the figure, the sensitivities of EA0 and EA1 with cI < cO rapidly decrease with an increase in cO/cP because the structure predictions made by using these functions tend to produce large inner regions with a very small number of base pairs. For EA2 with cI = 0, the number of predicted base pairs did not show a significant decrease, even when cO/cP was increased up to 10 000, because of the very small values of the outer probabilities pO(i). In contrast, the number of base pairs predicted by using EA1 with cO < cI becomes zero at cO/cP ~ 4, indicating a better balance between pL and pP as compared to the one between pO and pP. We also note that the predicted structures are relatively insensitive to the values of cI for EA1 as long as cI < cO. Since EA1 with cI = cO is equivalent to EA0, the accuracy is considerably degraded in this case. The accuracies for EA1 with cI > cO showed similar behavior (data not shown). Thus, we conclude that the function EA1 with cI < cO is the best expected accuracy function among the three functions.

4.4 Comparison of secondary structure predictions
Figure 9 shows the comparison of the accuracies of the predicted structures. We used Dataset 4 for the computation of the false positive rate and Dataset 3 for the sensitivity. For Rfold, we used the expected accuracy function in Equation (4) with cI = 0.4cO and drew the curves by varying the compositional weight cO for fixed cP.


Figure 9
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 9. Comparison of the accuracies of the predicted structures. We used Dataset 4 for the computation of the false positive rate and Dataset 3 for the sensitivity. We examined three values of W — 50 (circle), 100 (triangle) and 200 (square). For Rfold, we used the expected accuracy function Equation (4) with cI = 0.4cO. Since RNALfold has no parameter that strikes the balance between the sensitivity and the false positive rate, only one point is plotted for each W.

 
Figure 9 indicates that Rfold has a higher sensitivity as compared to that of RNALfold for a given false positive rate. Alternatively, Rfold predicts less than half of the false positives while retaining the same sensitivity as that of RNALfold when cI and cO are appropriately selected. The figure also shows that the fraction of detected base pairs that conform to the maximal span constraint slightly increases with a decrease in W.


    5 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have presented algorithms that exactly compute the BPPs and the MEA secondary structures under the constraint on the maximal span W of base pairs. The algorithms require O(NW2) time and O(N + W2) memory. Our algorithms are based on the separate treatment of the DP variables for the outer state and inner states (Fig. 2) and the analysis of the dependency of the inside and outside variables Fig. 4) under the maximal span constraint. From these analyses, the two-pass algorithm [Fig. 5(a), (b)] for the BPP computation and the three-pass algorithm for the secondary structure prediction [Fig. 5(a), (b'), (c)] are derived. We have demonstrated that our algorithms have a higher sensitivity as compared to previous algorithms for a given false positive rate.

Our analyses can be generalized to other problems related to RNA secondary structures, such as the prediction of secondary structures with pseudoknots and the structural alignment problem. For example, it can be shown that the time complexity of the Sankoff algorithm for structural alignment (Sankoff, 1985) reduces from Formula to Formula under the maximal span constraint W. Some of these generalizations will be presented elsewhere.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This study was partially supported by the ‘Functional RNA Project’ funded by the New Energy and Industrial Technology Development Organization (NEDO) of Japan. The authors thank their colleagues who worked on the project for discussions and comments.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Dmitrij Frishman

Received on July 9, 2007; revised on November 25, 2007; accepted on November 26, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bernhart SH, et al. Local RNA base pairing probabilities in large sequences. Bioinformatics (2006) 22:614–615.[Abstract/Free Full Text]

    Carninci P, et al. The transcriptional landscape of the mammalian genome. Science (2005) 309:1559–1563.[Abstract/Free Full Text]

    Do C, et al. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics (2006) 22:e90–e98.[Abstract/Free Full Text]

    Dunham A, et al. The DNA sequence and analysis of human chromosome 13. Nature (2004) 428:522–528.[CrossRef][Medline]

    Griffiths-Jones S, et al. Rfam: an RNA family database. Nucleic Acids Res (2003) 31:439–441.[Abstract/Free Full Text]

    Hamada M, et al. Mining frequent stem patterns from unaligned RNA sequences. Bioinformatics (2006) 22:2480–2487.[Abstract/Free Full Text]

    Hofacker I. Vienna RNA secondary structure server. Nucleic Acids Res (2003) 31:3429–3431.[Abstract/Free Full Text]

    Hofacker I, et al. Alignment of RNA base pairing probability matrices. Bioinformatics (2004a) 20:2222–2227.[Abstract/Free Full Text]

    Hofacker IL, et al. Prediction of locally stable RNA secondary structures for genome-wide surveys. Bioinformatics (2004b) 20:186–190.[Abstract/Free Full Text]

    Holmes I, Durbin R. Dynamic programming alignment accuracy. J. Comput. Biol (1998) 5:493–504.[Web of Science][Medline]

    Kin T, et al. fRNAdb: a platform for mining/annotating functional RNA candidates from non-coding RNA sequences. Nucleic Acids Res (2007) 35:145–148.

    Kiryu H, et al. Robust prediction of consensus secondary structures using averaged base pairing probability matrices. Bioinformatics (2007a) 23:434–441.[Abstract/Free Full Text]

    Kiryu H, et al. Murlet: a practical multiple alignment tool for structural RNA sequences. Bioinformatics (2007b) 23:1588–1598.[Abstract/Free Full Text]

    Knudsen B, Hein J. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res (2003) 31:3423–3428.[Abstract/Free Full Text]

    Mathews D, et al. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol (1999) 288:911–940.[CrossRef][Web of Science][Medline]

    McCaskill J. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers (1990) 29:1105–1119.[CrossRef][Web of Science][Medline]

    Miyazawa S. A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng (1995) 8:999–1009.[Abstract/Free Full Text]

    Okazaki Y, et al. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature (2002) 420:563–573.[CrossRef][Medline]

    Sankoff D. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl. Math (1985) 45:810–825.[CrossRef]

    Tabei Y, et al. SCARNA: fast and accurate structural alignment of RNA sequences by matching fixed-length stem fragments. Bioinformatics (2006) 22:1723–1729.[Abstract/Free Full Text]

    Will S, et al. Inferring Noncoding RNA Families and Classes by Means of Genome-Scale Structure-Based Clustering. PLoS Comput. Biol (2007) 3:e65.[CrossRef][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
Y. Tabei and K. Asai
A local multiple alignment method for detection of non-coding RNA sequences
Bioinformatics, June 15, 2009; 25(12): 1498 - 1505.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. Hamada, H. Kiryu, K. Sato, T. Mituyama, and K. Asai
Prediction of RNA secondary structure using generalized centroid estimators
Bioinformatics, February 15, 2009; 25(4): 465 - 473.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
K. Asai, H. Kiryu, M. Hamada, Y. Tabei, K. Sato, H. Matsui, Y. Sakakibara, G. Terai, and T. Mituyama
Software.ncrna.org: web servers for analyses of RNA sequences
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W75 - W78.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/3/367    most recent
btm591v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kiryu, H.
Right arrow Articles by Asai, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kiryu, H.
Right arrow Articles by Asai, K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?