Skip Navigation


Bioinformatics Advance Access originally published online on December 20, 2005
Bioinformatics 2006 22(5):614-615; doi:10.1093/bioinformatics/btk014
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/5/614    most recent
btk014v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bernhart, S. H.
Right arrow Articles by Stadler, P. F.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bernhart, S. H.
Right arrow Articles by Stadler, P. F.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Local RNA base pairing probabilities in large sequences

Stephan H. Bernhart 1, Ivo L. Hofacker 1,* and Peter F. Stadler 1,2,3

1Institut für Theoretische Chemie, Universität Wien Währingerstr. 17, A-1090 Wien, Austria
2Bioinformatics Group, Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig Härtelstr. 16-18, D-04107 Leipzig, Germany
3Santa Fe Institute 1399 Hyde Park Road, Santa Fe, NM 87501, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM
 3 PERFORMANCE AND APPLICATIONS
 REFERENCES
 

Summary: The genome-wide search for non-coding RNAs requires efficient methods to compute and compare local secondary structures. Since the exact boundaries of such putative transcripts are typically unknown, arbitrary sequence windows have to be used in practice. Here we present a method for robustly computing the probabilities of local base pairs from long RNA sequences independent of the exact positions of the sequence window.

Availability: The program RNAplfold is part of the Vienna RNA Package and can be downloaded from http://www.tbi.univie.ac.at/RNA/

Contact: ivo{at}tbi.univie.ac.at


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM
 3 PERFORMANCE AND APPLICATIONS
 REFERENCES
 
Computational approaches to detect and classify structured RNAs at genomic scales require efficient ways of computing local RNA secondary structures for both computational and biological reasons: (1) Long-range base pairs in large transcripts are disfavored kinetically relative to short-range pairs (Flamm et al., 2000). (2) Global approaches to RNA folding are limited to sequence length ≤20 000 on most hardware because of memory consumption. (3) In general, the exact boundaries of the transcript are unknown, so that global folds cannot add to the accuracy of the structure prediction relative to folding individual sequence windows.

A recent algorithm for microRNA detection is based upon the idea to consider the stability of secondary structure against changes in the immediate environment (Pfeffer et al., 2005; Sewer et al., 2005). More precisely, this approach considers the frequency with which a certain base pair (i, j) occurs in local minimum energy structures that are computed from sequence windows with a given size L. Here we combine this idea with a recently developed algorithm for local minimum energy structure predictions (Hofacker et al., 2004b). More precisely, we derive recursions for the average equilibrium probability of a base pair (i, j) over all fixed-size sequence windows.


    2 ALGORITHM
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM
 3 PERFORMANCE AND APPLICATIONS
 REFERENCES
 
Denote by Zij the partition function over all secondary structures on the sequence interval [i, j], writeFormula for the partition function subject to the constraint that i and j pair and let pij be the probability that i and j are actually paired in thermodynamic equilibrium.

The standard backtracking procedure for the partition function folding algorithm (McCaskill, 1990) can be expressed as

Formula 1(1)
Here {Xi}ij,kl is the ratio of the two partition functions Formula 1 with the constraint that both i, j and k, l pair, and Formula 1. The first term describes the case in which the (i, j) pair is external, i.e. not enclosed by another pair, the second (sum) term considers all possible base pairs (k, l) that could enclose (i, j). In the simplest case, i.e. for energies dependent on individual base pairs only, we have

Formula 2(2)
where {zeta}kl is the Boltzmann factor of the pairing energy for the closing base pair (k, l). In the standard energy model, described e.g. by Mathews et al. (1999), Formula 2is a sum over contributions for the different loop types (interior loops, bulges and multi-branched loops) as detailed by McCaskill (1990); for given i, j, k, l it can be computed in constant time from the tabulated partition functions of subsequences.

Let us now turn to interactions localized within a sequence window. We denote by Formula 2 the partition function over all secondary structures on the sequence interval [i, j] when the sequence window [u, u + L] is folded. Similarly, Formula 2 denotes the partition function with the additional constraint that positions i and j are paired. Furthermore, we write Formula 2 for the probability that i and j form a base pair when the sequence window [u, u + L] is folded.

Since the partition functions on a subsequence are independent of the external structures as long as the subsequence is contained in the folded sequence window, we have

Formula 3(3)
Furthermore, we observe that Formula 3 unless [i, j] {subseteq} [u, u + L]. We can immediately restrict Equation (1) to a sequence window [u, u + L] since the recursions for Zij depend only on sub-sequences within the interval [i, j] (McCaskill, 1990). Thus

Formula 4(4)
Next, we define the average probability of an (i, j) pair over all folding windows containing the sequence interval [i, j] as

Formula 5(5)
For i + L > n and jL < 1 the sequence windows are shorter, hence Equation (5) has to be modified accordingly. Substitution of Equation (4) yields in the generic case


Formula 6

(6)
Again, modified expressions apply to the sequence intervals close to the 5' and 3' ends.


    3 PERFORMANCE AND APPLICATIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM
 3 PERFORMANCE AND APPLICATIONS
 REFERENCES
 
Equation (6) implies that the n x L matrix Formula 5 can be computed in Formula 5(n x L3) time and Formula 5(n x L) memory for any fixed L. As is usual in most dynamic programming implementations of RNA folding, CPU requirements can be reduced further to Formula 5(n x L2) by applying the restriction that interior loops are bounded by a maximal size M and assuming that multi-loop energies are linear in loop size and loop degree. Forward and backward recursions can be interleaved in such a way that only Formula 5(L2) partition function values have to be stored at any given time, reducing the memory requirement to Formula 5(n + L2). Thus, our approach is equivalent to a sliding window approach with increment 1, but is faster by a factor of L.

The recursions (6) are implemented in the ANSI C program RNAplfold. Dot plots representing the values of Formula 5 are provided in Postscript format and can be used to visually inspect the results, Figure 1. The program is fast enough for genome-wide applications at least when run on a Linux cluster. Figure 2 shows that ~3 Mb of genomic sequence per hour can be scanned on a single CPU.


Figure 1
View larger version (5K):
[in this window]
[in a new window]
 
Fig. 1 Local structures (L = 100) in a 740 nt region of human X chromosome containing five miRNAs annotated in miRBase 7.1. (Griffiths-Jones, 2004).

 

Figure 2
View larger version (12K):
[in this window]
[in a new window]
 
Fig. 2 Performance of RNAplfold on a Pentium 4 with 3.2 GHz confirms the theoretical scaling of the CPU time as Formula 5 (n x L2).

 
RNAplfold has a number of obvious applications, which we are currently exploring. Along the lines of Pfeffer et al. (2005) it can be used to efficiently retrieve candidates for microRNA precursors or other structured RNAs from genomic sequence data. The parameter L should be a bit larger than the structures of interest. More generally, however, genome-wide tables of Formula 5 provide valuable a priori structure information that can be exploited by algorithms that search for RNA sequence/structure patterns such as e.g. a local variant of marna (Siebert and Backofen, 2005) or pmcomp (Hofacker et al., 2004a). As such RNAplfold provides a starting point for alternative approaches to purely comparative RNA annotation strategies.


    Acknowledgments
 
Financial support by the Austrian GEN-AU bioinformatics integration network sponsored by bm:bwk (200.067/3-VI/1/2002) and the German DFG Bioinformatics Initiative (BIZ-6/1-2) is gratefully acknowledged.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alfonso Valencia

Received on October 30, 2005; revised on December 15, 2005; accepted on December 15, 2005

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM
 3 PERFORMANCE AND APPLICATIONS
 REFERENCES
 

    Flamm, C., et al. (2000) RNA folding at elementary step resolution. RNA, 6, 325–338[Abstract].

    Griffiths-Jones, S. (2004) The microRNA Registry. Nucleic Acids Res, . 32, D109–D111[Abstract/Free Full Text].

    Hofacker, I.L., et al. (2004a) Alignment of RNA base pairing probability matrices. Bioinformatics, 20, 2222–2227[Abstract/Free Full Text].

    Hofacker, I.L., et al. (2004b) Prediction of locally stable RNA secondary structures for genome-wide surveys. Bioinformatics, 20, 186–190[Abstract/Free Full Text].

    Mathews, D., et al. (1999) Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol, . 288, 911–940[CrossRef][ISI][Medline].

    McCaskill, J.S. (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29, 1105–1119[CrossRef][ISI][Medline].

    Pfeffer, S., et al. (2005) Identification of microRNAs of the herpesvirus family. Nat. Methods, . 2, 269–276[CrossRef][ISI][Medline].

    Sewer, A., et al. (2005) Identification of clustered microRNAs using ab initio prediction method. BMC Bioinformatics, 6, 267[CrossRef][Medline].

    Siebert, S. and Backofen, R. (2005) MARNA: multiple alignment and consensus structure prediction of RNAs based on sequencestructure comparisons. Bioinformatics, 21, 3352–3359[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
S. Moretti, A. Wilm, D. G. Higgins, I. Xenarios, and C. Notredame
R-Coffee: a web server for accurately aligning noncoding RNA sequences
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W10 - W13.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Wilm, D. G. Higgins, and C. Notredame
R-Coffee: a method for multiple alignment of non-coding RNA
Nucleic Acids Res., May 1, 2008; 36(9): e52 - e52.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
H. Kiryu, T. Kin, and K. Asai
Rfold: an exact algorithm for computing local base pairing probabilities
Bioinformatics, February 1, 2008; 24(3): 367 - 373.
[Abstract] [Full Text] [PDF]


Home page
RNAHome page
G. Terai, T. Komori, K. Asai, and T. Kin
miRRim: A novel system to find conserved miRNAs with high sensitivity and specificity
RNA, December 1, 2007; 13(12): 2081 - 2090.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
I. M. Meyer
A practical guide to the art of RNA gene prediction
Brief Bioinform, November 1, 2007; 8(6): 396 - 414.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. Hamada, K. Tsuda, T. Kudo, T. Kin, and K. Asai
Mining frequent stem patterns from unaligned RNA sequences
Bioinformatics, October 15, 2006; 22(20): 2480 - 2487.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/5/614    most recent
btk014v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bernhart, S. H.
Right arrow Articles by Stadler, P. F.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bernhart, S. H.
Right arrow Articles by Stadler, P. F.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?