Skip Navigation


Bioinformatics Advance Access originally published online on July 14, 2005
Bioinformatics 2005 21(17):3516-3523; doi:10.1093/bioinformatics/bti577
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/17/3516    most recent
bti577v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (20)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Reeder, J.
Right arrow Articles by Giegerich, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Reeder, J.
Right arrow Articles by Giegerich, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Consensus shapes: an alternative to the Sankoff algorithm for RNA consensus structure prediction

Jens Reeder and Robert Giegerich *

Faculty of Technology, Bielefeld University 33615 Bielefeld, Germany

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 RNA SHAPE ANALYSIS AND...
 ALGORITHM IMPLEMENTATION
 EVALUATION
 DISCUSSION
 REFERENCES
 

Motivation: The well-known Sankoff algorithm for simultaneous RNA sequence alignment and folding is currently considered an ideal, but computationally over-expensive method. Available tools implement this algorithm under various pragmatic restrictions. They are still expensive to use, and it is difficult to judge if the moderate quality of results is because of the underlying model or to its imperfect implementation.

Results: We propose to redefine the consensus structure prediction problem in a way that does not imply a multiple sequence alignment step. For a family of RNA sequences, our method explicitly and independently enumerates the near-optimal abstract shape space, and predicts as the consensus an abstract shape common to all sequences. For each sequence, it delivers the thermodynamically best structure which has this common shape. Since the shape space is much smaller than the structure space, and identification of common shapes can be done in linear time (in the number of shapes considered), the method is essentially linear in the number of sequences. Our evaluation shows that the new method compares favorably with available alternatives.

Availability: The new method has been implemented in the program RNAcast and is available on the Bielefeld Bioinformatics Server.

Contact: jreeder{at}TechFak.Uni-Bielefeld.DE,robert{at}TechFak.Uni-Bielefeld.DE

Supplementary information: Available at http://bibiserv.techfak.uni-bielefeld.de/rnacast/supplementary.html


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 RNA SHAPE ANALYSIS AND...
 ALGORITHM IMPLEMENTATION
 EVALUATION
 DISCUSSION
 REFERENCES
 
The role of RNA in all organisms is much broader and more fundamental than it was considered only recently (Lee et al., 1993; Mattick and Gagen, 2001; Lee and Ambros, 2001; Skryabin et al., 2003; Pfeffer et al., 2004). With non-coding RNA, the structure of the molecule is often essential for its function. In analogy to coding RNA, where a conserved encoded protein hints at a similar metabolic function, structural conservation in RNA may give clues to RNA function and to finding of RNA genes. However, structure conservation is more complex to deal with computationally than sequence conservation.

Comparative structure prediction and theSankoff algorithm
The secondary structure of RNA—the level of base pairing—strongly determines the tertiary structure. Since the latter is computationally intractable and experimentally expensive to obtain, secondary structure analysis has become an accepted substitute. Early computational approaches to secondary structure prediction were by Nussinov et al. (1978) and Waterman and Smith (1978). Today's methods use minimum free energy folding, pioneered by Zuker and Stiegler (1981). Such methods are widely used today, although it is known that their results are not completely reliable (Doshi et al., 2004). Better results are generally achieved by comparative analysis of a family of homologous sequences, where sequence and structure conservation is exploited, using a resolved tertiary structure whenever available, sequence alignment, statistical methodsand human expertise (Gutell et al., 1992).

A first comparative approach based on thermodynamics was formulated by Sankoff as early as 1985 (Sankoff, 1985), which performs sequence alignment and minimal free energy folding simultaneously. Its time complexity is , with space , for two sequences of length n, and for more sequences, it becomes exponential in the number of sequences. Given these high computational costs, it seemed unlikely that this algorithm would ever be put into practice. For many years, it rested in oblivion.

Recently, however, interest in comparative methods for RNA structure prediction has been nurtured by findings on the functional versatility of RNA, and several related approaches have been suggested. Some emphasize the sequence conservation aspect, folding a predetermined sequence alignment under thermodynamic rules (RNAalifold, Hofacker et al., 2002). The other extreme emphasizes thermodynamics and suggests to use multiple structure alignments of independently folded sequences (Höchsmann et al., 2004). Other approaches directly implement Sankoff's idea of simultaneous alignment and folding, but introduce various pragmatic restrictions, e.g. Dynalign (Mathews and Turner, 2002) and Foldalign (Gorodkin et al., 1997). For a recent review of these and other tools, the reader is referred to the study of Gardner and Giegerich (2004).

Behind all these approaches, there is the original Sankoff approach as the ideal method—the one that every program tries to approximate in different ways. ‘Making Sankoff practical’ has been a recurring theme at the meetings of the computational RNA community. However, this road may require so many pragmatic restrictions that the ideal loses much of its attraction.

A way out of this dilemma may be to change the definition of a consensus structure. In Sankoff's approach, the consensus is a folded sequence alignment that optimizes a combined sequence similarity and energy score. What if we drop the implicit multiple sequence alignment step (as this problem is known to be NP-complete)? Let us agree that a consensus structure for sequences s1,...,sk is a set of structures x1,...,xk, one for each si, that all have, in some mathematically precise sense, a common shape. Should a sequence alignment of s1,...,sk, compatible with the consensus, also be desired, it may be computed afterward from x1,...,xk, rather than from s1,...,sk, by multiple structure alignment (Höchsmann et al., 2004). The latter phase will certainly need to resort to heuristics, but for the first phase, there may be a chance to achieve a complete and non-heuristic solution in acceptable time.

An alternative to the Sankoff method
A hypothetical method
To explain our new approach, let us first consider a hypothetical, exhaustive method. Let s1 and s2 be two RNA sequences, both of length n. Let us enumerate their foldings in the order of increasing free energy, yielding x1, x2,...,xN1 for s1 and y1, y2,...,yN2 for s2. The numbers N1 and N2 will be very large, even for small n, but let us ignore this for the moment.

If s1 and s2 have a common structure, there must be xi = yj for some i and j. In fact, there may be more such pairs. We rank them by (i+j), and the pair (xi,yj) with minimal rank is our predicted consensus. Just as well, we may produce the k top-ranking consensus pairs.

Using known algorithmic techniques, we can implement the enumeration in time and space, and the identification of common structures in time and space, where we represent structures as strings and employ keyword or suffix trees for fast identity matching. Clearly, if we add a third sequence s3, with structures z1, z2,...,zN3, the (N1+N2) above is replaced by (N1+N2+N3), and hence, this method is additive in the number of sequences! However, it is not practical for the following two reasons:

  • The numbers N1,N2,... are very large and Ni grows exponentially with n. Even if we restrict enumeration to an energy range of say 10% above the minimal free energy, Ni may be large as 100 000 or 1 000 000. This alone might not be a threat on today's computers, but there is one more problem.
  • Sequences s1 and s2 need not to have the same length, and hence their structures cannot be identical. We must allow for some flexibility in the relative position of helices. Therefore, we need to resort to some pairwise similarity computation, catapulting computation time of the identification phase to or higher. The additive behavior is lost.
To make our hypothetical method practical, we need to restrict enumeration to a small, but representative sample of the folding space, and achieve identification of consensus pairs in linear time in spite of their not being identical.

Outline of the consensus shapes prediction method
We build on the recent approach of abstract RNA shape analysis (Giegerich et al., 2004) to solve both of the above problems. Deferring formal definitions, a shape is a family of structures, sharing a common pattern of helix nesting and adjacency. The near-optimal folding space contains only a (relatively) small number of shapes. Using abstract shape analysis, we enumerate representative structures—one per shape, and only those!—for both s1 and s2. The highest ranking structure pair xi and yj, where both have the same shape, then forms our consensus pair. Although the structures xi and yj are only similar, their shapes can be easily computed, and identity matching on shapes can be implemented in time as sketched above, for significantly reduced N1 and N2.

These ideas will be rigorously described below, and we shall report on their implementation and evaluation.


    RNA SHAPE ANALYSIS AND CONSENSUS SHAPES
 TOP
 Abstract
 INTRODUCTION
 RNA SHAPE ANALYSIS AND...
 ALGORITHM IMPLEMENTATION
 EVALUATION
 DISCUSSION
 REFERENCES
 
Abstract shapes
We recall the basic definitions of abstract shape analysis.

  • An RNA sequence s has folding space F(s), the set of all admissible structures under the given base-pairing rules. For each structure x F(s), we can compute its free energy E(x).
  • The minimal free energy structure mfe(s) for a sequence s is the structure x F(s) where E(x) is minimal.
  • For efficient computation of shapes via dynamic programming, they must be represented as trees. Let be the tree-like domain of structures, and a tree-like domain of shapes. A shape abstraction is a mapping {pi} from to that preserves juxtaposition and embedding.
  • The abstract shape space of sequence s is . The class of p-shaped structures in F(s) is F(s|p) = {x|x F(s), {pi}(x) = p}.
  • The shape representative structure for shape p is the structure whose free energy is minimal among all members of that shape class. We call it shrep for short.

Abstract shape analysis, as implemented by the program RNAshapes, is computed for an RNA sequence s and an energy range R. It delivers a list with the following properties:

  • the pi are different shapes, and the are their respective shreps,
  • the list is ordered by increasing energy: and ,
  • the list is restricted to the energy range indicated by R: , and
  • the list completely covers this energy range in the abstract shape space: there is no shape pk+1 such that .
The strength of shape analysis lies in four aspects (for details see Giegerich et al., 2004):
  • It produces a non-heuristic, mathematically well defined synoptic view of the near-optimal folding space, allowing us to concentrate on a small number of shreps.
  • Shape analysis uses the full energy model (Mathews et al., 1999) and runs in the same asymptotic space and time complexity as suboptimal RNA folding.
  • Shapes are meaningful across sequences, hence lend themselves to a comparative approach. This aspect is exploited here for the first time.
  • The approach is generic with respect to the shape abstraction ({pi}) that is actually used. Shapes can be more or less abstract, depending on the level of detail considered relevant.

We illustrate the latter point by defining two shape abstractions used in this study. In general, shape abstractions retain nesting and adjacency of helices, but disregard their size and concrete position in the primary sequence. They may choose to retain or to discard bulges and internal loops, which leads to different levels of abstraction. ‘Level 5’ is the strongest abstraction and does not account for bulges etc. at all. ‘Level 3’ retains helix interruptions, but does not specify whether they result from 5'-bulges, 3'-bulges or internal loops.

As we are not concerned with the algorithmics of shape analysis here, we can forget about tree-like representations of structures and shapes, and define shape abstractions as mappings from the more familiar string representations of structures to string representations of shapes. Structures are represented as dot-bracket strings, e.g. ‘{(((.((.(((...))))).((...)).)))}’. The Level 5 shape of this structure is represented as ‘{[[][]]}’, its Level 3 shape as ‘{[[[]][]]}’. In Table 1, we provide equations defining shape abstractions {pi}5 and {pi}3.


View this table:
[in this window]
[in a new window]
 
Table 1 Definition of Level 5 and Level 3 shape abstractions

 
The rule about the choice of abstraction level is that we generally prefer to work with the less abstract level 3, except for long molecules where a stronger abstraction speeds up the program because the shape space is reduced further.

Incidentally, Level 3 shapes correspond to the two-interval patterns described in Vialette (2004).

Rankings of true shapes
In order to evaluate whether shape analysis bears promise toward consensus prediction, we performed two preliminary studies, using several sequence families from Rfam (Griffiths-Jones et al., 2003) and other databases (Supplementary material) where the ‘true’ structure s is known. From this true structure, we can compute the ‘true’ shape p* = {pi}(s). Question 1 asks for the rank i such that pi = p* in the list of shapes returned by shape analysis. Table 2 shows the outcome. The average rank of the true shape is 5.06, where in 32 of 69 cases (46%) the true shape has rank 1.


View this table:
[in this window]
[in a new window]
 
Table 2 Ranks of true shape in the list of near-optimal shapes using RNAshapes

 
The advantage of shape analysis over complete suboptimal folding (Wuchty et al., 1998) is witnessed by two detailed observations: for one of the tRNA sequences, the true shape has rank 3, whereas the true structure has rank 104 in the complete enumeration. In the worst case observed, a U12 RNA sequence, the true shape has rank 28, whereas its associated true structure has rank 3 695 033. This confirms our hope that the shape space is small enough to completely enumerate its interesting part. But it also confirms that the reliability (in terms of correctly predicted shapes) of single sequence folding lies ~46%—not useless, but not dependable either.

Question 2 asks whether this improves when we move toward a comparative approach by using pairs of sequences. In Table 3 we consider all pairs of predictions (within each family), and report on the rank of the true shape in the list of all common shapes. In the pairwise approach, the average rank of the true shape improves to 3.13, and the true shape now has rank 1 in 128 of 235 cases (53%). We conclude that the power of comparative analysis is well captured by our approach, and expect even better performance by using three or more sequences.


View this table:
[in this window]
[in a new window]
 
Table 3 This histogram shows the rank of the reference shape in all pairwise predictions

 
Consensus shape prediction
We now summarize the proposed method of consensus shape prediction.

For a set of sequences {s1, ..., sk}, intentionally a family of related RNA sequences, we enumerate their shape spaces . Upon those, we define:

DEFINITION 1
A shape p is a common shape of {s1,...,sk} if .

DEFINITION 2
The consensus shape for sequences {s1,...,sk} is the common shape p that minimizes rank .

Here, rank is a scoring function that combines the individual shrep scores. We will discuss several meaningful scoring functions in the Algorithms section.

Computing the intersection of , once we have those, is a trivial task. Usually, there are much less common shapes, than there are shapes in . We then sort all common shapes by their rank and get the consensus shape as the first one in the list.

Note, that the above definitions not only yield the consensus shape, but from shape analysis we also get the set of shreps—the resulting output is a (k + 1)-tuple . These shreps constitute an (unaligned) multiple RNA structure prediction for the input sequences.


    ALGORITHM IMPLEMENTATION
 TOP
 Abstract
 INTRODUCTION
 RNA SHAPE ANALYSIS AND...
 ALGORITHM IMPLEMENTATION
 EVALUATION
 DISCUSSION
 REFERENCES
 
The program RNAcast
The above method has been implemented by the program RNAcast, which stems from ‘RNA consensus abstract shapes technique’. Although most of the method is clear from our definition of the consensus shape, a few details remain to be fixed.

Step 1 Our algorithm starts with sequences s1, ..., sk as input, and an energy threshold R. Let n be their average length. We run RNAshapes on each individual sequence with the provided energy range R. Theoretically every sequence could have its own R, but in practice we use only one.

Step 2. Within the k resulting lists (the shape spaces), we identify all shapes that occur in all the lists. We use hashing techniques for fast identity matching of shapes. Thus this phase runs in a time proportional to . After this step, we have a list of all l common shapes, together with their shreps: .

Step 3. We evaluate each common shape with the scoring function and obtain a sorted list of all common shapes. The first shape of this list is returned as the consensus shape, along with its shreps. If desired, the r ≤ l best common shapes can be reportedas well.

We propose to use the output of RNAcast as input for RNAforester (Höchsmann et al., 2004), a multiple RNA structure alignment program. The unaligned RNAcast output is shown in Figure 1. The resulting alignment is shown in Figure 2.



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 1 Example output for a family of IRES elements of Picornaviridae viruses. It first shows the common shape and the achieved score. Thereafter, for each input RNA and not aligned, there is the sequence, the predicted shrep, its energy and its individual rank in the shape space.

 


View larger version (9K):
[in this window]
[in a new window]
 
Fig. 2 Two multiple RNA structure alignments of five 5S rRNAs computed by RNAforester. On the left-hand side is the alignment of the structures as found in the database (Szymanski et al., 2000). On the right-hand side, the output shreps of RNAcast served as input for the alignment. Obviously, the structures are similar, with RNAcast predicting a few additional compatible base pairs. The alignment visualization should be interpreted as follows: the frequencies of the bases A, C, G and U are proportional to the radius of circles that are arranged for each residue clockwise on the corners of a square, starting at the upper left corner. Additionally, these circles are colored red, green, blue and magenta for the bases A, C, G and U, respectively. The frequency of a gap is proportional to a black circle growing at the center of the square.

 
Left to be defined is the scoring function rank. We propose three different possibilities:
  1. Rank sum score: Each shrep contributes with its individual rank in the sorted shape space of its sequence: .
  2. Sum of energies: .
  3. Sum of probabilities: , where Prob(...) are the probabilities coming from the partition function (McCaskill, 1990), requiring extra O(k · n3) steps to compute. In this case, we have to maximize overall scores.
Overall, it turned out that rank2, the simple sum of energies, performs best, followed by rank3 and at last the rank sum score. But, prediction accuracy for all three scoring function does not differ much. The method seems to be relatively robust, concerning the choice of scoring function. We decided to use rank2 for all computations discussed in this paper.

Technical limitations
Mathematically, at least one common shape always exists for any set of sequences. This does not hold when we limit the search space, and hence the choice of the energy range R is critical. If the true shape is missing in the enumerated shape space of one of our sequences (because the energy of its shrep is too high), this shape cannot become a common shape, nor the consensus. The predicted consensus must be wrong in this case. This has consequences: theoretically, there is no upper bound for the number k of sequences to be considered, and from the efficiency point of view, k can be quite large. But with larger datasets, chances grow that one of the sequences is an outlier where the chosen energy range just is not large enough, and the true consensus shape is missed owing to the effect explained above. Increasing k implies a tendency to increase R, unless we know in advance that the dataset is very homogeneous. When more sequences require to look deeper into the shape space, these two sources of increased efforts multiply. In practice, we suggest to use not more than 10 sequences.

Another point of advice can be given as a rule of thumb. All predicted consensus shreps have the same number of helices. If their native structures are suspected to have different numbers of helices (as with a mixture of four and five stem tRNA) the energy threshold should be large enough to accomodate loss of theextra stem.


    EVALUATION
 TOP
 Abstract
 INTRODUCTION
 RNA SHAPE ANALYSIS AND...
 ALGORITHM IMPLEMENTATION
 EVALUATION
 DISCUSSION
 REFERENCES
 
In our preliminary tests, we evaluated that our method is capable to identify the correct shape in 53% of all pairwise predictions. When we are using RNAcast in the multiple way, the correct shape is predicted for 6 of 10 families and for 3 more families the true shape is on rank 2 or 3. But predicting the correct shape alone is not good enough. Within a shape class, there is considerable structural variation possible. Since shapes abstract from concrete helix positions and sizes, it is theoretically possible that the shrep of a correct shape does not share a single base pair with the true structure.

In this section we evaluate the accuracy achieved by RNAcast on the base pair level and compare it with other tools.

In particular we will answer the following questions:

  1. How accurate are the shreps, given the correct shape?
  2. What is the improvement over single sequence folding algorithms?
  3. How does RNAcast perform compared with other pairwise and multiple folding algorithms?
  4. What are the reasons for wrong predictions?

We evaluate the structure predictions in terms of sensitivity, selectivity and the Matthews correlation coefficient (MCC):


True positive base pairs (TP), true negatives (TN) and false negatives (FN) are counted as usual. No slipping of helices is allowed. For false positives (FP) we use the counting method proposed by Gardner and Giegerich (2004): Predicted base pairs, that do not occur in the reference structure, but are compatible with it, are not counted for FP. A base pair i · j is compatible if neither i nor j is paired to another base in the reference and there is no other base pair k · l that violates the nesting convention (i.e. k < i < l < j). This assumption is meaningful, since the reference structures used in this study often come as a consensus. The members of a specific RNA family share all base pairs in the consensus, but may have additional ones.

Accuracy of the true shrep
Let us first assume, that we already know the correct shape of the family under evaluation. We are then asking for the corresponding shreps. We either look them up in the RNAshapes output, or we generate an RNA folding program restricted to that specific shape and compute the optimal structure directly. We did this for the same set of RNA families as used in the preliminary study and evaluated the accuracy of the shreps.

On average, the sensitivity is 78.2% and selectivity is 78.6%, compared with 65.4 and 65% for the mfe-prediction of the single sequence. This shows, that knowing the correct shape improves secondary structure prediction of single sequences significantly. However, in most realistic cases we do not know the true shape in advance. The best we can do then is to rely on the consensus shape computed by RNAcast. Note that, even when the predicted consensus shape is incorrect, it still may be close to the correct shape, in which case the predicted structures may also come close to the truth.

Improvement over single sequence prediction
We now evaluate the accuracy of the structures predicted by RNAcast, whether or not the predicted shape is correct.

We folded each RNA family in five different ways:

(1) Single sequence prediction using RNAfold (Hofacker et al., 1994),
(2 and 4) RNAcast on all pairwise combinations using shape abstraction {pi}5 and {pi}3, and
(3 and 5) RNAcast in a multiple way on all family members, again in each case with {pi}5 and {pi}3.

In Figure 3 we plot the (average) MCC of each prediction method. We can see, that ‘going comparative’ pays off. In all cases but one (multiple tRNA folding with {pi}5) RNAcast performs better than single sequence prediction. The clover leaf prediction for tRNA failed—one arm of the clover leaf was missed. However, using the less abstract shape mapping yields the correct shape and a higher accuracy. We can further see, that using multiple sequences increases the reliability of prediction. Overall, {pi}3 gives the highest accuracy, especially for shorter sequences (≤150 bases), where additional bulges or internal loops may be more important than in longer sequences. The averaged MCC for RNAcast multiple with shape abstraction {pi}3 is 0.77. This is a definite increase, compared with 0.64 for single sequence prediction.



View larger version (34K):
[in this window]
[in a new window]
 
Fig. 3 Accuracy (MCC) of RNAcast on a set of RNA families, sorted by size. The bars correspond from left to right to RNAfold, RNAcast pairwise, RNAcast multiple (both {pi}5), RNAcast pairwise, RNAcast multiple (both {pi}3).

 
Next we relate our method to existing comparative tools.

Comparison with the Sankoff approach
Comparison with Dynalign
Dynalign was chosen as a state-of-the-art representative of the Sankoff approach. In Mathews and Turner (2002), the sensitivity of Dynalign is measured on a set of 5S rRNA. We found a secondary structure for five sequences of that set in the database (Szymanski et al., 2000) and evaluated RNAcast on them. Single sequence prediction performs relatively bad on this dataset (RNAfold in Table 4). Using RNAcast in a pairwise fashion improves the accuracy clearly, but still is not satisfying. For Dynalign (pairwise only) we compute an average sensitivity of 84.2%. This is only topped by running RNAcast multiple on five sequences simultaneously. The sensitivity is 92% and selectivity is almost perfectat 97.8%.


View this table:
[in this window]
[in a new window]
 
Table 4 Prediction accuracy for a set of 5S rRNA

 
Comparison with the Gardner study
Recently, several multiple RNA folding algorithms were evaluated by Gardner and Giegerich (2004). The study included three different approaches, where ‘Plan B’ referred to tools that approximate the Sankoff approach of simultaneous alignment and folding. We choose the Saccharomyces cerevisiae tRNA–PHE (11 sequences, high sequence similarity) and the Escherichia coli RNAse P (5 sequences, medium similarity) datasets from that study and compare the prediction accuracies. Since Dynalign permits only pairwise folding, Gardner et al. folded the reference sequence with each of the other sequences at a time. We did the same with our program. The corresponding results are in column ‘pairwise’ and ‘Dynalign’ in Table 5. Carnac (Touzet and Perriquet, 2004) can fold multiple sequences and performed quite well in the study. Naturally, our method yields much better results for a multiple sequence input than for only two sequences (column ‘multiple’, Table 5). The sensitivity is comparable with Carnac, which in turn is almost perfectly selective, and thus has a better correlation.


View this table:
[in this window]
[in a new window]
 
Table 5 Comparison with the Gardner study

 
Detailed analysis of mispredictions
Overall, we observed 107 cases where RNAcast was not able to find the true shape. But how bad are the wrong shapes? By visual inspection we could classify a few recurring situations, listed in Table 6.


View this table:
[in this window]
[in a new window]
 
Table 6 Recurring situations, for which RNAcast predicts the wrong shape

 
In three cases, all tRNA, RNAcast predicts an additional hairpin, not mentioned in the database. This hairpin is the variable arm and, therefore, predicted correctly. On other sequences, we found that parts of the reference structure were enclosed by an additional helix, thus forming a multiloop. From our point of view, this situation could not be counted as false, either. It further confirms our choice not to count compatible base pairs as FP. Another point of error was the loss of one hairpin in 22 cases. Instead of the hairpin, we either see a single-stranded region or the region is consumed by the prolongation of a neighboring helix. Nevertheless, the remaining structure is accurate. All remaining cases (41) differ substantially from the reference and have to be counted as wrong predictions without an excuse.

In general, the accuracy for the first three situations is still rather high on the base pair level, in fact higher than single sequence predictions. Usually, sequences for which RNAcast predicts a wrong shape have a low accuracy and are poorly predicted byRNAfold, too.

Efficiency
As is to be expected from the asymptotic analysis, the efficiency of RNAcast is quite good. It strongly depends on RNAshapes' efficiency, which has been optimized recently. On a typical sequence pair of tRNAs (72 and 75 nt) running time remains <1 s, whereas, Dynalign takes 488 s (Table 7). Adding more sequences of similar size increases the runtime only linearly: 10 tRNAs are processed within 5 s. Since the calls to RNAshapes are independent of each other, they can execute in parallel. The RNAse P example (5 sequences of ~240 bases each) with {pi}5 can actually be done in 12 s.


View this table:
[in this window]
[in a new window]
 
Table 7 Performance measured on a 2.8 GHz Dual Xeon system with 2GB RAM

 

    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 RNA SHAPE ANALYSIS AND...
 ALGORITHM IMPLEMENTATION
 EVALUATION
 DISCUSSION
 REFERENCES
 
Differences from the Sankoff notion of consensus
Let us once more relate RNAcast to Dynalign, which is the best available approximation to the Sankoff algorithm. It is important to keep in mind that although the Sankoff algorithm can, in principle, maximize sequence similarity alongside with free energy minimization, its Dynalign implementation minimizes gap penalties, but otherwise ignores sequence content.

The quantitative results in the previous section show that the new alternative method is comparable or better in the quality of predictions, and much faster computationally. In that section, results from both tools were compared with a ‘gold standard’, which is much easier than comparing them with each other, because they pursue different objectives.

Remember that we have not presented another approach to implement the Sankoff algorithm, but we have significantly changed the problem definition. Although the Sankoff approach determines a sequence alignment reflecting a common set of base pairs, consensus shape prediction produces a consensus abstract shape together with its shrep for each sequence, but no alignment. Since this deviates from the traditional and accepted notion, let us discuss common aspects as well as differences from a conceptual point of view.

The Sankoff approach produces sequences aligned according to the predicted common base pairs, hence with the same Level 5 shape. However, their Level 3 shapes may be different, as some sequences may have gaps where others have bulges. In either case, the structures reported are not necessarily the shreps of their respective shapes. One may refold the structures individually, with the consensus base pairs fixed, but then the refolded structures may be ‘out of shape’ because they exhibit additional hairpins.

RNAcast predictions are unaligned. Using the predicted shreps, a multiple structure alignment may be obtained via RNAforester or a similar structure alignment tools. From the structure alignment, a sequence alignment consistent with the consensus shape may be easily derived. The structure alignment also minimizes the number of gaps, but in contrast to the Sankoff approach, it does so after structure prediction, and not simultaneously. Hence, one may expect cases where the Sankoff approach produces results that fix more strongly the relative positions of helices, whereas with RNAcast, conserved helices may move more flexibly. However, we have not observed this effect to a significant amount in our studies.

Potential improvements
Reality differs from our evaluation scenario. Database families can be considered reliable homologs, but when a new (putative) family is investigated, we cannot be sure whether structure is preserved. With consensus shape prediction, we would like to implement a safeguard against members in the sequence set that really do not share the common shape of the rest. Such a situation will result most probably in a consensus garbage. We expect that leave-one-out tests can be designed to recognize this situation. Such tests can be implemented efficiently, because only Steps 2 and 3 of the RNAcast algorithm, but not the most expensive Step 1 must be iterated.

We have performed an overall evaluation of our new method, but have not yet tried to optimally adjust it to particular datasets. For example, when studying short molecules like microRNA precursors, Level 2 abstraction, which distinguishes 5'-, 3'-bulges and internal loops, might be more conclusive than Level 3. More systematic study and experience is needed to provide guidance about the most conclusive level of shape abstraction to be used in a particular context.

Conflict of Interest: none declared.

Received on May 9, 2005; revised on June 16, 2005; accepted on July 6, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 RNA SHAPE ANALYSIS AND...
 ALGORITHM IMPLEMENTATION
 EVALUATION
 DISCUSSION
 REFERENCES
 

    Doshi, K., et al. (2004) Evaluation of the suitability of free-energy minimization using nearest-neighbor energy parameters for RNA secondary structure prediction. BMC Bioinformatics, 5, 105[CrossRef][Medline].

    Gardner, P. and Giegerich, R. (2004) A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics, 5, 140[CrossRef][Medline].

    Giegerich, R., et al. (2004) Abstract shapes of RNA. Nucleic Acids Res., 32, 4843–4851[Abstract/Free Full Text].

    Gorodkin, J., et al. (1997) Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Res., 25, 3724–3732[Abstract/Free Full Text].

    Griffiths-Jones, S., et al. (2003) Rfam: an RNA family database. Nucleic Acids Res., 31, 439–441[Abstract/Free Full Text].

    Gutell, R.R., et al. (1992) Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods. Nucleic Acids Res., 20, 5785–5795[Abstract/Free Full Text].

    Höchsmann, M., et al. (2004) Pure multiple RNA secondary structure alignments: a progressive profile approach. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 1, 53–62[CrossRef].

    Hofacker, I.L., et al. (2002) Secondary structure prediction for aligned RNA sequences. J. Mol. Biol., 319, 1059–1066[CrossRef][ISI][Medline].

    Hofacker, I.L., et al. (1994) Fast folding and comparison of RNA secondary structures. Monatshefte f. Chemie, 125, 167–188[CrossRef].

    Lee, R. and Ambros, V. (2001) An extensive class of small RNAs in caenorhabditis elegans. Science, 294, 862–864[Abstract/Free Full Text].

    Lee, R., et al. (1993) The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell, 75, 843–854[CrossRef][ISI][Medline].

    Mathews, D.H., et al. (1999) Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol., 288, 911–940[CrossRef][ISI][Medline].

    Mathews, D.H. and Turner, D.H. (2002) Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J. Mol. Biol., 317, 191–203[CrossRef][ISI][Medline].

    Mattick, J. and Gagen, M. (2001) The evolution of controlled multitasked gene networks: the role of introns and other noncoding RNAs in the development of complex organisms. Mol. Biol. Evol., 18, 1611–1630[Abstract/Free Full Text].

    McCaskill, J.S. (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29, 1105–1119[CrossRef][ISI][Medline].

    Nussinov, R., et al. (1978) Algorithms for loop matchings. SIAM J. Appl. Math., 35, 68–82[CrossRef].

    Pfeffer, S., et al. (2004) Identification of virus-encoded microRNAs. Science, 304, 734–736[Abstract/Free Full Text].

    Sankoff, D. (1985) Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl. Math., 45, 810–825[CrossRef].

    Skryabin, B., et al. (2003) Neuronal untranslated BC1 RNA: targeted gene elimination in mice. Mol. Cell Biol., 23, 6435–6441[Abstract/Free Full Text].

    Szymanski, M., et al. (2000) 5S ribosomal RNA database Y2K. Nucleic Acids Res., 28, 166–167[Abstract/Free Full Text].

    Touzet, H. and Perriquet, O. (2004) CARNAC: folding families of related RNAs. Nucleic Acids Res., 32, 142–145.

    Vialette, S. (2004) On the computational complexity of 2-interval pattern matching problems. Theor. Comput. Sci., 312, 223–249[CrossRef].

    Waterman, M.S. and Smith, T.F. (1978) RNA secondary structure: a complete mathematical analysis. Math. Biosciences, 42, 257–266[CrossRef].

    Witwer, C., et al. (2001) Conserved RNA secondary structures in Picornaviridae genomes. Nucleic Acids Res., 29, 5079–5089[Abstract/Free Full Text].

    Wuchty, S., et al. (1998) Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers, 49, 145–165.

    Zuker, M. and Stiegler, P. (1981) Optimal computer folding of large RNA sequences using thermodynamics and auxiliary informations. Nucleic Acids Res., 9, 133–148[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
K. Katoh and H. Toh
Recent developments in the MAFFT multiple sequence alignment program
Brief Bioinform, July 1, 2008; 9(4): 286 - 298.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
E. Torarinsson and S. Lindgreen
WAR: Webserver for aligning structural RNAs
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W79 - W84.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. Lindgreen, P. P. Gardner, and A. Krogh
MASTR: multiple alignment and structure prediction of non-coding RNAs using simulated annealing
Bioinformatics, December 15, 2007; 23(24): 3304 - 3311.
[Abstract] [Full Text] [PDF]


Home page
RNAHome page
E. S. Andersen, A. Lind-Thomsen, B. Knudsen, S. E. Kristensen, J. H. Havgaard, E. Torarinsson, N. Larsen, C. Zwieb, P. Sestoft, J. Kjems, et al.
Semiautomated improvement of RNA alignments
RNA, November 1, 2007; 13(11): 1850 - 1859.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
H. Kiryu, Y. Tabei, T. Kin, and K. Asai
Murlet: a practical multiple alignment tool for structural RNA sequences
Bioinformatics, July 1, 2007; 23(13): 1588 - 1598.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
E. Torarinsson, J. H. Havgaard, and J. Gorodkin
Multiple structural alignment and clustering of RNA sequences
Bioinformatics, April 15, 2007; 23(8): 926 - 932.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
B. Voss
Structural analysis of aligned RNAs
Nucleic Acids Res., November 14, 2006; 34(19): 5471 - 5481.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
P. Steffen, B. Voss, M. Rehmsmeier, J. Reeder, and R. Giegerich
RNAshapes: an integrated RNA analysis package based on abstract shapes
Bioinformatics, February 15, 2006; 22(4): 500 - 503.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/17/3516    most recent
bti577v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (20)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Reeder, J.
Right arrow Articles by Giegerich, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Reeder, J.
Right arrow Articles by Giegerich, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?