Skip Navigation


Bioinformatics Advance Access originally published online on June 21, 2005
Bioinformatics 2005 21(16):3352-3359; doi:10.1093/bioinformatics/bti550
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/16/3352    most recent
bti550v2
bti550v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (33)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Siebert, S.
Right arrow Articles by Backofen, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Siebert, S.
Right arrow Articles by Backofen, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons

Sven Siebert and Rolf Backofen *

Department of Bioinformatics, Institute of Computer Science, Friedrich-Schiller-University Jena Ernst-Abbe Platz 2, 07743 Jena, Germany

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 METHODS
 APPLICATION
 DISCUSSION
 REFERENCES
 

Motivation: Due to the importance of considering secondary structures in aligning functional RNAs, several pairwise sequence–structure alignment methods have been developed. They use extended alignment scores that evaluate secondary structure information in addition to sequence information. However, two problems for the multiple alignment step remain. First, how to combine pairwise sequence–structure alignments into a multiple alignment and second, how to generate secondary structure information for sequences whose explicit structural information is missing.

Results: We describe a novel approach for multiple alignment of RNAs (MARNA) taking into consideration both the primary and the secondary structures. It is based on pairwise sequence–structure comparisons of RNAs. From these sequence–structure alignments, libraries of weighted alignment edges are generated. The weights reflect the sequential and structural conservation. For sequences whose secondary structures are missing, the libraries are generated by sampling low energy conformations. The libraries are then processed by the T-Coffee system, which is a consistency based multiple alignment method. Furthermore, we are able to extract a consensus-sequence and -structure from a multiple alignment. We have successfully tested MARNA on several datasets taken from the Rfam database.

Availability: MARNA can be used online on our webpage www.bio.inf.uni-jena.de/Software/MARNA/index.html

Contact: backofen{at}inf.uni-jena.de


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 METHODS
 APPLICATION
 DISCUSSION
 REFERENCES
 
In recent years, RNA molecules gained increasing interest since a huge variety of functions associated with them was found. Consequently, research on small RNAs has been elected as the scientific breakthrough of the year 2002 by the readers of the Science magazine (Couzin, 2002). The function of an RNA-molecule is mainly determined by its (secondary) structure. It is assumed that the structure of an RNA is often more conserved than its sequence (even more than for proteins). Hence, one cannot use standard multiple sequence alignment techniques like e.g. Clustal W (Thompson et al., 1994), Dialign (Morgenstern, 1998) or T. Coffee (Notredame et al., 2000) since they completelyneglect structural information.

Multiple sequence- and structure-based alignments of RNAs can be divided into two major classes, the probabilistic and the non-probabilistic approaches. Probabilistic approaches are based on stochastic context-free grammars (SCFG) and require an initial multiple alignment as input. The quality of the outputs crucially depends on this initial alignment. They are used to model RNA-families and/or to predict a secondary structure via comparative analysis [e.g. Cove (Eddy and Durbin, 1994), RNACAD (Brown, 1999) and Pfold (Knudsen and Hein, 2003)]. A non-probabilistic, comparative approach is e.g. given by RNAlign (Corpet and Michot, 1994) that performs an alignment between a bank of aligned sequences and a new sequence.

In this paper, we propose a non-probabilistic approach to align a set of more than two RNAs with or without known conformations. The standard approach is to perform direct pairwise alignments of RNAs using sequence and (secondary) structure information and to combine the pairwise alignments into a multiple alignment. No general approach yet exists albeit there is a wealth of approaches for pairwise alignment of RNAs (see below). The reason is that the results of the pairwise sequence/structure alignments cannot simply be aligned in a progressive way (like profiles for sequence alignments). To the best of our knowledge, there are only two exceptions, namely PMcomp/PMmulti (Hofacker et al., 2004) and RNAforester (Höchsmann et al., 2003). PMcomp aligns RNA base pairing probability matrices and predicts a common folding structure between two sequences. PMmulti uses PMcomp in a progressive alignment strategy and provides multiple alignments with good qualities. However, it has a high complexity of O(n6) time and O(n4) space for the pairwise comparisons. In RNAforester, secondary structures are interpreted as trees, and a tree-based alignment is applied.

We solved the problem of combining pairwise alignments of RNAs as follows. First, alignment edges between RNAs reflecting sequence and structure similarities are generated based on an algorithmpublished by Jiang et al., 2002. In a second step, these edges are collected in a library, which is given as input to the multiple sequence alignment method T-Coffee (Notredame et al., 2000). Structural positions that are supported by several pairwise comparisons are strengthened. Hence, the result comprises sequence and structure similarities of RNAs albeit the progressive alignment strategy is in principle not structure-based.

We have used the algorithm of Jiang et al., 2002 since it provides the greatest scoring flexibility and has moderate complexity. But any other sequence- and structure-based pairwise alignment method can also be adapted to our approach. The computational problem of pairwise alignment of RNAs was first addressed by Sankoff, 1985 who proposed a dynamic programming algorithm that aligns a set of RNA sequences while predicting their common fold at the same time. Subsequently, a variety of pairwise sequence-structure alignment approaches have been developed. Lenhof et al., 1998 addresses the problem of optimally aligning a given RNA sequence of unknown structure to one of known sequence and structure. Local pairwise RNA-alignments using the same scoring scheme as Jiang et al., 2002 are considered by Backofen and Will, 2004. Beside the above listed approaches, there are several approaches that work on a tree based representation of RNAs (see e.g. Jiang et al., 1995; Höchsmann et al., 2003; Shapiro and Zhang, 1990).

We tested our approach on eukaryotic SECIS-elements on tRNA-like 3' UTR elements from Tymovirus/Pomovirus and on the Hammerhead ribozyme (type III). We compared our MARNA results with the manual alignments taken from the Rfam database and with the alignments generated by PMmulti.


    METHODS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 APPLICATION
 DISCUSSION
 REFERENCES
 
A sequence S is a word over the alphabet {A, C, G, U}. S[i] denotes the i-th symbol in S. An arc a is a pair (i,j) N x N such that i < j. i and j are called ends of the arc a. A base is called free if it is not involved in any arc. A secondary structure P is a set of arcs such that no end of an arc appears more than once in P. Here, we consider secondary structures that are nested, i.e. for any two base pairs (i1,i2) P and (j1,j2) P, we have either independent base pairs with i2 < j1 or j2 < i1, or nested base-pairs with i1 < j1 < j2 < i2 or j1 < i1 < i2 < j2. We call the tuple S = (S,P) a sequence-structure. In the following, all RNAs are specified by their sequences and their known secondary structures. Unknown structures will be handled in a later section.

We use the gap symbol ‘’ to denote an inserted/deleted nucleotide. An alignment of two sequence–structures S1 = (S1,P1) and S2 = (S2,P2) is a subset of [1.|S1|] {cup} {} x [1.|S2|] {cup} {}, where for all pairs (i,j), holds

  1. i ≤ i' {Rightarrow} j ≤ j'
  2. i = i' != {Rightarrow} j = j' and
  3. j = j' != {Rightarrow} i = i'.

In addition, we require that for every i [1.|S1|] there is some j with , and vice versa for j [1.|S2|]. The pairs are called alignment edges. We say that i [1.|S1|] is aligned with j if , and analogously for j [1.|S2|]. An alignment edge is called realized if neither i = nor j = .

Pairwise alignment scores
The scoring of an alignment of two sequence-structures S1 = (S1,P1) and S2 = (S2,P2) is based on the notion of edit operations on bases as well as on arcs. We briefly recall the edit operations from Jiang et al., 2002 and present a slightly modified version of their distance-based scoring scheme.

Edit operations on free bases are base match, base mismatch and base deletion. A base match has cost 0, base mismatch and base deletion have positive costs. We combine these cost functions into a single cost function wbase(i,j), where wbase(i,j) = 0 only if S1[i] = S2[j]. We will feel free to write either the positions or the nucleotides as arguments in the cost function whereever necessary.

For arcs, we have a more complex scoring scheme. Consider an arc (i,i') P1 such that i is aligned with j and i' is aligned with j'. An arc match occurs if j,j' form an arc (j,j') P2, S1[i1] = S2[j1] and S1[i2] = S2[j2]. We have an arc mismatch if (j,j') P2, but S1[i1] != S2[j1] or S1[i2] != S2[j2]. Arc matches have cost 0, whereas arc mismatches have cost wam(i,i',j,j'). On the other hand, if (j,j') {notin} P2, then we have an arc deletion with cost wad(i,i',j,j'). Lin et al., subdivided arc deletions into arc breakings, arc alterings and arc removings. An arc breaking occurs if none of j and j' equals the gap symbol . If exactly one of j,j' equals , then we have an arc altering. If both j,j' are equal to , then we have an arc removing. The edit operations on arcs are summarized in Figure 1.



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 1 An alignment of two RNAs with corresponding edit operations on arcs. Alignment edges between bases are shown as solid lines (realized edges) and dashed lines (non-realized edges). The thickness of realized edges corresponds to similarity weights between bases. Non-realized alignment edges are skipped for the multiple alignment step.

 
The total score of an alignment is the sum of costs of all applied edit operations that transform one sequence–structure into the other. The complexity of finding an alignment with minimal costs is determined by the way arc deletions are scored. Jiang et al., 2002 presented a dynamic programming algorithm that solves this problem in O(n2m2) time and O(nm) space under certain restrictions on the scoring of arc deletions. In effect, this requires the existence of functions and for the left and right ends, respectively, such that

In the following, we will not distinguish between left and right ends of an arc, i.e. we set {forall}i,j: , where is a single function to score both ends of an arc.

The effect of this restriction is that one can evaluate both arc ends in an alignment independently, which is a necessary prerequisite for the dynamic programming algorithm. This situation is depicted in Figure 2.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 2 Independent scoring of both arc ends.

 
In our approach, we even simplify the scoring scheme further by defining to be composed of a base match, base mismatch or base deletion together with a fixed cost for deleting an arc. Hence, we set

where is the cost for deleting one arc.

Multiple alignment
T-Coffee
Once the sequence–structure alignments have been calculated for all pairs of input sequences, we construct the so-called library. A library for a pairwise alignment of two sequence–structures consists of the set of all realized edges together with a weighting of each edge. Then, the libraries for all pairwise alignments are given to T-Coffee (Notredame et al., 2000) to build a single multiple alignment.

The T-Coffee system is a consistency based alignment method that combines local and global information to produce a multiple alignment in the following manner. First, an extended primary library is produced that improves all pairwise alignments by taking into consideration how all other sequences align with the current two RNAs. Edges achieve higher weights if the bases at the end points of these edges are also aligned with other sequences. Second, the improved dataset of pairwise alignments is processed by a progressive alignment strategy. A distance matrix is computed between all sequences using the improved weights of alignment edges. Subsequently, the neighbor-joining method (Saitou and Nei, 1987) provides a phylogenetic tree, which dictates the order of aligning these sequences. Since the initial libraries were generated from sequence–structure alignments, the resulting multiple alignment reflects the sequential and structural similarities of RNAs.

Distance and similarity
The weights attached to realized edges in the libraries correspond to similarity weights. For that reason, we have to transform the distances defined in the previous section into similarity values. Smith and Waterman, 1981 solved the problem of transforming distances into similarities for edit operations on bases. We extend this approach to our set of edit operations. The main observation of Smith and Waterman, 1981 is that one has to consider the number of nucleotides r involved in an edit operation. We call this number the order of the edit operation. In our case, we have edit operations with r = 4 (arc match and arc mismatch), r = 2 (base match and base mismatch) and r = 1 (base deletion). Since we have split the arc deletion operation into two separate edit operations for the arc ends, we have an edit operation with r = 2 if the arc end is aligned with a nucleotide, and an edit operation with r = 1 if the arc end is aligned with .

By enumerating all different edit operations, we can write the distance score of an alignment as

where wr,k is the cost for the k-th edit operation of order r (for r = 4,2,1), and is the number of times the k-th edit operation of order r is used in the alignment . Then we can rewrite the distances wr,k into similarities sr,k as follows:

THEOREM 1
Consider a scoring scheme where wr,k is the cost of the k-th edit operation of order r. Let AMSP be any fixed value, which is interpreted as the maximal similarity per nucleotide position we want to achieve. Define the similarity sr,k for the k-th edit operation of order r by

Then the alignment which minimizes is the alignment that maximizes , and vice versa.

PROOF 1
The optimal alignment for two sequence–structures S1 = (S1,P1) and S2 = (S2,P2) under the similarity score is given by

Since any nucleotide position is involved in exactly one edit operation, we know that is the total number of nucleotide position involved in edit operations. Hence, . Thus,

Thus, one has only to choose the maximal similarity per position AMSP to transform the distance score into a similarity score without changing the global optimal alignment. Albeit it does not change the global optimal alignment, it is important for the T-Coffee system since only the realized edges are considered when combining the pairwise alignments into a multiple alignment. This implies that alignment edges containing a gap have a weight of 0. To achieve a good approximation to this, we set

The above theorem can also be extended to vary the contribution from structural and sequential positions for the generation of the multiple alignment. Obviously, the distance score is flexible enough to strengthen either structural or sequential positions. Structural positions are strengthened by rising the constant cost for arc deletion (i.e. ). But this is somewhat lost if we have the same maximal similarity for structural and sequential positions. This leads to the following modification of the theorem. We say that that a position i in the sequence–structure S = (S,P) is a structural position if there is an i' with (i,i') P or (i',i) P. The position i is defined to be sequential otherwise. The order of an edit operation is now defined by two values rstr and rseq, which are the numbers of structural and sequential positions in the edit operation, respectively. For an alignment the value denotes again the number of times the k-th edit operation of order rstr,rseq is used in . Then we can write the distance score of as .

THEOREM 2
Let wrstr,rseq,k be the cost of the k-th edit operation of order rstr,rseq. Let be the maximal similarity for structural positions, and let be analogously defined for sequential positions. Define the similarity for the k-th edit operation of order rstr,rseq by

Then the alignment which minimizes is the alignment that maximizes the similarity , and vice versa.

The resulting scoring scheme is depicted in Table 1. As discussed above for AMSP, a good choice for (resp. ) is to use the maximal cost for edit operations involving only structural (resp. sequential) positions. Another possibility is to choose such that the maximal weight for a single edge (namely 2) equals the maximal value allowed in T-Coffee. This is a reasonable choice if there is a high confidence in the structures selected for the sequences, and one wants to ensure that the structural positions arealigned. Note that in the current implementation of MARNA, we use the same values for and .


View this table:
[in this window]
[in a new window]
 
Table 1 Edit operations on arcs together with the associated distances and their similarity values given to the T-Coffee system

 
Combining several structures
The previously described approach uses one given structure for each sequence, which could be for example an experimentally confirmed structure. Usually, the structure is not known and has to be computed by secondary structure prediction programs like Mfold (Zuker, 1994) or RNAfold (Hofacker, 2003). Here, we are confronted with the problem that very often the real motif is not found in the minimum free energy structure, but in some sub-optimal structures.

A better strategy is to assign several structures to each sequence covering different possible folds of the sequence. To generate an ensemble of low energy structures, we have used the stochastic backtracking version of RNAsubopt (Vienna RNA package) as well as RNAshapes (Giegerich et al., 2004). The latter avoids the production of a large number of similar structures. The result is a usually small set of different structures for a sequence S. In the following, we call ES the ensemble of the sequence S. Since the structures in ES occur with different frequencies in the low energy spectrum, they have to be weighted. The weight for each structure is given by the conditional probability of seeing this structure under the condition that only structures of the ensemble ES are considered. Thus, we have

(1)
where is the Boltzmann probability that S forms the structure . Since RNAshapes often returns structures with similar energies, we approximate by the uniform distribution in our current implementation of MARNA, thus avoiding the calculation of the Boltzmann probabilities.

Next, we have to use the different structures to form a single library for a pair of sequences. So let S1 and S2 be two sequences. Assume that we have selected n1 structures for the first sequence and n2 structures for the second one. In this setting, n1 = 1 (resp. n2 = 1) means that we have a unique known structure for S1 (resp. S2). Thus, we are able to mix sequences having known structures with sequences where we do not know the structures. Let and be the ensembles of structures selected for sequences S1 and S2, respectively. Then we perform n1 x n2 sequence–structure alignments for and (1 ≤ k ≤ n1, 1≤ l ≤ n2). All realized edges from these alignments are then collected into a single library. For edges that are common to several alignments, the weights are summed up. In order to achieve weights that are consistent with other libraries, the combined similarity values of the realized edges are normalized by multiplying them by .

Consensus structure
Once we have computed the final alignment, we are ready to calculate a consensus structure from this alignment. The standard approach is to estimate possibly conserved bonds by means of the mutual information content [e.g. Luck et al. (1999) Gutell and Woese (1990) Chiu and Kolodziejczak (1991) and Gutell et al. (1992)] between all columns i and j in a given alignment. The keynote is that if there is not very much sequence conservation in these columns, but the columns show a high correlation measured by the mutual information content, then this must be due to a conserved bond. Hofacker et al. (2002) extended this approach by considering the probabilities of forming these base-pairs.

In our case, the situation is different since we explicitly use structure information for the calculation of the alignment. Hence, the calculation of the consensus structure should be based on these given structures.

To exemplify the basic idea, suppose that exactly one structure per sequence is given. Thus, each structure must then be interpreted as the ‘real’ known structure. A conserved base pair between two columns in the alignment is found if the majority of sequences have a base pair at the corresponding sequence positions. The remaining problem is that the resulting set of conserved base pairs alone does not form a nested secondary structure and is thus not a valid consensus structure. This is a problem common to all approaches for calculating a consensus structure. The usual solution is to calculate a secondary structure that maximizes base pair conservation. So let c,c' be two columns with 1 ≤ c < c' ≤ m, where m is the number of columns of the multiple alignment. Furthermore, let bp_cons(c,c') be the number of sequences that have a base pair between the corresponding sequence positions. The consensus structure is then defined to be a secondary structure P {subseteq} [1.m] x [1.m] such that

is maximized. This can be calculated using a variant of the Nussinov algorithm (Nussinov et al., 1978). For this purpose, we define a matrix (Ni,j) with 1 ≤ i,j ≤ m, where

is the maximal base pair conservation for all columns between i and j. The corresponding recursion equation for this matrix is

It is a dynamic programming approach, where the traceback reports the consensus structure of the alignment.

Finally, we have again to consider the case where we are given structure ensembles for some (or all) sequences. The overall structure of the approach is the same, only the definition of conserved base pairs has to be adapted, i.e. the definition of bp_cons(c,c'). If we have several structures for one sequence, then the probability of seeing a particular base pair depends on the probabilities of the structures that contain this base pair. Hence, we can only calculate the expected number of occurrences of base pairs for two columns c and c'. Thus, we redefine bp_cons(c,c') as follows. Consider a multiple alignment of K sequences. For each sequence Sk, let Ek be the ensemble of structures calculated for Sk. For each column c, let be either the position that corresponds to column c in sequence Sk (if aligned), or otherwise. Furthermore, let {delta}P(c,c') be the index function of P, i.e. {delta}P(c,c') is 1 if (c,c') P, and 0 otherwise. Then

where Pr[P | ESk] is defined as given in Equation (1).

Complexity
Here, we assume that all sequences have nearly the same length L and that we generate an ensemble of E structures for each sequence. Note that by using RNAshapes, E is typically small (up to three sequences). The running time of one pairwise alignment is O(E2L4). We have to make N(N – 1)/2 comparisons in a set of N RNAs. Therefore, the pairwise comparison step needs O(E2N2L4) computation time. The most time consuming part of the multiple alignment step consists of building the extended library, which takes O(N3L2) steps in the worst case (Notredame et al., 2000). Altogether, the dominating alignment complexity is given by


    APPLICATION
 TOP
 Abstract
 INTRODUCTION
 METHODS
 APPLICATION
 DISCUSSION
 REFERENCES
 
Our algorithmic approach of multiple alignment of RNAs can be used online at our MARNA server. The maximal sequence length of one RNA is restricted to 500 bases. The maximal number of RNAs depends on the sequence lengths. The sum of all sequence lengths is restricted to 10 000 bases. MARNA is capable of aligning RNA sequences with known as well as unknown secondary structures. In the latter case, the user can choose whether to assign for every sequence a known structure or an ensemble of several structures automatically generated by RNAshapes or by stochastic backtracking (as part of RNAsubopt, implemented in the Vienna RNA Package).

We have tested MARNA on three datasets from the Rfam database (Griffiths-Jones et al., 2003). We compared the alignments as well as the consensus structures given from the Rfam database with the output of the two different alignment tools T-Coffee and PMmulti. The manual alignment and the consensus structure from the Rfam database serve as the reference. For MARNA and PMmulti, we have compared the consensus structures as proposed by the programs. If RNAalifold (Hofacker et al., 2002) yielded a better consensus structure, we have also displayed this one. For T-Coffee, we have used RNAalifold (Hofacker et al., 2002) to predict the consensus structure.

The first dataset consists of seven randomly chosen eukaryotic SECIS-elements (selenocysteine insertion sequence, Rfam accession number RF00031). SECIS-elements are necessary for the incorporation of selenocysteine into a protein sequence directed by an in-frame UGA codon (usually a stop codon) within the coding region of the mRNA. Selenoprotein mRNAs contain a conserved secondary structure in the 3' UTR that is required for the distinction of UGA stop from UGA selenocysteine. The sequences are ~60 nt in length and adopt a hairpin structure that is sufficiently well-defined and conserved, but the primary sequences differ. This dataset is especially hard for sequence–structure alignment programs since it contains four non-standard base pairs (U–U, G–A, A–G, C–U) in the lower part of the stem.

The manual alignment was made from 25 SECIS-elements out of a total set of currently 65 SECIS-elements. We have chosen 7 out of the 25 sequences randomly. The manual alignment is shown in Figure 3 and serve as the ‘true’ alignment. An alignment with T-Coffee reveals some nucleotide similarities among sequences, as expected, but are not suited for the prediction of a consensus structure. Here, MARNA detects the long stem structure with the characteristic bulged A's in the upper loop (Lambert et al., 2002). For this test case, we used the predicted minimum free energy (mfe) structure for each sequence.



View larger version (61K):
[in this window]
[in a new window]
 
Fig. 3 Comparison of multiple alignments and consensus structure predictions of seven randomly chosen SECIS-elements taken from the Rfam database. (a) The restriction of the manual alignment in Rfam to the seven sequences. The consensus structure is proposed by Rfam, i.e. generated of all 65 SECIS elements. (b) The T-Coffee alignment is based on nucleotide similarities and thus disregards structural conformations. It is not surprising that the consensus structure contains no base-pairing interactions. (c) PMmulti detects some bonds, but does not identify the hairpin. (d) MARNA finds the hairpin-like structure as well as the buldged A's in the upper loop of the motif (indicated by a red arrow), which are required for the SECIS-element.

 
We also aligned all the 65 SECIS-elements using MARNA with mfe structures and with ensembles calculated by RNAshapes and by stochastic backtracking (as part of RNAsubopt, Vienna RNA Package). We have compared these results with the results of PMmulti. The results are summarized in Table 2.


View this table:
[in this window]
[in a new window]
 
Table 2 Evaluation of MARNA and PMmulti alignments using the SP-score of the Bali Base benchmark program

 
The second dataset consists of 22 tRNA-like structures, found in the 3' UTR of Tymoviruses and Pomoviruses. They were also taken from the Rfam database. The family is thought to be involved in the initiation of minus-strand synthesis and the disruption of the pseudoknot gives rise to a 50% drop in transcription efficiency. MARNA (both with mfe structures as well as with structure ensembles) is able to detect the four stems as well as the single G between the first and second stem. For PMmulti, the four stems are detected when using RNAalifold in addition (Fig. 4).



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 4 Consensus structure predictions for 22 tRNA-like structures, found in Tymovirus and Pomovirus. (a) The consensus structure of the manual alignment contains four distinct stems. (b) The proposed structure of PMmulti detects two of the four stems. (c) An improvement of the structure prediction applied to the same computed alignment can be achieved by RNAalifold (part of the Vienna RNA package). (d and e) MARNA is able to detect all four stems in case of assigning the mfe structure as well as the shape structure to each sequence. Both consensus structures are very close to the Rfam consensus structure.

 
Finally, we have used the objective function given by Bali Base benchmark program (Thompson et al., 1999) to compare all generated alignments, again taking the Rfam alignments as a reference. The benchmark program returns two scores, namely SP (sum of pairs) and TC (total columns). SP measures the ratio of the number of correctly aligned pairs, whereas TC measures the number of correctly aligned columns. Since conservation of columns is different in sequence alignments and sequence–structure alignments, we have used only the SP-score. The results are summarized in Table 2.


    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 METHODS
 APPLICATION
 DISCUSSION
 REFERENCES
 
We have presented a multiple alignment method for RNAs considering both the primary sequences and the secondary structures. It generates pairwise sequence–structure alignments and combines them using T-Coffee. Hence, MARNA is not only a structure alignment tool, but also considers sequence similarities. The main advantage is to set individual parameter values capable of weighting either sequence or structure properties. Concerning structures, one can use either user-defined structures, or let MARNA predict an ensemble of low energy structures.

MARNA can be tested online on our website. Although a pairwise comparison needs time complexity of O(n2m2) for two RNAs of lengths n and m, and thus limits the input sequence lengths to 500 bases, MARNA has been tested successfully on many RNAs like tRNAs, rRNAs and ncRNAs.

This paper is based on a previous version published in the German Conference of Bioinformatics (Siebert and Backofen, 2003). It has been extended in several ways. We have added the use of structure ensembles and provided a formal description for the calculation of the alignment weights. Furthermore, we added the calculation of a consensus structure and compared our results with PMmulti using the Balibase benchmark test.


    Acknowledgments
 
The authors would like to thank Anke Busch, Michael Hiller and Sebastian Will for reading the manuscript and for their helpful comments. This work was partially supported by the DFG within the national project ‘Selenoproteine’.

Conflict of Interest: none declared.

Received on April 1, 2005; revised on June 16, 2005; accepted on June 16, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 METHODS
 APPLICATION
 DISCUSSION
 REFERENCES
 

    Backofen, R. and Will, S. (2004) Local sequence-structure motifs in RNA. Journal of Bioinformatics and Computational Biology (JBCB), 2, 681–698.

    Brown, M.P. RNA Modeling Using Stochastic Context-Free Grammars (1999) , Santa Cruz >Ph. D. thesis University of California.

    Chiu, D.K. and Kolodziejczak, T. (1991) Inferring consensus structure from nucleic acid sequences. Comput. Appl. Biosci., 7, 347–352[Abstract/Free Full Text].

    Corpet, F. and Michot, B. (1994) RNAlign program: alignment of RNA sequences using both primary and secondary structures. Comput. Appl. Biosci., 10, 389–399[Abstract/Free Full Text].

    Couzin, J. (2002) Breakthrough of the year. Small RNAs make big splash. Science, 298, 2296–2297[Free Full Text].

    Dowell, R.D. and Eddy, S.R. (2004) Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics, 5, 71[CrossRef][Medline].

    Eddy, S.R. and Durbin, R. (1994) RNA sequence analysis using covariance models. Nucleic Acids Res., 22, 2079–2088[Abstract/Free Full Text].

    Giegerich, R., et al. (2004) Abstract shapes of RNA. Nucleic Acids Res., 32, 4843–4851[Abstract/Free Full Text].

    Griffiths-Jones, S., et al. (2003) Rfam: an RNA family database. Nucleic Acids Res., 31, 439–441[Abstract/Free Full Text].

    Gutell, R.R., et al. (1992) Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods. Nucleic Acids Res., 20, 5785–5795[Abstract/Free Full Text].

    Gutell, R.R. and Woese, C.R. (1990) Higher order structural elements in ribosomal RNAs: pseudo-knots and the use of noncanonical pairs. Proc. Natl Acad. Sci. USA, 87, 663–667[Abstract/Free Full Text].

    Höchsmann, M., et al. (2003) Local similarity in RNA secondary structures. Proceedings of Computational Systems Bioinformatics (CSB 2003)Stanford, CA , pp. 159–168.

    Hofacker, I.L. (2003) Vienna RNA secondary structure server. Nucleic Acids Res., 31, 3429–3431[Abstract/Free Full Text].

    Hofacker, I.L., et al. (2004) Alignment of RNA base pairing probability matrices. Bioinformatics, 20, 2222–2227[Abstract/Free Full Text].

    Hofacker, I.L., et al. (2002) Secondary structure prediction for aligned RNA sequences. J. Mol. Biol., 319, 1059–1066[CrossRef][Web of Science][Medline].

    Jiang, T., et al. (2002) A general edit distance between RNA structures. J. Comput. Biol., 9, 371–388[CrossRef][Web of Science][Medline].

    Jiang, T., et al. (1995) Alignment of trees—an alternative to tree edit. Theore. Comp. Sci., 143, 137–148[CrossRef].

    Knudsen, B. and Hein, J. (2003) Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res., 31, 3423–3428[Abstract/Free Full Text].

    Lambert, A., et al. (2002) A survey of metazoan selenocysteine insertion sequences. Biochimie, 84, 953–959[Medline].

    Lenhof, H.P., et al. (1998) A polyhedral approach to RNA sequence structure alignment. J. Comput. Biol., 5, 517–530[Web of Science][Medline].

    Lin, G.-H., Ma, B., Zhang, K. (2001) Edit distance between two RNA structures. Proceedings of the 5th Annual International Conferences on Compututational Molecular Biology (RECOMB'01) , Montreal, Canada ACM Press.

    Luck, R., et al. (1999) Construct: a tool for thermodynamic controlled prediction of conserved secondary structure. Nucleic Acids Res., 27, , pp. 4208–4217[Abstract/Free Full Text].

    Morgenstern, B., et al. (1998) DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics, 14, 290–294[Abstract/Free Full Text].

    Notredame, C., et al. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217[CrossRef][Web of Science][Medline].

    Nussinov, R., et al. (1978) Algorithms for loop matchings. SIAM J. Appl. Math., 35, 68–82[CrossRef].

    Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406–425[Abstract].

    Sankoff, D. (1985) Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl. Math., 45, 810–825[CrossRef].

    Shapiro, B.A. and Zhang, K.Z. (1990) Comparing multiple RNA secondary structures using tree comparisons. Comput. Appl. Biosci., 6, 309–318[Abstract/Free Full Text].

    GCB 2003 Siebert, S. and Backofen, R. (2003) Marna: a server for multiple alignment of rnas. , Neremberg, Gatching near Munich, Germany 135–140.

    Smith, T. and Waterman, M. (1981) Comparison of biosequences. Adv. Appl. Math., 2, 482–489[CrossRef].

    Thompson, J.D., et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680[Abstract/Free Full Text].

    Thompson, J.D., et al. (1999) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res., 27, 2682–2690[Abstract/Free Full Text].

    Zuker, M. (1994) Prediction of RNA secondary structure by energy minimization. Meth. Mol. Biol., 25, 267–294[Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
C. Kemena and C. Notredame
Upcoming challenges for multiple sequence alignment methods in the high-throughput era
Bioinformatics, October 1, 2009; 25(19): 2455 - 2465.
[Abstract] [Full Text] [PDF]


Home page
RNAHome page
D. Fan, P. B. Bitterman, and O. Larsson
Regulatory element identification in subsets of transcripts: Comparison and integration of current computational methods
RNA, August 1, 2009; 15(8): 1469 - 1482.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. Wolf, B. Ruderisch, T. Dandekar, J. Schultz, and T. Muller
ProfDistS: (profile-) distance based phylogeny on sequence--structure alignments
Bioinformatics, October 15, 2008; 24(20): 2401 - 2402.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Wilm, D. G. Higgins, and C. Notredame
R-Coffee: a method for multiple alignment of non-coding RNA
Nucleic Acids Res., May 1, 2008; 36(9): e52 - e52.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. Lindgreen, P. P. Gardner, and A. Krogh
MASTR: multiple alignment and structure prediction of non-coding RNAs using simulated annealing
Bioinformatics, December 15, 2007; 23(24): 3304 - 3311.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
I. M. Meyer
A practical guide to the art of RNA gene prediction
Brief Bioinform, November 1, 2007; 8(6): 396 - 414.
[Abstract] [Full Text] [PDF]


Home page
RNAHome page
K. Chakrabarti, M. Pearson, L. Grate, T. Sterne-Weiler, J. Deans, J. P. Donohue, and M. Ares Jr
Structural RNAs of known and unknown function identified in malaria parasites by comparative genomics and RNA analysis
RNA, November 1, 2007; 13(11): 1923 - 1939.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
X. Xu, Y. Ji, and G. D. Stormo
RNA Sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment
Bioinformatics, August 1, 2007; 23(15): 1883 - 1891.
[Abstract] [Full Text] [PDF]


Home page
Am J Trop Med HygHome page
S. Jongwutiwes, C. Putaporntip, M. Charoenkorn, T. Iwasaki, and T. Endo
Morphologic and Molecular Characterization of Isospora belli Oocysts from Patients in Thailand
Am J Trop Med Hyg, July 1, 2007; 77(1): 107 - 112.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
B. Voss
Structural analysis of aligned RNAs
Nucleic Acids Res., November 14, 2006; 34(19): 5471 - 5481.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Hiller, R. Pudimat, A. Busch, and R. Backofen
Using RNA secondary structures to guide sequence motif finding towards single-stranded regions
Nucleic Acids Res., October 18, 2006; 34(17): e117 - e117.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. Schultz, T. Muller, M. Achtziger, P. N. Seibel, T. Dandekar, and M. Wolf
The internal transcribed spacer 2 database--a web server for (not only) low level phylogenetic analyses.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W704 - W707.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. Dalli, A. Wilm, I. Mainz, and G. Steger
STRAL: progressive alignment of non-coding RNA using base pairing probability vectors in quadratic time
Bioinformatics, July 1, 2006; 22(13): 1593 - 1599.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. H. Bernhart, I. L. Hofacker, and P. F. Stadler
Local RNA base pairing probabilities in large sequences
Bioinformatics, March 1, 2006; 22(5): 614 - 615.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/16/3352    most recent
bti550v2
bti550v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (33)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Siebert, S.
Right arrow Articles by Backofen, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Siebert, S.
Right arrow Articles by Backofen, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?