Bioinformatics Advance Access originally published online on November 15, 2007
Bioinformatics 2007 23(24):3304-3311; doi:10.1093/bioinformatics/btm525
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MASTR: multiple alignment and structure prediction of non-coding RNAs using simulated annealing

1Bioinformatics Centre and 2Molecular Evolution Group, Department of Molecular Biology, University of Copenhagen, Ole Maaloes Vej 5, 2200 Copenhagen N, Denmark
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: As more non–coding RNAs are discovered, the importance of methods for RNA analysis increases. Since the structure of ncRNA is intimately tied to the function of the molecule, programs for RNA structure prediction are necessary tools in this growing field of research. Furthermore, it is known that RNA structure is often evolutionarily more conserved than sequence. However, few existing methods are capable of simultaneously considering multiple sequence alignment and structure prediction.
Result: We present a novel solution to the problem of simultaneous structure prediction and multiple alignment of RNA sequences. Using Markov chain Monte Carlo in a simulated annealing framework, the algorithm MASTR (Multiple Alignment of STructural RNAs) iteratively improves both sequence alignment and structure prediction for a set of RNA sequences. This is done by minimizing a combined cost function that considers sequence conservation, covariation and basepairing probabilities. The results show that the method is very competitive to similar programs available today, both in terms of accuracy and computational efficiency.
Availability: Source code available from http://mastr.binf.ku.dk/
Contact: stinus{at}binf.ku.dk
| 1 INTRODUCTION |
|---|
|
|
|---|
In recent years, the amount of evidence that RNAs play a much more active role in the cell than previously thought has grown dramatically. The view has now shifted away from the assumption that non-coding RNAs (ncRNA) merely helped in the protein synthesis (e.g. tRNA, rRNA), and today a wide variety of catalytically active RNAs or ribozymes have been characterized. It has also become clear that ncRNA is a very diverse group of molecules both in terms of function and structure.
RNA molecules have been found to play important and diverse roles (Athanasius F. Bompfünewerer Consortium et al., 2007; Bompfünewerer et al., 2005; Meyer, 2007): the recently discovered family of microRNAs (miRNA) is involved in gene expression and cell specialization, vault RNAs seem to be involved in multi-drug resistance important for the treatment of cancer, and small nuclear RNAs (snRNA) are key players in the splicing of pre-mRNA. A large number of other ncRNA families have yet to be functionally characterized.
It has also become clear that these non-protein coding genes vary greatly in size, ranging from microRNAs of
20 nt to more than 10 000 nt in RNAs involved in eukaryotic gene silencing (Jossinet et al., 2007), and also that they are transcribed in different ways: some reside in introns of protein coding genes, while others are large transcripts that include introns and the possibility of alternative splicing although they lack the open reading frame of a protein coding gene (Meyer, 2007).
Experimental studies show that a huge fraction of the human genome is transcribed (Cheng et al., 2005), and computational studies show evidence that thousands of structurally conserved RNAs can be found in the human genome (Pedersen et al., 2006; Washietl et al., 2005). There is therefore little doubt that RNAs are biologically very important, and the structural analysis of RNA sequences is a field of growing interest. Through evolution, the sequences of related RNAs can diverge although the structure remains conserved. Pure sequence comparison methods therefore fail when applied to ncRNAs that have diverged too much (Gardner et al., 2005).
It is ultimately the tertiary structure that determines the function of the molecule, and advances are being made in this field (Das and Baker, 2007; Shapiro et al., 2007). However, in the case of RNA it is often sufficient to determine the secondary structure. The reason is that the formation of secondary structure is fast, and the basepairing interactions are strong. The secondary structure, therefore, contributes the major part of the folding energetics, forming a stable scaffold for the formation of tertiary interactions (Onoa and Tinoco, 2004).
There exist methods to fold a single RNA sequence either by maximizing basepairing interactions (Nussinov et al., 1978), or by minimizing the free energy of the structure [mfold (Zuker and Stiegler, 1981); RNAfold (Hofacker, 2003)]. Another approach is to use an existing sequence alignment and predict a consensus structure based on this. In RNAalifold (Hofacker et al., 2002), this has been pursued using a combination of free energy and covariation. In Pfold Knudsen and Hein, 1999, 2003), a stochastic context-free grammar (SCFG) is used to predict a common structure from a multiple alignment.
If pseudoknots are disregarded, an RNA structure can be represented as a tree. Since comparison of strings can be extended to trees (Tai, 1979; Zhang and Shasha, 1989), alignments could be based on the structures directly. In RNAforester (Höchsmann et al., 2003), the input is a set of RNA sequences with (possibly predicted) secondary structures, and the problem thus becomes a forest alignment problem. The program performs either local or global alignment of the structures, and the output is an alignment and predicted consensus structure based on the structural similarities. MARNA (Siebert and Backofen, 2005) is another heuristic method, where a multiple structural alignment is inferred from all pairwise alignments of secondary structures.
Due to the tight relationship between sequence and structure, the solution to the sequence alignment problem and the structure prediction is interdependent. Whether aligning without considering the structure, or folding without considering sequence alignment, information is ignored. Ideally, one should therefore perform the sequence alignment and structure prediction in parallel. In 1985, Sankoff presented an exact solution to this
–hard problem (Sankoff, 1985), but the exponential running time of
(L3N) and memory usage of
(L2N) makes it intractable even for problems of moderate size.
Various implementations of the Sankoff algorithm exist. Foldalign (Gorodkin et al., 1997; Havgaard et al., 2005) and Dynalign (Mathews and Turner, 2002) are both limited to local pairwise alignment. In PMcomp/PMmulti (Hofacker et al., 2004), the optimal alignment of two basepairing probability matrices is found instead of aligning the RNA sequences per se. A multiple alignment can be built using progressive alignment of the basepairing probability matrices. A similar progressive approach is used in FoldalignM (Torarinsson et al., 2007). LocARNA (Will et al., 2007) is a local alignment tool similar to but more efficient than PMcomp, and this program can also be used to do progressive multiple alignment and structure prediction. In RNAcast (Reeder and Giegerich, 2005), the consensus structure problem is dealt with in a different way: by using abstract shapes (Giegerich et al., 2004), where the structures can be regarded without all details but only using the layout of the structure, the search space is reduced. RNAcast predicts the best common shape for all the sequences and, for each sequence, the energetically best structure.
In RNA Sampler (Xu et al., 2007) stems are the core building blocks. For each sequence, a list of all possible stems consisting of consecutive A–U, G–C and G–U pairs is generated. A pairwise alignment is found by aligning all stems from one sequence with all stems from another, and the loop regions are aligned using ClustalW (Thompson et al., 1994). Since bulges are not allowed in stems, the alignment process can be done efficiently by sliding one stem along the other. Such a block of aligned stems has a conservation score including both nucleotide alignment probabilities and basepairing probabilities. From the set of blocks, a common structure is found by sampling, and the probability of a block being chosen depends on the conservation score. The probability matrices are then updated based on the sampled structures, and the process is iterated until convergence. This process has been extended to multiple sequences by considering all pairs in a set of multiple RNA sequences.
SimulFold (Meyer and Miklos, 2007) is a fully probabilistic model using Bayesian Markov chain Monte Carlo. The program takes as input a set of unaligned sequences D and samples both multiple alignment A, secondary structure S and a phylogenetic tree T from the joint posterior probability P(S,A,T|D). This very comprehensive program came out very recently, but although it has some methodology in common with MASTR (e.g. sampling based on the likelihood of the solution) it does so in a very different way, which also shows in the computational complexity of the program.
Since the exact solution to the problem is too time and memory consuming to be pursued, all the methods above are simplified in one way or another. Furthermore, it has been suggested that the optimal minimum free energy structure is not necessarily a good solution to the consensus structure problem (Ding and Lawrence, 2003). We therefore pursue a heuristic sampling approach where the structure and sequence alignment can be optimized in parallel. In our approach a cost function (or energy) is defined as a sum of three terms: an alignment term, a structure term and a covariance term. This cost function is minimized using simulated annealing to obtain the combined alignment and structure with minimum cost—the best solution according to the cost function. This optimization is carried out by changing the structure on the basepair level or by moving gaps around in the sequence alignment. The change is then judged by the change in the cost function and either accepted or rejected. The procedure is implemented in the program called MASTR (Multiple Alignment of STructural RNAs).
| 2 METHODS |
|---|
|
|
|---|
2.1 Defining the cost function
To find a solution to the problem of simultaneous multiple alignment and structure prediction, we define a cost function that will be minimized in order to search for the optimal solution. A solution should minimize a combined cost function cost(A,
2.1.1 Calculating alignment cost
There exist many ways of determining the cost of a multiple alignment: Sum of Pairs using a substitution matrix and minimization of the entropy of the alignment are two well–known examples (Durbin et al., 1998), and using a phylogenetic tree to sum the pairwise alignments inferred by the edges has also been pursued (Hein, 1989).
During the development of the algorithm, various sequence cost functions were examined. Sum of Pairs and different entropy-based measures were tested using both single nucleotide and dinucleotide domains. We selected the best performing cost function, which proved to be a log-likelihood cost function inspired by Hidden Markov models (HMMs) over single nucleotides.
The cost function is fully probabilistic in its treatment of both gaps and nucleotides. We assume independence between the sites in the alignment. When calculating the cost, we have an alignment of length L consisting of N sequences. Let
denote the jth character in sequence i, and let
denote the probability of seeing character
at this specific position. Assuming the sites are independent, the probability of the multiple alignment A becomes:
|
|
The individual character probabilities need to be determined and gaps have to be taken into account. If
is a gap we have two cases: let PGO denote the gap open probability, i.e. the probability of having a gap at position j given that position j – 1 contained a nucleotide. Similarly, PGE is the gap extension probability used when both position j and j – 1 contain a gap. Both of these probabilities can be estimated from known structural alignments. In the program, they are set to PGO = 0.5 and PGE = 0.74.
If
is a nucleotide from the alphabet
= {A,C,G,U}, the probability
is calculated based on the nucleotides that comprise the column. Additionally, the probability is dependent on the preceding character. If
, we have a gap closing, and the probability is multiplied by (1 – PGE). Similarly, if
, the probability is multiplied by (1 – PGO). Let cj (a) be the number of occurrences of nucleotide a
at position j in the alignment. The probabilities are given as:
|
|
A simple pseudo-count function is used where cj (a) is incremented by 1 for each a
and 1
j
L. An IUPAC ambiguity character is exchanged with one of the nucleotides it symbolizes with equal probability. For instance, if an N occurs in a sequence, it is replaced by any one of the 4 nt with 25% chance each. Similarly, an S will be exchanged with either a C or a G with a 50% chance each.This exchange is done once in the beginning of the program. Having these probabilities, P(A) can be calculated. The cost function used is the negative log-likelihood based on the alignment probability:
|
| (1) |
2.1.2 Calculating structure cost
The cost of the structure is defined as the sum of the cost of the individual basepairs. Let
be the structure consisting of basepairs (i, j):
|
|
There are two ways to score the structure: by the free energy of single sequences and by covariation. In the present work, we use the two measures that proved best at predicting true basepairs in our previous study (Lindgreen et al., 2006): The McCaskill basepair probabilities (McCaskill, 1990), called P(bpi, j), and a novel version of the covariation measure used in RNAalifold (Hofacker et al., 2002) extended to include stacking of basepairs, called C(bpi, j).
McCaskill showed how to calculate the partition function over all possible secondary structures of an RNA sequence. The basepair probabilities are found using the weighted Boltzman ensemble favoring more stable structures. We use RNAfold, which is part of the Vienna package (Hofacker, 2003), to calculate the probability matrices. Since gaps are added to the sequences as part of the alignment this has to be taken into account when indexing the matrices: the partition function is calculated once for each ungapped sequence s = 1, ..., N before the optimization starts, and the results are stored in individual probability matrices Ms. These matrices do not change throughout the algorithm. For a basepair (i, j) in the alignment, we need to correct the indices to be able to find the probability for that particular basepair. Let these gap-corrected indices be denoted (is, js), where is = i – Ms and Ms is the number of gaps preceding position i in sequence s, and similarly for index j. The probability for the basepair in sequence s is then found as Ms(is, js). If either i or j is a gap in sequence s, Ms(is, js) = 0. A basepair (i, j) in the alignment is scored by the mean probability:
|
|
To transform this into a cost function for the basepairs, the negative logarithm of the mean probability is taken and a threshold is introduced. The threshold reflects the background probability Pnull of random basepairs found in the probability matrices:
|
| (2) |
Through evolution, related RNA sequences can mutate which leads to different sequences of nucleotides while the same core secondary structure is retained. When a mutation happens at a position that is involved in a basepair, selection favors mutations at the other position that maintain the structure and molecular function. This is known as compensatory mutations. Thus, structure is often more conserved than sequence, and this signal can be measured by a covariation score.
In Lindgreen et al. (2006), we analyzed various measures of covariation. We refer to this article for details, but here the chosen cost function will be briefly explained. The RNAalifold measure uses a matrix
for each sequence
, where
if sequence
can form a basepair between position i and j, and
otherwise. The function
measures the Hamming distance between two aligned pairs at positions i and j in sequences
and β. The goal is to measure the fraction of consistently aligned pairs. A penalty term, qi, j, measures the fraction of sequences with inconsistent pairs in the alignment. The covariation is then found as:
|
|
To add stacking information, the two basepairs enclosing the pair in question are also considered, but more weight is put on the actual pair:
|
|
To turn this into a cost function, the same approach is used as for the partition function above. The covariation score is negated and a threshold value added:
|
| (3) |
A threshold of
= 0.25 is used. Using the two cost functions costP(i, j) and costC (i, j) [Equations (2) and (Equation (3), respectively], the predicted structure can be evaluated and a move either accepted or rejected based on Equation (4) below.
2.1.3 Combined cost
Since we simultaneously optimize both sequence alignment and structure prediction, the cost function is a combination of three terms: the log-likelihood cost in Equation (1), the basepair probability cost in Equation (2), and the covariation cost in Equation (Equation (3). The cost of the secondary structure is given as a sum over all basepairs in the structure
:
|
|
The parameters
and β are parameters used to balance the contributions from the different terms in the combined cost. As default, they are set to
= 1.5 and β = 0.6, which are obtained from an initial grid search parameter optimization (data not shown).
2.2 Optimizing the solution
Simulated annealing (Kirkpatrick et al., 1983) is an optimization technique inspired by the physical process of annealing, which describes the slow cooling of material to form a crystal structure. The idea is that the positions of the individual atoms can be described as a probability distribution dependent on the temperature of the system: at high temperatures the atoms have a high energy and therefore move around, but as the temperature is lowered, the system becomes more stable. The goal is to form crystals with few defects, and the most stable crystal structure is the one with the lowest free energy. If the temperature of the system is decreased too fast, the crystal structure becomes brittle since the system becomes stuck in a local energy minimum. If the temperature is decreased slowly, the local energy minima can be avoided due to the thermal fluctuations, and the structure becomes more ordered and stable, and the minimum free energy conformation may be reached.
Simulated annealing works in analogy to this. In order to escape from local minima of a cost or energy function, steps towards worse states (i.e. higher cost) should be taken often in the beginning (at high temperature) and occasionally later at lower temperatures. This is done in a Monte Carlo simulation with an artificial temperature parameter that has absolutely no physical meaning. The probability of acceptance depends on the change in cost (huge increases should be accordingly improbable) as well as on the number of iterations (since the system is closer to the stable optimum towards the end). Given an infinite amount of time, it can be shown that simulated annealing will approach the optimal solution to any finite problem (Häggström, 2002). Simulated annealing can be used to minimize any cost function, and has for instance been used for multiple alignment (Lukashin et al., 1992).
Simulated annealing depends on an artificial temperature T that decreases over time. Initially the temperature should be high enough to give an unstable system—in this case an alignment prone to changes. As more and more changes are sampled, the temperature decreases to stabilize the system. Normally the temperature decreases exponentially (Lukashin et al., 1992), although there is no theoretical reason for this. If the new cost is lower than the previous, the change is always accepted. If the change increases the cost, the chance of acceptance P depends on the old cost cOLD, the new (larger) cost cNEW and the temperature T:
|
| (4) |
This is known as the Metropolis–Hastings algorithm (Hastings, 1970, Kirkpatrick et al., 1983). Using this, the possible states are sampled based on the cost of the current state. Since a state only depends on the previous state, this generates a Markov chain. In combination, MCMC using simulated annealing can be used to sample the solution space of multiple alignments and RNA structures. Changes can be made by moving the gaps in the alignment and by adding or removing basepairs in the structure, and the move is either rejected or accepted based on the change in cost.
The initial alignment is built by adding gaps at random to all sequences until they have equal length. By default, the length of the initial alignment is 1.06 · Lmax where Lmax is the length of the longest sequence. The moves through the solution space can either affect the sequence alignment or the structure. Since it makes little sense to try and deduce a common structure from randomly aligned sequences, the first iterations are purely sequence moves. As the alignment becomes more stable, we start doing a combination of sequence and structure moves.
The total number of iterations performed depends both on the length of the longest sequence, since that affects the number of structure moves needed, and on the size of the alignment, since that affects the number of sequence moves needed. The alignment size is measured as the total number of nucleotides in the dataset, Ntotal. The dependencies are denoted Ndep and Ldep, respectively, and the number of iterations is found as:
|
|
We use Ndep = 1000 and Ldep = 1700 as default. After initially only performing sequence moves, a mixture of alignment and structure altering moves are performed. The structure moves are initiated either after a fixed fraction of the total number of iterations or, as per default, after Ndep · Ntotal iterations. The remaining iterations are a mix of sequence and structure moves. The ratio between the two is set by a parameter R. Per default, R = 0.75 of the last iterations are structure moves.
Initially, all moves are accepted (i.e. a temperature of infinity is used) and the first 0.1% of the iterations are used to determine a good starting temperature. These results are used to estimate the standard deviation
of the cost distribution. By deciding on the desired initial probability of acceptance P0 the temperature T0 can be determined:
|
|
|
|
|
|
2.2.1 Sequence moves
The moves aimed at changing the sequence alignment do this by moving gaps in the sequences. They can of course be moved without altering the order of the nucleotides. Three different types of moves are implemented which in combination ensures that the alignment can be reduced, extended and altered locally.
- Gap block move: local changes are facilitated through gap blocks. A gap block is a subsequence consisting of 1 or more consecutive gaps in 1 or more aligned sequences. To make this move, a random gap in a random sequence is picked. Then the gap block is extended vertically with probability 0.85 through the other sequences containing a gap at that position. Afterwards, the gap block is extended horizontally to both sides with probability 0.85 if all the chosen sequences contain a gap there. Finally, the gap block is moved to a randomly chosen new position in the alignment. The procedure is illustrated in Figure 1 and constitutes 85% of the sequence moves.
- Gap insertion: insertion of gaps has to insert the same number of gaps in all sequences. One could insert the gaps at random positions in all sequences, but that would greatly affect the cost of the alignment. Instead, the gaps are inserted in either end of the alignment. From these positions the gaps can diffuse into the alignment as needed. These moves constitute 10% of the sequence moves.
- Gap deletion: removing gaps is done by locating gap columns, i.e. columns in the alignment containing a gap in all of the sequences. Using a well–defined cost function, superfluous gaps are placed in the same columns of the alignment. These moves constitute 5% of the sequence moves.
|
2.2.2 Structure moves.
Structural moves either add or delete basepairs. The structure is forced to contain only non–crossing basepairs (i.e. prediction of pseudoknots is not yet supported), and a minimum loop length of 3 nt is ensured. Using the three simple moves described below, the structure can be built, extended and reduced.
- Adding a basepair: a new basepair is added by choosing a nucleotide pair (i, j) at random and adding it to the structure if it does not violate the constraints. These moves constitute 70% of the structure moves.
- Extending a stem: a stem is extended by choosing a basepair (i, j) already in the structure. The stem that includes basepair (i, j) is then extended by adding a new basepair to it, with a 50% chance of extending the stem either internally or externally. These moves constitute 20% of the structure moves.
- Deleting a basepair: deleting a basepair is done by choosing a pair (i, j) in the structure and removing it. This cannot lead to any new violations of the constraints. These moves constitute 10% of the structure moves.
2.3 Datasets
Since consensus structures were not available from BRaliBase II (Gardner et al., 2005) at the time, we sampled alignments with consensus structures in much the same way as in the BRaliBase study. MASTR relies on the partition function to calculate basepair probability matrices, so we have chosen to use only short (
70–250 nt) sequences in the test. There are known problems when using the partition function to calculate basepair probability matrices for long sequences. Hence, the program will not perform well on long sequences until this has been dealt with.
Datasets were generated from large, trusted seed alignments obtained from Rfam (Griffiths-Jones et al., 2005). The 5 families chosen are tRNA, 5S ribosomal RNA (5S rRNA), U5 spliceosomal RNA (U5), Hepatitis C virus internal ribosome entry site (IRES) and TPP riboswitch THI element (TPP).
From each RNA family, a number of datasets were sampled. Each dataset has an average pairwise identity within a specified 10% interval. These intervals go from 30% to 100% with 5% overlap, i.e. 30–40%, 35–45%, 40–50%, etc. This sampling procedure gave 14 datasets from tRNA, 5S rRNA and U5, 8 datasets from TPP and 2 datasets from IRES. In total, 52 datasets were sampled and details on average pairwise identity, number of sequences and average sequence lengths can be seen in Table 1. The datasets can be obtained from http://mastr.binf.ku.dk/.
|
| 3 RESULTS |
|---|
|
|
|---|
The program MASTR is implemented in C++ and tested against the programs FoldalignM, LocARNA and RNA Sampler. RNAcast is used to produce input to RNAforester, and Clustal alignments were used as input to RNAalifold. All programs were used with their default settings except for RNAforester where the clustering cutoff had to be changed in order to produce complete alignments of all sequences. FoldalignM does not predict a single consensus structure but returns a structure for each sequence in the final alignment. Therefore, we define the consensus structure to be the basepairs that are predicted for all sequences. The other programs all predict a single multiple alignment and consensus structure. To evaluate the predicted structures, Matthew's Correlation Coefficient (MCC) is used:
|
|
Let TP be the number of truly predicted basepairs, FP be the number of predicted basepairs not in the reference structure and FN be the number of basepairs in the reference structure not predicted by the program. TN is defined here as the number of possible basepairing interactions in a sequence that are not predicted and not in the reference structure, i.e. pairs of nucleotide xy that are at least 4 nt apart, and where xy
{AU,UA,CG,GC,UG,GU}.
To evaluate the quality of the alignment, the Sum of Pairs score (SPS) (Thompson et al., 1999) is used. SPS is a sensitivity–like measure that compares the predicted alignment to a reference. For each pair of aligned sequences, the number of aligned positions that are present in both the prediction and the reference alignment is counted. The total number of correctly aligned positions is then compared to the total number of aligned non–gap pairs present in the reference alignment. This yields a number between 0 and 1 where 1 is perfect correspondence between prediction and reference.
In Figure 2, the performance of the programs are compared in terms of structure quality (plot a), alignment quality (plot b) and running time (plot c) as a function of the average pairwise identity of the datasets (%ID). The plots are averages over the different RNA families used for each %ID point.
|
The test shows that MASTR can predict consensus structures of a quality comparable to other existing methods. On the lowest identity datasets MASTR is outperformed by RNA Sampler, but after
45% ID the structure predictions of MASTR are on average better than or comparable to the best programs tested. MASTR is consistently better than or comparable to all other methods regarding alignment quality. As it can be seen, MASTR is clearly faster than FoldalignM by up to an order of magnitude, while LocARNA is even faster. Clustal + RNAalifold is of course by far the fastest method used. RNAforester has a running time comparable to LocARNA but produces consistently worse alignments—probably due to the fact that all sequences had to be included in the same alignment for comparison. The structure predictions are comparable to FoldalignM. The dip in the quality of the structure predictions that is visible for all methods in the highest identity range could be explained by the lack of covariation in these datasets. Most of the methods rely on some signal from compensatory mutations, and this signal diminishes as the sequences become too similar. RNA Sampler does not depend on a covariation term that could explain why the dip is less prominent here. Likewise, FoldalignM shows an almost monotone increase in the predictions as a function of identity.
Since RNA Sampler seems to be the best of the other methods tested here, a more detailed comparison of MASTR and RNA Sampler can be seen in Table 1. In total, 52 datasets are used. The running time is on the same scale for the two programs, although MASTR is in general slightly faster. The structure predictions are better than or equal to RNA Sampler in 28 cases (54%), and the alignments are better than or equal to RNA Sampler in 43 cases (83%). The differences seem to depend both on the RNA family and on the level of identity.
| 4 DISCUSSION |
|---|
|
|
|---|
We have developed a new algorithm for simultaneous alignment and structure prediction of multiple non-coding RNA sequences. As shown above, MASTR is highly competitive both in terms of structure prediction quality, sequence alignment and running time. The program can also handle larger datasets than, e.g. RNA Sampler or FoldalignM.
Although we have not used it in the above tests, it is also possible to add structural constraints if some knowledge is available about one of the sequences (e.g. known basepairs, knowledge about upstream or downstream interactions, or knowledge about non-basepaired positions). Additionally, already aligned sequences can be used as input with or without a consensus structure.
As the testing of the program showed, MASTR does not have top performance on very dissimilar sequences. In this range, one would assume that covariance is important, and it is therefore interesting that RNA sampler, which does not use covariance, is better. One possible explanation for this is that covariation in itself is not enough to deduce structure from alignment. Covariation is only an indicator of conserved basepairs, but it is not sufficient to predict pairing columns [this corresponds well with our previous study (Lindgreen et al., 2006)]. MASTR builds up the structure in small steps, which might make it vulnerable to erronously high covariation, whereas RNA Sampler makes sure that the alignment is structurally sound by fixing whole stems. MASTR therefore needs to have a relatively stable (and correct) alignment before predicting structure. This could explain the relatively poor performance on low identity datasets, and this should be explored further.
In future work, we would like to make a local version of the program. This would be ideal for dealing with long sequences where there are known problems with the standard basepair probability matrices. Since MASTR does not have the same limitations towards crossing basepairs as pure energy-based methods, an extension to include the prediction of pseudoknots will also be pursued.
One of the advantages of MASTR is that the optimization is decoupled from the cost function, which makes it very easy to change the latter. It also makes it reasonably straight forward to add phylogenetic prediction to the program, which would be similar to the goal of SimulFold (Meyer and Miklos, 2007), but MASTR functions in a very different way. We would also like to make it possible to input a set of related and already aligned sequences together with the set of unaligned sequences. Thus, new sequences can be aligned to reference alignments in a structurally sound manner.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
S.L. acknowledges support from the Novo Scholarship Program. The work was supported by the Carlsberg Foundation and the Danish National Research Foundation. The authors thank two anonymous referees for their helpful comments; especially one reviewer for interesting comments on covariation.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Dmitrij Frishman
Present address: Wellcome Trust Sanger Institute, Cambridge, UK. ![]()
Received on August 20, 2007; revised on October 11, 2007; accepted on October 13, 2007
| REFERENCES |
|---|
|
|
|---|
Athanasius F. Bompfünewerer Consortium. RNAs everywhere: genome-wide annotation of structured RNAs. J. Exp. Zoolog. Mol. Dev. Evol (2007) 308:1–25.
Bompfünewerer AF, et al. Evolutionary patterns of non-coding RNAs. Theory Biosci (2005) 123:301–369.[CrossRef][Medline]
Cheng J, et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science (2005) 308:1149–1154.
Das R, Baker D. Automated de novo prediction of native-like RNA tertiary structures. Proc. Natl Acad. Sci (2007) 104:14664–14669.
Ding Y, Lawrence CE. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res (2003) 31:7280–7301.
Durbin R, et al. Biological Sequence Analysis. Probabilistic Models of Proteins and Nucleic Acids (1998) Cambridge, UK: Cambridge University Press.
Gardner P, et al. A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res (2005) 33:2433–2439.
Giegerich R, et al. Abstract shapes of RNA. Nucleic Acids Res (2004) 32:4843–4851.
Gorodkin J, et al. Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Res (1997) 25:3724–3732.
Griffiths-Jones S, et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res (2005) 33:D121–D124.
Häggström O. Finite Markov Chains and Algorithmic Applications (2002) Cambridge, UK: Cambridge University Press.
Hastings WK. Monte carlo sampling methods using Markov chains and their applications. Biometrika (1970) 57:97–109.
Havgaard JH, et al. Pairwise local structure alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics (2005) 21:1815–1824.
Hein J. A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences when the phylogeny is given. Mol. Biol. Evol (1989) 6:649–668.[Abstract]
Höchsmann M, et al. Local similarity of RNA secondary structures. (2003) Proceedings of the IEEE Computational Systems Bioinformatics Conference (CSB 2003). 159–168.
Hofacker I. Vienna RNA secondary structure server. Nucleic Acids Res (2003) 31:3429–3431.
Hofacker I, et al. Alignment of RNA base pairing probability matrices. Bioinformatics (2004) 20:2222–2227.
Hofacker IL, et al. Secondary structure prediction for aligned RNA sequences. J. Mol. Biol (2002) 319:1059–1066.[CrossRef][Web of Science][Medline]
Jossinet F, et al. RNA structure: bioinformatic analysis. Curr. Opin. Microbiol (2007) 10:279–285.[CrossRef][Web of Science][Medline]
Kirkpatrick S, et al. Optimization by simulated annealing. Science (1983) 220:671–680.
Knudsen B, Hein J. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics (1999) 15:446–454.
Knudsen B, Hein J. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res (2003) 31:3423–3428.
Lindgreen S, et al. Measuring covariation in RNA alignments: physical realism improves information measures. Bioinformatics (2006) 22:2988–2995.
Lukashin AV, et al. Multiple alignment using simulated annealing: branch point definition in human mRNA splicing. Nucleic Acids Res (1992) 20:2511–2516.
Mathews DH, Turner DH. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J. Mol. Biol (2002) 317:191–203.[CrossRef][Web of Science][Medline]
McCaskill JS. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers (1990) 29:1105–1119.[CrossRef][Web of Science][Medline]
Meyer IM, Miklos I. SimulFold: simultaneously inferring RNA structures including pseudoknots, alignments and trees using a Bayesian MCMC framework. PLoS Comput. Biol (2007) 3:e149.[CrossRef][Medline]
Meyer IM. A practical guide to the art of RNA gene prediction. Brief. Bioinformatics (2007) Epub ahead of print, PMID: 17483123.
Nussinov R, et al. Algorithms for loop matchings. SIAM J. Appl. Math (1978) 35:68–82.[CrossRef]
Onoa B, Tinoco I Jr. RNA folding and unfolding. Curr. Opin. Struct. Biol (2004) 14:374–379.[CrossRef][Web of Science][Medline]
Pedersen J, et al. Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput. Biol (2006) 2:e33.[CrossRef][Medline]
Reeder J, Giegerich R. Consensus shapes: An alternative to the Sankoff algorithm for RNA consensus structure prediction. Bioinformatics (2005) 21:3516–3523.
Sankoff D. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl. Math (1985) 45:810–825.[CrossRef]
Shapiro BA, et al. Bridging the gap in RNA structure prediction. Curr. Opin. Struct. Biol (2007) 17:157–165.[CrossRef][Web of Science][Medline]
Siebert S, Backofen R. MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons. Bioinformatics (2005) 21:3352–3359.
Tai K. The tree-to-tree correction problem. J. ACM (1979) 26:422–433.[CrossRef]
Thompson JD, et al. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res (1994) 22:4673–4680.
Thompson JD, et al. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res (1999) 27:2682–2690.
Torarinsson E, et al. Multiple structural alignment and clustering of RNA sequences. Bioinformatics (2007) 23:926–932.
Washietl S, et al. Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nat. Biotechnol (2005) 23:1383–1390.[CrossRef][Web of Science][Medline]
Will S, et al. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput. Biol (2007) 3:e65.[CrossRef][Medline]
Xu X, et al. RNA Sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment. Bioinformatics (2007) 23:1883–1891.
Zhang K, Shasha D. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput (1989) 18:1245–1262.[CrossRef]
Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res (1981) 9:133–148.
This article has been cited by other articles:
![]() |
S. H. Bernhart and I. L. Hofacker From consensus structure prediction to RNA gene finding Brief Funct Genomic Proteomic, November 1, 2009; 8(6): 461 - 471. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. R. Stocsits, H. Letsch, J. Hertel, B. Misof, and P. F. Stadler Accurate and efficient reconstruction of deep phylogenies from structured RNAs Nucleic Acids Res., October 1, 2009; 37(18): 6184 - 6193. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Fan, P. B. Bitterman, and O. Larsson Regulatory element identification in subsets of transcripts: Comparison and integration of current computational methods RNA, August 1, 2009; 15(8): 1469 - 1482. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. K. Bradley, L. Pachter, and I. Holmes Specific alignment of structured RNA: stochastic grammars and sequence annealing Bioinformatics, December 1, 2008; 24(23): 2677 - 2683. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Katoh and H. Toh Recent developments in the MAFFT multiple sequence alignment program Brief Bioinform, July 1, 2008; 9(4): 286 - 298. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. B. Do, C.-S. Foo, and S. Batzoglou A max-margin model for efficient simultaneous alignment and folding of RNA sequences Bioinformatics, July 1, 2008; 24(13): i68 - i76. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Torarinsson and S. Lindgreen WAR: Webserver for aligning structural RNAs Nucleic Acids Res., July 1, 2008; 36(suppl_2): W79 - W84. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







