Skip Navigation


Bioinformatics Advance Access originally published online on December 15, 2005
Bioinformatics 2006 22(4):445-452; doi:10.1093/bioinformatics/btk008
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/4/445    most recent
btk008v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (22)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Yao, Z.
Right arrow Articles by Ruzzo, W. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Yao, Z.
Right arrow Articles by Ruzzo, W. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

CMfinder—a covariance model based RNA motif finding algorithm

Zizhen Yao 1,*, Zasha Weinberg 1 and Walter L. Ruzzo 1,2

1Department of Computer Science and Engineering, University of Washington Seattle WA 98195-2350, USA
2Department of Genome Sciences, University of Washington Seattle WA 98195-2350, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 

Motivation: The recent discoveries of large numbers of non-coding RNAs and computational advances in genome-scale RNA search create a need for tools for automatic, high quality identification and characterization of conserved RNA motifs that can be readily used for database search. Previous tools fall short of this goal.

Results: CMfinder is a new tool to predict RNA motifs in unaligned sequences. It is an expectation maximization algorithm using covariance models for motif description, featuring novel integration of multiple techniques for effective search of motif space, and a Bayesian framework that blends mutual information-based and folding energy-based approaches to predict structure in a principled way.

Extensive tests show that our method works well on datasets with either low or high sequence similarity, is robust to inclusion of lengthy extraneous flanking sequence and/or completely unrelated sequences, and is reasonably fast and scalable. In testing on 19 known ncRNA families, including some difficult cases with poor sequence conservation and large indels, our method demonstrates excellent average per-base-pair accuracy—79% compared with at most 60% for alternative methods. More importantly, the resulting probabilistic model can be directly used for homology search, allowing iterative refinement of structural models based on additional homologs. We have used this approach to obtain highly accurate covariance models of known RNA motifs based on small numbers of related sequences, which identified homologs in deeply-diverged species.

Availability: Results and web server version are available at http://bio.cs.washington.edu/yzizhen/CMfinder/

Contact: yzizhen{at}cs.washington.edu

Supplementary information: Supplementary technical details are available at http://bio.cs.washington.edu/yzizhen/CMfinder/


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
Non-coding RNAs (ncRNAs) are functional RNA molecules that do not code for proteins. In the last five years, many discoveries of highly structured cis-regulatory elements in 5' or 3'-untranslated regions (5'- or 3'-UTR) of mRNAs point to a variety of biological roles for ncRNAs, including localization, replication, translation, degradation and stabilization of transcripts (Conne et al., 2000; Mandal et al., 2003; Hentze and Kuhn, 1996). Notable examples are the mRNA elements known as riboswitches—parts of mRNA molecules that can directly bind a small target molecule to regulate their own activity (Mandal et al., 2003; Winkler and Breaker, 2003; Mandal et al., 2004).

Important computational advances, such as development of the Rfam database (Griffiths-Jones et al., 2003) and fast genome-scale covariance model (CM) searches (Eddy and Durbin, 1994; Weinberg and Ruzzo, 2004a, b), aid RNA research significantly. A key problem remaining is to identify conserved secondary structure motifs among related sequences, and characterize them by models that can be used for homology search. For example, identification of such motifs in untranslated regions of orthologous bacterial genes has been critical to the discovery of new riboswitches (Barrick et al., 2004), but available techniques require significant manual work. Comparative sequence analysis is generally recognized as the most reliable method of RNA structure prediction, but these methods can fail when sequence conservation is too low (owing to poor alignments) or too high (owing to lack of sequence covariation). Single-sequence structure prediction is inaccurate, and simultaneous multiple sequence alignment and folding is computationally expensive. Finally, these tools generally do not interface smoothly with RNA homology search tools.

In this paper, we present a new algorithm for solving this motif discovery problem. Oversimplifying considerably, it is an expectation maximization (EM) algorithm like MEME (Bailey and Elkan, 1995), but instead of weight matrix models, it captures RNA secondary structures with covariance models (Eddy and Durbin, 1994). Because of the greatly increased complexity of the problem, we applied several techniques to solve scalability and convergence issues. These include use of careful heuristics for choosing a set of candidate structure elements to initialize the EM iteration. To improve the accuracy of consensus secondary structure in the M-step, we have combined mutual information with data-dependent, position-specific priors for base pairing based on a thermodynamic model. The key merits of our solution include:

  • Applicable to unaligned input with unrelated sequences, long flanking regions and/or low sequence similarity;
  • Reasonably fast and scalable with respect to the number and length of input sequences;
  • Producing a motif structural alignment and statistical model that can be directly used for homology search.
The third point is particularly important, because we view this tool as only one component of a discovery pipeline wherein a motif model built from a small dataset will be used to find more instances, allowing the model to be extended and refined iteratively.

Extensive testing demonstrates that these goals are largely met. For most tests, the hand-curated ‘seed’ alignments from Rfam are our gold standard. CMfinder achieved better results than other methods in 17 of 19 tests, and predicted base pairs with average 77% sensitivity, 81% specificity and 79% accuracy, compared with at most 60% accuracy for the other methods. Most disagreements with Rfam are local perturbations such as small shifts or extra base pairs. Finally, we have very encouraging results from preliminary tests combining our method with other computational tools to construct a pipeline for novel ncRNA family discovery. In particular, integrating with the ‘Footprinter’ DNA motif discovery tool (Blanchette and Tompa, 2003) and an RNA genome search tool (Weinberg and Ruzzo, 2004a,b), CMfinder produced highly accurate models of several large cis-regulatory RNA families, starting from unaligned 5'-UTRs of small sets of orthologous genes. For example, a blind test identified 447 instances of a bacterial motif that turned out to be the T-box leader. These included 87% of its 342 Rfam family members, plus an additional 148 hits, 89 of which are strongly supported by functional annotation. In this and six other datasets the CMfinder models identify members of corresponding Rfam families with 92% specificity and 80% sensitivity on average. We are currently applying this technique to genome scale search for novel cis-regulatory RNA motifs in microorganisms.


    2 RELATED WORK
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
A classical approach to find RNA motifs is to construct a consensus RNA secondary structure from a given multiple sequence alignment based on covariation, thermodynamic stability, phylogeny, etc. (Eddy and Durbin, 1994; Hofacker et al., 2002; Akmaev et al., 2000; Gulko and Haussler, 1996; Knudsen and Hein, 1999). These methods require a good alignment, thus are unreliable for datasets with low sequence similarity. Coventry et al. (2004) identify conserved RNA regions based on sequence alignment by searching for correlated reverse-complementary regions. This method is robust to small variations of the alignment, but not major perturbations.

An alternative approach is to fold sequences separately, then find a consensus secondary structure, e.g. using a tree edit algorithm as in Hofacker et al. (1994) and Höchsmann et al. (2003). A key drawback of this approach is the inaccuracy of secondary structure prediction for a single sequence (Dowell and Eddy, 2004). Sankoff (1985); Gorodkin et al. (1997); Havgaard et al. (2005); Mathews and Turner (2002) solve this problem by inferring the alignment and folding simultaneously using dynamic programing. However, computational expense limits these approaches to small datasets.

Probabilistic approaches such as EM algorithms and Gibbs sampling have been applied to infer consensus RNA structure from unaligned sequences, based on covariance models or stochastic context-free grammars (Eddy and Durbin, 1994; Sakakibara et al., 1994; Grate et al., 1994). These methods, however, are not designed for the motif discovery problem in which RNA motifs are only present in local regions of a subset of sequences.

Graph theoretical techniques have been proposed to identify RNA motifs in unaligned sequences by comparison and assembly of stable stems (Touzet and Perriquet, 2004; Ji et al., 2004; Bafna et al., 2005). See Gardner and Giegerich (2004) for a recent comparative survey on RNA structure prediction.


    3 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
The basic idea of CMfinder is to use a CM to model an RNA motif, a finite mixture model to describe motif distribution in sequences, and an EM framework to search the motif space. It is motivated by two previous techniques: the DNA motif finding program MEME (Bailey and Elkan, 1995), and the CM-based RNA analysis tool COVE (Eddy and Durbin, 1994). The combination of these two methods is non-trivial owing to the increased complexity: in MEME, the motif model is an ungapped weight matrix with a relatively short, fixed length window, while RNA motifs are generally much longer with significant secondary structures and frequent insertions/deletions. COVE assumes that each sequence is an instance of the model and performs global alignment rather than the more difficult local alignment we need. In the EM framework, the expanded search space and higher model complexity suggest that it is infeasible to explore all subsequences as MEME does, and having a good starting point is critical. To address these issues, CMfinder chooses motif candidates with potentially stable secondary structures, selects a conserved set across all sequences and aligns them heuristically. This step is loosely similar to Carnac (Touzet and Perriquet, 2004) and ComRNA (Ji et al., 2004). The subsequent EM iteration refines the model and alignment. The following section elaborates this idea, and further technical details are available in the Supplementary information.

3.1 Construction of heuristic initial alignment
The goal of this step is to identify the approximate location and structure of the motif efficiently. The key design issue is the trade-off between accuracy and efficiency. As the motif will be refined later, we can tolerate alignment errors provided correctly aligned instances are well-represented, but need robustness to differences in dataset size, sequence similarity and consensus structure.

3.1.1 Candidate selection
We first eliminate a large portion of the search space by focusing on strong candidates, i.e. segments with potentially stable secondary structure. For each input sequence, we use Vienna (Hofacker et al., 1994) to compute the minimum free energy of all subsequences, sort them according to their free energy scaled by sequence length, then choose the top ranking candidates iteratively. Similar ideas in Carnac and ComRNA use simple stems as candidates, but in our context allowing a richer set of candidates gives better performance.

3.1.2 Candidate comparison and alignment
To find the consensus alignment of the candidates, our next step is to compare their predicted secondary structures. We use the tree-edit algorithm of Hofacker et al. (1994), modified to compare candidates at the single base or base pair level so that the comparison is sensitive to both sequence and structure. This improves its ability to distinguish between RNAs with relatively simple structures. The complexity of this algorithm is approximately quadratic in the length of the candidates, thus far more efficient than the Sankoff-style algorithms (Sankoff, 1985; Gorodkin et al., 1997; Mathews and Turner, 2002). It is a simpler framework for structure and sequence comparison than the the heuristics used in Carnac and ComRNA. Its drawbacks include the potential inaccuracy of secondary structure prediction and the simplified edit distance model. Despite the efficiency of the tree-edit algorithm, pairwise comparisons of numerous candidates can be expensive. To improve efficiency further, we only align two candidates if they are compatible with locally conserved regions found by BLAST search. This heuristic also improves accuracy by preventing obvious misalignments. CARNAC and comRNA also rely on similar anchor-based techniques to reduce their otherwise prohibitive computational cost; it is less critical in our case, but still valuable.

To construct the initial alignment, we want to choose one candidate from each sequence so as to minimize the sum of pairwise scores. We approximately solve this NP-hard problem using a heuristic that selects a central consensus candidate, and structually aligns it to its best match in each sequence. We repeat this process on the unselected candidates to find 1–10 initial alignments as seeds for the EM algorithm.

3.2 Refining alignments via CM-based EM
To improve the quality of the initial alignments, we adopt an iterative EM-like algorithm based on covariance models. The CM (Eddy and Durbin, 1994) is a probabilistic model for RNA families that cleanly describes both the secondary structure and the primary sequence consensus. We apply a finite mixture model to describe a sequence as a mixture of regions that follow the background distribution and regions that follow the motif CM, then use the EM algorithm to estimate the model and motif instances simultaneously. For clarity, we first assume that motif instances only occur among candidates, a restriction we relax in Section 3.2.3. The following notations are used:

  • M: the motif CM.
  • B: the background distribution.
  • {Gamma} = (M,B,{gamma}): the finite mixture model, where {gamma} is the mixture probability that a sequence contains a motif.
  • N: the total number of sequences.
  • S = (si)1 ≤ i ≤ N: the input sequences.
  • m: the number of candidates in each sequence.
  • Ci = (cij)1 ≤ j≤ m: the candidate set of sequence si.
  • {Pi}i = ({pi}ij): the alignments of candidates Ci with M.
  • Xi = (xij) : the occurrence of the motif in Ci (xij = 1 if cij is a motif instance, and xij = 0 otherwise).
  • D = (L1,L2, ... , Ll): the sequence alignment. Li: a column.
  • {sigma} = ({alpha},ß): consensus secondary structure for D. {alpha}: a set of (indices of) single-stranded columns. ß: a set of (pairs of indices of) base paired columns.
The aim is to find {Gamma} = (M,B,{gamma}) that maximizes the log likelihood

Formula
The E-step estimates the hidden variables Xi and {Pi}i, while the M-step updates the CM M and {gamma}. The major CM functions (e.g. alignment, scanning, etc.) are adopted from Eddy and Durbin (1994), and will not be explained here.

3.2.1 The E-step
For each candidate cij of sequence si, we need to estimate two hidden variables: alignment {pi}ij and motif assignment xij. Given a motif assignment, the inside–outside algorithm is usually used for CM parameter estimation by summing over all possible alignment paths. However, in order to update the model structure, we need a sequence alignment (see M-step). We use the Viterbi algorithm to compute the optimal alignment {pi}ij of candidate cij. Assuming zero or one motif occurrence per sequence ({sum}jxij ≤ 1), the motif assignment probabilities can be estimated based on a mixture model (Bailey and Elkan, 1995):

Formula 1(1)
where {lambda} = {gamma}/m. To save the cost of the inside algorithm for P(cij|M), we use the probability of the optimal alignment P({pi}ij|M) instead. Although this approximation tends to underestimate P(xij = 1), most real motif instances distinguish themselves from the background so dramatically that this approximation is sufficient. The resulting probability can be interpreted as the probability of a candidate with the suggested structural alignment being a motif instance.

3.2.2 The M-step
Here, we update M and {gamma}. The maximum likelihood estimate of {gamma} is simply Formula 1. To update M, we first need to adjust the structure of M, then infer the transition and emission probabilities. Given the structure, the second issue can be solved using a Bayesian posterior estimate with Dirichlet prior (Eddy and Durbin, 1994), hence, we focus on the first issue. This is easy in MEME because the structure is predetermined. In CMfinder, it is equivalent to finding a consensus secondary structure. We formulate this in the following Bayesian framework.

Our goal is to find Formula 1. Assuming independence of non-base paired columns, then

Formula 2(2)

Formula 3(3)

Formula 3
Using maximum likelihood parameter estimates and a multinomial model for each column/column pair, Iij is the mutual information between columns i and j. Without prior information, the optimal structure maximizes {sum}(i,j)isinßIij, which is the approach adopted by COVE. This method works well in large, phylogenetically diverse datasets, but not when there is insufficient covariance, as would be expected with a few closely related sequences.

We solve the problem by introducing an informative prior on structures. Let qi be the prior for column i to be single stranded, and pij the prior for columns i, j to be base paired, then Formula 3, and P(D,{sigma}) can be rewritten as

Formula 4(4)

Formula 4
then the maximum likelihood structure {sigma} maximizes {sum}(i,j)isinß Kij. We infer a prior on structures based on a thermodynamic model. For each sequence, we calculate the partition function Pij (Hofacker et al., 1994; McCaskill, 1990), which estimates the probability of forming base pair i,j, averaged over all possible structures. We estimate the column pairing probabilities pij by averaging the partition functions of the aligned sequences. The probability that a column is unpaired is estimated as qi = 1 – {sum}j pij. Note that candidates are weighted based on their probabilities to be motif instances when computing Iij, pij and qi. Finally, we use a dynamic programming algorithm to choose a set of compatible base pairs maximizing the sum of Kij. Since pij and qi are predicted from the given sequences, they are not ‘priors’ in a strict sense. However, the mutual information and the partition function look at the same data from different perspectives: the mutual information measures the conservation of base pairs in the particular sequences from a evolutionary point of view, while the partition function is based on a thermodynamic model that is generically applicable to all RNAs. Combining them gives us the power of both approaches: we rely on the energy model when there is little mutual information and use mutual information if the structure is ambiguous based on the energy model. In comparison with our method, RNAalifold (Hofacker et al., 2002) uses a linear combination of the energy contribution and mutual information. Our probabilistic integration provides some justification for combining these two seemingly disparate elements.

3.2.3 Adjusting candidates
We introduce a second EM phase identical to the first, except that the CM is used to scan each sequence, and the top hits in each are treated as candidates. This helps to discover motif instances missed by initial candidate selection and to refine the alignment.

3.2.4 Adding additional family members
The EM framework enables automatic update of the covariance models based on new sequences that may contain the motif. We simply repeat the EM algorithm using the existing motif as seed. This feature is highly effective at discovering large and diverse RNA families from a small set of related sequences; see Section 4.3.

3.2.5 Combining Motifs
In theory, this method can find arbitrarily large motifs, but in practice, it works best with relatively short ones (<100 bases). To overcome this, we apply a greedy algorithm to merge multiple motifs hierarchically. See Supplementary information for details.

3.2.6 Run time
The computation time of the EM algorithm is dominated by the Viterbi alignment, with complexity O(NL3|M|) for each iteration (where L is the maximum sequence length and |M| is the number of states in the CM). The EM algorithm generally converges in <15 iterations. Overall, a typical CMfinder run (on <60 sequences of <1 Kb average length) takes 1–60 min, depending on the number and the complexity of motifs, and is practical for most applications.


    4 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
Rfam is a large collection of multiple sequence alignments and CMs for ncRNA families (Griffiths-Jones et al., 2003). It contains a CM built from a hand curated ‘seed’ alignment for each ncRNA family. Additional homologs are then predicted by searching genome databases with the model. We used Rfam seed alignments to evaluate CMfinder's performance on three increasingly difficult tasks:

  1. Discovery: given unaligned seed members of an Rfam family (together with flanking regions), can CMfinder construct a good alignment and CM to characterize the family?
  2. Robustness: How does the quality of the alignment degrade as more flanking sequence is added and as seed sequences are replaced by unrelated sequences?
  3. Scale-up: Does it scale to plausible genome-wide discovery scenarios, where the initial set of sequences are taken, say, from a small group of orthologous genes? Are the sensitivity and specificity of the CMfinder model adequate when used to scan several gigabases of genome sequence?
In all the three tasks, our results are strong.

4.1 Discovery
We selected 19 families from Rfam (release 6.1, Aug 2004) as our test data, with varying length (26–216 bases), sequence identity (43–81%), and number of family members (9–75). This dataset captures the diversity of known RNA families, while excluding highly conserved ones, and emphasizing cis-regulatory elements, especially riboswitches. For each family, we took the seed alignment as the motif, and included 200 bases of genomic sequence flanking the motif, randomly distributed between the 5' and 3' sides, to simulate the realistic situation where motif locations are unknown.

We compared CMfinder with RNAalifold (Hofacker et al., 2002), Pfold (Knudsen and Hein, 1999), Foldalign (Havgaard et al., 2005; Gorodkin et al., 1997), ComRNA and Carnac. The accuracy of all predictions are computed at the base pair level relative to Rfam annotation. Let TP be the correctly predicted base pairs, FP the falsely predicted base pairs and FN the true base pairs that are not predicted. The sensitivity is defined as TP/(TP + FN), specificity as TP/(TP + FP) and overall prediction accuracy as their geometric mean. For decent sensitivity and specificity, the latter metric approximates Matthews Correlation (Gorodkin et al., 2001). Note that this is a very stringent metric. Small shifts and extra base pairs are counted as false negatives/positives while they are considered neutral in some other work. More importantly, this penalizes incorrect motif boundaries in this local alignment setting.

Fairly benchmarking different programs created with disparate goals is tricky. We are sometimes using these tools for purposes that they were not designed or optimized to do. Nevertheless, we feel that our choices are plausible ones for our goals (automated, genome-scale RNA motif discovery) and constitute a reasonable comparison of these tools for that purpose. The results may also illustrate the consequences of such tool abuse. CMfinder performs well in this context, but the other tools may excel in other contexts (or this one, if more cleverly used). In general, all programs received the same input and were run with default parameters without any per-data-set tuning. As one exception, among the programs tested, only ComRNA can predict pseudoknots, but we chose non-default parameter settings preventing this, which may understate its performance. (Lacking a ‘gold standard’ for pseudoknots in the test data, all obvious alternatives seemed worse.) As another important exception, both RNAalifold and Pfold require aligned inputs, so we used ClustalW alignments; other programs were given unaligned inputs. For ComRNA, we set a run time constraint comparable with the run time of CMfinder, and chose the motif with the best accuracy for each dataset. For Foldalign, we tried both multiple alignment (Gorodkin et al., 1997) and pairwise alignment (Havgaard et al., 2005), and report the results for the latter as it is faster and generally gives better results. We summarize its accuracy as the average of all pairwise comparisons. Although pairwise alignments are not directly comparable with multiple alignments, the tool ultimately attempts to achieve the same goal as our tool, and it is interesting to test whether more sequences improve consensus structure prediction accuracy. For CMfinder, we produced up to 10 motifs for each dataset, combined them, and retained the motif with the best average alignment score as the final output.

The descriptions of each Rfam family and prediction accuracies of all tested methods are summarized in Table 1. More detailed comparisons including predicted structures and alignments are included in the Supplementary information. CMfinder achieved the best performance on all families except s2m and RFN. The disagreement with Rfam for s2m is because of a small shift of a helix. For RFN, the motif we produced is partial, and most prediction errors are local, and in regions with great sequence conservation. CMfinder significantly outperformed the other methods on families with low sequence conservation or short motifs such as the SECIS family. This is a difficult test case owing to low sequence similarity, and two conserved non-canonical G–A pairs in the stem–loop. The other four methods tested predict no base pairs, while CMfinder correctly aligns and annotates the region enclosed by the two G–A pairs. RNAalifold, Pfold, Carnac and ComRNA have relatively weak performance for such families, presumbly because sequence conservation is insufficient to delineate possible alignments. Inclusion of arbitrary flanking regions makes sequence-based alignment even harder. Although our initial pairwise alignment algorithm is much simpler than pairwise Foldalign, we gained more information by comparing all sequences. We have also tested a set of methods that perform global alignments on Rfam families without flanking regions. Again, CMfinder usually outperforms other methods; see Supplementary information.


View this table:
[in this window]
[in a new window]
 
Table 1 Summary of Rfam test families and results

 
To quantify the effectiveness of each component of our algorithm, we compared the performance of CMfinder with its four variants: (1) heuristic initial alignment before EM iteration (Ini); (2) initial alignment with EM iteration based on mutual information only (Ini_em_mi); (3) initial alignment with EM iteration based on folding energy only (Ini_em_fe) and (4) ClustalW alignment with EM iteration (Clustal_em). To speed the EM iteration, we trimmed regions with >10% gaps from both ends of the ClustalW alignment.

The motif prediction accuracy of the these five methods are shown in Figure 1. First, we observe that the EM iteration improves the prediction accuracy considerably, from an average of 66% in the initial alignments to 83%. Second, the energy-based partition function (Ini_em_fe) generally outperforms mutual information (Ini_em_mi) in the EM algorithm, while using mutual information in addition to folding energy yields improvements on SECIS and s2m. Third, CMfinder has better performance than Clustal_em except for RFN and S_box. For RFN, the CMfinder motif is only partial. Meanwhile, CMfinder is far more effective at locating poorly conserved and/or short motifs, such as IRE and SECIS.


Figure 1
View larger version (10K):
[in this window]
[in a new window]
 
Fig. 1 Comparison of CMfinder with its variants. The initial alignment corresponding to the best output motif is selected for each family. For the rightmost 6, final CMfinder motifs are combinations of multiple motifs, precluding comparison to ini, ini_em_mi and ini_em_fe.

 
This suggests that our heuristics are generally more robust, but ClustalW can be a complementary alternative for constructing initial alignments. Finally, there is significant improvement of Clustal_em over Pfold and RNAalifold. All three are based on ClustalW alignment, yet Clustal_em achieves 61% average prediction accuracy, compared to 36% for Pfold and 27% for RNAalifold. On families where both RNAalifold and Pfold fail, such as S_box, Cobalamin and Purine, Clustal_em has 55–80% prediction accuracy. To summarize, both the initial alignment procedure and the EM module are effective components of CMfinder, which make it reliable on a variety of datasets.

4.2 Robustness
We tested CMfinder on more challenging datasets with larger flanking regions and only a subset of the sequences containing real RNA motifs. To form each dataset, we randomly selected n family members, including a given length of flanking sequence, then permuted the motif regions in k of them (again, randomly selected). The latter sequences serve as control sequences; the rest, with real RNA motifs, are referred to as test sequences. We performed this test on Histone3, IRE and SECIS families. Only the IRE results are presented here; others are available at our supplementary website. Figure 2a shows the scores of the motif instances produced by CMfinder when tested on datasets with flanking regions varying from 50 to 400 bases on both the 5' and 3' sides, with 1/4 control sequences, while Figure 2b varies the fraction of control sequences from 1/8 to 1/2, all with 100-base flanking regions. We categorized three types of predicted motif instances: true and false motif instances in the test sequences (Tt and Ft respectively) and those in the control sequences (Fc). Tts and Fts are determined by whether their overlaps with corresponding Rfam motif instances are >10 bases. Most overlaps were shorter than 5 or longer than 25 bases.


Figure 2
View larger version (12K):
[in this window]
[in a new window]
 
Fig. 2 (a) and (b) Robustness Test. Each column represents a test dataset, each point represents a motif instance predicted in a sequence. Tt, True motif in test seq.; Ft, False motif in test seq.; Fc, False motif in control seq. (see text).

 
Figures 2a and b generally show good score separations between true motif instances (Tt) and false ones (Ft and Fc). As the flanking region increases, candidate selection becomes more difficult owing to the larger number of stable local structures (we used the same number of candidates in all the test cases), and good local alignments occur more easily by chance. Although the score differences between the true and false ones tend to decrease as the flanking region increases, the distinction is generally clear enough to differentiate the two types. There are a total of 20 false motif instances in all datasets above the score threshold 10, but closer examination reveals that 11 of them in fact correspond to real IREs (which often occurs in tandem) present in Rfam, but not seed members. The score difference between true and false motifs is even more apparent in Figure 2b. The only Ft turns out to be a real IRE element. In both figures, there is only one Tt with score <10. Overall, even in the presence of significant amounts of extraneous sequence, CMfinder successfully predicted Histone3, IRE and SECIS motifs, which are among the more difficult of our test families.

4.3 Scale-up: a framework for novel ncRNA discovery
CMfinder is applicable to unaligned sequences with low sequence conservation, is robust to noise, produces highly descriptive motif models and permits automatic update based on new sequence data. These features make it an effective tool for novel ncRNA discovery as we demonstrate in a pilot application of this type, sketched below.

Cis-regulatory elements such as riboswitches can be expected to be conserved in upstream regions of orthologous genes. In an ongoing collaboration with Martin Tompa and colleagues at UW, we have begun a systematic search for ncRNAs based on this scenario. In outline, we are doing the following. For each protein coding gene in a specific organism (Bacillus subtilis, initially), we find its 19 closest matches in GenBank bacterial genomes based on pairwise BLAST scores of protein sequences. We rank the datasets based on patterns of conservation identified by the DNA phylogenetic footprinting tool Footprinter (Blanchette and Tompa, 2003) in the upstream intergenic regions of each group of 20 orthologs. Finally, we apply CMfinder to analyze the top ranking datasets. The covariance models of promising RNA motifs are used to scan the bacterial genome database. In some cases, hits from this scan were used to refine the RNA motifs, and the genome scan repeated. Second scans used an E-value cutoff of 1, but initial scans used an cutoff of 100, as the initial models learned from small datasets tend to be overly specific. Of the initial 115 datasets, we found that over 30 correspond to ribosomal RNAs, 13 contain T_boxes, 22 contain riboswitches, with additional sporadic cases of RNase P, tRNAs, the CIRCE element (Narberhaus, 1999) and other DNA binding sites. We selected seven predicted motifs to analyze more fully and determined their agreement with corresponding Rfam families. The results are summarized in Table 2.


View this table:
[in this window]
[in a new window]
 
Table 2 Comparison of scan predictions with Rfam results

 
In one example, from upstream sequences of 20 orthologous folC genes, we found a conserved RNA motif with nine instances. The first and the second genome scan found 234 and 447 hits respectively. Jeffrey Barrick (personal communication) identified this as the T_box leader (Grundy et al., 1994), a cis-regulatory element that interacts with tRNA to control expression of tRNA processing genes. Our motif model discovered 299 out of 342 Rfam T_box members, and 89 out of 148 additional hits are upstream of and on the same strand as aminoacyl-tRNA synthetase genes, where most T_box leaders are found, and the others are largely in poorly annotated regions. While we started with a small set of sequences from phylogenetically closely related species (Bacillales/Clostridia), the genome scan discovered a diverged homolog in Symbiobacterium thermophilum (an Actinobacterium) included in Rfam, and another in Geobacter sulfurreducens (a {delta} Proteobacterium), presumably the one reported in Winkler et al. (2001).

We repeated the test for the other families, although for the metK and ribB datasets, we scanned the sequence database only once. Averaged over all seven families, our predicted motifs achieved 92% specificity and 81% sensitivity. Except for the T_box, the specificities range from 92 to 100%. For the T_box family, if the 89 additional hits supported by annotation are counted as true positives, then specificity increases to 87% from 67%. All families except ykoY have sensitivity over 80%. Its poor sensitivity is likely due to the fact that the motif is only learned from sequences upstream of ykoY genes, while the Rfam seed also includes motifs from yybP genes. Six of the seven datasets were picked because they held known riboswitches, to confirm that our methods ‘discover’ them. This may have unconsciously biased our execution of the experiment, but the T_box example was entirely blind—we did not know there was an RNA motif near folC genes, only discovering its identity after completing the scans. See Supplementary information for details.

In summary, we have automatically learned highly accurate CMs from small automatically constructed datasets. In contrast, the Rfam models are learned from hand curated seed alignments, usually containing many more sequences. Automated model construction should not supplant the high-quality curated Rfam seed alignments, but we are optimistic that it will allow broad-scale screening for new cis-regulatory ncRNAs with minimal manual intervention.


    5 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
In this paper, we presented an algorithm for finding conserved RNA motifs in a set of unaligned sequences. The key contribution of this paper is to propose an EM framework that integrates various judiciously chosen techniques for secondary structure prediction and alignment which together makes this approach computationally feasible, robust and accurate. In particular, we proposed a principled statistical framework that combines an energy model with mutual information, which significantly improves the performance. In addition, we have presented an effective RNA motif discovery pipeline using CMfinder and other tools to iteratively expand and refine the motifs automatically. We are now applying this discovery pipeline for large-scale screening of RNA elements in microorganisms, and our preliminary results are very promising. Our future work will involve designing motif significance test statistics, incorporating phylogenetic information, and better priors for covariance models.


    Acknowledgments
 
We thank the reviewers for constructive comments; J. Barrick for offering his biological expertise; B. Knudsen for sharing Pfold; A. Prakash, S. Neph and M. Tompa for valuable input at several stages and insightful comments on the manuscript. Supported in part by NHLBI 1 P01 HL072262-01, HG-00035 and NIEHS P30ES07033.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Thomas Lengauer

Received on June 9, 2005; revised on December 12, 2005; accepted on December 13, 2005

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 

    Akmaev, V.R., et al. (2000) Phylogenetically enhanced statistical tools for RNA structure prediction. Bioinformatics, 16, 501–512[Abstract/Free Full Text].

    Bafna, V., Tang, H., Zhang, S. (2005) Consensus folding of unaligned RNA sequence revisited. Proc. Res. Comp. Mol. Biol, . Cambridge, MA, pp. p1.

    Bailey, T.L. and Elkan, C. (1995) The value of prior knowledge in discovering motifs with MEME. Proc. Intel. Sys. Mol. Biol, , St. Louis , pp. 21–29.

    Barrick, J.E., et al. (2004) New RNA motifs suggest an expanded scope for riboswitches in bacterial genetic control. Proc. Natl Acad. Sci. USA, 101, 6421–6426[Abstract/Free Full Text].

    Blanchette, M. and Tompa, M. (2003) FootPrinter: a program designed for phylogenetic footprinting. Nucleic Acids Res, . 31, 3840–3842[Abstract/Free Full Text].

    Conne, B., et al. (2000) The 3' untranslated region of messenger RNA: A molecular ‘hotspot’ for pathology? Nat. Med, . 6, 637–641[CrossRef][ISI][Medline].

    Coventry, A, et al. (2004) MSARI: multiple sequence alignments for statistical detection of RNA secondary structure. Proc. Natl Acad. Sci. USA, 101, 12102–12107[Abstract/Free Full Text].

    Dowell, R.D. and Eddy, S.R. (2004) Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics, 5, 71[CrossRef][Medline].

    Eddy, S.R. and Durbin, R. (1994) RNA sequence analysis using covariance models. Nucleic Acids Res, . 22, 2079–2088[Abstract/Free Full Text].

    Gardner, P.P. and Giegerich, R. (2004) A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics, 5, 140[CrossRef][Medline].

    Gorodkin, J., et al. (1997) Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Res, . 25, 3724–3732[Abstract/Free Full Text].

    Gorodkin, J., et al. (2001) Discovering common stem–loop motifs in unaligned RNA sequence. Nucleic Acids Res, . 29, 2135–2144[Abstract/Free Full Text].

    Grate, L., et al. (1994) RNA modeling using Gibbs sampling and stochastic context free grammars. Proc. Int. Conf. Intell. Syst. Mol. Biol, . 2, 138–146[Medline].

    Griffiths-Jones, S., et al. (2003) Rfam: an RNA family database. Nucleic Acids Res, . 31, 439–441[Abstract/Free Full Text].

    Grundy, F., et al. (1994) Interaction between the acceptor end of tRNA and the T box stimulates antitermination in the Bacillus subtilis tyrS gene: a new role for the discriminator base. J. Bacteriol, . 176, 4518–4526[Abstract/Free Full Text].

    Gulko, B. and Haussler, D. (1996) Using multiple alignments and phylogenetic trees to detect RNA secondary structure. Pac Symp Biocomput, . 350–367.

    Havgaard, J., et al. (2005) Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics, 21, 1815–1824[Abstract/Free Full Text].

    Hentze, M.W. and Kuhn, L.C. (1996) Molecular control of vertebrate iron metabolism: mRNA-based regulatory circuits operated by iron, nitric oxide, and oxidative stress. Proc. Natl Acad. Sci. USA, 93, 8175–8182[Abstract/Free Full Text].

    Höchsmann, M., Toller, T., Giegerich, R., Kurtz, S. (2003) Local similarity in RNA secondary structure. Proc. Compu. Sys. Bioinfo, . , Stanford, CA , pp. 159–168.

    Hofacker, I.L., et al. (2002) Secondary structure prediction for aligned RNA sequences. J. Mol. Biol, . 319, 1059–1066[CrossRef][ISI][Medline].

    Hofacker, I.L., et al. (1994) Fast folding and comparison of RNA secondary structure. Chemical Monthly, 125, 167–188[CrossRef].

    Ji, Y., et al. (2004) A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned sequences. Bioinformatics, 20, 1591–1602[Abstract/Free Full Text].

    Knudsen, B. and Hein, J. (1999) RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics, 15, 446–454[Abstract/Free Full Text].

    Mandal, M., et al. (2003) Riboswitches control fundamental biochemical pathways in Bacillus subtilis and other bacteria. Cell, 113, 577–586[CrossRef][ISI][Medline].

    Mandal, M., et al. (2004) A glycine-dependent riboswitch that uses cooperative binding to control gene expression [Erratum (2004) Science, 306, 1477]. Science, 306, 275–279[Abstract/Free Full Text].

    Mathews, D.H. and Turner, D.H. (2002) Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J. Mol. Biol, . 317, 191–203[CrossRef][ISI][Medline].

    McCaskill, J.S. (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29, 1105–1119[CrossRef][ISI][Medline].

    Narberhaus, F. (1999) Negative regulation of bacterial heat shock genes. Mol. Microbiol, . 31, 1–8[CrossRef][ISI][Medline].

    Sakakibara, Y., Brown, M., Mian, I.S., Underwood, R., Haussler, D. (1994) Stochastic context-free grammars for modeling RNA. Proc. Hawaiian Inter. Conf. Sys. Sci.Hawaii , pp. 284–294.

    Sankoff, D. (1985) Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl. Math, . 45, 810–825[CrossRef].

    Touzet, H. and Perriquet, O. (2004) CARNAC: folding families of related RNAs. Nucleic Acids Res, . 32, W142–W145[Abstract/Free Full Text].

    Weinberg, Z. and Ruzzo, W.L. (2004a) Faster genome annotation of non-coding RNA families without loss of accuracy. Proc. Res. Compu. Mol. Bio, . , Glasgow, Scotland , pp. 243–251.

    Weinberg, Z. and Ruzzo, W.L. (2004b) Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy. Bioinformatics, 20, suppl. 1, I334–I341.

    Winkler, W. and Breaker, R.R. (2003) Genetic control by metabolite-binding riboswitches. Chembiochem, . 4, 1024–1032[CrossRef][ISI][Medline].

    Winkler, W., et al. (2001) The GA motif: an RNA element common to bacterial antitermination systems, rRNA, and eukaryotic RNAs. RNA, 7, 1165–1172[Abstract].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
K. Katoh and H. Toh
Recent developments in the MAFFT multiple sequence alignment program
Brief Bioinform, July 1, 2008; 9(4): 286 - 298.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
E. Torarinsson and S. Lindgreen
WAR: Webserver for aligning structural RNAs
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W79 - W84.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Wilm, D. G. Higgins, and C. Notredame
R-Coffee: a method for multiple alignment of non-coding RNA
Nucleic Acids Res., May 1, 2008; 36(9): e52 - e52.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
E. Torarinsson, Z. Yao, E. D. Wiklund, J. B. Bramsen, C. Hansen, J. Kjems, N. Tommerup, W. L. Ruzzo, and J. Gorodkin
Comparative genomics beyond sequence-based alignments: RNA structures in the ENCODE regions
Genome Res., February 1, 2008; 18(2): 242 - 251.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
I. M. Meyer
A practical guide to the art of RNA gene prediction
Brief Bioinform, November 1, 2007; 8(6): 396 - 414.
[Abstract] [Full Text] [PDF]


Home page
RNAHome page
E. S. Andersen, A. Lind-Thomsen, B. Knudsen, S. E. Kristensen, J. H. Havgaard, E. Torarinsson, N. Larsen, C. Zwieb, P. Sestoft, J. Kjems, et al.
Semiautomated improvement of RNA alignments
RNA, November 1, 2007; 13(11): 1850 - 1859.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Khaladkar, V. Bellofatto, J. T. L. Wang, B. Tian, and B. A. Shapiro
RADAR: a web server for RNA data analysis and research
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W300 - W304.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
Z. Weinberg, J. E. Barrick, Z. Yao, A. Roth, J. N. Kim, J. Gore, J. X. Wang, E. R. Lee, K. F. Block, N. Sudarsan, et al.
Identification of 22 candidate structured RNAs in bacteria using the CMfinder comparative genomics pipeline
Nucleic Acids Res., July 9, 2007; (2007) gkm487v1.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
E. Torarinsson, J. H. Havgaard, and J. Gorodkin
Multiple structural alignment and clustering of RNA sequences
Bioinformatics, April 15, 2007; 23(8): 926 - 932.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
E. K. Freyhult, J. P. Bollback, and P. P. Gardner
Exploring genomic dark matter: A critical assessment of the performance of homology search methods on noncoding RNA
Genome Res., January 1, 2007; 17(1): 117 - 125.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
E. Puerta-Fernandez, J. E. Barrick, A. Roth, and R. R. Breaker
Identification of a large noncoding RNA in extremophilic eubacteria
PNAS, December 19, 2006; 103(51): 19490 - 19495.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Hiller, R. Pudimat, A. Busch, and R. Backofen
Using RNA secondary structures to guide sequence motif finding towards single-stranded regions
Nucleic Acids Res., October 18, 2006; 34(17): e117 - e117.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. Neph and M. Tompa
MicroFootPrinter: a tool for phylogenetic footprinting in prokaryotic genomes.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W366 - W368.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/4/445    most recent
btk008v1