Skip Navigation


Bioinformatics Advance Access originally published online on March 23, 2009
Bioinformatics 2009 25(10):1335-1337; doi:10.1093/bioinformatics/btp157
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow A corrigendum has been published
Right arrowOA All Versions of this Article:
25/10/1335    most recent
btp157v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Nawrocki, E. P.
Right arrow Articles by Eddy, S. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Nawrocki, E. P.
Right arrow Articles by Eddy, S. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© Crown Copyright 2009.
Reproduced with the permission of the Controller of Her Majesty's Stationery Office.

Infernal 1.0: inference of RNA alignments

Eric P. Nawrocki , Diana L. Kolbe and Sean R. Eddy *

HHMI Janelia Farm Research Campus, Ashburn, VA 20147, USA\}

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 USAGE
 3 PERFORMANCE
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: INFERNAL builds consensus RNA secondary structure profiles called covariance models (CMs), and uses them to search nucleic acid sequence databases for homologous RNAs, or to create new sequence- and structure-based multiple sequence alignments.

Availability: Source code, documentation and benchmark downloadable from http://infernal.janelia.org. INFERNAL is freely licensed under the GNU GPLv3 and should be portable to any POSIX-compliant operating system, including Linux and Mac OS/X.

Contact: nawrockie,kolbed,eddys{at}janelia.hhmi.org


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 USAGE
 3 PERFORMANCE
 ACKNOWLEDGEMENTS
 REFERENCES
 
When searching for homologous structural RNAs in sequence databases, it is desirable to score both primary sequence and secondary structure conservation. The most generally useful tools that integrate sequence and structure take as input any RNA (or RNA multiple alignment), and automatically construct an appropriate statistical scoring system that allows quantitative ranking of putative homologs in a sequence database (Gautheret and Lambert, 2001; Huang et al., 2008; Zhang et al., 2005). Stochastic context-free grammars (SCFGs) provide a natural statistical framework for combining sequence and (non-pseudoknotted) secondary structure conservation information in a single consistent scoring system (Brown, 2000; Durbin et al., 1998; Eddy and Durbin, 1994; Sakakibara et al., 1994).

Here, we announce the 1.0 release of INFERNAL, an implementation of a general SCFG-based approach for RNA database searches and multiple alignment. INFERNAL builds consensus RNA profiles called covariance models (CMs), a special case of SCFGs designed for modeling RNA consensus sequence and structure. It uses CMs to search nucleic acid sequence databases for homologous RNAs, or to create new sequence- and structure-based multiple sequence alignments. One use of INFERNAL is to annotate RNAs in genomes in conjunction with the RFAM database (Gardner et al., 2009), which contains hundreds of RNA families. RFAM follows a seed profile strategy, in which a well-annotated ‘seed’ alignment of each family is curated, and a CM built from that seed alignment is used to identify and align additional members of the family. INFERNAL has been in use since 2002, but 1.0 is the first version that we consider to be a reasonably complete production tool. It now includes E-value estimates for the statistical significance of database hits, and heuristic acceleration algorithms for both database searches and multiple alignment that allow INFERNAL to be deployed in a variety of real RNA analysis tasks with manageable (albeit high) computational requirements.


    2 USAGE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 USAGE
 3 PERFORMANCE
 ACKNOWLEDGEMENTS
 REFERENCES
 
A CM is built from a Stockholm format multiple sequence alignment (or single RNA sequence) with consensus secondary structure annotation marking which positions of the alignment are single stranded and which are base paired (Eddy, 2009). CMs assign position-specific scores for the four possible residues at single-stranded positions, the 16 possible base pairs at paired positions and for insertions and deletions. These scores are log-odds scores derived from the observed counts of residues, base pairs, insertions and deletions in the input alignment, combined with prior information derived from structural ribosomal RNA alignments. CM parameterization has been described in more detail elsewhere (Eddy, 2002, 2009; Eddy and Durbin, 1994; Klein and Eddy, 2003; Nawrocki and Eddy, 2007).

INFERNAL is composed of several programs that are used in combination by following four basic steps:

  1. Build a CM from a structural alignment with cmbuild.
  2. Calibrate a CM for homology search with cmcalibrate.
  3. Search databases for putative homologs with cmsearch.
  4. Align putative homologs to a CM with cmalign.

The calibration step is optional and computationally expensive (4 h on a 3.0 GHz Intel Xeon for a CM of a typical RNA family of length 100 nt), but is required to obtain E-values that estimate the statistical significance of hits in a database search. cmcalibrate will also determine appropriate hidden Markov model (HMM) filter thresholds for accelerating searches without an appreciable loss of sensitivity. Each model only needs to be calibrated once.


    3 PERFORMANCE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 USAGE
 3 PERFORMANCE
 ACKNOWLEDGEMENTS
 REFERENCES
 
A published benchmark (independent of our lab) (Freyhult et al., 2007) and our own internal benchmark used during development (Nawrocki and Eddy, 2007) both find that INFERNAL and other CM-based methods are the most sensitive and specific tools for structural RNA homology search among those tested. Figure 1 shows updated results of our internal benchmark comparing INFERNAL 1.0 with the previous version (0.72) that was benchmarked in Freyhult et al. (2007), and also to family-pairwise search with BLASTN (Altschul et al., 1997; Grundy, 1998). INFERNAL's sensitivity and specificity have greatly improved, due to mainly three relevant improvements in the implementation (Eddy, 2009): a biased composition correction to the raw log-odds scores, the use of Inside log likelihood scores (the summed score of all possible alignments of the target sequence) in place of CYK scores (the single maximum likelihood alignment score) and the introduction of approximate E-value estimates for the scores.


Figure 1
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. ROC curves for the benchmark. Plots are shown for the new INFERNAL 1.0 with and without filters, for the old INFERNAL 0.72 and for family-pairwise searches (FPS) with BLASTN. CPU times are total times for all 51 family searches measured for single execution threads on 3.0 GHz Intel Xeon processors. The INFERNAL 1.0 times do not include time required for model calibration.

 
The benchmark dataset used in Figure 1 includes query alignments and test sequences from 51 RFAM (release 7) families [details in (Nawrocki and Eddy, 2007)]. No query sequence is >60% identical to a test sequence. The 450 total test sequences were embedded at random positions in a 10 Mb ‘pseudogenome’. Previously, we generated the pseudogenome sequence from a uniform residue frequency distribution (Nawrocki and Eddy, 2007). Because base composition biases in the target sequence database cause the most serious problems in separating significant CM hits from noise, we improved the realism of the benchmark by generating the pseudogenome sequence from a 15-state fully connected HMM trained by Baum–Welch expectation maximization (Durbin et al., 1998) on genome sequence data from a wide variety of species. Each of the 51 query alignments was used to build a CM and search the pseudogenome, a single list of all hits for all families were collected and ranked, and true and false hits were defined, as described in Nawrocki and Eddy (2007), producing the ROC curves in Figure 1.

INFERNAL searches require a large amount of compute time [our 10 Mb benchmark search takes about 30 h per model on average (Fig. 1)]. To alleviate this, INFERNAL 1.0 implements two rounds of filtering. When appropriate, the HMM filtering technique described by Weinberg and Ruzzo (2006) is applied first with filter thresholds configured by cmcalibrate [occasionally a model with little primary sequence conservation cannot be usefully accelerated by a primary sequence-based filter as explained in (Eddy, 2009)]. The query-dependent banded (QDB) CYK maximum likelihood search algorithm is used as a second filter with relatively tight bands [β=10–7, the β parameter is the subtree length probability mass excluded by imposing the bands as explained in Nawrocki and Eddy (2007)]. Any sequence fragments that survive the filters are searched a final time with the Inside algorithm [again using QDB, but with looser bands (β = 10–15)]. In our benchmark, the default filters accelerate similarity search by about 30-fold overall, while sacrificing a small amount of sensitivity (Fig. 1). This makes version 1.0 substantially faster than 0.72. BLAST is still orders of magnitude faster, but significantly less sensitive than INFERNAL. Further acceleration remains a major goal of INFERNAL development.

The computational cost of CM alignment with cmalign has been a limitation of previous versions of INFERNAL. Version 1.0 now uses a constrained dynamic programming approach first developed by Brown (2000) that uses sequence-specific bands derived from a first-pass HMM alignment. This technique offers a dramatic speedup relative to unconstrained alignment, especially for large RNAs such as small and large subunit (SSU and LSU, respectively) ribosomal RNAs, which can now be aligned in roughly 1 and 3 s per sequence, respectively, as opposed to 12 min and 3 h in previous versions. This acceleration has facilitated the adoption of INFERNAL by RDP, one of the main ribosomal RNA databases (Cole et al., 2009).

INFERNAL is now a faster and more sensitive tool for RNA sequence analysis. Version 1.0's heuristic acceleration techniques make some important applications possible on a single desktop computer in less than an hour, such as searching a prokaryotic genome for a particular RNA family, or aligning a few thousand SSU rRNA sequences. Nonetheless, INFERNAL remains computationally expensive, and many problems of interest require the use of a cluster. The most expensive programs (cmcalibrate, cmsearch and cmalign) are implemented in coarse-grained parallel MPI versions which divide the workload into independent units, each of which is run on a separate processor.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 USAGE
 3 PERFORMANCE
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank Goran Ceric for his peerless skill in managing Janelia Farm's high-performance computing resources.

Funding: INFERNAL development is supported by the Howard Hughes Medical Institute. It has been supported in the past by an NIH NHGRI training grant (T32-HG000045) to E.P.N., an NSF Graduate Fellowship to D.L.K.; NIH R01-HG01363 and a generous endowment from Alvin Goldfarb.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Ivo Hofacker

Received on January 13, 2009; revised on March 11, 2009; accepted on March 14, 2009

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 USAGE
 3 PERFORMANCE
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. (1997) 25:3389–3402.[Abstract/Free Full Text]

    Brown MP. Small subunit ribosomal RNA modeling using stochastic context-free grammars. Proc. Int. Conf. Intell. Syst. Mol. Biol. (2000) 8:57–66.[Medline]

    Cole JR, et al. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. (2009) 37:D141–D145.[Abstract/Free Full Text]

    Durbin R, et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. (1998) Cambridge, UK: Cambridge University Press.

    Eddy SR. A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinformatics (2002) 3:18.[CrossRef][Medline]

    Eddy SR. The Infernal user's guide. (2009) Available at http://infernal.janelia.org/. (last accessed date March 27, 2009).

    Eddy SR, Durbin R. RNA sequence analysis using covariance models. Nucleic Acids Res. (1994) 22:2079–2088.[Abstract/Free Full Text]

    Freyhult EK, et al. Exploring genomic dark matter: a critical assessment of the performance of homology search methods on noncoding RNA. Genome Res. (2007) 17:117–125.[Abstract/Free Full Text]

    Gardner PP, et al. Rfam: updates to the RNA families database. Nucleic Acids Res. (2009) 37:D136–D140.[Abstract/Free Full Text]

    Gautheret D, Lambert A. Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles. J. Mol. Biol. (2001) 313:1003–1011.[CrossRef][Web of Science][Medline]

    Grundy WN. Homology detection via family pairwise search. J. Comput. Biol. (1998) 5:479–491.[Web of Science][Medline]

    Huang Z, et al. Fast and accurate search for non-coding RNA pseudoknot structures in genomes. Bioinformatics (2008) 24:2281–2287.[Abstract/Free Full Text]

    Klein RJ, Eddy SR. RSEARCH: finding homologs of single structured RNA sequences. BMC Bioinformatics (2003) 4:44.[CrossRef][Medline]

    Nawrocki EP, Eddy SR. Query-dependent banding (QDB) for faster RNA similarity searches. PLoS Comput. Biol. (2007) 3:e56.[CrossRef][Medline]

    Sakakibara Y, et al. Stochastic context-free grammars for tRNA modeling. Nucleic Acids Res. (1994) 22:5112–5120.[Abstract/Free Full Text]

    Weinberg Z, Ruzzo WL. Sequence-based heuristics for faster annotation of non-coding RNA families. Bioinformatics (2006) 22:35–39.[Abstract/Free Full Text]

    Zhang S, et al. Searching genomes for noncoding RNA using FastR. IEEE/ACM Trans. Comput. Biol. Bioinform. (2005) 2:366–379.[CrossRef]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Gen Biol EvolHome page
M. P. Hoeppner, S. White, D. C. Jeffares, and A. M. Poole
Evolutionarily Stable Association of Intronic snoRNAs and microRNAs with Their Host Genes
Gen Biol Evol, November 23, 2009; 2009(0): 420 - 428.
[Abstract] [Full Text] [PDF]


Home page
Brief Funct Genomic ProteomicHome page
A. Mosig, L. Zhu, and P. F. Stadler
Customized strategies for discovering distant ncRNA homologs
Brief Funct Genomic Proteomic, November 1, 2009; 8(6): 451 - 460.
[Abstract] [Full Text] [PDF]


Home page
Brief Funct Genomic ProteomicHome page
P. P. Gardner
The use of covariance models to annotate RNAs in whole genomes
Brief Funct Genomic Proteomic, November 1, 2009; 8(6): 444 - 450.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
V. M. Markowitz, I-M. A. Chen, K. Palaniappan, K. Chu, E. Szeto, Y. Grechkin, A. Ratner, I. Anderson, A. Lykidis, K. Mavromatis, et al.
The integrated microbial genomes system: an expanding comparative analysis resource
Nucleic Acids Res., October 28, 2009; (2009) gkp887v1.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow A corrigendum has been published
Right arrowOA All Versions of this Article:
25/10/1335    most recent
btp157v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Nawrocki, E. P.
Right arrow Articles by Eddy, S. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Nawrocki, E. P.
Right arrow Articles by Eddy, S. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?