Skip Navigation


Bioinformatics Advance Access originally published online on April 3, 2009
Bioinformatics 2009 25(10):1338-1340; doi:10.1093/bioinformatics/btp161
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
25/10/1338    most recent
btp161v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Huang, Y.
Right arrow Articles by Li, W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Huang, Y.
Right arrow Articles by Li, W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2009 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Identification of ribosomal RNA genes in metagenomic fragments

Ying Huang , Paul Gilna and Weizhong Li *

California Institute for Telecommunications and Information Technology, University of California, La Jolla, San Diego, California, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM DEVELOPMENT
 3 EVALUATION
 4 CONCLUSION
 REFERENCES
 

Motivation: Identification of genes coding for ribosomal RNA (rRNA) is considered an important goal in the analysis of data from metagenomics projects. Here, we report the development of a software program designed for the identification of rRNA genes from metagenomic fragments based on hidden Markov models (HMMs). This program provides rRNA gene predictions with high sensitivity and specificity on artificially fragmented genomic DNAs.

Availability: Supplementary files, scripts and sample data are available at http://tools.camera.calit2.net/camera/meta_rna.

Contact: liwz{at}sdsc.edu

Supplementary information: Supplementary Data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM DEVELOPMENT
 3 EVALUATION
 4 CONCLUSION
 REFERENCES
 
The emerging field of metagenomics promises a more comprehensive and complete understanding of the microbial world. Many projects have been reported with metagenomic approaches to study microbes and microbial communities that live in many different environmental conditions (Tringe and Rubin, 2005). Analyzing the sequence data generated by these projects is far from easy and requires accessible and user-friendly tools (Raes et al., 2007). An essential step in any metagenomics project is the identification of genes encoding for ribosomal RNAs (rRNAs), which are widely used for phylogenetic analysis and quantification of microbial diversity. Several methods haven been proposed for predicting non-coding RNA genes (Meyer, 2007), but a recent benchmark study by Freyhult et al. (2007) indicated that the most commonly used methods yield less than encouraging results. Lagesen et al. (2007) proposed RNAmmer, a program based on hidden Markov models (HMMs) for annotation of rRNA genes. Their algorithm predicts rRNAs in complete genomics sequences with high accuracy. However, a major concern for their predictions is the inability to deal with fragments of rRNAs. Compared with assembled genomic sequences from single species, the raw sequence reads from a typical metagenomic study often remain unassembled due to insufficient coverage. For a typical metagenome dataset, the length of sequence read is ~100–450 bp using 454 pyrosequencing, or ~700 bp long if using Sanger sequencing. Meanwhile, the full lengths of most of 16S and 23S rRNAs are >1200 bp. Therefore, most of rRNA genes in metagenomic sequencing reads are fragmentary, and will be overlooked by RNAmmer that focus on full length rRNAs. To overcome this limitation, we used HMMs that can discover incomplete rRNA gene fragments for predictions. In this article, we apply our algorithm on simulated sets of sequence reads of various lengths. Our method provides rRNA predictions with high-sensitivities and specificities on the benchmark dataset.


    2 ALGORITHM DEVELOPMENT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM DEVELOPMENT
 3 EVALUATION
 4 CONCLUSION
 REFERENCES
 
As an important molecular machine in all living organisms, the ribosome can be broken down into two subunits, the small and the large subunit. In prokaryotes, the large subunit of the ribosome contains 5S and 23S rRNAs, while the small subunit contains 16S rRNAs. Therefore, we will try to build predictors for 5S, 16S and 23S rRNAs. To obtain a reliable multiple sequence alignment (MSA) for HMM building, we retrieved MSAs of 5S rRNAs from the 5S Ribosomal Database (Szymanski et al., 2002), and MSAs of 16S and 23S rRNAs from the European rRNA database (Wuyts et al., 2004). These databases provide high-quality alignment that combine sequence and structural information. The MSAs were then divided into bacterial and archaeal domains. All sequences with more than five ambiguous nucleotides in either end were removed from the alignment, and then sequences were further clustered at 98% identity threshold to reduce bias. We then used software package HMMER (Eddy, 1998) version 2.3.2 to create HMMs from these alignments. We used ‘fs’ mode in HMMER package for HMM building instead of ‘ls’ mode implemented in RNAmmer. In HMMER package, ‘ls’ mode is suitable for identification of a complete sequence domain, while ‘fs’ mode is capable of finding domain fragments and maybe useful to detect incomplete rRNA genes. In addition, domain information for sequences is not available in metagenomic projects, so HMMs from bacterial and archaeal rRNA alignments were both used to search input sequences. Each sequence was classified to the domain that reported the most significant E-value, and results obtained from corresponding HMMs were used as final result.


    3 EVALUATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM DEVELOPMENT
 3 EVALUATION
 4 CONCLUSION
 REFERENCES
 
Performance of our rRNA prediction algorithm was evaluated using artificial DNA fragments generated from fully sequenced archaea and bacteria genomes. GenBank files for all fully sequenced genomes were retrieved from the ENTREZ Genome Project (downloaded on September 30, 2008). To reduce the impact of sequence redundancy, we removed species related to training set (see Supplementary Tables for remaining species used for evaluation). To simulate the current sequencing techniques, fragments of the lengths 100–800 bp (in intervals of 100 bp) were randomly sampled from each genome to 1 x genome coverage for each length. These fragments were used to investigate prediction performance of both our method and RNAmmer, they were also analyzed by BLASTN against 5S Ribosomal Database and SILVA database (Pruesse et al., 2007) to identify rRNA genes (with E-value of 10–5 or less). In current analysis, sampling of fragments was done without considering the sequencing errors, therefore estimated performances are optimistic. The annotation information of rRNA genes was also retrieved from GenBank files. Sequence fragments that had an overlap (>40 nt) with a known rRNA gene in the same strand were considered as a positive sample. The ratios of true-positives relative to all annotated fragments (sensitivity) and to all predicted fragments (specificity) were used as a performance measure. Both exactly matching predictions and partially matching predictions with correct strand were counted as true-positives.

Tables 1 and 2 show the prediction sensitivities and specificities for all fragment lengths. The result for RNAmmer is shown in Supplementary Table S5. The sequence length of most 16S and 23S rRNA genes substantially exceeds 800 bp, therefore can not be detected by a full domain model like RNAmmer. It can be shown that our algorithm can predict sequence reads with rRNAs with a high sensitivity and specificity (>90% in almost all configurations). More important, the prediction performance does not vary much on different read lengths. One commonly used method for predicting rRNAs in metagenomic projects is based on BLAST (Altschul et al., 1997, Frias-Lopez et al., 2008). However, Lagesen et al. (2007) indicated that results based on BLAST can be problematic due to its inconsistency. Compared with BLASTN, our algorithm achieves much better sensitivities (average 10.2% improvement) while the specificities are around 2.3% less for 5S RNA. The performances for 23S rRNA are almost the same for our algorithm and BLASTN. The biggest improvement comes from 16S rRNA prediction, it demonstrates that our algorithm improves the specificities significantly and keeps the sensitivities slightly better.


View this table:
[in this window]
[in a new window]

 
Table 1. Prediction sensitivities for different fragment lengths

 

View this table:
[in this window]
[in a new window]

 
Table 2. Prediction specificities for different fragment lengths

 
The average running time of our algorithm was 744 ms per 800 bp read, and 145 ms per 200 bp read for a single 2.33G Xeon® CPU. The running time for BLASTN was 239 ms per 800 bp read, and 123 ms per 200 bp read. Additional analyses were performed on Sargasso Sea metagenomic project (Venter et al., 2004) consisted of 811 372 entries totaling over 800 Mbp. On this set the search speed was 1088 s per Mbp, and our algorithm identified 660 5S, 1337 16S and 2300 23S rRNA genes or fragments of genes.


    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM DEVELOPMENT
 3 EVALUATION
 4 CONCLUSION
 REFERENCES
 
With the continued growth of metagenomic sequencing projects, identification of rRNA genes within sequence fragments from these projects continues to be a very important task. Here, we reported a HMM based algorithm to detect rRNA genes in short metagenomic fragments with high accuracies. Our algorithm is written in Python, and runs well on Linux/Unix and Windows XP systems with the installation of Python and HMMER package. The scripts, sample dataset and usage instruction are available online at http://tools.camera.calit2.net/camera/meta_rna as a downloadable application.

Funding: Gordon and Betty Moore Foundation (CAMERA project, http://camera.calit2.net).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Ivo Hofacker

Received on November 3, 2008; revised on March 16, 2009; accepted on March 17, 2009

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM DEVELOPMENT
 3 EVALUATION
 4 CONCLUSION
 REFERENCES
 

    Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. (1997) 25:3389–3402.[Abstract/Free Full Text]

    Eddy SR. Profile hidden Markov models. Bioinformatics (1998) 14:755–763.[Abstract/Free Full Text]

    Freyhult EK, et al. Exploring genomic dark matter: a critical assessment of the performance of homology search methods on noncoding RNA. Genome Res. (2007) 17:117–125.[Abstract/Free Full Text]

    Frias-Lopez J, et al. Microbial community gene expression in ocean surface waters. Proc. Natl Acad. Sci. USA (2008) 105:3805–3810.[Abstract/Free Full Text]

    Lagesen K, et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. (2007) 35:3100–3108.[Abstract/Free Full Text]

    Meyer IM. A practical guide to the art of RNA gene prediction. Brief. Bioinform. (2007) 8:396–414.[Abstract/Free Full Text]

    Raes J, et al. Get the most out of your metagenome: computational analysis of environmental sequence data. Curr. Opin. Microbiol. (2007) 10:490–498.[CrossRef][Web of Science][Medline]

    Pruesse E, et al. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. (2007) 35:7188–7196.[Abstract/Free Full Text]

    Szymanski M, et al. 5S Ribosomal RNA database. Nucleic Acids Res. (2002) 30:176–178.[Abstract/Free Full Text]

    Tringe SG, Rubin EM. Metagenomics: DNA sequencing of environmental samples. Nat. Rev. Genet. (2005) 6:805–814.[CrossRef][Web of Science][Medline]

    Venter JC, et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science (2004) 304:66–74.[Abstract/Free Full Text]

    Wuyts J, et al. The European ribosomal RNA database. Nucleic Acids Res. (2004) 32:D101–D103.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
25/10/1338    most recent
btp161v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Huang, Y.
Right arrow Articles by Li, W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Huang, Y.
Right arrow Articles by Li, W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?