Bioinformatics Advance Access originally published online on August 7, 2006
Bioinformatics 2006 22(20):2463-2465; doi:10.1093/bioinformatics/btl430
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ARFA: a program for annotating bacterial release factor genes, including prediction of programmed ribosomal frameshifting
1 Biosciences Institute, University College Cork Cork, Ireland
2 Department of Human Genetics, University of Utah Salt Lake City, Utah 84112-5330, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Correct annotation of genes encoding release factors in bacterial genomes is often complicated by utilization of +1 programmed ribosomal frameshifting during synthesis of release factor 2, RF2. In the absence of robust computational approaches for predicting ribosomal frameshifting, the success of proper annotation depends on annotators' familiarity with this phenomenon. Here we describe a novel computer tool that allows automatic discrimination of genes encoding class-I bacterial release factors, RF1, RF2 and RFH. Most usefully, this program identifies and automatically annotates +1 frameshifting in RF2 encoding genes. Comparison of ARFA performance with existing annotations of bacterial genomes revealed that only 20% of RF2 genes utilizing ribosomal frameshifting during their expression are annotated correctly.
Availability: The PHP based web interface of ARFA and the source code are located at http://recode.genetics.utah.edu/arfa
Contact: baranov{at}genetics.utah.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
The number of completed sequenced bacterial genomes is increasing almost every day. While the annotation process becomes routine, it requires meticulous analysis of sequenced genomes. The accuracy of this procedure largely depends on annotator's biological expertise. Owing to considerable diversity of biological processes, it is hard to expect a uniform level of such expertise. Life made perfect sense of a cliché that rules are made to be broken. Its opportunistic nature has resulted in the evolution of exceptions to almost every rule, including such fundamentals as the genetic code (Knight et al., 2001; Baranov et al., 2002a; Klobutcher and Farabaugh, 2002; Santos et al., 2004; Namy et al., 2004). In the majority of bacteria, release factor 2 (RF2) is encoded in two overlapping ORFs in the same orientation. The upstream ORF is short (25 codons in Escherichia coli) and the downstream one encodes a much larger portion of RF2 (Craigen et al., 1985). In such cases, expression of RF2 genes requires programmed ribosomal frameshifting, which serves as a negative feedback regulator of RF2 biosynthesis (Craigen and Caskey. 1986). Correct annotation of RF2 genes demands some degree of familiarity with programmed ribosomal frameshifting among annotators. When awareness of frameshifting is lacking, genes encoding RF2 are annotated in a single ORF corresponding to the second long ORF with CDS starting at a codon corresponding to internal ATG, GTG or TTG codons, while the first ORF containing the real initiation codon remains outside of RF2 CDS.
Conservation of the frameshifting cassette and its position together with significant sequence similarity among upstream RF2 ORFs (Baranov et al., 2002b), provide the opportunity for developing a simple, fully automated, correct annotation tool for shifty RF2 genes. We have developed such a tool that we named ARFA (Automatic Release Factor Annotation). It detects bacterial class-I release factors, then it discriminates between release factors with known specificity to mRNA stop-signals (RF1 and RF2). It also discriminates RFH, RFH presumably recognizes and mediates termination at an unknown mRNA signal (Pel et al., 1992; Baranov et al., 2006). For RF2 genes, it determines whether it is encoded in a single ORF or in two ORFs. If RF2 is encoded in two ORFs, ARFA further finds the frameshift site and generates a detailed description of the frameshift cassette, which includes a slippery site, a weak signal for termination of translation (Tate et al., 1995) and a stimulatory internal ShineDalgarno sequence (Weiss et al., 1988).
ARFA is a Bioperl (Stajich et al., 2002) module that can be downloaded from ARFA's website. ARFA can be easily implemented in pipelines for annotation of eubacterial genomes. It can be applied to any given sequence including eukaryotic mRNA sequences where it can be used for detecting genes encoding mitochondrial and chloroplast release factors. Data regarding frameshifting cassettes in bacterial genomes obtained with ARFA will be used by us for future automatic updates of the RECODE database (Baranov et al., 2001, 2003).
| 2 ALGORITHM AND IMPLEMENTATION |
|---|
|
|
|---|
As input, ARFA takes DNA or RNA sequences in FASTA format or GenBank accession numbers. It produces output in either GenBank flat format or in an extended XML format designed for future versions of the RECODE database. XML output provides a detailed description of frameshifting cassette. A local version of ARFA can be installed on Linux/Unix platforms, it requires pre-installation of BioPerl 1.5.1 (Stajich et al., 2002), FASTA 3.4 (Pearson and Lipman, 1988; Pearson, 1990) and HMMER 2.3.2 (Eddy, 1998).
The general scheme of ARFA analysis is illustrated in Supplementary Figure 1S. In the first step, ARFA determines a size of analyzed sequence. For large sequences (>20 kb) it runs a FASTA search against a sequence provided by user or a sequence retrievedfrom GenBank (Benson et al., 2006) using a user-provided accession number. The FASTA search is used as a rapid filter to reduce the number of unrelated sequences analyzed during the second step. As query for the FASTA search, ARFA uses a small number of selected RF sequences. This search is performed with relaxed parameters to prevent elimination of true positives. FASTA search is not performed for short (<20 kb) sequences.
In the second step, ARFA extracts sequences of candidates and performs HMMER searches, using HMMs based on alignments of RF1, RF2 and RFH sequences. At this step, a large number of false positives are eliminated and the remaining sequences are classified based on their similarity to RF HMMs. E-values are set by default to 1e40, which empirically was determined to give the best performance. Whether RF2 contains a frameshift site is estimated based on comparison of candidate RF2 sequences with HMM model based on the alignment of N-terminal parts of selected RF2 sequences. If the N-terminus RF2 HMM has a hit in the 5' end of the long ORF, RF2 is considered to be encoded in a single ORF. If a hit is located in a small ORF overlapping the long one at its 5' end, RF2 is considered to be expressed via ribosomal frameshifting. Further, nucleotide sequence at the end of a small ORF is compared with HMM of a frameshift site, which is based on the alignment of nucleotide sequences from known frameshift sites (Baranov et al., 2002b). This is useful for detailed manual analysis of frameshifting cassettes.
ARFA is written in Perl and it utilizes BioPerl modules. ARFA can be executed directly from Linux/Unix command line or it can be called from an external web application or an annotation pipeline. Input and output format options, E-value threshold, translational table and initiation codon restrictions are user defined.
| 3 PERFORMANCE |
|---|
|
|
|---|
To evaluate ARFA prediction sensitivity, sequences of completed bacterial genomes were downloaded from the RefSeq database (Pruitt et al., 2005) on May 20, 2006. The dataset contains chromosomal sequences from 311 bacteria. ARFA detected 311 RF1 genes, 297 RF2 genes and 23 RFH genes. All genomes, where RF2 was not found, were from bacteria where UGA is not recognized as a stop codon. While ARFA predictions of RF1 encoding genes matched genome annotations precisely, a number of RF2 and RFH genes are incorrectly annotated in completed genomes. In 12 genomes, RFH encoding genes are annotated as peptide chain release factor 2 leading to the situation where the same genome contains two RF2 encoding genes. In three genomes RFH genes are annotated as putative peptide chain release factor 2. The rest of RFH genes are annotated more accurately as putative peptide chain release factor or peptide chain release factor-like protein. We also found one RF2 encoding gene annotated as tRNA pseudouridine synthaseD gene, truD. For the details on missannotated release factors see Supplementary Table 1S. ARFA detected that frameshifting is utilized in the decoding of 259 RF2 genes, which is a slightly larger proportion (
87%) than the previous estimation of RF2 frameshifting mechanism distribution among eubacteria (Baranov et al., 2002a). Predicted frameshift cassettes were evaluated manually and were found to be consistent with the canonical consensus for standard RF2 frameshifting cassette (a few deviations were observed). Supplementary Table 2S lists shifty RF2 genes which are incorrectly annotated in single ORFs. In five cases, frameshifting was annotated, but the frameshift site location was annotated incorrectly. In all others frameshift sites were not detected. To our surprise, only 52 of shifty RF2 genes are currently annotated correctly in completed genomes. This observation emphasizes the need for an automatic prediction tool such as ARFA. Although ARFA detects incorrectly annotated initiator codons (in those cases where the frameshifting event was not detected during annotation), its own predictions of start codons may not always be accurate.
To evaluate ARFA prediction selectivity, a random sequence database (totaling 1.7 Gb) was generated by a fifth order Markov chains based on six-mer frequencies of each 311 genomic sequences from RefSeq. ARFA did not detect any RF sequence in this database. Based on the datasets used in this study we estimate ARFA selectivity and sensitivity as 100%.
In general, we believe that approaches to specifically annotate a single gene across all completed genomes (as described here) will become a valuable addition to a more common perpendicular approach to annotate all genes in a single genome, particularly when soon the number of completed bacterial genomes will exceed the number of genes in a given bacterial genome.
| Acknowledgments |
|---|
The authors thank Dr. Mark Yandell for careful reading of the draft manuscript. The authors appreciate personal financial support from Science Foundation Ireland.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Chris Stoeckert
Received on June 6, 2006; revised on July 17, 2006; accepted on August 2, 2006
| REFERENCES |
|---|
|
|
|---|
Baranov, P.V., et al. (2001) Recode: a database of frameshifting, bypassing and codon redefinition utilized for gene expression. Nucleic Acids Res, . 29, 264267
Baranov, P.V., et al. (2002a) Recoding: translational bifurcations in gene expression. Gene, 286, 187201[CrossRef][ISI][Medline].
Baranov, P.V., et al. (2002b) Release factor 2 frameshifting sites in different bacteria. EMBO Rep, . 3, 373377[CrossRef][ISI][Medline].
Baranov, P.V., et al. (2003) Recode 2003. Nucleic Acids Res, . 31, 8789
Baranov, P.V., et al. (2006) Diverse bacterial genomes encode an operon of two genes, one of which is an unusual class-I release factor that potentially recognizes atypical mRNA signals other than normal stop codons. Biology Direct, (in press).
Benson, D.A., et al. (2006) Genbank. Nucleic Acids Res, . 34, D16D20
Craigen, W.J. and Caskey, C.T. (1986) Expression of peptide chain release factor 2 requires high-efficiency frameshift. Nature, 322, 273275[CrossRef][Medline].
Craigen, W.J., et al. (1985) Bacterial peptide chain release factors: conserved primary structure and possible frameshift regulation of release factor 2. Proc. Natl Acad. Sci. USA, 82, 36163620
Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 75563
Klobutcher, L.A. and Farabaugh, P.J. (2002) Shifty ciliates: frequent programmed translational frameshifting in euplotids. Cell, 111, 763766[CrossRef][ISI][Medline].
Knight, R.D., et al. (2001) Rewiring the keyboard: evolvability of the genetic code. Nat. Rev. Genet, . 2, 4958[CrossRef][ISI][Medline].
Namy, O., et al. (2004) Reprogrammed genetic decoding in cellular gene expression. Mol. Cell, 13, 157168[CrossRef][ISI][Medline].
Pearson, W.R. (1990) Rapid and sensitive sequence comparison with fastp and fasta. Methods Enzymol, . 183, 6398[ISI][Medline].
Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 24442448
Pel, H.J., et al. (1992) Sequence comparison of new prokaryotic and mitochondrial members of the polypeptide chain release factor family predicts a five-domain model for release factor structure. Nucleic Acids Res, . 20, 44234428
Pruitt, K.D., et al. (2005) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res, . 33, D501D504
Santos, M.A., et al. (2004) Driving change: the evolution of alternative genetic codes. Trends Genet, . 20, 95102[CrossRef][ISI][Medline].
Stajich, J.E., et al. (2002) The bioperl toolkit: perl modules for the life sciences. Genome Res, . 12, 16111618
Tate, W.P., et al. (1995) Translational termination efficiency in both bacteria and mammals is regulated by the base following the stop codon. Biochem. Cell Biol, . 73, 10951103[ISI][Medline].
Weiss, R.B., et al. (1988) Reading frame switch caused by base-pair formation between the 3' end of 16S rRNA and the mRNA during elongation of protein synthesis in Escherichia coli. EMBO J, . 7, 15031507[ISI][Medline].
This article has been cited by other articles:
![]() |
I. P. Ivanov and J. F. Atkins Ribosomal frameshifting in decoding antizyme mRNAs from yeast and protists to humans: close to 300 cases reveal remarkable diversity despite underlying conservation Nucleic Acids Res., March 19, 2007; 35(6): 1842 - 1858. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
