Skip Navigation


Bioinformatics Advance Access originally published online on August 7, 2006
Bioinformatics 2006 22(20):2463-2465; doi:10.1093/bioinformatics/btl430
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
22/20/2463    most recent
btl430v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bekaert, M.
Right arrow Articles by Baranov, P. V
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bekaert, M.
Right arrow Articles by Baranov, P. V
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

ARFA: a program for annotating bacterial release factor genes, including prediction of programmed ribosomal frameshifting

Michaël Bekaert 1, John F Atkins 1,2 and Pavel V Baranov 1,2,*

1 Biosciences Institute, University College Cork Cork, Ireland
2 Department of Human Genetics, University of Utah Salt Lake City, Utah 84112-5330, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM AND IMPLEMENTATION
 3 PERFORMANCE
 REFERENCES
 

Summary: Correct annotation of genes encoding release factors in bacterial genomes is often complicated by utilization of +1 programmed ribosomal frameshifting during synthesis of release factor 2, RF2. In the absence of robust computational approaches for predicting ribosomal frameshifting, the success of proper annotation depends on annotators' familiarity with this phenomenon. Here we describe a novel computer tool that allows automatic discrimination of genes encoding class-I bacterial release factors, RF1, RF2 and RFH. Most usefully, this program identifies and automatically annotates +1 frameshifting in RF2 encoding genes. Comparison of ARFA performance with existing annotations of bacterial genomes revealed that only 20% of RF2 genes utilizing ribosomal frameshifting during their expression are annotated correctly.

Availability: The PHP based web interface of ARFA and the source code are located at http://recode.genetics.utah.edu/arfa

Contact: baranov{at}genetics.utah.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM AND IMPLEMENTATION
 3 PERFORMANCE
 REFERENCES
 
The number of completed sequenced bacterial genomes is increasing almost every day. While the annotation process becomes routine, it requires meticulous analysis of sequenced genomes. The accuracy of this procedure largely depends on annotator's biological expertise. Owing to considerable diversity of biological processes, it is hard to expect a uniform level of such expertise. Life made perfect sense of a cliché that ‘rules are made to be broken’. Its opportunistic nature has resulted in the evolution of exceptions to almost every rule, including such fundamentals as the genetic code (Knight et al., 2001; Baranov et al., 2002a; Klobutcher and Farabaugh, 2002; Santos et al., 2004; Namy et al., 2004). In the majority of bacteria, release factor 2 (RF2) is encoded in two overlapping ORFs in the same orientation. The upstream ORF is short (25 codons in Escherichia coli) and the downstream one encodes a much larger portion of RF2 (Craigen et al., 1985). In such cases, expression of RF2 genes requires programmed ribosomal frameshifting, which serves as a negative feedback regulator of RF2 biosynthesis (Craigen and Caskey. 1986). Correct annotation of RF2 genes demands some degree of familiarity with programmed ribosomal frameshifting among annotators. When awareness of frameshifting is lacking, genes encoding RF2 are annotated in a single ORF corresponding to the second long ORF with CDS starting at a codon corresponding to internal ATG, GTG or TTG codons, while the first ORF containing the real initiation codon remains outside of RF2 CDS.

Conservation of the frameshifting cassette and its position together with significant sequence similarity among upstream RF2 ORFs (Baranov et al., 2002b), provide the opportunity for developing a simple, fully automated, correct annotation tool for ‘shifty’ RF2 genes. We have developed such a tool that we named ARFA (Automatic Release Factor Annotation). It detects bacterial class-I release factors, then it discriminates between release factors with known specificity to mRNA stop-signals (RF1 and RF2). It also discriminates RFH, RFH presumably recognizes and mediates termination at an unknown mRNA signal (Pel et al., 1992; Baranov et al., 2006). For RF2 genes, it determines whether it is encoded in a single ORF or in two ORFs. If RF2 is encoded in two ORFs, ARFA further finds the frameshift site and generates a detailed description of the frameshift cassette, which includes a slippery site, a weak signal for termination of translation (Tate et al., 1995) and a stimulatory internal Shine–Dalgarno sequence (Weiss et al., 1988).

ARFA is a Bioperl (Stajich et al., 2002) module that can be downloaded from ARFA's website. ARFA can be easily implemented in pipelines for annotation of eubacterial genomes. It can be applied to any given sequence including eukaryotic mRNA sequences where it can be used for detecting genes encoding mitochondrial and chloroplast release factors. Data regarding frameshifting cassettes in bacterial genomes obtained with ARFA will be used by us for future automatic updates of the RECODE database (Baranov et al., 2001, 2003).


    2 ALGORITHM AND IMPLEMENTATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM AND IMPLEMENTATION
 3 PERFORMANCE
 REFERENCES
 
As input, ARFA takes DNA or RNA sequences in FASTA format or GenBank accession numbers. It produces output in either GenBank flat format or in an extended XML format designed for future versions of the RECODE database. XML output provides a detailed description of frameshifting cassette. A local version of ARFA can be installed on Linux/Unix platforms, it requires pre-installation of BioPerl 1.5.1 (Stajich et al., 2002), FASTA 3.4 (Pearson and Lipman, 1988; Pearson, 1990) and HMMER 2.3.2 (Eddy, 1998).

The general scheme of ARFA analysis is illustrated in Supplementary Figure 1S. In the first step, ARFA determines a size of analyzed sequence. For large sequences (>20 kb) it runs a FASTA search against a sequence provided by user or a sequence retrievedfrom GenBank (Benson et al., 2006) using a user-provided accession number. The FASTA search is used as a rapid filter to reduce the number of unrelated sequences analyzed during the second step. As query for the FASTA search, ARFA uses a small number of selected RF sequences. This search is performed with relaxed parameters to prevent elimination of true positives. FASTA search is not performed for short (<20 kb) sequences.

In the second step, ARFA extracts sequences of candidates and performs HMMER searches, using HMMs based on alignments of RF1, RF2 and RFH sequences. At this step, a large number of false positives are eliminated and the remaining sequences are classified based on their similarity to RF HMMs. E-values are set by default to 1e–40, which empirically was determined to give the best performance. Whether RF2 contains a frameshift site is estimated based on comparison of candidate RF2 sequences with HMM model based on the alignment of N-terminal parts of selected RF2 sequences. If the N-terminus RF2 HMM has a hit in the 5' end of the long ORF, RF2 is considered to be encoded in a single ORF. If a hit is located in a small ORF overlapping the long one at its 5' end, RF2 is considered to be expressed via ribosomal frameshifting. Further, nucleotide sequence at the end of a small ORF is compared with HMM of a frameshift site, which is based on the alignment of nucleotide sequences from known frameshift sites (Baranov et al., 2002b). This is useful for detailed manual analysis of frameshifting cassettes.

ARFA is written in Perl and it utilizes BioPerl modules. ARFA can be executed directly from Linux/Unix command line or it can be called from an external web application or an annotation pipeline. Input and output format options, E-value threshold, translational table and initiation codon restrictions are user defined.


    3 PERFORMANCE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM AND IMPLEMENTATION
 3 PERFORMANCE
 REFERENCES
 
To evaluate ARFA prediction sensitivity, sequences of completed bacterial genomes were downloaded from the RefSeq database (Pruitt et al., 2005) on May 20, 2006. The dataset contains chromosomal sequences from 311 bacteria. ARFA detected 311 RF1 genes, 297 RF2 genes and 23 RFH genes. All genomes, where RF2 was not found, were from bacteria where UGA is not recognized as a stop codon. While ARFA predictions of RF1 encoding genes matched genome annotations precisely, a number of RF2 and RFH genes are incorrectly annotated in completed genomes. In 12 genomes, RFH encoding genes are annotated as ‘peptide chain release factor 2’ leading to the situation where the same genome contains two RF2 encoding genes. In three genomes RFH genes are annotated as ‘putative peptide chain release factor 2’. The rest of RFH genes are annotated more accurately as ‘putative peptide chain release factor’ or ‘peptide chain release factor-like protein’. We also found one RF2 encoding gene annotated as ‘tRNA pseudouridine synthaseD gene, truD’. For the details on missannotated release factors see Supplementary Table 1S. ARFA detected that frameshifting is utilized in the decoding of 259 RF2 genes, which is a slightly larger proportion (~87%) than the previous estimation of RF2 frameshifting mechanism distribution among eubacteria (Baranov et al., 2002a). Predicted frameshift cassettes were evaluated manually and were found to be consistent with the canonical consensus for standard RF2 frameshifting cassette (a few deviations were observed). Supplementary Table 2S lists ‘shifty’ RF2 genes which are incorrectly annotated in single ORFs. In five cases, frameshifting was annotated, but the frameshift site location was annotated incorrectly. In all others frameshift sites were not detected. To our surprise, only 52 of ‘shifty’ RF2 genes are currently annotated correctly in completed genomes. This observation emphasizes the need for an automatic prediction tool such as ARFA.

Although ARFA detects incorrectly annotated initiator codons (in those cases where the frameshifting event was not detected during annotation), its own predictions of start codons may not always be accurate.

To evaluate ARFA prediction selectivity, a random sequence database (totaling 1.7 Gb) was generated by a fifth order Markov chains based on six-mer frequencies of each 311 genomic sequences from RefSeq. ARFA did not detect any RF sequence in this database. Based on the datasets used in this study we estimate ARFA selectivity and sensitivity as 100%.

In general, we believe that approaches to specifically annotate a single gene across all completed genomes (as described here) will become a valuable addition to a more common perpendicular approach to annotate all genes in a single genome, particularly when soon the number of completed bacterial genomes will exceed the number of genes in a given bacterial genome.


    Acknowledgments
 
The authors thank Dr. Mark Yandell for careful reading of the draft manuscript. The authors appreciate personal financial support from Science Foundation Ireland.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Chris Stoeckert

Received on June 6, 2006; revised on July 17, 2006; accepted on August 2, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM AND IMPLEMENTATION
 3 PERFORMANCE
 REFERENCES
 

    Baranov, P.V., et al. (2001) Recode: a database of frameshifting, bypassing and codon redefinition utilized for gene expression. Nucleic Acids Res, . 29, 264–267[Abstract/Free Full Text].

    Baranov, P.V., et al. (2002a) Recoding: translational bifurcations in gene expression. Gene, 286, 187–201[CrossRef][Web of Science][Medline].

    Baranov, P.V., et al. (2002b) Release factor 2 frameshifting sites in different bacteria. EMBO Rep, . 3, 373–377[CrossRef][Web of Science][Medline].

    Baranov, P.V., et al. (2003) Recode 2003. Nucleic Acids Res, . 31, 87–89[Abstract/Free Full Text].

    Baranov, P.V., et al. (2006) Diverse bacterial genomes encode an operon of two genes, one of which is an unusual class-I release factor that potentially recognizes atypical mRNA signals other than normal stop codons. Biology Direct, (in press).

    Benson, D.A., et al. (2006) Genbank. Nucleic Acids Res, . 34, D16–D20[Abstract/Free Full Text].

    Craigen, W.J. and Caskey, C.T. (1986) Expression of peptide chain release factor 2 requires high-efficiency frameshift. Nature, 322, 273–275[CrossRef][Medline].

    Craigen, W.J., et al. (1985) Bacterial peptide chain release factors: conserved primary structure and possible frameshift regulation of release factor 2. Proc. Natl Acad. Sci. USA, 82, 3616–3620[Abstract/Free Full Text].

    Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–63[Abstract/Free Full Text].

    Klobutcher, L.A. and Farabaugh, P.J. (2002) Shifty ciliates: frequent programmed translational frameshifting in euplotids. Cell, 111, 763–766[CrossRef][Web of Science][Medline].

    Knight, R.D., et al. (2001) Rewiring the keyboard: evolvability of the genetic code. Nat. Rev. Genet, . 2, 49–58[CrossRef][Web of Science][Medline].

    Namy, O., et al. (2004) Reprogrammed genetic decoding in cellular gene expression. Mol. Cell, 13, 157–168[CrossRef][Web of Science][Medline].

    Pearson, W.R. (1990) Rapid and sensitive sequence comparison with fastp and fasta. Methods Enzymol, . 183, 63–98[Web of Science][Medline].

    Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448[Abstract/Free Full Text].

    Pel, H.J., et al. (1992) Sequence comparison of new prokaryotic and mitochondrial members of the polypeptide chain release factor family predicts a five-domain model for release factor structure. Nucleic Acids Res, . 20, 4423–4428[Abstract/Free Full Text].

    Pruitt, K.D., et al. (2005) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res, . 33, D501–D504[Abstract/Free Full Text].

    Santos, M.A., et al. (2004) Driving change: the evolution of alternative genetic codes. Trends Genet, . 20, 95–102[CrossRef][Web of Science][Medline].

    Stajich, J.E., et al. (2002) The bioperl toolkit: perl modules for the life sciences. Genome Res, . 12, 1611–1618[Abstract/Free Full Text].

    Tate, W.P., et al. (1995) Translational termination efficiency in both bacteria and mammals is regulated by the base following the stop codon. Biochem. Cell Biol, . 73, 1095–1103[Web of Science][Medline].

    Weiss, R.B., et al. (1988) Reading frame switch caused by base-pair formation between the 3' end of 16S rRNA and the mRNA during elongation of protein synthesis in Escherichia coli. EMBO J, . 7, 1503–1507[Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
M. Bekaert, A. E. Firth, Y. Zhang, V. N. Gladyshev, J. F. Atkins, and P. V. Baranov
Recode-2: new design, new search tools, and many more genes
Nucleic Acids Res., September 25, 2009; (2009) gkp788v1.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
I. P. Ivanov and J. F. Atkins
Ribosomal frameshifting in decoding antizyme mRNAs from yeast and protists to humans: close to 300 cases reveal remarkable diversity despite underlying conservation
Nucleic Acids Res., March 19, 2007; 35(6): 1842 - 1858.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
22/20/2463    most recent
btl430v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bekaert, M.
Right arrow Articles by Baranov, P. V
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bekaert, M.
Right arrow Articles by Baranov, P. V
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?