Skip Navigation


Bioinformatics Advance Access originally published online on June 30, 2005
Bioinformatics 2005 21(17):3568-3569; doi:10.1093/bioinformatics/bti563
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/17/3568    most recent
bti563v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Tech, M.
Right arrow Articles by Meinicke, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Tech, M.
Right arrow Articles by Meinicke, P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Published by Oxford University Press 2005

TICO: a tool for improving predictions of prokaryotic translation initiation sites

Maike Tech *, Nico Pfeifer , Burkhard Morgenstern and Peter Meinicke

Abteilung Bioinformatik, Institut für Mikrobiologie und Genetik, Georg-August-Universität Göttingen Goldschmidtstrasse 1, 37077 Göttingen, Germany

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 REFERENCES
 

Summary: We provide the tool ‘TICO’ (Translation Initiation site COrrection) for improving the results of conventional gene finders for prokaryotic genomes with regard to exact localization of the translation initiation site (TIS). At the current state TICO provides an interface for direct post processing of the predictions obtained from the widely used program GLIMMER. Our program is based on a clustering algorithm for completely unsupervised scoring of potential TIS locations.

Availability: Our tool can be freely accessed through a web interface at http://tico.gobics.de/

Contact: maike{at}gobics.de

For prokaryotes, there are a number of gene-finding tools that can reliably predict the location of genes in a genome under study, for example GLIMMER (Delcher et al., 1999), GS-finder (Ou et al., 2004), MED-Start (Zhu et al., 2004), ZCurve (Guo et al., 2000) and GeneMarkS (Besemer et al., 2001). Essentially, these methods are based on a search for open reading frames with a statistically significant minimal length. In addition, characteristic statistics of sequence content features, such as oligo-nucleotide frequencies, are considered for evaluation of these open reading frames. But, while it is obvious how to identify the end position of a putative gene, it is by no means trivial to determine the corresponding start position as the codons for signaling the initiation of translation may also be used inside genes to code for amino acids. Systematic studies have shown that existing gene finders perform poorly in the prediction of correct translation initiation sites (TIS) (Ou et al., 2004; Zhu et al., 2004; Tech and Merkl, 2003). Consequently, many start positions are incorrectly annotated in databases and, due to the concepts used for gene annotation, these errors tend to be propagated to newly annotated genomes.

We present a tool, TICO (Translation Initiation site COrrection), for improving the results of conventional gene finders by analyzing and relocating prior predictions of prokaryotic TIS. Currently our tool provides an interface for post processing the output of the widely used program GLIMMER. Unlike other programs it is not based on any specific assumptions about prokaryotic TIS. Some existing tools do provide a sequence model and an unsupervised method for optimizing most of the parameters without the need for prior knowledge (Ou et al., 2004; Zhu et al., 2004; Guo et al., 2000; Suzek et al., 2001). However, these models usually include additional TIS related parameters that cannot be adjusted by means of the optimization method. These special parameters can for instance involve the length of a putative RBS motif, the maximal number of RBS motifs considered or the distribution of the start codon usage. Usually these parameters are adjusted to ‘default’ values, which provide good results on genomes like Escherichia coli and Bacillus subtilis. Because with that choice one implicitly makes assumptions about TIS characteristics there exists a certain amount of a risk that the results become suboptimal if the tools are applied to genomes of other species.

Our method is based on the analysis of candidate TIS sequences as obtained from the flanking regions of potential start codons. We implemented a clustering algorithm that performs an unsupervised classification of sequences according to strong-TIS and weak-TIS categories. As potential TIS locations we consider the positions of all admissible start codons in a specified search range (Fig. 1) around the initial TIS, as predicted by a conventional gene finder. In addition, potential start codons have to share the same reading frame of the associated gene and no in-frame stop codon has to occur between the candidate start and the annotated stop. For an initial classification we consider each TIS predicted by the gene finder as strong TIS and all other candidates within the same search range as weak TIS. The two classes are represented by inhomogeneous second order Markov models with positional smoothing (Meinicke et al., 2004) of the corresponding trinucleotide probabilities. In an iterative process the candidate TISs are scored with a positional weight matrix (PWM) based on the difference between the log-probabilities of the two second-order Markov models for weak TIS and strong TIS, respectively. Each time, the candidate with the highest positive score within the search range is considered to be a strong TIS.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 1 The figure illustrates two parameters, which may be adapted by the user. Above, the search range is shown. This range defines the maximum distance to be searched for alternative start sites around the initially predicted TIS (denoted as initial TIS in the figure). The initial TIS and the alternative start sites are termed candidate TIS. Below, the extract range is shown. This defines the regions around each candidate TIS to be extracted for the scoring based on unsupervised learning. For both parameters, search range and extract range, the length of upstream and downstream regions can be adjusted independently by the user.

 
TICO currently supports post processing of predictions obtained by the widely used program GLIMMER (Delcher et al., 1999). GLIMMER is well-suited for prior prediction because it has high sensitivity and therefore it can be expected to give a relatively small number of totally missed genes. Nevertheless, in future versions of our tool we will also support additional input formats (e.g. GenBank). For post processing of GLIMMER predictions the user has to submit the annotation as achieved from the GLIMMER output together with the FASTA sequence of the genome. Both files have to be uploaded at our web site. In addition to the annotation and sequence inputs, some optional parameters may be adjusted by the user. First, the range for searching additional candidate TIS (termed search range, Fig. 1) may be adjusted. Here the user may define the maximum distance to search for alternative start sites. The default values are 250 nt upstream and 250 nt downstream of an initial TIS. These values can be altered independently. As mentioned above, all possible start codons that share the reading frame of the predicted TIS (with no in-frame stop codon) are considered in the algorithm. All potential start codons, which include the initially predicted TIS and the alternative start sites, are termed candidate TISs.

The second parameter, termed extract range, may be adjusted in the same way. By using this parameter the user has the possibility to define the window around each candidate TIS to be extracted for analysis by means of the unsupervised learning routines. Due to the fact that these sequence windows are used for clustering and scoring a candidate they should be wide enough to contain the characteristics of the potential start site. The default values are 30 nt upstream and 30 nt downstream a candidate TIS.

Third, the standard deviation parameter sigma of the Gaussian probability density function used for smoothing of the estimated trinucleotide probabilities may be adjusted. The choice of the Gaussian smoothing kernel does not imply any distributional assumptions on the trinucleotide occurrences, it just adapts the estimation to a varying number of genes under consideration. The default value of sigma = 0.5 should work well for genomes with approximately 4000 genes. For a genome with a considerably smaller number of genes, it may be useful to choose a higher degree of smoothing, i.e. a larger sigma, in order to prevent vanishing probabilities.

At last a minimum gene length may be set by the user. This prevents the algorithm from reannotating a TIS, if the resulting gene is too short to be likely to code. If the distance of a potential candidate TIS to the annotated stop falls below the minimum gene length, it is omitted from the list of candidates.

At the current state TICO provides two output formats: a GLIMMER-like output and an output in GFF (general feature format). The output will be emailed to the user as selected in the web interface. The adapted GLIMMER output contains all putative genes from the initial GLIMMER prediction in the original notation. For each gene two columns are appended: the score calculated for the respective TIS and the number of nucleotides by which the TIS is shifted from the initial GLIMMER predicted position. If the predicted position is located upstream of the initially predicted TIS, it is denoted by a negative value and if it is located downstream the value is positive. The score is the value from the PWM for the respective TIS.

The GFF-output is denoted according to the specifications of the Sanger Institute (http://www.sanger.ac.uk/). It contains the GLIMMER-predicted values denoted with the feature tag CDS (coding region) and the suggested new positions denoted with the tag REANNCDS. The GFF-output can be visualized using the program ARTEMIS (Rutherford et al., 2000). By adding the line colour_o_REANNCDS = 1 to the options-file the relocated TIS will appear grey, while the CDS by default is colored light blue.

We evaluated the results of TICO as compared with other state-of-the-art gene finders and post processors on the EcoGene dataset (Rudd, 2000), which contains 854 annotations with verified N-termini (Table 1). The input was the prediction of GLIMMER2.02. As compared with the GLIMMER input, the accuracy of the TIS prediction could be improved by 31.1%. Other input formats as well as a stand-alone version of the tool will be provided soon.


View this table:
[in this window]
[in a new window]
 
Table 1 The accuracy of the TIS prediction obtained by TICO compared with other gene finders and the GenBank annotation (GBK)

 


    Acknowledgments
 
Conflict of Interest: none declared.

Received on May 9, 2005; revised on June 24, 2005; accepted on June 27, 2005

    REFERENCES
 TOP
 Abstract
 REFERENCES
 

    Besemer, J., et al. (2001) GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. implications for finding sequence motifs in regulatory regions. Nucleic Acids Res., 29, 2607–2618[Abstract/Free Full Text].

    Delcher, A.L., et al. (1999) Improved microbial gene identification with GLIMMER. Nucleic Acids Res., 27, 4636–4641[Abstract/Free Full Text].

    Guo, F.B., et al. (2000) ZCurve: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Res., 31, 1780–1789.

    Meinicke, P., et al. (2004) Oligo kernels for datamining on biological sequences: A case study on prokaryotic translation initiation sites. BMC Bioinformatics, 5, 169[CrossRef][Medline].

    Ou, H.Y., et al. (2004) GS-Finder: a program to find bacterial gene start sites with a self-training method. Int. J. Biochem. Cell Biol., 36, 535–544[CrossRef][ISI][Medline].

    Rudd, K.E. (2000) EcoGene: a genome sequence database for Escherichia coli k-12. Nucleic Acids Res., 28, 60–64[Abstract/Free Full Text].

    Rutherford, K., et al. (2000) Artemis: sequence visualisation and annotation. Bioinformatics, 16, 944–945[Abstract/Free Full Text].

    Suzek, B.E., et al. (2001) A probabilistic method for identifying start codons in bacterial genomes. Bioinformatics, 17, 1123–1130[Abstract/Free Full Text].

    Tech, M. and Merkl, R. (2003) YACOP: Enhanced gene prediction obtained by a combination of existing methods. In Silico Biol., 3, 441–451[Medline].

    Zhu, H.-Q., et al. (2004) Accuracy improvement for identifying translation initiation sites in microbial genomes. Bioinformatics, 20, 3308–3317[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
E. E. Snyder, N. Kampanya, J. Lu, E. K. Nordberg, H. R. Karur, M. Shukla, J. Soneja, Y. Tian, T. Xue, H. Yoo, et al.
PATRIC: The VBI PathoSystems Resource Integration Center
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D401 - D406.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Tech, B. Morgenstern, and P. Meinicke
TICO: a tool for postprocessing the predictions of prokaryotic translation initiation sites.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W588 - W590.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/17/3568    most recent
bti563v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Tech, M.
Right arrow Articles by Meinicke, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Tech, M.
Right arrow Articles by Meinicke, P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?