Bioinformatics Advance Access originally published online on October 31, 2007
Bioinformatics 2007 23(22):3088-3090; doi:10.1093/bioinformatics/btm512
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CONSORF: a consensus prediction system for prokaryotic coding sequences
1KOBIC (Korean BioInformation Center), KRIBB, Daejeon 305-806 and 2Soongsil University, 1-1 Sangdo-dong, Dongjak-gu, Seoul 156-743, Korea
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: CONSORF is a fully automatic high-accuracy identification system that provides consensus prokaryotic CDS information. It first predicts the CDSs supported by consensus alignments. The alignments are derived from multiple genome-to-proteome comparisons with other prokaryotes using the FASTX program. Then, it fills the empty genomic regions with the CDSs supported by consensus ab initio predictions. From those consensus results, CONSORF provides prediction reliability scores, predicted frame-shifts, alternative start sites and best pair-wise match information against other prokaryotes. These results are easily accessed from a website.
Availability: The regularly updated CDS predictions of prokaryotic genomes as well as the source code are freely accessible through http://consorf.kobic.re.kr and http://orfome.org.
Contact: j{at}bio.cc, jong{at}kribb.re.kr or sskimb{at}ssu.ac.kr
Supplementary information: The detailed methods and evaluation results can be found at http://consorf.kobic.re.kr/supplementary/.
| 1 INTRODUCTION |
|---|
|
|
|---|
While the number of known prokaryotic whole genomes is increasing rapidly, the consistency of their coding sequence (CDS) predictions varies depending on the genome (Devos and Valencia, 2001; Riley et al., 2005; Skovgaard et al., 2001). Moreover, it is difficult to systematically re-annotate them fast enough to keep up with new information from the expanding public databases.
There have been numerous computational methods for processing prokaryotic genomic CDSs. Most of them are ab initio prediction methods such as GeneMark (Borodovsky and McIninch, 1993), GeneMark.hmm (Lukashin and Borodovsky, 1998) and GLIMMER (Salzberg et al., 1998). However, the CDSs predicted by such programs usually need further time-consuming manual processes. To complement such ab initio prediction methods, some CDS prediction programs add homology-based methods: ORPHEUS (Frishman et al., 1998), Critica (Badger and Olsen, 1999), FrameD (Schiex et al., 2003) and YACOP (Tech and Merkl, 2003). However, ORPHEUS and FrameD have shown relatively low CDS prediction accuracies, and Critica and YACOP provide neither frame-shift information nor automated prediction pipeline.
Therefore, we have developed a consensus method, CONSORF, a fully automated and regularly updated prediction system for prokaryotic CDSs. It aims to improve both sensitivity and specificity of CDSs prediction.
| 2 CONSORF CDS FINDING |
|---|
|
|
|---|
CONSORF provides five kinds of predicted CDSs in FASTA and XML formats, based on their sources of evidence and refinement levels in prediction:
- homology-based consensus CDSs (called homology CDSs),
- alternative homology-based CDSs, predicted from the overall best match without considering the consensus among hits (called alternative CDSs),
- algorithm-based ab initio consensus CDSs (called ab initio CDSs),
- CDSs from the integration of homology CDSs and ab initio CDSs over certain thresholds (called integrated CDSs) and
- the representative final CDSs that have undergone the refinement of start codon positions via the analysis on the N-terminal residue matches of integrated CDSs (called representative CDSs).
From a prokaryotic genome sequence, the CONSORF system predicts CDSs in two complementary approaches: homology-based and algorithm-based. In the homology-based approach, pair-wise genome-to-proteome comparisons via the FASTX program (Pearson et al., 1997) are performed. It generates both homology CDSs and alternative CDSs, while multiple ab initio predictions currently including GeneMark, GLIMMER and GeneMakr.hmm are conducted to provide ab initio CDSs in the algorithm-based approach. Homology CDSs are determined from the representative FASTX alignment with the highest sum of bit scores in consensus analyses regarding stop, start and frame change positions. On the other hand, the ab initio CDSs are determined from the consensus of the algorithm-based CDSs with the highest sum of CDS nucleotide lengths in the consensus analyses regarding only stop and start positions. On the contrary, alternative CDSs are directly determined from the FASTX alignments with the highest individual bit score across all the pair-wise comparisons. By integrating the complementary homology CDSs first and ab initio CDSs later, avoiding a significant positional overlap on the genome, the integrated CDSs were predicted with high accuracy. To determine the most likely start site among candidate starts, the integrated CDSs aligned exactly with the N-terminal end of a library protein in the pair-wise FASTX comparisons were inspected to provide the final representative CDSs.
| 3 CONSORF GENOME BROWSER |
|---|
|
|
|---|
Our website features a genome browser (Fig. 1) that provides comprehensive and consistent annotation information, including reliability scores, predicted frame-shifts, candidate start sites and the best match. Most of the predicted CDSs are consistent with public CDSs with some minor improvements. If a homology-based CDS with high reliability score contains frame-shift (Fig. 1H), it needs further manual inspection for its authentic frame-shift or sequencing error.
|
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This research was supported by a grant from the KRIBB Research Initiative Program of Korea. Authors thank Maryana Bhak for edits.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alfonso Valencia
Received on May 18, 2007; revised on September 20, 2007; accepted on October 8, 2007
| REFERENCES |
|---|
|
|
|---|
Badger JH, Olsen GJ. CRITICA: coding region identification tool invoking comparative analysis. Mol. Biol. Evol. (1999) 16:512–524.[Abstract]
Borodovsky M, McIninch J. GenMark: parallel gene recognition for both DNA strands. Comput. Chem. (1993) 17:123–133.[CrossRef]
Devos D, Valencia A. Intrinsic errors in genome annotation. Trends Genet. (2001) 17:429–431.[CrossRef][Web of Science][Medline]
Frishman D, et al. Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res. (1998) 26:2941–2947.
Lukashin AV, Borodovsky M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. (1998) 26:1107–1115.
Pearson WR, et al. Comparison of DNA sequences with protein sequences. Genomics (1997) 46:24–36.[CrossRef][Web of Science][Medline]
Riley M, et al. Escherichia coli K-12: a cooperatively developed annotation snapshot – 2005. Nucleic Acids Res. (2005) 34:1–9.[Web of Science]
Salzberg SL, et al. Microbial gene identification using interpolated Markov models. Nucleic Acids Res. (1998) 26:544–548.
Schiex T, et al. FrameD: a flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences. Nucleic Acids Res. (2003) 31:3738–3741.
Skovgaard M, et al. On the total number of genes and their length distribution in complete microbial genomes. Trends Genet. (2001) 17:425–428.[CrossRef][Web of Science][Medline]
Tech M, Merkl R. YACOP: enhanced gene prediction obtained by a combination of existing methods. In Silico Biol. (2003) 3:441–451.[Medline]
This article has been cited by other articles:
![]() |
C. Pichon and B. Felden Small RNA gene identification and mRNA target predictions in bacteria Bioinformatics, December 15, 2008; 24(24): 2807 - 2813. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

