Skip Navigation


Bioinformatics Advance Access originally published online on October 31, 2007
Bioinformatics 2007 23(22):3088-3090; doi:10.1093/bioinformatics/btm512
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/22/3088    most recent
btm512v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Kang, S.
Right arrow Articles by Bhak, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kang, S.
Right arrow Articles by Bhak, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

CONSORF: a consensus prediction system for prokaryotic coding sequences

Sungsoo Kang 1, Sung-Jin Yang 1, Sangsoo Kim 2,* and Jong Bhak 1,*

1KOBIC (Korean BioInformation Center), KRIBB, Daejeon 305-806 and 2Soongsil University, 1-1 Sangdo-dong, Dongjak-gu, Seoul 156-743, Korea

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CONSORF CDS FINDING
 3 CONSORF GENOME BROWSER
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: CONSORF is a fully automatic high-accuracy identification system that provides consensus prokaryotic CDS information. It first predicts the CDSs supported by consensus alignments. The alignments are derived from multiple genome-to-proteome comparisons with other prokaryotes using the FASTX program. Then, it fills the empty genomic regions with the CDSs supported by consensus ab initio predictions. From those consensus results, CONSORF provides prediction reliability scores, predicted frame-shifts, alternative start sites and best pair-wise match information against other prokaryotes. These results are easily accessed from a website.

Availability: The regularly updated CDS predictions of prokaryotic genomes as well as the source code are freely accessible through http://consorf.kobic.re.kr and http://orfome.org.

Contact: j{at}bio.cc, jong{at}kribb.re.kr or sskimb{at}ssu.ac.kr

Supplementary information: The detailed methods and evaluation results can be found at http://consorf.kobic.re.kr/supplementary/.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CONSORF CDS FINDING
 3 CONSORF GENOME BROWSER
 ACKNOWLEDGEMENTS
 REFERENCES
 
While the number of known prokaryotic whole genomes is increasing rapidly, the consistency of their coding sequence (CDS) predictions varies depending on the genome (Devos and Valencia, 2001; Riley et al., 2005; Skovgaard et al., 2001). Moreover, it is difficult to systematically re-annotate them fast enough to keep up with new information from the expanding public databases.

There have been numerous computational methods for processing prokaryotic genomic CDSs. Most of them are ab initio prediction methods such as GeneMark (Borodovsky and McIninch, 1993), GeneMark.hmm (Lukashin and Borodovsky, 1998) and GLIMMER (Salzberg et al., 1998). However, the CDSs predicted by such programs usually need further time-consuming manual processes. To complement such ab initio prediction methods, some CDS prediction programs add homology-based methods: ORPHEUS (Frishman et al., 1998), Critica (Badger and Olsen, 1999), FrameD (Schiex et al., 2003) and YACOP (Tech and Merkl, 2003). However, ORPHEUS and FrameD have shown relatively low CDS prediction accuracies, and Critica and YACOP provide neither frame-shift information nor automated prediction pipeline.

Therefore, we have developed a consensus method, CONSORF, a fully automated and regularly updated prediction system for prokaryotic CDSs. It aims to improve both sensitivity and specificity of CDSs prediction.


    2 CONSORF CDS FINDING
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CONSORF CDS FINDING
 3 CONSORF GENOME BROWSER
 ACKNOWLEDGEMENTS
 REFERENCES
 
CONSORF provides five kinds of predicted CDSs in FASTA and XML formats, based on their sources of evidence and ‘refinement levels’ in prediction:

  1. homology-based consensus CDSs (called ‘homology CDSs’),
  2. alternative homology-based CDSs, predicted from the overall best match without considering the consensus among hits (called ‘alternative CDSs’),
  3. algorithm-based ab initio consensus CDSs (called ‘ab initio CDSs’),
  4. CDSs from the integration of ‘homology CDSs’ and ab initio CDSs’ over certain thresholds (called ‘integrated CDSs’) and
  5. the representative final CDSs that have undergone the refinement of start codon positions via the analysis on the N-terminal residue matches of ‘integrated CDSs’ (called ‘representative CDSs’).

From a prokaryotic genome sequence, the CONSORF system predicts CDSs in two complementary approaches: homology-based and algorithm-based. In the homology-based approach, pair-wise genome-to-proteome comparisons via the FASTX program (Pearson et al., 1997) are performed. It generates both ‘homology CDSs’ and ‘alternative CDSs’, while multiple ab initio predictions currently including GeneMark, GLIMMER and GeneMakr.hmm are conducted to provide ‘ab initio CDSs’ in the algorithm-based approach. ‘Homology CDSs’ are determined from the representative FASTX alignment with the highest sum of bit scores in consensus analyses regarding stop, start and frame change positions. On the other hand, the ‘ab initio CDSs’ are determined from the consensus of the algorithm-based CDSs with the highest sum of CDS nucleotide lengths in the consensus analyses regarding only stop and start positions. On the contrary, ‘alternative CDSs’ are directly determined from the FASTX alignments with the highest individual bit score across all the pair-wise comparisons. By integrating the complementary ‘homology CDSs’ first and ‘ab initio CDSs’ later, avoiding a significant positional overlap on the genome, the ‘integrated CDSs’ were predicted with high accuracy. To determine the most likely start site among candidate starts, the ‘integrated CDSs’ aligned exactly with the N-terminal end of a library protein in the pair-wise FASTX comparisons were inspected to provide the final ‘representative CDSs’.


    3 CONSORF GENOME BROWSER
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CONSORF CDS FINDING
 3 CONSORF GENOME BROWSER
 ACKNOWLEDGEMENTS
 REFERENCES
 
Our website features a genome browser (Fig. 1) that provides comprehensive and consistent annotation information, including reliability scores, predicted frame-shifts, candidate start sites and the best match. Most of the predicted CDSs are consistent with public CDSs with some minor improvements. If a homology-based CDS with high reliability score contains frame-shift (Fig. 1H), it needs further manual inspection for its authentic frame-shift or sequencing error.


Figure 1
View larger version (40K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Screenshots of the CONSORF genome browser. It displays (A) the information on the currently chosen organism and its chromosome, (B) CDS search field for the chosen organism, (C) display control buttons, (D) current position on the chromosome, and (E) CDS types and their displayed colors. (F) Most of the predicted CDSs are consistent with public CDSs. (G) Candidate start sites and (H and I) potential frame-shifts are represented by vertical bars. (J) The density of color represents CDS prediction reliability based on homology-based and algorithm-based consensus. (K) Two CDSs were predicted in ‘representative CDS’, but no ‘public CDS’ was found in this case. (L) The basic information on a ‘homology CDS’ was displayed when the mouse pointer was positioned over it. (M) The detailed information on the clicked ‘homology CDS’ was displayed.

 

    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CONSORF CDS FINDING
 3 CONSORF GENOME BROWSER
 ACKNOWLEDGEMENTS
 REFERENCES
 
This research was supported by a grant from the KRIBB Research Initiative Program of Korea. Authors thank Maryana Bhak for edits.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alfonso Valencia

Received on May 18, 2007; revised on September 20, 2007; accepted on October 8, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CONSORF CDS FINDING
 3 CONSORF GENOME BROWSER
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Badger JH, Olsen GJ. CRITICA: coding region identification tool invoking comparative analysis. Mol. Biol. Evol. (1999) 16:512–524.[Abstract]

    Borodovsky M, McIninch J. GenMark: parallel gene recognition for both DNA strands. Comput. Chem. (1993) 17:123–133.[CrossRef]

    Devos D, Valencia A. Intrinsic errors in genome annotation. Trends Genet. (2001) 17:429–431.[CrossRef][Web of Science][Medline]

    Frishman D, et al. Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res. (1998) 26:2941–2947.[Abstract/Free Full Text]

    Lukashin AV, Borodovsky M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. (1998) 26:1107–1115.[Abstract/Free Full Text]

    Pearson WR, et al. Comparison of DNA sequences with protein sequences. Genomics (1997) 46:24–36.[CrossRef][Web of Science][Medline]

    Riley M, et al. Escherichia coli K-12: a cooperatively developed annotation snapshot – 2005. Nucleic Acids Res. (2005) 34:1–9.[Web of Science]

    Salzberg SL, et al. Microbial gene identification using interpolated Markov models. Nucleic Acids Res. (1998) 26:544–548.[Abstract/Free Full Text]

    Schiex T, et al. FrameD: a flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences. Nucleic Acids Res. (2003) 31:3738–3741.[Abstract/Free Full Text]

    Skovgaard M, et al. On the total number of genes and their length distribution in complete microbial genomes. Trends Genet. (2001) 17:425–428.[CrossRef][Web of Science][Medline]

    Tech M, Merkl R. YACOP: enhanced gene prediction obtained by a combination of existing methods. In Silico Biol. (2003) 3:441–451.[Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
C. Pichon and B. Felden
Small RNA gene identification and mRNA target predictions in bacteria
Bioinformatics, December 15, 2008; 24(24): 2807 - 2813.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/22/3088    most recent
btm512v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Kang, S.
Right arrow Articles by Bhak, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kang, S.
Right arrow Articles by Bhak, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?