Bioinformatics Advance Access originally published online on January 4, 2007
Bioinformatics 2007 23(4):414-420; doi:10.1093/bioinformatics/btl639
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists
Department of Plant Systems Biology, Flanders Interuniversity Institute for Biotechnology (VIB) Technologiepark 927, B-9052 Ghent, Belgium
1 Laboratoire Associé de l'INRA (France) Ghent University Technologiepark 927, B-9052 Ghent, Belgium
*To whom correspondence should be addressed.
| Abstract |
|---|
Motivation: Prediction of the coding potential for stretches of DNA is crucial in gene calling and genome annotation, where it is used to identify potential exons and to position their boundaries in conjunction with functional sites, such as splice sites and translation initiation sites. The ability to discriminate between coding and non-coding sequences relates to the structure of coding sequences, which are organized in codons, and by their biased usage. For statistical reasons, the longer the sequences, the easier it is to detect this codon bias. However, in many eukaryotic genomes, where genes harbour many introns, both introns and exons might be small and hard to distinguish based on coding potential.
Results: Here, we present novel approaches that specifically aim at a better detection of coding potential in short sequences. The methods use complementary sequence features, combined with identification of which features are relevant in discriminating between coding and non-coding sequences. These newly developed methods are evaluated on different species, representative of four major eukaryotic kingdoms, and extensively compared to state-of-the-art Markov models, which are often used for predicting coding potential. The main conclusions drawn from our analyses are that (1) combining complementary sequence features clearly outperforms current Markov models for coding potential prediction in short sequence fragments, (2) coding potential prediction benefits from length-specific models, and these models are not necessarily the same for different sequence lengths and (3) comparing the results across several species indicates that, although our combined method consistently performs extremely well, there are important differences across genomes.
Supplementary data: http://bioinformatics.psb.ugent.be/
Contact: yvan.saeys{at}psb.ugent.be
Associate Editor: Alfonso Valencia
Received on August 30, 2006; revised on November 24, 2006; accepted on December 14, 2006
This article has been cited by other articles:
![]() |
V. Krauss, C. Thummler, F. Georgi, J. Lehmann, P. F. Stadler, and C. Eisenhardt Near Intron Positions Are Reliable Phylogenetic Markers: An Application to Holometabolous Insects Mol. Biol. Evol., May 1, 2008; 25(5): 821 - 830. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Saeys, I. Inza, and P. Larranaga A review of feature selection techniques in bioinformatics Bioinformatics, October 1, 2007; 23(19): 2507 - 2517. [Abstract] [Full Text] [PDF] |
||||

