Bioinformatics Advance Access published online on January 4, 2007
Bioinformatics, doi:10.1093/bioinformatics/btl639
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Department of Plant Systems Biology, Flanders Interuniversity Institute for Biotechnology (VIB), Ghent University, Technologiepark 927, B-9052 Ghent, Belgium
* To whom correspondence should be addressed.
Motivation: Prediction of the coding potential for stretches of DNA is crucial in gene calling and genome annotation, where it is used to identify potential exons and to position their boundaries in conjunction with functional sites such as splice sites and translation initiation sites. The ability to discriminate between coding and non-coding sequences relates to the structure of coding sequences, which are organised in codons, and by their biased usage. For statistical reasons, the longer the sequences, the easier it is to detect this codon bias. However, in many eukaryotic genomes, where genes harbour many introns, both introns and exons might be small and hard to distinguish based on coding potential. Results: Here, we present novel approaches that specifically aim at a better detection of coding potential in short sequences. The methods use complementary sequence features, combined with identification of which features are relevant in discriminating between coding and non-coding sequences. These newly developed methods are evaluated on different species, representative of four major eukaryotic kingdoms, and extensively compared to state-of-the-art Markov models, which are often used for predicting coding potential. The main conclusions drawn from our analyses are that 1) combining complementary sequence features clearly outperforms current Markov models for coding potential prediction in short sequence fragments, 2) coding potential prediction benefits from length-specific models, and these models are not necessarily the same for different sequence lengths, and 3) comparing the results across several species indicates that, although our combined method consistently performs extremely well, there are important differences across genomes. Supplementary data: http://bioinformatics.psb.ugent.be/
Received August 30, 2006
Revised November 24, 2006
Accepted December 14, 2006
Article
In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi, and protists
Yvan Saeys 1 *, Pierre Rouzé 2, and Yves Van de Peer 1
2 Laboratoire Associé de l'INRA (France) Ghent University, Technologiepark 927, B-9052 Ghent, Belgium
Yvan Saeys, E-mail: yvan.saeys{at}psb.ugent.be
![]()
Abstract
Associate Editor: Alfonso Valencia
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
V. Krauss, C. Thummler, F. Georgi, J. Lehmann, P. F. Stadler, and C. Eisenhardt Near Intron Positions Are Reliable Phylogenetic Markers: An Application to Holometabolous Insects Mol. Biol. Evol., May 1, 2008; 25(5): 821 - 830. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Saeys, I. Inza, and P. Larranaga A review of feature selection techniques in bioinformatics Bioinformatics, October 1, 2007; 23(19): 2507 - 2517. [Abstract] [Full Text] [PDF] |
||||

