Incorporation of splice site probability models for non-canonical introns improves gene structure prediction in plants
1Department of Genetics, Development and Cell Biology, Iowa State University 2112 Molecular Biology Building, Ames, IA 50011-3260, USA
2Department of Statistics, Iowa State University 2112 Molecular Biology Building, Ames, IA 50011-3260, USA
*To whom correspondence should be addressed.
Motivation: The vast majority of introns in protein-coding genes of higher eukaryotes have a GT dinucleotide at their 5'-terminus and an AG dinucleotide at their 3' end. About 12% of introns are non-canonical, with the most abundant subtype of non-canonical introns being characterized by GC and AG dinucleotides at their 5'- and 3'-termini, respectively. Most current gene prediction software, whether based on ab initio or spliced alignment approaches, does not include explicit models for non-canonical introns or may exclude their prediction altogether. With present amounts of genome and transcript data, it is now possible to apply statistical methodology to non-canonical splice site prediction. We pursued one such approach and describe the training and implementation of GC-donor splice site models for Arabidopsis and rice, with the goal of exploring whether specific modeling of non-canonical introns can enhance gene structure prediction accuracy.
Results: Our results indicate that the incorporation of non-canonical splice site models yields dramatic improvements in annotating genes containing GCAG and ATAC non-canonical introns. Comparison of models shows differences between monocot and dicot species, but also suggests GC intron-specific biases independent of taxonomic clade. We also present evidence that GCAG introns occur preferentially in genes with atypically high exon counts.
Availability: Source code for the updated versions of GeneSeqer and SplicePredictor (distributed with the GeneSeqer code) isavailable at http://bioinformatics.iastate.edu/bioinformatics2go/gs/download.html. Web servers for Arabidopsis, rice and other plant species are accessible at http://www.plantgdb.org/PlantGDB-cgi/GeneSeqer/AtGDBgs.cgi, http://www.plantgdb.org/PlantGDB-cgi/GeneSeqer/OsGDBgs.cgi and http://www.plantgdb.org/PlantGDB-cgi/GeneSeqer/PlantGDBgs.cgi, respectively. A SplicePredictor web server is available at http://bioinformatics.iastate.edu/cgi-bin/sp.cgi. Software to generate training data and parameterizations for Bayesian splice site models is available at http://gremlin1.gdcb.iastate.edu/~volker/SB05B/BSSM4GSQ/
Contact: vbrendel{at}iastate.edu
Supporting information: http://gremlin1.gdcb.iastate.edu/~volker/SB05B/
Received on June 13, 2005; accepted on August 16, 2005
This article has been cited by other articles:
![]() |
C. Liang, L. Mao, D. Ware, and L. Stein Evidence-based gene predictions in plant genomes Genome Res., October 1, 2009; 19(10): 1912 - 1923. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Sheth, X. Roca, M. L. Hastings, T. Roeder, A. R. Krainer, and R. Sachidanandam Comprehensive splice-site analysis using comparative genomics Nucleic Acids Res., September 1, 2006; 34(14): 3955 - 3967. [Abstract] [Full Text] [PDF] |
||||
![]() |
B.-B. Wang and V. Brendel Genomewide comparative analysis of alternative splicing in plants PNAS, May 2, 2006; 103(18): 7175 - 7180. [Abstract] [Full Text] [PDF] |
||||


