Bioinformatics Advance Access published online on December 7, 2004
Bioinformatics, doi:10.1093/bioinformatics/bti200
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Institut fuer Informatik, Ludwig-Maximilians-Universitaet Muenchen, Oettingenstrasse 67, 80538 Muenchen, Finland; Fakultaet fuer Informatik, Technische Universitaet Muenchen, Boltzmannstrasse 3, 85748 Garching b. Muenchen, Finland
* To whom correspondence should be addressed.
Motivation: Discovery of host and pathogen genes expressed at the plant-pathogen interface often requires the construction of mixed libraries that contain sequences from both genomes. Sequence identification requires high-throughput and reliable classification of genome origin. When using single-pass cDNA sequences difficulties arise from the short sequence length, the lack of sufficient taxonomically relevant sequence data in public databases and ambiguous sequence homology between plant and pathogen genes. Results: A novel method is described, which is independent of the availability of homologous genes and relies on subtle differences in codon usage between plant and fungal genes. We used support vector machines (SVMs) to identify the probable origin of sequences. SVMs were compared to several other machine learning techniques and to a probabilistic algorithm (PF-IND, Maor et al., 2003) for EST classification also based on codon bias differences. Our software (ECLAT) has achieved a classification accuracy of 93.1% on a test set of 3217 EST sequences from H. vulgare and B. graminis, which is a significant improvement compared to PF-IND (prediction accuracy of 81.2% on the same test set). EST sequences with at least 50 nt of coding sequence can be classified by ECLAT with high confidence. ECLAT allows training of classifiers for any host-pathogen combination for which there are sufficient classified training sequences. Availability: ECLAT is freely available on the internet (http://mips.gsf.de/proj/est) or on request as a standalone version.
Received September 16, 2004
Revised November 16, 2004
Accepted November 30, 2004
Article
Support vector machines for separation of mixed plant-pathogen EST collections based on codon usage
2 Bioinformatics group, Turku Centre for Biotechnology, Finland
3 Institute for Bioinformatics GSF - Forschungszentrum fuer Umwelt und Gesundheit, GmbH, Ingolstaedter Landstrasse 1, 85764 Neuherberg, Germany; Department of Genome-Oriented Bioinformatics, Wissenschaftszentrum Weihenstephan, Technische Universitaet Muenchen, 85350 Freising, Germany
4 Institute for Bioinformatics GSF - Forschungszentrum fuer Umwelt und Gesundheit, GmbH, Ingolstaedter Landstrasse 1, 85764 Neuherberg, Germany
Caroline C. Friedel, E-mail: friedel{at}informatik.uni-muenchen.de
![]()
Abstract ![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
J. E. Gewehr, M. Szugat, and R. Zimmer BioWeka extending the Weka framework for bioinformatics Bioinformatics, March 1, 2007; 23(5): 651 - 653. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Rudd and I. V. Tetko Eclair--a web service for unravelling species origin of sequences sampled from mixed host interfaces Nucleic Acids Res., July 1, 2005; 33(suppl_2): W724 - W727. [Abstract] [Full Text] [PDF] |
||||

