Bioinformatics Advance Access originally published online on March 14, 2008
Bioinformatics 2008 24(9):1198-1205; doi:10.1093/bioinformatics/btn089
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Biological sequence classification utilizing positive and unlabeled data
Department of Epidemiology and Biostatistics, Center for Bioinformatics and Molecular Biostatistics, University of California, San Francisco, CA 94107, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
Motivation: In the genomics setting, an increasingly common data configuration consists of a small set of sequences possessing a targeted property (positive instances) amongst a large set of sequences for which class membership is unknown (unlabeled instances). Traditional two-class classification methods do not effectively handle such data.
Results: Here, we develop a novel method, likely positive-iterative classification (LP-IC) for this problem, and contrast its performance with the few existing methods, most of which were devised and utilized in the text classification context. LP-IC employs an iterative classification scheme and introduces a class dispersion measure, adopted from unsupervised clustering approaches, to monitor the model selection process. Using two case studies—prediction of HLA binding, and alternative splicing conservation between human and mouse—we show that LP-IC provides superior performance to existing methodologies in terms of: (i) combined accuracy and precision in positive identification from the unlabeled set; and (ii) predictive performance of the resultant classifiers on independent test data.
Contact: mark{at}biostat.ucsf.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
Associate Editor: Limsoon Wong
Received on December 18, 2007; revised on January 31, 2008; accepted on March 4, 2008