Bioinformatics Advance Access published online on March 14, 2008
Bioinformatics, doi:10.1093/bioinformatics/btn089
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Biological sequence classification utilizing positive and unlabeled data
Department of Epidemiology and Biostatistics, Center for Bioinformatics and Molecular Biostatistics, University of California, 185 Berry Street, Lobby 4, Suite 5700, San Francisco, CA 94107, USA.
To whom correspondence should be addressed. Mark R. Segal, E-mail: mark{at}biostat.ucsf.edu
| Abstract |
|---|
Motivation: In the genomics setting, an increasingly common data configuration consists of a small set of sequences possessing a targeted property (positive instances) amongst a large set of sequences for which class membership is unknown (unlabeled instances). Traditional two-class classification methods do not effectively handle such data.
Results: Here, we develop a novel method, Likely Positive-Iterative Classification (LP-IC), for this problem and contrast its performance with the few existing methods, most of which were devised and utilized in the text classification context. LPIC employs an iterative classification scheme and introduces a class dispersion measure, adopted from unsupervised clustering approaches, to monitor the model selection process. Using two case studies – prediction of HLA binding, and alternative splicing conservation between human and mouse – we show that LP-IC provides superior performance to existing methodologies in terms of: (i) combined accuracy and precision in positive identification from the unlabeled set; and (ii) predictive performance of the resultant classifiers on independent test data.
Contact: mark{at}biostat.ucsf.edu
Associate Editor: Dr. Limsoon Wong
Received on December 18, 2007; revised on January 31, 2008; accepted on March 4, 2008