Bioinformatics Advance Access published online on May 31, 2007
Bioinformatics, doi:10.1093/bioinformatics/btm287
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Logistic regression for disease classification using microarray data: model selection in a large p and small n case
aDrexel University School of Public Health, Philadelphia, PA 19102, USA, bThe University of Toledo,Toledo, OH 43614, USA
*To whom correspondence should be addressed. J.G. Liao, E-mail: jg_liao{at}yahoo.com
| Abstract |
|---|
Motivation: Logistic regression is a standard method for building prediction models for a binary outcome and has been extended for disease classification with microarray data by many authors. A feature (gene) selection step, however, must be added to penalized logistic modeling due to a large number of genes and a small number of subjects. Model selection for this two-step approach requires new statistical tools because prediction error estimation ignoring the feature selection step can be severely downward biased. Generic methods such as cross-validation and nonparametric bootstrap can be very ineffective due to the big variability in the prediction error estimate.
Results: We propose a parametric bootstrap model for more accurate estimation of the prediction error that is tailored to the microarray data by borrowing from the extensive research in identifying differentially expressed genes, especially the local false discovery rate. The proposed method provides guidance on the two critical issues in model selection: the number of genes to include in the model and the optimal shrinkage for the penalized logistic regression. We show that selecting more than 20 genes usually helps little in further reducing the prediction error. Application to Golubs leukemia data and our own cervical cancer data leads to highly accurate prediction models.
Availability: R library GeneLogit at http://geocities.com/jg_liao.
Associate Editor: Dr. Trey Ideker
Received on September 10, 2006; revised on May 14, 2007; accepted on May 21, 2007