Bioinformatics Advance Access originally published online on May 19, 2005
Bioinformatics 2005 21(15):3301-3307; doi:10.1093/bioinformatics/bti499
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Published by Oxford University Press 2005
Prediction error estimation: a comparison of resampling methods
1Biostatistics Branch, Division of Cancer Epidemiology and Genetics, NCI, NIH Rockville, MD 20852 USA
2Biometric Research Branch, Division of Cancer Treatment and Diagnostics, NCI, NIH Rockville, MD 20852 USA
3Department of Epidemiology and Public Health, Yale University School of Medicine New Haven, CT 06520, USA
*To whom correspondence should be addressed.
Motivation: In genomic studies, thousands of features are collected on relatively few samples. One of the goals of these studies is to build classifiers to predict the outcome of future observations. There are three inherent steps to this process: feature selection, model selection and prediction assessment. With a focus on prediction assessment, we compare several methods for estimating the true prediction error of a prediction model in the presence of feature selection.
Results: For small studies where features are selected from thousands of candidates, the resubstitution and simple split-sample estimates are seriously biased. In these small samples, leave-one-out cross-validation (LOOCV), 10-fold cross-validation (CV) and the .632+ bootstrap have the smallest bias for diagonal discriminant analysis, nearest neighbor and classification trees. LOOCV and 10-fold CV have the smallest bias for linear discriminant analysis. Additionally, LOOCV, 5- and 10-fold CV, and the .632+ bootstrap have the lowest mean square error. The .632+ bootstrap is quite biased in small sample sizes with strong signal-to-noise ratios. Differences in performance among resampling methods are reduced as the number of specimens available increase.
Contact: annette.molinaro{at}yale.edu
Supplementary Information: A complete compilation of results and R code for simulations and analyses are available in Molinaro et al. (2005) (http://linus.nci.nih.gov/brb/TechReport.htm).
Received on April 6, 2005; revised on April 28, 2005; accepted on May 12, 2005
This article has been cited by other articles:
![]() |
R. Simon The Use of Genomics in Clinical Trial Design Clin. Cancer Res., October 1, 2008; 14(19): 5984 - 5993. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.-L. Boulesteix, C. Porzelius, and M. Daumer Microarray-based classification and clinical predictors: on combined classifiers and additional predictive value Bioinformatics, August 1, 2008; 24(15): 1698 - 1706. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Bonome, D. A. Levine, J. Shih, M. Randonovich, C. A. Pise-Masison, F. Bogomolniy, L. Ozbun, J. Brady, J. C. Barrett, J. Boyd, et al. A Gene Signature Predicting for Survival in Suboptimally Debulked Patients with Ovarian Cancer Cancer Res., July 1, 2008; 68(13): 5478 - 5486. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Maruvada and S. Srivastava Joint National Cancer Institute-Food and Drug Administration Workshop on Research Strategies, Study Designs, and Statistical Approaches to Biomarker Validation for Cancer Diagnosis and Detection Am. Assoc. Cancer Res. Educ. Book, April 12, 2008; 2008(1): 239 - 247. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Ng, M. A. Bearse Jr, M. E. Schneck, S. Barez, and A. J. Adams Local Diabetic Retinopathy Prediction by Multifocal ERG Delays over 3 Years Invest. Ophthalmol. Vis. Sci., April 1, 2008; 49(4): 1622 - 1628. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Jurman, S. Merler, A. Barla, S. Paoli, A. Galea, and C. Furlanello Algebraic stability indicators for ranked lists in molecular profiling Bioinformatics, January 15, 2008; 24(2): 258 - 264. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. R. Bushel, A. N. Heinloth, J. Li, L. Huang, J. W. Chou, G. A. Boorman, D. E. Malarkey, C. D. Houle, S. M. Ward, R. E. Wilson, et al. Blood gene expression signatures predict exposure levels PNAS, November 13, 2007; 104(46): 18211 - 18216. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Saeys, I. Inza, and P. Larranaga A review of feature selection techniques in bioinformatics Bioinformatics, October 1, 2007; 23(19): 2507 - 2517. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Huang, A. Gusnanto, K. O'Sullivan, J. Staaf, A. Borg, and Y. Pawitan Robust smooth segmentation approach for array CGH data analysis Bioinformatics, September 15, 2007; 23(18): 2463 - 2469. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Schumacher, H. Binder, and T. Gerds Assessment of survival prediction models based on microarray data Bioinformatics, July 15, 2007; 23(14): 1768 - 1774. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.-L. Boulesteix WilcoxCV: an R package for fast variable selection in cross-validation Bioinformatics, July 1, 2007; 23(13): 1702 - 1704. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Yanagisawa, S. Tomida, Y. Shimada, Y. Yatabe, T. Mitsudomi, and T. Takahashi A 25-Signal Proteomic Signature and Outcome for Patients With Resected Non-Small-Cell Lung Cancer J Natl Cancer Inst, June 6, 2007; 99(11): 858 - 867. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Tian, T. Cai, E. Goetghebeur, and L. J. Wei Model evaluation based on the sampling distribution of estimated absolute prediction error Biometrika, June 1, 2007; 94(2): 297 - 311. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. A. Wood, P. M. Visscher, and K. L. Mengersen Classification based upon gene expression data: bias and precision of error rates Bioinformatics, June 1, 2007; 23(11): 1363 - 1370. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Dupuy and R. M. Simon Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting J Natl Cancer Inst, January 17, 2007; 99(2): 147 - 157. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. J. Kelly, D. M. Jacobsen, Y. V. Sun, J. A. Smith, and S. L. R. Kardia KGraph: a system for visualizing and evaluating complex genetic associations Bioinformatics, January 15, 2007; 23(2): 249 - 251. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. K. Dobbin and R. M. Simon Sample size planning for developing classifiers using high-dimensional DNA microarray data Biostat., January 1, 2007; 8(1): 101 - 117. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Simon Development and evaluation of therapeutically relevant predictive classifiers using gene expression profiling. J Natl Cancer Inst, September 6, 2006; 98(17): 1169 - 1171. [Full Text] [PDF] |
||||
![]() |
H. Pang, A. Lin, M. Holford, B. E. Enerson, B. Lu, M. P. Lawton, E. Floyd, and H. Zhao Pathway analysis using random forests classification and regression Bioinformatics, August 15, 2006; 22(16): 2028 - 2036. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. S. Dave, K. Fu, G. W. Wright, L. T. Lam, P. Kluin, E.-J. Boerma, T. C. Greiner, D. D. Weisenburger, A. Rosenwald, G. Ott, et al. Molecular diagnosis of Burkitt's lymphoma. N. Engl. J. Med., June 8, 2006; 354(23): 2431 - 2442. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Maruvada and S. Srivastava Joint national cancer institute-food and drug administration workshop on research strategies, study designs, and statistical approaches to biomarker validation for cancer diagnosis and detection. Cancer Epidemiol. Biomarkers Prev., June 1, 2006; 15(6): 1078 - 1082. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. J. Buturovic PCP: a program for supervised classification of gene expression profiles Bioinformatics, January 15, 2006; 22(2): 245 - 247. [Abstract] [Full Text] [PDF] |
||||










