Bioinformatics Advance Access published online on May 19, 2005
Bioinformatics, doi:10.1093/bioinformatics/bti499
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Biostatistics Branch, Division of Cancer Epidemiology and Genetics, NCI, NIH, Rockville, MD 20852; Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, CT 06520
* To whom correspondence should be addressed.
Motivation: In genomic studies, thousands of features are collected on relatively few samples. One of the goals of these studies is to build classifiers to predict the outcome of future observations. There are three inherent steps to this process: feature selection, model selection, and prediction assessment. With a focus on prediction assessment, we compare several methods for estimating the true prediction error of a prediction model in the presence of feature selection. Results: For small studies where features are selected from thousands of candidates, the resubstitution and simple split-sample estimates are seriously biased. In these small samples, leave-one-out (LOOCV), 10-fold cross-validation (CV), and the .632+ bootstrap have the smallest bias for diagonal discriminant analysis, nearest neighbor, and classification trees. LOOCV and 10-fold CV have the smallest bias for linear discriminant analysis. Additionally, LOOCV, 5- and 10-fold CV, and the .632+ bootstrap have the lowest mean square error. The .632+ bootstrap is quite biased in small sample sizes with strong signal to noise ratios. Differences in performance among resampling methods are reduced as the number of specimens available increase. Supplementary Information: A complete compilation of results is available in Molinaro et al. (2005). R code for simulations and analyses is available from the authors.
Received April 6, 2005
Revised April 28, 2005
Accepted May 12, 2005
Article
Prediction error estimation: a comparison of resampling methods
2 Biometric Research Branch, Division of Cancer Treatment and Diagnostics, NCI, NIH, Rockville, MD 20852
3 Biostatistics Branch, Division of Cancer Epidemiology and Genetics, NCI, NIH, Rockville, MD 20852
Annette M. Molinaro, E-mail: annette.molinaro{at}yale.edu
![]()
Abstract ![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
F. Imamura, P. F Jacques, D. M Herrington, G. E Dallal, and A. H Lichtenstein Adherence to 2005 Dietary Guidelines for Americans is associated with a reduced progression of coronary artery atherosclerosis in women with established coronary artery disease Am. J. Clinical Nutrition, July 1, 2009; 90(1): 193 - 201. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. W.F. Catto, M. F. Abbod, D. A. Linkens, S. Larre, D. J. Rosario, and F. C. Hamdy Neurofuzzy Modeling to Determine Recurrence Risk Following Radical Cystectomy for Nonmetastatic Urothelial Carcinoma of the Bladder Clin. Cancer Res., May 1, 2009; 15(9): 3150 - 3155. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Sartorius, L. P Ly, K. Sikaris, R. McLachlan, and D. J Handelsman Predictive accuracy and sources of variability in calculated free testosterone estimates Ann Clin Biochem, March 1, 2009; 46(2): 137 - 143. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Hartmann, F. Spyratos, N. Harbeck, D. Dietrich, A. Fassbender, M. Schmitt, S. Eppenberger-Castori, V. Vuaroqueaux, F. Lerebours, K. Welzel, et al. DNA Methylation Markers Predict Outcome in Node-Positive, Estrogen Receptor-Positive Breast Cancer with Adjuvant Anthracycline-Based Chemotherapy Clin. Cancer Res., January 1, 2009; 15(1): 315 - 323. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Lee Mistakes in validating the accuracy of a prediction classifier in high-dimensional but small-sample microarray data Statistical Methods in Medical Research, December 1, 2008; 17(6): 635 - 642. [Abstract] [PDF] |
||||
![]() |
R. Simon The Use of Genomics in Clinical Trial Design Clin. Cancer Res., October 1, 2008; 14(19): 5984 - 5993. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.-L. Boulesteix, C. Porzelius, and M. Daumer Microarray-based classification and clinical predictors: on combined classifiers and additional predictive value Bioinformatics, August 1, 2008; 24(15): 1698 - 1706. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Bonome, D. A. Levine, J. Shih, M. Randonovich, C. A. Pise-Masison, F. Bogomolniy, L. Ozbun, J. Brady, J. C. Barrett, J. Boyd, et al. A Gene Signature Predicting for Survival in Suboptimally Debulked Patients with Ovarian Cancer Cancer Res., July 1, 2008; 68(13): 5478 - 5486. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Maruvada and S. Srivastava Joint National Cancer Institute-Food and Drug Administration Workshop on Research Strategies, Study Designs, and Statistical Approaches to Biomarker Validation for Cancer Diagnosis and Detection Am. Assoc. Cancer Res. Educ. Book, April 12, 2008; 2008(1): 239 - 247. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Ng, M. A. Bearse Jr, M. E. Schneck, S. Barez, and A. J. Adams Local Diabetic Retinopathy Prediction by Multifocal ERG Delays over 3 Years Invest. Ophthalmol. Vis. Sci., April 1, 2008; 49(4): 1622 - 1628. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Jurman, S. Merler, A. Barla, S. Paoli, A. Galea, and C. Furlanello Algebraic stability indicators for ranked lists in molecular profiling Bioinformatics, January 15, 2008; 24(2): 258 - 264. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. R. Bushel, A. N. Heinloth, J. Li, L. Huang, J. W. Chou, G. A. Boorman, D. E. Malarkey, C. D. Houle, S. M. Ward, R. E. Wilson, et al. Blood gene expression signatures predict exposure levels PNAS, November 13, 2007; 104(46): 18211 - 18216. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Saeys, I. Inza, and P. Larranaga A review of feature selection techniques in bioinformatics Bioinformatics, October 1, 2007; 23(19): 2507 - 2517. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Huang, A. Gusnanto, K. O'Sullivan, J. Staaf, A. Borg, and Y. Pawitan Robust smooth segmentation approach for array CGH data analysis Bioinformatics, September 15, 2007; 23(18): 2463 - 2469. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Schumacher, H. Binder, and T. Gerds Assessment of survival prediction models based on microarray data Bioinformatics, July 15, 2007; 23(14): 1768 - 1774. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.-L. Boulesteix WilcoxCV: an R package for fast variable selection in cross-validation Bioinformatics, July 1, 2007; 23(13): 1702 - 1704. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Yanagisawa, S. Tomida, Y. Shimada, Y. Yatabe, T. Mitsudomi, and T. Takahashi A 25-Signal Proteomic Signature and Outcome for Patients With Resected Non-Small-Cell Lung Cancer J Natl Cancer Inst, June 6, 2007; 99(11): 858 - 867. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Tian, T. Cai, E. Goetghebeur, and L. J. Wei Model evaluation based on the sampling distribution of estimated absolute prediction error Biometrika, June 1, 2007; 94(2): 297 - 311. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. A. Wood, P. M. Visscher, and K. L. Mengersen Classification based upon gene expression data: bias and precision of error rates Bioinformatics, June 1, 2007; 23(11): 1363 - 1370. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Dupuy and R. M. Simon Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting J Natl Cancer Inst, January 17, 2007; 99(2): 147 - 157. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. J. Kelly, D. M. Jacobsen, Y. V. Sun, J. A. Smith, and S. L. R. Kardia KGraph: a system for visualizing and evaluating complex genetic associations Bioinformatics, January 15, 2007; 23(2): 249 - 251. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. K. Dobbin and R. M. Simon Sample size planning for developing classifiers using high-dimensional DNA microarray data Biostat., January 1, 2007; 8(1): 101 - 117. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Simon Development and evaluation of therapeutically relevant predictive classifiers using gene expression profiling. J Natl Cancer Inst, September 6, 2006; 98(17): 1169 - 1171. [Full Text] [PDF] |
||||
![]() |
H. Pang, A. Lin, M. Holford, B. E. Enerson, B. Lu, M. P. Lawton, E. Floyd, and H. Zhao Pathway analysis using random forests classification and regression Bioinformatics, August 15, 2006; 22(16): 2028 - 2036. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. S. Dave, K. Fu, G. W. Wright, L. T. Lam, P. Kluin, E.-J. Boerma, T. C. Greiner, D. D. Weisenburger, A. Rosenwald, G. Ott, et al. Molecular diagnosis of Burkitt's lymphoma. N. Engl. J. Med., June 8, 2006; 354(23): 2431 - 2442. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Maruvada and S. Srivastava Joint national cancer institute-food and drug administration workshop on research strategies, study designs, and statistical approaches to biomarker validation for cancer diagnosis and detection. Cancer Epidemiol. Biomarkers Prev., June 1, 2006; 15(6): 1078 - 1082. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. J. Buturovic PCP: a program for supervised classification of gene expression profiles Bioinformatics, January 15, 2006; 22(2): 245 - 247. [Abstract] [Full Text] [PDF] |
||||













