Bioinformatics 20(3) © Oxford University Press 2004; all rights reserved.
Is cross-validation valid for small-sample microarray classification?
1 Section of Clinical Cancer Genetics and 2 Department of Pathology, University of Texas MD Anderson Cancer Center, Houston, TX, USA and 3 Department of Electrical Engineering, Texas A&M University, College Station, TX, USA
Received on March 18, 2003
; revised on July 3, 2003
; accepted on August 7, 2003
Motivation: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable understanding of the behavior of cross-validation in the context of very small samples.
Results: An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification ruleslinear discriminant analysis, 3-nearest-neighbor and decision trees (CART)using both synthetic and real breast-cancer patient data. Comparison is via the distribution of differences between the estimated and true errors. Various statistics for the deviation distribution have been computed: mean (for estimator bias), variance (for estimator precision), root-mean square error (for composition of bias and variance) and quartile ranges, including outlier behavior. In general, while cross-validation error estimation is much less biased than resubstitution, it displays excessive variance, which makes individual estimates unreliable for small samples. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (albeit, much less than with resubstitution).
Availability and Supplementary information: A companion web site can be accessed at the URL http://ee.tamu.edu/~edward/cv_paper. The companion web site contains: (1) the complete set of tables and plots regarding the simulation study; (2) additional figures; (3) a compilation of references for microarray classification studies and (4) the source code used, with full documentation and examples.
Contact: edward{at}ee.tamu.edu
* To whom correspondence should be addressed at 214 Zachry Engineering Center, Department of Electrical Engineering, Texas A&M University, College Station, TX 77840, USA.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
B. Duval and J.-K. Hao Advances in metaheuristics for gene selection and classification of microarray data Brief Bioinform, September 29, 2009; (2009) bbp035v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. J. Lancashire, C. Lemetre, and G. R. Ball An introduction to artificial neural networks in bioinformatics--application to complex microarray and mass spectrometry datasets in cancer studies Brief Bioinform, May 1, 2009; 10(3): 315 - 329. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Lee Mistakes in validating the accuracy of a prediction classifier in high-dimensional but small-sample microarray data Statistical Methods in Medical Research, December 1, 2008; 17(6): 635 - 642. [Abstract] [PDF] |
||||
![]() |
A. Statnikov, C. Li, and C. F. Aliferis A Statistical Reappraisal of the Findings of an Esophageal Cancer Genome-Wide Association Study Cancer Res., April 15, 2008; 68(8): 3074 - 3075. [Full Text] [PDF] |
||||
![]() |
S. Charaniya, S. Mehra, W. Lian, K. P. Jayapal, G. Karypis, and W.-S. Hu Transcriptome dynamics-based operon prediction and verification in Streptomyces coelicolor Nucleic Acids Res., December 18, 2007; 35(21): 7222 - 7236. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Hanczar, J.-D. Zucker, C. Henegar, and L. Saitta Feature construction from synergic pairs to improve microarray-based classification Bioinformatics, November 1, 2007; 23(21): 2866 - 2872. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Saeys, I. Inza, and P. Larranaga A review of feature selection techniques in bioinformatics Bioinformatics, October 1, 2007; 23(19): 2507 - 2517. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Fujarewicz, M. Jarzab, M. Eszlinger, K. Krohn, R. Paschke, M. Oczko-Wojciechowska, M. Wiench, A. Kukulska, B. Jarzab, and A. Swierniak A multi-gene approach to differentiate papillary thyroid carcinoma from benign lesions: gene selection using support vector machines with bootstrapping Endocr. Relat. Cancer, September 1, 2007; 14(3): 809 - 826. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Mramor, G. Leban, J. Demsar, and B. Zupan Visualization-based cancer microarray data classification analysis Bioinformatics, August 15, 2007; 23(16): 2147 - 2154. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.G. Liao and K.-V. Chin Logistic regression for disease classification using microarray data: model selection in a large p and small n case Bioinformatics, August 1, 2007; 23(15): 1945 - 1951. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Schumacher, H. Binder, and T. Gerds Assessment of survival prediction models based on microarray data Bioinformatics, July 15, 2007; 23(14): 1768 - 1774. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.-L. Boulesteix WilcoxCV: an R package for fast variable selection in cross-validation Bioinformatics, July 1, 2007; 23(13): 1702 - 1704. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. A. Wood, P. M. Visscher, and K. L. Mengersen Classification based upon gene expression data: bias and precision of error rates Bioinformatics, June 1, 2007; 23(11): 1363 - 1370. [Abstract] [Full Text] [PDF] |
||||
![]() |
Discussion on Hedging Predictions in Machine Learning by A. Gammerman and V. Vovk The Computer Journal, March 1, 2007; 50(2): 164 - 172. [Full Text] [PDF] |
||||
![]() |
R. Shen, D. Ghosh, A. Chinnaiyan, and Z. Meng Eigengene-based linear discriminant model for tumor classification using gene expression microarray data Bioinformatics, November 1, 2006; 22(21): 2635 - 2642. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Zhou and K. Z. Mao The ties problem resulting from counting-based error estimators and its impact on gene selection algorithms Bioinformatics, October 15, 2006; 22(20): 2507 - 2515. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Sima and E. R. Dougherty What should be expected from feature selection in small-sample settings Bioinformatics, October 1, 2006; 22(19): 2430 - 2436. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Pang, A. Lin, M. Holford, B. E. Enerson, B. Lu, M. P. Lawton, E. Floyd, and H. Zhao Pathway analysis using random forests classification and regression Bioinformatics, August 15, 2006; 22(16): 2028 - 2036. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Choudhary, M. Brun, J. Hua, J. Lowey, E. Suh, and E. R. Dougherty Genetic test bed for feature selection Bioinformatics, April 1, 2006; 22(7): 837 - 842. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Larranaga, B. Calvo, R. Santana, C. Bielza, J. Galdiano, I. Inza, J. A. Lozano, R. Armananzas, G. Santafe, A. Perez, et al. Machine learning in bioinformatics Brief Bioinform, March 1, 2006; 7(1): 86 - 112. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Xu and A. Gamst Re: Lessons from Controversy: Ovarian Cancer Screening and Serum Proteomics J Natl Cancer Inst, August 17, 2005; 97(16): 1226 - 1226. [Full Text] [PDF] |
||||
![]() |
A. M. Molinaro, R. Simon, and R. M. Pfeiffer Prediction error estimation: a comparison of resampling methods Bioinformatics, August 1, 2005; 21(15): 3301 - 3307. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. J. Fu, R. J. Carroll, and S. Wang Estimating misclassification error with small samples via bootstrap cross-validation Bioinformatics, May 1, 2005; 21(9): 1979 - 1986. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Hua, Z. Xiong, J. Lowey, E. Suh, and E. R. Dougherty Optimal number of features as a function of sample size for various classification rules Bioinformatics, April 15, 2005; 21(8): 1509 - 1515. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Zhou and K. Z. Mao LS Bound based gene selection for DNA microarray data Bioinformatics, April 15, 2005; 21(8): 1559 - 1564. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Sima, U. Braga-Neto, and E. R. Dougherty Superior feature-set ranking for small samples using bolstered error estimation Bioinformatics, April 1, 2005; 21(7): 1046 - 1054. [Abstract] [Full Text] [PDF] |
||||







