Skip Navigation

This Article
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow FREE Full Text (Screen PDF)
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (79)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Braga-Neto, U. M.
Right arrow Articles by Dougherty, E. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Braga-Neto, U. M.
Right arrow Articles by Dougherty, E. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics 20(3) © Oxford University Press 2004; all rights reserved.

Is cross-validation valid for small-sample microarray classification?

Ulisses M. Braga-Neto 1,3 and Edward R. Dougherty 2,3,*

1 Section of Clinical Cancer Genetics and 2 Department of Pathology, University of Texas MD Anderson Cancer Center, Houston, TX, USA and 3 Department of Electrical Engineering, Texas A&M University, College Station, TX, USA

Received on March 18, 2003 ; revised on July 3, 2003 ; accepted on August 7, 2003

Motivation: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable understanding of the behavior of cross-validation in the context of very small samples.

Results: An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification rules—linear discriminant analysis, 3-nearest-neighbor and decision trees (CART)—using both synthetic and real breast-cancer patient data. Comparison is via the distribution of differences between the estimated and true errors. Various statistics for the deviation distribution have been computed: mean (for estimator bias), variance (for estimator precision), root-mean square error (for composition of bias and variance) and quartile ranges, including outlier behavior. In general, while cross-validation error estimation is much less biased than resubstitution, it displays excessive variance, which makes individual estimates unreliable for small samples. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (albeit, much less than with resubstitution).

Availability and Supplementary information: A companion web site can be accessed at the URL http://ee.tamu.edu/~edward/cv_paper. The companion web site contains: (1) the complete set of tables and plots regarding the simulation study; (2) additional figures; (3) a compilation of references for microarray classification studies and (4) the source code used, with full documentation and examples.

Contact: edward{at}ee.tamu.edu

* To whom correspondence should be addressed at 214 Zachry Engineering Center, Department of Electrical Engineering, Texas A&M University, College Station, TX 77840, USA.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
B. Duval and J.-K. Hao
Advances in metaheuristics for gene selection and classification of microarray data
Brief Bioinform, September 29, 2009; (2009) bbp035v1.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
L. J. Lancashire, C. Lemetre, and G. R. Ball
An introduction to artificial neural networks in bioinformatics--application to complex microarray and mass spectrometry datasets in cancer studies
Brief Bioinform, May 1, 2009; 10(3): 315 - 329.
[Abstract] [Full Text] [PDF]


Home page
Stat Methods Med ResHome page
S. Lee
Mistakes in validating the accuracy of a prediction classifier in high-dimensional but small-sample microarray data
Statistical Methods in Medical Research, December 1, 2008; 17(6): 635 - 642.
[Abstract] [PDF]


Home page
Cancer Res.Home page
A. Statnikov, C. Li, and C. F. Aliferis
A Statistical Reappraisal of the Findings of an Esophageal Cancer Genome-Wide Association Study
Cancer Res., April 15, 2008; 68(8): 3074 - 3075.
[Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. Charaniya, S. Mehra, W. Lian, K. P. Jayapal, G. Karypis, and W.-S. Hu
Transcriptome dynamics-based operon prediction and verification in Streptomyces coelicolor
Nucleic Acids Res., December 18, 2007; 35(21): 7222 - 7236.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
B. Hanczar, J.-D. Zucker, C. Henegar, and L. Saitta
Feature construction from synergic pairs to improve microarray-based classification
Bioinformatics, November 1, 2007; 23(21): 2866 - 2872.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Saeys, I. Inza, and P. Larranaga
A review of feature selection techniques in bioinformatics
Bioinformatics, October 1, 2007; 23(19): 2507 - 2517.
[Abstract] [Full Text] [PDF]


Home page
Endocr Relat CancerHome page
K. Fujarewicz, M. Jarzab, M. Eszlinger, K. Krohn, R. Paschke, M. Oczko-Wojciechowska, M. Wiench, A. Kukulska, B. Jarzab, and A. Swierniak
A multi-gene approach to differentiate papillary thyroid carcinoma from benign lesions: gene selection using support vector machines with bootstrapping
Endocr. Relat. Cancer, September 1, 2007; 14(3): 809 - 826.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. Mramor, G. Leban, J. Demsar, and B. Zupan
Visualization-based cancer microarray data classification analysis
Bioinformatics, August 15, 2007; 23(16): 2147 - 2154.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J.G. Liao and K.-V. Chin
Logistic regression for disease classification using microarray data: model selection in a large p and small n case
Bioinformatics, August 1, 2007; 23(15): 1945 - 1951.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. Schumacher, H. Binder, and T. Gerds
Assessment of survival prediction models based on microarray data
Bioinformatics, July 15, 2007; 23(14): 1768 - 1774.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A.-L. Boulesteix
WilcoxCV: an R package for fast variable selection in cross-validation
Bioinformatics, July 1, 2007; 23(13): 1702 - 1704.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
I. A. Wood, P. M. Visscher, and K. L. Mengersen
Classification based upon gene expression data: bias and precision of error rates
Bioinformatics, June 1, 2007; 23(11): 1363 - 1370.
[Abstract] [Full Text] [PDF]


Home page
The Computer JournalHome page
Discussion on Hedging Predictions in Machine Learning by A. Gammerman and V. Vovk
The Computer Journal, March 1, 2007; 50(2): 164 - 172.
[Full Text] [PDF]


Home page
BioinformaticsHome page
R. Shen, D. Ghosh, A. Chinnaiyan, and Z. Meng
Eigengene-based linear discriminant model for tumor classification using gene expression microarray data
Bioinformatics, November 1, 2006; 22(21): 2635 - 2642.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
X. Zhou and K. Z. Mao
The ties problem resulting from counting-based error estimators and its impact on gene selection algorithms
Bioinformatics, October 15, 2006; 22(20): 2507 - 2515.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
C. Sima and E. R. Dougherty
What should be expected from feature selection in small-sample settings
Bioinformatics, October 1, 2006; 22(19): 2430 - 2436.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
H. Pang, A. Lin, M. Holford, B. E. Enerson, B. Lu, M. P. Lawton, E. Floyd, and H. Zhao
Pathway analysis using random forests classification and regression
Bioinformatics, August 15, 2006; 22(16): 2028 - 2036.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Choudhary, M. Brun, J. Hua, J. Lowey, E. Suh, and E. R. Dougherty
Genetic test bed for feature selection
Bioinformatics, April 1, 2006; 22(7): 837 - 842.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
P. Larranaga, B. Calvo, R. Santana, C. Bielza, J. Galdiano, I. Inza, J. A. Lozano, R. Armananzas, G. Santafe, A. Perez, et al.
Machine learning in bioinformatics
Brief Bioinform, March 1, 2006; 7(1): 86 - 112.
[Abstract] [Full Text] [PDF]


Home page
JNCI J Natl Cancer InstHome page
R. Xu and A. Gamst
Re: Lessons from Controversy: Ovarian Cancer Screening and Serum Proteomics
J Natl Cancer Inst, August 17, 2005; 97(16): 1226 - 1226.
[Full Text] [PDF]


Home page
BioinformaticsHome page
A. M. Molinaro, R. Simon, and R. M. Pfeiffer
Prediction error estimation: a comparison of resampling methods
Bioinformatics, August 1, 2005; 21(15): 3301 - 3307.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
W. J. Fu, R. J. Carroll, and S. Wang
Estimating misclassification error with small samples via bootstrap cross-validation
Bioinformatics, May 1, 2005; 21(9): 1979 - 1986.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. Hua, Z. Xiong, J. Lowey, E. Suh, and E. R. Dougherty
Optimal number of features as a function of sample size for various classification rules
Bioinformatics, April 15, 2005; 21(8): 1509 - 1515.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
X. Zhou and K. Z. Mao
LS Bound based gene selection for DNA microarray data
Bioinformatics, April 15, 2005; 21(8): 1559 - 1564.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
C. Sima, U. Braga-Neto, and E. R. Dougherty
Superior feature-set ranking for small samples using bolstered error estimation
Bioinformatics, April 1, 2005; 21(7): 1046 - 1054.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.