Bioinformatics Advance Access originally published online on August 5, 2004
Bioinformatics 2005 21(1):63-70; doi:10.1093/bioinformatics/bth461
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 21 issue 1 © Oxford University Press 2005; all rights reserved.
How many samples are needed to build a classifier: a general sequential approach
1 Department of Statistics, Texas A&M University 447 Blocker Building, College Station, TX 77843, USA
2 Department of Electrical Engineering, Texas A&M University 214 Zachry Engineering Center, College Station, TX 77840, USA
3 Department of Pathology, University of Texas MD Anderson Cancer Center 1515 Holcombe, Houston, TX 77030, USA
*To whom correspondence should be addressed.
Motivation: The standard paradigm for a classifier design is to obtain a sample of feature-label pairs and then to apply a classification rule to derive a classifier from the sample data. Typically in laboratory situations the sample size is limited by cost, time or availability of sample material. Thus, an investigator may wish to consider a sequential approach in which there is a sufficient number of patients to train a classifier in order to make a sound decision for diagnosis while at the same time keeping the number of patients as small as possible to make the studies affordable.
Results: A sequential classification procedure is studied via the martingale central limit theorem. It updates the classification rule at each step and provides stopping criteria to ensure with a certain confidence that at stopping a future subject will have misclassification probability smaller than a predetermined threshold. Simulation studies and applications to microarray data analysis are provided. The procedure possesses several attractive properties: (1) it updates the classification rule sequentially and thus does not rely on distributions of primary measurements from other studies; (2) it assesses the stopping criteria at each sequential step and thus can substantially reduce cost via early stopping; and (3) it is not restricted to any particular classification rule and therefore applies to any parametric or non-parametric method, including feature selection or extraction.
Availability: R-code for the sequential stopping rule is available at http://stat.tamu.edu/~wfu/microarray/sequential/R-code.html
Contact: wfu{at}stat.tamu.edu
Received on April 20, 2004; revised on July 5, 2004; accepted on July 28, 2004
This article has been cited by other articles:
![]() |
P. de Valpine, H.-M. Bitter, M. P. S. Brown, and J. Heller A simulation-approximation approach to sample size planning for high-dimensional classification studies Biostat., July 1, 2009; 10(3): 424 - 435. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. K. Dobbin A method for constructing a confidence bound for the actual error rate of a prediction rule in high dimensions Biostat., April 1, 2009; 10(2): 282 - 296. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. K. Dobbin and R. M. Simon Sample size planning for developing classifiers using high-dimensional DNA microarray data Biostat., January 1, 2007; 8(1): 101 - 117. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Wang, J. Yu, A. Sreekumar, S. Varambally, R. Shen, D. Giacherio, R. Mehra, J. E. Montie, K. J. Pienta, M. G. Sanda, et al. Autoantibody Signatures in Prostate Cancer N. Engl. J. Med., September 22, 2005; 353(12): 1224 - 1235. [Abstract] [Full Text] [PDF] |
||||

