Bioinformatics Advance Access originally published online on August 5, 2004
Bioinformatics 2005 21(1):63-70; doi:10.1093/bioinformatics/bth461
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 21 issue 1 © Oxford University Press 2005; all rights reserved.
How many samples are needed to build a classifier: a general sequential approach
1 Department of Statistics, Texas A&M University 447 Blocker Building, College Station, TX 77843, USA
2 Department of Electrical Engineering, Texas A&M University 214 Zachry Engineering Center, College Station, TX 77840, USA
3 Department of Pathology, University of Texas MD Anderson Cancer Center 1515 Holcombe, Houston, TX 77030, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: The standard paradigm for a classifier design is to obtain a sample of feature-label pairs and then to apply a classification rule to derive a classifier from the sample data. Typically in laboratory situations the sample size is limited by cost, time or availability of sample material. Thus, an investigator may wish to consider a sequential approach in which there is a sufficient number of patients to train a classifier in order to make a sound decision for diagnosis while at the same time keeping the number of patients as small as possible to make the studies affordable.
Results: A sequential classification procedure is studied via the martingale central limit theorem. It updates the classification rule at each step and provides stopping criteria to ensure with a certain confidence that at stopping a future subject will have misclassification probability smaller than a predetermined threshold. Simulation studies and applications to microarray data analysis are provided. The procedure possesses several attractive properties: (1) it updates the classification rule sequentially and thus does not rely on distributions of primary measurements from other studies; (2) it assesses the stopping criteria at each sequential step and thus can substantially reduce cost via early stopping; and (3) it is not restricted to any particular classification rule and therefore applies to any parametric or non-parametric method, including feature selection or extraction.
Availability: R-code for the sequential stopping rule is available at http://stat.tamu.edu/~wfu/microarray/sequential/R-code.html
Contact: wfu{at}stat.tamu.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
The standard paradigm for a classifier design is to obtain a sample of feature-label pairs and then to apply a classification rule to derive a classifier from the sample data. In some cases, where sample data can be cheaply obtained or generated via a mathematical model, it may be possible to estimate the sample size necessary to obtain a desired degree of design precision and then proceed to generate a sample of the desired size. Typically in laboratory situations the sample size is limited by cost, time or availability of sample material. This study has been motivated by gene-expression-based cancer classification, where sample mRNA may be severely limited. The sample mRNA may have to be obtained from available tissue taken from deceased patients and for which careful and extensive pathology has been undertaken, or it may be obtained in clinical studies, where recruiting patients can be very costly, with costs ranging from providing free medical care to long-term follow-up. Thus, an investigator may wish to consider a sequential approach in which there is a sufficient number of patients to train a classifier in order to make a sound decision for diagnosis while at the same time keeping the number of patients as small as possible to make the studies affordable.
Consider a study that sequentially obtains and categorizes sample tissue or recruits and diagnoses patients based on certain clinical conditions, the goal being to design a classifier to discriminate among the kinds, attributes or stages of a disease. Rather than apply a procedure involving a pre-determined fixed size sample, we will take a sequential approach. The diagnostic decision follows a learning process, where the decision criteria update with the recruiting process with the aim of making as few misclassifications as possible. The key issue to be addressed is the formulation of a stopping rule for the sequential classification.
| 2 STOPPING RULE FOR THE SEQUENTIAL PROCEDURE |
|---|
|
|
|---|
To formulate the stopping rule, let us set up the relevant random variables and probabilities. Let
i = {Y 1, ..., Y i }, where i = 1, 2, ..., be a set of binary observations in a sequential recruitment of patients, either being diagnosed of disease of interest (1) or no disease (0). Patients are recruited independently from each other, and are not correlated, i.e. no observations within the same family or unit are observed. Each patient will have clinical features or expression levels g i of a certain number of genes. Let Q i be the indicator function that Y i is misclassified based on
i 1, i.e. Q i = 1 if Y i is misclassified or Q i = 0 otherwise. And let
i = P(Q i = 1 |
i 1) be the conditional misclassification probability. We are thus interested in the number of patients (N) we need to recruit before we can claim with certain confidence (1
)% that the next patient to recruit will have a small probability
N of being misclassified with
N
for a given small threshold
> 0. To derive the stopping rule, we make the following assumptions:
- The conditional probability
n is weakly monotonically decreasing, i.e. there exists a finite integer p 0 > 0 such that
i+p
i almost surely for all p
p 0 and i
1. Thus
n converges weakly to 
0.
- The limit probability

> 0, i.e. there is a positive probability that observations may be misclassified. This is generally true in applications since the class distributions are assumed to overlap on a region of positive probability, with 
depending on the type of classifier considered, such as linear discriminant analysis (LDA), and on the scientific context of the problems. For a given classification problem with a non-zero Bayes error, the probability 
> 0 regardless of the type of classifiers.
The stopping rule that we apply depends on the following theorem, which depends on the martingale central limit theorem, and is proven in the Appendix.
THEOREM.
For level 0 <
< 1,
![]() |
where z 1
is the (1
)quantile of the standard Gaussian distribution N(0,1),
![]() |
is the estimator for the conditional probability
I.
The theorem suggests that a conservative sample size N, for given threshold 0 <
< 1, satisfies
![]() |
Thus
![]() |
with
and
. In addition, to ensure a stopping of the sequential procedure when a sufficiently large number N 0 of consecutive perfect classifications occur, we se N 0 = log(
)/log(1
). Therefore, the number of sequential recruitments N in the stopping rule satisfies
![]() |
and
.
Based on this stopping rule and a chosen value of
, the number of samples and the designed classifier are determined by the following algorithm under the assumption that the maximum sample size is M:
- Start with an initial sample S 0 and set the perfect-classification counter N 0 = 0.
- At step i, train the classifier based on current data.
- Recruit one subject and classify based on the clinical feature using the classifier.
- Set Q i = 0 and N 0=N 0 + 1 if the classification is correct; otherwise, set Q i = 1 and N 0 = 0.
- Calculate
.
- If the stopping rule is satisfied or N=M, then stop; otherwise, go back to Step 2 and repeat Steps 26.
While a sequence of consecutive perfect classifications indicates good performance of the classifier and induces early stopping, large Bayes errors, such as 30% or greater, make it difficult to train good classifiers and also ought to induce early stopping to save resources. This involves a hypothesis testing with null hypothesis H 0 : 
0.3 versus the alternative H 1 : 
> 0.3. Much work has been done in the area of hypothesis testing in sequential trials (see Emerson and Fleming, 1990; Lai, 1997 and references therein).
It is well-known that maximum-likelihood estimations in sequential studies are biased due to early stopping, and thus need to be corrected, (see Liu and Hall, 1999; Pinheiro and DeMets, 1997; Todd and Whitehead, 1996; Whitehead, 1986). In this paper, we focus on training classifiers rather than on estimating misclassification error, and thus refer readers to the above references for methods of error estimation and bias correction.
| 3 WEAK CONVERGENCE WITH LARGE SAMPLES |
|---|
|
|
|---|
We consider the large sample behavior for the classifier built by the sequential procedure. Let
n be the classifier trained based on the observation
n . C n is a random function. Let Y be a new observation. By large sample theory, the weak convergence
n (Y)
p 
(Y) holds for some classifier 
. Thus
![]() |
where 
is the sequence of observations
m with m being infinitely large. Since 
does not depend on the observations 
,
n
p
(Q = 1), i.e. the conditional probability of misclassification in the sequential procedure converges in probability to the limit probability for misclassification. It implies that although we set a threshold on the conditional probability in the sequential procedure, it is approximately equivalent to setting a threshold on the unconditional probability for sufficiently large number of sequential steps.
REMARKS
-
(Q = 1) depends on the type of classifier predetermined, and may assume different values for different types of classifiers. It may be much larger than the misclassification rate of the optimal Bayes classifier.
- In practice, if the Bayes classifier has a positive misclassification error, and no matter how small
is chosen to be, the resulting classifier trained by the sequential procedure cannot perform betterthat is, the misclassification error of C N cannot be smaller than the Bayes error even if a smaller
is selected. This will be demonstrated in our simulation studies.
| 4 SIMULATION STUDY |
|---|
|
|
|---|
To demonstrate basic behavior of sequential procedure, we compare the misclassification error of the trained classifier with Bayes error in the simple case of a single feature and two Gaussian class-conditional distributions N(0, 1) and N(
, 1),
> 0, where N(0, 1) and N(
, 1) correspond to the labels 0 and 1, respectively. The misclassification error of the designed classifier is its true error computed from the model. In this case, the Bayes classifier corresponds to the optimal linear classifier and we use LDA. The sequential procedure starts with 10 random samples, 5 from each class, proceeds according to the algorithm with a maximum of 90 sequential steps, and yields a classifier based on a sample size of N.
To examine classifier performance, we drew random samples of 10 000 data points, 5000 from N(0,1) and 5000 from N(
,1), and tested the LDA classifier to estimate its cutoff value
. Thus a data point is categorized as class 0 if less than
, or class 1 otherwise. We repeated the drawing of random samples 50 times to obtain accurate estimation of
. The error of the LDA classifier was then calculated with e(C N ) = [1
(
) +
(
)]/2 for each fixed value
. This sequential training and testing procedure was repeated 1000 times for each fixed pair of (
,
) to compare the LDA error with the Bayes error, which was calculated with [1
(
/2)]. The sample mean misclassification rate and sample SD of 1000 simulation runs with a random sequence of observations in each run were computed and the minimum, maximum and mean numbers of sequential steps in the procedure were also recorded.
Table 1 reports the mean misclassification rate, SD of the rates over 1000 simulation runs, the minimum, maximum and mean numbers of sequential steps, and the SD in training the classifiers for each
and
. The means and SD were computed with the sample mean and sample variances. The maximum of sequential steps is 90. The misclassification rates of the classifiers trained with the sequential procedure are very close to the Bayes errors. Specifying a smaller threshold
increases the number of sequential steps, sometimes substantially, but the overall misclassification rate is only lowered slightly. The most gain in Table 1 is for
= 2.3, with the misclassification rate decreasing by 0.0084 from
= 0.20.05, a 6% reduction in error at the cost of increasing the number of steps from 16 to 83 on the average. Therefore, specifying a very small threshold
is not recommended in practice when patient recruitment is costly.
|
Figure 1 shows the histogram plots of the misclassification errors over 1000 simulation runs by
and
. The arrow in the bottom left corner of each plot indicates the Bayes error for given
, which is the minimum value in each histogram. Although long tails to the right were observed, the tail probabilities are very small and the error estimates are in general very close to the Bayes error. It indicates that the LDA classifier at the stopping performs well with only slight bias.
|
Table 2 shows analogous results for the maximum number of sequential steps being 40. It shows that the mean misclassification error increases with smaller maximum number of sequential steps allowed compared to Table 1. Again, the number of sequential steps is reduced substantially while the increase in mean misclassification error varies, mostly for large Bayes error corresponding to
= 1, and barely for small Bayes error corresponding to
= 4.
|
Tables 1 and 2 indicate that setting a smaller threshold
or increasing the maximum number of steps allowed in the sequential procedure may achieve smaller misclassification error, but the reduction may be slight relative to the increase in the number of required patients. | 5 APPLICATION TO MICROARRAY DATA |
|---|
|
|
|---|
We apply the sequential procedure to two studies of microarray data, one on breast-cancer patient prognosis, and the other on breast-tumor characterization. van't Veer et al. (2002) and van de Vijver et al. (2002) have reported studies of gene expression profiling on breast-cancer patient prognosis based on 295 samples with 70 selected genes. Perou et al. (2000) have considered the molecular portrait of human breast tumors based on 84 samples with expressions of 8102 genes. We have chosen these two datasets because the large sample sizes facilitate accurate estimation of the misclassification errors of the trained classifiers. The breast tumor study has five classes of tissues: luminallike ER+tumors, ERBB2+tumors, normal breast, breast cell lines and basallike tumors. For simplicity, we pool them further and form two classes, 51 samples of tumorlike tissues (including luminallike ER + tumors, ERBB2 + tumors and basallike tumors) and 33 samples of non-tumor tissues (including normal breast and breast cell lines).
To initialize the sequential procedure, we choose a small initial random sample of patients, half with good prognosis or non-tumor tissues and half with poor prognosis or tumorlike tissues, either (5,5) or (10,10). Following application of the sequential algorithm, the misclassification error for the sequentially trained classifier is computed based on the remaining portion of the sample (which serves as hold-out test data) by comparing the predicted and clinical outcomes. For each set of fixed criteria such as
>0,
=0.05 and the maximum number of samples allowed in the sequential procedure, we carry out the sequential procedure 1000 times and report the mean number of sequential steps and the mean misclassification error.
5.1 Example 1: LDA classifier on small number of genes for breast-cancer patient prognosis
An LDA classifier is trained with the sequential procedure on the three genes most highly correlated with the prognosis based on the total of 295 samples. We are aware of the selection bias in such a gene selection procedure based on all 295 samples (Ambroise and McLachlan, 2002), but proceed in this way so as to demonstrate our method without incorporating a gene selection procedure. Example 3 will demonstrate the method incorporating a gene selection procedure. We choose a small number of genes to train the classifier for two reasons. First, biologists and clinicians often have only a small number of experimentally identified genes that play a major role, which depends largely on investigator's subjective judgement and is potentially subject to selection bias as well. Second, the amount of training data increases substantially with the number of classifier features.
Table 3 shows that, as the threshold
decreases from 0.3 to 0.15, the mean misclassification error decreases and the minimum and mean number of sequential steps increase. In both cases, the maximum number of steps allowed in the sequential procedure is achieved. The increase in the maximum number of steps from 50 to 80 is due to the increase in the maximum number of steps allowed in the sequential procedure. With the initial sample size increasing from (5, 5) to (10, 10), the mean misclassification error and the mean number of steps decrease. Although the initial sample size does not contribute directly to the stopping rule, a larger initial sample contributes to better classifier training and therefore may lead to early stopping.
|
5.2 Example 2: K-nearest neighbors (KNN) classifiers on small number of genes for breast-cancer patient prognosis
We fit 3NN and 5NN classifiers separately on the 5 genes most highly correlated with patient prognosis based on the full sample of 295 patients. The procedure is similar to Example 1 and the results are reported in Tables 4 and 5. It is interesting to observe in the tables that although the 3NN and 5NN classification rules have about the same mean number of steps and yield about the same misclassification errors, the 5NN classifier yields a slightly smaller mean number of steps and has slightly smaller errors than the 3NN classifier.
|
|
The mean misclassification errors of
0.220.25 reported in Tables 35 by the sequentially trained classifiers are consistent with the errors reported by Braga-Neto and Dougherty (2004).
5.3 Example 3: 3NN classifier on 10 sequentially selected principal components for breast-tumor characterization
We study the 3NN classifier on 10 principal components of the gene expressions for breast-tumor characterization. First, we remove the genes with missing expression levels and keep 4982 genes with no missing values. Second, we order the 4982 genes by the correlation between the tumor class and each individual gene expression based on the entire 84 samples, and choose 445 genes that have a correlation coefficient of absolute value greater than 0.4. These 445 genes are used to train the 3NN classifier in the sequential procedure. We also incorporate feature extraction in this procedure. At each sequential step, we compute the 10 largest principal components based on the current available sample of gene expressions and train a 3NN classifier based on the principal components. Table 6 reports the mean misclassification error, the mean, minimum and maximum number of sequential steps over 1000 runs for each fixed pair of (
, M), where M is the maximum number of sequential steps allowed. It is shown that the misclassification error decreases with the decrease in the threshold
and increase in the maximum number of steps allowed. By specifying a small threshold
, a decrease in the misclassification error may be achieved, but at the cost of running more sequential steps. With larger initial samples, (10,10) versus (5,5), the sequential procedure runs a fewer number of steps to stop and yields smaller error.
|
We have observed different levels of the misclassification errors in the two microarray studies,
25% in the breast-cancer prognosis study and <10% in the breast-tumor characterization study. This reflects the fact, as we mentioned earlier in Section 2, that the misclassification error depends on many factors, including Bayes error, the number of predictors in the model, and most importantly, the scientific context of the problem. | 6 CONCLUSION |
|---|
|
|
|---|
We have studied a sequential classification procedure and provided a stopping rule. The procedure recruits subjects sequentially, updates the classification rule at each step and stops with certain predetermined confidence level (1
) such that a new subject will have a probability less than a small threshold
> 0 to be misclassified by the classification rule. Simulation studies show that the procedure usually achieves its goal of stopping prior to reaching the maximum allowable sample size, and the gain is over a fairly wide range of
. We have applied the procedure to two microarray datasets. In particular, we have applied it in the context of feature extraction. It has been shown to yield error rates close to those achievable for fixed sample sizes. This procedure has several advantages over classical sample size calculations: (1) it updates the classification rule sequentially and thus depends on the study subjects rather than relying on distributions of primary measurements from other studies that may differ greatly, as do the classical sample size calculations; (2) it assesses the stopping criteria at each sequential step and thus can substantially reduce cost via early stopping; (3) it ensures the required sample size to achieve significance, whereas classical sample size calculations may lead to smaller than required sample sizes and therefore miss the expected significance due to inaccurate estimation; and (4) it is not restricted to any particular classification rule and therefore applies to any parametric and non-parametric method, such as LDA, KNN, classification and regression trees (CART), etc.
Although we have considered the classification of binary outcomes in this study, our procedure also applies to multiple class outcomes as well, since one can still define the misclassification indicator Qi and its conditional probability
i with multiple classes, and these constitute the only information used to derive the stopping rule.
| A APPENDIX |
|---|
|
|
|---|
A.1 Derivation of the stopping rule
Let X i = Q i
i , and
n be the
-field generated by Y n . Then
is a zero-mean martingale (Hall and Heyde, 1980.) Let
. By the variance of conditional expectation, one is ready to conclude the convergence in probability
![]() |
It is thus ready to prove with Chebyshev's Inequality (Knight, 2000, p. 123) that
and
. By the martingale central limit theorem (CLT) (Shorack, 2000, pp. 529530), one has the convergence in distribution
![]() |
Let
. By the CLT,
N(0,1) in distribution. Thus with probability approaching 1
,
![]() |
For the stopping rule, we want to find a sample size N such that the probability Prob(
N
)
1
.
For large m, by the weak monotonicity of
n ,
![]() |
where
converges to 0. Hence for large m,
in probability, i.e.
. Particularly,
. Hence by (A.1)
![]() |
We now show that
as N
. We first show that
in probability for large n. Since
and the convergence in distribution
![]() |
it implies that
for large n. Thus
in probability. Without loss of generality, we assume
a.s. Note that if 0 < a
b
1/2, then a(1 a)
b(1 b). Thus
in probability for large n.
Let M 0 be a large integer such that M 0 N and
in probability for i
M 0.
![]() |
holds in probability, where
is bounded by a finite number M 0. Hence
and by (A.2)
![]() |
holds. This completes the proof for the theorem.
Notice that the stopping rule requires
. However, a series of consecutive perfect classifications from the beginning of the sequential procedure yields
. Meanwhile, a long sequence of perfect classifications indicates good performance of classifiers and thus should invoke stopping. This is amended in the stopping rule by setting the number of consecutive perfect classifications N 0 = log(
)/[log(1
)] with a conservative assumption that the perfect classifications are independent with least probability (1
) for each.
| Acknowledgments |
|---|
W.J.F. was supported by a grant from the National Cancer Institute (CA-90301). E.R.D. was supported by the Translational Genomics Research Institute. R.J.C. and B.M. were supported by a grant from the National Cancer Institute (CA-57030) and by the Texas A&M Center for Environmental and Rural Health via a grant from the National Institute of Environmental Health Sciences (P30-ES09106).
Received on April 20, 2004; revised on July 5, 2004; accepted on July 28, 2004
| REFERENCES |
|---|
|
|
|---|
Ambroise, C and McLachlan, G.J. (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci., USA, 99, 65626566
Braga-Neto, U.M. and Dougherty, E.R. (2000) Is cross-validation valid for small-sample microarray classification?. Bioinformatics, 20, 374380.
Emerson, S.S. and Fleming, T.R. (1990) Parameter estimation following group sequential hypothesis testing. Biometrika, 77, 875892
Hall, P. and Heyde, C.C. Martingale Limit Theory and Its Application, (1980) , New York Academic Press Inc.
Knight, K. Mathematical Statistics, (2000) , New York Chapman and Hall/CRC.
Lai, T.L. (1997) On optimal stopping problems in sequential hypothesis testing. Statist. Sinica, 7, , pp. 3351.
Liu, A. and Hall, W.J. (1999) Unbiased estimation following a group sequential test. Biometrika, 86, 7178
Perou, C.M., Sorlie, T., Eisen, M.B., van de Rijn, M., Jeffrey, S.S., Rees, C.A., Pollack, J.R., Ross, D.T., Johnsen, H., Akslen, L.A., et al. (2000) Molecular portraits of Human Breast Tumours. Nature, 406, 747752[CrossRef][Medline].
Pinheiro, J.C. and DeMets, D.L. (1997) Estimating and reducing bias in group sequential designs with Gaussian independent increment structure. Biometrika, 84, 831845
Shorack, G.R. Probability for Statisticians, (2000) , New York Springer.
Todd, S. and Whitehead, J. (1996) Point and interval estimation following a sequential clinical trial. Biometrika, 83, , pp. 453461
van de Vijver, M.J., He, Y.D., van't Veer, L.J., Dai, H., Hart, A.A.M., Voskuil, D.W., Schreiber, G.J., Peterse, J.L., Roberts, C., Marton, M.J., et al. (2002) A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med., 347, 19992009
van't Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A.M., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T., et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530536[CrossRef][Medline].
Whitehead, J. (1986) On the bias of maximum likelihood estimation following a sequential test. Biometrika, 73, 573581
This article has been cited by other articles:
![]() |
K. K. Dobbin and R. M. Simon Sample size planning for developing classifiers using high-dimensional DNA microarray data Biostat., January 1, 2007; 8(1): 101 - 117. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Wang, J. Yu, A. Sreekumar, S. Varambally, R. Shen, D. Giacherio, R. Mehra, J. E. Montie, K. J. Pienta, M. G. Sanda, et al. Autoantibody Signatures in Prostate Cancer N. Engl. J. Med., September 22, 2005; 353(12): 1224 - 1235. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






) numbers of sequential steps and the SD in training LDA classifier by population mean 









