Skip Navigation


Bioinformatics Advance Access originally published online on August 5, 2004
Bioinformatics 2005 21(1):63-70; doi:10.1093/bioinformatics/bth461
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/1/63    most recent
bth461v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Fu, W. J.
Right arrow Articles by Carroll, R. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Fu, W. J.
Right arrow Articles by Carroll, R. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics vol. 21 issue 1 © Oxford University Press 2005; all rights reserved.

How many samples are needed to build a classifier: a general sequential approach

Wenjiang J. Fu 1,*, Edward R. Dougherty 2,3, Bani Mallick 1 and Raymond J. Carroll 1

1 Department of Statistics, Texas A&M University 447 Blocker Building, College Station, TX 77843, USA
2 Department of Electrical Engineering, Texas A&M University 214 Zachry Engineering Center, College Station, TX 77840, USA
3 Department of Pathology, University of Texas MD Anderson Cancer Center 1515 Holcombe, Houston, TX 77030, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STOPPING RULE FOR...
 3 WEAK CONVERGENCE WITH...
 4 SIMULATION STUDY
 5 APPLICATION TO MICROARRAY...
 6 CONCLUSION
 A APPENDIX
 REFERENCES
 

Motivation: The standard paradigm for a classifier design is to obtain a sample of feature-label pairs and then to apply a classification rule to derive a classifier from the sample data. Typically in laboratory situations the sample size is limited by cost, time or availability of sample material. Thus, an investigator may wish to consider a sequential approach in which there is a sufficient number of patients to train a classifier in order to make a sound decision for diagnosis while at the same time keeping the number of patients as small as possible to make the studies affordable.

Results: A sequential classification procedure is studied via the martingale central limit theorem. It updates the classification rule at each step and provides stopping criteria to ensure with a certain confidence that at stopping a future subject will have misclassification probability smaller than a predetermined threshold. Simulation studies and applications to microarray data analysis are provided. The procedure possesses several attractive properties: (1) it updates the classification rule sequentially and thus does not rely on distributions of primary measurements from other studies; (2) it assesses the stopping criteria at each sequential step and thus can substantially reduce cost via early stopping; and (3) it is not restricted to any particular classification rule and therefore applies to any parametric or non-parametric method, including feature selection or extraction.

Availability: R-code for the sequential stopping rule is available at http://stat.tamu.edu/~wfu/microarray/sequential/R-code.html

Contact: wfu{at}stat.tamu.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STOPPING RULE FOR...
 3 WEAK CONVERGENCE WITH...
 4 SIMULATION STUDY
 5 APPLICATION TO MICROARRAY...
 6 CONCLUSION
 A APPENDIX
 REFERENCES
 
The standard paradigm for a classifier design is to obtain a sample of feature-label pairs and then to apply a classification rule to derive a classifier from the sample data. In some cases, where sample data can be cheaply obtained or generated via a mathematical model, it may be possible to estimate the sample size necessary to obtain a desired degree of design precision and then proceed to generate a sample of the desired size. Typically in laboratory situations the sample size is limited by cost, time or availability of sample material. This study has been motivated by gene-expression-based cancer classification, where sample mRNA may be severely limited. The sample mRNA may have to be obtained from available tissue taken from deceased patients and for which careful and extensive pathology has been undertaken, or it may be obtained in clinical studies, where recruiting patients can be very costly, with costs ranging from providing free medical care to long-term follow-up. Thus, an investigator may wish to consider a sequential approach in which there is a sufficient number of patients to train a classifier in order to make a sound decision for diagnosis while at the same time keeping the number of patients as small as possible to make the studies affordable.

Consider a study that sequentially obtains and categorizes sample tissue or recruits and diagnoses patients based on certain clinical conditions, the goal being to design a classifier to discriminate among the kinds, attributes or stages of a disease. Rather than apply a procedure involving a pre-determined fixed size sample, we will take a sequential approach. The diagnostic decision follows a learning process, where the decision criteria update with the recruiting process with the aim of making as few misclassifications as possible. The key issue to be addressed is the formulation of a stopping rule for the sequential classification.


    2 STOPPING RULE FOR THE SEQUENTIAL PROCEDURE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STOPPING RULE FOR...
 3 WEAK CONVERGENCE WITH...
 4 SIMULATION STUDY
 5 APPLICATION TO MICROARRAY...
 6 CONCLUSION
 A APPENDIX
 REFERENCES
 
To formulate the stopping rule, let us set up the relevant random variables and probabilities. Let Y i = {Y 1, ..., Y i }, where i = 1, 2, ..., be a set of binary observations in a sequential recruitment of patients, either being diagnosed of disease of interest (1) or no disease (0). Patients are recruited independently from each other, and are not correlated, i.e. no observations within the same family or unit are observed. Each patient will have clinical features or expression levels g i of a certain number of genes. Let Q i be the indicator function that Y i is misclassified based on Y i – 1, i.e. Q i = 1 if Y i is misclassified or Q i = 0 otherwise. And let {pi} i = P(Q i = 1 |Y i – 1) be the conditional misclassification probability. We are thus interested in the number of patients (N) we need to recruit before we can claim with certain confidence (1–{alpha})% that the next patient to recruit will have a small probability {pi} N of being misclassified with {pi} N ≤ {varepsilon} for a given small threshold {varepsilon} > 0.

To derive the stopping rule, we make the following assumptions:

  1. The conditional probability {pi} n is weakly monotonically decreasing, i.e. there exists a finite integer p 0 > 0 such that {pi} i+p ≤ {pi} i almost surely for all p ≥ p 0 and i ≥ 1. Thus {pi} n converges weakly to {pi}{infty} ≥ 0.
  2. The limit probability {pi}{infty} > 0, i.e. there is a positive probability that observations may be misclassified. This is generally true in applications since the class distributions are assumed to overlap on a region of positive probability, with {pi}{infty} depending on the type of classifier considered, such as linear discriminant analysis (LDA), and on the scientific context of the problems. For a given classification problem with a non-zero Bayes error, the probability {pi}{infty} > 0 regardless of the type of classifiers.

The stopping rule that we apply depends on the following theorem, which depends on the martingale central limit theorem, and is proven in the Appendix.

THEOREM.

For level 0 < {alpha} < 1,

where z 1 – {alpha} is the (1 – {alpha})—quantile of the standard Gaussian distribution N(0,1),

is the estimator for the conditional probability {pi}I.

The theorem suggests that a conservative sample size N, for given threshold 0 < {varepsilon} < 1, satisfies

Thus

with and . In addition, to ensure a stopping of the sequential procedure when a sufficiently large number N 0 of consecutive perfect classifications occur, we se N 0 = log({alpha})/log(1 – {varepsilon}). Therefore, the number of sequential recruitments N in the stopping rule satisfies

and .

Based on this stopping rule and a chosen value of {varepsilon}, the number of samples and the designed classifier are determined by the following algorithm under the assumption that the maximum sample size is M:

  1. Start with an initial sample S 0 and set the perfect-classification counter N 0 = 0.
  2. At step i, train the classifier based on current data.
  3. Recruit one subject and classify based on the clinical feature using the classifier.
  4. Set Q i = 0 and N 0=N 0 + 1 if the classification is correct; otherwise, set Q i = 1 and N 0 = 0.
  5. Calculate .
  6. If the stopping rule is satisfied or N=M, then stop; otherwise, go back to Step 2 and repeat Steps 2–6.

While a sequence of consecutive perfect classifications indicates good performance of the classifier and induces early stopping, large Bayes errors, such as 30% or greater, make it difficult to train good classifiers and also ought to induce early stopping to save resources. This involves a hypothesis testing with null hypothesis H 0 : {pi}{infty} ≤ 0.3 versus the alternative H 1 : {pi}{infty} > 0.3. Much work has been done in the area of hypothesis testing in sequential trials (see Emerson and Fleming, 1990; Lai, 1997 and references therein).

It is well-known that maximum-likelihood estimations in sequential studies are biased due to early stopping, and thus need to be corrected, (see Liu and Hall, 1999; Pinheiro and DeMets, 1997; Todd and Whitehead, 1996; Whitehead, 1986). In this paper, we focus on training classifiers rather than on estimating misclassification error, and thus refer readers to the above references for methods of error estimation and bias correction.


    3 WEAK CONVERGENCE WITH LARGE SAMPLES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STOPPING RULE FOR...
 3 WEAK CONVERGENCE WITH...
 4 SIMULATION STUDY
 5 APPLICATION TO MICROARRAY...
 6 CONCLUSION
 A APPENDIX
 REFERENCES
 
We consider the large sample behavior for the classifier built by the sequential procedure. Let C n be the classifier trained based on the observation Y n . C n is a random function. Let Y be a new observation. By large sample theory, the weak convergence C n (Y) -> p C{infty} (Y) holds for some classifier C{infty}. Thus

where Y{infty} is the sequence of observations Y m with m being infinitely large. Since C{infty} does not depend on the observations Y{infty}, {pi}n -> p (Q = 1), i.e. the conditional probability of misclassification in the sequential procedure converges in probability to the limit probability for misclassification. It implies that although we set a threshold on the conditional probability in the sequential procedure, it is approximately equivalent to setting a threshold on the unconditional probability for sufficiently large number of sequential steps.

REMARKS

  1. (Q = 1) depends on the type of classifier predetermined, and may assume different values for different types of classifiers. It may be much larger than the misclassification rate of the optimal Bayes classifier.
  2. In practice, if the Bayes classifier has a positive misclassification error, and no matter how small {varepsilon} is chosen to be, the resulting classifier trained by the sequential procedure cannot perform better—that is, the misclassification error of C N cannot be smaller than the Bayes error even if a smaller {varepsilon} is selected. This will be demonstrated in our simulation studies.


    4 SIMULATION STUDY
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STOPPING RULE FOR...
 3 WEAK CONVERGENCE WITH...
 4 SIMULATION STUDY
 5 APPLICATION TO MICROARRAY...
 6 CONCLUSION
 A APPENDIX
 REFERENCES
 
To demonstrate basic behavior of sequential procedure, we compare the misclassification error of the trained classifier with Bayes error in the simple case of a single feature and two Gaussian class-conditional distributions N(0, 1) and N({Delta}, 1), {Delta} > 0, where N(0, 1) and N({Delta}, 1) correspond to the labels 0 and 1, respectively. The misclassification error of the designed classifier is its true error computed from the model. In this case, the Bayes classifier corresponds to the optimal linear classifier and we use LDA. The sequential procedure starts with 10 random samples, 5 from each class, proceeds according to the algorithm with a maximum of 90 sequential steps, and yields a classifier based on a sample size of N.

To examine classifier performance, we drew random samples of 10 000 data points, 5000 from N(0,1) and 5000 from N({Delta},1), and tested the LDA classifier to estimate its cutoff value {lambda}. Thus a data point is categorized as class 0 if less than {lambda}, or class 1 otherwise. We repeated the drawing of random samples 50 times to obtain accurate estimation of {lambda}. The error of the LDA classifier was then calculated with e(C N ) = [1 – {Phi}({lambda}) + {Phi}({lambda}{Delta})]/2 for each fixed value {Delta}. This sequential training and testing procedure was repeated 1000 times for each fixed pair of ({Delta}, {varepsilon}) to compare the LDA error with the Bayes error, which was calculated with [1 – {Phi}({Delta}/2)]. The sample mean misclassification rate and sample SD of 1000 simulation runs with a random sequence of observations in each run were computed and the minimum, maximum and mean numbers of sequential steps in the procedure were also recorded.

Table 1 reports the mean misclassification rate, SD of the rates over 1000 simulation runs, the minimum, maximum and mean numbers of sequential steps, and the SD in training the classifiers for each {Delta} and {varepsilon}. The means and SD were computed with the sample mean and sample variances. The maximum of sequential steps is 90. The misclassification rates of the classifiers trained with the sequential procedure are very close to the Bayes errors. Specifying a smaller threshold {varepsilon} increases the number of sequential steps, sometimes substantially, but the overall misclassification rate is only lowered slightly. The most gain in Table 1 is for {Delta} = 2.3, with the misclassification rate decreasing by 0.0084 from {varepsilon} = 0.2–0.05, a 6% reduction in error at the cost of increasing the number of steps from 16 to 83 on the average. Therefore, specifying a very small threshold {varepsilon} is not recommended in practice when patient recruitment is costly.


View this table:
[in this window]
[in a new window]
 
Table 1 Misclassification error, the SD, minimum, maximum and mean () numbers of sequential steps and the SD in training LDA classifier by population mean {Delta} and stopping threshold {varepsilon} for fixed {alpha} = 0.05 and maximum sequential steps 90

 
Figure 1 shows the histogram plots of the misclassification errors over 1000 simulation runs by {Delta} and {varepsilon}. The arrow in the bottom left corner of each plot indicates the Bayes error for given {Delta}, which is the minimum value in each histogram. Although long tails to the right were observed, the tail probabilities are very small and the error estimates are in general very close to the Bayes error. It indicates that the LDA classifier at the stopping performs well with only slight bias.



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 1 Histogram of misclassification errors in 1000 runs for given {Delta} and {varepsilon} in the simulation study with maximum steps M=90. The arrow in the bottom left corner of each plot indicates the Bayes error between Normal (0,1) and Normal ({Delta},1) with specified {Delta}. Top panels, {Delta}=1; Middle panels, {Delta}=2; Bottom panels, {Delta}=3; Left panels, {varepsilon}=0.1; Central panels, {varepsilon}=0.15; and Right panels, {varepsilon}=0.2.

 
Table 2 shows analogous results for the maximum number of sequential steps being 40. It shows that the mean misclassification error increases with smaller maximum number of sequential steps allowed compared to Table 1. Again, the number of sequential steps is reduced substantially while the increase in mean misclassification error varies, mostly for large Bayes error corresponding to {Delta} = 1, and barely for small Bayes error corresponding to {Delta} = 4.


View this table:
[in this window]
[in a new window]
 
Table 2 Misclassification error, SD, minimum, maximum and mean () numbers of sequential steps and the SD in training LDA classifier by population mean {Delta} and stopping threshold {varepsilon} for fixed {alpha} = 0.05 and maximum sequential steps 40

 
Tables 1 and 2 indicate that setting a smaller threshold {varepsilon} or increasing the maximum number of steps allowed in the sequential procedure may achieve smaller misclassification error, but the reduction may be slight relative to the increase in the number of required patients.


    5 APPLICATION TO MICROARRAY DATA
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STOPPING RULE FOR...
 3 WEAK CONVERGENCE WITH...
 4 SIMULATION STUDY
 5 APPLICATION TO MICROARRAY...
 6 CONCLUSION
 A APPENDIX
 REFERENCES
 
We apply the sequential procedure to two studies of microarray data, one on breast-cancer patient prognosis, and the other on breast-tumor characterization. van't Veer et al. (2002) and van de Vijver et al. (2002) have reported studies of gene expression profiling on breast-cancer patient prognosis based on 295 samples with 70 selected genes. Perou et al. (2000) have considered the molecular portrait of human breast tumors based on 84 samples with expressions of 8102 genes. We have chosen these two datasets because the large sample sizes facilitate accurate estimation of the misclassification errors of the trained classifiers. The breast tumor study has five classes of tissues: luminal–like ER+tumors, ERBB2+tumors, normal breast, breast cell lines and basal–like tumors. For simplicity, we pool them further and form two classes, 51 samples of tumor–like tissues (including luminal–like ER + tumors, ERBB2 + tumors and basal–like tumors) and 33 samples of non-tumor tissues (including normal breast and breast cell lines).

To initialize the sequential procedure, we choose a small initial random sample of patients, half with good prognosis or non-tumor tissues and half with poor prognosis or tumor–like tissues, either (5,5) or (10,10). Following application of the sequential algorithm, the misclassification error for the sequentially trained classifier is computed based on the remaining portion of the sample (which serves as hold-out test data) by comparing the predicted and clinical outcomes. For each set of fixed criteria such as {varepsilon}>0, {alpha}=0.05 and the maximum number of samples allowed in the sequential procedure, we carry out the sequential procedure 1000 times and report the mean number of sequential steps and the mean misclassification error.

5.1 Example 1: LDA classifier on small number of genes for breast-cancer patient prognosis
An LDA classifier is trained with the sequential procedure on the three genes most highly correlated with the prognosis based on the total of 295 samples. We are aware of the selection bias in such a gene selection procedure based on all 295 samples (Ambroise and McLachlan, 2002), but proceed in this way so as to demonstrate our method without incorporating a gene selection procedure. Example 3 will demonstrate the method incorporating a gene selection procedure. We choose a small number of genes to train the classifier for two reasons. First, biologists and clinicians often have only a small number of experimentally identified genes that play a major role, which depends largely on investigator's subjective judgement and is potentially subject to selection bias as well. Second, the amount of training data increases substantially with the number of classifier features.

Table 3 shows that, as the threshold {varepsilon} decreases from 0.3 to 0.15, the mean misclassification error decreases and the minimum and mean number of sequential steps increase. In both cases, the maximum number of steps allowed in the sequential procedure is achieved. The increase in the maximum number of steps from 50 to 80 is due to the increase in the maximum number of steps allowed in the sequential procedure. With the initial sample size increasing from (5, 5) to (10, 10), the mean misclassification error and the mean number of steps decrease. Although the initial sample size does not contribute directly to the stopping rule, a larger initial sample contributes to better classifier training and therefore may lead to early stopping.


View this table:
[in this window]
[in a new window]
 
Table 3 Misclassification error, its SD, minimum, maximum, mean numbers of sequential steps and the SD in training the LDA classifier on the three most highly correlated genes with breast-cancer patient's prognosis (van de Vijver et al., 2002)

 
5.2 Example 2: K-nearest neighbors (KNN) classifiers on small number of genes for breast-cancer patient prognosis
We fit 3NN and 5NN classifiers separately on the 5 genes most highly correlated with patient prognosis based on the full sample of 295 patients. The procedure is similar to Example 1 and the results are reported in Tables 4 and 5. It is interesting to observe in the tables that although the 3NN and 5NN classification rules have about the same mean number of steps and yield about the same misclassification errors, the 5NN classifier yields a slightly smaller mean number of steps and has slightly smaller errors than the 3NN classifier.


View this table:
[in this window]
[in a new window]
 
Table 4 Misclassification error, the SD, minimum, maximum, mean numbers of sequential steps and the SD in training the 3NN classifier on the five most highly correlated genes with breast-cancer patient's prognosis (van de Vijver et al., 2002)

 

View this table:
[in this window]
[in a new window]
 
Table 5 Misclassification error, the SD, minimum, maximum, mean numbers of sequential steps and the SD in training the 5NN classifier on the five most highly correlated genes with breast-cancer patient's prognosis (van de Vijver et al., 2002)

 
The mean misclassification errors of ~0.22–0.25 reported in Tables 35 by the sequentially trained classifiers are consistent with the errors reported by Braga-Neto and Dougherty (2004).

5.3 Example 3: 3NN classifier on 10 sequentially selected principal components for breast-tumor characterization
We study the 3NN classifier on 10 principal components of the gene expressions for breast-tumor characterization. First, we remove the genes with missing expression levels and keep 4982 genes with no missing values. Second, we order the 4982 genes by the correlation between the tumor class and each individual gene expression based on the entire 84 samples, and choose 445 genes that have a correlation coefficient of absolute value greater than 0.4. These 445 genes are used to train the 3NN classifier in the sequential procedure. We also incorporate feature extraction in this procedure. At each sequential step, we compute the 10 largest principal components based on the current available sample of gene expressions and train a 3NN classifier based on the principal components. Table 6 reports the mean misclassification error, the mean, minimum and maximum number of sequential steps over 1000 runs for each fixed pair of ({varepsilon}, M), where M is the maximum number of sequential steps allowed. It is shown that the misclassification error decreases with the decrease in the threshold {varepsilon} and increase in the maximum number of steps allowed. By specifying a small threshold {varepsilon}, a decrease in the misclassification error may be achieved, but at the cost of running more sequential steps. With larger initial samples, (10,10) versus (5,5), the sequential procedure runs a fewer number of steps to stop and yields smaller error.


View this table:
[in this window]
[in a new window]
 
Table 6 Misclassification error, the SD, minimum, maximum, mean numbers of sequential steps and the SD in training the 3NN classifier based on sequentially selected 10 largest principal components of expressions of the 445 most highly correlated genes with the tissue class in the breast-tumor characterization study (Perou et al., 2000)

 
We have observed different levels of the misclassification errors in the two microarray studies, ~25% in the breast-cancer prognosis study and <10% in the breast-tumor characterization study. This reflects the fact, as we mentioned earlier in Section 2, that the misclassification error depends on many factors, including Bayes error, the number of predictors in the model, and most importantly, the scientific context of the problem.


    6 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STOPPING RULE FOR...
 3 WEAK CONVERGENCE WITH...
 4 SIMULATION STUDY
 5 APPLICATION TO MICROARRAY...
 6 CONCLUSION
 A APPENDIX
 REFERENCES
 
We have studied a sequential classification procedure and provided a stopping rule. The procedure recruits subjects sequentially, updates the classification rule at each step and stops with certain predetermined confidence level (1 – {alpha}) such that a new subject will have a probability less than a small threshold {varepsilon} > 0 to be misclassified by the classification rule. Simulation studies show that the procedure usually achieves its goal of stopping prior to reaching the maximum allowable sample size, and the gain is over a fairly wide range of {varepsilon}. We have applied the procedure to two microarray datasets. In particular, we have applied it in the context of feature extraction. It has been shown to yield error rates close to those achievable for fixed sample sizes.

This procedure has several advantages over classical sample size calculations: (1) it updates the classification rule sequentially and thus depends on the study subjects rather than relying on distributions of primary measurements from other studies that may differ greatly, as do the classical sample size calculations; (2) it assesses the stopping criteria at each sequential step and thus can substantially reduce cost via early stopping; (3) it ensures the required sample size to achieve significance, whereas classical sample size calculations may lead to smaller than required sample sizes and therefore miss the expected significance due to inaccurate estimation; and (4) it is not restricted to any particular classification rule and therefore applies to any parametric and non-parametric method, such as LDA, KNN, classification and regression trees (CART), etc.

Although we have considered the classification of binary outcomes in this study, our procedure also applies to multiple class outcomes as well, since one can still define the misclassification indicator Qi and its conditional probability {pi} i with multiple classes, and these constitute the only information used to derive the stopping rule.


    A APPENDIX
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STOPPING RULE FOR...
 3 WEAK CONVERGENCE WITH...
 4 SIMULATION STUDY
 5 APPLICATION TO MICROARRAY...
 6 CONCLUSION
 A APPENDIX
 REFERENCES
 
A.1 Derivation of the stopping rule
Let X i = Q i {pi} i , and F n be the {sigma}-field generated by Y n . Then is a zero-mean martingale (Hall and Heyde, 1980.) Let . By the variance of conditional expectation, one is ready to conclude the convergence in probability

It is thus ready to prove with Chebyshev's Inequality (Knight, 2000, p. 123) that and . By the martingale central limit theorem (CLT) (Shorack, 2000, pp. 529–530), one has the convergence in distribution

Let . By the CLT, N(0,1) in distribution. Thus with probability approaching 1 – {alpha},

For the stopping rule, we want to find a sample size N such that the probability Prob({pi} N ≤ {varepsilon}) ≥ 1 – {alpha}.

For large m, by the weak monotonicity of {pi} n ,

where converges to 0. Hence for large m, in probability, i.e. . Particularly, . Hence by (A.1)

We now show that as N -> {infty}. We first show that in probability for large n. Since and the convergence in distribution

it implies that for large n. Thus in probability. Without loss of generality, we assume a.s. Note that if 0 < a ≤ b ≤ 1/2, then a(1 – a) ≤ b(1 – b). Thus in probability for large n.

Let M 0 be a large integer such that M 0 N and in probability for i ≥ M 0.

holds in probability, where is bounded by a finite number M 0. Hence and by (A.2)

holds. This completes the proof for the theorem.

Notice that the stopping rule requires . However, a series of consecutive perfect classifications from the beginning of the sequential procedure yields . Meanwhile, a long sequence of perfect classifications indicates good performance of classifiers and thus should invoke stopping. This is amended in the stopping rule by setting the number of consecutive perfect classifications N 0 = log({alpha})/[log(1 {varepsilon})] with a conservative assumption that the perfect classifications are independent with least probability (1 – {varepsilon}) for each.


    Acknowledgments
 
W.J.F. was supported by a grant from the National Cancer Institute (CA-90301). E.R.D. was supported by the Translational Genomics Research Institute. R.J.C. and B.M. were supported by a grant from the National Cancer Institute (CA-57030) and by the Texas A&M Center for Environmental and Rural Health via a grant from the National Institute of Environmental Health Sciences (P30-ES09106).

Received on April 20, 2004; revised on July 5, 2004; accepted on July 28, 2004

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STOPPING RULE FOR...
 3 WEAK CONVERGENCE WITH...
 4 SIMULATION STUDY
 5 APPLICATION TO MICROARRAY...
 6 CONCLUSION
 A APPENDIX
 REFERENCES
 

    Ambroise, C and McLachlan, G.J. (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci., USA, 99, 6562–6566[Abstract/Free Full Text].

    Braga-Neto, U.M. and Dougherty, E.R. (2000) Is cross-validation valid for small-sample microarray classification?. Bioinformatics, 20, 374–380.

    Emerson, S.S. and Fleming, T.R. (1990) Parameter estimation following group sequential hypothesis testing. Biometrika, 77, 875–892[Abstract/Free Full Text].

    Hall, P. and Heyde, C.C. Martingale Limit Theory and Its Application, (1980) , New York Academic Press Inc.

    Knight, K. Mathematical Statistics, (2000) , New York Chapman and Hall/CRC.

    Lai, T.L. (1997) On optimal stopping problems in sequential hypothesis testing. Statist. Sinica, 7, , pp. 33–51.

    Liu, A. and Hall, W.J. (1999) Unbiased estimation following a group sequential test. Biometrika, 86, 71–78[Abstract/Free Full Text].

    Perou, C.M., Sorlie, T., Eisen, M.B., van de Rijn, M., Jeffrey, S.S., Rees, C.A., Pollack, J.R., Ross, D.T., Johnsen, H., Akslen, L.A., et al. (2000) Molecular portraits of Human Breast Tumours. Nature, 406, 747–752[CrossRef][Medline].

    Pinheiro, J.C. and DeMets, D.L. (1997) Estimating and reducing bias in group sequential designs with Gaussian independent increment structure. Biometrika, 84, 831–845[Abstract/Free Full Text].

    Shorack, G.R. Probability for Statisticians, (2000) , New York Springer.

    Todd, S. and Whitehead, J. (1996) Point and interval estimation following a sequential clinical trial. Biometrika, 83, , pp. 453–461[Abstract/Free Full Text].

    van de Vijver, M.J., He, Y.D., van't Veer, L.J., Dai, H., Hart, A.A.M., Voskuil, D.W., Schreiber, G.J., Peterse, J.L., Roberts, C., Marton, M.J., et al. (2002) A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med., 347, 1999–2009[Abstract/Free Full Text].

    van't Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A.M., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T., et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530–536[CrossRef][Medline].

    Whitehead, J. (1986) On the bias of maximum likelihood estimation following a sequential test. Biometrika, 73, 573–581[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BiostatisticsHome page
K. K. Dobbin and R. M. Simon
Sample size planning for developing classifiers using high-dimensional DNA microarray data
Biostat., January 1, 2007; 8(1): 101 - 117.
[Abstract] [Full Text] [PDF]


Home page
NEJMHome page
X. Wang, J. Yu, A. Sreekumar, S. Varambally, R. Shen, D. Giacherio, R. Mehra, J. E. Montie, K. J. Pienta, M. G. Sanda, et al.
Autoantibody Signatures in Prostate Cancer
N. Engl. J. Med., September 22, 2005; 353(12): 1224 - 1235.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/1/63    most recent
bth461v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Fu, W. J.
Right arrow Articles by Carroll, R. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Fu, W. J.
Right arrow Articles by Carroll, R. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?