Bioinformatics Advance Access originally published online on March 28, 2007
Bioinformatics 2007 23(11):1363-1370; doi:10.1093/bioinformatics/btm117
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Classification based upon gene expression data: bias and precision of error rates
1School of Mathematical Sciences, Queensland University of Technology, Gardens Point, GPO Box 2434, Brisbane, QLD 4001, Australia and 2Queensland Institute of Medical Research, Post Office, Royal Brisbane Hospital, 300 Herston Rd., Herston, QLD 4029, Australia
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Gene expression data offer a large number of potentially useful predictors for the classification of tissue samples into classes, such as diseased and non-diseased. The predictive error rate of classifiers can be estimated using methods such as cross-validation. We have investigated issues of interpretation and potential bias in the reporting of error rate estimates. The issues considered here are optimization and selection biases, sampling effects, measures of misclassification rate, baseline error rates, two-level external cross-validation and a novel proposal for detection of bias using the permutation mean.
Results: Reporting an optimal estimated error rate incurs an optimization bias. Downward bias of 3–5% was found in an existing study of classification based on gene expression data and may be endemic in similar studies. Using a simulated non-informative dataset and two example datasets from existing studies, we show how bias can be detected through the use of label permutations and avoided using two-level external cross-validation. Some studies avoid optimization bias by using single-level cross-validation and a test set, but error rates can be more accurately estimated via two-level cross-validation. In addition to estimating the simple overall error rate, we recommend reporting class error rates plus where possible the conditional risk incorporating prior class probabilities and a misclassification cost matrix. We also describe baseline error rates derived from three trivial classifiers which ignore the predictors.
Availability: R code which implements two-level external cross-validation with the PAMR package, experiment code, dataset details and additional figures are freely available for non-commercial use from http://www.maths.qut.edu.au/profiles/wood/permr.jsp
Contact: i.wood@qut.edu.au
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Recent studies suggest that a number of complex diseases can be accurately diagnosed on the basis of measurements of gene expression levels from microarrays and similar technology. Furthermore, they often suggest lists of the genes likely to be involved in the disease. Methods used include support vector machines (SVMs) (Guyon et al., 2002; McLachlan et al., 2004) nearest shrunken centroids (NSC) (Sharma et al., 2005; Tibshirani et al., 2002) neural networks (Khan et al., 2001) classification trees and mixture models (McLachlan et al., 2004).
These studies typically report estimates of classifier accuracy. However, it is not always clear how these results should be interpreted. There are a number of possible sources of bias in such estimates. In this study, we examine some of these, focusing particularly on optimization bias and sampling effects. We review existing methods for estimating and reporting prediction accuracy. We then give suggestions for improvements and examples of their effectiveness in the large p (number of features or predictors), small n (number of labelled samples) case common in gene expression analyses.
| 2 THEORY |
|---|
|
|
|---|
2.1 Classification and error rates
Given a dataset of n observations, each comprising the measurement of p predictors and an expert-based classification of each point into one of G classes, we can fit a model and use this to classify these observations and future data of the same type. Methods of this type are known as classifiers or discriminant rules. Following Efron (1983), we let the dataset be
We often wish to estimate the misclassification or error rate we could expect, if the classifier were asked to predict the class of a new set of predictors drawn from the same distribution as the original dataset. Let (t, y) be a new point drawn at random from F and define the zero-one loss function Q for the classifier
as follows:
|
| (1) |
to be the expectation of Q over F given x (Efron, 1983). There are many ways to measure and report the misclassification rate of a classifier when applied to a labelled test set of data. The class of each test observation is predicted based on the selected predictors, then compared against the given label. Q is the simplest type of error function which treats all errors as equally important.
It is generally informative to decompose the misclassification rate into a rate for each (true) class. This is particularly useful when the observed number of data points per class is unequal. For example, if a non-diseased class contains 80% of the data and a diseased class contains 20%, then the trivial classifier which predicts every observation to be non-diseased will have a 20% misclassification rate. Examined more closely, it will have a 0% rate of false positives and a 100% rate of false negatives. Observational studies will often produce this type of unbalanced data since some classes of response will be rarer than others in the population of interest.
The cost of errors in misclassifying observations may also vary from (true) class to class, and we may wish to report an overall estimated error rate or risk which variously weights the errors on each class. This can be done retrospectively if the error rate on each class is reported. In a more sophisticated version, we may have a matrix of misclassification costs
, where
is the cost of misclassifying a data point of class g into class h. This can be applied retrospectively if a matrix of misclassification rates is reported.
For two class problems, Wessels et al. (2005) suggest reporting the average of the sensitivity and the specificity so that the aforementioned trivial classifier does not appear too successful. Here, we report estimates of the simple overall misclassification rate, error rates for each class and the average of the class error rates. The latter can also be motivated by decision-theoretic considerations regarding the construction of a Bayes optimal rule (McLachlan et al., 2004) (p. 188). Let
be the true proportions of each class in the population of interest. These will often be known with high precision, but sometimes must be estimated from the data. Let
be the observed proportion of responses in class g. For simplicity, assume that all misclassifications of a given class have equal cost, i.e.
and that
(see McLachlan (1992) p. 8 for more generality). If the error rate conditional upon the true class being g is Errg, then the expected cost per observation is
. As argued by McLachlan et al. (2004), misclassification cost is often nearly inversely proportional to relative frequency, so
may be near constant for all g. In this case, the expected cost will be approximately a multiple of the average of the class error rates Ea, so an estimate of this is a useful summary measure.
2.2 Cross-validation
Some of the most popular methods for estimating error rates are cross-validation (Breiman et al., 1984; Stone, 1974), the bootstrap (Efron, 1983), the holdout method (McLachlan, 1992) (p. 341) and the 0.632 estimator (Efron, 1983; Efron and Tibshirani, 1997). All of these rely on training the classifier based on a subset of the data, and testing it on a separate subset of the data. Here, we consider only cross-validation since it is popular and effective (Molinaro et al., 2005) and uses the data efficiently and is almost unbiased when used correctly (McLachlan, 1992). However, it does have significant variance when used with small sample sizes (Braga-Neto and Dougherty, 2004) and can be subject to bias if used naively, as we show in the following sections.
Cross-validation can be formalized as follows. The dataset x will be split into K disjoint folds
, of approximately equal sizes nk. One then fits the classifier and tests it K times, such that in iteration k, the classifier is fitted based on the data in the training folds
and evaluated on the test fold
. The cross-validated error rate estimates
and
are obtained by averaging over the performance on the K test folds as follows:
|
| (2) |
|
| (3) |
The simplest method of forming the folds is to split the randomly ordered data into K pieces with the largest fold containing at most one element more than the smallest. The bias due to the uneven distribution of classes within folds can be reduced by attempting to balance or stratify the folds, so that the empirical distribution of classes in each fold is similar to that of the whole dataset (Breiman et al., 1984).
In standard K-fold cross-validation, folds of size
are created by sampling from the data without replacement and each of the remaining n mod K data points is assigned randomly to a different fold. In stratified or balanced cross-validation (Breiman et al., 1984) (p. 246), the data are first ordered by the response value or class. This list is broken up into
bins each containing K points with many similar response values. Any remaining points at the end of the list are assigned to an additional bin. A fold is formed by sampling one point without replacement from each of the bins. Except for the ordering of the data, this is equivalent to standard cross-validation.
For a classifier
and dataset x, we define the bias in the estimation of the error rate using cross-validation to be:
. This is typically intractable, but some contributing components in
can be described and efforts made to reduce
.
The number of folds K can be any integer between 2 and n, the number of data points. The use of cross-validation to estimate error introduces a small positive component into B since each training set has mean size
rather than the n used to construct the final classifier. As K is reduced, the mean training set size shrinks, and this positive bias component grows. The training sets also become more different from each other, which tends to reduce the variance of the error estimate. It is common practice (e.g. McLachlan et al., 2004 p. 214) to compromise between minimizing the bias and variance by using K = 5 or 10 folds.
2.3 Selection bias
Cross-validation can be used to estimate model or classifier parameters as well as perform model and variable selection. However, combining these steps with error estimation for the final classifier can lead to bias unless one is particularly careful. External cross-validation (Ambroise and McLachlan, 2002) (1-external cv) leaves out a single test fold of the data, selects the model, variables and parameters based on the remaining training folds and then evaluates the misclassification rate on the test fold. When averaged over K folds, this should provide a nearly unbiased estimate of the true error rate of the final classifier.
Selection bias (McLachlan et al., 2004) (p. 218) can occur when cross-validation is used internally. In this case, all the available data are used to select a subset of the available predictors. This subset of predictors is then fixed and the error rate is estimated by cross-validation.
2.4 Optimization bias
When largely following the above advice, it is still easy to allow a subtler bias to emerge, which we call optimization bias. This can occur if cross-validation or other methods are used to estimate the error rate for multiple values of a set of free parameters, and then the set of parameter values with the lowest (optimal) estimated error rate is chosen for use in the final classifier. The free parameters can be involved in any aspect of model selection, variable selection or model fitting and include parameters as general as the index of a model or a variable subset. This method is reasonable for choosing the final classifier, but provides a downwardly biased estimate of its error rate. Varma and Simon (2006) have independently investigated this bias and call it parameter selection bias.
As an example of how one might incur optimization bias, assume we have a procedure which, given a fixed b
0, can select b predictors and fit a classifier based on the available data. The error rate for this classifier can be estimated using cross-validation. However, we may then decide to also choose an optimal value b
; from a set of values
via b
.
For each r, the error estimate
will be nearly unbiased, but the estimate
(b
) is now slightly biased. This happens because the same data is used to both estimate the error rate and to select a parameter, namely b. For similar examples involving the use of SVMs with recursive feature elimination on gene expression data, see Zhu et al. (2007).
Stone (1974) (p. 115) described how to carry out cross-validatory assessment of cross-validatory choice in his seminal paper. While describing only leave-one-out (LOO) cross-validation, he made clear that while one level of cross-validation is satisfactory to optimize the set of free parameters for the final classifier, a separate two-level cross-validation is needed to estimate its error rate. Failure to do this leads to optimization bias.
Two-level external cross-validation (2-external cv) can be used to avoid both selection and optimization bias in these circumstances. By two-level external cross-validation, we mean the following. At the top level, one of K1 folds of data is left out for the purpose of assessing the error rate of the finished classifier. At the lower level, K2-fold cross-validation is then performed on the remaining data to select the optimal value of any free parameters. When all parameters are selected, the classifier can be tested on the left out fold at the top level. By repeating this for all K1 folds at the top level, one can construct a cross-validatory assessment of the cross-validatory choice. The same two-level procedure can be used with any method for estimating the error rate, where there are free parameters to be chosen and an overall assessment of error rate, is desired. If using cross-validation it is easiest to choose
, so that the same fold structure can be used for both levels. The use of two levels of cross-validation to avoid bias is also discussed by Dudoit and Fridlyand (2003), Statnikov et al. (2005) and Wessels et al. (2005). If instead the whole model selection, variable selection and parameter fitting process is performed without cross-validation, then only one level of external cross-validation is needed to estimate error rates of prediction.
Optimization bias will increase in magnitude with the variability of the error estimate and with the number of parameter values considered, especially those whose true error rate is near the minimum value (within the range of variability). In the analysis of gene expression, SNP (single nucleotide polymorphism) chip and mass spectroscopy data, the number of available predictors is large, so careful variable selection is needed to avoid optimization bias.
Sharma et al. (2005) built a system to classify patients into those with or without breast cancer based on gene expression levels in blood. They considered 1368 genes and used the NSC method of Tibshirani et al. (2002) to both select a subset of genes for classification and for the classification itself. Sharma et al. (2005) used 10-fold cross-validation to estimate a prediction error rate of 18% based on 102 labelled samples. However, they did this for multiple gene subset sizes controlled through a threshold parameter. The optimal value of the threshold was chosen to be that producing the lowest cross-validation error rate estimate. They reported the error rate estimate for this optimal choice and constructed the final classifier using the whole dataset.
Based on the discussion above, it seems likely that these authors have incurred optimization bias. They could have avoided this bias by choosing the threshold using cross-validation based on a subset of the data as some of the same authors did in Tibshirani et al. (2003). This is the holdout method of assessing cross-validatory choice, which is effectively one fold of 2-external cv with a small K1. One could make even better use of the data by completing a two-level external cross-validation. The resulting estimates would be more accurate since they would be based on K1 holdout estimates, with each observation being used once in a test set.
A number of authors (e.g. McLachlan et al., 2004 (p. 240), Dabney, 2005) estimate and report error rates for various numbers of genes b where for each value of b, an optimal subset of b predictors is selected. The natural response to a table of estimated error rates for various values of b is to choose
with the minimal estimated error rate and select
variables in the final classifier. In the absence of other information, one is also likely to take the reported error rate estimate for this number of genes to be indicative of that final classifiers error rate. As discussed previously, this estimate will be subject to optimization bias due to the process of choosing the optimal value for b. The study of McLachlan et al. (2004) was repeated using two-level cross-validation to avoid optimization bias, with the results reported in Zhu et al. (2007).
Varma and Simon (2006) investigated the same bias by applying the NSC and SVM to simulated datasets containing two classes of exactly 20 points each. Using 1-external cv, they estimated a bias of –0.122 with K = 10 and an NSC and a bias of –0.083 with K = n and an SVM. Using 2-external cv, they estimated biases of 0.042 and 0.033 for the NSC and SVM, respectively, which they attributed to cross-validation leaving some data out.
2.5 Sampling effects on error rate estimation
If a dataset of size n consists of a sample of two classes, each occurring with probability
, then in most cases, the number of observations from each class will be different, i.e.
. Assuming independence, the number of observations n1 in class 1 can be described by the binomial distribution, i.e.
, and the size of the second class is simply
. Then
and
. For larger samples and
1 not too close to 0 or 1, the binomial distribution
can be approximated by a normal distribution with mean
and variance
.
Let m1 be the absolute difference in class size 1 from the expected size, i.e.
, and similarly
. Using the normal approximation to the binomial distribution, the distributions for m1 and m2 are both half-normal. For m1, this is the renormalized right-half of a normal distribution with mean 0 and variance
. As a half-normal distribution, it has mean
and variance
(Johnson et al., 1994).
As an example, if we have a sample size of 60 with two equiprobable classes (
), the mean absolute difference in class size from the expected 30 is 3.1, with a variance of 5.45. Thus, class sizes of 33 and 27 would be typical for a random sample from this population and on average one class is 22% larger than the other. Through balanced cross- validation, one might expect the following numbers of each class in each fold (n1, n2): (4,2),(4,2),(4,2),(3,3),(3,3),(3,3),(3,3), (3,3),(3,3),(3,3). Hence 3 out of 10-folds would have a majority from the class which was larger in the sample; the other 7-folds would have equal numbers of each class. These types of sampling effects have an impact on the estimation of classification error rates and their interpretation.
In the above example, if the available predictors were independent of each other and of the response and the method of classification ignored the predictors, it would be likely to use the apparently different class probabilities, as estimated from the training data. Each fold of 10-fold cross-validation would contain on average 3.3 of one class and 2.7 of the other. In the combined training folds, one would expect 29.7 of the larger class and 24.3 of the smaller. Even with n-fold (or LOO) cross-validation, the excluded data point will have the same class as the larger class in the training folds in 33/60
55% of cases. A classifier that assigns every point to the larger class in the training set can thus be expected to show an error rate of 45% under this type of cross-validation. However, we know that it would achieve an expected error rate of 50% if applied to new data from the same population or underlying distribution.
Efron and Tibshirani (1997) (p. 552) define the no-information error rate
to be the error rate if the true response or classification is independent of all the predictors. They estimate
by
|
| (4) |
0.45. Consider the following three trivial classifiers. These are simple to implement and use no predictor information. They are presented here in order of expected increasing error rate. The first trivial classifier can be seen as providing a baseline for classifier error rates.
Trivial classifier 1 (TC1): classify all observations as belonging to the largest class in the sample. Without loss of generality let this be class 1, so
,
and
.
Trivial classifier 2 (TC2): classify observations randomly with class probabilities equal to the sample proportions, so
. Then
and
.
Trivial classifier 3 (TC3): classify observations randomly with equal probability for each class, so
. If we use qg in equation (4) instead of
, we obtain:
.
Using
and the Cauchy–Schwartz inequality, it can be shown that
. If
, then
and if
,
. The true averages of the class error rates are the same for all three trivial classifiers and will equal the estimated average for trivial classifier 3, i.e.
.
Efron and Tibshirani (1997) (p. 552) studied classification based on uninformative data with responses of 0 or 1 with probability 0.5. They erroneously claim that on this type of data, the leave-one-out cross-validation estimate
for the nearest neighbour classifier would have the correct expectation of 0.5. In fact, as described above, it will generally be slightly lower for this and most other classifiers since sampling variation will usually produce one class larger than the other. Classifiers will tend to exploit this imbalance and cross-validation estimates of the error rate derived from the same dataset will be unable to correct for its effect. By default, the NSC takes the prior class probabilities to be the sample class proportions. If no genes are selected, a prior term causes the NSC to classify all observations into the largest sample class, so becoming TC1.
2.6 Permutation assessment
The permutation or randomization test is an exact test which can be used to determine a significance level for the acceptance or rejection of a null hypothesis (Good, 1994). The statistic of interest here is the estimated error rate of the classifier. The null hypothesis is that the value of this statistic does not depend upon the given set of labels, i.e. there is no meaningful relationship between the predictors and the given labels. This implies that the classifier would be expected to yield a similar estimated error rate even under a random permutation of the labels.
We can obtain a reasonable approximation by taking a subset of the possible permutations chosen via a uniform distribution over the
possible relabellings. The p-value for this test is then given by the fraction of the statistics obtained under permutation which are more extreme than the value obtained using the original labelling.
The mean of a statistic under a large number of permutations is also worth consideration. If the permutations successfully remove the relationship between predictors and response, and the trivial classifiers dominate as expected, then we can expect the permutation mean of
to be close to
. If it is not, then the method used to estimate error is likely to be biased.
| 3 METHODS |
|---|
|
|
|---|
We performed computer experiments to test for bias in the estimation of error rates using 1-external and 2-external cv. In addition, the experiments were designed to test for differences between
The 1-external and 2-external cv versions of the NSC classifier were applied to a simulated non-informative dataset and the Khan (Khan et al., 2001) and Sharma (Sharma et al., 2005) datasets. For each dataset, we estimated the simple, average and class-conditional error rates for the NSC classifier using 1-external and 2-external cv. We also recorded the number of genes selected using the optimal threshold value under 1-external cv and the average number of genes selected across the K folds under 2-external cv. Balanced cross-validation randomly allocates data values to folds, so we repeated each procedure 1000 times to reduce variability and estimated the mean and standard deviation of each of the above estimates across these repetitions. Standard errors were calculated across folds, then averaged over the repetitions.
We also carried out a series of permutation tests for each dataset. We permuted the data labels (responses) 1000 times and refit the NSC classifier under 1-external and 2-external cv. Each time we recorded the simple, average and class error rates and the number of genes selected. We also calculated the mean of each estimate over the permutations. For the non-informative dataset, permutation would be expected to make little difference to any of the estimates.
Since the true distribution is available here, we were also able to estimate the optimization bias with the NSC on sample sizes of 100 by simulating 1000 additional samples of this size and performing 1- and 2-external cv on each.
3.1 Simulated data
The non-informative simulated dataset comprised 100 data points
, each intended to represent an individual drawn from a population of interest. Each individual was given 2000 real-valued predictor measurements
, with each
and a binary response yi with
, i.e.
. Hence each predictor and response value is drawn independently of all others and any relationships between predictors and response are purely due to chance. The dataset was generated randomly once and then used throughout the experiments. The number of observations in classes 1 and 2 were 53 and 47, respectively.
3.2 Khan data
Khan et al. (2001) described a gene expression dataset of 83 observations, each from a child who was determined by clinicians to have a type of small round blue cell tumour (SRBCT). These included the following four classes: neuroblastoma (N), rhabdomyosarcoma (R), Burkitt lymphoma (B; a subset of the non-Hodgkin lymphomas) and the Ewings sarcoma family of tumours (E). The numbers in each class are: 18 N, 25 R, 11 B, 29 E.
For each tissue sample the levels of gene expression were estimated using a cDNA microarray. A total of 2308 genes and ESTs passed the intensity requirements imposed and the values were normalized (Khan et al., 2001) The full dataset is publicly available at http://home.ccr.cancer.gov/oncology/oncogenomics/. We ignored five additional observations which were not determined to be SBRCTs.
3.3 Sharma data
Sharma et al. (2005) described and made public a dataset containing the expression levels (mRNA) of 1368 genes from 60 blood samples taken from 56 women. Some of the blood samples were analyzed more than once in separate batches giving a total of 102 labelled blood samples. Each blood sample was labelled by clinicians, with 24 labelled as having breast cancer (BC) and 36 labelled as not having it (NC).
The supplementary section of Sharma et al. (2005) supplies both the raw data from macroarray measurements and batch-adjusted data, obtained using ANOVA. The authors found a clear batch effect and removed it for their analysis, so we also used only the batch-adjusted data. To avoid consideration of the method of aggregation, we chose to use just one measurement per blood sample and ignore the others. Hence the Sharma data set used here is a randomly selected subset of 60 observations, rather than the whole 102. The subset used here is publicly available on the website.
Table 1 lists our calculations of the estimated no-information error rates
, and the true error rates Err for the three types of trivial classifiers described in Section 2.5 on the three datasets. The missing entries for true error rates could be filled in if one knew the prior probabilities of class membership
g for the populations sampled by Khan et al. (2001) and Sharma et al. (2005). These may be available, but are beyond the scope of this article.
|
| 4 RESULTS AND DISCUSSION |
|---|
|
|
|---|
4.1 Results on simulated data
The results on the simulated dataset are detailed in Table 2. They show that 1-external cv yielded mean (standard error) estimates of
|
As discussed in Section 2.5, the estimate
This example also illustrates the value of estimating
in addition to
.
was unaffected by the difference between the sampling and true proportions and so offers a valuable diagnostic tool for determining whether or not a given method of estimating error is biased when the true proportions are unknown.
The NSC returned large p-values in the range 0.34–0.66 for
and
with both 1-external and 2-external cv under the permutation test on this dataset. This was expected since the given labelling was assigned randomly and uninformatively. The average values of
and
with the given labelling were slightly different to the permutation mean values, but fell well inside a standard deviation.
Under 2-external cv, the mean estimate of
with permuted labels was 0.487, which is slightly above the 0.47
baseline. The mean estimate of
using 1-external cv under permuted labellings was 0.420. The 2-external cv estimate
was 0.503, which is close to the expected 0.5, while 1-external cv produced an
of 0.435. Hence the use of 1-external cv, seems to incur an optimization bias of around –0.07 in both
and
.
Based on the 1000 additional datasets of size 100, the mean (standard error over the 1000) results for
were 0.410 (0.0014) and 0.476 (0.0018) for 1- and 2-external cv, respectively. For
, the respective values were 0.439 (0.0016) and 0.503 (0.0017). Hence, for this true distribution and sample sizes of 100, we estimate the optimization bias in
and
under 1-external cv to be –0.06.
4.2 Results on Khan and Sharma data
The results on the Khan and Sharma datasets are detailed in Table 3 and 4. On the Khan dataset, 1-external cv produced mean (standard error) results of 0.00026 (0.00027) for
and 0.00023 (0.00023) for
. The mean (standard error) estimates for
and
from 2-external cv were 0.00717 (0.0069) and 0.00563 (0.0052), respectively. Tibshirani et al. (2002) reported an estimated error rate of zero for the NSC using a separate test set, but the 2-external cv estimate given here is expected to be more accurate. On this dataset, optimization bias reduced both
and
by an order of magnitude under 1-external cv.
|
|
On the Sharma dataset the mean (standard error) estimates of
For both the Khan and Sharma datasets, the permutation tests rejected the null hypothesis with p-values
for
and
estimated using 1-external and 2-external cv. This is unsurprising, and supports an association between the predictors and the given labels.
As with the simulated data set, it is more interesting to consider the permutation mean of
and
. The effects of optimization bias are illustrated for the Sharma dataset in Figure 1 through the different distributions of
and
as estimated by 1-external and 2-external cv under 1000 permutations of the labels. The baseline error rates
for trivial classifier 1 are 0.651 and 0.4 for the Khan and Sharma datasets, respectively. These values are approximately midway between the permutation means of
using 1-external and 2-external cv on these datasets.
|
The permutation means of
Mean class error rates under permutation were very high for the smaller observed classes (B, N and R on the Khan dataset and BC on the Sharma dataset), which indicates that the NSC may have frequently become trivial classifier 1. By checking the raw results, we found that under 1-external cv the NSC became in effect TC1 in 38% of cases on the Khan dataset and in 60% of cases on the Sharma dataset. Under 2-external cv there is another layer of diversity, but class error rates matching TC1 were seen in 16% of cases on the Khan dataset and in 27% of cases on the Sharma dataset. This shows that trivial classifiers are relevant in deriving a baseline error rate.
| 5 CONCLUSIONS |
|---|
|
|
|---|
We have quantified the bias and precision of error rates in classification based upon gene expression data from simulations and using real datasets, and have shown how common methods of estimation can lead to bias. We have proposed a novel permutation approach to detect bias and shown the effectiveness of two-level external cross-validation in reducing it.
We urge all investigators performing classification tasks to calculate and examine the permutation mean of the average of the estimated class error rates
. If this is noticeably below the expected
, the procedure may be incurring selection or optimization bias. These can be avoided by using two-level external cross-validation.
| 6 Acknowledgements |
|---|
|
|
|---|
The authors appreciate discussions with Geoff McLachlan, David Duffy, Ross McVinish, Clair Alston and Georgia Chenevix-Trench and the helpful comments of two anonymous reviewers. This research was primarily supported by the ARC Center for Complex Dynamic Systems and Control CEO348165 and NHMRC Medical Bioinformatics, Genomics and Proteomics Program Grant 389892.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
Received on November 27, 2006; revised on March 12, 2007; accepted on March 15, 2007
| REFERENCES |
|---|
|
|
|---|
Ambroise C, McLachlan GJ. Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS (2002) 99:6562–6566.
Braga-Neto UM, Dougherty ER. Is cross-validation valid for small-sample microarray classification? Bioinformatics (2004) 20:374–380.
Breiman L, et al. Classification and Regression Trees. (1984) Belmont, CA: Wadsworth.
Dabney AR. Classification of microarrays to nearest centroids. Bioinformatics (2005) 21:4148–4154.
Dudoit S, Fridlyand J. Classification in microarray experiments. In: Statistical Analysis of Gene Expression Microarray Data—Speed TP, ed. (2003) Boca Raton: Chapman & Hall. 93–158.
Efron B. Estimating the error rate of a prediction rule: improvement on cross-validation. J. Am. Stat. Assoc (1983) 78:316–331.[CrossRef][Web of Science]
Efron B, Tibshirani R. Improvements on cross-validation: The .632+ bootstrap method. J. Am. Stat. Assoc (1997) 92:548–560.[CrossRef][Web of Science]
Good P. Permutation Tests: a Practical Guide to Resampling Methods for Testing Hypotheses (1994) New York: Springer-Verlag.
Guyon I, et al. Gene selection for cancer classification using support vector machines. Machine Learning (2002) 46:389–422.[CrossRef][Web of Science]
Johnson S, et al. Continuous Univariate Distributions (1994) Vol. 1, 2nd edn. New York: Wiley.
Khan J, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Med (2001) 7:673–679.[CrossRef][Web of Science][Medline]
McLachlan GJ. Discriminant Analysis and Statistical Pattern Recognition (1992) New York: Wiley.
McLachlan GJ, et al. Analyzing Microarray Gene Expression Data (2004) Hoboken, NJ, USA: Wiley.
Molinaro AM, et al. Prediction error estimation: a comparison of resampling methods. Bioinformatics (2005) 21:3301–3307.
Sharma P, et al. Early detection of breast cancer based on gene-expression patterns in peripheral blood cells. Breast Cancer Res (2005) 7:R634–R644.[CrossRef][Web of Science][Medline]
Stone M. Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B (1974) 36:111–147.
Statnikov A, et al. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics (2005) 21:631–643.
Tibshirani R, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS (2002) 99:6567–6572.
Tibshirani R, et al. Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat. Sci (2003) 18:104–117.[CrossRef][Web of Science]
Varma S, Simon R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics (2006) 7:91.[CrossRef][Medline]
Wessels LFA, et al. A protocol for building and evaluating predictors of disease state based on microarray data. Bioinformatics (2005) 21:3755–3762.
Zhu JX, et al. On selection biases with prediction rules formed from gene expression data. J. Stat. Plan. Inference (2007) in press.
This article has been cited by other articles:
![]() |
S. Lee Mistakes in validating the accuracy of a prediction classifier in high-dimensional but small-sample microarray data Statistical Methods in Medical Research, December 1, 2008; 17(6): 635 - 642. [Abstract] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

