Bioinformatics Advance Access originally published online on August 14, 2006
Bioinformatics 2006 22(20):2507-2515; doi:10.1093/bioinformatics/btl438
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The ties problem resulting from counting-based error estimators and its impact on gene selection algorithms
1 School of Electrical & Electronic Engineering, Nanyang Technological University, Nanyang Avenue Singapore 639798, Singapore
2 Bioinformatics Research Centre, Nanyang Technological University, Nanyang Avenue Singapore 639798, Singapore
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Feature selection approaches, such as filter and wrapper, have been applied to address the gene selection problem in the literature of microarray data analysis. In wrapper methods, the classification error is usually used as the evaluation criterion of feature subsets. Due to the nature of high dimensionality and small sample size of microarray data, however, counting-based error estimation may not necessarily be an ideal criterion for gene selection problem.
Results: Our study reveals that evaluating genes in terms of counting-based error estimators such as resubstitution error, leave-one-out error, cross-validation error and bootstrap error may encounter severe ties problem, i.e. two or more gene subsets score equally, and this in turn results in uncertainty in gene selection. Our analysis finds that the ties problem is caused by the discrete nature of counting-based error estimators and could be avoided by using continuous evaluation criteria instead. Experiment results show that continuous evaluation criteria such as generalised ||w||2 measure for support vector machines and modified Relief's measure for k-nearest neighbors produce improved gene selection compared with counting-based error estimators.
Availability: The companion website is at http://www.ntu.edu.sg/home5/pg02776030/wrappers/. The website contains (1) the source code of all the gene selection algorithms and (2) the complete set of tables and figures of experiments.
Contact: ekzmao{at}ntu.edu.sg
| 1 INTRODUCTION |
|---|
|
|
|---|
Current microarray experiments allow expression levels of thousands of genes to be measured at once to provide complete transcription information in cells. Microarray data is like a goldmine, from which a lot can be explored. One of the explorations is to identify gene markers that present the maximum discriminant power between different conditions of interest, such as tumor and normal tissues. Gene selection has emerged as a major issue of microarray data analysis.
Gene selection can be cast as a feature selection problem in the context of pattern classification. In general, a feature selection algorithm consists of four basic components: search procedure, evaluation criterion, stopping criterion and validation procedure (Dash and Liu, 1997). The search procedure generates candidate feature subsets for evaluation, while the evaluation criterion measures the goodness of the candidate feature subsets. The stopping criterion decides when to terminate the feature selection procedure, and in the validation procedure a particular classification algorithm is used to check whether the final selected subset is valid. Feature selection approaches can be divided into two broad groups, filter methods and wrapper methods, based on their dependence on the classification algorithm used in the validation procedure, see e.g. John et al. (1994). The main difference between the two approaches lies in the evaluation criterion. The filter method employs intrinsic properties of data, which are independent of the classification algorithm, as the criterion for feature subset evaluation, while the wrapper method evaluates feature subsets according to the classification performance. In most pattern classification applications the wrapper method provides features with better predictive accuracy than the filter method, whereas the computational cost of the wrapper method is much higher than that of the filter method. Both approaches have been used in microarray data analysis. For example, Golub et al. (1999), Furey et al. (2000) and West et al. (2001) employed filter methods, while Guyon et al. (2002), Cho et al. (2004) and Zhou and Mao (2005) employed wrapper methods for gene selection.
The goodness of a feature subset is often evaluated by measuring its predictive capacity. Hence, it is natural to consider the error rate or error probability as a choice criterion in the wrapper-based feature selection (John et al., 1994; Kohavi and John, 1997). As a large independent test set is rarely available in practice, the error rate is usually estimated based on the training data. There are several approaches to error estimation, including resubstitution, cross-validation and bootstrap. They are all calculated based on the error-counting mechanism. The leave-one-out error, a special case of cross-validation error, gives an almost unbiased estimate of the probability of test error (Luntz and Brailovsky, 1969). Therefore, several authors (Li et al., 2001; Xiong et al., 2001; Inza et al., 2002; Ooi and Tan, 2003) employed the leave-one-out error as the evaluation criterion in the wrapper-based gene selection and achieved satisfactory performance according to their reported experimental results.
Recent studies (Braga-Neto and Dougherty, 2004a, b) found, however, that leave-one-out error estimator and other three error estimators exhibit excessive variability in small-sized data, which could affect the performance of wrapper-based feature selection (Sima et al., 2005a) or feature-set ranking (Sima et al., 2005b). In this study, we reveal another problem of counting-based error estimators. We find that in high-dimensional and small-sized microarray data, employing counting-based error estimators as gene evaluation criteria could result in severe ties problem, i.e. two or more gene subsets score equally. In the occurrences of ties, the evaluation criterion is unable to differentiate these gene subsets, and the selection decision is usually made in terms of the order in which the candidate gene subsets are evaluated. In a sense, this selection is arbitrary, and less significant genes could be selected while more significant genes could be discarded. Our analysis finds that the root cause of the ties problem lies in the discrete nature of the counting-based error estimators. For microarray data, the counting-based error estimators just have a very limited number of possible values due to the very limited number of samples available, while the number of gene subsets for evaluation is huge because of the excessive high dimensionality. The mapping from gene subsets to error rates is thus many-to-one, and it is highly possible that two or more gene subsets have the same minimal error at some steps of the gene selection procedure.
For any dataset, the problem of ties always exists when the counting-based error estimator is employed as the evaluation criterion. However, the probability of the occurrence of ties is limited in conventional data (low dimensionality and large or medium sample size). We believe this is the reason why the ties problem has not received considerable attentions in the literature of feature selection. However, the high dimensionality and small sample size of microarray data drastically enhance this probability, which results in an arbitrary gene selection. As the high-dimensional and small-sized settings are common in biological data, such as microarray data and mass spectrometry data, our finding is significant for successful exploration of biological applications.
Experimental studies show that the impact of the ties problem is twofold. First, the final selected gene set is uncertain. A slight change in the order of genes in microarray data could affect the selection results remarkably. Second, the classification performance deteriorates since less important genes have chances to be selected. To solve the ties problem, we suggest employing well-designed continuous criteria, which could always have different values for different feature subsets although the difference may not necessarily be substantial, instead of the counting-based error estimators. In the present study, we examined two continuous evaluation criteria including generalized
measure for support vector machines (SVMs) and modified Relief's measure for k-nearest neighbors (KNNs). Experimental results reveal that well-designed continuous evaluation criteria could eliminate the uncertainty raised by the ties problem and improve the gene selection performance.
| 2 EVALUATION CRITERIA |
|---|
|
|
|---|
Error rate, the most commonly used evaluation criterion in wrapper methods, is usually estimated from training data based on error-counting mechanism. Criteria that do not count errors explicitly but reflect error probability implicitly are also seen in other wrapper methods. For example, SVM-RFE (Guyon et al., 2002) employs separating margin, which implicitly reflects error probability, as gene evaluation criterion. In this section, counting-based error estimators and continuous measures implicitly reflecting error probability are reviewed and discussed.
Recently, Braga-Neto and Dougherty (2004a) proposed a bolstered error estimator that presents low variance and generally low bias as well. In some special cases, the bolstered error estimator may be computed in two different ways: one is in a continuous manner, while the other is based on the error-counting mechanism. Therefore, investigation on the bolstered error is helpful to understand the impact of counting-based error estimators on gene selection.
2.1 Counting-based error estimation
A classifier is a function that maps an unlabelled sample X (
) to a label y. We use notation
to denote the class label assigned to sample X by the classifier built on training dataset S in the present study.
The error rate of a classifier is the probability of incorrectly classifying a randomly selected sample. The true error rate depends on the feature-label distribution. If the feature-label distribution were known, the true error could be computed exactly. However, the distribution is often unknown in practice, and the error rate has to be estimated from the given dataset. Most of the commonly used error estimators, including resubstitution, cross-validation and bootstrap, are based on a certain form of error counting.
Consider n training data pairs on a candidate feature subset:
i = 1, ... , n, where
is a feature vector representing the i-th sample, and yi is the class label of Xi. For a two-class problem, yi
{1, 1}.
The resubstitution error,
, also called apparent error, is the ratio of the number of misclassifications on the training dataset S to the number of samples in the dataset:
|
| (1) |
In k-fold cross-validation, the dataset S is randomly split into k mutually exclusive folds St, t = 1, ... , k, of nearly equal size. The training and test procedures are performed k times. Each time t
1, ... , k, the classifier is trained on S\St (the notation \ denotes operation of set subtraction) and tested on St. The cross-validation error rate is the ratio of overall misclassifications of k test subsets to the number of samples in the dataset:
|
| (2) |
xi, yi
. To obtain a more reliable estimate, the cross-validation process may be repeated multiple times using different splits of the data.
The leave-one-out error is a special case of k-fold cross-validation error, with k = n:
|
| (3) |
In the bootstrap error estimation, a bootstrap sample set S* is created by sampling n samples uniformly from the original dataset S with replacement. Some of the samples in S may appear once or multiple times in S*, whereas others may not appear at all. In practice, the bootstrap error is approximated on a total of B independent bootstrap sample sets
, for b = 1, ... , B. Define
= 1 if sample
xi, yi
is not contained in the b-th bootstrap sample set
, and 0 otherwise. The bootstrap error estimate on the basis of the B bootstrap sample sets is given by
|
| (4) |
is upwardly biased, and Efron and Tibshirani (1997) proposed the .632+ bootstrap estimator to correct the upward bias:
![]() | (5) |
is given by
,
is the relative overfitting rate, and
is the no-information error rate that would apply if the feature vector X and label y were independent. For the two-class problem,
is estimated by
, where
is the proportion of label yi equaling 1, and
is the proportion of
equaling 1. In this study, .632+ bootstrap (Efron and Tibshirani, 1997) is employed.
All the error estimators mentioned above are based on 01 decision in error counting. That is, if
, one error is counted, otherwise zero error. For resubstitution and leave-one-out error rates, the possible values are 0/n, 1/n, 2/n, ... . Obviously, the two error estimators are discrete functions with at most n + 1 possible values for a dataset with n samples. Cross-validation and bootstrap errors could have more possible values, but the two error estimators are still discrete functions in nature.
A study in Sima et al. (2005a) showed that the four counting-based error estimators exhibit variability that has impacts on feature selection. Next, we analyse that the discrete nature of the counting-based error estimator could result in the ties problem in gene selection owing to small sample size and high dimensionality of microarray data, and this in turn induces uncertainty to gene selection.
Without loss of generality, we take the leave-one-out error based wrapper method as an example. At each step of the feature selection procedure, no matter what search procedure is used, a number of candidate feature subsets are evaluated. For the high-dimensional microarray data, the number of candidate feature subsets to be evaluated is huge. For example, at each step of a sequential forward gene selection (SFS), the number of candidate gene subsets is almost equal to the dimensionality of the dataset (usually a few thousands). However, the number of possible values of the leave-one-out error is very limited, say 10, no matter what classification algorithm is used. This is just like the following scenario: thousands of balls are randomly tossed into 10 boxes marked 0 to 9. It is almost impossible that the number of balls falling into box 0 is only 1. This example illustrates that at some steps of the gene selection procedure, there might exist a few candidate gene subsets producing the same minimal error. If this ties problem occurs, the error rate as a criterion is unable to judge which gene subset is the best, and the selection decision is often made in terms of the order in which the candidate gene subsets are evaluated. In a sense, the selection loses its way and becomes arbitrary. As a direct result, some less significant genes might be selected while more important ones might be discarded.
For conventional data with low dimensionality and large or medium sample size, the ties problem is not a major issue. But in microarray data which is of small size and high dimensionality, the ties problem often occurs. This will be demonstrated in experimental studies in Section 4.
2.2 Continuous evaluation criteria
As analysed in Section 2.1, the main reason of the arbitrary gene selection is the discrete evaluation criterion. To solve this problem, we suggest employing well-designed continuous criterion, which maps different gene subsets to different values, to replace the discrete counting-based error estimator. Two continuous measures are investigated for wrapper-based gene selection in the present study.
2.2.1 Generalized ||w||2 measure for support vector machines
SVMs are well-suited to work with high dimensional data, such as microarray data (Furey et al., 2000). When used for classification, SVMs separate one class from the other with a hyperplane that maximizes the distance between the hyperplane and the nearest sample of each class. For separable cases, SVMs maximize the separating margin
. The mechanism of SVMs is to minimize the
, and hence to maximize the separating margin. In feature selection applications, the feature subset with minimal
, which leads to the maximal separating margin, is preferred. For non-separable cases, the SVM is generalized to minimize the following criterion (Vapnik, 1998),
![]() | (6) |
. Therefore the
can also be generalized to
for feature evaluation. The latter is called generalized
measure in this work.
Our goal here is not only to find the hyperplane with minimal generalized
measure, but also the feature subset that yields minimal generalized
measure. To achieve this goal, we first induce the SVM classifiers on all candidate feature subsets generated by the search procedure, then identify the feature subset with minimal generalized
measure. The well-known SVM-RFE (Guyon et al., 2002) minimizes the generalized
measure as well, but it uses the Optimal Brain Damage (OBD) (LeCun et al., 1990) algorithm to perform a backward elimination procedure.
2.2.2 Modified Relief's measure for k-nearest neighbors
The modified Relief's measure is inspired by Relief algorithm (Kononenko, 1994). The key idea of Relief is to estimate the quality of a feature according to how well its value distinguishes between samples that are near each other.
In this study, instead of evaluating the significance of a single feature, we propose the modified Relief's measure to assess the quality of a feature subset. The measure is similar to the KNN rule. For a given sample pair
xi, yi
, two groups of nearest neighbors are identified from the remaining data (
): k nearest neighbors from the same class as yi, called nearest hits
for j = 1, ... , k; and k nearest neighbors from the other class, called nearest misses
, for j = 1, ... , k. k is a predefined number. Let
be the center of nearest hits,
be the center of nearest misses, i.e.
and
. The modified Relief's measure is defined as
![]() | (7) |
2.3 Bolstered error estimation
The bolstered error estimator (Braga-Neto and Dougherty, 2004a) is an improved error estimator, and is particularly useful in small-sized settings (Sima et al., 2005a, b). In general, the bolstered resubstitution error, an improved resubstitution error by employing bolstering techniques, is given as follows:
|
| (8) |
is the bolstering kernel. According to Equation (8), the bolstered error provides an inherently continuous measure. For example, when the classifier is linear, the bolstered resubstitution estimation can be rewritten as,
![]() | (9) |
is the cumulative distribution function of a zero-mean Gaussian random variable with variance
i,
i is a parameter that can be determined by the method proposed in Braga-Neto and Dougherty (2004a), and W is given by
. However, in most cases, the integrals in Equation (8) cannot be computed exactly, then a Monte-Carlo integration can be employed instead:
|
| (10) |
, j = 1, ... , M, are Monte-Carlo samples drawn from the sample xi. Note that there are two approaches to compute the bolstered resubstitution error if a linear classification algorithm (such as SVM) is employed. From our point of view, the former [Equation (9)] is a continuous function, while the latter [Equation (10)] is basically discrete since it is based on error-counting mechanism. | 3 SYSTEM AND METHODS |
|---|
|
|
|---|
3.1 Datasets
The second dataset is the breast cancer dataset (West et al. (1999). The dataset consists of expression of 49 tumor samples for 7129 human genes. There are two different response labels in the dataset, the status of the estrogen receptor (ER) and the lymph nodal (LN). In this work, we only considered the former. Among the 49 samples, 25 samples are ER+, while the remaining samples are ER. The data are available at http://data.cgt.duke.edu/west.php
The leukaemia dataset was first described in Golub et al. (1999). The dataset contains expression levels of 7129 human genes of 72 patients with either acute lymphoblastic leukaemia (ALL, 47 cases) or acute myeloid leukaemia (AML, 25 cases). The raw data are available at http://www-genome.wi.mit.edu/cancer/
The third dataset is the colon cancer dataset (Alon et al., 1999). It contains gene expression levels of 2000 genes of 40 tumor and 22 normal colon tissues. The data are publicly available at http://microarray.princeton.edu/oncology/
Each of these datasets was pre-processed using the procedure described in Dudoit et al. (2002). After thresholding, filtering and logarithmic-transforming, the microarray data were standardized to zero mean and unit standard deviation across genes. Since the dimensionality (number of genes) of microarray data is very high, and most of the genes are irrelevant to the discriminant task, we employed a pre-selection procedure to reduce the search space and to save the computational cost. We selected top 1000 genes based on Fisher's ratio. Fisher's ratio is an individual gene ranking criterion, and is used to evaluate how well a single gene is correlated with the separation between classes. For every gene the Fisher's ratio is defined as
, where µ1, µ2,
1,
2 denote the means and standard deviations of two classes, respectively. All the simulations and comparisons in this work were carried out based on the pre-processed and pre-selected data.
3.2 Wrapper schemes
Typical feature selection algorithm consists of four basic components: search procedure, evaluation criterion, stopping criterion and validation procedure (Dash and Liu, 1997). Since our focus is the performance of different evaluation criteria, the search procedure was fixed to sequential forward selection (SFS) (Devijver and Kittler, 1982). SFS is a simple hill-climbing search algorithm. SFS starts from an empty subset of features and sequentially adds features to the subset until the stopping criterion is reached. The stopping criterion in our work was a predefined number of genes, say 10.
In wrapper methods, the evaluation criterion is based on the performance of the classification algorithm used in the validation procedure. In this work, three groups of evaluation criteria were investigated. The first group includes counting-based error estimators, such as resubstitution error, leave-one-out error, cross-validation error and .632+ bootstrap error (Efron and Tibshirani, 1997). The second group includes two continuous criteria: generalized
measure for SVMs and modified Relief's measure for KNNs. The third group is the bolstered error estimators.
With different combinations of the four components in a typical feature selection algorithm, we investigated 12 cases as summarized in Table 1.
|
| 4 EXPERIMENTS AND DISCUSSION |
|---|
|
|
|---|
In this section, we tested the performance of wrapper methods listed in Table 1 on three microarray benchmark datasets. For SVMs, the linear SVM classifier was employed, and the parameter C was set to 0.1. For KNNs and modified Relief's measure, the parameter k was set to 3.
Owing to page limit, only the results of criteria with SVM classification (No. 14, 9, 11 and 12 in Table 1) are presented here. However, all the experimental results are available at our companion website (http://www.ntu.edu.sg/home5/pg02776030/wrappers/)
4.1 Existence of ties problem
We first demonstrate the existence of the ties problem, taking the leukaemia dataset as an example. In the experiment, SVM is employed as the classification algorithm in the validation procedure, and the sequential forward wrapper selection is performed. As illustrated in Table 2, a few gene subsets could lead to equal minimal estimated error at some steps of the gene selection procedure when any counting-based error estimator is used as the evaluation criterion. For example, in the case of leave-one-out error, after four genes were selected, we may identify 563 genes out of the remaining 996 candidate genes, and the addition of any one of the 563 genes into the gene subset selected at the previous step could lead to the same minimal leave-one-out error. Thus, the next selected gene could be any one of the 563 genes. The selection decision is usually made in terms of the order in which the gene subsets are evaluated. In a sense, this selection is arbitrary, and it is reasonable to believe that less significant genes could be selected while more significant ones could be discarded. However, the ties problem does not occur in the continuous criterion-based wrapper method as shown in Table 2 (the last row is the results of generalized
measure). Note that the 10-fold cross-validation error was estimated by the average of 10 runs of 10-fold cross-validation, and the bootstrap was performed with B = 100 replicates, with the goal of keeping the number of induced classifiers in the bootstrap to be the same as 10 runs of 10-fold cross-validation. Actually, the ties problem also happens in leukaemia dataset for KNN rules, and in breast cancer and colon cancer datasets for both rules, please refer to our companion website for details.
|
4.2 Impact of ties problem
Next, we demonstrate the impact of the ties problem. The experimental design is as follows. First, genes of the original dataset (leukaemia dataset, breast cancer dataset or colon cancer dataset) were randomly permuted to generate a new dataset. In the experiment, the procedure was repeated 10 times to generate 10 datasets. Note that the only difference among the 10 datasets is the different order of genes in the data matrix. Second, each of the 10 newly generated datasets was split into a training and a test set with nearly equal size and approximately the same proportions of labels as the entire dataset. The training set is used to perform gene selection and classifier training, while the test set is used to assess quality of the selected gene subsets. The process was repeated 100 times using different training-test splits. Evaluation and comparison of feature selection algorithms are based on the average of the 100 repeats.
As discussed above, evaluating genes in terms of counting-based error estimators has two problems including excessive variability and ties, and both problems have impacts on gene selection. In the experiment, the 100 training-test data splits, the fold partitions in the 10-fold cross-validation and the replicates in B.632+ are kept identical across the 10 newly generated datasets. Such a design should separate the impact of ties from that of variability of error estimation. This is because under such settings the impact of the variability in error estimation is almost identical across the 10 newly generated datasets, and a comparison of the results on the 10 newly generated datasets would reveal the impact of ties problem. Since the SFS procedure is a deterministic search algorithm, the selection performances for a specific evaluation criterion would be the same on the 10 newly generated datasets if there were no ties. If different results are produced in the 10 datasets, the difference should be caused by the ties problem.
The experimental results of the three benchmark datasets, using SVM as the classification algorithm, are shown in Figure 1. Assume that 10 genes were selected. For the leukaemia dataset, the test errors of wrapper methods employing the resubsitution error, leave-one-out error, cross-validation error and bootstrap error as evaluation criteria vary in the range of [6.1% 8.9%], [6.5% 9.3%], [6.1% 8.7%] and [6.1% 7.8%], respectively, as shown in Figure 1a, d, g and j. For the breast cancer dataset, the test errors vary in the range of [13.2% 19.3%], [14.2% 19.6%], [13.9% 18.9%] and [14.0% 16.3%], respectively, as shown in Figure 1b, e, h and k. And for the colon cancer dataset, the test errors vary in [20.0% 22.2%], [18.8% 22.3%], [20.2% 21.6%] and [21.3% 21.7%], respectively, as shown in Figure 1c, f, i and l. Except for the bootstrap estimator on the colon cancer dataset, the variations of these test errors are apparent. Note that the variation is caused only by the different gene order in the data matrix, but the difference in gene order does not change the intrinsic property of the data at all. This indicates that the counting-based error estimator may not necessarily be a reliable and consistent evaluation criterion. However, when continuous evaluation criterion such as generalized
measure was used, exactly the same results were obtained from the 10 newly generated datasets, as shown in Figure 1mo.
|
From the experiments, we can see that different gene orders in the data matrix produce different results for the same wrapper scheme (i.e. the same search procedure, evaluation criterion, stopping criterion and classification algorithm). Actually, the root cause of the problem lies in the error estimator. Owing to the discrete error estimators, some gene subsets would have the same value of evaluation criterion. Since ties occur during the gene selection process, the selection of gene subset would depend on how the ties are broken. For the SFS algorithm, the gene order is used as a tie-breaking rule, i.e. the genes that rank in the front would have greater probability to be selected. For example, on a dataset with 5 genes, it is assumed that gene 1 and gene 2 lead to the same optimal value of evaluation criterion when they are respectively added into the previous selected subsets. If the gene order is 1-2-3-4-5, then gene 1 would be selected; whereas gene 2 would be selected if the order is 2-1-3-4-5. When we permuted the order of genes, different tie-breaking rules were employed, and this in turn led to different results. In other words, the arbitrary gene selection is just caused by the ties problem inherent in discrete counting-based error estimators.
We averaged the test errors across 10 newly generated datasets to assess the performance of wrapper methods using the four error estimators, respectively, as evaluation criterion. As shown in Figure 1mo, the gene subset identified by the generalized ||w||2 measure outperforms those by the other four error estimators. The same conclusion can be drawn from the results of KNN classifier, which are available at our companion website.
For conventional data (low dimensionality and large or medium sample size), the impact of ties is not significant, especially for cross-validation error and bootstrap error (see example at the companion website). Therefore this problem has not received much attention in the literature of feature selection. However, for microarray data analysis, the impact of ties is more significant owing to high dimensionality and small sample size of the microarray data.
In the above experiments, SFS was used as the search procedure. A shortcoming of SFS is the nesting effect, i.e. once a feature has been selected to the subset, it cannot be removed at later steps. Thus, the selected feature subset may not achieve the global optimum for the evaluation criterion. However, the problem we discussed above is caused by the discrete nature of the evaluation criterion rather than the search algorithm. Even if other search procedures are used, the ties problem still exists. To illustrate this, we investigated two other search procedures. The first one is random selection. We randomly generated 1 000 000 gene subsets, each of which contained 10 genes, using the leukaemia dataset. By calculating the leave-one-out error, we found that there are 10 111 gene subsets with zero leave-one-out error. Each of the 1000 genes presents at least 10 times within the 10 111 subsets. This means any of the 1000 genes has a chance to be selected, regardless of its merit. We also employed sequential forward floating selection (SFFS) described in Pudil et al. (1994) for gene selection. SFFS allows backtracking as long as this optimizes the evaluation criterion function, which overcomes the nesting effect in SFS. However, SFFS presents similar performance with SFS on the error counting-based wrapper methods. The experimental results are available at our companion website.
4.3 Comparison between two implementations of bolstered error
The above experimental results corroborate the finding that the ties problem would drive the gene selection procedure to an arbitrary manner. The results also show that the ties problem does not occur when continuous measure is employed, and well-designed continuous evaluation criterion produces better performance than the counting-based error estimator does. However, it is not convincing to conclude it is the ties problem that lowers the capacity of counting-based error estimator in gene (feature) selection, because we compared the rule-specific continuous measure with general error estimator, and the continuous measure might benefit from its close relationship with the classification algorithm (rule) used in the validation procedure. Hence we tested the bolstered resubstitution error estimator to further investigate the impact of ties problem. When a linear classifier is used, as we mentioned in Section 2.3, the bolstered resubstitution error estimator can be continuous or discrete, depending on how it is calculated. The comparison between two ways of bolstered error calculation is helpful to clarify the effect of rule-specific continuous measure against general error estimator. Another advantage of using bolstered error is its low computational cost. We can use a large number of Monte-Carlo samples to relieve the ties problem, while it is almost impossible to repeat numerous times in cross validation and bootstrap.
The performance of bolstered resubstitution error estimators was also tested and compared using the 10 newly generated datasets with different gene order. Note that Monte-Carlo bolstered error estimation is affected by the Monte-Carlo sampling, which in turn induces internal variance. To highlight the impact of ties, and to lower the impact of variability across the 10 newly generated datasets, we generated in advance a sequence of random numbers that will be used in Monte-Carlo sampling, and kept the set of random numbers unchanged across the simulations on the 10 newly generated datasets. The 100 training-test data splits still remain unchanged. Under such experimental settings, the selection performance on the 10 newly generated datasets would be of no difference if there were no ties. In other words, if different performances are obtained, such variation should be resulted from the ties problem.
The experimental results using bolstered resubstitution error as the evaluation criterion and SVM as the classification algorithm are shown in Figure 2. Similar to other counting-based error estimators, the test performance for the wrapper method using Monte-Carlo bolstered error varies across the 10 newly generated datasets, as shown in Figure 2ai. And the continuous bolstered estimation presents consistent performance, as shown in Figure 2jl. For Monte-Carlo bolstered error estimator, we set M to three different values: M = 10, 100 and 1000. According to Equation (10), when M increases, there are more possible values for the bolstered error estimator, so the chance of ties occurring decreases. In other words, the increase of M would alleviate the effect of the ties problem. This is verified by the fact that the variations in Figure 2gi (M = 1000) are much smaller than those in Figure 2ac (M = 10). We averaged the test errors across 10 newly generated datasets to assess the performance of wrapper methods using Monte-Carlo bolstered error as evaluation criterion, and compared them with the wrapper method using continuous bolstered error. The comparison results shown in Figure 2jl reveal that even for the same error estimator, the continuous approach produces better results than error-counting approach for gene selection on high-dimensional and small-sized microarray data. This experiment again demonstrates the advantage of continuous evaluation criteria over error counting-based criteria. More importantly, the performance of wrapper methods using Monte-Carlo bolstered error increases with M increasing. Hence it is clear that the ties problem would reduce the selection capacity of counting-based error estimators.
|
| 5 CONCLUSIONS |
|---|
|
|
|---|
In this work, we have investigated the performance of several error counting-based wrapper methods in gene selection, including the resubstitution error, leave-one-out error, cross-validation error and bootstrap error on three benchmark microarray datasets. Our experiments have showed that error counting-based wrapper methods might select less-significant genes when more than one gene subset result in the same estimated error. We believe that the root cause of the ties problem is the discrete nature of counting-based error estimators. We have tested and suggested two continuous criteria for gene selection. Experimental results have showed that well-designed continuous evaluation criteria, such as the generalized
measure for SVMs and the modified Relief's measure for KNNs, outperform the counting-based error estimators in gene selection on high-dimensional and small-sized microarray data. | Acknowledgments |
|---|
The authors thank the anonymous reviewers for their very constructive comments.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on November 22, 2005; revised on July 31, 2006; accepted on August 9, 2006
| REFERENCES |
|---|
|
|
|---|
Alon, U., et al. (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl Acad. Sci. USA, 96, 67456750
Braga-Neto, U. and Dougherty, E.R. (2004a) Bolstered error estimation. Pattern Recog, . 37, 12671281[CrossRef].
Braga-Neto, U. and Dougherty, E.R. (2004b) Is cross-validation valid for small-sample microarray classification? Bioinformatics, 20, 374380
Cho, J.-H., et al. (2004) Gene selection and classification from microarray data using kernel machine. FEBS Lett, . 571, 9398[CrossRef][Web of Science][Medline].
Dash, M. and Liu, H. (1997) Feature selection for classification. Intell. Data Anal, . 1, 131156.
Devijver, P. and Kittler, J. Pattern Recognition: a Statistical Approach, (1982) , London Prentice Hall.
Dudoit, S., et al. (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc, . 97, 7787[CrossRef][Web of Science].
Efron, B. and Tibshirani, R. (1997) Improvements on cross-validation: the .632+ bootstrap method. J. Am. Stat. Assoc, . 92, 548560[CrossRef][Web of Science].
Furey, T., et al. (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16, 906914
Golub, T., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531537
Guyon, I., et al. (2002) Gene selection for cancer classification using support vector machines. Mach. Learn, . 46, 389422[CrossRef].
Inza, I., et al. (2002) Gene selection by sequential search wrapper approaches in microarray cancer class prediction. J. Intell. Fuzzy Syst, . 12, 2533.
John, G.H., Kohavi, R., Pfleger, K. (1994) Irrelevant features and the subset selection problem. In Proceedings of the Eleventh International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco, CA 121129.
Kohavi, R. and John, G.H. (1997) Wrappers for feature subset selection. Artif. Intell, . 97, 273324[CrossRef].
Kononenko, I. (1994) Estimating attributes: analysis and extensions of RELIEF. In Bergadano, F. and Raedt, L.D. (Eds.). Lecture Notes in Computer Science, Springer, Berlin vol. 784, , pp. 171182.
LeCun, Y., Denker, J., Solla, S. (1990) Optimal brain damage. In Touretzky, D.S. (Ed.). Advances in Neural Information Processing Systems II, , San Mateo, CA Morgan Kaufman Publishers, pp. 598605.
Li, L., et al. (2001) Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics, 17, 11311142
Luntz, A. and Brailovsky, V. (1969) On estimation of characters obtained in statistical procedure of recognition (in Russian). Technicheskaya Kibernatica, 3, .
Ooi, C.H. and Tan, P. (2003) Genetic algorithms applied to multi-class prediction for the analysis of gene expression data. Bioinformatics, 19, 3744
Pudil, P., et al. (1994) Floating search methods in feature selection. Pattern Recogn. Lett, . 15, 11191125[CrossRef].
Sima, C., et al. (2005a) Impact of error estimation on feature selection. Pattern Recogn, . 38, 24722482[CrossRef].
Sima, C., et al. (2005b) Superior feature-set ranking for small samples using bolstered error estimation. Bioinformatics, 21, 10461054
Vapnik, V.N. Statistical Learning Theory, (1998) , New York John Wiley and Sons.
West, M., et al. (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl Acad. Sci. USA, 98, 1146211467
Xiong, M., et al. (2001) Biomarker identification by feature wrappers. Genome Res, . 11, 18781887
Zhou, X. and Mao, K.Z. (2005) LS Bound based gene selection for DNA microarray data. Bioinformatics, 21, 15591564
This article has been cited by other articles:
![]() |
X. Zhou and D. P. Tuck MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data Bioinformatics, May 1, 2007; 23(9): 1106 - 1114. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





,
,
as evaluation criterion, respectively, for 10 newly generated datasets. Each figure includes 10 curves, each of which corresponds to the test performance for one dataset with random gene order. (ac) are related to 
