Bioinformatics Advance Access originally published online on July 26, 2006
Bioinformatics 2006 22(19):2430-2436; doi:10.1093/bioinformatics/btl407
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
What should be expected from feature selection in small-sample settings
1 Department of Electrical and Computer Engineering, Texas A&M University, College Station TX 77843, USA
2 Computational Biology Division, Translational Genomics Research Institute Phoenix, AZ 85004, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: High-throughput technologies for rapid measurement of vast numbers of biological variables offer the potential for highly discriminatory diagnosis and prognosis; however, high dimensionality together with small samples creates the need for feature selection, while at the same time making feature-selection algorithms less reliable. Feature selection must typically be carried out from among thousands of gene-expression features and in the context of a small sample (small number of microarrays). Two basic questions arise: (1) Can one expect feature selection to yield a feature set whose error is close to that of an optimal feature set? (2) If a good feature set is not found, should it be expected that good feature sets do not exist?
Results: The two questions translate quantitatively into questions concerning conditional expectation. (1) Given the error of an optimal feature set, what is the conditionally expected error of the selected feature set? (2) Given the error of the selected feature set, what is the conditionally expected error of the optimal feature set? We address these questions using three classification rules (linear discriminant analysis, linear support vector machine and k-nearest-neighbor classification) and feature selection via sequential floating forward search and the t-test. We consider three feature-label models and patient data from a study concerning survival prognosis for breast cancer. With regard to the two focus questions, there is similarity across all experiments: (1) One cannot expect to find a feature set whose error is close to optimal, and (2) the inability to find a good feature set should not lead to the conclusion that good feature sets do not exist. In practice, the latter conclusion may be more immediately relevant, since when faced with the common occurrence that a feature set discovered from the data does not give satisfactory results, the experimenter can draw no conclusions regarding the existence or nonexistence of suitable feature sets.
Availability: http://ee.tamu.edu/~edward/feature_regression/
Contact: edward{at}ece.tamu.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
A key problem in translational genomics is to find sets of genes whose expression levels can serve as feature sets for diagnosis or prognosis (Hedenfalk et al., 2001; van de Vijver et al., 2002; Wei et al., 2004; Barrier et al., 2005; Reedijk et al., 2005). Owing to the obstacles inherent in dealing with extremely large numbers of interacting variables, small samples create the need for feature selection, while at the same time making feature-selection algorithms less reliable. Two basic related questions arise: (1) Can one expect feature selection to yield a feature set whose error is close to that of an optimal feature set? (2) If a good feature set is not found, should it be concluded that good feature sets do not exist? The second question is confronted by researchers whenever they believe that gene-expression-based discrimination should be possible but they are unable to find good feature sets.
Feature selection is required when the number of features is large with respect to the sample size because the use of a large number of features can result in overfitting the data: the designed classifier performs well on the sample data but not on the feature-label distribution from which the data have been drawn. Feature selection, which is part of the classification rule, results in classifier constraint, not a reduction in the dimensionality of the feature space relative to design. For instance, if there are D features available for linear discriminant analysis (LDA), when used directly, then the classifier family consists of all hyperplanes in D-dimensional space, but if a feature-selection algorithm reduces the number of variables to d < D before the application of LDA, then the classifier family consists of all hyperplanes in D-dimensional space confined to d-dimensional subspaces. The dimensionality of the classification rule has not been reduced, but the new classification rule (feature selection plus LDA) is constrained. The issue is whether it is sufficiently constrained. Given 20 000 gene-expression levels as features, the new rule has significant potential for overfitting.
A major impediment to feature selection is the combinatorial nature of the problem. To select a subset of d features from a set of D potential features and be assured that it provides an optimal classifier with minimum error among all optimal classifiers for subsets of size d, all d-element subsets must be checked unless there is distributional knowledge that mitigates the search requirement, a condition rarely satisfied in practice (Cover and Van Campenhout, 1977). Thus a suboptimal feature-selection algorithm is required. In addition, for small-sample settings, error estimation is problematic; indeed, if error estimation (or other parameter estimation) is required for a feature-selection algorithm, then the impact of error estimation can be greater than the choice of algorithm (Sima et al., 2005).
The two basic questions stated at the outset translate quantitatively into questions concerning conditional expectation. (1) Given the error of an optimal feature set, what is the conditionally expected error of the selected feature set? (2) Given the error of the selected feature set, what is the conditionally expected error of the optimal feature set? The first question gets directly at the question of whether one can expect suboptimal feature-selection algorithms to find good feature sets. The second question relates more directly in practice because there one has a dataset, has applied a feature-selection algorithm and has estimated the error of the resulting classifier. If the classifier is not good, one must confront the issue of whether, given the dataset in hand, there does not exist a feature set from which a good classifier can be designed, or whether there exist feature sets from which good classifiers can be designed but the feature-selection algorithm has failed to find one.
| 2 SYSTEMS AND METHODS |
|---|
|
|
|---|
The conditional-expectation analysis, and therefore the answers to the questions posed, depends on the feature-label distribution from which the data are drawn, the basic classification rule, the feature selection algorithm employed and the sample size. The choice of distribution and sample size are limited by the requirement that we have the optimal feature set for the analysis. The consequences of this requirement depend on whether we perform a model-based study or utilize real patient data.
For a model-based study, the model must be such that we must either limit the model to one for which we theoretically know the optimal feature set or we must limit the number of features and sample size so that an exhaustive search can be performed over all possible feature sets. So that we might address large numbers of features, we use models for which we theoretically know the optimal feature set of a given size. This limits us to models possessing uncorrelated features. Were we to obtain results showing good feature-selection performance and were the overall intent to demonstrate good performance, then this methodological limitation would be a limitation on the impact of the results. However, as will be seen, feature selection does not achieve good results even in the uncorrelated case, so it is highly unlikely that it would achieve good results for correlated features, and the results demonstrate the conclusion of the paper, that good results are not achieved. In a similar vein, to theoretically know the optimal feature set, we are restricted to relatively non-complex optimal boundaries: linear, multiple linear (bimodal) and quadratic. In analogy to the issue of correlated features, feature selection does not achieve good results on these boundaries, so one must expect it would do no better on more complex boundaries.
For patient data we are confronted with a different issue. Here, the complete dataset serves as the (empirical) distribution, in our case there being 295 sample points. Since an optimal feature set cannot be determined from the distribution, we take three approaches to obtain an optimal feature set. First, we use an existing feature-set test bed for which exhaustive searches have been carried out in a high-performance computing environment using a Beowulf cluster and in which optimal feature sets are known for the patient data (Choudhary et al., 2006). The test bed utilizes 70 gene-expression features supplied with the original published dataset (van't Veer et al., 2002), where they were found to be the most correlated with the disease outcome. Even with only 70 features, computation time is so extensive that the test bed only considers feature sets of no more than 7 genes [for details see (Choudhary et al., 2006)]. To see that the results are not dependent on the pre-selected genes, in a second approach we randomly select 50 genes from the original 24 496 genes in the study and then do an exhaustive search for the best 4-gene feature set using the same methodology as in the test-bed paper. We limit ourselves to the best 4-gene set from among 50 genes to make the computations tractable without resorting to high-performance computing, the key point being that the total number of genes is substantially larger than the number in the selected feature set. For a third approach, we choose 50 genes with the highest variance and then exhaustively select a 4-gene feature set. As will be shown later, the results for all cases are very similar: feature selection does not achieve good results. This is a strong indicator that feature selection would not achieve good results when selecting 10 or 20 features out of 10 000 features. Thus, although the methodology is limited when using real data owing to the necessities of an exhaustive search and of using a dataset of sufficient size to create an empirical distribution, when taken in conjunction with the results for the model-based analysis, the poor performance on the patient data is a strong indicator that no better results should be expected when selecting from much larger collections of features.
We consider three classification rules for the model-based studies: LDA, linear support vector machine (SVM) and 3-nearest-neighbor classification (3NN). Since in applications we do not know the feature-label distribution, we assume a random feature-label distribution governed by a random parameter. The exact way in which we carry out the experiments will be discussed subsequently.
For feature selection we employ the sequential floating forward search (SFFS) algorithm for both the model and patient data (Pudil et al., 1994). The performance of SFFS has been studied extensively and it has been shown that it provides good results in relation to competing algorithms (Jain and Zongker, 1997; Kudo and Sklansky, 2000). Error estimation is critical within the SFFS algorithm and since, depending on the classification rule, bolstered and semi-bolstered resubstitution (Braga-Neto and Dougherty, 2004a) have been shown to perform well within the algorithm (Sima et al., 2005), we employ these within SFFS. For the special case of Gaussian models with uncorrelated feature, t-test feature selection is appropriate and we will consider it in addition to SFFS to demonstrate that it provides similar performance relative to the conditional-expectation analysis. In this special uncorrelated case, t-test feature selection can be expected to outperform SFFS; however, we consider the conditional-expectation results for SFFS more important because under usual experimental conditions, where the variables are correlated, the t-test feature selection suffers owing to its inability to discover features that only perform well in combination, and therefore is typically only used for preliminary reduction of the total set of available features, if at all. We do not use t-test feature selection for the patient data.
Before describing the details of our experiments, we explain how, given the feature-label distribution, we rank feature sets for a classification rule. Given a set G of features corresponding to the feature-label distribution FG, and the desire to select the best feature set of size d, then an obvious choice is that subset H
G such that H possesses d features and the Bayes classifier for the distribution FH, the marginal distribution of FG corresponding to H, has minimal error among all Bayes classifiers corresponding to subsets of G possessing d features. In this case we say that H has minimal Bayes error. This approach is reasonable because we are interested in finding good feature sets, regardless of the classification rule employed.
One might argue that there is a problem with this approach if the classification rule is not consistent, meaning that, given a feature set H and a sample of size n, the expected error of the designed classifier does not converge in mean to the Bayes error for the feature set as n
. To illustrate the issue, suppose the class conditional distributions are Gaussian with equal covariance matrices. Then the Bayes classifier is a hyperplane and LDA provides a consistent rule. If a feature set has an error not close to the Bayes error, then this difference is clearly a consequence of the feature-selection algorithm (and its application to a sample of the given size). But what of the situation when there are two class conditional Gaussian distributions with unequal covariance matrices and the Bayes classifier is determined by a quadratic surface? In this case, if LDA is the classification rule, then there is an inherent positive lower bound for the difference between the error of a designed classifier and that of the Bayes classifier, where in the present case it arises from the fact that the LDA rule cannot achieve a better result than the optimal-hyperplane decision boundary. Rather than comparing the error of the selected feature set using LDA to the Bayes error, would it not be better to compare it to the error of the optimal hyperplane decision boundary, which in this case exceeds the Bayes error? Certainly this is an option, but this would require that we know a classifier to whose error the errors of the designed classifiers converge in mean. Although there are some situations in which such a classifier is known, such cases are not common.
Aside from this practical reason for comparing the classifier error for a selected feature set to the classifier error for the best Baysian feature set is that, were the size of the sample not restricted, one would not be using a constrained classification rule like LDA but would instead be estimating the Bayes classifier from an estimate of the feature-label distribution. Indeed, were it known that the class conditional distributions were Gaussian with unequal covariance matrices, were it not for insufficient data, we would be using quadratic discriminant analysis (QDA) as the classification rule. Using LDA instead of QDA is a form of regularization to offset the effects of too little data, so that, ipso facto, LDA is being used as the way to better design an approximation to the Bayes classifier. Hence, similar to the choice of feature-selection algorithm, the choice of LDA represents our effort to discover good features.
Although taking as optimal the feature set with the minimal Bayes error is a suitable approach for model-based analysis in which a feature-label distribution is assumed, it is not appropriate when considering patient data because we lack the Bayes classifier. For the patient data, the best feature sets are either taken from a feature-selection test bed that utilizes the classification rule being applied (Choudhary et al., 2006), in which case we also consider 5NN classification, or from 50 selected genes as described previously.
Owing to the large number of experiments, some typical results will be displayed in the paper, with the majority of the results available on the companion website.
| 3 IMPLEMENTATION |
|---|
|
|
|---|
3.1 Model-based study
We consider three models to generate synthetic sample points. The linear model is a two-class Gaussian model with the classes equally likely and the class-conditional densities being spherical Gaussians possessing common variance
2, the common covariance matrix being
2I. One class mean is located at the origin
and the other at
, where
. The Bayes classifier is a hyperplane perpendicular to the axis joining the means. The quadratic model is similar to the first model, but instead of there being equal covariance matrices for the class-conditional densities, the covariance matrices are
and
, for class 0 and class 1, respectively, with
0
1. In the bimodal model, one class consists of a spherical Gaussian distribution of variance
2 centered at the origin, the other is composed of two spherical Gaussian distributions of variance
2 centered at
and
, and the prior probability of the second class is equally split between the two distributions composing it. The Bayes classifier is composed of two hyperplanes normal to
.
With variances fixed, the Bayes error is solely determined by the distance,
, between the means of the classes. Moreover, since all features are independent, the set of the d best features is
, where
is maximum for 1
k1, k2, ... , kd
D. In the simulations, each ai composing
is independently drawn from a beta distribution, F(
, ß). We further let
be fixed and let ß follow a uniform distribution, U(ß1, ß2). To generate sample points, first we draw randomly from U(ß1, ß2) to get ß, and then from F(
, ß) to get
. A sample set of size n is generated for each model. We repeat the procedure T times for a total of T random samples. Apart from Abest, which is determined before the sample points are generated, we find a feature set,
, using the SFFS feature selection.
Overall, for the model-based study, the simulation utilizes the following protocol:
- Choose a model by randomly selecting ß from U(ß1, ß2) and then a1, a2, ... , aD from F(
, ß) to get
.
- Obtain Abest = {ak1, ak2, ... , akd} from the model (Abest relative to the Bayes classifier).
- Generate an n-point sample S from the model.
- Design a classifier
best for the feature set Abest according to the classification rule
from S.
- Compute the error
best for
best using the underlying distribution of the model.
- Apply SFFS using the classification rule
on S to find a feature set
.
- Design a classifier
for the feature set
according to the rule
from S.
- Compute the error
for
using the underlying distribution of the model.
- Repeat Steps 1 through 8 T times to form T error pairs
, i = 1,2, ... , T.
Since we have the underlying distribution, in Steps 4 and 7 we can find the Bayes classifiers for Abest and
instead of using the sample data, thereby leading to T Bayes-error pairs
, i = 1, 2, ... , T.
A summary of the experiments with different parameters is provided in Table 1. Sample plots for F(
, ß) are shown in Figure 1.
|
|
We have two interests. First, given the error for the best feature set Abest, what error is expected for the SFFS feature set
? Second, given the error for
, what error is expected for Abest? We denote the results for the two scenarios by
and
, respectively. Three figures are plotted for each combination of
and d in every experiment. For the first scenario the three figures are (1) a scatter plot for
, with the average errors marked with bold dots on their respective axes; (2) a curve of the conditional expectation
, estimated by dividing all points into bins based on
best, with each bin containing the same number of points, and averaging the corresponding values of
in each bin; and (3) the scatter plot superimposed with the expectation curve. For the second scenario, the figures are the same but with the roles of
best and
reversed. In all three figure types, the 45° line is shown, along with the number of bins, and maximum and minimum values. The complete results can be found on our companion website. Figure 2 shows examples of the scatter plots with superimposed expectation curves for the quadratic model in experiment 1: (a)
, d = 5 for LDA; (b)
, d = 5 for LDA; (c)
, d = 5 for 3NN; (d)
, d = 5 for 3NN; (e)
, d = 10 for LDA; (f)
, d = 10 for LDA; (g)
, d = 10 for 3NN; and (h)
, d = 10 for 3NN.
|
In the model study we have applied t-test feature selection to see if the results are consistent. This means that the t-test replaces SFFS in Step 6 of the simulation procedure. Since our practical concern is with SFFS, we have only performed this for the quadratic model in experiment 1. As can be seen on the companion website, the results are quite similar and we will say no more about the t-test results.
3.2 Patient study
Similar experiments are conducted using patient data from a microarray-based classification study that analyzes microarrays prepared with RNA from breast tumor samples from 295 patients (van de Vijver et al., 2002). Using a previously established 70-gene prognosis profile (van't Veer et al., 2002), a prognosis signature based on gene expression is proposed in (van de Vijver et al., 2002) that correlates well with patient survival data and other clinical measures. Of the 295 microarrays, 115 belong to the good-prognosis class and 180 belong to the poor-prognosis class.
Our first experiment uses intensity gene-expression values associated with the D = 70 genes. The best feature sets of size d = 5, 6 and 7 are obtained from the test bed developed in Choudhary et al. (2006). Because we lack the Bayes classifier in this empirical study, the best feature sets are taken from the test bed for the rule
being considered. The second experiment uses intensity gene-expression values for the D = 50 genes which are randomly selected from the original 24 496 genes and the best feature sets of size d = 4 are developed using the same methodology as in the test bed paper. The third experiment uses the D = 50 variance selected genes. The following protocol is utilized for the patient data:
- For the rule
, obtain
directly from the test bed for the D = 70 genes, or develop Abest for the randomly or variance selected D = 50 genes.
- Generate a 50-point sample S from the 295-point empirical distribution.
- Design a classifier
best for the feature set Abest according to the rule
from S.
- Compute the error
best for
best using hold-out on the 245 points not in S.
- Apply SFFS using the rule
on S to find a feature set
.
- Design a classifier
for the feature set
according to the rule
from S.
- Compute the error
for
using hold-out on the 245 points not in S.
- Repeat Steps 1 through 8 T times to form T error pairs
, i = 1, 2, ... , T.
It should be noted that the samples are not fully independent on account of overlap resulting from choosing the 50 sample points from among the same 295 sample points; however, as discussed in Braga-Neto and Dougherty (2004b), the samples are only weakly dependent. Owing to the dependency, we limit the total number of samples T to 200. Since this number is insufficient to estimate the conditional expectation, for the patient data we employ linear regression. For the patient data, Figure 3 corresponds to Figure 2, with parts (a)(d) for the D = 70 genes (the results for 5NN are similar to those for 3NN and are provided on the companion website) and parts (e)(h) are for the randomly selected D = 50 genes. Similar regression is found in the results for the variance selected D = 50 genes, which are on the companion website.
|
| 4 DISCUSSION |
|---|
|
|
|---|
4.1 Model-based study
In discussing the results for the model-based study, we focus on the quadratic model of experiment 1. Similar observations apply to the other models. The case
concerns our first question, predicting the performance of a selected feature set based on the performance of the best feature set. Parts (a) and (c) of Figure 2 provide the scatter plots and conditional expectations for LDA and 3NN for d = 5 (linear SVM being very similar to LDA). The expectation curve for LDA is approximately parallel to the 45° line, with
exceeding
best by
0.05 for the bulk of the mass, including the mean of
best. The situation improves for
best > 0.13, but there is little mass there. The situation is worse for 3NN, with
exceeding
by
0.07 for most of the mass, with improvement only for
and the improvement being less pronounced. The corresponding plots for d = 10 are in parts (e) and (g) of Figure 2. For LDA, the expectation curve is similar to that in the case of d = 5, except that the errors are smaller and
exceeds
best by less. For 3NN, the expectation curve is also similar and the errors are smaller; however, the amount by which
exceeds
best is substantially more than for d = 5, indicating worse prediction. The salient point deduced from the
expectation curves is that one can expect the error of a selected feature set to be substantially worse than the error of the best feature set.
The case
concerns our second question, predicting the performance of the best feature set based on the performance of the selected feature set. Parts (b) and (d) of Figure 2 provide the scatter plots and conditional expectations for LDA and 3NN for d = 5. In some sense, these are inverse to the
plots, with
exceeding
by about
0.05 for LDA and 0.07 for 3NN. The difference is with the interpretation. Since the expectation curves for
are close to being horizontal, and especially so for large feature-set errors, there is little relation between the errors of the classifiers designed from the selected and best feature sets. In particular, if feature selection results in a poor result, one should not conclude that there does not exist good feature sets. Indeed, if we look at Figure 2d for 3NN, there is a substantial number of samples that yield
and
, and many for which
and
.
4.2 Patient study
For the patient data, we focus on 3NN for the D = 70 case, referring to parts (c) and (d) of Figure 3 (the results for all other cases being very similar). In part (c), linear regression for the patient data yields a straight line that has important similarities with the curve for
in the model-based study: (1) the line is increasing; (2) the line lies almost entirely above the 45° line; (3) for the bulk of the mass,
significantly exceeds
best, with
exceeding E[
best] by 0.08; and (4) only for large values of
best can we expect the two errors to be close, and there is little mass in this region. As with the model-based analysis, one can expect the error of a selected feature set to be substantially worse than the error of the best feature set. Note, however, that with the patient data the regression line is more horizontal, indicating less predictability than for the synthetic data. The situation with
for 3NN with the patient data is striking. The regression line is practically horizontal, and once again nothing can be concluded from a poor result when using feature selection. We note that using quadratic regression yields curves not significantly different than those for linear regression.
Although our main interest is with designed classifiers, in the model-based studies we have also considered the Bayes-error pairs
, with the corresponding scatter plots and expectation curves being provided on the companion website. Although there are some differences in the curvatures of the expectation curves for the Bayes-error pairs, these are not significant. The main difference in the Bayes-error scatter plots is that they are tighter and show smaller errors than for the designed classifiers (as is expected). In the model-based studies we always have
, indicating that
is not optimal. On the other hand, although
for most points, there are points with
. This occurs because the optimal feature set is defined over the whole distribution, whereas feature selection is carried out over a particular sample, thereby making it is possible that
best, designed according to the sample, may be outperformed by
.
4.3 Concluding remarks
The lack of relation between the errors of the best and selected feature sets is observed throughout our experiments, including both models and patient data, different classification rules, and different feature-set sizes. It is generally more evident for higher variance cases (experiments 3 and 4) than lower variance cases (experiments 1 and 2), and for smaller sample sizes (experiments 1 and 3) than for larger sample sizes (experiments 2 and 4), reflecting the comparative difficulty of feature selection. With regard to the two focus questions, there is similarity across all experiments: (1) One cannot expect to find a feature set whose error is close to optimal, and (2) the inability to find a good feature set should not lead to the conclusion that good feature sets do not exist.
In practice, these conclusions mean that if one discovers a feature set with a satisfactory error (estimate) for a particular application, then classifier design can be considered pragmatically successful, even though one can be quite confident that substantially better feature sets exist. On the other hand, if no satisfactory feature set is found, then one is left hanging, uncertain of whether or not the search for prognostic markers is futile on the account of biological conditions or simply the insufficiency of the data relative to the high dimensionality of the problem.
These conclusions lead to the further conclusion that we require feature-selection techniques that are not purely data driven. The search for features needs to be constrained and directed by the use of prior biological knowledge, for instance, with respect to pathways known (or suspected) to be related to the pathology in question. The existence of convenient knowledge databases alone will not suffice; there needs to be close cooperation between cancer biologists and engineers in order to discover how to use the existing knowledge. Moreover, confidence in the correctness of prior assumptions should be integrated into the design techniques, perhaps via fuzzy or probabilistic reasoning. The results presented in this paper point to limitations of a certain kind of approach. Hopefully this will stimulate the search for different kinds of approaches.
| Acknowledgments |
|---|
This research has been supported in part by the National Cancer Institute (CA104620 [GenBank] ), the National Science Foundation (CCF0514644) and the Translational Genomics Research Institute.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Satoru Miyano
Received on February 2, 2006; revised on July 4, 2006; accepted on July 22, 2006
| REFERENCES |
|---|
|
|
|---|
Barrier, A., et al. (2005) Colon cancer prognosis prediction by gene expression profiling. Oncogene, 24, 61556164[CrossRef][Web of Science][Medline].
Braga-Neto, U.M. and Dougherty, E.R. (2004a) Bolstered error estimation. Pattern Recogn, . 37, 12671281[CrossRef].
Braga-Neto, U.M. and Dougherty, E.R. (2004b) Is cross-validation valid for small-sample microarray classification? Bioinformatics, 20, 374380
Choudhary, A., et al. (2006) Genetic test bed for feature selection. Bioinformatics, 22, 837842
Cover, T.M. and Van Campenhout, J. (1977) On the possible orderings in the measurement selection problem. IEEE Trans. Syst. Man Cybernet, . 7, 657661.
Hedenfalk, I., et al. (2001) Gene-expression profiles in hereditary breast cancer. N. Eng. J. Med, . 344, 539548
Jain, A.K. and Zongker, D. (1997) Feature selectionevaluation, application, and small sample performance. IEEE Trans. Pattern Anal. Machine Intell, . 19, 153158[CrossRef].
Kudo, M. and Sklansky, J. (2000) Comparison of algorithms that select features for pattern classifiers. Pattern Recogn, . 33, 2541[CrossRef].
Pudil, P., et al. (1994) Floating search methods in feature selection. Pattern Recogn. Lett, . 15, 11191125[CrossRef].
Reedijk, M., et al. (2005) High-level coexpression of JAG1 and NOTCH1 is observed in human breast cancer and is associated with poor overall survival. Cancer Res, . 65, 85308537
Sima, C., et al. (2005) Impact of error estimation on feature-selection algorithms. Pattern Recogn, . 38, 24722482[CrossRef].
van de Vijver, M.J., et al. (2002) A gene-expression signature as a predictor of survival in breast cancer. N. Eng. J. Med, . 347, 19992009
van't Veer, L.J., et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 53036[CrossRef][Medline].
Wei, J.S., et al. (2004) Prediction of clinical outcome using gene expression profiling and artificial neural networks for patients with neuroblastoma. Cancer Res, . 64, 68836891
This article has been cited by other articles:
![]() |
C. Zhao, I. Ivanov, E. R. Dougherty, T. J. Hartman, E. Lanza, G. Bobe, N. H. Colburn, J. R. Lupton, L. A. Davidson, and R. S. Chapkin Noninvasive Detection of Candidate Molecular Biomarkers in Subjects with a History of Insulin Resistance and Colorectal Adenomas Cancer Prevention Research, June 1, 2009; 2(6): 590 - 597. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Qin, T. Feng, S. A. Harding, C.-J. Tsai, and S. Zhang An efficient method to identify differentially expressed genes in microarray experiments Bioinformatics, July 15, 2008; 24(14): 1583 - 1589. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Hilario and A. Kalousis Approaches to dimensionality reduction in proteomic biomarker studies Brief Bioinform, March 1, 2008; 9(2): 102 - 118. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Saeys, I. Inza, and P. Larranaga A review of feature selection techniques in bioinformatics Bioinformatics, October 1, 2007; 23(19): 2507 - 2517. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Stafford and M. Brun Three methods for optimization of cross-laboratory and cross-platform microarray expression data Nucleic Acids Res., May 11, 2007; 35(10): e72 - e72. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






