Bioinformatics Advance Access originally published online on October 25, 2005
Bioinformatics 2006 22(1):88-95; doi:10.1093/bioinformatics/bti736
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Gene selection using support vector machines with non-convex penalty
1Department of Statistics, North Carolina State University Raleigh, NC 27695, USA
2Department of Statistics and Operations Research, University of North Carolina Chapel Hill, NC 27599, USA
3Department of Mathematical Sciences, University of Cincinnati OH 45221, USA
4Department of Statistics, University of Georgia Athens, GA 30602, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: With the development of DNA microarray technology, scientists can now measure the expression levels of thousands of genes simultaneously in one single experiment. One current difficulty in interpreting microarray data comes from their innate nature of high-dimensional low sample size. Therefore, robust and accurate gene selection methods are required to identify differentially expressed group of genes across different samples, e.g. between cancerous and normal cells. Successful gene selection will help to classify different cancer types, lead to a better understanding of genetic signatures in cancers and improve treatment strategies. Although gene selection and cancer classification are two closely related problems, most existing approaches handle them separately by selecting genes prior to classification. We provide a unified procedure for simultaneous gene selection and cancer classification, achieving high accuracy in both aspects.
Results: In this paper we develop a novel type of regularization in support vector machines (SVMs) to identify important genes for cancer classification. A special nonconvex penalty, called the smoothly clipped absolute deviation penalty, is imposed on the hinge loss function in the SVM. By systematically thresholding small estimates to zeros, the new procedure eliminates redundant genes automatically and yields a compact and accurate classifier. A successive quadratic algorithm is proposed to convert the non-differentiable and non-convex optimization problem into easily solved linear equation systems. The method is applied to two real datasets and has produced very promising results.
Availability: MATLAB codes are available upon request from the authors.
Contact: hzhang{at}stat.ncsu.edu
Supplementary information: http://www4.stat.ncsu.edu/~hzhang/research.html
| 1 INTRODUCTION |
|---|
|
|
|---|
We consider the problem of gene selection for cancer classification using microarray gene expression data. The objective of gene selection is 2-fold: to provide a better understanding of the underlying biological system that generates data and to improve the prediction performance of classifiers. Effective gene selection often leads to a compact classifier with better accuracy and interpretability (Kitter, 1986).
Gene selection is treated as a variable selection problem in statistics and a dimension reduction problem in machine learning. Many greedy algorithms have been developed in the literature. Gene-ranking methods are particularly popular, which select genes according to some predetermined ranking criteria. There are two main types of ranking criteria: correlation coefficients (Golub et al., 1999; Furey et al., 2000; Pavlidis et al., 2001) and hypothesis testing statistics. Two-sample t-test methods include parametric tests (Devore and Peck, 1997; Thomas et al., 2001; Pan, 2002) and non-parametric tests (Troyanskaya et al., 2002; He, 2004). Although being useful in practice, all these methods that select important genes based on individual gene information thus fail to take into account mutual information among genes. Dimension reduction techniques project the full data onto the first few principal directions then conduct classification in the low-dimensional subspace. West (2003) proposed the idea of meta-genes, which are linear combinations of the original genes. One disadvantage of projection methods is that none of the original genes can be discarded since each principal component generally involves all the genes.
Support vector machines (SVMs) (Boser et al., 1992; Vapnik, 1995; Cristianini and Shawe-Taylor, 1999) have demonstrated superior performances in classifying high-dimensional and low sample size data. However, the standard SVM can suffer from the presence of redundant variables (Hastie et al., 2001; Guyon et al., 2002), since its decision rule utilizes all the variables without discrimination. Several methods have been proposed for variable selection in the SVM (Furey et al., 2000; Rakotomamonjy, 2003; Grandvalet and Canu, 2002; Mukherjee et al., 2000; Chapelle et al., 2002; Weston et al., 2000). Guyon et al. (2002) developed the recursive feature elimination algorithm, which successively eliminates features by training a sequence of SVM classifiers. Bradley and Mangasarian (1998) suggested the L1 SVM, which imposes the absolute value penalty on the directional vector of the separating plane.
Different from all the methods above, we formulate the SVM as a regularization problem with a novel form of the penalty. The optimization problem consists of two parts: the data fit is represented by the hinge loss function, and the regularization is defined as the smoothly clipped absolute deviation penalty. In the regression context, this penalty was proposed and studied by Fan and Li (2001) and shown to have better theoretical properties than the L1 penalty. Following their terminology, we will refer our method as the SCAD SVM. The SCAD SVM conducts variable selection and classification simultaneously, resulting in a compact classifier with high accuracy. We give an iterative algorithm to solve the SCAD SVM, and show that only linear equation system solvers are needed for its implementation. The SCAD SVM is applicable to any biological data of high-dimensional low sample size. Its performances on one microarray dataset and on one metabolism dataset are illustrated in the paper.
| 2 METHODS |
|---|
|
|
|---|
Given a training set
, where xi = (xi1,...xid
d is the input vector and yi
{+1, 1} indicates its class label, the classification problem is to learn a discrimination rule f :
d
{ +1, 1} so that we can assign a class label to any new subject observed in the future. For microarray gene expression data, xi represents the expression levels of d genes of the i-th sample tissue and yi is normal or cancerous; often we have d >> n. In the statistical framework, we assume (xi, yi)s are independent realizations of the random pair (X, Y) which follows a joint distribution P(X, Y). Define g(x) = Prob(Y = +1|X = x). With the 0-1 loss
![]() |
2.1 Support vector machines
SVM is a large margin classifier which separates two classes by maximizing the margin between them. For non-separable data, the soft-margin SVM uses the slack variable to control an upper bound of the misclassification error. For classifying the data with complicated structures where a linear separation is not plausible, the non-linear SVM maps the data from the original input space into a high-dimensional feature space and then implements the linear classification in the feature space. Lin (2002) showed that under some general conditions, the SVM solution approaches the Bayes rule when the sample size increases.
The SVM finds f(x) = b + w · h(x) by minimizing
![]() | (1) |
is a dictionary of basis functions. The parameter
controls the trade-off between minimizing the loss function and maximizing the margin. The hinge loss [1 yf(x)]+ is a convex upper bound for the 0-1 loss. One nice property of the SVM is that its solution only depends on a small subset of the training set called support vectors. The standard SVM cannot select important variables, since all the input variables are used for constructing the classifier. For variable selection purposes, the following thresholding functions can be used to replace the L2 penalty ||w||2:
![]() |
function is a soft-thresholding penalty only if
1 (Bradley and Mangasarian, 1998). This explains why the standard SVM corresponding to
= 2 does not select variables. In Figure 1, the first three penalty functions are all non-differentiable at the origin, which is a necessary condition for a penalty function to produce sparse solutions (Fan and Li, 2001).
|
In the statistics literature, the L1 penalty is also known as the LASSO (Tibshirani, 1996) and widely used for variable selection in linear regression models. The L1 SVM was proposed by Bradley and Mangasarian (1998). Recently Zhu et al. (2003) studied its solution properties and suggested an algorithm to find the whole solution path over a range of tuning parameters. Fung and Mangasarian (2004) developed a fast Newton algorithm to solve its dual problem. Other methods incorporating the model parsimony include some Bayesian methods (Lee et al., 2003; Bae and Mallick, 2004).
2.2 The SCAD SVM
Though the L1 penalty gives sparse solutions, the estimates can be biased for large coefficients since larger penalties are imposed on larger coefficients. In linear regression models, Fan and Li (2001) proposed the smoothly clipped absolute deviation (SCAD) penalty which overcomes the biasness problem of the L1 penalty. They showed that the SCAD penalty produces sparse solutions by thresholding small estimates to zero, provides nearly unbiased estimates for large coefficients and gives a model continuous in data. In this paper, we propose the SCAD SVM to conduct variable selection in the context of classification and study its performances on high-dimensional low sample size data.
The SCAD function, as plotted in Figure 2, is symmetric, nonconvex and singular at the origin. Though having the same form as the L1 penalty at the neighborhood of zero, the SCAD applies a constant penalty for large coefficients while the L1 penalty increases linearly as the coefficient increases. It is this distinct feature that guards the SCAD penalty against producing biases for estimating large coefficients. Mathematically, the SCAD penalty has the expression
![]() | (2) |
> 0 are tuning parameters. In Figure 2, we have a = 3 and
= 0.4. The function in (2) is a quadratic spline function with two knots at
and a
. Except being singular at the origin, the function p
(w) has a continuous first-order derivative. We propose the SCAD SVM as
![]() | (3) |
balances the trade-off between data fitting and model parsimony. If
is too small, the procedure tends to overfit the training data and gives a classifier with little sparsity; if
is too large, the produced classifier can be very sparse but have a poor discriminating power. To tune
properly, we generate a tuning set for the simulated data and use cross validation for the real data. Fan and Li (2001) showed that the Bayes risks are not sensitive to the choice of a, and a = 3.7 is a good choice for various problems. We also use a = 3.7 in our examples.
|
Interestingly, when d >> n, linear classifiers often give better performances than non-linear ones in many applications (Hastie et al., 2001), even though non-linear methods are known to be more flexible. This fact is related to the asymptotic results in Hall et al. (2005): when d >> n, under mild assumptions for data distribution, the pairwise distances between any two points are approximately identical to each other so the data points form an n-simplex. Linear classifiers then become natural choices to discriminate two simplices. Since we focus on classifying high-dimensional low sample size data, only linear SVMs are considered in this paper. In other words, we use the input vector x as basis functions, i.e. h(x) = x and q = d.
| 3 ALGORITHM |
|---|
|
|
|---|
The standard SVM and L1 SVM are often solved using quadratic programming and linear programming methods. However, many standard optimization packages fail to solve (3), because the hinge loss function is not differentiable at zero and the SCAD penalty is not convex in w. In this section, we propose an iterative algorithm to solve the SCAD SVM efficiently and show that only a series of linear equation systems need to be solved.
Successive quadratic algorithm (SQA) is a generalization of Newton's method for unconstrained optimization in that it finds a step away from the current point by minimizing a quadratic approximation of the problem. Numerous optimization packages, including NPSOL, NLPQL, OPSYC, OPTIMA and MATLAB, are found on this approach (More and Wright, 1993). We propose using the SQA to solve the SCAD SVM.
Denote the objective function in (3) by A(b, w). For each i, we have
and
![]() | (4) |
![]() |
(|wj|), we use the following quadratic approximation
![]() |
![]() |
![]() |
Define the matrix X = [1, x1, ..., xd], where 1 is the vector of 1s with length n and xj is the j-th input vector. Define y = [y1, ..., yn]T, w = [w1, ..., wd]T and
= [
1, ...,
n]T with
i = yi (b0 + w0 · xi). Define r = [y1/|
1|, ..., yn/|
n|]T,
,
and
,
. Then the approximate optimization problem becomes
![]() | (5) |
![]() | (6) |
- Step 1: Set k = 1 and specify the initial value (b(1), w(1)).
- Step 2: Let (b0, w0) = (b(k), w(k)). Minimize
(b, w) by solving (6). The solution is denoted as (b(k+1), w(k+1)).
- Step 3: Let k = k + 1. Go to step 2 until convergence.
- Step 2: Let (b0, w0) = (b(k), w(k)). Minimize
is very close to zero, say, smaller than a certain threshold, then the j-th variable is regarded as redundant. Following Fan and Li (2001), we remove the j-th column from the matrix X and adjust the coefficient matrix in (6) correspondingly. The iteration then continues as a reduced optimization problem. The algorithm stops when there is no change in (b(k), w(k)). The j-th variable is regarded as unimportant if |wj| <
, where
is a preselected small positive thresholding value. Based on our experiences, the solutions from the standard SVM provide good starting values. In all of our examples, this algorithm converges quickly. | 4 SIMULATION |
|---|
|
|
|---|
In the high-dimensional low sample size setting, we simulate a dataset which contains many redundant variables. Four methods are compared: standard SVM L1 SVM, SCAD SVM and another classifier called distance weighted discrimination (DWD). The DWD is a large-margin classifier developed using the second-order cone programming by Marron et al. (2004), and it does not suffer from the data piling problem as the standard SVM. We use the OSU SVM package (http://svm.sourceforge.net/docs/3.00/api) to implement the SVM and the algorithm of Fung and Mangasarian (2004) to implement the L1 SVM. A tuning set with the same size as the training set is used to choose the optimal
. Each classifier is evaluated on a test set of size 500. The thresholding value we used for removing variables is 0.001. This example is a modification of the example used in Weston et al. (2000). There are d = 200 inputs and only the first two are relevant. The probability of Y = +1 or 1 is equal. The inputs X1 and X2 are drawn from a mixture of Normal distributions: with probability 0.7, we have X1 = YN(3, 1) and X2 = N(0, 1); with probability 0.3, we have X1 = N(0, 1) and X2 = YN(3, 1). The inputs Xj, j = 3, ..., 200 are independently generated from N(0, 20). The Bayes rule is slightly non-linear around the origin but can be approximated well by linear functions. We consider various settings for the sample size: n = 20, 30, 40, 50, 70, 100. In each setting, we run 30 replicates and plot the average test errors in Figure 3. As n increases, the test errors of the SCAD SVM and L1 SVM decrease prominently compared with those of the DWD and the standard SVM. This suggests that variable selection is important when too many redundant variables are present. Furthermore, the SCAD SVM consistently outperforms the L1 SVM in all the settings.
|
Table 1 shows the average number of variables selected over 30 runs for each method. The values in the parentheses are the standard errors of the corresponding mean values. Since the DWD and the SVM are not designed to select variables, they do not perform well in variable selection. It is observed that the SCAD SVM selects a smaller and a more stable (with smaller standard errors) number of variables than the L1 SVM in almost all cases. Table 2 compares the frequency of selecting correct variables (X1 and X2) in 30 runs between the L1 and SCAD SVM. When the sample size is too small, both methods experience a certain level of difficulty in selecting two important variables from 200 variables. As n increases, both methods tend to select the two variables more correctly. The SCAD SVM performs slightly better than the L1 SVM.
|
|
In Figure 4, for one particular simulated data of size n = 100 we plot the classification boundaries given by the Bayes rule and four learning methods. Note the Bayes rule is available only for simulated data. We useo for the points from the +1 class and x for those from the 1 class. The line symbols are the DWD (dashed), the SVM (dotted), the L1 SVM (dash-dotted), the SCAD SVM (thick solid), and the Bayes rule (thin solid). Since only X1 and X2 are truly relevant to the classification boundary, all the classifiers have been projected from the 200-dimensional input space to the first two-dimensional subspace. Figure 4 shows that the SCAD SVM classifier is the closest to the Bayes rule among all the classifiers. This explains why the SCAD SVM has the smallest test error rate as shown in Figure 3.
|
| 5 REAL DATA |
|---|
|
|
|---|
In practice, gene-ranking methods are widely used to select genes, which are highly differentiated between two types of tissues prior to training. Ranking criteria are often based on the t-statistic (Pan, 2002) or correlation coefficients. For each gene xj, the mean
(resp.
) and standard deviation
(resp.
) using only the tissues labeled +1 (resp. 1) are calculated. Define
Golub et al. (1999) selected p genes with the largest positive wjs and p genes with the largest negative wjs. Furey et al. (2000) used |wj| to select top p genes. Pavlidis et al. (2001) suggested a Fisher-discriminant type correlation coefficient. We compare our method with two ranking methods: t-test and Furey et al. (2000). Each ranking criterion is first applied to select the top 50 and 100 genes, then the standard SVM is fitted. There are two problems with ranking methods: (1) one has to specify the number of selected genes p in advance and often subjectively and (2) the selection is individual-based and hence ignores correlation among genes.
5.1 UNC breast cancer dataset
Three public microarray gene expression datasets are used in this section. They are from Perou et al. (2000), van't Veer et al. (2002), Sotiriou et al. (2003), respectively, and for convenience we use Stanford, Rosetta, and Singapore to refer them. Originally the three sets have 5974 genes and 104 patients, 24187 genes and 97 patients, and 7650 genes and 99 patients, respectively. In a recent study (Z. Hu, C. Fan, J.S. Marron, X. He, B.F. Qaqish, G. Karaca, C. Livasy, L. Carey, E. Reynolds, L. Dressler, A. Nobel, J. Parker, W.G. Ewend, L.R. Sawyer, D. Xiang, J. Wu, Y. Liu, M. Karaca, R. Nanda, M. Tretiakova, A.R. Orrico, D. Dreher, J.P. Palazzo, L. Perreard, E. Nelson, M. Mone, H. Hansen, M. Mullins, J.F. Quackenbush, O.I. Olapade, B.S. Bernard and C.M. Perou (2005). The molecular portraits of breast tumors are conserved across microarray platforms, manuscript submitted.), the three datasets are imputed for missing values, combined and then corrected to adjust the bias since they are from three different batches. The DWD is used for the batch adjustment process; see Benito et al. (2004) for a detailed description of the systematic bias adjustment for microarray data using the DWD. As for the gene identifier for the combined dataset, UniGene is used since it is most convenient to map the identifiers from each dataset to UniGene identifier (Build 161). In case of multiple occurrences of a UCID, the median value is used.
The combined dataset has 2924 genes and totally 300 patients. Our primary interest is to select important genes and use them to classify the tissues into two different types of breast cancer. We use the source information to separate the whole data into three folds naturally. We train each classifier on two folds and test it on the remaining one. For example, the SCAD SVM is first trained on Rosetta and Singapore data, then tested on Stanford data. We refer this as Stanford learning. Then we repeat this procedure for the other two learnings, referred as the Rosetta learning and the Singapore learning. To choose the tuning parameter
, we use 10-fold cross validation within the training set.
Table 3 shows the cross validation error in each learning and the average error rate for five methods. In the Stanford learning, the SCAD SVM has the lowest error rate 0.115. In the Rosetta learning, the SCAD SVM and the SVM are equally best. In the Singapore learning, the SVM is best and the SCAD SVM is slightly worse and the second best. Two ranking methods give the same result (hence only t-test reported), and they are worse than the SVM which uses all the genes. It can be explained by that the ranking methods select individual genes separately and ignore their correlations. For a fair comparison, the DWD is not included here because it was used for preprocessing the combined data. Overall speaking, the SCAD SVM gives the lowest error rate among all.
|
Table 4 gives the number of genes selected in each learning by the SCAD SVM and the L1 SVM. The L1 SVM selects 5972 genes in each learning, where the SCAD SVM only selects 1531 genes for each case. Note that the misclassification rates of the SCAD SVM shown in Table 2 are only based on the selected 1531 genes, which shows the very strong gene selection power of the method. Also note that the gene selection results of both methods are consistent in the sense that they both select the smallest number of genes for prediction in the Stanford learning and both select the largest number of genes in the Singapore learning.
|
Figure 5 lists the UniGene identifiers of all the genes that are selected at least three times by either the SCAD SVM or the L1 SVM. In the second and third columns are the frequencies of each gene selected in three learnings by, respectively, the SCAD SVM and the L1 SVM. The sum of these two columns is in the fourth column, and the fifth column lists the number of times that each gene selected by the t-test. The sixth column shows whether or not the selected gene is in the list of intrinsic genes selected by Perou et al. (2000). The last column displays the corresponding descriptive names of UniGene identifiers. We see that the top gene Hs.169946 is selected by all the methods in each learning and also classified as an intrinsic gene. Hs.79136 and Hs.80420, both intrinsic genes, are selected three times by the SCAD SVM but only two times by the L1 SVM. The top five genes are intrinsic and also selected by the t-test. However, there are 9 out of total 27 selected genes which are neither intrinsic nor selected by t-test. This suggests that one should consider the multivariate gene selection approaches, such as the SCAD SVM and the L1 SVM, rather than individual gene-by-gene methods such as t-test procedures.
|
5.2 Metabolism dataset
Metabolic datasets contain the quantitative measurements of all small molecule metabolites in biological samples. Some biological studies show that most of the metabolites are not informative in predicting disease or non-disease outcomes (Stitt and Fernie, 2003). Consequently, hybrid methods that incorporate variable selection with classification techniques can be very effective in analyzing datasets of this sort. Our metabolism dataset is provided by Metabolon, Inc. and we are actually one of the first research groups to analyze it.
There are metabolic profiles of 63 samples: 32 healthy subjects and 31 subjects diagnosed with a certain disease. Within the patient group, 9 subjects are taking medication and 22 are are not. For each sample, its metabolic profile contains the intensity levels of 317 compounds (metabolites). Table 5 shows the average leave-one-out cross validation error and the number of metabolites selected by each method. The SCAD SVM gives the smallest cross validation error 0.143. Moreover, the SCAD SVM selects 18 important metabolites out of 317 and the L1 SVM selects 32 metabolites. Hence the SCAD SVM achieves the highest classification accuracy using the fewest number of metabolites. This result has great implications on metabolic studies, since one main issue from biological aspects is to identify which metabolites are more relevant to the occurrence of the disease.
|
| 6 DISCUSSION |
|---|
|
|
|---|
For high-dimensional low sample size data, redundant input variables can affect the performances of classifiers. How to combine variable selection and classification in a unified framework has become an imminent problem. In this paper, we propose a new regularization technique for simultaneous classification and variable selection in the SVM. Compared with other methods, our non-convex penalty function achieves more compactness and better accuracy, showing great potential for gene selection in cancer classification problems.
The non-convexity of the penalty function introduces great difficulties in optimization. To implement the SCAD SVM efficiently, we developed an iterative procedure based on the successive quadratic algorithm. From both simulated and real data analysis, we found that this algorithm converges quickly. Overall, the SCAD SVM gives competitive results in terms of both variable selection and classification.
| Acknowledgments |
|---|
The authors are grateful to Dr C. M. Perou and his colleagues for providing breast cancer microarray datasets. This research was supported by Statistical and Applied Mathematical Sciences Institute (SAMSI) data mining and machine learning program during 2003-2004. H.H.Z.'s research is partially supported by NSF grant DMS-0405913. X.L.'s research is partially supported by a fellowship from Charles Phelps Taft Research Center.
Conflict of interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
Received on June 12, 2005; revised on October 20, 2005; accepted on October 20, 2005
| REFERENCES |
|---|
|
|
|---|
Bae, K. and Mallick, B.K. (2004) Gene selection using a two-level hierarchical Bayesian model. Bioinformatics, 20, 34233430
Benito, M., et al. (2004) Adjustment of systematic microarray data biases. Bioinformatics, 20, 105144
Boser, E., Guyon, M., Vapnik, V. (1992) A training algorithm for optimal margin classifiers. Proceedings of the Fifth ACM Workshop on Computational Learning TheoryPittsburgh, PA , pp. 144152.
Bradley, P.S. and Mangasarian, O.L. (1998) Feature selection via concave minimization and support vector machines. Proceedings of the 13th International Conference on Machine LearningCA , pp. 8290.
Chaplle, O., et al. (2002) Choosing kernel parameters for SVMs. Mach. Learning, 46, 131159[CrossRef].
Cristianini, N. and Shawe-Taylor, J. An Introduction to SVM, (1999) , Cambridge, MA Cambridge University Press.
Devore, J. and Peck, R. Statistics: The Exploration and Analysis of Data, (1997) 3rd edn , Pacific Grove, CA Duxbury Press.
Donoho, D. and Johnstone, I. (1994) Ideal spatial adaptation via wavelet shrinkage. Biometrika, 81, 425455
Fan, J. and Li, R. (2001) Variable selection via penalized likelihood. J. Am. Stat. Assoc, . 96, 13481360[CrossRef][Web of Science].
Fung, G. and Mangasarian, O.L. (2004) A feature selection Newton method for support vector machine classification. Comput. Optim. Appl. J, . 28, 185202[CrossRef].
Furey, T., et al. (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16, 906914
Golub, R., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531537
Grandvalet, Y. and Canu, S. (2002) Adaptive scaling for feature selection in SVMs. Neural Inform. Processing Syst, . 553560 NIPS 2002.
Guyon, I., et al. (2002) Gene selection for cancer classification using SVM. Mach. Learning, 46, 389422[CrossRef].
Hall, P., et al. (2005) Geometric representation of high dimension low sample size data. J. R. Statist. Soc. B, 67, 427444[CrossRef].
Hastie, T., Tibshirani, R., Friedman, J. The Elements of Statistical Learning, (2001) , NY Springer.
He, W. (2004) A spline function approach for detecting differentially expressed genes in microarray data analysis. Bioinformatics, 20, 29542963
Kitter, J. (1986) Feature selection and extraction. In Young, T.Y. and Fu, K.-S. (Eds.). Handbook of Pattern Recognition and Image Processing, , NY Academic Press.
Lee, E., et al. (2003) Gene selection: a Bayesian variable selection approach. Bioinformatics, 19, 9097
Lin, Y. (2002) SVM and the Bayes rule in classification. Data Mining Knowledge Discov, . 6, 259275.
Marron, J.S., et al. (2004) Distance weighted discrimination. J. Am. Stat. Assoc, . in press.
More, J.J. and Wright, S.J. Optimization Software Guide, (1993) , Philadelphia SIAM.
Mukherjee, S., Tamayo, P., Slonim, D., Verri, A., Golub, T., Messirov, P., Poggio, T. (2000) SVM classification of microarray data. AI memo 182, CBCL paper 182, , MA MIT.
Pan, W. (2002) A comparative review of statistical methods for discovering differently expressed genes in replicated microarray experiments. Bioinformatics, 18, 546554
Parvlidis, P., Weston, J., Cai, J., Grundy, W.N. (2001) Gene functional analysis from heterogeneous data. Proceedings of 5th International Conference on Computational BiologyPittsburgh, PA , pp. 249255.
Perou, C.M., et al. (2000) Molecular portraits of human breast tumors. Nature, 406, 747752[CrossRef][Medline].
Rakotomamonjy, A. (2003) Variable selection using SVM-based Criteria. J. Mach. Learning Res, . 3, 13571370[CrossRef].
Sotiriou, C., et al. (2003) Breast cancer classification and prognosis based on gene expression profiles from a population-based study. Proc. Natl Acad. Sci. USA, 100, 1039310398
Stitt, M. and Fernie, A.R. (2003) From measurements of metabolites to metabolomics: an on the fly perspective illustrated by recent studies of carbon-nitrogen interactions. Curr. Opin. Biotechnol, . 14, 136144[CrossRef][Web of Science][Medline].
Thomas, G., et al. (2001) An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Res, . 11, 12271236
Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J. R. Statist. Soc., B, 58, 267288.
Troyanskaya, G., et al. (2002) Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics, 18, 14541461
van't Veer, L., et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530536[CrossRef][Medline].
Vapnik, V. The Nature of Statistical Learning Theory, (1995) , NY Springer.
West, M. (2003) Bayes factor regression models in the large p, small n paradigm. Bayesian Statistics, 7, 723732.
Weston, J., et al. (2000) Feature selection for SVMs. Adv. Neural Inform. Processing Syst, . 13, 668674.
Zhu, J., et al. (2003) 1-norm SVMs. Neural Inform. Processing Systems, 16, 4956.
This article has been cited by other articles:
![]() |
N. Becker, W. Werft, G. Toedt, P. Lichter, and A. Benner penalizedSVM: a R-package for feature selection SVM classification Bioinformatics, July 1, 2009; 25(13): 1711 - 1712. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Ma and M. R. Kosorok Identification of differential gene pathways with principal component analysis Bioinformatics, April 1, 2009; 25(7): 882 - 889. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Hwang, H. Sicotte, Z. Tian, B. Wu, J.-P. Kocher, D. A. Wigle, V. Kumar, and R. Kuang Robust and efficient identification of biomarkers by classifying features on graphs Bioinformatics, September 15, 2008; 24(18): 2023 - 2029. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Ma and J. Huang Penalized feature selection and classification in bioinformatics Brief Bioinform, September 1, 2008; 9(5): 392 - 403. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Huang and T. W. S. Chow Identifying the biologically relevant gene categories based on gene expression and biological data: an example on prostate cancer Bioinformatics, June 15, 2007; 23(12): 1503 - 1510. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Zhou and D. P. Tuck MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data Bioinformatics, May 1, 2007; 23(9): 1106 - 1114. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Wang and J. Zhu Improved centroids estimation for the nearest shrunken centroid classifier Bioinformatics, April 15, 2007; 23(8): 972 - 979. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


















