Bioinformatics Advance Access originally published online on January 5, 2008
Bioinformatics 2008 24(3):412-419; doi:10.1093/bioinformatics/btm579
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Hybrid huberized support vector machines for microarray classification and gene selection
1Ross School of Business, 2Department of Statistics, University of Michigan, Ann Arbor, MI 48109 and 3School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: The standard L2-norm support vector machine (SVM) is a widely used tool for microarray classification. Previous studies have demonstrated its superior performance in terms of classification accuracy. However, a major limitation of the SVM is that it cannot automatically select relevant genes for the classification. The L1-norm SVM is a variant of the standard L2-norm SVM, that constrains the L1-norm of the fitted coefficients. Due to the singularity of the L1-norm, the L1-norm SVM has the property of automatically selecting relevant genes. On the other hand, the L1-norm SVM has two drawbacks: (1) the number of selected genes is upper bounded by the size of the training data; (2) when there are several highly correlated genes, the L1-norm SVM tends to pick only a few of them, and remove the rest.
Results: We propose a hybrid huberized support vector machine (HHSVM). The HHSVM combines the huberized hinge loss function and the elastic-net penalty. By doing so, the HHSVM performs automatic gene selection in a way similar to the L1-norm SVM. In addition, the HHSVM encourages highly correlated genes to be selected (or removed) together. We also develop an efficient algorithm to compute the entire solution path of the HHSVM. Numerical results indicate that the HHSVM tends to provide better variable selection results than the L1-norm SVM, especially when variables are highly correlated.
Availability: R code are available at http://www.stat.lsa.umich.edu/~jizhu/code/hhsvm/
Contact: jizhu{at}umich.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
The DNA microarray technology is a powerful tool for biological and medical research. It can detect thousands of gene expression levels simultaneously, providing a wealth of information. On the other hand, however, microarray datasets usually contain only a small number of samples. These characteristics pose great challenges for sample classification and gene selection. The support vector machine (SVM) is one of the most effective methods for microarray classification (Guyon et al., 2002; Mukherjee et al., 2000; Ramaswamy et al., 2001); however, a major limitation of the SVM is that it cannot perform automatic gene selection. Since in microarray analysis, researchers are often interested in identifying informative genes, it is desirable to have a tool that can achieve both classification and gene selection simultaneously.
Guyon et al. (2002) proposed therecursive feature elimination (RFE) method for the SVM. The method, called the SVM-RFE, recursively eliminates irrelevant genes. At each step, the SVM-RFE trains for a SVM classifier, ranks the genes according to some score function and eliminates one or more genes with the lowest ranking scores. This process is repeated until classification accuracy starts to degrade. The SVM-RFE method is computationally intensive, especially for microarray datasets, which usually has a large number of genes. Bradley and Mangasarian (1998) proposed the L1-norm SVM, which can automatically select genes via the L1-norm regularization. The L1-norm SVM, however, has two limitations:
- The number of selected genes is upper bounded by the sample size. Therefore, when the number of relevant genes exceeds the sample size, the L1-norm SVM can only discover a portion of them.
- For the highly correlated and relevant genes, the L1-norm SVM tends to pick only one or a few of them.
- Similar to the L1-norm SVM, it automatically selects genes.
- The number of selected genes is no longer upper bounded by the sample size.
- When genes are highly correlated, they tend to be selected or removed together, i.e. the grouping effect.
The rest of the article is organized as follows. In Section 2, we describe the HHSVM model. In Section 3, we develop the solution path algorithm. In Section 4, we apply the HHSVM method to simulation and real microarray datasets. We conclude the article with Section 5.
| 2 MODEL |
|---|
|
|
|---|
2.1 The support vector machine
We briefly introduce the SVM and refer interested readers to (Burges, 1998; Evgeniou et al., 1999; Vapnik, 1995) for detailed tutorials. Let {x1, x2, ... , xn} represent the n input vectors, where xi
{1, – 1}. The SVM finds a hyperplane that separates the two classes of data points by the largest distance:
|
| (1) |
|
| (2) |
|
| (3) |
|
| (4) |
i (i = 1, ... , n) are slack variables, and C
0 is a tuning parameter that controls the overlap. Denote the solution as
Many researchers have recognized that the SVM can be equivalently transformed into the loss + penalty format, i.e.
|
| (5) |
0 is a regularization parameter, which controls the balance between the loss and the penalty. The same L2-norm penalty has also been used in the ridge regression (Hoerl and Kennard, 1970) and neural networks. By shrinking the magnitude of the coefficients, the L2-norm penalty reduces the variance of the estimated coefficients and may result in better prediction accuracy.
2.2 The L1-norm SVM
Bradley and Mangasarian (1998) proposed to replace the L2-norm penalty in the standard SVM with the L1-norm penalty (Tibshirani, 1996):
|
| (6) |
1 is large enough, it tends to shrink the coefficients of irrelevant variables to exactly zero. When
changes, different coefficients are set to zero, hence the L1-norm penalty allows a kind of automatic continuous variable selection.
2.3 The hybrid huberized SVM
Zou and Hastie (2005) argued that the L1-norm penalty has two major limitations:
- The number of variables selected by the L1-norm penalty is upper bounded by the sample size n. In microarray analysis, we nearly always have the case p >> n, but the L1-norm SVM can identify at most n relevant genes.
- For highly correlated and relevant variables, the L1-norm penalty tends to select only one or a few of them. In microarray analysis, genes sharing the same biological pathway tend to have highly correlated expression levels. It is often desirable to identify all, rather than a few, of them if they are related to the underlying biological process.
|
|
In this article, we apply the elastic-net penalty to the SVM and propose the HHSVM:
|
| (7) |
1,
2
0 are regularization parameters. Increasing
1 tends to eliminate more irrelevant variables; and increasing
2 makes the grouping effect more prominent, which we will illustrate by a theorem at the end of this section.
Notice that instead of using the standard hinge loss function of the SVM, we use the huberized hinge loss function (Rosset and Zhu, 2007) to measure badness-of-fit:
|
|
0 is a pre-specified constant. Figure 1 compares the standard hinge loss function and the huberized hinge loss function. Notice that they have similar shapes: when yf decreases (misclassification), they both increase linearly; when yf is bigger than 1, they are both equal to zero. Therefore, we expect the classification performance of these two functions should be similar to each other. Also notice that, unlike the hinge loss, the huberized hinge loss function is differentiable everywhere. As we will see in the next section, this differentiability can significantly reduce the computational cost for the HHSVM algorithm.
|
Before delving into details for the HHSVM algorithm, we illustrate the grouping effect with the following theorem:
THEOREM 1
Letand
denote the solution for problem (7). For any pair (j, j'), we have
If the input vector xj and xj' are centered and normalized, then
(8) where
(9) is the sample correlation between xj and xj '.
Details of the proof are in the Supplementary Material. Theorem 1 suggests that highly correlated variables tend to have similar estimated coefficients; hence, they tend to be selected or removed together when
2 is sufficiently large.
| 3 ALGORITHM |
|---|
|
|
|---|
The HHSVM involves two tuning parameters
1 and
2. According to our experience, the prediction performance is often more sensitive to
1, since it has more impact on selecting variables; therefore, the value of
1 must be chosen more carefully. For parameter tuning, one usually specifies a number of candidates, tests each of them and chooses the best one according to some criterion. However, this trial-and-error approach is computationally expensive and the optimal parameter setting can be easily missed. In this section, we propose an efficient algorithm, which solves the entire solution path for every possible value of
1 (when
2 is fixed). The algorithm is based on the fact that the solution (
1.
3.1 Problem setup
Since the huberized hinge loss function has different definitions in different regions, for simplicity we define each region as:
= {i: yi f (xi) > 1} (Right)
= {i: 1 –
< yi f (xi)
1} (Elbow)
= {i: yi f (xi)
1 –
} (Left)
= { j : βj
0, j = 1,2, ... , p} (Active)
|
| (10) |
|
| (11) |
In the linear system (10)–(11), there are |
| + 1 unknowns and |
| + 1 equations, where |
| represents the number of elements in set
. Therefore, the solution β0 and βj (j
) can be uniquely determined, given that the system is non-singular.
When sets
,
,
and
are fixed, the structure of the linear system (10)–(11) is also fixed. Under this condition, β0 and βj (j
) are linear functions of
1, which can be seen from (11). However, as
1 decreases, sooner or later some of the sets
,
,
and
will change. We call this an event. After an event occurs the new β0 and βj (j
) are still linear functions of
1, but their derivatives with respect to
1 will change. Therefore, the entire solution path is piecewise linear in
1, and between any two consecutive events, β0 and βj (j
) change linearly with
1. Each event corresponds to a kink on the piecewise linear solution path.
Our algorithm starts from
1 =
, continuously decreases
1, solves the solutions along this path, and terminates if
1 reaches 0. The algorithm provides the solutions on each kink. For any
1 between two consecutive kinks, the solution can be precisely obtained using linear interpolation. Table 1 shows the outline of the our algorithm.
|
3.2 Initial solution
The algorithm starts from
1 =
. From the objective function of problem (7), we can see that β = 0 at this stage. Therefore, the problem reduces to:
|
| (12) |
Let
represent the optimal solution when
1 =
. Hence
=
(β = 0) and sets
,
and
can be determined using the values of
(i = 1, ... , n). Reducing
1 tends to increase the magnitude of β, and we can find a critical point, denoted as
, where exactly one βj (j = 1, ... , p) joins
, i.e. the estimate βj will become non-zero if
1 is further reduced.
Since Equation (11) must hold for any j
, we can determine the critical point
by:
|
|
|
|
|
|
Let the superscript k indicate the iteration number and k = 0 for the initial stage. Now, we have
k = { j*},
,
(j = 1, ... , p),
, and the
k,
k,
k are determined by
(i = 1, ... , n).
We emphasize again that the initial solution can be easily determined here because of the differentiability of the huberized hinge loss function; however, this is not the case for the hinge loss function. Since the hinge loss is not everywhere differentiable, it is difficult to determine the critical point
and the corresponding
0,
0,
0 and
0 at the initial stage. Zhu et al. (2004) proposed an approach for solving the initial solution of the L1-norm SVM. The approach was based on linear programming, but it can be computationally intensive, especially when p is large.
3.3 Solution path
The algorithm continuously decreases
1 until it reaches 0. Let
, where
1 < 0. When
1 is reduced by a small enough amount, sets
,
,
and
do not change, because of the continuity of f(x) with respect to
1. Therefore, based on the systems (10)–(11), the derivatives of β0 and βj (j
) with respect to
1 can be solved from the following equations:
|
| (13) |
|
| (14) |
Since there are |
| + 1 unknowns and |
| + 1 equations,
and
(j
) are uniquely determined, given that the system is non-singular. Then when |
1| is sufficiently small, the solution and fitted values are linear in
1:
|
|
If we keep reducing
1, some of the sets
,
,
and
will change. We call this an event, and four types of events may occur:
- A point i reaches the boundary between
and
.
- A point i reaches the boundary between
and
.
- A parameter βj becomes zero (j leaves
).
- A zero-valued parameter βj becomes non-zero (j joins
).
The boundary between
and
is 1 –
and the boundary between
and
is 1. Therefore, when yi f(xi) for a point crosses 1 –
or 1, one of the first two events occurs. To determine the step size 
1 for the first event, for each i
or
we calculate:
|
|

1,1 represent the step size for the first event. Since
1 only decreases, 
1,1
0 and its value should be determined by: |
|
Similarly, for the second event we calculate:
|
|
or
. And the step size for the second event is: |
|
When a non-zero βj reduces to 0, the third event occurs. Therefore, we calculate for each j
:
|
|
The step size for the third event is:
|
|
The fourth event (a zero-valued βj becomes non-zero) is a little complicated to determine. For j = 1, ... , p, we define:
|
|
- |Cj| =
1, sign(Cj) = – sign(βj), for j
;
- |Cj| <
1, for j
.
Notice that when |
1| is sufficiently small, Cj is also a linear function of
1:
|
|
As
1 decreases, the value for a |Cj| (j
) will first meet the decreasing
1, after which the corresponding βj will become non-zero if we further reduce
1.
Therefore, to determine the step size for the fourth event, for each j
we calculate:
|
|
The step size
1,4 for the fourth event is:
|
|
After solving for
1,1,
1,2,
1,3 and
1,4, the overall step size
1 can be obtained:
|
|
,
,
and
1, β0, βj, Cj and the fitted value f(xi). Finally, let k = k + 1 and the algorithm goes to the next iteration: solving the linear systems (13)–(14) and calculating the step size 
1. This entire process is repeated, until
1 reaches 0.
Between any two consecutive events, the solutions are linear in
1, and after an event occurs, the derivative of the solution with respect to
1 is changed. Therefore, the solution path is piecewise linear in
1, where each event corresponds to a kink on the path. The algorithm provides solutions at these kinks, and for any
1 between two consecutive kinks the solutions can be calculated precisely via linear interpolation.
3.4 Computational cost
The major computational cost in each iteration comes from solving the linear system (13)–(14). Since this system has |
| + 1 unknowns, the computational cost seems to be O(|
|3) for each iteration. However, between any two iterations, only one element is changed for sets
,
,
and
, so using inverse updating and downdating the computational cost can be reduced to O(|
|2). It is difficult to predict the number of iterations. According to our experience, O(min(n,p)) is a reasonable estimate. The heuristic is that the algorithm needs O(n) to move all the points to
and O(p) steps to add all the parameters into
. Since p is the upper bound for |
|, the overall computational cost is upper bounded by O(min(n,p)p2).
| 4 NUMERICAL RESULTS |
|---|
|
|
|---|
In this section, we use both simulation data and real microarray datasets to illustrate the HHSVM.
4.1 Simulation
The main purpose of the simulation is to demonstrate that when input variables are independent, the L1-norm SVM and the HHSVM perform similarly, but the HHSVM can have an advantage when the variables are highly correlated, especially at identifying the relevant variables.
We first consider the scenario where all input variables are independent. The + class has a normal distribution with mean
|
|
= Ip x p. The – class has a similar distribution except that |
|
We generated 50 = 25 + 25 training data, each input xi is a p = 300-dimensional vector. We compared the standard L2-norm SVM, the L1-norm SVM, and the HHSVM. We used 50 validation data to select the tuning parameters for each method, then applied the selected models to a separate 10 000 testing dataset. Each experiment was repeated 100 times. The means of the prediction errors and the corresponding standard errors (in parentheses) are summarized in Table 2. As we can see, the prediction errors of the L1-norm SVM and the HHSVM are similar and both are better than that of the L2-norm SVM. This is due to the fact that the L2-norm SVM uses all input variables, and its prediction accuracy is polluted by the noise variables.
|
Besides the prediction error, we also compared the selected variables of the L1-norm SVM and the HHSVM (the L2-norm SVM keeps all input variables). In particular, we consider qsignal = number of selected relevant variables, and qnoise = number of selected noise variables. The results are in Table 3. Again, we see that the L1-norm SVM and the HHSVM perform similarly; both are able to identify the relevant variables and remove most of the irrelevant variables.
|
Now we consider the scenario when the relevant variables are correlated. Similar as the independent scenario, the + class has a normal distribution, with mean
|
|
|
|
* are 1 and the off-diagonal elements are all equal to
= 0.8. The – class has a similar distribution except that |
|
Again, we considered n = 25 + 25 and p = 300. Each experiment was repeated 100 times. The results for the prediction errors are shown in Table 2. Now the performance of the HHSVM is slightly better than that of the L1-norm SVM, after taking into account the standard error. The right part of Table 3 compares the variables selected by the L1-norm SVM and the HHSVM, which sheds some light on what happened. Both the L1-norm SVM and the HHSVM are able to identify relevant variables. However, when the relevant variables are highly correlated, the L1-norm SVM tends to keep only a small subset of the relevant variables, and overlook the others, while the HHSVM tends to identify all of them, due to the grouping effect. Both methods seem to work well in removing irrelevant variables.
4.2 Real Data Analysis
The first dataset we considered is the colon cancer dataset in Alon and Barkai (1999). This dataset consists of 62 samples (40 colon cancer tumors and 22 normal tissues). Each sample consists of p = 2000 genes. We randomly split the samples into the training and the test sets for 100 times; for each split, 42 samples (27 cancer samples and 15 normal tissues) were used for training and the rest 20 samples (13 cancer samples and 7 normal tissues) were for testing. For each split, we applied four methods, the standard L2-norm SVM, the L1-norm SVM, the SVM-RFE and the HHSVM, to the training data. Tuning parameters were chosen using 10-fold cross-validation on the training data, and the chosen models were evaluated on the test data. For the standard L2-norm SVM, we tried different kernels and found the linear kernel has the best performance. For the SVM-RFE, we used the same strategy as in Guyon et al. (2002), i.e. eliminating 10% ofthe remaining genes in each iteration. The results are summarized in the upper part of Table 4. We can see that the HHSVM seemed to have a slightly better classification accuracy than other methods. However, we note that this is not necessarily conclusive, for the sample size is small.
|
Tables 5–6
|
|
|
To assess the stability of prediction, we also recorded the frequency that each sample, as a test observation, was correctly classified. For example, if a sample appears in 70 test sets among the 100 random splits, and out of the 70 predictions, the sample was correctly classified for 60 times, then we recorded 60/70 for this sample. The results are shown in Figure 3. As we can see, for most samples, both the L1-norm SVM and the HHSVM classified them correctly for most of the random splits, but there are also some samples where both methods tended to misclassify. Overall, the HHSVM seems to be slightly more stable than the L1-norm SVM in terms of prediction.
|
The second dataset we considered is a more recent lung cancer dataset (Potti et al., 2006), which consists of 198 samples (54 cancer tumors and 144 normal tissues). Each sample consists of expression measurements on 22 215 genes. We randomly splitted the data into training and test sets, with sample sizes 137 (37 cancer samples and 100 normal tissues) and 61, respectively. We repeated it 100 times. The lower part of Table 4 summarizes the results. The gene selection behavior of the L1-norm SVM and the HHSVM for the lung cancer dataset are similar to those for the colon cancer dataset. Due to the lack of space, we skip the results.
| 5 CONCLUSION |
|---|
|
|
|---|
In this article, we have proposed the HHSVM for microarray classification and gene selection. The HHSVM uses the huberized hinge loss function to measure the badness-of-fit and the elastic-net penalty to control the model complexity. The huberized hinge loss function allows efficient computation for calculating the entire solution path. The elastic-net penalty allows automatic flexible variable selection. We have presented some evidence that the new method tends to select more relevant variables (than the L1-norm SVM method), especially when variables are highly correlated.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank the Associate Editor, David Rocke, and two referees for their feedback which led us to improve the article, particularly the numerical result section. We also thank Saharon Rosset for helpful comments. L.W. and J.Z. are partially supported by grant DMS-0505432 and DMS-0705532 from the National Science Foundation.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: David Rocke
Received on June 15, 2007; revised on October 9, 2007; accepted on November 18, 2007
| REFERENCES |
|---|
|
|
|---|
Alon U, Barkai N. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon cancer tissues probed by oligonucleotide arrays. Proc. Natl Acad. Sci. USA (1999) 96:6745–6750.
Bradley P, Mangasarian O. Feature selection via concave minimization and support vector machines. (1998) Proceedings of the 15th International Conference on Machine Learning.
Burges C. A tutorial on support vector machines for pattern recognition. Data Mining Knowl. Discov (1998) 2:121–167.[CrossRef]
Efron B, et al. Least angle regression. Ann. Stat (2004) 32:407–499.[CrossRef]
Evgeniou T, et al. Regularization networks and support vector machines. In: Advances in Large Margin Classifiers.—Smola A, et al, eds. (1999) MIT Press.
Guyon I, et al. Gene selection for cancer classification using support vector machines. Mach. Learn (2002) 46:389–422.[CrossRef]
Hoerl A, Kennard R. Ridge regression: biased estimation for nonorthogonal problems. Technometrics (1970) 12:55–67.[CrossRef][Web of Science]
Mukherjee S, et al. Support vector machine classification of microarray data. Technical report (2000) Massachusetts Institute of Technology. Artificial Intelligence Laboratory.
Potti A, et al. A genomic strategy to refine prognosis in early-stage non-small-cell lung cancer. N. Eng. J. Med (2006) 355:570–580.
Ramaswamy S, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl Acad. Sci. USA (2001) 98:15149–15154.
Rosset S, Zhu J. Piecewise linear regularized solution paths. Ann. Stat (2007) 35:1012–1030.[CrossRef]
Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B (1996) 58:267–288.
Vapnik V. The Nature of Statistical Learning Theory. (1995) New York: Springer-Verlag.
Zhu J, et al. 1-norm SVMs. (2004) Proceedings of the Neural Information Processing Systems.
Zou H, Hastie T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B (2005) 67:301–320.[CrossRef]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








