Skip Navigation


Bioinformatics Advance Access originally published online on January 5, 2008
Bioinformatics 2008 24(3):412-419; doi:10.1093/bioinformatics/btm579
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/3/412    most recent
btm579v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wang, L.
Right arrow Articles by Zou, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, L.
Right arrow Articles by Zou, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Hybrid huberized support vector machines for microarray classification and gene selection

Li Wang 1, Ji Zhu 2,* and Hui Zou 3

1Ross School of Business, 2Department of Statistics, University of Michigan, Ann Arbor, MI 48109 and 3School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODEL
 3 ALGORITHM
 4 NUMERICAL RESULTS
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: The standard L2-norm support vector machine (SVM) is a widely used tool for microarray classification. Previous studies have demonstrated its superior performance in terms of classification accuracy. However, a major limitation of the SVM is that it cannot automatically select relevant genes for the classification. The L1-norm SVM is a variant of the standard L2-norm SVM, that constrains the L1-norm of the fitted coefficients. Due to the singularity of the L1-norm, the L1-norm SVM has the property of automatically selecting relevant genes. On the other hand, the L1-norm SVM has two drawbacks: (1) the number of selected genes is upper bounded by the size of the training data; (2) when there are several highly correlated genes, the L1-norm SVM tends to pick only a few of them, and remove the rest.

Results: We propose a hybrid huberized support vector machine (HHSVM). The HHSVM combines the huberized hinge loss function and the elastic-net penalty. By doing so, the HHSVM performs automatic gene selection in a way similar to the L1-norm SVM. In addition, the HHSVM encourages highly correlated genes to be selected (or removed) together. We also develop an efficient algorithm to compute the entire solution path of the HHSVM. Numerical results indicate that the HHSVM tends to provide better variable selection results than the L1-norm SVM, especially when variables are highly correlated.

Availability: R code are available at http://www.stat.lsa.umich.edu/~jizhu/code/hhsvm/

Contact: jizhu{at}umich.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODEL
 3 ALGORITHM
 4 NUMERICAL RESULTS
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The DNA microarray technology is a powerful tool for biological and medical research. It can detect thousands of gene expression levels simultaneously, providing a wealth of information. On the other hand, however, microarray datasets usually contain only a small number of samples. These characteristics pose great challenges for sample classification and gene selection. The support vector machine (SVM) is one of the most effective methods for microarray classification (Guyon et al., 2002; Mukherjee et al., 2000; Ramaswamy et al., 2001); however, a major limitation of the SVM is that it cannot perform automatic gene selection. Since in microarray analysis, researchers are often interested in identifying informative genes, it is desirable to have a tool that can achieve both classification and gene selection simultaneously.

Guyon et al. (2002) proposed therecursive feature elimination (RFE) method for the SVM. The method, called the SVM-RFE, recursively eliminates irrelevant genes. At each step, the SVM-RFE trains for a SVM classifier, ranks the genes according to some score function and eliminates one or more genes with the lowest ranking scores. This process is repeated until classification accuracy starts to degrade. The SVM-RFE method is computationally intensive, especially for microarray datasets, which usually has a large number of genes. Bradley and Mangasarian (1998) proposed the L1-norm SVM, which can automatically select genes via the L1-norm regularization. The L1-norm SVM, however, has two limitations:

  1. The number of selected genes is upper bounded by the sample size. Therefore, when the number of relevant genes exceeds the sample size, the L1-norm SVM can only discover a portion of them.
  2. For the highly correlated and relevant genes, the L1-norm SVM tends to pick only one or a few of them.
In this article, we propose an HHSVM for microarray classification. The HHSVM has the form of ‘loss’ + ‘penalty’, where it uses the huberized hinge function to measure the loss and the elastic-net penalty (Zou and Hastie, 2005) to control the complexity of the model. The HHSVM has several major benefits:
  • Similar to the L1-norm SVM, it automatically selects genes.
  • The number of selected genes is no longer upper bounded by the sample size.
  • When genes are highly correlated, they tend to be selected or removed together, i.e. the grouping effect.
Furthermore, inspired by the LAR/LASSO method (Efron et al., 2004) and the general piecewise linear solution path strategy (Rosset and Zhu, 2007), we develop an efficient algorithm, which solves the entire solution path for every possible value of the regularization parameter.

The rest of the article is organized as follows. In Section 2, we describe the HHSVM model. In Section 3, we develop the solution path algorithm. In Section 4, we apply the HHSVM method to simulation and real microarray datasets. We conclude the article with Section 5.


    2 MODEL
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODEL
 3 ALGORITHM
 4 NUMERICAL RESULTS
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 The support vector machine
We briefly introduce the SVM and refer interested readers to (Burges, 1998; Evgeniou et al., 1999; Vapnik, 1995) for detailed tutorials. Let {x1, x2, ... , xn} represent the n input vectors, where xi isin Formula p, and let {y1, y2, ... , yn} be the corresponding output labels, where yi isin {1, – 1}. The SVM finds a hyperplane that separates the two classes of data points by the largest distance:


Formula 1

(1)


Formula 2

(2)


Formula 3

(3)


Formula 4

(4)
where {varepsilon}i (i = 1, ... , n) are slack variables, and C ≥ 0 is a tuning parameter that controls the overlap. Denote the solution as Formula and Formula , a new point with input vector x is then assigned with the label Formula .

Many researchers have recognized that the SVM can be equivalently transformed into the ‘loss’ + ‘penalty’ format, i.e.


Formula 5

(5)
where the loss function (1 – ·)+ = max(1 – ·, 0) is called the hinge loss and the penalty is the L2-norm of the fitted coefficients. {lambda} ≥ 0 is a regularization parameter, which controls the balance between the ‘loss’ and the ‘penalty’. The same L2-norm penalty has also been used in the ridge regression (Hoerl and Kennard, 1970) and neural networks. By shrinking the magnitude of the coefficients, the L2-norm penalty reduces the variance of the estimated coefficients and may result in better prediction accuracy.

2.2 The L1-norm SVM
Bradley and Mangasarian (1998) proposed to replace the L2-norm penalty in the standard SVM with the L1-norm penalty (Tibshirani, 1996):


Formula 6

(6)
Similar to the L2-norm penalty, the L1-norm penalty also reduces the variance of the estimates and improves the prediction accuracy. Furthermore, due to the singularity of the L1-norm function, when {lambda}1 is large enough, it tends to shrink the coefficients of irrelevant variables to exactly zero. When {lambda} changes, different coefficients are set to zero, hence the L1-norm penalty allows a kind of automatic continuous variable selection.

2.3 The hybrid huberized SVM
Zou and Hastie (2005) argued that the L1-norm penalty has two major limitations:

  1. The number of variables selected by the L1-norm penalty is upper bounded by the sample size n. In microarray analysis, we nearly always have the case p >> n, but the L1-norm SVM can identify at most n relevant genes.
  2. For highly correlated and relevant variables, the L1-norm penalty tends to select only one or a few of them. In microarray analysis, genes sharing the same biological pathway tend to have highly correlated expression levels. It is often desirable to identify all, rather than a few, of them if they are related to the underlying biological process.
To overcome these limitations, Zou and Hastie (2005) proposed the elastic-net penalty:


Formula

which is a hybrid of the L1-norm and the L2-norm penalties. The elastic-net penalty retains the variable selection feature of the L1-norm penalty, and the number of selected variables is no longer bounded by n. Furthermore, the elastic-net penalty tends to provide similar estimated coefficients for highly correlated variables, i.e. the grouping effect; hence, highly correlated variables tend to be selected or removed together.

In this article, we apply the elastic-net penalty to the SVM and propose the HHSVM:


Formula 7

(7)
where {lambda}1, {lambda}2 ≥ 0 are regularization parameters. Increasing {lambda}1 tends to eliminate more irrelevant variables; and increasing {lambda}2 makes the ‘grouping effect’ more prominent, which we will illustrate by a theorem at the end of this section.

Notice that instead of using the standard hinge loss function of the SVM, we use the huberized hinge loss function (Rosset and Zhu, 2007) to measure ‘badness-of-fit’:


Formula

where {delta} ≥ 0 is a pre-specified constant.

Figure 1 compares the standard hinge loss function and the huberized hinge loss function. Notice that they have similar shapes: when yf decreases (misclassification), they both increase linearly; when yf is bigger than 1, they are both equal to zero. Therefore, we expect the classification performance of these two functions should be similar to each other. Also notice that, unlike the hinge loss, the huberized hinge loss function is differentiable everywhere. As we will see in the next section, this differentiability can significantly reduce the computational cost for the HHSVM algorithm.


Figure 1
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. The hinge and the huberized hinge loss functions (with {delta} = 2). Note that the Elbow corresponds to the region (1 –{delta}; 1); the Left and the Right correspond to the regions (– {infty}, 1 – {delta}) and (1,{infty}), respectively.

 
Before delving into details for the HHSVM algorithm, we illustrate the ‘grouping effect’ with the following theorem:

THEOREM 1
Let Formula and Formula denote the solution for problem (7). For any pair (j, j'), we have


Formula 8

(8)
If the input vector xj and xj' are centered and normalized, then


Formula 9

(9)
where {rho} is the sample correlation between xj and xj '.

Details of the proof are in the Supplementary Material. Theorem 1 suggests that highly correlated variables tend to have similar estimated coefficients; hence, they tend to be selected or removed together when {lambda}2 is sufficiently large.


    3 ALGORITHM
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODEL
 3 ALGORITHM
 4 NUMERICAL RESULTS
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The HHSVM involves two tuning parameters {lambda}1 and {lambda}2. According to our experience, the prediction performance is often more sensitive to {lambda}1, since it has more impact on selecting variables; therefore, the value of {lambda}1 must be chosen more carefully. For parameter tuning, one usually specifies a number of candidates, tests each of them and chooses the best one according to some criterion. However, this trial-and-error approach is computationally expensive and the optimal parameter setting can be easily missed. In this section, we propose an efficient algorithm, which solves the entire solution path for every possible value of {lambda}1 (when {lambda}2 is fixed). The algorithm is based on the fact that the solution (Formula ) is a piecewise linear function (in a multi-dimensional space) with respect to {lambda}1.

3.1 Problem setup
Since the huberized hinge loss function has different definitions in different regions, for simplicity we define each region as:

  • R = {i: yi f (xi) > 1} (Right)
  • E = {i: 1 – {delta} < yi f (xi) ≤ 1} (Elbow)
  • L = {i: yi f (xi) ≤ 1 – {delta}} (Left)
where Formula . We also define the indices for non-zero βj as the active set Formula :
  • Formula = { j : βj != 0, j = 1,2, ... , p} (Active)
Since problem (7) is an unconstrained convex optimization problem, for any optimal solution the derivatives of its objective function must be zero. Setting the derivative with respect to β0 to zero, we get:


Formula 10

(10)
Setting the derivative with respect to βj (j isin Formula ) to zero, we get:


Formula 11

(11)

In the linear system (10)–(11), there are |Formula | + 1 unknowns and |Formula | + 1 equations, where |Formula | represents the number of elements in set Formula . Therefore, the solution β0 and βj (j isin Formula ) can be uniquely determined, given that the system is non-singular.

When sets L, E, R and Formula are fixed, the structure of the linear system (10)–(11) is also fixed. Under this condition, β0 and βj (j isin Formula ) are linear functions of {lambda}1, which can be seen from (11). However, as {lambda}1 decreases, sooner or later some of the sets L, E, R and Formula will change. We call this an event. After an ‘event’ occurs the new β0 and βj (j isin Formula ) are still linear functions of {lambda}1, but their derivatives with respect to {lambda}1 will change. Therefore, the entire solution path is piecewise linear in {lambda}1, and between any two consecutive events, β0 and βj (j isin Formula ) change linearly with {lambda}1. Each ‘event’ corresponds to a kink on the piecewise linear solution path.

Our algorithm starts from {lambda}1 = {infty}, continuously decreases {lambda}1, solves the solutions along this path, and terminates if {lambda}1 reaches 0. The algorithm provides the solutions on each kink. For any {lambda}1 between two consecutive kinks, the solution can be precisely obtained using linear interpolation. Table 1 shows the outline of the our algorithm.


View this table:
[in this window]
[in a new window]

 
Table 1. Outline of the HHSVM algorithm

 
3.2 Initial solution
The algorithm starts from {lambda}1 = {infty}. From the objective function of problem (7), we can see that β = 0 at this stage. Therefore, the problem reduces to:


Formula 12

(12)
which involves only one parameter β0. Since the huberized hinge loss function is convex and differentiable everywhere, this one-variable optimization problem can be easily solved. In fact, one can easily develop an analytical solution for the initial β0, but due to the lack of space, we skip the details.

Let Formula represent the optimal solution when {lambda}1 = {infty}. Hence Formula = {emptyset} (β = 0) and sets L, E and R can be determined using the values of Formula (i = 1, ... , n). Reducing {lambda}1 tends to increase the magnitude of β, and we can find a critical point, denoted as Formula , where exactly one βj (j = 1, ... , p) joins Formula , i.e. the estimate βj will become non-zero if {lambda}1 is further reduced.

Since Equation (11) must hold for any j isin Formula , we can determine the critical point Formula by:


Formula

The corresponding variable that joins Formula first can be identified as:


Formula

and the sign for Formula is:


Formula

Let the superscript k indicate the iteration number and k = 0 for the initial stage. Now, we have Formula k = { j*}, Formula , Formula (j = 1, ... , p), Formula , and the Lk, Rk, Ek are determined by Formula (i = 1, ... , n).

We emphasize again that the initial solution can be easily determined here because of the differentiability of the huberized hinge loss function; however, this is not the case for the hinge loss function. Since the hinge loss is not everywhere differentiable, it is difficult to determine the critical point Formula and the corresponding Formula 0, L0, E0 and R0 at the initial stage. Zhu et al. (2004) proposed an approach for solving the initial solution of the L1-norm SVM. The approach was based on linear programming, but it can be computationally intensive, especially when p is large.

3.3 Solution path
The algorithm continuously decreases {lambda}1 until it reaches 0. Let Formula , where {Delta} {lambda}1 < 0. When {lambda}1 is reduced by a small enough amount, sets L, E, R and Formula do not change, because of the continuity of f(x) with respect to {lambda}1. Therefore, based on the systems (10)–(11), the derivatives of β0 and βj (jisin Formula ) with respect to {lambda}1 can be solved from the following equations:



Formula 13

(13)



Formula 14

(14)

Since there are |Formula | + 1 unknowns and |Formula | + 1 equations, Formula and Formula (j isin Formula ) are uniquely determined, given that the system is non-singular. Then when |{Delta}{lambda}1| is sufficiently small, the solution and fitted values are linear in {lambda}1:


Formula

If we keep reducing {lambda}1, some of the sets L, E, R and Formula will change. We call this an event, and four types of ‘events’ may occur:

  1. A point i reaches the boundary between L and E.
  2. A point i reaches the boundary between R and E.
  3. A parameter βj becomes zero (j leaves Formula ).
  4. A zero-valued parameter βj becomes non-zero (j joins Formula ).

The boundary between L and E is 1 – {delta} and the boundary between R and E is 1. Therefore, when yi f(xi) for a point crosses 1 – {delta} or 1, one of the first two ‘events ’ occurs. To determine the step size {Delta}{lambda}1 for the first ‘event’, for each i isin L or E we calculate:


Formula

Let {Delta}{lambda}1,1 represent the step size for the first ‘event’. Since {lambda}1 only decreases, {Delta}{lambda}1,1 ≤ 0 and its value should be determined by:


Formula

Similarly, for the second ‘event’ we calculate:


Formula

for each i isin R or E. And the step size for the second ‘event’ is:


Formula

When a non-zero βj reduces to 0, the third ‘event’ occurs. Therefore, we calculate for each j isin Formula :


Formula

The step size for the third ‘event’ is:


Formula

The fourth ‘event’ (a zero-valued βj becomes non-zero) is a little complicated to determine. For j = 1, ... , p, we define:


Formula

which is part of the left-hand side of Equation (11). From (11) and the initial conditions, we know that:

  • |Cj| = {lambda}1, sign(Cj) = – sign(βj), for j isin Formula ;
  • |Cj| < {lambda}1, for j {notin} Formula .

Notice that when |{Delta} {lambda}1| is sufficiently small, Cj is also a linear function of {lambda}1:


Formula

As {lambda}1 decreases, the value for a |Cj| (j {notin} Formula ) will first meet the decreasing {lambda}1, after which the corresponding βj will become non-zero if we further reduce {lambda}1.

Therefore, to determine the step size for the fourth ‘event’, for each j {notin} Formula we calculate:


Formula

The step size {Delta} {lambda}1,4 for the fourth ‘event’ is:


Formula

After solving for {Delta} {lambda}1,1, {Delta} {lambda}1,2, {Delta} {lambda}1,3 and {Delta} {lambda}1,4, the overall step size {Delta} {lambda}1 can be obtained:


Formula

We then update L, E, R and Formula according to the corresponding ‘event’, and also update {lambda}1, β0, βj, Cj and the fitted value f(xi). Finally, let k = k + 1 and the algorithm goes to the next iteration: solving the linear systems (13)–(14) and calculating the step size {Delta}{lambda}1. This entire process is repeated, until {lambda}1 reaches 0.

Between any two consecutive ‘events’, the solutions are linear in {lambda}1, and after an ‘event’ occurs, the derivative of the solution with respect to {lambda}1 is changed. Therefore, the solution path is piecewise linear in {lambda}1, where each ‘event’ corresponds to a kink on the path. The algorithm provides solutions at these kinks, and for any {lambda}1 between two consecutive kinks the solutions can be calculated precisely via linear interpolation.

3.4 Computational cost
The major computational cost in each iteration comes from solving the linear system (13)–(14). Since this system has |Formula | + 1 unknowns, the computational cost seems to be O(|Formula |3) for each iteration. However, between any two iterations, only one element is changed for sets L, E, R and Formula , so using inverse updating and downdating the computational cost can be reduced to O(|Formula |2). It is difficult to predict the number of iterations. According to our experience, O(min(n,p)) is a reasonable estimate. The heuristic is that the algorithm needs O(n) to move all the points to R and O(p) steps to add all the parameters into Formula . Since p is the upper bound for |Formula |, the overall computational cost is upper bounded by O(min(n,p)p2).


    4 NUMERICAL RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODEL
 3 ALGORITHM
 4 NUMERICAL RESULTS
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this section, we use both simulation data and real microarray datasets to illustrate the HHSVM.

4.1 Simulation
The main purpose of the simulation is to demonstrate that when input variables are independent, the L1-norm SVM and the HHSVM perform similarly, but the HHSVM can have an advantage when the variables are highly correlated, especially at identifying the relevant variables.

We first consider the scenario where all input variables are independent. The ‘+’ class has a normal distribution with mean


Formula

and covariance {Sigma} = Ip x p. The ‘–’ class has a similar distribution except that


Formula

So the Bayes optimal classification rule only depends on x1, ... , x10, and the Bayes error is about 0.06, independent of the dimension p.

We generated 50 = 25 + 25 training data, each input xi is a p = 300-dimensional vector. We compared the standard L2-norm SVM, the L1-norm SVM, and the HHSVM. We used 50 validation data to select the tuning parameters for each method, then applied the selected models to a separate 10 000 testing dataset. Each experiment was repeated 100 times. The means of the prediction errors and the corresponding standard errors (in parentheses) are summarized in Table 2. As we can see, the prediction errors of the L1-norm SVM and the HHSVM are similar and both are better than that of the L2-norm SVM. This is due to the fact that the L2-norm SVM uses all input variables, and its prediction accuracy is ‘polluted’ by the noise variables.


View this table:
[in this window]
[in a new window]

 
Table 2. Comparison of test errors

 
Besides the prediction error, we also compared the selected variables of the L1-norm SVM and the HHSVM (the L2-norm SVM keeps all input variables). In particular, we consider qsignal = number of selected relevant variables, and qnoise = number of selected noise variables. The results are in Table 3. Again, we see that the L1-norm SVM and the HHSVM perform similarly; both are able to identify the relevant variables and remove most of the irrelevant variables.


View this table:
[in this window]
[in a new window]

 
Table 3. Comparison of variable selection

 
Now we consider the scenario when the relevant variables are correlated. Similar as the independent scenario, the ‘+’ class has a normal distribution, with mean


Formula

and covariance


Formula

where the diagonal elements of {Sigma}* are 1 and the off-diagonal elements are all equal to {rho} = 0.8. The ‘–’ class has a similar distribution except that


Formula

So the Bayes optimal classification rule depends on x1, ... , x10, which are highly correlated. The Bayes error is 0.13, independent of the dimension p.

Again, we considered n = 25 + 25 and p = 300. Each experiment was repeated 100 times. The results for the prediction errors are shown in Table 2. Now the performance of the HHSVM is slightly better than that of the L1-norm SVM, after taking into account the standard error. The right part of Table 3 compares the variables selected by the L1-norm SVM and the HHSVM, which sheds some light on what happened. Both the L1-norm SVM and the HHSVM are able to identify relevant variables. However, when the relevant variables are highly correlated, the L1-norm SVM tends to keep only a small subset of the relevant variables, and overlook the others, while the HHSVM tends to identify all of them, due to the grouping effect. Both methods seem to work well in removing irrelevant variables.

4.2 Real Data Analysis
The first dataset we considered is the colon cancer dataset in Alon and Barkai (1999). This dataset consists of 62 samples (40 colon cancer tumors and 22 normal tissues). Each sample consists of p = 2000 genes. We randomly split the samples into the training and the test sets for 100 times; for each split, 42 samples (27 cancer samples and 15 normal tissues) were used for training and the rest 20 samples (13 cancer samples and 7 normal tissues) were for testing. For each split, we applied four methods, the standard L2-norm SVM, the L1-norm SVM, the SVM-RFE and the HHSVM, to the training data. Tuning parameters were chosen using 10-fold cross-validation on the training data, and the chosen models were evaluated on the test data. For the standard L2-norm SVM, we tried different kernels and found the linear kernel has the best performance. For the SVM-RFE, we used the same strategy as in Guyon et al. (2002), i.e. eliminating 10% ofthe remaining genes in each iteration. The results are summarized in the upper part of Table 4. We can see that the HHSVM seemed to have a slightly better classification accuracy than other methods. However, we note that this is not necessarily conclusive, for the sample size is small.


View this table:
[in this window]
[in a new window]

 
Table 4. Results on 100 randomsplits of the original datasets: the upper part is for the colon cancer dataset, and the lower part is for the lung cancer dataset

 
Tables 5–6Go summarize the genes that were ‘frequently’ selected by the L1-norm SVM and the HHSVM. As we can see, only 5 genes were selected for more than 50 times by the L1-norm SVM, while 40 genes were selected for more than 50 times by the HHSVM. The selection frequency implies that these genes may have certain power in differentiating the colon cancer samples from the normal tissues, however, the L1-norm SVM was not always able to identify them. Figure 2 shows the number of selected genes for each selection frequency out of the 100 random splits. Again, we can see that the HHSVM is more stable than the L1-norm SVM in terms of selecting genes, i.e. over the 100 random splits, for important genes, the HHSVM selects them more frequently than the L1-norm SVM (lower part of Fig. 2), while for unimportant genes, the HHSVM selects them less frequently (upper part of Fig. 2).


Figure 2
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. The upper part shows the number of selected genes versus the selection frequency for the L1-norm SVM (left) and the HHSVM (right) on 100 random splits of the colon cancer dataset. The lower part ‘zooms in’ to the region where the ‘selection frequency’ is bigger than 50 (the left panel is for the L1-norm SVM and the right panel is for the HHSVM).

 

View this table:
[in this window]
[in a new window]

 
Table 5. The 32 most frequently selected genes by the L1-SVM from the colon cancer dataset

 

View this table:
[in this window]
[in a new window]

 
Table 6. The 100 most frequently selected genes by the HHSVM from the colon cancer dataset

 
To assess the stability of prediction, we also recorded the frequency that each sample, as a test observation, was correctly classified. For example, if a sample appears in 70 test sets among the 100 random splits, and out of the 70 predictions, the sample was correctly classified for 60 times, then we recorded 60/70 for this sample. The results are shown in Figure 3. As we can see, for most samples, both the L1-norm SVM and the HHSVM classified them correctly for most of the random splits, but there are also some samples where both methods tended to misclassify. Overall, the HHSVM seems to be slightly more stable than the L1-norm SVM in terms of prediction.


Figure 3
View larger version (6K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Number of samples vs the frequency that a sample (as a test observation) was correctly classified on 100 random splits of the colon cancer dataset. The left panel is for the L1-norm SVM, and the right panel is for the HHSVM.

 
The second dataset we considered is a more recent lung cancer dataset (Potti et al., 2006), which consists of 198 samples (54 cancer tumors and 144 normal tissues). Each sample consists of expression measurements on 22 215 genes. We randomly splitted the data into training and test sets, with sample sizes 137 (37 cancer samples and 100 normal tissues) and 61, respectively. We repeated it 100 times. The lower part of Table 4 summarizes the results. The gene selection behavior of the L1-norm SVM and the HHSVM for the lung cancer dataset are similar to those for the colon cancer dataset. Due to the lack of space, we skip the results.


    5 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODEL
 3 ALGORITHM
 4 NUMERICAL RESULTS
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this article, we have proposed the HHSVM for microarray classification and gene selection. The HHSVM uses the huberized hinge loss function to measure the ‘badness-of-fit’ and the elastic-net penalty to control the model complexity. The huberized hinge loss function allows efficient computation for calculating the entire solution path. The elastic-net penalty allows automatic flexible variable selection. We have presented some evidence that the new method tends to select more relevant variables (than the L1-norm SVM method), especially when variables are highly correlated.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODEL
 3 ALGORITHM
 4 NUMERICAL RESULTS
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank the Associate Editor, David Rocke, and two referees for their feedback which led us to improve the article, particularly the numerical result section. We also thank Saharon Rosset for helpful comments. L.W. and J.Z. are partially supported by grant DMS-0505432 and DMS-0705532 from the National Science Foundation.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: David Rocke

Received on June 15, 2007; revised on October 9, 2007; accepted on November 18, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODEL
 3 ALGORITHM
 4 NUMERICAL RESULTS
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Alon U, Barkai N. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon cancer tissues probed by oligonucleotide arrays. Proc. Natl Acad. Sci. USA (1999) 96:6745–6750.[Abstract/Free Full Text]

    Bradley P, Mangasarian O. Feature selection via concave minimization and support vector machines. (1998) Proceedings of the 15th International Conference on Machine Learning.

    Burges C. A tutorial on support vector machines for pattern recognition. Data Mining Knowl. Discov (1998) 2:121–167.[CrossRef]

    Efron B, et al. Least angle regression. Ann. Stat (2004) 32:407–499.[CrossRef]

    Evgeniou T, et al. Regularization networks and support vector machines. In: Advances in Large Margin Classifiers.—Smola A, et al, eds. (1999) MIT Press.

    Guyon I, et al. Gene selection for cancer classification using support vector machines. Mach. Learn (2002) 46:389–422.[CrossRef]

    Hoerl A, Kennard R. Ridge regression: biased estimation for nonorthogonal problems. Technometrics (1970) 12:55–67.[CrossRef][Web of Science]

    Mukherjee S, et al. Support vector machine classification of microarray data. Technical report (2000) Massachusetts Institute of Technology. Artificial Intelligence Laboratory.

    Potti A, et al. A genomic strategy to refine prognosis in early-stage non-small-cell lung cancer. N. Eng. J. Med (2006) 355:570–580.[Abstract/Free Full Text]

    Ramaswamy S, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl Acad. Sci. USA (2001) 98:15149–15154.[Abstract/Free Full Text]

    Rosset S, Zhu J. Piecewise linear regularized solution paths. Ann. Stat (2007) 35:1012–1030.[CrossRef]

    Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B (1996) 58:267–288.

    Vapnik V. The Nature of Statistical Learning Theory. (1995) New York: Springer-Verlag.

    Zhu J, et al. 1-norm SVMs. (2004) Proceedings of the Neural Information Processing Systems.

    Zou H, Hastie T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B (2005) 67:301–320.[CrossRef]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/3/412    most recent
btm579v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wang, L.
Right arrow Articles by Zou, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, L.
Right arrow Articles by Zou, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?