Skip Navigation


Bioinformatics Advance Access originally published online on March 24, 2007
Bioinformatics 2007 23(8):972-979; doi:10.1093/bioinformatics/btm046
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/8/972    most recent
btm046v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wang, S.
Right arrow Articles by Zhu, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, S.
Right arrow Articles by Zhu, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Improved centroids estimation for the nearest shrunken centroid classifier

Sijian Wang 1 and Ji Zhu 2,*

1Department of Biostatistics and 2Department of Statistics, University of Michigan, Ann Arbor, MI, 48109, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 ALGORITHMS
 4 SIMULATION STUDY
 5 REAL DATA ANALYSIS
 6 CONCLUSION
 ACKNOWLEDGEMENT
 REFERENCES
 

Motivation: The nearest shrunken centroid (NSC) method has been successfully applied in many DNA-microarray classification problems. The NSC uses ‘shrunken’ centroids as prototypes for each class and identifies subsets of genes that best characterize each class. Classification is then made to the nearest (shrunken) centroid. The NSC is very easy to implement and very easy to interpret, however, it has drawbacks.

Results: We show that the NSC method can be interpreted in the framework of LASSO regression. Based on that, we consider two new methods, adaptive L{infty}-norm penalized NSC (ALP-NSC) and adaptive hierarchically penalized NSC (AHP-NSC), with two different penalty functions for microarray classification, which improve over the NSC. Unlike the L1-norm penalty used in LASSO, the penalty terms that we consider make use of the fact that parameters belonging to one gene should be treated as a natural group. Numerical results indicate that the two new methods tend to remove irrelevant genes more effectively and provide better classification results than the L1-norm approach.

Availability: R code for the ALP-NSC and the AHP-NSC algorithms are available from authors upon request.

Contact: jizhu{at}umich.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 ALGORITHMS
 4 SIMULATION STUDY
 5 REAL DATA ANALYSIS
 6 CONCLUSION
 ACKNOWLEDGEMENT
 REFERENCES
 
Class prediction with high-dimensional microarray data has recently received much attention in many fields, such as bioinformatics, machine learning, medicine and statistics (Alizadeh et al., 2000; Dabney, 2005; Dudoit et al., 2002; Eisen et al., 1998; Golub et al., 1999; Hastie et al., 2001; Khan et al., 2001; Liu and Shen, 2006; Pan, 2002; Shen et al., 2006; Wu, 2006; Zhang et al., 2006a). It is considered very helpful for medical research if one can classify and predict the clinical category of a sample based on its gene expression profile. The microarray classification problem is a very challenging task, however, because there are a huge number of variables (genes) but a much smaller number of samples. Hence, finding relevant genes that distinguish samples is also greatly desired in practice.

Tibshirani et al. (2002) proposed the nearest shrunken centroid (NSC) method for class prediction in DNA-microarray studies. The NSC uses ‘shrunken’ centroids as prototypes for each class and identifies subsets of genes that best characterize each class. We describe the NSC algorithm briefly in the next subsection.

1.1 The nearest shrunken centroids method
Assuming we have n samples, and for each sample, we have expressions for p genes. Let xij be the expression for the jth gene and the ith sample. Each sample belongs to one of K classes 1,2, ... ,K. Let Ck be the set of indices of the nk samples in class k. The jth component of the centroid for class k is Formula , the mean expression in class k for gene j; the jth component of the overall centroid is Formula .

Let


Formula 1

(1)
where, sj is the pooled within-class SD for the jth gene:


Formula 2

(2)
and Formula .

The NSC shrinks each Formula to


Formula 3

(3)
where {lambda} is a tuning parameter, and the shrunken centroid in class k for gene j is constructed as:


Formula 4

(4)
Because many of the Formula values are noisy and close to the overall mean Formula , soft thresholding usually produces more reliable estimates of the true means, and if {lambda} is large enough, some of the Formula can be shrunken to zero, hence the corresponding Formula are equal to Formula .

For the classification of a test sample, suppose we have one with expression levels Formula . We define the discriminant score for class k as


Formula 5

(5)

where {pi}k = nk/n , and the classification rule is given by


Formula 6

(6)
So we can see, if for some j, all Formula , k = 1, ... ,K, are zero, hence all Formula are equal to Formula , then the jth gene will not contribute to the discriminant score, and it can be removed.

The NSC classifier is very easy to implement and very easy to interpret; it has been shown to be very successful in many high-dimensional classification problems. However, it has drawbacks.

In this article, we re-derive the NSC method as a LASSO regression on gene expression profiles. This re-interpretation allows us to notice that the L1-norm penalty used by NSC may not be the most effective way in analyzing microarray data. The L1-norm penalty treats all centroids equally or ‘flatly’, but centroids for the same gene are naturally ‘grouped’ together and intuitively should be considered as a group. Also, centroids for different genes, say relevant ones and irrelevant ones, should be treated differently. Enlightened by these observations, we consider two different penalty functions different from the L1-norm penalty to make use of natural grouping information within the data. As we will see in the numerical study, the methods we consider tend to remove irrelevant genes more effectively and provide better classification results.

The remainder of this article is organized as follows. In Section 2, we show how the NSC can be represented as a LASSO regression, and based on that, we consider two new methods: the adaptive L{infty}-norm penalized NSC (ALP-NSC) and the adaptive hierarchically penalized NSC (AHP-NSC), which improve over the NSC. In Section 3, we derive algorithms for the two methods in detail. Numerical results are in Sections 4 and 5. A brief discussion is in Section 6.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 ALGORITHMS
 4 SIMULATION STUDY
 5 REAL DATA ANALYSIS
 6 CONCLUSION
 ACKNOWLEDGEMENT
 REFERENCES
 
2.1 LASSO interpretation of the nearest shrunken centroids
Assume that the observation xi = (xi1, ... ,xip) from class k follows a multivariate normal distribution: MVN({nu}k, {Sigma}k), where {nu}k = ({nu}k1,..., {nu}kj, ... , {nu}kp) is the mean vector for class k, and {Sigma}k is the covariance matrix for class k. We further assume that Formula , i.e. the covariance matrices are the same across different classes and are diagonal. Such assumption is common when one works with high-dimension low sample size data (Marron and Todd, 2002). Some theoretical justification for this assumption can be found in Bickel and Levina (2004).

We first center and scale each xij to be Formula , and consider the linear regression:


Formula 7

(7)
where Formula ; zik is the indicator for whether the ith sample is in class k, i.e. zik =1 if the ith sample belongs to class k, and zik =0 otherwise; {varepsilon}ij are independent of each other and approximately follow Formula , if sample i belongs to class k.

Now we consider an LASSO-(Tibshirani, 1996) type estimator for µkj:


Formula 8

(8)
After some algebra, one can show that the solutions to (8) are


Formula 9

(9)


Formula 10

(10)
which matches exactly with (3). Therefore, the shrunken centroids used in the NSC can be considered as the solutions to (8). We acknowledge that Wu (2006) presented a similar interpretation of NSC, but used different values of {lambda} for different classes.

We can see from (8) that by using the L1-norm penalty, the NSC shrinks µkj continuously towards zero, and shrinks some of the fitted µkj to be exactly zero when making {lambda} sufficiently large. In order to remove the j th gene, we require all µkj, k = 1, ... ,K, to be zero. However, we can also see from (8) that the L1-norm penalty treats all µkj the same, i.e. it does not use the information that µkj and µk'j are associated with the same gene j. Intuitively, they belong to one ‘group’ and should be treated differently from µkj', which is associated with a different gene j'. In the next two subsections, we consider two different penalty functions, i.e. the L{infty}-norm penalty and the hierarchical penalty that incorporate this information into the modeling procedure. In general, we consider


Formula 11

(11)
where {Omega} = { µkj, k = 1, ... ,K;, j = 1, ...,p}, and J({Omega}) is a penalty function.

2.2 Method I: the adaptive L{infty}-norm penalized NSC (ALP-NSC)
For the ALP-NSC, we consider to estimate µkj by


Formula 12

(12)
where max (| µ1j|,... | µKj|) = | (µ1j, ... µKj)|{infty}. Different from penalizing every µkj individually, the L{infty}-norm penalizes the maximum absolute value of µkj, k = 1, ... , K, for the jth gene. If the maximum of |µkj|, k = 1, ... , K, is shrunken to zero, all µkj are automatically shrunken to zero. The L{infty}-norm penalty has also been used in Zhang et al. (2006b), Zhao et al. (2006) and Zou and Yuan (2006) for other supervised problems.

To further improve the model (12), we borrow the adaptive idea from Shen and Ye (2002) and Zou (2006), i.e. to penalize different genes differently. We consider


Formula 13

(13)
where wj are pre-specified weights. The intuition is that if the jth gene is relevant for distinguishing different classes from each other, we would like the corresponding wj to be small, hence the jth gene is lightly penalized, while if the jth gene is irrelevant and expressed similarly across different classes, we would like the corresponding wj to be large, hence the jth gene is heavily penalized. How to pre-specify wj from the data will be discussed in Section 2.4.

2.3 Method II: the adaptive hierarchically penalized NSC (AHP-NSC)
The L{infty}-norm penalty makes use of the information that µkj and µk'j are associated with the same gene by shrinking the maximum absolute value of µkj within the jth gene. If we denote Mj = maxk (| µ1j|, ..., | µKj|), the corresponding L{infty}-norm penalty on the jth gene is {lambda} Mj, and we can write µkj = Mj {alpha}kj, where –1≤ {alpha}kj≤ 1. This motivates us to reparameterize µkj in a more general way:


Formula 14

(14)
where {gamma}j ≥ 0 (for identifiability reasons). Notice that here {gamma}j plays a similar role as Mj, but it does not have to be the maximum of |µkj|; similarly {theta}kj does not have to be bounded between –1 and 1. This decomposition reflects the information that µkj, k = 1, ... , K, all belong to one single gene xj, by treating each µkj hierarchically. {gamma}j is at the first level of the hierarchy, controlling µkj, k = 1, ... , K, as a group; {theta}kjs are at the second level of the hierarchy, reflecting differences within the jth gene.

To estimate {gamma}j and {theta}kj, we consider


Formula 15

(15)
subject to {gamma}j ≥ 0. Notice that there are two tuning parameters {lambda}{gamma} and {lambda}{theta}. {lambda}{gamma} controls the estimates at the gene-specific level, and it can effectively remove irrelevant genes: if {gamma}j is shrunken to zero, all µkj for the jth gene will be equal to zero. {lambda}{theta} controls the estimates at the class-specific level: if {gamma}j is not equal to zero, some of the {theta}kj hence some of the µkj, k = 1, ... , K, still have the possibility of being zero; in this sense, the hierarchical penalty shares some of the properties of the L1-norm penalty.

The adaptive idea in (13) also applies here. If the jth gene is relevant, we would like to penalize its {gamma}j and {theta}kj lightly, and if the jth gene is irrelevant, we would like to penalize its {gamma}j and {theta}kj heavily. Hence, we consider the AHP-NSC:


Formula 16

(16)
where Formula and Formula are pre-specified weights.

2.4 Computing the adaptive weights
Regarding the adaptive weights wj in (13), Formula and Formula in (16), following Breiman (1995) and Zou (2006), we can choose them using the un-penalized estimates Formula . Specifically:


Formula


    3 ALGORITHMS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 ALGORITHMS
 4 SIMULATION STUDY
 5 REAL DATA ANALYSIS
 6 CONCLUSION
 ACKNOWLEDGEMENT
 REFERENCES
 
In this section, we describe details of our algorithms for estimating µkj in ALP-NSC and AHP-NSC. Notice that both the L{infty}-norm penalty and the hierarchical penalty have non-differential points, so they pose optimization challenges.

3.1 Estimating µkj in ALP-NSC
In ALP-NSC, (13) can be decomposed into p separate minimization problems


Formula 17

(17)
where j = 1, ... ,p. For each j, (17) can be transformed into a quadratic programming problem:


Formula 18

(18)


Formula 19

(19)


Formula 20

(20)
Hence, most commercially available packages can be used to solve it.

We have also explored explicit forms for the solutions to (17), which help us gain more insights into the nature of the L{infty}-norm penalty. We can show that Formula , the solution to the minimization problem (17), can be achieved by shrinking an average of Formula (1), i.e. the solution to (17) when there is no penalty (or {lambda} = 0).

THEOREM 1.
For the jth minimization problem (17), if there exists an indices set C={k1, ... , kr}, such that


Formula 21

(21)
then


Formula 22

(22)
where (:)+ is the positive part of the argument.

Details of the proof are in the Supplementary Material. From Theorem 1, we can see when there are r maximums among Formula , only the corresponding Formula will be shrunken by the L{infty}-norm penalty, and they are shrunken to the same absolute value. This value is based on an average of Formula of the corresponding r classes. We can also see that if the jth gene is irrelevant and all Formula are close to zero, then the L{infty}-norm penalty tends to shrink all of them to zero (with an appropriately chosen {lambda} wj).

To implement Theorem 1, we need to decide r, the number of maximums among Formula , and the set {k1, ... , kr}, which indicates which r Formula should be shrunken. When K is not very large, say K ≤ 20, we can use an exhaustive search to find r and {k1, ... , kr}, i.e. for each 1 ≤ r ≤ K, we search over all possible sets {k1, ... , kr}. For each possible set, we estimate Formula using (22), then check whether the estimates satisfy the assumption (21). If the assumption is satisfied, we compute the corresponding value for the objective function (17). Finally, we choose Formula that give the smallest value for the objective function. When K is large, we will resort to the quadratic programming (18)–(20)GoGo.

3.2 Estimating µkj in AHP-NSC
In AHP-NSC (16), we can use an iterative approach to estimate {gamma}j and {theta}kj, i.e. we first fix {theta}kj and estimate {gamma}j, then we fix {gamma}j and estimate {theta}kj, and we iterate between these two steps until the solution converges. Since at each step, the value of the objective function (16) decreases, the solution is guaranteed to converge. We have the following theorem that helps us solve for {gamma}j and {theta}kj at each step.

THEOREM 2.
• When {theta}kj, j = 1, ... , p and k = 1, ... , K, are fixed,


Formula 23

(23)

where Formula .
When {gamma}j, j = 1, ..., p, are fixed,


Formula 24

(24)

Equations (23) and (24) show that both Formula and Formula are soft-thresholding estimates. Details of the proof are in the Supplementary Material. Here we give some intuitive explanation.

We first look at Formula (23). If all {theta}kj are zero, it is natural to estimate {gamma}j also to be zero because of the penalty on {gamma}j. If not all {theta}kj are 0, say, {theta}k1j, ... ,{theta}krj are not zero, then from our reparameterization, we have {gamma}j = µksj/{theta}ksj, 1≤ s≤ r. Plugging in Formula for µksj, we obtain r estimates for {gamma}j : Formula , 1≤ s≤ r. A natural estimate for {gamma}j is then a weighted average of the Formula , and Equation (23) provides such a (shrunken) average, with weights proportional to {xi}k.

Now considering Formula (24). If {gamma} = 0, it is natural to estimate all {theta}kj also to be zero because of the penalty on {theta}kj. When {gamma}j > 0, we have {theta}kj= µ kj/{gamma}j. Again, plugging in Formula for µkj, we obtain Formula . Equation (24) shrinks Formula and the amount of shrinkage is inversely proportional to Formula . When {gamma}j is large, which indicates the jth gene is relevant, the amount of shrinkage is small, while when {gamma}j is small, which indicates the jth gene is less relevant, the amount of shrinkage is large.


    4 SIMULATION STUDY
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 ALGORITHMS
 4 SIMULATION STUDY
 5 REAL DATA ANALYSIS
 6 CONCLUSION
 ACKNOWLEDGEMENT
 REFERENCES
 
In this section, we use simulated data to demonstrate our methods ALP-NSC and AHP-NSC, and compare the results with that of the NSC. We also compare our methods with an improved version of NSC, i.e. NSC with adaptive choice of thresholds (Tibshirani et al., 2003). The ‘adaptive’ in Tibshirani et al. (2003) has a different meaning from that in our two methods: it still treats different genes ‘flatly’, but uses large thresholds for classes easy to classify and small thresholds for classes difficult to classify.

We first considered a two-class classification scenario. There were a total of p= 10 000 variables with only the first 20 relevant while the other 9980 irrelevant in forming two classes. Specifically, the first 20 variables were i.i.d. N(0,1) for the first class and i.i.d. N(1,1) for the second class, whereas the remaining 9980 variables were all i.i.d. N(0,1) for both classes.

We generated n = 100 training observations, with 70 in the first class and 30 in the second one; similarly, we also generated 1000 test observations, with the same class prior and the same within-class distribution. We denote this as the ‘70-30’ example. Tuning parameters were chosen using 5-fold cross-validation (CV) on the training data. We then computed the misclassification error rate of the chosen model on the test data. We also recorded both the number of relevant variables and the number of irrelevant variables that were selected. We repeated this 100 times. The results are summarized in Table 1.


View this table:
[in this window]
[in a new window]

 
Table 1. Simulation results for the ‘70-30’ example

 
As we can see, our ALP-NSC and AHP-NSC methods performed similarly to the NSC method in terms of selecting relevant variables, but tended to keep much fewer irrelevant variables. The error rates of ALP-NSC and AHP-NSC also seemed to be smaller than that of the NSC.

We then considered two three-class classification scenarios, with highly unbalanced classes. In both scenarios, there were a total of p= 4000 variables with the first 2 relevant while the other 3998 irrelevant in forming three classes. The first 2 variables were i.i.d. from N(0,1) for the first class, i.i.d. N(2.5,1) for the second class and i.i.d. N(5,1) for the third class, whereas the remaining 3998 variables were all i.i.d. from N(0,1) for all three classes. In the first scenario, we generated 20 observations for each of the first class and the third class, and 100 for the second class We denote it as the ‘20-100-20’ example. In the second scenario, we generated 100 observations for each of the first class and the third class, and 20 for the second class. We denote it as the ‘100-20-100’ example. For each scenario, we also generated test observations with the same class prior and the same within-class distribution, but with the sample size 10 times larger than that of the training datasets. We repeated this 100 times. The results are summarized in Table 2.


View this table:
[in this window]
[in a new window]

 
Table 2. Simulation results for ‘20-100-20’ and ‘100-20-100’ examples: the upper part is for the ‘20-100-20’ example, and the lower part is for the ‘100-20-100’ example

 
As we can see, our ALP-NSC and AHP-NSC methods removed all 3998 irrelevant variables for every repetition (out of 100), and the classification error rates were just slightly higher than that of the NSC using only the first 2 relevant variables, i.e., the ‘oracle’. In contrast, the NSC method and the NSC with adaptive thresholds tended to select more noise variables and have higher error rates.


    5 REAL DATA ANALYSIS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 ALGORITHMS
 4 SIMULATION STUDY
 5 REAL DATA ANALYSIS
 6 CONCLUSION
 ACKNOWLEDGEMENT
 REFERENCES
 
In this section, we apply the ALP-NSC and the AHP-NSC methods to three gene microarray datasets.

The first dataset we considered is the Leukemia dataset in Golub et al. (1999). This dataset consists of 38 training data and 34 test data for two types of acute leukemia, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). Each sample is a vector of p= 7129 genes. We applied the ALP-NSC and the AHP-NSC methods to the training data. Tuning parameters were chosen using 10-fold CV, and the chosen models were evaluated on the test data. The results are summarized in the upper part of Table 3. Both the ALP-NSC and the AHP-NSC had two misclassification on the test data, and they selected 16 exactly same genes. Figure 1 shows the corresponding heatmap. Clear separation of the two classes is evident.


Figure 1
View larger version (85K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Heatmap of the 16 genes selected by both the ALP-NSC and the AHP-NSC from the Leukemia dataset.

 

View this table:
[in this window]
[in a new window]

 
Table 3. Results on the real datasets: the upper part is for the Leukemia dataset, and the lower part is for the SRBCT dataset

 
The second dataset we considered consists of microarray experiments of small round blue cell tumors (SRBCT) of childhood cancer (Khan et al., 2001). The tumors are classified as Burkitt lymphoma (BL), Ewing sarcoma (EWS), neuroblastoma (NB) or rhabdomyosarcoma (RMS). A total of 63 training samples and 20 test samples were provided. Each sample consists of expression measurements on p= 2308 genes. We analyzed this dataset in the similar way as with the Leukemia dataset. Tuning parameters were chosen using 8-fold CV. The results are summarized in Table 3. The CV errors and the test errors are all zero. The ALP-NSC selected 38 genes, and the AHP-NSC selected 40 genes. The 38 genes selected by ALP-NSC and the 40 genes selected by AHP-NSC have 34 overlapping genes. Figures 2 and 3 show the heatmaps of the selected genes. Similar as in Figure 1, genes that distinguish each class from other classes are also evident.


Figure 2
View larger version (133K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Heatmap of the 38 genes selected by ALP-NSC from the SRBCT dataset.

 

Figure 3
View larger version (122K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Heatmap of the 40 genes selected by AHP-NSC from the SRBCT dataset.

 
To further assess the genes that were selected from the Leukemia and the SRBCT datasets, we randomly split each dataset into the training and the test sets for 100 times. The sizes of the two sets and the class priors were kept the same as the original training/test split. Each training/test split was also analyzed in the same way as for the original training/test split. The results are summarized in Tables 4–6GoGo. For the Leukemia dataset, out of the 16 genes selected from the original training/test split, 10 genes were selected for more than 70 times out of the 100 random training/test splits. For the SRBCT dataset, out of the 44 genes that were selected by ALP-NSC and AHP-NSC from the original training/test split, 29 genes were selected for more than 70 times out of the 100 random splits. These results imply that the highly frequently selected genes may have certain power in differentiating the corresponding cancer classes from each other.


View this table:
[in this window]
[in a new window]

 
Table 4. Results on 100 random splits of the real datasets: the first part is for the Leukemia dataset, the second part is for the SRBCT dataset and the third part is for the NCI-60 dataset

 

View this table:
[in this window]
[in a new window]

 
Table 5. The 16 genes that were selected by the ALP-NSC and the AHP-NSC from the Leukemia dataset

 

View this table:
[in this window]
[in a new window]

 
Table 6. The 44 genes that were selected by the ALP-NSC and the AHP-NSC from the SRBCT dataset

 
The Leukemia and the SRBCT datasets are relatively ‘easy’ for classification. We also considered the NCI-60 dataset (Dudoit et al., 2002). In this study, cDNA microarrays were used to examine the variation in gene expressions among 60 cell lines from the National Cancer Institute's anti-cancer drug screen. The cell lines are derived from tumors with eight different sites of origin: breast, central nervous system (CNS), colon, leukemia, melanoma, non-small-cell-lung-carcinoma (NSCLC), ovarian and prostate. A total of 61 samples were provided. Each sample consists of expression measurements on p= 5244 genes. The class sizes are all small, and some of the classes (e.g. breast and NSCLS) are known to be heterogeneous. So this is a more difficult dataset for classification than the Leukemia and the SRBCT datasets. The NCI-60 dataset came without pre-specified training/test split, so we randomly split the data into the training and the test sets comma; with sample sizes 40 and 21, respectively. We repeat it 100 times. The results are summarized in the lower part of Table 4 and Figure 4. We can see that ALP-NSC and AHP-NSC had similar misclassification errors as NSC and adaptive-NSC, but selected much fewer genes. Then, 777 genes were selected for more than 70 times out of the 100 trials. Our methods selected on average about 1000 genes, while NSC selected more than 3000 genes.


Figure 4
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Selection frequency for ALP-NSC and AHP-NSC out of 100 trials for the NCI-60 dataset.

 

    6 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 ALGORITHMS
 4 SIMULATION STUDY
 5 REAL DATA ANALYSIS
 6 CONCLUSION
 ACKNOWLEDGEMENT
 REFERENCES
 
In this article, we first re-interpreted the popular NSC method in the framework of LASSO regression. Based on the penalized linear regression framework, we have considered two new methods for microarray classification, which improve over the NSC. Unlike the L1-norm penalty used in LASSO, the penalty terms that we consider make use of the fact that parameters belonging to one gene should be treated as a natural group. We have presented some evidence that the two new methods tend to remove irrelevant genes more effectively and provide better classification results than the L1-norm approach.


    ACKNOWLEDGEMENT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 ALGORITHMS
 4 SIMULATION STUDY
 5 REAL DATA ANALYSIS
 6 CONCLUSION
 ACKNOWLEDGEMENT
 REFERENCES
 
We would like to thank the Associate Editor and two reviewers for their thoughtful and useful comments. We would also like to thank Trevor Hastie and Rob Tibshirani for helpful discussions. S.W and J.Z are partially supported by grant DMS-0505432 from the National Science Foundation and grant NHLBI-18 HL60884 from National Institutes of Health.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on November 17, 2006; revised on January 19, 2007; accepted on February 4, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 ALGORITHMS
 4 SIMULATION STUDY
 5 REAL DATA ANALYSIS
 6 CONCLUSION
 ACKNOWLEDGEMENT
 REFERENCES
 

    Alizadeh A, et al. Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, ( (2000) ) 403, : 503–511.[CrossRef][Medline].

    Bickel P, Levina L. Some theory for fisher's linear discriminant function, "naive bayes", and some alternatives when there are many more variables than observations. Bernoulli, ( (2004) ) 10, : 989–1010.[ISI].

    Breiman L. Better subset regression using the non-negative garrote. Technometrics, ( (1995) ) 37, : 373–384.[CrossRef][ISI].

    Dabney AR. Classification of microarrays to nearest centroids. Bioinformatics, ( (2005) ) 21, : 4148–4154.[Abstract/Free Full Text].

    Dudoit S, et al. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc, ( (2002) ) 97, : 77–87.[CrossRef][ISI].

    Eisen M, et al. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, ( (1998) ) 95, : 14863–14868.[Abstract/Free Full Text].

    Golub T, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, ( (1999) ) 286, : 531–537.[Abstract/Free Full Text].

    Hastie T, et al. Supervised harvesting of expression trees. Genome Biol, ( (2001) ) 2, : 1–12.[Medline].

    Khan J, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med, ( (2001) ) 7, : 673–679.[CrossRef][ISI][Medline].

    Liu Y, Shen X. Multicategory psi-learning. J. Am. Stat. Assoc, ( (2006) ) 101, : 500–509.[CrossRef][ISI].

    Marron J, Todd M. Distance weighted discrimination. In: Technical Report., ( (2002) ) School of Operations Research and Industrial Engineering, Cornell University..

    Pan W. A comparative review of statistical methods for discovering differently expressed genes in replicated microarray experiments. Bioinformatics, ( (2002) ) 18, : 546–554.[Abstract/Free Full Text].

    Shen X, Ye J. Adaptive model selection. J. Am. Stat. Assoc, ( (2002) ) 97, : 210–221.[CrossRef][ISI].

    Shen R, et al. Eigengene-based linear discriminant model for tumor classification using gene expression microarray data. Bioinformatics, ( (2006) ) 22, : 2635–2642.[Abstract/Free Full Text].

    Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B, ( (1996) ) 58, : 267–288..

    Tibshirani R, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl Acad. Sci. USA, ( (2002) ) 99, : 6567–6572.[Abstract/Free Full Text].

    Tibshirani R, et al. Class prediction by nearest shrunken centroids, with application to DNA microarrays. Stat. Sci, ( (2003) ) 18, : 104–117.[CrossRef][ISI].

    Wu B. Differential gene expression detection and sample classification using penalized linear regression models. Bioinformatics, ( (2006) ) 22, : 472–476.[Abstract/Free Full Text].

    Zhang H, et al. Gene selection using support vector machines with non-convex penalty. Bioinformatics, ( (2006a) ) 22, : 88–95.[Abstract/Free Full Text].

    Zhang H, et al. Variable selection for multicategory SVM via sup-norm regularization. ( (2006b) ) Institute of Statistics Mimeo Series 2596, NCSU..

    Zhao P, et al. Grouped and hierarchical model selection through composite absolute penalties. In: Technical Report., ( (2006) ) Department of Statistics, University of California at Berkeley..

    Zou H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc, ( (2006) ) 101, : 1418–1429.[CrossRef][ISI].

    Zou H, Yuan M. The F{infty}-norm support vector machine. Stat. Sin, ( (2007) ) to appear..


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
F. Tai and W. Pan
Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data
Bioinformatics, December 1, 2007; 23(23): 3170 - 3177.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/8/972    most recent
btm046v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wang, S.
Right arrow Articles by Zhu, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, S.
Right arrow Articles by Zhu, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?