Bioinformatics Advance Access originally published online on March 24, 2007
Bioinformatics 2007 23(8):972-979; doi:10.1093/bioinformatics/btm046
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Improved centroids estimation for the nearest shrunken centroid classifier
1Department of Biostatistics and 2Department of Statistics, University of Michigan, Ann Arbor, MI, 48109, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: The nearest shrunken centroid (NSC) method has been successfully applied in many DNA-microarray classification problems. The NSC uses shrunken centroids as prototypes for each class and identifies subsets of genes that best characterize each class. Classification is then made to the nearest (shrunken) centroid. The NSC is very easy to implement and very easy to interpret, however, it has drawbacks.
Results: We show that the NSC method can be interpreted in the framework of LASSO regression. Based on that, we consider two new methods, adaptive L
-norm penalized NSC (ALP-NSC) and adaptive hierarchically penalized NSC (AHP-NSC), with two different penalty functions for microarray classification, which improve over the NSC. Unlike the L1-norm penalty used in LASSO, the penalty terms that we consider make use of the fact that parameters belonging to one gene should be treated as a natural group. Numerical results indicate that the two new methods tend to remove irrelevant genes more effectively and provide better classification results than the L1-norm approach.
Availability: R code for the ALP-NSC and the AHP-NSC algorithms are available from authors upon request.
Contact: jizhu{at}umich.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Class prediction with high-dimensional microarray data has recently received much attention in many fields, such as bioinformatics, machine learning, medicine and statistics (Alizadeh et al., 2000; Dabney, 2005; Dudoit et al., 2002; Eisen et al., 1998; Golub et al., 1999; Hastie et al., 2001; Khan et al., 2001; Liu and Shen, 2006; Pan, 2002; Shen et al., 2006; Wu, 2006; Zhang et al., 2006a). It is considered very helpful for medical research if one can classify and predict the clinical category of a sample based on its gene expression profile. The microarray classification problem is a very challenging task, however, because there are a huge number of variables (genes) but a much smaller number of samples. Hence, finding relevant genes that distinguish samples is also greatly desired in practice.
Tibshirani et al. (2002) proposed the nearest shrunken centroid (NSC) method for class prediction in DNA-microarray studies. The NSC uses shrunken centroids as prototypes for each class and identifies subsets of genes that best characterize each class. We describe the NSC algorithm briefly in the next subsection.
1.1 The nearest shrunken centroids method
Assuming we have n samples, and for each sample, we have expressions for p genes. Let xij be the expression for the jth gene and the ith sample. Each sample belongs to one of K classes 1,2, ... ,K. Let Ck be the set of indices of the nk samples in class k. The jth component of the centroid for class k is
, the mean expression in class k for gene j; the jth component of the overall centroid is
.
|
| (1) |
|
| (2) |
|
| (3) |
is a tuning parameter, and the shrunken centroid in class k for gene j is constructed as:
|
| (4) |
is large enough, some of the
For the classification of a test sample, suppose we have one with expression levels
. We define the discriminant score for class k as
|
| (5) |
where
k = nk/n , and the classification rule is given by
|
| (6) |
The NSC classifier is very easy to implement and very easy to interpret; it has been shown to be very successful in many high-dimensional classification problems. However, it has drawbacks.
In this article, we re-derive the NSC method as a LASSO regression on gene expression profiles. This re-interpretation allows us to notice that the L1-norm penalty used by NSC may not be the most effective way in analyzing microarray data. The L1-norm penalty treats all centroids equally or flatly, but centroids for the same gene are naturally grouped together and intuitively should be considered as a group. Also, centroids for different genes, say relevant ones and irrelevant ones, should be treated differently. Enlightened by these observations, we consider two different penalty functions different from the L1-norm penalty to make use of natural grouping information within the data. As we will see in the numerical study, the methods we consider tend to remove irrelevant genes more effectively and provide better classification results.
The remainder of this article is organized as follows. In Section 2, we show how the NSC can be represented as a LASSO regression, and based on that, we consider two new methods: the adaptive L
-norm penalized NSC (ALP-NSC) and the adaptive hierarchically penalized NSC (AHP-NSC), which improve over the NSC. In Section 3, we derive algorithms for the two methods in detail. Numerical results are in Sections 4 and 5. A brief discussion is in Section 6.
| 2 METHODS |
|---|
|
|
|---|
2.1 LASSO interpretation of the nearest shrunken centroids
Assume that the observation xi = (xi1, ... ,xip) from class k follows a multivariate normal distribution: MVN(
k,
k), where
k = (
k1,...,
kj, ... ,
kp) is the mean vector for class k, and
k is the covariance matrix for class k. We further assume that
We first center and scale each xij to be
, and consider the linear regression:
|
| (7) |
ij are independent of each other and approximately follow
Now we consider an LASSO-(Tibshirani, 1996) type estimator for µkj:
|
| (8) |
|
| (9) |
|
| (10) |
for different classes.
We can see from (8) that by using the L1-norm penalty, the NSC shrinks µkj continuously towards zero, and shrinks some of the fitted µkj to be exactly zero when making
sufficiently large. In order to remove the j th gene, we require all µkj, k = 1, ... ,K, to be zero. However, we can also see from (8) that the L1-norm penalty treats all µkj the same, i.e. it does not use the information that µkj and µk'j are associated with the same gene j. Intuitively, they belong to one group and should be treated differently from µkj', which is associated with a different gene j'. In the next two subsections, we consider two different penalty functions, i.e. the L
-norm penalty and the hierarchical penalty that incorporate this information into the modeling procedure. In general, we consider
|
| (11) |
= { µkj, k = 1, ... ,K;, j = 1, ...,p}, and J(
) is a penalty function.
2.2 Method I: the adaptive L
-norm penalized NSC (ALP-NSC)
For the ALP-NSC, we consider to estimate µkj by
|
| (12) |
. Different from penalizing every µkj individually, the L
-norm penalizes the maximum absolute value of µkj, k = 1, ... , K, for the jth gene. If the maximum of |µkj|, k = 1, ... , K, is shrunken to zero, all µkj are automatically shrunken to zero. The L
-norm penalty has also been used in Zhang et al. (2006b), Zhao et al. (2006) and Zou and Yuan (2006) for other supervised problems.
To further improve the model (12), we borrow the adaptive idea from Shen and Ye (2002) and Zou (2006), i.e. to penalize different genes differently. We consider
|
| (13) |
2.3 Method II: the adaptive hierarchically penalized NSC (AHP-NSC)
The L
-norm penalty makes use of the information that µkj and µk'j are associated with the same gene by shrinking the maximum absolute value of µkj within the jth gene. If we denote Mj = maxk (| µ1j|, ..., | µKj|), the corresponding L
-norm penalty on the jth gene is
Mj, and we can write µkj = Mj
kj, where –1
kj
1. This motivates us to reparameterize µkj in a more general way:
|
| (14) |
j
0 (for identifiability reasons). Notice that here
j plays a similar role as Mj, but it does not have to be the maximum of |µkj|; similarly
kj does not have to be bounded between –1 and 1. This decomposition reflects the information that µkj, k = 1, ... , K, all belong to one single gene xj, by treating each µkj hierarchically.
j is at the first level of the hierarchy, controlling µkj, k = 1, ... , K, as a group;
kjs are at the second level of the hierarchy, reflecting differences within the jth gene.
To estimate
j and
kj, we consider
|
| (15) |
j
0. Notice that there are two tuning parameters 
and 
. 
controls the estimates at the gene-specific level, and it can effectively remove irrelevant genes: if
j is shrunken to zero, all µkj for the jth gene will be equal to zero. 
controls the estimates at the class-specific level: if
j is not equal to zero, some of the
kj hence some of the µkj, k = 1, ... , K, still have the possibility of being zero; in this sense, the hierarchical penalty shares some of the properties of the L1-norm penalty.
The adaptive idea in (13) also applies here. If the jth gene is relevant, we would like to penalize its
j and
kj lightly, and if the jth gene is irrelevant, we would like to penalize its
j and
kj heavily. Hence, we consider the AHP-NSC:
|
| (16) |
2.4 Computing the adaptive weights
Regarding the adaptive weights wj in (13),
and
in (16), following Breiman (1995) and Zou (2006), we can choose them using the un-penalized estimates
. Specifically:
|
|
| 3 ALGORITHMS |
|---|
|
|
|---|
In this section, we describe details of our algorithms for estimating µkj in ALP-NSC and AHP-NSC. Notice that both the L
-norm penalty and the hierarchical penalty have non-differential points, so they pose optimization challenges.
3.1 Estimating µkj in ALP-NSC
In ALP-NSC, (13) can be decomposed into p separate minimization problems
|
| (17) |
|
| (18) |
|
| (19) |
|
| (20) |
We have also explored explicit forms for the solutions to (17), which help us gain more insights into the nature of the L
-norm penalty. We can show that
, the solution to the minimization problem (17), can be achieved by shrinking an average of
(1), i.e. the solution to (17) when there is no penalty (or
= 0).
THEOREM 1.
For the jth minimization problem (17), if there exists an indices set C={k1, ... , kr}, such thatthen
(21) where (:)+ is the positive part of the argument.
(22)
Details of the proof are in the Supplementary Material. From Theorem 1, we can see when there are r maximums among
, only the corresponding
will be shrunken by the L
-norm penalty, and they are shrunken to the same absolute value. This value is based on an average of
of the corresponding r classes. We can also see that if the jth gene is irrelevant and all
are close to zero, then the L
-norm penalty tends to shrink all of them to zero (with an appropriately chosen
wj).
To implement Theorem 1, we need to decide r, the number of maximums among
, and the set {k1, ... , kr}, which indicates which r
should be shrunken. When K is not very large, say K
20, we can use an exhaustive search to find r and {k1, ... , kr}, i.e. for each 1
r
K, we search over all possible sets {k1, ... , kr}. For each possible set, we estimate
using (22), then check whether the estimates satisfy the assumption (21). If the assumption is satisfied, we compute the corresponding value for the objective function (17). Finally, we choose
that give the smallest value for the objective function. When K is large, we will resort to the quadratic programming (18)–(20)![]()
.
3.2 Estimating µkj in AHP-NSC
In AHP-NSC (16), we can use an iterative approach to estimate
j and
kj, i.e. we first fix
kj and estimate
j, then we fix
j and estimate
kj, and we iterate between these two steps until the solution converges. Since at each step, the value of the objective function (16) decreases, the solution is guaranteed to converge. We have the following theorem that helps us solve for
j and
kj at each step.
THEOREM 2.
where.
Equations (23) and (24) show that both
and
are soft-thresholding estimates. Details of the proof are in the Supplementary Material. Here we give some intuitive explanation.
We first look at
(23). If all
kj are zero, it is natural to estimate
j also to be zero because of the penalty on
j. If not all
kj are 0, say,
k1j, ... ,
krj are not zero, then from our reparameterization, we have
j = µksj/
ksj, 1
s
r. Plugging in
for µksj, we obtain r estimates for
j :
, 1
s
r. A natural estimate for
j is then a weighted average of the
, and Equation (23) provides such a (shrunken) average, with weights proportional to
k.
Now considering
(24). If
= 0, it is natural to estimate all
kj also to be zero because of the penalty on
kj. When
j > 0, we have
kj= µ kj/
j. Again, plugging in
for µkj, we obtain
. Equation (24) shrinks
and the amount of shrinkage is inversely proportional to
. When
j is large, which indicates the jth gene is relevant, the amount of shrinkage is small, while when
j is small, which indicates the jth gene is less relevant, the amount of shrinkage is large.
| 4 SIMULATION STUDY |
|---|
|
|
|---|
In this section, we use simulated data to demonstrate our methods ALP-NSC and AHP-NSC, and compare the results with that of the NSC. We also compare our methods with an improved version of NSC, i.e. NSC with adaptive choice of thresholds (Tibshirani et al., 2003). The adaptive in Tibshirani et al. (2003) has a different meaning from that in our two methods: it still treats different genes flatly, but uses large thresholds for classes easy to classify and small thresholds for classes difficult to classify.
We first considered a two-class classification scenario. There were a total of p= 10 000 variables with only the first 20 relevant while the other 9980 irrelevant in forming two classes. Specifically, the first 20 variables were i.i.d. N(0,1) for the first class and i.i.d. N(1,1) for the second class, whereas the remaining 9980 variables were all i.i.d. N(0,1) for both classes.
We generated n = 100 training observations, with 70 in the first class and 30 in the second one; similarly, we also generated 1000 test observations, with the same class prior and the same within-class distribution. We denote this as the 70-30 example. Tuning parameters were chosen using 5-fold cross-validation (CV) on the training data. We then computed the misclassification error rate of the chosen model on the test data. We also recorded both the number of relevant variables and the number of irrelevant variables that were selected. We repeated this 100 times. The results are summarized in Table 1.
|
As we can see, our ALP-NSC and AHP-NSC methods performed similarly to the NSC method in terms of selecting relevant variables, but tended to keep much fewer irrelevant variables. The error rates of ALP-NSC and AHP-NSC also seemed to be smaller than that of the NSC.
We then considered two three-class classification scenarios, with highly unbalanced classes. In both scenarios, there were a total of p= 4000 variables with the first 2 relevant while the other 3998 irrelevant in forming three classes. The first 2 variables were i.i.d. from N(0,1) for the first class, i.i.d. N(2.5,1) for the second class and i.i.d. N(5,1) for the third class, whereas the remaining 3998 variables were all i.i.d. from N(0,1) for all three classes. In the first scenario, we generated 20 observations for each of the first class and the third class, and 100 for the second class We denote it as the 20-100-20 example. In the second scenario, we generated 100 observations for each of the first class and the third class, and 20 for the second class. We denote it as the 100-20-100 example. For each scenario, we also generated test observations with the same class prior and the same within-class distribution, but with the sample size 10 times larger than that of the training datasets. We repeated this 100 times. The results are summarized in Table 2.
|
As we can see, our ALP-NSC and AHP-NSC methods removed all 3998 irrelevant variables for every repetition (out of 100), and the classification error rates were just slightly higher than that of the NSC using only the first 2 relevant variables, i.e., the oracle. In contrast, the NSC method and the NSC with adaptive thresholds tended to select more noise variables and have higher error rates.
| 5 REAL DATA ANALYSIS |
|---|
|
|
|---|
In this section, we apply the ALP-NSC and the AHP-NSC methods to three gene microarray datasets.
The first dataset we considered is the Leukemia dataset in Golub et al. (1999). This dataset consists of 38 training data and 34 test data for two types of acute leukemia, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). Each sample is a vector of p= 7129 genes. We applied the ALP-NSC and the AHP-NSC methods to the training data. Tuning parameters were chosen using 10-fold CV, and the chosen models were evaluated on the test data. The results are summarized in the upper part of Table 3. Both the ALP-NSC and the AHP-NSC had two misclassification on the test data, and they selected 16 exactly same genes. Figure 1 shows the corresponding heatmap. Clear separation of the two classes is evident.
|
|
The second dataset we considered consists of microarray experiments of small round blue cell tumors (SRBCT) of childhood cancer (Khan et al., 2001). The tumors are classified as Burkitt lymphoma (BL), Ewing sarcoma (EWS), neuroblastoma (NB) or rhabdomyosarcoma (RMS). A total of 63 training samples and 20 test samples were provided. Each sample consists of expression measurements on p= 2308 genes. We analyzed this dataset in the similar way as with the Leukemia dataset. Tuning parameters were chosen using 8-fold CV. The results are summarized in Table 3. The CV errors and the test errors are all zero. The ALP-NSC selected 38 genes, and the AHP-NSC selected 40 genes. The 38 genes selected by ALP-NSC and the 40 genes selected by AHP-NSC have 34 overlapping genes. Figures 2 and 3 show the heatmaps of the selected genes. Similar as in Figure 1, genes that distinguish each class from other classes are also evident.
|
|
To further assess the genes that were selected from the Leukemia and the SRBCT datasets, we randomly split each dataset into the training and the test sets for 100 times. The sizes of the two sets and the class priors were kept the same as the original training/test split. Each training/test split was also analyzed in the same way as for the original training/test split. The results are summarized in Tables 4–6
|
|
|
The Leukemia and the SRBCT datasets are relatively easy for classification. We also considered the NCI-60 dataset (Dudoit et al., 2002). In this study, cDNA microarrays were used to examine the variation in gene expressions among 60 cell lines from the National Cancer Institute's anti-cancer drug screen. The cell lines are derived from tumors with eight different sites of origin: breast, central nervous system (CNS), colon, leukemia, melanoma, non-small-cell-lung-carcinoma (NSCLC), ovarian and prostate. A total of 61 samples were provided. Each sample consists of expression measurements on p= 5244 genes. The class sizes are all small, and some of the classes (e.g. breast and NSCLS) are known to be heterogeneous. So this is a more difficult dataset for classification than the Leukemia and the SRBCT datasets. The NCI-60 dataset came without pre-specified training/test split, so we randomly split the data into the training and the test sets comma; with sample sizes 40 and 21, respectively. We repeat it 100 times. The results are summarized in the lower part of Table 4 and Figure 4. We can see that ALP-NSC and AHP-NSC had similar misclassification errors as NSC and adaptive-NSC, but selected much fewer genes. Then, 777 genes were selected for more than 70 times out of the 100 trials. Our methods selected on average about 1000 genes, while NSC selected more than 3000 genes.
|
| 6 CONCLUSION |
|---|
|
|
|---|
In this article, we first re-interpreted the popular NSC method in the framework of LASSO regression. Based on the penalized linear regression framework, we have considered two new methods for microarray classification, which improve over the NSC. Unlike the L1-norm penalty used in LASSO, the penalty terms that we consider make use of the fact that parameters belonging to one gene should be treated as a natural group. We have presented some evidence that the two new methods tend to remove irrelevant genes more effectively and provide better classification results than the L1-norm approach.
| ACKNOWLEDGEMENT |
|---|
|
|
|---|
We would like to thank the Associate Editor and two reviewers for their thoughtful and useful comments. We would also like to thank Trevor Hastie and Rob Tibshirani for helpful discussions. S.W and J.Z are partially supported by grant DMS-0505432 from the National Science Foundation and grant NHLBI-18 HL60884 from National Institutes of Health.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
Received on November 17, 2006; revised on January 19, 2007; accepted on February 4, 2007
| REFERENCES |
|---|
|
|
|---|
Alizadeh A, et al. Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, ( (2000) ) 403, : 503–511.[CrossRef][Medline].
Bickel P, Levina L. Some theory for fisher's linear discriminant function, "naive bayes", and some alternatives when there are many more variables than observations. Bernoulli, ( (2004) ) 10, : 989–1010.[ISI].
Breiman L. Better subset regression using the non-negative garrote. Technometrics, ( (1995) ) 37, : 373–384.[CrossRef][ISI].
Dabney AR. Classification of microarrays to nearest centroids. Bioinformatics, ( (2005) ) 21, : 4148–4154.
Dudoit S, et al. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc, ( (2002) ) 97, : 77–87.[CrossRef][ISI].
Eisen M, et al. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, ( (1998) ) 95, : 14863–14868.
Golub T, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, ( (1999) ) 286, : 531–537.
Hastie T, et al. Supervised harvesting of expression trees. Genome Biol, ( (2001) ) 2, : 1–12.[Medline].
Khan J, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med, ( (2001) ) 7, : 673–679.[CrossRef][ISI][Medline].
Liu Y, Shen X. Multicategory psi-learning. J. Am. Stat. Assoc, ( (2006) ) 101, : 500–509.[CrossRef][ISI].
Marron J, Todd M. Distance weighted discrimination. In: Technical Report., ( (2002) ) School of Operations Research and Industrial Engineering, Cornell University..
Pan W. A comparative review of statistical methods for discovering differently expressed genes in replicated microarray experiments. Bioinformatics, ( (2002) ) 18, : 546–554.
Shen X, Ye J. Adaptive model selection. J. Am. Stat. Assoc, ( (2002) ) 97, : 210–221.[CrossRef][ISI].
Shen R, et al. Eigengene-based linear discriminant model for tumor classification using gene expression microarray data. Bioinformatics, ( (2006) ) 22, : 2635–2642.
Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B, ( (1996) ) 58, : 267–288..
Tibshirani R, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl Acad. Sci. USA, ( (2002) ) 99, : 6567–6572.
Tibshirani R, et al. Class prediction by nearest shrunken centroids, with application to DNA microarrays. Stat. Sci, ( (2003) ) 18, : 104–117.[CrossRef][ISI].
Wu B. Differential gene expression detection and sample classification using penalized linear regression models. Bioinformatics, ( (2006) ) 22, : 472–476.
Zhang H, et al. Gene selection using support vector machines with non-convex penalty. Bioinformatics, ( (2006a) ) 22, : 88–95.
Zhang H, et al. Variable selection for multicategory SVM via sup-norm regularization. ( (2006b) ) Institute of Statistics Mimeo Series 2596, NCSU..
Zhao P, et al. Grouped and hierarchical model selection through composite absolute penalties. In: Technical Report., ( (2006) ) Department of Statistics, University of California at Berkeley..
Zou H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc, ( (2006) ) 101, : 1418–1429.[CrossRef][ISI].
Zou H, Yuan M. The F
-norm support vector machine. Stat. Sin, ( (2007) ) to appear..
This article has been cited by other articles:
![]() |
F. Tai and W. Pan Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data Bioinformatics, December 1, 2007; 23(23): 3170 - 3177. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


















