Bioinformatics Advance Access originally published online on October 12, 2007
Bioinformatics 2007 23(23):3170-3177; doi:10.1093/bioinformatics/btm488
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data
Division of Biostatistics, School of Public Health, University of Minnesota, A460 Mayo Building (MMC 303), Minneapolis, MN 55455-0378, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Discriminant analysis for high-dimensional and low-sample-sized data has become a hot research topic in bioinformatics, mainly motivated by its importance and challenge in applications to tumor classifications for high-dimensional microarray data. Two of the popular methods are the nearest shrunken centroids, also called predictive analysis of microarray (PAM), and shrunken centroids regularized discriminant analysis (SCRDA). Both methods are modifications to the classic linear discriminant analysis (LDA) in two aspects tailored to high-dimensional and low-sample-sized data: one is the regularization of the covariance matrix, and the other is variable selection through shrinkage. In spite of their usefulness, there are potential limitations with each method. The main concern is that both PAM and SCRDA are possibly too extreme: the covariance matrix in the former is restricted to be diagonal while in the latter there is barely any restriction. Based on the biology of gene functions and given the feature of the data, it may be beneficial to estimate the covariance matrix as an intermediate between the two; furthermore, more effective shrinkage schemes may be possible.
Results: We propose modified LDA methods to integrate biological knowledge of gene functions (or variable groups) into classification of microarray data. Instead of simply treating all the genes independently or imposing no restriction on the correlations among the genes, we group the genes according to their biological functions extracted from existing biological knowledge or data, and propose regularized covariance estimators that encourages between-group gene independence and within-group gene correlations while maintaining the flexibility of any general covariance structure. Furthermore, we propose a shrinkage scheme on groups of genes that tends to retain or remove a whole group of the genes altogether, in contrast to the standard shrinkage on individual genes. We show that one of the proposed methods performed better than PAM and SCRDA in a simulation study and several real data examples.
Contact: weip{at}biostat.umn.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
As a classic method, linear discriminant analysis (LDA) has been well studied and widely used. It is well known for its simplicity as well as robustness. Suppose we have a class variable Y with possible values in
|
|
). By some simple calculations, we have |
|
|
|
, the sample mean and sample covariance, are used: |
|
|
|
k = nk/n, where nk and n are the sample sizes for class k and the pooled samples, respectively. However, with high-dimensional and low-sample-sized data, as arising in sample classification with microarray gene expression data, LDA suffers from the singularity of the sample covariance matrixWe feel that both PAM and SCRDA are at the ends of the two extremes: the covariance matrix in the former is restricted to be diagonal while in the latter there is barely any restriction. Based on the biology of gene functions, we aim to estimate the covariance matrix as an intermediate between the two. The idea is in line with recent efforts of integrating biological information of genes, as available in the Gene Ontology (GO) annotations (Ashburner et al., 2000), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (Kanehisa, 1996) or constructed gene networks, into the process of model building. Several methods in this line have been proposed. Lottaz and Spang (2005) proposed a structured analysis of microarray data (StAM), while Wei and Li (2007) proposed a modified boosting method called non-parametric pathway-based regression (NPR). More recently, Tai and Pan (2007a) proposed a group penalization method that handles the genes within different functional groups with different penalty terms. The common rationale behind all these methods is simple: many genes are known to have the same function or involved in the same pathway as that of some known/putative cancer-related genes, and the genes in the same functional group or pathway are more likely to work together. The methods aim to borrow information from existing biological knowledge to improve both predictive accuracy and interpretability of a resulting classifier.
In this article, we propose several versions of a modified LDA, group regularized discriminant analysis (GRDA) that aims to take advantage of existing gene functional groups. Specifically, we lean on, but do not require, the assumption that the genes within the same group are correlated with each other, but are independent of the genes from other groups, leading to a block-diagonal covariance structure. In addition, rather than shrinking individually for each gene as in PAM and SCRDA, again by taking advantage of known gene groups, we propose a group shrinkage scheme to identify biologically significant gene functional groups or pathways.
The rest of the article is organized as follows. We first review the PAM and SCRDA. Then, we introduce our new methods GRDA with three combinations of choosing between two regularized covariance matrices and between two shrinkage schemes. We also discuss some computational issues such that an efficient implementation makes it feasible to handle very high-dimensional data. Results from simulation studies and analyses of four public cancer data sets are presented to evaluate the proposed methods, demonstrating their potential gains over PAM and SCRDA. We end with a short summary and discussion.
| 2 METHODS |
|---|
|
|
|---|
2.1 PAM
The nearest shrunken centroids (PAM) method assumes the independence among the genes, ignoring any possible correlation among the genes. It uses a soft-thresholding rule to shrink
The basic idea of PAM is to shrink the class centroids or class-specific sample means
towards the overall centroid
. Let
|
|
|
|
|
|
0 by soft thresholding:
|
| (1) |
is a tuning parameter that has to be decided, usually by cross-validation (CV). We thus obtain a new shrunken centroid |
|
|
|
k = nk/n is the estimated class prior probability.
It is noted that, if the class centroids
for all k, then gene i does not contribute to classification and is regarded as a noise gene.
2.2 SCRDA
Instead of completely ignoring the correlations between genes, the SCRDA aims to estimate the covariance matrix in a general way:
|
| (2) |
Next, SCRDA shrinks the transformed class mean
towards 0:
|
| (3) |
|
|
2.3 Group Regularized Discriminant Analysis (GRDA)
Many studies have shown that genes in the same functional group or involved in the same pathway are more likely to co-express, hence their expression levels tend to be correlated. In this article, we aim to incorporate such biological information into the development of a regularized covariance matrix estimator and a grouped shrinkage scheme. In contrast, neither PAM nor SCRDA takes advantage of such biological information.
2.3.1 Regularized covariance matrix
Instead of shrinking the sample covariance matrix to an independence structure, we shrink it to a between-group independence structure:
|
| (4) |
1,
2 and
1 =
2
[0,1] are some tuning parameters to be determined; as a simpler alternative, we also consider using
|
| (5) |
2.3.2 Shrunken centroids GRDA (SCGRDA)
Next we consider two shrinkage schemes. The first one is exactly the same as in SCRDA; we call it individual shrinkage as opposed to the second one, called group shrinkage. The first one works as follows,
|
|
The second one is a group shrinkage that tends to retain or remove all the variables or genes in a group (Cai, 1999; Yuan and Lin, 2006). Instead of shrinking each
individually, we shrink them as a group. With regularized covariance matrix (5), that assumes between-group independence, we can actually perform gene selection at group level. Specifically,
|
|
With the two choices of a regularized covariance matrix and two choices of a shrinkage scheme, we have three possible methods:
- ISCGRDA-1: GRDA-1 with individual shrinkage
- ISCGRDA-2: GRDA-2 with individual shrinkage
- GSCGRDA: GRDA-2 with group shrinkage.
2.3.3 Tuning parameters
In GRDA-1, we have three tuning parameters whose values need to be determined by CV:
1,
2 and
, and two tuning parameters for GRDA-2:
and
. We perform a grid search in a tuning parameter space. The grids for
1,
2 or
range from 0 to 0.99 with a equally spaced grids and a was 10 in our study. The grids for
conventionally range from 0 to the maximum of the absolute values of the parameter that needs to be shrunken; e.g. in PAM, it ranges from 0 to max(|dk|). However, in SCRDA and our method SCGRDA,
, the parameter that needs to be shrunken, depends on
, which changes with
. SCRDA fixes the range of
to simplify the computation.
Instead of directly searching in grids of
, we introduce another parameter
, which is a proportion of the total number of genes or gene groups remaining in the model. The range of
was fixed from 0 to 30 in our study. In CV, given
1,
2 or
, we can obtain
. Then we calculate
for each gene. Suppose the order statistic
is the 100(1 –
)th percentile of
, we let
and use it to shrink. Similarly, for group shrinkage, we replace
by
.
The best combination of the shrinkage parameters were selected based on the smallest number of test errors. When there are two or more combinations giving the same smallest test error rates, to break ties, our strategy is to first use as a small number of genes as possible, which means to choose the smallest
; if still tied, we choose the smallest
1 to decrease the weight of the sample covariance matrix; if still tied, we choose the group independence over the individual independence by choosing largest
2 or
.
2.4 Connection to penalized likelihoods
Let X be centered raw expression data, i.e.
. Each gene expression sample is denoted by a p x 1 vector Xi = (xi1,...,xip)T and mean of class k is denoted by µk = (µ1k,...,µpk)T. In LDA, µk and
are estimated by MLEs. In this section, we show the connections between our method and penalized log-likelihood methods; for derivations, see Tai and Pan (2007b). The penalized log-likelihood can be expressed as
|
| (6) |
|
|
(µk,
) is a penalty function for parameters µ and
.
2.4.1 PAM
As pointed out by Wu (2006) and Wang and Zhu (2007), if we let Zi = Xi/mk for gene i in class k and
= diag((s1 + s0)2,...,(sp + s0)2), and use an L1 norm penalty
, the nearest shrunken centroids estimators are the maximizer of L
with Xi replaced by Zi.
2.4.2 GSCGRDA
For GSCGRDA, we assume a between-group independence covariance matrix
= diag(
1,...,
G), where
g is the covariance for group g (g = 1,...,G). Accordingly, we apply a group penalty on µk (Yuan and Lin, 2006), resulting in the following penalized log-likelihood
|
| (7) |
|
| (8) |
|
| (9) |
|
| (10) |
|
| (11) |
g, then solution (10) becomes exact for (8).
The covariance matrix estimator
used in GSCGRDA can be regarded as a maximizer of penalized likelihood
|
|
= diag(
1,...,
G). The penalty term tr
g as an inverse Wishart distribution with mean Dg. A similar idea of deriving
2.5 Computational issues
With high-dimensional microarray data it takes too much memory space or even infeasible to invert a p x p covariance matrix, where p is in the order of thousands or tens of thousands. In order to efficiently and stably invert such a large and potentially sparse matrix, we use the Woodbury formula so that the memory requirement is reduced from inverting a p x p matrix to an n x n matrix; the latter is quite small for microarray data with n less than hundreds. The Woodbury formula can be expressed as
|
|
1X/(n – K) and V = X, then |
|
n, we invert it by the Cholesky decomposition; otherwise, we apply the Woodbury formula to Ai, |
|
1 –
2) and b =
2/(n – K)(1 –
1 –
2)2. In this way, the largest matrix we need to invert is the n x n matrix In + bXi'Xi, which is computationally affordable, considering n is usually small in microarray data analysis. In the context of discriminant analysis, our final goal is to compute the discriminant score
k(x). If we can obtain|
|
1/(n – K),| 3 RESULTS |
|---|
|
|
|---|
3.1 Simulation
For evaluation, we compared our methods to PAM and SCRDA on simulated data. As discussed above, the major difference between these methods is the assumption on the form of the covariance matrix. Hence, we considered simulation set-ups with four different covariance matrices. For each simulated dataset in each case, we had two classes and p = 1000 variables; there were 50 training samples and 500 test samples for each class; the small training samples reflected typical microarray data sizes while the large test samples were used for obtaining accurate test error rates for the purpose of evaluations only. Gene expression levels X were generated from two multivariate normal distributions with different mean vectors but the same covariance structure. Specifically,
|
|
U(0,1) and the remaining ones were 0. The choices of covariance matrices were respectively,- an identity matrix
- a compound symmetric (CS) matrix with
= 0.2: the correlation between any two genes was
- a block diagonal matrix with each block as CS: the block size was 50 x 50, resulting in a total of 20 blocks. The within-block-wise correlations were
1,...,
20
U(0,1)
- a block diagonal matrix plus a weak CS correlation for other off-block elements: the blocking structure was the same as in case 3 with the within-block correlations
1,...,
20
U(0.5,1), and the correlation for off-block elements was
U(0,0.1).
For the grouped methods, we grouped variables according to the blocking structure in case 3 and used this grouping scheme throughout all four simulation set-ups when applying any group-based method. More specifically, we grouped 1000 variables into 20 groups with 50 variables in each group, the first 50 variables in group 1, the next 50 variables in group 2, etc.
To investigate the robustness of the proposed methods to group misspecification, we also considered three scenarios:
- Mixed groups (mix). We randomly chose m
U(0,40) genes from each group, then randomly reassigned them to one of the 20 groups; in this way,
40% genes' groups were misspecified.
- New groups (new). We randomly chose m
U(0,40) genes from each group, and then treated them as separate groups.
- Divided groups (div). We randomly divided each of four groups (1 informative and 3 non-informative ones) into two groups of nearly equal size.
|
PAM and wPAM performed pretty well under the independence case, but suffered severely for other cases because of their ignoring existing correlations. SCRDA performed well in general for all cases, especially when correlations among the genes were fairly strong, but it tended to use more genes than other methods probably because it was unable to capture the sparseness of an underlying covariance matrix. ISCGRDA-1 performed well in all cases due to its generality. However, it suffered from its high computational cost. ISCGRDA-2 performed similarly to ISCGRDA-1 except for the CS case. In terms of prediction error, GSCGRDA outperformed all the other methods except in the CS case, in which it was ranked the second only behind SCRDA. For variable selection, it selected a much higher proportion of informative genes while eliminating much more noise genes. In addition, by using a block covariance structure along with the group shrinkage, GSCGRDA had the capability of genuine gene selection as compared to other RDA-based methods. In particular, GSCGRDA performed better than SVM, which was not really surprising because the former method (along with other LDA-type methods) used the normality assumption on the predictors.
3.2 Real data
We also applied all methods to four public microarray gene expression datasets for tumor classifications.
- Breast cancer data (Huang et al., 2003). There were in total n = 52 samples (18 with recurrence of tumor and 34 without). Each sample contained p = 12 625 genes.
- Lung cancer data (Gordon et al., 2002). The goal was to discriminate between malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA) of the lung. There were n = 181 tissue samples (31 MPM and 150 ADCA). Each sample contained p = 12 533 genes.
- Prostate cancer data (Singh et al., 2002). The goal was to classify between 77 tumor samples and 59 normal samples. Each sample contained p = 12 600 genes.
- Leukemia data (Armstrong et al., 2001). The goal was to classify each of 62 leukemia samples, among which there were 24 acute lymphoblastic leukemia (ALL) samples, 20 acute leukemia samples carrying chromosomal translocations involving the mixed-lineage leukemia gene (MLL), and 28 acute myelogenousleukemia (AML) samples, into one of the three subtypes. There were p = 12 582 genes.
Gene groups were formed based on the KEGG pathways (Kanehisa, 1996): the genes in a KEGG pathway formed a group, while each of the genes that was not annotated in any KEGG pathway formed its own group with group size one. To deal with the genes annotated in multiple pathways, we applied two methods: (1) (the default): we kept a gene to the pathway with the smallest ID while ignoring its belonging to other pathways (2) (dup): we duplicated the expression profile of the gene in each pathway to which it belonged to.
As a pre-screening, we only used various subsets of the genes in each dataset: we used the genes in KEGG pathways along with the top 3000 genes with the largest sample variances. Results in Table 2 were based on 10 independent repeats of 10-fold CV; within each CV, a second-level CV was used to select tuning parameters.
|
The RDA-based methods tended to use more genes than PAM or wPAM; in particular, SCRDA used almost all the genes for each dataset. However, RDA-based methods performed significantly better than PAM and wPAM for the prostate cancer data. The GSCGRDA performed consistently well for all datasets: it was the best for the breast cancer data and leukemia data, the third best for the prostate cancer data and performed closely to the winner (wPAM) for the lung cancer data. It also consistently outperformed other two GRDA-based methods. In addition, the genes selected from GSCGRDA had a biological interpretation: any group of more than one genes selected at the end corresponded to a KEGG pathway. In Table 3, we show the top 10 most frequently selected pathways by GSCGRDA based on 100 models from 10 repeats of 10-fold CV. Since GSCGRDA selected almost all of the genes for the prostate cancer data, we do not list the selected pathways for the data.
|
It was somewhat reassuring that the two treatments of the genes with multiple annotations in GSCGRDA gave almost equal performance in terms of misclassifications, while the duplication method seemed to select an almost equal number of or fewer unique genes as compared to the first treatment; more work is needed on how to handle the genes with multiple annotations for grouped methods.
To empirically verify the stronger within-group and weaker between-group correlations assumption, based on the lung cancer data, we calculated pairwise Pearson's correlations among the genes within three pathways (with IDs 04514, 04020 and 04360) and the collection of all other genes not in any pathway. For the three pathways, the median correlations were 0.092, 0.095 and 0.097, and the 75th percentiles were 0.169, 0.167, 0.168, respectively, larger than 0.090 and 0.159 for the non-annotated genes.
| 4 DISCUSSION |
|---|
|
|
|---|
In this article, we have proposed a class of modified LDA to incorporate prior knowledge on gene functions into building a classifier. A main difference from other modifications of LDA is that we regularize the covariance matrix by considering group relationships among variables. Unlike most standard classifiers, which treat all the genes equally a priori, our methods assume that the genes in the same group are more likely to function similarly and thus have correlated expressions while the genes from different functional groups or pathways are more likely to be independent or only weakly correlated. We introduce a between-group independence (i.e. block-diagonal) covariance structure into regularization and put more weight on it to account for the biological belief of higher within-group but lower between-group gene correlations. Another main difference is our consideration of a group shrinkage scheme that tends to retain or remove a whole group of the genes altogether. When gene groups are formed informatively, it may not only improve predictive performance, but also facilitate interpretation of results. Although the genes within such selected pathways may be useful for the purpose of clinical outcome classifications, we cannot determine whether differential expressions of these genes are the cause or only manifestation of different outcomes; in fact, even for the purpose of diagnosis or prognosis, more studies may be still needed to evaluate whether identified genes are really trustworthy predictors.
Among the methods studied, in general, GSCGRDA performed best for our simulated and real data. In addition, an advantage of GSCGRDA over other RDA-based methods is that it can realize gene selection while the others cannot. In particular, gene selection is accomplished at the group level, thus naturally associating selected gene groups to their biological interpretations, e.g. pathways. Furthermore, compared to ISGRDA-1 and SCRDA, GSCGRDA is much less computationally intensive by excluding the use of the unrestrictive sample covariance estimate.
Although we only used the KEGG pathways to form gene groups in the real data examples, other sources of biological knowledge for gene functions or pathways, such as GO, can be also utilized in our proposed methods. However, how to take advantage of the hierarchical structure of GO annotations, or known or predicted gene networks, is unclear. Furthermore, how to handle genes with multiple annotations warrants more research. Finally, it may be productive to combine the idea proposed here with other improved PAM methods (Wang and Zhu, 2007). These are interesting topics to be studied.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This research was partially supported by NIH grant HL65462 and a UM AHC Faculty Research Development grant. The authors thank the three reviewers and the associate editor for helpful and constructive comments.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Joaquin Dopazo
Received on June 5, 2007; revised on August 21, 2007; accepted on September 25, 2007
| REFERENCES |
|---|
|
|
|---|
Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet (2000) 25:25–29.[CrossRef][Web of Science][Medline]
Armstrong, et al. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat. Genet (2001) 30:41–47.[CrossRef][Web of Science][Medline]
Cai T. Adaptive wavelet estimation: a block thresholding and oracle inequality approach. Ann. Stat (1999) 27:898–924.[CrossRef]
Gordon J, et al. Translation of microarray data into clinically relevant cancer diagnostic tests using gege expression ratios in lung cancer and mesothelioma. Cancer Res (2002) 62:4963–4967.
Gui J, Li H. Penalized cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics (2005) 21:3001–3008.
Guo Y, et al. Regularized linear discriminant analysis and its application in microarrays. Biostatistics (2007) 8:86–100.
Hastie T, et al. The Elements of Statistical Learning. Data mining, Inference, and Prediction (2001) New York, USA: Springer.
Huang X, Pan W. Linear regression and two-class classification with gene expression data. Bioinformatics (2003) 19:2072–2078.
Huang E, et al. Gene expression predictors of breast cancer outcomes. Lancet (2003) 361:1590–1596.[CrossRef][Web of Science][Medline]
Huang D, Pan W. Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinformatics (2006) 22:1259–1268.
Kanehisa M. Toward pathway engineering: a new database of genetic and molecular pathway. Sci. Tech. Jpn (1996) 59:34–38.
Lottaz C, Spang R. Molecular decomposition of complex clinical phenotypes using biologically structured analysis of microarray data. Bioinformatics (2005) 21:1971–1978.
Pan W. Incorporating biological information as a prior in an empirical Bayes approach to analyzing microarray data. Stat. Appl. Genet. Mol. Biol (2005) 4. Article 12.
Pan W. Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics (2006) 22:795–801.
Pang W, et al. Pathway analysis using random forests classification and regression. Bioinformatics (2006) 22:2028–2036.
Singh D, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell (2002) 1:203–209.[CrossRef][Web of Science][Medline]
Srivastava MS, Kubokawa T. Comparison of discrimination methods for high dimensional data. J. Jpn. Stat. Soc (2007) 37:123–134.
Tai F, Pan W. Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms. Bioinformatics (2007a) 23:1775–1782.
Tai F, Pan W. Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data. Research report 2008–020 (2007b) Division of Biostatisitics, University of Minnesota. Available at http://www.biostat.umn.edu./rrs.php.
Tibshirani R. Regression shrinkage and selection via the LASSO. J. R. Stat. Soc.B (1996) 58:267–288.
Tibshirani R, et al. Class prediction by nearest shrunken centroids with applications to DNA Microarrays. Stat. Sci (2003) 18:104–117.[CrossRef][Web of Science]
Vapnik V. Statistical Learning Theory (1998) New York, USA: Wiley.
Wang S, Zhu J. Improved centroids estimation for the nearest shrunken centroid classifier. Bioinformatics (2007) 23:972–979.
Wang Y, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet (2005) 365:671–679.[Web of Science][Medline]
Wei Z, Li H. Nonparametric pathway-based regression models for analysis of genomic data. Biostatistics (2007) 8:265–284.
Wu B. Differential gene expression detection and sample classification using penalized linear regression models. Bioinformatics (2006) 22:472–476.
Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. B (2006) 68:49–67.[CrossRef]
This article has been cited by other articles:
![]() |
V. Zuber and K. Strimmer Gene ranking and biomarker discovery under correlation Bioinformatics, October 15, 2009; 25(20): 2700 - 2707. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


