Bioinformatics Advance Access originally published online on May 5, 2007
Bioinformatics 2007 23(14):1775-1782; doi:10.1093/bioinformatics/btm234
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms
Division of Biostatistics, School of Public Health, University of Minnesota, A460 Mayo Building (MMC 303), Minneapolis, MN 55455-0378, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: In the context of sample (e.g. tumor) classifications with microarray gene expression data, many methods have been proposed. However, almost all the methods ignore existing biological knowledge and treat all the genes equally a priori. On the other hand, because some genes have been identified by previous studies to have biological functions or to be involved in pathways related to the outcome (e.g. cancer), incorporating this type of prior knowledge into a classifier can potentially improve both the predictive performance and interpretability of the resulting model.
Results: We propose a simple and general framework to incorporate such prior knowledge into building a penalized classifier. As two concrete examples, we apply the idea to two penalized classifiers, nearest shrunken centroids (also called PAM) and penalized partial least squares (PPLS). Instead of treating all the genes equally a priori as in standard penalized methods, we group the genes according to their functional associations based on existing biological knowledge or data, and adopt group-specific penalty terms and penalization parameters. Simulated and real data examples demonstrate that, if prior knowledge on gene grouping is indeed informative, our new methods perform better than the two standard penalized methods, yielding higher predictive accuracy and screening out more irrelevant genes.
Contact: weip{at}biostat.umn.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
In recent years, tumor classification based on microarray gene expression data has become one of the most active research topics in bioinformatics. Numerous studies on various types of cancers have appeared in the literature, such as breast cancer (Huang et al., 2003; Wang et al., 2005), prostate cancer (Singh et al., 2002; Welsh et al., 2001), lung cancer (Bhattacharjee et al., 2001), leukemia (Golub et al., 1999), etc. Many classification methods have been newly developed or adapted for these applications. In addition to predicting some outcomes related to the cancer, a major goal is to identify genes that are related to the cancer. At the same time, biological functions of genes have been explored intensively by the biological research community. A large amount of biological information gathered so far is stored in databases, such as those with the Gene Ontology (GO) annotations (Ashburner et al. 2000) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa, 1996). In addition, prior experiments with similar biological objectives may have generated data that are relevant to the current study. Hence, borrowing information from prior data or biological knowledge seems natural for the current study; it opens up an opportunity of, and at the same time poses a challenge to, further improving over standard analysis methods that ignore prior knowledge and data.
From a methodological point, in the context of sample classification, in order to achieve a high predictive performance and effectively select a few relevant predictors for high-dimensional data like microarray gene expression data, many statistical learning methods have been adapted or developed, such as SVM (Vapnik, 1998), random forest (Breiman, 2001) and PAM (Tibshirani et al., 2003). These methods have gained much popularity because of their superior performance in practice. However, almost all the existing methods treat all the genes equally a priori in the process of model building, ignoring biological knowledge of gene functions, which may result in a loss of their effectiveness. For example, some genes have been identified or hypothesized to be related to cancer by previous studies; others may be known to have the same function of or be involved in a pathway with some known/putative cancer-related genes, hence we may want to treat these genes differently from other genes a priori when choosing genes to predict cancer-related outcomes. To take advantages of such prior information, Lottaz and Spang (2005) proposed a structured analysis of microarray data (StAM), while Wei and Li (2006) proposed a modified boosting method called non-parametric pathway-based regression (NPR). The StAM is based on the GO hierarchical structure, in which biological functions of genes are organized as a directed acyclic graph: each node in the graph represents a biological function; a child node has a more specific function while its parent node has a more general one. StAM works by first building a separate classifier for each leaf node based on an existing method (e.g. PAM), then propagating their classification results by a weighted sum to their parent nodes, where the weights are related to the performance of the classifiers; a shrinkage scheme is used to shrink the weights towards zero so that a sparse representation is possible; and the process is repeated until the results are propagated to the root node. Because the final classifier is built based on the GO tree, it greatly facilitates the interpretation of a final result in terms of identifying biological processes that are related to the outcome. However, a downside is that only the genes annotated in the leaf nodes (i.e. with most detailed biological functions) are used as predictors; because of incomplete knowledge, other relevant genes that are not annotated yet cannot be used, which may in turn result in missing important new genes and losing predictive performance of the final model. In NPR, it is assumed that the genes can be first partitioned into several groups or pathways, then in boosting, only pathway-specific new classifiers (i.e. using only the genes in each of the pathways) were built. Our idea is similar to NPR with regard to grouping genes, but ours applies to any penalized method through the use of group-specific penalty terms while NPR only applies to boosting. More recently, Pang et al. (2006) proposed using random forests to rank biological pathways in regression and classification.
In this article, we propose a simple and flexible framework to incorporate prior knowledge of genes into penalized methods. In the Bayesian inference, it is standard to incorporate prior information by specifying a prior distribution. Penalized methods have a close connection to the Bayesian inference: a penalty term is related to a prior distribution for the genes involved in the penalty term (Hastie et al., 2001). However, most penalized methods only have a global penalty term involving all the genes, which essentially specifies the same prior distribution for all the genes. In order to incorporate prior information about different functional groups of the genes into analysis, we adopt group-specific penalty terms in a penalized method, thus allowing genes from different groups to have different prior distributions (e.g. different prior probabilities of being related to the cancer). As two concrete examples, we apply the idea to two penalized methods, the nearest shrunken centroids (also called PAM, Tibshirani et al., 2002) and penalized partial least squares (PPLS, Huang and Pan, 2003).
The rest of the article is organized as follows. We first review the standard PPLS and PAM. Then we introduce two new methods, mPPLS and mPAM, two modifications to PPLS and PAM, respectively: they have multiple penalty terms with multiple penalization parameters; the choice of the penalty terms is guided by prior knowledge. To reduce the computing demand in searching for multiple penalization parameters in mPPLS and mPAM, we present a weighting method that effectively reduces multiple unknown penalization parameters to only one. Simulation studies and analyses of two breast cancer datasets and a prostate cancer dataset are used to evaluate the proposed methods, and in particular illustrate the advantage of the new methods over the standard ones. We end with a short summary and discussion.
| 2 METHODS |
|---|
|
|
|---|
2.1 Notation
Let xij be the expression level of gene i in sample j, and yj be the cancer type for sample j, i = 1, ... , p; j = 1, ... , n and yj
{1, ... , K}. Denote Y = (y1, ... , yn)' and Xi = (xi1, ... , xin)'. Here ,we only consider two-class classification (K = 2) where yj is binary. Suppose we have n1 tumor (Class I, CI) samples and n2 controls (Class II,CII) such that n = n1 + n2. The mean of the expression levels of Class I samples, Class II samples and all n samples for gene i are
2.2 Nearest shrunken centroids
The ideal of nearest shrunken centroids is to shrink the class centroids
toward the overall centroid
. Let
|
|
|
|
|
|
0 by soft thresholding:
|
| (1) |
has to be decided, usually by cross-validation (CV). We thus obtain a new shrunken centroid |
|
|
|
k = nk/n is the class prior probability. The new test sample is classified as Class I if
1(x*) <
2(x*); otherwise, as Class II.
2.3 Penalized partial least squares regression
Partial least square (PLS) was first introduced by Wold (1966) and has been heavily promoted in the chemometrics literature as an alternative to ordinary least squares. It is often used in situations where the predictors are highly collinear, and/or the number of predictors p is large relative to the sample size n, as encountered in microarray data (Nguyen and Rocke, 2002). PLS forms a sequence of uncorrelated linear components, which are linear combinations of the original predictors (i.e. gene expression levels), to predict the outcome.
To construct PLS components, we first center the Y and Xi to give
and
, where 1 = (1, ... , 1)' is the n-dimensional unit vector. U1 is regressed against each V1i separately. Since the mean of U1 and V1i are 0, for i = 1, ... , p, the resulting least squares regression equations are
|
|
|
|
2) has already been constructed fromUk – 1 and Vk – 1,i, and denote the values of Tk – 1, Uk – 1 and Vk – 1, i as tk – 1, uk – 1 and vk – 1. Then we have |
|
|
|
|
|
The final model is obtained by regressing Y on T1, ... , Tq and has form
|
|
0, ... ,
q are estimated by OLS. Since each Ti is a linear combination of Xi, we can rewrite the model as |
|
|
| (2) |
has to be determined in practice, we construct a new component |
|
Note that in practice, one has to choose two tuning parameters, the number of components q in PLS and the shrinkage parameter
. We use crossvalidation to select the parameters,
= (q,
), such that a minimum CV error is achieved. In the situation that there are multiple
giving a minimum CV error, we choose
with the smallest q and for this q, choose the
s such that the number of genes in the model is smallest.
2.4 New methods
In either of PAM and PPLS, the magnitude of shrinkage is determined by one shrinkage parameter
. In order to incorporate prior knowledge of different gene groups into the model, we propose using group-specific parameters
j's. Specifically, we assume that, based on prior data or biological knowledge, the genes can be partitioned into J
1 groups, G1, ... , GJ, and we shrink the genes from different groups by possibly different magnitudes: we replace expression (1) in PAM by
|
| (3) |
|
| (4) |
In practice, each of the shrinkage parameter is unknown and has to be determined; we used a grid search and CV to tune shrinkage parameters. For a large J, it may be computationally too demanding to determine
1, ... ,
J separately. Hence, we propose the following weighted method: we assume that
j =
/wj for j = 1, ... , J, where
|
| (5) |
|
| (6) |
as in the standard methods, though multiple shrinkage parameters are used. Note that the weight in (5) and thus the shrinkage parameter depend on clas k; because we only considered two-class classifications, we fixed k = 1 in (5). Compared to treating the multiple shrinkage parameters separately, the weighted method, albeit less flexible, largely reduces the computational demand; furthermore, the weighted method penalizes less on the genes in a group with a larger mean parameter estimate: when a group contains a larger proportion of none-zero coefficients or a few large non-zero coefficients, indicating the existence of potentially useful genes in the group, the coefficients of the genes in the group will be shrunken less and thus less likely to be zero, leading to both higher chance of identifying important genes and in general smaller biases of shrinkage estimates, the latter of which may in turn improve predictive performance (Dabney, 2005). Given that the genes in a pathway or from the same functional group tend to work together, this is biologically reasonable: if the prior knowledge on grouping genes is informative, then the genes in the same group are more likely to be either involved or not involved in the biology; i.e. it is more likely that the genes from the same group have either zero or non-zero parameters simultaneously, hence borrowing information across the genes from the same group not only reduces the variability of the resulting estimate for a gene, but also leads to bias reduction for those non-zero parameters (i.e. relevant genes) as opposed to their being shrunken by a constant as in soft-thresholding.
Treating each gene as a separate group, the above weighting methods become
|
| (7) |
| 3 RESULTS |
|---|
|
|
|---|
To evaluate our methods, we applied our methods to we used both simulated and real real data to compare our methods with the standard ones.
3.1 Simulation
To mimic real data, we used gene expression profiles from a breast cancer study (Huang et al., 2003) as the predictors in our simulation study. The original breast cancer data consisted of a total of 89 breast cancer patients. The microarray platform used was Affymetrix HG U95Av2, each containing 12 625 probe sets (also called genes for convenience). We used the observed expression levels as predictors X to simulate the outcome Y. We restricted the number of genes to be p = 1000 or 3000 in order to limit the computational time: we used only top p genes with the largest sample variances across all 89 samples. The binary outcome Y was generated by two steps: we first generated a continuous response Z as
|
|
|
|
Two sets of regression coefficients β were used: the first set, β1,..., β100, for the 100 genes, were randomly drawn from N(0, 10) while in the second set each of the remaining p – 100 βs was set to 0, representing two gene functional groups: informative and non-informative ones. To investigate the sensitivity of the proposed methods to the misspecification of the gene functional groups, we partitioned the whole set of genes as the following.
- Perfect specification: we correctly set the 100 informative genes as one group and the remaining ones as another group.
- Misspecification: we randomly chose m genes from the informative and non-informative groups, respectively, then exchanged their group memberships; the first group contained 100 – m informative and m non-informative genes, while the other group contained all other genes. We tried m = 20 and m = 80, corresponding to light and heavy misspecifications, respectively.
Seven approaches were considered: standard PLS, standard PPLS with only one penalization parameter, our new PPLS with multiple penalization parameters (mPPLS) and weighted shrinkage parameter (wPPLS), standard PAM with one penalization parameter, our new PAM with multiple shrinkage parameters(mPAM) and weighted shrinkage parameter (wPAM). As a bench mark, we also considered PLS with only informative genes, the ideal (but not practical) case where we knew the truth about which genes were relevant.
Table 1 showed the mean classification errors and the mean numbers of genes selected over 1000 replications. As expected, the PLS with only informative genes gave lowest misclassification errors (7.17). Generally, the corresponding PPLS- and PAM-based methods performed very similarly. The PPLS-based methods tended to select more informative genes but also more non-informative ones. The proposed methods with multiple shrinkage parameters with a perfect specification of the gene groups (i.e. mPPLS and mPAM) had the best performance among the PPLS- and PAM-based methods, respectively; in particular, they were significantly better than the standard methods with only one shrinkage parameter. The multiple shrinkage parameter methods based on light misspecification (m = 20) also performed better than the methods with no or only one shrinkage parameter. Even the new methods with a severe misspecification (m = 80) performed similarly as the standard PPLS and PAM for p = 1000, and strikingly, it might perform slightly better than PPLS and PAM as the total number of the genes p increased. The multiple shrinkage parameter methods, without regard to perfect specification or mis-specification of the gene groups, were much less sensitive to the total number of the genes included in a starting model: their performance went down much slower than other methods. On the other hand, in all cases, the multiple shrinkage parameter methods used not only a much higher percentage of informative genes but also much fewer genes in total than the standard methods. The weighted penalized methods (i.e. wPPLS and wPAM) provided a good approximation to multiple penalized methods in terms of prediction and gene selection performance. We can also observe that the weighted methods with each gene being treated as an individual group, corresponding to using the NG estimates, performed slightly better than the standard method (with the soft-thresholding) while using fewer genes; however, they were not as good as the weighted methods (with two groups).
|
In summary, the proposed methods, by including more informative genes and less non-informative genes, gave better predictive performance and better interpretability than the standard methods.
3.2 Examples
We applied our new methods to three public datasets, The first one was the breast cancer data from Huang et al. (2003), denoted as BCH and here we only focused on the recurrence outcome. There were in total n = 52 samples (18 with recurrence of tumor and 34 without) and p = 12 625 probe sets from Affymetrix HG U95Av2 genechips. The second one was the breast cancer data from Wang et al. (2005), denoted as BCW, containing expression profiles for n = 286 patients with lymph-node-negative primary breast cancer, of whom 107 patients developed distant metastasis during the 5-year follow up while 179 were relapse free. The genechips used were Affymetrix HG U133A, each containing p = 22283 probe sets. The third one was on prostate cancer (Welsh et al., 2001), denoted as PSW. There were in total n = 34 samples in PSW: 25 tumors and 9 normal tissues arrayed by Affymetrix HG U95A chips.
A double 10-fold CV was used to estimate the classification errors and the numbers of the genes included in a final model. Specifically, (1) we randomly partitioned a dataset into 10 parts of almost equal size, denoted as D1, ... , D10; (2) for k = 1, ... , 10, we left out Dk as the test data and used D – k =
q
kDq as training data: (a) a 10-fold CV was conducted on D – k to select appropriate tuning parameters (i.e.
's and g); (b) the model with the selected tuning parameters was fitted using D – k, and we recorded the number of the genes in the fitted model and (c) the number of classification errors was recorded when the fitted model was applied to Dk. In short, we conducted honest CV in which any test data were never used in any aspect of model building (Ambroise and McLachlan, 2002).
3.2.1 Breast cancer data
The SuperArray cancer arrays provided a list of 113 genes known or hypothesized to be related to tumor metastasis (www.superarray.com). We identified 223 probe sets in an Affymetrix HG U95Av2 genechip corresponding to a subset of those 113 genes. These 223 probe sets were treated as the informative group while the remaining 12 402 ones as the non-informative groups. Table 2 provides a comparison between our new multiple shrinkage parameter methods and the standard penalized methods based on a double 10-fold CV. The mPPLS and mPAM performed much better than PPLS and PAM: the former two had less CV errors, while including much fewer genes, among which higher proportions came from the informative group. The wPPLS and wPAM performed similarly to PPLS and PAM.
|
Using the same set of the 113 genes related to tumor metastasis, we obtained 275 probe sets as the informative group while the remaining 22 008 as the non-informative group for the BCW data. Table 3 shows the performance of the methods: mPPLS selected fewer genes than PPLS but gave more errors; on the other hand, mPAM selected much fewer genes and performed better than PAM. Again, The wPPLS and wPAM performed similar to PPLS and PAM, but selected fewer genes.
|
Those 113 genes can be further partitioned into seven groups according to their biological functions: cell adhesion, extracellular matrix proteins, cell cycle, cell growth and proliferation, apoptosis, transcription factors and regulators and other genes involved in metastasis; we treated the remaining genes as another group, resulting in a total of eight groups. we applied the weighted methods to the eight groups of the genes. Results are shown in Tables 2 and 3: there was no or only slight improvement.
We also applied wPAM and wPPLS to the BCW data based on the cancer pathway information as described in Wei and Li (2006). 245 genes from 33 cancer-related sub-pathways and 188 cancer-related genes yielded a total of 433 genes in 34 gene groups. The misclassification rates based on 10-fold CV for wPAM and wPPLS were 99/286 = 34.6% and 110/286 = 38.5%, respectively, which were competitive with other methods as shown in Wei and Li (2006): e.g. the random forest gave 33% while SVM 42%. Nevertheless, our main purpose here is not to compare our results with those of other non-PAM or non-PPLS classifiers, because there may be inherent differences among the classifiers, e.g. between PAM and SVM, implying that it may be unfair to compare wPAM with SVM; rather, because of the generality of our proposal, we may pursue in the future to compare SVM with its modified versions, say mSVM and wSVM, that have multiple penalization parameters (Hastie et al., 2001 for a formulation of SVM as a penalized method with only a single penalization parameter).
3.2.2 Prostate cancer data
The SuperArray cancer arrays also provided a list of 263 genes useful as molecular markers for the prognosis and diagnosis of prostate cancer, which corresponded to 411 probe sets in an Affymetrix HG U95A chip. As before, we used this list of the probe sets as the informative group while the remaining ones as the non-informative group, and the classification results were shown in Table 4. Even though all the methods gave good predictive accuracy rates, the new methods mPPLS and mPAM used much fewer genes, most of which came from the informative group.
|
3.2.3 Using BCH data to generate prior for analyzing BCW data
Since the two breast cancer datasets had the same clinical outcome, recurrence of tumor, would like to combine them into an analysis. We chose using the BCH data to generate prior information because the BCH study was actually conducted prior to BCW (year 2003 versus 2005). The goal was to find out whether we could gain some improvement on prediction for the BCW data by incorporating gene information drawn from the BCH data. We applied PAM to the BCH data: the final model contained 234 genes, some of which were likely to be related to the recurrence of breast cancer. Those 234 genes corresponded to 452 probe sets on a HGU133A chip for the BCW data. We used those 452 probe sets as the informative group and the remaining 21 831 ones as the non-informative group for the BCW data. Table 5 shows the performance of the methods. The mPAM performed much better than PAM: mPAM had 104 samples misclassified with on average only 439.8 genes, while PAM had 116 samples misclassified with 2410.8 genes. In addition, mPAM used a higher proportion of the genes from the informative group. The wPPLS and PPLS performed the same: both performed better than mPAM and PAM with fewer classification errors, but used much more genes than the two PAM-based methods; though giving five more errors, mPPLS selected fewer genes than PPLS and wPPLS.
|
3.2.4 Using KEGG for BCW data
KEGG is a knowledge base for systematic analysis of gene functions in terms of the networks of genes and molecules (Kanehisa et al., 1996). The major component of KEGG is a pathway database containing information about biochemical pathways. On an HGU133A microarray, there are 6243 probes belonging to one or more of the total 183 KEGG pathways. We combined those 6243 probes and top 3000 probes with the largest sample variances, resulting in a total of 8072 distinct probes. In order to apply wPAM and wPPLS, we grouped genes according to the KEGG pathways and treated those genes that were not in any pathway as individual groups with group size one. To deal with those genes belonging to multiple pathways, we considered two approaches for the weighted methods. The first, called random assignment, was to randomly assign a gene into one of the several pathways in which it was annotated. Another approach, called non-random assignment, was to first consider it in each of the multiple pathways in which it was annotated, such as in calculating group weights wj's; then we selected the minimum of these wj's, say wj 0, and used
/wj 0 as the shrinkage parameter for the gene; other genes in only one or none of the pathways were treated the same as before. For comparison, the standard PAM/PPLS and wPAM/wPPLS with either KEGG pathways or individual genes as groups (NG) were applied. Results obtained from 10 independent runs of a double 10-fold CV are shown in Table 6. First, it is reassuring to see that for the weighted methods the two approaches to handling genes in multiple groups yielded similar results. Second, although the wPPLS with the KEGG grouping had a slightly larger misclassification error rate than PPLS, it did use much fewer genes; on the other hand, wPAM based on the KEGG pathways had a smaller error rate and used much fewer genes than PAM. Finally, the NG gave either a higher or similar misclassification error as compared to the standard methods, but selected fewer genes. In summary, in terms of both predictive accuracy and model parsimony, the weighted methods using the KEGG pathways seemed to be the winner.
|
| 4 DISCUSSION |
|---|
|
|
|---|
Here, we have proposed a simple and flexible framework to incorporate various sources of prior knowledge on gene functions into building more effective penalized classifiers. In contrast to standard methods of treating all the genes equally a priori, we propose to partition the genes into various groups based on prior data or biological knowledge such that the genes in the same group are more likely to function similarly, then we use group-specific penalty terms and associated penalty parameters to account for possibly varying degrees of relevance of the gene groups to the outcome of interest. Implemented in PAM and PPLS, the proposed methods were shown to have better predictive performance while containing fewer genes as compared to the standard PAM and PPLS with simulated data and several real datasets. We also investigated the robustness of the new methods in a simulation study: even when the gene groups were not completely correctly specified, the new methods worked either better than or at least as well as the standard methods. Nevertheless, in general, the performance of the proposed methods depend on the degree of informativeness of gene grouping, as expected.
The basic idea of our proposal parallels that of Pan (2005) in the context of detecting differential gene expression and that of Pan (2006) in clustering gene expression profiles for gene function discovery. The importance of incorporating biological knowledge into analysis has been increasingly recognized (Dopazo, 2006), but most applications are in clustering analysis (e.g. Al-Shahrour et al., 2005; Cheng et al., 2004; Fang et al., 2005; Huang and Pan, 2006), while there seems to be fewer studies in classification with only a few exceptions (Lottaz and Spang, 2005; Pang et al., 2006 and Wei and Li, 2006).
In the real data examples, we have shown how to incorporate biological knowledge, extracted from either an existing database or a previous study, into the current analysis; for the latter case, the previous study and the current study used two different microarray platforms for gene expression profiling. This demonstrates the flexibility of our proposed methods. As biological knowledge as well as data from relevant experimental studies accumulate over time, the proposed framework provides a general way to incorporate such ever-increasing amount of prior knowledge into analysis and thus also a potential to further improve the predictive performance.
Our use of multiple penalization parameters or terms for multiple gene groups is related to block thresholding in wavelets (Cai, 1999). However, a major difference is that our groups (or blocks) of the genes are formed based on prior knowledge while they are data-driven in the latter; nonetheless, theoretical optimum properties of block thresholding as compared to term-by-term thresholding (corresponding to a single shrinkage parameter in standard penalized methods) in wavelets may provide theoretical support for our proposal in the current context. Likewise, the theory of adaptive Lasso (Zou, 2006) may also help justify and explain the good performance of our proposed weighted methods. At this moment, we do not have a rigorous theory for our proposed methods; however, in addition to empirical evidence shown in numerical examples, the connection of our methods to the NG estimator, as opposed to soft-thresholding, along with the common wisdom of averaging over groups to reduce noise, suggests an intuitive argument favoring our proposed methods. Finally, although we have only focused on PAM and PPLS as concrete examples, our idea can be equally applied to many penalized methods for regression and classification, such as Lasso (Tibshirani, 1996) and SVM (Vapnik, 1998), or for other outcome variables, such as survival times (Broet et al., 2006; Gui and Li 2005). These are all interesting topics to be studied in the future.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This research was partially supported by NIH grant HL65462 and a UM AHC Faculty Research Development grant. The authors thank the reviewers for helpful and constructive comments.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Olga Troyanskaya
Received on January 5, 2007; revised on April 4, 2007; accepted on April 26, 2007
| REFERENCES |
|---|
|
|
|---|
Al-Shahrour F, et al. Discovering molecular functions significantly related to phenotypes by combining gene expression data and biological information. Bioinformatics (2005) 21:2988–2993.
Ambroise C, McLachlan GJ. Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS (2002) 99:6562–6566.
Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. (2000) 25:25–29.[CrossRef][Web of Science][Medline]
Breiman L. Better subset regression using the nonnegative garrote. Technometrics (1995) 37:373–384.[CrossRef][Web of Science]
Breiman L. Random forests. Mach. Learn. (2001) 45:5–32.[CrossRef]
Bhattacharjee A, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclass. Proc. Natl Acad. Sci. USA (2001) 98:13790–13795.
Broet P, et al. Identifying gene expression changes in breast cancer that distinguish early and late relapse among uncured patients. Bioinformatics (2006) 22:1477–1485.
Cai T. Adaptive wavelet estimation: a block thresholding and oracle inequality approach. Ann. of Stat. (1999) 27:898–924.[CrossRef]
Cheng J, et al. A knowledge-based clustering algorithm driven by gene ontology. J. Biopharm. Stat. (2004) 14:687–700.[CrossRef][Medline]
Dabney AR. Classification of microarrays to nearest centroids. Bioinformatics (2005) 21:4148–4154.
Dopazo J. Functional interpretation of microarray experiments. OMICS: J. Integr. Biol. (2006) 10:398–410.[CrossRef]
Fang, et al. Journal of Biomedical Informatics (2006) 39:401–411.[CrossRef][Web of Science][Medline]
Golub TR, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science (1999) 286:531–537.
Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics (2005) 21:3001–3008.
Hastie T, et al. The Elements of Statistical Learning. Data mining, Inference, and Prediction (2001) New York, USA: Springer.
Huang X, Pan W. Linear regression and two-class classification with gene expression data. Bioinformatics (2003) 19:2072–2078.
Huang E, et al. Gene expression predictors of breast cancer outcomes. Lancet (2003) 361:1590–1596.[CrossRef][Web of Science][Medline]
Huang D, Pan W. Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinformatics (2006) 22:1259–1268.
Kanehisa M. Toward pathway engineering: a new database of genetic and molecular pathway. Sci. Technol. Japan (1996) 59:34–38.
Lottaz C, Spang R. Molecular decomposition of complex clinical phenotypes using biologically structured analysis of microarray data. Bioinformatics (2005) 21:1971–1978.
Nguyen DV, Rocke DM. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics (2002) 18:39–50.
Pan W. Incorporating biological information as a prior in an empirical Bayes approach to analyzing microarray data. Stat. Appl. Genet. Mol. Biol. (2005) 4:Article 12.[Medline]
Pan W. Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics (2006) 22:795–801.
Pang W, et al. Pathway analysis using random forests classification and regression. Bioinformatics (2006) 22:2028–2036.
Singh D, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell (2002) 1:203–209.[CrossRef][Web of Science][Medline]
Tibshirani R. Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. B (1996) 58:267–288.
Tibshirani R, Hastie R, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl Acad. Sci. USA (2002) 99:6567–6572.
Tibshirani R, Hastie T, Narasimhan B, Chu G. Class prediction by nearest shrunken centroids with applications to DNA Microarrays. Stat. Sci. (2003) 18:104–117.[CrossRef][Web of Science]
Vapnik V. Statistical Learning Theory (1998) Wiley.
Wang Y, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet (2005) 365:671–679.[Web of Science][Medline]
Wei, Li. Biostatistics (2007) 8:265–284.
Welsh JB, et al. Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. Cancer Res. (2001) 61:5974–5978.
Wold H. Estimation of principal components and related models by iterative least squares. In: Multivariate Analysis—Krishnaiaah PR, ed. (1966) New York: Academic Press. 391–420.
Yuan M, Lin Y. On the non-negative garrotte estimator. J. R. Stat. Soc. B (2007) 69:143–161.[CrossRef]
Zou H. The adaptive lasso and its oracle properties. JASA (2006) 101:1418–1429.
This article has been cited by other articles:
![]() |
F. Tai and W. Pan Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data Bioinformatics, December 1, 2007; 23(23): 3170 - 3177. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
