Bioinformatics Advance Access originally published online on April 6, 2005
Bioinformatics 2005 21(13):3001-3008; doi:10.1093/bioinformatics/bti422
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data
1Department of Statistics, University of California Davis, CA 95616, USA
2Rowe Program in Human Genetics, University of California Davis, CA 95616, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: An important application of microarray technology is to relate gene expression profiles to various clinical phenotypes of patients. Success has been demonstrated in molecular classification of cancer in which the gene expression data serve as predictors and different types of cancer serve as a categorical outcome variable. However, there has been less research in linking gene expression profiles to the censored survival data such as patients' overall survival time or time to cancer relapse. It would be desirable to have models with good prediction accuracy and parsimony property.
Results: We propose to use the L1 penalized estimation for the Cox model to select genes that are relevant to patients' survival and to build a predictive model for future prediction. The computational difficulty associated with the estimation in the high-dimensional and low-sample size settings can be efficiently solved by using the recently developed least-angle regression (LARS) method. Our simulation studies and application to real datasets on predicting survival after chemotherapy for patients with diffuse large B-cell lymphoma demonstrate that the proposed procedure, which we call the LARSCox procedure, can be used for identifying important genes that are related to time to death due to cancer and for building a parsimonious model for predicting the survival of future patients. The LARSCox regression gives better predictive performance than the L2 penalized regression and a few other dimension-reduction based methods.
Conclusions: We conclude that the proposed LARSCox procedure can be very useful in identifying genes relevant to survival phenotypes and in building a parsimonious predictive model that can be used for classifying future patients into clinically relevant high- and low-risk groups based on the gene expression profile and survival times of previous patients.
Supplementary information: http://dna.ucdavis.edu/~hli/LARSCox-Appendix.pdf
Contact: hli{at}ucdavis.edu
| INTRODUCTION |
|---|
|
|
|---|
DNA microarray technology permits simultaneous measurements of expression levels for thousands of genes, which offers the possibility of a powerful, genome-wide approach to the genetic basis of different types of tumors. The genome-wide expression profiles can be used for molecular classification of cancers, for studying varying levels of drug responses in the area of pharmacogenomics and for predicting different patients' clinical outcomes. The problem of cancer class prediction using gene expression data, which can be formulated as predicting binary or multi-category outcomes, has been studied extensively and has been demonstrated great promise in recent years (Alon et al., 1999; Golub et al., 1999; Alizadeh et al., 2000; Garber et al., 2001; Sorlie et al., 2001). However, there has been less development in relating gene expression profiles to other phenotypes, such as quantitative continuous phenotypes or censored survival phenotypes such as time to cancer recurrence or time to death. Due to the large variability in time to a certain clinical event such as cancer recurrence among cancer patients, studying possibly censored survival phenotypes can be more informative than treating the phenotypes as binary or categorical variables.
The Cox regression model (Cox, 1972) is the most popular method in regression analysis for censored survival data. However, due to the very high-dimensional space of the predictors, i.e. the genes with expression levels measured by microarray experiments, the standard maximum Cox partial likelihood method cannot be applied directly to obtain the parameter estimates. Besides the high-dimensionality, the expression levels of some genes are often highly correlated, which creates the problem of high collinearity. To deal with the problem of collinearity, the most popular approach is the penalized partial likelihood, including both the L2 penalized estimation, which is often called the ridge regression, and the L1 penalized estimation, which was proposed by Tibshirani (1996) and is called the least absolute shrinkage and selection operator (Lasso) estimation. Such a Lasso procedure minimizes the negative log partial likelihood subject to the sum of the absolute value of the coefficients being less than a constant s. Compared to the L2 penalized procedure with constraints on the sum of the square of the coefficients, the Lasso procedure provides a method for variable selection. These penalized procedures have been investigated mainly in the setting where the sample size is greater than the number of predictors. Li and Luan (2003) were the first to investigate the L2 penalized estimation of the Cox model in the high-dimensional low-sample size settings and applied their method to relate the gene expression profile to survival data. To avoid the inversion of large matrices, they used kernel tricks to reduce the computation to involving only inversion of the matrix of the size of the sample size. They demonstrated that such a procedure can be applied to build a model for predicting patients' future survival times.
One limitation of the L2 penalized estimation of the Cox model as presented in Li and Luan (2003) is that it uses all the genes in the prediction and does not provide a way of selecting relevant genes for prediction. However, from the biological point of view, one should expect that only a small subset of the genes are relevant to predicting the phenotypes. Including all the genes in the predictive model introduces noise and is expected to lead to poor predictive performance. Due to the high-dimensionality, the standard variable selection methods such as stepwise and backward selection cannot be applied. Tibshirani (1997) further extended the Lasso procedure for variable selection for the Cox proportional hazard models and proposed using the quadratic programming procedure for maximizing the L1 penalized partial likelihood in order to obtain the parameter estimates. However, such a quadratic programming procedure cannot be applied directly to the settings when the sample size is much smaller than the number of potential predictors, such as in the setting of microarray data analysis.
Recently, Efron et al. (2004) proposed the least angle regression (LARS) procedure for variable selection in the linear regression setting. The LARS selects predictors by its current correlation or angle with the response, where the current correlation is defined as the correlation between the predictor and the current residuals. If the active set is defined as the set of indices corresponding to covariates with the greatest absolute current correlations, as the constraint constant s increases, the predictors are chosen one by one without deletion into the active set. The special feature of LARS is that before a new predictor is chosen to the active set as s increases, the corresponding increment of the coefficients only depends on all predictors in the active set. Efron et al. (2004) further pointed out the link between LARS and Lasso, showing that LARS can be modified to provide solutions for Lasso. Instead of solving Lasso discretely by quadratic programming, modified LARS can give the whole solution path of all predictors. With this powerful algorithm, Lasso can be extended to perform subset selection in the high-dimension and low-sample settings. We propose in this paper to use the LARS algorithm to obtain the solutions for the Cox model with L1 penalty in the setting of very high-dimensional covariates such as the gene expression data obtained by microarrays. We call such an estimation procedure the LARSCox procedure.
The rest of the paper is organized as follows. We first present the model and briefly review the Lasso estimation of the regression coefficients and present a modified LARS procedure for the Lasso estimation. We then evaluate the LARSCox procedure by simulation studies and applications to a real dataset of diffuse large B-cell lymphoma (DLBCL) survival times and gene expression data (Rosenwald et al., 2002). Comparisons of results with methods proposed previously by using simulations and analysis of real datasets of patients with DLBCL are also presented. Finally, we give a brief discussion of the methods and conclusions.
| STATISTICAL MODELS AND METHODS |
|---|
|
|
|---|
Cox proportional hazards model and Lasso estimation
Suppose we have a sample size of n from which to estimate the relationship between the survival time and the gene expression levels X1,...,Xp of p genes. Due to censoring, for i = 1,...,n, the ith datum in the sample is denoted by (ti,
i, xi1, xi2,..., xip), where
i is the censoring indicator and ti is the survival time if
i = 1 or censoring time if
i = 0, and xi = {xi1, xi2,...,xip}' is the vector of the gene expression level of p genes for the ith sample. Our aim is to build the following Cox regression model for the hazard of cancer recurrence or death at time t:
![]() | (1) |
0(t) is an unspecified baseline hazard function, ß = {ß1,...,ßp} is the vector of the regression coefficients and X = {X1,...,Xp} is the vector of gene expression levels with the corresponding sample values of xi = {xi1,...,xip} for the ith sample. We define f(X) = ß'X to be the linear risk score function.
Based on the available sample data, the Cox's partial likelihood (Cox, 1972) can be written as
![]() |
![]() |
Tibshirani (1997) proposed the following iterative procedure to reformulate this optimization problem with constraints as a Lasso problem for linear regression models. Specifically, let
= ß'x, µ =
l/
, A =
2l/

T and z =
+ A µ, where x = (x1,...,xn) is the gene expression matrix. Here since the sum of all elements in each row (or column) of the matrix A is 0, A is clearly a singular matrix. We can however use the generalized inverse. Alternatively, Tibshirani proposed replacing the information matrix A with a diagonal matrix D, which has the same diagonal elements as A. However, in most of our applications, n is usually small and calculation of the generalized inverse is computationally feasible. In addition, due the high-dimensionality of the predictors, it is important to make the algorithm as accurate as possible. With this reparameterization, a one-term Taylor series expansion for l(ß) has the form of
![]() |
+ Aµ, (z
)TA(z
) is invariant to the choice of the generalized inverse of A. The iterative procedure of Tibshirani (1997) involves the following four steps:
- Fix s and initialize
.
- Compute
, µ, A and z based on the current value of
.
- Minimize (z ß'x)T A(z ß'x) subject to
|ßj |
s.
- Repeat step 2 and 3 until
does not change.
A LARSCox procedure for obtaining the Lasso for the Cox model
We propose an efficient algorithm called LARSCox to solve Step 3 of the algorithm, which is based on the recently proposed LARS algorithm (Efron et al., 2004). In their paper, Efron et al. (2004) proved that for the linear regression models, starting from zero, the Lasso solution paths grow piecewise linearly in a predictable way and they also proposed the LARS algorithm to efficiently solve the entire Lasso solution path using the same order of computations as a single ordinary least square fit. We propose applying the LARS algorithm for solving Step 3 of the Lasso algorithm. To do so, we first apply the Cholesky decomposition to obtain T = A1/2 such that T'T = A, and define y = Tz and
; then Step 3 of the iterative procedure presented in the previous section can be rewritten as
![]() |
and can be efficiently solved by the LARS algorithm for a given s.
To determine the value of the tuning parameter s or the number of genes to be used in the final model, one can choose s which minimizes the cross-validated partial likelihood (CVPL) (Verwij and Van Houwelingen, 1993; Huang and Harrington, 2002), which is defined as
![]() |
is the estimate of the score function based on the LARSCox procedure with tuning parameter s from the data without the ith subject. The terms l(f) and l(i)(f) are the log partial likelihoods with all the subjects and without the ith subject, respectively. The optimal value of s is chosen to maximize the sum of the contributions of each subject to the log partial likelihood. This CVPL is a special case of a more general cross-validated likelihood approach for model selections (Smyth, 2001; Van der Laan et al., 2003) and has been demonstrated to perform well in prediction in the context of the penalized Cox regression (Huang and Harrington, 2002).
Evaluation of the predictive performance: time-dependent ROC curves and area under the curves
In order to assess how well the model predicts the outcome, we propose employing the idea of time-dependent receiver-operator characteristics (ROC) curves for censored data and area under the curve (AUC) as our criteria. These methods were recently developed by Heagerty et al. (2000) in the context of the medical diagnosis. For a given score function f(X), we can define time-dependent sensitivity and specificity functions as
![]() |
(t) is the event indicator at time t. A nearest neighbor estimator for the bivariate distribution function is used for estimating these conditional probabilities accounting for possible censoring (Akritas, 1994). Note that a larger AUC at time t based on a score function f(X) indicates better predictability of time to event at time t as measured by sensitivity and specificity evaluated at time t. In our application presented in the next section, we study several different methods of constructing the score function f(X) in the Cox model ( ) and compare their predictive performance based on the AUCs. | EVALUATION OF THE METHODS BY SIMULATION STUDIES |
|---|
|
|
|---|
We performed simulation studies to evaluate how well the LARSCox procedure performs in the high-dimensional and low-sample size settings. We focus on whether the important covariates that are related to survival endpoints can be selected by the LARSCox procedure and how well the model can be used for predicting the survival time for future patients.
In our simulation studies, we assume that 20 out of a total of 500 genes are related to time to cancer recurrence through a Cox regression model with 10 coefficients generated from a uniform U(1, 0.1) distribution and 10 coefficients generated from a uniform U(0.1, 1) distribution (see first column of Table 1 for the coefficients generated). A Weibull distribution with the shape parameter of 5 and the scale parameter of 2 is used for the baseline hazard function, and a uniform U(2,10) is used for simulating the censoring times. Based on this setting, we would expect about 40% censoring.
|
In order to generate gene expression data for 500 predictors (genes), we first generate a 100 x 500 dataset X from a uniform U(1.5, 1.5) distribution. We assume that the first 20 genes with expression levels X1, X2,...,X20 are related to patients' risk cancer recurrence through a Cox model. In order to generate gene expression data for the rest of the 480 genes which are not related to the survival but of which some may be correlated with the 20 relevant genes, we first use GramSchmidt orthonormalization to construct its normalized orthogonal basis {
1,...,
20,
1,...,
80}, where
= {
1,
2,...,
20} is an orthogonal basis of the linear space A expanded by X1,X2,...,X20 and
= {
1,
2,...,
80} is a set of orthogonal basis of B, which is the orthogonal complement space of A. By Cauchy's inequality, it is easy to show that if {
1,...,
20,
1,...,
80} is a set of normalized orthogonal bases, then for any 20 x 80 matrix T, we have
, for
x
R80, y
R20, where
2 is the largest eigenvalue of T'T. Based on this result, we can generate the expression levels of genes which are unrelated to survival from the linear space C = {
+
T} with an appropriate choice of the maximum eigenvalue of T'T in order to control the maximum possible correlation between vectors in spaces A and C. We considered the maximum possible correlation of 0, 0.71, 0.82 and 0.87 in our simulations. These numbers are chosen to assess how gradual changes in correlations between irrelevant and non-irrelevant genes affect the LARSCox procedure in identifying relevant genes. Note that the actual observed possible correlations between the relevant (to the risk of event) 20 genes and the irrelevant genes are much smaller than these values, and for most simulated datasets, the observed maximum of the pair-wise sample correlations between genes is smaller than half of the theoretical maximum correlation.
Effects of between-gene correlations on identifying relevant genes
For each chosen maximum possible correlation between the relevant genes and non-relevant genes, we generated 100 datasets with a sample size of 100 individuals. For each replication, we applied the LARSCox procedure to build a model which included 20 genes by selecting an appropriate s value in the LARSCox estimation. Table 1 summarizes the frequencies that the 20 relevant genes are among the first 20 genes that are selected by the LARSCox procedure. We observe the following interesting results. First, as expected, the predictors with larger coefficients are more likely to be selected by the LARSCox procedure. Second, it is interesting to observe that when the maximum possible correlation between the relevant and non-relevant genes increases, i.e. when the linear space spanned by the non-relevant genes gets close to the linear space expanded by those relevant genes, the chance of the relevant genes with smaller coefficients being selected gets smaller. This is because that at each step, the LARSCox procedure only selects the gene with the largest absolute correlation in the model. Of course, the chance of these relevant genes being selected also depends on the sample size. For example, for the maximum possible correlation of 0.85, more relevant genes are selected if the sample size is increased to 200 (see the last column of Table 1).
Predictive performance and comparison with other methods
We then examined the predictive performance of the proposed method. We simulated a sample size of 100 patients as the training dataset to build the predictive model and evaluated the predictive performance based on another new dataset of 100 patients (test dataset). For each simulation, we generated 500 gene expression levels for each patients with the maximum possible between-gene correlation of 0.82. For each replication, we built a predictive model based on the training set. We applied the CVPL to choose the tuning parameter s used in the model. We also considered three other methods, including the L2 penalized procedure proposed by Li and Luan (2003), the principal-components based partial Cox regression (PC-PCR) procedure proposed by Li and Gui (2004) and the supervised principal components (SPCA) procedure proposed in Bair and Tibshirani (2004). For each method, we build the model based on the training dataset, and predict the risk scores for the 100 patients in the test set. We repeated this procedure 100 times. We used the time-dependent AUC as a criterion to assess the predictive performance.
Figure 1(a)(d) shows the averages of the estimated AUCs over 100 replications using the predictive score for the test sets for each method together with the estimated 95% point-wise confidence intervals. The plot indicates a very good predictive performance of the LARSCox procedure. The AUC is over 75% at the beginning of the follow-ups and remains high even at later times. As a comparison, the other three procedures did not perform as well as indicated by the estimated AUCs [Fig. 1(b)(d)]. It should however be noted that due to the small sample sizes we simulated, the comparisons are not statistically significant as indicated by the slight overlaps of the 95% confidence intervals. Note that both the L2 penalized procedure and the PC-PCR procedure use all the genes in building the predictive models. Clearly, neither of these procedures performed as well as the LARSCox procedure in predicting the survival times for future patients as measured by the AUCs. We also performed the L2 procedure and the PC-PCR procedure using genes selected based on univariate Cox regression analysis and did not observe any improvement in their predictive performances. The SPCA procedure, although it performs gene selection by univariate analysis, did not perform as well as the LARSCox procedure. These results indicate that selecting genes by performing univariate analysis may not be the best choice in building predictive models. In contrast, the LARSCox procedure selects genes by considering all the genes together.
|
As another way of comparing these three different methods, for each replication, we divided the patients in the test set into high- and low-risk groups based on having positive or negative predictive risk scores and tested the statistical significance in the risk of cancer recurrence between the two groups. We observed that for a p-value of <105, all 100 replications showed significant differences in risk between the high- and low-risk groups defined by the LARSCox predicted scores, as compared to only 38, 22 and 36 replications showing significant differences in risk between the high- and low-risk groups defined by the risk scores predicted by the L2 penalized procedure, the PC-PCR procedure and the SPCA procedure.
In summary, the results from our simulation studies indicate that the LARSCox procedure can indeed select genes that are related to censored phenotypes, especially those genes with relatively strong effects, although genes with smaller effects on survival are difficult to identify, especially when the correlations between the gene expression levels are high. When the correlations between the gene expression levels of the relevant genes and non-relevant genes are high, the CVPL procedure tends to select more genes in building the predictive models. However, we observed a better predictive performance of the LARSCox procedure than other procedures.
| APPLICATION TO PREDICTION OF SURVIVAL TIME OF PATIENTS WITH DLBCL |
|---|
|
|
|---|
To further demonstrate the utility of the LARSCox procedure in relating microarray gene expression data to censored survival phenotypes, we re-analyzed a recently published dataset of DLBCL by Rosenwald et al. (2002). This dataset includes a total of 240 patients with DLBCL, including 138 patient deaths during the follow-ups with a median death time of 2.8 years. Rosenwald et al. (2002) divided the 240 patients into a training set of 160 patients and a validation set or test set of 80 patients and built a multivariate Cox model. The variables in the Cox model included the average gene expression levels of smaller sets of genes in four different gene expression signatures together with the gene expression level of BMP6. It should be noted that in order to select the gene expression signatures, they performed a hierarchical clustering analysis for genes across all the samples (including both test and training samples). In order to compare our results with those in Rosenwald et al. (2002) we used the same training and test datasets in our analysis.
The gene expression measurements of 7399 genes are available for analysis. However, there are a large number of missing gene expression values in the dataset. Among the 7399 genes, only 434 genes have no missing values. We first applied a nearest neighbor technique (Troyanskaya et al., 2001) to estimate those missing values. Specifically, for each gene, we first identified eight genes which are the nearest neighbors according to Euclidean distance. We then filled the remaining with the average of the nearest neighbors. Our method is slightly different from that of Troyanskaya et al. (2001) in that the nearest neighbors are not restricted to those 434 genes with no missing data. We also tried the method of Troyanskaya et al. (2001) for filling the missing value, and the results of survival time prediction with the two methods were very close.
Selection of genes related to risk of death
We applied the LARSCox procedure to first build a predictive model using the training data of 160 patients and all the 7399 features. As the tuning parameter increases, more genes are selected and these genes are chosen in order of their relevances in predicting survival. The genes entered first in the model would provide a good list of candidate genes for further investigation. Table 2 shows the GenBank ID and a brief description of the first 10 genes selected. It is interesting to note that seven of these genes belong to the three gene expression signature groups defined in Rosenwald et al. (2002). These three signature groups include Germinal-center B-cell signature, MHC class II signature and Lymph-node signature. No genes in the proliferation signature group defined by Rosenwald et al. (2002) were among the top 10 genes selected by LARSCox. However, ribosomal protein S12 from the proliferation group was among the top 20 gene selected by our method.
|
The other three genes which do not belong to the signature groups of Rosenwald et al. (2002) may also be related to lymphoma or risk of death from lymphoma; however, evidence for their direct involvement in any mechanism is currently lacking. It should also be noted that there is always a possibility that genes are selected because of coexpression with other genes, or for reasons that cannot be explained mechanistically. Among these three genes, AA729003 [GenBank] is a protein coding TCL1A gene which has been demonstrated to be a powerful oncogene and when it is over-expressed in both B and T cells, it predominantly yields mature B cell lymphomas (Pekarsky et al., 1999). The gene L19872 [GenBank] is a Aryl hydrocarbon receptor (AHR), which is a ligand-activated transcription factor involved in the regulation of biological responses to planar aromatic hydrocarbons. The AHR has been shown to regulate xenobiotic-metabolizing enzymes such as cytochrome P450, which belongs to the lymph-node signature group. Finally, the gene AA760674 [GenBank] is a COX15 homolog, which is the terminal component of the mitochondrial respiratory chain that catalyzes the electron transfer from reduced cytochrome c to oxygen (Petruzzella et al., 1998). It has been reported that mutation in the COX15 gene can cause Leigh syndrome (Oquendo et al., 2004); however, its involvement in cancers is not clear.
Evaluation of the predictive performance
We also examined how well the model built by the LARSCox procedure predicts the survival of a future patient. Using the training set of 160 patients, we built a predictive Cox model with the LARSCox procedure using the CVPL to select the optimal tuning value s. The minimum CVPL was obtained when s = 0.28, which corresponds to selecting four genes in the model. We also observed that the CVPL value increases by only 0.001 when the tuning parameter s increases from 0.28 to 0.33, which corresponds to nine genes in the model. As a matter of fact, for s ranging from 0.28 to 0.33, the predictive performances of the resulting models are very comparable. We chose the most parsimonious model with four genes. These four genes are AA805575
[GenBank]
, LC_29222, X00452
[GenBank]
and X59812
[GenBank]
(see Table 2 for a description of these four genes), belonging to three of the four signature groups defined in Rosenwald et al. (2002).
We obtained the estimates of the coefficients of these four genes using the LARSCox procedure, denoted by vector
. The estimated coefficients for all four genes were negative, indicating that high expression levels of these genes reduce the risk of death among the patients with DLBCL. This agrees with what Rosenwald et al. (2002) has found (see Table 2 of their paper). Based on the estimated model with four genes, we estimated the risk scores (
) for the 80 patients in the test dataset based on their gene expression levels of the four genes in the predictive model. The time-dependent AUCs for 120 years after diagnosis based on the estimated scores for the patients in the test set are around 0.67 in the first 10 years of follow-up, indicating a reasonable predictive performance.
To further examine whether clinically relevant groups can be identified by the model, we used zero as a cutoff point of the risk scores and divided the test patients into two groups based on whether they have positive or negative risk scores. Figure 2(b) shows the KaplanMeier curves for the two groups of patients, showing very significant differences (p-value = 0.0004) in overall survival between the high-risk group (36 patients) and the low-risk group (44 patients).
|
A comparison with other methods
As a comparison, we also analyzed the lymphoma dataset using three other methods, the PCR method of Li and Gui (2004) the L2 penalized method of Li and Luan (2003) and the SPCA method of Bair and Tibshirani (2004). Figure 2(b)(d) shows the survival curves of the two groups of patients in the test dataset defined by the scores estimated by each of the three methods. We observe that the two risk groups defined by the LARSCox estimated model showed more significant difference in risk of death than the groups defined by the other three models (p-value of 0.0004 versus 0.003, 0.003 and 0.034). Finally, the AUCs based on the risk scores estimated by the LARSCox procedure are also higher than those from the other three procedures; however, the results are not statistically significant (the time-dependent AUCs and their 95% confidence intervals are provided in the Supplemental material).
As another evaluation of the methods, we performed another set of analyses using the training and testing datasets as defined in Bair and Tibshirani (2004); (data available from http://www-stat.stanford.edu/~tibs/superpc/staudt.html) for the lymphoma dataset. Again, if we used zero as the cutoff point of the predicted scores to divide the 80 patients in the test set into high- and low-risk groups, we observed that the LARSCox procedure gives a slightly more significant difference in risk between the two groups. The log-rank test p-values are 0.007, 0.06, 0.007 and 0.015 for the LARSCox, L2 penalized procedure, PC-PCR procedure and the SPCA procedure, respectively, again indicating that the LARSCox procedure performs well on this new training/testing partition of the lymphoma dataset. The corresponding survival curves are provided in the Supplemental material.
| DISCUSSION AND CONCLUSIONS |
|---|
|
|
|---|
It is clinically relevant and very important to predict patients' time to cancer relapse or time to death due to cancer after treatment using gene expression profiles of the cancerous cells prior to the treatment. Powerful statistical methods for such prediction allow microarray gene expression data to be used most efficiently. In this paper, we have proposed and studied the LARSCox procedure for censored survival data in order to identify important predictive genes for survival using microarray gene expression data. To solve the computational difficulty, we modified the recently developed LARS procedure (Efron et al., 2004) to obtain the solutions for the Lasso estimation of the Cox model. Since the risk of cancer recurrence or death due to cancer may result from the interplay between many genes, methods which can utilize the data of many genes, as in the case of our proposed procedure, are expected to show better performance in predicting risk. Our simulation studies demonstrated that the procedure can indeed be used to identify genes which are related to censored survival outcomes and to built a parsimonious model for predicting future patients' survival. We have also demonstrated the applicability of our methods by analyzing time to death of the diffuse large B-cell lymphoma patients and obtained satisfactory results, as evaluated by both applying the model to the test dataset and time-dependent ROC curves.
While we did not compare the new procedure with all the other procedures available, we did compare the LARSCox procedure with several other previously proposed methods in predictive performance and found that the new procedure performed better than the L2 penalized procedure, PC-PCR procedure and the SPCA procedure (Li and Luan, 2003; Li and Gui, 2004; Bair and Tibshirani, 2004) in predicting the future patients' survival. We would however expect that the LARSCox procedure performs better than other dimension-reduction based procedures such as the partial least squares (Park et al., 2002) or the principal components Cox regression because the LARSCox procedure automatically selects and utilizes only the relevant genes in building the predictive model. A comprehensive comparison of different methods warrants further research. It is worth mentioning that the L1 penalized regression was also demonstrated to perform better than other procedures in the settings of microarray gene expression data and linear models (Segal et al., 2003).
The proposed LARSCox procedure has no computational or methodological limitation in terms of the number of genes that can potentially be used in building models for the prediction of patients' time to clinical event. The method can in principle select n 1 genes, where n is the sample size, although the cross-validation procedure often results in a much smaller number of predictors in the model. However, when the number of predictors is close to the sample size, there is a risk of over-fitting. In addition, as pointed out by Osborne et al. (1998) as s increases, when the number of nonzero coefficients gets close to the number of observations, Lasso may not have a unique solution. This implies that the number of genes selected by the procedure cannot be more than the sample size. In addition, the LARSCox tends to select only one variable from a group of highly correlated genes. These points have also been pointed out by Zou and Hastie (2003) for the Lasso. If the LARSCox procedure is used mainly for selecting important and relevant genes, one may want to include all these highly correlated genes, if one of them is selected. However, if the goal is to build a model with a good predictive accuracy, this should not be a problem since simple models are preferred for the scientific insight into the relationship between survival and gene expressions. One way to extend the LARSCox procedure in order to identify correlated genes is that at each LARS variable selection step, we selected not only one single gene with the largest absolute current inner product, but a group of such genes with similar current inner products. An alternative is to use the elastic net penalty as recently proposed by Zou and Hastie (2004) for the penalized estimation. How well such extensions perform in prediction of future survival time deserves further investigation.
The LARSCox procedure assumes the Cox proportional hazards model, which is the most popular model for censored survival data. However, the proportional hazards assumption may not hold for gene expression data or for all diseases. It is possible to develop robust procedures under mis-specified proportional hazards models along the lines of Lin and Wei (1989). In addition, model checking techniques analogous to those of Lin et al. (1993) can be derived. As an alternative, we can consider similar L1 penalized estimation for the accelerated failure time models (Wei, 1992) or more general semi-parametric transformation models (Cheng et al., 1995). We are currently pursuing these alternative models.
In summary, an important application of microarray technology is to relate gene expression profiles to various clinical phenotypes of patients such as time to cancer recurrence or overall survival time. The statistical model built to relate gene expression profiles to the censored survival time should have the property of high predictive accuracy and parsimony. The proposed LARSCox procedure in this paper can be very useful in building a parsimonious predictive model that can be used for classifying future patients into clinically relevant high- and low-risk groups based on the gene expression profile and survival times of previous patients. The procedure can also be applied to select important genes which are related to patients' survival outcome.
| Acknowledgments |
|---|
This research was supported by NIH grant ES009911. We thank the two reviewers for their very helpful comments.
Received on November 17, 2004; revised on March 4, 2005; accepted on March 30, 2005
| REFERENCES |
|---|
|
|
|---|
Akritas, M.G. (1994) Nearest neighbor estimation of a bivariate distribution under random censoring. Ann. Statist., 22, 12991327.
Alizadeh, A.A., et al. (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503511[CrossRef][Medline].
Alon, U., et al. (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl Acad. Sci., USA, 96, 67456750
Bair, E. and Tibshirani, R. (2004) Semi-supervised methods for predicting patient survival from gene expression papers. PLoS Biol., 2, 50115022.
Cheng, S.C., et al. (1995) Analysis of transformation models with censored data. Biometrika, 82, 835845
Cox, D.R. (1972) Regression models and life-tables. J. R. Statist. Soc. B, 34, 187220.
Efron, B., et al. (2004) Least angle regression. Ann. Statist., 32, 407499[CrossRef].
Garber, M.E., et al. (2001) Diversity of gene expression in adenocarcinoma of the lung. Proc. Natl Acad. Sci. USA, 98, 1378413789
Golub, T.R., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531537
Heagerty, P.J., et al. (2000) Time dependent ROC curves for censored survival data and a diagnostic marker. Biometrics, 56, 337344[CrossRef][Web of Science][Medline].
Huang, J. and Harrington, D. (2002) Penalized partial likelihood regression for right-censored data with bootstrap selection of the penalty parameter. Biometrics, 58, 781791[CrossRef][Web of Science][Medline].
Li, H. and Gui, J. (2004) Partial Cox regression analysis for high-dimensional microarray gene expression data. Bioinformatics, 20, suppl. 1, i208i215[Abstract].
Li, H. and Luan, Y. (2003) Kernel Cox regression models for linking gene expression profiles to censored survival data. Pacific Symp. Biocomput., 8, 6576.
Lin, D.Y. and Wei, L.J. (1989) The robust inference for the Cox proportional hazards model. J. Am. Statist. Assoc., 84, 10741078[CrossRef][Web of Science].
Lin, D.Y., et al. (1993) Checking the Cox model with cumulative sums of martingale-based residuals. Biometrika, 80, 557572
Oquendo, C.E., et al. (2004) Functional and genetic studies demonstrate that mutation in the COX15 gene can cause Leigh syndrome. J. Med. Genet., 41, 540544
Research Report Osborne, M.R., Presnell, B., Turlach, B.A. (1998) On the Lasso and its dual. Department of Statistics, University of Adelaide.
Park, P.J., et al. (2002) Linking expression data with patient survival times using partial least squares. Bioinformatics, 18, S120S127[Abstract].
Pekarsky, Y., et al. (1999) Abnormalities at 14q32.1 in T cell malignancies involve two oncogenes. Proc. Natl Acad. Sci. USA, 96, 29492951
Petruzzella, V., et al. (1998) Identification and characterization of human cDNAs specific to BCS1, PET112, SCO1, COX15, and COX11, five genes involved in the formation and function of the mitochondrial respiratory chain. Genomics, 54, 494504[CrossRef][Web of Science][Medline].
Rosenwald, A., et al. (2002) The use of molecular profling to predict survival after themotheropy for diffuse large-B-Cell lymphoma. N. Eng J. Med., 346, 19371947
Segal, M.R., et al. (2003) Regression approaches for microarray data analysis. J. Comput. Biol., 10, 961980[CrossRef][Web of Science][Medline].
Smyth, P. (2001) Model selection of probabilistic clustering using cross-validated likelihood. Statist. Comput., 10, 6372[CrossRef].
Sorlie, T., et al. (2001) Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl Acad. Sci., 98, 1086910874
Tibshirani, R. (1996) Regression shrinkage and selection via the Lasso. J. R. Statist. Soc. B, 58, 267288.
Tibshirani, R. (1997) The Lasso method for variable selection in the Cox model. Statist. Med., 16, 385395.
Troyanskaya, O., et al. (2001) Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520525
Technical Report Van der Laan, M.J., Dudoit, S., Keles, S. (2003) Asymptotic optimality of likelihood-based cross validation. , Berkeley Division of Biostatistics, University of California.
Verwij, P.J.M. and Van Houwelingen, J.C. (1993) Cross validation in survival analysis. Statist. Med., 12, 23052314.
Wei, L.J. (1992) The accelerated failure time model. a useful alternative to the Cox regression model in survival analysis. Statist. Med., 11, 18711879.
Zou, H. and Hastie, T. (2004) Regression shrinkage and selection via the elastic net, with applications to microarrays. J. R. Statist. Soc. B, in press.
This article has been cited by other articles:
![]() |
K. J. Archer, V. R. Mas, K. David, D. G. Maluf, K. Bornstein, and R. A. Fisher Identifying Genes for Establishing a Multigenic Test for Hepatocellular Carcinoma Surveillance in Hepatitis C Virus-Positive Cirrhotic Patients Cancer Epidemiol. Biomarkers Prev., November 1, 2009; 18(11): 2929 - 2932. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Sohn, J. Kim, S.-H. Jung, and C. Park Gradient lasso for Cox proportional hazards model Bioinformatics, July 15, 2009; 25(14): 1775 - 1781. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Wang, B. Nan, N. Zhu, and J. Zhu Hierarchically penalized Cox regression with grouped variables Biometrika, June 1, 2009; 96(2): 307 - 322. [Abstract] [PDF] |
||||
![]() |
H. Binder, A. Allignol, M. Schumacher, and J. Beyersmann Boosting for high-dimensional time-to-event data with competing risks Bioinformatics, April 1, 2009; 25(7): 890 - 896. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. J. Raz, M. R. Ray, J. Y. Kim, B. He, M. Taron, M. Skrzypski, M. Segal, D. R. Gandara, R. Rosell, and D. M. Jablons A Multigene Assay Is Prognostic of Survival in Patients with Early-Stage Lung Adenocarcinoma Clin. Cancer Res., September 1, 2008; 14(17): 5565 - 5570. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Tai and W. Pan Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data Bioinformatics, December 1, 2007; 23(23): 3170 - 3177. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Tai and W. Pan Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms Bioinformatics, July 15, 2007; 23(14): 1775 - 1782. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Schumacher, H. Binder, and T. Gerds Assessment of survival prediction models based on microarray data Bioinformatics, July 15, 2007; 23(14): 1768 - 1774. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Huang and T. W. S. Chow Identifying the biologically relevant gene categories based on gene expression and biological data: an example on prostate cancer Bioinformatics, June 15, 2007; 23(12): 1503 - 1510. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Wei and H. Li Nonparametric pathway-based regression models for analysis of genomic data Biostat., April 1, 2007; 8(2): 265 - 284. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Ma and J. Huang Clustering threshold gradient descent regularization: with applications to microarray studies Bioinformatics, February 15, 2007; 23(4): 466 - 472. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Rajicic, D. M. Finkelstein, D. A. Schoenfeld, and the Inflammation Host Response to Injury Research Survival analysis of longitudinal microarrays Bioinformatics, November 1, 2006; 22(21): 2643 - 2649. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Sha, M. G. Tadesse, and M. Vannucci Bayesian variable selection for the analysis of microarray data with censored outcomes Bioinformatics, September 15, 2006; 22(18): 2262 - 2268. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Li Survival prediction of diffuse large-B-cell lymphoma based on both clinical and gene expression information Bioinformatics, February 15, 2006; 22(4): 466 - 471. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||













