Bioinformatics Advance Access originally published online on August 4, 2008
Bioinformatics 2008 24(21):2566-2568; doi:10.1093/bioinformatics/btn412
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Comment on network-constrained regularization and variable selection for analysis of genomic data
1Freiburg Center for Data Analysis and Modeling, University of Freiburg and 2Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg, Stefan-Meier-Str. 26, 79104 Freiburg, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Contact: binderh{at}fdm.uni-freiburg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
It is very challenging to build good predictive models from high-dimensional data, arising e.g. from microarrays, as the number of covariates typically is much larger than the number of observations. Starting with the Lasso (Tibshirani, 1996), several approaches have been developed, that perform variable selection and parameter estimation at the same time, by adding a penalty term to the likelihood criterion to be maximized.
Li and Li (2008) present a new approach that employs an extended penalty term for network-constrained regularization. For predictive models fitted to microarray data, this potentially allows to improve prediction performance by incorporating pathway information, available e.g. from the KEGG pathway database (Kanehisa and Goto, 2000). This is also expected to improve interpretability, but the extent to which the interpretation of a fitted model is to be trusted still depends on its prediction performance.
In an application example, Li and Li (2008) use microarray survival data from patients with glioblastoma to illustrate their approach. Prediction performance in a test set is compared to the Lasso and the elastic net (Zou and Hastie, 2005), and the new procedure seems to compare favorably, at least to the former. What is missing is a comparison to the null model, i.e. a model that does not use any covariate information. As for all three procedures model selection via cross-validation indicates that several microarray features should receive non-zero coefficients, it could be believed that there is enough information in the data and therefore comparison to a null model is not needed. However, as will be shown in the following, all three procedures actually perform considerably worse compared to the null model.
The analysis performed by Li and Li (2008) also simply discards censored observations. This is very problematic, and in the course of exploring potential reasons for the bad prediction performance, we will employ more adequate modeling techniques for time-to-event data. It will be shown that predictive information can be extracted from the data. For judging this, a predictive model that uses only clinical covariates will serve as an additional benchmark.
Li and Li (2008) used data from the study described in Horvath et al. (2006), which comprises of two data sets with n=55 and n=65 observations, where the former is used as a training and the latter as a test set. The response of interest is survival time. As there is censoring, i.e. survival time is not observed for all patients, Li and Li (2008) decide to discard censored observations for all the model fitting approches they compare, resulting in a training set with n=50 and a test set with n=61 observations. Models are then fitted to log-time as a continuous response. Table 1 shows the mean squared errors on the test set, as given by Li and Li (2008) in their Table 2. We extended the latter table by results for a null model, i.e. a model that does not use any covariate information and uses mean log-time of the training set as prediction for the test set, and a linear model that employs only the two clinical covariates given in the data (age and sex). The R code for calculating these results and for performing the following analyses is given in the Supplementary Data.
|
It is clearly seen that all three procedures compared by Li and Li (2008) perform considerably worse compared to the null model. This illustrates that the fact that cross-validation selects a certain number of genes with non-zero parameter estimates does not imply that the fitted models perform better than the null model. The latter should therefore always be used as benchmark for performance comparison. Omission of this led Li and Li (2008) to believe that there is substance to the model fit obtained from their approach, which seemed also to be supported by the interpretability of the selected genes in terms of subject matter knowledge. However, in light of the poor prediction performance also the presented interpretation of the fitted model is questionable.
In microarray settings, a second kind of benchmark, besides the null model, is provided by purely clinical models, i.e. models that contain only clinical covariates. Predictive models based on microarray features will probably only be accepted, if their prediction performance is at least on par with clinical models or better. Ideally, clinical covariates and microarray features should be combined into one model, as only then it can be judged, whether the microarray features contain additional predictive information (for example Binder and Schumacher, 2008b). In the present example, a purely clinical model improves over the null model only by a small amount, potentially because important predictors such as tumor characteristics are not available, but clearly outperforms the microarray-only models. This demonstrates that it is possible to outperform the intercept model in the present data.
There are two potential reasons why the analyses of Li and Li (2008) might have missed potential predictive microarray information in the present application. The first is the inadequate handling of censored observations. There is a wealth of model classes that can adequately deal with these, and we will use the classical Cox proportional hazards model with estimation by a boosting technique (Binder and Schumacher, 2008b) in the following. The second reason might be, that Li and Li (2008) restrict their analysis to that 1533 microarray features that are related to 33 KEGG regulatory pathways, discarding about 20 000 covariates.
For performance evaluation we use prediction error curves, i.e. squared distance between true status (1 for still being alive and 0 for having had an event) and the predicted probability of still being alive, tracked over time (Gerds and Schumacher, 2006). In contrast to mean squared error of predicted log-time this is not linked to some kind of transformation model and can be used for all classes of survival models. It also properly handles censored observations in the test set by weighting and does not (inadequately) discard these.
The grey curve in Figure 1 shows the test set prediction error curve for the Kaplan-Meier benchmark, i.e. a null model that does not use any covariate information. Prediction performance of a purely clinical model (using covariates age and sex) is indicated by a dashed curve. It is seen that it improves over the null model. This was not seen so clearly from the model that was built for log-time, after discarding censored observations.
|
In a first step, building of a predictive microarray model is not restricted to the microarray features which are linked to KEGG regulatory pathways, but the 8000 most varying features are used, as suggested by Horvath et al. (2006) in their supplementary material. The boosting technique described in Binder and Schumacher (2008b) is used for parameter estimation, the number of boosting steps being determined by 10-fold cross-validation. While there are many alternative techniques for fitting high-dimensional survival models, such as supervised principal components (Bair et al., 2006), the employed boosting approach is very similar to the Lasso-like techniques used by Li and Li (2008) and also allows for straightforward incorporation of clinical covariates.
The test set prediction error curve of the resulting model fit, which features 15 non-zero coefficients, is indicated by the solid black curve in Figure 1. It is seen that the microarray-only model can improve over the Kaplan-Meier benchmark, i.e. there seems to be some relevant information in the microarray features. However, the clinical model still performs better compared to a microarray-only model. Combining clinical and microarray information into one model (as described in Binder and Schumacher, 2008b) results in a model with a total of 12 non-zero parameter estimates. Of the 10 non-zero estimates that belong to microarray features, eight are also present in the microarray-only model. So there is a considerable overlap, which indicates stability. The corresponding prediction error curve (dash-dotted curve in Fig. 1) shows that the combined model outperforms the purely clinical model, the difference between the two indicating the additional prediction performance gained by employing microarray features.
To evaluate the effect of restricting to genes represented in KEGG pathways, we use a subset of 952 microarray features (of the 8000 most varying features), which is represented in the KEGG regulatory pathways. Use of 10-fold cross-validation, to select the number of boosting steps, results in a model with 11 non-zero parameter estimates. Only one of the microarray features with non-zero parameter estimate also has a non-zero estimate in the model built from 8000 features. This small overlap is no surprise, as only 2 of the 15 microarray features employed by the latter model are linked to KEGG regulatory pathways. Therefore, the fitting algorithm is forced to pick other, potentially less informative, microarray features. Judging only from the number of parameters selected by cross-validation, it could be expected, that prediction performance nevertheless is better than that of the null model. However, the corresponding prediction error curve (dotted curve in Fig. 1) is seen to be far above the Kaplan-Meier benchmark, i.e. prediction performance is much worse compared to the null model. This is similar to the results for the log-time modeling approaches compared by Li and Li (2008). The pathway-related subset of microarray features does not seem to contain enough predictive information. The randomness of the selected microarray-features is also underlined by the small overlap between the genes identified by our KEGG-only model and the subnetworks identified by Li and Li (2008; see their Fig 2), which is only one common gene (SYNJ2).
|
Another useful benchmark value is the so-called no-information prediction error (Efron and Tibshirani, 1997; Gerds and Schumacher, 2007), that mimics some kind of worst case scenario in a situation where the potential predictors do not provide any information on the response (here survival time), but are used for prediction. The no-information prediction error can be calculated (Schumacher et al., 2007) by systematic permutation of the response values and the vector of predictor values (here the microarray features) of all patients. The dashed curve in Figure 2 shows this additional benchmark jointly with the Kaplan-Meier benchmark and the prediction error curve for the prediction based on 952 microarray features represented in the KEGG regulatory pathways. It is seen that the latter is even above the no-information error. This indicates that the test set might be structurally different from the training set.
Some kind of pattern in the training data that is not present in the test data might also be the reason why cross-validation selects a relatively large number of non-zero parameters. Due to the pattern in the training data, cross-validation overestimates prediction performance and chooses a model that is too complex. One way to avoid use of test data, that might be structurally different from the training data, is to perform performance estimation by bootstrap techniques (see Binder and Schumacher, 2008a; Schumacher et al., 2007, for example).
In summary, the lesson learned from a thorough analysis of the application example provided by Li and Li (2008) is, that adequate benchmarks should be used for evaluating prediction performance of a new approach, always including a null model and (in a microarray setting) potentially also a purely clinical model. Furthermore, adequate techniques for fitting models to censored time-to-event data and for evaluating their prediction performance should be employed.
Funding: We gratefully acknowledge support from Deutsche Forschungsgemeinschaft (DFG Forschergruppe FOR 534).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Olga Troyanskaya
Received on April 11, 2008; revised on July 23, 2008; accepted on August 1, 2008
| REFERENCES |
|---|
|
|
|---|
Bair E, et al. Prediction by supervised principal components. J. Am. Stat. Assoc. (2006) 101:119–137.[CrossRef][Web of Science]
Binder H, Schumacher M. Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples. Stat. Appl. Genet. Mol. Biol. (2008a) 7. Article 12.
Binder H, Schumacher M. Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinformatics (2008b) 9:14.[CrossRef][Medline]
Efron B, Tibshirani R. Improvements on cross-validation: The.623+ boostrap method. J. Am. Stat. Assoc. (1997) 92:548–560.[CrossRef][Web of Science]
Gerds TA, Schumacher M. Consistent estimation of the expected Brier score in general survival models with right-censored event times. Biom. J. (2006) 48:1029–1040.[CrossRef][Web of Science][Medline]
Gerds TA, Schumacher M. Efron-type measures of prediction error for survival analysis. Biometrics (2007) 63:1283–1287.[Medline]
Horvath S, et al. Analysis of oncogenic signaling networks in glioblastoma identifies ASPM as a molecular target. Proc. Natl Acad. Sci. USA (2006) 103:17402–17407.
Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. (2000) 28:27–30.
Li C, Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics (2008) 24:1175–1182. Doi: 10.1093/bioinformatics/btn081.
Schumacher M, et al. Assessment of survival prediction models based on microarray data. Bioinformatics (2007) 23:1768–1774.
Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B (1996) 58:267–288.
Zou H, Hastie T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B (2005) 67:301–320.[CrossRef]
This article has been cited by other articles:
![]() |
C. Porzelius, H. Binder, and M. Schumacher Parallelized prediction error estimation for evaluation of high-dimensional models Bioinformatics, March 15, 2009; 25(6): 827 - 829. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


