Bioinformatics Advance Access originally published online on June 9, 2008
Bioinformatics 2008 24(15):1698-1706; doi:10.1093/bioinformatics/btn262
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Microarray-based classification and clinical predictors: on combined classifiers and additional predictive value
1Sylvia Lawry Centre for MS Research, Hohenlindenerstr. 1, D-81677 Munich, 2Department of Statistics, Ludwig-Maximilians-University of Munich, Ludwigstr. 33, D-80539 Munich and 3Institute of Medical Biometry and Medical Informatics, University Hospital Freiburg, Stefan-Meier-Str. 26, D-79104 Freiburg, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: In the context of clinical bioinformatics methods are needed for assessing the additional predictive value of microarray data compared to simple clinical parameters alone. Such methods should also provide an optimal prediction rule making use of all potentialities of both types of data: they should ideally be able to catch subtypes which are not identified by clinical parameters alone. Moreover, they should address the question of the additional predictive value of microarray data in a fair framework.
Results: We propose a novel but simple two-step approach based on random forests and partial least squares (PLS) dimension reduction embedding the idea of pre-validation suggested by Tibshirani and colleagues, which is based on an internal cross-validation for avoiding overfitting. Our approach is fast, flexible and can be used both for assessing the overall additional significance of the microarray data and for building optimal hybrid classification rules. Its efficiency is demonstrated through simulations and an application to breast cancer and colorectal cancer data.
Availability: Our method is implemented in the freely available R package MAclinical which can be downloaded from http://www.stat.uni-muenchen.de/~socher/MAclinical
Contact: boulesteix{at}slcmsr.org
| 1 INTRODUCTION |
|---|
|
|
|---|
For the last few years, microarray-based outcome prediction, especially classification, has attracted much attention in the statistics, bioinformatics and medical communities. While cancer research is probably the most important field of application of microarray-based prediction, classifiers have also been proposed for other diseases such as multiple sclerosis (Bomprezzi et al., 2003). Classification studies using microarray data only often aim to demonstrate that microarray data are informative to distinguish different types of tissues or patients, e.g. normal from cancer tissues or responders from non-responders. As a by-product of such a study, researchers usually also explore the molecular mechanisms underlying the considered disease by focusing their attention on the most informative genes.
In the context of outcome prediction, some groups of researchers suggest that gene expression data could be used in clinical practice to provide improved diagnosis or prediction see, e.g.(van't Veer et al., 2002). In this case, it is crucial to assess the additional predictive value of gene expression data compared to the available (good) simple clinical predictors. Since they are in general much more difficult and expensive to collect than clinical predictors, gene expression predictors should be used as prediction tools only when they really lead to an accuracy improvement. A problem related to the additional predictive value is outlined by Ntzani and Ioannidis (2003) who state that adjustment for other classic predictors of the disease outcome [is] essential. This is especially true when the study's aim is to demonstrate the practical benefit of using gene expression predictors in clinical practice, but also in other cases. For instance, suppose that the age and sex distributions are not the same in the two groups that have to be distinguished. If these variables are ignored when performing classification, one may misleadingly conclude that microarray data can separate the two groups very well, whereas the differences in gene expression are in fact due to sex and age differences.
Although taking clinical variables into account may be crucial in the context of microarray-based prediction, this aspect is often either omitted or performed using suboptimal methods and not adequately described in the medical literature. Hundreds of novel methods have been proposed to deal with the small n, large p problem, but very few statisticians address the question of the additional predictive value of microarray data. We give an overview at the end of this section.
Such clinical parameters may include, e.g. age and sex of the patient, disease duration, relapse rate or tumor grade, depending on the investigated disease. A critical study of breast cancer outcome prediction (Eden et al., 2004) suggests that good old clinical markers have similar power in breast cancer prognosis as microarray [...] profilers and, more generally, microarray data are suspected to sometimes yield noise discovery (Ioannidis, 2005). In another context, Hunter et al. (2008) point out that letting the genome out of the bottle may have perverse effects in the context of genetic tests. Similar results have been obtained in the field of multiple sclerosis and magnetic resonance imaging (MRI). MRI, which has long been considered as an efficient tool for disease course prediction, turns out to show only marginal additional predictive value when it is used in combination with simple clinical parameters including, e.g. relapse history and disease duration (Daumer et al., 2006).
In the present article, we focus on a standard binary classification problem: the response variable Y to be predicted can take two values Y=0 or Y=1. The term prediction refers to the prediction of the response Y. For example, Y may stand for the development of metastases within a given period of time (yes/no). Note that not all prediction problems can be easily simplified in terms of binary prediction without substantial loss of information and precision. However, class prediction remains the most commonly encountered prediction problem in high-dimensional settings. Our method is easily generalizable to other prediction problems including survival analysis and multicategorical responses.
The answer to the question of the additional predictive value of microarray data is typically binary: yes, microarray data improve the classification accuracy yielded by clinical predictors or clinical predictors perform at least as well as gene expression predictors—and are much less expensive. The second answer may correspond to different situations. First, it is possible that microarray data are not relevant at all for the prediction problem, in which case an usual classifier for high-dimensional data gives poor results when applied to microarray data alone. The second scenario is that microarray data are relevant for the prediction problem, but redundant with or weaker than clinical parameters, in which case an usual classifier for high-dimensional data yields satisfying results. Note that the term redundant does not imply any causality relationship. Microarray data and clinical data may be redundant because the gene expression influences clinical variables or vice versa, or because both clinical and microarray variables are influenced by common latent unobserved mechanisms. Additional biological knowledge is needed to answer this question, which goes beyond the scope of this article.
In practical studies, the additional predictive value of microarray data is often assessed by using naïve methods. The most simple one is probably subgroup analysis. If one is interested in the predictive value given that a binary predictor is already available, the separate analysis of both subgroups is a natural approach. Considering the small sample sizes in microarray studies and the number of available candidate clinical predictors (typically about 5 to 10), this approach cannot be recommended in general. Another simple approach consists of building a classifier based on all predictors, without distinguishing between microarray and clinical variables. This method seems also inappropriate to answer the question of the additional predictive value: even if we have an excellent clinical predictor, it is likely to get lost within the huge amount of microarray variables. Hence, this approach does not treat clinical predictors fairly. The third intuitive approach consists of building two classifiers: one based on clinical parameters, one based on microarray data. The problem is then that the original question of the additional predictive value cannot be answered at all. If both classifiers perform similarly, one does not know whether microarray data do exactly the same as clinical parameters or rather allow to refine the prediction in some way. Hence, the assessment of the additional predictive value of microarray data is not a trivial issue.
A related problem is the construction of complex classifiers combining clinical parameters and high-dimensional microarray data. Ideally, such a classifier would
- show at least as good performance as simpler classifiers using only clinical parameters or only microarray data, respectively,
- handle different configurations (bad microarray and good clinical predictors, good microarray and bad clinical predictors) by performing correct model selection,
- neither oversummarize microarray data nor favor them in the final classifier through overfitting mechanisms,
- handle both categorical and continuous predictors, since many clinical parameters are categorical,
- decide automatically whether to include microarray data or not, depending on their additional predictive value.
On the one hand, Tibshirani and Efron (2002) suggest the so-called pre-validation (PV) testing framework whose aim is to determine whether microarray data contribute significantly to the prediction problem, given that clinical parameters are already available. The idea is to summarize microarray data in form of the internally cross-validated predicted probability of class membership, thus avoiding that microarray data are artificially favored. This approach is applied, e.g. in a breast cancer study by Pawitan et al. (2005). Note that the aim of this method is not to construct an optimal classifier combining both types of data.
On the other hand, several authors try to involve clinical parameters in the classifier construction in some way. Dettling and Bühlmann (2004) suggest a statistical approach based on penalized logistic regression handling all types of clinical variables. Gevaert et al. (2006) follow an approach based on Bayesian networks involving two steps (structure step and learning step). A related approach is presented by Sun et al. (2007). It is also based on variable selection, although using a completely different selection procedure. The method by Sun et al., (2007) relies on a wrapper feature selection method called I-RELIEF. They use linear discriminant analysis (LDA) as a class prediction method, which can be an inconvenience in the presence of categorical predictors.
Some of these studies do not appear to use any systematic validation strategy and hence have the pitfalls outlined by Dupuy and Simon (2007), which make their results uninterpretable. Moreover, most of them do not provide any adequate answer to the related question of the additional predictive value of microarray data from a testing point of view because it was not their primary goal. For instance, with methods putting microarray and clinical data together, the latter tend to get lost within the huge amount of microarray variables and are thus not treated fairly from the point of view of the additional predictive value. Methods treating the two groups of variables separately and combining them at the end may also fail partly in the frequent case where clinical and microarray data are highly correlated.
In this article, we present a method which simultaneously (i) determines whether microarray data have additional predictive value and (ii) provides a combined classifier fulfilling the five points enumerated above. To the best of our knowledge, there is no other approach treating these two aspects in a common framework. In a very recent article, Binder and Schumacher (2008) address these problems based on a penalized Cox regression approach using componentwise boosting techniques (Tutz and Binder, 2007). However, they only address the prediction of survival times. It is still unclear whether such methods would perform well for classification problems in high-dimensional settings, which may be more affected by separation and overfitting problems.
According to several independent comparison studies (Boulesteix, 2004; Dai et al., 2006; Man et al., 2004), partial least squares (PLS)-based methods range among the best dimension reduction methods for high-dimensional and noisy microarray data in the context of prediction. See Nguyen and Rocke (2002) for the first application of PLS to microarray-based prediction and Boulesteix and Strimmer (2007) for an overview of PLS methods for genomic data.
In this article, we suggest a new approach combining PLS dimension reduction and the principle of pre-validation introduced by Tibshirani and Efron (2002). Random forests (Breiman, 2001) are then applied with both the new components and the clinical variables as predictors. Our proposal contains several novelties: (i) the two-step approach involving a dimension reduction step and a classification step for handling the two types of variables, (ii) the extension of the pre-validation idea to dimension reduction and prediction, (iii) the combination of PLS and random forests which involves several advantages and (iv) a model choice procedure based on the out-of-bag error estimator. The proposed method is described in Section 2 and illustrated in Section 3 through simulations and an application to the breast cancer data by van't Veer et al. (2002) and the colorectal data by Lin et al. (2007).
| 2 METHODS |
|---|
|
|
|---|
Let X denote the n x p matrix containing the column-centered expression values of p genes for n patients, while y denotes the centered vector of classes coded as 0, 1. Similarly, Z denotes the n x q matrix giving the values of q clinical parameters for the n patients. In contrast to the gene expression matrix X, Z may include categorical variables, such as tumor grade or sex of the patient.
In the present section, we first give a short overview of PLS dimension reduction and random forest classification. In the second subsection, we propose a novel method combining PLS dimension reduction with the idea of pre-validation suggested by Tibshirani and Efron (2002). We then outline the whole procedure consisting of summarizing microarray data in form of pre-validated PLS components and applying random forests to both microarray and clinical variables and address the problem of model choice, especially the choice of the number of PLS components.
2.1 An introduction to PLS and random forests
PLS methods were developed in connection with path models in the 60s and 70s (Wold, 1966). Statisticians became interested in its application to robust and computationally efficient regression for data with small sample sizes and large number of highly correlated variables some 25 years ago (Garthwaite, 1994; Martens and Naes, 1989; Stone and Brooks, 1990). The following introduction refers to the review by Boulesteix and Strimmer (2007). PLS regression consists of two steps. During the dimension reduction step, the predictors from matrix X are summarized in the form of a small number of linear combinations called PLS components. Subsequently, assuming that the response is continuous, these extracted PLS components are used as predictors in ordinary least squares regression, hence the term PLS regression. When the response is binary, the linear regression step can of course not be carried out. However, it can be shown (Barker and Rayens, 2003) that, if applied to a categorical response, the dimension reduction step is strongly related to principal component analysis performed on the between-group covariance matrix. Hence, it makes sense to perform PLS dimension reduction in this setting.
PLS dimension reduction constructs k mutually orthogonal components as linear combinations of X:
|
|
|
| (1) |
wi=1 and t
tj=w
XT Xwj=0. The fast extraction of the weight matrix W can be carried out using a sequential algorithm given in, e.g. Martens and Naes (1989). By definition, the most informative components are the first ones, but the determination of the best number of components is a difficult task. Some authors (Boulesteix, 2004; Dai et al., 2006) use cross-validation based strategies. In this article, we use the implementation of SIMPLS included in the R package plsgenomics (Boulesteix, 2004; Boulesteix and Strimmer, 2007), function pls.regression. Although variable selection is not always necessary as a preliminary step to PLS-based classification, some authors argue that it can substantially improve accuracy in the high-dimensional setting (Dai et al., 2006), especially when there are indeed few relevant variables. Many variable selection procedures are available in the literature. One of the most widely used is univariate filtering based on the absolute value of the t-statistic. In the present article, we stick to this standard approach.
Random forests are introduced by (Breiman, 2001) and based on the decision tree methodology. In only 7 years, they have grown to a major data analysis tool, especially in the context of high-dimensional genetic or genomic data (Strobl et al., 2007). Like bagging (Breiman, 1996), the method is based on the aggregation of classification or regression trees built using bootstrap samples drawn out of the original n observations, in order to make tree-based prediction more robust. In order to make the obtained trees even more different and thus increase their stability and to reduce the computation time, random forests have an additional feature. At each split, a subset of candidate predictors is selected out of the available predictors. The size mtry of the subset, which is a method parameter, is often set to
, where p is the total number of predictors.
As a model-free approach, the random forest method does not need any distributional assumptions and can be applied to any type of data. In particular, it behaves well with high-dimensional correlated data, see, e.g. Diaz-Uriarte and Alvarez de Andrés (2006) for an application to microarray-based class prediction. Random forests can also take interactions between variables into account explicitly. Lastly, they are faster than other aggregation methods like bagging, since they do not consider all the available predictors at each split.
Like classification and regression trees, random forests handle all types of responses, in particular multicategorical or censored responses. They also work with all types of predictor variables. However, when predictors do not have the same scale, selection bias may occur using the standard random forest algorithm. It is then recommended to use an alternative version of the random forest method based on conditional inference (Hothorn et al., 2006) implemented in the function cforest of the R package party. Moreover, it can be shown (Strobl et al., 2007) that subsampling (without replacement) is preferable to the bootstrap when drawing samples out of the n observations at each random forest iteration. In this article, we follow these recommendations. The only parameters for which we do not use the default settings are (i) the number of trees, which is set to 200 instead of 500 for computational reasons, (ii) the number of candidate predictors at each split, which we set to
for consistency with the original R package randomForest implementing the method by Breiman (2001), (iii) the threshold defining the stopping criterion (see Hothorn et al. (2006) for more details), which we set to mincriterion=0 in order to obtain trees with long branches, as commonly recommended for trees used in random forests. In very small datasets (say, n
30), one should also modify the parameter minsplit controlling the minimal size of nodes to be split. However, our experience shows that this modification is not necessary in datasets of usual size as those considered here. Note that, in contrast to other methods such as penalized logistic regression, the performance of random forests depends only slightly on the choice of parameters and that different settings would yield similar results.
2.2 Pre-validated PLS
Suppose that we construct PLS components as described in Section 2.1, based on a given learning dataset. Per definition, these components are likely to be strongly related to the response variable, especially in the case of high-dimensional data. Comparing their predictive power to the power of clinical variables in the learning dataset would be an unwise strategy: because of overfitting, there typically will be a bias in favor of the PLS components.
In the present article, we suggest to overcome this problem by extending the pre-validation principle of Tibshirani and Efron (2002) to PLS dimension reduction. Pre-validation is inspired from the well-known cross-validation procedure for evaluation of prediction rules, which consists of partitioning the available sample into distinct subsamples and successively considering each subsample as test data and the remaining subsamples as training data. Unfamiliar readers may refer to the review by Boulesteix et al. (2008) on this subject. Our novel procedure works as follows.
- Divide the learning dataset into G groups. Here, we set G=10, as recommended by Höfling and Tibshirani (2008).
- Leave one group out and run PLS dimension reduction on the remaining G–1 groups.
- Compute the PLS components for the left-out group using the derived weight matrix. We denote these PLS components as pre-validated PLS components.
- Repeat steps 2–3 for each of the G groups.
2.3 Summary: recipe of the analysis
In the present article, we suggest to combine PLS dimension reduction with the random forest methodology in order to take both gene expression and clinical parameters into account when constructing a classifier. Suppose that we have a learning dataset L of size nL (corresponding to XL, ZL, yL) for which we know the response variable. We also have a test dataset T of size nT (corresponding to XT, ZT), for which a prediction has to be made.
In clinical practice, the test dataset would be a set of patients that have to be predicted. In the context of the validation of research findings, the test dataset would be a set of patients for which we also know the response variable, and that are used to assess the prediction accuracy of the combined classifier constructed using the learning data. Note that this scheme is possible only if we have a large enough dataset. Otherwise, one may use an evaluation scheme based on, e.g. cross-validation, repeated subsampling or bootstrap sampling, see Boulesteix et al. (2008) for an overview. In this case, the algorithm is run several times. For example, if leave-one-out cross-validation (LOOCV) is used to assess our combined classifier, one would run the following algorithm for each LOOCV iteration, where the dataset XT consists of only one observation at each iteration.
The matrix XL is assumed to have columns with zero mean and XT to be centered by substraction of the columns means obtained from XL, as usual in PLS-based prediction (Boulesteix and Strimmer, 2007). Let k denote the maximum allowed number of PLS components, typically k=3 in the binary case. The detailed procedure is as follows.
- Cross-validated PLS dimension reduction with learning dataset Construct the nL x k matrix of pre-validated PLS components
as follows. For g=1,...,G:- Carry out variable selection based on X
and y
only, where the superscript (–g) indicates that the observations from the g-th group have been removed from XL and yL, respectively. This yields an expression matrix X
with p* columns, where p* is the pre-fixed number of selected variables. The g-th group is not taken into account in the variable selection process, because variable selection must be considered as a part of the classifier construction, (see, e.g. Boulesteix, 2007; Dupuy and Simon, 2007).
- Run the PLS dimension reduction procedure with k components on the data matrix X
. This yields the p* x k weight matrix W
.
- Build the k components for the excluded g-th group as the product X
W
, where X
denotes the part of the matrix XL corresponding to the g-th group and containing only the p* variables selected in (a). Store the product X
W
in the rows of
corresponding to the g-th group.
- Carry out variable selection based on X
- Classifier construction Construct a random forest using the columns of the matrices
and ZL as predictors and yL as response.
- PLS dimension reduction with the test dataset Carry out variable selection based on XL and yL only, yielding again p* selected variables. For the reduced test data matrix X
consisting of the p* selected variables, compute the matrix TT of PLS components as follows.- Run the PLS dimension reduction procedure with k components on the whole learning data matrix X
of size nL x p*, yielding the p* x k weight matrix WL.
- Build the PLS components for the test dataset as TT=X
WL.
- Run the PLS dimension reduction procedure with k components on the whole learning data matrix X
- Prediction Apply the random forest constructed in Step 2 to the prediction of the test observations using the matrices TT and ZT. For each test observation one obtains a prediction
for the class membership.
|
2.4 Model choice and additive predictive value of microarray data
In this section, we show how the out-of-bag (OOB) error estimator (Breiman, 2001) yielded as a by-product when growing a random forest can be used both for the choice of the number of PLS components and for answering the question of the additive predictive value of microarray data.
The OOB error estimator works as follows. When growing each tree of the random forest, 36.8% of the n observations are put aside and not used for choosing the splits (this default setting of the function cforest, which stems from bootstrap sampling). These observations are called OOB observations. After all the trees are constructed, a pseudo-prediction can be made for each of the n observations using only the trees that did not use it for training, i.e. the trees for which it was an OOB observation. The OOB error rate is then computed by comparison of the n pseudo-predictions with the true classes. Note that, in contrast to in-bag error estimation, this procedure overcomes overfitting problems, since the predicted observations were not used for training the corresponding predicting trees. The OOB error estimator can be used for comparing the prediction accuracy of several random forests. An interesting application in the present context is the choice of the number of components, and, if zero is considered as a possible candidate for the number of components, the question of the additive predictive value of microarray data. To do this, we suggest to replace Step 2 of the procedure outlined above by a modified version 2* as follows.
(2*) Classifier construction
- For l=0,...,k, construct a random forest using the l first columns of the matrix
and the matrix ZL as predictors and yL as response.
- Compute the OOB error for each of the k+1 constructed forests.
- Select the number of components k* yielding the forest with the smallest OOB error.
If k*=0, we conclude that microarray data do not have any predictive value compared to clinical variables alone. If k*>0, it is possible to roughly evaluate the significance of microarray data by computing confidence intervals for the calculated OOB errors, where the sample size is given as the size of the training set. This procedure does not yield a rigorous statistical test, since independence of the observations is not warranted. However, the resulting confidence intervals should give the order of magnitude of the corresponding differences in accuracy.
The whole procedure is implemented in the R package MAclinical. The current version that was used for this article is available from http://www.stat.uni-muenchen.de/~socher/MAclinical. We plan to send a refined version of this package to the Comprehensive R Archive Network.
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
The analyzes described below can be reproduced using the scripts available from http://www.stat.uni-muenchen.de/
socher/MAclinical.
3.1 Simulations
The aim of this simulation study is to compare the performance of our approach to related approaches based on clinical and/or microarray variables. Several data structures are considered: different predictive powers for the microarray variables, different powers for the clinical variables and different class structures. By different class structures, we mean that we examine two settings: (i) a redundant setting where the microarray and clinical variables are generated using exactly the same model, thus discriminating the classes in the same way and giving redundant information and (ii) a non-redundant setting, where observations from class Y=1 are assumed to form two distinct subgroups: one of the subgroups can be discriminated from the other one and from Y=0 by microarray data, whereas the second one is discriminated by clinical variables. The corresponding data generating processes are detailed below.
In the first setting (redundant setting), the random variables Y, X1,...,Xp and Z1,...,Zq have the following joint distribution. The binary response Y follows a binomial distribution with P(Y=1)=0.5. A total of p*<p microarray variables are relevant for class prediction. Each microarray variable Xj (j=1,...,p) is generated as
|
| (2) |
|
| (3) |
In the present simulation, we set µZs to the same value µZs=µZ for all clinical variables and consider different values of µZ successively: µZ=0 (no power), µZ=1 (moderate power) and µZ=3 (strong power). Similarly, µXj is set to the constant µX for the p* genes X1,...,Xp* (with p*<p) and to zero for the remaining genes Xp*+1,...,Xp. Similarly with µZ, the parameter µX takes different values successively: µX=0,0.5,1. In the present study, the total number of genes is set to p=1000 and the number p* of relevant genes to p*=50. We denote this simulation setting as redundant, because the discrimination mechanism is the same for microarray and clinical variables.
For all parameter combinations (µX,µZ), we draw nL and nT i.i.d. observations forming the learning set and test set, respectively. Here, nL is set to nL=50, nT is set to nT=450 in order to obtain accurate estimates of the error rate. A total of Niter=100 datasets are simulated for each parameter setting. The optimal number of PLS components is selected from k=0,1,2,3.
We compare the pre-validated PLS+RF method based on both microarray and clinical variables (pls-pv+rf/xz, with 10-fold PV) to simpler related approaches, in order to determine the effect of pre-validation and to answer the question whether the combined classifiers perform as well as classifiers based on microarray data only or clinical variables only. The considered approaches are (i) PLS+RF based on both microarray and clinical variables without pre-validation (pls+rf/xz), (ii) pre-validated PLS+RF based on microarray data only with 10-fold PV (pls-pv+rf/x), (iii) PLS+RF based on microarray data only without pre-validation (pls+rf/x), and (iv) RF based on clinical variables only (rf/z). As an additional comparison, we also apply standard approaches used when dealing with only one type of predictors: logistic regression for clinical predictors (log/z) and the support vector machines (SVM) method for microarray data (svm/x), which is well-established as one of the most accurate procedures in this setting (Statnikov et al., 2005). We use the R package e1071, with linear kernel and cost set to the default value 1. In order to make clear that we do not tune our method artificially (which would yield an unfair comparison), let us mention that we additionally applied the two pre-validation approaches (pls-pv+rf/xz and pls-pv+rf/x) with leave-one-out pre-validation instead of 10-fold pre-validation (data not shown). However, the results were not different from those obtained with 10-fold pre-validation. Hence, we stick to 10-fold, following Höfling and Tibshirani (2008).
For each iteration, a classifier is built based on the learning dataset only, with the seven methods outlined above successively. The classifiers are then evaluated based on the corresponding test set and the error rate is estimated as the mean proportion of misclassified observations. In this simulation, we do not perform any preliminary variable selection, as suggested by Boulesteix (2004, 2006) in the case of relatively large signal to noise ratios. In real data analysis, one could of course try to improve classification accuracy by preliminary variable selection. This step was omitted in the simulation for computational reasons. Similarly, correlations between genes and/or clinical variables do not seem to affect the results noticeably (data not shown).
As can be seen from Table 1, pre-validation improves classification accuracy noticeably, especially when clinical parameters are good predictors (i.e. for µZ=1 or µZ=3). Since PLS components without pre-validation usually overfit the training data, they are artificially preferred to clinical parameters in the split selection procedure. Moreover, trees are then likely to have longer irrelevant branches. The performance of the pls-pv+rf/xz approach is slightly lower than the performance of pls-pv+rf/x in the case of non-predictive clinical variables, but as good as the rf/z approach in all cases, even when microarray data are not predictive. The comparison to the standard logistic regression and to SVMs reveals interesting features. In Table 1, logistic regression with clinical parameters performs better than the pls-pv+rf/xz approach only when clinical parameters are more predictive than microarray data. Except for the case µX=0,µZ=1 (0.165±0.03 versus 0.216±0.04), this difference is minimal. Note that in this case, random forests with clinical variables only do not perform better than pls-pv+rf/xz. The difference between random forests and logistic regression can be explained by the linear structure of our simulated data: for this simulation setting the flexibility of random forests is not an advantage, in contrast to the non-redundant setting sketched below. SVMs perform approximately as well as pls-pv+rf in the case of non-informative clinical variables, but worse in all other cases.
|
As an illustration of the model selection scheme proposed in Section 2.4 and its ability to assess the additional predictive value of microarray data, we also compute (i) the mean OOB error over the 100 subsampling runs obtained with k*=0 and k*=1 PLS component and (ii) the percentage of runs for which at least one PLS component is selected (i.e. k*>0), for the novel pls-pv+rf/xz method. As can be seen from Table 2, the proportion of simulation runs with k*>0 selected PLS components increases drastically with µX, but this increase also depends on µZ. For a fixed µX, the proportion of runs with k*>0 is much lower with informative clinical variables (µZ=1,3) than with non-informative clinical variables. This can be explained as follows: if clinical variables perform well, it is more difficult for microarray data to yield accuracy improvement.
|
The simulation design outlined above corresponds to the case where both microarray and clinical variables discriminate the two response classes in the same way, for instance because both of them are influenced by the same underlying mechanism. They give essentially redundant information. In practice, investigators often hope that microarray data give additional (i.e. non-redundant) information, for instance, by correctly predicting a particular group that is difficult to predict with clinical predictors only. In the rest of this section, a variant of the above simulation design is applied to investigate the behavior of the different methods in an ideal extreme case where clinical and microarray predictors are perfectly complementary. The observations with Y=1 are assumed to come from two underlying classes 1a and 1b. The microarray variables are drawn to separate 1a from the rest, whereas the clinical variables separate 1b from the rest. The underlying model is the same as in Equation (2) and (3), except that Y is replaced by the binary variables Y(a) and Y(b) defined as Y(a)=1 if Y=1a and 0 otherwise, and Y(b)=1 if Y=1b and 0 otherwise, respectively.
In this non-redundant setting, the two-step pls-pv+rf/xz approach performs almost uniformly better than all other methods (Table 3). This is not surprising, since the simulation setting can be seen as an extreme case where the combination of two types of predictors using tree-based methodologies is expected to work well. However, this case is very important in the context of the additional predictive value. Indeed, by additional predictive value, one often implicitly means that microarray data can predict disease subtypes which are wrongly classified by clinical parameters. Note that the overall performance of all methods decreases dramatically compared to the redundant setting. This is because each observation is discriminated by both clinical and microarray variables in the redundant setting, but by only one of the two types of variables (either clinical or microarray) in the non-redundant setting.
|
3.2 Application to breast cancer data
This widely-used benchmark dataset gives the expression levels of 22 483 genes for 78 breast cancer patients, of which 34 have poor prognosis and 44 have good prognosis (van't Veer et al., 2002). It can be downloaded from the article webpage. The dataset prepared as described in the original manuscript (only genes that show 2-fold differential expression and P-value for a gene being expressed <0.01 in more than five samples are retained, yielding 4348 genes) is included in the R package DENMARKLAB (Fridlyand and Yang, 2004), which we use in the article. The available clinical variables are age (metric), tumor grade (ordinal), estrogen receptor status (binary), progesterone receptor status (binary), tumor size (metric) and angioinvasion (binary).
The classification accuracy is evaluated using the common repeated subsampling method which is a variant of cross-validation also denoted as Monte-Carlo-cross-validation, [see (Boulesteix et al., (2008); Molinaro et al., (2005) for more details]. In a nutshell, instead of splitting the original dataset consisting of 78 observations into, say, 5, 10 or n subsets (like in standard cross-validation), we repeatedly split it into a learning set and a test set according to the ratio 4:1. Variable selection is carried out based on the absolute value of the t-statistic, using the learning set only as commonly recommended (Boulesteix, 2007; Dupuy and Simon, 2007). The method described in Section 2.3 is then applied to the learning and test sets including the step for the optimization of the number of components as described in Section 2.4. The error rates are estimated as the proportion of misclassified test observations. The whole procedure is repeated 100 times and the error rates are averaged. This approach usually leads to more stable results than standard cross-validation, because it is based on a larger number of iterations. Table 4 gives the obtained mean error rates with the seven methods described in Section 3.1, for different numbers of variables in the range of the number of genes used in the original signature proposed by van't Veer et al., (2002).
|
It can be seen from Table 4 that microarray data do not noticeably improve the prediction accuracy yielded by clinical parameters alone, which corroborates the findings of Eden et al. (2004). This result is confirmed by considering the mean OOB error rates obtained with the different numbers of components. For each of the 100 iterations, we estimate the 95% confidence interval for the difference of the OOB misclassification rates obtained with k=0 and k=1, respectively. The sample size is 62 for both rates, since each learning set contains 0.8 x 78
62 observations. For p*=100, the lower bound of the obtained confidence interval exceeds zero for only 9% of the 100 iterations, which suggests that the microarray data do not contribute significantly to the prediction. Similar conclusions are obtained with p*=20,100,200.
3.3 Application to colorectal cancer data
This Affymetrix dataset described by Lin et al. (2007) gives the expression levels of 16 041 genes for 29 good outcome patients and 26 poor outcome patients with colorectal cancer. In addition to microarray data, the two variables sex and age are available. The data are prepared as described in Lin et al. (2007). Gene expression data are expected to have better predictive power than the variables sex and age which typically yield relatively poor prediction accuracy in the case of cancer.
The analysis design is the same as for the van't Veer data. As can be seen from the results given in Table 5, the microarray data now have predictive power whereas the two variables age and sex do not. Unsurprisingly, involving the uninformative variables age and sex (methods pls-pv+rf/xz and pls+rf/xz) slightly decreases the prediction accuracy compared to the methods without clinical variables pls-pv+rf/x and pls+rf/x, but the performance of the approach pls-pv+rf/xz remains comparable to the performance of the standard good performing SVM method. Hence, our approach shows overall good performance in very different situations.
|
As an illustration of the model selection scheme embedded in the PLS-PV+RF method, we show (i) the mean OOB error over the 100 subsampling runs obtained with each number of PLS components, and (ii) the percentage of runs for which at least one PLS component is selected (k*>0), as in the simulation study. As can be seen from Table 6, our method correctly selects at least one PLS component in most runs (
90%) for the colorectal data, i.e. much more often than for the van't Veer data. This result is in agreement with the mean OOB obtained with each number of components. Whereas the mean OOB does not depend on the number of PLS components for the van't Veer data (OOB
0.30 for all p* and all k), it decreases substantially between k=0 and k=1 for the colorectal data, with a further slight decrease from k=1 to k=2.
|
| 4 CONCLUSION |
|---|
|
|
|---|
We have presented a simple two-step approach based on well-established data analysis tools (PLS and random forests) combined with the pre-validation principle by Tibshirani and Efron (2002). This procedure can simultaneously determine whether microarray data have additional predictive value and provide a combined classifier fulfilling the six points enumerated above. Our fast, simple and flexible method is implemented in the R package MAclinical. Its ability to yield efficient hybrid prediction rules in various settings (good/bad microarray/clinical predictors) was demonstrated in both simulations and real data analysis. In particular, our approach does not seem to overestimate the predictive value of microarray data and yields good prediction accuracies when microarray and clinical parameters do not give redundant predictive information. Let us further mention that as far as computation time is concerned our novel method is similar to SVM, i.e. very fast. In their study of pre-validation, Höfling and Tibshirani (2008) point out substantial biases occurring when pre-validated predicted probabilities are tested for significance in linear regression. In our study, the OOB error slightly decreased with the number of included pre-validated PLS components in the non-informative case, but this trend was minimal. This potential slight bias in model selection, which is probably due to the mechanism outlined by Höfling and Tibshirani (2008) (roughly speaking, observations are not independent anymore in cross-validation), should be addressed in future research.
Note that the OOB error used in this article for the choice of the number of PLS components can also be used for the choice of the number of genes, similarly to the procedure for the choice of the number of PLS components. This problem is ignored by many authors, who compare the different number of genes post hoc, i.e. after completing the evaluation procedure. Tuning the number of genes may lead to optimistic bias, not only in the context considered in this article.
The extension of our method to other prediction problems (regression, survival analysis, multicategorical classification) is straightforward. Unlike other common approaches such as logistic regression, it (i) does not require any distributional assumption or a specific type of relationship (e.g. linearity) between the response and the predictors, (ii) does not need any limitation of the number of clinical variables, which is useful in the case of small samples, (iii) can cope with separated classes (in this respect, the aggregation of trees built on perturbed datasets is an advantage) and (iv) can handle interactions, for instance interactions between microarray and clinical predictors, thus potentially identifying subtypes that are not caught by clinical data alone.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank J. Friederichs and B. Holzmann for making the colorectal cancer dataset available. We are also grateful to Casimir Kulikowski and the three referees for their constructive comments.
Funding: A.L.B. was sponsored by the Porticus Foundation in the context of the International School for Technical Medicine and Clinical Bioinformatics.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Joaquin Dopazo
Received on February 8, 2008; revised on May 16, 2008; accepted on June 4, 2008
| REFERENCES |
|---|
|
|
|---|
Barker M, Rayens W. Partial least squares for discrimination. J. Chemometr (2003) 17:166–173.[CrossRef]
Binder H, Schumacher M. Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinformatics (2008) 9:14.[CrossRef][Medline]
Bomprezzi R, et al. Gene expression profile in multiple sclerosis patients and healthy controls: identifying pathways relevant to disease. Hum. Mol. Genet (2003) 12:2191–2199.
Boulesteix A-L. PLS dimension reduction for classification with microarray data. Stat. Appl. Genet. Mol. Biol. (2004) 3:33.
Boulesteix A-L. Reader's reaction to Dimension reduction for classification with gene expression microarray data by Dai et al. (2006). Stat. Appl. Genet. Mol. Biol. (2006) 5:16.
Boulesteix A-L. WilcoxCV: an efficient R package for variable selection in cross-validation. Bioinformatics (2007) 23:1702–1704.
Boulesteix A-L, Strimmer K. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Brief. Bioinform (2007) 8:32–44.
Boulesteix A-L, et al. Evaluating microarray-based classifiers: an overview. Cancer Informat (2008) 6:77–97.
Breiman L. Bagging predictors. Mach. Learn (1996) 24:123–140.
Breiman L. Random forests. Mach. Learn (2001) 45:5–32.[CrossRef]
Dai JJ, et al. Dimension reduction for classification with gene expression data. Stat. Appl. Genet. Mol. Biol (2006) 5:6.
Daumer M, et al. The additional predictive value of magnetic resonance imaging for the prediction of future relapses if relapse history is available. Mult. Scler (2006) 12:S46–S47.
de Jong S. SIMPLS: an alternative approach to partial least squares regression. Chemomet. Intell. Lab. Syst (1993) 18:251–253.[CrossRef]
Dettling M, Bühlmann P. Finding predictive gene groups from microarray data. J. Multivariate Anal (2004) 90:106–131.[CrossRef]
Diaz-Uriarte R, Alvarez de Andrés S. Gene selection and classification of microarray data using random forests. BMC Bioinformatics (2006) 7:3.[CrossRef][Medline]
Dupuy A, Simon R. Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J. Natl Cancer I (2007) 99:147–157.[CrossRef]
Eden P, et al. Good old clinical markers have similar power in breast cancer prognosis as microarray gene expression profilers. Eur. J. Cancer (2004) 40:1837–1841.[CrossRef][Web of Science][Medline]
Fridlyand J, Yang JYH. DENMARKLAB R package. In: Advanced microarray data analysis: class discovery and class prediction (2004) last accessed date 30 June 2008. Available at http://genome.cbs.dtu.dk/courses/norfa2004/Extras/DENMARKLAB.zip.
Garthwaite PH. An interpretation of partial least squares. J. Am. Stat. Assoc (1994) 89:122–127.[CrossRef][Web of Science]
Gevaert O, et al. Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics (2006) 22:e184–e190.
Höfling H, Tibshirani R. A study of pre-validation. Ann. Appl. Stat (2008) (in press).
Hothorn T, et al. Unbiased recursive partitioning: a conditional inference framework. J. Comput. Graph. Stat (2006) 15:651–674.[CrossRef]
Hunter DJ, et al. Letting the genome out of the bottle – Will we get our wish? New England J. Med (2008) 358:105–107.
Ioannidis JP. Microarrays and molecular research: noise discovery? The Lancet (2005) 365:488–492.
Lin YH, et al. Multiple gene expression classifiers from different array platforms predict poor prognosis of colorectal cancer. Clin. Cancer Res (2007) 13:498–507.
Man MZ, et al. Evaluating methods for classifying expression data. J. Biopharm. Stat (2004) 14:1065–1084.[CrossRef][Medline]
Martens H, Naes T. Multivariate Calibration. (1989) Wiley, New York.
Molinaro A, et al. Prediction error estimation: a comparison of resampling methods. Bioinformatics (2005) 21:3301–3307.
Nguyen DV, Rocke D. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics (2002) 18:39–50.
Ntzani EE, Ioannidis JPA. Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment. The Lancet (2003) 362:1439–1444.
Pawitan Y, et al. Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res (2005) 7:R953–R964.[CrossRef][Web of Science][Medline]
Statnikov A, et al. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics (2005) 21:631–643.
Stone M, Brooks RJ. Continuum regression: cross-validated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal component regression. J. R. Stat. Soc. B (1990) 52:237–269.
Strobl C, et al. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics (2007) 8:25.[CrossRef][Medline]
Sun Y, et al. Improved breast cancer prognosis through the combination of clinical and genetic markers. Bioinformatics (2007) 23:30–37.
Tibshirani R, Efron B. Pre-validation and inference in microarrays. Stat. Appl. Genet. Mol. Biol (2002) 1:1.[Medline]
Tutz G, Binder H. Boosting ridge regression. Comput. Statist. Data Anal (2007) 51:6044–6059.[CrossRef]
van't Veer LJ, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature (2002) 415:530–536.[CrossRef][Web of Science][Medline]
Wold H. Estimation of principal components and related models by iterative least squares. In. Multivariate Analysis—Krishnaiah PR, ed. (1966) New York: Academic Press.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
