Bioinformatics Advance Access originally published online on November 5, 2004
Bioinformatics 2005 21(7):1104-1111; doi:10.1093/bioinformatics/bti114
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Classification using partial least squares with penalized logistic regression
CNRS/LMC-IMAG BP 53, 38041 Grenoble cedex 9, France
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: One important aspect of data-mining of microarray data is to discover the molecular variation among cancers. In microarray studies, the number n of samples is relatively small compared to the number p of genes per sample (usually in thousands). It is known that standard statistical methods in classification are efficient (i.e. in the present case, yield successful classifiers) particularly when n is (far) larger than p. This naturally calls for the use of a dimension reduction procedure together with the classification one.
Results: In this paper, the question of classification in such a high-dimensional setting is addressed. We view the classification problem as a regression one with few observations and many predictor variables. We propose a new method combining partial least squares (PLS) and Ridge penalized logistic regression. We review the existing methods based on PLS and/or penalized likelihood techniques, outline their interest in some cases and theoretically explain their sometimes poor behavior. Our procedure is compared with these other classifiers. The predictive performance of the resulting classification rule is illustrated on three data sets: Leukemia, Colon and Prostate.
Availability: Software that implements the procedures and data source on which this paper focuses are freely available at http://www-lmc.imag.fr/SMS/membres/Gersende_Fort,Sophie_Lambert.html
Contact: sophie.lambert{at}imag.fr
| INTRODUCTION |
|---|
|
|
|---|
Microarray technology generates a vast amount of data by measuring, through the hybridization process, the levels of virtually all the genes expressed in a biological sample. One can expect that knowledge gleaned from microarray data will contribute significantly to advances in fundamental questions in biology as well as in clinical medicine. One important goal of analyzing microarray data is to classify the samples. To cite a few, Golub et al. (1999) have considered classification of acute leukemia and Alon et al. (1999) have addressed the cluster analysis of tumor and normal colon tissues. The approaches developed in these papers consist in discrimination methods and machine learning methods [see Dudoit et al. (2002) for a comparative study].
In microarray studies, the number of samples, n, is relatively small compared to the number of genes, p, usually in thousands. Unless a preliminary variable selection step is performed, standard statistical methods in classification perform poorly because there are far more variables than observations. One problem is multicollinearity: estimating equations become singular and have no unique and stable solution. For instance, the pooled within-class sample covariance matrix in Fisher's linear discriminant function is singular if n < p + 2. Even if all genes can be used as in support vector machines, it seems to be not sensible to use all the genes. Indeed, this use allows the presence of the noise associated with genes of little or no discrimination power. That inhibits and degrades the performances of the classification rules in their application to unclassified tumor. In this situation, dimension reduction is needed to reduce the high p-dimensional gene space. In most previously mentioned works, the authors have used univariate methods for reducing the number of genes. Alternative approaches to handle the dimension reduction problem can also be used (see for instance Ghosh, 2002; Nguyen and Rocke, 2002; West et al., 2001; Antoniadis et al., 2003).
Similar data structures have been encountered in the field of chemometrics. The method of partial least squares (PLS) (Wold, 1975; Naes and Martens, 1985; Helland, 1988) has been found to be a useful dimension reduction technique as well as principal component regression (PCR) [Massy, 1965; see Frank and Friedman (1993) for a statistical view of PLS and PCR]. In the context of microarrays, the purpose of PCR is to produce orthogonal tumor descriptors that reduce the dimension to only a few gene components (super-genes) (West et al., 2001). But the dimension reduction is achieved without regard to the response variable and may be inefficient. This is the reason why PLS looks more adapted than PCR for dimension reduction based prediction. Indeed, PLS components are chosen so that the sample covariance between the response and a linear combination of the p predictors (genes) is maximum.
Nguyen and Rocke (2002) proposed using PLS for dimension reduction as a preliminary step to classification, based either on linear logistic discrimination, or linear or quadratic discriminant analysis. However, this seems to be intuitively unappealing because PLS is really designed to handle continuous responses and models that do not suffer from heteroscedasticity as it is the case for Bernoulli or multinomial data. Furthermore, in practice we have observed problems in the convergence of the iteratively reweighted least squares (IRLS) algorithm, which is the usual procedure for solving the maximum likelihood (ML) equation in the field of the generalized linear models (GLM). Indeed, for logistic regression, it is well known that convergence poses a long standing problem. Infinite parameter estimates can occur depending on the configuration of the sample points in the observation space (Albert and Anderson, 1984).
Marx (1996) proposed an extension of PLS to a categorical response variable and illustrated the developments from a spectroscopy example. His approach embedded the usual PLS steps within the IRLS. Unfortunately, we have observed that this algorithm does not converge in many cases of interest (such as in the applications considered in this paper). More recently, Ding and Gentleman (2004, http://www.bepress.com/bioconductor/) proposed an approach based on this procedure. They phrased the problem in a GLM setting and applied Firth's procedure to avoid (quasi)separation.
To deal with the high-dimension problem, another approach consists in penalizing the likelihood. Eilers et al. (2001) propose useing the Ridge penalized logistic regression in order to both stabilize the statistical problem and remove numerical degeneracy due to multicollinearity. They have shown that this method appears to work well with microarray data. Note that this method is not a dimension-reduction technique. Indeed all explanatory variables are allowed into the regression model. From the log-likelihood a so-called Ridge penalty is subtracted. All the genes contribute, which can inhibit and degrade the performances of the classification rules. Note that we can find alternative approaches (see, for example, Huang and Pan, 2003; Ghosh, 2003) for which the classification problem is not viewed as a problem in a logistic regression.
In this paper, we extend the PLS method to a binary response variable. To do that, we want to substitute the categorical response variable in the input of PLS by a continuous-valued pseudo-response variable whose expected value has a linear relationship with the covariates. The limiting pseudo-response variable in the IRLS algorithm seems to be a good candidate. Unfortunately, in the present situation, small n, large p, IRLS no longer works since the limiting pseudo-response variable is, in norm, infinite. The idea developed here is to penalize with a Ridge penalty the likelihood criterion in order to constrain the pseudo-response variable to be finite. That is, our procedure combines a Ridge penalty step and a PLS step and the dimension-reduction step is incorporated in the classification step. Here we present the classification rule for the binary response variable indicating a normal or colon tumor, for instance. Nevertheless, our approach remains valid for multi-categorical response variables. But the binary case is the simplest case which allows us to point out whether such a procedure works well or not and why.
This paper is organized as follows. The Methods section is the methodological part of this paper. It contains a description of the logistic regression and linear discrimination. We then recall the Ridge regression method and derive a weighted PLS algorithm in order to address the dimension reduction in heteroscedastic models. We then introduce an extension of PLS to GLM based on the Ridge penalty, and analyze the Nguyen and Rocke, Marx, Ding and Gentleman and Eilers et al. algorithms. Applications to disease classification through microarray are presented in the Results section.
| METHODS |
|---|
|
|
|---|
Some basic ingredients
After introducing some notations, we recall the principle of linear logistic discrimination, some results on the existence of the maximum likelihood estimator and the classical algorithm used to compute it. Next, we present a regularization method, a penalized maximum likelihood method, and a dimension-reduction technique, PLS.
Notations
Expression levels of the p genes for the n microarray samples are collected in an n x p data matrix X = (xi/j), 1
i
n, 1
j
p. The entry xi/j is the expression level of the variable gene j in the microarray sample i, and the i-th row Xi,· is the vector of a gene expression profile for sample i. More generally, for a matrix A, Ai,j denotes the entry (i, j), A·,j (resp. Ai,·) denotes the column vector collecting the column #j (resp. the row #i). Ai1:i2,j1:j2 is the (i2 i1 + 1) x (j2 j1 + 1) matrix formed by picking out the rows i1 to i2 and columns j1 to j2 of A; A·,j1:j2 is formed by picking out the columns j1 to j2 of A. The labels of the n microarray samples are collected in a {0,..., (g 1)}n-valued vector y. In supervised machine learning, each sample is thought to originate from a specific class k
{0,...,g 1} where the number of possible classes g is known and fixed. A classifier can be regarded as a function G:
p
{0,...,g1} that predicts the unknown class label of a new tissue sample x 
p by G(x). We assume that the data (y, X) collect observations of n statistically independent and identically distributed random pairs (Y, X). We choose a logit model for the data (see, e.g. Fahrmeir and Tutz, 2001), and the logistic discrimination (LD) method for the classification procedure (see, e.g. Timm, 2002). In the terminology of the regression analysis, (X·,j)1
j
p are the predictor variables and (yi)1
i
n the response variables. We include an intercept into the regression model, and denote by Z = [
n X] the design matrix of size n x (p + 1), where
n = (1,...,1)' stands for the column vector of length n (' denotes the transposition operator).
Linear LD
In logit models, the conditional class probabilityor equivalently, the conditional expectation of Y given XP(Y = 1|X = x;
) is related to x and some parameter
p+1 through the relation P(Y = 1| X = x;
) = h([1 x']
) where h(
) = 1/[1 + exp(
)]. The quantity [1 x']
is called the linear predictor.
is an unknown parameter that has to be estimated from the data. In LD, it is usually estimated by
, the ML estimator. The log-likelihood of the observations for the value
of the parameter, simply denoted by l(
), is given by
![]() | (1) |
i
n,
i(
) = (Z
)i.
For a vector z = [1 x'], the predicted class
of each sample is 1 if
and 0 otherwise, where
. Nevertheless, as discussed below, in some cases, including in practice the case n << p, the existence and unicity of
for logit models is not guaranteed.
Maximum likelihood estimate and IRLS
We say that the ML estimate exists if there exists
p+1 of finite norm which is a maximizer of the concave log-likelihood l. Hence, such an estimate is a solution to the normal equation Z'(y
(
)) = 0, where
(
) is the
n-valued mean vector with coordinates
i(
) = h(
i(
)).
If Z is full column-rank, the solution, when exists, is unique. The existence of a solution, when Z is full column-rank, depends on the configuration of the n sample points in the observation space
p (Albert and Anderson, 1984; Santner and Duffy, 1986). There are three exclusive situations: separate, quasi-separate and overlap. In the first two cases, there exists
such that
for all i such that yi = 1 and
for all i such that yi = 0; roughly speaking, this means that there exists a hyperplane that exactly separates the two classes, except maybe some points that can belong to the hyperplane. In such a case, l reaches its maximum as ||
|| tends to +
and the ML estimate does not exist. In the third case, the estimate exists and is computed as the limit of a converging NewtonRaphson sequence; this algorithm is known as the IRLS algorithm (Green, 1984). Let W(
) be the diagonal n x n matrix with diagonal entries Wi,i(
) =
i(
)[1
i(
)]. Each iteration divides into two steps,
![]() | (2) |
![]() | (3) |
(t) are shorthand notations for W(
(t)) and
(
(t)). IRLS can thus be considered as an iterative weighted least square regression of an
n-valued pseudo-variable z(t) onto the columns of Z. We denote this algorithm by IRLS (y, x)
When Z is not full column-rank, the parameter is not identifiable and the ML estimate is not unique when exists; applying the above iterations [Equations (2) and (3)] by replacing the inverse matrix (3) with the MoorePenrose pseudo-inverse, yields the parameter estimate which is of minimal norm among all the solutions. In practice, in the present statistical framework n
p, n = rank(Z) and the minimal norm solution verifies for all 1
i
n, (Z
)i = ln(yi) ln(1 yi); it is thus of infinite norm and the ML estimate cannot exist. This calls for regularization methods.
Ridge penalty and RIRLS
The Ridge estimator (Le Cessie and Van Houwelingen, 1992
is defined as the (unique) maximizer of the penalized likelihood l*(
) = l(
) 0.5
'
2
, where
> 0 is the shrinkage parameter, and
2 is a diagonal matrix with entries
and
![]() | (4) |
1.
always exists, is unique and is computed as the limit of a NewtonRaphson sequence. We denote by RIRLS(y, X,
) (shorthand notation for Ridge-IRLS) this algorithm. It consists in replacing in IRLS, the weighted regression (3) by a weighted Ridge regression
(t+1) = (Z'W(t)Z + 
2)1 Z' W(t) z(t), where z(t) is built as in Equation (2).
controls the amount of shrinkage in the data and can be chosen as the minimum, over a given range, of the BIC criterion 2
(Kass and Raftery, 1995).
Weighted PLS
PLS is both a tool for linear regression and a tool for dimension reduction (Wold, 1975 Naes and Martens, 1985 Helland, 1988). Let y
n be a response vector, X be an n x p data matrix and W be a symmetric positive definite n x n matrix. PLS (i) defines
W-orthogonal scores (tk)1
k
, linear combinations of the columns of Z such that for all k,
and (ii) performs a W-weighted least squares regression of y on (
n, t1,...,t
). This yields the decomposition
![]() |
+1 is W-orthogonal to the vectors (
n, t1,...,t
). Contrary to classical dimension-reduction methods (such as PCR), the scores depend on the response vector y; roughly speaking, given (tk)1
k
l, tl+1 is the linear combination of the columns of Z, i.e. is of the form tl+1 = Zc, which is the most informative on the residual response variable fl+1, when information is defined in terms of the weighted covariance |Cov(
Zc,
fl+1)|(
denotes the square root matrix of W) (Helland, 1988). While the maximal number of PLS scores
max can be lower than rank(X), in practice, it is often equal to rank(X). Helland (1988) shows that the weighted PLS (WPLS) regression applied with
=
max is nothing more than the weighted least squares regression. In the literature, PLS is usually derived with W = I, the identity matrix; we thus detail the algorithm in the weighted case. Let
be the p x p positive-definite diagonal matrix with diagonal entries
j,j, j
2, given by Equation (4).
, t0 =
n, E0 = Xs; f0 = y.
- For k = 0, ...,
, 
). If Z is full column-rank, this algorithm determines a unique estimate
satisfying
; if Z is not full column-rank, the procedure above yields the minimal norm vector among all the vectors verifying y fk+1 = Z
.
Ridge PLS
A direct application of PLS to GLM seems to be intuitively unappealing because PLS handles continuous responses. This is the reason why, in order to extend PLS to GLM, we want to replace the binary vector y with a pseudo-response variable whose expected value has a linear relationship with the covariates. The pseudo-response variable z
at the convergence of RIRLS(y, X,
) verifies this condition and is thus our candidate: it is of the form
, where, conditional to
being the true value of the parameter,
is a centered vector of covariance matrix (W
)1. The main advantage of choosing z
instead of, for example, the pseudo-variable at the convergence of IRLSwhich has a linear structure toois that this allows the combination of a regularization step and of a dimension-reduction step. In addition, this extension is always well defined: recall indeed that in some cases (including the case n
p), the ML estimate does not exist so that the pseudo-variable at convergence of IRLS is of infinite norm.
As a consequence, we propose a new procedure which combines Ridge penalty (the regularization step) and PLS (the dimension-reduction step) the so-called Ridge PLS (RPLS). Let
be some positive real constant and
be some positive integer. RPLS divides in two steps:
- (z
, W
)
RIRLS(y, X,
);
.
for the input of PLS, the dispersion matrix of which is [W
]1. This explains the call, in the second step, to a weighted PLS procedure with weight W
. The use of Xs in WPLS and of
in the penalized ridge criterion makes our procedure invariant to the scaling of the data matrix.
RPLS depends on two parameters,
and
.
is determined at the end of Step 1, as minimizing the BIC criterion (see the Ridge penalty section), and thus independently of
. In the linear regression setting, the optimal choice of
when dimension reduction is achieved by PLS, is to our best knowledge, an open problem: the non-linear dependence of
upon the response vector makes an explicit control of the error term f
+1 impossible. Finally, observe that RPLS provides an estimate
(which is unique, given y, X,
and
).
We are now able to provide an answer to the classification problem in a high-dimensional setting: our classification procedure consists in applying LD with the estimate
.
Comparison with other approaches
We briefly review some regression procedures that use PLS as the dimension-reduction tool to manage the high-dimensional setting. We outline their interest and in some cases, explain their poor behavior.
Nguyen and Rocke's approach
Nguyen and Rocke (2002) substitute the data matrix X by an n x
matrix
, the columns of which are the first
PLS-scores given by WPLS (y, X, I, f). Then they estimate the parameter in the ML sense by running IRLS (y,
). This yields
. As mentioned above, applying PLS with a binary input y is unappealing; in addition, the PLS-regression step does not take into account the heteroscedasticity of the response vector y; finally, in many applications,
since the ML estimate does not exist.
In practice, IRLS is stopped after a maximal number of iterations nmax thus hiding the non-convergence of IRLS. Unfortunately, the estimate
depends on nmax and this yields an unstable procedure for classification. We observed this phenomenon on the Leukemia data set.
is estimated by using the data in the Golub's training set (Golub et al., 1999); classification is performed on the samples from the test set. When p = 150 and
= 3, there are 1 (resp. 2) samples incorrectly classified if nmax = 7 (resp. nmax = 10).
Marx's approach
In Marx (1996) the parameter
is estimated in the ML sense and is obtained at the convergence of IRLS(y,
), where
is defined by IRPLS, an algorithm that extends PLS to GLM. More precisely, IRPLS can be understood as an IRLS algorithm in which the weighted least squares regression (3) is replaced with the WPLS regression, WPLS(z(t), X, W(t), rank(E1)).
collects the first
components at convergence of IRPLS.
As recalled above, WPLS applied with the maximal number of PLS components is nothing else than weighted least squares (note that Marx chooses
= rank(E1) while in theory,
max should be strictly lower than rank(E1)). Hence IRPLS and IRLS coincide, and, when X is full row-rank (which is most often the case when n
p), IRPLS never converges. In practice, IRPLS is stopped after a fixed number of iterations, thus hiding the non-convergence phenomenon. In addition, initializing IRPLS by choosing a linear predictor of the form
(0) = c0y c0(
n y) [where for example c0 = ln(3)], as done by Marx, yields
. A trivial induction shows that for all t
0, z(t) = 2cty ct
n with ct = 1 + ct1 + exp(ct1), and W(t) is proportional to the identity matrix In. Since WPLS(y, X, W,
) = WPLS(
y + ß
n, X, W,
)in terms of the exhibited scoresfor all
, ß
, WPLS (z(t), X, W(t),
) returns the same scores as WPLS (y, X, In,
), thus proving
.
Ding and Gentleman's approach
The originality of their work (Ding and Gentleman, 2004) is that it simultaneously answers to the regularization question and to the dimension-reduction one. They run an approximation of a NewtonRaphson (NR) algorithm for solving a Firth's penalized ML criterion. As in IRLS, any iteration of the NR algorithm is a weighted least squares regression and Ding and Gentleman replace this least square regression by a WPLS one. We call this algorithm FPLS.
We run their method on the data sets described in the next section. On the colon data set and on the prostate data set, the algorithm does not always converge; we observe a cyclic behaviorafter a burn-in period the path is periodic. The estimate
, and consequently the classification rule, may depend on the maximal number of iterations.
This approach is greatly promising since it addresses both the regularization and the dimension-reduction problems. Comparisons of our results with their approach are of interest and will be explored in future research.
Eilers et al.'s approach
Their method (Eilers et al., 2001) does not use PLS. We nevertheless mention their work since their estimate,
is the Ridge-penalized ML estimate (with an un-weighted penalty term i.e.
2 = I). The method of Eilers et al. does not reduce the dimension but only deals with the regularization question. In particular, all the explanatory variables are allowed and included into the regression model, which can deteriorate the performances of the classifier. In the next section, we will outline the high interest of combining a reduction step with the Ridge regularization.
| RESULTS |
|---|
|
|
|---|
We illustrate the interest of RPLS by considering applications for the classification of microarrays data. We compare the classification results from our procedure with those of other classifiers including RIRLS, FPLS, the effective dimension reduction (MAVE, Antoniadis et al., 2003), diagonal linear discriminant analysis (DLDA), diagonal quadratic discriminant analysis (DQDA) and k-nearest neighbors (KNN) based on the Euclidean distance [see Devroye et al. (1996) for an overview of the last three methods].
DLDA, DQDA and KNN are thus introduced in the present paper as classical statistical methods. As commented in the abstract, our goal is to show that these methods poorly behave when applied to high-dimensional data sets. This is exactly what happens, thus stressing the need for interest in methods based on regularization and dimension reduction.
In order to illustrate the interest of PLS over PCR in the regression framework, we compare our algorithm RPLS to RPCR (for Ridge-PCR). By nature, PCR handles continuous responses; this calls for an extension of PCR to GLM, in order to use it as a dimension reduction in GLM. The extension we derived for PLS remains valid for PCR: we exhibit the continuous-valued pseudo-response variable at the convergence of the RIRLS algorithm and use this variable as the input variable for PCR. This yields RPCR.
Data, pre-processing and gene selection
We will consider in turn the Leukemia, Colon and Prostate data sets.1 The Leukemia data set contains 72 tissue samples with pinit = 7129 genes: 47 cases of acute lymphoblastic leukemia (ALL), coded 0 and 25 cases of acute myeloid leukemia (AML), coded 1 (Golub et al., 1999). The Colon data set contains 62 tissue samples with pinit = 2000 genes: 40 tumors tissues, coded 1 and 22 normal tissues, coded 0 (Alon et al., 1999). The Prostate data set contains 102 tissue samples with pinit = 12600 genes: 52 tumors tissues, coded 1 and 50 normal tissues, coded 0 (Singh et al., 2002).
For Leukemia and Colon data (resp. Prostate), the pre-processing steps of Dudoit et al., 2002 [resp. Singh et al., 2002] are applied: thresholding [floor of 100 (resp. 10) and ceiling of 16 000] filtering [exclusion of genes with max/min
5 and (max min)
500 (resp. 50)] log10-transformation/standardization. Notice that the filtering step is applied using only the Learning set. This yields a resulting number of covariates pmax depending on the subdivision Learning and Testing set, lower than pinit but still far larger than the number of observations.
Although the procedures can handle a large number (thousands) of genes, the number of genes may still be too large for practical use. Furthermore, a considerable percentage of the genes do not show differential expression across groups and only a subset of genes are of interest. We perform the preliminary selection of gene based on the BSS/WSS criterion used in Dudoit et al. (2002). When training the rule for the selection of gene, we select p genes by the previous criterion with
for Leukemia data,
for Colon data and
for Prostate data.
Assessing prediction methods
It is common to assess the performance of the classification rules for a selected subset of genes by their errors on the test set and also by their leave-one-out cross-validated errors. Due to the instability of leave-one-out error rates, we also perform a re-randomization study, i.e. an out-of-sample analysis of 100 random subdivisions of the data set into a learning set and a test set. When a test set is available, we randomly split the original data set into a training set and a test set of the same size as the original ones. Otherwise, we choose a test set size equal to one-third of the data [2:1 scheme of Dudoit et al. (2002)]. Each subdivision yields a test set error rate for each predictor; boxplots are used to summarize these error rates over the runs.
The optimal number of PLS or PCR components (for RPLS, FPLS or RPCR) is selected by choosing the value of
minimizing leave-one-out error rates for the training set. This is also employed for other procedures that involve hyperparameters, such as MAVE or KNN. In practice, on the leave-one-out analyses performed on the Colon data sets and on the Prostate data sets, we observed many cases of indecisions for even values of k. This is the reason why, as suggested in Devroye et al. (1996) we run KNN for odd values of k. We really believe that the frequent occurrence of the indecision case shows that KNN is not a pertinent method (for this kind of data sets). The weakness of this classical statistical method is clearly illustrated by the numerical results.
The
range is given by
for Leukemia data,
for Colon data and
for Prostate data. Moreover, the shrinkage parameter (for RIRLS, RPLS or RPCR) is determined as mentioned above on 51 log10-linearly spaced points in the range [102; 103]. Note that, to fairly evaluate and compare the test or leave-one-out cross-validated errors, pre-processing, gene selection and (hyper)-parameter estimations are performed on the training set (at each step of the cross-validation process).
| DISCUSSION |
|---|
|
|
|---|
Different numerical results are reported in Tables 13 and boxplots are plotted in Figures 13. In the tables, the number in brackets for RPLS, RPCR, FPLS and MAVE are the optimal numbers of components chosen as previously indicated and those for KNN are the optimal numbers of nearest neighbors. The numerical results and graphics show the necessity of the dimension-reduction step. This is particularly evident from the Colon and Prostate data results. Indeed note that most of the classifiers proposed in the literature behave well on the Leukemia data set though the other data sets are known to be more problematic. In particular, the boxplots suggest that errors rates for RPLS, RPCR and FPLS are typically lower and less variable. There is no obvious difference between the distributions of error rates for these three methods. However, we can mention that for Colon and Prostate data FPLS has converged only for small
values; and that RPCR needs
values greater than the one of RPLS. Otherwise these methods are robust to the growth of p, thanks to the dimension-reduction step (the larger p is, the larger
has to be chosen to reach the best classification result except for FPLS which does not converge for large
), and to an increasing value of the shrinkage parameter. The good performance of these methods when p = pmax (Table 2) is particularly interesting when applied to microarrays, since it can allow the practitioner to avoid a pre-selection step and thus makes the classification result independent of the criterion applied in this preliminary selection. On the other hand, the methods such as RIRLS, DLDA, DQDA or KNN have very poor performances when p gets large. Note that MAVE stands between the two although it is a dimension-reduction method.
|
|
|
|
|
Concerning the comparison between RPLS and RIRLS, as mentioned above, we do not trust RIRLS due to the non-scaling of the design matrix that makes the utility of the method problem-specific. It may be read in the tables and figures that RPLS and RIRLS have an equivalent behavior for small p values. Nevertheless the latter is not robust to large p. This legitimately suggests adding to this method a dimension-reduction step; we observed on the three data sets, in the resampling analysis, that this would improve the RIRLS method.
RPLS confirms different analyses in the literature. For example, it is known that in the Leukemia data set, samples #28, 66 and 67 have a high misclassification rate (henceforth denoted MR[i] for sample #i) (Dudoit et al., 2002). In the resampling study, for
and
, RPLS systematically misclassifies sample #66, whereas MR[28] and MR[67] decrease when
and p both increase: for p = 1000 and
= 3 (resp. p = 50 and
= 1), MR[28] = 7.02% and MR[67] = 7.55% (resp. 38.60 and 39.62%). Another example is given by the Colon data set, for which samples N8,34 and T30,33,36 are misclassified by both the contributions (Alon et al., 1999; Furey et al., 2000); in the resampling analysis, performed for
,
, N8,34 and T 36 are systematically misclassified, MR[T30]
88.89% and MR[T33]
96.87%. In addition, RPLS always misclassifies N36 [(an example pointed out in Furey et al., 2000)], and behaves poorly for samples T2,33 [samples pointed out in Alon et al. (1999)]. For the Prostate data set, in the leave-one-out study, the minimal number of misclassified samples is five (and is reached by RPLS), namely samples 32,64,68,84 and 92. Samples 32, 84 and 92 are misclassified for all of the LO analysis
; samples 64 and 68 are misclassified in more than 96 and 91% of the LO analysis. In the resampling study, MR[32] = 1, MR[64]
0.41, MR[68]
0.72, MR[84] = 1 and MR[92]
0.90.
| CONCLUSIONS |
|---|
|
|
|---|
We have proposed a statistical dimension-reduction approach for the classification of tumors based on microarray gene expression data. Our method is designed to address the curse of dimensionality and overcome the problem of a high-dimensional gene expression space so common in such type of problems. We have provided a new extension of partial least squares to binary response data, that seems to have better properties than some of the currently used methods. We restricted our attention to the binary case, but the methodology can be extended to cover multi-class problems and we are interested in doing so. Indeed the structure of the algorithm for the binary case and multi-class case is the same, but the choice of the parameter
necessitates more attention in the multi-class case than in the binary case. Future research will also concern the variable and model selection themes in order to determine optimal values for (
,
). | APPENDIX: RPLS |
|---|
|
|
|---|
For given (y, X),
> 0 and
1.- Compute Z
[
n X] and
as in Equation (4).
- RIRLS step:
- Initialize
(0)
p+1. 
- Until convergence, do

- Set

- Initialize
- WPLS step:
.
- t0
n, E0
Xs, f0
z
,
0
0
p,
Ip.
- For k = 0, ...,
, 
- Set
and q
[q1 ... q
]'.
- Conclude:

The procedure, presently derived in
p+1, can be equivalently derived in
r+1 where r + 1 = rank(Z)
n. To that goal, compute UDV', the singular values decomposition (svd) of (X
n
X/n)
, the scaled covariate matrix; collect the first r columns of UD in
= (UD)·,1:r so that Z
= [
n
]
for some
r+1. We denote by J(r) a diagonal matrix with
and
, k = 2,...,r + 1. It is readily seen that RPLS, run by replacing (X,
2) by (
, J(r)), yields an estimate
uniquely related to
by the formulas
![]() |
|
| Acknowledgments |
|---|
The authors are really grateful to A. Antoniadis for constructive and fruitful discussions and to the referees for their constructive comments and criticisms which have substantially improved this article. They would like also thank I. De Feis for helpful comments and B. Ding and R. Gentleman for providing a preprint of their paper prior to publication.
Part of this work was supported by the research project ASBGEN and the Interuniversity Attraction Pole (IAP) research network in Statistics P5/24.
| Footnotes |
|---|
1They can be downloaded from http://www.sdmc.lit.org.sg/aGEDatasets/Datasets.html
Received on July 27, 2004; revised on October 5, 2004; accepted on October 22, 2004
| REFERENCES |
|---|
|
|
|---|
Albert, A. and Anderson, J. (1984) On the existence of maximum likelihood estimates in logistic regression models. Biometrika, 71, 110
Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., Levine, A. (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl Acad. Sci. USA, 96, 67456750
Antoniadis, A., Lambert-Lacroix, S., Leblanc, F. (2003) Effective dimension reduction methods for tumor classification using gene expression data. Bioinformatics, 19, 563570
Devroye, L., Gyorfi, L., Lugosi, G. Theory of Pattern Recognition, (1996) , New York Springer-Verlag.
Ding, B. and Gentleman, R. (2004) Classification using generalized partial least squares. Technical Report 5, Bioconductor Project Working Papers. http://www.bepress.com/bioconductor/paper5.
Dudoit, S., Fridlyand, J., Speed, T. (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Stat. Assoc., 97, , pp. 7787[CrossRef].
Eilers, P., Boer, J., Van Ommen, G., Van Houwelingen, H. (2001) Classification of microarray data with penalized logistic regression. Proceedings of SPIE. Progress in Biomedical Optics and Images, vol. 4266, , pp. 187198.
Fahrmeir, L. and Tutz, G. Multivariate Statistical Modelling Based on Generalized Linear Models, (2001) 2nd edn , New York Springer Series in Statistics.
Frank, I. and Friedman, J. (1993) A statistical view of some chemometrics regression tools, with discussion. Technometrics, 35, , pp. 109148[CrossRef].
Furey, T., Cristianini, N., Duffy, N., Bednarsky, D., Schummer, M., Haussler, D. (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16, 906914
Ghosh, D. (2002) Singular value decomposition regression models for classification of tumors from microarray experiments. Pac. Symp. Biocomput., 98, 1829.
Ghosh, D. (2003) Penalized discriminant methods for the classification of tumors from gene expression data. Biometrics, 59, 9921000[CrossRef][Web of Science][Medline].
Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C., Lander, E. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531537
Green, P. (1984) Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. J.R. Statist.Soc. B, 46, 149192.
Helland, I. (1988) On the structure of partial least squares regression. Commun. Statist., Simulation Comput., 17, 581607.
Huang, X. and Pan, W. (2003) Linear regression and two-class classification with gene expression data. Bioinformatics, 19, 20722078
Kass, R. and Raftery, A. (1995) Bayes factor. J. Amer. Stat. Assoc., 90, 733795.
Le Cessie, S. and Van Houwelingen, J. (1992) Ridge estimators in logistic regression. J. R. Statist. Soc. C, 41, 191201.
Marx, B.D. (1996) Iteratively reweighted partial least squares estimation for generalized linear regression. Technometrics, 38, 374381[CrossRef].
Massy, W.F. (1965) Principal components regression in exploratory statistical research. J. Amer. Statist. Assoc., 60, 234246[CrossRef].
Naes, T. and Martens, H. (1985) Comparison of prediction methods for multicollinear data. Commun. Statist. Simulation Comput., 14, 545576.
Nguyen, D. and Rocke, D. (2002) Tumor classification by partial least squares using microarray gene expression data. Bioinformatics, 18, 3950
Santner, T. and Duffy, D. (1986) A note on A. Albert and J.A. Anderson's conditions for the existence of maximum likelihood estimates in logistic regression models. Biometrika, 73, 755758
Singh, D., Febbo, P., Ross, D., Jackson, G., Manola, J., Ladd, C., Tamayo, A., Renshaw, A., D'Amico, A.V., Richie, J., et al. (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1, 203209[CrossRef][Web of Science][Medline].
Timm, N. (2002) Applied Multivariate Analysis. , New York Springer-Verlag.
West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., Olson, J., Marks, J., Nevins, J. (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl Acad. Sci., 98, 1146211467
Wold, H. (1975) Soft modelling by latent variables: the non-linear iterative partial least squares (NIPALS) approach. In Gani, J. (Ed.). Perspectives in Probability and Statistics, Papers in Honour of M. S. Bartlett, , London Academic Press, pp. 117142.
This article has been cited by other articles:
![]() |
J.G. Liao and K.-V. Chin Logistic regression for disease classification using microarray data: model selection in a large p and small n case Bioinformatics, August 1, 2007; 23(15): 1945 - 1951. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Liu, J. M. Hughes-Oliver, and J. A. Menius Jr Domain-enhanced analysis of microarray data using GO annotations Bioinformatics, May 15, 2007; 23(10): 1225 - 1234. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.-L. Boulesteix and K. Strimmer Partial least squares: a versatile tool for the analysis of high-dimensional genomic data Brief Bioinform, January 1, 2007; 8(1): 32 - 44. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Tan, L. Shi, S. M. Hussain, J. Xu, W. Tong, J. M. Frazier, and C. Wang Integrating time-course microarray gene expression profiles with cytotoxicity for identification of biomarkers in primary rat hepatocytes exposed to cadmium Bioinformatics, January 1, 2006; 22(1): 77 - 87. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||










