Skip Navigation


Bioinformatics Advance Access originally published online on May 31, 2007
Bioinformatics 2007 23(15):1945-1951; doi:10.1093/bioinformatics/btm287
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/15/1945    most recent
btm287v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Liao, J.G.
Right arrow Articles by Chin, K.-V.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liao, J.G.
Right arrow Articles by Chin, K.-V.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Logistic regression for disease classification using microarray data: model selection in a large p and small n case

J.G. Liao 1,* and Khew-Voon Chin 2

1Drexel University School of Public Health, Philadelphia, PA 19102 and 2The University of Toledo,Toledo, OH 43614, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PARAMETRIC BOOTSTRAP MODEL...
 3 TWO EXAMPLES
 4 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Logistic regression is a standard method for building prediction models for a binary outcome and has been extended for disease classification with microarray data by many authors. A feature (gene) selection step, however, must be added to penalized logistic modeling due to a large number of genes and a small number of subjects. Model selection for this two-step approach requires new statistical tools because prediction error estimation ignoring the feature selection step can be severely downward biased. Generic methods such as cross-validation and non-parametric bootstrap can be very ineffective due to the big variability in the prediction error estimate.

Results: We propose a parametric bootstrap model for more accurate estimation of the prediction error that is tailored to the microarray data by borrowing from the extensive research in identifying differentially expressed genes, especially the local false discovery rate. The proposed method provides guidance on the two critical issues in model selection: the number of genes to include in the model and the optimal shrinkage for the penalized logistic regression. We show that selecting more than 20 genes usually helps little in further reducing the prediction error. Application to Golub's leukemia data and our own cervical cancer data leads to highly accurate prediction models.

Availability: R library GeneLogit at http://geocities.com/jg_liao

Contact: jl544{at}drexel.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PARAMETRIC BOOTSTRAP MODEL...
 3 TWO EXAMPLES
 4 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
DNA microarray is a new technology that measures the expression levels of thousands of genes simultaneously and has emerged as an important tool in biomedical research. Golub et al. (1999) show that microarray gene expression can be used to classify between acute myeloid leukemia (AML) and acute lymphocytic leukemia. Since then, disease classification using microarray data has been the focus of intensive research with the aim of providing more accurate diagnostic tools than what the traditional pathological method alone can provide. Gene expression can also be used to predict survival time, disease prognostics and response to treatment, all with important clinical implications.

The mathematical and statistical methodologies for building such classification models, from the classical statistical methods to machine learning theory to classification trees, are reviewed and compared by Dudoit et al. (2002), Lee et al. (2005) and Li et al. (2004). This article considers the logistic regression approach, a standard method for binary classification that has been extended for use in microarray data in Eilers et al. (2001), Fort and Lambert-Lacroix (2005), Nguyen and Rocke (2002), Shen and Tan (2005), Zhou et al. (2004), Zhu and Hastie (2005). Let y be an array's binary disease status (1 for cancer and 0 for normal as a general example) and let x = (x1, ... , xp) be the expression vector, where xj is the expression level of the jth gene. A logistic prediction model for {pi} (x), the probability of y=1 given x, is constructed from a training dataset, which can then be applied to the gene expression of a new array to estimate its cancerous probability. Building a logistic prediction model using microarray data, however, is fundamentally different from standard logistic modeling because the number of genes (predictors) p can be thousands while the number of arrays (subjects) n is usually no more than 100. A popular approach is to combine a gene selection step with penalized likelihood inference (Eilers et al., 2001; Shen and Tan, 2005; and Zhu and Hastie, 2004). Step 1, called feature selection, selects a subset of genes to include in the logistic regression. For ease of exposition, we will focus on the method of selecting the q most univariately significant genes (Dudoit et al., 2002) and let Formula , be the expression of the jth selected gene. Let yi, i=1, ... , n, be the binary disease status of the ith array in the training dataset and let xi=(xi1, ... , xip) be its gene expression vector. Step 2 fits the logistic model


Formula 1

(1)
by maximizing the penalized log-likelihood


Formula 2

(2)
where {pi}i = {pi} (xi) as given by model (1), ||ß|| is the Euclidean length of ß = (ß1, ... , ßq), and {tau} isin (0, {infty}) is the shrinkage parameter that controls the degree of shrinkage of ß toward 0 (Cessie and Van Houwelingen, 1990; Van Houwelingen, 2001). There are two key unresolved model selection issues here. The first is how to choose the number of genes q in Step 1. A smaller q makes the prediction model (1) easier and less costly to use but may lead to a larger prediction error. The second is how to find the optimal shrinkage parameter {tau} for a chosen q. To address these issues, we need to be able to estimate the prediction error of a model building strategy. Ambroise and McLachlan (2002) and Simon et al. (2003) show that methods in earlier publications that ignored the feature selection step in the evaluation can severely underestimate the prediction error. To incorporate both Steps 1 and 2 for valid assessment, they use the generic cross-validation and non-parametric bootstrap. Recently, however, Braga-Nato and Dougherty (2004) demonstrated that the prediction error estimation using cross-validation can be too variable to be useful. Efron (2004), in a significant theoretical advance, shows that cross-validation and non-parametric bootstrap, while broadly applicable, pay a substantial price in terms of decreased estimating efficiency, which is especially acute for a large p and small n problem. The parametric bootstrap method, based on models tailored to the specific problem, can offer substantially better accuracy if the model is justified.

This article proposes and develops a parametric bootstrap method for more accurate and reliable estimation of the prediction error for the two-step procedure of building a logistic model, which provides the critically needed guidance on the choice of q and {tau}. For any given q, our method finds the optimal {tau} and the corresponding prediction error. We show that including q = 20 genes in the model is usually sufficient as additional genes help little in further reducing the prediction error. Application to Golub's leukemia data (Golub et al., 1999) and our own breast cancer data (Wong et al., 2003) leads to highly accurate prediction models. A carefully crafted R library, GeneLogit, is supplied on our web-site (http://geocities.com/jg_liao) that can be readily used for data analysis.


    2 PARAMETRIC BOOTSTRAP MODEL SELECTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PARAMETRIC BOOTSTRAP MODEL...
 3 TWO EXAMPLES
 4 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 A parametric bootstrap model
The parametric bootstrap (Efron and Tibshirani, 1993), unlike cross-validation and non-parametric bootstrap, requires a more detailed model for the underlying process that generated data y = (y1, ... , yn). We now propose such a model in a hierarchical form.

Stage 1: given the gene expression vector xi, yi ~ Bernoulli ({pi}i) with


Formula 3

(3)
where b1, ... , bp are the regression coefficients to be further modeled in Stage 2 and b0 is the intercept. Note that Equation (3) is different from Equation (1) in that (3) is the assumed true model for {pi}i while (1) is the working model for building a prediction rule. In Section 2.1, {pi}1, ... , {pi}n are always as given in (3). Usually, only a small percentage of genes from (3) are needed in (1) because the expression levels of many genes are collinear.

Let n1 be the total number of cancer arrays and n0 = n n1 be the total number of normal arrays in the training sample. It can often be more appropriate to model y as drawn conditional on y1 +, ... , + yn = n1 because n1 and n0 are fixed by design (see Section 4 for more discussion). Let


Formula

where {pi}=({pi}1, ... , {pi}n). Let u = (u1, ... , un), where each ui is either 0 or 1, and let Formula . The conditional probability of y given y1 +, ..., + yn = n1 is then


Formula 4

(4)
which depends on b1, ... , bp but not on intercept b0 and is often the likelihood of choice in statistical inference when n1 is fixed by design or when the intercept b0 is considered a nuisance parameter (Agresti, 2002, McCullagh and Nelder, 1989; Chapter 6.7.1). We will therefore use this conditional distribution for the inference and simulation subsequently.

Stage 2: to develop a model for coefficients b1, ..., bp in (3), let M = {j : bj != 0} be the subset of genes with a non-zero regression coefficient. Rewrite (3) as


Formula 5

(5)
where sj = |bj| /bj is the sign of bj for j isin M. Note that Equation (3) is ill-posed in that infinitely many solutions of b0, ... , bp exist for any given {pi}. To put a reasonable structure on bj, we shall seek insight from a formulation of the gene expression vector x that leads to logistic regression (3). Assume that x ~ N1, V) for a tissue drawn from the cancer group and x ~ N0, V) for a tissue drawn from the normal group, where µ1 = (µ11, ... , µ1p) and µ0 = (µ01, ... , µ0p). This leads to logistic regression (3) with (b1, ... , bp)t = V –11 µ0) and intercept b0 that depends on the relative sampling probabilities from the normal and cancer groups (Hastie, et al., 2001). Assume further that the gene expression is independent across genes so that V = diag (v1, ... , vp). We then have bj = (µ1j – µ0j)/vj. Note that bj is a multivariate regression coefficient in (3) while µ1j – µ0j represents the marginal relationship between the jth gene's expression and the outcome y. Let Hj, j=1, ... , p, be the hypothesis that the jth's gene expression has no difference between the cancer and normal arrays, Formula be the alternative hypothesis that expression is stronger in the cancer arrays and Formula be the hypothesis that the expression is stronger in the normal arrays. It follows that Hj, Formula and Formula correspond to bj = 0, sj = 1 and sj = –1, respectively. We can now borrow from the extensive research on identifying differentially expressed genes based on the false discovery rate (FDR) framework (Benjamini and Hochberg, 1995). Let zj be the P-value from testing Hj that summarizes the strength of statistical evidence against Hj. Efron et al. (2001) and Efron and Tibshirani (2002) model zj, j = 1, ... , p, as generated from the mixture distribution


Formula

where {eta}0 is the proportion of genes that do not express differentially, f0 is the uniform density on [0, 1] for P-values under the null hypothesis and f1 is the density for P-values under the alternative. The local FDR, which quantifies the plausibility of individual Hj being true, is given by


Formula 6

(6)
and can be estimated using the method either in Liao et al. (2004) or in Scheid and Spang (2005). To model the subset M, we shall generate it as a random set so that each j has probability Formula of being in M independently for j = 1, ... , p. To determine a reasonable value for sj, for j isin M, or to choose between Formula and Formula after Hj is rejected, note that it is common practice to conclude that the direction of difference in the population is the same as what it is in the sample when a two-sided null hypothesis is rejected (Leventhal and Huynh, 1996). We will thus assign, for j isin M, sj = 1 or sj = – 1 depending on the direction of the gene's expression in the training dataset. Because of the adjustment for multiple comparisons, a gene is only usually included in M when the P-value zj is much smaller than the usual cut point 0.05. This way of modeling bj != 0 and assigning sj is based on the marginal relationship between the outcome y and the jth gene's expression level. We show that it is justified by assuming normal distribution for gene expression vector x and by assuming independent expression across genes within the normal group and within the cancer group. While this is somewhat a strong assumption, we think it provides a useful approximation to an otherwise intractable problem. Similar approach is adopted in Barbieri and Berger (2002) and Ishwaran and Rao (2005) where variables in a multivariate model are selected based on their individual performance instead of their joint posterior distribution in Bayesian analysis of large p and small n data. Further research is needed on the best ways to model bj for a group of highly correlated genes.

To complete the specification for bj, we shall model |bj|, j isin M, as independent random effects from normal distribution Formula truncated on (0, {infty}). The variance component {theta}0, to be estimated from the data, quantifies the size of |bj|. This follows the established statistical tradition of modeling parameters of similar characteristics as random effects (Laird and Ware, 1982) and also naturally motivates penalized likelihood (2) (Cessie and Van Houwelingen, 1990; Van Houwelingen, 2001).

It is easy to draw a bootstrap sample Formula from the proposed two-stage model. To do this, first get an estimate of the local FDR as Formula . Obtain an estimate of {theta}0 Formula by maximizing (7) as discussed subsequently. We can then draw y* as follows.

Step 1: generate M so that each j has probability of Formula of being in M. Assign, for j isin M, sj = 1 if the gene's expression is stronger in cancer arrays for the training dataset and sj = –1 otherwise. Draw |bj| from Formula truncated on (0, {infty}) for j isin M. Model (3) or (5) is now specified except the intercept b0.

Step 2: draw y* from the conditional logistic distribution (4) with {pi} given by model (5) as generated in Step 1. Note that both Steps 1 and 2 need to be performed for each bootstrap sample y** as Step 1 models the uncertainty in regression coefficients in (3) or (5).

Finally, to estimate the variance component {theta}0, let Formula be a bootstrap sample generated with |bj| in Step 1 drawn from the truncated N (0, {theta}2) instead. Let h ({theta}) be the probability of Formula , where the probability incorporates both Steps 1 and 2 in drawing Formula . Our estimate Formula maximizes


Formula 7

(7)
where the factor {theta}–1/3 is added to the likelihood function h ({theta}) to improve the stability of the maximization and can be seen as the kernel of a prior density on, say [10–6, 1], that mildly favors a smaller {theta}.

2.2 Estimation of the prediction error
With what is proposed in Section 2.1, the prediction error of a model building procedure can be easily evaluated using parametric bootstrap. For ease of exposition, we will denote the two-step procedure for building a logistic model in Section 1 by logistic (q, {tau}), where q is the number of the most significant genes included in working model (1) and {tau} is the shrinkage parameter in penalized likelihood (2). For any given pair (q, {tau}), the prediction error of procedure logistic (q, {tau}) can be estimated as follows:

Step 1: draw a bootstrap sample Formula as described in Section 2.1. Note that the intercept b0 in (5) was left unspecified there but its value is needed here. Given the generated M, sj and |bj|, we will choose b0 so that (5) satisfies {pi}1 +, ... ,+ {pi}n = n1, which is the maximum likelihood estimate of b0. The {pi}1, ... , {pi}n are now completely specified in (5).

Step 2: apply the two-step procedure logistic (q, {tau}) to bootstrap sample y*. To be more specific, first test the gene expression difference between arrays with Formula and arrays with Formula for each of the p genes and select the q most significant genes. Then fit logistic regression (1) with y in penalized likelihood (2) replaced by y*. Let Formula be the estimated {pi}1, ... , {pi}n from the resulting model.

Step 3: compute the expected Brier score


Formula 8

(8)
which is smaller if the estimate Formula is closer to the true {pi}. To derive (8), let Formula be n independent Bernoulli trials with each y'i having success probability {pi}i. The Brier score (Brier, 1950) of using Formula to predict y' is


Formula

Formula (8) is the expected Brier score over y'.

Repeat Steps 1–3 here a large number of times (10 000 times for the two examples below) and the average of (8) over these replications serves as an estimate of the prediction error of procedure logistic (q, {tau}).

As mentioned earlier, the closer the bootstrap estimate Formula is to the underlying true {pi}, the smaller the expected Brier score (8) is. The true {pi} from Step 1, however, is a random quantity determined by the generated values of M, sj and |bj|, which models our uncertainty about {pi}. The prediction error estimate of procedure logistic (q, {tau}) averages (8) over {pi}. Our parametric bootstrap method can therefore be alternatively motivated from the perspective of Bayesian model averaging (Hoeting et al., 1999).

2.3 Software implementation
Our proposed method is implemented in software library GeneLogit that runs on the open source R environment (R Development Core Team, 2006) and is available at http://geocities.com/jg_liao For our implementation, the local FDR estimator in Liao et al. (2004) is used. The conditional probability (4) is generally computationally onerous because the probability of y1 +, ... ,+ yn = n1 in the denominator may involve the sum of a large number of terms. To simplify the computation, normal approximation {Phi}(n1 + 0.5)–{Phi}(n1 – 0.5) is used, where {Phi} is the cumulative distribution of normal distribution with mean {pi}1+, ...,+{pi}n and variance {pi}1(1–{pi}1)+, ... ,+{pi}n(1–{pi}n). The free parameter b0 in (3) is chosen so that {pi}1 +, ... ,+ {pi}n=n1 for more accurate approximation, which works well when n>30 and n1 is not too close to 0 or n. The likelihood h({theta}) in (7) is computed by averaging conditional probability (4) over 5000 simulated b1, ... , bp, where b1, ... , bp are generated as in Step 1 in Section 2.1 except |bj| are drawn from N (0,{theta}2) truncated on (0, {infty}). The resulting library GeneLogit is easy to use but still computationally intensive (~20 hours of total computing time on today's dual core CPU from Intel or AMD for either the leukemia or the cervical cancer dataset subsequently). As input, it requires the disease outcome y as a vector and the gene expression as an n x p matrix. No missing values are allowed. The gene expression matrix needs to be standardized so that each column (the expression of a gene across all subjects) has variance of 1. There is no need, however, to center the gene expression to 0 as it does not impact on the regression coefficients b1, ... , bp. The program seems numerically stable as same result is obtained from many different runs. We now illustrate the use of the library in two datasets.


    3 TWO EXAMPLES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PARAMETRIC BOOTSTRAP MODEL...
 3 TWO EXAMPLES
 4 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Golub leukemia data: classification between ALL and AML
Golub et al. (1999) use gene expression to classify between acute lymphoblastic leukemia (ALL) and AML. Their data have since been analyzed by many authors using different classification methodologies. The training dataset consists of 27 ALL and 11 AML subjects and the test dataset consists of 20 ALL and 14 AML subjects. The expression of 7129 genes were originally measured. We applied the pre-processing procedures in Dudoit et al. (2002) to filter out genes that do not exhibit significant variation across the samples, followed by thresholding and the logarithmic transformation. A total of 3051 genes remain after the pre-processing. Our proposed method is applied to the training dataset (n = 38 and p =3051) in the following steps using GeneLogit library:

  1. Run function localFDR to estimate fdrj and assign sj.
  2. Run function model.estimation to estimate {theta}0 by maximizing (7), which gives Formula .
  3. Use function bootstrap.prediction to find the optimal {tau} and the corresponding prediction error for different values of q. The result for q = 1, 10, 20, 50 and 100 is given in Table 1. For example, the optimal {tau} for q = 1 is found to be 4.542 with a corresponding prediction error estimated to be 0.0191 and the optimal {tau} for q = 20 is 0.628 with a corresponding prediction error of 0.0152. A larger q leads to a smaller optimal {tau} (more shrinkage) and smaller prediction error. Increasing q beyond 20, however, does not help much in further reducing the prediction error. Note that the estimated prediction error is the average over 10 000 expected Brier scores defined in Equation (8) from 10 000 bootstrapped models. For q = 20 and {tau} = 0.628, the 5th and 95th percentiles of these expected Brier score are 0.0116 and 0.0381, respectively.
  4. Based on the result in Table 1, we choose procedure logistic (q = 20, {tau} = 0.628) to build our prediction model by running function pena.logit. The 20 selected genes and the penalized logistic regression coefficients are given in Table 2. Note that the regression coefficients are all close to 0 due to the shrinkage effect.


View this table:
[in this window]
[in a new window]

 
Table 1. The optimal {tau} and the prediction error for different q, leukemia data

 

View this table:
[in this window]
[in a new window]

 
Table 2. Logistic prediction model using procedure logistic(q = 20, {tau} = 0.629), leukemia data

 
To assess the prediction capacity of the model in Table 2 for independent samples, we apply it to Golub's test dataset of 20 ALL and 14 AML subjects. For each of the 34 subjects in the test dataset, we compute Formula , the estimated probability of the ith subject being ALL, by applying the prediction model to the array's expression vector. Note, however, the number of cases over the number of normal subjects is 27/11 for the training dataset and 20/14 for the test dataset. The estimated intercept –0.278 needs to be adjusted to –0.278 – log(27/11) + log(20/14) to account for the different sampling ratios (McCullagh and Nelder, 1989, Chapter 4.3.3). The Formula and the true disease status yi (1 for ALL and 0 for AML) are given in Table 3 for i =1, ... , 34. We see that Formula for every subject with yi =1 and Formula for every subject with yi = 0 except one with Formula . The Brier Score for predicting y1, ... , y34 Formula As comparison, Nguyen and Rocke (2002) report 1–3 classification errors; Lee et al. (2003) and Zhou et al. (2004) report one classification error and Yeung et al. (2005) have two classification errors. The result reported here can be reproduced by running the R code included in our GeneLogit library.


View this table:
[in this window]
[in a new window]

 
Table 3. The true disease status yi and the estimated {pi}i for the 34 arrays in Golub's test dataset

 
3.2 Classification between cervical cancer and normal tissues
Wong et al. (2003) study the gene expression of 26 cervical cancer tissues and 9 normal cervical tissues. The data is also analyzed in Liao et al. (2004) in the context of estimating the local FDR. We now use the data to build a logistic prediction model for classifying between cervical cancer and normal tissues. We use the 3670 genes, among 10 692 assessed, that have complete data for all the 35 subjects. Applying our GeneLogit library using the same steps as for the Golub's data, we obtain Formula . The optimal {tau} and the corresponding prediction error for q = 1, 10, 20, 50 and 100 are given in Table 4. Again, the optimal {tau} decreases as q increases and choosing q beyond 20 does not help much in further reducing the prediction error. We thus choose procedure logistic (q = 20, {tau} = 0.384) to build our prediction model with result given in Table 5. For this dataset, however, there is no separate test dataset to reliably assess the prediction error for independent samples. To overcome this problem, we shall borrow idea from cross-validation. For i = 1, ... , 35, we apply procedure logistic (q = 20, {tau} = 0.384) to the cervical dataset with data from the ith subject removed. The resulted logistic model is then applied to the ith tissue's gene expression to compute the estimate Formula . The result is given in Table 6. We see that all the cervical cancer tissues (yi = 1) have Formula and all the normal tissues (yi = 0) have Formula . The Brier score for predicting y1, ... , y35 using Formula , is 0.023. Note that yi does not contribute to the value of Formula . The high agreement between yi and Formula in Table 6 for all 35 subjects indicates that the procedure logistic (q = 20, {tau} = 0.338) is a good choice.


View this table:
[in this window]
[in a new window]

 
Table 4. The optimal {tau} and the prediction error for different q, cervical cancer

 

View this table:
[in this window]
[in a new window]

 
Table 5. Logistic prediction model using procedure logistic (q=20, {tau} =0.384), cervical cancer

 

View this table:
[in this window]
[in a new window]

 
Table 6. The true disease status yi and the estimated {pi}i for the 38 arrays in the cervical cancer dataset

 

    4 CONCLUSION AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PARAMETRIC BOOTSTRAP MODEL...
 3 TWO EXAMPLES
 4 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Building a logistic prediction model using microarray data poses considerable technical challenge because of the larger p and small n. Infinitely number of solutions of b0, b1, ..., bp exist for any given {pi}1, ..., {pi}n for logistic model (3). Consequently, any observed data y1, ..., yn is subject to infinitely many interpretations. Additional structure on b0, b1, ... , bp is required for effective data analysis. We propose such a structure in Section 2.1 by writing (3) in the form of (5) and by borrowing from the extensive research on FDR. In our model, the probability of bj != 0 or j isin M is taken to be Formula , obtained from the direction of the jth gene expression in the training dataset and |bj| modeled as random effects. The prediction error of procedure logistic (q, {tau}) can then be estimated using parametric bootstrap to guide the choice of q and {tau}. Application of the proposed method to the leukemia and cervical cancer datasets results in excellent prediction models. Biomedical researchers, we interacted with, found the method intuitive and easy to understand.

We now briefly discuss a few issues. First, the number of cancer arrays n1 and the number of normal arrays n0 in a microarray study are often chosen to be of similar size to increase statistical efficiency even though the underlying normal population can be much larger than the cancer population. This is similar to the case-control design in epidemiology in which individuals in the disease population are often sampled for inclusion in study in greater proportion than subjects in the control population. For such design, the intercept ß0 in (1) depends on the ratio of the sampling proportions but ß1, ... , ßq do not (Agresti, 2002; McCullagh and Nelder, 1989, Chapter 6.7.1). In applying the logistic prediction model to new arrays, it is prudent to find out if the same ratio of sampling proportions is used as in the training dataset. The relatively cancerous risk, however, can be determined from only ß1, ... , ßq. Second, we have focused on the model-building procedure logistic(q, {tau}) in which the q univariately most significantly genes are included in the logistic model. Other feature selection methods can also be used. One may consider, e.g. to include the q jointly most significant genes. Our proposed bootstrap method can be used in the same way in evaluating its prediction error. Computationally, however, it is much more intensive to find the q jointly most significant genes. Third Yeung, et al. (2005) use Bayesian model averaging for building microarray classification models. As discussed in Section 2.2, our proposed bootstrap method can also be motivated from the same perspective. But Yeung et al. method is, in our opinion, a somewhat ad hoc adaptation of standard Bayesian model averaging to the large p and small n microarray data while we have developed a coherent model specifically tailored to such data. Indeed, they report more misclassified arrays for Golub's leukemia data.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PARAMETRIC BOOTSTRAP MODEL...
 3 TWO EXAMPLES
 4 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Part of the first author's work was conducted when he was Bio-statistician in the Cancer Center of New Jersey. The research was supported by NCI grant 2P30 CA 72720-04. Dr Yong Lin helped with computation. The authors thank the two reviewers whose comments led to substantial improvement of the manuscript.

Confilict of Interest : none declared.


    FOOTNOTES
 
Associate Editor: Trey Ideker

Received on September 10, 2006; revised on May 14, 2007; accepted on May 21, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PARAMETRIC BOOTSTRAP MODEL...
 3 TWO EXAMPLES
 4 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Ambroise C, McLachlan GJ. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl Acad. Sci. (2002) 99:6562–6566.[Abstract/Free Full Text]

    Agresti A. Categorical Data Analysis (2002) 2nd. New York: Wiley.

    Barbieri MM, Berger JO. Optimal predictive model selection. Ann. Stat. (2004) 32:870–897.[CrossRef]

    Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. (1995) B57:289–300.

    Braga-Nato UM, Dougherty ER. Is cross-validation valid for small sample microarray classification? Bioinformatics (2004) 20:374–380.[Abstract/Free Full Text]

    Brier GW. Verification of forecasts expressed in terms of probability. Mon. Weather Rev. (1950) 78:1–3.[CrossRef]

    Cessie SL, Van Houwelingen HC. Ridge estimators in logistic regression. Appl. Stat. (1990) 41:191–201.[CrossRef]

    Dudoit S, et al. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. (2002) 97:77–87.[CrossRef][Web of Science]

    Golub T, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science (1999) 286:531–536.[Abstract/Free Full Text]

    Efron B. The estimation of prediction error: covariance penalties and cross-Validation. J. Am. Stati. Associ. (2004) 99:619–632.[CrossRef]

    Efron B, Tibshirani RJ. An Introduction to the Bootstrap (1993) London: Chapmann and Hall/CRC.

    Efron B, Tibshirani R. Empirical Bayes methods and false discovery rates for microarrays. Genet. Epidemiol. (2002) 23:70–86.[CrossRef][Web of Science][Medline]

    Efron B, et al. Empirical Bayes analysis of microarray experiment. J. Am. Stati. Associ. (2001) 96:1151–1160.[CrossRef]

    Eilers PH, et al. Classification of microarray data with penalized logistic regression. In Proceedings of SPIE Volume 4266: Progress in Biomedical optics and Imaging (2001) Vol 2:187–198.

    Fort G, Lambert-Lacroix S. Classification using partial least squares with penalized logistic regression. Bioinformatics (2005) 21:1104–1111.[Abstract/Free Full Text]

    Hastie T, et al. The Elements of Statistical Learning (2001) Berlin: Springer.

    Hoeting J, et al. Bayesian model Averaging. Stat. Sci. (1999) 14:382–401.[CrossRef][Web of Science]

    Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics (1982) 38:963–974.[CrossRef][Web of Science][Medline]

    Lee KE, et al. Gene selection: a Bayesian variable selection approach. Bioinformatics (2003) 19:90–97.[Abstract/Free Full Text]

    Lee JW, et al. An extensive comparison of recent classification tools applied to microarray data. Comput. Stat. & Data Anal. (2005) 48:869–885.[CrossRef]

    Leventhal L, Huynh C. Directional decisions for two-tailed tests: power, error rates, and sample size. Psychol. Methods (1996) 1:278–292.[CrossRef][Web of Science]

    Li T, et al. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics (2004) 20:2429–2437.[Abstract/Free Full Text]

    Liao JG, et al. A mixture model for estimating the local false discovery rate in DNA microarray analysis. Bioinformatics (2004) 20:2694–2701.[Abstract/Free Full Text]

    McCullagh P, Nelder JA. Generalized Linear Models (1989) New York: Chapman and Hall.

    Nguyen D, Rocke D. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics (2002) 18:39–50.[Abstract/Free Full Text]

    R Development Core Team. R. fondtation for satistical computing. In: R: a Language and Environment for Statistical Computing (2006) Austria: Vienna. ISBN 3-900051-00-3 http://www.R-project.org.

    Scheid S, Spang R. Twilight; a Bioconductor package for estimating the local false discovery rate. Bioinformatics (2005) 21:2921–2922.[Abstract/Free Full Text]

    Shen L, Tan EC. Dimension reduction-based penalized logistic regression for cancer classification using microarray data. IEEE/ACM Trans. Compu. Biol. Bioinform. (2005) 2:166–175.[CrossRef]

    Simon R, et al. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J. Nat. Cancer Inst (2003) 95:14–18.[Free Full Text]

    Van Houwelingen JC. Shrinkage and penalized likelihood as methods to improve predictive accuracy. Statistica Neerlandica (2001) 55:17–34.[CrossRef][Web of Science]

    Wong YF, et al. Expression genomics of cervical cancer: molecular classification and prediction of radio-therapy response by DNA microarray. Clini. Cancer Res. (2003) 9:5486–5492.[Abstract/Free Full Text]

    Yeung KY, et al. Bayesian model averaging: development of an improved multi-calss, gene selection and classification tool for microarray data. Bioinformatics (2005) 21:2394–2402.[Abstract/Free Full Text]

    Zhu J, Hastie T. Classification of gene microarrays by penalized logistic regression. Biostatistics (2004) 5:427–443.[Abstract]

    Zhou X, et al. Cancer classification and prediction using logistic regression with Bayesian gene selection. J. Biomed. Inform. (2004) 37:249–259.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/15/1945    most recent
btm287v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Liao, J.G.
Right arrow Articles by Chin, K.-V.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liao, J.G.
Right arrow Articles by Chin, K.-V.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?