Bioinformatics Advance Access originally published online on May 31, 2007
Bioinformatics 2007 23(15):1945-1951; doi:10.1093/bioinformatics/btm287
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Logistic regression for disease classification using microarray data: model selection in a large p and small n case
1Drexel University School of Public Health, Philadelphia, PA 19102 and 2The University of Toledo,Toledo, OH 43614, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Logistic regression is a standard method for building prediction models for a binary outcome and has been extended for disease classification with microarray data by many authors. A feature (gene) selection step, however, must be added to penalized logistic modeling due to a large number of genes and a small number of subjects. Model selection for this two-step approach requires new statistical tools because prediction error estimation ignoring the feature selection step can be severely downward biased. Generic methods such as cross-validation and non-parametric bootstrap can be very ineffective due to the big variability in the prediction error estimate.
Results: We propose a parametric bootstrap model for more accurate estimation of the prediction error that is tailored to the microarray data by borrowing from the extensive research in identifying differentially expressed genes, especially the local false discovery rate. The proposed method provides guidance on the two critical issues in model selection: the number of genes to include in the model and the optimal shrinkage for the penalized logistic regression. We show that selecting more than 20 genes usually helps little in further reducing the prediction error. Application to Golub's leukemia data and our own cervical cancer data leads to highly accurate prediction models.
Availability: R library GeneLogit at http://geocities.com/jg_liao
Contact: jl544{at}drexel.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
DNA microarray is a new technology that measures the expression levels of thousands of genes simultaneously and has emerged as an important tool in biomedical research. Golub et al. (1999) show that microarray gene expression can be used to classify between acute myeloid leukemia (AML) and acute lymphocytic leukemia. Since then, disease classification using microarray data has been the focus of intensive research with the aim of providing more accurate diagnostic tools than what the traditional pathological method alone can provide. Gene expression can also be used to predict survival time, disease prognostics and response to treatment, all with important clinical implications.
The mathematical and statistical methodologies for building such classification models, from the classical statistical methods to machine learning theory to classification trees, are reviewed and compared by Dudoit et al. (2002), Lee et al. (2005) and Li et al. (2004). This article considers the logistic regression approach, a standard method for binary classification that has been extended for use in microarray data in Eilers et al. (2001), Fort and Lambert-Lacroix (2005), Nguyen and Rocke (2002), Shen and Tan (2005), Zhou et al. (2004), Zhu and Hastie (2005). Let y be an array's binary disease status (1 for cancer and 0 for normal as a general example) and let x = (x1, ... , xp) be the expression vector, where xj is the expression level of the jth gene. A logistic prediction model for
(x), the probability of y=1 given x, is constructed from a training dataset, which can then be applied to the gene expression of a new array to estimate its cancerous probability. Building a logistic prediction model using microarray data, however, is fundamentally different from standard logistic modeling because the number of genes (predictors) p can be thousands while the number of arrays (subjects) n is usually no more than 100. A popular approach is to combine a gene selection step with penalized likelihood inference (Eilers et al., 2001; Shen and Tan, 2005; and Zhu and Hastie, 2004). Step 1, called feature selection, selects a subset of genes to include in the logistic regression. For ease of exposition, we will focus on the method of selecting the q most univariately significant genes (Dudoit et al., 2002) and let
, be the expression of the jth selected gene. Let yi, i=1, ... , n, be the binary disease status of the ith array in the training dataset and let xi=(xi1, ... , xip) be its gene expression vector. Step 2 fits the logistic model
|
| (1) |
|
| (2) |
i =
(xi) as given by model (1), ||ß|| is the Euclidean length of ß = (ß1, ... , ßq), and
(0,
) is the shrinkage parameter that controls the degree of shrinkage of ß toward 0 (Cessie and Van Houwelingen, 1990; Van Houwelingen, 2001). There are two key unresolved model selection issues here. The first is how to choose the number of genes q in Step 1. A smaller q makes the prediction model (1) easier and less costly to use but may lead to a larger prediction error. The second is how to find the optimal shrinkage parameter
for a chosen q. To address these issues, we need to be able to estimate the prediction error of a model building strategy. Ambroise and McLachlan (2002) and Simon et al. (2003) show that methods in earlier publications that ignored the feature selection step in the evaluation can severely underestimate the prediction error. To incorporate both Steps 1 and 2 for valid assessment, they use the generic cross-validation and non-parametric bootstrap. Recently, however, Braga-Nato and Dougherty (2004) demonstrated that the prediction error estimation using cross-validation can be too variable to be useful. Efron (2004), in a significant theoretical advance, shows that cross-validation and non-parametric bootstrap, while broadly applicable, pay a substantial price in terms of decreased estimating efficiency, which is especially acute for a large p and small n problem. The parametric bootstrap method, based on models tailored to the specific problem, can offer substantially better accuracy if the model is justified.
This article proposes and develops a parametric bootstrap method for more accurate and reliable estimation of the prediction error for the two-step procedure of building a logistic model, which provides the critically needed guidance on the choice of q and
. For any given q, our method finds the optimal
and the corresponding prediction error. We show that including q = 20 genes in the model is usually sufficient as additional genes help little in further reducing the prediction error. Application to Golub's leukemia data (Golub et al., 1999) and our own breast cancer data (Wong et al., 2003) leads to highly accurate prediction models. A carefully crafted R library, GeneLogit, is supplied on our web-site (http://geocities.com/jg_liao) that can be readily used for data analysis.
| 2 PARAMETRIC BOOTSTRAP MODEL SELECTION |
|---|
|
|
|---|
2.1 A parametric bootstrap model
The parametric bootstrap (Efron and Tibshirani, 1993), unlike cross-validation and non-parametric bootstrap, requires a more detailed model for the underlying process that generated data y = (y1, ... , yn). We now propose such a model in a hierarchical form.
Stage 1: given the gene expression vector xi, yi
Bernoulli (
i) with
|
| (3) |
i while (1) is the working model for building a prediction rule. In Section 2.1,
1, ... ,
n are always as given in (3). Usually, only a small percentage of genes from (3) are needed in (1) because the expression levels of many genes are collinear.
Let n1 be the total number of cancer arrays and n0 = n – n1 be the total number of normal arrays in the training sample. It can often be more appropriate to model y as drawn conditional on y1 +, ... , + yn = n1 because n1 and n0 are fixed by design (see Section 4 for more discussion). Let
|
|
=(
1, ... ,
n). Let u = (u1, ... , un), where each ui is either 0 or 1, and let |
| (4) |
Stage 2: to develop a model for coefficients b1, ..., bp in (3), let M = {j : bj
0} be the subset of genes with a non-zero regression coefficient. Rewrite (3) as
|
| (5) |
M. Note that Equation (3) is ill-posed in that infinitely many solutions of b0, ... , bp exist for any given
. To put a reasonable structure on bj, we shall seek insight from a formulation of the gene expression vector x that leads to logistic regression (3). Assume that x
N (µ1, V) for a tissue drawn from the cancer group and x
N (µ0, V) for a tissue drawn from the normal group, where µ1 = (µ11, ... , µ1p) and µ0 = (µ01, ... , µ0p). This leads to logistic regression (3) with (b1, ... , bp)t = V –1 (µ1 – µ0) and intercept b0 that depends on the relative sampling probabilities from the normal and cancer groups (Hastie, et al., 2001). Assume further that the gene expression is independent across genes so that V = diag (v1, ... , vp). We then have bj = (µ1j – µ0j)/vj. Note that bj is a multivariate regression coefficient in (3) while µ1j – µ0j represents the marginal relationship between the jth gene's expression and the outcome y. Let Hj, j=1, ... , p, be the hypothesis that the jth's gene expression has no difference between the cancer and normal arrays, |
|
0 is the proportion of genes that do not express differentially, f0 is the uniform density on [0, 1] for P-values under the null hypothesis and f1 is the density for P-values under the alternative. The local FDR, which quantifies the plausibility of individual Hj being true, is given by
|
| (6) |
M, or to choose between
M, sj = 1 or sj = – 1 depending on the direction of the gene's expression in the training dataset. Because of the adjustment for multiple comparisons, a gene is only usually included in M when the P-value zj is much smaller than the usual cut point 0.05. This way of modeling bj
0 and assigning sj is based on the marginal relationship between the outcome y and the jth gene's expression level. We show that it is justified by assuming normal distribution for gene expression vector x and by assuming independent expression across genes within the normal group and within the cancer group. While this is somewhat a strong assumption, we think it provides a useful approximation to an otherwise intractable problem. Similar approach is adopted in Barbieri and Berger (2002) and Ishwaran and Rao (2005) where variables in a multivariate model are selected based on their individual performance instead of their joint posterior distribution in Bayesian analysis of large p and small n data. Further research is needed on the best ways to model bj for a group of highly correlated genes.
To complete the specification for bj, we shall model |bj|, j
M, as independent random effects from normal distribution
truncated on (0,
). The variance component
0, to be estimated from the data, quantifies the size of |bj|. This follows the established statistical tradition of modeling parameters of similar characteristics as random effects (Laird and Ware, 1982) and also naturally motivates penalized likelihood (2) (Cessie and Van Houwelingen, 1990; Van Houwelingen, 2001).
It is easy to draw a bootstrap sample
from the proposed two-stage model. To do this, first get an estimate of the local FDR as
. Obtain an estimate of
0
by maximizing (7) as discussed subsequently. We can then draw y* as follows.
Step 1: generate M so that each j has probability of
of being in M. Assign, for j
M, sj = 1 if the gene's expression is stronger in cancer arrays for the training dataset and sj = –1 otherwise. Draw |bj| from
truncated on (0,
) for j
M. Model (3) or (5) is now specified except the intercept b0.
Step 2: draw y* from the conditional logistic distribution (4) with
given by model (5) as generated in Step 1. Note that both Steps 1 and 2 need to be performed for each bootstrap sample y** as Step 1 models the uncertainty in regression coefficients in (3) or (5).
Finally, to estimate the variance component
0, let
be a bootstrap sample generated with |bj| in Step 1 drawn from the truncated N (0,
2) instead. Let h (
) be the probability of
, where the probability incorporates both Steps 1 and 2 in drawing
. Our estimate
maximizes
|
| (7) |
–1/3 is added to the likelihood function h (
) to improve the stability of the maximization and can be seen as the kernel of a prior density on, say [10–6, 1], that mildly favors a smaller
.
2.2 Estimation of the prediction error
With what is proposed in Section 2.1, the prediction error of a model building procedure can be easily evaluated using parametric bootstrap. For ease of exposition, we will denote the two-step procedure for building a logistic model in Section 1 by logistic (q,
), where q is the number of the most significant genes included in working model (1) and
is the shrinkage parameter in penalized likelihood (2). For any given pair (q,
), the prediction error of procedure logistic (q,
) can be estimated as follows:
Step 1: draw a bootstrap sample
as described in Section 2.1. Note that the intercept b0 in (5) was left unspecified there but its value is needed here. Given the generated M, sj and |bj|, we will choose b0 so that (5) satisfies
1 +, ... ,+
n = n1, which is the maximum likelihood estimate of b0. The
1, ... ,
n are now completely specified in (5).
Step 2: apply the two-step procedure logistic (q,
) to bootstrap sample y*. To be more specific, first test the gene expression difference between arrays with
and arrays with
for each of the p genes and select the q most significant genes. Then fit logistic regression (1) with y in penalized likelihood (2) replaced by y*. Let
be the estimated
1, ... ,
n from the resulting model.
Step 3: compute the expected Brier score
|
| (8) |
. To derive (8), let
i. The Brier score (Brier, 1950) of using |
|
Repeat Steps 1–3 here a large number of times (10 000 times for the two examples below) and the average of (8) over these replications serves as an estimate of the prediction error of procedure logistic (q,
).
As mentioned earlier, the closer the bootstrap estimate
is to the underlying true
, the smaller the expected Brier score (8) is. The true
from Step 1, however, is a random quantity determined by the generated values of M, sj and |bj|, which models our uncertainty about
. The prediction error estimate of procedure logistic (q,
) averages (8) over
. Our parametric bootstrap method can therefore be alternatively motivated from the perspective of Bayesian model averaging (Hoeting et al., 1999).
2.3 Software implementation
Our proposed method is implemented in software library GeneLogit that runs on the open source R environment (R Development Core Team, 2006) and is available at http://geocities.com/jg_liao For our implementation, the local FDR estimator in Liao et al. (2004) is used. The conditional probability (4) is generally computationally onerous because the probability of y1 +, ... ,+ yn = n1 in the denominator may involve the sum of a large number of terms. To simplify the computation, normal approximation
(n1 + 0.5)–
(n1 – 0.5) is used, where
is the cumulative distribution of normal distribution with mean
1+, ...,+
n and variance
1(1–
1)+, ... ,+
n(1–
n). The free parameter b0 in (3) is chosen so that
1 +, ... ,+
n=n1 for more accurate approximation, which works well when n>30 and n1 is not too close to 0 or n. The likelihood h(
) in (7) is computed by averaging conditional probability (4) over 5000 simulated b1, ... , bp, where b1, ... , bp are generated as in Step 1 in Section 2.1 except |bj| are drawn from N (0,
2) truncated on (0,
). The resulting library GeneLogit is easy to use but still computationally intensive (
20 hours of total computing time on today's dual core CPU from Intel or AMD for either the leukemia or the cervical cancer dataset subsequently). As input, it requires the disease outcome y as a vector and the gene expression as an n x p matrix. No missing values are allowed. The gene expression matrix needs to be standardized so that each column (the expression of a gene across all subjects) has variance of 1. There is no need, however, to center the gene expression to 0 as it does not impact on the regression coefficients b1, ... , bp. The program seems numerically stable as same result is obtained from many different runs. We now illustrate the use of the library in two datasets.
| 3 TWO EXAMPLES |
|---|
|
|
|---|
3.1 Golub leukemia data: classification between ALL and AML
Golub et al. (1999) use gene expression to classify between acute lymphoblastic leukemia (ALL) and AML. Their data have since been analyzed by many authors using different classification methodologies. The training dataset consists of 27 ALL and 11 AML subjects and the test dataset consists of 20 ALL and 14 AML subjects. The expression of 7129 genes were originally measured. We applied the pre-processing procedures in Dudoit et al. (2002) to filter out genes that do not exhibit significant variation across the samples, followed by thresholding and the logarithmic transformation. A total of 3051 genes remain after the pre-processing. Our proposed method is applied to the training dataset (n = 38 and p =3051) in the following steps using GeneLogit library:
- Run function localFDR to estimate fdrj and assign sj.
- Run function model.estimation to estimate
0 by maximizing (7), which gives
.
- Use function bootstrap.prediction to find the optimal
and the corresponding prediction error for different values of q. The result for q = 1, 10, 20, 50 and 100 is given in Table 1. For example, the optimal
for q = 1 is found to be 4.542 with a corresponding prediction error estimated to be 0.0191 and the optimal
for q = 20 is 0.628 with a corresponding prediction error of 0.0152. A larger q leads to a smaller optimal
(more shrinkage) and smaller prediction error. Increasing q beyond 20, however, does not help much in further reducing the prediction error. Note that the estimated prediction error is the average over 10 000 expected Brier scores defined in Equation (8) from 10 000 bootstrapped models. For q = 20 and
= 0.628, the 5th and 95th percentiles of these expected Brier score are 0.0116 and 0.0381, respectively.
- Based on the result in Table 1, we choose procedure logistic (q = 20,
= 0.628) to build our prediction model by running function pena.logit. The 20 selected genes and the penalized logistic regression coefficients are given in Table 2. Note that the regression coefficients are all close to 0 due to the shrinkage effect.
|
|
To assess the prediction capacity of the model in Table 2 for independent samples, we apply it to Golub's test dataset of 20 ALL and 14 AML subjects. For each of the 34 subjects in the test dataset, we compute
|
3.2 Classification between cervical cancer and normal tissues
Wong et al. (2003) study the gene expression of 26 cervical cancer tissues and 9 normal cervical tissues. The data is also analyzed in Liao et al. (2004) in the context of estimating the local FDR. We now use the data to build a logistic prediction model for classifying between cervical cancer and normal tissues. We use the 3670 genes, among 10 692 assessed, that have complete data for all the 35 subjects. Applying our GeneLogit library using the same steps as for the Golub's data, we obtain
and the corresponding prediction error for q = 1, 10, 20, 50 and 100 are given in Table 4. Again, the optimal
decreases as q increases and choosing q beyond 20 does not help much in further reducing the prediction error. We thus choose procedure logistic (q = 20,
= 0.384) to build our prediction model with result given in Table 5. For this dataset, however, there is no separate test dataset to reliably assess the prediction error for independent samples. To overcome this problem, we shall borrow idea from cross-validation. For i = 1, ... , 35, we apply procedure logistic (q = 20,
= 0.384) to the cervical dataset with data from the ith subject removed. The resulted logistic model is then applied to the ith tissue's gene expression to compute the estimate
= 0.338) is a good choice.
|
|
|
| 4 CONCLUSION AND DISCUSSION |
|---|
|
|
|---|
Building a logistic prediction model using microarray data poses considerable technical challenge because of the larger p and small n. Infinitely number of solutions of b0, b1, ..., bp exist for any given
1, ...,
n for logistic model (3). Consequently, any observed data y1, ..., yn is subject to infinitely many interpretations. Additional structure on b0, b1, ... , bp is required for effective data analysis. We propose such a structure in Section 2.1 by writing (3) in the form of (5) and by borrowing from the extensive research on FDR. In our model, the probability of bj
0 or j
M is taken to be
) can then be estimated using parametric bootstrap to guide the choice of q and
. Application of the proposed method to the leukemia and cervical cancer datasets results in excellent prediction models. Biomedical researchers, we interacted with, found the method intuitive and easy to understand.
We now briefly discuss a few issues. First, the number of cancer arrays n1 and the number of normal arrays n0 in a microarray study are often chosen to be of similar size to increase statistical efficiency even though the underlying normal population can be much larger than the cancer population. This is similar to the case-control design in epidemiology in which individuals in the disease population are often sampled for inclusion in study in greater proportion than subjects in the control population. For such design, the intercept ß0 in (1) depends on the ratio of the sampling proportions but ß1, ... , ßq do not (Agresti, 2002; McCullagh and Nelder, 1989, Chapter 6.7.1). In applying the logistic prediction model to new arrays, it is prudent to find out if the same ratio of sampling proportions is used as in the training dataset. The relatively cancerous risk, however, can be determined from only ß1, ... , ßq. Second, we have focused on the model-building procedure logistic(q,
) in which the q univariately most significantly genes are included in the logistic model. Other feature selection methods can also be used. One may consider, e.g. to include the q jointly most significant genes. Our proposed bootstrap method can be used in the same way in evaluating its prediction error. Computationally, however, it is much more intensive to find the q jointly most significant genes. Third Yeung, et al. (2005) use Bayesian model averaging for building microarray classification models. As discussed in Section 2.2, our proposed bootstrap method can also be motivated from the same perspective. But Yeung et al. method is, in our opinion, a somewhat ad hoc adaptation of standard Bayesian model averaging to the large p and small n microarray data while we have developed a coherent model specifically tailored to such data. Indeed, they report more misclassified arrays for Golub's leukemia data.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
Part of the first author's work was conducted when he was Bio-statistician in the Cancer Center of New Jersey. The research was supported by NCI grant 2P30 CA 72720-04. Dr Yong Lin helped with computation. The authors thank the two reviewers whose comments led to substantial improvement of the manuscript.
Confilict of Interest : none declared.
| FOOTNOTES |
|---|
Associate Editor: Trey Ideker
Received on September 10, 2006; revised on May 14, 2007; accepted on May 21, 2007
| REFERENCES |
|---|
|
|
|---|
Ambroise C, McLachlan GJ. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl Acad. Sci., ( (2002) ) 99, : 6562–6566.
Agresti A. Categorical Data Analysis, ( (2002) ) 2nd. New York: Wiley..
Barbieri MM, Berger JO. Optimal predictive model selection. Ann. Stat., ( (2004) ) 32, : 870–897.[CrossRef].
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc., ( (1995) ) B57, : 289–300..
Braga-Nato UM, Dougherty ER. Is cross-validation valid for small sample microarray classification? Bioinformatics, ( (2004) ) 20, : 374–380.
Brier GW. Verification of forecasts expressed in terms of probability. Mon. Weather Rev., ( (1950) ) 78, : 1–3.[CrossRef].
Cessie SL, Van Houwelingen HC. Ridge estimators in logistic regression. Appl. Stat., ( (1990) ) 41, : 191–201.[CrossRef].
Dudoit S, et al. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc., ( (2002) ) 97, : 77–87.[CrossRef][ISI].
Golub T, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, ( (1999) ) 286, : 531–536.
Efron B. The estimation of prediction error: covariance penalties and cross-Validation. J. Am. Stati. Associ., ( (2004) ) 99, : 619–632.[CrossRef].
Efron B, Tibshirani RJ. An Introduction to the Bootstrap, ( (1993) ) London: Chapmann and Hall/CRC..
Efron B, Tibshirani R. Empirical Bayes methods and false discovery rates for microarrays. Genet. Epidemiol., ( (2002) ) 23, : 70–86.[CrossRef][ISI][Medline].
Efron B, et al. Empirical Bayes analysis of microarray experiment. J. Am. Stati. Associ., ( (2001) ) 96, : 1151–1160.[CrossRef].
Eilers PH, et al. Classification of microarray data with penalized logistic regression. In Proceedings of SPIE Volume 4266: Progress in Biomedical optics and Imaging, ( (2001) ) Vol 2, : 187–198..
Fort G, Lambert-Lacroix S. Classification using partial least squares with penalized logistic regression. Bioinformatics, ( (2005) ) 21, : 1104–1111.
Hastie T, et al. The Elements of Statistical Learning, ( (2001) ) Berlin: Springer..
Hoeting J, et al. Bayesian model Averaging. Stat. Sci., ( (1999) ) 14, : 382–401.[CrossRef][ISI].
Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics, ( (1982) ) 38, : 963–974.[CrossRef][ISI][Medline].
Lee KE, et al. Gene selection: a Bayesian variable selection approach. Bioinformatics, ( (2003) ) 19, : 90–97.
Lee JW, et al. An extensive comparison of recent classification tools applied to microarray data. Comput. Stat. & Data Anal., ( (2005) ) 48, : 869–885.[CrossRef].
Leventhal L, Huynh C. Directional decisions for two-tailed tests: power, error rates, and sample size. Psychol. Methods, ( (1996) ) 1, : 278–292.[CrossRef][ISI].
Li T, et al. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics, ( (2004) ) 20, : 2429–2437.
Liao JG, et al. A mixture model for estimating the local false discovery rate in DNA microarray analysis. Bioinformatics, ( (2004) ) 20, : 2694–2701.
McCullagh P, Nelder JA. Generalized Linear Models, ( (1989) ) New York: Chapman and Hall..
Nguyen D, Rocke D. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics, ( (2002) ) 18, : 39–50.
R Development Core Team. R. fondtation for satistical computing. In: R: a Language and Environment for Statistical Computing, ( (2006) ) Austria: Vienna. ISBN 3-900051-00-3 http://www.R-project.org..
Scheid S, Spang R. Twilight; a Bioconductor package for estimating the local false discovery rate. Bioinformatics, ( (2005) ) 21, : 2921–2922.
Shen L, Tan EC. Dimension reduction-based penalized logistic regression for cancer classification using microarray data. IEEE/ACM Trans. Compu. Biol. Bioinform., ( (2005) ) 2, : 166–175.[CrossRef].
Simon R, et al. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J. Nat. Cancer Inst, ( (2003) ) 95, : 14–18.
Van Houwelingen JC. Shrinkage and penalized likelihood as methods to improve predictive accuracy. Statistica Neerlandica, ( (2001) ) 55, : 17–34.[CrossRef][ISI].
Wong YF, et al. Expression genomics of cervical cancer: molecular classification and prediction of radio-therapy response by DNA microarray. Clini. Cancer Res., ( (2003) ) 9, : 5486–5492.
Yeung KY, et al. Bayesian model averaging: development of an improved multi-calss, gene selection and classification tool for microarray data. Bioinformatics, ( (2005) ) 21, : 2394–2402.
Zhu J, Hastie T. Classification of gene microarrays by penalized logistic regression. Biostatistics, ( (2004) ) 5, : 427–443.[Abstract].
Zhou X, et al. Cancer classification and prediction using logistic regression with Bayesian gene selection. J. Biomed. Inform., ( (2004) ) 37, : 249–259.[CrossRef][ISI][Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||