Skip Navigation


Bioinformatics Advance Access originally published online on September 16, 2004
Bioinformatics 2005 21(4):529-536; doi:10.1093/bioinformatics/bti032
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/4/529    most recent
bti032v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (9)
Right arrowRequest Permissions
Citing Articles
Right arrowScopus Links
Google Scholar
Right arrow Articles by Guan, Z.
Right arrow Articles by Zhao, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Guan, Z.
Right arrow Articles by Zhao, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics vol. 21 issue 4 © Oxford University Press 2005; all rights reserved.

A semiparametric approach for marker gene selection based on gene expression data

Zhong Guan 1 and Hongyu Zhao 2,*

1 Department of Mathematical Sciences, Indiana University South Bend South Bend, IN 46634, USA
2 Department of Epidemiology and Public Health, Yale University School of Medicine New Haven, CT 06520, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 APPLICATION TO LEUKEMIA STUDY
 CLASSIFICATION
 DISCUSSION
 REFERENCES
 

Motivation: Identification of differentially expressed genes is a major issue in gene expression data analysis and selection of marker genes is critical in tumor classification using gene expression data. In this paper, we propose a semiparametric two-sample test to identify both differentially expressed genes and select marker genes for sample classification.

Results: A simulation study shows that the proposed method is more robust and powerful than the methods, generally used such as t-tests and non-parametric rank-sum tests, when the sample size is small. Cross-validation shows that the sample classification based on genes selected using this semiparametric method has lower misclassification rates.

Contact: hongyu.zhao{at}yale.edu


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 APPLICATION TO LEUKEMIA STUDY
 CLASSIFICATION
 DISCUSSION
 REFERENCES
 
Identifying differentially expressed genes is one major goal of microarray data analysis. Selecting marker genes for sample classification is also an important issue for disease classification based on gene expression data. Many methods have been proposed to select differentially expressed genes, including the two-sample t-tests (Dudoit et al. 2002b), ANOVA (Kerr et al., 2000), SAM (Tusher et al., 2001), Wilcoxon non-parametric two-sample tests and others.

Since microarray technology is still expensive and requires biological materials that may be difficult to collect, most studies perform only a few replicated microarray experiments. However, the appropriateness of many statistical methods, especially parametric methods such as t-test and ANOVA, is questionable when the sample size is small. Based on our experience of analyzing both cDNA and Affymetrix microarray data, the difference between expression levels of genes in different samples is reflected both in means and variances, and the normality assumption for the underlying distribution may not hold. In order to take the effect of the treatments on the variances into account and still use two-sample t-test-like methods, some authors have proposed different variance stabilization methods in microarray data analysis (e.g. see, Tusher et al., 2001; Tibshirani, 1988a,b; Huber et al., 2002). Variance shrinkage is another strategy to improve the estimation of variance (Long et al., 2001). O'Brien (1988) considered this issue in the general setting of two-sample comparison by proposing and comparing several extensions of the t-, rank-sum and log-rank tests with the corresponding conventional tests. O'Brien (1988) observed that the conventional t-, rank-sum and log-rank tests are insensitive for a large class of alternatives that may be expected to occur commonly in practice and pointed out that the proposed methods should be useful for both identifying and interpreting group differences. Mantel and Brown (1974) also studied logistic-regression-based alternative tests for comparing normal distribution parameters. They showed that these tests are valid under various types of non-random sampling schemes and can be used for any distribution within the exponential family. The efficiency comparison between logistic regression and normal discriminant analysis was given by Efron (1975) (see also Halperin et al., 1971).

In gene expression data analysis, especially in selecting differentially expressed genes, which may be used as gene markers to classify human diseases, we hope to find genes that reflect as many different aspects as possible between different samples. To avoid too many parameters and making the calculation too complicated, it may be necessary to consider the differences in both the means and variances into account. Because most gene expression datasets contain a small number of replicates, it is usually difficult to check the normality assumption of the underlying population distributions. One of the robust two-sample tests is the logistic regression method that is called an extension of the classic two-sample t-test. Our simulation study showed that, in the classification of tumor samples based on gene expression data, if the sample sizes of the two classes in the learning (training) data are not too small, the logistic regression method performs similarly to the non-parametric Wilcoxon test. Otherwise, the logistic method is more powerful than the non-parametric Wilcoxon test. In all the situations, logistic and Wilcoxon methods are more powerful than t-tests. This advantage of the semiparametric method is especially important in microarray data analysis because the number of replicates is usually small in this context.


    METHODOLOGY
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 APPLICATION TO LEUKEMIA STUDY
 CLASSIFICATION
 DISCUSSION
 REFERENCES
 
Our gene selection procedure is based on the multiple tests performed separately on each gene. For a given gene, let x ij denote the gene expression level of the i-th replicate in the j-th group. For each fixed j = 0, 1, ..., m, the x ij , i = 1, ..., n j are an iid sample from distribution F j . Suppose that


where ß j R d is a d-dimensional parameter (d ≥ 1) and T(x) is a known vector of functions of x such as T(x) = x or T(x) = (x,x 2){tau}. Qin and Zhang (1997) and Zhang (1999, 2001, 2002a) studied the goodness-of-fit tests for this model. Let Y be a multicategory response variable with m+1 categories, {pi} j = Pr(Y = j) and F j be the conditional distribution of X given Y = j for j = 0, 1, ..., m. It is easy to see that model (1) is equivalent to the following polychotomous logistic model (e.g. see Lesaffre and Albert, 1989a; Zhang, 2002a)


where {alpha}* j = {alpha} j – log({pi}0/{pi} j ). Consider the following hypothesis:


where {alpha} = ({alpha}1, ..., {alpha} m ){tau} and . Let n = n 0 + ··· + n m , {alpha}0 = 0, ß0 = 0, and denote the combined sample {x 01, ..., x 0n 0 , ..., x m1, ..., x mn m } by {u 1, ..., u n }. Based on the semiparametric model (1), the likelihood of the expression values x ij is


where p k = dF 0(u k ) ≥ 0 and


Using the Lagrangian multiplier method, we get (Zhang 2002b>)


and the profile semiparametric log-likelihood of ({alpha}, B )


where {rho} j = n j /n for j = 0, ..., m. Let and be the solution to the score equations:


The minus twice the logarithm of likelihood ratio test statistic for H 0: B = 0 versus H 1: B != 0 is


where , . For large sample sizes, LR has an asymptotic {chi}2 distribution with md degrees of freedom.

Let z ij = j, for j = 0, ..., m; i = 1, ..., n j . By fitting the data {(x ij ,z ij ): j = 0, ..., m; i = 1, ..., n j } with the polychotomous logistic model (2) and using the Newton–Raphson iteration method, we can also get the maximum-likelihood estimates and . This can be performed by the built-in logistic regression functions of the statistical packages, such as R, S-plus and SAS. The simulation results of this paper (m = 1) are calculated using R function glm with family = binomial. The CATMOD procedure of SAS can be used to perform the analysis of generalized logits for polychotomous outcomes.

When T(x) = (x,x 2){tau} and m = 1, O'Brien (1988) called the above test ‘a natural generalization of the t-test’. It is actually a simultaneous test about the population means and variances. In fact, from (1), it is easy to see that the symmetrized Kullback–Leibler information distance between the two distributions F 0 and F j in the exponential change point model measures the difference between E F 0 T(X) and E F j T(X), i.e.


2I(f, g) is also called J-divergence (see Jeffreys, 1946). Compared to the parametric models, such as Gaussian models, for simultaneous tests of means and the variances, this semiparametric test is more robust in the sense that it assumes no specific forms of the underlying population distributions and only focuses on the relationship between them. Indeed, the two underlying distributions are assumed to be non-parametric except that the tilt has a parametric exponential form. From (1) and (2), we know that in this logistic model we regress the posterior log odds ratio against x. We also ‘regress’ the log ratio of the two unknown density (frequency) functions.

The above semiparametric method has been applied to changepoint problem in Guan (2004) and was shown to be more sensitive and robust than some non-parametric methods. Polychotomous discrimination was applied to multiclass cancer classification in Nguyen and Rocke (2002).

An important issue in logistic regression is the existence of the maximum-likelihood estimate of ß. As pointed out by Albert and Anderson (1984) and Lesaffre and Albert (1989b), if for some gene, the data x ij are completely or quasicompletely or partially separated, the maximum-likelihood estimate of B in the above logistic regression does not exist. In this case, the gene is clearly a marker gene and may not be detected using this semiparametric method. However, in this case, some other methods, such as t-test and Wilcoxon test, should be able to detect the marker gene. Moreover, Albert and Anderson (1984), Lesaffre and Albert (1989b) and Santner and Duffy (1986) provided methods to determine whether data are separated or overlapped. Therefore, if in the iteration of finding maximum-likelihood estimate of ß exceed a given limit, one would like to apply these methods to check the separation status. If the data are separated into the correct groups, then this particular gene is obviously a marker gene. A better strategy to find marker genes is to first apply Wilcoxon test, and if the result of this test is significant, then select the gene, otherwise perform logistic regression test.


    APPLICATION TO LEUKEMIA STUDY
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 APPLICATION TO LEUKEMIA STUDY
 CLASSIFICATION
 DISCUSSION
 REFERENCES
 
The leukemia dataset contains gene expression levels in two types of acute leukemia: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) (Golub et al., 1999). Gene expression levels were measured by using Affymetrix high-density oligonucleotide arrays containing 6817 human genes. The data consist of 47 cases of ALL (38 B-cell ALL and 9 T-cell ALL) and 25 cases of AML, and is available at http://www.genome.wi.mit.edu/MPR. This dataset has been analyzed by many authors. For example, Dudoit et al. (2002a) used this dataset as an example to compare several discrimination methods for the classification of tumors.

We applied the semiparametric logistic regression test to this dataset and compared the results with the classical two-sample t-test. Among the 40 genes with the smallest P-values for the semiparametric test with d = 1, 20 genes are not in the top 40 list of the two-sample t-tests as they have very large P-values for t-test. These 20 genes are summarized in Table 1 and many of them are in fact cancer-related (http://www.ncbi.nlm.nih.gov/).


View this table:
[in this window]
[in a new window]
 
Table 1 Description of the 20 genes

 
Based on Golub et al. (1999) training set (27 ALL and 11 AML) and the whole data, we compared the sets of top 40 significant genes of the five methods: Wilcoxon test (W), logistic regression with d = 1 (Lgt1) and d = 2 (Lgt2), BSS/WSS criterion (B/WSS) and t-test (T). The degree of overlap among these methods is summarized in Figures 1 and 2. These results suggest that different methods may lead to quite different sets of marker genes. By increasing the sample sizes, we can get more consistent sets of marker genes. Although the logistic test with d = 2 consider the changes in both mean and variance, the sets of marker genes are quite consistent. On the other hand, conventional t-test (assuming equal variances) and Welch adjusted t-test (assuming unequal variances) generated quite different sets of marker genes even for large sample sizes.



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 1 Overlap of five methods with genes selected based on Golub et al.'s learning set.

 


View larger version (16K):
[in this window]
[in a new window]
 
Fig. 2 Overlap of five methods with genes selected based on all data.

 
It is not surprising to see such differences because different methods work under different model assumptions. Some methods are not very robust and depend heavily upon the goodness of fit of the model to the data. Some are robust and insensitive to the data distribution. The parametric methods, e.g. t-tests and ANOVA, generally assume normal distributions or large sample sizes. In microarray data analyses, we often have small sample sizes and the distribution of most datasets does not follow, or even differs much from the normal distribution. It is also well known that such parametric methods are not robust. Although non-parametric methods, such as the Wilcoxon test, make almost no assumption on the underlying distributions and may lose useful information, even they are distribution-free and robust. The proposed semiparametric method treats the underlying distributions almost non-parametrically and only assumes that the log ratios of density (frequency) functions are a known parametric functions of observations. This procedure can be viewed as to regress the log ratio of density (frequency) function. Therefore, as a local fit of the data, this semiparametric approach provides a flexible, robust and powerful alternative to the existing methods.


    CLASSIFICATION
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 APPLICATION TO LEUKEMIA STUDY
 CLASSIFICATION
 DISCUSSION
 REFERENCES
 
As mentioned in Dudoit et al. (2002a), the identification of ‘marker’ genes for the classification of tumors is an important issue and t-tests are generally unable to identify genes that discriminate between all the classes. Using logistic regression with d = 2, we take the change in the variance of training data into account. We do not mean, and it is also impossible, to use the instability of the expression level of the selected gene to discriminate different classes. However, this feature is actually present in gene expression data. From this point of view, we see that bagging (Breiman, 1998, 1996) or boosting (Freund and Schapire, 1997) would be necessary even for some ‘stable’ classifiers such as the nearest neighbors (Fix and Hodges, 1951) procedure.

After the selection of the genes, we can use quadratic discriminant analysis to classify a new sample. In the following analysis, we only consider the simplest case K = 2. Let x i = (x i1, ..., x ip ){tau} denote the expression profile of p selected genes of the i-th sample in the training dataset, and y i is the class label of this sample.

Several discrimination methods can be used to classify tumors based on gene expression data. Dudoit et al. (2002a) compared many methods and concluded that the k-nearest neighbors (kNNs) classifier perform remarkably well compared to more sophisticated methods such as aggregated classification trees (CART). In our analysis, we aggregate the nearest neighbors classifier using bagging and boosting perturbations. A total of 30 genes are selected using the 30 most significant genes based on five tests: semiparametric tests (logistic regression) with d = 1, 2, Wilcoxon test, t-tests with unequal variances and BSS/WSS criterion (Dudoit et al., 2002a), which is equivalent to the t-test with equal variance. The value of k is chosen to be 3. Using Breiman's (1998) adapted boosting algorithm (see Freund and Schapire, 1998) we correctly classified all the 34 observations in the test leukemia dataset with two classes. The prediction votes (PVs) are given in Table 2. Two ALL cases [indices 71 (B-cell) and 67 (T-cell)] and one AML case (index 66) are the most difficult observations to classify. Observations 66 and 67 were in the list of three observations tended to be difficult with the classification of Dudoit et al. (2002a) and were misclassified and have low prediction strength of 0.27 and 0.15, respectively, when compared with Golub et al. (1999). Using standard bagging (non-parametric bootstrap) procedure, only 2 (67 and 71) out of 34 test observations are misclassified. To compare different marker gene selection methods in terms of the misclassification rates, we use learning set/test set (LS/TS) resampling procedure of 2:1, 1:2, 1:3 and 1:6 schemes (see Dudoit et al., 2002a). That is, with (n L ,n T ) = (48,24),(24,48), (18,54) and (10,62), the misclassification rates are estimated based on random partitions of the combined dataset of n = 72 observations into a learning set of n L observations and a test set of n T observations, respectively.


View this table:
[in this window]
[in a new window]
 
Table 2 Prediction votes of aggregated kNN classifier using boosting

 
Error rate estimation using simulation
We first select the top p = 30, and 40 significant genes based on eight tests: Wilcoxon (W), logistic regression with d = 1 (Lgt1), d = 2 (Lgt2), t-test (T), BSS/WSS criterion (B/WSS), the combination of Wilcoxon and logistic regression with d = 1 (W.Lgt1), the combination of logistic regression with d = 1 and d = 2 (Lgt12), and the combination of Wilcoxon and logistic regression with d = 1 and d = 2 (W.Lgt12). For the combined method, the P-value of a gene is determined by the smaller or smallest one of the tests to be combined. We then simulate 500 LS/TS (2:1, 1:2, 1:3 and 1:6 schemes) samples and use the kNN classifier with k = 3 to classify the test set, the misclassification rates are summarized by box-and-whisker plots in Figures 36.



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 3 Summary of error rates (500 replicates, 1:2 scheme, 40 different genes).

 


View larger version (20K):
[in this window]
[in a new window]
 
Fig. 6 Summary of error rates (500 replicates, 1:6 scheme, 40 different genes).

 
This simulation study shows that different marker gene selection methods affect the output of the classification dramatically.

Since we mainly focus on the comparison of different methods of marker gene selection, we can compare these methods by selecting genes based on all datasets available and then compare the classification results using different sets of marker genes. We note that the misclassification rates may be underestimated, although the set of marker genes may converge for each gene selection method when the learning dataset is large enough. Figure 7 compares the results based on the same set of marker genes that are selected using all the datasets available.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 7 Summary of error rates (500 replicates, 2:1 scheme, 40 same genes).

 

    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 APPLICATION TO LEUKEMIA STUDY
 CLASSIFICATION
 DISCUSSION
 REFERENCES
 
Owing to the expense of microarray experiments and difficulty in biological sample collection, the sample sizes of most microarray data are quite small. In this case, parametric statistical methods such as t-test and ANOVA are less reliable than the non-parametric or semiparametric methods due to likely deviations from the parametric assumptions. For example, although both t-test and logistics regression test with d = 1 compare the means, t-test depends heavily on the correctness of the normality assumption of the data values. This is especially the case for microarray data since there are many unknown uncontrollable factors that affect the observed expression levels and the normal assumptions are also difficult to test due to small sample sizes. Although Wilcoxon non-parametric two-sample test (equivalent to Mann–Whitney test) and its k-sample version Kruskal–Wallis test are distribution-free methods, they make no use of any information about the distributions of the data. In contrast, the proposed semiparametric method makes no assumption on the underlying distribution except that there is a parametric link between two groups and this link can be interpreted as regression. Therefore, this semiparametric method is both more powerful and robust than the other methods. In selecting differentially expressed genes based on gene expression data, we recommend the combination of several methods, e.g. t-test or ANOVA, logistic regression with d = 1,2 and non-parametric methods, and consider the union of the sets of significant genes as the set of candidate genes. In selecting marker genes for classification, the logistic regression method with d = 1 seems appropriate if linear classifiers are applied.

Selection method for the classification of tumors and cancers based on high-throughput data is also important. Different methods indeed yield quite different misclassification rates for various datasets. In the examples of the present paper, we used only kNN classification just because it has been shown by Dudoit et al. (2002a) that this method is better than other commonly used classification methods. For other kinds of datasets, other methods may be used. For example, Wu et al. (2003) compared several discriminant methods including linear and quadratic discriminant analysis, kNN and random forest (RF) for the classification of ovarian cancer using mass spectrometry data, and showed that RF performs better than all the other methods considered in the comparison.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 4 Summary of error rates (500 replicates, 1:2 scheme, 30 different genes).

 



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 5 Summary of error rates (500 replicates, 1:3 scheme, 40 different genes).

 


    Acknowledgments
 
We thank Dr Biao Zhang for introducing the paper of O'Brien (1988) and his advice on empirical likelihood techniques. This work was supported in part by NIH grant GM59507 and NSF grant DMS 0241160.

Received on April 21, 2004; revised on August 5, 2004; accepted on September 9, 2004

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 APPLICATION TO LEUKEMIA STUDY
 CLASSIFICATION
 DISCUSSION
 REFERENCES
 

    Albert, A. and Anderson, J.A. (1984) On the existence of maximum likelihood estimates in logistic regression models. Biometrika, 71, 1–10[Abstract/Free Full Text].

    Breiman, L. (1996) Bagging predictors. Mach. Learn., 24, 123–140.

    Breiman, L. (1998) Arcing classifier. Ann. Stat., 26, 801–824[CrossRef].

    Dudoit, S., Fridly, J., Speed, T.P. (2002a) Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc., 97, 77–87[CrossRef][Web of Science].

    Dudoit, S., Yang, Y.H., Callow, M.J., Speed, T.P. (2002b) Statistical methods for identifying differentially expressed genes in replicated c{DNA} microarray experiments. Stat. Sinica, 12, 111–139.

    Efron, B. (1975) The efficiency of logistic regression compared to normal discriminant analysis. J. Am. Stat. Assoc., 70, 892–898[CrossRef][Web of Science].

    Technical Report. Fix, E. and Hodges, J. (1951) Discriminatory analysis, nonparametric discrimination: consistency properties. , Randolph Field, TX USAF School of Aviation Medicine.

    Freund, Y. and Schapire, R.E. (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Sys. Sci., 55, 119–139.

    Freund, Y. and Schapire, R.E. (1998) Comment on ‘{A}rcing classifiers’. Ann. Stat., 26, 824–832.

    Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasen, M., Mesirov, J.P., Coller, H., Loh, M.L., Dowing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537[Abstract/Free Full Text].

    Guan, Z. (2004) A semiparametric changepoint model. Biometrika, 91, 849–862[Abstract/Free Full Text].

    Halperin, M., Blackwelder, W.C., Verter, J.I. (1971) Estimation of the multivariate logistic risk function: a comparison of the discriminant function and maximum likelihood approaches. J. Chronic Dis., 24, 125–158[CrossRef][Web of Science][Medline].

    Huber, W., von Heydebreck, A., Sultmann, H., Poustka, A., Vingron, M. (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics, 18, S96–S104[Abstract].

    Jeffreys, H. (1946) An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond., Ser. A, 186, 453–461[Abstract/Free Full Text].

    Technical Report. Kerr, M.K., Martin, M., Churchill, G.A. (2000) Analysis of variance for gene expression microarray data. , Bar Harbor, ME The Jackson Laboratory.

    Lesaffre, E. and Albert, A. (1989a) Multiple-group logistic regression diagnostics. J. R. Stat. Soc. Ser. C, 38, 425–440.

    Lesaffre, E. and Albert, A. (1989b) Partial separation in logistic discrimination. J. R. Statist. Soc. Ser. B, 51, 109–116.

    Long, A., Mangalam, H.J., Chan, B.Y., Tolleri, L., Hatfield, G.W., Baldi, P. (2001) Improved statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical framework. Analysis of global gene expression profiling in Escherichia coli K12. J. Biol. Chem., 276, 19937–19944[Abstract/Free Full Text].

    Mantel, N. and Brown, C. (1974) Alternative tests for comparing normal distribution parameters based on logistic regression. Biometrics, 30, 485–497[CrossRef][Web of Science][Medline].

    Nguyen, D.V. and Rocke, D.M. (2002) Multi-class cancer classification via partial least squares with gene expression profiles. Bioinformatics, 18, 1216–1226[Abstract/Free Full Text].

    O'Brien, P.C. (1988) Comparing two samples: extensions of the t, rank-sum, and log-rank tests. J. Am. Stat. Assoc., 83, 52–61[CrossRef][Web of Science].

    Qin, J. and Zhang, B. (1997) A goodness-of-fit test for logistic regression models based on case–control data. Biometrika, 84, 609–618[Abstract/Free Full Text].

    Santner, T.J. and Duffy, D.E. (1986) A note on A. Albert and J. A. Anderson's conditions for the existence of maximum likelihood estimates in logistic regression models. Biometrika, 73, 755–758[Abstract/Free Full Text].

    Tibshirani, R. (1988a) Estimating transformations for regression via additivity and variance stabilization. J. Am. Stat. Assoc., 83, 394–405[CrossRef][Web of Science].

    Tibshirani, R. (1988b) Variance stabilization and the bootstrap. Biometrika, 75, 433–444[Abstract/Free Full Text].

    Tusher, V.G., Tibshirani, R., Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA, 98, 5116–5121[Abstract/Free Full Text].

    Wu, B., Abbott, T., Fishman, D., McMurray, W., Mor, G., Stone, K., Ward, D., Williams, K., Zhao, H. (2003) Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics, 19, 1636–1643[Abstract/Free Full Text].

    Zhang, B. (1999) A chi-squared goodness-of-fit test for logistic regression models based on case–control data. Biometrika, 86, 531–539[Abstract/Free Full Text].

    Zhang, B. (2001) An information matrix test for logistic regression models based on case–control data. Biometrika, 88, 921–932[Abstract/Free Full Text].

    Zhang, B. (2002a) Assessing goodness-of-fit of generalized logit models based on case–control data. J. Multivariate Anal., 82, 17–38[CrossRef].

    Zhang, B. (2002b) An EM algorithm for a semiparametric finite mixture model. J. Stat. Comput. Simul., 72, 791–802.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
Y. Lai
Genome-wide co-expression based prediction of differential expressions
Bioinformatics, March 1, 2008; 24(5): 666 - 673.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/4/529    most recent
bti032v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (9)
Right arrowRequest Permissions
Right arrowScopus Links
Google Scholar
Right arrow Articles by Guan, Z.
Right arrow Articles by Zhao, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Guan, Z.
Right arrow Articles by Zhao, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?