Bioinformatics Advance Access originally published online on June 2, 2005
Bioinformatics 2005 21(15):3264-3272; doi:10.1093/bioinformatics/bti519
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Practical FDR-based sample size calculations in microarray experiments
1Department of Biostatistics and Applied Mathematics, University of Texas M.D. Anderson Cancer Center TX 77030-4009, USA
2Department of Biostatistics, University of North Carolina at Chapel Hill NC 27599-3260, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Owing to the experimental cost and difficulty in obtaining biological materials, it is essential to consider appropriate sample sizes in microarray studies. With the growing use of the False Discovery Rate (FDR) in microarray analysis, an FDR-based sample size calculation is essential.
Method: We describe an approach to explicitly connect the sample size to the FDR and the number of differentially expressed genes to be detected. The method fits parametric models for degree of differential expression using the ExpectationMaximization algorithm.
Results: The applicability of the method is illustrated with simulations and studies of a lung microarray dataset. We propose to use a small training set or published data from relevant biological settings to calculate the sample size of an experiment.
Availability: Code to implement the method in the statistical package R is available from the authors.
Contact: jhu{at}mdanderson.org
| INTRODUCTION |
|---|
|
|
|---|
cDNA and oligonucleotide microarrays have become powerful tools for the global estimation and comparison of gene expression. A main application of microarrays is the detection of genes that are differentially expressed under two (or more) different conditions. This problem is more difficult than one might expect, owing to the multiplicity of tests and the attendant need to increase power by employing sensitive modeling. Early approaches had used simple thresholds for ratios of expression estimates under the two conditions (Chen et al., 1997), whereas ordinary t-tests and Wilcoxon tests (Dudoit et al., 2002; Troyanskaya et al., 2002) have been used in a manner that controls family-wise error rate (FWER) or false discovery rate (FDR). The tests may be improved by explicitly utilizing the relationship between the mean and the variance of estimated expression [Hu and Wright, 2004 (http://www.bios.unc.edu/~fwright/TechReport/); Chen et al., 1997; Ideker et al., 2000]. Similar ideas are employed in regularized t-tests; one version involves adding a constant to the variance estimate in the t denominator (Tusher et al., 2001; Efron et al., 2001). Once a suitable test statistic is chosen, permutation can be used to obtain a suitable null distribution for empirical testing or estimating the FDR (Tusher et al., 2001; Efron et al., 2001; Pan et al., 2001).
Many array studies have demonstrated biologically plausible results with very few arrays (e.g., Yoon et al., 2002), leading to a perception that a researcher might casually hybridize a handful of arrays in the hope of finding something meaningful. To statisticians concerned with multiple-testing issues, such a sample size might seem inherently insufficient, and permutation-based approaches may not be possible. Our view is that in some cases surprisingly few arrays may be sufficient, but the current dominance of cancer microarray research has produced an optimistic view of differential expression that may be unwarranted in other settings. Examples with greater biological subtlety will probably include the search for the downstream effects of a single gene mutation, or examining expression differences among closely related rodentstrains.
As the microarray field moves beyond casual hypothesis-generating efforts, it becomes increasingly important to prospectively estimate required sample sizes prior to undertaking an experiment. Unfortunately, there is sparse literature on sample size estimation for microarrays. Using ANOVA analyses of gene expression, Black and Doerge (2002) propose a parametric approach assuming lognormal or gamma distributions for gene expression intensities. Lee and Whitmore (2002) also describe sample size calculations for the ANOVA model, under several different experiment designs. Pan et al. (2002) discuss sample size calculations using a combination of parametric and non-parametric approaches. All these methods relate power to sample size while controlling for the Type I error, although Lee and Whitmore (2002) also discuss some limited connections to the FDR. While the methods in these papers are useful, it is more natural to base sample size calculations directly on the FDR, as this criterion is often used as an error bound in accounting for multiple tests. Our approach is applicable to any array platform, assuming that a single expression estimate is available for each gene and each sample after any necessary platform-dependent pre-processing and normalization. For two-color arrays, log ratios (where one sample serves as a common reference) or estimates based on separate modeling of the color channels (Jin et al., 2001) are often used. Although we highlight the application to microarrays, the approach described here can be used for any high-throughput data involving two-sample comparisons that employs the FDR. Our method is described in the next section, followed by simulation studies and the analysis of real datasets. We conclude with some remarks and discussion.
| METHODS |
|---|
|
|
|---|
Notation and assumptions
Let x1i (i = 1,...,n1) and x2j (j = 1,...,n2) denote the expression levels for a single gene under two conditions, where n1 and n2 are the total number of arrays under conditions 1 and 2, respectively. For simplicity, we assume n1 = n2 = n. Extensions to more general experimental designs are relatively straightforward and appear in the discussion. For a single gene, we assume that the gene expression estimates (perhaps after suitable transformation) are normally distributed within each condition. A growing literature supports this assumption for estimates derived from two-color arrays (Chen et al., 1997) and oligonucleotide arrays (Giles and Kipling, 2003). We assume there is a total of m genes represented on the arrays. We wish to identify the genes whose expression differs under the two conditions.
Let µ1 and µ2 denote true mean expressions for the gene under conditions 1 and 2, and
and
the corresponding variances. For testing purposes, the hypotheses are H0: µ1 = µ2 versus H1: µ1
µ2, and we use the statistic
, where
,
.
and
. The quantity
![]() |
= 0 versus H1:
0. T then has an approximate t density with 2n 2 degrees of freedom and non-centrality parameter
(Welch, 1947). T is exactly t-distributed if
, and we implicitly use this assumption in model-fitting. Numerical illustrations given further below examine the effect of departures from the assumption. We denote the non-central t density as
, with cumulative distribution function (CDF)
(Johnson et al., 1994).
With thousands of genes assayed in a typical microarray experiment, we treat the
s as realizations from a common CDF, F, for a random variable
. The distributions of T and derived P-values depend entirely on F. Because the sign of
has biological importance, we decompose F as the mixture:
![]() | (1) |
0,
1 and
2 are the probabilities that
is 0, positive or negative, respectively, with
0 +
1 +
2 = 1. F0(
) = I(
0) is the CDF of the random variable with a point mass at 0, whereas F1 and F2 are the conditional CDFs for positive and negative
. Equation (1) encompasses a large number of situations of interest, where F1 and F2 can be discrete or continuous. Finally, we use f1 and f2 to denote the probability distribution functions for F1 and F2.
We consider three situations for the remainder of the paper. (1) The discrete model assumes that F is discrete with point masses at 0 and constants a1 (positive) and a2 (negative); (2) the exponential mixture model assumes a point mass at 0 and exponential densities for positive and negative
, or f1(
) =
1 exp(
1
) and f2(
) =
2 exp(
2
); (3) the normal mixture model assumes a point mass at 0 and two normal densities truncated at 0:
![]() |
![]() |
Statistical models
We let P be the random P-value for a gene. The required number of arrays is related to the distribution of P as shown below. Thus the key of our method is to estimate the distribution of P, from which the sample size can becomputed.
Suppose
has a CDF as given in Equation (1). Conditional on a given
, the CDF of P at a specific p0 is,
![]() | (2) |
is the 1
quantile of the central t density with 2n 2 degrees of freedom. We will use p0 as a threshold for rejecting H0, and thus it is the Type I error rate for a single test. The marginal distribution of P, in contrast, reflects the mixture of varying alternatives:
![]() | (3) |
![]() | (4) |
5), this phenomenon does not appreciably constrain the researcher's ability to achieve a specified pFDR with sufficiently small p0 (see Hu and Wright, 2004). | NUMERICAL STUDIES |
|---|
|
|
|---|
In this section, we perform some simple simulations assuming that F is known. Such a situation might arise if the researcher has a hypothesized F and wishes to examine the pFDR for various sample sizes. It can also provide us some insight on how the sample size changes under different distributions of F. We mainly used the discrete and exponential mixture models for illustration because they represent the two extremes in tail behavior among the models considered. In both cases, F is symmetric to simplify the specification of dispersion of F about 0.
For the discrete model, we have a1 = a2 =a and
1 =
2. Then Equation (3) can be written as,
![]() | (5) |
1 =
2 =
, and
1 =
2. Then Equation (3) is
![]() | (6) |
, i.e.
= 1/a implies that F1 and F2 have the same means under the discrete and continuous models. We choose a across a range of values we consider to be biologically meaningful, with expression differences in the two groups ranging from 0.25 to 2.5 SD.
0 is examined in the range from 0.5 to 0.99, corresponding to numerous studies in which a minority of genes were differentially expressed (Liao et al., 2004). Finally, we use p0 = 0.1/m for the first numerical example, assuming that the rejection region has been chosen to conservatively control the FWER to not >0.1. This specific choice of p0 is used for the illustrationthe derivations apply to any p0. Later examples demonstrate sample size calculations when a fixed number of genes is rejected and a pFDR is specified.
With the number of arrays per condition set to n = 10, the expected number of rejected genes E(R) and the pFDR are exhibited in Table 1. For much of the range of a, the continuous model has a lower pFDR because of a small but often-rejected portion of genes with extreme
under the exponential mixture model. One interesting observation is that the two E(R) functions cross, so that the discrete model rejects fewer genes than the exponential mixture model for small a, but more genes for large a. It is difficult to draw general lessons about the conservativeness [in terms of pFDR for a given E(R)] of the competing models, so that simulation or numeric integration is essential to evaluate the relationship among the relevant quantities.
|
By combining Equations (3) and (4), we can find the relationship among n, E(R) and the pFDR by solving
![]() |
![]() | (7) |
0). Table 2 shows the required number of arrays for E(R) = 10 and E(R) = 100, and a range of a values while controlling for several different pFDR values. Again, results are shown for the discrete and the symmetric exponential mixture models. In addition, we also implement a similar procedure for normal mixture model to obtain sample size results. To make the normal mixture model comparable to the other models, we make
1 and
2 equal 0 and obtain the values of
and
such that the absolute mean equals a for both the positive and the negative
s. For most of the parameter values examined, the discrete model requires larger sample sizes than the continuous model, again because the latter has many easily-rejected genes with
far from zero. The normal mixture model's results are usually between the discrete and the exponential mixture models owing to its medium tails. Note that for large a the required sample size might be as few as 3 or 4 in each condition, even when E(R) = 100. However, to detect more subtle effects, hundreds of arrays may be required.
|
Instead of using the conditional means of F1 and F2, it might be argued that a better overall summary of the magnitude of expression differences is given by var(
), so that the symmetric discrete and continuous models are made comparable in the first two moments of
by choosing
. However, for sufficiently small pFDR the exponential mixture model will still typically require a smaller sample size, again owing to the heavier tails of the exponential model compared with the discrete model. The practical importance of the moments can be understood through the following approximate argument. For sufficiently large n and fixed
, T is approximately distributed as
, leading to the approximation
, where
N(0,1) is independent of
. Thus the sample mean and variance of T, which are readily estimated from the data, can be used to estimate the mean and variance of
without a specific model for F. Indeed, as we will see further below, when a maximum-likelihood approach is applied to the competing models using real data, the means and standard deviations of the corresponding
distributions can be compared with each other. Because the pFDR depends on the specific form of F, it is important to choose a plausible model. One approach might be to examine a number of models and conservatively choose the model with the greatest pFDR for a specified E(R). As an alternative, in many circumstances a training set of data may be available and can be used to estimate F, prior to using the proposed method to calculate the sample size. This approach is not necessarily restrictive, as results from previous existing studies may be available from which it is possible to estimate F. Moreover, because of the thousands of genes on the array, F may be estimated using fewer arrays than are necessary to achieve a desired pFDR. Below we propose a procedure to estimate F using real data.
| ESTIMATING F VIA THE EM ALGORITHM |
|---|
|
|
|---|
The primary difficulty in estimating F is that the realizations of
s are not observed directly. As discussed in the previous section,
may be thought of as observed with random t-distributed errors. This view of the T observations as an incomplete data suggests the use of the ExpectationMaximization (EM) algorithm (Dempster et al., 1977).
We note that the discrete and exponential mixture models have the same number of parameters, while the normal mixture model has two additional parameters. To demonstrate how the EM algorithm works, we focus on the two models with fewer parameters, denoting the parameter vector
= (
0,
1,
1,
2)T. Here
1,
2 correspond to the appropriate model, i.e. for the discrete model
1=a1,
2=a2, for the exponential mixture
1 =
1,
2 =
2. Let ti and (unobserved)
i be the realized values of T and
for the i-th gene, respectively, so that the complete data is the set {ti,
i}. The complete datalog-likelihood is
![]() | (8) |
,
the corresponding CDFs. The E-step updates the expected values of unobserved quantities in Equation (8), and we let
![]() |
![]() |
k + 1. For the probability mass estimates, we have
![]() |
1,
2 are specific to the model. Despite the simplicity of the discrete model, no closed form is available to update a1, a2. For this and the continuous models, numerical integration and maximization of the expected log-likelihood were performed using functions in R. For the exponential mixture model, updates for the F1 and F2 parameters can be represented by
![]() |
distribution. We used the EM procedure to analyze a real dataset, to demonstrate the applicability of our approach. | EM ALGORITHM RESULTS FOR SIMULATEDAND REAL DATASETS |
|---|
|
|
|---|
Simulation studies
We implemented the normal mixture model in simulations to test the EM algorithm and to highlight any difficulties in handling the greater number of parameters in the model, because the real dataset (discussed later) favored this model. The number of genes per array was set to m = 10 000, with
0 = 0.8 and
1 =
2 = 0.1. There were n = 5 arrays under each condition. The distribution of
was assumed to follow the normal mixture model with
1 = 1,
and
2=1,
. To generate the data with the appropriate characteristics, for each gene i we needed to simulate a
i from
, and then generate an expression profile for the gene consistent with that effect size. We implemented the approach described by Hu and Wright (2004) based on the analysis and modeling of four Affymetrix datasets. The vector of µ1 were generated as independent
observations multiplied by 1000. For fixed ß0 = 4.6 and ß1 = 1.7,
was obtained from the meanvariance model
![]() |
were chosen, the remaining vectors were obtained in one of two ways: (1)
was chosen as
, and then µ2 chosen to accord with the choice of
for each gene (the equal variance scenario). (2) µ2 and
were chosen to satisfy both the choice of
for the gene and the meanvariance model for condition 2,
(the unequal variance scenario). The unequal variance scenario is more realistic, and the simulations allow us to examine the effect of this modest departure from the assumed model when estimating the pFDR and sample sizes.
We also explored the effect of correlation among sets of genes on the array, by introducing correlation among blocks of genes on the array. Within a block of genes of size B, we wanted the correlation coefficient
between all pairs of genes. We assumed that all genes within the block had the same µ1, µ2,
and
, and therefore the same
. Within condition k (k = 1,2), we generated a prototypical gene expression profile wk1, wk2,...,wkn as iid
. Then for gene i, i=1,..., B, we let
![]() |
are all iid
, and
. Thus w served as a latent expression profile to produce the x values. It is easy to confirm that within a block and within each condition the expression values for different genes have the correlation
and the appropriate means and variances.
Figure 1 shows the model fits of the three different distributions when the true distribution is truncated normal as described above. The top row shows the results for the equal variance, no correlation scenario. This scenario corresponds to the situation where the model assumptions hold exactly, and the fit is indicated by quantilequantile (QQ) plots of the observed 10 000 t-statistics versus 100 000 simulations of T using the fitted model. The bottom row shows the fits for the situation where two aspects of the model assumptions do not hold: the variances are unequal and genes are correlated (
= 0.5) in block of size B = 50. Under both scenarios, the correct truncated normal model fits the best, especially in the extreme tails that typically form rejection regions. This result is confirmed by Pearson correlation coefficents of the QQ plots and KolmogorovSmirnov statistics for the simulated versus observed values. For the two other intermediate scenarios (equal variance with correlation, unequal variance with/no correlation), the truncated normal model is also correctly identified (data not shown).
|
For the two scenarios considered in Figure 1, we used the fitted F distributions for each of the three model types to plot the theoretical pFDRs at various E(R) values (Fig. 2). These fitted values can be compared with the true pFDR using the known parameters (bold curve). Not surprisingly, the curve under the estimated normal mixture model is closest to the true model under both the scenarios.
|
Another comparison is provided by an empirical estimate of the pFDR created by comparing the observed t-statistics with those generated under 252 exhaustively permuted (therefore null) assignments of arrays to the two conditions. Essentially this is the approach implemented in the popular SAM software (Tusher et al., 2001; Storey and Tibshirani, 2003). We consider the t-statistics arising under different permutations to be exchangeable (Reiner et al., 2003), which leads to use of the mean number of rejected genes for each permutation. This is a slight difference from the SAM procedure, which counts the median number of rejected genes for each null permutation (Chu et al., 2001). The resulting empirical pFDR estimate is shown as the thick dashed line (Fig. 2). The jagged appearance of the empirical pFDR is due to the finite number of genes in the observed dataset. Note that the empirical pFDR is quite far from the pFDR obtained under the true model. This phenomenon appears to largely result from an overestimate of
0,
0.95 in both the cases versus the true value 0.8, which has a direct effect on the estimated pFDR, and leads to a less-extreme estimated rejection threshold, also increasing the apparent pFDR. We expect that this may often occur in situations where
follows a continuous distribution, because many genes with small
may appear to be effectively null. From Figure 2 and the other intermediate scenarios (equal variance with correlation, unequal variance with/no correlation, data not shown), we note that neither the variance nor the block correlation structure among genes has a great impact on the pFDR estimates.
A real dataset
We applied the EM algorithm for the three parametric models to a murine lung microarray dataset submitted to Gene Expression Omnibus (E. Hoffman, Children's National Medical Center, GDS251). The study used the Affymetrix U74Av2 array (12 488 probesets) and compared samples from the C57BL6/J (n = 12) and Balb/c (n = 12) strains in sensitivity with pulmonary fibrosis. The study had additional factors balanced over the strains, but we applied the simple two-sample comparison to illustrate ourapproach.
The two-sample t-statistics for all the genes were computed. The three models discussed above were fitted separately. The parameter estimates for the discrete model were
,
,
,
and
. The exponential mixture model estimates were
,
,
,
and
. For the truncated normal mixture model,
,
,
,
,
,
and
.
The
0 estimates varied among the models. However, we note that the continuous F1 and F2 models allow mass near zero, for which the genes are effectively null. Accordingly, and as suggested earlier, the mean and standard deviations of the corresponding
distributions are fairly comparable. For example, the fit to the discrete model gives E(
) = 0.033, SD(
) = 0.207, while for the exponential mixture model the corresponding values are 0.058 and 0.227, and for the normal mixture model are 0.031 and 0.157.
Although the true form of F is unknown, an indication of model fit is given in QQ plots of the observed 12 488 t-statistics versus 100 000 simulations of T (first three panels in Fig. 3). It is clear that the normal mixture F model offers the best fit, especially in the extreme tails (with the possible exception of the two most extreme genes).
|
As a simple illustration of effectiveness of the model fitting procedure, we performed 1000 simulations of 12 488 t-statistics using the normal mixture model, fixing the parameters at the values estimated from the murine pulmonary dataset. The true values were compared with the observed means ± SD as follows:
0 = 0.848 (0.851 ± 0.005);
1 = 0.119 (0.117 ± 0.002);
1 = 0.346 (0.367 ± 0.012);
2 = 0.344 (0.398 ± 0.025),
and
. This amount of variation in the estimates has only a minimal effect on sample size computation. Figure 3 (lower right) shows the sample sizes needed to reject varying numbers of genes at a variety of pFDR values. pFDR results were computed using numeric integration. pFDRs for given sample sizes along a range of expected number of rejected genes are exhibited in Table 3, along with the expected number of rejected genes for given sample sizes at a variety of pFDR values. For this dataset, rejection of many genes requires many more arrays than were run in the analyzed dataset. For example, while n = 14 arrays in each condition would be required to reject 30 genes while controlling the pFDR at 0.05, n = 26 would be required to control the pFDR at 0.05 for 300 rejected genes (Fig. 3 and Table 3).
|
Similar tables and curves will be of use to practicing researchers who wish to balance competing considerations in the sample size versus experimental costs. R computer code to fit the three parametric models, and estimate sample sizes and pFDR values is available from the authors.
| CONCLUDING REMARKS |
|---|
|
|
|---|
We have described an approach to estimate sample sizes necessary to control the FDR in microarray experiments. By explicitly modeling the degree of differential expression under the alternative hypothesis, we expand the framework of the FDR. This approach is necessary, as the specific tail behavior of
under the alternative is important in determining the pFDR. Although we illustrated our method using a few simple parametric models for
, a larger and more flexible range of parameterizations might be explored. Alternatively, empirical approaches might be used, in which the t-statistics are appropriately shrunk toward a common mean to provide an estimate of F. Other extensions to our approach will account for more powerful statistics and a wider range of experimental designs. The essence of the approach, however, will remain unchanged as the FDR will depend on the distribution F of unknown
, which can be estimated from observed Ts.
Extensions of our two-sample approach to more general linear model situations (e.g. ANOVA and regression) require a more general specification of
, such as the ratio of explained to residual variance in the linear model. In principle, such an approach is not difficult, but again requires that the sample size be the same under each condition, or at least that the sample size allocation across conditions be preserved from the training set to the anticipated larger study. We do not view the sample size allocation restriction as a major drawback as designs with (nearly) equal numbers of arrays in each condition are common, and sensitivity analysis can be used for reassurance in cases of modest departure from equal allocation.
In practice we may wish to develop sample size estimates by using some similar pilot study, or published data from biological settings that are different, but considered similar enough to the proposed experiment in order to provide meaningful predictions. Here our parametric estimate of F from the published data leads to a compact estimate of the distribution of effect sizes that might be easily carried over to different settings, or possibly to different array platforms. By computing the pFDR under a number of parametric models, it is also straightforward to be conservative in the sample size calculation. For example, we might base our sample size on conservative pFDR estimates from Figure 2, using the greatest pFDR among the three models for each E(R).
Finally, we note that a limitation on the sample size choice can occur when the proportion of truly differentially expressed genes is very close to 1, so that for a fixed p0, E(R) reaches a limit with increasing n. Essentially this results from an inequality implicit from Equation (4):
![]() | (9) |
is very close to 1, and for fixed p0 we must always be rejecting some null genes.
| Acknowledgments |
|---|
The authors would like to thank the editors and reviewers for helpful comments that strengthened the manuscript. This work is supported in part by NIH grant 3 P30 HD003110.
Conflict of Interest: none declared.
Received on January 3, 2005; revised on May 22, 2005; accepted on May 25, 2005
| REFERENCES |
|---|
|
|
|---|
Black, M.A. and Doerge, R.W. (2002) Calculation of the minimum number of replicate spots required for detection of significant gene expression fold change in microarray experiments. Bioinformatics, 18, 16091616
Chen, Y., et al. (1997) Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Optics, 2, 364367[CrossRef].
SAM, "Significance Analysis of Microarrays" Chu, G., Narasimhan, B., Tibshirani, R., Tusher, V. (2001) Users Guide and Technical Document.
Cohen, J. Statistical Power Analysis for the Behavioral Sciences, (1988) 2nd edn , Hillsdale, NJ Erlbaum.
Dempster, A.P., et al. (1977) Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. Ser. B, 39, 138.
Dudoit, S., et al. (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat. Sinica, 12, 111139.
Efron, B., et al. (2001) Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc., 96, 11511160[CrossRef][Web of Science].
Giles, P. and Kipling, D. (2003) Normality of oligonucleotide microarray data and implications for parametric statistical analyses. Bioinformatics, 19, 22542262
Hu, J. and Wright, F.A. (2004) Assessing differential gene expression with small sample sizes in oligonucleotide arrays using a meanvariance model. Submitted.
Ideker, T., et al. (2000) Testing for differentially expressed genes by maximum likelihood analysis of microarray data. J. Comput. Biol., 7, 805817[CrossRef][Web of Science][Medline].
Jin, W., et al. (2001) The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster. Nat. Genet., 29, 389395[CrossRef][Web of Science][Medline].
Johnson, N.L., Kotz, S., Balakrishnan, N. Continous Univariate Distributions, (1994) 2nd edn , NY Wiley.
Lee, M.T. and Whitmore, G.A. (2002) Power and sample size for DNA microarray studies. Stat. Med., 21, 35433570[CrossRef][Web of Science][Medline].
Liao, J.G., et al. (2004) A mixture model for estimating the local false discovery rate in DNA microarray analysis. Bioinformatics, 20, 26942701
Neter, J., Wasserman, W., Kutner, M.H. Applied Linear Statistical Models, (1985) 2nd edn , Homewood, IL Irwin.
Pan, W., et al. (2001) A mixture model approach to detecting differentially expressed genes in replicated microarray experiments. Funct. Integr. Genomics, 3, 117124.
Pan, W., et al. (2002) How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biol., 3, research0022.10022.10.
Parmigiani, G., et al. (2002) A statistical framework for expression-based molecular classification in cancer. J. R. Stat. Soc. Ser. B, 64, 717736[CrossRef].
Reiner, A., et al. (2003) Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics, 19, 368375
Storey, J.D. (2003) The positive false discovery rate: a Bayesian interpretation and the Q-value. Ann. Stat., 31, 20132035[CrossRef].
Storey, J.D. and Tibshirani, R. The Analysis of Gene Expression Data: Methods and Software, (2003) , NY Springer.
Troyanskaya, O.G., et al. (2002) Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics, 18, 14541461
Tusher, V.G., Tibshirani, R., Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 51165121
Welch, B.L. (1947) The generalization of students problem when several different population variances are involved. Biometrika, 34, 2835
Yoon, H., et al. (2002) Gene expression profiling of isogenic cells with different tp53 gene dosage reveals numerous genes that are affected by tp53 dosage and identifies cspg2 as a direct target of p53. Proc. Natl Acad. Sci. USA, 99, 1563215637
This article has been cited by other articles:
![]() |
P. de Valpine, H.-M. Bitter, M. P. S. Brown, and J. Heller A simulation-approximation approach to sample size planning for high-dimensional classification studies Biostat., July 1, 2009; 10(3): 424 - 435. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Wu, I. Ivanov, R. Xu, and S. Safe Role of SP transcription factors in hormone-dependent modulation of genes in MCF-7 breast cancer cells: microarray and RNA interference studies J. Mol. Endocrinol., January 1, 2009; 42(1): 19 - 33. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Zaza, P. Pontrelli, G. Pertosa, S. Granata, M. Rossini, S. Porreca, F. J. T. Staal, L. Gesualdo, G. Grandaliano, and F. P. Schena Dialysis-related systemic microinflammation is associated with specific genomic patterns Nephrol. Dial. Transplant., May 1, 2008; 23(5): 1673 - 1681. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. J. M. Rosa, N. de Leon, and A. J. M. Rosa Review of microarray experimental design strategies for genetical genomics studies Physiol Genomics, December 13, 2006; 28(1): 15 - 23. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Gao Construction of null statistics in permutation-based multiple testing for multi-factorial microarray experiments Bioinformatics, June 15, 2006; 22(12): 1486 - 1494. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. B. Pounds Estimation and control of multiple testing error rates for microarray studies Brief Bioinform, March 1, 2006; 7(1): 25 - 36. |
||||
![]() |
S. Pounds and C. Cheng Sample size determination for the false discovery rate Bioinformatics, December 1, 2005; 21(23): 4263 - 4271. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



























