Bioinformatics Advance Access originally published online on August 16, 2005
Bioinformatics 2005 21(20):3865-3872; doi:10.1093/bioinformatics/bti626
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bias in the estimation of false discovery rate in microarray studies
1Department of Medical Epidemiology and Biostatistics, Karolinska Institutet 17177 Stockholm, Sweden
2Genome Institute of Singapore Singapore
3Unit of Biostatistics and Epidemiology, Institute Gustave Roussy Villejuif, France
*To whom correspondence should be adderessed.
| Abstract |
|---|
|
|
|---|
Motivation: The false discovery rate (FDR) provides a key statistical assessment for microarray studies. Its value depends on the proportion
0 of non-differentially expressed (non-DE) genes. In most microarray studies, many genes have small effects not easily separable from non-DE genes. As a result, current methods often overestimate
0 and FDR, leading to unnecessary loss of power in the overall analysis.
Methods: For the common two-sample comparison we derive a natural mixture model of the test statistic and an explicit bias formula in the standard estimation of
0. We suggest an improved estimation of
0 based on the mixture model and describe a practical likelihood-based procedure for this purpose.
Results: The analysis shows that a large bias occurs when
0 is far from 1 and when the non-centrality parameters of the distribution of the test statistic are near zero. The theoretical result also explains substantial discrepancies between non-parametric and model-based estimates of
0. Simulation studies indicate mixture-model estimates are less biased than standard estimates. The method is applied to breast cancer and lymphoma data examples.
Availability: An R-package OCplus containing functions to compute
0 based on the mixture model, the resulting FDR and other operating characteristics of microarray data, is freely available at http://www.meb.ki.se/~yudpaw
Contact: yudi.pawitan{at}meb.ki.se and alexander.ploner{at}meb.ki.se
| 1 INTRODUCTION |
|---|
|
|
|---|
The false discovery rate (FDR) is defined as the expected proportion of false positives among the list of genes declared to be differentially expressed (DE). This directly useful interpretation allows the researcher to balance the length of their candidate list against its quality, and this makes the FDR a more useful metric for controlling the false positive rate than the standard P-value. Benjamini and Hochberg (1995) originally introduced a procedure to control FDR by a sequential P-value adjustment, but it is clear that the FDR can be considered simply as an unknown parameter to be estimated from data [Storey (2002), Tsai et al. (2003), Broët et al. (2004), Pounds and Cheng (2004), Datta and Datta (2005), Dalmasso et al. (2005), Grant et al. (2005), Pawitan et al. (2005)]. The purpose of this paper is to discuss the bias problem in the current estimation of FDR.
It is evident from theory that FDR depends on the proportion
0 of non-differentially expressed (non-DE) genes. In our experience with many microarray data, it is a common feature that there are many genes with small effects. When coupled with small sample size, these genes become barely separable from truly non-DE genes. Using a simple example later, we will show that this is the key determinant of bias in the estimation of
0 and FDR. We develop a mixture-model theory to explain exactly the reasons of bias in two-sample comparison problems. The mixture model also suggests a less-biased estimation of
0. In summary, novel contributions of the paper include (i) a theoretical explanation and formula of the bias in the current estimation of
0 and FDR, hence providing a theoretical connection between the model-based and non-parametric estimation of
0, and (ii) a practical likelihood-based procedure to improve the estimation of
0 based on the mixture model.
Although the ideas apply generally, we consider the common two-group comparison problem using a certain statistic Z to compare the mean log-expression level. The distribution F of the observed statistics is a mixture of the form
![]() | (1) |
0 is the proportion of truly non-DE genes, F0(z) is the distribution of the statistics for non-DE genes, and F1(z) for DE genes. Using a symmetric two-sided procedure, so that we declare DE genes if |Z| > c for a given critical value c > 0, we can compute the proportion of declared DE genes as {F(c) + 1 F(c)}. If F is symmetric, this is equal to 2{1 F(c)}, but we will use the more general asymmetric formulas. Similarly, the classical significance level is equal to {F0(c) + 1 F0(c)}. The FDR as a function of c is then given by
![]() |
0, F0 and F. The estimate of F is immediate from the observed statistics, using the empirical distribution function. The null distribution F0 can be generated using a permutation argument, since for non-DE genes the group labels can be permuted without changing the distribution (Efron et al., 2001). A commonly used formula for estimating
0 (Storey, 2002) is
![]() | (2) |
, and the P-values are associated with statistics zis. This nonparametric or model-free formula has been justified intuitively with the argument that the largest P-values come from non-DE genes, and for these genes the P-values are uniformly distributed between 0 and 1, and
![]() |
such as 0.5 or 0.75 are often used; Storey (2002) proposed optimizing the mean-squared error in the estimation of FDR, but usually there is little difference in the resulting estimate. Since the density f(p) of the P-value distribution is expected to be monotone decreasing, all existing non-parametric estimation procedures of
0 essentially attempt to estimate f(1), the right-hand limit of the P-value density.
Hence, given the estimates
,
and
, we can compute the empirical FDR by
![]() |
![]() | (3) |
In terms of the distribution of P-values we have an equivalent formula
![]() | (4) |
is the empirical distribution of P-values, i.e. the proportion of P-values less than or equal to
. If we order the P-values from smallest to largest as p(1), ..., p(m), at
= p(k), we have
, and exactly
![]() |
To demonstrate the bias problem in the estimation of FDR, we start with a relatively simple theoretical model. Later we show that this model captures most of the characteristics of real data. Suppose we have two groups with n arrays per group, the proportion of truly non-DE genes is
0, and for DE genes the log-fold changes (in log-2 scale) are distributed at ±D with equal proportions. In group 1, the log-expression values are normally distributed with mean zero and variance
2. In group 2, the non-DE genes are log-normal with mean zero and variance
2, while the DE genes are log-normal with mean ±D and variance
2. So, the distribution of the observed t-statistics is a mixture of the form
![]() | (5) |
![]() | (6) |
and
, respectively (Pawitan et al., 2005).
Figure 1a shows the histogram of P-values obtained from 10 000 t-statistics simulated from the model above with n = 10 arrays per group,
0 = 0.6, D = 1.33 and
= 1. The P-values are computed using the the empirical null distribution F0 based on permutations (which in this case is almost identical to the theoretical null distribution F0). The horizontal line at the y-axis equal to
0 = 0.6 matches the level of the histogram at larger P-values, so the bias in estimating
0 using (2) is negligible. Figure 1b on the other hand shows the P-values for a dataset simulated from the same parameters, except for a smaller log-fold change D = 0.45. Here the bias is obvious, as the histogram clearly converges at a much higher level than
0 = 0.6.
|
Figure 1c and d show the effect of the bias in the estimation of
0 on the estimate of FDR. For the data with a large effect size in Figure 1c, the estimated FDR (solid line) and the true FDR (dashed line) are within the 90%-confidence band for the estimated FDR. However, for the data with small effect size in Figure 1d, there is a large difference between the estimated and the true FDR, evidently larger than the sampling variability in the estimated FDR as indicated by the confidence bands. To distinguish it from the latter estimate, sometimes we will call this bias-inflated estimated FDR the apparent FDR. Note that this same problem occurred in the simulation results (Table 5) of Taylor et al. (2005), though the authors did not comment on it. | 2 METHODS |
|---|
|
|
|---|
2.1 Conventional estimation of
0Conventional approaches to estimating
0 can be motivated by the model (1)
![]() |
0 is close to 1, or if there is an interval [a,b] so that F1(z) is almost constant between a and b, then we can use the resulting approximation
![]() | (7) |
0 by
![]() |
![]() |
, for example 0.1 or 0.25, where t1, ..., tm are the observed t-statistics. We also get the standard estimator (2) by using the P-value as the non-parametric test statistic and setting e.g. [a,b] = [0.5, 1]. Alternatively, we can take the limit of the interval [a, b] in (7) to a point z0 representative of the null hypothesis, e.g. z0 = 0 when using a t-statistic, in which case we get
![]() |
However, these simple estimates can be seriously biased if
0 is far from 1 and there is no suitable interval [a, b]. The latter means F1 does not have a flat region, occurring when over-expressed and under-expressed genes do not separate cleanly, as happens when there are many genes with small effects. At best these estimates only work as loose upper bounds for
0.
2.2 Bias for a simple alternative
To analyse the bias theoretically, from model (5), the resulting distribution D(p) of the P-value is also a mixture of the form
![]() |
![]() |
Using some calculus, the P-value density f(p) is a mixture
![]() | (8) |
![]() |
![]() |
![]() |
As previously noted, a non-parametric estimate of
0 such as (2) tries to estimate f(1), so from (8), it is a biased estimate and the amount of bias is given by
![]() | (9) |
0 caused by the bias is the same as the relative error in the FDR:
![]() |
2.3 Bias for a general mixture alternative
Let us assume that the log-fold changes are distributed at discrete values {D0
0, D1, ..., Dq}, with probabilities {
0,
1, ...,
q}. Let N = n1 + n2 be the combined sample size in both groups, where n1 is the number of arrays in group 1 and n2 in group 2. The previous results can then be generalized as follows: the distribution of the test statistics is a mixture
![]() | (10) |
![]() |
0 = 0 and G0(t)
F0(t), the central t distribution with N 2 degrees of freedom. The resulting distribution D(p) of the two-sided P-value is a mixture of the form
![]() |
The P-value density f(p) is a mixture
![]() | (11) |
![]() |
0 is
![]() |
2.4 Estimating
0 from the general mixture model
Since the bias formula depends on the unknown mixing probabilities, in practice it does not help us in estimating
0. The key problem in the standard methods such as (2) is that genes with small effects have large P-values, so any method that relies on the proportion of large P-values will be seriously biased. Using the observed statistics t1, ..., tm as data, an alternative estimate of
0 that is much less biased can be obtained from the mixture model (10). In this model, we have 2q+1 unknown parameters
![]() |
. Note that no assumptions are made regarding individual variances
2. The number of components q can be chosen large enough so that the model captures the distribution of the observed statistics. Specifying too many components typically leads to replication of certain components, splitting the corresponding p0 across several very similar non-centrality parameters. For an objective choice, we can use a model selection criterion such as the (Akaike information criterion (AIC) [e.g. Pawitan (2001), Section 13.5] or the Bayesian Information Criterion (BIC)).
As a working model, assume that the statistics are independent, so that we can immediately apply a likelihood-based method to estimate the parameters. (In practice, the effect of dependence on inference can be accounted for by using the bootstrap.) Based on the data t1,...,tm, the log-likelihood is
![]() |
j, and
0
0. An attractive feature of this mixture model is that, as shown in the previous section, the non-centrality parameter
j can be interpreted in terms of effect size. Furthermore, if the total sample size N is >30, the t-density fj() can be well approximated by the normal density with mean
j and variance one, which leads to simpler formulas.
A commonly used algorithm to estimate mixture models is given by the EM algorithm, described e.g. in [Pawitan (2001), Section 12.5]. We found, however, that because there is typically a large number of genes, the EM algorithm is too slow for routine use. This is compounded by the problem of local maxima in the likelihood surface, so a proper estimation requires replications with many different starting values. We derive a much faster algorithm based on prebinning the data into a relatively fine grid of intervals. We first split the range of the observed statistics into B equispaced bins (Tk1, Tk)s, for k = 1, ..., B. Let y = (y1, ..., yB) be the number of statistics tis that fall in the bins. Then y has a multinomial distribution with probabilities (
1, ...,
B), where
![]() | (12) |
![]() | (13) |
ks, using (10) and (12). To deal with the constraints on the probabilities, we use the logistic transform
![]() |
1, ...,
q) without constraints.
Given the complete set of estimated mixture parameters, we could theoretically proceed to estimate the FDR using the estimated mixture distribution F(t) in (10). We have, however, found that that it is difficult for the mixture model to capture the tail behavior of F(t) properly, and that the estimation of the mixture parameters corresponding to the contribution of the DE genes can be sensitive to the degree of binning used. In contrast, the estimation of
0 is robust towards the effects of binning, and appears unbiased for sufficient effect sizes; see the simulation results below. Intuitively, this is not surprising, as the estimation of
0 relies on the center of the distribution, which has better signal-to-noise ratio, whereas the estimation of the
j for j > 0, and the corresponding effects
js, will rely to a larger degree on the sparsely populated tail bins of the distribution.
In practice, we suggest therefore the use of the mixture model to estimate
0 only, and the use of the empirical distribution functions to estimate F(t) and F0(t) for the purpose of computing the FDR. The other parameters of the mixture model can still be usefully interpreted as descriptive statistics, or for planning the sample size of future studies (Pawitan et al., 2005). The estimated effect sizes provide information about bias of the standard FDR estimate. An R-package OCplus containing functions to compute
0 based on the mixture model, the resulting FDR and other operating characteristics of microarray data, is freely available at http://www.meb.ki.se/~yudpaw.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Determinants of bias
From formula (9), we can show that the bias b in estimating
0 is a decreasing function of the non-centrality parameter
; see Figure 2a. For example, for our previous example, using n = 10,
= 0.6 and D = 0.45, we have
= 1 and f(1) = 0.84, and the bias in estimating
0 is 0.24 or 40%. Recall that
![]() |
is >2.5 or 3. In Figure 1, for n = 10 and D = 1.33 we have
= 3.0, so there is little bias. The curves in Figure 2a vary very little for different n, so the bias depends on the group size n almost exclusively through the non-centrality parameter
.
|
The bias of
and consequently the estimated FDR on the other hand will depend in a non-trivial manner on the scaled effect size D/
. Figure 2b shows the contour plot of the relative error in the estimation of both
0 and FDR for a fixed group size of n = 10. As expected, the error decreases with increasing effect size and percentage of non-DE genes. This effect is even more pronounced for larger sample sizes (figure not shown).
3.2 Simulated data
To illustrate the theoretical result, we first go back to the previous simulation example with D = 0.45 in Figure 1b and d. Figure 3a shows that the theoretical density f(p) from (8) follows closely the histogram of the P-values. The value f(1) = 0.84the upper horizontal dashed linematches the level of the histogram for large P-values. We can also compute the biased FDR using f(1) instead of the true
0, using the mixture model
![]() | (14) |
0.
|
To assess the factors influencing the estimation of
0, we conducted a larger simulation study using model (5) with a simple two-sided alternative (q = 2). We varied the proportion of non-DE genes
0 = 0.5, 0.7 and 0.9, the effect size as measured by the non-centrality parameter
= 0.5, 1 and 2, the number of arrays per group n = 10 and 30. We set the number of genes to 10 000, and each simulation set-up is run 25 times. The boxplots in Figure 4 summarize the estimation of
0.
|
The results indicate that the non-centrality parameter is by far the most important factor when estimating
using the mixture model. For a comparatively large effect
= 2, all estimates are very close to the true value, regardless of
0 or n. For an intermediate effect
=1, the estimates still appear unbiased, but with larger variability. Only for the small effects
= 0.5, we see clear bias in
. From this simulation we expect to be able to properly estimate components with
around 1 or larger, but have to be more careful if the estimated
is less than one.
3.3 Breast cancer data
We applied the mixture-based estimation of
to expression data collected from patients suffering from hereditary breast cancer (Hedenfalk et al., 2001). The patients had mutations either of the BRCA1 gene (n1 = 7) or the BRCA2 gene (n2 = 8). We compared the two types of tumors using a t-test with pooled variance, where the null distribution was generated via the permutation of group labels.
Figure 5a shows the histogram of the P-values from 3170 genes, with a strong spike near zero, indicating many genes with small P-values. The histogram converges at around 0.67, which is the estimate of
0 using the standard formula (2) with
= 0.7. The resulting apparent FDR curve in Figure 5b (solid) yields 155 genes with FDR <5%, at a critical value of c = 3.31.
|
The AIC for the mixture model is minimized at q = 3 components. This provides the following estimates:
![]() |
as a less-biased estimate of
0. Using the profile likelihood of
0, we obtained the nominal 95% confidence interval of 0.4 <
0 < 0.53. (As we commented previously, a more precise interval can be found by bootstrapping.) Using the mixture estimate 0.43 instead of the biased estimate 0.67 for
0, we calculated the corrected FDR curve as before, shown as dashed line in Figure 5b. We found 215 genes with corrected FDR <5%, at a critical level of c = 2.96. To show that the mixture model largely captures the observed distribution, we also show in Figure 5b the biased FDR implied by the mixture model (dotted), which is close to the apparent FDR but slightly larger in the tail area.
3.4 Lymphoma data
We next applied the mixture model approach to the analysis of 240 cases of diffuse large B-cell lymphoma data described in Rosenwald et al. (2002). The authors used a custom-made DNA microarray chip to collect measurements for 7399 probes. The average clinical follow up for patients was 4.4 years, with 138 deaths occurring during this period. For this analysis, we ignore the censoring information and only compare the 102 survivors versus 138 non-survivors, using again a t-test with pooled variance and a permutation null distribution.
Figure 6a shows the histogram of the P-values, which converges at
0.84, which is also the standard estimate of
0 using formula (2) with
= 0.7. The resulting apparent FDR curve in Figure 6b (solid line) indicates 15 genes with FDR <10%, at critical value c = 3.71.
|
The mixture model with q = 2 has the lowest AIC and provides the following estimates:
![]() |
0 is
(95% confidence interval 0.39 <
0 < 0.69). The corrected FDR resulting from the mixture model (dashed line in Fig. 6b) is again smaller than the apparent FDR, and we find 105 genes with estimated FDR <10%, at critical level c = 3.01. The biased FDR implied by the mixture model (dotted) is also close to the apparent FDR (solid). | 4 DISCUSSION |
|---|
|
|
|---|
The massive multiplicity of gene expression microarray data creates a serious problem when we try to discover genes that are DE between experimental conditions. The classical control of the family-wise error rate (FWER) is overly conservative. When used for gene discovery, the microarray is essentially a screening tool that allows researchers to generate a list of candidate genes that needs to be confirmed by different means. In this context, controlling FWERthe probability of any gene being falsely declared DEis not relevant, and most researchers will accept a small number of false positive genes as the price for finding a good candidate list. This explains why the FDR has become popular for assessing the differential expression of genes.
Recent methods of estimation of FDR are either non-parametric, based on the observed distribution of P-values, or based on mixture models [e.g. Pounds and Cheng (2004), Broët et al. (2004)]. These two methods can produce different results, and our analysis shows that they are not estimating the same quantity, and that the non-parametric estimate of
0 is severely biased upwards if there are many genes with small effects. Intuitively, this occurs because some fraction of these genes are not separable from non-DE genes. As shown in our analysis, the term small effect is objectively captured by a small non-centrality parameter, so it is a function of sample size and measurement variance as well as the true effect size. In general, the bias problem is worse in small experiments. The FDR estimated using the biased estimate of
0 is also biased, so it results in unnecessary loss of power.
The concept of local false discovery rate (fdr, to distinguish it from the global FDR) (Efron et al., 2001) also needs an estimate of
0 and its current use suffers from the bias problem mentioned in this paper.
Note that in (1) we can only get estimates for F and F0 easily using the empirical distribution functions. After we plug in these estimates, the equation still contains two unknown parameters,
0 and F1. Current approaches estimate
0 first, and use this biased estimate to compute
. We have shown that this can lead to severe bias in the subsequent estimation of FDR. We suggest that under these circumstances, it is possible to get less bias by estimating
0 and F1 jointly, using a method that we have described.
Recently Efron (2004) argued that genes of small effects are not interesting and they should be considered null genes. He did not, however, consider the effect of these genes on the bias in estimating
0 and FDR. The problem with such genes is that it requires a large sample to detect them. However, since it is now becoming economically feasible to run large microarray studies to detect these genes, it is not at all obvious at what effect size one should consider these genes uninteresting. Genes that are uninteresting in an experiment with sample size n = 10 can become interesting when n = 40 or larger. It is not unrealistic to imagine studies with triple or quadruple the sample size of many current studies, thus reducing the FDR substantially and allowing the discovery of genes with small effects.
We have shown that the multiplicity of genes in a microarray study allows an estimation of the proportion of genes with relatively small effects. Very likely, the majority of these genes are not detectable in the experiment, i.e. the sensitivity is small (Pawitan et al., 2005). However, we believe it is important to know the proportions and effect sizes, since they will inform us how large our experiment should be to detect them. Our simulation study indicates, however, that there is an inherent limit in the determination of effect size, approximately around a non-centrality parameter of
= 1. At this size the mixture-based estimate is less conservative than the current procedure based on (2). As we have shown in the examples, the mixture-model leads to distinct estimates of
0 and FDR compared to the current estimates.
Conflict of Interest: Murthy declares that he is currently conducting research supported by the Genome Institute of Singapore, Singapore.
Received on June 8, 2005; revised on July 11, 2005; accepted on August 10, 2005
| REFERENCES |
|---|
|
|
|---|
Benjamini, Y. and Hochberg, Y. (1995) Controling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. B, 57, 289300.
Broët, P., et al. (2004) A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments. Bioinformatics, 20, 25622571
Dalmasso, C., et al. (2005) A simple procedure for estimating the false discovery rate. Bioinformatics, 21, 660668
Datta, S. and Datta, S. (2005) Empirical Bayes screening of many P-values with applications to microarray studies. Bioinformatics, 21, 19871994
Efron, B., et al. (2001) Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Soc., 96, 11511160.
Efron, B. (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J. Am. Stat. Soc., 99, 96104.
Grant, G.R., et al. (2005) A practical false discovery rate approach to identifying patterns of differential expression in microarray data. Bioinformatics, 21, 26842690
Hedenfalk, I., et al. (2001) Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med., 344, 539548
Pawitan, Y. In All Likelihood: Statistical Modelling and Inference Using Likelihood, (2001) , Oxford Oxford University Press.
Pawitan, Y., et al. (2005) False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics, 21, 30173024
Pounds, S. and Cheng, C. (2004) Improving false discovery rate estimation. Bioinformatics, 20, 17371745
Rosenwald, A., et al. (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N. Engl. J. Med., 346, 19371947
Storey, J.D. (2002) A direct approach to false discovery rates. J. R. Statist. Soc. B., 64, 479498[CrossRef].
Taylor, J., et al. (2005) The miss rate for the analysis of gene expression data. Biostatistics, 6, 111117[Abstract].
Tsai, C.A., et al. (2003) Estimation of false discovery rates in multiple testing: application to gene microarray data. Biometrics, 59, 10711081[CrossRef][ISI][Medline].
This article has been cited by other articles:
![]() |
Y. Pawitan, S. Calza, and A. Ploner Estimation of false discovery proportion under general dependence Bioinformatics, December 15, 2006; 22(24): 3025 - 3031. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. S. Mehta, S. O. Zakharkin, G. L. Gadbury, and D. B. Allison Epistemological issues in omics and high-dimensional biology: give the people what they want Physiol Genomics, December 13, 2006; 28(1): 24 - 32. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Ploner, S. Calza, A. Gusnanto, and Y. Pawitan Multidimensional local false discovery rate for microarray studies Bioinformatics, March 1, 2006; 22(5): 556 - 565. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Wolfram, E. Haas, and P. Bauerfeind Nickel Represses the Synthesis of the Nickel Permease NixA of Helicobacter pylori J. Bacteriol., February 15, 2006; 188(4): 1245 - 1250. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









































(solid line), the corrected FDR using the estimate based on the mixture model (dashed line) and the biased FDR implied by the mixture model (dotted line).

(solid line), the corrected FDR using the estimate based on the mixture model (10) (dashed line) and the biased FDR implied by the mixture model (dotted).


