Skip Navigation


Bioinformatics Advance Access originally published online on April 19, 2008
Bioinformatics 2008 24(10):1225-1228; doi:10.1093/bioinformatics/btn120
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/10/1225    most recent
btn120v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Higdon, R.
Right arrow Articles by Kolker, E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Higdon, R.
Right arrow Articles by Kolker, E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

A note on the false discovery rate and inconsistent comparisons between experiments

Roger Higdon 1, Gerald van Belle 2 and Eugene Kolker 1,3,*

1Seattle Children's Research Institute, Seattle, WA 98101, 2Departments of Biostatistics and Environmental and Occupational Health Sciences, University of Washington and 3Division of Biomedical Informatics, Department of Medical Education and Biomedical Informatics, University of Washington, Seattle, WA 98195, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RESULTS AND DISCUSSION
 3 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: The false discovery rate (FDR) has been widely adopted to address the multiple comparisons issue in high-throughput experiments such as microarray gene-expression studies. However, while the FDR is quite useful as an approach to limit false discoveries within a single experiment, like other multiple comparison corrections it may be an inappropriate way to compare results across experiments. This article uses several examples based on gene-expression data to demonstrate the potential misinterpretations that can arise from using FDR to compare across experiments. Researchers should be aware of these pitfalls and wary of using FDR to compare experimental results. FDR should be augmented with other measures such as p-values and expression ratios. It is worth including standard error and variance information for meta-analyses and, if possible, the raw data for re-analyses. This is especially important for high-throughput studies because data are often re-used for different objectives, including comparing common elements across many experiments. No single error rate or data summary may be appropriate for all of the different objectives.

Contact: Eugene.Kolker{at}seattlechildrens.org


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RESULTS AND DISCUSSION
 3 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
In an effort to control for multiple comparisons and increase power over more conventional methods (Dudoit et al., 2003), the false discovery rate (FDR) (Benjamini and Hochberg, 1995) has become increasingly popular for large, exploratory data analyses. In particular, the FDR has become the standard criterion for assessing results in microarray gene-expression studies, along with the associated q-value (FDR for a specific p-value threshold) used to quantify individual comparisons (Allison et al., 2006; Kerr and Churchill, 2001; Storey and Tibshirani, 2003; Tusher et al., 2001). These quantities are defined in Table 1. The FDR is significant in other fields as well, including for example, imaging (Srikanth et al., 2006), proteomics (Karp et al., 2007) and genetic association and linkage (Chen and Storey, 2006).


View this table:
[in this window]
[in a new window]

 
Table 1. Definitions of error rates in a multiple testing situation using the notation of Benjamini and Hochberg (1995)

 
However, use of the FDR and its associated q-value may result in inconsistent and misleading interpretation of comparisons across different experiments. This inconsistency is inherent to other stepwise multiple comparison procedures such as Student–Newman–Keuls (Keuls, 1952) and the Holm Bonferoni adjustment (Holm, 1979). This difficulty is in part due to the omnibus nature of such tests, where many different elements of the tests and family of comparisons can lead to the same error rate. The rapid increase in popularity of the FDR has made it more necessary than ever to demonstrate these inconsistencies. These inconsistencies are fundamental to the FDR and not to issues of estimation for the FDR, a topic which has been discussed at great length elsewhere (Allison et al., 2006; Benjamini and Hochberg, 1995; Storey, 2002; Tsai et al., 2003).

This topic has not been directly addressed in the multitudes of papers discussing the FDR. Few papers demonstrate the potential for interpretation error or issues with comparing the FDR and q-values across different experiments. Others have noted potential inconsistencies in the interpretation of any results that used any multiple comparison procedures (O’Brien, 1983; Rothman, 1990). This inconsistency is often due to differences in the number of comparisons as illustrated by the following. Assume that there are two studies, the first compares all pairs of treatments A, B and C, while the other compares only treatments A and B. Focusing on the comparison between A and B, assume both studies observe the same unadjusted p-value of 0.03 for the comparison. Using a Bonferoni correction, the first study can adjust for multiple comparisons yielding an adjusted p-value of 0.09 (3*0.03) for this comparison. As a result, the studies would reach different conclusions based on a standard 0.05 p-value threshold, despite observing the same difference between A and B.

Methods to control for the FDR are relatively less sensitive to the number of comparisons than other procedures that adjust for multiple comparisons (Holland and Cheung, 2002). However, even when the numbers of comparisons are identical across different experiments, the thresholds to control for the FDR and the associated q-values for individual comparisons are highly dependent on the results of other comparisons. This situation is often encountered in microarray studies, where many series of experiments are based on the same set of genes.


    2 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RESULTS AND DISCUSSION
 3 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Ten-gene comparisons
For simplicity, assume there are two studies comparing gene-expression levels between two conditions (expression ratios) for the identical set 10 genes (clearly, this example can be scaled to any number of genes). Focusing on gene X, assume both studies observe the same unadjusted p-value of 0.01 in a test for differential gene expression (expression ratio not equal to one). Assume also that this is the smallest p-value among the 10 genes in the first study and the 3rd smallest in the second study. In the first study a conservative estimate of the FDR using a p-value threshold of 0.01 would be 10% (10*0.01/1) and in the second the FDR at that same threshold would be ~3% (10*0.01/3). Based on a 5% FDR threshold, gene X would be considered differentially expressed in the second study but not in the first. This result appears to be counter intuitive; despite observing the same level of differential expression in gene X, it is considered significant when it is the third smallest p-value, but is no longer significant when it is the smallest.

As we shall illustrate in the rest of this article, the nature of the FDR is such that the larger the pool of differentially expressed genes, the less conservative the p-value threshold becomes. This result makes sense probabilistically, however, practically and intuitively should the significance of this p-value change in this way? Common sense might suggest the opposite. One should consider larger p-values when there are only few differentially expressed genes, not when there are many.

2.2 FDR dependencies
The following equation describes the relationship between the q-value (qv) and the p-value (pv) of an individual comparison. It shows how the q-value is dependent on the totality of comparisons in an experiment.


Formula 1

(1)
Notation is defined in Table 1 following Benjamini and Hochberg (1995). Note that the power (1 – β) depends on the significance level or p-value threshold, the particular statistical test, as well as the distribution of alternative hypotheses. For a specific FDR (i.e. 5%), the p-value threshold will therefore depend upon the overall power and the proportion of true null hypotheses (i.e. the proportion of equally expressed genes).

Figure 1 shows the p-values corresponding to an FDR or q-value of 5% at different proportions of true null hypotheses and power. The p-value threshold decreases as power decreases and the proportion of null hypotheses (equally expressed genes) increases. At the upper left corner of the figure, where large numbers of hypotheses are rejected (power = 0.9 and 50% null hypotheses), the p-value threshold to achieve a 5% FDR is near 0.05, which is on the border of what would be considered significant for a single comparison. On the other hand, at the bottom right corner, where few hypotheses were rejected (power = 0.25 and 95% null hypotheses), the p-value threshold is below 0.001, a highly significant difference for a single comparison.


Figure 1
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Variation in p-value threshold for a fixed FDR of 5%. Plot shows p-value threshold to achieve a 5% FDR (or p-value corresponding to a q-value of 0.05) as power and the proportion of true null hypotheses vary.

 
The local FDR can be defined as the FDR for genes equal to given q-value (or p-value), where the q-value is the FDR for genes with p-values as small or smaller (Table 1). It has been argued that the FDR can be misleading because the error rate on the rejection boundary (local FDR) is often much higher than the overall FDR (Efron, 2004). Therefore, using the local FDR to judge significance might be preferable. However, the results shown in Figure 1 are not unique to the overall FDR, but similarly affect the local FDR. For example, see Figure 2 where the distribution of the test statistics (i.e. based on log-expression ratios) is assumed to be N(µ,1) where µ is the true log-expression ratio and µ = 0 under the null hypothesis and the distribution of µ under the alternative hypothesis is N(1,1). This formulation allows easy calculation of p-values, FDR and the local FDR (Efron, 2004). This results in a similar relationship of p-value thresholds (to achieve 5% FDR) with the proportion of true null hypotheses as was shown in Figure 1. However, while the local FDR was higher than the overall 5% FDR, it does not vary much with the proportion of true null hypotheses used in Figure 2, ranging from 11.8% down to 10.1%. In fact, if we hold the local FDR fixed at 10%, the curve changes only slightly from the fixed 5% FDR curve as shown in Figure 2, where similar variation in p-value thresholds is apparent. This demonstrates that the local FDR suffers from the same difficulties and issues as the FDR.


Figure 2
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Similarity in variation in p-value threshold for a fixed 5% FDR and 10% local FDR. Plot shows p-value threshold to achieve a 5% FDR and 10% local FDR as a function of the proportion of true null hypotheses. The distribution of the test statistics (i.e. based on log-expression ratios) is assumed to be N(µ,1) where µ is the true log-expression ratio and µ = 0 under the null hypothesis and the distribution of µ under the alternative hypothesis is N(1,1).

 
2.3 Mouse liver experiments
The following example is based on microarray data obtained from the Gene Expression Omnibus (GEO) data repository developed by the National Center for Biotechnology Information at the NIH (Barrett et al., 2007). The study compared mouse liver samples for a treatment (PPAR{alpha} agonist Wy14643) versus a control (GEO experiment series GSE8295 [NCBI GEO] ). The experiment was repeated for wild-type and mutant mice. The array contains ~ 40 000 sequences (genes and variants) and there were four replicates in each treatment group. Log-expression values were analyzed using the Limma package for the R programming language (Smyth, 2003) to generate p-values and q-values for differential expression studies.

Analysis of the two experiments (wild-type and mutant) results in disproportionate numbers of differentially expressed genes at a 5% FDR: 8669 for the wild-type and only 16 for the mutant. In turn, this results in dramatically different p-value thresholds to achieve a 5% FDR (0.014 versus 0.00002), mirroring the example shown in Figure 1. Using 5% FDR (q-value <0.05) as the threshold, there are a number of significant, differentially expressed genes in the wild-type experiment with larger p-values, smaller expression ratios and much lower rankings than non-significant genes in the mutant experiment. The results for three such genes are shown in Table 2.


View this table:
[in this window]
[in a new window]

 
Table 2. Comparison of q-values, p-values and expression ratios (ER) for three genes from two different mouse liver microarray experiments (GEO experiment series GSE8295)

 
Clearly, a simple comparison of q-values between the two experiments can be quite misleading for specific genes. If a researcher was specifically interested in these three genes and only had a list of differentially expressed genes with a 5% FDR, the wrong conclusion is inevitable: these three genes are significantly differentially expressed in the wild-type experiment but not in the mutant experiment.

2.4 Use of FDR
FDR is a useful concept and control for multiple testing issues, particularly for the huge number of comparisons made in high-throughput experiments such as microarray gene-expression studies. However, relying only on the FDR to judge the significance of results across different experiments can lead to inconsistencies and misinterpretation of individual comparisons.

The FDR is an appropriate error measure to identify a list of genes that has a suitable high likelihood of being differentially expressed based only on the information from the specific experiment. It may be advisable not to use a pre-set FDR threshold, since in some circumstances it may result in huge numbers of candidate genes, while in others it may yield only a few.

The FDR is only one type of useful information for evaluating individual comparisons. For instance, considering the per comparison error rate (p-value), the magnitude of the difference (i.e. expression ratios) and perhaps the local FDR will give a more complete picture of the significance of different genes. The most appropriate criteria depend upon the objective(s) of the study. For example, if one wants to ensure each individual gene has a high likelihood of differential expression, then the local FDR is more appropriate. If one wants to rank the most differentially expressed genes then, as was seen in the mouse liver experiment (Table 2), the FDR, local FDR and p-value all result in the same ranking of genes, while a ranking based on the expression ratios is quite different.

When a researcher's interest is in examining a single gene or a small group of genes across different experiments, the FDR, q-values and local FDR are not the appropriate measures. The research question now focuses on cross-experimental comparisons, rather than using a single, entire experiment as the basis for analysis. In this case, using the FDR, q-value or local FDR may lead one to exclude comparisons or genes that show consistently small p-values and large differences, but did not achieve a desired FDR in all those studies. Therefore, the FDR, q-values or local FDR can give the false impression that across studies results were inconsistent; p-values and expression ratios will be far more informative for comparing the same small set of genes across different experiments.

A meta-analysis of the experiments (Choi et al., 2003) or, better still, a re-analysis of the raw data on the restricted set of genes may provide error rates more specific to this situation. For example, tests for a difference in expression ratios between the mutant and wild-type experiments based on a combined analysis for the three genes in Table 2 show no difference for Hlf and Arntl (both p-values = 0.85). It is also suggestive that the expression ratio for Per3 was larger for the mutant experiment (p-value = 0.07). These results are contradictory to the results based on the FDR or q-values.


    3 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RESULTS AND DISCUSSION
 3 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Gene expression and other large-scale analyses may initially have the objective of finding biomarkers or other discovery targets, and with such an objective, using the FDR is a sensible method for controlling errors and maximizing the number of potential discoveries. However, the data from these studies may serve many purposes, including much more specialized and targeted analyses. It is important that the reported results from these studies include more than simple gene lists for a given FDR.

Including results for all genes and additional quantitative information such as p-values, expression ratios and local FDR values can help researchers make better comparisons across different experiments. Reporting information on variability, standard errors and sample size may make meta-analyses possible. Better still is to make raw data available along with detailed information about the experimental design and data normalization, such as that which is being done by the NIH with GEO. This will allow researchers to estimate the appropriate error rate for the objectives of the study and avoid inconsistent comparisons that obscure scientific discovery.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RESULTS AND DISCUSSION
 3 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We greatly appreciate Caroline Dombrowski, Katie Kerr and Jared Roach for their insightful comments.

Funding: This work was supported by the grants from the National Institutes of Health, National Institute of General Medical Sciences (Grant No. GM076680-01A1) and from the National Science Foundation, Offices of Biological Infrastructure and Molecular and Cellular Biology (Grant No. 0544757) to E.K.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on January 18, 2008; revised on March 14, 2008; accepted on April 1, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RESULTS AND DISCUSSION
 3 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Allison DB, et al. Microarray data analysis: from disarray to consolidation and consensus. Nat. Rev. Genet (2006) 7:55–65.[CrossRef][Web of Science][Medline]

    Barrett T, et al. NCBI GEO: mining tens of millions of expression profiles–database and tools update. Nucleic Acids Res (2007) 35:D760–D765.[Abstract/Free Full Text]

    Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc. Ser. B (1995) 57:289–200.

    Chen L, Storey JD. Relaxed significance criteria for linkage analysis. Genetics (2006) 173:2371–2381.[Abstract/Free Full Text]

    Choi JK, et al. Combining multiple microarray studies and modeling interstudy variation. Bioinformatics (2003) 19(Suppl. 1):i84–i90.[Abstract]

    Dudoit S, et al. Multiple hypothesis testing in microarray experiments. Stat. Sci (2003) 18:71–103.[CrossRef][Web of Science]

    Efron B. Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J. Am. Stat. Assoc (2004) 99:96–104.[CrossRef][Web of Science]

    Holland B, Cheung SH. Familywise robustness criteria for multiple-comparison procedures. J. Royal Stat. Soc. Ser. B (2002) 64:63–77.[CrossRef]

    Holm S. A simple sequentially rejective multiple test procedure. Scand. J. Stat (1979) 6:65–70.

    Karp NA, et al. Experimental and statistical considerations to avoid false conclusions in proteomic studies using differential in-gel electrophoresis. Mol. Cell Proteomics (2007) 8:1354–1364.

    Kerr MK, Churchill GA. Experimental design for gene expression microarrays. Biostatistics (2001) 2:183–201.[Abstract]

    Keuls M. The use of the studentize range in connection with an analysis of variance. Euphytica (1952) 1:112–122.[CrossRef]

    O’Brien PC. The appropriateness of analysis of variance and multiple-comparison procedures. Biometrics (1983) 39:787–794.[CrossRef][Web of Science][Medline]

    Rothman KJ. No adjustments are needed for multiple comparisons. Epidemiology (1990) 1:43–46.[Medline]

    Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol (2004) 3:Article 3, 1–24.[Medline]

    Srikanth R, et al. Estimation of false discovery rates for wavelet-denoised statistical parametric maps. Neuroimage (2006) 33:72–84.[CrossRef][Web of Science][Medline]

    Storey JD. A direct approach to false discovery rates. J. Royal Stat. Soc. Ser. B (2002) 64:479–498.[CrossRef]

    Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA (2003) 100:9440–9445.[Abstract/Free Full Text]

    Tsai CA, et al. Estimation of false discovery rates in multiple testing: application to gene microarray data. Biometrics (2003) 59:1071–1081.[CrossRef][Web of Science][Medline]

    Tusher VG, et al. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA (2001) 98:5116–5121.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/10/1225    most recent
btn120v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Higdon, R.
Right arrow Articles by Kolker, E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Higdon, R.
Right arrow Articles by Kolker, E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?