Bioinformatics Advance Access originally published online on April 25, 2008
Bioinformatics 2008 24(12):1461-1462; doi:10.1093/bioinformatics/btn209
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
fdrtool: a versatile R package for estimating local and tail area-based false discovery rates
Institute for Medical Informatics, Statistics and Epidemiology (IMISE), University of Leipzig, Härtelstr. 16-18, 04107 Leipzig, Germany
| ABSTRACT |
|---|
|
|
|---|
Summary: False discovery rate (FDR) methodologies are essential in the study of high-dimensional genomic and proteomic data. The R package fdrtool facilitates such analyses by offering a comprehensive set of procedures for FDR estimation. Its distinctive features include: (i) many different types of test statistics are allowed as input data, such as P-values, z-scores, correlations and t-scores; (ii) simultaneously, both local FDR and tail area-based FDR values are estimated for all test statistics and (iii) empirical null models are fit where possible, thereby taking account of potential over- or underdispersion of the theoretical null. In addition, fdrtool provides readily interpretable graphical output, and can be applied to very large scale (in the order of millions of hypotheses) multiple testing problems. Consequently, fdrtool implements a flexible FDR estimation scheme that is unified across different test statistics and variants of FDR.
Availability: The program is freely available from the Comprehensive R Archive Network (http://cran.r-project.org/) under the terms of the GNU General Public License (version 3 or later).
Contact: strimmer{at}uni-leipzig.de
| 1 INTRODUCTION |
|---|
|
|
|---|
Multiple testing is often an essential step in the analysis of high-dimensional genomic or proteomic data. In this context, false discovery rates (FDR) have proven to be reliable as statistical criteria to determine the significance of genomic features. Correspondingly, FDR methodologies are currently employed in many settings such as differential expression, spectrometric peak detection, SNP discovery, edge selection in genetic networks, to name but a few examples.
FDR theory starts with the seminal papers by Schweder and Spjøtvoll (1982) and Benjamini and Hochberg (1995). Local FDR was introduced by Efron et al. (2001). For a general overview over FDR methodologies see, e.g. the review of Broberg (2005) and Efron (2004, 2007).
Generally, two distinct types of FDR need be distinguished:
- density-based local FDR (= fdr), and
- tail area-based FDR (= Fdr).
More formally, consider an observed test statistic y
0 designed such that a small y indicates an uninteresting null case, and conversely, a large y an interesting alternative case. It is assumed that across hypotheses i = 1,...,m the test statistics yi follow a two-component mixture, with density
|
| (1) |
|
| (2) |
|
| (3) |
|
| (4) |
0 and for the parameters
. Note that this is not an easy task, and that this is precisely the point where the various FDR algorithms differ. While both Fdr and fdr are defined for any arbitrary test statistic, it is common practice to use P-values for estimating Fdr, and z-scores for fdr computations.
| 2 DISTINCTIVE FEATURES OF FDRTOOL |
|---|
|
|
|---|
In contrast to other FDR estimation schemes, in fdrtool there is no unnecessary distinction between P-values and other test statistics. Instead, one common algorithm is used to fit the mixture distribution and to infer its parameters. Currently, null models are implemented for P-values, z-scores, correlations and t-scores. It is straightforward to extend fdrtool to allow further types of test statistics.
A second distinguishing feature of fdrtool is that, regardless of the choice of test statistic, simultaneously both local FDR as well as tail area-based FDR values are estimated. This enables, e.g. the computation of local FDR from P-values, and also ensures that
.
Furthermore, all null models may contain free parameters, typically related to scale or location. This implies that fdrtool facilitates the use of an empirical null model (Efron, 2004; Schäfer and Strimmer, 2005). This is beneficial if hypotheses are correlated, and if there is an over- or underdispersion of the theoretical null model (Efron, 2007).
The learning algorithm employed in fdrtool merges the Grenander-density approaches (Broberg, 2005; Langaas et al., 2005) with empirical null modeling (Efron, 2004). Precise details of this procedure and its statistical properties will be reported elsewhere (Strimmer, 2008, manuscript in preparation).
| 3 AN EXAMPLE SESSION |
|---|
|
|
|---|
FDR analysis with fdrtool is simple: start the R application (R Development Core Team, 2007), arrange the test statistics in vector format, and run the fdrtool command. In the following example r is a vector of correlations:
library("fdrtool")
fdr.out = fdrtool(r, type="correlation") The resulting graphical output is shown in Figure 1. The actual estimated (local) FDR values can be accessed as follows:
|
fdr.out$pval # p-values
fdr.out$lfdr # local FDR (=fdr)
fdr.out$qval # tail area-based FDR (=Fdr)
fdr.out$param # estimated parametersThe manual accompanying the fdrtool R package documents this and a number of other procedures in more detail.
| 4 CONCLUSION |
|---|
|
|
|---|
fdrtool is a flexible and simple to use software package for the R environment that allows to obtain estimates of local FDR and frequentist FDR, with a unified interface and algorithm for a diverse set of test statistics and variants of FDR.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
I thank Brit B. Turnbull, Stanford University, for valuable discussion of the local FDR estimation procedure implemented in the R package locfdr, and for kindly sharing an unpublished preprint.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Trey Ideker
Received on December 12, 2007; revised on January 28, 2008; accepted on April 23, 2008
| REFERENCES |
|---|
|
|
|---|
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. B (1995) 57:289–300.
Broberg P. A comparative review of estimates of the proportion unchanged genes and the false discovery rate. BMC Bioinformatics (2005) 6:199.[CrossRef][Medline]
Efron B. Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J. Amer. Statist. Assoc (2004) 99:96–104.[CrossRef]
Efron B. Correlation and large-scale simultaneous significance tesing. J. Amer. Statist. Assoc (2007) 102:93–103.[CrossRef]
Efron B, et al. Empirical bayes analysis of a microarray experiment. J. Amer. Statist. Assoc (2001) 96:1151–1160.[CrossRef]
Langaas M, et al. Estimating the proportion of true null hypotheses, with application to DNA microarray data. J. R. Statist. Soc. B (2005) 67:565–572.
R Development Core Team. R: a Language and Environment for Statistical Computing. (2007) Vienna, Austria: R Foundation for Statistical Computing.
Schäfer J, Strimmer K. An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics (2005) 21:754–764.
Schweder T, Spjøtvoll E. Plots of p-values to evaluate many tests simultaneously. Biometrika (1982) 69:493–502.
This article has been cited by other articles:
![]() |
C. Li, X. Li, Y. Miao, Q. Wang, W. Jiang, C. Xu, J. Li, J. Han, F. Zhang, B. Gong, et al. SubpathwayMiner: a software package for flexible identification of pathways Nucleic Acids Res., October 1, 2009; 37(19): e131 - e131. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Nam, M. Li, K. Choi, C. Balch, S. Kim, and K. P. Nephew MicroRNA and mRNA integrated analysis (MMIA): a web tool for examining biological functions of microRNA expression Nucleic Acids Res., July 1, 2009; 37(suppl_2): W356 - W362. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

of the correlations. The second row depicts the corresponding cumulative density functions. In the third row the local FDR as well as the tail area-based FDR are shown in dependence of the value of the test statistic. Note that the default output is in color, but if desired (as in this figure) a black & white version can be produced by invoking the option color.figure=FALSE.