Skip Navigation


Bioinformatics Advance Access originally published online on July 28, 2008
Bioinformatics 2008 24(20):2407-2408; doi:10.1093/bioinformatics/btn379
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/20/2407    most recent
btn379v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wojcik, J.
Right arrow Articles by Forner, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wojcik, J.
Right arrow Articles by Forner, K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

ExactFDR: exact computation of false discovery rate estimate in case-control association studies

Jérôme Wojcik * and Karl Forner

Department of Bioinformatics, Merck Serono Geneva Research Center, 1202 Geneva, Switzerland

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM PRINCIPLE
 3 SOFTWARE OVERVIEW
 4 EXAMPLE APPLICATION
 5 DISCUSSION
 REFERENCES
 

Summary: Genome-wide association studies require accurate and fast statistical methods to identify relevant signals from the background noise generated by a huge number of simultaneously tested hypotheses. It is now commonly accepted that exact computations of association probability value (P-value) are preferred to {chi}2 and permutation-based approximations. Following the same principle, the ExactFDR software package improves speed and accuracy of the permutation-based false discovery rate (FDR) estimation method by replacing the permutation-based estimation of the null distribution by the generalization of the algorithm used for computing individual exact P-values. It provides a quick and accurate non-conservative estimator of the proportion of false positives in a given selection of markers, and is therefore an efficient and pragmatic tool for the analysis of genome-wide association studies.

Availability: A Java 1.6 (1.5-compatible) version is available on SourceForge: http://sourceforge.net/projects/exactfdr.

Contact: Jerome.wojcik{at}merckserono.net

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM PRINCIPLE
 3 SOFTWARE OVERVIEW
 4 EXAMPLE APPLICATION
 5 DISCUSSION
 REFERENCES
 
Genome-wide case-control association analyses are becoming a routine tool in studies of genetic risk factors for complex diseases. Recent technologies now allow the genotyping of hundreds of thousands of single nucleotide polymorphisms (SNPs) on chips. These screening campaigns generally result in a selection of markers that require further validation (both genotyping confirmation, ideally replication, and ultimately functional validation). Because validation steps can only be performed at low throughput, selecting the most relevant set of markers from the primary screen is critical. Most currently used methods are based on setting a P-value cutoff, but this does not control the rate of false positives generated by the multiple hypothesis testing problem.

Recently, a methodology for the estimation of the false discovery rate (FDR) was proposed (Forner et al., 2008), less conservative than the pioneering Benjamini and Hochberg initial control procedure (Benjamini et al., 1995) and applicable to any study design, using any association statistic. This algorithm estimates accurately the proportion of false discovery V/R, where V and R are the numbers of false positives and positives at a given P-value level, respectively. In Forner et al. (2008), a permutation-based implementation of this algorithm was proposed, similarly to other methods developed for differential gene expression studies (Benjamini et al., 2001; Ge et al., 2003; Storey et al., 2003). In this implemented method, called here ‘permutation-based FDR estimation’, the distribution of P-values under the null hypothesis is estimated by computing all P-values after random shuffling of case/control labels. The precision of the distribution estimation depends on the number of rounds of shuffling, and therefore the method suffers from long execution times. In order to tackle this issue and make this algorithm easily usable, we have developed the ExactFDR software package. ExactFDR is a user-friendly tool, available for several common platforms, which implements an exact computation of the FDR estimate based on exact computations of allelic or genotypic P-values (Balding , 2006). Its execution time is short and it estimates V/R at least as accurately as the permutation-based FDR estimator proposed in Forner et al. (2008).


    2 ALGORITHM PRINCIPLE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM PRINCIPLE
 3 SOFTWARE OVERVIEW
 4 EXAMPLE APPLICATION
 5 DISCUSSION
 REFERENCES
 
The principle of the algorithm is to estimate the global (experiment-wise) null distribution of the test statistic and then globally adjust exact P-values in order to have a perfectly uniform null distribution despite the discreteness and dependency of data. For a given SNP, represented by its genotypic counts (x1,..., x6) in a 2x3 contingency table representing two samples of which genotyping distribution is compared (Table 1), the exact P-value computation relies on the enumeration of all contingency tables having the same margins as the observed one (dubbed compatible tables). The exact P-value is the sum of the multiple hypergeometric probabilities of all compatible tables having a statistic as extreme as the observed one (Guedj et al., 2006). The ExactFDR algorithm is based on a similar principle: the distribution of P-values under the null hypothesis is simulated by comprehensively computing and storing exact P-values of all the compatible tables of all the SNPs in the study. Then the type I error rate for a statistic test value {alpha} is the proportion of stored P-values equal or smaller than {alpha}. This ratio is computed using all observed individual exact P-values as {alpha} thresholds (Forner et al., 2008).


View this table:
[in this window]
[in a new window]

 
Table 1. A genotypic 2 x 3 contingency table

 

    3 SOFTWARE OVERVIEW
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM PRINCIPLE
 3 SOFTWARE OVERVIEW
 4 EXAMPLE APPLICATION
 5 DISCUSSION
 REFERENCES
 
ExactFDR requires an input file listing an identifier and the six genotypic counts (x1,...,x6) for every SNP in the study. The algorithm is implemented in a Java multithreaded package allowing multiprocessors parallel computing. Since the number of individual statistics and multiple hypergeometric probability computations is large [~ O(mn2), where n is the total sample size and m the number of SNPs], the software program has been extensively and thoroughly optimized, both at the programming and algorithmic levels. Interested readers can refer to the code documentation and previous publications for details (Forner et al., 2008; Guedj et al., 2006).


    4 EXAMPLE APPLICATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM PRINCIPLE
 3 SOFTWARE OVERVIEW
 4 EXAMPLE APPLICATION
 5 DISCUSSION
 REFERENCES
 
The ExactFDR package has been run on experimental data of 313/351 cases/controls from a multiple sclerosis whole genome association study using Affymetrix 500K Genechip® technology. After quality control filtering, 350 000 SNPs have been analyzed. The FDR based on exact allelic and genotypic tests is estimated in 35.0 and 76.5 min, respectively, on a 1.5 GHz Itanium single-processor computer. Execution time drops to 2.1 and 4.5 min, respectively, with 32 processors, demonstrating the efficiency of the multithreaded implementation. The FDR curve corresponding to the genotypic test is illustrated in Figure 1: it overlaps almost perfectly the FDR estimates obtained with the previously published permutation-based estimator, thus proving that ExactFDR is an accurate estimator of the actual proportion of false discoveries V/R. Differences between the two estimators (Fig. 1b) are most frequently negatives (for 70% of estimates), showing that the ExactFDR is on average less conservative than the permutation-based estimator. In addition, ExactFDR is about 400 times faster than the permutation-based FDR estimator using 10 000 permutations.


Figure 1
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. (a) On an experimental dataset (see text), ExactFDR estimates are compared with permutation-based estimates (using 10 000 permutations); both are plotted against the number of positives R (for the first 1000). The two curves overlap almost perfectly. Differences (exact—permutation-based estimates) are plotted in (b): average difference over the 10 000 first positives is –1.4e–4 (variance 2.1e–6).

 

    5 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM PRINCIPLE
 3 SOFTWARE OVERVIEW
 4 EXAMPLE APPLICATION
 5 DISCUSSION
 REFERENCES
 
The identification of genetic risk factors in complex diseases requires efficient statistical tools to analyze the data and address the so-called multiple-testing problem (the number of tested hypotheses is much greater than the sample size) in genome-wide association studies. A pragmatic and accurate methodology has been proposed for estimating the FDR, applicable to any study design (Forner et al., 2008). ExactFDR is an exact implementation of this methodology. It combines the accuracy of false-positive proportion estimation with enhanced speed of execution. It requires no arbitrary parameters to set. The software program is available on most common platforms and is simple to use. In conclusion, ExactFDR is a useful tool in practice as it permits to estimate, after a genome scan, the proportion of false positives amongst the selection of SNPs that will enter further validation stages.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alex Bateman

Received on February 26, 2008; revised on June 27, 2008; accepted on July 18, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM PRINCIPLE
 3 SOFTWARE OVERVIEW
 4 EXAMPLE APPLICATION
 5 DISCUSSION
 REFERENCES
 

    Balding DJ. A tutorial on statistical methods for population association studies. Nat. Rev. Genet. (2006) 7:781–791.[CrossRef][Web of Science][Medline]

    Benjamini Y, et al. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B (1995) 289–300.

    Benjamini Y, et al. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. (2001) 29:1165–1188.[CrossRef]

    Forner K, et al. Universal false discovery rate estimation methodology for genome-wide association studies. Hum. Hered. (2008) 65:183–194.[CrossRef][Web of Science][Medline]

    Ge Y, et al. Resampling-based multiple testing for microarray data analysis. Test (2003) 12:1–77.[CrossRef][Web of Science]

    Guedj M, et al. A fast, unbiased and exact allelic test for case-control association studies. Hum. Hered. (2006) 61:210–221.[CrossRef][Web of Science][Medline]

    Storey JD, et al. Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA (2003) 100:9440–9445.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/20/2407    most recent
btn379v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wojcik, J.
Right arrow Articles by Forner, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wojcik, J.
Right arrow Articles by Forner, K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?