Skip Navigation


Bioinformatics Advance Access originally published online on October 12, 2004
Bioinformatics 2005 21(5):660-668; doi:10.1093/bioinformatics/bti063
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/5/660    most recent
bti063v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (25)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Dalmasso, C.
Right arrow Articles by Moreau, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dalmasso, C.
Right arrow Articles by Moreau, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

A simple procedure for estimating the false discovery rate

Cyril Dalmasso , Philippe Broët * and Thierry Moreau

INSERM U472, Faculté de Médecine Paris-Sud 16 Avenue Paul Vaillant-Couturier, 94807 Villejuif Cedex, France

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 GENERAL FRAMEWORK FOR...
 3 A GENERAL CLASS...
 4 PROPOSED ESTIMATOR
 5 SIMULATIONS
 6 EXAMPLES
 7 DISCUSSION
 8 APPENDIX
 REFERENCES
 

Motivation: The most used criterion in microarray data analysis is nowadays the false discovery rate (FDR). In the framework of estimating procedures based on the marginal distribution of the P-values without any assumption on gene expression changes, estimators of the FDR are necessarily conservatively biased. Indeed, only an upper bound estimate can be obtained for the key quantity {pi}0, which is the probability for a gene to be unmodified. In this paper, we propose a novel family of estimators for {pi}0 that allows the calculation of FDR.

Results: The very simple method for estimating {pi}0 called LBE (Location Based Estimator) is presented together with results on its variability. Simulation results indicate that the proposed estimator performs well in finite sample and has the best mean square error in most of the cases as compared with the procedures QVALUE, BUM and SPLOSH. The different procedures are then applied to real datasets.

Availability: The R function LBE is available at http://ifr69.vjf.inserm.fr/lbe

Contact: broet{at}vjf.inserm.fr


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 GENERAL FRAMEWORK FOR...
 3 A GENERAL CLASS...
 4 PROPOSED ESTIMATOR
 5 SIMULATIONS
 6 EXAMPLES
 7 DISCUSSION
 8 APPENDIX
 REFERENCES
 
New transcriptome-oriented biotechnologies make nowadays possible the comparative analysis of thousands of genes expression in parallel for selecting relevant genes the transcriptional changes of which are related to a clinical or biological outcome (Schena, 2000). In such a case, a major multiple testing problem arises due to the fact that a large number of statistical tests are performed simultaneously (Hochberg and Tamhane, 1987). Until now, statistical procedures devoted to this multiple testing problem mostly focused on controlling or estimating false positive error criteria.

For cDNA microarray experiments, the most used criterion nowadays is the false discovery rate (FDR) introduced by Benjamini and Hochberg (1995). The FDR is the expected proportion of false discoveries among all discoveries. Noting V the random variable representing the number of false discoveries and R the number of significant results obtained from a particular multiple testing procedure, Benjamini and Hochberg defined the FDR by FDR = E(V/R) if R > 0, and 0 otherwise. In large-scale hypotheses generating studies such as microarray experiments, the FDR seems more relevant than the Family Wise Error Rate (FWER) defined by the probability of committing at least one false discovery (Hochberg and Tamhane, 1987). In this setting, the purpose of this paper is to propose a novel procedure for estimating the FDR.

In their seminal paper, Benjamini and Hochberg (1995) presented a step up method in order to control the FDR and discussed another criterion, later called the positive FDR (pFDR) by Storey (2001). This criterion is defined as pFDR = E[(V/R)|R > 0]. However, Benjamini and Hochberg did not consider this criterion due to the fact that it cannot be controlled since under the complete null hypothesis (all null hypotheses tested are true), all significant results (if there are significant ones) are necessary false discoveries. Then, pFDR = 1 and it is impossible to insure that pFDR < {alpha} for a given {alpha} != 1.

Storey (2001) demonstrated that if the test statistics are independent and identically distributed, for a fixed rejection region {Gamma}, which is the same for every test,

where H is the variable such as H = 0 if the null hypothesis H 0 is true, H = 1 if the alternative hypothesis H 1 is true, {pi}0 = Pr(H = 0) is the probability of not being modified and T is the test statistic used for all tested hypotheses.

From its definition, the pFDR is obviously related to the FDR through pFDR = FDR/[Pr(R > 0)]. Since Pr(R > 0) tends to one when the number of tested hypotheses tends to infinity, these two criteria are asymptotically equivalent and, in the following, we will note FDR for both of them.

Storey and Tibshirani (2003) proposed a method (implemented in R function QVALUE) for obtaining a conservatively biased estimator for the pFDR based on the marginal distribution of the P-values without making any assumption on the distribution related to the modified genes. In practice, from (1), estimating the FDR is based on the separate estimation of the following three terms Pr(T {Gamma}), Pr(T {Gamma}|H = 0) and {pi}0 where only an upper bound estimator of the latter quantity can be obtained.

Relying on the same framework, two procedures named BUM (Pounds and Morris, 2003) and SPLOSH (Pounds and Cheng, 2004) have been recently proposed. In practice, all these three methods are based on the marginal distribution of the P-values and provide a conservatively biased estimator for the FDR resulting from the overestimation of {pi}0.

In this paper, we provide a class of estimators for an upper bound of {pi}0 based on the expectation of the transformed P-values and from which we can obtain results on the asymptotic distribution. As for QVALUE, BUM and SPLOSH, our procedure do not make any assumption on the distribution related to modified genes. From our proposed estimators, we can easily obtain estimators of the FDR or other quantities such as the q-values (Storey, 2003).

The paper is organized as follows: in Section 2, we present the general framework of the procedures QVALUE, BUM and SPLOSH for obtaining a conservatively biased estimator for {pi}0 based on the marginal distribution of the P-values. In Section 3, we present a general class of estimators for an upper bound of {pi}0 with results on its asymptotic distribution. In Section 4, we propose a particular family of estimators and give guidelines for choosing one estimator in the family depending on the experimental setup and the accuracy needed. In Section 5, we present results from a simulation study that compares proposed estimators to those provided by QVALUE, BUM and SPLOSH. In Section 6, we apply the different methods on real datasets and we conclude with a discussion.


    2 GENERAL FRAMEWORK FOR PROCEDURES BASED ON THE MARGINAL DISTRIBUTION OF THE P-VALUES
 TOP
 Abstract
 1 INTRODUCTION
 2 GENERAL FRAMEWORK FOR...
 3 A GENERAL CLASS...
 4 PROPOSED ESTIMATOR
 5 SIMULATIONS
 6 EXAMPLES
 7 DISCUSSION
 8 APPENDIX
 REFERENCES
 
Data can be modeled following a two components mixture model (McLachlan and Peel, 2000) whereby the population of genes can be considered as composed of two subpopulations of genes, those for which the null hypothesis is true (unmodified genes), and those for which the alternative hypothesis is true (modified genes). Let p i i = 1, ..., m be the P-values calculated for the m tested hypotheses. Let P be the random variable for which the P-values are the observations and let f be the marginal probability density function (pdf) of P. Denote f 0 the conditional pdf of P under the null hypothesis and f 1 the conditional pdf of P under the alternative hypothesis. Then:

Under the null hypothesis (and if the assumption for the distribution of the test statistic under the null hypothesis is true) the P-values are uniformly distributed on [0, 1] so that f 0(p) = 1[0,1](p) and the relation (2) is: f(p) = {pi}0 + (1 {pi}0)f 1(p) where the conditional density f 1 is unknown. Since (1 – {pi}0)f 1(p) is non-negative and assuming that f (or f 1) is non-increasing for p [0, 1], then f(1) is the smallest upper bound for {pi}0 based on (2). Thus, an unbiased estimator of f(1) provides a conservatively biased estimator of {pi}0. As seen below, the procedures QVALUE, BUM and SPLOSH are based on this latter estimator whereas our procedure is based on the expectation of transformed P-values.

A widely used estimator for {pi}0 is the one proposed by Storey and Tibshirani (2003). Using a tuning parameter {lambda} [0, 1], {pi}0 is estimated by:

As argued by Storey and Tibshirani, there is a trade-off between bias (which decreases when {lambda} -> 1) and variance (which increases when {lambda} -> 1). Considering as a function of {lambda}, Storey and Tibshirani proposed to use a cubic spline based method to estimate the quantity .

Actually, noting F the marginal cumulative distribution function (cdf) of P, Storey and Tibshirani's estimator can be viewed such as:

Then, the estimated quantity is:

Pounds and Morris (2003) have proposed a parametric method assuming that the marginal distribution of the P-values arises from a beta-uniform mixture distribution. The model parameters are estimated using the maximum-likelihood method, and .

More recently, Pounds and Cheng (2004) have proposed a method also based on the marginal distribution of the P-values, but applying a local regression method (LOESS; Loader, 1999) to obtain a smooth estimate of f in a transformed space (for more details on the transformation used, see Pounds and Cheng, 2004).


    3 A GENERAL CLASS OF ESTIMATORS
 TOP
 Abstract
 1 INTRODUCTION
 2 GENERAL FRAMEWORK FOR...
 3 A GENERAL CLASS...
 4 PROPOSED ESTIMATOR
 5 SIMULATIONS
 6 EXAMPLES
 7 DISCUSSION
 8 APPENDIX
 REFERENCES
 
The proposed class of estimators for an upper bound of {pi}0 is based upon the expectation of P under the model (2) that can be expressed as:

where E 0 and E 1 are the expectations of the conditional distribution of P under the null and the alternative hypothesis, respectively.

Since under the null hypothesis, P ~ U[0, 1], E 0(P) = 1/2 so that the previous equation can be written as: 2E(P) = {pi}0 + 2(1 – {pi}0)E 1(P).

It follows that an estimator of an upper bound of {pi}0 leading to a conservatively biased estimator of {pi}0 is simply

since (Appendix 1).

As shown below, a transformation of the random variable P can be considered in order to reduce the bias of this estimator. Noting {varphi} any function defined on [0, 1]:

A function {varphi} leading to an estimator with a lower bias than (3) is such as

that is:

Intuitively, functions {varphi} that are well-suited for achieving the above inequality are such as that take on values which are greater for P close to 1 than for P close to 0. The following general theorem gives formal conditions on {varphi} that leads to the required inequality (5).

THEOREM.

Let f 0 and f 1 be two non-increasing probability density functions of the random variable P defined on [0, 1] (denote f 0 the one such as lim x->1(f 1/f 0)(x) ≤ 1), and let {varphi} a real continuous function defined on [0, 1] verifying the following conditions:

  1. lim x->1{varphi}(x) = +{infty}
  2. lim x->0{varphi}(x) < +{infty}
  3. {varphi} is convex
  4. {varphi} (E 0(P)) ≥ E 0(P)

Then:

The proof of the theorem is given in Appendix 2.

In the following, we denote S the set of functions verifying (i) to (iv), and the general class of estimators proposed for an upper bound of {pi}0 is:

Assuming the independence of the P-values, we can obtain results on the asymptotic distribution of . Indeed, according to the central limit theorem, as m tends to infinity:

where E[{varphi}(P)]/E 0[{varphi}(P)] is an upper bound of {pi}0 and {sigma}2 is the variance of the random variable {varphi}(P). Despite {sigma}2 is unknown, we can obtain an upper bound of this variance as follows.

Denote the variance of the random variable {varphi}(P) under the null hypothesis and let {Phi}(P) = {{varphi}(P) – E[{varphi}(P)]}2.

Since, lim x->1({Phi}(x)) = {infty}, lim x->0({Phi}(x)) < {infty} and f 0 and f are two non-increasing pdf such as , following the lemma given in Appendix 2:

But, as stated previously (Appendix 1), E(P) ≤ E 0(P), then .

As the distribution of the P-values is known under the null hypothesis, we can obtain an upper bound of the asymptotic variance of the estimator:

In the next section, we propose a particular family of functions {varphi} belonging to the class S and we provide a method to select one in the family.


    4 PROPOSED ESTIMATOR
 TOP
 Abstract
 1 INTRODUCTION
 2 GENERAL FRAMEWORK FOR...
 3 A GENERAL CLASS...
 4 PROPOSED ESTIMATOR
 5 SIMULATIONS
 6 EXAMPLES
 7 DISCUSSION
 8 APPENDIX
 REFERENCES
 
Let {varphi}(x) = –ln(1 – x). This function {varphi} belongs to the class S and we can show that {forall}n N, E 1({varphi}(P) n+1)/E 0({varphi}(P) n+1) ≤ (E 1({varphi}(P) n ))/(E 0({varphi}(P) n )) (Appendix 3). Then, the set of functions {varphi}(x) n leads to a family of estimators for which the bias for {pi}0 is decreasing with n.

It is worth noting that, under the null hypothesis, {varphi}(P) follows an exponential distribution with parameter 1. Then, using this variable change, E 0({varphi}(P) n ) = n! (Appendix 4) and, for n N, the proposed estimator is:

Following results stated in the previous section,

where is the variance of the random variable {varphi}(P) n .

An upper bound of is . Then,

As it can easily be seen, there is a balance between bias (decreasing as n increase) and variance (increasing as n increase). Even if the proposed estimator is an unbiased estimator for an upper bound of {pi}0, it is important to preserve oneself from the risk to underestimate {pi}0 due to the dispersion of the estimator.

In practice, for a specified number m of tested hypotheses, one can choose n according to a certain value l for the variance's upper bound such as . Other rules may obviously be considered.


    5 SIMULATIONS
 TOP
 Abstract
 1 INTRODUCTION
 2 GENERAL FRAMEWORK FOR...
 3 A GENERAL CLASS...
 4 PROPOSED ESTIMATOR
 5 SIMULATIONS
 6 EXAMPLES
 7 DISCUSSION
 8 APPENDIX
 REFERENCES
 
In order to compare the proposed estimator of {pi}0 named LBE (Location Based Estimator) to those provided by QVALUE, BUM and SPLOSH, we performed a simulation study as follows.

Data were generated to mimic a two class comparison study with normalized log-ratio measurements for m genes (i = 1, ..., m) obtained from 20 experiments corresponding to two conditions (j = 1, 2), each with 10 replicated samples (k = 1, ..., 10). Three total numbers of genes were considered (m = 100, 500 and 2000). In each case, all values were independently sampled from a normal distribution, X i,j,k ~ N ij , 1). For the first condition, all the data were simulated with µ i1 = 0. For the second condition, a proportion {pi}0 of genes were simulated with µ i2 = 0 (unmodified genes) whereas modified genes were simulated using three different configurations: (a) µ i2 = 1 for all modified genes; (b) µ i2 = 2 for all modified genes; (c) the first half with µ i2 = 1, the second half with µ i2 = 2. Different {pi}0 values were considered ({pi}0 = 0.2, 0.5 and 0.8).

In each case, the P-values, calculated under the null hypothesis H 0: µ i1 = µ i2, were obtained from the Student's statistic. Then, we estimated {pi}0 from QVALUE, BUM, SPLOSH and LBE.

In the previous section, we provide a method to select n for the LBE estimator according to the experimental setup and a chosen threshold l for the variance. Using this rule with l = 0.052 for the variance, the selected value is n = 1 for m = 100 and m = 500 and n = 2 for m = 2000. However, for completeness, we considered the LBE estimation with n = 1 and n = 2 in each case.

For each setup, 1000 iterations were performed. The mean, the standard deviation and the mean square error of each estimator were estimated over 1000 iterations.

Table 1 displays the means of the five estimators (for each simulated configuration with the different methods). It shows that even if all the estimators are supposed to be conservatively biased, BUM and SPLOSH procedures dramatically underestimate {pi}0 in most of the simulated configurations. As an example, under configuration (b) and with {pi}0 = 80% and m = 500, the estimates mean for SPLOSH and BUM procedures are and . For a few cases, the estimates mean for QVALUE is less than {pi}0, particularly for a small number of genes and high value of {pi}0. Nevertheless, the greatest underestimation of QVALUE estimator is only of 3.8% [for {pi}0 = 80%, m = 100 and configuration (c)]. The proposed estimator with n = 1 provides an upper bound for {pi}0 in all the cases. For n = 2, the mean of over 1000 simulations is less than {pi}0 with a small difference (<3 x 10–3) in only six cases, which can be explained by the variability of the estimates mean. The estimations provided by LBE are greater than those provided by QVALUE in all cases except one. However, the difference is not >8.7% for n = 1 and 4.8% for n = 2.


View this table:
[in this window]
[in a new window]
 
Table 1 Mean of all estimates for each simulated configuration with the methods QVALUE, BUM, SPLOSH and LBE with n = 1 and n = 2

 
In contrast, Table 2 which displays the standard error estimation for each method, shows that the standard error of the proposed estimator for n = 1 is always less than the standard error of QVALUE (the least difference is 1.8%). As expected, the proposed estimator's mean decreases with n (in almost all cases) and variance increases in all cases with n. However, for n = 2, there are only two cases for which the proposed estimator standard error is greater than QVALUE's standard error.


View this table:
[in this window]
[in a new window]
 
Table 2 Standard error for each simulated configuration with the methods QVALUE, BUM, SPLOSH and LBE with n = 1 and n = 2

 
The estimated standard errors for LBE with n = 1 and n = 2 are less than the upper bounds calculated from (7) for the standard error. Indeed, for m = 100, 500 and 2000 the calculated values are 0.1, 0.045, 0.022 (for n = 1) and 0.224, 0.1, 0.05 (for n = 2).

Table 3 presents the mean square error for each estimator. Compared to QVALUE, Table 3 shows that for m = 100 and m = 500, the proposed estimator with n = 1 has the lowest mean square error in 16 cases out of 18, and for m = 2000, the proposed estimator with n = 2 has the lowest mean square error in 6 cases out of 9. For 6 and 5 cases out of 27, SPLOSH and BUM have the lowest mean square error over the five estimators, respectively. However, it is quite difficult to interpret these results since it has been previously shown that these latter estimators tend frequently to underestimate {pi}0.


View this table:
[in this window]
[in a new window]
 
Table 3 Mean square error for each simulated configuration with the methods QVALUE, BUM, SPLOSH and LBE with n = 1 and n = 2

 
As an example, Figure 1 presents the histogram of the different estimators for the four methods in one case [m = 2000, configuration (c), and {pi}0 = 0.8]. It illustrates that the proposed estimator seems to be normally distributed in finite samples, which appears to be roughly true for QVALUE, but not for BUM and SPLOSH. The graphic diagram also illustrates that the variance of QVALUE is higher than the variance of the proposed estimator, and that BUM and SPLOSH, in this case, underestimate {pi}0.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 1 Estimates distribution for QVALUE, BUM, SPLOSH and LBE with n = 1 and n = 2 in the case: m = 2000, configuration (c) and {pi}0 = 0.8.

 
Concerning QVALUE and LBE, simulation results have shown that the upper bound for {pi}0 estimated by both methods is closer to the true value as {pi}0 is increasing and there is a large overlap between the distributions under the null and alternative hypothesis. This is not surprising, since from (1) and (4), the bias is depending on {pi}0 and the distribution of the P-values under the alternative hypothesis.

It is worth noting that for practical use, investigator would probably truncate the estimator at one. However, simulations results (data not shown) have shown that if n is chosen according to the proposed rule, truncating or not the estimator provides very close results.


    6 EXAMPLES
 TOP
 Abstract
 1 INTRODUCTION
 2 GENERAL FRAMEWORK FOR...
 3 A GENERAL CLASS...
 4 PROPOSED ESTIMATOR
 5 SIMULATIONS
 6 EXAMPLES
 7 DISCUSSION
 8 APPENDIX
 REFERENCES
 
Our proposed estimator together with QVALUE, BUM and SPLOSH have been applied to the publicly available datasets from the breast study conducted by Hedenfalk et al. (2001), the leukemia study conducted by Golub et al. (1999) and the apolipoprotein AI (Apo AI) experiment conducted by Callow et al. (2000).

The aim of the study of Hedenfalk et al. (2001) was to examine breast cancer tissues from patients with BRCA1–BRCA2-related cancer and cases of sporadic breast cancer to determine global gene expression patterns in the different classes of tumors. The initial dataset consists of 3226 genes expression ratios corresponding to the fluorescent intensities from a tumor sample divided by those from a common reference sample. For each gene, a log-expression ratio was available. In this paper, we focus on the comparison of BRCA1 and BRCA2 with a subset of 3030 genes for which log-ratio values >0.1 and <10 and the data were normalized following a classical analysis of variance model [same as in Broët et al. (2004)].

The aim of the study of Golub et al. (1999) was to identify the differentially expressed genes between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). The expression levels of 6817 genes were measured using Affymetrix high-density oligonucleotide chips. Data were pre-processed as described in Dudoit et al. (2002), leading to the analysis of 3051 genes.

The aim of the study of Callow et al. (2000) was to identify genes with altered expression in the livers of apo AI knock-out mice compared to inbred control mice. The considered dataset consists of 6384 genes expression values corresponding to the log of the fluorescent intensities from a mice sample divided by those from a common reference sample. We excluded genes having at least one fluorescent intensity equal to zero so that 6226 genes were retained and the data were standardized within arrays.

For each dataset, P-values were calculated for each gene from a two-sample t-test. Then, we applied the methods QVALUE, BUM, SPLOSH and LBE to these sets of P-values in order to estimate {pi}0.

The estimates obtained for {pi}0 by QVALUE, BUM, SPLOSH and LBE (with n = 2, that corresponds for the three datasets to a threshold l = 0.052 for the estimator's variance) are as follows. For the Hedenfalk et al. dataset: 0.669, 0.586, 0.622 and 0.688, respectively; for the Golub et al. dataset: 0.496, 0.453, 0.524 and 0.525, respectively; and for the Callow et al. dataset 0.901, 0.837, 0.830 and 0.895, respectively.

For each dataset, LBE and QVALUE estimates are very close, which is not surprising when looking at simulation results presented in the previous section. For the two first datasets, QVALUE estimate is lower than the LBE estimate, but for the third dataset, LBE estimate is lower.

As compared to QVALUE, we can obtain upper bounds for the variances, which are 1.65 x 10–3, 1.64 x 10–3 and 8.03 x 10–3 for the Hedenfalk et al. dataset, the Golub et al. dataset and the Callow et al. dataset, respectively. These variances correspond to standard errors of 4.06, 4.05 and 2.83%, respectively.

As seen in Storey and Tibshirani (2003), FDR(t) is estimated by (). When selecting all genes so that the FDR is <10%, for the three experiments, QVALUE leads to select 290, 1206 and 9 genes, respectively, and our proposed method leads to select 282, 1187 and 9 genes, respectively. BUM and SPLOSH procedures generally led to select larger numbers of genes but as shown by the simulation study, these procedures led to underestimate {pi}0 in many cases and the true FDR may be quite >10%.


    7 DISCUSSION
 TOP
 Abstract
 1 INTRODUCTION
 2 GENERAL FRAMEWORK FOR...
 3 A GENERAL CLASS...
 4 PROPOSED ESTIMATOR
 5 SIMULATIONS
 6 EXAMPLES
 7 DISCUSSION
 8 APPENDIX
 REFERENCES
 
In this paper, we propose a novel procedure for estimating the FDR that proceeds, as QVALUE, BUM and SPLOSH, from the marginal distribution of the P-values. For all these procedures, a key quantity is the probability for a gene of being unmodified. Estimating this latter quantity without making assumptions on the distribution of modified genes leads to a conservatively biased estimator of the FDR.

In contrast to QVALUE, BUM and SPLOSH that proceed from an estimate of the marginal density evaluated at one with complex procedures, our proposed estimators are simply obtained from the expectation of the transformed P-values. Moreover, we provide results on their asymptotic distribution under the assumption that the P-values are independent. From these estimators, FDR and q-values are easy to obtain.

In order to select one particular estimator among the proposed family, the following guidelines may be suggested. According to the experimental setup and a threshold l = 0.052 for the variance upper bound of the estimator, n = 1 for 2 ≤ m < 2000, n = 2 for 2000 ≤ m < 7500 and n = 3 for m ≥ 7500. However, this threshold l is arbitrary and should be chosen according to the accuracy needed.

As seen in the simulation study, BUM and SPLOSH procedures underestimate {pi}0 in most of the cases, leading to an anticonservatively biased estimator of the FDR. Simulations study has shown that LBE and QVALUE expectations are close, the latter one providing the less biased estimator of {pi}0. However, our proposed estimator has the smallest variance, so that the risk to underestimate {pi}0 is smaller with LBE than with QVALUE. Regarding the bias and variance trade-off, the mean square error of the proposed estimator is the smallest in most of the cases. Applying the four methods on a real dataset, QVALUE and LBE have provided very close results, which is in agreement with the simulation results. BUM and SPLOSH have led to select a greater number of genes, but these results have to be taken cautiously when looking at simulation results.

Although the proposed method is dedicated to the FDR, the estimate of {pi}0 can be used with other criteria such as the local FDR (Efron et al., 2001).

In conclusion, the proposed method for estimating an upper bound of {pi}0 appears to be very useful for calculating the FDR and should be recommended for its nice properties and its simplicity.


    8 APPENDIX
 TOP
 Abstract
 1 INTRODUCTION
 2 GENERAL FRAMEWORK FOR...
 3 A GENERAL CLASS...
 4 PROPOSED ESTIMATOR
 5 SIMULATIONS
 6 EXAMPLES
 7 DISCUSSION
 8 APPENDIX
 REFERENCES
 
8.1 Proof of
Assuming that f, the marginal pdf is non-increasing and f 0 = 1[0,1], F, the marginal cdf and F 0, then the conditional cdf under the null hypothesis, are such as F > F 0. Then,

8.2 Proof of theorem
The proof of the theorem follows the lemma:

LEMMA.

Let f 0 and f 1 two non-increasing probability density function of the random variable P defined on [0, 1] (denote f 0 the one such as ) and let {varphi} a continuous function defined on [0, 1] such as (i) lim x->1{varphi}(x) = +{infty} and (ii) lim x->0{varphi}(x) < +{infty}.

Then, E 1[{varphi}(P)] – E 0[{varphi}(P)] ≤ E 1(P) – E 0(P).

PROOF OF THE LEMMA











PROOF OF THE THEOREM

  1. Note: As {varphi} is convex (iii), following the Jensen inequality: E 0[{varphi}(P)] ≥ {varphi}[E 0(P)]
  2. From the lemma:


8.3 Proof of {forall}n N, [with {varphi}(P) = –(1 – P)]
Following the same argumentation as previously, the following variant of the theorem can easily be shown:

THEOREM.

Let g 0 and g 1 two non-increasing pdf of the random variable Z defined on [0,+{infty}] [denote g 0 the one such as ], and let {psi} a real function defined on [0,+{infty}] verifying the following conditions:

  1. lim x->+{infty}{psi}(x) – x = +{infty}
  2. lim x->0{psi}(x) < +{infty}
  3. {psi} is convex
  4. {psi}[E 0(Z)] ≥ E 0(Z)
Then:

Denote g 0 and g 1 the conditional pdf of the random variable Z = {varphi}(P) n under the null hypothesis and under the alternative hypothesis, respectively. . Indeed:

Let {psi} : [0, +{infty}] -> R such as {psi}(Z) = Z (n+1)/n .

  1. lim x->+{infty}{psi}(x) x = lim x->+{infty} x (n+1)/n} x = +{infty}.
  2. {psi}(0) = 0 {Rightarrow} lim x->0{psi}(x) < +{infty}
  3. {psi}''(x) = [(n + 1)/n 2]x (1–n)/n ≥ 0 {Rightarrow} {psi} is convex
  4. E 0(Z) = n! ≥ 1 {Rightarrow} {psi}[E 0(Z)] = E 0(Z)(n+1)/n ≥ E 0(Z) (Appendix 4)

Then, following the previous theorem:

8.4 Proof of {varphi}(P) ~ exp(1) {Rightarrow} E 0[{varphi}(P) n ] = n!
Let X ~ exp(1)

The equality E 0[{varphi}(P) n ] = n! is obviously true for n = 1 and n = 2:

Lets assume that E(X n ) = n! and lets show that E(X n+1) = (n + 1)!:

Received on April 7, 2004; revised on June 23, 2004; accepted on July 24, 2004

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 GENERAL FRAMEWORK FOR...
 3 A GENERAL CLASS...
 4 PROPOSED ESTIMATOR
 5 SIMULATIONS
 6 EXAMPLES
 7 DISCUSSION
 8 APPENDIX
 REFERENCES
 

    Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B, 57, 289–300.

    Broët, P., Lewin, A., Richardson, S., Dalmasso, C., Magdelenat, H. (2004) A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments. Bioinformatics, (Epub ahead of print).

    Callow, M.J., Dudoit, S., Gong, E.L., Speed, T.P., Rubin, E.M. (2000) Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. Genome Res., 10, 2022–2029[Abstract/Free Full Text].

    Dudoit, S., Fridlyand, J., Speed, T.P. (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc., 97, 457 77–87.

    Efron, B., Tibshirani, R., Storey, J., Tusher, V. (2001) Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc., 96, 1151–1160[CrossRef][Web of Science].

    Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537[Abstract/Free Full Text].

    Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R., Meltzer, P., Guterson, B., Esteller, M., Kallioniemi, O.P., et al. (2001) Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med., 22, 539–548.

    Hochberg, Y. and Tamhane, A. Multiple Comparison Procedures, (1987) Wiley.

    Loader, C. Local Regression and Likelihood, (1999) , NY Springer-Verlag.

    McLachlan, G. and Peel, D. Finite Mixture Models, (2000) , NY Wiley.

    Pounds, S. and Cheng, C. (2004) Improving false discovery rate estimation. Bioinformatics, 20, , pp. 1737–1745[Abstract/Free Full Text].

    Pounds, S. and Morris, S.W. (2003) Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics, 19, 1236–1242[Abstract/Free Full Text].

    Schena, M. (2000) Microarray biochip technology. Biotechniques, (in press).

    Storey, J.D. (2001) A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B, 64, 479–498[CrossRef].

    Storey, J.D. (2003) The positive false discovery rate: a Bayesian interpretation and the q-value. Ann. Stat., 31, 2013–2035[CrossRef].

    Storey, J.D. and Tibshirani, R. (2003) Statistical significance for genome-wide studies. Proc. Natl Acad. Sci. USA, 100, 9440–9445[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
K. F. Kerr
Comments on the analysis of unbalanced microarray data
Bioinformatics, August 15, 2009; 25(16): 2035 - 2041.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. Bukszar, J. L. McClay, and E. J. C. G. van den Oord
Estimating the posterior probability that genome-wide association findings are true or false
Bioinformatics, July 15, 2009; 25(14): 1807 - 1813.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J.-T. Li, Y. Zhang, L. Kong, Q.-R. Liu, and L. Wei
Trans-natural antisense transcripts including noncoding RNAs in 10 species: implications for expression regulation
Nucleic Acids Res., September 1, 2008; 36(15): 4833 - 4844.
[Abstract] [Full Text] [PDF]


Home page
BiostatisticsHome page
Y. Lai
A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data
Biostat., October 1, 2007; 8(4): 744 - 755.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
P. Broet, V. A. Kuznetsov, J. Bergh, E. T. Liu, and L. D. Miller
Identifying gene expression changes in breast cancer that distinguish early and late relapse among uncured patients
Bioinformatics, June 15, 2006; 22(12): 1477 - 1485.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
X. Gao
Construction of null statistics in permutation-based multiple testing for multi-factorial microarray experiments
Bioinformatics, June 15, 2006; 22(12): 1486 - 1494.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Ploner, S. Calza, A. Gusnanto, and Y. Pawitan
Multidimensional local false discovery rate for microarray studies
Bioinformatics, March 1, 2006; 22(5): 556 - 565.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Xie, W. Pan, and A. B. Khodursky
A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data
Bioinformatics, December 1, 2005; 21(23): 4280 - 4288.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Pawitan, K. R. K. Murthy, S. Michiels, and A. Ploner
Bias in the estimation of false discovery rate in microarray studies
Bioinformatics, October 15, 2005; 21(20): 3865 - 3872.
[Abstract] [Full Text] [PDF]


Home page
Plant CellHome page
M. A. Mazzella, M. V. Arana, R. J. Staneloni, S. Perelman, M. J. Rodriguez Batiller, J. Muschietti, P. D. Cerdan, K. Chen, R. A. Sanchez, T. Zhu, et al.
Phytochrome Control of the Arabidopsis Transcriptome Anticipates Seedling Exposure to Light
PLANT CELL, September 1, 2005; 17(9): 2507 - 2516.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/5/660    most recent
bti063v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (25)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Dalmasso, C.
Right arrow Articles by Moreau, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dalmasso, C.
Right arrow Articles by Moreau, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?