Skip Navigation


Bioinformatics Advance Access originally published online on July 17, 2008
Bioinformatics 2008 24(18):2015-2022; doi:10.1093/bioinformatics/btn373
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/18/2015    most recent
btn373v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Chen, L.
Right arrow Articles by Zhao, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chen, L.
Right arrow Articles by Zhao, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Considering dependence among genes and markers for false discovery control in eQTL mapping

Liang Chen 1,*, Tiejun Tong 2 and Hongyu Zhao 3,4,*

1Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, 2Department of Applied Mathematics, University of Colorado, Boulder, CO, 3Department of Epidemiology and Public Health and 4Department of Genetics, Yale University, New Haven, CT, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Multiple comparison adjustment is a significant and challenging statistical issue in large-scale biological studies. In previous studies, dependence among genes is largely ignored. However, such dependence may be strong for some genomic-scale studies such as genetical genomics [also called expression quantitative trait loci (eQTL) mapping] in which thousands of genes are treated as quantitative traits and mapped to different genetical markers. Besides the dependence among markers, the dependence among the expression levels of genes can also have a significant impact on data analysis and interpretation.

Results: In this article, we propose to consider both the mean as well as the variance of false discovery number for multiple comparison adjustment to handle dependence among hypotheses. This is achieved by developing a variance estimator for false discovery number, and using the upper bound of false discovery proportion (uFDP) for false discovery control. More importantly, we introduce a weighted version of uFDP (wuFDP) control to improve the statistical power of eQTL identification. In addition, the wuFDP approach can better control false positives than false discovery rate (FDR) and uFDP approaches when markers are in linkage disequilibrium. The relative performance of uFDP control and wuFDP control is illustrated through simulation studies and real data analysis.

Contacts: liang.chen{at}usc.edu; hongyu.zhao{at}yale.edu

Supplementary information: Supplementary figures, tables and appendices are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Advanced chip technologies such as microarrays facilitate biological discoveries by studying thousands of genes simultaneously. However, false positive control presents a challenging statistical problem because a large number of hypotheses are tested in such studies. Family-wise error rate (FWER) control is one approach for multiple comparison correction and is defined as the probability of at least one false positive occurring. For example, when FWER is controlled at 0.01, the probability of identifying one or more false positives is less than or equal to 0.01. However, it is generally agreed that FWER control is conservative in genomic studies when there may be many signals and the primary goal is discovery. We may relax our criteria for more discoveries by tolerating more false positives. Consequently, alternative procedures, e.g. false discovery rate [FDR, which is the expectation of false discovery proportion (FDP)] controls (Benjamini and Hochberg, 1995; Storey and Tibshirani, 2003), are widely used for multiple comparison correction in high-dimensional genomic studies. However, the dependence among hypotheses is largely ignored in the existing methods based on FDR control, despite the facts that correlations among hypotheses may be high for genomics studies.

In this article, we focus on the analysis and interpretation of data arising from genetical genomics [expression quantitative trait loci (eQTL) mapping] studies, whose goal is to search for genetic loci associated with gene expression variations in a study population, e.g. samples from experimental crosses or an outbred population. Genetical genomics allows us to systematically study transcriptional regulation through sequence variations across study subjects. In this context, sequence variations can be considered as natural perturbations that can affect gene expressions. This approach has been successfully applied to yeast, fly, maize, mice, rat, human and other organisms (Brem et al., 2002; Bystrykh et al., 2005; Chesler et al., 2005; Hubner et al., 2005; Morley et al., 2004; Schadt et al., 2003; Spielman et al., 2007; Stranger et al., 2005). In these studies, gene expressions can vary significantly across individuals and genes often exhibit a complicated correlation structure among them. For example, genes sharing biological functions or in the same chromosomal domains (Cohen et al., 2000) may be correlated. In addition, markers in close physical proximity may be in linkage disequilibrium and they are highly correlated. Therefore, there is a need to develop statistical methods to address the issue of dependence among hypotheses in such studies. Recently, several papers have addressed the multiple comparison problem for correlated hypothesis tests. (Lehmann and Romano, 2005) proposed to control the probability of k or more false rejections as a generalized family-wise error rate control without making any assumptions about the dependence structure among different hypotheses. (Efron, 2007) addressed this problem from a different perspective by proposing to use the expectation of false discovery number conditioning on a correlation effect parameter.

In this article, we propose to control the upper bound of FDP (uFDP) to handle multiple comparisons for dependent hypotheses. It is similar to the generalized family-wise error rate control because we control the probability of false rejection proportion larger than a given threshold. More importantly, we introduce a weighted version to control the uFDP (wuFDP) to improve the statistical power of eQTL identification. These weights are related to the correlation structure of the hypotheses. Thus, in contrast to previous studies, we not only consider the dependence among hypotheses, but also utilize it to improve statistical power. In addition, the wuFDP approach can better control false positives than FDR and uFDP approaches when markers are in linkage disequilibrium. We, therefore, recommend using wuFDP control as a practical approach to identify significant marker–gene pairs in eQTL studies.

In the following sections, we will illustrate the effect of dependence on false discovery control using an eQTL dataset. Then, we will introduce uFDP and wuFDP controls followed by simulations and real data analysis. Finally, we conclude this article in Section 4.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Impact of dependence among hypotheses
As discussed earlier, the objective of an eQTL study is to identify chromosomal regions affecting the expression levels of the genes measured on microarrays. For each individual, both gene expression levels and marker genotypes are collected and their associations are investigated. If each gene is treated as a quantitative trait, traditional QTL mapping methods can be applied to identify markers associated with each gene. Much work has been done in the literature to consider the dependence among markers in QTL mapping when only one trait is studied. However, the expression levels of many genes are highly dependent, and appropriate statistical methods are needed to take into account such dependence. The complexity of gene expression pattern in eQTL mapping makes the dependence among hypotheses complicated. Supplementary Figure 1 illustrates the non-trivial dependence among genes in an eQTL expression dataset. The dataset contains 1000 genes for 60 individuals. The details of the dataset are described subsequently in Section 3.3. This figure shows the histograms of pairwise correlations among genes. It clearly indicates a strong dependence pattern among genes. When genes are highly correlated with each other, if one of them is falsely declared significant, other correlated genes are also likely to be falsely declared significant. In the presence of correlation among hypotheses, the variance of false positive number will increase and correspondingly FDP may deviate much more from the mean, i.e. FDR, than when hypotheses are independent of each other. This phenomenon was also explored by (Owen, 2005) for the identification of differentially expressed genes under two conditions.


Figure 1
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Boxplots of FDP or weighted FDP among 1000 simulations for FDR, uFDP and wuFDP controls. The differential signal β=0.5. The threshold for FDR, uFDP and wuFDP is 0.1. z1–{alpha} is 1.65. The number of correlated non-differentially expressed genes (nc) varies from 0 to 200. The correlation among these genes (c) varies from 0.6 to 0.9.

 
To illustrate the magnitude of the variation of false positive number in the presence of correlated hypotheses in eQTL studies, Supplementary Figure 2 shows the histogram of false positive number from permuted datasets, where all the null hypotheses are true. The permuted data were generated from the original data as follows. In the original dataset, the observations for each individual consist of two components, gene expression data and genotype data. For each permuted dataset, the association between these two sets of observations was randomly paired across all the individuals, whereas the gene expression data vector and the genotype vector were kept intact individually. That is, we permuted the whole transcriptome together instead of permuting every gene separately, so the dependence among genes and the dependence among markers were kept but any association between genes and markers was destroyed. A linear regression model was used to test the association between gene expression and marker genotype. For permuted datasets, all the discoveries are false positives. For 1000 simulations, using the P-value threshold of 1.0 x 10–5, although the average false positive number 258.1 was close to 202.8 (the expected false positive number for 1000 gene by 20 281 marker comparisons under the independence situation), the false positive number was very high for some permuted datasets and the maximum false positive number was 482. As can be seen from Supplementary Figure 2, the distribution is right-skewed and the false positive number tends to deviate much from the mean for some permutations. The SD of the false positive numbers across 1000 permutations was about 31.6, much larger than 14.2, the expected SD when all the hypotheses were independent.


Figure 2
View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Statistical power for different differential signal (0.3–0.9). The number of correlated non-differentially expressed genes varies from 50 to 200. The correlation among these genes varies from 0.6 to 0.9. The threshold for uFDP and wuFDP is 0.1. z1–{alpha} is 1.65. Triangle symbol line is for wuFDP control and cross symbol line is for uFDP control.

 
As we mentioned that much work has been done to consider marker dependence in QTL mapping. Most of them focus on the FWER control to avoid any single false positive across the whole genome scan (Doerge and Churchill, 1996; Churchill and Doerge, 1994). It has been reported that the applicability of traditional FDR approach for the linkage analysis of a single trait is dubious (Chen and Storey, 2006). The dependence among markers makes the interpretation of FDR problematic. By genotyping more markers around the causal marker, we can identify more true positives. These true positives cannot provide us additional information because they represent the same signal from the causal marker. However, the FDR will be decreased by adding these markers. Correspondingly, more false positives will be included to achieve the specified FDR level. To overcome the above limitations of FDR control, we propose a wuFDP control. First, we define uFDP control.

2.2 uFDP and wuFDP control
As shown in Section 2.1, because FDR control only considers the expected false discovery proportion, the FDP for each realization may differ from the desired FDR level, an issue that may become much more severe when the hypotheses are dependent on each other. To remedy this problem, we propose to control FDP at a certain level so that we are confident about our results. FDP is formally defined as (Lehmann and Romano, 2005)


Formula

where V is the false positive number and R is the total discovery number. V can be written as {sum}FormulavFormula, where h is the total number of hypotheses and vi is the indicator function of whether hypothesis i is falsely rejected. R can be written as {sum}Formulari and ri is the indicator function of whether hypothesis i is rejected.

If R>0, we define the uFDP as


Formula

where z1–{alpha} is the 100(1–{alpha})-th percentile of the standard normal distribution. According to the central limit theorem, if v1,..., vh are independent of each other, when h->{infty}, Formula converges to the standard normal distribution. For a given dataset, R is fixed, if we approximate the distribution of V as a normal distribution and if both the mean and variance of V can be calculated,


Formula

Therefore, for a given {alpha}, the probability that FDP is larger than uFDP can be controlled at {alpha}. It will have a better statistical power than other procedures which control Pr(V/R≥uFDP)≤{alpha} with the equal sign only achieved under certain situations. However, if v1,..., vh are dependent on each other, the approximation of V as a normal distribution may be inappropriate. Therefore, we consider a weighting scheme to make the dependence among hypotheses smaller. Specifically, if a gene (or marker) is highly correlated with other genes (or markers), it will be assigned a smaller weight. If {sum}Formulawiri>0, the weighted version of the uFDP is defined as


Formula

If we approximate the distribution of {sum}Formulawivi as a normal distribution, we can find wuFDP such that


Formula

If we treat V and {sum}Formulawivi as continuous random variables with a unimodal probability density function, according to Vysochanskiï–Petunin inequality (Vysochanskiï and Petunin, 1980), for Formula , we have,


Formula



Formula

If z1–{alpha} is chosen to be 2.33, the FDP or weighted FDP ({sum}Formulawivi/{sum}Formulawiri) are controlled at 0.01 level according to the normal distribution approximation and 0.08 level according to Vysochanskiï–Petunin inequality. If R or {sum}Formulawiri is equal to 0, FDP or weighted FDP is defined as 0, which can also be controlled at {alpha} level.

In practice, the threshold for rejecting hypotheses is chosen to let uFDP or wuFDP equal to our predefined level. uFDP and wuFDP can also be written as


Formula



Formula

where {pi}0 is the proportion of true null hypotheses among all the hypotheses. TFormula is the test statistic random variable under the null and Ti is the observed test statistic. I(·) is the indicator function and t is the threshold. In this article, we estimate {pi}0 as 1, which means only a very small proportion of true marker–gene associations among all of the possible marker–gene pairs. This is a reasonable assumption in genetical genomics studies. We note that a better estimator of {pi}0 will improve the results in other settings.

2.3 Variance estimation of V
In this section, we discuss how to estimate the variance of V in the context of eQTL analysis. To detect eQTL, we fit a linear regression model relating the expression of gene g to the genotype of marker m (coded as 0, 1 or 2 for homozygous rare, heterozygous and homozygous common alleles):


Formula

ygk is the expression level for gene g and individual k, xmk is the number of common alleles for marker m and individual k, the {varepsilon}gk (k=1,..., n) are independent normal random variables with mean 0 and variance {sigma}Formula. Y's and X's are standardized so that {sum}Formulaygk=0, {sum}FormulayFormula=1, {sum}Formulaxmk=0 and {sum}FormulaxFormula=1. Similarly, we can fit a regression between gene g' and marker m'. Although {varepsilon}gk are independent for different individuals and {varepsilon}g'k are independent for different individuals, {varepsilon}gk and {varepsilon}g'k may be related to each other for the same individual k. That is, we have the following covariance structure: Cov({varepsilon}gk, {varepsilon}g'k')={varrho}gg' for k=k' and Cov({varepsilon}gk, {varepsilon}g'k')=0 for k!=k'. Under the above setup, the least squares estimates for βgm and βg'm' are


Formula



Formula

where {sigma}Formula and {sigma}Formula can be estimated by the residual sum of squares divided by (n–2), and their estimates are denoted as Formula and Formula .

Under the null hypotheses that βgm=0 and βg'm'=0,


Formula



Formula

We define the test statistics as follows:


Formula



Formula

where Ptn–2 is the cdf of t distribution with n–2 degrees of freedom, {Phi}–1 is the inverse function of the cdf of standard normal. Under the null hypotheses, (TFormula, TFormula)T can be approximated by a bivariate normal distribution with mean (0, 0)T and variance-covariance matrix Formula . {rho}gg'mm' can be approximated as the correlation between Formula and Formula . With a large sample size, the latter one can be further approximated as the correlation between Formula and Formula , that is Formula . The correlation consists of two parts: {varrho}gg'/{sigma}g{sigma}g' is the correlation between expression-level residuals which can be estimated as the Pearson's correlation between Formula and Formula ; {sum}Formulaxmkxm'k can be treated as the correlation between markers. Denote the estimator for {rho}gg'mm' as Formula .

The covariance between vgm and vg'm' can be estimated as:


Formula

Therefore, for a threshold t and when {pi}0 is set as 1, we can calculate the FDR, uFDP and wuFDP as:


Formula

For each threshold t, we can count the total number of rejections R and calculate FDR, uFDP and wuFDP. We stop at the largest R whose corresponding FDR, uFDP and wuFDP are less than our predefined levels (e.g. 0.1). This FDR is similar to Storey's FDR (Storey and Tibshirani, 2003) except that we estimate {pi}0 as 1.

2.4 Weighting scheme
In this section, we discuss wuFDP control as a way to increase statistical power. The rationale of our approach is based on minimizing the variance of weighted false discoveries. Let {Sigma}=({tau}ij)i,j=1,...,h denote the estimated variance–covariance matrix of v1,...,vh. For the threshold t, we can get a corresponding P-value threshold p, and {tau}ij=pp2 for i=j and {tau}ij=pijp2 for i!=j where Formula . Let hypothesis i correspond to gene g and marker m and hypothesis j correspond to gene g' and marker m'. Given {sum}Formulawi=h, the optimal weights minimizing Var({sum}Formulawivi) are given as


Formula 1

(1)
where 1=(1,..., 1)T with size h. Note that the optimal weights in (1) are not guaranteed to be non-negative, which makes the interpretation difficult. The constraint that wi≥0 in variance minimization is commonly used in portfolio optimization. But the computation is prohibitive for our studies where there are thousands of hypotheses and the weights need to be updated many times according to different threshold values. In this article, we propose to use the following weight for hypothesis i:


Formula 2

(2)
where si={sum}Formula{tau}ij. In Appendix 1 in Supplementary Material, we proved that the weights in (2) are all positive under the setting that the tests are two-sided.

If the hypotheses are independent of each other (i.e. {Sigma} is diagonal), the optimal weights defined in (1) are


Formula

which are equal to our proposed weights in (2).

More generally, the optimality of our proposed weights holds when the variance–covariance matrix {Sigma} is a block diagonal matrix with compound symmetric matrix in each block. The proof is in Appendix 2, in Supplementary Material.

Though the proposed weights is not optimal for general {Sigma}, simulations (data not shown) indicate that the weighted variance is smaller than the original one in most situations. Appendix 3 in Supplementary Material provides some theoretical justifications under certain conditions. Specifically, if {tau}ij≤{tau}kl implies that fij({tau}ij)≥fkl({tau}kl), where fij({tau}ij)=1/sisj, we have


Formula

Because {tau}11=···={tau}hh in our study, we have


Formula

The sum of diagonal elements of the weighted variance is always larger than the original one. This implies that the reduction of variance using the proposed weights (or the optimal weights) are obtained in the off-diagonal elements. v1,..., vh are more likely to be uncorrelated.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Simulations for gene dependence
We focus on the dependence among genes first. We consider a study consisting of 100 individuals, with each one typed at one marker and measured expression levels for 1000 genes. As for genotype data, among these individuals, 9 have genotype aa (coded as 0), 42 have genotype Aa (coded as 1) and 49 have genotype AA (coded as 2). Thus the major allele frequency is 0.7 and the minor allele frequency is 0.3.

As for the expression levels of the 1000 genes, we assume that 50 genes have different expression levels across different genotype groups. Four scenarios are considered.

  1. All of the non-differentially expressed genes are independent.
  2. Fifty non-differentially expressed genes are correlated with each other (pairwise correlation is 0.6, 0.7, 0.8 or 0.9).
  3. Hundred non-differentially expressed genes are correlated with each other (pairwise correlation is 0.6, 0.7, 0.8 or 0.9).
  4. Two hundred non-differentially expressed genes are correlated with each other (pairwise correlation is 0.6, 0.7, 0.8 or 0.9).

For all of the scenarios, we assume that the differentially expressed genes and the non-differentially expressed genes are independent. For each individual, expression data was simulated using a multivariate normal distribution with mean 0 and corresponding covariance matrix defined by the above correlation structures. For the 50 differentially expressed genes, differential signal β was varied from 0.3, 0.4, 0.5, ..., to 0.9. The simulations were repeated 1000 times. Each gene expression and each marker genotype data were standardized to have sample mean 0 and sample variance 1/(n–1).

Table 1 summarizes the frequency of FDP or weighted FDP larger than 0.1 among those 1000 simulations for FDR, uFDP and wuFDP controls. The differential signal β is 0.5 and FDR, uFDP and wuFDP cutoff values are 0.1. z1–{alpha}=1.65 so that Pr(V/R≥0.1) and Pr({sum}wivi/{sum}wiri≥0.1) can be controlled at 0.05 level according to the normal distribution approximation. For wuFDP control, all of the estimated probabilities of weighted FDP larger than 0.1 are about 0.04~0.05 which are close to our significant level 0.05. Even for the FDP instead of the weighted one, wuFDP still performs well with Pr(FDP≥0.1)≤0.06. For uFDP control, the probabilities of FDP larger than 0.1 are much less than 0.05 when the dependence is high. It suggests that uFDP control may be too conservative in the presence of dependence. On the contrary, for FDR control, there is a chance of 29%~44% to have FDP larger than 0.1. This table indicates that wuFDP and uFDP control FDP and weighted FDP almost at the predefined significant level. However, FDP may be much larger than the controlled average level for regular FDR control. Figure 1 shows the boxplots of FDP or weighted FDP for FDR, uFDP and wuFDP controls. If the dependence among hypotheses is strong, FDP is more likely to deviate from the mean for FDR control. However, uFDP and wuFDP can control FDP not to deviate much from the mean.


View this table:
[in this window]
[in a new window]

 
Table 1. The frequency of FDP or weighted FDP larger than 0.1 among 1000 simulations for FDR, uFDP and wuFDP controls

 
Statistical power of uFDP and wuFDP is compared using these simulation results. Power is defined by the average of ratios between truly declared positives and total positives for these 1000 simulations. In Figure 2, power is plotted for each differential signal (0.3–0.9). The triangle symbol line is for wuFDP control and the cross symbol line is for uFDP control. When the correlation among non-differentially expressed genes increases or the number of correlated non-differentially expressed genes increases, wuFDP control has a better power than uFDP control. If all of the non-differentially expressed genes are independent of each other, Supplementary Figure 3 shows that wuFDP control performs almost the same as uFDP control.

Table 2 summarizes the average true positives and false positives when the signal is 0.5. Here, for wuFDP control, if one hypothesis is correctly rejected or falsely rejected, we count it as one without using its weight. wuFDP control performs well when some non-differentially expressed genes are correlated with each other. The performance of wuFDP is almost the same as that of uFDP, when all of the non-differentially expressed genes are independent of each other. uFDP controls false positives well for the dependent cases, but it has lower statistical power.


View this table:
[in this window]
[in a new window]

 
Table 2. The average true positives (TP) and false positives (FP) among 1000 simulations for FDR, uFDP and wuFDP controls

 
Average sensitivity is defined as the average of ratios between truly declared positives and total positives (i.e. 50). Average specificity is defined as the average of ratios between truly declared negatives and total negatives (i.e. 950). ROC curves are plotted according to the average sensitivity and 1-average specificity for wuFDP, uFDP and FDR controls. The ROC curves shown in Figure 3 also suggest that wuFDP control performs better than uFDP control under the dependence situation. Because the ROC curves correspond to the average sensitivity and the average specificity, FDR control also performs well. However, as we discussed before, FDP for a specific experiment may be much larger than the average value for FDR control. When the dependence is strong, the performance of wuFDP is as good as or even better than FDR control as shown in Figure 3.


Figure 3
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. ROC curves for FDR, uFDP and wuFDP controls. The differential signal β=0.4. Two hundred non-differentially expressed gene are correlated with correlation 0.7. z1–{alpha} is 1.65 to control Pr(V/R≥uFDP) and Pr({sum}wivi/{sum}wiri≥wuFDP) less than or equal to 0.05. Triangle symbol line is for wuFDP control, cross symbol line is for uFDP control and circle symbol line is for FDR.

 
3.2 Simulations for marker dependence
The above simulations demonstrate that the uFDP and the wuFDP approaches can control the probability of Pr(FDP≥a specified value) at a specified level. In addition, the wuFDP control has a better power than the uFDP control. We next consider the dependence among markers.

We simulate marker genotype data for 100 F2 offsprings from intercross experiments using R package qtl (Broman et al., 2003). For 100 independent genes, 50 of them are differentially expressed. Two scenarios are considered.

  1. Ten markers are on 10 different chromosomes. For marker 1 on chromosome 1, 50 genes are differentially expressed according to the genotypes of marker 1.
  2. Compared with scenario 1, we add additional nine markers on chromosome 1. Thus, 10 markers are equally spaced on chromosome 1 with 1 cM distance. Another nine markers are on chromosome 2–10. Marker 1 is associated with 50 genes.
For scenario 2, markers on chromosome 1 are tightly linked. Therefore, any marker called significant on chromosome 1 is a true positive. Otherwise it is a false positive. Table 3 summarizes the average true positives and false positives for 1000 simulations. Compared with scenario 1, many more true positives are identified in scenario 2. However, these additional true discoveries are purely due to the linkage disequilibrium and they cannot provide more information about the association between gene and marker. On the other hand, we include many more false positives in scenario 2. According to the definition of FDR: Formula , given a test statistic threshold t, the increased number of true positives will make the FDR smaller. Correspondingly, a smaller threshold t will be chosen for a particular FDR level (e.g. 0.1). Thus, an arbitrary small threshold t can be chosen to control FDR at a specific level by adding more markers around the casual marker. The small threshold t will result in more false positives (40.6 versus 7.4). According to the definition of uFDP, the increased number of true positives will make the denominator of uFDP bigger and the uFDP will be smaller correspondingly. More false positives are included for scenario 2 (22.2 versus 3.1). But compared to FDR control, uFDP control is a more stringent approach and the number of false positives are smaller (22.2 versus 40.6). For wuFDP, smaller weights will be assigned to those highly correlated true positives. It penalizes the effect of additional markers and the denominator of wuFDP changes little. Correspondingly, the false positive number changes little (6.9 versus 3.5). Note that the correlation between two hypotheses under the null consists of two parts: the correlation between expression-level residuals and the correlation between markers (see Section 2.3). If the markers are close to each other, the correlation between hypotheses will be high. If two genes are associated with the same tested marker, the correlation between hypotheses may not necessarily be high because we use the correlation between expression residuals instead of expression levels themselves.


View this table:
[in this window]
[in a new window]

 
Table 3. A comparison of FDR, uFDP and wuFDP controls when markers are dependent

 
3.3 Real data analysis
We use a human eQTL dataset to demonstrate the usefulness of our proposed methods. Stranger et al. (2007) used Illumina's Sentrix Human-6 Expression BeadChips to measure gene expression in B-lymphocyte cells of CEPH Utah individuals. In total, we consider 60 unrelated individuals. There are four replicates for each individual. In the original paper, the raw datasets were background corrected and quantile normalized across replicates of a single individual followed by a median normalization across different individuals. We downloaded the normalized data from the Sanger Institute website (ftp://ftp.sanger.ac.uk/pub/genevar/). We chose the 1000 most variable probes among the total of 47 293 probes for further analysis after excluding probes with extreme outliers. These individuals are the subjects in the HapMap project (Consortium, 2003). Genotypes of 20 281 autosomal SNPs which are in the 5' UTR region or 3' UTR region and have minor allele frequency ≥0.1 were downloaded from the HapMap website (http://www.hapmap.org/). UTR is an important region for gene transcription regulation. Therefore, we prioritize these SNPs. The genotype is coded as 0, 1 or 2 for homozygous rare, heterozygous and homozygous common alleles. Gene expression and marker genotype data were standardized to have mean 0 and sample variance 1/(n–1).

As we know, markers far away on the same chromosome or on different chromosomes are in linkage equilibrium in a homogeneous population, as is the case for the samples considered here. Therefore, {sum}Formulaxmkxm'k is very close to 0 for such markers. As a result, the variance–covariance matrix of vi's can be simplified as a block diagonal matrix with each block i corresponding to hypotheses between those 1000 genes and marker i and its neighboring markers which are within 10 Mb region of marker i. FDR, uFDP and wuFDP procedures as mentioned before were applied to perform the analysis.

Using wuFDP 0.1 as a cutoff and at significance level 0.01 (z1–{alpha}=2.33), there are 81 significant gene–marker pairs corresponding to 29 genes and 79 markers. Using uFDP control, we can identify 74 eQTLs corresponding to 27 genes and 73 markers. Using FDR = 0.1 as a cutoff, 96 eQTLs corresponding to 36 genes and 94 markers can be identified. As we mentioned in Section 2.3, the FDR approach is Storey's approach (Storey and Tibshirani, 2003) with Formula .

The 74 significant gene–marker pairs resulting from uFDP control can be identified by all of the three methods: uFDP, wuFDP and FDR (see Supplementary Table 1). If the distance between the middle point of gene probe and marker is<1 Mb, we call this pair as cis-regulation pair. Otherwise, we call the pair as trans-regulation pair. Among the 74 pairs, 57 pairs are cis-regulation pairs. It was reported that signal of cis-regulation eQTL is more abundant and more stable than distal and trans-regulation eQTL across statistical methodologies (Stranger et al., 2005). Previous studies also found a significant proportion of cis-regulation eQTL pairs (Schadt et al., 2003). Table 4 lists the gene–marker pairs identified by wuFDP and FDR but not identified by uFDP control and the gene–marker pairs identified by FDR but not identified by wuFDP and uFDP controls. Compared with uFDP control, wuFDP control identified seven more eQTL pairs with five of them being cis-regulation pairs. Compared with FDR control, wuFDP control missed 15 eQTL pairs with only two of them being cis-regulation pairs.


View this table:
[in this window]
[in a new window]

 
Table 4. Gene–marker pairs identified by wuFDP or FDR but not uFDP

 
The number of pairs identified depends on the threshold used. Using wuFDP or uFDP 0.05 as a cutoff and at significance level 0.05 (z1–{alpha}=1.65), there are 72 or 65 eQTL pairs. Using FDR=0.05 as a cutoff, 81 eQTLs pairs can be identified.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
False discovery control is widely used in genomics and proteomics studies involving large-scale hypothesis tests. Most of the approaches assume independence among genes. However, the dependence among hypotheses may lead to misleading interpretation of FDP. In this article, we proposed a method that not only considers the correlation effect in multiple comparison adjustment, but also utilizes such correlations among hypotheses to improve the power of identifying eQTL. We treat different hypotheses differently according to their correlations with other hypotheses. The effective size of each hypothesis for false discovery control is reflected in its weight. If the dataset shows non-trivial dependence among hypotheses such as in Supplementary Figure 1, we recommend to use the wuFDP control. Otherwise, either the wuFDP or uFDP control can be used since they perform almost the same (as shown in Supplementary Figure 3).

We note that weighted analysis has been proposed in the literature in multiple comparisons. Roeder et al. (2006) proposed to assign different weights to P-values for association tests between markers and complex diseases. The weights are predefined and based on prior information such as results from linkage analysis. If the linkage study is informative, which means we choose the correct weights, the weighted FDR procedure for the association study will improve power significantly. On the contrary, the power loss is small if the linkage study is uninformative. However, their weights do not address the dependence problem. Cheverud, (2001) proposed a simple correction for multiple comparisons in interval mapping. Hypothesis tests are not independent if makers are in linkage disequilibrium. The effective number of independent tests for these tests is used in Bonferroni correction. The effective number is measured by the variance of eigenvalues derived from the observed marker correlation matrix.

In this article, instead of only focusing on marker correlations, we also consider gene correlations. Weights based on the variance–covariance matrix of false discovery number V are considered when we perform multiple comparison adjustment. Genes (or markers) which have lower correlations with other genes (or markers) are assigned with larger weights. On the contrary, genes (or markers) which have high correlations with others are penalized by assigning smaller weights. Through these approaches, we can improve statistical power. More importantly, by assigning smaller weights to markers highly correlated with each other, we can handle the situation where many true positives actually come from the same signal source. The markers are declared significant only because they are in linkage disequilibrium with the causal marker. The augmented true positive number will lead to a smaller test statistic threshold for a predefined FDR level. It will include much more false discoveries. However, wuFDP control can penalize markers highly correlated with each other and the weighted number of true positives can represent the number of true signal sources. Therefore, the wuFDP approach can better control false positives.

uFDP and wuFDP controls can be easily applied to other types of data analysis, such as identifying differentially expressed genes across different conditions which sometimes can be treated as expression biomarkers for diseases. Besides the large-scale gene expression data, now it is feasible to do high-throughput and high-quality genotyping (e.g. the Affymetrix 500k array set covers about 500k SNPs and costs about $250 per sample). How to handle this large amount of data computationally is still a significant issue. In this article, we choose a computationally feasible weighting scheme. However, the optimality properties for those weights still need to be studied. And eQTL mapping only provides us the directed genetic interaction links from markers to genes. The detailed transcriptional regulatory path needs to be recovered in combination with other types of data.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank two reviewers for their insightful comments.

Funding: This research was supported in part by NIH grants R01 GM59507, N01 HV28286, P30 DA018343, U24 NS051869, P50 HG 002790, and NSF grant DMS 0714817.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Joaquin Dopazo

Received on January 15, 2008; revised on July 4, 2008; accepted on July 15, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Benjamini Y, Hochberg Y. Controlling the false discovery rate - a practical and powerful appraoch to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. (1995) 57:289–300.

    Brem R, et al. Genetic dissection of transcriptional regulation in budding yeast. Science (2002) 296:752–755.[Abstract/Free Full Text]

    Broman K, et al. R/qtl: Qtl mapping in experimental crosses. Bioinformatics (2003) 19:889–890.[Abstract/Free Full Text]

    Bystrykh L, et al. Uncovering regulatory pathways that affect hematopoietic stem cell function using "genetical genomics". Nat. Genet. (2005) 37:225–232.[CrossRef][Web of Science][Medline]

    Chen L, Storey J. Relaxed significance criteria for linkage analysis. Genetics (2006) 173:2371–2381.[Abstract/Free Full Text]

    Chesler E, et al. Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat. Genet. (2005) 37:233–242.[CrossRef][Web of Science][Medline]

    Cheverud J. A simple correction for multiple comparisons in interval mapping genome scans. Heredity (2001) 87:52–58.[CrossRef][Web of Science][Medline]

    Churchill G, Doerge R. Empirical threshold values for quantitative trait mapping. Genetics (1994) 138:963–971.[Abstract]

    Cohen B, et al. A computational analysis of whole-genome expression data reveals chromosomal domains of gene expression. Nat. Genet. (2000) 26:183–186.[CrossRef][Web of Science][Medline]

    Consortium TIH. The international hapmap project. Nature (2003) 426:789–796.[CrossRef][Web of Science][Medline]

    Doerge R, Churchill G. Permutation tests for multiple loci affecting a quantitative character. Genetics (1996) 142:285–294.[Abstract]

    Efron B. Correlation and large-scale simultaneous significance testing. J. Am. Stat. Assoc. (2007) 102:93–103.[CrossRef][Web of Science]

    Hubner N, et al. Integrated transcriptional profiling and linkage analysis for identification of genes underlying disease. Nat. Genet. (2005) 37:243–253.[CrossRef][Web of Science][Medline]

    Lehmann E, Romano J. Generalizations of the familywise error rate. Ann. Stat. (2005) 33:1138–1154.[CrossRef]

    Morley M, et al. Genetic analysis of genome-wide variation in human gene expression. Nature (2004) 430:743–747.[CrossRef][Web of Science][Medline]

    Owen A. Variance of the number of false discoveries. J. R. Stat. Soc. Ser. B Stat. Methodol. (2005) 67:411–426.[CrossRef]

    Roeder K, et al. Using linkage genome scans to improve power of association in genome scans. Am. J. Hum. Genet. (2006) 78:243–252.[CrossRef][Web of Science][Medline]

    Schadt E, et al. Genetics of gene expression surveyed in maize and mouse and man. Nature (2003) 422:297–302.[CrossRef][Web of Science][Medline]

    Spielman R, et al. Common genetic variants account for differences in gene expression among ethnic groups. Nat. Genet. (2007) 39:226–231.[CrossRef][Web of Science][Medline]

    Storey J, Tibshirani R. Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA (2003) 100:9440–9445.[Abstract/Free Full Text]

    Stranger B, et al. Genome-wide associations of gene expression variation in humans. PLoS Genet. (2005) 1:e78.[CrossRef][Medline]

    Stranger B, et al. Population genomics of human gene expression. Nat. Genet. (2007) 39:1217–1224.[CrossRef][Web of Science][Medline]

    Vysochanskiï D, Petunin Y. Justification of the 3 {sigma} rule for unimodal distributions. Theor. Probab. Math. Stat. (1980) 21:22–36.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/18/2015    most recent
btn373v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Chen, L.
Right arrow Articles by Zhao, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chen, L.
Right arrow Articles by Zhao, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?