Bioinformatics Advance Access originally published online on March 14, 2008
Bioinformatics 2008 24(9):1168-1174; doi:10.1093/bioinformatics/btn100
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Unequal group variances in microarray data analyses
1Department of Statistics, University of Örebro, Sweden, 2Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden and 3Department of Biomedical Sciences and Biotechnology, University of Brescia, Italy
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: In searching for differentially expressed (DE) genes in microarray data, we often observe a fraction of the genes to have unequal variability between groups. This is not an issue in large samples, where a valid test exists that uses individual variances separately. The problem arises in the small-sample setting, where the approximately valid Welch test lacks sensitivity, while the more sensitive moderated t-test assumes equal variance.
Methods: We introduce a moderated Welch test (MWT) that allows unequal variance between groups. It is based on (i) weighting of pooled and unpooled standard errors and (ii) improved estimation of the gene-level variance that exploits the information from across the genes.
Results: When a non-trivial proportion of genes has unequal variability, false discovery rate (FDR) estimates based on the standard t and moderated t-tests are often too optimistic, while the standard Welch test has low sensitivity. The MWT is shown to (i) perform better than the standard t, the standard Welch and the moderated t-tests when the variances are unequal between groups and (ii) perform similarly to the moderated t, and better than the standard t and Welch tests when the group variances are equal. These results mean that MWT is more reliable than other existing tests over wider range of data conditions.
Availability: R package to perform MWT is available at http://www.meb.ki.se/~yudpaw
Contact: yudi.pawitan{at}ki.se
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Statistical analyses of microarray data have reached a certain consensus, for example (i) assessment based on false discovery rates (FDR) is more relevant than one based on p-values (e.g. Storey and Tibshirani, 2003) and (ii) in small samples the standard t-test can perform poorly, so it should be replaced by some modified versions (e.g. Baldi and Long, 2001; Efron et al., 2001; Lönnstedt and Speed, 2002; Ploner et al., 2006; Smyth, 2004). This article will focus on one issue for which there is yet no consensus: how to deal with unequal variability between groups. By unequal variance problem, we mean that a non-trivial proportion of the genes have unequal group variances. For clarity and simplicity, we will limit our discussion to the most common problem of finding differentially expressed (DE) genes between two biological conditions. We show that unequal group variance is a common problem in practice, and, in this setting, standard FDR estimates based on tests that assume equal variance can give misleading optimistic bias. The standard remedy using the Welch test is valid in large samples, but lacking in sensitivity in small samples. We propose a moderated version of the Welch test, and show that it works well in small samples.
1.1 Notation and standard tests
We start with the standard tests as a way of introducing the notation. Let n1 and n2 be the number of arrays for the two groups, with each array containing probes for m genes. For gene g, we observe a vector of expression values yg of length n1 + n2, which consists of the observations yg1 in the first group, and yg2 in the second group. We define the group means and SDs as usual, but for convenience we drop the subscript g. Denote the standard t-statistic by
|
|
|
|
|
|
|
|
|
|
In microarray data analysis, the standard t-test is known to have poor sensitivity (see e.g. Ploner et al., 2006). This property can be expected to occur also with the Welch version. The moderated t-statistic (Baldi and Long, 2001; Lönnstedt and Speed, 2002; Smyth, 2004), introduced to improve on the performance of the standard t-test, has the form
|
|
|
|
In principle, Ploner et al.'s multidimensional FDR idea (Ploner et al., 2006) also works for unequal variance case, but, again, the sample size must be large enough to generate permutation-based null distribution.
1.2 Unequal-variance problem
Unequal group variances occur frequently in practice. Figure 1 shows four different examples of the histogram of p-values from F-tests for equal variance; see Section 2.1 for the data description. The F-test is the most commonly used test of equal variance, but in principle any test of equal variance is applicable; for example, one might consider a test that is less sensitive to normal assumption. If the group variances are equal, the distributions should be uniform. In fact each histogram shows an excess of small p-values, indicating a sizable proportion of genes with unequal group variances. Using formula (6), we estimate the proportion of genes with unequal group variances as 0.10, 0.21, 0.38 and 0.20 for panels (a) to (d) respectively.
|
What is the effect of unequal variance on the search for DE genes? Figure 2a shows a simulation result under the null setup of no DE genes, comparing n1 = 3 versus n2 = 3 arrays of m = 10 000 genes. The correct FDR is equal to one, but the estimated FDRs based on the standard t and tm are clearly biased downwards, leading to declaration of DE genes with misleadingly small FDR estimate. What is surprising is that the problem is already visible for a balanced sample, where these tests are known to be robust against unequal variance. As we show later, the bias is worse for unbalanced samples. The Welch test W performs as expected with little bias.
|
Does this mean that one should always use W ? In large samples, W converges to the standard normal regardless of equality of the variances, so it is more robust than the standard t. However, in small samples, W suffers the same lack of sensitivity as the standard t. Figure 2b shows a simulation result comparing n1 = 3 versus n2 = 3 arrays, where there are truly DE genes. The true FDRs based on t (solid) and W (dotted) coincide, and these are substantially larger than the FDR achieved by the moderated statistic tm.
1.3 Moderated Welch test
Suppose, as we expect, there is a subset of genes where the group variances are equal and others where the variances are not equal. Had we known which is the case for each gene, we would use the pooled formula when the group variances are equal, and unpooled otherwise, i.e.
|
|
|
| (1) |
2 with degrees of freedom given by |
|
It is clear that probability weighting alone is not sufficient to get a sensitive procedure. For example, if all the group variances were equal, then the weighted formula would yield the pooled standard errors, and we would end up with the standard t-test. We now propose a moderated Welch test (MWT), whose form is motivated by the moderated tm statistic:
|
| (2) |
|
|
1.4 Summary
To summarize our findings, when measurement variability is unequal between groups: (i) the use of t or tm leads to more false positives than declared by the standard FDR estimates, (ii) the standard Welch test is unbiased, but lacks sensitivity, (iii) MWT deals with the unequal-variance problem without the loss of power suffered by the standard Welch test, and finally, (iv) when group variances are in fact equal, MWT and tm perform similarly. These results mean that MWT is a more robust test to use than the standard t, tm or W tests over a wide range of data conditions.
| 2 METHODOLOGY |
|---|
|
|
|---|
2.1 Datasets
To exhibit the common occurrence of unequal group variances (as shown in Figure 1) we used these datasets:
- The BRCA dataset (Hedenfalk et al., 2001), collected from patients with hereditary breast cancer, who had mutations either of the BRCA1 (n1 = 7) or the BRCA2 gene (n2 = 8). After quality control, we are left with 3170 genes.
- A dataset of 240 cases of diffuse large B-cell lymphoma (Rosenwald et al., 2002) with 7399 genes, where we compare n1 = 102 survivors and n2 = 138 non-survivors.
- A dataset of 159 population-based breast cancer cases (Pawitan et al., 2005a,b); we used the Affymetrix U133A chips and filtered out probes with <50% presence across the samples, leaving us with 11 295 genes for analysis. We compared the group variances of two key variables: (a) estrogen receptor (ER) status (n1 = 130 positive and n2 = 29 negative) and (b) survival status at 5 years (n1 = 121 alive and n2 = 38 dead). Note the unbalanced sample sizes in these examples.
For procedure comparisons, we also used the so-called Golden-Spike data (Choe et al., 2005). It is a dataset of 3860 RNA species, where 100–200 RNAs were spiked in at fold-change (FC) level ranging from 1.2- to 4-fold, while a set of 2551 RNA species was spiked-in at a constant (FC = 1) level. Data were designed as a two-group comparison, spike-in (S) versus control (C) (n = 3 each). There is a total of 14 010 probesets, and in our comparisons we define as true DE those probes that have fold-change at least 3, yielding 454 probes.
2.2 Mean and variance models
To allow small-sample inferences, we will assume a parametric model as follows. Given a gene-wise mean µgi and variance
from group i, the log-expression value of the jth sample is
|
| (3) |
|
The group variances
|
| (4) |
gi is an independent perturbation term. The term
gi for the unequal variances in Figure 1. Note that
gi = 0 for genes with equal group variance.
For explicit distribution theory, one option is to allow separate models for the two variances, but we have found that this approach did not work well. Instead we consider the following: first approximate the conditional distribution of the weighted standard error
as (a scaled)
2 with parameter
and dwg degrees of freedom. Then use the inverse
2 distribution to model
as
|
|
2 with (d0 + dwg + 2) and scale |
|
2.3 Estimation of d0 and ![]()
The hyper-parameters d0 and
reflect the variation of
across genes; because there is typically a large number of genes, the method-of-moments estimates for these parameters are adequate. Adapting Smyth (2004), we base our estimates on
|
|
|
|
and
' are the digamma and trigamma functions, and
2.4 Simulation
We perform simulations with m = 10 000 genes per array with two independent groups according to models (3) and (4). The common variance
is generated from an inverse
2 distribution with unit scale and 10 degrees of freedom. We set a fraction
v0 of the genes to have common group variance, and the rest to have unequal variance. Except when stated otherwise, we use
v0 = 0.7 throughout. For genes with unequal variances,
gi in (4) is an independent sample from N(0, 1).
We simulate null and non-null cases under various scenarios as described in Sections 4.2 and 4.3. In the the non-null case, we use a proportion of truly non-DE genes
0 = 0.9; for each DE gene, we set µg1 = 0 and
.
For each scenario we use 100 replications (for a total of 106 genes). In order to evaluate the performance of the different procedures we compare the global FDR associated with the k top ranking genes and draw the resulting true FDR curve as a function of the proportion of genes declared DE. For each procedure, the genes are first ranked according to the test statistics, then the FDR is computed from the known DE status as a proportion of false positives. That is,
|
|
2.5 FDR estimation
The FDR in the previous subsection is the true FDR achieved by a test procedure, which is an unknown parameter. In reality, this FDR needs to be estimated. We will use the usual estimator, for the k top genes,
|
| (5) |
0 of non-DE genes is estimated (e.g. Storey and Tibshirani, 2003) by
|
| (6) |
= 0.7, but the results are similar for values between 0.5 and 0.9. The computation of p-values is straightforward, since all null distributions are known. If the sample sizes are large enough, one might consider the permutation method to compute the p-values. In our computations we assume parametric models, since we wish to use the procedures in small samples.
2.6 FDR weights
The FDR weight w in (1) is computed as the estimated FDR (5) for group-variance comparisons. Under model (3) and under the null hypothesis of equal group variance, we have
|
|
v0 is similarly estimated using (6). In summary, for a gene g with sample variances | 3 OTHER METHODS FOR COMPARISONS |
|---|
|
|
|---|
Three additional methods for dealing with small samples and unequal variances were compared to MWT, namely local-pooled-error (LPE), weighted analysis of microarray experiments (WAME) and BGmix:
- The LPE test for DE genes (Jain et al., 2003) is based on a shrunken estimate of the within-group error variance.
- The WAME (Sjögren et al., 2007) assumes the arrays are dependent, with a certain variance–covariance matrix that accounts for heteroscedasticity and correlation.
- BGmix (Lewin et al., 2007) is a method for identifying DE genes based on a fully Bayesian mixture model. The model is formulated as a three-component mixture, describing genes that are under-, over- and non-DE.
| 4 RESULTS |
|---|
|
|
|---|
4.1 Completing Figure 2
Returning first to Figure 2, we now show the performance of MWT compared to the other tests. In this simulation we use
v0 = 0, i.e. all genes have unequal group variances, but because of the small sample size (n1 = n2 = 3), most of the variance inequalities are too small to detect; in fact the estimated
v0 using formula (6) is around 0.7. In the null case—Figure 4a—the MWT is much less biased than the t or tm tests, while in the non-null case (b) MWT achieves the same performance as the moderated tm.
|
4.2 Null case
It is known that the standard t-test is robust against unequal group variance, especially when the sample sizes are equal. Therefore, to investigate the effect of unequal variance on FDR estimates, we will consider balanced and unbalanced situations. Furthermore, the distribution of the Welch statistic is only approximate, so we expect a different performance in small and large samples. We set the proportion of genes with equal group variance at
v0 = 0.7 throughout. Hence we consider the following four scenarios:- Balanced small samples: n1 = 3 and n2 = 3.
- Unbalanced small samples: n1 = 3 and n2 = 9.
- Balanced larger samples: n1 = 10 and n2 = 10.
- Unbalanced larger samples: n1 = 10 and n2 = 30.
We first consider the null situation where there are no DE genes, so the true/correct FDR is equal to one for all procedures. Figure 5a shows that in a balanced small sample case, there is only a small bias in the FDR estimates. The bias becomes significantly worse when the data are unbalanced: see Figure 5b. Here all procedures show some bias, with MWT giving the smallest.
|
For balanced larger samples in panel (c), all procedures have very little bias. However, when the data are unbalanced (Figure 5d), only the Welch test and MWT are reliable. The standard t and moderated tm have a bias that persists in larger samples.
The figure for the comparison with LPE, WAME and BGmix is given the Supplementary Report II. In small samples, both balanced and unbalanced, LPE and WAME have little bias, with LPE having less bias. With bigger samples all methods have similar bias. In all four scenarios, BGmix suffers consistently worse bias than the others.
4.3 Non-null case
We now consider the case when there are truly DE genes. To reduce the number of plots, only two scenarios will be shown here (the other two are given in the Supplementary Report I):
- Balanced small samples: n1 = 3 and n2 = 3.
- Unbalanced larger samples: n1 = 10 and n2 = 30.
Figure 6a shows that the FDRs of the MWT and moderated tm tests are lower than the FDRs of the standard t and Welch tests. This means that MWT mirrors the known improvement in terms of sensitivity of moderated tm over the standard t-tests. However, Figure 6b shows that the FDR estimates based on the moderated tm and MWT tests tend to be slightly biased. The standard t-test produces the worst negative bias, while the Welch test will give highly conservative estimates.
|
In larger unbalanced situation—Figure 6c—the true FDRs of the Welch and MWT are slightly better than the FDRs of the t and moderated tm. But, more importantly, the bias problem of the latter two is persistent at larger sample size. At about 400 top genes, the bias of tm in (d) is about 50% of the target value shown in (c), so it is quite substantial.
Figures 7a and c show that WAME and MWT have similar true FDR both in small and larger samples, while LPE clearly produces higher FDR. However, both WAME and LPE have quite a large bias in the FDR estimation, both in small and big samples (Figs 7b and d). BGmix has the worst performance in both scenarios with a higher true FDR, and a consistently large negative bias.
|
Finally, in the Supplementary Report I we show that MWT performs similarly to the moderated tm when the group variances are in fact the same.
4.4 Golden-Spike data
Figure 8a shows the histogram of p-values for the F-test of equal variance of the spiked (n = 3) versus control (n = 3) groups of the Golden-Spike data. There is some indication of unequal variance, but it is not substantial. Furthermore, the samples are balanced, so we do not expect a big improvement in performance of MWT over the moderated tm test. This is what we see in the operating characteristic (OC) curves in Figure 8b, where MWT and tm perform similarly, and both are better than the standard t and Welch tests.
|
4.5 Real data analyses
In comparing the performance of the various procedures with real data, note that the standard t and moderated tm tend to be biased downwards, so the apparent estimates are not directly comparable. These estimates are given only to indicate that the FDR from MWT can be computed routinely. Figure 9 shows the FDRs from the datasets in Figure 1. The apparent estimates are comparable, except in the ER example, where we expect the standard and moderated statistics to be most biased.
|
4.6 Simulation from real data
To show the bias problem in real data, we simulate realistic data from the ER status data as follows:
- subtract the group-wise means for each gene;
- randomly subsample arrays with replacement separately from each group: n1 arrays from group 1, and n2 from group 2.
- for DE genes, add an effect size
in group 2, with equal proportions of under- and over-expression, and
the pooled sample SD of gene g, and D the effect size.
0 = 0.9, and the effect size to D = 1. It is worth noting that in this simulation setup, we preserve the real (and unknown) unequal variance patterns as well as the dependence structure between the genes. Figure 10a shows the true FDRs are comparable, and again (b) shows that MWT is unbiased, while the standard t and moderated tm statistics produce substantial negative bias. For example, at around 1000 genes, the bias is about 50% of the target value.
|
| 5 DISCUSSION |
|---|
|
|
|---|
The key ideas in this article are that (i) unequal group variance is a common problem in practice, (ii) this leads to biased FDR estimates based on standard procedures, (iii) the standard statistical test that deals with unequal variance—the Welch test—lacks power in small samples and (iv) there is a moderated form of the Welch test (MWT) that overcomes these problems. Since MWT performs similarly to the moderated tm when the group variances are the same, MWT can be used more reliably over a wider set of conditions.
If we use the standard t or tm tests when the group variances are unequal, the assumed null distribution will have incorrect variance. We then get a mixture of t-variates, which have a heavier tail than the assumed t distribution. This typically leads to estimates of FDR which are too optimistic and, consequently, to more false positives than declared. The standard Welch test gets an approximately correct null distribution, but lacks sensitivity in small samples, because of poor estimation of the variance parameters.
Extension of MWT to testing several groups is straightforward: first compute the FDR for testing equality of several variances, then apply it as a weight to the estimated mean-squared error under equal and under unequal variance assumptions. Since the detailed method and results are rather extensive, the full description of this idea is given in a separate paper (presently included in Supplementary Report II). Briefly, the results are similar to two-group comparisons: in small samples, the moderated F-statistic using weighted mean-squared errors performs better than the standard tests.
Our results show that when the proportion of genes with unequal variance is small or when the sample sizes are balanced, the standard procedures have only small bias; otherwise the bias can be quite large. From our various scenarios, we believe the standard Welch test should not be used in small samples, as it is worse than MWT. The gain in power by MWT over the Welch test is partly due to the weighted formula, which will favor the common variance for a large fraction of genes which have similar variances, and partly due to the pooling of information of standard errors from across the genes.
In conclusion, it might be a good practice to run tests of equal variance as in Figure 1 to check the common variance assumption. We will then know whether the standard tests are reliable. However, in any case the MWT can be expected to work well, so we would recommend its general use in place of the Welch test.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on January 21, 2008; revised on February 23, 2008; accepted on March 12, 2008
| REFERENCES |
|---|
|
|
|---|
Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inference of gene changes. Bioinformatics (2001) 17:509–519.
Choe S, et al. Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biol (2005) 6:R16.[CrossRef][Medline]
Efron B, et al. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Soc (2001) 96:1151–1160.
Hedenfalk D, et al. Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med (2001) 344:539–548.
Hu J, Wright FA. Assessing differential gene expression with small sample sizes in oligonucleotide arrays using a mean-variance model. Biometrics (2007) 63:41–49.[CrossRef][Web of Science][Medline]
Jain N, et al. Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics (2003) 19:1945–1951.
Lewin A, et al. Fully Bayesian mixture model for differential gene expression: simulations and model checks. Stat. Appl. Genet. Mol. Biol (2007) 6. Article 36.
Lönnstedt I, Speed T. Replicated microarray data. Statistica Sinica (2002) 12:31–46.[Web of Science]
Pawitan Y, et al. False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics (2005a) 21:3017–3024.
Pawitan Y, et al. Gene expression profiling spares early breast cancer patients from adjuvant therapy—derived and validated in two population-based cohorts. Breast Cancer Res (2005b) 7:R953–R964.[CrossRef][Web of Science][Medline]
Ploner A, et al. Multidimensional local false discovery rate for microarray studies. Bioinformatics (2006) 22:556–565.
Rosenwald A, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N. Engl. J. Med (2002) 346:1937–1947.
Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol (2004) 3. Article 3.
Storey JD, Tibshirani R. Statistical significance for genomewide studies. PNAS (2003) 100:9440–9445.
Sjögren A, et al. Weighted analysis of general microarray experiments. BMC Bioinformatics (2007) 8:387.[CrossRef][Medline]
Tusher VG, et al. Significance analysis of microarrays applied to the ionizing radiation response. PNAS (2001) 98:5116–5121.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











