Skip Navigation


Bioinformatics Advance Access originally published online on March 14, 2008
Bioinformatics 2008 24(9):1168-1174; doi:10.1093/bioinformatics/btn100
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/9/1168    most recent
btn100v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Demissie, M.
Right arrow Articles by Pawitan, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Demissie, M.
Right arrow Articles by Pawitan, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Unequal group variances in microarray data analyses

Meaza Demissie 1, Barbara Mascialino 2, Stefano Calza 3 and Yudi Pawitan 2,*

1Department of Statistics, University of Örebro, Sweden, 2Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden and 3Department of Biomedical Sciences and Biotechnology, University of Brescia, Italy

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 OTHER METHODS FOR...
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 

Motivation: In searching for differentially expressed (DE) genes in microarray data, we often observe a fraction of the genes to have unequal variability between groups. This is not an issue in large samples, where a valid test exists that uses individual variances separately. The problem arises in the small-sample setting, where the approximately valid Welch test lacks sensitivity, while the more sensitive moderated t-test assumes equal variance.

Methods: We introduce a moderated Welch test (MWT) that allows unequal variance between groups. It is based on (i) weighting of pooled and unpooled standard errors and (ii) improved estimation of the gene-level variance that exploits the information from across the genes.

Results: When a non-trivial proportion of genes has unequal variability, false discovery rate (FDR) estimates based on the standard t and moderated t-tests are often too optimistic, while the standard Welch test has low sensitivity. The MWT is shown to (i) perform better than the standard t, the standard Welch and the moderated t-tests when the variances are unequal between groups and (ii) perform similarly to the moderated t, and better than the standard t and Welch tests when the group variances are equal. These results mean that MWT is more reliable than other existing tests over wider range of data conditions.

Availability: R package to perform MWT is available at http://www.meb.ki.se/~yudpaw

Contact: yudi.pawitan{at}ki.se

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 OTHER METHODS FOR...
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 
Statistical analyses of microarray data have reached a certain consensus, for example (i) assessment based on false discovery rates (FDR) is more relevant than one based on p-values (e.g. Storey and Tibshirani, 2003) and (ii) in small samples the standard t-test can perform poorly, so it should be replaced by some modified versions (e.g. Baldi and Long, 2001; Efron et al., 2001; Lönnstedt and Speed, 2002; Ploner et al., 2006; Smyth, 2004). This article will focus on one issue for which there is yet no consensus: how to deal with unequal variability between groups. By ‘unequal variance problem’, we mean that a non-trivial proportion of the genes have unequal group variances. For clarity and simplicity, we will limit our discussion to the most common problem of finding differentially expressed (DE) genes between two biological conditions. We show that unequal group variance is a common problem in practice, and, in this setting, standard FDR estimates based on tests that assume equal variance can give misleading optimistic bias. The standard remedy using the Welch test is valid in large samples, but lacking in sensitivity in small samples. We propose a moderated version of the Welch test, and show that it works well in small samples.

1.1 Notation and standard tests
We start with the standard tests as a way of introducing the notation. Let n1 and n2 be the number of arrays for the two groups, with each array containing probes for m genes. For gene g, we observe a vector of expression values yg of length n1 + n2, which consists of the observations yg1 in the first group, and yg2 in the second group. We define the group means and SDs as usual, but for convenience we drop the subscript g. Denote the standard t-statistic by


Formula

using pooled (squared) standard error


Formula

To allow for unequal variances, one might consider the so-called Welch's t-statistic


Formula

with the unpooled standard error


Formula

In large samples, W has an approximate standard normal distribution regardless of the variances, so no problem arises. However, in small samples the normal approximation does not apply. Instead, under the null, W has an approximate t distribution with estimated degrees of freedom given by


Formula

In microarray data analysis, the standard t-test is known to have poor sensitivity (see e.g. Ploner et al., 2006). This property can be expected to occur also with the Welch version. The moderated t-statistic (Baldi and Long, 2001; Lönnstedt and Speed, 2002; Smyth, 2004), introduced to improve on the performance of the standard t-test, has the form


Formula

where


Formula

and d0 and Formula are hyper-parameters to be estimated from the data. There are other improvements of the t-statistic, e.g. Efron's (Efron et al., 2001) or SAM (Tusher et al., 2001), or the multidimensional FDR (Ploner er al., 2006). However, these forms have unknown null distribution, so they require large-enough samples to allow a permutation test, while in this article our objective is to have a procedure that can work in small-sample problems where the permutation-based null distribution is not available. One key advantage of tm is that under the null it has the standard t distribution with d0 + d degrees of freedom, so no permutation computation is needed for inference and the procedure can be applied to small-sample comparisons.

In principle, Ploner et al.'s multidimensional FDR idea (Ploner et al., 2006) also works for unequal variance case, but, again, the sample size must be large enough to generate permutation-based null distribution.

1.2 Unequal-variance problem
Unequal group variances occur frequently in practice. Figure 1 shows four different examples of the histogram of p-values from F-tests for equal variance; see Section 2.1 for the data description. The F-test is the most commonly used test of equal variance, but in principle any test of equal variance is applicable; for example, one might consider a test that is less sensitive to normal assumption. If the group variances are equal, the distributions should be uniform. In fact each histogram shows an excess of small p-values, indicating a sizable proportion of genes with unequal group variances. Using formula (6), we estimate the proportion of genes with unequal group variances as 0.10, 0.21, 0.38 and 0.20 for panels (a) to (d) respectively.


Figure 1
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Histogram of p-values for gene-wise F-tests of equal group variances in four different comparisons: (a) BRCA1 (n1 = 7) versus BRCA2 mutation (n2 = 8) breast cancers. (b) n1 = 102 survivors versus n2 = 138 non-survivors of diffuse large B-cell lymphoma. (c) n1 = 130 estrogen receptor (ER) positive versus n2 = 29 ER negative breast cancers. (d) n1 = 121 survivors versus n2 = 38 non-survivors of breast cancer cases at 5 years.

 
What is the effect of unequal variance on the search for DE genes? Figure 2a shows a simulation result under the null setup of no DE genes, comparing n1 = 3 versus n2 = 3 arrays of m = 10 000 genes. The correct FDR is equal to one, but the estimated FDRs based on the standard t and tm are clearly biased downwards, leading to declaration of DE genes with misleadingly small FDR estimate. What is surprising is that the problem is already visible for a balanced sample, where these tests are known to be robust against unequal variance. As we show later, the bias is worse for unbalanced samples. The Welch test W performs as expected with little bias.


Figure 2
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. FDR for simulated data under the null and non-null setting. Under the null, the true FDR of all procedures should be equal to one. (a) The average FDR estimates based on the standard t (solid) and moderated tm (dashed) tests are biased, but the estimate based on the Welch test (dotted) is not. (b) The standard t (solid) and the Welch tests (dotted, which coincides with solid, so it is additionally marked with triangles) lose sensitivity, giving larger FDR compared to the moderated tm test (dashed).

 
Does this mean that one should always use W ? In large samples, W converges to the standard normal regardless of equality of the variances, so it is more robust than the standard t. However, in small samples, W suffers the same lack of sensitivity as the standard t. Figure 2b shows a simulation result comparing n1 = 3 versus n2 = 3 arrays, where there are truly DE genes. The true FDRs based on t (solid) and W (dotted) coincide, and these are substantially larger than the FDR achieved by the moderated statistic tm.

1.3 Moderated Welch test
Suppose, as we expect, there is a subset of genes where the group variances are equal and others where the variances are not equal. Had we known which is the case for each gene, we would use the pooled formula when the group variances are equal, and unpooled otherwise, i.e.


Formula

where the indicator I(·) = 1 if the condition is true, and zero otherwise. In practice we of course do not know whether Formula , but we can investigate it empirically. This suggests a probability weighting:


Formula 1

(1)
where the weight w is the (posterior) probability that the variances are equal given the data. If we use the F-test of equal variance, we can condition on the extremeness of the observed statistic: F > Fobs. The probability Formula is in fact given by the FDR for testing the equality of variance, so it can be estimated from the data using standard FDR computations. Intuitively, if two observed variances are very different, it will be associated with low FDR and the unpooled standard error will get a much higher weight. We then approximate the distribution of Formula as a scaled {chi}2 with degrees of freedom given by


Formula

where d and df are the degrees of freedom of the pooled and unpooled standard errors, respectively.

It is clear that probability weighting alone is not sufficient to get a sensitive procedure. For example, if all the group variances were equal, then the weighted formula would yield the pooled standard errors, and we would end up with the standard t-test. We now propose a moderated Welch test (MWT), whose form is motivated by the moderated tm statistic:


Formula 2

(2)
based on the moderated standard error


Formula

where d0 and Formula are again hyper-parameters to be estimated from the data. When the variances are equal, Wm reduces to the moderated tm statistic, but we expect to reduce bias if the variances are unequal.

1.4 Summary
To summarize our findings, when measurement variability is unequal between groups: (i) the use of t or tm leads to more false positives than declared by the standard FDR estimates, (ii) the standard Welch test is unbiased, but lacks sensitivity, (iii) MWT deals with the unequal-variance problem without the loss of power suffered by the standard Welch test, and finally, (iv) when group variances are in fact equal, MWT and tm perform similarly. These results mean that MWT is a more robust test to use than the standard t, tm or W tests over a wide range of data conditions.


    2 METHODOLOGY
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 OTHER METHODS FOR...
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 
2.1 Datasets
To exhibit the common occurrence of unequal group variances (as shown in Figure 1) we used these datasets:

  • The BRCA dataset (Hedenfalk et al., 2001), collected from patients with hereditary breast cancer, who had mutations either of the BRCA1 (n1 = 7) or the BRCA2 gene (n2 = 8). After quality control, we are left with 3170 genes.
  • A dataset of 240 cases of diffuse large B-cell lymphoma (Rosenwald et al., 2002) with 7399 genes, where we compare n1 = 102 survivors and n2 = 138 non-survivors.
  • A dataset of 159 population-based breast cancer cases (Pawitan et al., 2005a,b); we used the Affymetrix U133A chips and filtered out probes with <50% presence across the samples, leaving us with 11 295 genes for analysis. We compared the group variances of two key variables: (a) estrogen receptor (ER) status (n1 = 130 positive and n2 = 29 negative) and (b) survival status at 5 years (n1 = 121 alive and n2 = 38 dead). Note the unbalanced sample sizes in these examples.

For procedure comparisons, we also used the so-called ‘Golden-Spike’ data (Choe et al., 2005). It is a dataset of 3860 RNA species, where 100–200 RNAs were spiked in at fold-change (FC) level ranging from 1.2- to 4-fold, while a set of 2551 RNA species was spiked-in at a constant (FC = 1) level. Data were designed as a two-group comparison, spike-in (S) versus control (C) (n = 3 each). There is a total of 14 010 probesets, and in our comparisons we define as true DE those probes that have fold-change at least 3, yielding 454 probes.

2.2 Mean and variance models
To allow small-sample inferences, we will assume a parametric model as follows. Given a gene-wise mean µgi and variance Formula from group i, the log-expression value of the jth sample is


Formula 3

(3)
To understand the variance pattern, we first analyze the observed gene-wise variances from the BRCA data and show the results in Figure 3. Some correlation (Pearson's r = 0.4) does exist, indicating a common element between the two groups, but the marginal distributions are similar. Similar patterns are observed in other datasets, with the correlation increasing with sample size, as expected since there is less sampling noise.


Figure 3
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Analysis of gene variances of the BRCA data: (a) Scatter plot of the variances from BRCA1 versus BRCA2; (b) The marginal density of the variances from BRCA1 (solid) and BRCA2 (dotted).

 
The group variances Formula and Formula vary across genes, so it is convenient to model them as random. In view of the pattern in Figure 3, we consider the model


Formula 4

(4)
where Formula is the common variance between the two groups, and {varepsilon}gi is an independent perturbation term. The term Formula will account for the correlation in Figure 3a, and {varepsilon}gi for the unequal variances in Figure 1. Note that {varepsilon}gi = 0 for genes with equal group variance.

For explicit distribution theory, one option is to allow separate models for the two variances, but we have found that this approach did not work well. Instead we consider the following: first approximate the conditional distribution of the weighted standard error Formula as (a scaled) {chi}2 with parameter Formula and dwg degrees of freedom. Then use the inverse {chi}2 distribution to model Formula as


Formula

where d0 is the degrees of freedom and Formula the scale parameter. Then, the posterior distribution of Formula given Formula is inverse {chi}2 with (d0 + dwg + 2) and scale Formula . So the posterior mean is


Formula

which is the standard error needed for the mean difference, leading to the moderated test Wm in (2). If the group variances are equal, this approach reduces to the standard tm statistic.

2.3 Estimation of d0 and Formula
The hyper-parameters d0 and Formula reflect the variation of Formula across genes; because there is typically a large number of genes, the method-of-moments estimates for these parameters are adequate. Adapting Smyth (2004), we base our estimates on


Formula

which is roughly normal with mean and variance


Formula

where {psi} and {psi}' are the digamma and trigamma functions, and Formula is the average of dwg. Using the sample versions of these moments, from the second equation we get an estimate of d0; then we can compute Formula from the first equation.

2.4 Simulation
We perform simulations with m = 10 000 genes per array with two independent groups according to models (3) and (4). The common variance Formula is generated from an inverse {chi}2 distribution with unit scale and 10 degrees of freedom. We set a fraction {pi}v0 of the genes to have common group variance, and the rest to have unequal variance. Except when stated otherwise, we use {pi}v0 = 0.7 throughout. For genes with unequal variances, {varepsilon}gi in (4) is an independent sample from N(0, 1).

We simulate null and non-null cases under various scenarios as described in Sections 4.2 and 4.3. In the the non-null case, we use a proportion of truly non-DE genes {pi}0 = 0.9; for each DE gene, we set µg1 = 0 and Formula .

For each scenario we use 100 replications (for a total of 106 genes). In order to evaluate the performance of the different procedures we compare the global FDR associated with the k top ranking genes and draw the resulting true FDR curve as a function of the proportion of genes declared DE. For each procedure, the genes are first ranked according to the test statistics, then the FDR is computed from the known DE status as a proportion of false positives. That is,


Formula

where the average is taken over 100 replications.

2.5 FDR estimation
The FDR in the previous subsection is the true FDR achieved by a test procedure, which is an unknown parameter. In reality, this FDR needs to be estimated. We will use the usual estimator, for the k top genes,


Formula 5

(5)
where p(k) is the kth highest p-value; monotonicity is then imposed by applying a cumulative minimum over k, ... , m. The proportion {pi}0 of non-DE genes is estimated (e.g. Storey and Tibshirani, 2003) by


Formula 6

(6)
In our computations we use a cutoff value {lambda} = 0.7, but the results are similar for values between 0.5 and 0.9.

The computation of p-values is straightforward, since all null distributions are known. If the sample sizes are large enough, one might consider the permutation method to compute the p-values. In our computations we assume parametric models, since we wish to use the procedures in small samples.

2.6 FDR weights
The FDR weight w in (1) is computed as the estimated FDR (5) for group-variance comparisons. Under model (3) and under the null hypothesis of equal group variance, we have


Formula

so the p-value can be computed easily. The parameter {pi}v0 is similarly estimated using (6). In summary, for a gene g with sample variances Formula and Formula , we obtain an F-statistic, p-value pg and weight wg = FDR(pg).


    3 OTHER METHODS FOR COMPARISONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 OTHER METHODS FOR...
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 
Three additional methods for dealing with small samples and unequal variances were compared to MWT, namely local-pooled-error (LPE), weighted analysis of microarray experiments (WAME) and BGmix:

  • The LPE test for DE genes (Jain et al., 2003) is based on a shrunken estimate of the within-group error variance.
  • The WAME (Sjögren et al., 2007) assumes the arrays are dependent, with a certain variance–covariance matrix that accounts for heteroscedasticity and correlation.
  • BGmix (Lewin et al., 2007) is a method for identifying DE genes based on a fully Bayesian mixture model. The model is formulated as a three-component mixture, describing genes that are under-, over- and non-DE.
More detailed descriptions are given in the Supplementary Report II. All these methods are available as R packages, and in all of our analyses we use their default settings. We also considered the method by Hu and Wright (2007) as implemented in the code available at the authors’ website, but we were not able to include it in our simulations, because it was extremely slow.


    4 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 OTHER METHODS FOR...
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 
4.1 Completing Figure 2
Returning first to Figure 2, we now show the performance of MWT compared to the other tests. In this simulation we use {pi}v0 = 0, i.e. all genes have unequal group variances, but because of the small sample size (n1 = n2 = 3), most of the variance inequalities are too small to detect; in fact the estimated {pi}v0 using formula (6) is around 0.7. In the null case—Figure 4a—the MWT is much less biased than the t or tm tests, while in the non-null case (b) MWT achieves the same performance as the moderated tm.


Figure 4
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. (a) Simulation results under the null. Shown are the average FDR estimates based on the standard t (solid), moderated tm (dashed), Welch (dotted) and MWT (thick solid). (b) The corresponding results under non-null setting. Because of overlaps, the Welch test is additionally marked with triangles, and the moderated tm with circles.

 
4.2 Null case
It is known that the standard t-test is robust against unequal group variance, especially when the sample sizes are equal. Therefore, to investigate the effect of unequal variance on FDR estimates, we will consider balanced and unbalanced situations. Furthermore, the distribution of the Welch statistic is only approximate, so we expect a different performance in small and large samples. We set the proportion of genes with equal group variance at {pi}v0 = 0.7 throughout. Hence we consider the following four scenarios:
  1. Balanced small samples: n1 = 3 and n2 = 3.
  2. Unbalanced small samples: n1 = 3 and n2 = 9.
  3. Balanced larger samples: n1 = 10 and n2 = 10.
  4. Unbalanced larger samples: n1 = 10 and n2 = 30.

We first consider the null situation where there are no DE genes, so the true/correct FDR is equal to one for all procedures. Figure 5a shows that in a balanced small sample case, there is only a small bias in the FDR estimates. The bias becomes significantly worse when the data are unbalanced: see Figure 5b. Here all procedures show some bias, with MWT giving the smallest.


Figure 5
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Comparison of average FDR estimates based on the standard t (solid), moderated tm (dashed), Welch (dotted) and moderated Welch tests (thick solid). The data are generated under the null, so the correct FDR is equal to one. (a) n1 = 3 and n2 = 3; (b) n1 = 3 and n2 = 9; (c) n1 = 10 and n2 = 10; (d) n1 = 10 and n2 = 30.

 
For balanced larger samples in panel (c), all procedures have very little bias. However, when the data are unbalanced (Figure 5d), only the Welch test and MWT are reliable. The standard t and moderated tm have a bias that persists in larger samples.

The figure for the comparison with LPE, WAME and BGmix is given the Supplementary Report II. In small samples, both balanced and unbalanced, LPE and WAME have little bias, with LPE having less bias. With bigger samples all methods have similar bias. In all four scenarios, BGmix suffers consistently worse bias than the others.

4.3 Non-null case
We now consider the case when there are truly DE genes. To reduce the number of plots, only two scenarios will be shown here (the other two are given in the Supplementary Report I):

  • Balanced small samples: n1 = 3 and n2 = 3.
  • Unbalanced larger samples: n1 = 10 and n2 = 30.
We will highlight two key issues: (i) different procedures have different sensitivity, which can be compared using the corresponding true FDRs (in contrast to the null case, where all procedures have the same true FDR, which is equal to one) and (ii) different procedures have different biases.

Figure 6a shows that the FDRs of the MWT and moderated tm tests are lower than the FDRs of the standard t and Welch tests. This means that MWT mirrors the known improvement in terms of sensitivity of moderated tm over the standard t-tests. However, Figure 6b shows that the FDR estimates based on the moderated tm and MWT tests tend to be slightly biased. The standard t-test produces the worst negative bias, while the Welch test will give highly conservative estimates.


Figure 6
View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. (a) and (c) The true FDRs based on the standard t (solid), moderated tm (dashed with additional circles), Welch (dotted with additional triangles) and MWT (thick solid). (b) and (d) The corresponding bias of the FDR estimates based on those statistics. In (a) and (b) n1 = 3 and n2 = 3; in (c) and (d) n1 = 10 and n2 = 30.

 
In larger unbalanced situation—Figure 6c—the true FDRs of the Welch and MWT are slightly better than the FDRs of the t and moderated tm. But, more importantly, the bias problem of the latter two is persistent at larger sample size. At about 400 top genes, the bias of tm in (d) is about 50% of the target value shown in (c), so it is quite substantial.

Figures 7a and c show that WAME and MWT have similar true FDR both in small and larger samples, while LPE clearly produces higher FDR. However, both WAME and LPE have quite a large bias in the FDR estimation, both in small and big samples (Figs 7b and d). BGmix has the worst performance in both scenarios with a higher true FDR, and a consistently large negative bias.


Figure 7
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7. (a) and (c) The true FDRs based on the BGmix (solid), LPE (dashed), WAME (dotted) and MWT (thick solid). In (a) and (b) n1 = 3 and n2 = 3; in (c) and (d) n1 = 10 and n2 = 30.

 
Finally, in the Supplementary Report I we show that MWT performs similarly to the moderated tm when the group variances are in fact the same.

4.4 Golden-Spike data
Figure 8a shows the histogram of p-values for the F-test of equal variance of the spiked (n = 3) versus control (n = 3) groups of the Golden-Spike data. There is some indication of unequal variance, but it is not substantial. Furthermore, the samples are balanced, so we do not expect a big improvement in performance of MWT over the moderated tm test. This is what we see in the operating characteristic (OC) curves in Figure 8b, where MWT and tm perform similarly, and both are better than the standard t and Welch tests.


Figure 8
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 8. Analysis of the Golden-Spike data from Choe et al. (2005). (a) The histogram of p-values for the F-test of equal variance. (b) OC curves of the standard t (solid), moderated tm (dashed), Welch (dotted) and MWT (thick solid) tests.

 
4.5 Real data analyses
In comparing the performance of the various procedures with real data, note that the standard t and moderated tm tend to be biased downwards, so the apparent estimates are not directly comparable. These estimates are given only to indicate that the FDR from MWT can be computed routinely. Figure 9 shows the FDRs from the datasets in Figure 1. The apparent estimates are comparable, except in the ER example, where we expect the standard and moderated statistics to be most biased.


Figure 9
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 9. FDR estimates associated with the datasets in Figure 1: using the standard t (solid), moderated tm (dashed), Welch (dotted) and moderated Welch (thick solid).

 
4.6 Simulation from real data
To show the bias problem in real data, we simulate realistic data from the ER status data as follows:
  • subtract the group-wise means for each gene;
  • randomly subsample arrays with replacement separately from each group: n1 arrays from group 1, and n2 from group 2.
  • for DE genes, add an effect size Formula in group 2, with equal proportions of under- and over-expression, and Formula the pooled sample SD of gene g, and D the effect size.
Separate subsampling in the second step is important to preserve the unequal variances. For the example, we simulate two groups with n1 = 40 from the ER positive group and n2 = 10 from the ER negative group (n1/n2 ratio as observed). We set the proportion of non-DE genes to {pi}0 = 0.9, and the effect size to D = 1.

It is worth noting that in this simulation setup, we preserve the real (and unknown) unequal variance patterns as well as the dependence structure between the genes. Figure 10a shows the true FDRs are comparable, and again (b) shows that MWT is unbiased, while the standard t and moderated tm statistics produce substantial negative bias. For example, at around 1000 genes, the bias is about 50% of the target value.


Figure 10
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 10. Simulated ER data: (a) true FDRs using the standard t (solid), moderated tm (dashed), Welch (dotted) and MWT (thick solid). (b) corresponding biases of the FDR estimates (t and tm coincide).

 

    5 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 OTHER METHODS FOR...
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 
The key ideas in this article are that (i) unequal group variance is a common problem in practice, (ii) this leads to biased FDR estimates based on standard procedures, (iii) the standard statistical test that deals with unequal variance—the Welch test—lacks power in small samples and (iv) there is a moderated form of the Welch test (MWT) that overcomes these problems. Since MWT performs similarly to the moderated tm when the group variances are the same, MWT can be used more reliably over a wider set of conditions.

If we use the standard t or tm tests when the group variances are unequal, the assumed null distribution will have incorrect variance. We then get a mixture of t-variates, which have a heavier tail than the assumed t distribution. This typically leads to estimates of FDR which are too optimistic and, consequently, to more false positives than declared. The standard Welch test gets an approximately correct null distribution, but lacks sensitivity in small samples, because of poor estimation of the variance parameters.

Extension of MWT to testing several groups is straightforward: first compute the FDR for testing equality of several variances, then apply it as a weight to the estimated mean-squared error under equal and under unequal variance assumptions. Since the detailed method and results are rather extensive, the full description of this idea is given in a separate paper (presently included in Supplementary Report II). Briefly, the results are similar to two-group comparisons: in small samples, the moderated F-statistic using weighted mean-squared errors performs better than the standard tests.

Our results show that when the proportion of genes with unequal variance is small or when the sample sizes are balanced, the standard procedures have only small bias; otherwise the bias can be quite large. From our various scenarios, we believe the standard Welch test should not be used in small samples, as it is worse than MWT. The gain in power by MWT over the Welch test is partly due to the weighted formula, which will favor the common variance for a large fraction of genes which have similar variances, and partly due to the pooling of information of standard errors from across the genes.

In conclusion, it might be a good practice to run tests of equal variance as in Figure 1 to check the common variance assumption. We will then know whether the standard tests are reliable. However, in any case the MWT can be expected to work well, so we would recommend its general use in place of the Welch test.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on January 21, 2008; revised on February 23, 2008; accepted on March 12, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 OTHER METHODS FOR...
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 

    Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inference of gene changes. Bioinformatics (2001) 17:509–519.[Abstract/Free Full Text]

    Choe S, et al. Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biol (2005) 6:R16.[CrossRef][Medline]

    Efron B, et al. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Soc (2001) 96:1151–1160.

    Hedenfalk D, et al. Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med (2001) 344:539–548.[Abstract/Free Full Text]

    Hu J, Wright FA. Assessing differential gene expression with small sample sizes in oligonucleotide arrays using a mean-variance model. Biometrics (2007) 63:41–49.[CrossRef][Web of Science][Medline]

    Jain N, et al. Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics (2003) 19:1945–1951.[Abstract/Free Full Text]

    Lewin A, et al. Fully Bayesian mixture model for differential gene expression: simulations and model checks. Stat. Appl. Genet. Mol. Biol (2007) 6. Article 36.

    Lönnstedt I, Speed T. Replicated microarray data. Statistica Sinica (2002) 12:31–46.[Web of Science]

    Pawitan Y, et al. False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics (2005a) 21:3017–3024.[Abstract/Free Full Text]

    Pawitan Y, et al. Gene expression profiling spares early breast cancer patients from adjuvant therapy—derived and validated in two population-based cohorts. Breast Cancer Res (2005b) 7:R953–R964.[CrossRef][Web of Science][Medline]

    Ploner A, et al. Multidimensional local false discovery rate for microarray studies. Bioinformatics (2006) 22:556–565.[Abstract/Free Full Text]

    Rosenwald A, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N. Engl. J. Med (2002) 346:1937–1947.[Abstract/Free Full Text]

    Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol (2004) 3. Article 3.

    Storey JD, Tibshirani R. Statistical significance for genomewide studies. PNAS (2003) 100:9440–9445.[Abstract/Free Full Text]

    Sjögren A, et al. Weighted analysis of general microarray experiments. BMC Bioinformatics (2007) 8:387.[CrossRef][Medline]

    Tusher VG, et al. Significance analysis of microarrays applied to the ionizing radiation response. PNAS (2001) 98:5116–5121.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/9/1168    most recent
btn100v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Demissie, M.
Right arrow Articles by Pawitan, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Demissie, M.
Right arrow Articles by Pawitan, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?