Skip Navigation


Bioinformatics Advance Access originally published online on July 26, 2006
Bioinformatics 2006 22(18):2244-2248; doi:10.1093/bioinformatics/btl383
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/18/2244    most recent
btl383v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Huang, Y.
Right arrow Articles by Hsu, J. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Huang, Y.
Right arrow Articles by Hsu, J. C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

To permute or not to permute

Yifan Huang 1, Haiyan Xu 2, Violeta Calian 3 and Jason C. Hsu 4

1 H. Lee Moffitt Cancer Center & Research Institute, The University of South Florida Tampa, FL 33612, USA
2 Department of Clinical Biostatistics, Johnson & Johnson Pharmaceutical Research & Development L.L.C., USA
3 Science Institute, University of Iceland Dunhaga 3, 107 Reykjavik, Iceland
4 Department of Statistics, The Ohio State University Columbus, OH 43210, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MAIN RESULTS AND...
 3 EXAMPLES AND RECOMMENDATIONS
 REFERENCES
 

Permutation test is a popular technique for testing a hypothesis of no effect, when the distribution of the test statistic is unknown. To test the equality of two means, a permutation test might use a test statistic which is the difference of the two sample means in the univariate case. In the multivariate case, it might use a test statistic which is the maximum of the univariate test statistics. A permutation test then estimates the null distribution of the test statistic by permuting the observations between the two samples.

We will show that, for such tests, if the two distributions are not identical (as for example when they have unequal variances, correlations or skewness), then a permutation test for equality of means based on difference of sample means can have an inflated Type I error rate even when the means are equal. Our results illustrate permutation testing should be confined to testing for non-identical distributions.

Contact: calian{at}raunvis.hi.is


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MAIN RESULTS AND...
 3 EXAMPLES AND RECOMMENDATIONS
 REFERENCES
 
A popular technique for testing hypotheses of no effect, when the distribution of the test statistic is unknown, is to resample the data. Permutation testing is a version of this technique.

For example, to test the equality of two means Formula one might use a test statistic which is the difference of the two sample means, and estimate its null distribution by permuting the observations in the combined X and Y samples. The basis for permutation testing is if the X data are sampled from distribution PX and the Y data from distribution PY, then under the null hypothesis of identical distributions Formula all permutations of the observations are equally probable.

We will show, using simple examples, that if the identical distributions hypothesis Formula is false, as for example when PX and PY have unequal variances, skewness or (in the multivariate case) unequal correlations, then a permutation test for Formula based on difference of sample means can have an inflated Type I error rate even when Formula is true. Our results thus illustrate the appropriateness of permutation testing may depend on whether the purpose of testing is to detect differences in means, or non-identical distributions.

This purpose will depend on the intended application. Take the analysis of gene expression levels for instance. For discovering regulatory networks, it may be useful to detect groups of genes with non-identically distributed expression levels between normal and disease subjects. On the other hand, for selecting genes to train a prognostic algorithm using supervised machine learning, as in the re-analysis of the data in van't Veer et al. (2002) by Ein-Dor et al. (2005), detecting differences in gene expression levels (i.e. testing Formula) would be of primary interest if the prognostic algorithm is based on differences.


    2 MAIN RESULTS AND SIMULATED EXAMPLES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MAIN RESULTS AND...
 3 EXAMPLES AND RECOMMENDATIONS
 REFERENCES
 
In this section, we examine the distribution of test statistics for the difference of two means. In the normal distribution case, we show that if the sample sizes are unequal, then a permutation test will pick up signals from unequal variances and (in the multivariate case) unequal correlations, leading potentially to an inflated Type I error rate. In the case of arbitrary distributions, we show even if the sample sizes are equal, a permutation test will pick up signals from unequal skewness and higher order cumulants, leading potentially to an inflated Type I error rate. We illustrate this for a univariate lognormal distribution.

For the univariate case, Romano (1990) showed that the difference of sample means test statistic asymptotically has the issues described above, while Janssen (1997) showed the Welch t-test is asymptotically valid. Our results are based on the exact distribution of the difference in sample mean test statistic for finite samples.

Pollard and van der Laan (2005) proposed a general multiple testing method using asymptotically linear statistics, estimating the null distribution of the test statistic by re-sampling (with replacement), centering the data or statistics appropriately. They prove that, under regularity conditions, asymptotically (as sample size approaches infinity) their method controls the probability of incorrect rejection under the true distribution. [While this error rate control is weaker than strong control of the Familywise error rate (FWER), it should suffice in practice.] In the case of testing for the equality of means when the distributions do not differ by a location shift only, their method will re-sample mean-centered data within each group of the populations to be compared. (In other situations such as testing the equality of correlations or in tests parameters from non-linear models, they re-sample uncentered data then center the test statistics distribution.) Pollard and van der Laan (2003, 2005) then show, in testing the equality of the mean vectors of two bivariate distributions, estimating the null distribution of the test statistic by re-sampling the pooled mean-centered data will result in too small a critical value, resulting in an inflated error rate, if the sample sizes are unequal and the covariances are not the same. This is very similar to our result that permutation test has inflated error rate in that situation. But our method of proof differs from theirs in that they compare the inappropriately re-sampled distribution with the asymptotic distribution of the appropriate non-parametric boot-strap distribution, whereas we compare the permutation distribution with the exact distribution for finite samples.

To gain some intuition towards what might go wrong, before considering permutation tests, let us consider the simpler situation of testing by resampling with replacement, in the case of normal distributions.

Let Formula and Formula, i = 1, 2, ..., m, j = 1, 2, ..., n, where N (µ, {sigma}2) denotes a normal distribution with mean µ and variance {sigma}2. Consider testing the null hypothesis Formula using the test statistic Formula. The distribution of T under the null hypothesis is

Formula 1(1)
If we re-sample Xi, i = 1, ..., m and Yj, j = 1, ..., n, from the pooled sample {X1, ..., Xm, Y1, ..., Yn}, recomputing T each time we re-sample, it turns out the distribution of T under the null hypothesis Formula 1 is

Formula 2(2)
This can be seen as follows. Each re-sampled observation has chance m/(m + n) of being an X with variance Formula 2, and chance n/(m + n) of being a Y with variance Formula 2 So the variance of Formula 2 is

Formula 2

Thus, if Formula 2 then the re-sampled distribution (2) equals the true null distribution (1) only if m = n.

2.1 Effect of different correlations
Let Formula 2 and Formula 2, i = 1, 2, ..., m, j = 1, 2, ..., n, where MVNg(µ, {Sigma}) denotes a g-dimentional multivariate normal distribution with mean vector µ and variance–covariance matrix {Sigma}.

For inference on the difference of means µX – µY, consider the statistics

Formula 3(3)
where Formula 3 and Formula 3. Under the null hypothesis of equality of means Formula 3, the statistic Formula 3 is distributed as:

Formula 4(4)

However, the permutation distribution of T may be different.

THEOREM 2.1.
Let m ≤ n. Under Formula 4, the permutation distribution of Formula 4 is

Formula 5(5)
The proof of Theorem 2.1 is given in Appendix A.

A direct consequence of Theorem 2.1 is that

  1. If m = n even though {Sigma}X != {Sigma}Y, or if m != n but {Sigma}X = {Sigma}Y, the permutation distribution of T coincides with the distribution under the null hypothesis (4).
  2. If m != n and {Sigma}X != {Sigma}Y, the permutation and null - hypothesis distributions of T are different.

A common test for Formula 5 is

Formula 5
To control the familywise error rate (FWER) at {alpha}, the threshold c should be chosen to be the upper {alpha} quantile of the distribution of the maximum of |T1|, |T2|, ..., |Tg|, where T is distributed according to (4).

To assess the potential effect of different correlations on computing critical value based on the permutation distribution (5) instead of the true distribution (4), we conducted a simple simulation. We sampled 10 000 datasets with g = 50 from the permutation distribution (4) given by Theorem 2.1, a mixture of MVNgX, {Sigma}X) and MVNgY, {Sigma}Y) where µX = µY = 0, {Sigma}X has all the diagonal elements equal to 1 and all the off-diagonal elements equal to zero, while {Sigma}Y has all the diagonal elements equal to 1 and all the off-diagonal elements equal to 0.9. For m = 2 and n = 4, the permutation test proves to be liberal (Fig. 1), i.e. it does not control the Type I error rate of testing Formula 5 as the distribution of Formula 5 sampled from the permutation distribution (5) turns out to be stochastically smaller than its distribution sampled from the true distribution (4). (If m = 4 and n = 2, then the permutation test would be conservative.)


Figure 1
View larger version (6K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Density and cumulative density plots of max i = 1, ..., g|Ti| for g = 50 with m = 2 and n = 4. The permutation distribution is stochastically smaller than the true null distribution.

 
To compare the critical values of the permutation test with the correct critical value, we drew sets of 10 000 sample data from the permutation distribution (5) and from the true test statistic distribution (4), for g = 500, 600, ..., 1000. Figure 2 shows the extent the permutation test critical values are too small for m = 2 and n = 8.


Figure 2
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Critical values of the test statistic max i = 1, ..., g|Ti| for g = 500, 600, ..., 1000 with m = 2 and n = 8. Critical values based on permutation distribution are smaller than the ones based on true null distribution.

 
As the max|T| test is often used to test component hypotheses in multiple testing, our result has implication in multiple testing as well.

2.2 Effect of different variances
We conducted a simple simulation to show that different variances can cause permutation test for equality of means to be either liberal or conservative.

Let Formula 5 and Formula 5, i = 1, 2, ..., m, j = 1, 2, ..., n, and consider testing the null hypothesis Formula 5 using the test statistic Formula 5.

In the first simulation study, 10 000 sets of random samples are drawn from Formula 5 with m = 6 and from Formula 5 with n = 4, independently. For each random sample, the p-value of the permutation test is computed by complete enumeration. Figure 3a compares the cumulative distribution of the simulated p-values with that of the Uniform(0, 1) distribution (a straight line along the 45° diagonal), showing that for a typical significance level the permutation test is liberal in this case. For instance, if the nominal significance level is 0.05, the actual significance level is ~0.09.


Figure 3
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 (a) Cumulative distribution plot of simulated p-value. m = 6 random samples are drawn from Formula 5 and n = 4 random samples are drawn from Formula 5 independently. (b) Cumulative distribution plot of simulated p-value. m = 4 random samples are drawn from Formula 5 and n = 6 random samples are drawn from Formula 5 independently.

 
In the second simulation study, 10 000 sets of random samples are drawn from Formula 5 with m = 4 and from Formula 5 with n = 6, independently. For each random sample, the p-value of the permutation test is computed by complete enumeration. Figure 3b compares the cumulative distribution of the simulated p-values with that of the Uniform(0, 1) distribution (a straight line along the 45° diagonal), showing that for a typical significance level the permutation test is conservative in this case. For instance, if the nominal significance level is 0.05, the actual significance level is 0.04.

2.3 Effect of differences in higher order cumulants
So far, we have shown that if X and Y are multivariate normal, then the permutation distribution of the test statistic happens to be the same as the true distribution of the test statistic if the sample sizes are equal (m = n). This lucky coincidence only holds when X and Y are multivariate normal, as we now show.

Suppose Formula 5 and Formula 5, i = 1, ..., m, j = 1, ..., n, where FX and FY are arbitrary multivariate distributions. The permutation distribution of the test statistic and the true distribution of the test statistic can be described in terms of cumulants ka(FX) and ka(FY), a = 1, 2, 3, ..., of FX and FY (assuming they exist):

THEOREM 2.2.
(1) The true distribution of the test statistic Formula 5 has cumulants

Formula 6(6)

(2) For a given permutation with r elements relabeled, the distribution (Pr) of the test statistic Formula 6 obtained by permutation has cumulants

Formula 7(7)

The proof is given in Appendix B.

For the test statistic Formula 7, results are similar except there are constants Formula 7 multiplying ka(FX) and ka(FY) in Equations (6) and (7).

We thus have

COROLLARY 2.3.
(1) The true and permutation distributions of the test statistic T will have the same even-order cumulants if m = n.

(2) The true and permutation distributions of the test statistic will not necessarily have the same ath order cumulants for a odd, regardless of whether m = n, unless ka(FX) = ka(FY).

So, if X and Y are not multivariate normal, then differences in cumulants of order higher than two can cause permutation test for equality of means to be liberal even if m = n. We use a simulation with different skewness to demonstrate this, even when X and Y are univariate.

Let Formula 7 and Formula 7, i = 1, 2, ..., m, j = 1, 2, ..., n, and consider testing the null hypothesis Formula 7, using the test statistic

Formula 7

We generated 10 000 sets of random samples from µX = –0.25, Formula 7, m = 5 and independently from µY = 0.125, Formula 7, n = 5. The two distributions have the same mean (Formula 7) but different skewness, as shown in Figure 4. For each random sample, the p-value of the permutation test is computed by complete enumeration.


Figure 4
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 Density plot of Formula 7 and Formula 7. X and Y have the same expectation (1.284) but different skewness.

 
The cumulative distribution of the simulated p-values is shown in Figure 5. The actual significance level is higher than the nominal significance level. For instance, for a nominal significance level of 0.05, the estimated actual significance level is 0.154, with a 95% confidence interval of (0.150, 0.157). For a nominal significance level of 0.10, the estimated actual significance level is 0.233, with a 95% confidence interval of (0.228, 0.237). Apparently, the permutation test can be liberal.


Figure 5
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5 Cumulative distribution plot of simulated p-value. m = 5 random samples are drawn from Formula 7 and n = 5 random samples from Formula 7, independently.

 

    3 EXAMPLES AND RECOMMENDATIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MAIN RESULTS AND...
 3 EXAMPLES AND RECOMMENDATIONS
 REFERENCES
 
An advantage of permutation testing is no knowledge of the distribution of the observations is required. Its control of error rate, however, only holds under the condition of identical distribution among groups to be compared. If the purpose of testing is to detect differences in means, then permutation testing may pick up unintended signals, rejecting an equality hypothesis for the wrong reason.

For example, van't Veer et al. (2002) reported a 70 gene signature as strongly predictive of short interval to distant metastasis of breast cancer. Their data was reanalyzed by Ein-Dor et al. (2005), in a way similar to the original data analysis. First, based on sample fold changes, (non-statistical) filtering was applied to select 5852 genes from the 24 481 genes probed on the microarrays. Then 1234 genes were selected for more detailed study from these 5852 genes, those genes corresponding to the rejection of the null hypothesis that its expression levels across patients is uncorrelated with their prognosis (a dichotomized outcome of either metastasis-free for >5 years or not, controlling FDR at 10%). Turns out it can be shown that permutation testing of sample correlation between gene expression profile and dichotomized prognoses is equivalent to permutation testing of difference in mean gene expression levels between good and poor prognosis groups. Since m = 51 != n = 45 in Ein-Dor et al. (2005), such permutation testing will detect, in addition to equality of means, unequal variance, correlations and skewness. Whether detecting such differences is useful or not may depend on the intended use of permutation testing. As the purpose of selecting genes in van't Veer et al. (2002) and Ein-Dor et al. (2005) was to train prognostic algorithms using machine learning, it seems to us the hypotheses of primary interests are Formula 7 and not Formula 7 (at least for algorithm based on distances measured by differences).

Instead of permutation testing, we recommend the following alternative approaches.

If gene expression data from microarray experiments can be modeled, and modeling diagnostics of the residuals (observed values minus values predicted from the model) show the errors have reasonable i.i.d. structure, then setting critical values for testing by appropriately re-sampling the residuals controls the error rate asymptotically. For example, Hsu et al. (2006) describe a statistically designed microarray experiment whose (logarithm of the) expression levels can be modeled linearly. After identifying error vectors that can be reasonably assumed to be i.i.d, they then bootstrapped the corresponding residuals vectors to set critical values for multiple testing.

If modeling of the data is difficult and the sample sizes are reasonably large, then we recommend the non-parametric bootstrap method of Pollard and van der Laan (2005) which re-samples data within each group and then centers the re-sampled test statistics to obtain a test null distribution. This re-sampling method is implemented as the MTP function in the multtest package of bioconductor (Pollard et al., 2005). They used both real and simulated data to study the behavior of permutation methods and their bootstrap methods, controlling generalized FWER. They found the error rates of permutation tests to be systematically higher than the target level, with the exception being the equal sample size case for difference of sample means test statistics. They also found that bootstrap methods can be either liberal or conservative in terms of error rates, depending on the test statistics used.


    Acknowledgments
 
The authors thank Professor Yoon Lee for useful discussions. Research of J.C.H. is supported by NSF Grant No. DMS-0505519.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Joaquin Dopazo

Received on December 9, 2005; revised on June 14, 2006; accepted on July 6, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MAIN RESULTS AND...
 3 EXAMPLES AND RECOMMENDATIONS
 REFERENCES
 

    Hsu, J.C., Chang, J., Wang, T., Steingrimsson, E., Magnusson, M.K., Bergsteinsdottir, K. (2006) Statistically designing microarrays and microarray experiments to enhance sensitivity and specificity. em Technical Report 771, , OH Department of Statistics, The Ohio State University.

    Ein-Dor, L. (2005) Outcome signature genes in breast cancer: is there a unique set? Bioinformatics, 21, 171–178[Abstract/Free Full Text].

    Janssen, A. (1997) Studentized permutation tests for non - i.i.d. hypotheses and the generalized Behrens-Fisher. Stat. Probab. Lett, . 36, 9–21[CrossRef].

    Pollard, K.S. and van der Laan, M. (2003) Multiple testing for gene expression data: an investigation of null distributions with consequences for the permutation test. Proceedings of the 2003 International MultiConference in Computer Science and EngineeringMETMBS'03 Conference , pp. 3–9.

    Pollard, K.S. and van der Laan, M. (2005) Resampling-based multiple testing: asymptotic control of type I error and applications to gene expression data. J. Stat. Plan. Infer, . 125, 85–100[CrossRef].

    Pollard, K.S., Dudoit, S., van der Laan, M.J. (2005) Multiple testing procedures: R multtest package and applications to genomics. Bioinformatics and Computational Biology Solutions Using R and Bioconductor, , New York Springer-Verlag.

    Romano, J. (1990) On the behavior of randomization tests without a group—symmetry assumption. J. Am. Stat. Assoc, . 85, 686–692[CrossRef].

    van't Veer, L.J. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530–536[CrossRef][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/18/2244    most recent
btl383v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Huang, Y.
Right arrow Articles by Hsu, J. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Huang, Y.
Right arrow Articles by Hsu, J. C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?