Skip Navigation


Bioinformatics Advance Access originally published online on March 30, 2006
Bioinformatics 2006 22(12):1486-1494; doi:10.1093/bioinformatics/btl109
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/12/1486    most recent
btl109v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Gao, X.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gao, X.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Construction of null statistics in permutation-based multiple testing for multi-factorial microarray experiments

Xin Gao

Department of Mathematics and Statistics, York University 4700 Keele Street, Toronto, ON M3J 1P3, Canada

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD
 3 RESULTS ON SIMULATED...
 4 RESULTS ON BIOLOGICAL...
 5 DISCUSSION
 REFERENCES
 

Motivation: The parametric F-test has been widely used in the analysis of factorial microarray experiments to assess treatment effects. However, the normality assumption is often untenable for microarray experiments with small replications. Therefore, permutation-based methods are called for help to assess the statistical significance. The distribution of the F-statistics across all the genes on the array can be regarded as a mixture distribution with a proportion of statistics generated from the null distribution of no differential gene expression whereas the other proportion of statistics generated from the alternative distribution of genes differentially expressed. This results in the fact that the permutation distribution of the F-statistics may not approximate well to the true null distribution of the F-statistics. Therefore, the construction of a proper null statistic to better approximate the null distribution of F-statistic is of great importance to the permutation-based multiple testing in microarray data analysis.

Results: In this paper, we extend the ideas of constructing null statistics based on pairwise differences to neglect the treatment effects from the two-sample comparison problem to the multifactorial balanced or unbalanced microarray experiments. A null statistic based on a subpartition method is proposed and its distribution is employed to approximate the null distribution of the F-statistic. The proposed null statistic is able to accommodate unbalance in the design and is also corrected for the undue correlation between its numerator and denominator. In the simulation studies and real biological data analysis, the number of true positives and the false discovery rate (FDR) of the proposed null statistic are compared with those of the permutated version of the F-statistic. It has been shown that our proposed method has a better control of the FDRs and a higher power than the standard permutation method to detect differentially expressed genes because of the better approximated tail probabilities.

Availability: R codes available upon request

Contact: xingao{at}mathstat.yorku.ca


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD
 3 RESULTS ON SIMULATED...
 4 RESULTS ON BIOLOGICAL...
 5 DISCUSSION
 REFERENCES
 
The development of microarray technologies has provided plentiful amount of genome-wide expression data for biologists to explore various biological mechanisms of interest. These data with simultaneous measurements of thousands of genes present challenging statistical analysis problems. As the normality assumption justifying parametric testing is often untenable in microarray studies (Hunter et al., 2001; Gao et al., 2005), a class of non-parametric statistical methods have been proposed, including the empirical Bayes method of Efron et al. (2001), the significance analysis of microarray (SAM) method of Tusher et al. (2001), and the mixture model method (MMM) of Pan et al. (2003). This class of non-parametric methods are very powerful as they make less stringent distributional assumptions. Furthermore, they fully utilize the special feature of microarray data that there is a large number of genes whereas the number of replications is small. Typically a summary statistic Z is constructed for each gene to measure the differential gene expression across the different treatment conditions. The null distribution of the test statistic Z is estimated by constructing a corresponding null statistic z. Then the statistical significance is assessed by comparing the observed Z-statistic with this estimated null distribution. The most common way of constructing the null statistic z is to use the permutated version of Z. However, it has been emphasized in the recent literature including Efron et al., (2001), Zhao and Pan (2003), Pan (2003), Xie et al., (2005) that the permutation distribution of Z may not approximate well to the true null distribution of Z because there may exist a relatively large proportion of genes in the microarray data which are differentially expressed. Therefore, the permutation distribution can be regarded as a mixture distribution with one component corresponding to the permutation distribution under the null hypothesis while the other component corresponding to the permutation distribution under the alternative hypothesis. The resulted mixture distribution has a larger variance and a heavier tail compared to true null distribution, which leads to a potential loss of power. For the mathematical justification of this phenomenon, readers are referred to Xie et al., (2005).

There have been a series of pioneering work in constructing a suitable null statistic z to estimate the unknown null distribution of Z. Efron et al. (2001) proposed to construct z based on the pairwise differences which neglect the treatment effects. In the context of MMM approach, Pan et al. (2003) further proposed a new version of Z and z. Assume each condition has even number of replications ni, i = 1, 2. Within each condition, the difference between the sums of the first half and second half Formula is formed. The Z-statistic has the numerator of the expression {sum}jX1j/n1{sum}jX2j/n2, whereas the z statistic has the expression d1/n1d2/n2. The Z and z statistics share the same denominator of the standard error based on the two sample variances of Xi(j). Zhao and Pan (2003) pointed out a weakness associated with this statistic that the numerator and denominator of the Z-statistic are uncorrelated whereas the numerator and denominator of the z are correlated. This undue correlation for z causes its distribution to be different from the null distribution of Z. Two modifications were proposed by Zhao and Pan (2003) which corrected the aforementioned problem and further extended the method to non-symmetric noise distribution. The limitation of these two modifications is that efficiency suffers because of the loss of degrees of freedom in the variance estimation. Pan (2003) further proposed a new method in which the sample in each condition is decomposed into two approximately equal sized halves, denoted as parts i1, i2, respectively. The proposed Z has the numerator measuring the difference Formula whereas the proposed z has the numerator measuring the difference Formula The proposed Z and z share the same numerator of Formula This new set of statistics have the advantage that the undue correlation between the numerator and denominator of the z is corrected and furthermore there are much fewer degrees of freedom lost compared with Zhao and Pan's methods (2003).

All the aforementioned methods have been developed to address the two condition problem in microarray setting. However, a microarray experiment often has a more complicated design than that of two user-defined groups. Besides the treatment effects of interest, there may exist some clinical covariates such as age, gender and certain clinical symptoms, which also influence the gene expression level. For such experiments, a factorial design model is useful to account for the multiple sources of variation. An overall ANOVA model has been proposed to simultaneously consider all the genes on the arrays and incorporate array effect and dye effect (Kerr et al., 2000). A gene specific ANOVA model under the normality assumption was considered in Jin et al. (2001). When the normality assumption is potentially violated, with a limited number of replications, the sensitivity of these parametric approaches could be severely undermined by using the asymptotic distribution to assess the statistical significance of the test statistic. Ideally, it would be very useful to extend the non-parametric approaches discussed above from the two-sample comparison problem to a more complicated multifactorial set up. Thus it is the objective of this article to construct a proper null statistic to approximate the null distribution of the F-statistic which is used as the summary statistic in multifactorial data analysis. It is of further interest to examine the power of the proposed method in detecting true positives and its control of false discovery rate (FDR) under the multiple testing scenario.


    2 METHOD
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD
 3 RESULTS ON SIMULATED...
 4 RESULTS ON BIOLOGICAL...
 5 DISCUSSION
 REFERENCES
 
2.1 Model
Given a multifactorial microarray experiment, the expression of the i-th gene could be modelled as follows:

Formula 1(1)
with {sum}j {alpha}j = 0, {sum}k ßk = 0 and {sum}i {gamma}ij = {sum}j {gamma}ij = 0, where i indexes for the gene number, j indexes for the treatment group, k indexes for the covariate group and s indexes for the replicate number. Here Xijks stands for the expression measurement, {theta}i represents the i-th gene specific mean, {alpha}j represents the effect of the j-th treatment group (for instance, drug treatments, tissue types, strains of mice, etc.), ßk represents the effect of the k-th level of a clinical covariate, and {gamma}jk represents the interaction effects between the treatment and the clinical covariate. The error terms {varepsilon}ijks are independently and identically distributed random noise from a continuous distribution function with a common variance {sigma}2. It is worthy to note that this model does not make normality assumption on the error terms {varepsilon}ijks.

2.2 Permutation method
Usually the goal of the analysis is to identify genes which are differentially expressed among the treatment groups, which is equivalent to test the following hypotheses for each individual genes:

Formula 1

For simplicity in notation, the index i is suppressed in the expressions hereafter, as all the observations in the formulae are from the specific gene i. The marginal and overall averages of the expression values are defined as Formula 1, Formula 1 and Formula 1 Formula 1 Formula 1 and Formula 1 It can be shown that the vector Formula 1 follows a multivariate distribution with the covariance matrix Formula 1

Define matrix A = {Sigma}–1[IJJJ {Sigma}–1 / trace ({Sigma}–1)], where IJ denotes an identity matrix of dimension J and JJ denotes a matrix with dimension J and all entries equal to one. It can be shown that A{Sigma} is idempotent and A1 = 0. Therefore under Hi0,

Formula 2(2)
as the sample size goes large. Thus the usual F-statistic for testing the treatment effect in the unbalanced multifactorial design takes the following form (Searle, 1987):

Formula 3(3)
which follows the Formula 3 distribution asymptotically. It is noted that for a balanced design with all njk equal to M, the F statistic above is reduced to a more commonly seen form:

Formula 4(4)

Throughout this article we shall be focused on the development for any arbitrary unbalanced design as it encompasses the balanced design as a special case. Under the normality assumption, the F-statistic above follows the exact Formula 4 distribution regardless of the sample size. However in a replicated microarray analysis, the normality assumption is often violated and the sample size is often so limited that the asymptotic F distribution is not accurate enough to obtain a valid p-value. In order to assess the significance of the F-statistic, the permutation method will be invoked instead to provide p-values. To carry out the permutation in the unbalanced design, the measurements within the k-th covariate levels are combined and shuffled, and a random collection of njk measurements from this pool will be assigned to the cell (j, k) for Formula 4 The procedure is repeated for Formula 4 to complete one permutation. Suppose the permutation is conducted B times, and one F-statistic is obtained from each permutation, the empirical distribution of the total Bn F-statistic will be the proposed permutation distribution to approximate the null distribution of F.

2.3 Null statistic based on pairwise differences
In reality among all the genes on the microarray, there exist a proportion of genes which are differentially expressed. Therefore the permutation distribution actually consists of two components with one generated from the permutation distribution under the null hypothesis and the other generated from the permutation distribution under the alternative hypothesis. Therefore, the permutation distribution of F has a larger variation and a heavier tail than the true null distribution of F (Xie et al., 2005). To take into account the proportion of differentially expressed genes, we can adopt the idea proposed by Efron et al. (2001) to construct a null statistic based on pairwise differences to neglect the treatment effects so that its empirical distribution will approximate well to the null distribution of the F-statistic regardless whether or not the gene is differentially expressed among treatment groups or not.

Assume each cell has even number of replications. Within each cell (j, k), we define Formula 4 pairwise differences

Formula 5(5)
for Formula 5 Define the marginal and overall averages of the pairwise differences as Formula 5 and Formula 5

Then we construct a null statistic

Formula 6(6)

It is noted that the F and f1 share the same denominator. Their numerators bears the same expression with each Formula 6 exchanged by the corresponding Formula 6 Define function sign(a) = 1, if a ≥ 0, and sign(a) = –1, if a < 0. It can be shown that

Formula 7(7)
and

Formula 8(8)

Define the vector Formula 8 as Formula 8 As under the null hypothesis Formula 8 it follows that Formula 8 are all equal and the mean vector of Formula 8 takes the form of µ = ({theta}i + {alpha})1. The distributions of Formula 8 and Formula 8 are identical under the assumption of noise distribution symmetric about zero. The numerator of F can be expressed as

Formula 9(9)
because Aµ = 0. In parallel, the numerator of f1 can be expressed as Formula 9 Because the distributions of X – µ and D are identical, the numerator of F under the null hypothesis has the same distribution as the numerator of f1. Furthermore, F and f1 share the same expression for the denominator. Thus the empirical distribution of f1 can be employed to approximate the null distribution of F. For each gene we randomly permute the expression values within each cell and then form the pairwise differences and generate the corresponding f1. Suppose we permute the data B times, the empirical distribution of the total Bn null statistic f1 provides the approximated null distribution of F. The advantage of this method is that the {chi}2-distribution in the numerator of the f1 maintains a zero non-centrality parameter for both differentially expressed or non-differentially expressed genes thus it can accommodate microarray data with a relatively large proportion of genes with differential expression and possible large magnitude of change.

2.4 Null statistic based on subpartition
If we further examine the above method of null statistic based on pairwise differences, we notice that there are certain limitations of this null statistic. First of all, this null statistic can only tackle the designs with even number of replicates for each cell. Second there is undue correlation between the numerator and denominator of the f1 statistic. This same undue correlation problem was first pointed out by Zhao and Pan (2003) in constructing null statistic for two-sample comparison problem. To see this problem in the multifactorial setting, let us use the similar argument as that of Zhao and Pan (2003). Given two independent observations from the same distribution, Y1 and Y2, it is readily shown that cov(Y1 + Y2, Y1 – Y2) = 0. Namely the pairwise difference will be uncorrelated with the pairwise sum. For the F-statistic, the denominator can be expressed as functions of the sample variance Formula 9 within cell (j, k) which is based on pairwise differences whereas the numerator can be expressed as functions of the sample mean Formula 9 which is based on pairwise sums. Therefore, the numerator and denominator of the F-statistic are uncorrelated. In contrast, for the f1 statistic, the denominator is the same as that of F which is based on pairwise differences whereas the numerator is expressed in terms of functions of Formula 9 which is also based on pairwise differences. Therefore, unlike F, the numerator and denominator of f1 are unduely correlated which causes the distribution of f1 to be different from that of F under null hypothesis.

To overcome the problem of undue correlation and to further accommodate the null statistic to arbitrary odd or even number of cell sizes, we propose the following null statistic based on the idea of subpartition proposed by Pan (2003) in two-sample comparison problem. For each cell (j, k), we partition it into two approximately equal halves indexed as jk1, and jk2 with the first half containing Formula 9 observations and the second half containing Formula 9 observations. Here [a] denotes the integer part of a. Now for each of the total 2JK subpartitions, we form the sample means and sample variances: Formula 9 Formula 9 Formula 9 Formula 9

It is known that Formula 9 is uncorrelated with Formula 9 and Formula 9 is uncorrelated with Formula 9

Define

Formula 10(10)

Formula 11(11)

Formula 12(12)

Denote the vector Formula 12 and the vector Formula 12 It follows that under the assumption of symmetric noise distribution and under the null hypothesis, the vector Formula 12 and the vector Formula 12 follow the same multivariate distribution with the zero mean vector and the covariance matrix to be Formula 12

The proposed F2 and f2 statistic based on subpartition would take the following forms:

Formula 13(13)
and

Formula 14(14)

Then following the similar argument applied to F1 and f1, it is readily shown that the numerators of F2 and f2 share the same distribution under the assumption of symmetric noise distribution. Furthermore, the numerator and the denominator of f2 are uncorrelated. Hence the permutation distribution of f2 provides a good approximation to the null distribution of F2. This method is very general as it works for unbalanced or balanced factorial designs with arbitrary even or odd cell sizes and more importantly it is corrected for the undue correlation problem.


    3 RESULTS ON SIMULATED DATA
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD
 3 RESULTS ON SIMULATED...
 4 RESULTS ON BIOLOGICAL...
 5 DISCUSSION
 REFERENCES
 
This section illustrates the performances of the proposed methods presented in Section 2 through some simulation experiments. There are three competing methods, traditional permutation method (F and f), pairwise difference method (F1 and f1) and subpartition method (F2 and f2). (Note the test statistic F1 for the pairwise difference method is the same F-statistic for the traditional permutation method.)

3.1 Comparison of the null distribution
The performances of these methods highly depend on how accurately by the proposed null statistics approximate the null distributions of the test statistics. To investigate this aspect, we performed the following simulations. We simulated a 2 x 2 design with n11 = n12 = n21 = n22 = 4. The arrays were simulated to contain 10 000 genes with 50% genes non-differentially expressed (NDE) whereas the other 50% genes differentially expressed (DE). The gene specific mean was simulated as {theta}i ~ U(0, 1), the covariate effect ß1 = –ß2 ~ N(0, 3), the gene-covariate interaction effect {gamma}11 = –{gamma}12 = –{gamma}21 = {gamma}22 ~ N(0, 1). For those NDE genes, the treatment effect was set to be {alpha}1 = {alpha}2 = 0, whereas for those DE genes, the treatment effect was set to be {alpha}1 = {alpha}2 ~ U(0.1, 4). The noise distribution was simulated as symmetric normal distribution {varepsilon}ijks ~ N(0, 1). The empirical distributions of the null statistics were obtained from 20 permutations which results in 200 000 null statistics values. The empirical null distributions of F, F1 and F2 were generated from the exact F1,12, F1,12 and F1,8 respectively.

According to the simulation, it is noticed that the variance of f statistics is 22.45 whereas the variance of F-statistics under the null is 4.25, which are substantially different. This result also agrees with the findings obtained for two-sample comparison problem in Xie et al. (2005). The larger variation of f is attributed to the contribution of the genes which are DE. On the contrary, the distribution of f1 has a much smaller variance 1.58 compared with the variance of 4.25 for F1 owing to the undue correlation between the numerator and denominator of f1. Best of the three competing methods, f2 has a variation of 7.63 which is very close to the variance of 7.23 for F2 reflecting a much better approximation to F2.

Furthermore all the empirical distributions of the null statistics are plotted against the theoretical density curves of the null distributions. From Figure 1a, it is observed that f has a much heavier tail compared with the null distribution of F which is because of the mixture proportion of DE genes on the array. The heavier tail behavior also explains the larger variation of f listed above. In Figure 1b, the distribution of f1 has a much shorter tail compared to the null distribution of F1, which explains the smaller variation of f1 discussed above. This is because of the undue correlation between the numerator and denominator of f1. In Figure 1c, it is demonstrated that the subpartition method yields the null statistic f2 which provides an almost perfect agreement with the null distribution of F2 for even extreme tail regions.


Figure 1
View larger version (8K):
[in this window]
[in a new window]
 
Fig. 1 (ac) The comparison of the tail density curves of the empirical distributions of the null statistics versus the theoretical null distributions under the normal noise. The solid black curves are the theoretical null distribution and the dashed curves are the null statistics.

 
3.2 Comparison of power and false discovery rates
The above discussion focuses on the approximation to the null distributions. In multifactorial microarray analysis, the data analysis can proceed by first computing the summary F-statistic for each gene. Then the significance of each individual gene is assessed by comparing the observed statistic with the distributions of the null statistics to obtain the empirical p-value. Subsequently a threshold for the p-values or equivalently a threshold for the test statistics themselves is determined such that all the genes above the thresholds are declared as differentially expressed genes.

FDR introduced by Benjamini and Hochberg (1995) has been widely adopted to adjust for the multiple testing problem in microarray data analysis (Storey and Tibshirani, 2003; Hu et al., 2005, etc). FDR is defined as the expected proportion of the false positive discoveries among all the declared positive discoveries. First let us denote the null statistics generated from the B permutations by fm, m = 1, ... , Bn. Also denote the F-statistics from the n genes by Fi, i = 1, ... , n. Given a statistic Fi, the corresponding p-value is defined as p-value Formula 14

Given a fixed cut-off value {alpha} for p-value, we can obtain the realized FDR and its estimates (Storey and Tibshirani, 2003), Formula 14 and Formula 14 where {pi} is the proportion of NDE genes and Formula 14 is its estimator, FP is the number of false positive genes, i.e. the number of genes which are NDE genes but claimed as DE genes, Formula 14 is the estimated number of false positive genes, Formula 14 is the total number of genes claimed as DE genes. It is noted that using the null statistic as the reference distribution, given the p-value threshold {alpha}, Formula 14. Therefore, by selecting the significance level of the non-parametric approaches, we are controlling the estimated number of false positive discoveries. When yielding the same number of Formula 14 the most desirable method should yield the largest number of Formula 14 and in the meantime, the estimated Formula 14 should be as close to the true FP as possible.

In the simulation, a microarray of 2000 genes was simulated with two levels of the treatment (J = 2) and two levels of the clinical covariate (K = 2). An unbalanced design of n11 = 4, n12 = 6, n21 = 6, n22 = 4, was considered. The gene specific mean was simulated as {theta}i ~ U(0, 1), the covariate effect ß1 = –ß2 ~ N(0, 3), the gene–covariate interaction effect {gamma}11 = –{gamma}12 = –{gamma}21 = {gamma}22 ~ N(0, 1). For those NDE genes, the treatment effect was set to be {alpha}1 = {alpha}2 = 0, whereas the for those DE genes, the treatment effect was set to be {alpha}1 = {alpha}2 ~ U(0.1, 4). The noise distribution was simulated as symmetric normal distribution {varepsilon}ijks ~ N(0, 1). Other symmetric noise distributions including uniform, double exponential and t-distributions were also considered. As the comparison results are similar to the normal distribution, only the result under normal noise is presented in this article. The proportion of NDE genes was set to be {pi} = 0.8 and {pi} = 0.5 to represent the situations of low proportion of DE genes and the medium to high proportion of DE genes. To take into account the variability of the statistics, the results were summarized based on 50 simulated datasets.The average values of the estimators together with the standard deviations are provided in Tables 1 and 2. We compared the average numbers of the claimed significant genes Formula 14 and the false positive genes FP, the estimated Formula 14 and the realized FDR for the three competing methods.


View this table:
[in this window]
[in a new window]
 
Table 1 True positives and FDRs for 2 x 2 unbalanced design with {pi} = 0.8 under normal noise

 

View this table:
[in this window]
[in a new window]
 
Table 2 True positives and FDRs for 2 x 2 unbalanced design with {pi} = 0.5 under normal noise

 
In practice, if {pi} is to be estimated, there are methods available based on the distributions of p-values (Storey, 2002; Allison et al., 2002; Pounds and Morris, 2003; Pounds and Cheng, 2004; Guan et al., 2004, http://www.biostat.umn.edu/rrs.php.; Wu et al., 2004). It is worthy to point out that if permutation method is employed to estimate the p-values, the accuracy of the approximation to the null distribution is also very crucial to the precision of the p-values. As a result, the traditional permutation method will lead to over-estimation of the p-values, whereas the paired-difference method will lead to under-estimation of the p-values. There also exist non-parametric approaches to obtain an estimate of the {pi}, however, only upper bounds of {pi} is readily available (Efron et al., 2001; Dalmasso et al., 2005). As it is beyond the scope of this article to compare different estimation methods for {pi}, the true value of {pi} is used in calculating Formula 14.

In Tables 1 and 2, we adjusted the number FP = (# true false positive genes)/{pi} to be consistent with the formulations in Efron (2001) and Storey and Tibshirani (2003), as FP was originally proposed to be based on the testing on all n genes under the null hypothesis. Similar adjustment can be found in Pan (2003). It is demonstrated by the simulation results that when there is a mixture of DE and NDE genes on the array, the proposed subpartition method is more powerful than the traditional permutation method. When maintaining the same estimated false positive numbers Formula 14, the subpartition method identifies more significant genes Formula 14 For instance with a medium proportion of DE genes {pi} = 0.5, when Formula 14 is controlled at 0.2, in average the subpartition method identifies 161 more genes than the traditional permutation method. In terms of FDRs, the false positive number FP of the subpartition method is much closer to the estimated false positive number Formula 14 compared with the traditional permutation method. For instance, when Formula 14 is controlled at 10, with {pi} = 0.5, the average false positive number FP for the permutation method is 6.80, whereas the average FP for the subpartition method is 9.92. Consequently the estimated FDR of the subpartition method is more accurate than that of the permutation method. For instance, when Formula 14 is controlled at 10, with {pi} = 0.5, the average realized and estimated false discovery rates FDR and Formula 14 for the traditional permutation method are 0.0040 and 0.0059, whereas the average FDR and Formula 14 for the subpartition method are 0.0059 and 0.0059. Comparing with the paired-difference method, the subpartition method offers a more valid procedure. It is shown in Tables 1 and 2 that the paired-difference method produces a largely inflated false positive number FP and severely underestimated FDR rates. Although the paired-difference method constantly identifies more significant genes than its two other competitors, it contains much more false positives than it aims to control. For instance, with {pi} = 0.5, and the Formula 14 is aimed to be controlled at 10 in average the paired-difference method produces 39.32 false positives, whereas the permutation method and subpartition method produce 6.80 and 9.92 false positives. Consequently, the corresponding average Formula 14 of the paired-difference method is 0.0055 which is significantly underestimated compared with the true FDR = 0.0218. In contrast, the proposed subpartition method always produces FP numbers close to the estimated Formula 14 and provides a much more accurate estimation of the FDR.

In order to investigate the performance of the proposed methods under unbalanced factorial designs with more than two groups by factor, we simulated a microarray of 2000 genes with three levels of the treatment (J = 3) and three levels of the clinical covariate (K = 3). An unbalanced design of n11 = 4, n12 = 5, n13 = 6, n21 = 5, n22 = 6, n23 = 4, n31 = 4, n32 = 5, n33 = 7 was considered. The gene specific mean was simulated as {theta}i ~ U(0, 1), the covariate effect ß1 ~ N(–2,1), ß2 ~ N(–1,1) and ß3 ~ N(1, 1), the gene–covariate interaction effect {gamma}11 = –{gamma}12 = –{gamma}21 = {gamma}22 ~ N(0, 1), and other {gamma}jk = 0. For those NDE genes, the treatment effect was set to be {alpha}1 = {alpha}2 = {alpha}3 = 0, whereas for those DE genes, the treatment effect was set to be {alpha}1 ~ U(–4,–2), {alpha}2 ~ U(0, 2), {alpha}3 ~ U(4, 6). The noise distribution was simulated as symmetric normal distribution {varepsilon}ijks ~ N(0, 2). The proportion of NDE genes was set to be {pi} = 0.8. The results were summarized in Table 3 based on 50 simulated datasets. It is shown from the simulation results that the subpartition method maintains its validity and power in the unbalanced multifactorial settings with more than two groups per factor. Compared with the traditional method, the subpartition method still detects higher numbers of true positives and obtains more accurate estimates for the number of false positives. It is also interesting to note that as the number of levels increases the total number of replications increases as well. Consequently the gap between the traditional permutation method and the proposed subpartition method is lessened as the bias pertained to the traditional permutation method has been alleviated by the increase of sample size.


View this table:
[in this window]
[in a new window]
 
Table 3 True positives and false discovery rates for 3 x 3 unbalanced design with even and odd number of replicates when {pi} = 0.8 under normal noise

 

    4 RESULTS ON BIOLOGICAL DATA
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD
 3 RESULTS ON SIMULATED...
 4 RESULTS ON BIOLOGICAL...
 5 DISCUSSION
 REFERENCES
 
Here we present a statistical analysis on a microarray expression data of Drosophila melanogaster (Jin et al., 2001) that has multiple classes of treatments attributing to the expression level. In the data, there were 24 cDNA microarrays, 6 for each combination of two genotypes (Oregon R and Samarkand) and two genders. In total, the gene expression values of the 3931 clone were obtained on the array representing a third of the genome. Focused on individual gene's expression level measured at the first week, the objective of the analysis was to identify genes whose expression levels are significantly affected by the genotypes and the genders. In the original analysis of Jin et al. (2001), F distribution was employed to assess the statistical significance of the F-statistics obtained from ANOVA tests. Similar to Jin's approach we modelled the expression data using a multifactorial model and used the F-statistic to summarize the differential gene expression across genotypes and across genders. Without normality assumption, owing to the small number of replications, the exact F distribution may not provide an accurate assessment of the statistical significance. Therefore we were motivated to reanalyze this dataset using the non-parametric methods proposed in this article. The three competing methods were applied to the datasets. For each method, 10 permutations were employed to generate Bn = 39 310 null statistics. Using the empirical distributions of the null statistics to approximate the null distributions, the empirical p-values were obtained for each individual genes. We used a series of thresholds for the p-value, ranging from 0.05, 0.01, 0.005, 0.001, 0.0005 to 0.00025. The corresponding estimated number of false positives are 196.55, 39.31, 19.65, 3.93, 1.96 and 0.98. The number Formula 14 of significant genes differentially expressed across the two genders and across the two genotypes were provided in Tables 4 and 5.


View this table:
[in this window]
[in a new window]
 
Table 4 True positives and estimated false positives for genes differentially expressed across two genders

 

View this table:
[in this window]
[in a new window]
 
Table 5 True positives and estimated false positives for genes differentially expressed across two genotypes

 
It is shown in Table 4 that among the total 3931 genes, each of the three methods identifies a large number of genes differentially expressed across the two genders. This is in agreement with the conclusion made by Jin et al. (2001) that the expression profiles of adult fruit flies are greatly influenced by different genders. Thus this dataset is one of the examples that the microarray contains a large proportion of genes belonging to the alternative situation. Therefore, the traditional permutation method will be a convertive procedure due to the heavier tail of the null statistic. Contrastingly the proposed subpartition method is well suited for this data set and it identifies a greater number of significant genes. For example, when the estimated false positives Formula 14 is controlled at about 1, the subpartition method identifies about 97 genes compared to the 50 genes identified by the traditional permutation method. It has been revealed in the simulations that the paired-difference method offers a very liberal number of significant genes due to the underestimated false positive numbers. Therefore, the seemingly large number of significant genes identified by the paired-difference method is less reliable than the list produced by the subpartition method. We also applied the three methods to detect the differential gene expression across the two genotypes. Compared with the gender effect, less numbers of differentially expressed genes were identified, which also agrees with the findings of Jin et al. (2001) that the transcriptional profiles of adult fruit flies are affected most strongly by sex and less so by genotypes. Nevertheless, the subpartition method is consistently more powerful than the traditional method. For example, when the estimated false positive Formula 14 is controlled at 1.96 the subpartition method identifies about five genes in contrast to the three genes identified by the standard permutation method. With regard to the paired-difference method, the number of significant genes is again liberally large. In conclusion, our analysis confirms the result by Jin et al. (2001) that genders and genotypes have played important roles in determining the transcriptional profiles of adult flies. Furthermore, our proposed subpartition method is more powerful in identifying significant genes owing to the better approximated null distribution.


    5 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD
 3 RESULTS ON SIMULATED...
 4 RESULTS ON BIOLOGICAL...
 5 DISCUSSION
 REFERENCES
 
This paper presents a set of non-parametric permutation-based tests to detect differential gene expression for multifactorial microarray experiments. It is shown through the simulation results that the traditional permutation method provides a conservative procedure when the array contains a mixture distribution of DE and NDE genes. This owes to the fact that the DE genes induce a larger variation and a heavier tail to the permutation distribution. With regard to the paired-difference method, it is observed that its null statistic has an undue correlation between the numerator and denominator. Thus the distribution of the null statistic is rather different from the null distribution of the summary statistic. Consequently, although the paired-difference method is seemingly more powerful than the traditional permutation method, it identifies a lot more false positives and offers an unsatisfactory control of the FDR. The aforementioned drawbacks of the traditional permutation method and the paired-difference method have been pointed out by Pan (2003), Zhao and Pan (2003) and Xie (2005) for the two sample comparison problem. We confirmed and extended the findings to the multifactorial setting. In contrast to the traditional permutation method and the paired-difference method, we propose a subpartition method to construct a null statistic which not only neglects the treatment effects for both DE and NDE genes, but also is corrected for the undue correlation problem. Furthermore, by explicitly considering the covariance structure of the mean vector under subpartition, the method works for arbitrary unbalanced factorial designs with even or odd cell sizes. From the simulations and the real biological data analysis, it is shown that the subpartition method offers a much better approximation to the null distribution compared with its two competitors. It is more powerful to identify significant genes than the traditional permutation method and it offers a more satisfactory control of the FDR than the paired-difference method.

It is worthy to point out that the validity of the proposed subpartition method requires a number of basic assumptions. First of all, each cell should have at least four replications, so that the variances for each subpartition could be calculated. Another restriction of the proposed subpartition method, which applies to the paired-difference method as well, is that it requires the assumption of symmetric noise distribution. Third, the model specifies that the {varepsilon}ijks are independent and identically distributed according to a common distribution F. Nevertheless in practice, there might exist correlated random noise in the data. Theoretically it has been shown by Pollard and van der Lann (2004) and Pollard et al. (2005) that the permutation method cannot preserve the correct covariance structure unless the design is balanced or the covariance structure is same across different populations. Thus if the multivariate distributions vary across the different treatment groups and the covariate groups, we should be cautious that the subpartition method may not be invariant under the permutations. To demonstrate this limitation of the subpartition permutation method in the presence of correlated random noise, we conducted the following simulations. A microarray of 2000 genes was simulated with three levels of the treatment (J = 3) and four levels of the clinical covariate (K = 4). An unbalanced design of n11 = 4, n12 = 6, n13 = 8, n14 = 4, n21 = 8, n22 = 6, n23 = 6, n24 = 4, n31 = 7, n32 = 6, n33 = 6, n34 = 7 was considered. The gene specific mean was simulated as {theta}i ~ U(0, 1). The covariate effects were simulated as ß1 ~ N(0, 1), ß2 ~ N(0, 2), ß3 ~ N(0, 3), ß4 ~ N(0, 4). For those NDE genes, the treatment effect was set to be {alpha}1 = {alpha}2 = {alpha}3 = 0, whereas the for those DE genes, the treatment effect was set to be {alpha}1 ~ U(0, 1), {alpha}2 ~ N(2, 1). {alpha}3 ~ U(4, 5). To simulate unequal correlations across different covariate groups, we first simulated the variables zj1 ~ N(0, 1), zj2 ~ N(0, 1), zj3 ~ N(0, 2), and zj4 ~ N(0, 2). Then the noise distribution was simulated as the sum of independent symmetric normal error plus the correlated errors across the covariate groups {varepsilon}jks ~ N(0, 1) + zj1, if s ≤ [njk/2], k = 1, 2, {varepsilon}jks ~ N(0, 1) + zj2, if s > [njk/2], k = 1, 2, {varepsilon}jks ~ N(0, 1) + zj3, if s ≤ [njk/2], k = 3, 4, {varepsilon}jks ~ N(0, 1) + zj4, if s > [njk/2], k = 3, 4. The proportion of NDE genes was set to be {pi} = 0.8. The results are summarized in Table 6 based on 50 simulated datasets. The simulation result clearly demonstrates that in the presence of correlations which are different across covariate groups, both the traditional permutation method and the subpartition permutation method no longer maintain the good control of the FDRs.


View this table:
[in this window]
[in a new window]
 
Table 6 True positives and false discovery rates for 3 x 4 unbalanced design with even and odd number of replicates when {pi} = 0.8 under correlated noise

 
Another assumption to validate our approach requires that all the test statistics are independent and identically distributed, which applies to many other existing FDR approaches as well (Efron et al., 2001, Tusher et al., 2001 and Storey and Tibshirani, 2003). As the mRNAs for all the genes on an array are extracted from the same sample and are subject to the same experimental condition and normalization process, it is reasonable to assume that the marginal distributions for the majority of the genes are of the same type. Furthermore for most microarray studies, genes either act independently or have week dependence whose effect on the FDR becomes negligible as the number of genes goes to infinity (Storey and Tibshirani, 2003). Under this condition, we can employ only a small number of permutations for each gene and pool all the permutations from the genes together to form the empirical null distribution. Collapsing the distributions of the test statistics not only reduces the computational cost but also overcomes the discrete nature of the permutation distribution of a test statistic based on few observations (Tusher et al., 2001, and Reiner et al., 2003, etc). Nevertheless in practice there may exist violations such that strong dependency structure does occur among the test statistics. For example in cancer studies, the co-regulations of genes based on genomic locations and regulation network can lead to strong correlations among the genes (Reiner et al., 2003). To account for the dependency structure, we can relax the assumption and modify our procedures. Instead of pooling all the test statistics together, we employ a large number of permutations. For each permutation, we randomly shuffle the arrays within the same combination of treatment and covariate group and all the genes on the same array are shuffled simultaneously. Upon each permutation, the vector of the subpartition null statistics f2 for all the genes is computed. The data are repeatedly permutated with the empirical dependency structure of the data being preserved. By this means the joint multivariate null distribution of the test statistics can be generated from this resampling scheme. Denote the null statistic for gene i generated from the B permutations by fi,m, m = 1, ... , B. Also denote the F-statistics from the gene i by Fi, i = 1, ... , n. The corresponding empirical p-value for the statistic Fi is defined as p-value Formula 14 I(fi, m ≥ Fi). Given a fixed cut-off value {alpha} for p-value, we can obtain the realized FDR and its estimates. It is noted that using the multivariate null distribution as the reference distribution, given the p-value threshold {alpha} for each marginal distribution, the estimated number of false positives still equals n{alpha}. This is because the expectation of sums of Bernoulli false discovery outcomes do not depend on the correlation structure of the joint distribution (Hu et al., 2005). Therefore we are able to control the estimated FDRs under the dependency structure.


    Acknowledgments
 
This research was supported by Canadian NSERC Grant.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Joaquin Dopazo

Received on December 14, 2005; revised on February 28, 2006; accepted on March 19, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD
 3 RESULTS ON SIMULATED...
 4 RESULTS ON BIOLOGICAL...
 5 DISCUSSION
 REFERENCES
 

    Allison, D.B., et al. (2002) A mixture model approach for the analysis of microarray gene expression data. Comput. Stat. Data. Anal, . 39, 1–20.

    Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statiat. Soc. Ser. B, 57, 289–300.

    Dalmasso, C., et al. (2005) A simple procedure for estimating the false discovery rate. Bioinformatics, 21, 660–668[Abstract/Free Full Text].

    Efron, B., et al. (2001) Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc, . 96, 1151–1160[CrossRef][Web of Science].

    Gao, X., et al. (2005) Nonparametric tests for differential gene expression and interaction effects in multifactorial microarray experiments. BMC Bioinformatics, 6, 186[Medline].

    Research Report 2004-016 Guan, Z., et al. (2004) ‘Model-based approach to FDR estimation’. , MN Division of Biostatistics, University of Minnesota.

    Hu, J., et al. (2005) Practical FDR-based sample size calculations in microarray experiments. Bioinformatics, 21, 3264–3272[Abstract/Free Full Text].

    Hunter, L., et al. (2001) GEST: a gene expression search tool based on a novel Bayesian similarity metric. Bioinformatics, 17, Suppl. 1, S115–S122[Abstract].

    Jin, W., et al. (2001) The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster. Nat. Genet, . 29, 389–395[CrossRef][Web of Science][Medline].

    Kerr, M.K., et al. (2000) Analysis of variance for gene expression microarray data. J. Comput. Biol, . 7, 819–837[CrossRef][Web of Science][Medline].

    Pan, W. (2003) On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression. Bioinformatics, 19, 1333–1340[Abstract/Free Full Text].

    Pan, W., et al. (2003) A mixture model approach to detecting differentially expressed genes with microarray data. Funct. Integr. Genomics, 3, 117–124[CrossRef][Medline].

    Pavlidis, P., et al. (2003) Using ANOVA for gene selection from microarray studies of the nervous system. Methods, 31, 282–289[CrossRef][Web of Science][Medline].

    Pollard, K.S. and van der Laan, M.J. (2004) Choice of a null distribution in resampling-based multiple testing. J. Stat. Plan. Infer, . 125, 85–100[CrossRef].

    Working Paper Series, Working Paper 184 Pollard, K.S., Birkner, M.D., Dudoit, S., van der Laan, M.J. (2005) Test statistics null distributions in multiple testing: simulation studies and applications to genomics. , CA U.C. Berkeley Division of Biostatistics.

    Pounds, S. and Morris, S.W. (2003) Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of P-values. Bioinformatics, 19, 1236–1242[Abstract/Free Full Text].

    Pounds, S. and Cheng, C. (2004) Improving false discovery rate estimation. Bioinformatics, 20, 1–9[Abstract/Free Full Text].

    Reiner, A., et al. (2003) Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics, 19, 368–375[Abstract/Free Full Text].

    Searle, S. R. Linear Models for Unbalanced Data, (1987) , New York Wiley.

    Storey, J.D. and Tibshirani, R. (2003) Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA, 100, 9440–9445[Abstract/Free Full Text].

    Tusher, V.G., et al. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 5116–5121[Abstract/Free Full Text].

    Wu, B., et al. (2004) Parametric and nonparametric FDR estimation. Revisited Research Report 2004-015, , Division of Biostatistics, University of Minnesota, MN.

    Xie, Y., et al. (2005) A note on using permutation based false discoveray rate estimate to compare different analysis methods for microarray data. Bioinformatics, 21, 4280–4288[Abstract/Free Full Text].

    Zhao, Y. and Pan, W. (2003) Modified nonparametric approaches to detecting differentially expressed genes in replicated microarray experiments. Bioinformatics, 19, 1046–1054[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
K. F. Kerr
Comments on the analysis of unbalanced microarray data
Bioinformatics, August 15, 2009; 25(16): 2035 - 2041.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. Xu and X. Cui
Robustified MANOVA with applications in detecting differentially expressed genes from oligonucleotide arrays
Bioinformatics, April 15, 2008; 24(8): 1056 - 1062.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/12/1486    most recent
btl109v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Gao, X.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gao, X.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?