Skip Navigation


Bioinformatics Advance Access originally published online on March 29, 2005
Bioinformatics 2005 21(11):2684-2690; doi:10.1093/bioinformatics/bti407
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/11/2684    most recent
bti407v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (16)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Grant, G. R.
Right arrow Articles by Stoeckert, C. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Grant, G. R.
Right arrow Articles by Stoeckert, C. J., Jr.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

A practical false discovery rate approach to identifying patterns of differential expression in microarray data

Gregory R. Grant *, Junmin Liu and Christian J. Stoeckert, Jr.

Center for Bioinformatics, University of Pennsylvania 1429 Blockley Hall, 423 Guardian Drive, Philadelphia, PA 19104-6021, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 DIFFERENTIAL EXPRESSION
 3 THE DATA
 4 THE STATISTICS
 5 THE FDR
 6 FDR ESTIMATION
 7 LEVELS AND PATTERNS
 8 THE T-STATISTIC TUNING...
 9 THE CHOICE OF...
 REFERENCES
 

Summary: Searching for differentially expressed genes is one of the most common applications for microarrays, yet statistically there are difficult hurdles to achieving adequate rigor and practicality. False discovery rate (FDR) approaches have become relatively standard; however, how to define and control the FDR has been hotly debated. Permutation estimation approaches such as SAM and PaGE can be effective; however, they leave much room for improvement. We pursue the permutation estimation method and describe a convenient definition for the FDR that can be estimated in a straightforward manner. We then discuss issues regarding the choice of statistic and data transformation. It is impossible to optimize the power of any statistic for thousands of genes simultaneously, and we look at the practical consequences of this. For example, the log transform can both help and hurt at the same time, depending on the gene. We examine issues surrounding the SAM ‘fudge factor’ parameter, and how to handle these issues by optimizing with respect to power.

Availability: Java and Perl implementations are available at www.cbil.upenn.edu/PaGE

Contact: ggrant{at}pcbi.upenn.edu


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 DIFFERENTIAL EXPRESSION
 3 THE DATA
 4 THE STATISTICS
 5 THE FDR
 6 FDR ESTIMATION
 7 LEVELS AND PATTERNS
 8 THE T-STATISTIC TUNING...
 9 THE CHOICE OF...
 REFERENCES
 
A common use of microarrays is to find differentially expressed genes between two experimental conditions. The statistical problem of controlling the error rates has proven to be difficult using straightforward classical statistics, and instead the relatively new false discovery rate (FDR) approaches (Benjamini and Hochberg, 1995) have become widely accepted as appropriate (Ge et al., 2003). Therefore we will not argue about the merits of the FDR approach, but we will take it as our starting point. The FDR approach is to accept some false positives, while attempting to control the proportion of them in the set of all genes predicted. Benjamini and Hochberg (1995) did not provide a general and powerful method for achieving FDR control. Instead their original method relies on strong assumptions and even when those hold the method can be quite conservative. The method starts with any gene-by-gene p-values, and then for a desired FDR Q it ‘adjusts’ the p-values and a cutoff is determined so that the genes whose adjusted p-values are smaller than the cutoff have the expected proportion of false-positives no more than Q. By describing one method as having greater ‘power’ than another, we will mean that the method predicts a larger set of genes at the desired FDR.

There have been many methods proposed to make the Benjamini and Hochberg methods more general and more powerful (e.g. Benjamini and Yekultieli, 2001; Pounds and Cheng, 2004); however, there are always assumptions required to obtain the gene-by-gene p-values. Typical assumptions are normality of intensity distributions, independence of p-value distributions across genes or identically distributed t-statistics. For the most part the degree to which these assumptions affect the results on real data is unknown. The methods work well on simulated data, but since we do not yet understand the nature of real data well enough to properly simulate them, it is difficult to compare methods in a meaningful way.

When there are only a few replicates available in each condition, and thousands of genes on the array, permutation p-values, or rank-based p-values such as the Wilcoxon rank sum statistic, which would overcome some of the parametric assumptions, are too granular to be useful. For example, with three replicates in each condition, there are only 20 permutations, so an array with 20 000 genes would necessarily clump 1000 of them into the highest significance value 0.05. The empirical Bayes method of Efron and Tibshirani (2002) is based on the Wilcoxon statistic and so requires a similar number of replicates. PaGE is a permutation based method which attempts to avoid as many parametric assumptions as possible, while also avoiding the granularity of p-values and rank-based statistics altogether.

PaGE (Manduchi et al., 2001) and SAM (Tusher et al., 2001) are methods that attempt to control the FDR without the use of corrected p-values. SAM and PaGE use methods of permutation estimation, as described below.1 SAM has become a popular application; however, it relies on some heuristics the consequences of which are not well documented and which in fact present several limitations as we will show below. We propose a slightly different definition of the FDR and we will show why it is more straightforward for permutation estimation. We will then describe a permutation method to control the FDR. The most serious issue regarding any method is the choice of statistic. SAM has relied on a modified t-statistic which depends on an extra parameter. We will show how sensitive results can be to this parameter, and will show how the SAM method of setting this parameter does not generally have good power properties. The current PaGE approach to this problem is different, as described in Section 8.

Other issues relating to the statistic and data transformations are discussed. The conclusion is that no statistic is best for finding all differentially expressed genes, and relying on one statistic over another generally always involves a tradeoff.

PaGE can be used to find differentially expressed genes between two conditions, and to generate patterns across several conditions. PaGE was introduced by (Manduchi et al., 1999), and though the algorithm has changed significantly, the general approach of generating discrete patterns using an FDR-based confidence measure has remained unchanged.

Implementations of PaGE are available, in Java and Perl, at http://www.cbil.upenn.edu/PaGE/. A complete description of the PaGE algorithm as implemented is available in the technical manual documentation, http://www.cbil.upenn.edu/PaGE/techman.pdf.


    2 DIFFERENTIAL EXPRESSION
 TOP
 Abstract
 1 INTRODUCTION
 2 DIFFERENTIAL EXPRESSION
 3 THE DATA
 4 THE STATISTICS
 5 THE FDR
 6 FDR ESTIMATION
 7 LEVELS AND PATTERNS
 8 THE T-STATISTIC TUNING...
 9 THE CHOICE OF...
 REFERENCES
 
We assume that there are two well-defined experimental conditions, and that each gene has a measurable expression intensity that follows some (unknown) distribution in each condition. Differential expression of a gene means that these distributions are different between the two conditions. The distributions can differ in any possible way, but the statistics we use are designed to be sensitive to primarily a difference in the means (e.g. the t-statistic). Even so, the hypotheses being tested are of equality of distributions. This is a necessary consequence of using the permutation methods that we do. The data are assumed to consist of multiple quantified microarray experiments in each condition.

There are many ways to design a study for comparative analysis. We will focus here on the two-sample case where there is a separate measurement in each condition, for each gene. Another popular design, the direct comparison design, is to hybridize the two conditions to the two channels of a two-channel array, respectively, and to generate some number of replicate arrays. The PaGE software handles all of these designs; however, for concision, the direct comparison design will not be discussed here, nor other cases such as paired or reference designs; those cases are discussed in detail in the technical manual http://www.cbil.upenn.edu/PaGE/techman.pdf. The theory for all cases is similar.

We begin by considering just two experimental conditions, called condition 0 and condition 1. Condition 0 will be referred to as the reference condition. Up-regulation of a gene will mean the gene's mean intensity is higher in condition 1 as compared to condition 0, and analogously for down-regulation.


    3 THE DATA
 TOP
 Abstract
 1 INTRODUCTION
 2 DIFFERENTIAL EXPRESSION
 3 THE DATA
 4 THE STATISTICS
 5 THE FDR
 6 FDR ESTIMATION
 7 LEVELS AND PATTERNS
 8 THE T-STATISTIC TUNING...
 9 THE CHOICE OF...
 REFERENCES
 
The data will be assumed to consist of some number m of replicate arrays in condition 0 and some number n of replicates in condition 1. We put the data in a matrix as in Equation (1) below.


{bti407e1}

(1)
The m + n columns in Equation (1) correspond to hybridizations and the g rows correspond to genes. The columns labeled with the Cis correspond to hybridizations from condition 0, and the columns labeled with the Dis correspond to the hybridizations from condition 1.

A permutation of the data will consist of choosing some number k of columns from condition 0, and then choosing the same number of columns from condition 1, and switching them. We then obtain two new conditions from the first m and the last n columns of the permuted data matrix. What is important in a permutation is which columns end up in which condition, and not the order that they happen to be listed in. Therefore there are a total of possible permutations in the two-sample case.

Denote a row of the data matrix by r. If p is a permutation of the columns, we denote the correspondingly permuted row by rp.


    4 THE STATISTICS
 TOP
 Abstract
 1 INTRODUCTION
 2 DIFFERENTIAL EXPRESSION
 3 THE DATA
 4 THE STATISTICS
 5 THE FDR
 6 FDR ESTIMATION
 7 LEVELS AND PATTERNS
 8 THE T-STATISTIC TUNING...
 9 THE CHOICE OF...
 REFERENCES
 
We denote by S any two-class statistic, and think of it as a function which maps rows of the data matrix r to real numbers. Suppose there is some center point c such that S > c indicates up-regulation and S < c indicates down-regulation. The statistics we have in mind are:

  • The modified t-statistic

    (2)
    where µ0 is the mean of (c1,c2,...,cm), µ1 is the mean of (d1,d2,...,dn),

    where and .
    In this case the center c equals 0. When {alpha} = 0 this is the standard two-sample t-statistic. As we will see (Section 8), results can be extremely sensitive to the value of {alpha}; therefore we refer to {alpha} as the t-statistic tuning parameter.2

  • The second statistic is the ratio of the means in the two conditions

    where µ0 is the mean of (c1,c2,...,cm) and µ1 is the mean of (d1,d2,...,dn). In this case the center c = 1.

A statistic need not make sense in all cases. For example if there are negative intensities in the unlogged data then the ratio statistic is not sensible.

We could apply the t-statistic to the unlogged data, or to the logged data, or to any transformation of the data. Therefore, even with fixed {alpha}, the t-statistic is not one statistic but a family of statistics. This will be discussed further in Section 9. The ratio statistic makes the most sense when the data is on a multiplicative scale, such as two-channel ratios. The log ratio is similar to the modified t-statistic with a large value of {alpha}, so we do not consider it separately.

Let M be the data matrix, p a permutation of the columns and Mp the permuted data matrix. Denote the value of the statistic S on row r of M by Sr and on row r of Mp by Srp.


    5 THE FDR
 TOP
 Abstract
 1 INTRODUCTION
 2 DIFFERENTIAL EXPRESSION
 3 THE DATA
 4 THE STATISTICS
 5 THE FDR
 6 FDR ESTIMATION
 7 LEVELS AND PATTERNS
 8 THE T-STATISTIC TUNING...
 9 THE CHOICE OF...
 REFERENCES
 
We will focus on up-regulation in condition 1 versus condition 0. The case of down-regulation follows by switching the roles of conditions 0 and 1. We will also assume that larger values of S are more significant. This is the case for all of the statistics above. The case where smaller values are more significant follows by switching the direction of the inequalities.

For any row r we take the null hypothesis to be that the distribution for any observation in condition 0 is identical to the distribution for any observation in condition 1. Suppose that for g0 of the rows the null hypothesis is true. Let g1 = gg0. For each real number k > c, let be the set of rows r of M such that Sr ≥ k. is the set of predictions if we use k as the ‘cutoff’ for the statistic. Let Rk be the size of . Let Vk be the number of rows in for which the null hypothesis is true.

With this set-up, we will have made Vk false predictions out of Rk total predictions. Provided Rk > 0, we call the ratio Vk/Rk the false discovery proportion of this set of predictions.

We choose k so that this proportion is controlled to a desired level. There are many ways to define what it means to control this proportion, and our FDR definition differs from the original one of Benjamini and Hochberg (1995) as well as that of Storey (2002) and Storey and Tibshirani (2003). Throughout we will define the FDR of the procedure as

(3)
This differs from the original definition of Benjamini and Hochberg (1995) which is given by

(4)

The advantage of the original definition (4) is that it takes into account the dependence between Vk and Rk. However an advantage of definition (3) is that it can be more realistically estimated via permutation distributions.3

The goal is to find the least conservative (i.e. smallest) value of k so that (3) is acceptably low. Sometimes one is willing to tolerate a relatively high FDR such as 0.5; other times a low FDR such as 0.05 is desired.


    6 FDR ESTIMATION
 TOP
 Abstract
 1 INTRODUCTION
 2 DIFFERENTIAL EXPRESSION
 3 THE DATA
 4 THE STATISTICS
 5 THE FDR
 6 FDR ESTIMATION
 7 LEVELS AND PATTERNS
 8 THE T-STATISTIC TUNING...
 9 THE CHOICE OF...
 REFERENCES
 
For each permutation p of the data matrix we obtain a value , the number of rows whose permutation statistic Srp ≥ k. Thus we obtain a permutation distribution Dk of Vk under the complete null hypothesis (that is, when all null hypotheses are true). Note that the distribution Dk restricted to the null genes depends on the joint distribution of the Sr, over all, and the joint distribution of the Sr restricted to the null genes is maintained by permuting the data matrix in columns.

Let be the mean of Dk. Typically permutation distributions are used to derive p-values, for which there is substantial theory. Here, however, we are interested in actually estimating E(Vk) from the permutation distribution, and this requires some justification. Note that similar ‘permutation estimates’ are utilized in the SAM theory of Storey and Tibshirani (2003); however they do not address their properties (Pan, 2003). There are two problems in using as an estimate of E(Vk). First, since it is calculated under the complete null hypothesis, it is at best a measure of how many hypotheses would be falsely rejected if they were all true. So, assuming that some hypotheses are false, would be an overestimate. Second, unless all hypotheses are true, the false hypotheses can cause the distribution of Vk to be different from what it would be if we could consider only the true hypotheses in defining Dk. Since we do not know which hypotheses are true, we must allow the false hypotheses to contribute to the counts involved in Dk.

Regarding the second issue, we argue that is conservative. Suppose that null hypothesis r is false. Then for permutations p which switch only one or a few columns between conditions, the false hypothesis r will tend to have large values of the statistic Sr, and will therefore tend to contribute to the count of more than it would if were true. Similarly, down-regulated genes will tend to over-contribute to the count for those permutations which switch most or all of the columns between the conditions. Therefore the estimate will tend to be larger than the true value of E(Vk). The more hypotheses that are false, the more conservative will be.

Turning to the first issue, since is an overestimate of E(Vk),

(5)
is an underestimate of the number of true positives (the rows in for which the null hypothesis is false). Therefore is an overestimate of the number of true hypotheses. Originally was calculated as an estimate of Vk assuming all hypotheses are true. If we recalculate assuming there are true hypotheses, then we obtain

Since is an overestimate of the number of true hypotheses, is still an overestimate of E(Vk); however it is a better estimate than . Using the same logic, we calculate

and in general

This sequence is decreasing and bounded below, and therefore converges. In fact it converges quickly and PaGE takes as its final estimate for Vk, where . We denote by .

We take as estimate of the FDR

It is useful to also define the quantity CONFk = 1 – k. CONFk is an estimate of the probability that any gene taken at random from is a true positive.

We assign confidences not just to the sets , but to the rows of the data matrix themselves, by

where r is a row of the data matrix. In this way, we have confidence of at least {gamma} that a row r with CONFr = {gamma} represents a truly differentially expressed gene.


    7 LEVELS AND PATTERNS
 TOP
 Abstract
 1 INTRODUCTION
 2 DIFFERENTIAL EXPRESSION
 3 THE DATA
 4 THE STATISTICS
 5 THE FDR
 6 FDR ESTIMATION
 7 LEVELS AND PATTERNS
 8 THE T-STATISTIC TUNING...
 9 THE CHOICE OF...
 REFERENCES
 
SAM offers a multi-class mode which tests the hypothesis that the means in all conditions are equal. This does not, however, give an indication of what exactly is going on in which conditions, or of a way of sorting the genes by their behavior across the conditions. In order to visualize the behavior of genes across multiple conditions, PaGE was designed to perform a series of pairwise comparisons and to generate patterns from the results (Manduchi et al., 2000). One condition is chosen as a reference and the remaining conditions are compared to this reference condition, which we call condition 0.

If there are n conditions including condition 0, then a simple way to generate patterns of length n – 1 would be to put +1 in position i if there is up-regulation at the desired confidence in condition i versus the reference, and similarly –1 if there is down-regulation; if there is no differential regulation at the desired confidence, then put 0.

This would allow one to see where the differential expression is happening. However, in order to make this strategy more descriptive, we do not restrict to just patterns of 0,±1, but use the rest of the integers as well. Higher levels represent a higher confidence of differential expression than lower levels. The levels are determined as follows. First, the user chooses a confidence {gamma}. A cutoff C for the statistic is determined by setting it to the minimum k for which CONFk > {gamma}, if such a C exists (CONFk > {gamma} is defined in Section 6). The set of all rows r such that Sr > C then gives the least conservative set of predictions with a confidence of at least {gamma}. Depending on the data, there may not be any such value of C that achieves confidence {gamma}, in which case C is set to be {infty}.

Now, depending on whether the statistic is on an additive scale (such as the t-statistic), or on a multiplicative scale (such as the ratio statistic), the levels are created differently. In the additive case if the statistic Sr < C, then row r is given level 0. If C ≤ Sr < 2C, then it is given level 1. If 2C ≤ Sr < 3C then it is given level 2. In general if

then it is given level n.

If the statistic is on a multiplicative scale, then row r is given level n if

When the data are multiplicative and the ratio statistic is used then the patterns take on the intuitive meaning of fold-change.

The parameter {gamma} is referred to as the level confidence. As one raises the level confidence, fewer levels are produced and the genes assigned to the levels have higher confidence. We have found this to be a convenient way of visualizing the results of a study with many conditions.


    8 THE T-STATISTIC TUNING PARAMETER
 TOP
 Abstract
 1 INTRODUCTION
 2 DIFFERENTIAL EXPRESSION
 3 THE DATA
 4 THE STATISTICS
 5 THE FDR
 6 FDR ESTIMATION
 7 LEVELS AND PATTERNS
 8 THE T-STATISTIC TUNING...
 9 THE CHOICE OF...
 REFERENCES
 
When using the t-statistic, the number of genes found at the desired confidence can be dramatically affected by the value of the tuning parameter {alpha} in Equation (2). This is particularly true when there are only a few replicates per condition. The FDR is conservatively estimated regardless of the value of {alpha}, so choosing a value of {alpha} which maximizes the number of results at the desired confidence is desirable. If too many genes are found, the confidence can be raised to find a smaller set.

We observe this effect empirically in a wide range of datasets, both real and simulated; see for example the data in Tables 1 3.


View this table:
[in this window]
[in a new window]
 
Table 1 The effect of the t-statistic tuning parametera

 

View this table:
[in this window]
[in a new window]
 
Table 3 Similar to Tables 1 and 2 the effect of the t-statistic tuning parameter on real data consisting of eight replicates direct comparison dataa

 
In these tables the number of genes found differentially expressed, at a fixed confidence, is given as a function of {alpha}. Table 1 represents simulated data while Tables 2 and 3 are from real data. For the simulated data, 50 genes are up-regulated out of 1000 genes. The intensities in each row for each condition are given by beta distributions. With beta distributions, by varying the two parameters as well as the range, we can produce heterogeneous behavior: from unimodal, to bimodal, are highly skewed. Since gene expression data are highly heterogeneous, the beta provides a better model than the normal, with respect to the heterogeneity of distributions. For the complete details of the simulation engine, see Grant et al. (2005; www.cbil.upenn.edu/expression_simulator/)


View this table:
[in this window]
[in a new window]
 
Table 2 Similar to Table 1, the effect of the t-statistic tuning parameter on real data consisting of four replicates per condition, log transformed two-class mouse pancreas cells 0 and 2 h post-partial hepatectomy [data published in White et al (2005)]a

 
In any case, the sensitivity on {alpha} as shown in Table 1 was not dependent on the arbitrary parameter choices used to simulate the data but is a general phenomenon we found to varying degrees regardless of the dataset being real or simulated. Though the nature of the dependence on {alpha} varies from dataset to dataset, the dependence itself is typical.

The reason for this dependence on {alpha} is that the standard t-statistic ({alpha} = 0) is large for those genes with small variances, and the algorithm is forced to be more conservative in its predictions to avoid picking them up. As {alpha} grows, this effect is minimized; however, when {alpha} is too large, then for the differentially expressed genes with small mean difference and small variance, {alpha} dominates the statistic, which tends to obscure the differential expression of these genes in the noise. Therefore some genes get lost as {alpha} goes down, while other genes get lost as {alpha} goes up. What value to set {alpha} to depends on the nature of the differentially expressed genes as well as the non-differentially expressed genes.

There is no known formula that can be applied to the data matrix to determine the value of {alpha} which maximizes the power. However, since the (expected) confidence of the set of predicted genes is the same regardless of {alpha}, a power criterion to determine {alpha} is desirable and we attempt to choose {alpha} to maximize the power. PaGE tries a range of 10 values of {alpha}, from small to large, and chooses the one which gives the greatest number of results. This is the default value of {alpha}. Other values of {alpha} can, however, find genes that the default value misses, as we will see below. Therefore it is important that the user has control over this parameter.

Since we are taking the maximum over 10 values of {alpha} we potentially introduce another multiple testing issue. If these were 10 independent runs this problem might become serious, but they are highly dependent, even for very different values of {alpha}, and simulation studies indicate that the confidence is not significantly affected. The extreme at which point this effect might have a significant impact on the FDR is when there are no differentially expressed genes and a high FDR is requested. In this case a false positive set might consist of one or a very few genes, and the maximization can have a significant effect on the proportion because the numbers involved are small. Because of this we do not recommend raising the FDR much higher than 0.5 when there are very few genes found.

The SAM (Tusher et al., 2001) algorithm uses an approach which depends on a smoothing criterion to select {alpha}. The rationale is that the t-statistic distributions should be identical for all (null) genes, so they impose a uniformity criteria for the t-statistics to determine {alpha}. They do not present the theory however which shows their criteria achieves reasonably optimal power, and in fact this method can give values of {alpha} that are quite far from optimal with regard to the power of the results. This could be due to the fact that they are smoothing over all genes and not just the null ones.

To illustrate this we generated several simulated datasets, using again the simulation engine from Grant et al. (2005). The first dataset has 5000 genes, 300 of which are differentially expressed. Differentially expressed genes have varying mean differences. The first 25 rows have mean difference 1.2 between the two conditions. The next 25 rows have mean difference 1.3. The next have 1.4, etc. Specifically the 12 blocks of 25 have mean differences 1.2, 1.3, 1.4, 1.5, 2, 2.2, 2.4, 2.6, 4, 4.3, 4.6 and 4.9, respectively. This gives us 100 with relatively small mean differences; 100 with medium mean differences and 100 with relatively large mean differences. These are the rows numbered 0–299. The remaining 4700 non-differentially expressed genes were generated by 4700 randomly chosen beta distributions with randomly chosen parameters.

Similarly to the data simulated for Table 1, the intensities in each conditions are given by beta distributions. For the differentially expressed genes, the variances of the distributions increase over each block of 25 from very low to very high. So row 0 has µ1 – µ0 = 1.2 and very low variances, row 1 has µ1 – µ0 = 1.2 but slightly higher variances, up to row 24 which has µ1 – µ0 = 1.2 and high variances. Row 25 then has µ1 – µ0 = 1.3 with low variances, row 26 has µ1 – µ0 = 1.3 with slightly higher variances, etc. In this way a full range of mean differences and variances is represented in the data. The full dataset can be downloaded at cbil.upenn.edu/PaGE/doc/testdata1.txt.

Using the first three replicates of each condition, we ran PaGE with five different values of {alpha} and also ran SAM. Table 4 has a summary of the results obtained with an 80% confidence cutoff for five values of {alpha} and SAM. The realized confidence is the actual confidence achieved, which can be determined since this is simulated data and we know exactly what are the true positives and true negatives.


View this table:
[in this window]
[in a new window]
 
Table 4 The power of the results as a function of the t-statistic tuning parametera

 
Table 5 breaks down the results by type of differentially expressed gene. Note that the number of genes found in each row of the table is maximized on the diagonal of the shaded portion. Therefore small values of {alpha} find genes with small mean differences and small variance; large values of {alpha} find genes with large mean differences and large variance. The above example is far from the worst case. If the non-differentially expressed genes have bimodal distributions and the differentially expressed genes have large variance, then SAM performance can be quite far from optimal.


View this table:
[in this window]
[in a new window]
 
Table 5 The effect of the t-statistic tuning parameter on different spectra of differentially expressed genes

 
The page cbil.upenn.edu/PaGE/doc/example1.htm has the complete results for the genes found at 0.8 confidence or higher. Columns represent different runs for different values of {alpha}. The final column gives the SAM results. The top three rows of the output consist of the actual values of RV (the number of true positives), R (the total number of predictions) and the realized confidence. The 300 differentially expressed genes are listed, numbered 0–299, and an ‘X’ means that that gene was found in that run.

SAM produces results very close to the setting {alpha} = 0.1. The power is maximized around {alpha} = 2.5. The PaGE default is {alpha} = 3.5. At this value 160 genes are reported, as opposed to SAM's 145. But the overlap between the two sets is only 125 genes. SAM finds 20 that PaGE does not with the default {alpha}, and PaGE finds 35 that SAM does not.

Note that it is not necessary to use the PaGE default, and in the software the user has control over the value of {alpha}. The default is simply designed to maximize the number of results. It is important to keep in mind that an FDR only makes sense within the context of a set of predictions. Even though we apply confidences to the individual genes in the sets, these confidences mean the chance of making a mistake when considering genes at random among the set of predictions made. Therefore, if one merges sets of genes found from different runs, the FDR may be increased. If two methods produce sets with the same FDR, but the results overlap proportionally more on the true positives than the false positives, then the FDR of the union of the two sets will be greater than the FDR of either set individually. So when looking for more genes, it is not generally a good idea to use all possible choices of parameter settings and transformations, and merge the results, but rather to find the one or few that work best, and always take the meaning of the confidences in the context of the separate sets of results.

The dataset and parameter settings in the previous example were not particular; we repeated this with many variations and the same behavior was seen in general. In fact, it is possible to generate simple datasets for which SAM performs much worse. To demonstrate this, we generated a simulated dataset of 1000 genes, with 100 differentially expressed, and with the non-differentially expressed genes each having a bimodal distribution. The dataset, SAM q-values and the PaGE confidences can be obtained at cbil.upenn.edu/PaGE/doc/files/bmod.html. The q-values are equal to one minus the confidence. The lowest q-values produced by SAM on this dataset are 0.5 (14 genes). In contrast, PaGE reports 17 genes at confidence >0.8, all but one of which are true positives (Table 6).


View this table:
[in this window]
[in a new window]
 
Table 6 Simulated dataset of 1000 genes, with 100 differentially expressed, and with the remaining genes each having a bimodal distribution

 

    9 THE CHOICE OF STATISTIC, TRANSFORMATION AND OTHER PARAMETERS
 TOP
 Abstract
 1 INTRODUCTION
 2 DIFFERENTIAL EXPRESSION
 3 THE DATA
 4 THE STATISTICS
 5 THE FDR
 6 FDR ESTIMATION
 7 LEVELS AND PATTERNS
 8 THE T-STATISTIC TUNING...
 9 THE CHOICE OF...
 REFERENCES
 
9.1 Using the ratio of means versus the t-statistic
For most datasets the user will probably want to start with the t-statistic option. If the t-statistic does not return many results, then keeping in mind the caveats of the previous section about multiple runs, one can try the other statistic or the log transformation options. Even if there are many genes found, however, different statistics can pick up different kinds of differential expression, as we saw in the previous section regarding using the t-statistic with different values of the tuning parameter {alpha}. To illustrate this further, we generated a simulated dataset with two conditions, 100 genes and four replicates per condition. Two of the genes are differentially expressed. The 98 non-differentially expressed genes have moderate intensity: (beta with mean 50, spread 35). Gene 0 is differentially expressed in the low-intensity range (means of ~4 and ~9 in the two conditions, respectively). Gene 1 is differentially expressed in the high-intensity range (means of ~400 and ~450 in the two conditions, respectively). The data are available at cbil.upenn.edu/PaGE/doc/testdata2.txt. Table 7 shows the results using the ratio of means statistic (left) and the t-statistic (right). Using the ratio of means, the low-intensity differentially expressed gene (gene 0) is much more significant than the high-intensity differentially expressed gene (gene 1). Conversely, when using the t-statistic, the high-intensity differentially expressed gene is much more significant than the low-intensity differentially expressed gene.


View this table:
[in this window]
[in a new window]
 
Table 7 Comparison of results using the ratio of means statistic versus the t-statistic, on simulated dataa

 
9.2 Using logged versus unlogged data
The caveats about choice of statistic apply also to the different possible data transformations one can perform. Perhaps the most common is the log transformation. PaGE offers the option of performing this transformation when one is using the t-statistic. Using the same test datasets as above, Table 8 shows what happens to the confidence of gene 1 when the data are logged versus unlogged. The confidence of Gene 0 goes down while the confidence of Gene 1 goes up.


View this table:
[in this window]
[in a new window]
 
Table 8 Comparison of results using the logged versus unlogged data with the t-statistica

 
Thus one cannot trust either approach to perform better for all genes simultaneously. This happened because applying logs to the data, and then applying the t-statistic, which focuses on differences, is similar to first taking ratios, and then taking logs. Gene 1 in the above example, whose intensities are in the high range of the spectrum, has a relatively small ratio, compared to the null genes whose intensities are in a lower range of the spectrum.

9.3 Other statistics
Statistics which are invariant to monotonic transformations, for example the Wilcoxon signed rank statistic, would allow us to avoid some of the issues above. However, without sufficiently many replicates, the Wilcoxon statistic suffers from the same granularity problems as permutation p-values. Once there are enough replicates for it to work, the null genes have a higher chance of appearing significant with the Wilcoxon because it puts small and large differences on an equal footing, and we are comparing the statistic across many genes. This can obscure the true signal and as a result we were unable to find any datasets for which this method gave superior results. We experimented with other statistics as well, including using p-values themselves as statistics, even adjusted p-values. We were unable to find a significant example where these gave improved results, so we do not include them as options in the PaGE implementation.

Ultimately there is no ‘best’ statistic or transformation. A statistic can often be optimized for a single test, and there is substantial statistical theory about how to do this. But a statistic cannot typically be optimized for thousands of genes at once. Unfortunately each dataset is particular and must be treated as a special case, but by starting with the defaults the user can home in on reasonable parameter settings to suit their needs.


    Acknowledgments
 
We thank Dr. Elisabetta Manduchi and Warren Ewens for valuable comments and discussions. This work was supported in part by NIH grant K25-HG-0052-01A1.


    Footnotes
 
1PaGE 1.0 does not use permutations; the original PaGE algorithm was replaced by a permutation algorithm in PaGE 4.0. Back

2This is analogous to the t-statistic ‘fudge factor’ introduced by Tusher et al. (2001). Back

3Indeed to estimate definition (4) one needs to know something about the random properties of V/R. If we permute the columns of the data matrix (1), we can obtain some kind of approximation to an observation of V under the complete null hypothesis, but this tells us nothing about V/R under the true distribution of the data. The bootstrap distribution obtained by sampling with replacement from the two conditions separately will give us an approximation to the distribution of R, but again tells us nothing about V/R. To obtain information about V/R most authors have had to make strong assumptions about the data. Back

Received on December 15, 2004; revised on March 16, 2005; accepted on March 22, 2005

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 DIFFERENTIAL EXPRESSION
 3 THE DATA
 4 THE STATISTICS
 5 THE FDR
 6 FDR ESTIMATION
 7 LEVELS AND PATTERNS
 8 THE T-STATISTIC TUNING...
 9 THE CHOICE OF...
 REFERENCES
 

    Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Proc. R. Statist. Soc. B, 57, 289–300.

    Benjamini, Y. and Yekultieli, D. (2001) The control of the false discovery rate in multiple testing under dependency. Ann. Statist., 29, 1165–1188[CrossRef].

    Efron, B. and Tibshirani, R. (2002) Empirical Bayes methods and false discovery rates for microarrays. Genet. Epidemiol., 23, 70–86[CrossRef][ISI][Medline].

    Ge, Y., et al. (2003) Resampling-based multiple testing for DNA microarray data analysis. Test, 12, 1–44.

    Technical Report Grant, G.R., Sokolowski, S., Stoeckert, C.J., Jr. (2005) Performance analysis of differential expression prediction algorithms using simulated array data.

    Manduchi, E., et al. (2000) Generation of patterns from gene expression data by assigning confidence to differentially expressed genes. Bioinformatics, 16, 685–698[Abstract/Free Full Text].

    Pan, W. (2003) On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression. Bioinformatics, 19, 1333–1340[Abstract/Free Full Text].

    Pounds, S. and Cheng, C. (2004) Improving false discovery rate estimation. Bioinformatics, 20, 1737–1745[Abstract/Free Full Text].

    Simmons, C.A., et al. (2005) Spatial heterogeneity of endothelial phenotypes correlates with side-specific vulnerability to calcification in normal porcine aortic valves. Circ. Res., in press.

    Storey, J.D. (2002) A direct approach to false discovery rates. J. Roy. Statist. Soc., 84, 479–498.

    Storey, J.D. and Tibshirani, R. Parmigiani, G., Garrett, E.S., Irizarry, R.A., Zeger, S. The Analysis of Gene Expression Data, (2003) , New York Springer, pp. 272–290.

    Tusher, V.G., et al. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 5116–5121[Abstract/Free Full Text].

    White, P., et al. (2005) Identification of transcriptional networks during liver regeneration. J. Biol. Chem., 28, 3715–3722.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
DiabetesHome page
P. White, C. Lee May, R. N. Lamounier, J. E. Brestelli, and K. H. Kaestner
Defining Pancreatic Endocrine Precursors and Their Descendants
Diabetes, March 1, 2008; 57(3): 654 - 668.
[Abstract] [Full Text] [PDF]


Home page
Physiol. GenomicsHome page
M. Kidd, B. Nadler, S. Mane, G. Eick, M. Malfertheiner, M. Champaneria, R. Pfragner, and I. Modlin
GeneChip, geNorm, and gastrointestinal tumors: novel reference genes for real-time PCR
Physiol Genomics, August 20, 2007; 30(3): 363 - 370.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. Sahoo, D. L. Dill, R. Tibshirani, and S. K. Plevritis
Extracting binary signals from microarray time-course data
Nucleic Acids Res., June 28, 2007; 35(11): 3705 - 3712.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S.-D. Zhang and T. W. Gant
Effect of pooling samples on the efficiency of comparative studies using microarrays
Bioinformatics, December 15, 2005; 21(24): 4378 - 4383.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Pawitan, K. R. K. Murthy, S. Michiels, and A. Ploner
Bias in the estimation of false discovery rate in microarray studies
Bioinformatics, October 15, 2005; 21(20): 3865 - 3872.
[Abstract] [Full Text] [PDF]


Home page
Plant CellHome page
M. A. Mazzella, M. V. Arana, R. J. Staneloni, S. Perelman, M. J. Rodriguez Batiller, J. Muschietti, P. D. Cerdan, K. Chen, R. A. Sanchez, T. Zhu, et al.
Phytochrome Control of the Arabidopsis Transcriptome Anticipates Seedling Exposure to Light
PLANT CELL, September 1, 2005; 17(9): 2507 - 2516.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/11/2684    most recent
bti407v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (16)
Right arrowRequest Permissions
Google Scholar<