Bioinformatics Advance Access originally published online on June 5, 2007
Bioinformatics 2007 23(16):2073-2079; doi:10.1093/bioinformatics/btm292
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A global approach to identify differentially expressed genes in cDNA (two-color) microarray experiments
1Division of Endocrinology, Metabolism and Lipid Research, Department of Internal Medicine and 2Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Currently most of the methods for identifying differentially expressed genes fall into the category of so called single-gene-analysis, performing hypothesis testing on a gene-by-gene basis. In a single-gene-analysis approach, estimating the variability of each gene is required to determine whether a gene is differentially expressed or not. Poor accuracy of variability estimation makes it difficult to identify genes with small fold-changes unless a very large number of replicate experiments are performed.
Results: We propose a method that can avoid the difficult task of estimating variability for each gene, while reliably identifying a group of differentially expressed genes with low false discovery rates, even when the fold-changes are very small. In this article, a new characterization of differentially expressed genes is established based on a theorem about the distribution of ranks of genes sorted by (log) ratios within each array. This characterization of differentially expressed genes based on rank is an example of all-gene-analysis instead of single gene analysis. We apply the method to a cDNA microarray dataset and many low fold-changed genes (as low as 1.3 fold-changes) are reliably identified without carrying out hypothesis testing on a gene-by-gene basis. The false discovery rate is estimated in two different ways reflecting the variability from all the genes without the complications related to multiple hypothesis testing. We also provide some comparisons between our approach and single-gene-analysis based methods.
Contact: yyzhou{at}netra.wustl.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Microarray technology can monitor the expression levels of thousands of genes simultaneously, and is widely applied in many areas of biological research (Eisen et al., 1998; Gu et al., 2002). A basic goal in analyzing microarray data is to determine which genes are differentially expressed among different samples. A natural framework is to perform hypothesis testing (such as two-sample t-test, Wilcoxon test, ANOVA, etc.) (SAS, 1999) on a gene-by-gene basis (single gene analysis). It generally involves two major steps: (1) construting a statistic S for each gene, and (2) determining the distribution of S under some null hypothesis. Many statistical methods (Beasley et al., 2004; Efron and Tibshirani 2002; Kerr et al., 2000; Neuhauser and Senske 2004; Troyanskaya et al., 2002; Tusher et al., 2001; Wolfinger et al., 2001; Yang and Churchill 2007; Zhao et al., 2003) for identifying differentially expressed genes in microarray analysis follow this general framework, and usually a P-value is assigned to each gene as the result of the single gene analysis. P-values can be calculated either parametrically or non-parametrically, depending on how the distribution of S under the null hypothesis is determined.
In a single gene analysis approach, following traditional hypothesis testing, genes are identified as differentially expressed if the P-values are smaller than a predetermined value (significance level). For example, if the expected value of S is 0 under the null hypothesis, such as for log ratios in cDNA (two-color) microarrays, then very small P-value usually requires |S| >>
S0 (or its variants), where
S0 is the width of the distribution of S under some null hypothesis (null distribution), characterizing quantitatively the variability of S for the gene. Single gene analysis-based methods generally rely on |S| >>
S0 as the criterion (i.e. working definition) for identifying differentially expressed genes and estimating
S0 for each gene is a necessary step in determining whether a gene is differentially expressed or not. For genes with large fold-changes, poor accuracy of
S0 estimation may be tolerated; however, for genes with small fold-changes (such as 20–40% changes of gene expression levels), it becomes critical to first estimate
S0 very accurately.
In typical microarray experiments a huge number N of genes are monitored simultaneously; but only a small number R of replicates (arrays) are available; i.e. the so-called small R large N situation. This makes the task of accurately estimating
S0 for each gene very difficult [unstable variance estimates, (Durbin and Rocke 2004; Rocke and Durbin 2001)]. Furthermore, the distribution of S is generally unknown; P-values thus obtained in single gene analysis methods are usually quite unreliable. In some studies people will simply pick an arbitrary fold change, such as 2-fold, as the criterion to be used in declaring a gene to be differentially-expressed.
The rank method developed here follows a different conceptual framework. It is an example of all gene analysis rather than single gene analysis; that is, analyzing genes together by working on the joint distribution of all the genes. It relies on a new characterization of differentially expressed genes different from that used in single gene analysis-based method, i.e. |S| >>
S0 (or its variants). This allows us to avoid the difficult task of estimating
S0 accurately for each gene; hence it can be more reliable in dealing with low fold-changed genes. The statistic
(which also depends on a parameter T) constructed in the Methods section is of group or global nature; i.e. it is related to all the genes rather than being associated with, and calculated for, each single gene, rending it unnecessary to assign P-value for each individual gene separately as practiced in single gene analysis-based methods. The method relies solely on the ranks of the genes in each of several arrays. The exact distribution of
under null hypothesis in our case is unknown and the number of false positives can be estimated either parametrically (by assuming a null distribution) or non-parametrically (empirically from the experimental data).
Below, we describe the rank method in detail and apply it to a cDNA microarray dataset for investigating MIN6 cells' response to glucose and insulin stimulation (Ohsugi et al., 2004). We show that this method can reliably identify differentially expressed genes with a low false discovery rate (FDR), even when the changes of their mean expression levels are quite small (20–40%). Our method possesses a unique feature of being more effective as N (the number of genes on the arrays) increases, which other methods seem lacking. It also provides a very simple way of choosing desired level of false positive rates. We end with a discussion about the essential differences between the rank method presented here and single gene analysis-based methods.
| 2 METHODS |
|---|
|
|
|---|
2.1 Characterization of differentially expressed genes
Suppose two samples A and B are hybridized onto R arrays (replicates), and the total number of genes on each array is N. Let qi(g) be the log ratio for gene g(g = 1, 2, ..., N) on array i(i = 1, 2, ..., R) after certain data pre-processing (Quackenbush, 2002; Yang and Speed 2002b) such as normalizations. Let ri(g) be the rank of gene g(g = 1, 2, ..., N) within array i(i = 1, 2, ..., R) by sorting qi(g) within each array (Cheng et al., 2003). Using standard technique (details in the Supplemental Material) we prove the following theorem under very general conditions.
Theorem (Random Ordering) Let {r(g); g = 1, 2, ..., N} be the ranks of genes sorted by (log) ratios within an array. Under the null hypothesis H0 that no genes are either up- or down-regulated, for each pair of genes g1 and g2 on the array, P{r(g1) < r(g2)} = P{r(g2) < r(g1)}. That is, the relative ordering of any two genes are completely random.
The theorem states that under the null hypothesis H0, no genes can rank consistently high or low across all the arrays. If in a microarray experiment we find many genes ranking consistently high (or low) across all the arrays, the null hypothesis H0 must be rejected and naturally we identify those genes ranking consistently high (or low) as differentially expressed genes. Thus, the random ordering theorem leads to the following characterization of differentially expressed genes: If a gene ranks consistently high (or low) across all the arrays, it is a differentially expressed gene.
2.2 Algorithm
The rank method for identifying differentially expressed genes is based on the above characterization of differentially expressed genes.
- Step 1: Sort genes by (log) ratios within each array
- Step 2: Top-T (and bottom-T) genes, i.e. genes consistently ranking top T (ri(g)
T) (or bottom T (ri(g) > N-T)) on all R replicate arrays, are identified as differentially expressed genes.
The list of differentially expressed genes depends on the choice of T, similar to a threshold or cut-off, and its implication will be discussed subsequently in the article.
2.3 False discovery rate
As with any prediction method based on statistical inferences there is some probability of a false prediction that we need to characterize. In our approach, the false discovery rate depends on both the value of T and the number of replicates R. For simplicity, in the following we focus only on the top ranked genes, but the same approach is equally applicable to identify the bottom-ranked genes. Denote
0 (T) the number of top-T genes calculated from experimental data and
0 (T) the expected number of top-T genes under the null hypothesis H0. We identify
0 (T) with the number of false positives; that is, of the
(T) genes being identified as differentially expressed genes,
0 (T) of them are expected to be simply due to random fluctuation (false positives). The false discovery rate is FDR(T)
0 (T)/
(T) by definition (Bickel, 2004, 2005; Cole et al., 2003; Cui and Churchill 2003). In our context it has practically the same statistical interpretation as those proposed by Benjamini and Hochberg (Benjamini and Hochberg 1995).
The exact form of the null distribution is generally unknown. We consider two ways of estimating
0 (T): either assuming a null distribution (parametric) or estimating
0 (T) empirically from experimental data (non-parametric).
2.3.1 Parametric estimate of
0(T)
The simplest parametric way is to assume that {r(g); g = 1, 2, ..., N} follows that of random permutations. That is, a gene has an equal probability to be ranked anywhere from 1 to N. Then
0 (T) = N(T/N)R, and the standard deviation (SD) of
0 (T) is
|
|
The derivation of the above formula is in Supplemental Material. The random ordering theorem proved does not imply that the null distribution is necessarily that of random permutations. This particular form of the null distribution requires equal variability of all the genes, which is a very strong condition and usually very difficult to justify. Genes with unusually large variability will tend to rank either on top or on bottom (i.e. bi-polar), while genes with small variability are more likely to rank in the middle. If there are Nbp such bi-polar genes, we would expect that the contribution from these bi-polar genes to
0 (T) is Nbp/2R simply by chance. This may lead to much larger false positives than
0 (T) = N(T/N)R predicted under the assumed random permutations. The effects of such bi-polar genes are investigated in detail by computer simulations in the Results section.
2.3.2 Non-parametric estimate of
0 (T)
In the Supplemental Material part (C), we prove that under H0, for each gene, Pr{r(g)
T} = Pr{r(g)
N+1-T}; i.e. a non-differentially expressed gene has the same probability of ranking top-T as ranking bottom-T. Intuitively, the null distribution is invariant under the transformation I1
I2, where I1 and I2 are the intensities of the two channels (samples) for cDNA microarrays. So the null distribution is independent of whether the genes are sorted in ascending or descending order by log ratios of I1 and I2. For R replicates 2R bins of top–bottom combinations can be formed: genes ranking top T on some arrays and bottom-T on the rest of arrays. Under H0 the expected number of genes falling into each one of the 2R bins should be all the same. For example, in four arrays (R=4) the probability that a non-differentially expressed gene will be ranked in the top-T on each (i.e. in the TTTT bin) is the same as the probability that it is in the bottom-T on each (the BBBB bin), and also equal to the probability that it is ranked in the top-T on the first two arrays and the bottom on the last two (the TTBB bin). The same probability holds for all 16 (2R) bins. We can count the number of genes in all of those bins. The only two bins that contain differentially expressed genes are TTTT and BBBB, i.e. genes either ranking top-T or bottom-T on all arrays. But by counting the number of genes in the other 14 (2R–2) bins, we can estimate quite accurately the number of genes due to random fluctuation (i.e. non differentially expressed genes) in those two bins (TTTT and BBBB) containing differentially expressed genes, which is the value of
0 (T) that we seek. In fact we can calculate the average and standard deviation of the gene counts for the remaining 2R–2 bins to estimate
0 (T) and its SD
0(T).
2.4 Computer simulations
As pointed out earlier, the null distribution of the ranks is generally unknown and the assumed random permutation used in the parametric estimation of
0 (T) is only an approximation. We conducted computer simulations to further investigate the effects of such bi-polar genes on
0 (T). In our computer simulations, genes are partitioned into two categories: normal and bi-polar genes. The log ratios of normal genes are drawn from the distribution U[–0.5, 0.5] (i.e. the uniform distribution on the interval [–0.5, 0.5]), while those of the bi-polar genes are drawn from U[–0.5Vbp, 0.5Vbp], where Vbp > 1. The bi-polar genes have larger variability than those of normal genes and tend to rank either high on the top or low at the bottom. Another parameter in the simulations is the number of bi-polar genes Nbp. We would expect that the larger the Nbp or Vbp, the more severe the departure of the rank distributions from random permutations, and the larger the deviations of
0 (T) from those predicted under the random permutations model.
As usual, the null hypothesis H0 can be tested by comparing
(T) calculated from real experimental data against
0 (T) estimated under the null hypothesis H0. Also notice that such statistical tests can be carried out at different values of T. This offers important flexibility in applications as we discuss in the Results section.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Computer Simulations
In order to test the effect of having genes with different variability, we generated several datasets, each with a total of N = 5497 genes and R = 4 replicates, to match the experimental data we analyze subsequently. We have used the following parameters for (Nbp, Vbp) in our computer simulations: (0, 1), (500, 2), (1000, 2), (500, 4), and (1000, 4), respectively. The simulation results are shown in Figure 1. It is clear that as either Nbp or Vbp increases, the number of genes in the top-T bin also increases, with the effect most noticeable at the lower range of T (Fig. 1). The same effect, of course, also occurs for other bins (data not shown). This is as expected based on the model of bi-polar genes and demonstrates that we can easily account for that effect by counting the number of genes in each of the other 14 (2R–2) bins. Obviously, in this simulated data there are no truly differentially expressed genes, so the counts of the TTTT bin is always nearly equal to the expected number calculated non-parametrically (average of the gene counts in the other 14 bins) and the differences between the two are always within the SDs of the non-parametric estimates (data not shown). But, if we had used the random permutation assumption (i.e. the parametric estimate) instead of the empirical distribution we would have significantly underestimated the random contribution to the TTTT bin. The parametric estimate of
0(T) = N(T/N)R is the same in every simulation independent of Nbp and Vbp and is equivalent to the simulation with no bi-polar genes (i.e. in the absence of bi-polar genes Nbp = 0 the parametric estimate of the TTTT bin is quite accurate, as shown in Fig. 1).
|
3.2 Experimental data
The rank method is applied to a cDNA microarray dataset for investigating MIN6 cells response to glucose, insulin and KCl stimulation (Ohsugi et al., 2004). In the experiment, there are N = 5497 genes on each array and R = 4 replicates. Details on RNA preparation, sample paring scheme, scanning, normalization (Quackenbush, 2002; Yang et al., 2002a), direct and indirect ratios (Yang and Speed 2002b), etc. are presented in Ohsugi et al. (2004). The dataset is available as Supplementary Material.
3.2.1 Estimate of false positives
The results of the two methods (parametric and non-parametric) of estimating the expected numbers of false positives
0(T), and the estimates for its SD,
0(T), are shown in Figure 2. The results of the two ways of estimating
0(T) are very close, and in fact are consistent with each other within error bars. By comparing with the simulation data shown in Figure 1, where the non-parametric estimates of
0(T) (Ave_n) appreciably exceeds the parametric estimates of
0(T) (Ave_p), we do not observe any strong indication of bi-polar genes in the experimental dataset. In the rest of the article, the non-parametric estimate of
0(T) will be used in the discussion.
|
3.2.2 Differentially expressed genes and FDR
The
(T), number of top-T genes (and also the bottom-T genes) calculated from real experimental data, is plotted in Figure 3 together with
0(T) (non-parametric estimates), the expected number of top-T genes under the null hypothesis H0 that no genes are either up- or down-regulated.
|
We can see that
(T) increases as T increases, indicating that more and more genes are included in the list of differentially expressed genes. At the same time,
0(T) also increases as T increases, and more false positives should be expected in the list of differentially expressed genes too. Figure 3b shows the false discovery rate FDR(T) for both up- and down-regulated genes as a function of T. At small values of T,
0(T) <<
(T) and FDR(T) is very low, especially for up-regulated genes. As expected, lowering the criteria of selecting differentially expressed genes by increasing T generally increase the number of false positives as well as FDRs. This is consistent with the well-known relationship between the type I and II errors: improve one at the cost of the other.
The expected number of false positives
0(T) is the same for both up- and down-regulated genes. The false discovery rate FDR(T), defined by
0 (T)/
(T), also depends on
(T). At the range of T << N, the lists for up-regulated genes has a smaller FDR than those for the down-regulated genes simply because more up-regulated genes are identified than down-regulated ones are, i.e.
+(T) >
–(T), even though the absolute number of false positives
0(T) are the same for both up- and down-regulated genes.
3.2.3 Choice of T
Which T-value should be used to get the list of differentially expressed genes should be decided in considerations with other factors. For example (Fig. 4), if we choose T = 2700, the list of differentially expressed genes would include nearly 800 genes for either up- or down-regulated classes, of which about 250 are expected to be false positives. This is hardly an acceptable choice in almost any situation because of the huge number of false positives even though the list contains perhaps almost all the truly differentially expressed genes. On the other hand it is not very straightforward to choose between T = 300 and 1000: at T = 300, only 37 genes are identified with virtually no false positives (FDR(300)
0); while at T = 1000, the list contains 153 genes with only about 4 false positives. In the study by Ohsugi et al. (2004), about 100 (up-regulated) genes are identified with false positives of about 1 or 2, corresponding roughly to T = 700.
|
The method provides the flexibility of balancing power and false positives in a simple way by choosing different values of T at which the statistical tests are conducted. Since the calculations involved are quite simple, the results can be easily tabulated for several values of T. Figure 4 shows such an example, and can present a better idea of the trade-offs between power (how many differentially expressed genes can be detected) and the false positives. Researchers can take advantage of this flexibility and choose T that most suits their research strategies and goals.
3.2.4 Fold-changes of differentially expressed genes
We can study the distribution of fold-changes of differentially expressed genes thus identified. The fold changes are calculated using Mixed Model (Wolfinger et al., 2001). Figure 5 shows the cumulative distribution of fold-changes for the list of genes identified at several T values. Many genes with very small fold-changes (1.2–1.4) can be reliably identified with a very low false discovery rate. For example, at T = 900, 131 genes are identified with about three false positives. 100 of them are below 1.4 fold changes (i.e. 40% increases in gene expression levels). More than a dozen genes have been randomly chosen for quantitative real time PCR validation and the results have been very consistent with the microarray measurements (Ohsugi et al., 2004), even though many of the genes chosen for validation have a very small fold-change (smaller than 1.4).
|
| 4 DISCUSSION |
|---|
|
|
|---|
There are two features of the rank method that are essentially different from single gene analysis-based methods. One is the working definition (i.e. criteria) of differentially expressed genes and the other is the adoption of the all gene analysis approach. These two are closely related.
In the rank method top-T genes are identified as differentially expressed genes. This working definition of differentially expressed genes is fundamentally different from |S| >>
S0, commonly used in single gene analysis-based method. As mentioned in the Introduction, this allows us to avoid the difficult task of estimating
S0 accurately for each gene, which is critical for dealing with genes with small fold-changes. One source of such difficulty is the small number of replicates (arrays) in a typical microarray experiment. From a statistical point of view, small sample size generally results in large error bars. In our case we have only four replicates. Another source of difficulty is due to the fact that a large number of genes are measured simultaneously: some genes are bound to have smaller measured variability while others have larger ones. In single gene analysis, it seems to be very difficult to determine convincingly whether a very small variability actually measured is indeed due to true smallness of variability of that particular gene, or simply by chance due to large number of genes measured. The inability to accurately estimate the width
S0 for each gene would make it very difficult for single gene analysis methods to reliably identify genes with small fold changes, such as those shown in Figure 5, given the fact that the standard deviation of logarithmic (base e) ratios for the four arrays are 0.14, 0.15, 0.16 and 0.16, respectively, corresponding roughly to 15% changes.
Another unique feature of the rank method is its all-gene analysis approach. This approach is fundamentally different from single-gene analysis in that the calculations are based on the joint distribution of all the genes in contrast with the single gene analysis framework of one probability distribution for one gene. Single-gene analysis is prevalent and many existing methods belong to this category. This can be attributed to the basic fact that measurements of one gene are independent of all other genes, or equivalently, knowing the measurements of one gene provides no information of any other genes. Interestingly, the theorem we proved relies on exactly the same facts [see Supplemental Material, especially part (B)]. However we end up with a non-trivial joint distribution of ranks of all the genes: the ranks of genes are no longer independently distributed as in the case of single gene analysis.
One of the advantages of all gene analysis is reflected in the way random errors are handled: the estimate of false positives
0 (T) captures the collective variability from all the genes. This is another reason that, unlike single gene analysis, accurately determining the variability for each and every gene is unnecessary and can be avoided.
Another advantage is that by working with the joint distribution of all the genes, the issue of multiple hypotheses testing is greatly simplified. In single-gene-analysis methods, the number of hypothesis testing is equal to the number of genes N. Multiple testing procedures should be followed, since N is usually quite large (N = 5497 in our datasets). Many existing methods of dealing with the issue of multiple testing, such as Family Wise Error Rate [FWER, (Zar, 1999)], generalize family wise error rate (gFWER, (van der Laan et al., 2004)), as well as the classical FDR by (Benjamini and Hochberg 1995; Storey, 2003), etc, usually requires as input the raw P-values for each individual hypothesis testing obtained by some single gene analysis method. We have presented the arguments before that those P-values are usually not very reliable due to the great difficulties in accurately estimating
S0. In contrast, our all gene analysis approach identify a group of genes by a single hypothesis testing due to the characterization of differentially expressed genes we established, and the multiple hypotheses testing-related issues is actually irrelevant in our case.
Our method processes a unique property of being more effective (in the sense of lower false discover rates) as N (total number of genes) increases. In general, large N can only adversely affect single gene analysis-based methods due to issues related to multiple hypotheses testing (Zar, 1999). For the rank method, large number of genes on the arrays are actually more desirable since
0(T)
N(T/N)R (which is exact if no bi-polar genes exist) decreases as the number of gene on the array increases, which in turn leads to smaller FDR. Increasing the number of replicates (R) is helpful in reducing errors for both rank method and single-gene analysis-based methods. In single-gene analysis,
, where
S0 is the width of the null distribution for the particular gene. However, this improvement in estimating the width of the null distribution (
S0) may be rather insignificant compared with what happens for rank method:
0(T)
N(T/N)R drops rather quickly and much faster than
S0 does as R increases, which again leads to much smaller FDR.
By applying the rank method to real microarray dataset, we have demonstrated that, as shown in Figure 5, a large number of genes with small fold-changes can be reliably detected, even if each of them may not be statistically significant when analyzed individually. As we discussed earlier, it would be very difficult for single gene analysis-based methods to reliably identify genes with such small changes. Single gene analysis-based methods try to determine whether the magnitude of the fold-change for each particular gene is statistically significant or not by comparing the fold-change S with
S0, while the rank method tries to determine whether the number of top-T (i.e. differentially expressed) genes is statistically significant or not by comparing it with
0(T).
0(T) measures the strength of the random fluctuations from all the genes. As long as
(T) is much larger than
0(T), we can identify a group of
(T) genes with a very small false discovery rate.
This is a very important result for practical applications of microarray technology. Many stimuli or treatments may often have effects in the form of changing the expression levels of many genes by just a small amount (Mootha et al., 2003; Yechoor et al., 2002). The collective results of these small changes, however, may have profound effects on biological systems (such as cells or organs), especially over a long period of time (such as diet and smoking). Reliably detecting such group of low fold-changed genes is invaluable for a better understanding of many biological processes. Indeed the major conclusion of the microarray study (Ohsugi et al., 2004), used here as an example to illustrate the rank method, that both glucose and insulin stimulate a common set of genes in insulinoma cells is primarily based on such small fold-changed (1.2–1.4) transcripts. The rank method may be used to reliably identify a large number of low fold-changed genes in such cases.
Another feature of the rank method is that by considering various values of T, we can control the false positives (or false positive rate) in a very simple and convenient way. Such flexibility allows us easily to choose the one that most fits the strategy and goal of the investigation.
Clearly in this method information on the fold-changes is lost since we only use ranks in our calculation. This may present an advantage since we do not have to make any arbitrary cut-off (such as 2-fold changes) in determining differentially expressed genes. In order to get information of fold-changes, other methods such as ANOVA can be employed, and the results from different approaches should be compared and cross-examined with each other. Another limitation of the method from a practical point of view is that one single bad array would have damaged the whole set of data. Hence, array quality must be monitored closely and arrays with obvious defects should be excluded from the analysis (this also applies to other methods). In extreme cases such as a gene ranking consistently high in all arrays but one, other methods such as quantitative real time-PCR should be used to study the gene directly. If there are many arrays, the characterization of differentially expressed genes as rank top on all array may be too stringent and some modifications should be made to make it more practically useful. One possible modification is to use rank top on many arrays instead of rank top on all arrays. Consider R = 50 as an example. We can calculate the number of genes that rank top-T on 50, 45, 40, 35 arrays and use similar non-parametric procedures to estimate the contributions from random fluctuation of ranking. Finally, if a large percentage of genes on the arrays are changed among different samples (as in the case of some customized arrays with small numbers of genes: e.g. N < 200), the method would not be very effective since the ranks of all the differentially expressed genes also tend to be random relative to each other, similar to those followed by the ranks of non-differentially expressed genes. This shows most clearly the essence of the rank method: it compares genes with a pool of non-differentially expressed genes. Such approach is orthogonally different from any single gene analysis-based methods, where each individual gene is tested separately for being differentially expressed or not, and one gene is never compared with any other genes during hypothesis testing. The rank method would perform most effectively when there are many differentially expressed (i.e. regulated) genes, but still the overwhelming majority of genes on the arrays remain unchanged. In such a scenario, the handful differentially expressed genes will rank high (or low) consistently across all the arrays relative to the rest of the non-differentially expressed genes whose ranks are simply random distributed.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This work is supported in part by National Institute of Health Grants DK16746, DK56954, DK99007 (to M.A.P); GM28755 (to G.D.S.); NSF FIBR under grant number 0425749 and the Battelle Pacific under DOE project (to Prof. Bijoy Ghosh, Department of Electrical and Systems Engineering, Washington University in Saint Louis) and the Washington University Diabetes Research and Training Center.
We gratefully acknowledge the D. Melton lab (Harvard University) and K. Kaestner and C. Stoeckert Labs (University of Pennsylvania), as well as Ellen Ostlund, Jessica Murray, Sandy Clifton, Hiroshi Inoue, Chris Sawyer, Mike Heinz, Wesley Warren, Elaine Mardis, and other members of the Genome Sequencing Center for their work with the Endocrine Pancreas Consortium cDNA libraries and microarrays. We would also like to thank Robin Matlib for helpful discussions. We would like to thank the reviewers' comments and suggestions that improved our manuscript.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on October 4, 2006; revised on May 22, 2007; accepted on May 23, 2007
| REFERENCES |
|---|
|
|
|---|
Beasley TM, et al. Chebyshev's inequality for nonparametric testing with small N and alpha in microarray research. J. Roy. Stat. Soc. C, ( (2004) ) 53, : 95–108.[CrossRef].
Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Stat. Soc, ( (1995) ) B, 57, : 289–300..
Bickel DR. Degrees of differential gene expression: detecting biologically significant expression differences and estimating their magnitudes. Bioinformatics, ( (2004) ) 20, : 682–688.
Bickel DR. Probabilities of spurious connections in gene networks: application to expression time series. Bioinformatics, ( (2005) ) 21, : 1121–1128.
Cheng C, et al. Array rank order regression analysis for the detection of gene copy-number changes in human cancer. Genomics, ( (2003) ) 82, : 122–129.[CrossRef][ISI][Medline].
Cole SW, et al. Controlling false-negative errors in microarray differential expression analysis: a PRIM approach. Bioinformatics, ( (2003) ) 19, : 1808–1816.
Cui X, Churchill GA. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol, ( (2003) ) 4, : 210.[CrossRef][Medline].
Durbin BP, Rocke DM. Variance-stabilizing transformations for two-color microarrays. Bioinformatics, ( (2004) ) 20, : 660–667.
Efron B, Tibshirani R. Empirical bayes methods and false discovery rates for microarrays. Genet. Epidemiol, ( (2002) ) 23, : 70–86.[CrossRef][ISI][Medline].
Eisen MB, et al. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, ( (1998) ) 95, : 14863–14868.
Gu CC, et al. Role of gene expression microarray analysis in finding complex disease genes. Genet. Epidemiol, ( (2002) ) 23, : 37–56.[CrossRef][ISI][Medline].
Kerr MK, et al. Analysis of variance for gene expression microarray data. J. Comput. Biol, ( (2000) ) 7, : 819–837.[CrossRef][ISI][Medline].
Mootha VK, et al. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet, ( (2003) ) 34, : 267–273.[CrossRef][ISI][Medline].
Neuhauser M, Senske R. The Baumgartner-Weiss-Schindler test for the detection of differentially expressed genes in replicated microarray experiments. Bioinformatics, ( (2004) ) 20, : 3553–3564.
Ohsugi M, et al. Glucose and insulin treatment of insulinoma cells results in transcriptional regulation of a common set of genes. Diabetes, ( (2004) ) 53, : 1496–508.
Quackenbush J. Microarray data normalization and transformation. Nat. Genet, ( (2002) ) 32, (Suppl): 496–501.[CrossRef][ISI][Medline].
Rocke DM, Durbin B. A model for measurement error for gene expression arrays. J. Comput. Biol, ( (2001) ) 8, : 557–569.[CrossRef][ISI][Medline].
SAS Institue Inc. SAS online doc, V8 http://v8doc.sas.com. NC, USA: Cary..
Storey J. The positive false discovery rate: a bayesian interpretation and the q-value. Ann. Stat, ( (2003) ) 31, : 2013–2035.[CrossRef].
Troyanskaya OG, et al. Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics, ( (2002) ) 18, : 1454–1461.
Tusher VG, et al. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, ( (2001) ) 98, : 5116–5121.
van der Laan MJ, et al. Augmentation procedures for control of the generalized family-wise error rate and tail probabilities for the proportion of false positives. Stat. Appl. Genet. Mol. Bio, ( (2004) ) 3, ..
Wolfinger RD, et al. Assessing gene significance from cDNA microarray expression data via mixed models. J. Comput. Biol, ( (2001) ) 8, : 625–637.[CrossRef][ISI][Medline].
Yang YH, et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res, ( (2002a) ) 30, : e15.
Yang YH, Speed T. Design issues for cDNA microarray experiments. Nat. Rev. Genet, ( (2002b) ) 3, : 579–588.[CrossRef][ISI][Medline].
Yang H, Churchill G. Estimating p-values in small microarray experiments. Bioinformatics, ( (2007) ) 23, : 38–43.
Yechoor VK, et al. Coordinated patterns of gene expression for substrate and energy metabolism in skeletal muscle of diabetic mice. Proc. Natl. Acad. Sci. USA, ( (2002) ) 99, : 10587–10592.
Zar J. Biostatistical Analysis, ( (1999) ) Upper Saddle River, NJ: Prentics Hall..
Zhao Y, Pan W. Modified nonparametric approaches to detecting differentially expressed genes in replicated microarray experiments. Bioinformatics, ( (2003) ) 19, : 1046–1054.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



