Bioinformatics Advance Access originally published online on January 18, 2005
Bioinformatics 2005 21(9):2067-2075; doi:10.1093/bioinformatics/bti270
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Use of within-array replicate spots for assessing differential expression in microarray experiments
Walter and Eliza Hall Institute of Medical Research Melbourne, Vic 3050, Australia
*To whom correspondence should be addressed.
| Abstract |
|---|
Motivation: Spotted arrays are often printed with probes in duplicate or triplicate, but current methods for assessing differential expression are not able to make full use of the resulting information. The usual practice is to average the duplicate or triplicate results for each probe before assessing differential expression. This results in the loss of valuable information about genewise variability.
Results: A method is proposed for extracting more information from within-array replicate spots in microarray experiments by estimating the strength of the correlation between them. The method involves fitting separate linear models to the expression data for each gene but with a common value for the between-replicate correlation. The method greatly improves the precision with which the genewise variances are estimated and thereby improves inference methods designed to identify differentially expressed genes. The method may be combined with empirical Bayes methods for moderating the genewise variances between genes. The method is validated using data from a microarray experiment involving calibration and ratio control spots in conjunction with spiked-in RNA. Comparing results for calibration and ratio control spots shows that the common correlation method results in substantially better discrimination of differentially expressed genes from those which are not. The spike-in experiment also confirms that the results may be further improved by empirical Bayes smoothing of the variances when the sample size is small.
Availability: The methodology is implemented in the limma software package for R, available from the CRAN repository http://www.r-project.org
Contact: smyth{at}wehi.edu.au
| 1 INTRODUCTION |
|---|
Microarrays measure the mRNA expression of tens of thousands of genes in a single hybridization experiment. Designed experiments involving two or more microarrays hybridized with RNA from different sources generate expression profiles which can help classify the genes according to functional groups or molecular pathways. Although much attention has been given to the statistical analysis of microarray data many problems are still unresolved (Nguyen et al., 2002; Smyth et al., 2003; Parmigiani et al., 2003; Speed, 2003; Causton et al., 2003; Firestein and Pisetsky, 2002; Tilstone, 2003). Particular challenges and opportunities arise from the multiplicity of genes and the possibilities for parallel inference.
A standard analysis method is to fit the same statistical model separately to the expression measurements for each gene (Wolfinger et al., 2001; Yang and Speed, 2003). A number of authors have noted that inference for each individual gene can be made more reliable by making use of information generated from the whole ensemble of genes (Newton et al., 2001; Tusher et al., 2001; Efron et al., 2001; Efron and Tibshirani, 2002; Lönnstedt and Speed, 2002; Kendziorski et al., 2003; Newton et al., 2004; Smyth, 2004). Such methods have not as yet been applied to experimental designs in which there are technical or biological replicates leading to multiple strata of random variation for each gene. This article develops a between-gene moderation method appropriate for a particular type of technical replication, that of within-array replicate spots. The method proposed is particularly simple in that a suitably chosen parameter is constrained to be common between the genes. The treatment proposed here for within-array replicates may be combined with moderation methods designed for a single error strata.
Spotted microarrays are produced by printing cDNA or oligonucleotide sequences on glass slides using a robotic printer. The spots are laid down using a printhead made up of capillary print tips or pins or inkjets. The DNA is prepared in 96-well or 384-well plates ready for printing, with normally one well for each distinct probe. The robot acquires DNA by dipping the tips of the print head into the wells of the plate before depositing the DNA on the glass slide. In most cases only a small proportion of the DNA in each well is actually printed onto the arrays and any excess is discarded. Provided that there is space on the array, there is no cost, apart from printing time, in printing two or more spots on the arrays from each well. This is accomplished by programming the robotic printer to dip the print head more than once into the same set of wells. This results in a printed array in which each gene appears two or more times a fixed distance apart. Usually multiple printing produces two spots of each gene but an arbitrarily large number of replicate spots may be printed if there is sufficient space to accommodate them on the arrays.
Normally the replicate spots are printed side-by-side in the same row, in the same column or at the top and bottom halves of the array. Any intensity or log-ratio measurements made from the replicate spots will be positively correlated though being observed on the same array. Replicate spots which are side-by-side are likely to be very highly correlated since they are not only printed with the same gene but are also spatially close together and therefore likely to share many common causes including local effects on the array surfaces as well as hybridization and labelling effects. Indeed the value of having multiple prints of each clone on an array has often been questioned given the low within-array variability compared with between-array variability (Tran et al., 2002). Replicate spots in the top and bottom halves of the array are also likely to be positively correlated but less so than side-by-side replicates.
Replicate spots are often used as a quality assessment tool since disagreement between replicates is strong evidence that at least one of the spots is affected by a local artifact. Repeatability of the log-ratios across replicate spots within arrays can be used as a basis for removing outlier spots (Tseng et al., 2001; Hoffmann et al., 2002; Yang et al., 2002; Jenssen et al., 2002; Lyne et al., 2003; König et al., 2004), to construct spot-quality measures (Beissbarth et al., 2000) or to evaluate the effectiveness of a spot-quality scheme (Wang et al., 2001). It is almost universal practice to average the log-intensities or log-ratios obtained from within-array replicate spots before conducting a formal statistical analysis of differential expression (Andrews et al., 2000; Tseng et al., 2001; Berwanger et al., 2002; Hoffmann et al., 2002; Yang et al., 2002; Kaynak et al., 2003; Lyne et al., 2003), although averaging can cause complications when some of the log-ratios are missing or when there are spot-specific quality weights. Many public microarray database programs, such as the Stanford Microarray Database, automatically average log-ratios from duplicate spots. A relatively small number of studies have used within-array replicate level information to improve the assessment of differential expression (Baggerly et al., 2001; Boer et al., 2001; Fan et al., 2004).
The method developed in this paper extracts more information from the within-array replicate spots by estimating the correlation between them. A simple model is explored in which the between-replicate correlation is taken to be constant across genes. The method uses a consensus estimator of the correlation across genes in such a way that the correlation can be taken to be known at the individual gene level. Compared with simply averaging replicate spots, this method greatly improves the precision with which the genewise variances are estimated and thereby improves inference methods designed to identify differentially expressed genes.
The method is validated using data from a microarray experiment involving calibration and ratio control spots in conjunction with spiked-in RNA. Comparing results for calibration and ratio control spots shows that the within-array correlation method results in substantially better discrimination of differentially expressed genes from those which are not compared with simply averaging the replicate spots. Based on these data, the proposed method increases the power to detect differential expression when it is present without incurring a greater rate of Type I errors when it is not.
| 2 cDNA MICROARRAY PREPARATION METHODS |
|---|
2.1 Spike-in control spots
This study uses data from a set of 26 cDNA microarrays which were printed and hybridized as part of a study on human transcription factors. The paper presents data only from the spike-in control spots.
The arrays were printed at the Australian Genome Research Facility with the Hs8k cDNA clone library from Research Genetics and a selection of control spots. Each array was printed with 12 sets of the Lucidea Universal Scorecard system (Amersham). Spots were printed in duplicate, side-by-side by rows, including the 12 sets of ScoreCard spots.
The RNA samples hybridized to the arrays included ScoreCard spike mixes according to the Lucidea ScoreCard User's Guide. The ScoreCard system includes calibration and ratio control spots designed to generate pre-determined fold changes. Each set of ScoreCard spots includes 10 calibration spots, labeled here Calib1Calib10, which have a theoretical fold change of 1 and are expressed at successively decreasing intensities. The ratio controls have fold changes as follows: 3-fold up and down at low intensity (3UL and 3DL), 3-fold up and down at high intensity (3UH and 3DH), 10-fold up and down at low intensity (10UL and 10DL) and 10-fold up and down at high intensity (10UH and 10DH). The same spike mix was applied to all the arrays, so the arrays can be treated as a set of replicate arrays for the purposes of the ScoreCard spots.
2.2 Hybridization
An aliquot of 50 µg of total RNA extracted from HeLa cells and 1 µl of either reference or test spike mRNA was reverse transcribed using an anchored oligo(dT) primer and 200 units of Superscript II reverse transcriptase (Invitrogen) in the presence of 25 mM dATP, 25 mM dCTP, 25 mM dGTP, 15 mM aminoallyl-dUTP (SIGMA #A0410) and 10 mM dTTP. The single strand cDNA was purified using QIAquick PCR purification kit (Qiagen) and labelled with CyDye post-labelling dye (Amersham) for an hour. After a second purification as above, both Cy-5 and Cy-3 labelled cDNAs were pooled and mixed to 25 µg of human Cot1 DNA, 38 µg of polyA DNA and 50 µg of salmon sperm DNA. The mixture was concentrated using a vacuum dryer and resuspended in 50% formamide, 5x SSC and 0.1% SDS.
The arrays were incubated in 50% formamide, 5x SSC, 0.1% SDS and 10 mg/ml BSA for 45 min at 42°C, rinsed with distilled water and dried using an air gun. The labelled cDNA mixture was denatured at 95°C for 5 min, incubated at 45°C for 30 min and cooled to room temperature before being pipetted onto the array. The slides were incubated overnight at 42°C in a hybridization chamber (Corning) placed in a water bath. After incubation the slides were washed in 1 x SSC/0.2% SDS solution for 5 min, in 0.1 x SSC/0.2%SDS solution for 5 min, and twice in 0.1 x SSC for 2 min. The slides were then spun dry using a centrifuge.
2.3 Image analysis and normalization
The arrays were scanned using a Genepix 4000B scanner with adjusted settings in order to obtain a similar green and red overall intensity. The images were analyzed using the SPOT software (Buckley, 2000, http://www.cmis.csiro.au/iap/Spot/spotmanual.htm). Foreground intensities were background corrected using the morph background measure and the ScoreCard spot log-ratios were normalized using global loess normalization with the default smoothing span of 0.3 (Yang et al., 2001; Smyth and Speed, 2003).
| 3 THE BALANCED SINGLE SAMPLE PROBLEM |
|---|
3.1 Individual correlations
For simplicity, consider first a series of n replicate two-color microarray experiments, each array hybridized with RNA from the same two sources. Suppose that each gene is replicated m times on each array at a fixed distance apart. Image analysis and normalization of the microarray data will yield a log-ratio of expression ygij for each spot. Here ygij is the log-ratio for gene g = 1,...,G, array i = 1,...,n and replicate j = 1,...,m. Usually ygij is a normalized version of log2(Rgij/Ggij) where Rgij is the measured red intensity while Ggij is the measured green intensity for that spot. Assume that
![]() |
It is reasonable to assume that observations made on different arrays for a given gene are independent or nearly so. On the other hand, replicate observations made on the same array are likely to be correlated, perhaps highly so. For the remainder of this paper, the term replicate spots will be taken to refer to spots on the same array. Let
g be the correlation between replicate spots for gene g. We will assume that
![]() |
![]() |
j'. Observations with different i are assumed to be independent. Observations on different genes on the same array are also likely to be correlated. The correlations between genes, however, are highly problematic to estimate, because of the very large number of genes compared to the number of arrays, and so these correlations are left unspecified in this article. If the replicate spots are close together we expect
g to be large, perhaps close to unity. If the replicate spots are far apart, the correlation will be much smaller. Note that
g is constrained according to 1/(m 1)
g
1 by the requirement that the covariance matrix of the ygij be non-negative definite. It will be further assumed in this article that the ygij are normally distributed. Although this assumption will be used in deriving specific results in this paper, most of the results of this paper do not depend on normality for their usefulness. See further comments on this issue in Section 8.
For each gene g write
for the sample mean of the replicate observations on array i and
for overall sample mean across all arrays. For each gene let
be the between-arrays standard deviation,
![]() |
be the within-arrays standard deviation,
![]() |
,
and
are mutually independent and sufficient for µg,
g and
g with
![]() |
and
. The within standard deviation
does not contribute any further information. The maximum likelihood estimator of µg is
![]() |
![]() | (1) |
tn1. This explains why it is usual practice to average the replicate spots before undertaking inference for microarrays with within-array replication.
It is useful for later reference to consider the estimation of
g even though it does not contribute here to inference about µg. Let
![]() |
g is a monotonic increasing transformation of
g which takes values on the whole real line. The transformation is reversed by
![]() |
g = tanh
g when m = 2. The residual maximum likelihood (REML) (Searle et al., 1992) estimator of
g is
![]() |
. This shows that
![]() |
![]() |
is the digamma function. The variance is
![]() |
![]() |
is the trigamma function. The distribution of
is somewhat skew to the left because of the differing degrees of freedom for
and
. In the worst case with n = m = 2 the bias of
is 0.35.
3.2 Common correlation
Now we make the simplifying assumption that the between-replicate correlation is common across genes,
g =
for all g. This assumption is motivated by the belief that the correlation springs mainly from the physical proximity of the replicate spots on the same array. The robotic printing ensures that the spacing between the replicate spots is the same for all genes and all arrays. In practice it will not be necessary that the assumption be precisely true but rather that the correlations be sufficiently stable to make the common correlation model a useful one. This is likely to be true when the between and within standard deviations
and
are positively associated across genes, meaning that the correlations are much more stable than the variances. This has been true in all microarray experiments seen by the authors so far.
Let
![]() |
would be
![]() | (2) |
. This estimator remains consistent as n
even if the genes are not independent because it requires only that the mean of the
and the mean of the
converge to quantities in the ratio of 1 + (m 1)
to 1
. For the same reasons it requires only weak assumptions on the independence between genes to be consistent as G
. In practice this estimator is likely to be very accurate if the number of genes G is large. Under the assumption of dependence between genes, the bias is
![]() |
![]() |
The fact that the correlation is common between genes does not change the estimator
for each gene but does substantially change the inference about
. Since the common correlation can be estimated very accurately from the ensemble of genes,
may be treated as known to a very good approximation when undertaking inference about each individual gene. This means that
can contribute to the estimator of
, improving the precision with which we judge whether µg is nonzero. The REML estimator of
is approximately
![]() | (3) |
. The test statistic for testing H0 : µg = 0 now becomes
![]() | (4) |
| 4 RESULTS FOR BALANCED LINEAR MODELS |
|---|
Section 3 considered only replicate arrays comparing two RNA sources. The results of Section 3 generalize easily to arbitrarily complicated microarray experiments comparing
2 RNA sources. Let yg be the vector containing the nm log-ratios or log-intensities observed for gene g. A general microarray experiment can be represented by a linear model
![]() |
![]() |
![]() |
g = cTßg, where c is a vector of known constants, be a particular contrast or linear combination of the regression coefficients and suppose that interest lies in testing H0 :
g = 0. This formulation is sufficiently general to accommodate a wide variety of microarray experiments including dye-swaps, time course experiments and factorial experiments. It is also applicable to single-channel microarray experiments for which ygij is a normalized version of log2 Igij, where Igij is the measured intensity for that spot.
Generalizing from replicate arrays to the linear model causes little extra complication for the theory of Section 2. Let
be the n-vector of array means
and let
be the reduced n x k dimensional design matrix in which there is only one row instead of m rows for each gene by array combination. Then
![]() | (5) |
to be m times the residual standard error which arises from fitting the linear model (5). This mean square is now on nk instead of n1 degrees of freedom. Let
be the estimator of ßg from fitting this model and let
. The estimate
from the reduced linear model (5) is the same as that from the full linear model for yg. Note that
![]() |
![]() |
![]() |
Assuming now that
g =
, the common correlation estimator (2) would now be distributed as
if the genes were independent. The pooled variance estimator (3) now becomes
![]() |
![]() |
g = 0.
It can be seen that the relative difference in degrees of freedom between
and
can be large if k > 1 and especially if k is not <<n. This means that the gain in degrees of freedom of sg over
which results from assuming common correlations is especially important for larger values of k, i.e. for designed experiments involving a larger number of distinct RNA sources to be compared.
| 5 RESULTS FOR UNBALANCED MODELS |
|---|
Suppose that now there are spot-specific weights wgij associated with the observations so that
![]() |
![]() |
![]() |
of ßg is now somewhat dependent on the estimated value for
g. This produces an unbalanced statistical model in which there are no non-iterative formulae for the REML estimators of
g or
g. On the other hand, iterative computational procedures are readily available to compute the numerical REML estimates
and
for any given dataset (Pinheiro and Bates, 2000).
Assume now that
g =
. Even assuming independence between the genes, exact REML estimation of the common correlation would require iterative computation using the entire dataset. This is at best very unattractive computationally and would in most cases involve prohibitive memory storage requirements. An alternative and much easier strategy is to estimate the common correlation
by combining the individual correlation estimators
. The fact that this estimation method is not fully efficient is not important when the number of genes is large. Let
![]() |
is the REML estimator of
g from the data for gene g. By analogy with the balanced case we can conclude that
![]() |
is the between-array degrees of freedom and
is the within-array degrees of freedom for gene g. A combined estimator of
is
![]() |
as n
regardless of the dependence structure between the genes and is consistent as G
given weak assumptions on the dependence structure. The estimator of
is recovered by
![]() |
can be estimated by weighted least squares of yg on X with weight matrix
![]() |
is equal to Rg with
substituted for
g. The weighted least squares estimator is
![]() |
2g is the residual variance
![]() |
= 0 is
![]() |
![]() |
. The t-statistic is on nmk degrees of freedom. If wgij = 1 then
and tg reduce to the same forms as in the balanced case in Section 3 apart from differences in the estimation of
, specifically the replacement of
with
.
Note that the t-statistic tg is not sensitive to small changes in the correlation correlation
, since the estimated residual variance
will tend to compensate. This reassures us that the common correlation model will not lead to misleading results if it fails to be exactly correct for some genes.
The efficiency of
relative to the REML estimator can be computed for the balanced case under the assumption of independence between genes. When G = 1000 and n = m = 2, the standard deviation of
is 0.029, showing that its relative efficiency compared to the REML estimator is about 30%. This is the worst case; efficiency increases with the number of arrays. For example, the efficiency is 70% if there are n = 6 arrays. For the purposes of the methodology of this paper, these are acceptable efficiencies.
| 6 COMBINING WITH EMPIRICAL BAYES MODERATION |
|---|
A number of authors have shown that one can improve on the use of t-statistics for assessing differential expression in microarray experiments by using appropriate shrinkage methods to moderate the genewise sample variances (Baldi and Long, 2001; Efron et al., 2001; Tusher et al., 2001; Lönnstedt and Speed, 2002; Broberg, 2003; Smyth, 2004). We show here that the empirical Bayes method of Smyth (2004) combines in a natural way with the methods of this paper.
In the separate correlation model of Section 2.1, an inverse Gamma prior would be applied to the between-array variances
yielding posterior variances
![]() |
is the prior value and d0 the prior degrees of freedom. Replacing the sample variance in Equation (1) with the posterior variance produces the moderated t-statistic
![]() |
yielding posterior variances
![]() |
![]() |
would lead to posterior variances
![]() |
![]() |
Note that empirical Bayes smoothing could in principle be applied to the correlations as well as the variances. In fact, smoothing the between and within variances
and
independently leads immediately to smoothed correlation estimators from
![]() |
| 7 RESULTS WITH SPIKE-IN DATA |
|---|
The methodology is demonstrated on a set of 26 microarrays for which the differential expression status of a set of control spots is known. Figure 1 shows boxplots of t-statistics for the ScoreCard series of control spots. There are 12 t-statistics in each box. The grey filled boxplots, on the left of each pair of boxplots, show statistics computed using common correlations while the white boxplots on the right of each pair show statistics computed by averaging the duplicate spots.
|
The t-statistics produced by averaging the duplicate spots are on fewer degrees of freedom than those produced by the common correlation method, meaning that they are not directly comparable on the basis of magnitudes alone. One way to compare the t-statistics would be to compute P-values. The vertical axis in the plot actually shows Z-score equivalents of the t-statistics, i.e. the standard normal deviate which has the same P-value as has the t-statistic. The Z-scores put t-statistics with different degrees of freedom on the same scale. Comparing Z-scores is equivalent to comparing P-values but the Z-scores are better suited to graphical presentation.
An ideal test statistic will show Z-score values which are randomly distributed about zero with as little variability as possible for the calibration spots and Z-scores as far from zero as possible for the ratio controls. The greater the separation between the calibration values and the ratio values, the better the performance of the statistic. The plot shows that the t-statistics computed assuming common correlations give much larger absolute Z-scores for the differentially expressed genes while maintaining a similar null distribution for the non-differentially expressed spots. This shows that the common correlation t-statistics have greater power for detecting differential expression while producing no more false positives on average.
The bottom panel of Figure 1 shows the results with empirical Bayes smoothing of the sample variations while the top panel shows results with ordinary t-statistics. The relatively large number of arrays here means that the sample variances are fairly reliable so that use of empirical Bayes changes the picture only slightly.
The minimum number of arrays for which t-statistics can be computed is two, this being the minimum number to return a degree of freedom for error when the duplicate spot values are averaged. In order to examine this extreme situation, we separated the 26 arrays into 13 pairs of arrays and computed t-statistics for each pair. Figure 2 shows the results. Each boxplot in Figure 2 represents 156 values, i.e. 12 values for each of the 13 pairs. The fact that t-statistics are computed from only 2 arrays instead of all 26 means that they are less able to distinguish which spots are differentially expressed; however, the t-statistics using common correlations do markedly better. As before, the common correlation t-statistics have greater power to detect differential expression while having a similar null distribution (Fig. 2, top panel). The relative gain of the common correlation method compared with averaging the duplicate spots is perhaps even greater here with n = 2 than with the larger sample size.
|
With only two arrays for each t-statistic, the sample variances are rather unreliably estimated. When the duplicate spots are averaged, the sample variances have in fact only one degree of freedom. In this situation, empirical Bayes smoothing can be expected to have a large impact on the reliability of the statistics. The bottom panel of Figure 2 shows the same results as the left but with empirical Bayes smoothing of the variances. The empirical Bayes method greatly improves the performance of both statistics, with and without common correlations, and the separation of calibration and ratio values is improved relative to the top panel. The comparison between common and individual correlations is no longer clear cut because the Z-scores for ratio controls without common correlations are so variable, sometimes larger and sometimes smaller than the statistics with common correlations. The important observation here is that the common correlation (grey) boxes are noticeably more compact than the white boxes for all intensities of calibration spots, i.e. the rate of false discoveries is reduced. Furthermore, assuming common correlations also gives larger median Z-scores for all types of ratio controls except for 10DL, meaning that the common correlation method gives greater power in most cases as well as better control of type I errors.
| 8 DISCUSSION |
|---|
This paper shows that setting the between-replicate correlation constant across genes is a useful strategy. Results using spike-in probes show that the statistics assuming common correlations give clearly improved assessment of differential expression. Any bias which is introduced by assuming correlations to be equal seems to be more than offset by an increase in the precision with which the genewise variances are estimated. When the number of arrays is small, the spike-in results were further improved by empirical Bayes smoothing of the sample variances as in Smyth (2004).
The authors have applied the methodology to a variety of microarray experiments with arrays printed in several different laboratories with several different clone libraries. Our experience has been that correlations between side-by-side duplicates are estimated typically in the range 0.70.9, suggesting that side-by-side duplicates share about half of their variability as measured by squared correlation. Correlations between replicates in the top and bottom halves of the array are typically estimated in the range 0.50.6, suggesting that duplicates at the maximum distance apart still share about a quarter of their variability. These observations are consistent with the idea that spots which are further apart should be less highly correlated.
In most experiments the genewise correlation estimates
are found to be too variable across genes to be compatible with a common true correlation and the theoretical scaled Fnk,n sampling variability for
(data not shown). In other words the assumption of constant correlation across genes does not appear to be strictly tenable. On the other hand, the between and within sample variances
and
have been found to be positively associated, meaning that the correlation estimates
are less variable, relative to the theoretical F-distribution, than are the sample variances
relative to their theoretical
2 distribution. So the assumption of constant correlation appears to be valid in practice in the relative sense that the correlations are more nearly constant than are the variances themselves.
The effectiveness of the common correlation model seems to be due to three main characteristics. First, the estimated common correlation is very stable, it being a consensus estimator derived from a large number of genes. This stability results in a favorable variancebias trade-off, especially for small data sets. Second, the correlation is a nuisance parameter rather than a quantity of primary interest. It has been noted that the genewise t-statistics are not sensitive to small changes in the correlation estimate, so it is not necessary to track small differences in the genewise correlations provided that the common correlation estimate is broadly correct. Third, the common correlation model causes genes with poor quality data to be downweighted. Good quality data seems to give rise to consistently high correlations between replicate spots. Even for arrays with a lot of poor quality spots, the common correlation is generally large and positive. Those genes which do give rise to low or even negative correlations seem to do so most often because of poor quality data, e.g. artifacts on the surface of the arrays which affect only one of a set of replicate spots. Holding the correlation fixed forces the estimated residual variance for these genes to be relatively large to reflect the disagreement between the replicate spots. This means that statistical significance for these genes is downweighted, a phenomenon which seems conservative and desirable. Allowing each gene to estimate its own correlation would cause disagreements between replicate spots to be disregarded.
The formal calculations in this paper have assumed normality of the expression log-ratios as well as independence and constant variances across arrays for each gene. There are several reasons to expect the methodology to remain useful even for data which deviate from normality. First, apart from the bias correction b(f1,f2) which is relatively small in magnitude, the estimators derived here remain consistent given only the first and second moments of the response distribution. Second, the estimation procedures can be modified to make them more resistant to outliers. A simple method which has proved effective is to estimate
not from
in Section but from a robust mean of the
. This has the effect of ignoring a small proportion of aberrant correlation estimates. The default estimator used in the limma software package is the trimmed mean removing 15% of the values from each tail.
Third, and perhaps most importantly, the basic purpose of differential expression analyses for microarray data is to rank the genes in terms of evidence for differential expression (Smyth et al., 2003). An effective ranking, which reliably ranks the truly differentially expressed genes near the top, is even more important than the ability to decide which genes are significantly differentially expressed. It is more important then that the P-values for different genes are correctly ordered than that the P-values have the correct uniform null distribution. On this measure, the common correlation method has clear advantages over alternatives even when the underlying model is not exactly correct. It is more effective than averaging the replicate spots because it takes into account deviations between replicates when estimating the precision of the data for each gene. Compared with rank-based or permutation tests, the parametric method described here has the advantage of greater resolution, i.e. lack of granularity in the estimators and P-values, allowing genes to be more finely graduated for small or moderate sized datasets.
For the reasons explained above, the application of the methods described here is not limited to high quality data sets for which normality might be reasonable nor to very large data sets for which rigorous checking of the distributional assumptions might be feasible. In fact the benefits, relative to alternative methods such as averaging the replicate spots or simply ranking genes on fold change, may be most pronounced in experiments with very few arrays or with poor quality data. This expectation is borne out by the spike-in experimental data with n = 2.
As with any statistical modelling technique, it is assumed that appropriate quality assessment has been done of the data before application of the method proposed here. It has been described in Section how the method is capable of incorporating spot and array quality weights which might arise from such quality assessment.
The method used in this paper differs from previous work on empirical Bayes or shrinkage estimators in that a suitably chosen parameter is simply set equal across genes. The idea works here because the correlation parameter is of secondary interest from an inferential point of view and because it is relatively stable across genes. The technique is applicable to other situations involving mixed model analyses of microarray data such as those with technical as well as biological replication or the separate channel analyses described by Jin et al. (2001) and Wolfinger et al. (2001). These situations have within-block or within-spot correlations for which consensus estimators might be used across genes.
The methods described in this paper are implemented in the software package limma for the R computing environment (Smyth et al., 2004, http://bioinf.wehi.edu.au/limma). Limma is part of the Bioconductor project at http://www.bioconductor.org (Gentleman et al., 2004).
| Acknowledgments |
|---|
The authors thank Lisa Martin, Melanie O'Keefe and Cathy Jensen of the Australian Genome Research Facility for printing the microarrays used in the study described in the paper. Thanks are due to Terry Speed and Matthew Ritchie for valuable discussions. This research was supported by NHMRC Grants 257501 and 257529.
Received on May 9, 2004; revised on December 26, 2004; accepted on January 7, 2005
| REFERENCES |
|---|
Andrews, J., Bouffard, G.G., Cheadle, C., Lu, J., Becker, K.G., Oliver, B. (2000) Gene discovery using computational and microarray analysis of transcription in the Drosophila melanogaster testis. Genome Res., 10, 20302043
Baggerly, K.A., Coombes, K.R., Hess, K.R., Stivers, D.N., Abruzzo, L.V., Zhang, W. (2001) Identifying differentially expressed genes in cDNA microarray experiments. J. Comput. Biol., 8, 639659[CrossRef][ISI][Medline].
Baldi, P. and Long, A.D. (2001) A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics, 17, 509519
Beissbarth, T., Fellenberg, K., Brors, B., Arribas-Prat, R., Boer, J., Hauser, N.C., Scheideler, M., Hoheisel, J.D., Schutz, G., Poustka, A., Vingron, M. (2000) Processing and quality control of DNA array hybridization data. Bioinformatics, 16, 10141022
Berwanger, B., Hartmann, O., Bergmann, E., Bernard, S., Nielsen, D., Krause, M., Kartal, A., Flynn, D., Wiedemeyer, R., Schwab, M., Schafer, H., Christiansen, H., Eilers, M. (2002) Loss of a FYN-regulated differentiation and growth arrest pathway in advanced stage neuroblastoma. Cancer Cell, 2, 377386[CrossRef][ISI][Medline].
Boer, J.M., Huber, W.K., Sültmann, H., Wilmer, F., von Heydebreck, A., Haas, S., Korn, B., Gunawan, B., Vente, A., Fuzesi, L., Vingron, M., Poustka, A. (2001) Identification and classification of differentially expressed genes in renal cell carcinoma by expression profiling on a global human 31,500-element cDNA array. Genome Res., 11, 18611870
Broberg, P. (2003) Statistical methods for ranking differentially expressed genes. Genome Biol., 4, R41[CrossRef][Medline].
Buckley, M.J. Spot User's Guide, (2000) , Sydney, Australia CSIRO Mathematical and Information Sciences.
Causton, H.C., Quackenbush, J., Brazma, A. Microarray Gene Expression Data Analysis: A Beginner's Guide, (2003) , Malden, MA Blackwell Publishing.
Efron, B. and Tibshirani, R. (2002) Empirical Bayes methods and false discovery rates for microarrays. Genet. Epidemiol., 23, 7086[CrossRef][ISI][Medline].
Efron, B., Tibshirani, R., Storey, J.D., Tusher, V. (2001) Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc., 96, 11511160[CrossRef][ISI].
Fan, J., Tam, P., Woude, G.V., Ren, Y. (2004) Normalization and analysis of cDNA microarrays using within-array replications applied to neuroblastoma cell response to a cytokine. Proc. Natl Acad. Sci. USA, 101, 11351140
Firestein, G.S. and Pisetsky, D.S. (2002) DNA microarrays: boundless technology or bound by technology? Guidelines for studies using microarray technology. Arthritis Rheum., 46, 859861[CrossRef][ISI][Medline].
Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A.J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J.Y., Zhang, J. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol., 5, R80[CrossRef][Medline].
Hoffmann, K.F., Johnston, D.A., Dunne, D.W. (2002) Identification of Schistosoma mansoni gender-associated gene transcripts by cDNA microarray profiling. Genome Biol., 3, Research0041.
Jenssen, T.K., Langaas, M., Kuo, W.P., Smith-Sorensen, B., Myklebost, O., Hovig, E. (2002) Analysis of repeatability in spotted cDNA microarrays. Nucleic Acids Res., 30, 32353244
Jin, W., Riley, R.M., Wolfinger, R.D., White, K.P., Passador-Gurgel, G., Gibson, G. (2001) The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster. Nat. Genet., 29, 389395[CrossRef][ISI][Medline].
Kaynak, B., von Heydebreck, A., Mebus, S., Seelow, D., Hennig, S., Vogel, J., Sperling, H.P., Pregla, R., Alexi-Meskishvili, V., Hetzer, R., Lange, P.E., Vingron, M., Lehrach, H., Sperling, S. (2003) Genome-wide array analysis of normal and malformed human hearts. Circulation, 107, 24672474
Kendziorski, C.M., Newton, M.A., Lan, H., Gould, M.N. (2003) On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat. Med., 22, 38993914[CrossRef][ISI][Medline].
König, R., Baldessari, D., Pollet, N., Niehrs, C., Eils, R. (2004) Reliability of gene expression ratios for cDNA microarrays in multiconditional experiments with a reference design. Nucleic Acids Res., 32, e29
Lönnstedt, I. and Speed, T. (2002) Replicated microarray data. Stat. Sinica, 12, 3146.
Lyne, R., Burns, G., Mata, J., Penkett, C.J., Rustici, G., Chen, D., Langford, C., Vetrie, D., Bahler, J. (2003) Whole-genome microarrays of fission yeast: characteristics, accuracy, reproducibility, and processing of array data. BMC Genomics, 4, 27[CrossRef][Medline].
Newton, M.A., Kendziorski, C.M., Richmond, C.S., Blattner, F.R., Tsui, K.W. (2001) On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J. Comput. Biol., 8, 3752[CrossRef][ISI][Medline].
Newton, M.A., Noueiry, A., Sarkar, D., Ahlquist, P. (2004) Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics, 5, 155176[Abstract].
Nguyen, D.V., Arpat, A.B., Wang, N., Carroll, R.J. (2002) DNA microarray experiments: biological and technological aspects. Biometrics, 58, 701717[CrossRef][ISI][Medline].
(Eds.). The Analysis of Gene Expression Data. Statistics for Biology and Health, (2003) , NY Springer-Verlag.
Pinheiro, J.C. and Bates, D.M. Mixed-effects Models in S and S-PLUS. Statistics and computing, (2000) , NY Springer.
Searle, S.R., Casella, G., McCulloch, C.E. (1992) Wiley series in probability and mathematical statistics. Applied probability and statistics. Variance Components, , NY Wiley.
Smyth, G.K. (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol., 3, Article 3.
Smyth, G.K. and Speed, T. (2003) Normalization of cDNA microarray data. Methods, 31, 265273[CrossRef][ISI][Medline].
Smyth, G.K., Yang, Y.H., Speed, T. (2003) Statistical issues in cDNA microarray data analysis. Methods Mol. Biol., 224, 111136[Medline].
Smyth, G.K., Thorne, N., Wettenhall, J. Limma: Linear Models for Microarray, User's Guide, (2004) .
(Ed.). Statistical Analysis of Gene Expression Microarray Data, (2003) , Boca Raton, FL Chapman and Hall/CRC press.
Tilstone, C. (2003) DNA microarrays: vital statistics. Nature, 424, 610612[CrossRef][Medline].
Tran, P.H., Peiffer, D.A., Shin, Y., Meek, L.M., Brody, J.P., Cho, K.W. (2002) Microarray optimizations: increasing spot accuracy and automated identification of true microarray signals. Nucleic Acids Res., 30, e54
Tseng, G.C., Oh, M.K., Rohlin, L., Liao, J.C., Wong, W.H. (2001) Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res., 29, 25492557
Tusher, V.G., Tibshirani, R., Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 51165121
Wang, X., Ghosh, S., Guo, S.W. (2001) Quantitative quality control in microarray image processing and data acquisition. Nucleic Acids Res., 29, E75-5.
Wolfinger


















































