Biases induced by pooling samples in microarray experiments
1UMR AgroParisTech/INRA MIA 518, 16 rue Claude Bernard 75231 Paris Cedex 5 (France) and 2Department of Statistics G. Parenti, viale Morgagni 59, 50100 and 3 CSPO Biostatistics Unit, via San Salvi, Florence, Italy
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: If there is insufficient RNA from the tissues under investigation from one organism, then it is common practice to pool RNA. An important question is to determine whether pooling introduces biases, which can lead to inaccurate results. In this article, we describe two biases related to pooling, from a theoretical as well as a practical point of view.
Results: We model and quantify the respective parts of the pooling bias due to the log transform as well as the bias due to biological averaging of the samples. We also evaluate the impact of the bias on the statistical differential analysis of Affymetrix data.
Contact: maryhuar{at}inapg.fr
| 1 INTRODUCTION |
|---|
|
|
|---|
In microarray experiments, pooling refers to the study design in which material collected from several individuals is combined in a pooled sample before hybridization. Labelling and hybridization are then performed on the composite sample. There are several reasons for pooling. When extraction from a single individual does not provide enough material, pooling is an alternative to RNA amplification (Gold et al., 2004). Pooling is sometimes used to assemble a stable reference condition, to reduce the number of arrays for cost-saving purposes (Churchill, 2002), or to reduce the subject-to-subject variability and thus increase the power of statistical tests (Churchill, 2002; Churchill and Oliver, 2001; Han et al., 2004; Simon and Dobbin, 2003).
Pooling design has recently received thoughtful attention in both statistical and biological publications about gene expression experiments. Authors mainly focus on two important questions:
- how to define equivalent designs i.e. what is the required number of subjects and arrays to achieve a given power in the statistical analysis (Shih et al., 2004; Wit and McClure, 2004)?
- is the signal derived from a pool design equivalent to the average of expression signals from an individual-based design? This hypothesis, known as biological averaging assumption (BAA), has been studied on real data (Han et al., 2004; Kendziorski et al., 2003, 2005).
In this article, we focus on the study of the validity of BAA. There is no general agreement in the literature. For example, in Kendziorski et al. (2005), the authors conclude that biological averaging occurs for most but not all genes. However in Shih et al. (2004), the authors find that this assumption may not hold especially when the signals are strong ... the pooling bias appears to be severer for the Affymetrix arrays. However, no quantitative study of a possible pooling bias has yet been made.
There are two reasons why BAA may not hold (Kendziorski et al., 2005):
- there may be an imperfect averaging of the individual RNA: Xp is different from (1/ns)
Xi, where ns is the number of samples, Xi is the number of labelled and hybridized RNA copies of a given gene from sample i for i = 1, ns and Xp is the corresponding quantity for the pooled sample. We call this bias the pool bias.
- differences between the pool signal and the average of the individual signals could be due to the log transformation that occurs in the normalization process [for instance in the RMA procedure, Irizarry et al. (2003)]. Indeed, the log transformation is applied to individual samples in the absence of pooling, and to the pool sample otherwise, and, for any positive sequence X1,...Xn,
(Xi)
log(
Xi). If there is no pool bias,
and, as a result, the same equality cannot be true on the log scale, which is used for further statistical analysis. We call this bias the log bias.
The overall difference, on the log scale, between the pool and the mean of the corresponding individuals is called the pooling bias.
The goal of this study is to provide a better insight on the pooling bias, to quantify the respective parts of the log bias and the pool bias and to evaluate their impact on the statistical differential analysis of Affymetrix data. In Section 2, we define a general model for the expression measurement at the probe level. This framework is used in Section 3 to derive some tools to detect the two biases. We exemplify the two biases on both simulated and real data, using the Kendziorski experiment (Kendziorski et al., 2005). Finally, Section 4 is devoted to the impact of the pooling bias on the differential analysis.
| 2 GENERAL FRAMEWORK |
|---|
|
|
|---|
We consider here a two stage model. In the following the superscript B (respectively T) denotes the biological (respectively technical) variability.
2.1 Model on the expression of genes
The number of RNA copies of gene k from sample i is modelled by:
|
| (1) |
k is the population mean number of RNA copies of gene k, and |
| (2) |
i is called the efficiency factor, and depends on the number of cells included in the RNA preparation and the quality of the hybridization and labelling processes. In the following, this efficiency factor is assumed to depend only on the sample RNA preparation and the array i, and does not depend on the gene and the probe. For a pool of ns samples, the number of RNA copies of gene k contained in the pooled sample is
|
| (3) |
k for all k.
2.2 Model on the measure of fluorescence at the probe level
For probe j associated with gene k, the expression measurement, on the log scale, is for the perfect match
|
| (4) |
|
| (5) |
T, corresponding to the technical variability. The distribution of
At this step the model is quite general since few assumptions are made on
and
. Notice that if we note eik = log(Yik), the previous model can be rewritten
|
|
|
| (6) |
| 3 BIAS QUANTIFICATION |
|---|
|
|
|---|
3.1 Pool bias
We consider an experiment where RNA samples are extracted from ns subjects. The RNA are used to make both individual samples and a pooled sample of the ns subjects. Each sample is hybridized on a slide. For a given probe j and gene k, we obtain the measurements PMijk and PMpjk, respectively for each individual i, i = 1,..., ns and for the corresponding pool p. The mean expression for a given probe across the samples is:
|
|
|
|
|
|
|
|
|
| (7) |
Simulated data are an effective way to illustrate the theoretical computations above. Simulations are performed according to Models(2) and (5) with
p =
i = 1. For a given individual i and a given gene k, we have chosen to set
. Figure 1 a plots the pooled PM versus the mean of individual PMs. The values of the parameters are
and
with normal distribution for the errors. The strong linearity along the line y = x between PMp and
illustrates the theoretical computations above. We observe similar results for different values of the parameters, provided that
T is not too high (figures not shown).
|
We now turn to real data, using the Kendziorski dataset (Kendziorski et al., 2005). This experiment aims at comparing gene expression in mammary glands from female rats fed with two different diets (normal, denoted A and supplemented with the retinoic X receptor ligand LG100268, denoted B). RNA samples were obtained for 12 rats in each condition, and hybridized on Affymetrix RAE230A chips to measure gene expression for 15 923 genes. Individual RNA were also used to construct 6 pools of pairs, 4 pools of triples and 1 pool of 12 subjects, in each condition. Further details about this experiment can be found in Kendziorski et al. (2005).
If we compute the ratio between two individual slides 1 and 2, the same reasoning as above gives:
|
|
We observe a different picture for pool data, since the ratio
is less than one. In Figure 1c, we consider the individual and pool arrays with ns = 12. There appears to be a strong linear relationship, but the slope of the line is 0.75 rather than 1. This low ratio is not due to array effects, for similar computations on the 21 remaining pools of pairs, triples and 12 individuals give ratio values less than one. Except for one pool of pairs, the ratios are lower than 1 and lie between 0.5 and 0.96. The mean ratios are 0.77 for pools of 2, 0.715 for pools of 3 and 0.725 for pools of 12. These results are in keeping with the lower level of the mean expression for the arrays corresponding to pooled samples in comparison with arrays for individual samples (Fig. 2a).
|
One may wonder whether the ratio is probe (or gene) specific or not. Figure 2b and c represents the plot between ratios
We computed 2 scores per probe, Shigh and Slow, that count how often a probe ratio
belongs to the 5% highest (respectively lowest) ratios for a given pool p. Slow and Shigh vary between 0 and 22, since we considered the 22 pools all together. To get the same information at the gene level, we computed two scores
and
per gene, by summing the scores Shigh and Slow of their associated probes. Each gene is represented by 11 probes (except for 36 that are represented by 20), meaning that
and
vary from 0 to 242. Under the assumption that no gene is specifically affected by pooling, each score has a binomial distribution
(n0,p0) with n0 = 22 x 11 and p0 = 0.05. A gene will have a significant specific pool bias if Shigh or Slow is greater than the 0.05/15 923 quantile of the binomial distribution, which is 30. In the Kendziorski data, we found more than 700 genes with either a
or a
value higher than 30, representing 4.6% of the total number of genes.
There are two reasons why the observed ratio
is not 1. Since
|
|
3.2 Log bias
In this section, we study the bias that exists between the log-transformed pool signal and the mean of the log transformed individual signals. We define the log bias for a given probe j as:
|
|
|
|
· =
p. We obtain: |
|
Assuming that the coefficient of variation of Xi is low, i.e.
, and using
we get:
|
|
|
| (8) |
Finally pooling results first in a lowering (for the Kendziorski experiment) of the raw microarray response that varies from microarray to microarray and secondly in an increase in the signal on the log scale after normalization, which depends on the biological variability of each gene. However, the sum of these two opposite effects is not equal to zero and varies on a relatively large scale from gene to gene and sample to sample. Therefore there is some theoretical and material evidence on the Kendziorski dataset that pooling affects the absolute measurement of gene expression.
| 4 DIFFERENTIAL ANALYSIS |
|---|
|
|
|---|
An important point is to assess the consequences of pooling on the differential expression (DE) inference. First, if the arrays are made with pools composed of the same number of samples, the efficiency factor
p is more or less the same for all arrays. Moreover, array to array normalization corrects, at least partially, the mean effect of the pooling on raw PM values, so that the efficiency factor should not distort the DE inference. However, the log bias artefact is not corrected by normalization precisely because it is produced by normalization. For example, if the mean and the variability of the expression of a particular gene are increased in condition A in reference with condition B, its DE will be higher in a pooled experiment than in an individual one. Therefore individual experiments and experiments with pooling may lead to different conclusions for this given gene. One may think of other combinations of DE and variability that may lead to conflicting results. To assess the impact of the pooling bias on DE, we performed a differential analysis on individual and pool arrays. For the individual study, RMA normalization was performed on the total batch of individual arrays, and genes were then ranked according to their associated T statistic, giving a unique reference list of genes. For the pool study, we performed three different normalization procedures for each pool batch (pool of 2, 3 and 12):
- Norm1 consists in a simple RMA normalization.
- Norm2 is a two-step normalization procedure. Data are first corrected for the pool bias: ratios
are estimated for each probe using the corresponding individual samples, and according to expression (7). Then classical RMA normalization is applied on the pool bias corrected data.
- Norm3 is a two-step normalization procedure, where data are first corrected for both the pool bias and the log-bias according to expressions (7) and (8), and then normalized with RMA.
To compare our results with those of Kendziorski, we plotted the number of DE calls in common between the pool of 2 and reference for lists of fixed size (Fig. 3). We see that the correction of the pool bias increases the agreement between the reference and the pool of 12 lists. The additional correction of the log bias improves the agreement but the gain is very slight. Results obtained with pools of 3 and pools of 12 are similar (not shown here).
|
| 5 DISCUSSION |
|---|
|
|
|---|
In many articles, the BAA refers both to the fact that RNA abundance levels average out when pooled, and that average on the scale or raw RNA abundance will not correspond to the processed RNA measurement (Kendziorski et al., 2005). Here, we proposed to break down the overall pooling bias into two parts, the pool bias and the log bias, that are properly defined by expressions(7) and (8). This distinction allowed us to describe the pool bias as a combination of an overall effect which depends on the efficiency factor of each slide, and a gene specific effect which can be related to the RNA abundance with and without pooling. We were also able to quantify each part of the pooling bias. The main conclusions of this study are the following:
- pooling seems to lower the efficiency of the labelling or hybridization steps. This artefact, which has been found in Kendziorski's experiment, has to be confirmed by other experimental results. Shih et al. (2004) suggested that a possible reason for this artefact is that mixing of the RNA may cause some alteration of individual RNA contributions. Such a bias can be easily detected in experiments by producing a few individual slides to compare their average signal level to that of pool slides. The impact of this bias on the differential analysis should be negligible, since it is mainly corrected by the array to array normalization step.
- some genes (up to 4.6% of the total number of genes in the Kendziorski experiment) are specifically affected by pooling. Specific gene biases are much more difficult to quantify and correct and would require both pool and individual slides for the same individuals, which cannot routinely be done in practice. Such biases could lead to different results between pool and individual slides analyses.
- the bias induced by the log transformation, included in most normalization methods, is experimentally and theoretically well assessed. While this bias is systematic, we showed that its consequences are of limited importance in the Kendziorski experiment. Yet it may potentially have an influence in other experiments, particulary for genes with high biological variability.
In Kendziorski et al. (2005), the authors observed that for the majority of genes where there was a large [pooling bias], the difference was similar across biological conditions. Considering the two parts of the bias, it is difficult to conclude whether this may hold for further experiments. On one hand, for the bias to be 0 the specific gene bias has to be similar in both conditions. On the other hand, we showed that for a given gene, the log bias depends on the coefficient of variation cv. Compared with the log-ratio computed for individuals, the log-ratio for pools is biased by
|
|
1 =
2 and the bias is 0 if
1 >
2, and the bias is 0 only if While the theoretical formulas we derived here are universal, the computations were all made on a unique set of microarrays. Future work will consist in applying the presented analytical tools to additional data to check to what extent the conclusions we have drawn on the Kendziorsky experiment can be generalized.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
1 http://www.geneontology.org/
2 The GO ontology performs multiple hypothesis testing to adjust the cutoffvalue. ![]()
3 We used the recommended cut-off of 0.05 for all our validations. ![]()
4 The trends for the other two clustering algorithms are similar and are omitted. ![]()
5 Of the 1124, 2121 clusters produced by the CE-agglo algorithm contained singletons, whereas for the CE-balls algorithm, 1939, of the 2783 clusters contained singletons. ![]()
6 The remaining clusters have comparable p-values. ![]()
7 To make a fair comparison, we used the same dataset as the earlier work. It was downloaded on 11.06-2005 and contained 15147 interactions among 4741 proteins. We provide the soft clusters obtained as Supplementary material. ![]()
8 The plots for the molecular function and cellular component ontologies follow similar trends and have been omitted due to lack of space. ![]()
| REFERENCES |
|---|
|
|
|---|
Churchill GA. Fundamentals of experimental designs for cdna microarrays. Nat. Genet, ( (2002) ) 32, : 490–495.[CrossRef][ISI][Medline].
Churchill GA, Oliver B. Sex, flies and microarrays. Nat. Genet, ( (2001) ) 29, : 322–356..
Gold D, et al. A comparative analysis of data generated using two different target preparation methods for hybridization to high-density oligonucleotide microarrays. BMC Genomics, ( (2004) ) 5, ..
Han ES, et al. Reproducibility, sources of variability, pooling, and sample size: important considerations for the design of high density oligonucleotide array experiments. J. Gerontol. Biol. Sci, ( (2004) ) 59, : 306–315.[ISI][Medline].
Irizarry RA, et al. Summaries of affymetrix genechip probe level data. Nucleic Acids Res, ( (2003) ) 31, ..
Kendziorski C, et al. On the utility of pooling biological samples in microarray experiments. Proc. Natl Acad. Sci. USA, ( (2005) ) 102, : 4252–4257.
Kendziorski CM, et al. The efficiency of pooling mrna in microarray experiments. Biostatistics, ( (2003) ) 4, : 465–477.[Abstract].
Shih JH, et al. Effects of pooling mrna in microarray class comparisons. Bioinformatics, ( (2004) ) 20, : 3318–3325.
Simon RM, Dobbin K. Experimental design of DNA microarray experiment. BioTechniques, ( (2003) ) 34, : S16–S21..
Wit E, McClure J. Statistics for Microarrays, ( (2004) ) Hoboken, NJ, USA: John Wiley & Sons..
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







