Skip Navigation

Bioinformatics 2007 23(13):i313-i318; doi:10.1093/bioinformatics/btm182
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Mary-Huard, T.
Right arrow Articles by Bar-Hen, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Mary-Huard, T.
Right arrow Articles by Bar-Hen, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Biases induced by pooling samples in microarray experiments

Tristan Mary-Huard 1,*, Jean-Jacques Daudin 1, Michela Baccini 2,3, Annibale Biggeri 2,3 and Avner Bar-Hen 1

1UMR AgroParisTech/INRA MIA 518, 16 rue Claude Bernard 75231 Paris Cedex 5 (France) and 2Department of Statistics ‘G. Parenti’, viale Morgagni 59, 50100 and 3 CSPO Biostatistics Unit, via San Salvi, Florence, Italy

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GENERAL FRAMEWORK
 3 BIAS QUANTIFICATION
 4 DIFFERENTIAL ANALYSIS
 5 DISCUSSION
 REFERENCES
 

Motivation: If there is insufficient RNA from the tissues under investigation from one organism, then it is common practice to pool RNA. An important question is to determine whether pooling introduces biases, which can lead to inaccurate results. In this article, we describe two biases related to pooling, from a theoretical as well as a practical point of view.

Results: We model and quantify the respective parts of the pooling bias due to the log transform as well as the bias due to biological averaging of the samples. We also evaluate the impact of the bias on the statistical differential analysis of Affymetrix data.

Contact: maryhuar{at}inapg.fr


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GENERAL FRAMEWORK
 3 BIAS QUANTIFICATION
 4 DIFFERENTIAL ANALYSIS
 5 DISCUSSION
 REFERENCES
 
In microarray experiments, pooling refers to the study design in which material collected from several individuals is combined in a pooled sample before hybridization. Labelling and hybridization are then performed on the composite sample. There are several reasons for pooling. When extraction from a single individual does not provide enough material, pooling is an alternative to RNA amplification (Gold et al., 2004). Pooling is sometimes used to assemble a stable reference condition, to reduce the number of arrays for cost-saving purposes (Churchill, 2002), or to reduce the subject-to-subject variability and thus increase the power of statistical tests (Churchill, 2002; Churchill and Oliver, 2001; Han et al., 2004; Simon and Dobbin, 2003).

Pooling design has recently received thoughtful attention in both statistical and biological publications about gene expression experiments. Authors mainly focus on two important questions:

  • how to define equivalent designs i.e. what is the required number of subjects and arrays to achieve a given power in the statistical analysis (Shih et al., 2004; Wit and McClure, 2004)?
  • is the signal derived from a pool design equivalent to the average of expression signals from an individual-based design? This hypothesis, known as biological averaging assumption (BAA), has been studied on real data (Han et al., 2004; Kendziorski et al., 2003, 2005).

In this article, we focus on the study of the validity of BAA. There is no general agreement in the literature. For example, in Kendziorski et al. (2005), the authors conclude that ‘biological averaging occurs for most but not all genes’. However in Shih et al. (2004), the authors find that ‘this assumption may not hold especially when the signals are strong ... the pooling bias appears to be severer for the Affymetrix arrays’. However, no quantitative study of a possible pooling bias has yet been made.

There are two reasons why BAA may not hold (Kendziorski et al., 2005):

  • there may be an imperfect averaging of the individual RNA: Xp is different from (1/ns) Formula Xi, where ns is the number of samples, Xi is the number of labelled and hybridized RNA copies of a given gene from sample i for i = 1, ns and Xp is the corresponding quantity for the pooled sample. We call this bias the pool bias.
  • differences between the pool signal and the average of the individual signals could be due to the log transformation that occurs in the normalization process [for instance in the RMA procedure, Irizarry et al. (2003)]. Indeed, the log transformation is applied to individual samples in the absence of pooling, and to the pool sample otherwise, and, for any positive sequence X1,...Xn, Formula (Xi) ≤ log(Formula Xi). If there is no pool bias, Formula and, as a result, the same equality cannot be true on the log scale, which is used for further statistical analysis. We call this bias the log bias.

The overall difference, on the log scale, between the pool and the mean of the corresponding individuals is called the pooling bias.

The goal of this study is to provide a better insight on the pooling bias, to quantify the respective parts of the log bias and the pool bias and to evaluate their impact on the statistical differential analysis of Affymetrix data. In Section 2, we define a general model for the expression measurement at the probe level. This framework is used in Section 3 to derive some tools to detect the two biases. We exemplify the two biases on both simulated and real data, using the Kendziorski experiment (Kendziorski et al., 2005). Finally, Section 4 is devoted to the impact of the pooling bias on the differential analysis.


    2 GENERAL FRAMEWORK
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GENERAL FRAMEWORK
 3 BIAS QUANTIFICATION
 4 DIFFERENTIAL ANALYSIS
 5 DISCUSSION
 REFERENCES
 
We consider here a two stage model. In the following the superscript B (respectively T) denotes the biological (respectively technical) variability.

2.1 Model on the expression of genes
The number of RNA copies of gene k from sample i is modelled by:


Formula 1

(1)
where {lambda}k is the population mean number of RNA copies of gene k, and Formula represents an independent random term with mean 0 and SD Formula , corresponding to the subject-to-subject variability. Formula is assumed to be finite, but may take different values for different genes. The number of labelled RNA copies of gene k of sample i, hybridized on the array i, (the numbering of the sample and of the array are identical) is modelled by:


Formula 2

(2)
where {alpha}i is called the efficiency factor, and depends on the number of cells included in the RNA preparation and the quality of the hybridization and labelling processes. In the following, this efficiency factor is assumed to depend only on the sample RNA preparation and the array i, and does not depend on the gene and the probe. For a pool of ns samples, the number of RNA copies of gene k contained in the pooled sample is


Formula 3

(3)
If BAA is true, then Formula = {lambda}k for all k.

2.2 Model on the measure of fluorescence at the probe level
For probe j associated with gene k, the expression measurement, on the log scale, is for the perfect match


Formula 4

(4)


Formula 5

(5)
where ajk is the specific effect of probe j for gene k, and Formula is an independent random term with mean 0 and SD {sigma}T, corresponding to the technical variability. The distribution of Formula is assumed to be the same for each probe, each gene and each slide, and the two sources of variability Formula and Formula are supposed independent.

At this step the model is quite general since few assumptions are made on Formula and Formula . Notice that if we note eik = log(Yik), the previous model can be rewritten


Formula

which is the model used in RMA normalization (Irizarry et al., 2003). For a pooled sample we have:


Formula 6

(6)


    3 BIAS QUANTIFICATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GENERAL FRAMEWORK
 3 BIAS QUANTIFICATION
 4 DIFFERENTIAL ANALYSIS
 5 DISCUSSION
 REFERENCES
 
3.1 Pool bias
We consider an experiment where RNA samples are extracted from ns subjects. The RNA are used to make both individual samples and a pooled sample of the ns subjects. Each sample is hybridized on a slide. For a given probe j and gene k, we obtain the measurements PMijk and PMpjk, respectively for each individual i, i = 1,..., ns and for the corresponding pool p. The mean expression for a given probe across the samples is:


Formula

Since Formula and Formula are independent, the mean of Formula with respect to the random variables Formula and Formula is:


Formula

where Formula , since Formula are identically distributed. For a pooled sample,


Formula

The expectation of PMpk is


Formula

From these two computations we conclude that


Formula 7

(7)
If Formula and if Formula , i.e. the efficiency factors are similar for the pool and the individual slides, then Formula

Simulated data are an effective way to illustrate the theoretical computations above. Simulations are performed according to Models(2) and (5) with {alpha}p = {alpha}i = 1. For a given individual i and a given gene k, we have chosen to set Formula . Figure 1 a plots the pooled PM versus the mean of individual PMs. The values of the parameters are Formula and Formula with normal distribution for the errors. The strong linearity along the line y = x between PMp and Formula illustrates the theoretical computations above. We observe similar results for different values of the parameters, provided that {sigma}T is not too high (figures not shown).


Figure 1
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Plot of the pool PM versus the mean of individual PM. Data are simulated according to Models (2) and (5) with gaussian errors. (b) PM of individual slide A2 versus PM of individual slide A3 (10 000 points presented in place of the total 175 477). (c) Plot of the pool of 12 PM versus the mean of individual PM (10 000 points presented).

 
We now turn to real data, using the Kendziorski dataset (Kendziorski et al., 2005). This experiment aims at comparing gene expression in mammary glands from female rats fed with two different diets (normal, denoted A and supplemented with the retinoic X receptor ligand LG100268, denoted B). RNA samples were obtained for 12 rats in each condition, and hybridized on Affymetrix RAE230A chips to measure gene expression for 15 923 genes. Individual RNA were also used to construct 6 pools of pairs, 4 pools of triples and 1 pool of 12 subjects, in each condition. Further details about this experiment can be found in Kendziorski et al. (2005).

If we compute the ratio between two individual slides 1 and 2, the same reasoning as above gives:


Formula

This ratio only depends on the efficiency factor of the two slides, and should be roughly equal to 1. Figure 1b shows that the PM values of individual A2 versus A3 are distributed along the line y = x. The whole set of 11 x 12 = 132 individual ratios varies between0.75 and 1.38, the mean is 1.02 and the SE 0.13. This confirms that expression (7) gives a good picture of the biological process.

We observe a different picture for pool data, since the ratio Formula is less than one. In Figure 1c, we consider the individual and pool arrays with ns = 12. There appears to be a strong linear relationship, but the slope of the line is 0.75 rather than 1. This low ratio is not due to array effects, for similar computations on the 21 remaining pools of pairs, triples and 12 individuals give ratio values less than one. Except for one pool of pairs, the ratios are lower than 1 and lie between 0.5 and 0.96. The mean ratios are 0.77 for pools of 2, 0.715 for pools of 3 and 0.725 for pools of 12. These results are in keeping with the lower level of the mean expression for the arrays corresponding to pooled samples in comparison with arrays for individual samples (Fig. 2a).


Figure 2
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. (a) Boxplots for the mean PM of the arrays for pooled samples and individual samples (conditions A and B, before normalization). (b) plot of Figure 2 where the upper index refers to the specific pool of 3 considered (10 000 points presented in place of the total 175 477). (c) Plot of Figure 2 where the upper index refers to the specific pool of 3 considered (10 000 points presented).

 
One may wonder whether the ratio is probe (or gene) specific or not. Figure 2b and c represents the plot between ratios Formula and Formula for two different pool samples (pool of 3). Hence ratios vary greatly from pool to pool for a given probe. The Spearman correlations between these ratios from pool to pool are equal to 0.29 for the pools of 12 and for pools of 3 the Spearman correlations are between –0.34 and 0.43 with a mean equal to 0.14. For pools of 2, the mean correlation is equal to 0.08 and lie between –0.58 and 0.60. This argues for a non-specific ratio for a majority of probes.

We computed 2 scores per probe, Shigh and Slow, that count how often a probe ratio Formula belongs to the 5% highest (respectively lowest) ratios for a given pool p. Slow and Shigh vary between 0 and 22, since we considered the 22 pools all together. To get the same information at the gene level, we computed two scores Formula and Formula per gene, by summing the scores Shigh and Slow of their associated probes. Each gene is represented by 11 probes (except for 36 that are represented by 20), meaning that Formula and Formula vary from 0 to 242. Under the assumption that no gene is specifically affected by pooling, each score has a binomial distribution Formula (n0,p0) with n0 = 22 x 11 and p0 = 0.05. A gene will have a significant specific pool bias if Shigh or Slow is greater than the 0.05/15 923 quantile of the binomial distribution, which is 30. In the Kendziorski data, we found more than 700 genes with either a Formula or a Formula value higher than 30, representing 4.6% of the total number of genes.

There are two reasons why the observed ratio Formula is not 1. Since


Formula

either Formula . Of course it could be a combination of both reasons. From Figure 2 and the Spearman test we conclude that for a majority of probes the ratio Formula is similar, meaning that pooling induces a overall lowering effect that is not gene specific. This overall lowering corresponds to Formula : there is a difference in efficiency between individual and pool slides. However, we showed that 4.6% of the genes were affected by an additional specific effect that comes from Formula . While there is good hope that the normalization process eliminates the overall lowering effect of pooling (since most normalizations correct for a slide effect), the gene specific effect will not be removed and may affect the differential analysis.

3.2 Log bias
In this section, we study the bias that exists between the log-transformed pool signal and the mean of the log transformed individual signals. We define the log bias for a given probe j as:


Formula

According to (5), and dropping the gene index k for the sake of simplicity, we have for a given gene:


Formula

We suppose that the BAA hypothesis holds, and again for the sake of simplicity that {alpha}· = {alpha}p. We obtain:


Formula

Assuming that the coefficient of variation of Xi is low, i.e. Formula , and using Formula we get:


Formula

Therefore


Formula 8

(8)
where ns is the number of individuals combined in the pool sample. Thus, the expression is higher for the pool sample than for the mean of the corresponding individual samples on the log scale, and the mean difference, for a given gene, is proportional to the square of the coefficient of variation of Formula . Kendziorski found that for 25% of the genes with the largest SD, more than 80% have larger values in the pools of two (Fig. 4A in Kendziorski's paper). This artefact is well explained by the distortion of the log transformation that is described in this section.

Finally pooling results first in a lowering (for the Kendziorski experiment) of the raw microarray response that varies from microarray to microarray and secondly in an increase in the signal on the log scale after normalization, which depends on the biological variability of each gene. However, the sum of these two opposite effects is not equal to zero and varies on a relatively large scale from gene to gene and sample to sample. Therefore there is some theoretical and material evidence on the Kendziorski dataset that pooling affects the absolute measurement of gene expression.


    4 DIFFERENTIAL ANALYSIS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GENERAL FRAMEWORK
 3 BIAS QUANTIFICATION
 4 DIFFERENTIAL ANALYSIS
 5 DISCUSSION
 REFERENCES
 
An important point is to assess the consequences of pooling on the differential expression (DE) inference. First, if the arrays are made with pools composed of the same number of samples, the efficiency factor {alpha}p is more or less the same for all arrays. Moreover, array to array normalization corrects, at least partially, the mean effect of the pooling on raw PM values, so that the efficiency factor should not distort the DE inference. However, the log bias artefact is not corrected by normalization precisely because it is produced by normalization. For example, if the mean and the variability of the expression of a particular gene are increased in condition A in reference with condition B, its DE will be higher in a pooled experiment than in an individual one. Therefore individual experiments and experiments with pooling may lead to different conclusions for this given gene. One may think of other combinations of DE and variability that may lead to conflicting results.

To assess the impact of the pooling bias on DE, we performed a differential analysis on individual and pool arrays. For the individual study, RMA normalization was performed on the total batch of individual arrays, and genes were then ranked according to their associated T statistic, giving a unique reference list of genes. For the pool study, we performed three different normalization procedures for each pool batch (pool of 2, 3 and 12):

  • Norm1 consists in a simple RMA normalization.
  • Norm2 is a two-step normalization procedure. Data are first corrected for the pool bias: ratios Formula are estimated for each probe using the corresponding individual samples, and according to expression (7). Then classical RMA normalization is applied on the pool bias corrected data.
  • Norm3 is a two-step normalization procedure, where data are first corrected for both the pool bias and the log-bias according to expressions (7) and (8), and then normalized with RMA.
For each of the three normalized datasets, we ranked the gene according to their T statistic to obtain three DE lists.

To compare our results with those of Kendziorski, we plotted the number of DE calls in common between the pool of 2 and reference for lists of fixed size (Fig. 3). We see that the correction of the pool bias increases the agreement between the reference and the pool of 12 lists. The additional correction of the log bias improves the agreement but the gain is very slight. Results obtained with pools of 3 and pools of 12 are similar (not shown here).


Figure 3
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Plot of the proportion of common DE genes between the individual analysis and the Norm1 normalized pool analysis (black curve), the Norm2 normalized pool analysis (red curve) and the Norm3 normalized corrected pool analysis (green curve), versus the size of the DE list. The pool analyses are performed on the total batch of pools of 2.

 

    5 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GENERAL FRAMEWORK
 3 BIAS QUANTIFICATION
 4 DIFFERENTIAL ANALYSIS
 5 DISCUSSION
 REFERENCES
 
In many articles, the BAA refers both to the fact that ‘RNA abundance levels average out when pooled’, and that ‘average on the scale or raw RNA abundance will not correspond to the processed RNA measurement’ (Kendziorski et al., 2005). Here, we proposed to break down the overall pooling bias into two parts, the pool bias and the log bias, that are properly defined by expressions(7) and (8). This distinction allowed us to describe the pool bias as a combination of an overall effect which depends on the efficiency factor of each slide, and a gene specific effect which can be related to the RNA abundance with and without pooling. We were also able to quantify each part of the pooling bias. The main conclusions of this study are the following:
  • pooling seems to lower the efficiency of the labelling or hybridization steps. This artefact, which has been found in Kendziorski's experiment, has to be confirmed by other experimental results. Shih et al. (2004) suggested that ‘a possible reason for this artefact is that mixing of the RNA may cause some alteration of individual RNA contributions.’ Such a bias can be easily detected in experiments by producing a few individual slides to compare their average signal level to that of pool slides. The impact of this bias on the differential analysis should be negligible, since it is mainly corrected by the array to array normalization step.
  • some genes (up to 4.6% of the total number of genes in the Kendziorski experiment) are specifically affected by pooling. Specific gene biases are much more difficult to quantify and correct and would require both pool and individual slides for the same individuals, which cannot routinely be done in practice. Such biases could lead to different results between pool and individual slides analyses.
  • the bias induced by the log transformation, included in most normalization methods, is experimentally and theoretically well assessed. While this bias is systematic, we showed that its consequences are of limited importance in the Kendziorski experiment. Yet it may potentially have an influence in other experiments, particulary for genes with high biological variability.

In Kendziorski et al. (2005), the authors observed that ‘for the majority of genes where there was a large [pooling bias], the difference was similar across biological conditions’. Considering the two parts of the bias, it is difficult to conclude whether this may hold for further experiments. On one hand, for the bias to be 0 the specific gene bias has to be similar in both conditions. On the other hand, we showed that for a given gene, the log bias depends on the coefficient of variation cv. Compared with the log-ratio computed for individuals, the log-ratio for pools is biased by


Formula

For a non-DE gene, {lambda}1 = {lambda}2 and the bias is 0 if Formula . But for a DE gene {lambda}1 > {lambda}2, and the bias is 0 only if Formula increases in proportion to Formula .

While the theoretical formulas we derived here are universal, the computations were all made on a unique set of microarrays. Future work will consist in applying the presented analytical tools to additional data to check to what extent the conclusions we have drawn on the Kendziorsky experiment can be generalized.

Conflict of Interest: none declared.


    FOOTNOTES
 
1 http://www.geneontology.org/ Back

2 The GO ontology performs multiple hypothesis testing to adjust the cutoffvalue. Back

3 We used the recommended cut-off of 0.05 for all our validations. Back

4 The trends for the other two clustering algorithms are similar and are omitted. Back

5 Of the 1124, 2121 clusters produced by the CE-agglo algorithm contained singletons, whereas for the CE-balls algorithm, 1939, of the 2783 clusters contained singletons. Back

6 The remaining clusters have comparable p-values. Back

7 To make a fair comparison, we used the same dataset as the earlier work. It was downloaded on 11.06-2005 and contained 15147 interactions among 4741 proteins. We provide the soft clusters obtained as Supplementary material. Back

8 The plots for the molecular function and cellular component ontologies follow similar trends and have been omitted due to lack of space. Back


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GENERAL FRAMEWORK
 3 BIAS QUANTIFICATION
 4 DIFFERENTIAL ANALYSIS
 5 DISCUSSION
 REFERENCES
 

    Churchill GA. Fundamentals of experimental designs for cdna microarrays. Nat. Genet, ( (2002) ) 32, : 490–495.[CrossRef][ISI][Medline].

    Churchill GA, Oliver B. Sex, flies and microarrays. Nat. Genet, ( (2001) ) 29, : 322–356..

    Gold D, et al. A comparative analysis of data generated using two different target preparation methods for hybridization to high-density oligonucleotide microarrays. BMC Genomics, ( (2004) ) 5, ..

    Han ES, et al. Reproducibility, sources of variability, pooling, and sample size: important considerations for the design of high density oligonucleotide array experiments. J. Gerontol. Biol. Sci, ( (2004) ) 59, : 306–315.[ISI][Medline].

    Irizarry RA, et al. Summaries of affymetrix genechip probe level data. Nucleic Acids Res, ( (2003) ) 31, ..

    Kendziorski C, et al. On the utility of pooling biological samples in microarray experiments. Proc. Natl Acad. Sci. USA, ( (2005) ) 102, : 4252–4257.[Abstract/Free Full Text].

    Kendziorski CM, et al. The efficiency of pooling mrna in microarray experiments. Biostatistics, ( (2003) ) 4, : 465–477.[Abstract].

    Shih JH, et al. Effects of pooling mrna in microarray class comparisons. Bioinformatics, ( (2004) ) 20, : 3318–3325.[Abstract/Free Full Text].

    Simon RM, Dobbin K. Experimental design of DNA microarray experiment. BioTechniques, ( (2003) ) 34, : S16–S21..

    Wit E, McClure J. Statistics for Microarrays, ( (2004) ) Hoboken, NJ, USA: John Wiley & Sons..


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Mary-Huard, T.
Right arrow Articles by Bar-Hen, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Mary-Huard, T.
Right arrow Articles by Bar-Hen, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?