Skip Navigation


Bioinformatics Advance Access originally published online on July 14, 2005
Bioinformatics 2005 21(18):3637-3644; doi:10.1093/bioinformatics/bti583
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/18/3637    most recent
bti583v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Liu, X.
Right arrow Articles by Rattray, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liu, X.
Right arrow Articles by Rattray, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

A tractable probabilistic model for Affymetrix probe-level analysis across multiple chips

Xuejun Liu 1, Marta Milo 2, Neil D. Lawrence 2 and Magnus Rattray 1,*

1School of Computer Science, University of Manchester Oxford Road, Manchester M13 9PL, UK
2Department of Computer Science, University of Sheffield Regent Court 211 Portobello Street, Sheffield S1 4DP, UK

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 

Motivation: Affymetrix GeneChip® arrays are currently the most widely used microarray technology. Many summarization methods have been developed to provide gene expression levels from Affymetrix probe-level data. Most of the currently popular methods do not provide a measure of uncertainty for the expression level of each gene. The use of probabilistic models can overcome this limitation. A full hierarchical Bayesian approach requires the use of computationally intensive MCMC methods that are impractical for large datasets. An alternative computationally efficient probabilistic model, mgMOS, uses Gamma distributions to model specific and non-specific binding with a latent variable to capture variations in probe affinity. Although promising, the main limitations of this model are that it does not use information from multiple chips and does not account for specific binding to the mismatch (MM) probes.

Results: We extend mgMOS to model the binding affinity of probe-pairs across multiple chips and to capture the effect of specific binding to MM probes. The new model, multi-mgMOS, provides improved accuracy, as demonstrated on some bench-mark datasets and a real time-course dataset, and is much more computationally efficient than a competing hierarchical Bayesian approach that requires MCMC sampling. We demonstrate how the probabilistic model can be used to estimate credibility intervals for expression levels and their log-ratios between conditions.

Availability: Both mgMOS and the new model multi-mgMOS have been implemented in an R package, which is available at http://www.bioinf.man.ac.uk/resources/puma

Contact: magnus{at}cs.man.ac.uk


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 
Microarrays provide a practical method for measuring the expression levels of tens of thousands of genes simultaneously (Schena et al., 1995; Lockhart et al., 1996). To provide meaningful information from microarray experiment data, various levels of analysis are conducted on this data. An important initial stage is the probe-level analysis which provides a summary of the expression value for each gene. Further downstream analysis is then required depending on the aims of microarray experiments, such as to detect differentially expressed genes and to find co-regulated genes. Micorarrays are associated with high levels of experimental uncertainty and probe-level analysis is required to determine an accurate summary of the expression level for a particular gene in the face of this uncertainty. It is also very useful to associate this measurement with a level of confidence that can then be propagated and incorporated into further analyses using probabilistic models or Bayesian methods.

Affymetrix GeneChip® arrays (Lockhart et al., 1996) are currently the most widely used microarrays. Short specific oligonucleotide probes are tethered and immobilized on the surface of Affymetrix arrays. Target cRNA is fluorescently labelled and hybridized to the array. A 2D image is then generated, with each probe being identified by a position and an intensity. Each probe is 25 bases long, and each gene is represented by 11–20 probe-pairs referred to as a probe-set. Each probe-pair comprises a perfect match (PM) probe and a mismatch (MM) probe. The PM probe and MM probe have almost the same base sequences except that the middle base of the MM probe is changed to the complimentary one of that in the PM probe. The aim of the one base mismatch in the MM probe is to measure non-specific hybridization, which occurs when the target binds to a probe that does not have the perfectly complimentary sequence. Therefore, by design, the MM probe acts as a background measurement for its corresponding PM probe. Probe-level analysis is the process of estimating the expression level of each gene from intensities of PM and MM probes in the corresponding probe-set.

With multiple probes in Affymetrix microarrays, it is possible to adopt various statistical and probabilistic methods to provide confident gene expression results. MAS 5.0 (Affymetrix, 2001), MBEI (Li and Wong, 2001), RMA (Irizarry et al., 2003), GCRMA (Wu et al., 2004), gMOS (Milo et al., 2003), mgMOS (Milo et al., 2004) and BGX (Hein et al., 2005) have been proposed to measure gene expression levels. MAS 5.0, MBEI, RMA and GCRMA are statistical methods and they are able to calculate the gene expression levels accurately. However, these methods are incapable of providing the credibility of the expression values that may be very useful for further statistical analysis. gMOS, mgMOS and BGX are probabilistic models and overcome this issue but they have a number of limitations. gMOS and mgMOS do not currently make use of information shared across multiple chips and this reduces their ability to accurately model probe specific effects. They also make an assumption that there is no specific binding to the MM probe and this assumption has been shown to be inaccurate (Irizarry et al., 2003;Hein et al., 2005). BGX is a hierarchical Bayesian model and requires a Markov chain Monte Carlo (MCMC) method for parameter estimation, which is computationally demanding for large datasets. BGX also fails to account for the probe specific effects in the estimation of non-specific hybridization and this makes it difficult to obtain reasonable expression values for weakly expressed genes.

In this paper we propose a multi-chip version of mgMOS, multi-mgMOS, for the analysis of gene expression data with the purpose of providing the uncertainty of the measured expression value for each gene. An important feature of our new model is that it accounts for the similar binding affinity of the same probe across multiple chips. The fact that the same, or similar, probes share similar binding properties has also been used in GCRMA which explicitly takes into account the probe sequence base composition. In contrast, our own approach is to include a probe-specific parameter that is shared across chips and will therefore automatically capture the shared composition effects but will also capture other probe-specific effects. Importantly, the binding affinity parameters can be analytically integrated out of the likelihood and therefore the number of model parameters does not scale with the size of the probe-set. The likelihood can be written in closed form and this allows fast gradient-based optimization to be used in order to obtain the model parameters so that the computation is much faster than BGX.


    2 METHODS
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 
One of the most important characteristics of microarray data is the high probe sequence specificity. PM and MM probe intensities, and also differences between them, vary in probe-specific ways (Li and Wong, 2001; Irizarry et al., 2003; Wu et al., 2004; Hein et al., 2005). Another phenomenon observed from experimental data is that as the amount of target mRNA increases both MM and the PM intensities also increase (Naef and Magnasco, 2003; Irizarry et al., 2003; Wu et al., 2004; Hein et al., 2005). So there is a considerable amount of true signal binding to MM probes which should not be treated as background or non-specific hybridization. To correctly account for non-specific hybridization, it is necessary to consider both PM and MM probes. We extend the probabilistic model mgMOS (Milo et al., 2004) to model these phenomena. Before describing our new model we introduce the previously developed individual chip model mgMOS.

mgMOS is a modified version of gMOS (Milo et al., 2003), which is a probabilistic model assuming an underlying gamma distribution for the PM and MM probe intensities. The model is defined by

(1)
where ygj and mgj represent, respectively, PM and MM intensities of the j-th probe-pair in the g-th probe-set, and Ga(a,b) represents the gamma distribution with parameters a and b. The specific binding signal for the j-th probe pair, sgj, also follows a gamma distribution, sgj ~ Ga({alpha}g,bg). The probability density of sgj is

(2)
where {Gamma}(·) is the Gamma function. In gMOS, the PM and MM probe intensities are independently sampled from two gamma distributions. Milo et al. (2004) improve the original gMOS to the modified gMOS (mgMOS) by modelling the correlation between PM and MM intensities within a probe-set. mgMOS assumes that PM and MM intensities are drawn from a joint probability density

(3)
where bgj ~ Ga(cg,dg). The bgj are latent variables reflecting the different binding affinity of probes within the probe-set. This modified distribution accurately captures the correlated changes in the binding affinity of probe-pairs within the probe-set.

mgMOS has been shown to be an efficient and accurate model (Milo et al., 2004). However, two significant problems remain with the model, which we describe below:

  1. The existing model is a single chip model and does not account for the fact that the latent variables, bgj, are modelling the same information for every chip in the dataset. The estimated bgj for the same probe-pair on the same type of chip may differ and cannot reflect an intrinsic characteristic of probe sequences.
  2. The intensities of MM probes are taken as background and non-specific hybridization, so they cannot account for presence of true signal in the MM probes. This causes improper estimates of the non-specific binding from MM intensities.

To overcome these limitations we propose a multiple chip model, multi-mgMOS, which is given by

(4)
where ygjc and mgjc represent, respectively, PM and MM intensities of the j-th probe-pair in the g-th probe set on the c-th chip, and both of them follow a gamma distribution with the same inverse scale parameter bgj which is probe-pair specific. The shape parameters of the two gamma distributions are different and are comprised of two parts: the background term agc and the true specific hybridization signal term {alpha}gc which are probe-set and chip specific. Similar to the approach in BGX, we allow a fraction of the true signal, {varphi}, to bind to MM probes. The parameter {varphi} is then shared by all probes. In practice this is an approximation as it can be observed that the parameter {varphi} varies between probe-pairs.

The scale parameter is a latent variable bgj which also follows a gamma distribution, bgj ~ Ga(cg,dg), with the parameters cg and dg which are both probe-set specific. This scale parameter is shared across chips for each probe-pair and therefore captures the sequence-dependent nature of the binding affinity.

From (4) the true signal for the j-th probe-pair in the g-th probe-set of the c-th chip, sgjc, follows a gamma distribution

(5)
with a gamma prior over bgj as defined above. For {varphi} = 0, ygjc and mgjc are simple combinations of Gamma distributed variables representing signal and noise. For {varphi} > 0, the mean effect of true signal binding to the MM probe is {varphi} <sgjc>, where <sgjc> is the mean of sgjc over all probes in the probe-set for a particular chip. Our model can be considered a tractable approximation to one in which the MM intensity is the direct sum of the background and {varphi}sgjc.

The log-likelihood of the observed PM and MM intensities for each probe-set g is

(6)
where qg = {sum}c(2agc + (1 + {varphi}){alpha}gc) + cg and wgj = {sum}c(ygjc + mgjc) + dg. The parameters agc,{varphi},{alpha}gc,cg and dg can be estimated iteratively using maximum likelihood. Firstly, with fixed {varphi}, we fit agc, {alpha}gc,cg and dg for each probe-set, then using the fitted agc, {alpha}gc,cg and dg we estimate {varphi}. This process is iterated until all parameters reach stable values.

We find that the model has a flat likelihood for a range of parameters. In order to make the model parameters uniquely identifiable, we adopt the empirical knowledge of {varphi} which can be estimated from spike-in data. We assume that for highly expressed spike-in genes the background and non-specific hybridization can be ignored. Therefore, in (4) we set agc to zero and get {varphi} ~= <mgjc>/<ygjc>, which means the difference between log(MM) and log(PM) for each probe-pair is approximately equal to the constant log({varphi}) (shown in Figure 1a). Using the experimental data from all known spike-in genes whose spiked concentrations are above 50 pM we obtain the fitted log-normal distribution for {varphi} (shown in Figure 1b). We introduce a log-normal prior for {varphi} to obtain the maximum a posteriori (MAP) estimate of {varphi}. The posterior distribution of {varphi} is

(7)
The logarithm of the posterior probability of {varphi} is then (ignoring an irrelevant constant term)

(8)
This is the quantity that is maximized to estimate the model parameters.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 1 (a) The logarithm of the intensity of one probe pair of the spike-in gene 37777_at in Affymetrix Latin Square spike-in the dataset (described in Section 3.1) versus the logarithm of transcription concentrations. The solid and dotted lines represent PM and MM intensities, respectively. (b) The histogram and fitted log-normal distribution of {varphi}, which measures the fractional amount of specific binding to the MM probe, estimated from the highly expressed spike-in genes.

 
Once the parameters have been estimated the distribution of signal sgjc for probe-pair j in probe-set g of chip c is

(9)
The expected log true probe signal and the variance of log signal for the g-th probe-set are respectively given by

(10)
where {Psi} is the digamma function, which is the derivative of the logarithm of the gamma function, and {Psi}' is the first derivative of the digamma function.

In (10) cg and dg are characteristic of the probe-set for a specific type of chip, while <log(sgjc)> varies with {alpha}gc over different chips in the dataset. The posterior distribution of {alpha}gc is unimodal, so we use a truncated Gaussian to approximate the posterior distribution of {alpha}gc. The Gaussian is centered at the maximum likelihood estimate which corresponds to the MAP estimate under a uniform prior on {alpha}gc. The Hessian of the log-likelihood is used to determine the curvature at the mode. We have also numerically calculated the histogram of the distribution of {alpha}gc and the truncated Gaussian approximates the histogram well for every case where we have compared them. Some examples are shown in Figure 2. We are assuming that the posterior distribution of {alpha}gc factorizes as a product of independent distributions for each chip and analysis of the Hessian of the log-likelihood shows this to be a good assumption unless the number of chips under consideration is very small.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 2 Posterior probability density function of estimated {alpha} (upper panel) and log expression levels (lower panel) for spike-in gene 37777_at in the Affymetrix Latin Square spike-in dataset (described in Section 3.1) at concentrations (a) 0, (b) 8 and (c) 512 pM. The thin solid lines are numerically calculated from the histogram of the posterior distribution of {alpha} and the thick dash-dotted lines are from the truncated Gaussian approximation.

 
With the posterior distribution of {alpha}gc, it is then straightforward to calculate the percentiles and credibility intervals. We use (10) and the percentiles of {alpha}gc to calculate the percentiles of <log(sgjc)> since these are invariant under the transformation of (10). We take the median of < log(sgjc)> as the expression level of gene g on chip c. It is possible to share the parameter {alpha} across technical replicates as in BGX (Hein et al., 2005). However, in most cases we prefer to use a different {alpha} parameter for each chip since this is more robust to outliers and between-chip variation.

We have implemented both mgMOS and multi-mgMOS in a publicly available R-package using the fast C program donlp2 (Spellucci, 1998) for parameter optimization.


    3 RESULTS AND DISCUSSION
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 
3.1 Datasets
To evaluate the performance of the new probabilistic model, we consider two bench-mark datasets and a real time-course dataset. Dataset A is the same one used by Hein et al. (2005) to evaluate BGX. It is a subset of the publicly available GeneLogic spike-in dataset (http://www.genelogic.com/media/studies/spikein.cfm) and includes only 1011 probe-sets for each chip. Dataset A includes six chips under two different conditions, each of which has three replicates. All chips have a common complex cRNA derived from an acute myeloid leukemia (AML) tumor cell line as an identical background. There are 11 spike-in genes which were spiked in at different concentrations for the two conditions. According to differences between spiked in concentrations in the two conditions, ranks of the difference between expression levels for the 11 spike-in genes are shown in Table 1. For more details of dataset A please refer to Hein et al. (2005).


View this table:
[in this window]
[in a new window]
 
Table 1 The ranks of the 11 spike-in genes in dataset A, with respect to the degree of differences between expression levels under two conditions, obtained with the three different probabilistic methods

 
Dataset B is a subset of the Affymetrix human genome U95 Latin Square spike-in dataset (http://www.affymetrix.com/support/tech6nical/sample_data/datasets.affx). We include 14 chips (chip a–m and q) from group 1521. There are 14 genes spiked in at different concentrations on each chip. The 14 concentrations were 0, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 and 1024 pM. The combined spike-in material for each chip contains the 14 spike-in genes, each of which was spiked in at one of the 14 different concentrations. This dataset includes all of the 12 626 probe-sets on each chip.

Dataset C (http://www.ncbi.nlm.nih.gov/projects/geo/) used in Lin et al. (2004) measures the mRNA expression profile in mouse back skin from eight representative time points to discover regulators in hair-follicle morphogenesis and cycling. This dataset has 25 chips and each time point has three or four replicates which are measured from various littermates. In this dataset eight genes, not previously known to be hair-cycle associated, were identified by their temporal and spatial expression patterns during the hair-growth cycle and confirmed by quantitative real-time PCR (qr-PCR) experiments. For the first hair-growth cycle the microarray data come from five time points (day 1, 6, 14, 17 and 23) and qr-PCR data are from eight time points (day 1, 5, 8, 12, 15, 17, 19 and 23). So there are three time points in common (day 1, 17 and 23). For more details of dataset C please refer to Lin et al. (2004).

3.2 Comparison with other methods
We use the simplified GeneLogic dataset A to compare multi-mgMOS with other alternative probabilistic models BGX (Hein et al., 2005) and mgMOS (Milo et al., 2004). The performance of otherpopular statistical methods on dataset A is shown in Hein et al. (2005). In order to further demonstrate the accuracy of multi-mgMOS, we also compare it with the most popular statistical models, MAS 5.0 (Affymetrix, 2001), MBEI (Li and Wong, 2001) and GCRMA (Wu et al., 2004), using the larger dataset B. In order to demonstrate the advantages of the probabilistic approach, we show some probabilistic quantities of interest in the results such as the credibility in both the expression measures and the signal log-ratios. Furthermore, we use the PCR-validated time-course dataset C to show the performance of our proposed method on a real biological dataset. The following software were used: BGX(http://www.bgx.org.uk/software.html), MBEI (http://www.biostat.harvard.edu/complab/dchip) and GCRMA (http://www.bioconductor.org).

3.2.1 Comparison with BGX and mgMOS
Figure 3 shows scatter plots of gene expression values from BGX, mgMOS and multi-mgMOS for dataset A. The dotted points represent the non-spike-in genes and the star points represent the spike-in genes. Since the expression levels of non-spike-in genes in the two samples are identical, the dotted points should follow the diagonal. For non-spike-in genes, the correlation coefficient between gene expression levels for the two conditions estimated with BGX, mgMOS and multi-mgMOS are 0.9919, 0.9926 and 0.9934, respectively, demonstrating that multi-mgMOS is most consistent for these genes. The spike-in genes are spiked in at different concentrations under the two conditions, so the star points should be away from the diagonal. Except for one inappropriate spike-in gene, all the other 10 spike-in genes situate away from the diagonal. Table 1 shows results for the estimated ranks of 11 spike-in genes from the three models. All the models rank 10 of 11 spike-in genes in the top 10 and show similar performance, although multi-mgMOS seems to show slightly worse performance in identifying the rank of spike-in genes 5 and 8. The results on this small dataset are rather inconclusive and it is difficult to distinguish between the three methods. We investigate the difference in performance of these models on the larger datasets B and C below.



View larger version (13K):
[in this window]
[in a new window]
 
Fig. 3 Scatter plots of gene expression measures of the two conditions in dataset A from (a) BGX, (b) mgMOS and (c) multi-mgMOS. For mgMOS and multi-mgMOS, the mean estimated gene expression levels for each condition over the three replicates are used.

 
3.2.2 Sensitivity to variation in concentration
Figure 4 shows curves of log expression values for 12 spike-in genes in dataset B from six methods against log concentrations which are scaled to (1,14). Following instructions of Affymetrix, two spike-in genes, 407_at and 36889_at, in dataset B are excluded due to the poor performance of certain probe-pairs. For multi-mgMOS and mgMOS the negative log expression levels are truncated at –0.5 and negative values obtained from other methods are truncated at zero. Ideally, curves for spike-in genes should have a slope of one since the difference in the concentration should result in an identical difference in measured expression level. For highly expressed spike-in genes, all six methods obtain similar results, but for low expressed spike-in genes, the slopes of curves of multi-mgMOS, mgMOS and GCRMA are closest to one showing high sensitivity to the variation in concentrations of these methods. BGX estimates non-specific hybridization signal equally over the whole chip without taking probe effects into consideration. This seems unreasonable in practice since PM and MM share very similar oligo sequences. Consequently, BGX has poor performance in the low expressed area where non-specific binding has the strongest effect. The inability of mgMOS to share probe-specific effects across chips results in reduced accuracy for some genes. This can be observed from some spike-in genes at low concentrations where the expression measurements seem higher than true concentrations. GCRMA uses the GC content of probes in order to obtain improvement for the lower end, but the slope is still slightly <1. The same problem exists for other popular statistical methods which get relatively large expression measures for those weakly expressed genes.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 4 Curves of logarithm of gene expression values obtained from different methods for the 12 spike-in genes in the dataset B against the logarithm of transcription concentrations: (a) multi-mgMOS, (b) BGX, (c) mgMOS, (d) MAS 5.0, (e) MBEI and (f) GCRMA. The ideal slope of curves is one.

 
3.2.3 Performance on a real dataset
In dataset C, we obtain the profiles of eight PCR confirmed genes from MAS 5.0, GCRMA, BGX, mgMOS and multi-mgMOS. As an example, Figure 5 shows results from the five models for three probe-sets measuring the expression of gene Dab2 in the first hair-growth cycle. The data from one randomly selected sample is shown. For the first probe-set, GCRMA does not correctly identify the anti-hair-growth pattern as shown in the leftmost plot of the first row. MAS 5.0, BGX, mgMOS and multi-mgMOS obtain more reasonable cycle patterns for all three probe-sets. However, BGX is less confident in capturing the cycle pattern for the first probe-set due to the large credibility intervals for expression levels at day 17 and 23.



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 5 Temporal profile of gene Dab2, which contains three probe-sets in column (a), (b) and (c) from dataset C using the following five models: MAS 5.0, GCRMA, BGX, gMOS and multi-mgMOS. The qr-PCR profile is the dotted line in each plot. The first row shows the results from MAS 5.0 (dashed lines) and GCRMA (solid lines). The second, third and fourth rows show the profile from BGX, mgMOS and multi-mgMOS, respectively. The 5–95% credibility intervals are also plotted for each time point for the probabilistic models BGX, mgMOS and multi-mgMOS. The credibility intervals are truncated at 2.0 and –4.0 in order to make plots clear.

 
Table 2 shows the root mean square error (RMSE) of estimated profiles for eight hair-cycle associated genes. The RMSE to qr-PCR data is shown in the top row of Table 2 and measures the difference between the estimated profiles and the corresponding qr-PCR results for three time points (day 1, 17 and 23) in the first hair-growth cycle for all three mice. This is computed using all probe-sets for the eight genes. We find that mgMOS and multi-mgMOS obtain the best values of RMSE and the reason for this becomes apparent when we look in detail at the profiles from each probe-set. Hair-growth patterns obtained from mgMOS and multi-mgMOS are consistent with the qr-PCR results for all the eight genes. Profiles from BGX are significantly different from the corresponding qr-PCR data for two probe-sets associated with genes Elf5 and Wnt11. Profiles from MAS 5.0 are inconsistent with qr-PCR profiles for two probe-sets related to gene Fbln1. Profiles from GCRMA are inconsistent for one and two probe-sets from genes Dab2 and Fbln1, respectively.


View this table:
[in this window]
[in a new window]
 
Table 2 The RMSE of profiles from multi-mgMOS, mgMOS, MAS 5.0 and GCRMA for hair-growth associated genes in dataset C

 
There are three PCR confirmed genes (Junb, Dab2 and Fbln1) that have multiple probe-sets, and they have 2, 3 and 4 probe-sets, respectively. We use profiles of these nine probe-sets for three mice to calculate the RMSE to the same gene probe-set (bottom row of Table 2) and this shows that multi-mgMOS identifies the most consistent quantities for probe-sets associated with the same gene.

We found that the performance of the new method was most impressive on this real dataset and we believe that it is important to validate new methods for normalization and probe-level analysis on real experimental data as well as on spike-in data. Spike-in data have the unrealistic property that almost all genes have identical expression levels in different experiments. This property may be better suited to some methods than others. For example, the quantile normalization (Bolstad et al., 2003) used in RMA and GCRMA works under the assumption that gene expression levels have a similar distribution in different experiments. This assumption is especially well-suited to the analysis of artificial spike-in datasets in which the distribution of expression levels between experiments is almost identical. It is unclear how well this assumption holds in general.

3.2.4 Model selection
We have seen that mg-MOS and multi-mgMOS have quite similar performance in terms of accuracy. However, there are often quite substantial differences in terms of the inferred posterior signal distribution and corresponding error bars. Therefore we would like to determine which model has the most statistical support by using standard model selection methods. We have computed Akaike's information criterion (AIC) (Akaike, 1973) and the Bayesian information criterion (BIC) (Schwartz, 1978) on all three datasets as shown in Table 3. According to results of AIC and BIC, multi-mgMOS provides a better explanation of these three datasets. This may explain why multi-mgMOS typically obtains more confident results when compared with mgMOS (shown in Figure 5).


View this table:
[in this window]
[in a new window]
 
Table 3 Results of AIC and BIC model selection criteria for mgMOS and multi-mgMOS on the three datasets used in the paper

 
3.2.5 Computational efficiency
A major advantage of multi-mgMOS over BGX is that the likelihood can be written in closed form and with the use of an efficient optimization package donlp2 (Spellucci, 1998) the parameters can be obtained much faster than BGX. The difference in computation time between BGX and multi-mgMOS for the three datasets used in this study are shown in Table 4. As a multiple chip model, multi-mgMOS is expected to perform better on relatively large datasets and its computational efficiency makes this applicable in practice.


View this table:
[in this window]
[in a new window]
 
Table 4 The computation time of BGX and multi-mgMOS on different datasets. Computation time is obtained on a 1.8 GHz AMD Opteron machine

 
3.3 Credibility of expression measures
We randomly selected one spike-in gene 37777_at from dataset B and plotted the probability density function of estimated expression levels at concentrations 0, 8 and 512 pM as shown respectively in the lower panel of Figure 2. The upper panel is the related posterior distribution of {alpha}. As the concentrations increases, the most likely expression level increases and the variance of the expression measurement decreases to obtain more and more confidence in the estimated expression levels. Figure 6 shows 2.5–97.5% credibility intervals of expression levels for 12 spike-in genes in dataset B. As concentration increases, more confidence with the expression estimates is obtained.



View larger version (33K):
[in this window]
[in a new window]
 
Fig. 6 2.5–97.5% credibility intervals of expression levels for 12 spike-in genes in dataset B. The expression levels and the credibility intervals are truncated at –0.5 to aid clarity.

 
A key use of microarrays is identifying differentially expressed genes from different experimental samples. With our model it is easy to carry out this task. For dataset A with technical replicates whose sample comes from the same fragmented complex cRNA, we assume that the variation between chips are low enough to share {alpha} across replicates. Using the delta method (Oehlert, 1992) to approximate the posterior distribution of < log(sgjc)>, we obtain the distribution of the difference between the expression levels for each gene in dataset A. Figure 7 shows the median and 5–95% credibility of the differences between the estimated expression levels for (a) all genes and (b) 51 genes under two conditions from dataset A. For the spike-in genes, except gene 1004, which is not spiked in properly, the credibility intervals do not include zero, which means the spike-in genes are differentially expressed with high probabilities. The credibility intervals of all non-spike-in genes except five include zero. With increased credibility intervals, the error bars of the five non-spike-in genes that are false positives under 5–95% credibility intervals embrace zero gradually, while the true positives remain significant. Notice that in some cases the credibility intervals are very large. This is because the log-ratio of signal is essentially unidentifiable for very low expressed genes. However, this does not present a problem for the method since we only infer a significant log-ratio when the credibility intervals do not include zero.



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 7 Median and 5–95% credibility intervals of the log-ratio between expression levels of dataset A under two conditions for (a) all genes and (b) genes from 961 to 1011. Median values are represented by star points and credibility intervals are represented by horizontal solid lines.

 

    4 CONCLUSION
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 
We have presented a new probabilistic model for probe-level analysis of Affymetrix microarray data. This model shows competitive performance compared with other models on commonly used benchmarks and gives very impressive results on a real time-course dataset. The likelihood function can be written in closed form and the computation is therefore very fast, which allows its potential application to large datasets. Probe effects are shared across all chips of the same type and this improves the accuracy as well as the model's support. Moreover, as a probabilistic model, multi-mgMOS provides a measure of confidence for the inferred true signal and this will be very useful in downstream analyses, especially those adopting Bayesian methods.

We submitted our results on the benchmark described by Cope et al. (2004) on 15 June, 2005. This is a popular benchmark for evaluating probe-level analysis methods. At the time of submission we ranked top of all methods in 4 out of 15 criteria in the original assessment and top in 3 out of 14 criteria in the newer assessment (see Supplementary Material). Our models show excellent sensitivity to variation in concentration, which is consistent with the results in Section 3.2.2. The evaluation methods in this benchmark use a simple fold-change rule to identify differentially expressed genes and do not consider approaches that make use of credibility intervals, such as those proposed in Section 3.3. This leads to an underestimate of the area under ROC curve for our model. We use the median posterior estimate of <log(s)> as our point estimate of log concentration and this estimator has a low bias. However, for weak signals this estimator is naturally associated with a large variance and this leads to some large point estimates of fold-change for weakly expressed genes. These are considered ‘false positives’ by Cope et al. (2004) but we would reject most of these large fold-changes as insignificant by taking their associated credibility intervals into account, as we have explained in Section 3.3. Our point estimate of signal must be combined with a credibility interval in order to get sensible results for weakly expressed genes. We could reduce the typical size of these large fold-changes by, for example, using <log(s+c)> for some positive constant c as our signal estimate. This would reduce variance but at the cost of increased bias and reduced accuracy in measuring concentration. In our opinion, criteria that are solely based on point estimates of fold-change are inappropriate for the evaluation of probabilistic methods.

Many popular analysis methods (e.g. clustering, dimensionality reduction and classification) can be modified to include measurement variance since in these methods chips are typically treated as independent data points. Other forms of analysis make explicit use of biological replicates, such as t-tests and the clustering approach in Lin et al. (2004). The number of replicates is often very small and we think that the variance estimate could be improved by also including the within-replicate variance identified by a probabilistic probe-level analysis. This is the current focus of our work.


    Acknowledgments
 
The authors gratefully acknowledge Padhraic Smyth and Bogi Andersen for providing the mouse time-course qr-PCR data. X.L. especially thanks Gwenn Englebienne for helping with C coding. M.R. and N.D.L. gratefully acknowledge a BBSRC award ‘Improved processing of microarray data with probabilistic models’. M.M. is supported by an Advanced Training Fellowship from the Wellcome Trust.

Conflict of Interest: none declared.

Received on May 5, 2005; revised on June 30, 2005; accepted on July 8, 2005

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 

    Affymetrix. Microarray Suite User Guide Version 5.0, (2001) , Santa Clara CA Affymetrix Inc.

    Akaike, H. (1973) Information theory and extension of the maximum likelihood principle. 2nd International Symposium in Information TheoryBudapest Akademia Kiado, pp. , pp. 267–281.

    Bolstad, B.M., et al. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19, 185–193[Abstract/Free Full Text].

    Cope, L.M., et al. (2004) A benchmark for Affymetrix GeneChip expression measures. Bioinformatics, 20, 323–331[Abstract/Free Full Text].

    Hein, A.K., et al. (2005) BGX: a fully Bayesian integrated approach to the analysis of Affymetrix GeneChip data. Biostatistics, 6, 349–373[Abstract/Free Full Text].

    Irizarry, R.A., et al. (2003) Exploration, normalization and summaries of high density oligonucleotide array probe level data. Biostatistics, 4, 249–264[Abstract].

    Li, C. and Wong, W. (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc. Natl Acad. Sci. USA, 98, 31–36[Abstract/Free Full Text].

    Lin, K.K., et al. (2004) Identification of hair cycle-associated genes from time-course gene expression profile data by using replicate variance. Proc. Natl Acad. Sci. USA, 101, 15955–15960[Abstract/Free Full Text].

    Lockhart, D.J., et al. (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol., 14, 1675–1680[CrossRef][ISI][Medline].

    Milo, M., et al. (2003) A probabilistic model for the extraction of expression levels from oligonucleotide arrays. Biochem. Soc. Trans., 31, 1510–1512[ISI][Medline].

    Milo, M., Niranjan, M., Holley, M.C., Rattray, M., Lawrence, N.D. (2004) A probabilistic approach for summarising oligonucleotide gene expression data. Submitted for publication, Technical report available upon request.

    Naef, F. and Magnasco, M.O. (2003) Solving the riddle of the bright mismatches: Labeling and effective binding in oligonucleotide arrays. Phys. Rev. E, 68, 011906[CrossRef].

    Oehlert, G.W. (1992) A note on the delta method. Amer. Stat., 46, 27–29.

    Schena, M., et al. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270, 467–470[Abstract/Free Full Text].

    Schwartz, G. (1978) Estimating the dimension of a model. Ann. Stat., 6, 461–464.

    Spellucci, P. (1998) A SQP method for general nonlinear programs using only equality constrained subproblems. Math. Prog., 82, 413–448.

    Wu, Z., et al. (2004) A model based background adjustment for oligonucleotide expression arrays. J. Am. Stat. Assoc., 99, 909–917[CrossRef][ISI].

    Zhang, L., et al. (2003) A model of molecular interactions on short oligonucleotide microarrays. Nat. Biotechnol., 21, 818–821[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
W. K. Lim, K. Wang, C. Lefebvre, and A. Califano
Comparative analysis of microarray normalization procedures: effects on reverse engineering gene networks
Bioinformatics, July 1, 2007; 23(13): i282 - i288.
[Abstract] [Full Text] [PDF]


Home page
EndocrinologyHome page
R. Stienstra, S. Mandard, D. Patsouris, C. Maass, S. Kersten, and M. Muller
Peroxisome Proliferator-Activated Receptor {alpha} Protects against Obesity-Induced Hepatic Inflammation
Endocrinology, June 1, 2007; 148(6): 2753 - 2763.
[Abstract] [Full Text] [PDF]


Home page
BiostatisticsHome page
V. Purutcuoglu and E. Wit
FGX: a frequentist gene expression index for Affymetrix arrays
Biostat., April 1, 2007; 8(2): 433 - 437.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
X. Liu, M. Milo, N. D Lawrence, and M. Rattray
Probe-level measurement error improves accuracy in detecting differential gene expression
Bioinformatics, September 1, 2006; 22(17): 2107 - 2113.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
G. Sanguinetti, M. Rattray, and N. D. Lawrence
A probabilistic dynamical model for quantitative inference of the regulatory mechanism of transcription
Bioinformatics, July 15, 2006; 22(14): 1753 - 1759.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
M. Rattray, X. Liu, G. Sanguinetti, M. Milo, and N. D. Lawrence
Propagating uncertainty in microarray data analysis
Brief Bioinform, March 1, 2006; 7(1): 37 - 47.



Home page
BioinformaticsHome page
G. Sanguinetti, M. Milo, M. Rattray, and N. D. Lawrence
Accounting for probe-level noise in principal component analysis of microarray data
Bioinformatics, October 1, 2005; 21(19): 3748 - 3754.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/18/3637    most recent
bti583v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Liu, X.
Right arrow Articles by Rattray, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liu, X.
Right arrow Articles by Rattray, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?