Bioinformatics Advance Access originally published online on July 28, 2006
Bioinformatics 2006 22(19):2381-2387; doi:10.1093/bioinformatics/btl399
Joint estimation of calibration and expression for high-density oligonucleotide arrays
Division of Biostatistics, Department of Health Sciences Research, Mayo Clinic College of Medicine 200 First Street SW, Rochester, MN 55905, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: The need for normalization in microarray experiments has been well documented in the literature. Currently, many analysis methods treat normalization and analysis as a series of steps, with summarized data carried forward to the next step.
Results: We present a unified algorithm which incorporates normalization and class comparison in one analysis using probe level perfect match and mismatch data. The algorithm is based on calibration models common to most biological assays, and the resulting chip-specific parameters have a natural interpretation. We show that the algorithm fits into the statistical generalized linear models framework, describe a practical fitting strategy and present results of the algorithm applied to an example dataset as well as based on metrics used in affycomp. The algorithm ranks amongst the top third of the affycomp competitors, performing best in measures of bias.
Availability: R functions are available on request from the authors.
Contact: oberg.ann{at}mayo.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Microarray technology provides researchers with a powerful tool to measure expression levels of thousands of genes in a specimen sample simultaneously. Owing to the costly nature of these studies and sometimes sample scarcity, an experiment typically yields results for only a few samples. In addition, it is usually feasible to pursue further research regarding only a small subset of the genes on the array. Hence, it is important to make efficient use of the data in order to distinguish biological variation from random error.
Currently, many methods of class comparison involve a separate normalization step and comparison step, although unifying normalization together with class comparison into one analysis has been advocated by Kerr et al. (2000), Kepler et al. (2002), Emptage et al. (2003) and others. The purpose of normalization is to remove systematic sources of variability while preserving the biologic variation of interest. The simplest of the normalization methods [applied by Affymetrix MAS software (Affymetrix, 2001), and others http://www.affymetrix.com/support/technical/manuals.affx, Affymetrix, Inc] involves setting the overall mean of each chip to a common value. (The microarray laboratory in our institution for instance, uses the value of 1500.) This assumes a linear relationship between the log true expression level and the log fluorescent intensity actually observed over the entire range of gene expression values. There is compelling evidence that this relationship is in fact non-linear (Bolstad et al., 2003; Dudoit et al., 2002; Kepler et al., 2002; Naef et al., 2003), and several more sophisticated normalization methods have been proposed. The normalized probe-level data are then typically summarized into a per-gene estimate of expression. Class comparisons and other analyses are then performed on the data at this level.
We show here that the massive number of probes on a high-density oligonucleotide array chip actually allows the construction of an indirect calibration curve using the available data in an unsummarized form, and that these curves have several desirable features. We present an algorithm that unifies the normalization and class comparison steps into one analysis, show how it fits into the statistical class of generalized linear models of McCullagh and Nelder (1983) and describe a practical fitting strategy. Section 2 describes the datasets used for illustrative purposes. Section 3 motivates and introduces the concept of calibration and describes its application here. Section 4 describes the algorithm and its implementation and Section 5 describes the performance of the method when applied to an example dataset as well as based on the benchmark of Affymetrix GeneChip expression measures (Affycomp) of Cope et al. (2004) and Irizarry et al. (2005). Conclusions and discussion are presented in Section 6.
| 2 DATA |
|---|
|
|
|---|
The properties of the proposed algorithm were evaluated on data from publicly available spike-in experiments as well as on an experimental data set published by Abraham et al. (2005). These datasets are briefly described in this section. The spike-in datasets are available from the Affymetrix website http://www.affymetrix.com or on the affycomp website http://affycomp.jhsph.edu.
2.1 Affymetrix U95A spike-in data
The Affymetrix U95A spike-in dataset has 16 genes which were spiked in varying concentrations into a pancreas background in concentrations ranging from 0 to 1024 pM and hybridized in a cyclic Latin Square design onto Hu95A chips. There were at least three chips per concentration combination. In this dataset, there are only 16 genes expected to show fold changes in concentration expression. All other 12 610 genes should be identically expressed on all arrays. A more complete description can be found in Cope et al. (2004) and Irizarry et al. (2005) or via a search for Latin square data on the Affymetrix website.
2.2 Affymetrix U133A spike-in data
The Affymetrix U133A spike-in dataset has 42 genes spiked into a human HeLa cell line background in a cyclic Latin square fashion and hybridized onto Hu133A chips. The concentrations range from 0 to 512 pM and the sample was hybridized to 42 chips. In this dataset, there are only 42 genes expected to show fold changes in concentration expression. The other 22 258 genes should be identically expressed on all arrays. More details can be found via a search for Latin square data on the Affymetrix website or in Cope et al. (2004) and Irizarry et al. (2005).
2.3 Immunoglobulin light chain amyloidosis (AL) data
Abraham et al. (2005) performed microarray experiments using the Affymetrix Hu133A platform in order to identify a set of genes which could distinguish immunoglobulin light chain amyloidosis (AL) from multiple myeloma (MMy). Class comparison analyses were performed on data from 24 AL and 28 MMy patients.
| 3 CALIBRATION |
|---|
|
|
|---|
In simpler, low volume assays, it is common to normalize the data using a direct calibration curve f
![]() |
is random error. In a 96 well ELISA assay, for instance, f is directly estimated using reference samples of known concentration in one row of wells. The fitted curve is used to recover estimates of the true values x from the data y. Microarray data does not typically contain the information needed for a direct calibration. In particular, it is not clear what could be used as reference targets that would fairly represent the wide variety of probes on the array. (See the Gene Logic technical note entitled Optimization of an external standard for the normalization of Affymetrix GeneChip arrays available from the Gene Logic website http://www.genelogic.com for one approach to this issue, however.) Here, we focus on an indirect calibration function applied to an array as a whole accounting for the largest effects, with future work dedicated to accounting for probe-specific background effects.
Hekstra et al. (2003) and Burden et al. (2004) have shown that the Langmuir isotherm
![]() |
It is a general principle that for any biological assay that spans a wide range, an S-shaped curve between the true (unmeasured) value and the observed assayed value is almost a guarantee. A lower threshold on the observed data can be owing to background binding, lower limits of detection for the instrumentation, or other causes [the mismatch (MM) probes are actually designed to estimate this], while an upper limit may be owing to either biochemical or instrumentation saturation. Given the wide range of gene expression values, from
10 to 46 000 seen on the Affymetrix platform, this S-shape relationship is likely to be true.
Furthermore, the review article of Finney (1976) describes analytical approaches to the binding problem in radioligand assay. He states that for most problems, a logistic or probit function fit to the log of the true value (on the horizontal axis) versus the log of the observed value is sufficient. A further advantage of this is that the data in question is approximately equivariant on the log scale (Ballman and Therneau, 2005), making both plots and analyses more straightforward.
Figure 1 contains two Langmuir isotherms over the concentration and fluorescence range found in the U95 spike-in data with matching logistic curves overlayed. The Langmuir and logistic curves are nearly identical in each case, and we have chosen to retain the logistic fit for this report owing to its statistical familiarity. For the Langmuir form, a and b control the upper and lower thresholds, and log2(K) translates the curve left-right. For the logistic form
,
=
3(xK),
1 and
2 again control the upper and lower limits, the location parameter K corresponds to a per probe binding constant, and
3 = 1/2 causes the curve to have slope 1 at the midpoint; this is what is shown in the figure.
|
The importance of a per-probe binding constant is shown in Figures 2 and 3, both of which show the data for chip 1 of the U95 experiment. Figure 2 shows the raw fluorescence data versus the spiked in values, equivalent to assuming K = 0 for all probes, along with a lowess smooth; Wu and Irizarry (2004) pursue this in more detail. In contrast, Figure 3 shows the result with probe-specific values for K; based on a fit to all 59 chips and 256 PM/MM probe pairs of the U95 spike-in with x = the true spike-in amount. The fit is much tighter, and in particular the logistic (Langmuir) form of the true calibration curve is much more clear.
|
|
| 4 IMPLEMENTATION |
|---|
|
|
|---|
4.1 An algorithm
A goal of many microarray experiments is to evaluate differences in gene expression between experimental groups. The normalization function is necessary, but is not of interest in and of itself. Our basic calibration model is
![]() |
where gik is the expression of the i-th gene on the k-th chip and pij is the binding efficiency of the j-th probe associated with the i-th gene. The primary parameters of interest are the gene effects gik, secondarily the calibration functions fk themselves, and the probe effects pij are ancillary.
Note that if the chip functions fk were known, then this is the estimation of a generalized linear model (McCullagh and Nelder, 1983) with f1 as the link function and
ijk as the linear predictor. This opens up a whole set of well-developed tools for estimation and hypothesis testing.
It is not feasible to fit all the parameters at once, however, and we use a simple iterative algorithm where each step is described in more detail in subsequent sections:
- Create initial estimates
of fk.
- Solve for pij and gik, given
and yijk, creating
and
. Notice that this step can be done separately for each gene i.
- Solve for fk, given
,
, and yijk, creating an updated
. Notice that this step can be done separately for each chip k.
- Iterate steps 1 and 2 a fixed number of times, or until convergence.
4.2 Calibration function, fk
We parameterize the logistic calibration function for the kth chip as
![]() |
The parameters are the lower threshold
1k, the upper threshold
1k +
2k, the inflection point
4k and the slope
3k. The chip inflection parameter
4k is completely confounded with the per-chip gene expression gik. This confounding corresponds to an obvious physical interdigitation: if each signal on chip 2 were 20% larger than the corresponding signal on chip 1, it is not possible to say, without outside information or assumptions, whether this is a chip effect (such as a different sample handling method or different scanner) or that all gene products on the second chip are actually more highly expressed (such as in a dilution experiment). The first corresponds to
41 >
42 and gi1 = gi2 for all i, the second to
41 =
42 and gi1 > gi2 for all i.
For computation, we will assume without loss of generality that
4k = 0 for all k, i.e. set the chip effect to 0. After the fit, this can be adjusted based on chosen constraints, for instance rescaling so that all of the gene effects gik have mean 0 within a chip will credit all systematic variations to chip effects rather than to the experimental unit that was hybridized to that chip. This is probably what will be done most often in practice.
The other non-identifiability is between gene effects and probes within gene:
for any constant c. We have chosen to make the perfect match (PM) probe effects within each gene sum to zero. Any comparison of the absolute values of gik between genes would require the assumption that the average binding efficiency of their two probe sets is identical.
The MM data are used in addition to the PM data in this algorithm. A wide range of pij s facilitates detection of the curvature by the fitting process, and the MM data help to ensure a wide range. Hence, the information added by the MM data is key to the success of this algorithm. In fact, the algorithm surprisingly has substantially more difficulty recovering the true calibration curve without the MM data when evaluated in simulations (data not shown).
4.3 Initial estimates
As an initial estimate of the lower threshold of the calibration function, we use a percentile of the MM data. If the MM probes measure only background binding, as was purposed in their design, then we would expect an average MM probe to be close to the true lower threshold. In actuality, some of them will have activity (Ballman and Therneau, 2005 http://mayoresearch.mayo.edu/mayo/research/biostat/techreports.cfm), so a percentile between the 20th and 30th is used. For the upper threshold, we use the 99th percentile of the PM values since the 100th percentile represents saturated housekeeping genes.
An initial value for the slope
3k can be set to 1 since this only controls the scale of the primary parameters gik and pij. As stated above, the offset parameter
4k is set to 0.
4.4 Estimation of gene and probe effects given fk
Since this estimation will be done on a very large number of genes, the calculations should be as fast as possible. Calls to a general non-linear estimation would be slow. The iterative weighted least squares (IWLS) approach pioneered in generalized linear models (McCullagh and Nelder, 1983) allows fast and easy estimation of the logistic parameters. For a given gene i, we have
![]() | (1) |
1k and
1k +
2k, the IWLS update is the solution to a weighted regression with a working dependent variable z and case weights wijk, where
![]() |
and
is the variance of the yijk vector evaluated at the current value of the linear predictor
.
Since we are using the log2(intensity) data without background subtraction, the variance is approximately constant, i.e.
(this is evident in Fig. 3 in that the width of the scatter around the predicted line does not vary greatly over the range of the data). Ballman et al. (2004) examined plots for all 1508 probes for the spiked in genes of the U95 and U133 spike in experiments and reached the same conclusion. This provides a key simplification in the estimation procedure. Background subtraction makes no essential difference in the formulation of Equation (1), as it corresponds more or less to subtracting a constant from
1; but it would add a major complication to the variance function
in that it is no longer constant.
With V a constant and using the logistic formulation for fk, the update formula simplifies to a regression of
on DX, where D is a diagonal matrix with elements dijk. The weights dijk serve to down-weight points in the far left or right tails of the function. Notice that if we were to assume d = 1, then this becomes a bias correction algorithm similar to fastlo (Ballman et al., 2004), but with a more complex linear model (both gene and probe effects rather than the simple probe mean) and a logistic mean function rather than a non-parametric smooth.
| 5 RESULTS |
|---|
|
|
|---|
The calibration algorithm was applied to the datasets described in Section 2. The resulting gene estimates (i.e. the gik), centered and scaled to attribute the systematic effects to the chip, were submitted to affycomp (Cope et al., 2004; Irizarry et al., 2005). The full report can be found on the affycomp website http://affycomp.jhsph.edu/ under the method name chipcal4 on the new assessment link.
Figure 4 shows the estimated calibration curve for chip 1 of the U95 data from the proposed method, along with the best possible estimate created from a fit where the true concentrations are known (i.e. the model fit to the spike in genes only). The two curves are surprisingly similar.
|
At the time of this writing, competition results were available on the new assessment link of the affycomp website for 48 methods in the U95 table and 42 unique methods for the U133 table. In the 14-dimensional score displayed for all submitted methods our calibration method has average rank (based on absolute distance from the ideal value) of 21 for the U95 and 19 for the U133 result tables, receiving the highest ranks in the items measuring bias and the lowest ranks in the items measuring variance. When the methods are ranked based on their average score ranking, this algorithm places 13th and 14th for the U95 and U133 data, respectively.
A plot of observed expression intensity versus true expression intensity is displayed in Figure 4a of the reports. The slopes from a regression line fitted to these data (observed versus nominal expression values) are 0.94 and 0.95 for the U95 and U133 data, respectively. A value of 1 would indicate no bias. While these slopes show that there is bias towards zero, this algorithm ranks fourth and sixth on this measure in the U95 and U133 datasets, respectively, indicating that this method has less bias than most. It is our thought that the excellent performance in this area may be owing to the fact that we are, at least approximately, capturing the correct functional form of the calibration curve directly in the model.
The Achilles heel of the current algorithm lies in variance. Figure 2b for both the U95 and U133 reports indicate that variability for the non-spiked in genes increases at both the lowest and highest expression values, with the median standard deviation ranking in the top third for both datasets. The R2 values from the regression lines fit to the observed versus nominal expression values of 0.70 and 0.82 for the U95 and U133 fits rank near the bottom of the list. When signal detection is based on the criteria of a fold change >2, the algorithm performs reasonably well for medium and high fold changes with ROC AUC values of 0.80 and 0.77 for the U95 data and 0.76 and 0.96 for the U133 data. However, for low expression genes, the AUC is <0.5 at 0.36 and 0.44 in the U95 and U133 experiments, respectively, indicating a large number of false positives. Further exploration of the false positives, however, revealed that a large fraction of them are owing to a single aberrant probe which had an overly large influence on the two-way ANOVA of Section 4.4 and that they are located on the horizontally flat portions of the logistic curve.
Going beyond the Affycomp results, we explored the behavior of contrasts for both the spike in and null genes, using the six chips in experiments 1 and 2 of the U133 experiment. For each gene i, the fit gives six gene effect estimates
for which k = 1, 2, 3 represent the first spike-in pattern and k = 4, 5, 6 the second. The variance matrix and residual standard error of the fit can be used to test whether c'g = 0 for these six chips where c = (1, 1, 1, 1, 1, 1, 0, ... , 0) and
, a contrast which incorporates both gene and probe variability. We compare these to results from RMA which are based on only the six summary expression values. Similarly, we explored the behavior of these contrasts in the U95 data utilizing the two spike in combinations having 12 replicates each.
The two chosen spike in combinations in both datasets are such that all but two of the spike in genes have a nominal fold change of 2, with the other two spike in genes having fold changes of 0 versus 0.125 or 512 versus 0 for the U133 data and 0 versus 1024 or 0.25 versus 0 for the U95 data. The remaining genes on the chip have a nominal fold change of zero. Contrast estimates and confidence intervals for the genes having fold change of two from both datasets are displayed in Figure 5 for both our algorithm and RMA. Ideally the contrast estimates would be centered at one (the plots are on the log2 scale) and the confidence intervals would not include zero. Both algorithms have greater difficulty in detecting true changes in the lower abundance range than in the high abundance range. Performance in the mid abundance range is better in the U95 than in the U133 data. Of the 42 fold changes in the U133 data 25 are significant at the 5% level with this algorithm while RMA declares 30 significant (24 of them are in common). There are 11 genes which neither algorithm declares significantly differentially expressed. Of the 16 fold changes in the U95 data 15 are significant at the 5% level with this algorithm while RMA declares 16 significant. This algorithm declares 6.3 and 19.4% of the null genes significantly differentially expressed at the 5% level in the U133 and U95 data, respectively, while RMA declares 5.7 and 4.3% significant. Scatter plots of the distributions of the null gene test statistics for this algorithm versus the RMA algorithm are displayed in Figure 6. These results indicate that both algorithms do a reasonable job of controlling type I error in the U133 data. The current algorithm is not controlling this well in the U95 data. Inspection revealed that, as for signal detection via fold change, a single probe outlier was typically the cause of the false positives, indicating that robust estimation methods will probably improve the control of type I error. As would be expected, the statistical criterion for significance performs much better than the fold change criterion which does not account for variability present in the data.
|
|
The algorithm was applied to the AL data described in Section 2.3 and the hypothesis of differential expression between AL and MMy samples was tested. Before and after calibration, box plots of the overall chip distributions in Figure 7 demonstrate that the calibration algorithm removed systematic differences between the chips. The false discovery rate (FDR) versus type I error rate is displayed in Figure 8 for both our algorithm and RMA. Our algorithm clearly controls the FDR better than RMA does across the entire range of alpha values. At an alpha level of 0.01 (0.05), the FDR for our algorithm is 1.4% (4.3%) while for RMA it is 5.2% (12%).
|
|
| 6 DISCUSSION |
|---|
|
|
|---|
We have presented an algorithm for normalization and analysis of high-density oligonucleotide microarray data. The current algorithm, cast into the generalized linear model framework, brings a statistical framework to bear on the normalization problem and paves the way for a large set of well-established methods to be easily applied in this situation. In addition, the generalized linear model framework allows one to conduct hypothesis tests based on variability at the probe level, accounting for all sources of variation present in an experiment. While our implementation assumes the variance is not a function of the mean, the algorithm itself does not rely on this assumption. The implementation can be modified accordingly in order to allow application to background subtracted data.
The algorithm incorporates biological principles as well as efficient classical statistical techniques. It uses probe level PM and MM data as the input and results in a chip-specific normalization or indirect calibration function. The logistic function used to model this is based on calibration models common to most biological assays and the parameters have a natural interpretation. Parameter estimates from these calibration functions could eventually be used for quality control purposes once their behavior is better understood. The algorithm is simple to implement using standard software. R functions that implement this algorithm are available from the authors.
The algorithm estimates the true calibration functions well. Results from the affycomp competition show that on average, this algorithm is amongst the top third of affycomp competitors based on average rank in the 14-dimensional score and performs well with respect to bias. There is room for improvement with respect to variance and signal detection. Including the MM probes in the modeling process is probably adding variability to the estimates, in a sense acting like background subtraction. However, it is well-known that MM probes measure signal, and without the MMs the algorithm has mediocre performance. In addition, knowledge of probe-specific non-specific binding (such as GC content) has not yet been incorporated into this algorithm. Reports have shown that incorporating this knowledge into normalization routines improves the variance at the lower expression levels (Naef and Magnasco, 2003; Wu et al., 2004). Work is in progress to incorporate probe-specific background binding information into the algorithm.
Regarding signal detection, the affycomp results indicate that this algorithm produces a large number of false positives based on the criteria of fold change greater than two, and the statistical contrast results demonstrate that type I error is not well controlled in the U95 data. Inspection of the null (non-spiked in) genes (i.e. those with expected fold change of zero) which were given large fold changes by the algorithm reveals that these particular genes are those that fall on the horizontally flat portions of the logistic curve and the large fold changes are owing to one or two outliers. This suggests that using robust methods would improve the algorithm with respect to this metric. Hence, we are currently working to incorporate robust methods into this algorithm.
| Acknowledgments |
|---|
The authors are grateful to the reviewers for their insightful comments which resulted in an improved manuscript, to Mr E. M. Lunde for his programming support and to Dr R. S. Abraham for use of the AL data. Partial funding for A.L.O. was provided by the Fraternal Order of Eagles Cancer Research Fund.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alvis Brazma
Received on December 23, 2006; revised on June 9, 2006; accepted on July 18, 2006
| REFERENCES |
|---|
|
|
|---|
Abraham, R., et al. (2005) Functional gene expression analysis of clonal plasma cells identifies a unique molecular profile for light chain amyloidosis. Blood, 105, 794803
Affymetrix. (2001) Microarray suite user guide, version 5. Technical Report.
Ballman, K.V. and Therneau, T.M. (2005) An exploration of affymetrix probe-set intensities in spike-in experiments. , Rochester, MN Technical Report No. 74 Department of Health Sciences Research, Mayo Clinic.
Ballman, K.V., et al. (2004) Faster cyclic loess: normalizing RNA arrays via linear models. Bioinformatics, 20, 27782786
Bolstad, B.M., et al. (2003) A comparison of normalization methods for high-density oligonucleotide array data based on variance and bias. Bioinformatics, 19, 185193
Burden, C.J., et al. (2004) Statistical analysis of adsorption models for oligonucleotide microarrays. Stat Appl. Genetic Mol. Biol, . 3, Article 35.
Cope, L.M., et al. (2004) A benchmark for affymetrix genechip expression measures. Bioinformatics, 20, 323331
Dudoit, S., et al. (2002) Statistical methods for identifying genes with differential expression in replicated cdna microarray experiments. Statistica Sinica, 12, 111139[Web of Science].
Emptage, M.R., et al. (2003) Treatment of microarray experiments as split-plot designs. Biopharm. Stat, . 13, 159178[CrossRef].
Finney, D.J. (1976) Radioligand assay. Biometrics, 32, 721740[CrossRef][Web of Science][Medline].
Hekstra, D., et al. (2003) Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays. Nucleic Acids Res, . 31, 19621968
Irizarry, R.A., Wu, Z., Jaffee, H.A. (2005) Comparison of affymetrix genechip expression measures. Technical Report Working Paper 86, Johns Hopkins University, Department of Biostatistics Working Papers.
Kepler, T.B., et al. (2002) Normalization and analysis of DNA microarray data by self-consistency and local regression. Genome Biol, . 3, research0037.1research0037.12.
Kerr, M.K., et al. (2000) Analysis of variance for gene expression microarray data. J. Comput. Biol, . 7, 819837[CrossRef][Web of Science][Medline].
McCullagh, P. and Nelder, J.A. Generalized Linear Models, (1983) Chapman and Hall.
Naef, F. and Magnasco, M.O. (2003) Solving the riddle of the bright mismatches: Labeling and effective binding in oligonucleotide arrays. Phys. Rev. E Stat. Nonlin. Soft Matter Phys, . 68, 011906011904[Medline].
Naef, F., et al. (2003) A study of accuracy and precision in oligonucleotide arrays: extracting more signal at large concentrations. Bioinformatics, 19, 178184
Wu, Z. and Irizarry, R. (2004) Stochastic models inspired by hybridization theory for short oligonucleotide arrays [extended abstract]. San Diego, CA. RECOMB, Copyright 2004 ACM 1-58113-755-9/04/0003.
Wu, Z., et al. (2004) A model-based background adjustment for oligonucleotide expression arrays. J. Am. Stat. Assoc, . 99, 909917[CrossRef][Web of Science].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||













