Bioinformatics Advance Access originally published online on July 10, 2007
Bioinformatics 2007 23(18):2493-2494; doi:10.1093/bioinformatics/btm357
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
RefPlus: an R package extending the RMA Algorithm
1Statistical Sciences, AstraZeneca, Alderley Park, Macclesfield, Cheshire SK10 4TG, UK, 2Department of Research, Koo Foundation Sun Yat-Sen Cancer Center, 125 Lihder Road, Taipei 112, Taiwan and 3Cancer Discovery Medicine, AstraZeneca, Alderley Park, Macclesfield, Cheshire SK10 4TG, UK
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: RMA has become a widely used methodology to pre-process Affymetrix gene expression microarrays. A limitation of RMA is that the calculated probeset intensities change when a set of microarrays is re-pre-processed after the inclusion of additional microarrays into the analysis set. Here we report the availability of the RefPlus package containing functions to perform the Extrapolation Strategy and Extrapolation Averaging algorithms which address these issues.
Availability: The software is implemented in the R language and can be downloaded from the Bioconductor project website (http://www.bioconductor.org).
Contact: Chris.Harbron{at}AstraZeneca.Com
Supplementary information: Further details of the workings and evaluation of these functions are given in the documentation available on the Bioconductor website.
| 1 INTRODUCTION |
|---|
|
|
|---|
It is often necessary to analyse microarray data at one or more interim stages throughout the course of a study. Multiple-microarray pre-processing algorithms for Affymetrix microarrays such as RMA (Irizarry et al., 2003) have the undesirable property that the probeset intensities change when microarrays are re-pre-processed due to the inclusion of additional microarrays. A similar situation can occur when developing and applying prediction or classification models using microarrays. Any new sample that is to be predicted by the model will need to be pre-processed and pre-processing this sample along with the training set of samples used to develop the model will change the probeset intensities of these microarrays and the parameters of the fitted model.
An extension to RMA, the Extrapolation Strategy, provides a solution to these problems. This method was independently developed by Goldstein (2006) and also by Katz et al. (2006) as refRMA. It avoids having to re-pre-process already pre-processed microarrays when new arrays are added to the data set, but still maintains many of the desirable properties of RMA. RMA is applied to a reference set of microarrays, storing the parameters of the RMA fit. To process additional microarrays, these parameters are directly applied, without any re-estimation, to the new microarrays leaving the gene expression measurements of the reference microarrays unchanged. A similar strategy has also been considered in the PLIER algorithm (Affymetrix, 2005), where a model file fitted by a set of microarrays can be stored and used later.
The use of the RMA algorithm for processing large numbers of microarrays can be limited by available computer memory. One approach is to apply the Extrapolation Strategy, using a subset of microarrays as the reference set and processing the remaining microarrays using the parameters calculated from this reference set. Alternatively the Extrapolation Averaging algorithm (Goldstein 2006) gives an improved approximation to RMA by averaging multiple Extrapolation Strategy results over different reference sets.
| 2 ALGORITHMS |
|---|
|
|
|---|
2.1 RMA
RMA consists of three steps:
- Background correction: probe-level data for each microarray are background corrected independently using a probabilistic model.
- Quantile normalization: the background corrected probe-level data on each microarray are normalized to a com-mon set of quantiles, derived from background corrected data from all microarrays.
- Expression calculation: estimated separately for each probeset using median polish on the linear model:
|
| (1) |
ij is an error term. For further details on the RMA algorithm refer to Irizarry et al. (2003).
2.2 Extrapolation strategy
The extrapolation strategy divides the set of microarrays into two distinct sets: the reference set used to generate reference sets of parameters for future processing and the future set of all other microarrays which are subsequently processed. The extrapolation strategy consists of four steps:
- RMA: RMA is applied to the reference set to obtain the probeset intensities of the reference set microarrays. The reference quantiles and reference probe effects are stored.
- Background correction: as in RMA, applied to the future set.
- Normalization: the background corrected probe level data from the future microarrays are quantile normalized to the reference quantiles.
- Expression calculation: the probeset intensities of the future microarrays are estimated using model (1) assuming that the probe effects of the future microarrays are the same as the probe effects of the reference set. The estimated logarithmic intensity
f of a probeset on a future array is:
|
| (2) |
|
2.3 Extrapolation averaging
Extrapolation averaging consists of repeated application of the extrapolation strategy using different reference sets and can be described in four steps:
- Randomly select n microarrays as a reference set, the remainder of the microarrays form the future set. n is the maximum number of microarrays that can be processed in one batch by RMA within the available computer memory.
- Apply the extrapolation strategy to this reference and future set.
- Repeat steps 1 and 2 several times.
- Calculate the probeset intensities as an average on the log2 scale of the gene expression profiles calculated in steps 1–3.
Any additional microarrays can be pre-processed by using the extrapolation strategy to calculate a gene expression profile based on the saved parameters from all of the reference sets and averaging these gene expression profiles across the reference sets.
| 3 CONCLUSIONS |
|---|
|
|
|---|
The RMA algorithm has been found to have good performance characteristics in the pre-processing of Affymetrix gene expression data (Irizarry et al., 2006). A limitation of RMA is that the probe intensities change when the analysis set of microarrays changes. This can be an issue when a study is analysed at interim stages, as the processed data for the same samples will vary between analyses. This property also makes the application of predictive models difficult as additional microarrays need to be pre-processed to apply the model, but without changing the model parameters. Also for large sets of microarrays, computer memory can also be limiting to performing RMA. The extrapolation strategy and extrapolation averaging algorithms implemented in the R package RefPlus provide an easily applied solution to these issues. An evaluation using the data from Bhattacharjee et al. (2001) showing that the extrapolation strategy and extrapolation averaging algorithms provide a close approximation to RMA, even in challenging situations, can be found along with the R package on the Bioconductor website.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The authors would like to acknowledge colleagues within AstraZeneca who provided valuable suggestions and comments, and thank the authors of Bhattacharjee et al. (2001) who permitted the use of their microarray data.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: David Rocke
Received on May 25, 2007; revised on June 25, 2007; accepted on July 4, 2007
| REFERENCES |
|---|
|
|
|---|
Affymetrix. Guide to Probe Logarithmic Intensity Error (PLIER) Estimation. ( (2005) )..
Bhattacharjee A, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl Acad. Sci. USA., ( (2001) ) 98, : 13790–13795.
Goldstein DR. Partition resampling and extrapolation averaging: approximation methods for quantifying gene expression in large numbers of short oligonucleotide arrays. Bioinformatics, ( (2006) ) 22, : 2364–2372.
Irizarry RA, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, ( (2003) ) 4, : 249–264.[Abstract].
Irizarry RA, et al. Comparison of Affymetrix GeneChip expression measures. Bioinformatics, ( (2006) ) 22, : 789–794.
Katz S, et al. A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database. BMC Bioinformatics, ( (2006) ) 7, : 464.[CrossRef][Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
