Bioinformatics Advance Access originally published online on September 28, 2004
Bioinformatics 2005 21(4):554-556; doi:10.1093/bioinformatics/bti052
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 21 issue 4 © Oxford University Press 2005; all rights reserved.
arrayMagic: two-colour cDNA microarray quality control and preprocessing
Department of Molecular Genome Analysis, German Cancer Research Center INF 580, Heidelberg, 69120, Germany
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: arrayMagic is a software package for quality control and preprocessing of two-colour cDNA microarray data. The automated analysis pipeline comprises data import, normalization, replica merging, quality diagnostics and data export. The script-based processing combines reproducibility and flexibility at high-throughput and provides quality-assured and preprocessed microarray data to high-level follow-up analysis.
Availability: The R package arrayMagic is available with BSD license at http://www.bioconductor.org
Contact: a.buness{at}dkfz.de
Supplementary information: The package contains documentation in the form of manual pages and a vignette with a guided tour of a typical workflow.
| INTRODUCTION |
|---|
|
|
|---|
Two-colour cDNA microarray technology has evolved into a routine laboratory procedure. Our motivation in implementing arrayMagic was to deal with the large amount of data generated by microarray projects in an efficient, reliable and reproducible manner. We focused on preprocessing and quality assurance, leaving out high-level analysis which has to be adressed specifically.
The main design goal was to allow for the rapid construction of customized quality assessment and control (QA/QC) and preprocessing pipelines for such projects from a small set of building blocks. arrayMagic bridges the gap between the image quantification software and subsequent statistical and explorative analyses like testing for differential expression or classification. It simplifies the task of building processing pipelines that are reproducible, which means that even for idiosyncratic experimental designs and non-trivial combinations and selections of the data the whole procedure from raw data to normalized, quality-controlled, annotated and summarized data is documented in a not too verbose script that can at any time be re-run or extended. The compendium technology (Gentleman, 2004) can be used to produce distributable objects containing the data as well as revivable documents reporting the processing.
We aimed to integrate normalization methods, quality scores and visualizations that had been reported previously. In addition, we provide tools for dealing with different microarray layouts within one experiment and for merging data from replicate probes or hybridizations. The researcher obtains an instant overview on the quality of the experiment.
| NORMALIZATION |
|---|
|
|
|---|
Normalization strategies for two-colour microarrays can be divided into two groups: adjustment of the colour channels or of the log-ratios. Moreover, depending on the experimental design and the objectives either a single channel intensity or a log-ratio-based analysis might be more appropriate.
The tool offers log-ratio-based normalization by means of the loess method (Yang et al., 2002) and direct intensity-based normalization by means of vsn (Huber et al., 2002) and quantile normalization (Bolstad et al., 2003) methods. We will also use the terms log-ratios and log-transformed intensities for the data resulting from the vsn method. Groups of hybridizations, subsets of spots, e.g. by grid, print-tip or PCR plate, as well as colour channels can be normalized separately. Plots characterizing the distributions of the log-ratios and colour channels before and after normalization were generated (Fig. 1b).
|
| QUALITY CONTROL AND ASSESSMENT |
|---|
|
|
|---|
Quality assured data are prerequisite for any reliable high-level analysis. In addition, quality control allows to monitor and improve the laboratory procedures.
The quality of hybridizations is best assessed in the context of normalization. In a model-based approach like vsn, the model is a summary of past experience and our expectations on the data. Thus, it can be used to identify hybridizations or groups of measurements that do not fit. Other methods like loess or quantile normalization place more emphasis on making the data conform in any situation. In these cases, statistics of the data distribution can be calculated (e.g. location and scale of the distribution of normalized log-ratios) and compared against expectations. Moreover, as long as the majority of the data are assumed to be acceptable, outlier detection methods can be used for quality control.
Visual inspection of the data is supported by spatial false-colour representations of foreground and background intensities and the log-ratios. This allows to detect scratches and artefacts (Fig. 1a). Most notably, the spatial plots of the normalized data are useful for assessing the necessity of background correction and for assuring spatial homogeneity of the data.
Several quality scores are calculated, stored in a report file and are visualized in part. These scores include spot replicate concordance, the correlation of the two colour channels and a robust measure of noise W for each hybridization. W is defined as the median absolute deviation of the normalized log-ratios q i , i.e. W = mad i (q i ) = median i (|q i median j (q j )|). A minority of differentially expressed genes should not disturb W.
We do not find it practical to define universally applicable thresholds on quality scores. They should be evaluated not on the level of a single hybridization, but in the context of all data in the experiment. In our experience this has been very useful in detecting outliers in large-scale experiments. In particular, a global view on all pairwise similarities between all hybridizations shown in Figure 1c has proved to be useful.
For two arrays a and b, we define a similarity score S ab = mad i (x ia x ib ), where x ia can be the log-ratio of the i-th probe on the a-th array, or the log-transformed normalized intensity of an individual colour channel. Especially in the case of biologically related samples, this is an informative measure of similarity.
| IMPLEMENTATION |
|---|
|
|
|---|
The software is implemented in the R language (Ihaka and Gentleman, 1996) and integrates into the Bioconductor project (Gentleman et al., 2004, http://www.bepress.com/bioconductor/paper1), an open source software project for bioinformatics. It uses building blocks from the packages Biobase, vsn and limma. The software works on Linux, Windows and MacOS.
| CONCLUSION |
|---|
|
|
|---|
The open source software tool arrayMagic facilitates the analysis of two colour cDNA microarray data. It aims to provide quality assured and normalized data. The script-based pipeline supports reproducible batch-like processing. The workflow starts with quantified image scan result files. Several quality scores and diagnostics are calculated and visualized, which offer a broad view. The processed data can be exported as HTML-file or as tab-delimited file with spot and sample annotation and may serve as input for follow-up analysis in commonly used tools of choice. Naturally, high-level follow-up analysis in the framework of R and Bioconductor is supported by adequate representation of the data. Documentation of all functionality and a step-by-step example following a typical workflow is part of the package.
Received on July 17, 2004; accepted on September 22, 2004
| REFERENCES |
|---|
|
|
|---|
Bolstad, B.M., Irizarry, R.A., Astrand, M., Speed, T.P. (2003) A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics, 19, 185193
Gentleman, R. (2004) Reproducible research: a bioinformatics case study. Stat. Appl. Genet. Mol. Biol., 3, .
Gentleman, R., Carey, V.J., Bates, D.J., Bolstad, B.M., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., et al. Bioconductor: open software development for computational biology and bioinformatics. Bioconductor Project Working Papers. Working Paper 1.
Huber, W., von Heydebreck, A., Sültmann, H., Poustka, A., Vingron, M. (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics, 18, (Suppl. 1), S96S104[Abstract].
Ihaka, R. and Gentleman, R. (1996) R: a language for data analysis and graphics. J. Comput. Graph. Statist, 5, 299314[CrossRef].
Yang, Y.H., Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J., Speed, T.P. (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res., 30, e15
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
