Bioinformatics Advance Access originally published online on June 22, 2007
Bioinformatics 2007 23(17):2339-2341; doi:10.1093/bioinformatics/btm321
MAnGO: an interactive R-based tool for two-colour microarray analysis
1Centre de Génétique Moléculaire, CNRS UPR2167 and Gif/Orsay DNA Microarray Platform (GODMAP), 91190 Gif-sur-Yvette, 2Univ Paris-Sud 11, 91405 Orsay and 3Univ Pierre and Marie Curie, 75005 Paris, France
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: MAnGO (Microarray Analysis at the Gif/Orsay platform) is an interactive R-based tool for the analysis of two-colour microarray experiments. It is a compilation of various methods, which allows the user (1) to control data quality by detecting biases with a large number of visual representations, (2) to pre-process data (filtering and normalization) and (3) to carry out differential analyses. MAnGO is not only a turn-key tool, oriented towards biologists but also a flexible and adaptable R script oriented towards bioinformaticians.
Availability: http://bioinfome.cgm.cnrs-gif.fr/
Contact: reymond{at}cgm.cnrs-gif.fr
| 1 MOTIVATION |
|---|
|
|
|---|
Many software are currently available for microarray data processing and analysis, however, they often remain focused on a specific stage of the analysis or on specific data (slide type, experimental design and organism).
In spite of many improvements in the protocols for microarray experiments, there are still technical biases that absolutely must be taken into account all along the data processing. Therefore, it is necessary to have routines that control the quality of the data before and after each step of the pre-processing to ensure the correction of these biases prior to further analysis.
Based on user experience and expectations of the scientific community, there has been a growing need for an integrated and flexible tool, that combines a wide diversity of methods for background subtraction, filtering, normalization and differential analysis, with many diagnostic plots to validate the choice of the most appropriate method for each stage of the analysis. As a result, MAnGO has been developed. MAnGO is a tool (1) that is simple and flexible, (2) that includes various methods adapted to a specific question, (3) that can be used at any stage of a microarray data analysis, (4) that helps to make a complete differential analysis after an appropriate pre-processing, (5) that exports data for post-processing tasks and finally (6) that can be easily modified by an advanced R user, who can integrate specific methods adapted to user requirements.
| 2 DESCRIPTION |
|---|
|
|
|---|
2.1 General description
MAnGO allows a complete two-colour microarray data analysis from raw data files. MAnGO guides the user throughout the analysis protocol, from data import to data pre-processing and then to differential analysis. Besides file selection, the information provided in the import stage is used to adapt the analysis to the data. Prior to any processing, an overview stage allows the estimation of data quality and the identification of possible biases. Then, the pre-processing stage corrects data for the observed biases and noise, through normalization, filtering and background subtraction. For differential expression assessment, analysis can be performed: either for each slide (Speed, 2003), or including multiple slides (Smyth, 2004). At each step, graphics and tables allow the user to select the appropriate procedure for the data analysis. Throughout the analysis, the MAnGO script generates files and graphics that are stored in a predefined results folder. At the end, a summary of the overall procedure is saved as an html report.
2.2 Architecture description
MAnGO is an R script (R development core team, 2006) using the interactive functionality of R (and optionally, tcl/tk widgets) to allow microarray data to be entered, to choose between processing methods and to display graphics. As the analysis progresses, the script calls functions of different R packages (sma, R2HTML and statmod) and Bioconductor packages (Biobase, Dyndoc, limma, reposTools, tkWidgets, vsn and widgetTools), as well as the functions contained in Rmango1, the package that we specifically designed for MAnGO (Fig. 1).
|
MAnGO data objects are based on the limma package RG/MAList objects (Smyth and Speed, 2003) with specific fields added to conveniently manage the experimental design and the replicate spots. The script preserves all the data objects and the R environment is automatically saved at the end of each stage. As a result, the analysis can be restarted at any stage.
2.3 Detailed description of the analysis process
2.3.1 Import
In this stage, input files are selected and data are extracted. To analyse appropriately the data, additional information about the experiment (experimental design, control spot definition, spot annotations and array design) must also be specified. This information is required to implement, during the analysis, the different levels of replication (biological, technical, dye swap and spot replicates). At the end of the import stage, the script applies the most adapted method of the analysis protocol to the experimental design. The script is designed for the gpr (GenePix Result) data file format but adaptation to other formats can be made with only minor changes. MAnGO requires only the original data files, without any re-formatting, to perform an analysis.
2.3.2 Overview
In this stage, raw data are visualized through descriptive statistics and diagnostic plots (Dudoit et al., 2002)—correlation tree, box-plot, intensity or log-ratio histograms, density plot, background and foreground intensity image, MA plot and print-tip MA-plot—so that the overall quality of the data can be evaluated. Each plot provides different types of quality information. For example, when a spatial bias is identified in a slide having only one print-tip block, artificial blocks can be created in the pre-processing stage.
2.3.3 Pre-processing
This stage consists of a serie of procedures that correct the data and make slides comparable (Smyth and Speed, 2003). At each step, diagnostic plots and explanations are provided to help the user to choose the appropriate method. Pre-processing starts with a background correction step, using limma methods. Then, bad spots can be eliminated and certain types of spots—such as over represented expressed controls or saturated spots—can be excluded from the normalization computations. After filtering, an intra-slide normalization can be performed using median, loess or spline, on all data or only on controls, by block (print tip) or on the whole slide (Smyth and Speed, 2003). An inter-slide normalization also can be performed using scaling or adjusting the quantile distribution (Yang et al., 2002). Then, post-filtering of outlier or missing values in replicate spots can be applied. Finally, an overall pre-processing evaluation is provided by control spot profile visualization, hierarchical clustering of slides and comparison of inter-slide variances.
2.3.4 Single slide analysis
This stage consists of identifying, for each slide, differentially expressed genes by three different methods for single slide analysis (Speed, 2003) available in the package sma (stat.Newton, stat.Chen and stat.ChurSap). As a rule, a gene is considered differentially expressed, if it is identified by all three methods. This stage is useful prior to the inter-slide analysis for evaluating the quality of the data by considering control spot behaviour and the reproducibility of the data, by visualizing differentially expressed genes common to replicate slides and/or to dye swap slides.
2.3.5 Inter-slide analysis
This stage provides an adaptation of the limma package differential analysis procedure. An empirical Bayes linear modelling approach is used to compute a moderated t-statistic (Smyth, 2004). The user simply selects the direct and/or indirect comparisons to be made and then the analysis is automatically performed (the design and contrast matrices required for the linear model are automatically generated from the import information and the selected comparisons). Different P-value adjustment methods can then be used to control the false positive error rate (Bonferonni, FDR, etc.). A histogram of the P-values helps the user to choose the adjustment method. An MA-plot (with the mean expression ratios between the two conditions and their confidence intervals/SDs) and a volcano plot of the most significant differentially expressed genes are displayed for each comparison.
| 3 DISCUSSION—PERSPECTIVES |
|---|
|
|
|---|
Developed in the context of a microarray platform, MAnGO supports a great diversity of data. It has been tested on data from Agilent Technology® human, mouse and yeast slides, from other purchased pangenomic slides and custom slides, with or without controls and on data resulting from direct and indirect designs. It has also been successfully tested on large amounts of data.
By manipulating data locally, the script assures the protection of confidential data and avoids time-consuming data import and export with other web services (Rainer et al., 2006). Moreover, a script solution using an open source R package offers considerable flexibility, as it can be modified easily. An advanced user can add new methods and functionalities to the script, as long as the structure of the data object is preserved. He can also use an existing data object to test other methods directly in the R environment and, eventually, restart the script to continue the next step with the modified data. Moreover, he can export data for other post-processing analysis.
Future modifications of MAnGO will address the following main aspects: (1) provision of quantitative quality information by proposing metrics to identify bad spots and bad slides, (2) inclusion of new methods to manage more suitably custom slides or experiments using a genomic reference and (3) addition of other analysis methods such as ANOVA or clustering.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The authors thank all the members and users of the GODMAP platform and the Bioinfome team of the Centre de Génétique Moléculaire for their help, tests of MAnGO and remarks. This work was supported by the CNRS, the Region Ile-de-France, Sanofi-Aventis and the Fondation de France.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: David Rocke
Received on April 15, 2007; revised on June 5, 2007; accepted on June 8, 2007
| REFERENCES |
|---|
|
|
|---|
Dudoit S, et al. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat. Sin (2002) 12:111–139.
R Development Core Team. R: a language and environment for statistical computing. In: R Foundation for Statistical Computing (2006) Austria: Vienna. http://www.R-project.org.
Rainer J, et al. CARMAweb: comprehensive R- and bioconductor-based web service for microarray data analysis. Nucleic Acids Res (2006) 34(Web Server issue):W498–503.
Smyth GK. Linear models and empirical Bayes for assessing differential expression in microarray experiment. Stat. Appl. Genet. Mol. Biol (2004) 3. Article 3. Epub 2004 Feb 12.
Smyth GK, Speed TP. Normalization of cDNA microarray data. Methods (2003) 31:265–273.[CrossRef][Web of Science][Medline]
Speed TP. (2003) Statistical analysis of gene expression microarray data. In Speed,T (ed.) Chapman & Hall/CRC, Boca Raton.
Yang YH, et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res (2002) 30:e15.
This article has been cited by other articles:
![]() |
M. Bourens, C. Panozzo, A. Nowacka, S. Imbeaud, M.-H. Mucchielli, and C. J. Herbert Mutations in the Saccharomyces cerevisiae Kinase Cbk1p Lead to a Fertility Defect That Can Be Suppressed by the Absence of Brr1p or Mpt5p (Puf5p), Proteins Involved in RNA Metabolism Genetics, September 1, 2009; 183(1): 161 - 173. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

