Bioinformatics Advance Access originally published online on September 27, 2005
Bioinformatics 2005 21(21):4069-4070; doi:10.1093/bioinformatics/bti663
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A friendly statistics package for microarray analysis
1Department of Genetics, University of Cambridge UK
2Department of Pathology, University of Cambridge UK
3Cambridge Computational Biology Institute at the Department of Applied Mathematics and Theoretical Physics, University of Cambridge UK
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: The friendly statistics package for microarray analysis (FSPMA) is a tool that aims to fill the gap between simple to use and powerful analysis. FSPMA is a platform-independent R-package that allows efficient exploration of microarray data without the need for computer programming. Analysis is based on a mixed model ANOVA library (YASMA) that was extended to allow more flexible comparisons and other useful operations like k nearest neighbour imputing and spike-based normalization. Processing is controlled by a definition file that specifies all the steps necessary to derive analysis results from quantified microarray data. In addition to providing analysis without programming, the definition file also serves as exact documentation of all the analysis steps.
Availability: The library is available under GPL 2 license and, together with additional information, provided at http://www.ccbi.cam.ac.uk/software/psyk/software.html#fspma
Contact: peter{at}sykacek.net
| INTRODUCTION |
|---|
|
|
|---|
The number of analysis packages for microarray data is vast and yet one is still faced with the problem of how to best analyse any particular dataset. Easy to use tools are appealing but many are available only commercially. More elaborate packages such as LIMMA (Smyth et al., 2003) or YASMA (Wernisch et al., 2003) require programming skills and are thus out of reach for non-specialists. The friendly statistics package for microarray analysis (FSPMA) aims to fill the gap between simple to use, yet powerful analysis. It is a set of R-scripts based on YASMA that makes it possible to explore data efficiently without computer programming. The entire process is controlled by a definition file that specifies all steps to generate analysis results from microarray data. The analysis is centred around an existing tool for mixed model ANOVA (analysis of variance) for balanced experiments. Mixed model ANOVA was chosen as this allows for correct treatment of nested effects that would otherwise be regarded as independent identically distributed samples. We thus obtain more realistic P-values in the ANOVA table and in subsequent tests. The library introduced here provides some useful extensions of the original YASMA package; to allow for more general comparisons, gene ranking is based on contrasts. We provide a k nearest neighbour (knn)-based method to impute missing values and also spike-based normalization which can be equally well used with housekeeping genes. The tool operates on quantified single- and two-channel microarray data whether normalized or not, as long as the experiment is a balanced reference design. In addition to providing analysis without programming, the definition file serves as an exact documentation of all analysis steps, which is important in its own right.
| DATA LOADING |
|---|
|
|
|---|
Analysis requirements in a microarray laboratory can be rather diverse. Experiments are typically done with single or two colour arrays and sometimes the data have been preprocessed; e.g. conversion to log ratios or application of a favourite normalization method. To obtain a generic solution these various data sources have to be standardized. This is done by having default values for unavailable channels, a boolean dye swap indicator for each file and a flag indicating whether the data are log transformed or not. Headers are ignored and data columns are identified via their column names, so that column order is unimportant and use of heterogeneous file structures in one analysis run does not matter.
| IMPUTING AND NORMALIZATION |
|---|
|
|
|---|
To accommodate poor quality flagged spots or missing information, the library provides an implementation of knn imputation, (Troyanskaya et al., 2001). Alternatively, all such spots can be taken care of by removing their corresponding genes. In terms of normalization, the library uses YASMA's functionality to provide removal of within-slide location and scale, or removal of the amplitude-dependent mean by subtracting a loess fit. In addition, FSPMA allows normalization based on RNAs of known concentration, spiked into the RNA samples, where the spike residual log ratio (i.e. the difference between actual and theoretical log ratio of spike concentration) is used to normalize the data. Options for spike based normalization include removing a global mean or a loess fit, and/or adjusting the variance of each slide to the global variance across all slides. The loess fit can be based on spot position (spatial effects), subgrid number (pin effect) or spot intensity as well as interactions of the above.
| ANOVA AND CONTRAST BASED RANKING |
|---|
|
|
|---|
We chose YASMA (Wernisch et al., 2003) for ANOVA calculations, because it implements a mixed effects ANOVA including the gene effect. FSPMA requires that all non-gene effects of an experiment are specified. Each effect is either random or fixed. Random effects are variables where the experiment does not contain instances of all possible levels (e.g. biological replicate). Fixed effects are those variables where all possible levels are a part of the experiment, or other levels are not of interest (e.g. time point in a longitudinal study). The description of an experiment is automatically converted to an ANOVA model equation, where each effect is considered hierarchically and modelled as an interaction term with the previous grouping. As an example, gene, G, within slide replicate, r, technical replicate, s, and time point, t, where gene and time point are fixed effects and technical replicate and within slide replication are random effects, will result in the ANOVA model equation
![]() |
G,t as the mean of each genetime interaction. Variable BG,t,s is a Gaussian random variable that represents interactions of gene and time with the random effect technical replicate. Finally,
G,t,s,r is the residual. Such equations are used to calculate the ANOVA table and variance components using the functions provided by YASMA. If the ANOVA table allows rejection of the null hypothesis for the fixed effect of interest (e.g. time), the user may further assess the differences between groups. In order to do that, the library allows for general contrasts, such that evaluations beyond pairwise comparisons are possible. We illustrate this in Table 1 using a longitudinal study of mammary gland development (Clarkson et al., 2004): the first column shows the time points of the experiment; the second column illustrates a contrast for pairwise comparisons between the last lactation day and involution onset; the third column is a more general contrast that tests for significant differences between groups of time points and is here indicative for causes of type II apoptosis.
|
In addition, a gene-based ANOVA rank list can be produced. This ranks genes by the P-values of an F-statistic that is obtained from the null hypothesis that all levels of the corresponding effect have identical mean. The total number of comparisons within a definition file is used to adjust P-values for multiple testing. For each comparison an ordered gene list is written into a separate tab-delimited file.
| DISCUSSION |
|---|
|
|
|---|
FSPMA-based analysis of microarray experiments is accessible to non-programmers with a basic understanding of ANOVA, random and fixed effects and contrasts, which are supported by FSPMA's quite elaborate consistency checks of definition files. For the expert, FSPMA allows efficient analysis of balanced reference designs by providing pre-defined definition files. In non-standard situations that go beyond what is possible with mixed effects ANOVA, the library can still serve as a front end for data loading and normalization.
| Acknowledgments |
|---|
The authors are grateful for suggestions, on how to improve the package that were kindly provided by the reviewers of this paper. This work was funded by the BBSRC's Exploiting Genomics initiative under ref. 8/EGH16106, Shared Genetic Pathways in Cell Number Control.
Conflict of Interest: none declared.
Received on February 21, 2005; revised on July 18, 2005; accepted on September 5, 2005
| REFERENCES |
|---|
|
|
|---|
Clarkson, R.W.E., et al. (2004) Gene expression profiling of mammary gland development reveals putative roles for death receptors and immune mediators in post-lactational regression. Breast Cancer Res., 6, R92R109[CrossRef][Medline].
Smyth, G.K., et al. (2003) Statistical issues in microarray data analysis. In Brownstein, M.J. and Khodursky, A.B. (Eds.). Functional Genomics: Methods and Protocols, , Totowa, NJ Humana Press, pp. 111136.
Troyanskaya, G.O., et al. (2001) Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520525
Wernisch, L., et al. (2003) Analysis of whole-genome microarray replicates using mixed models. Bioinformatics, 19, 5361
This article has been cited by other articles:
![]() |
P. J I Ellis, R. A Furlong, S. J Conner, J. Kirkman-Brown, M. Afnan, C. Barratt, D. K Griffin, and N. A Affara Coordinated transcriptional regulation patterns associated with infertility phenotypes in men J. Med. Genet., August 1, 2007; 44(8): 498 - 508. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. T. Khaled, E. K. C. Read, S. E. Nicholson, F. O. Baxter, A. J. Brennan, P. J. Came, N. Sprigg, A. N. J. McKenzie, and C. J. Watson The IL-4/IL-13/Stat6 signalling pathway promotes luminal mammary epithelial cell development Development, August 1, 2007; 134(15): 2739 - 2750. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


