Skip Navigation


Bioinformatics Advance Access originally published online on November 14, 2006
Bioinformatics 2007 23(2):184-190; doi:10.1093/bioinformatics/btl572
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/2/184    most recent
btl572v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by de Haan, J. R.
Right arrow Articles by Buydens, L. M. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by de Haan, J. R.
Right arrow Articles by Buydens, L. M. C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Interpretation of ANOVA models for microarray data using PCA

J. R. de Haan 1, R. Wehrens 1, S. Bauerschmidt 2,4, E. Piek 3, R. C. van Schaik 2,4 and L. M. C. Buydens 1,*

1 Institute for Molecules and Materials, Analytical Chemistry, Radboud University Nijmegen Toernooiveld 1, 6525 ED, Nijmegen, The Netherlands
2 NV Organon Molenstraat 110, 5340 BH, Oss, The Netherlands
3 Department of Applied Biology, Radboud University Nijmegen Toernooiveld 1, 6525 ED, Nijmegen, The Netherlands
4 Centre for Molecular and Biomolecular Informatics, Nijmegen Centre for Molecular Life Sciences, Radboud University Nijmegen Toernooiveld 1, 6525 ED, Nijmegen, The Netherlands

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS AND ANALYSIS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 

Motivation: ANOVA is a technique, which is frequently used in the analysis of microarray data, e.g. to assess the significance of treatment effects, and to select interesting genes based on P-values. However, it does not give information about what exactly is causing the effect. Our purpose is to improve the interpretation of the results from ANOVA on large microarray datasets, by applying PCA on the individual variance components. Interaction effects can be visualized by biplots, showing genes and variables in one plot, providing insight in the effect of e.g. treatment or time on gene expression. Because ANOVA has removed uninteresting sources of variance, the results are much more interpretable than without ANOVA. Moreover, the combination of ANOVA and PCA provides a simple way to select genes, based on the interactions of interest.

Results: It is shown that the components from an ANOVA model can be summarized and visualized with PCA, which improves the interpretability of the models. The method is applied to a real time-course gene expression dataset of mesenchymal stem cells. The dataset was designed to investigate the effect of different treatments on osteogenesis. The biplots generated with the algorithm give specific information about the effects of specific treatments on genes over time. These results are in agreement with the literature. The biological validation with GO annotation from the genes present in the selections shows that biologically relevant groups of genes are selected.

Availability: R code with the implementation of the method for this dataset is available from http://www.cac.science.ru.nl under the heading "Software".

Contact: L.Buydens{at}science.ru.nl


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS AND ANALYSIS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
Microarray analysis is not straightforward because of the large number of genes, which are investigated simultaneously. By incorporating several factors of interest (for instance time and different treatments) in the experimental design, the interpretation of the data becomes even more difficult. The influence of the factors of interest should be separated from each other to draw sensible conclusions from the data analysis. By application of ANOVA the different factors are modelled separately and significance of interactions between factors can be investigated as well (Fisher, 1925). Interpretation of the ANOVA models and resulting P-values can be difficult. Here a method is shown which enhances the interpretation of ANOVA models by application of principal component analysis (PCA) on the ANOVA results. Relevant information is often known beforehand to be present in the interaction terms of the ANOVA model. With the technique presented here this information can be accessed. The method works fast and straightforward, displaying the results in an orderly fashion.

The application of ANOVA to microarray data has been described by Kerr et al. and Churchill (Kerr et al., 2000; Kerr and Chruchill 2001; Kerr et al., 2002; Churchill, 2004). In these papers it is shown how a distinction between the effects of different factors can be made with ANOVA. Normalization is then performed by removing the disturbing effects from the data. Thus, when performing normalization with ANOVA, the aim is to make sound estimations of the relative expression of genes by modelling the different factors of variation and subsequently correcting for disturbing factors (Cui and Churchill, 2003). There the ANOVA is split in a normalization model and a gene specific model. The residuals of the normalization model are used as the normalized signal, which is then modelled by the gene specific model. This approach was used as well in an article by Wolfinger et al. (2001). Recently, Juenger et al. (2006) used the ANOVA framework to asses sources of variability in their expression data of Arabidopsis thalania.

ANOVA can be used for the detection of differentially expressed genes when there are more than two conditions in an experiment (Cui and Churchill, 2003; Pavlidis, 2003; Wolfinger et al., 2001). Gene selections are made based on significance of an effect, calculated by performing ANOVA on a gene-by-gene basis (Churchill, 2004). The selection of genes showing differential expression (with a significant effect) can be made based on a fixed cutoff, but it is preferred to apply multiple hypothesis testing on the results of the significance tests from ANOVA. Tan et al. (2003) applied ANOVA to detect differentially expressed genes to determine cross-platform comparability of three commercial microarray platforms. The number of genes which are declared differentially expressed with ANOVA are compared between platforms in order to make statements about the level of agreement. A recent example of the application of ANOVA to identify differentially expressed genes is from Bushari et al. (2006). The authors identified 36 differentially expressed genes for groups of mice with different diets. The interest was to identify genes which were involved in the effect of fescue toxicosis on gene expression in the liver.

Although ANOVA is useful in identifying genes with significant group differences, it is not clear which of the groups are responsible for this difference (Pavlidis, 2003). It is highly desirable to get this information from the ANOVA results as well. One way of achieving this is analysing the individual ANOVA components, the main and interaction effect matrices, with PCA. This allows to examine this information in an intuitive way by displaying it using biplots. These explicitly show the correlation between individual genes and other factors, such as time and treatment.

PCA (Jackson, 1991) has been applied to microarray data in several publications as a data exploration tool (Holter et al., 2000; Raychaudhuri et al., 2000). An example of reduction of dimensionality of microarray data to so called eigengenes and eigenarrays is the study of Alter et al. (2000). for a well known yeast cell cycle dataset (Spellman et al., 1998). PCA allows to visualize correlations in datasets by compressing information in a low number of dimensions. The method is very flexible and large datasets can be handled easily. An important step in PCA is the determination of the number of latent variables, which contain relevant information. Alter et al. suggest to remove components, which are thought to be resulting from noise; however, the determination of which components in the PCA can be attributed to noise is not at all straightforward. Furthermore, Hörnquist et al. (2003) (for time-series data) and Misra et al. (2006) (for non time-series data) have recognized that PCA can be disturbed by variance resulting from noise in the data, or genes which are not affected by the biological experiment. Modelling the different factors in the experiment with ANOVA is an approach to solve this.

In this article, the combination of the techniques of PCA and ANOVA is shown for the analysis of microarray data. Performance of PCA on ANOVA models has been described previously for metabolomics data of guinea pigs (Smilde et al., 2005), studying the effects of different doses of Vitamin C on the development of osteoarthritis. Another example has been published in the field of proteomics (Harrington et al., 2005). There, the aim was to find consistent biomarkers for premature delivery in humans. Besides finding biomarkers from the mass spectra, ANOVA-PCA was used to investigate effects of the design of the experiment on the data.

ANOVA and PCA have been used on microarray data separately, but we will demonstrate the advantages of combining the two methods. The combination of PCA with ANOVA will cause PCA to concentrate on the variance components that are really important. This avoids the problem, mentioned before, of uninformative factors disturbing the picture. Therefore the results will be more interpretable, as will be shown below. The aim of the analysis presented here is to display the relationships between genes and factors in the experiment in an appealing way.

Here, the value of ANOVA-PCA is demonstrated for a multi-treatment time-series dataset. In the dataset analysed here, the biological background is in the field of osteogenesis. First, the main and interaction components of the ANOVA model are calculated with the factors time, treatment and gene. Next, the interaction effects are analysed with PCA, showing biologically relevant correlations between genes and treatments. Another application for the algorithm is making gene selections. The external validation of the results is performed with biological annotation information from the Gene Ontology (GO) database (The Gene Ontology Consortium, 2000), indicating that biologically relevant selections of genes are made.


    2 METHODS AND ANALYSIS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS AND ANALYSIS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
2.1 Description of the mesenchymal stem cell dataset
The microarray experiment presented here was performed on human mesenchymal stem cells, triggered to undergo osteogenic differentiation (E. Piek et al., manuscript in preparation). Dexamethasone can induce the differentiation of stem cells to osteoblasts. It is known that BMP2 and Vitamin D3 can potentiate osteogenesis in combination with dexamethasone (Jørgensen, et al., 2004). The biological process of interest for the dataset which is used for the analysis in this article is skeletal development.

Mesenchymal stem cells are able to differentiate into osteoblasts. Mature osteoblasts which mineralize the extracellular matrix are formed after two weeks. The experiment was set up as a time-series experiment and the expression measurements were taken at10 time points (at 1, 3, 6, 12, 24, 48, 72, 120, 192 and 288 h after onset of treatment). More measurements were taken in the first 24 h of the time course, because the first part is expected to be important for the onset of differentiation.

In addition to sampling multiple timepoints during the differention process, another factor of interest is introduced to the experiment by inducing osteogenesis of the mesenchymal stem cells with different treatments. The three treatments are named after the substances added to the culture medium (osteogenic differentiation medium, Cambrex Bioscience, Verviers, Belgium). For the first treatment a combination of Vitamin D3 (VIT) and Dexamethasone (DEX) was used. The second treatment consisted of a combination of BMP2 and DEX. The third treatment consisted of Dexamethasone only. Furthermore an untreated sample is measured as a control at each time point. Thus, four time profiles can be generated for each gene on the microarrays. A graphical view of the experimental design of the experiment is given in Figure 1.


Figure 1
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Schematic representation of the design of the mesenchymal stem cell dataset. The conditions are indicated on each row. Each column represents a time point (in hours). The dimension genes is not shown in the table, but is taken into account in the analysis. Each cell in the table contains three replicate arrays.

 
The data can be represented as a cube with the axes gene, time and treatment (see Fig. 1). For each cell in the cube three replicate measurements have been performed. The hybridizations were randomly assigned to six different groups in order to randomize the experimental effects of measuring at different time. The hybridizations were performed with Affymetrix Human Genome U133 A GeneChips (Lipshutz et al., 1999).

The normalization of the data was performed with Rosetta Resolver Version 5 and subsequently the expression data was log transformed (Weng et al., 2006).

2.2 Description of the ANOVA model
ANOVA can be used in microarray data analysis to investigate the significance of the effects from factors which could possibly influence the gene expression. The ANOVA fixed effects model in which three of the possible factors of interest are incorporated is given by expression 1. In this model the measured gene expression (Xijkr) is assumed to be the result of the added effects of the factors time (T), treatment (S) and gene (G) over timepoint i, treatment j, gene k and replicate r:


Formula 1

(1)

The effect of interactions between factors is also incorporated in the three factor ANOVA model shown here (TS, TG, GS and three way interaction TSG). The effects are added to a general mean expression value which is indicated with µ. Finally the remaining variation is captured in the error term {varepsilon}. In the normal application of ANOVA, the sum of squares and mean squares are calculated for each factor and interaction. With an F-test the significance of the effect of each factor is then calculated. In this application of a combination of ANOVA and PCA however, the F-test for significance of effects of factors is not performed. Instead of calculating sums of squares from the main effects and interaction matrices, these matrices are analysed with PCA. The main effects are vectors with a length equal to the number of levels in each factor. The interaction matrices consist of the combined effect of two factors, when corrected for the general effect of these factors. For example, the interaction matrix of the factors gene and treatment shows the effect of a gene and treatment after the general treatment effect and general gene effect have been corrected for. Thus, the interaction effect could be interpreted as the response of a gene to the treatments additional to general gene and treatment effects. This is of course interesting to the biologists who are looking for genes which respond to the treatments incorporated in the experiment.

First, the main effect is calculated for each of the factors of interest. The three factors used in the ANOVA model are time effect, treatment effect and the effect of each individual gene. The main effect is then calculated for each factor by subtractingthe general overall mean from the mean per timepoint, the mean per gene and the mean per treatment

Formula 2(2)
where y is gene expression, and q is either i... , .j. . or . .k. , depending on the effect which is calculated. The next step is the calculation of the interaction matrices. The interaction matrices of the interaction between gene and treatment (GS), time and treatment (TS) and time and gene (TG) are calculated with:

Formula 3(3)

Formula 4(4)

Formula 5(5)
The three factor interaction (GTS) of time and gene can be calculated as follows:

Formula 6(6)

The three factor interaction is in fact a data cube. In order to analyse it further with PCA the cube has to be unfolded. The two factor interaction matrices and the unfolded three factor interaction matrix can now be analysed with PCA and interactions can be identified as explained below.

2.3 PCA on ANOVA interaction terms
The interaction matrices can be analysed with PCA to identify genes with interesting and significant interactions. Genes are interesting for the biologist when they can be correlated with for instance a specific treatment or timepoint, depending on the biological question for the dataset. The gene is considered to have a significant interaction with e.g. treatment when a particular interaction effect is different from the value expected from the mean effects of all the genes and treatments. The technique of PCA is especially suited to visualize the interesting interactions because it focuses on displaying the data in a way that most variance is shown: the interesting genes are likely to be among those with large positive or negative values.

An especially useful graphical representation of the data are the so called biplot (Gabriel, 1971). In this representation, objects (e.g. genes) and variables (e.g. treatments) can be visualized in the same picture, allowing for interpretation of the location of scores in the plot. Chapman et al. (2001) showed that biplots can be used to view the effects of different treatments related to plant defense. The first principal component was stated to be correlated to the mean response of the genes. In contrast, in the results shown below the data are modelled by the ANOVA model and the analysis is performed on the resulting interactions and not on the raw data.

2.4 Gene selection
In microarray analysis the number of measured genes is substantial and not all genes are expected to be involved in the biological process of interest. The goal of the osteogenic microarray experiment presented here is to find the genes which are involved in the process of mesenchymal stem cell differentiation. The aim is therefore to find a meaningful reduced set of genes. This set can consist of unknown genes as well as genes known to be involved in the process, which can confirm the experimental results. The selection can then be used to form new hypotheses and to perform further research. Here a method is shown which can be used to make gene selections from the results of the analysis with ANOVA and PCA.

Before any validation can take place a selection criterion for the identification of groups of interesting genes has to be defined. The selection method which will be described here is applied on the scores of the PCA performed on the interaction of the factors gene and treatment. The Hotelling T2 distribution is used to estimate the statistical significance of the interactions of the genes in the score plots. Genes which are different and interesting to the process, are expected to havea significant interaction when they are tested against a critical value for the distribution of the interactions of all the probesets. A gene which has a significant interaction is expected to have biological relevance because the gene shows a combined effect (corrected for general effects).

The Hotelling T2 distribution is used to calculate the critical value Formula 6:

Formula 7(7)
Here n is the sample size, p is the number of variables used, F is the F-statistic with (p, np) degrees of freedom and {alpha} is the level at which an observation is considered to be statistically significant. By calculating a Hotelling T2 statistic for each observation a statement can be made about the chance for this observation to be an outlier by chance alone under the null hypothesis. The Hotelling T2 statistic for each observation can be calculated with

Formula 8(8)
in which u is the score of the ith observation on the kth Principal Component. The variance of the scores is taken into account by Formula 8 and a is the number of PCs considered to be relevant. By making use of the F-statistic the group of genes with a significant interaction can be identified by predefining the number of principal components to be taken into account and selecting a confidence level cut-off.

The test described here can be applied to the number of components which explain a relevant amount of variance. One option of selecting the number of components is to incorporate those components which capture a certain percentage of variance in the data (for instance 95%). The number of PCs in the analysis can be determined as well by plotting the percentage of variance in each component to see how the variance is distributed (See Fig. 2).


Figure 2
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Plots showing the explained variance for each component of PCA performed on (A) the interaction between the factors gene and treatment (B) the three factor interaction (C) original data, averaged over replicates. In (B and C) only the first 5 PCs of 40 PCs are shown.

 

    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS AND ANALYSIS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
Application of the algorithm on the mesenchymal dataset results in main effects and interaction matrices (as described in section 2.2). Especially the interaction matrices of genes and treatment, and genes, treatment and time, contain interesting biological information. These can be further analysed with PCA.

First the interaction between the factors gene and treatment is inspected, because this is expected to be the most interesting from a biological point of view. It is desirable to be able to correlate the interaction of a gene with a treatment. Other interactions are of interest as well, but for now we address the question whether we can make distinctions between the treatments.

3.1 Analysis of the interaction with PCA
The first three components contain more than 95% of the explained variance after performance of PCA on the interaction matrix of the factors gene and treatment (see Fig. 2A). Because the fourth PC has little variation compared to the other PCs the first three PCs will be used to construct the biplots to make gene selections.

The investigation of the interactions between the factors gene and treatment is interesting because this can give information about genes which are different from the general effects. A normal PCA which does not correct for the general effects would be distorted by the general effects. The biplots of the two first three principal components from the PCA performed on the interaction matrix of the factors gene and treatment are displayed in Figure 3.


Figure 3
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Biplots of a principal component analysis performed on the interaction between the factors gene and treatment. The arrows indicate the loadings of the treatments (DEX, BMP, VIT and unt). All 22 283 probe sets from the dataset were used in the analysis. A group of genes associated with skeletal development or known to be involved is indicated with gene symbols instead of points.

 
The first principal component shown on the x-axis of Figure 3A indicates variation, which distinguishes between the three samples treated with the substances (DEX, VIT and BMP) and the untreated sample. This can be seen because the arrows, which indicate the loadings, point in opposite directions in the plot for treated and untreated. Furthermore, it appears that the BMP treatment is different from DEX and VIT. The biplot of the second and third component (Fig. 3B) shows the distinction between the different treatments and the genes most correlated with a certain treatment.

A number of genes which have a significant interaction ({alpha} = 10–5, Hotelling T2 on first 2 components) and the GO annotation skeletal development (or are known to be developed in the process) are displayed in text in Figure 3. Some examples of genes which are known to be associated with one of the treatments will be given to illustrate the advantages of the biplot. For instance BGLAP, also known as osteocalcin, is induced by VIT and not DEX and BMP (Jørgensen et al., 2004). Furthermore FKBP5 is induced by dexamethasone (Vermeer et al., 2003). Dexamethasone is present in all treatments and the position of FKBP5 in the plot confirms this. In agreement with the data presented in Figure 3 DLX5 and COMP are known to be induced by BMP (Lee et al., 2003; Tian et al., 2003).

The results of a general per-gene two way ANOVA can not be associated with specific treatments based on their P-value for the interaction between time and treatment, even if it is low. The sorted P-values of the treatment effect of the two way ANOVA for each gene would show the same problem: the genes cannot be correlated or anticorrelated with a specific treatment. As can be seen in Figure 3 this problem can be solved by performing PCA on the interaction matrix because here associations can be made. The advantage of the combination of ANOVA and PCA can be shown by comparing the results of the technique with results from PCA on the data when it is not processed by ANOVA. In order to perform PCA on the complete dataset, the data cube with the dimensions time, gene and treatment is unfolded by combining the dimension treatments and time. Thus the resulting data matrix consists of 22 693 rows for genes and 40 columns for 10 timepoints and 4 treatments. In this analysis, the mean of the three replicates in each cell was taken. PCA was then applied to the data matrix. The results shown in Figure 4 are difficult to interpret when compared to the results of PCA on interactions (see Fig. 3). When performing PCA on the unfolded mean-centered data matrix, the explained variance is almost totally in PC1 (96.5% of explained variance). PC2 and PC3 together contain only 2.2% of the explained variance. Furthermore, the loadings are pointing in the same direction on PC1, making it difficult to associate genes with treatments and timepoints (see Fig. 4). When the loadings are plotted per treatment, a separation can be seen in PC two between early and late timepoints, but this separation is similar for each treatment, and PC2 contains only 1.76% variation.


Figure 4
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 Loading plots of PCA performed on the whole data matrix (without application of ANOVA). The arrows indicate the loadings of the 40 variables from the matrix. The four subplots show the loadings from each of the four different treatments (DEX, BMP, VIT and untreated). All 22 283 probe sets from the dataset were used in the analysis.

 
Instead of inspecting the two way interaction matrix of gene and treatment, one may also focus on the three way interaction between gene, treatment and time. This matrix should contain information on which genes show a response to treatments over time that cannot be explained by main effects and two factor interactions alone. The loading plot of PCA on the unfolded three factor interaction matrix can be seen in Figure 5. Thus a gene with a significant interaction in Figure 5 can be associated with an effect of that gene at a certain timepoint for a certain treatment (depending on the loadings, which correlate with the gene). The first notable difference between the three factor interaction plot (Fig. 5) and the PCA on original data, averaged over replicates (Fig. 4) is the distribution of loadings, which is not in one direction but occupies all four quadrants. Furthermore the variance is not captured in the first PC alone but more evenly distributed (see Fig. 2B and C). The loadings are ordered clockwise according to the timepoints. The orientation of the first five timepoints is similar for DEX and BMP. This orientation is clearly different from the first five loadings from the untreated and VIT in the experiment.


Figure 5
View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5 Loading plots of PCA performed on the three factor interaction matrix from the application of ANOVA. The four subplots show the loadings from each of the four different treatments (DEX, BMP, VIT and untreated). All 22 283 probe sets from the dataset were used in the analysis.

 
3.2 Gene selection
After the application of ANOVA and visual inspection of the biplot of PCA on the interaction between gene and treatment, interesting genes can be selected. An example of a selection of a group of genes according to the Hotelling T2 statistic can be seen in Figure 6. The scores of the first 2 PCs from PCA on the interaction matrix of the factors gene and treatment are shown. The selected genes are on the outside of the ellipsoidal shape of the distribution in the first two Principal Components. These describe 90.2% of the varation in the data. The selection was made as reported before and an {alpha} = 10–5 leads to a selection of 384 genes in this case.


Figure 6
View larger version (33K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6 Example of a selection of genes, which was identified with the Hotelling T2 distribution. The {alpha} for this selection was 10–5 and 384 genes were selected. Selected genes are represented by the points outside the circle.

 
The GO enrichment analysis of this selection can be seen as a first external biological validation of the selection of interesting genes. It will give an indication whether the biological function of the selected genes can be attributed to the process of skeletal development, or even to the differentiation of mesenchymal stem cells.

In Table 1 the GO terms are displayed, which correspond to the gene selection. The results are consistent with the biological framework of the dataset. The term skeletal development is a good example for this. The terms development, organ development, morphogenesis and cell differentiation can be linked to the process of osteogenesis as well.


View this table:
[in this window]
[in a new window]

 
Table 1 Results of GO enrichment analysis

 
This example was performed using the first two PCs. These were chosen because the difference between treated and untreated was clearly visible in the biplot (see Fig. 3A). The selection was made with the first three PCs as well: then, the processes development, organ development, skeletal development and cell differentiation were found in the first 10 terms again. The term skeletal development was ranked lower (6th) when compared with enrichment after 2 PCs.

The cut-off values which are used as selection criterion for groups of genes are known to have effects on the outcome of the biological results of expression experiments (Pan et al., 2005). In other words: different cutoffs can render quite different biological results. Therefore the enrichment process was performed with several gene selections on the first 2 PCs ({alpha} in Hotelling T2 between 10–1 and 10–8). This analysis showed that the top eight processes were consistently enriched for a wide range of {alpha} values (data not shown).

Next, GO enrichment analysis was performed on genes selected from the unfolded three factor interaction. Again the Biological Process category was used for the enrichment. This was not done for a wide range of cutoffs but at {alpha} = 10–5. The first 4 PCs were used. The enrichment results were comparable with the enrichment results of the selected genes for the two factor interaction when the rank of the enrichment of the term skeletal development is considered. Several selectionswere made from the PCA of the three factor interaction matrix. For a selection of 228 genes, selected from the first 2 PCs with Hotelling T2 the first enriched term was skeletal development. Skeletal development was still ranked first for a selection of 365 genes, selected from the first 3 PCs. When the first four PCs were used to make the selection, 486 genes were selected and skeletal development showed as the fourth enriched term in the list (with morphogenesis ranking first). Clearly, gene selection based on interaction matrices leads to biologically relevant sets. Although the selection of the number of PCs to use may lead to differences, the same, biologically relevant, GO categories are enriched in the selections. In our example, this is true for both the two factor and three factor ANOVA interactions.

3.3 Application of ANOVA on selected genes
The results shown before were obtained by applying the combination of ANOVA and PCA on all probe sets on the chip, in order to select subsets. The resulting gene selections are biologically valid, but it is interesting to see the results of the algorithm when it is performed on a selected subset of genes. Or, put differently, what happens to the main effects and interaction effects when a selection of genes is used in ANOVA-PCA?

Thus the selection of 384 genes mentioned previously was used to perform ANOVA and PCA again with just these genes. As expected, the results show that the main effects of these subsets are different from the overall main effects present in the whole dataset.

When the three factor interaction effect of gene, time and treatment is shown for the selection of 384 genes a clear difference can be observed between the untreated time-series and the treated samples (see Fig. 7). A picture of the three factor interaction effect in time over all the genes is less informative because it will be disturbed by all the genes which are not necessarily involved in the biological process of interest. The untreated profile decreases further and longer than the treatment profiles. Furthermore the BMP treatment displays a higher peak at 72 h when compared to DEX. For the peak of VIT treatment a shift in kinetics can be seen because the maximum of the peak is at 24 h. The shift in Vitamin D kinetics has been confirmed experimentally with in vitro cell-based biological assays (Piek et al., manuscript in preparation).


Figure 7
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7 Display of average three factor interaction effect of gene, time and treatment for a set of 384 genes selected from the first and second PC of the interaction matrix of the factors gene and treatment.

 

    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS AND ANALYSIS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
The combination of ANOVA and PCA, as demonstrated here, is a valuable tool in microarray data analysis, perhaps even more so than in metabolomics: because more information is available on the functions of genes, the biological interpretation of the results is more straightforward than in the case of metabolites. ANOVA-PCA can be used to split and visualize the different components in the ANOVA model, showing relations between genes, treatments, and time points. Several genes which are known to be induced by specific treatments can be seen to lie close to these treatments in the biplots; for genes with an unknown function these plots can provide plausible hypotheses to be tested. Furthermore, specific interaction matrices can be used to select interesting subsets of genes, identified with the Hotelling T2 test. Such a selection for the current data set is shown to be significantly enriched with genes involved in skeletal development, which is in line with the biological framework.

The shifted kinetics of Vitamin D induced osteogenesis, which had been suspected based on earlier experiments, is clearly visible in the ANOVA-PCA results. It has been confirmed using in vitro cell-based biological assays (Piek et al., manuscript in preparation). Application of either ANOVA or PCA would not have shown this as clearly. Many other questions can be addressed as well, depending on the nature of the data set and the interests of the researcher.


    Acknowledgments
 
The authors would like to thank NV Organon for their financial support of the research presented here.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Joaquin Dopazo

Received on September 29, 2006; accepted on November 8, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS AND ANALYSIS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 

    Alter, O., et al. (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA, 97, 10101–10106[Abstract/Free Full Text].

    Bushari, S., et al. (2006) Effect of fescue toxicosis on hepatic gene expression in mice. J. Animal Sci, . 84, 1600–1612[Abstract/Free Full Text].

    Chapman, S., et al. (2001) Using biplots to interpret gene expression patterns in plants. Bioinformatics, 18, 202–204.

    Churchill, G.A. (2004) Using ANOVA to analyze microarray data. Biotechniques, 37, 173–177[Web of Science][Medline].

    Cui, X.Q. and Churchill, G.A. (2003) Statistical tests for differential expression in cDNA microarray experiments. Genome Biol, . 4, 210[CrossRef][Medline].

    Fisher, R.A. Statistical Methods for Research Workers, (1925) , Edinburgh Oliver and Boyd.

    Gabriel, K.R. (1971) The biplot graphic display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 14863–12868.

    Harrington, P.B., et al. (2005) Analyis of variance-principal component analysis: a soft tool for proteomic discovery. Analytica Chimica Acta, 544, 118–127[CrossRef].

    Holter, N.S., et al. (2000) Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proc. Natl Acad Sci, . 97, 8409–8414[Abstract/Free Full Text].

    Hörnquist, M., et al. (2003) Effective dimensionality for principal component analysis of time series expression data. Biosystems, 71, 311–317[CrossRef][Web of Science][Medline].

    Jackson, J. A Users Guide to Principal Components, (1991) , NY Wiley & Sons.

    Jørgensen, N.R., et al. (2004) Dexamethasone, BMP-2, and 1,25-dihydroxyvitamin D enhance a more differentiated osteoblast phenotype: validataion of an in vitro model for human bone marrow-derived primary osteoblasts. Steroids, 69, 219–226[CrossRef][Web of Science][Medline].

    Juenger, T.E., et al. (2006) Natural genetic variation in whole-genome expression in Arabidopsis thalania: the impact of physiological QTL introgression. Mol. Ecol, . 15, 1351–1365[CrossRef][Medline].

    Kerr, M.K., et al. (2000) Analysis of variance for gene expression microarray data. J. Computat. Biol, . 7, 819–837.

    Kerr, M.K. and Churchill, G.A. (2001) Experimental design for gene expression microarrays. Biostatistics, 2, 183–201[Medline].

    Kerr, M.K., et al. (2002) Statistical analysis of a gene experssion microarray experiment with replication. Statistica Sinica, 12, 203–217.

    Lee, M.H., et al. (2003) BMP-2-induced Osterix expression is mediated by Dlx5 but is independent of Runx2. Biochem. Biophys. Res., Commun, . 309, 689–694[CrossRef][Web of Science][Medline].

    Lipshutz, R.J., et al. (1999) High density synthetic oligonucleotide arrays. Natl Genet, . 21, 20–24.

    Misra, J., et al. (2006) Interactive exploration of microarray gene expression patterns in a reduced dimensional space. Genome Res, . 12, 1112–1120.

    Pan, K.-H., et al. (2005) Effects of threshold choice on biological conclusions reached during analysis of gene expression. Proc. Natl Acad. Sci, . 102, 8961–8964[Abstract/Free Full Text].

    Pavlidis, P. (2003) Using ANOVA for gene selection from microarray studies of the nervous system. Methods, 31, 282–289[CrossRef][Web of Science][Medline].

    Raychaudhuri, S., et al. (2000) Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac. Symp. Biocomput, . 455–466.

    Smilde, A.K., et al. (2005) ANOVA-Simultaneous component analysis (ASCA): a new tool for analyzing designed metabolomics data. Bioinformatics, 21, 3043–3048[Abstract/Free Full Text].

    Spellman, P.T., et al. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 3273–3297[Abstract/Free Full Text].

    Tan, P.K., et al. (2003) Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Res, . 31, 5676–5684[Abstract/Free Full Text].

    Tian, H., et al. (2003) Effects of bone morphogenetic protein-2 on cartilage oligomeric matrix protein expression in chondrocytes. Beijing Da Xue Xue Bao, . 35, 317–203[Medline].

    The Gene Ontology Consortium. (2000) Gene Ontology: tool for the unifaction of biology. Natl Genet, . 25, 25–29.

    Vermeer, H., et al. (2003) Glucocorticoid-induced increase in lymphocytic FKBP51 messenger ribonucleic acid expression: a potential marker for glucocorticoid sensitivity, potency, and bioavailability. J. Clin. Endocrinol. Metabo, . 88, 277–284.

    Weng, L., et al. (2006) Rosetta error model for gene expression analysis. Bioinformatics, 22, 1111–1121[Abstract/Free Full Text].

    Wolfinger, R.D., et al. (2001) Assessing gene significance from cDNA microarray expression data via mixed models. J. Comput. Biol, . 8, 625–637[CrossRef][Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J. Immunol.Home page
M. Schmolke, D. Viemann, J. Roth, and S. Ludwig
Essential Impact of NF-{kappa}B Signaling on the H5N1 Influenza A Virus-Induced Transcriptome
J. Immunol., October 15, 2009; 183(8): 5180 - 5189.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/2/184    most recent
btl572v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by de Haan, J. R.
Right arrow Articles by Buydens, L. M. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by de Haan, J. R.
Right arrow Articles by Buydens, L. M. C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?