Bioinformatics Advance Access originally published online on February 25, 2005
Bioinformatics 2005 21(10):2548-2549; doi:10.1093/bioinformatics/bti343
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A web-based tool for principal component and significance analysis of microarray data
Developmental Genomics and Aging Section, Laboratory of Genetics, National Institute on Aging, National Institutes of Health Baltimore, MD 21224, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: We have developed a program for microarray data analysis, which features the false discovery rate for testing statistical significance and the principal component analysis using the singular value decomposition method for detecting the global trends of gene-expression patterns. Additional features include analysis of variance with multiple methods for error variance adjustment, correction of cross-channel correlation for two-color microarrays, identification of genes specific to each cluster of tissue samples, biplot of tissues and corresponding tissue-specific genes, clustering of genes that are correlated with each principal component (PC), three-dimensional graphics based on virtual reality modeling language and sharing of PC between different experiments. The software also supports parameter adjustment, gene search and graphical output of results. The software is implemented as a web tool and thus the speed of analysis does not depend on the power of a client computer.
Availability: The tool can be used on-line or downloaded at http://lgsun.grc.nia.nih.gov/ANOVA/
Contact: kom{at}mail.nih.gov
Global gene-expression analysis with microarrays becomes a routine procedure in biomedical research. Although many programs have been developed to support the statistical analysis of microarray results (Kim et al., 2001; Theilhaber et al., 2004; TIGR, 2004 http://www.tigr.org/software/tm4/; Tusher et al., 2001), they do not necessarily contain all the advanced analysis methods. To facilitate the use of these relatively new methods we developed NIA Array Analysis software. A complete description of the software as well as the glossary of technical and statistical terms can be found at http://lgsun.grc.nia.nih.gov/ANOVA/. In this paper, we describe the main features of this software.
The NIA Array Analysis software can be used for both single-color and two-color microarrays with or without a dye swap. It uses a tab-delimited text file as an input and generates outputs in both graphics and text formats. An additional tool (Arrayjoin) assembles multiple input files from different experiments into one input file. The software can also take an annotation file that hyperlinks each microarray probe to various web resources, including Unigene, TIGR, MGI and NIA Mouse Gene Index. These gene links allow the users to incorporate microarray data into other programs, e.g. the GenMAPP for Gene Ontology analysis. All results can be saved as a stand-alone web-page for sharing or releasing the data.
The software offers an optional adjustment of signal intensities, when two-color hybridizations are used. This is based on our observation that signal intensities in one channel (e.g. red) often increase with the increasing signal intensities in the other channel (e.g. green), even when the same reference RNA is always used for the red channel. If readings from these two channels are independent, the signal intensities in the red channel should not vary among experiments and should be corrected if there are changes.
We have implemented the single-factor analysis of variance (ANOVA) for testing statistical significance. Testing multiple hypotheses with the ANOVA requires some modifications such as error variance averaging and false discovery rate (FDR). The average error variance for genes with similar signal intensities is estimated using the sliding window of adjustable size applied to genes sorted by their average signal intensities. Because some genes (outliers) may have unusually high error variance, genes with the highest variance values (a top 1% by default) are not used for the error variance averaging. To obtain an estimate for the true error variance, the software provides the following five different error models as options: (1) actual error variance (this option processes each gene independently), (2) intensity-specific average error variance, (3) Bayesian error model (Baldi and Long, 2001), (4) maximum between intensity-specific average error variance and actual error variance and (5) maximum between intensity-specific average error variance and Bayesian error variances. Option (4), the most conservative model, is used as default. However, if error variance is too high, none of these models is reliable. Thus, we tag and visually examine genes with high error variance (five times greater than the average). Users can also select more stringent criteria for removing outliers (i.e. a lower z-threshold level). The default threshold (z = 8) removes only the most deviating outliers. Estimation of the z-value is based on the ANOVA results; thus, ANOVA is applied iteratively with outlier removal in each cycle until no new outliers are detected.
The FDR identifies the proportion of false positives among significant genes (Benjamini and Hochberg, 1995; Reiner et al., 2003). Traditional p-values, which are designed for testing a single hypothesis, are not suited to the comparison of several thousand genes. The Bonferroni correction is not relevant either, because it is too stringent and allows no false positives among significant genes. We have implemented the original method (Benjamini and Hochberg, 1995):
![]() | (1) |
The software offers two methods of clustering tissue samples and subsequent identification of correlated genes. First, hierarchical clustering of samples (e.g. tissues and cells) is done by using the average distance method. A set of genes, unique to each cluster is identified in the following manner. For each gene, g, we first identify a sample T1(g) with the lowest average expression, E[T1(g)], within the cluster and a sample T2(g) with the highest average expression, E[T2(g)], outside the cluster. If K genes satisfy E[T1(g)] > E[T2(g)], these genes always have higher expressions in samples within the cluster than in samples outside the cluster. To determine if the difference E[T1(g)] E[T2(g)] is statistically significant, we calculate z-values based on the error model and p-values based on single-tail normal distribution. Finally we estimate FDR values using Equation 1, in which N is the minimum between 2K and the total number of genes. The set of K genes represents only a half (the positive part) of the normal distribution, and thus K is doubled for estimating the FDR.
Second, we have implemented the principal component analysis (PCA). One advantage of PCA is that the principal components are always orthogonal (uncorrelated), whereas other methods (e.g. K-means clustering) often produce redundant correlated clusters. We have also implemented the singular value decomposition method, which reduces the dimension in both columns and rows of the data matrix. The method combines samples and genes in a single graph (called biplot) so that their association can be analyzed visually (Chapman et al., 2002; Gabriel, 1971). The NIA Array Analysis tool generates interactive two-dimensional (2D) and 3D biplots (Fig. 1). Each gene in a biplot is hyperlinked to its annotation and histogram showing the expression levels in each sample. We identify two sets of genes that are positively and negatively correlated with each principal component (PC). If the degree of a gene-expression change associated with a specific PC exceeds a user-defined threshold, then the gene is considered correlated with the PC.
|
The NIA Array Analysis tool has been successfully used for the last two years (Hamatani et al., 2004; Sharov et al., 2003). This open-source non-restricted software will be a valuable resource for the research community.
Received on January 14, 2005; revised on February 15, 2005; accepted on February 17, 2005
| REFERENCES |
|---|
|
|
|---|
Baldi, P. and Long, A.D. (2001) A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics, 17, 509519
Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery ratea practical and powerful approach to multiple testing. J. R. Stat. Soc. B, 57, 289300.
Chapman, S., et al. (2002) Using biplots to interpret gene expression patterns in plants. Bioinformatics, 18, 202204
Gabriel, R. (1971) The biplot graphical display of matrices with application to principal component analysis. Biometrika, 58, 453467
Hamatani, T., et al. (2004) Dynamics of global gene expression changes during mouse preimplantation development. Dev. Cell, 6, 117131[CrossRef][Web of Science][Medline].
Kim, S.K., et al. (2001) A gene expression map for Caenorhabditis elegans. Science, 293, 20872092
Reiner, A., et al. (2003) Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics, 19, 368375
Sharov, A.A., et al. (2003) Transcriptome analysis of mouse stem cells and early embryos. PLoS Biol., 1, e74.
Theilhaber, J., et al. (2004) GECKO: a complete large-scale gene expression analysis platform. BMC Bioinformatics, 5, 195[CrossRef][Medline].
TIGR. (2004) TM4 microarray software suite.
Tusher, V.G., et al. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 51165121
This article has been cited by other articles:
![]() |
A. A. Sharov, A. N. Mardaryev, T. Y. Sharova, M. Grachtchouk, R. Atoyan, H. R. Byers, J. T. Seykora, P. Overbeek, A. Dlugosz, and V. A. Botchkarev Bone Morphogenetic Protein Antagonist Noggin Promotes Skin Tumorigenesis via Stimulation of the Wnt and Shh Signaling Pathways Am. J. Pathol., September 1, 2009; 175(3): 1303 - 1314. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Vallee, I. Dufort, S. Desrosiers, A. Labbe, C. Gravel, I. Gilbert, C. Robert, and M.-A. Sirard Revealing the bovine embryo transcript profiles during early in vivo embryonic development Reproduction, July 1, 2009; 138(1): 95 - 105. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. L. Arenzana, M. R. Smith-Raska, and B. Reizis Transcription factor Zfx controls BCR-induced proliferation and survival of B lymphocytes Blood, June 4, 2009; 113(23): 5857 - 5867. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. A. Heimeier, B. Das, D. R. Buchholz, and Y.-B. Shi The Xenoestrogen Bisphenol A Inhibits Postembryonic Vertebrate Development by Antagonizing Gene Regulation by Thyroid Hormone Endocrinology, June 1, 2009; 150(6): 2964 - 2973. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Kunisada, C.-Y. Cui, Y. Piao, M. S.H. Ko, and D. Schlessinger Requirement for Shh and Fox family genes at different stages in sweat gland development Hum. Mol. Genet., May 15, 2009; 18(10): 1769 - 1778. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Gilbert, S. Scantland, I. Dufort, O. Gordynska, A. Labbe, M.-A. Sirard, and C. Robert Real-time monitoring of aRNA production during T7 amplification to prevent the loss of sample representation during microarray hybridization sample preparation Nucleic Acids Res., May 1, 2009; 37(8): e65 - e65. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Z Carletti and L. K Christenson Rapid effects of LH on gene expression in the mural granulosa cells of mouse periovulatory follicles Reproduction, May 1, 2009; 137(5): 843 - 855. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. T. Jacobs and L. J. Marnett HSF1-mediated BAG3 Expression Attenuates Apoptosis in 4-Hydroxynonenal-treated Colon Cancer Cells via Stabilization of Anti-apoptotic Bcl-2 Proteins J. Biol. Chem., April 3, 2009; 284(14): 9176 - 9183. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Ma and M. R. Kosorok Identification of differential gene pathways with principal component analysis Bioinformatics, April 1, 2009; 25(7): 882 - 889. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. T. Tran, L. Xu, V. Phan, S. B. Goodwin, M. Rahman, V. X. Jin, C. H. Sutter, B. D. Roebuck, T. W. Kensler, E.O. George, et al. Chemical genomics of cancer chemopreventive dithiolethiones Carcinogenesis, March 1, 2009; 30(3): 480 - 486. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Aiba, T. Nedorezov, Y. Piao, A. Nishiyama, R. Matoba, L. V. Sharova, A. A. Sharov, S. Yamanaka, H. Niwa, and M. S. H. Ko Defining Developmental Potency and Cell Lineage Trajectories by Expression Profiling of Differentiating Mouse Embryonic Stem Cells DNA Res, February 1, 2009; 16(1): 73 - 80. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Vigneault, C. Gravel, M. Vallee, S. McGraw, and M.-A. Sirard Unveiling the bovine embryo transcriptome during the maternal-to-embryonic transition Reproduction, February 1, 2009; 137(2): 245 - 257. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. R. Demey, J. L. Vicente-Villardon, M. P. Galindo-Villardon, and A. Y. Zambrano Identifying molecular markers associated with classification of genotypes by External Logistic Biplots Bioinformatics, December 15, 2008; 24(24): 2832 - 2838. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. D. Fiedler, M. Z. Carletti, X. Hong, and L. K. Christenson Hormonal Regulation of MicroRNA Expression in Periovulatory Mouse Mural Granulosa Cells Biol Reprod, December 1, 2008; 79(6): 1030 - 1037. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Zandi, R. Mansson, P. Tsapogas, J. Zetterblad, D. Bryder, and M. Sigvardsson EBF1 Is Essential for B-Lineage Priming and Establishment of a Transcription Factor Network in Common Lymphoid Progenitors J. Immunol., September 1, 2008; 181(5): 3364 - 3372. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Vallee, K. Aiba, Y. Piao, M.-F. Palin, M. S H Ko, and M.-A. Sirard Comparative analysis of oocyte transcript profiles reveals a high degree of conservation among species Reproduction, April 1, 2008; 135(4): 439 - 448. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Li, L. Ying, M. Naesens, W. Xiao, T. Sigdel, S. Hsieh, J. Martin, R. Chen, K. Liu, M. Mindrinos, et al. Interference of globin genes with biomarker discovery for allograft rejection in peripheral blood samples Physiol Genomics, January 17, 2008; 32(2): 190 - 197. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. Watson, L. R. Ylagan, K. M. Trinkaus, W. E. Gillanders, M. J. Naughton, K. N. Weilbaecher, T. P. Fleming, and R. L. Aft Isolation and Molecular Profiling of Bone Marrow Micrometastases Identifies TWIST1 as a Marker of Early Tumor Relapse in Breast Cancer Patients Clin. Cancer Res., September 1, 2007; 13(17): 5001 - 5009. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Shimono, S. Sugano, A. Nakayama, C.-J. Jiang, K. Ono, S. Toki, and H. Takatsuji Rice WRKY45 Plays a Crucial Role in Benzothiadiazole-Inducible Blast Resistance PLANT CELL, June 1, 2007; 19(6): 2064 - 2076. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Raman, X. Puyang, T.-Y. Cheng, D. C. Young, D. B. Moody, and R. N. Husson Mycobacterium tuberculosis SigM Positively Regulates Esx Secreted Protein and Nonribosomal Peptide Synthetase Genes and Down Regulates Virulence-Associated Surface Lipid Synthesis J. Bacteriol., December 15, 2006; 188(24): 8460 - 8468. [Abstract] [Full Text] [PDF] |
||||
![]() |
N T Rogers, G Halet, Y Piao, J Carroll, M S H Ko, and K Swann The absence of a Ca2+ signal during mouse egg activation can affect parthenogenetic preimplantation development, gene expression patterns, and blastocyst quality Reproduction, July 1, 2006; 132(1): 45 - 57. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

















