Bioinformatics Vol. 17 no. 9 2001
Pages 763-774
© 2001 Oxford University Press
Principal component analysis for clustering gene expression data
Computer Science and Engineering, Box 352350, University of Washington, Seattle, WA 98195, USA
Received on January 1, 2001
; revised on May 3, 2001
; accepted on May 23, 2001
Motivation: There is a great need to develop analytical methodology to analyze and to exploit the information contained in gene expression data. Because of the large number of genes and the complexity of biological networks, clustering is a useful exploratory technique for analysis of gene expression data. Other classical techniques, such as principal component analysis (PCA), have also been applied to analyze gene expression data. Using different data analysis techniques and different clustering algorithms to analyze the same data set can lead to very different conclusions. Our goal is to study the effectiveness of principal components (PCs) in capturing cluster structure. Specifically, using both real and synthetic gene expression data sets, we compared the quality of clusters obtained from the original data to the quality of clusters obtained after projecting onto subsets of the principal component axes.
Results: Our empirical study showed that clustering with the PCs instead of the original variables does not necessarily improve, and often degrades, cluster quality. In particular, the first few PCs (which contain most of the variation in the data) do not necessarily capture most of the cluster structure. We also showed that clustering with PCs has different impact on different algorithms and different similarity metrics. Overall, we would not recommend PCA before clustering except in special circumstances.
Contact: kayee{at}cs.washington.edu
Supplementary information: http://www.cs.washington.edu/homes/kayee/pca
* To whom correspondence should be addressed.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
E. Yohannes, J. Chang, G. J. Christ, K. P. Davies, and M. R. Chance Proteomics Analysis Identifies Molecular Targets Related to Diabetes Mellitus-associated Bladder Dysfunction Mol. Cell. Proteomics, July 1, 2008; 7(7): 1270 - 1285. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Kurimoto, Y. Yabuta, Y. Ohinata, M. Shigeta, K. Yamanaka, and M. Saitou Complex genome-wide transcription dynamics orchestrated by Blimp1 for the specification of the germ cell lineage in mice Genes & Dev., June 15, 2008; 22(12): 1617 - 1635. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Grundberg, H. Brandstrom, K. C. L. Lam, S. Gurd, B. Ge, E. Harmsen, A. Kindmark, O. Ljunggren, H. Mallmin, O. Nilsson, et al. Systematic assessment of the human osteoblast transcriptome in resting and induced primary cells Physiol Genomics, May 9, 2008; 33(3): 301 - 311. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Bandyopadhyay, A. Mukhopadhyay, and U. Maulik An improved algorithm for clustering gene expression data Bioinformatics, November 1, 2007; 23(21): 2859 - 2865. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. S. Moller-Levet, C. M. West, and C. J. Miller Exploiting sample variability to enhance multivariate analysis of microarray data Bioinformatics, October 15, 2007; 23(20): 2733 - 2740. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Nueda, A. Conesa, J. A. Westerhuis, H. C. J. Hoefsloot, A. K. Smilde, M. Talon, and A. Ferrer Discovering gene expression patterns in time course microarray experiments by ANOVA SCA Bioinformatics, July 15, 2007; 23(14): 1792 - 1800. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. K. Reen, A. A. Dombkowski, L. A. Kresty, D. Cukovic, J. M. Mele, S. Salagrama, R. Nines, and G. D. Stoner Effects of Phenylethyl Isothiocyanate on Early Molecular Events in N-Nitrosomethylbenzylamine-Induced Cytotoxicity in Rat Esophagus Cancer Res., July 1, 2007; 67(13): 6484 - 6492. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. L.M. Boylan, M. A. Gosse, S. E. Staggs, S. Janz, S. Grindle, G. S. Kansas, and B. G. Van Ness A Transgenic Mouse Model of Plasma Cell Malignancy Shows Phenotypic, Cytogenetic, and Gene Expression Heterogeneity Similar to Human Multiple Myeloma Cancer Res., May 1, 2007; 67(9): 4069 - 4078. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. Zapala and N. J. Schork Multivariate regression analysis of distance matrices for testing associations between gene expression patterns and related variables PNAS, December 19, 2006; 103(51): 19430 - 19435. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Chen and H. Wang Appearance-Based Topological Bayesian Inference for Loop-Closing Detection in a Cross-Country Environment The International Journal of Robotics Research, October 1, 2006; 25(10): 953 - 983. [Abstract] [PDF] |
||||
![]() |
S. A. Jesch, P. Liu, X. Zhao, M. T. Wells, and S. A. Henry Multiple Endoplasmic Reticulum-to-Nucleus Signaling Pathways Coordinate Phospholipid Metabolism with Gene Expression by Distinct Mechanisms J. Biol. Chem., August 18, 2006; 281(33): 24070 - 24083. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. S. Wilson, G. S. Davidson, S. B. Martin, E. Andries, J. Potter, R. Harvey, K. Ar, Y. Xu, K. J. Kopecky, D. P. Ankerst, et al. Gene expression profiling of adult acute myeloid leukemia identifies novel biologic clusters for risk classification and outcome prediction Blood, July 15, 2006; 108(2): 685 - 696. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Rainer, F. Sanchez-Cabo, G. Stocker, A. Sturn, and Z. Trajanoski CARMAweb: comprehensive R- and bioconductor-based web service for microarray data analysis. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W498 - W503. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. C. Ru, L. A. Zhu, J. Silberman, and C. D. Shriver Label-free Semiquantitative Peptide Feature Profiling of Human Breast Cancer and Breast Disease Sera via Two-dimensional Liquid Chromatography-Mass Spectrometry Mol. Cell. Proteomics, June 1, 2006; 5(6): 1095 - 1104. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Rattray, X. Liu, G. Sanguinetti, M. Milo, and N. D. Lawrence Propagating uncertainty in microarray data analysis Brief Bioinform, March 1, 2006; 7(1): 37 - 47. |
||||
![]() |
I. C. Gerling, S. Singh, N. I. Lenchik, D. R. Marshall, and J. Wu New Data Analysis and Mining Approaches Identify Unique Proteome and Transcriptome Markers of Susceptibility to Autoimmune Diabetes Mol. Cell. Proteomics, February 1, 2006; 5(2): 293 - 305. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Zwir, H. Huang, and E. A. Groisman Analysis of differentially-regulated genes within a regulatory network by GPS genome navigation Bioinformatics, November 15, 2005; 21(22): 4073 - 4083. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. A. Jolly, K. M. Goldstein, T. Wei, H. Gao, P. Chen, S. Huang, J.-M. Colet, T. P. Ryan, C. E. Thomas, and S. T. Estrem Pooling samples within microarray studies: a comparative analysis of rat liver transcription response to prototypical toxicants Physiol Genomics, August 11, 2005; 22(3): 346 - 355. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Wang, J. T. Prince, and E. M. Marcotte Mass spectrometry of the M. smegmatis proteome: Protein expression levels correlate with function, operons, and codon bias Genome Res., August 1, 2005; 15(8): 1118 - 1126. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Liang, B. Tayo, X. Cai, and A. Kelemen Differential and trajectory methods for time course gene expression data Bioinformatics, July 1, 2005; 21(13): 3009 - 3016. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. M. Maurer, E. Yohannes, S. S. Bondurant, M. Radmacher, and J. L. Slonczewski pH Regulates Genes for Flagellar Motility, Catabolism, and Oxidative Stress in Escherichia coli K-12 J. Bacteriol., January 1, 2005; 187(1): 304 - 319. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. B. Fitzgerald, M. Jin, D. Dean, D. J. Wood, M. H. Zheng, and A. J. Grodzinsky Mechanical Compression of Cartilage Explants Induces Multiple Time-dependent Gene Expression Patterns and Involves Intracellular Calcium and Cyclic AMP J. Biol. Chem., May 7, 2004; 279(19): 19502 - 19511. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Qu, U. Schittko, and I. T. Baldwin Consistency of Nicotiana attenuata's Herbivore- and Jasmonate-Induced Transcriptional Responses in the Allotetraploid Species Nicotiana quadrivalvis and Nicotiana clevelandii Plant Physiology, May 1, 2004; 135(1): 539 - 548. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Xu, V. Olman, L. Wang, and Y. Xu EXCAVATOR: a computer program for efficiently mining gene expression data Nucleic Acids Res., October 1, 2003; 31(19): 5582 - 5589. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Ressom, D. Wang, and P. Natarajan Clustering gene expression data using adaptive double self-organizing map Physiol Genomics, June 24, 2003; 14(1): 35 - 46. [Abstract] [Full Text] [PDF] |
||||
![]() |
N.M. Svrakic, O. Nesic, M.R.K. Dasu, D. Herndon, and J.R. Perez-Polo Statistical Approach to DNA Chip Analysis Recent Prog. Horm. Res., January 1, 2003; 58(1): 75 - 93. [Abstract] [Full Text] [PDF] |
||||














