Bioinformatics Vol. 19 no. 4 2003
Pages 459-466
© 2003 Oxford University Press
Comparisons and validation of statistical clustering techniques for microarray gene expression data
1 Department of Mathematics and Statistics
and Department of Biology, Georgia State University, Atlanta,
GA 30303, USA
2 Department of Statistics, University of
Georgia, Athens, GA 30602, USA
Received on May 30, 2002
; revised on October 18, 2002
; accepted on October 21, 2002
Motivation: With the advent of microarray chip technology, large data sets are emerging containing the simultaneous expression levels of thousands of genes at various time points during a biological process. Biologists are attempting to group genes based on the temporal pattern of their expression levels. While the use of hierarchical clustering (UPGMA) with correlation distance has been the most common in the microarray studies, there are many more choices of clustering algorithms in pattern recognition and statistics literature. At the moment there do not seem to be any clear-cut guidelines regarding the choice of a clustering algorithm to be used for grouping genes based on their expression profiles.
Results: In this paper, we consider six clustering algorithms (of various flavors!) and evaluate their performances on a well-known publicly available microarray data set on sporulation of budding yeast and on two simulated data sets. Among other things, we formulate three reasonable validation strategies that can be used with any clustering algorithm when temporal observations or replications are present. We evaluate each of these six clustering methods with these validation measures. While the best method is dependent on the exact validation strategy and the number of clusters to be used, overall Diana appears to be a solid performer. Interestingly, the performance of correlation-based hierarchical clustering and model-based clustering (another method that has been advocated by a number of researchers) appear to be on opposite extremes, depending on what validation measure one employs. Next it is shown that the group means produced by Diana are the closest and those produced by UPGMA are the farthest from a model profile based on a set of hand-picked genes.
Availability: S+ codes for the partial least squares based clustering are available from the authors upon request. All other clustering methods considered have S+ implementation in the library MASS. S+ codes for calculating the validation measures are available from the authors upon request. The sporulation data set is publicly available at http://cmgm.stanford.edu/pbrown/sporulation
Supplementary information: http://www.mathstat.gsu.edu/~matsnd/clustering/supp.htm
Contact: sdatta{at}mathstat.gsu.edu
* To whom correspondence should be addressed.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
V. Pihur, S. Datta, and S. Datta Reconstruction of genetic association networks from microarray data: a partial least squares approach Bioinformatics, February 15, 2008; 24(4): 561 - 568. [Abstract] [Full Text] [PDF] |
||||
![]() |
Seo Young Kim and J. Won Lee Ensemble clustering method based on the resampling similarity measure for gene expression data Statistical Methods in Medical Research, December 1, 2007; 16(6): 539 - 564. [Abstract] [PDF] |
||||
![]() |
V. Pihur, S. Datta, and S. Datta Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach Bioinformatics, July 1, 2007; 23(13): 1607 - 1615. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Sadri-Vakili, B. Bouzou, C. L. Benn, M.-O. Kim, P. Chawla, R. P. Overland, K. E. Glajch, E. Xia, Z. Qiu, S. M. Hersch, et al. Histones associated with downregulated genes are hypo-acetylated in Huntington's disease models Hum. Mol. Genet., June 1, 2007; 16(11): 1293 - 1306. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. V. Kapp and R. Tibshirani Are clusters found in one dataset present in another dataset? Biostat., January 1, 2007; 8(1): 9 - 31. [Abstract] [Full Text] [PDF] |
||||
![]() |
D.-W. Kim, K.-Y. Lee, K. H. Lee, and D. Lee Towards clustering of incomplete microarray data without the use of imputation Bioinformatics, January 1, 2007; 23(1): 107 - 113. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Huang and W. Pan Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data Bioinformatics, May 15, 2006; 22(10): 1259 - 1268. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Prelic, S. Bleuler, P. Zimmermann, A. Wille, P. Buhlmann, W. Gruissem, L. Hennig, L. Thiele, and E. Zitzler A systematic comparison and evaluation of biclustering methods for gene expression data Bioinformatics, May 1, 2006; 22(9): 1122 - 1129. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Stanbrough, G. J. Bubley, K. Ross, T. R. Golub, M. A. Rubin, T. M. Penning, P. G. Febbo, and S. P. Balk Increased expression of genes converting adrenal androgens to testosterone in androgen-independent prostate cancer. Cancer Res., March 1, 2006; 66(5): 2815 - 2825. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. DING, C. Y. CHAN, and C. E. LAWRENCE RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble RNA, August 1, 2005; 11(8): 1157 - 1166. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Handl, J. Knowles, and D. B. Kell Computational cluster validation in post-genomic data analysis Bioinformatics, August 1, 2005; 21(15): 3201 - 3212. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. E. Eckel-Passow, A. Hoering, T. M. Therneau, and I. Ghobrial Experimental Design and Analysis of Antibody Microarrays: Applying Methods from cDNA Arrays Cancer Res., April 15, 2005; 65(8): 2985 - 2989. [Abstract] [Full Text] [PDF] |
||||





