Skip Navigation

This Article
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow FREE Full Text (Screen PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (85)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Datta, S.
Right arrow Articles by Datta, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Datta, S.
Right arrow Articles by Datta, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics Vol. 19 no. 4 2003
Pages 459-466
© 2003 Oxford University Press

Comparisons and validation of statistical clustering techniques for microarray gene expression data

Susmita Datta 1,* and Somnath Datta 2

1 Department of Mathematics and Statistics and Department of Biology, Georgia State University, Atlanta, GA 30303, USA
2 Department of Statistics, University of Georgia, Athens, GA 30602, USA

Received on May 30, 2002 ; revised on October 18, 2002 ; accepted on October 21, 2002

Motivation: With the advent of microarray chip technology, large data sets are emerging containing the simultaneous expression levels of thousands of genes at various time points during a biological process. Biologists are attempting to group genes based on the temporal pattern of their expression levels. While the use of hierarchical clustering (UPGMA) with correlation ‘distance’ has been the most common in the microarray studies, there are many more choices of clustering algorithms in pattern recognition and statistics literature. At the moment there do not seem to be any clear-cut guidelines regarding the choice of a clustering algorithm to be used for grouping genes based on their expression profiles.

Results: In this paper, we consider six clustering algorithms (of various flavors!) and evaluate their performances on a well-known publicly available microarray data set on sporulation of budding yeast and on two simulated data sets. Among other things, we formulate three reasonable validation strategies that can be used with any clustering algorithm when temporal observations or replications are present. We evaluate each of these six clustering methods with these validation measures. While the ‘best’ method is dependent on the exact validation strategy and the number of clusters to be used, overall Diana appears to be a solid performer. Interestingly, the performance of correlation-based hierarchical clustering and model-based clustering (another method that has been advocated by a number of researchers) appear to be on opposite extremes, depending on what validation measure one employs. Next it is shown that the group means produced by Diana are the closest and those produced by UPGMA are the farthest from a model profile based on a set of hand-picked genes.

Availability: S+ codes for the partial least squares based clustering are available from the authors upon request. All other clustering methods considered have S+ implementation in the library MASS. S+ codes for calculating the validation measures are available from the authors upon request. The sporulation data set is publicly available at http://cmgm.stanford.edu/pbrown/sporulation

Supplementary information: http://www.mathstat.gsu.edu/~matsnd/clustering/supp.htm

Contact: sdatta{at}mathstat.gsu.edu

* To whom correspondence should be addressed.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
B. Andreopoulos, A. An, X. Wang, and M. Schroeder
A roadmap of clustering algorithms: finding a match for a biomedical application
Brief Bioinform, May 1, 2009; 10(3): 297 - 314.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Sharma, R. Podolsky, J. Zhao, and R. A. McIndoe
A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets
Bioinformatics, May 1, 2009; 25(9): 1152 - 1157.
[Abstract] [Full Text] [PDF]


Home page
JDRHome page
E.L. Hendrickson, R.J. Lamont, and M. Hackett
Tools for Interpreting Large-scale Protein Profiling in Microbiology
Journal of Dental Research, November 1, 2008; 87(11): 1004 - 1015.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
V. Pihur, S. Datta, and S. Datta
Reconstruction of genetic association networks from microarray data: a partial least squares approach
Bioinformatics, February 15, 2008; 24(4): 561 - 568.
[Abstract] [Full Text] [PDF]


Home page
Stat Methods Med ResHome page
Seo Young Kim and J. Won Lee
Ensemble clustering method based on the resampling similarity measure for gene expression data
Statistical Methods in Medical Research, December 1, 2007; 16(6): 539 - 564.
[Abstract] [PDF]


Home page
BioinformaticsHome page
V. Pihur, S. Datta, and S. Datta
Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach
Bioinformatics, July 1, 2007; 23(13): 1607 - 1615.
[Abstract] [Full Text] [PDF]


Home page
Hum Mol GenetHome page
G. Sadri-Vakili, B. Bouzou, C. L. Benn, M.-O. Kim, P. Chawla, R. P. Overland, K. E. Glajch, E. Xia, Z. Qiu, S. M. Hersch, et al.
Histones associated with downregulated genes are hypo-acetylated in Huntington's disease models
Hum. Mol. Genet., June 1, 2007; 16(11): 1293 - 1306.
[Abstract] [Full Text] [PDF]


Home page
BiostatisticsHome page
A. V. Kapp and R. Tibshirani
Are clusters found in one dataset present in another dataset?
Biostat., January 1, 2007; 8(1): 9 - 31.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D.-W. Kim, K.-Y. Lee, K. H. Lee, and D. Lee
Towards clustering of incomplete microarray data without the use of imputation
Bioinformatics, January 1, 2007; 23(1): 107 - 113.
[Abstract] [Full Text] [PDF]


Home page
J R Soc InterfaceHome page
F. J Doyle III and J. Stelling
Systems interface biology
J R Soc Interface, October 22, 2006; 3(10): 603 - 616.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
R. G. Beiko, J. M. Keith, T. J. Harlow, and M. A. Ragan
Searching for Convergence in Phylogenetic Markov Chain Monte Carlo
Syst Biol, August 1, 2006; 55(4): 553 - 565.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
C. P. Burridge, M. C. Roberto, and B. S. Dyer
Multiple Origins of the Juan Fernandez Kelpfish Fauna and Evidence for Frequent and Unidirectional Dispersal of Cirrhitoid Fishes Across the South Pacific
Syst Biol, August 1, 2006; 55(4): 566 - 578.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. Huang and W. Pan
Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data
Bioinformatics, May 15, 2006; 22(10): 1259 - 1268.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Prelic, S. Bleuler, P. Zimmermann, A. Wille, P. Buhlmann, W. Gruissem, L. Hennig, L. Thiele, and E. Zitzler
A systematic comparison and evaluation of biclustering methods for gene expression data
Bioinformatics, May 1, 2006; 22(9): 1122 - 1129.
[Abstract] [Full Text] [PDF]


Home page
Cancer Res.Home page
M. Stanbrough, G. J. Bubley, K. Ross, T. R. Golub, M. A. Rubin, T. M. Penning, P. G. Febbo, and S. P. Balk
Increased expression of genes converting adrenal androgens to testosterone in androgen-independent prostate cancer.
Cancer Res., March 1, 2006; 66(5): 2815 - 2825.
[Abstract] [Full Text] [PDF]


Home page
RNAHome page
Y. DING, C. Y. CHAN, and C. E. LAWRENCE
RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble
RNA, August 1, 2005; 11(8): 1157 - 1166.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. Handl, J. Knowles, and D. B. Kell
Computational cluster validation in post-genomic data analysis
Bioinformatics, August 1, 2005; 21(15): 3201 - 3212.
[Abstract] [Full Text] [PDF]


Home page
Cancer Res.Home page
J. E. Eckel-Passow, A. Hoering, T. M. Therneau, and I. Ghobrial
Experimental Design and Analysis of Antibody Microarrays: Applying Methods from cDNA Arrays
Cancer Res., April 15, 2005; 65(8): 2985 - 2989.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.