Bioinformatics Advance Access originally published online on November 22, 2007
Bioinformatics 2008 24(2):176-183; doi:10.1093/bioinformatics/btm562
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Analysis of a Gibbs sampler method for model-based clustering of gene expression data
1Department of Plant Systems Biology, VIB and 2Department of Molecular Genetics, UGent, Technologiepark 927, 9052 Gent, Belgium
*To whom correspondence should be addressed.
| Abstract |
|---|
Motivation: Over the last decade, a large variety of clustering algorithms have been developed to detect coregulatory relationships among genes from microarray gene expression data. Model-based clustering approaches have emerged as statistically well-grounded methods, but the properties of these algorithms when applied to large-scale data sets are not always well understood. An in-depth analysis can reveal important insights about the performance of the algorithm, the expected quality of the output clusters, and the possibilities for extracting more relevant information out of a particular data set.
Results: We have extended an existing algorithm for model-based clustering of genes to simultaneously cluster genes and conditions, and used three large compendia of gene expression data for Saccharomyces cerevisiae to analyze its properties. The algorithm uses a Bayesian approach and a Gibbs sampling procedure to iteratively update the cluster assignment of each gene and condition. For large-scale data sets, the posterior distribution is strongly peaked on a limited number of equiprobable clusterings. A GO annotation analysis shows that these local maxima are all biologically equally significant, and that simultaneously clustering genes and conditions performs better than only clustering genes and assuming independent conditions. A collection of distinct equivalent clusterings can be summarized as a weighted graph on the set of genes, from which we extract fuzzy, overlapping clusters using a graph spectral method. The cores of these fuzzy clusters contain tight sets of strongly coexpressed genes, while the overlaps exhibit relations between genes showing only partial coexpression.
Availability: GaneSh, a Java package for coclustering, is available under the terms of the GNU General Public License from our website at http://bioinformatics.psb.ugent.be/software
Contact: yves.vandepeer{at}psb.ugent.be
Supplementary information: Supplementary data are available on our website at http://bioinformatics.psb.ugent.be/supplementary_data/anjos/gibbs
Associate Editor: Martin Bishop
Received on July 10, 2007; revised on October 31, 2007; accepted on November 6, 2007
This article has been cited by other articles:
![]() |
B. Andreopoulos, A. An, X. Wang, and M. Schroeder A roadmap of clustering algorithms: finding a match for a biomedical application Brief Bioinform, May 1, 2009; 10(3): 297 - 314. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Joshi, R. De Smet, K. Marchal, Y. Van de Peer, and T. Michoel Module networks revisited: computational assessment and prioritization of model predictions Bioinformatics, February 15, 2009; 25(4): 490 - 496. [Abstract] [Full Text] [PDF] |
||||

