Bioinformatics Advance Access published online on April 10, 2008
Bioinformatics, doi:10.1093/bioinformatics/btn133
Divisive Correlation Clustering Algorithm (DCCA) for grouping of genes: Detecting varying patterns in expression profiles
aDepartment of Computer Science and Engineering, Netaji Subhash Engineering College, Garia, Kolkata, India. bMachine Intelligence Unit, Indian Statistical Institute, Kolkata, India.
*To whom correspondence should be addressed. Dr. Rajat K. De, E-mail: rajat{at}isical.ac.in
| Abstract |
|---|
Motivation: Cluster analysis (of gene expression data) is a useful tool for identifying biologically relevant groups of genes that show similar expression patterns under multiple experimental conditions. Many clustering methods have been proposed for clustering geneexpression data. However most of these algorithms have several shortcomings for gene expression data clustering. In the present paper, we focus on several shortcomings of conventional clustering algorithms and propose a new clustering algorithm that is able to produce better clustering solution than that produced by some others.
Results: We present the Divisive Correlation Clustering Algorithm (DCCA) that is suitable for finding a group of genes having similar pattern of variation in their expression values. To detect clusters with high correlation and biological significance, we use the correlation clustering concept introduced by Bansal's et al. (Bansal et al. (2004)). Our proposed algorithm DCCA produces a clustering solution without taking number of clusters to be created as an input. DCCA uses the correlation matrix in such a way that all genes in a cluster have highest average correlation with genes in that cluster. To test the performance of the DCCA, we have applied DCCA and some well-known conventional methods to an artificial data set, and nine gene expression datasets, and compared the performance of the algorithms. The clustering results of the DCCA are found to be more significantly relevant to the biological annotations than those of the other methods. All these facts show the superiority of the DCCA over some others for the clustering of gene-expression data.
Availability of the software: The software has been developed using C and Visual Basic languages, and can be executed on the Microsoft Windows platforms. The software may be downloaded as a zip file from http://www.isical.ac.in/~rajat. Then it needs to be installed. Two word files (included in the zip file) need to be consulted before installation and execution of the software.
Contact: rajat{at}isical.ac.in
Supplementary Material: Supplementary Material has been uploaded as a separate file.
Associate Editor: Dr. Trey Ideker
Received on September 14, 2007; revised on January 21, 2008; accepted on April 9, 2008