Bioinformatics Vol. 18 no. 4 2002
Pages 536-545
© 2002 Oxford University Press
Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees
Protein Informatics Group, Life Sciences Division, Oak Ridge National Laboratory, MS 6480, Oak Ridge, TN 27831-6480, USA
Received on August 28, 2001
; revised on November 2, 2001
; accepted on November 7, 2001
Motivation: Gene expression data clustering provides a powerful tool for studying functional relationships of genes in a biological process. Identifying correlated expression patterns of genes represents the basic challenge in this clustering problem.
Results: This paper describes a new framework for representing a set of multi-dimensional gene expression data as a Minimum Spanning Tree (MST), a concept from the graph theory. A key property of this representation is that each cluster of the expression data corresponds to one subtree of the MST, which rigorously converts a multi-dimensional clustering problem to a tree partitioning problem. We have demonstrated that though the inter-data relationship is greatly simplified in the MST representation, no essential information is lost for the purpose of clustering. Two key advantages in representing a set of multi-dimensional data as an MST are: (1) the simple structure of a tree facilitates efficient implementations of rigorous clustering algorithms, which otherwise are highly computationally challenging; and (2) as an MST-based clustering does not depend on detailed geometric shape of a cluster, it can overcome many of the problems faced by classical clustering algorithms. Based on the MST representation, we have developed a number of rigorous and efficient clustering algorithms, including two with guaranteed global optimality. We have implemented these algorithms as a computer software EXpression data Clustering Analysis and VisualizATiOn Resource (EXCAVATOR). To demonstrate its effectiveness, we have tested it on three data sets, i.e. expression data from yeast Saccharomyces cerevisiae , expression data in response of human fibroblasts to serum, and Arabidopsis expression data in response to chitin elicitation. The test results are highly encouraging.
Availability: EXCAVATOR is available on request from the authors.
Contact: xyn{at}ornl.gov
* To whom correspondence should be addressed.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
A. Bhattacharya and R. K. De Divisive Correlation Clustering Algorithm (DCCA) for grouping of genes: detecting varying patterns in expression profiles Bioinformatics, June 1, 2008; 24(11): 1359 - 1366. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. C. Wieland, J. S. Brownstein, B. Berger, and K. D. Mandl Density-equalizing Euclidean minimum spanning trees for the detection of all disease cluster shapes PNAS, May 29, 2007; 104(22): 9404 - 9409. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Gupta, C. D. Maranas, and R. Albert Elucidation of directionality for co-expressed genes: predicting intra-operon termination sites Bioinformatics, January 15, 2006; 22(2): 209 - 214. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Liu, F. Mao, J.-t. Guo, B. Yan, P. Wang, Y. Qu, and Y. Xu Quantitative evaluation of protein-DNA interactions using an optimized knowledge-based potential Nucleic Acids Res., January 26, 2005; 33(2): 546 - 558. [Abstract] [Full Text] [PDF] |
||||
![]() |
W.P. Kuo Overview of Bioinformatics and its Application to Oral Genomics Adv. Dent. Res., December 1, 2003; 17(1): 89 - 94. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Katagiri and J. Glazebrook Local Context Finder (LCF) reveals multidimensional relationships among mRNA expression profiles of Arabidopsis responding to pathogen infection PNAS, September 16, 2003; 100(19): 10842 - 10847. [Abstract] [Full Text] [PDF] |
||||



