Skip Navigation


Bioinformatics Advance Access originally published online on January 12, 2005
Bioinformatics 2005 21(9):1927-1934; doi:10.1093/bioinformatics/bti251
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/9/1927    most recent
bti251v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kim, D.-W.
Right arrow Articles by Lee, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kim, D.-W.
Right arrow Articles by Lee, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Detecting clusters of different geometrical shapes in microarray gene expression data

Dae-Won Kim 1, Kwang H. Lee 1,2 and Doheon Lee 1,*

1Department of BioSystems and Advanced Information Technology Research Center, Korea Advanced Institute of Science and Technology 373–1 Guseong-dong, Yuseong-gu, Daejeon, 305–701, Korea
2Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology 373–1 Guseong-dong, Yuseong-gu, Daejeon, 305–701, Korea

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 THE GUSTAFSON-KESSEL METHOD
 3 RESULTS
 4 CONCLUSIONS
 REFERENCES
 

Motivation: Clustering has been used as a popular technique for finding groups of genes that show similar expression patterns under multiple experimental conditions. Many clustering methods have been proposed for clustering gene-expression data, including the hierarchical clustering, k-means clustering and self-organizing map (SOM). However, the conventional methods are limited to identify different shapes of clusters because they use a fixed distance norm when calculating the distance between genes. The fixed distance norm imposes a fixed geometrical shape on the clusters regardless of the actual data distribution. Thus, different distance norms are required for handling the different shapes of clusters.

Results: We present the Gustafson–Kessel (GK) clustering method for microarray gene-expression data. To detect clusters of different shapes in a dataset, we use an adaptive distance norm that is calculated by a fuzzy covariance matrix (F) of each cluster in which the eigenstructure of F is used as an indicator of the shape of the cluster. Moreover, the GK method is less prone to falling into local minima than the k-means and SOM because it makes decisions through the use of membership degrees of a gene to clusters. The algorithmic procedure is accomplished by the alternating optimization technique, which iteratively improves a sequence of sets of clusters until no further improvement is possible. To test the performance of the GK method, we applied the GK method and well-known conventional methods to three recently published yeast datasets, and compared the performance of each method using the Saccharomyces Genome Database annotations. The clustering results of the GK method are more significantly relevant to the biological annotations than those of the other methods, demonstrating its effectiveness and potential for clustering gene-expression data.

Availability: The software was developed using Java language, and can be executed on the platforms that JVM (Java Virtual Machine) is running. It is available from the authors upon request.

Contact: dhlee{at}bisl.kaist.ac.kr

Supplementary information: Supplementary data are available at http://dragon.kaist.ac.kr/gk


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 THE GUSTAFSON-KESSEL METHOD
 3 RESULTS
 4 CONCLUSIONS
 REFERENCES
 
The DNA microarray technology has enabled biologists to monitor the expression levels of thousand of genes in parallel under multiple experimental conditions. Since the diauxic shift (DeRisi et al., 1997), sporulation (Chu et al., 1998) and the cell cycle (Cho et al., 1998) in the yeast Saccharomyces cerevisiae were explored, many experiments have been made to monitor the gene-expression levels of various organisms during some biological process.

The present study focuses on the application of clustering methods for analyzing gene-expression datasets that are comprised of the expression patterns of specific environmental conditions, rather than time-course type of data. Since Eisen et al. (1998) first used the hierarchical clustering method to find groups of coexpressed genes, numerous methods have been studied for clustering gene-expression data: self-organizing map (SOM) (Tamayo et al., 1999), k-means clustering (Tavazoie et al., 1999), simulated annealing (Lukashin and Fuchs, 2001), graph-theoretic approach (Xu et al., 2001), mutual information approach (Steuer et al., 2002), fuzzy c-means clustering (Dembele and Kastner, 2003), kernel hierarchical clustering (Qin et al., 2003), diametrical clustering (Dhilon et al., 2003), quantum clustering (QC) with singular value decomposition (Horn and Axel, 2003), bagged clustering (Dudoit and Fridlyand, 2003) and CLICK (Sharan et al., 2003).

Of the clustering methods reported to date, the most widely used methods are the hierarchical, k-means and SOM due to their superiority and availability of several free software tools. However, these conventional clustering methods have a number of limitations. As Yeung et al. (2001), and Gibbons and Roth (2002) pointed out, the performance of the hierarchical clustering method was close to random, despite its wide-usage, and worse than the k-means and SOM. Such cases arise because the hierarchical clustering is likely to produce one single large cluster and several singletons. Furthermore, this method suffers from the defect that it can never repair what was done in previous steps.

The partitional clustering methods such as the k-means and SOM are also problematic when the clusters differ in shape (Babuska, 1998; Bezdek et al., 1999; Jain et al., 1999). The shape of clusters is determined by a distance norm, which is a fundamental factor when developing a clustering method. Let xj be the expression vector for the j-th gene over different environmental conditions (or samples), and let vi be the i-th cluster centroid (prototype). Then a typical distance norm between xj and vi is represented as:

(1)
where A is a symmetric and positive definite matrix. Thus different norms can be induced by the choice of the matrix A. The Euclidean norm is induced when A = I where I is an identity matrix. The Mahalonobis norm is induced when A = M–1 where M–1 is the inverse of the covariance matrix of patterns in the system.

Although many clustering methods have been studied in the literature, a common limitation of conventional methods is to use a fixed distance norm for finding clusters; this fixed norm imposes a fixed geometrical structure and finds clusters of that shape even if they are not present. The Euclidean norm-based methods find only spherical shape of clusters, and the Mahalanobis norm-based methods find only ellipsoidal ones even if those shapes of clusters are not present in a dataset. For example, let us consider a dataset distributed in four clusters, plotted in Figure 1a, which is composed of the expression vectors of 400 genes measured at two conditions. Although the clusters differ in shape, the conventional clustering method using the Euclidean distance is limited to identify the four clusters because it imposes a spherical shape on the clusters with no regard to the actual data distribution, as depicted in Figure 1b. In such cases, it is of no help to use another distance norm.



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 1 Clustering of the conventional clustering method using the Euclidean distance: (a) plot of the original dataset measured at two conditions in which four clusters have different shapes; (b) conventional method detects clusters of spherical shape regardless of the actual data distribution. The horizontal axis represents the expression value at the first condition. The vertical axis represents the expression value at the second condition.

 
To tackle the addressed problems, different distance norms are required for handling the different shapes of clusters; specifically, different As should be applied to the different clusters. In the present work, we present the Gustafson–Kessel (GK) method based on an adaptive distance norm (Gustafson and Kessel, 1979; Babuska, 1998; Bezdek et al., 1999) for clustering gene-expression data. Different norm-inducing matrix As are adapted by estimating the data covariance, and used to guess the associated structure of the data in each individual cluster. Recently, Guthke et al. (2002) showed that the GK method gave higher accuracy than other methods in predicting the gene function of Escherichia coli. However, the work of Guthke et al. was not very extensive and the experimental comparisons were made using a very small subset of data (265 genes). In the present study we provide a comprehensive analysis on the GK method, and show the superiority of the GK method to other methods through extensive experiments with various datasets. The remainder of this paper is organized as follows: Section 2 describes the formulation of the GK method; Section 3 highlights the potential of the GK method through several tests on the yeast datasets; and Section 4 presents our concluding remarks.


    2 THE GUSTAFSON–KESSEL METHOD
 TOP
 Abstract
 1 INTRODUCTION
 2 THE GUSTAFSON-KESSEL METHOD
 3 RESULTS
 4 CONCLUSIONS
 REFERENCES
 
The GK method generates a fuzzy partition that provides a degree of membership of each data point to a given cluster. The objective of the GK method is to classify a set of data points X = {x1,x2,...,xn} in p-dimensional space into k disjoint and homogeneous clusters represented as C = C1,C2,...,Ck. To detect clusters of different geometrical shapes in a dataset, this method introduces an adaptive distance norm for each cluster. Each cluster Ci has its own norm-inducing matrix Ai, which affects the distance norm given as the following.

(2)
where V = [v1,v2,...,vk] is a vector of the centroids of the fuzzy clusters C1,C2,...,Ck. Use of the matrix Ai makes it possible for each cluster to adapt the distance norm to the geometrical structure of the data at each iteration. Based on the norm-inducing matrices, the objective of the GK method is obtained by minimizing the function Jm.

(3)
where

is a k-tuple of the norm-inducing matrices,

(4)
is a fuzzy partition matrix of X satisfying the following constraints,

(5)
and

(6)
is a weighting exponent that controls the membership degree µij of each data point xj to the cluster Ci. The choice of appropriate m value is of importance because the final clusters may vary depending on the m value selected. As m -> 1, J1 produces a hard partition where µij {0,1}. As m approaches infinity, J{infty} produces a maximum fuzzy partition where µij = 1/c.

To obtain a feasible solution by minimizing Equation (3), the additional constraint is required for Ai.

(7)
where {rho}i is a cluster volume for each cluster. Equation (7) guarantees that Ai is positive-definite, indicating that Ai can be varied to find the optimal shape of cluster with its volume fixed. Using the Lagrange multiplier method, minimization of Equation (3) with respect to Ai produces the following equation.

(8)
where

(9)
is the fuzzy covariance matrix of cluster Ci. The set of fuzzy covariance matrices is represented as a k-tuple of F = (F1, F2,..., Fk). To show Ai in Equation (8) satisfies the constraint of a symmetric and positive-definite matrix, suppose that there are p linearly independent data points {xi} Rp in the dataset. Then, the matrices {xi}{xi}T are symmetric and positive semi-definite and also their weighted sum (Fi), and hence Ai is symmetric and positive-definite.

There is no general agreement on what value to use for {rho}i; without any prior knowledge, a rule of thumb that many investigators use is {rho}i = 1 in practice (Bezdek et al., 1999). Moreover, based on the notion that {rho}i represents the cluster volume for each cluster, in the present study we estimated {rho}i, shown in the following equation, by exploiting the definition on the volume of fuzzy cluster Ci (Dumitrescu et al., 2000), making the GK method to be fully operational.

(10)

Under this formulation, the fixed norm calculated for the distance between xj and vi is replaced in the GK method with the following distance,

(11)

The eigenstructure of the covariance matrices Fi identifies the shape of clusters. Let {lambda}1,{lambda}2, ..., {lambda}p be the eigenvalues of Fi, and they are arranged in decreasing order; then the length of the r-th axis of each cluster is given by for r = 1, ..., p. Figure 2 shows the eigenstructures of the clusters for the dataset from Figure 1a. The thick lines represent the eigenvectors of Fi, which correctly detect the shape of the clusters. The cluster centroids and eigenvalues obtained by the GK method for the four clusters are listed in Table 1. The last column is the ratio of the length of the first axis () to the length of the second axis (), which describes the overall distribution of data. We can see that the ratio value of the cluster C1 is about 1.0, indicating that C1 has a spherical shape. In contrast, the cluster C4 has a ratio value of 9.36, indicating that the shape of C4 would be an elongated ellipsoid. The eigenvalue calculations, shown in Figure 2 and Table 1, indicate that the shape of each cluster is recognized by its eigenstructure, and therefore the adaptive distance norm is capable of identifying the inherent structure of the data.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 2 The GK method successfully identifies the four clusters by exploiting the covariance matrices Fi for the clusters. Dots represent data points and thick lines represent the hyperellipsoides given by the eigenstructures of clusters. The horizontal axis represents the expression value at the first condition. The vertical axis represents the expression value at the second condition.

 

View this table:
[in this window]
[in a new window]
 
Table 1 The cluster centroids (vi) and eigenvalues ({lambda}i) obtained by the GK method for the dataset from Figure 2

 
The saddle point of Jm is obtained by considering the constraint Equation (5) as the Lagrange multipliers:

(12)
and by setting {nabla}Jm = 0. If for all i,j and m > 1, then (U,V) may minimize Jm only if,

(13)
and

(14)
This solution also satisfies the remaining constraints of Equation (5).

The GK method iteratively improves a sequence of sets of clusters until no further improvement in Jm(U,V,A:X) is possible. This type of iteration is often referred to as the alternating optimization (AO) scheme. It loops through the estimates for Vt–1 -> Ut -> Vt (where t is the iteration step) and terminates on ||Vt Vt–1|| ≤ {varepsilon}. Equivalently, the initialization of the algorithm can be done on U0, and the iterates become Ut–1 -> Vt -> Ut, with the termination criterion ||UtUt–1|| ≤ {varepsilon}. This way of alternating optimization makes the GK method be less prone to falling into local minima than the k-means and SOM methods because it makes soft decisions in each iteration through the use of membership functions.

Algorithm 1
Gustafson–Kessel method

Given the dataset X = {x1,..., xn}, xj Rp, the number of clusters (k), the weighting exponent (m), and the termination criterion ({varepsilon}), this method finds k disjoint and homogeneous clusters.

  1. Initialize (initially, t <- 1) of xj belonging to cluster Ci for 1 ≤ i ≤ k, 1 ≤ j ≤ n such that:


  2. Update the cluster centroids for 1 ≤ i ≤ k using:


  3. Update the cluster covariance matrices Fi for 1 ≤ i ≤ k using:


  4. Compute the distances between xj and for 1 ≤ i ≤ k, 1 ≤ j ≤ n using:


  5. Update by the following procedure. For each xj, 1 ≤ j ≤ n,
    1. if DijAi > 0, 1 ≤ i ≤ k, then update the membership of xj at t by:


    2. if DijAi = 0 for some i I {subseteq} 1,..., k, then for all i I, set to be between [0,1] such that:



  6. If ||UtUt–1|| ≤ {varepsilon}, then stop; otherwise, t <- t + 1 and go to Step 2.

Algorithm 1 shows the procedural steps of the GK method for clustering the n x p gene-expression data where n is the number of genes and p is the number of experiments. A singularity can occur at Step 4 when the number of data is small or when the data within a cluster are much linearly correlated (Babuska, 1998). In such cases, the cluster covariance matrix becomes singular and cannot be inverted; thus we cannot compute the distances at Step 4. To deal with such cases, in the present study, the membership degrees are set to µij = 0 for non-singular classes, and arbitrary values for singular classes subject to the constraints of Equation (5).

THEOREM 1
The GK method given in Algorithm 1 converges in a finite number of iterations.

PROOF 1
We first show that a saddle point of Jm appears at most once by the GK method before it stops. Suppose that this is not true, i.e. Ut1 = Ut2 for some t1, t2 where t1 != t2. By the AO scheme, we get Vt1+1 and Vt2+1 as optimal solutions for U = Ut1 and U = Ut2, respectively. Therefore, we have

(15)
However, the sequence Jm(·,·) generated by the GK method is strictly decreasing (Selim and Ismail, 1984). Hence equation (15) is false and Ut1 != Ut2. Since there are a finite number of saddle points of Jm (Selim and Ismail, 1984), the algorithm will converge in a finite number of iterations.

A similar proof concerning the convergence of the k-means-type algorithms to a local minimum has been stated by Selim and Ismail, 1984.


    3 RESULTS
 TOP
 Abstract
 1 INTRODUCTION
 2 THE GUSTAFSON-KESSEL METHOD
 3 RESULTS
 4 CONCLUSIONS
 REFERENCES
 
3.1 Datasets and implementation parameters
To test the effectiveness with which the GK method clusters gene-expression data, we applied the GK method and five well-known clustering methods to three recently published yeast datasets and compared the performance of each method. In the present study, we used EXPANDER (Sharan et al., 2003) software that implements many clustering methods, of which we investigated the results of the k-means, SOM, and CLICK methods. Along with these, we examined the results of the bagged clustering (BagClust) (Dudoit and Fridlyand, 2003) and the QC (Horn and Axel, 2003).

The datasets employed were the yeast PHO-regulation dataset of Ogawa et al. (2000), the yeast ATP-regulation dataset of Mizuguchi et al. (2004), and the yeast Calcineurin-regulation dataset of Yoshimoto et al. (2002). The Ogawa's PHO dataset contains the expression profiles of 6013 yeast genes measured at eight PHO-regulated experiments. The Mizuguchi's ATP dataset consists of the expression levels of the 6215 yeast genes measured at three different ATPase experiments. The Yoshimoto's Calcineurin dataset contains the expression profiles of 6102 yeast genes at 24 experiments by the presence and absence of Ca2+, Na+, CRZ1 and FK506. These three datasets were obtained from a public website containing various published large-scale yeast datasets (http://sgdlite.princeton.edu/download/yeast_datasets/).

In these experiments, the parameters used in the GK method were {varepsilon} = 0.001, m = 2.5, and {rho} = 1; these values were chosen because they have been overwhelmingly favored in many studies (Bezdek et al., 1999). In the tests reported here, we analyzed the performance of each method under changes in the number of clusters of k, with varied from k = 2 to k = 10.

3.2 Performance comparison
In the present study, the clustering results were assessed using two validation measures: z-score and Jaccard score.

First, the z-score (Gibbons and Roth, 2002) is calculated by investigating the relation between a clustering result and the functional annotation of the genes in the cluster. To achieve this, the score uses the Saccharomyces Genome Database (SGD) annotation of the yeast genes, along with the gene ontology developed by the Gene Ontology Consortium (Ashburner et al., 2000; Issel-Tarver et al., 2002). A higher score of z indicates that genes are better clustered by function, indicating a more biologically significant clustering result.

Figure 3 shows the clustering results of the k-means, SOM, BagClust and GK methods for the yeast PHO dataset. The z-scores of the four clustering methods are plotted with respect to k = 2,3,...,10. The k-means method gave z-scores of ranging from –0.9 to 3.1, and SOM gave scores from –0.8 to 2.5. The z-scores of the BagClust were ranged from –0.6 to 1.1. The SOM method outperformed the k-means and BagClust methods at low k-values, and the k-means method showed better performance than the SOM and BagClust methods at high k-values. Compared to these three methods, the GK method provided superior clustering performance over a wide range of k-values; the z-scores were varied from 16.8 to 25.3.



View larger version (13K):
[in this window]
[in a new window]
 
Fig. 3 Comparison of the clustering performance of the k-means, SOM, BagClust and GK methods for the yeast PHO-regulation dataset of Ogawa et al. (2000). The horizontal axis represents the number of clusters given, the vertical axis represents the z-score. The z-score is computed with the relation between a clustering result and the SGD functional annotation of the genes in the cluster (Gibbons and Roth (2002)).

 
Figure 4 shows the clustering results of the k-means, SOM, BagClust and GK methods for the yeast ATP dataset. The k-means method gave z-scores of ranging from 1.0 to 3.9. On the whole, the SOM and BagCluster showed similar tendency for all k values. In comparison to these methods, it is evident that the GK clustering method shows markedly better performance, giving z-scores of >15.0 for k > 3. This result is in agreement with the work of Mizuguchi et al. (2004), which reported that the optimal k-value lies ~3 for this dataset.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 4 Comparison of the clustering performance of the k-means, SOM, BagClust, and GK methods for the yeast ATP-regulation dataset of Mizuguchi et al. (2004). The horizontal axis represents the number of clusters given, the vertical axis represents the z-score. The z-score is computed with the relation between a clustering result and the SGD functional annotation of the genes in the cluster (Gibbons and Roth, 2002).

 
The clustering results achieved by the six clustering methods for each of the yeast PHO and ATP datasets are listed in Table 2. For the PHO dataset, the k-means method showed the best z-score of 3.15 at k = 9. The SOM and BagClust methods provided best z-values of 2.53 and 1.14 at k = 7, respectively. The CLICK and QC methods yielded z-values of 1.65 and 1.46 at k = 31 and k = 16, respectively; these two methods automatically find the optimal k. In contrast, the best value of z = 25.38 of the GK method was obtained at k = 10. Furthermore, we tested the performance of the GK method for two different values of {rho} = 1 and . It is observed that the GK method for these two {rho} values showed similarly better performance than other methods; the z-scores were varied from 5.77 to 29.80 over k = 2,3,...,10. The best scores of the GK method were z = 25.38 for {rho} = 1, and z = 29.80 for . For the ATP dataset, the best z-values of the k-means and SOM methods were 3.97 and 1.84 at k = 5 and k = 3, respectively. The BagClust and CLICK provided the best z = 0.87 and z = –1.03 at k = 10, respectively, whereas the QC method yielded the best z = 0.94 at k = 8. The GK method with respect to {rho} = 1, provided significantly better clustering performance than other methods, giving z-scores of >20.0.


View this table:
[in this window]
[in a new window]
 
Table 2 Summary of clustering results obtained by six clustering methods for the yeast PHO-regulation dataset and the ATP-regulation dataset

 
In addition to the assessment using the z-score, we quantified the clustering result of each method using the Jaccard and Minkowski scores. Let T be the ‘true’ solution and C the solution a clustering algorithm generated. Let n11 be the number of pairs of data that are in the same cluster in both T and C. Let n10 be the number of pairs that are in the same cluster only in T, and n01 be the number of pairs that are in the same cluster only in C. Then the Jaccard score is defined as

(16)
A higher value of SJ(T, C) indicates a better clustering result; the Minkowski score is defined as

(17)

In this case, a lower value of SM(T, C) indicates a well-clustered result. To measure the quality of clustering with these two scores, we applied five clustering methods to the yeast Calcineurin dataset that provides a putative true solution obtained through manual inspection by Yoshimoto et al. (2002). Table 3 lists the comparison results of five clustering methods for k = 3. The GK method is the most effective of the methods considered; it provides the highest Jaccard score, with the lowest Minkowski score. The k-means and BagCluster methods showed better scores than the SOM and QC methods, and the QC method proved the most ineffective of the methods considered.


View this table:
[in this window]
[in a new window]
 
Table 3 Comparison of the clustering performance of the k-means, SOM, BagClust, QC and GK methods for the yeast Calcineurin-regulation dataset of Yoshimoto et al. (2002)

 
The results of the comparison calculations indicate that the GK method gave markedly better clustering performance than the other five methods considered, highlighting the effectiveness and potential of the GK method.

3.3 Functional enrichment
The enriched functional categories for each cluster obtained by the GK method on the yeast PHO and ATP datasets are listed in Tables 4 and 5 respectively. The enrichment of each GO category in each of the clusters was calculated by its P-value. To compute the P-value, we employed the hypergeometric distribution used by Tavazoie et al. (1999) and Dembele and Kastner (2003) in order to obtain the probability of observing the number of genes from a specific GO functional category within each cluster. More detailed explanation on this P-value can be found in Tavazoie et al. (1999) and Dembele and Kastner (2003). A low P-value indicates that the genes belonging to the enriched functional categories are biologically significant in the corresponding clusters. In the present study, only functional categories with P-value <5.0 x 10–7 are reported.


View this table:
[in this window]
[in a new window]
 
Table 4 Enrichment of GO categories in each of the clusters obtained by the GK method for the yeast PHO-regulation dataset of Ogawa et al. (2000)

 

View this table:
[in this window]
[in a new window]
 
Table 5 Enrichment of GO categories in each of the clusters obtained by the GK method for the yeast ATP-regulation dataset of Mizuguchi et al. (2004)

 
Of the ten clusters obtained for the yeast PHO dataset (Table 4), the cluster C9 contains several enriched categories on ‘ribosome’. The highly enriched category in cluster C9 is the ‘ribonucleoprotein complex’ with P-value of 4.83 x 10–35. The GO category ‘ribosome’ is also highly enriched in this cluster with P-value of 4.20 x 10–45. The cluster C2 contains an enriched category ‘RNA processing’ with P-value of 7.30 x 10–44. In the case of the ATP dataset (Table 5), the cluster C3 contains the yeast genes corresponding to the ATP-involved GO biological process. The highly enriched categories in cluster C3 are the ‘ATP metabolism’ with P-value of 4.89 x 10–9 and the ‘purine ribonucleoside triphosphate biosynthesis’ with P-value of 1.39 x 10–7. From the results of Tables 4 and 5, we see that the cluster obtained by the GK method shows a high enrichment of functional categories.

Besides, as mentioned earlier, the GK method produces a fuzzy clustering result, which provides a convenient way of selecting genes showing tight association to given clusters (Dembele and Kastner, 2003). Dembele and Kastner referred to this process as ‘restricted’ selection. Similar to the work of Dembele and Kastner, we removed the most loosely associated genes in each cluster using a threshold-based selection; specifically, genes with the highest membership degree µij > 0.5 were selected. To see the effect of the restricted selection, we compared the percentage of genes in the raw cluster and its corresponding restricted cluster for GO functional categories. Table 6 lists a comparison result for the cluster C9 shown from Table 4. In comparison to the raw cluster, the percentage of genes in the restricted cluster were remarkably increased in most cases. For instance, the percentage for the ‘ribonucleoprotein complex’ was increased from 14.12% in the raw cluster to 20.36% in the restricted cluster; this indicates that the genes in this category were retained more frequently than the average genes in the cluster. We see from this example that selecting tightly associated genes using the threshold of membership degree can increase the biological relevance of the genes in the cluster.


View this table:
[in this window]
[in a new window]
 
Table 6 Enrichment of GO categories in the raw cluster C9 from Table 4 and in the corresponding restricted cluster obtained by the GK method for the yeast PHO-regulation dataset of Ogawa et al. (2000)

 

    4 CONCLUSIONS
 TOP
 Abstract
 1 INTRODUCTION
 2 THE GUSTAFSON-KESSEL METHOD
 3 RESULTS
 4 CONCLUSIONS
 REFERENCES
 
It is well established that clustering is a useful technique for analysis of large amounts of gene-expression data, and a variety of clustering method have been used in biological research. However, conventional clustering methods suffer from a tendency to impose a fixed shape on the clusters even if the clusters in many real-life datasets may differ in shape. To address this problem, we have presented the GK clustering method using an adaptive distance norm. The adaptive norm is described in terms of the covariance matrix for each cluster in which the eigenstructure is exploited to identify the shape of the cluster. The present assessment has shown that the GK method is the superior and effective method for clustering gene-expression data.

Despite these benefits of the GK method, several issues require further investigation. The computational costs of the GK method are much higher than other clustering methods such as the k-means method. In each iteration step, k · n matrices are computed and subsequently inverted, and additionally, k determinants are computed. Especially, the GK method becomes computationally inefficient when applied to high dimensional data. Figure 5 shows the real run time of the GK method to cluster the yeast PHO-regulation and the ATP-regulation datasets. It is observed from the figure that the eight-dimensional PHO dataset shows a rapid increase in the computational time more than the three-dimensional ATP dataset as the number of clusters is increasing. In such a case, it maybe useful to initialize the cluster centroids of the GK method with the resulting centroids of the k-means method so that the GK method can converge fast to the saddle points of Jm with the reduced number of iterations. Second, the GK method is problematic when the number of data is small or when the data within a cluster are much linearly correlated (Babuska, 1998). In such cases, the cluster covariance matrix becomes singular and cannot be inverted; thus we cannot compute the distances at Step 4 in Algorithm 1. To tackle this singularity problem, it would be helpful to prevent the ratio of the maximum to the minimum eigenvalue from being larger than some predefined threshold.



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 5 The run time of the GK method for clustering the yeast PHO-regulation dataset and the ATP-regulation dataset. The horizontal axis represents the number of clusters given, the vertical axis represents the real run time in seconds.

 


    Acknowledgments
 
This work was supported by the Korean Systems Biology Research Grant (M1-0309-02-0002) from the Ministry of Science and Technology. We would like to thank CHUNG Moon Soul Center for BioInformation and BioElectronics and the IBM SUR program for providing research and computing facilities.

Received on August 13, 2004; revised on December 16, 2004; accepted on December 23, 2004

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 THE GUSTAFSON-KESSEL METHOD
 3 RESULTS
 4 CONCLUSIONS
 REFERENCES
 

    Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Davis, A.P., Dolinski, K., Dwight, S.S., et al. (2000) Gene ontology: tool for the unification of biology. Nat. Genet., 25, 25–29[CrossRef][Web of Science][Medline].

    Babuska, R. Fuzzy Modeling for Control, (1998) , Boston Kluwer Academy Publishers.

    Bezdek, J., Keller, J., Krisnapuram, R., Pal, N.R. Fuzzy Models and Algorithms for Pattern Recognition and Image Processing, (1999) , Boston Kluwer Academy Publishers.

    Cho, R., Campbell, M., Winzeler, E., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T.G., Gabrielian, A.E., Landsman, D., Lockhar, D.J., Davis, R.W., et al. (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell, 2, 65–73[CrossRef][Web of Science][Medline].

    Chu, S., DeRish, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P.O., Herskowitz, I. (1998) The transcriptional program of sporulation in budding yeast. Science, 282, 699–705[Abstract/Free Full Text].

    Dembele, D. and Kastner, P. (2003) Fuzzy c-means method for clustering microarray data. Bioinformatics, 19, 973–980[Abstract/Free Full Text].

    DeRisi, J., Iyer, V., Brown, P. (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 282, 257–264.

    Dhilon, I., Marcotte, E., Roshan, U. (2003) Diametrical clustering for identifying anti-correlated gene clusters. Bioinformatics, 19, 1612–1619[Abstract/Free Full Text].

    Dudoit, S. and Fridlyand, J. (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19, 1090–1099[Abstract/Free Full Text].

    Dumitrescu, D., Lazzerini, B., Jain, L. Fuzzy Sets and Their Applications to Clustering and Training, (2000) , Florida CRC Press.

    Eisen, M., Spellman, P., Brown, P., Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 14863–14868[Abstract/Free Full Text].

    Gibbons, F. and Roth, F. (2002) Judging the quality of gene expression-based clustering methods using gene annotation. Genome Res., 12, 1574–1581[Abstract/Free Full Text].

    Gustafson, E. and Kessel, W. (1979) Fuzzy clustering with a fuzzy covariance matrix. Proc. IEEE Conf. Decision Control, 761–766.

    Guthke, R., Schmidt-Heck, W., Hahn, D., Pfaff, M. (2002) Gene expression data mining for functional genomics using fuzzy technology. Advances in Computational Intelligence and Learning: Methods and Applications, , New York Springer, pp. 475–488.

    Horn, D. and Axel, I. (2003) Novel clustering algorithm for microarray expression data in a truncated svd space. Bioinformatics, 19, 1110–1115[Abstract/Free Full Text].

    Issel-Tarver, L., Christie, K., Dolinski, K., Andrada, R., Balakrishnan, R., Ball, C.A., Binkley, G., Dong, S., Dwight, S.S., Fisk, D.G. (2002) Saccharomyces, genome database. Methods Enzymol., 350, 329–346[Web of Science][Medline].

    Jain, A., Murty, M., Flynn, P. (1999) Data clustering: a review. ACM Comput. Surv., 31, 264–323[CrossRef].

    Lukashin, A. and Fuchs, R. (2001) Analysis of temporal gene expression profiles: clustering by simuulated annealing and determining the optimal number of clusters. Bioinformatics, 17, 405–414[Abstract/Free Full Text].

    Mizuguchi, G., Shen, X., Landry, J., Wu, W.H., Sen, S., Wu, C. (2004) ATP-driven exchange of histone h2az variant catalyzed by swr1 chromatin remodeling complex. Science, 303, 343–348[Abstract/Free Full Text].

    Ogawa, N., DeRisi, J., Brown, P.O. (2000) New components of a system for phosphate accumulation and polyphosphate metabolism in Saccharomyces cerevisiae revealed by gemomic expression analysis. Mol. Biol. Cell, 11, 4309–4321[Abstract/Free Full Text].

    Qin, J., Lewis, D., Noble, W. (2003) Kernel hierarchical gene clustering from microarray gene expression data. Bioinformatics, 19, 2097–2104[Abstract/Free Full Text].

    Selim, S. and Ismail, M. (1984) K-means type algorithms: a generalized convergence theorem and the caracterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell., 6, 284–288.

    Sharan, R., Maron-Katz, A., Shamir, R. (2003) Click and expander: a system for clustering and visualizing gene expression data. Bioinformatics, 19, 1787–1799[Abstract/Free Full Text].

    Steuer, R., Kurths, J., Daub, C., Weise, J., Selbig, J. (2002) The mutual information: detecting and evaluating dependencies between variables. Bioinformatics, 18, S231–S240[Abstract].

    Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., Golub, T.R. (1999) Interpreting patters of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA, 96, 2907–2912[Abstract/Free Full Text].

    Tavazoie, S., Hughes, J., Campbell, M., Cho, R.J., Church, G.M. (1999) Systematic determination of genetic network architecture. Nat. Genet., 22, 281–285[CrossRef][Web of Science][Medline].

    Xu, Y., Olman, V., Xu, D. (2001) Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics, 17, 309–318[Abstract/Free Full Text].

    Yeung, K., Haynor, D., Ruzzo, W. (2001) Validating clustering for gene expression data. Bioinformatics, 17, 309–318[Abstract/Free Full Text].

    Yoshimoto, H., Saltsman, K., Gasch, A., Li, H.X., Ogawa, N., Botstein, D., Brown, P.O., Cyert, M.S. (2002) Genome-wide analysis of gene expression regulated by the calcineurin/crz1p signaling pathway in Saccharomyces cerevisiae. J. Biol. Chem., 277, 31079–31088[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
B. Andreopoulos, A. An, X. Wang, and M. Schroeder
A roadmap of clustering algorithms: finding a match for a biomedical application
Brief Bioinform, May 1, 2009; 10(3): 297 - 314.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Bhattacharya and R. K. De
Divisive Correlation Clustering Algorithm (DCCA) for grouping of genes: detecting varying patterns in expression profiles
Bioinformatics, June 1, 2008; 24(11): 1359 - 1366.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D.-W. Kim, K.-Y. Lee, K. H. Lee, and D. Lee
Towards clustering of incomplete microarray data without the use of imputation
Bioinformatics, January 1, 2007; 23(1): 107 - 113.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/9/1927    most recent
bti251v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kim, D.-W.
Right arrow Articles by Lee, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kim, D.-W.
Right arrow Articles by Lee, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?