Bioinformatics Advance Access originally published online on January 12, 2005
Bioinformatics 2005 21(9):1927-1934; doi:10.1093/bioinformatics/bti251
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Detecting clusters of different geometrical shapes in microarray gene expression data
1Department of BioSystems and Advanced Information Technology Research Center, Korea Advanced Institute of Science and Technology 3731 Guseong-dong, Yuseong-gu, Daejeon, 305701, Korea
2Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology 3731 Guseong-dong, Yuseong-gu, Daejeon, 305701, Korea
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Clustering has been used as a popular technique for finding groups of genes that show similar expression patterns under multiple experimental conditions. Many clustering methods have been proposed for clustering gene-expression data, including the hierarchical clustering, k-means clustering and self-organizing map (SOM). However, the conventional methods are limited to identify different shapes of clusters because they use a fixed distance norm when calculating the distance between genes. The fixed distance norm imposes a fixed geometrical shape on the clusters regardless of the actual data distribution. Thus, different distance norms are required for handling the different shapes of clusters.
Results: We present the GustafsonKessel (GK) clustering method for microarray gene-expression data. To detect clusters of different shapes in a dataset, we use an adaptive distance norm that is calculated by a fuzzy covariance matrix (F) of each cluster in which the eigenstructure of F is used as an indicator of the shape of the cluster. Moreover, the GK method is less prone to falling into local minima than the k-means and SOM because it makes decisions through the use of membership degrees of a gene to clusters. The algorithmic procedure is accomplished by the alternating optimization technique, which iteratively improves a sequence of sets of clusters until no further improvement is possible. To test the performance of the GK method, we applied the GK method and well-known conventional methods to three recently published yeast datasets, and compared the performance of each method using the Saccharomyces Genome Database annotations. The clustering results of the GK method are more significantly relevant to the biological annotations than those of the other methods, demonstrating its effectiveness and potential for clustering gene-expression data.
Availability: The software was developed using Java language, and can be executed on the platforms that JVM (Java Virtual Machine) is running. It is available from the authors upon request.
Contact: dhlee{at}bisl.kaist.ac.kr
Supplementary information: Supplementary data are available at http://dragon.kaist.ac.kr/gk
| 1 INTRODUCTION |
|---|
|
|
|---|
The DNA microarray technology has enabled biologists to monitor the expression levels of thousand of genes in parallel under multiple experimental conditions. Since the diauxic shift (DeRisi et al., 1997), sporulation (Chu et al., 1998) and the cell cycle (Cho et al., 1998) in the yeast Saccharomyces cerevisiae were explored, many experiments have been made to monitor the gene-expression levels of various organisms during some biological process.
The present study focuses on the application of clustering methods for analyzing gene-expression datasets that are comprised of the expression patterns of specific environmental conditions, rather than time-course type of data. Since Eisen et al. (1998) first used the hierarchical clustering method to find groups of coexpressed genes, numerous methods have been studied for clustering gene-expression data: self-organizing map (SOM) (Tamayo et al., 1999), k-means clustering (Tavazoie et al., 1999), simulated annealing (Lukashin and Fuchs, 2001), graph-theoretic approach (Xu et al., 2001), mutual information approach (Steuer et al., 2002), fuzzy c-means clustering (Dembele and Kastner, 2003), kernel hierarchical clustering (Qin et al., 2003), diametrical clustering (Dhilon et al., 2003), quantum clustering (QC) with singular value decomposition (Horn and Axel, 2003), bagged clustering (Dudoit and Fridlyand, 2003) and CLICK (Sharan et al., 2003).
Of the clustering methods reported to date, the most widely used methods are the hierarchical, k-means and SOM due to their superiority and availability of several free software tools. However, these conventional clustering methods have a number of limitations. As Yeung et al. (2001), and Gibbons and Roth (2002) pointed out, the performance of the hierarchical clustering method was close to random, despite its wide-usage, and worse than the k-means and SOM. Such cases arise because the hierarchical clustering is likely to produce one single large cluster and several singletons. Furthermore, this method suffers from the defect that it can never repair what was done in previous steps.
The partitional clustering methods such as the k-means and SOM are also problematic when the clusters differ in shape (Babuska, 1998; Bezdek et al., 1999; Jain et al., 1999). The shape of clusters is determined by a distance norm, which is a fundamental factor when developing a clustering method. Let xj be the expression vector for the j-th gene over different environmental conditions (or samples), and let vi be the i-th cluster centroid (prototype). Then a typical distance norm between xj and vi is represented as:
![]() | (1) |
Although many clustering methods have been studied in the literature, a common limitation of conventional methods is to use a fixed distance norm for finding clusters; this fixed norm imposes a fixed geometrical structure and finds clusters of that shape even if they are not present. The Euclidean norm-based methods find only spherical shape of clusters, and the Mahalanobis norm-based methods find only ellipsoidal ones even if those shapes of clusters are not present in a dataset. For example, let us consider a dataset distributed in four clusters, plotted in Figure 1a, which is composed of the expression vectors of 400 genes measured at two conditions. Although the clusters differ in shape, the conventional clustering method using the Euclidean distance is limited to identify the four clusters because it imposes a spherical shape on the clusters with no regard to the actual data distribution, as depicted in Figure 1b. In such cases, it is of no help to use another distance norm.
|
To tackle the addressed problems, different distance norms are required for handling the different shapes of clusters; specifically, different As should be applied to the different clusters. In the present work, we present the GustafsonKessel (GK) method based on an adaptive distance norm (Gustafson and Kessel, 1979; Babuska, 1998; Bezdek et al., 1999) for clustering gene-expression data. Different norm-inducing matrix As are adapted by estimating the data covariance, and used to guess the associated structure of the data in each individual cluster. Recently, Guthke et al. (2002) showed that the GK method gave higher accuracy than other methods in predicting the gene function of Escherichia coli. However, the work of Guthke et al. was not very extensive and the experimental comparisons were made using a very small subset of data (265 genes). In the present study we provide a comprehensive analysis on the GK method, and show the superiority of the GK method to other methods through extensive experiments with various datasets. The remainder of this paper is organized as follows: Section 2 describes the formulation of the GK method; Section 3 highlights the potential of the GK method through several tests on the yeast datasets; and Section 4 presents our concluding remarks.
| 2 THE GUSTAFSONKESSEL METHOD |
|---|
|
|
|---|
The GK method generates a fuzzy partition that provides a degree of membership of each data point to a given cluster. The objective of the GK method is to classify a set of data points X = {x1,x2,...,xn} in p-dimensional space into k disjoint and homogeneous clusters represented as C = C1,C2,...,Ck. To detect clusters of different geometrical shapes in a dataset, this method introduces an adaptive distance norm for each cluster. Each cluster Ci has its own norm-inducing matrix Ai, which affects the distance norm given as the following.
![]() | (2) |
![]() | (3) |
![]() |
![]() | (4) |
![]() | (5) |
![]() | (6) |
1, J1 produces a hard partition where µij
{0,1}. As m approaches infinity, J
produces a maximum fuzzy partition where µij = 1/c.
To obtain a feasible solution by minimizing Equation (3), the additional constraint is required for Ai.
![]() | (7) |
i is a cluster volume for each cluster. Equation (7) guarantees that Ai is positive-definite, indicating that Ai can be varied to find the optimal shape of cluster with its volume fixed. Using the Lagrange multiplier method, minimization of Equation (3) with respect to Ai produces the following equation.
![]() | (8) |
![]() | (9) |
Rp in the dataset. Then, the matrices 
T are symmetric and positive semi-definite and also their weighted sum (Fi), and hence Ai is symmetric and positive-definite.
There is no general agreement on what value to use for
i; without any prior knowledge, a rule of thumb that many investigators use is
i = 1 in practice (Bezdek et al., 1999). Moreover, based on the notion that
i represents the cluster volume for each cluster, in the present study we estimated
i, shown in the following equation, by exploiting the definition on the volume of fuzzy cluster Ci (Dumitrescu et al., 2000), making the GK method to be fully operational.
![]() | (10) |
Under this formulation, the fixed norm
calculated for the distance between xj and vi is replaced in the GK method with the following distance,
![]() | (11) |
The eigenstructure of the covariance matrices Fi identifies the shape of clusters. Let
1,
2, ...,
p be the eigenvalues of Fi, and they are arranged in decreasing order; then the length of the r-th axis of each cluster is given by
for r = 1, ..., p. Figure 2 shows the eigenstructures of the clusters for the dataset from Figure 1a. The thick lines represent the eigenvectors of Fi, which correctly detect the shape of the clusters. The cluster centroids and eigenvalues obtained by the GK method for the four clusters are listed in Table 1. The last column is the ratio of the length of the first axis (
) to the length of the second axis (
), which describes the overall distribution of data. We can see that the ratio value of the cluster C1 is about 1.0, indicating that C1 has a spherical shape. In contrast, the cluster C4 has a ratio value of 9.36, indicating that the shape of C4 would be an elongated ellipsoid. The eigenvalue calculations, shown in Figure 2 and Table 1, indicate that the shape of each cluster is recognized by its eigenstructure, and therefore the adaptive distance norm is capable of identifying the inherent structure of the data.
|
|
The saddle point of Jm is obtained by considering the constraint Equation (5) as the Lagrange multipliers:
![]() | (12) |
Jm = 0. If
for all i,j and m > 1, then (U,V) may minimize Jm only if,
![]() | (13) |
![]() | (14) |
The GK method iteratively improves a sequence of sets of clusters until no further improvement in Jm(U,V,A:X) is possible. This type of iteration is often referred to as the alternating optimization (AO) scheme. It loops through the estimates for Vt1
Ut
Vt (where t is the iteration step) and terminates on ||Vt Vt1||
. Equivalently, the initialization of the algorithm can be done on U0, and the iterates become Ut1
Vt
Ut, with the termination criterion ||Ut Ut1||
. This way of alternating optimization makes the GK method be less prone to falling into local minima than the k-means and SOM methods because it makes soft decisions in each iteration through the use of membership functions.
Algorithm 1
GustafsonKessel method
Given the dataset X = {x1,..., xn}, xj
Rp, the number of clusters (k), the weighting exponent (m), and the termination criterion (
), this method finds k disjoint and homogeneous clusters.
- Initialize
(initially, t
1) of xj belonging to cluster Ci for 1
i
k, 1
j
n such that: 
- Update the cluster centroids
for 1
i
k using: 
- Update the cluster covariance matrices Fi for 1
i
k using: 
- Compute the distances between xj and
for 1
i
k, 1
j
n using: 
- Update
by the following procedure. For each xj, 1
j
n,- if DijAi > 0, 1
i
k, then update the membership of xj at t by: 
- if DijAi = 0 for some i
I
1,..., k, then for all i
I, set
to be between [0,1] such that: 
- if DijAi > 0, 1
- If ||Ut Ut1||
, then stop; otherwise, t
t + 1 and go to Step 2.
Algorithm 1 shows the procedural steps of the GK method for clustering the n x p gene-expression data where n is the number of genes and p is the number of experiments. A singularity can occur at Step 4 when the number of data is small or when the data within a cluster are much linearly correlated (Babuska, 1998). In such cases, the cluster covariance matrix becomes singular and cannot be inverted; thus we cannot compute the distances at Step 4. To deal with such cases, in the present study, the membership degrees are set to µij = 0 for non-singular classes, and arbitrary values for singular classes subject to the constraints of Equation (5).
THEOREM 1
The GK method given in Algorithm 1 converges in a finite number of iterations.
PROOF 1
We first show that a saddle point of Jm appears at most once by the GK method before it stops. Suppose that this is not true, i.e. Ut1 = Ut2 for some t1, t2 where t1t2. By the AO scheme, we get Vt1+1 and Vt2+1 as optimal solutions for U = Ut1 and U = Ut2, respectively. Therefore, we have
However, the sequence Jm(·,·) generated by the GK method is strictly decreasing (Selim and Ismail, 1984). Hence equation (15) is false and Ut1
(15) Ut2. Since there are a finite number of saddle points of Jm (Selim and Ismail, 1984), the algorithm will converge in a finite number of iterations.
A similar proof concerning the convergence of the k-means-type algorithms to a local minimum has been stated by Selim and Ismail, 1984.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Datasets and implementation parameters
To test the effectiveness with which the GK method clusters gene-expression data, we applied the GK method and five well-known clustering methods to three recently published yeast datasets and compared the performance of each method. In the present study, we used EXPANDER (Sharan et al., 2003) software that implements many clustering methods, of which we investigated the results of the k-means, SOM, and CLICK methods. Along with these, we examined the results of the bagged clustering (BagClust) (Dudoit and Fridlyand, 2003) and the QC (Horn and Axel, 2003).
The datasets employed were the yeast PHO-regulation dataset of Ogawa et al. (2000), the yeast ATP-regulation dataset of Mizuguchi et al. (2004), and the yeast Calcineurin-regulation dataset of Yoshimoto et al. (2002). The Ogawa's PHO dataset contains the expression profiles of 6013 yeast genes measured at eight PHO-regulated experiments. The Mizuguchi's ATP dataset consists of the expression levels of the 6215 yeast genes measured at three different ATPase experiments. The Yoshimoto's Calcineurin dataset contains the expression profiles of 6102 yeast genes at 24 experiments by the presence and absence of Ca2+, Na+, CRZ1 and FK506. These three datasets were obtained from a public website containing various published large-scale yeast datasets (http://sgdlite.princeton.edu/download/yeast_datasets/).
In these experiments, the parameters used in the GK method were
= 0.001, m = 2.5, and
= 1; these values were chosen because they have been overwhelmingly favored in many studies (Bezdek et al., 1999). In the tests reported here, we analyzed the performance of each method under changes in the number of clusters of k, with varied from k = 2 to k = 10.
3.2 Performance comparison
In the present study, the clustering results were assessed using two validation measures: z-score and Jaccard score.
First, the z-score (Gibbons and Roth, 2002) is calculated by investigating the relation between a clustering result and the functional annotation of the genes in the cluster. To achieve this, the score uses the Saccharomyces Genome Database (SGD) annotation of the yeast genes, along with the gene ontology developed by the Gene Ontology Consortium (Ashburner et al., 2000; Issel-Tarver et al., 2002). A higher score of z indicates that genes are better clustered by function, indicating a more biologically significant clustering result.
Figure 3 shows the clustering results of the k-means, SOM, BagClust and GK methods for the yeast PHO dataset. The z-scores of the four clustering methods are plotted with respect to k = 2,3,...,10. The k-means method gave z-scores of ranging from 0.9 to 3.1, and SOM gave scores from 0.8 to 2.5. The z-scores of the BagClust were ranged from 0.6 to 1.1. The SOM method outperformed the k-means and BagClust methods at low k-values, and the k-means method showed better performance than the SOM and BagClust methods at high k-values. Compared to these three methods, the GK method provided superior clustering performance over a wide range of k-values; the z-scores were varied from 16.8 to 25.3.
|
Figure 4 shows the clustering results of the k-means, SOM, BagClust and GK methods for the yeast ATP dataset. The k-means method gave z-scores of ranging from 1.0 to 3.9. On the whole, the SOM and BagCluster showed similar tendency for all k values. In comparison to these methods, it is evident that the GK clustering method shows markedly better performance, giving z-scores of >15.0 for k > 3. This result is in agreement with the work of Mizuguchi et al. (2004), which reported that the optimal k-value lies
3 for this dataset.
|
The clustering results achieved by the six clustering methods for each of the yeast PHO and ATP datasets are listed in Table 2. For the PHO dataset, the k-means method showed the best z-score of 3.15 at k = 9. The SOM and BagClust methods provided best z-values of 2.53 and 1.14 at k = 7, respectively. The CLICK and QC methods yielded z-values of 1.65 and 1.46 at k = 31 and k = 16, respectively; these two methods automatically find the optimal k. In contrast, the best value of z = 25.38 of the GK method was obtained at k = 10. Furthermore, we tested the performance of the GK method for two different values of
= 1 and
. It is observed that the GK method for these two
values showed similarly better performance than other methods; the z-scores were varied from 5.77 to 29.80 over k = 2,3,...,10. The best scores of the GK method were z = 25.38 for
= 1, and z = 29.80 for
. For the ATP dataset, the best z-values of the k-means and SOM methods were 3.97 and 1.84 at k = 5 and k = 3, respectively. The BagClust and CLICK provided the best z = 0.87 and z = 1.03 at k = 10, respectively, whereas the QC method yielded the best z = 0.94 at k = 8. The GK method with respect to
= 1,
provided significantly better clustering performance than other methods, giving z-scores of >20.0.
|
In addition to the assessment using the z-score, we quantified the clustering result of each method using the Jaccard and Minkowski scores. Let T be the true solution and C the solution a clustering algorithm generated. Let n11 be the number of pairs of data that are in the same cluster in both T and C. Let n10 be the number of pairs that are in the same cluster only in T, and n01 be the number of pairs that are in the same cluster only in C. Then the Jaccard score is defined as
![]() | (16) |
![]() | (17) |
In this case, a lower value of SM(T, C) indicates a well-clustered result. To measure the quality of clustering with these two scores, we applied five clustering methods to the yeast Calcineurin dataset that provides a putative true solution obtained through manual inspection by Yoshimoto et al. (2002). Table 3 lists the comparison results of five clustering methods for k = 3. The GK method is the most effective of the methods considered; it provides the highest Jaccard score, with the lowest Minkowski score. The k-means and BagCluster methods showed better scores than the SOM and QC methods, and the QC method proved the most ineffective of the methods considered.
|
The results of the comparison calculations indicate that the GK method gave markedly better clustering performance than the other five methods considered, highlighting the effectiveness and potential of the GK method.
3.3 Functional enrichment
The enriched functional categories for each cluster obtained by the GK method on the yeast PHO and ATP datasets are listed in Tables 4 and 5 respectively. The enrichment of each GO category in each of the clusters was calculated by its P-value. To compute the P-value, we employed the hypergeometric distribution used by Tavazoie et al. (1999) and Dembele and Kastner (2003) in order to obtain the probability of observing the number of genes from a specific GO functional category within each cluster. More detailed explanation on this P-value can be found in Tavazoie et al. (1999) and Dembele and Kastner (2003). A low P-value indicates that the genes belonging to the enriched functional categories are biologically significant in the corresponding clusters. In the present study, only functional categories with P-value <5.0 x 107 are reported.
|
|
Of the ten clusters obtained for the yeast PHO dataset (Table 4), the cluster C9 contains several enriched categories on ribosome. The highly enriched category in cluster C9 is the ribonucleoprotein complex with P-value of 4.83 x 1035. The GO category ribosome is also highly enriched in this cluster with P-value of 4.20 x 1045. The cluster C2 contains an enriched category RNA processing with P-value of 7.30 x 1044. In the case of the ATP dataset (Table 5), the cluster C3 contains the yeast genes corresponding to the ATP-involved GO biological process. The highly enriched categories in cluster C3 are the ATP metabolism with P-value of 4.89 x 109 and the purine ribonucleoside triphosphate biosynthesis with P-value of 1.39 x 107. From the results of Tables 4 and 5, we see that the cluster obtained by the GK method shows a high enrichment of functional categories.
Besides, as mentioned earlier, the GK method produces a fuzzy clustering result, which provides a convenient way of selecting genes showing tight association to given clusters (Dembele and Kastner, 2003). Dembele and Kastner referred to this process as restricted selection. Similar to the work of Dembele and Kastner, we removed the most loosely associated genes in each cluster using a threshold-based selection; specifically, genes with the highest membership degree µij > 0.5 were selected. To see the effect of the restricted selection, we compared the percentage of genes in the raw cluster and its corresponding restricted cluster for GO functional categories. Table 6 lists a comparison result for the cluster C9 shown from Table 4. In comparison to the raw cluster, the percentage of genes in the restricted cluster were remarkably increased in most cases. For instance, the percentage for the ribonucleoprotein complex was increased from 14.12% in the raw cluster to 20.36% in the restricted cluster; this indicates that the genes in this category were retained more frequently than the average genes in the cluster. We see from this example that selecting tightly associated genes using the threshold of membership degree can increase the biological relevance of the genes in the cluster.
|
| 4 CONCLUSIONS |
|---|
|
|
|---|
It is well established that clustering is a useful technique for analysis of large amounts of gene-expression data, and a variety of clustering method have been used in biological research. However, conventional clustering methods suffer from a tendency to impose a fixed shape on the clusters even if the clusters in many real-life datasets may differ in shape. To address this problem, we have presented the GK clustering method using an adaptive distance norm. The adaptive norm is described in terms of the covariance matrix for each cluster in which the eigenstructure is exploited to identify the shape of the cluster. The present assessment has shown that the GK method is the superior and effective method for clustering gene-expression data.
Despite these benefits of the GK method, several issues require further investigation. The computational costs of the GK method are much higher than other clustering methods such as the k-means method. In each iteration step, k · n matrices are computed and subsequently inverted, and additionally, k determinants are computed. Especially, the GK method becomes computationally inefficient when applied to high dimensional data. Figure 5 shows the real run time of the GK method to cluster the yeast PHO-regulation and the ATP-regulation datasets. It is observed from the figure that the eight-dimensional PHO dataset shows a rapid increase in the computational time more than the three-dimensional ATP dataset as the number of clusters is increasing. In such a case, it maybe useful to initialize the cluster centroids of the GK method with the resulting centroids of the k-means method so that the GK method can converge fast to the saddle points of Jm with the reduced number of iterations. Second, the GK method is problematic when the number of data is small or when the data within a cluster are much linearly correlated (Babuska, 1998). In such cases, the cluster covariance matrix becomes singular and cannot be inverted; thus we cannot compute the distances at Step 4 in Algorithm 1. To tackle this singularity problem, it would be helpful to prevent the ratio of the maximum to the minimum eigenvalue from being larger than some predefined threshold.
|
| Acknowledgments |
|---|
This work was supported by the Korean Systems Biology Research Grant (M1-0309-02-0002) from the Ministry of Science and Technology. We would like to thank CHUNG Moon Soul Center for BioInformation and BioElectronics and the IBM SUR program for providing research and computing facilities.
Received on August 13, 2004; revised on December 16, 2004; accepted on December 23, 2004
| REFERENCES |
|---|
|
|
|---|
Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Davis, A.P., Dolinski, K., Dwight, S.S., et al. (2000) Gene ontology: tool for the unification of biology. Nat. Genet., 25, 2529[CrossRef][ISI][Medline].
Babuska, R. Fuzzy Modeling for Control, (1998) , Boston Kluwer Academy Publishers.
Bezdek, J., Keller, J., Krisnapuram, R., Pal, N.R. Fuzzy Models and Algorithms for Pattern Recognition and Image Processing, (1999) , Boston Kluwer Academy Publishers.
Cho, R., Campbell, M., Winzeler, E., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T.G., Gabrielian, A.E., Landsman, D., Lockhar, D.J., Davis, R.W., et al. (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell, 2, 6573[CrossRef][ISI][Medline].
Chu, S., DeRish, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P.O., Herskowitz, I. (1998) The transcriptional program of sporulation in budding yeast. Science, 282, 699705
Dembele, D. and Kastner, P. (2003) Fuzzy c-means method for clustering microarray data. Bioinformatics, 19, 973980
DeRisi, J., Iyer, V., Brown, P. (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 282, 257264.
Dhilon, I., Marcotte, E., Roshan, U. (2003) Diametrical clustering for identifying anti-correlated gene clusters. Bioinformatics, 19, 16121619
Dudoit, S. and Fridlyand, J. (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19, 10901099
Dumitrescu, D., Lazzerini, B., Jain, L. Fuzzy Sets and Their Applications to Clustering and Training, (2000) , Florida CRC Press.
Eisen, M., Spellman, P., Brown, P., Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 1486314868
Gibbons, F. and Roth, F. (2002) Judging the quality of gene expression-based clustering methods using gene annotation. Genome Res., 12, 15741581
Gustafson, E. and Kessel, W. (1979) Fuzzy clustering with a fuzzy covariance matrix. Proc. IEEE Conf. Decision Control, 761766.
Guthke, R., Schmidt-Heck, W., Hahn, D., Pfaff, M. (2002) Gene expression data mining for functional genomics using fuzzy technology. Advances in Computational Intelligence and Learning: Methods and Applications, , New York Springer, pp. 475488.
Horn, D. and Axel, I. (2003) Novel clustering algorithm for microarray expression data in a truncated svd space. Bioinformatics, 19, 11101115
Issel-Tarver, L., Christie, K., Dolinski, K., Andrada, R., Balakrishnan, R., Ball, C.A., Binkley, G., Dong, S., Dwight, S.S., Fisk, D.G. (2002) Saccharomyces, genome database. Methods Enzymol., 350, 329346[ISI][Medline].
Jain, A., Murty, M., Flynn, P. (1999) Data clustering: a review. ACM Comput. Surv., 31, 264323[CrossRef].
Lukashin, A. and Fuchs, R. (2001) Analysis of temporal gene expression profiles: clustering by simuulated annealing and determining the optimal number of clusters. Bioinformatics, 17, 405414
Mizuguchi, G., Shen, X., Landry, J., Wu, W.H., Sen, S., Wu, C. (2004) ATP-driven exchange of histone h2az variant catalyzed by swr1 chromatin remodeling complex. Science, 303, 343348
Ogawa, N., DeRisi, J., Brown, P.O. (2000) New components of a system for phosphate accumulation and polyphosphate metabolism in Saccharomyces cerevisiae revealed by gemomic expression analysis. Mol. Biol. Cell, 11, 43094321
Qin, J., Lewis, D., Noble, W. (2003) Kernel hierarchical gene clustering from microarray gene expression data. Bioinformatics, 19, 20972104
Selim, S. and Ismail, M. (1984) K-means type algorithms: a generalized convergence theorem and the caracterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell., 6, 284288.
Sharan, R., Maron-Katz, A., Shamir, R. (2003) Click and expander: a system for clustering and visualizing gene expression data. Bioinformatics, 19, 17871799
Steuer, R., Kurths, J., Daub, C., Weise, J., Selbig, J. (2002) The mutual information: detecting and evaluating dependencies between variables. Bioinformatics, 18, S231S240[Abstract].
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., Golub, T.R. (1999) Interpreting patters of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA, 96, 29072912
Tavazoie, S., Hughes, J., Campbell, M., Cho, R.J., Church, G.M. (1999) Systematic determination of genetic network architecture. Nat. Genet., 22, 281285[CrossRef][ISI][Medline].
Xu, Y., Olman, V., Xu, D. (2001) Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics, 17, 309318
Yeung, K., Haynor, D., Ruzzo, W. (2001) Validating clustering for gene expression data. Bioinformatics, 17, 309318
Yoshimoto, H., Saltsman, K., Gasch, A., Li, H.X., Ogawa, N., Botstein, D., Brown, P.O., Cyert, M.S. (2002) Genome-wide analysis of gene expression regulated by the calcineurin/crz1p signaling pathway in Saccharomyces cerevisiae. J. Biol. Chem., 277, 3107931088
This article has been cited by other articles:
![]() |
A. Bhattacharya and R. K. De Divisive Correlation Clustering Algorithm (DCCA) for grouping of genes: detecting varying patterns in expression profiles Bioinformatics, June 1, 2008; 24(11): 1359 - 1366. [Abstract] [Full Text] [PDF] |
||||
![]() |
D.-W. Kim, K.-Y. Lee, K. H. Lee, and D. Lee Towards clustering of incomplete microarray data without the use of imputation Bioinformatics, January 1, 2007; 23(1): 107 - 113. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

















t2. By the AO scheme, we get Vt1+1 and Vt2+1 as optimal solutions for U = Ut1 and U = Ut2, respectively. Therefore, we have





