Skip Navigation


Bioinformatics Advance Access originally published online on September 1, 2005
Bioinformatics 2005 21(21):3993-3999; doi:10.1093/bioinformatics/bti644
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
21/21/3993    most recent
bti644v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Google Scholar
Right arrow Articles by Torrente, A.
Right arrow Articles by Brazma, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Torrente, A.
Right arrow Articles by Brazma, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oxfordjournals.org
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions{at}oxfordjournals.org

A new algorithm for comparing and visualizing relationships between hierarchical and flat gene expression data clusterings

Aurora Torrente 1,2,*, Misha Kapushesky 1 and Alvis Brazma 1

1EMBL Outstation—Hinxton, European Bioinformatics Institute Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
2Universidad Carlos III de Madrid Leganés, CP 28911, Madrid, Spain

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 DESCRIPTION OF THE...
 3 TESTING THE METHOD...
 4 IMPLEMENTATION IN EXPRESSION...
 5 CONCLUSIONS
 REFERENCES
 

Motivation: Clustering is one of the most widely used methods in unsupervised gene expression data analysis. The use of different clustering algorithms or different parameters often produces rather different results on the same data. Biological interpretation of multiple clustering results requires understanding how different clusters relate to each other. It is particularly non-trivial to compare the results of a hierarchical and a flat, e.g. k-means, clustering.

Results: We present a new method for comparing and visualizing relationships between different clustering results, either flat versus flat, or flat versus hierarchical. When comparing a flat clustering to a hierarchical clustering, the algorithm cuts different branches in the hierarchical tree at different levels to optimize the correspondence between the clusters. The optimization function is based on graph layout aesthetics or on mutual information. The clusters are displayed using a bipartite graph where the edges are weighted proportionally to the number of common elements in the respective clusters and the weighted number of crossings is minimized. The performance of the algorithm is tested using simulated and real gene expression data. The algorithm is implemented in the online gene expression data analysis tool Expression Profiler.

Availability: http://www.ebi.ac.uk/expressionprofiler

Contact: aurora{at}ebi.ac.uk

Supplementary information: http://www.ebi.ac.uk/microarray/General/Publications/publications.html


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 DESCRIPTION OF THE...
 3 TESTING THE METHOD...
 4 IMPLEMENTATION IN EXPRESSION...
 5 CONCLUSIONS
 REFERENCES
 
Clustering is one of the most widely used methods in gene expression data analysis. Clustering results are often rather sensitive to the particular clustering method and the parameters used (see, e.g. Quackenbush, 2001). Many ‘classic’ and new clustering algorithms, including ones for hierarchical and flat (k-means) clustering have been implemented in different tools and are widely used. In the absence of clear biologically meaningful benchmark clusters, it is difficult to judge which clustering method is most appropriate in a particular situation. Various cluster quality assessment methods have been proposed (Rand, 1971; Fowlkes and Mallows, 1983; Kaufman and Rousseeuw, 1990), but none of them can be applied universally to all datasets. Moreover, clustering is only one step in gene expression data analysis—to understand the biological meaning of different clusterings it is important to be able to compare the results obtained by different methods and to visualize the results of such comparisons. It is important to be able to see how clusters in one clustering relate to those in another, in order to make expert decisions regarding their biological meaning.

If the results of two clusterings are very different, examining simple one-to-one or one-to-many relationships between the clusters in each clustering may not always help to understand the global relationships. In this case finding the best many-to-many relationships may be more appropriate (for instance, two clusters on one side may correspond to three clusters on the other side). If we assume that there is a ‘true’ but unknown grouping of genes underlying the data, such many-to-many relationships between clusters may help one to approximate the ‘true’ clusters.

An even more complicated situation occurs when comparing a flat to a hierarchical clustering—on the one side, we have a fixed number of clusters, but on the other side a dendrogram (e.g. Fig. 6). Of course, one could cut the dendrogram at some level to obtain a set of flat clusters (Eisen et al., 1998; Tibshirani et al., 1999). However, how can we determine the optimal cut-off level? Furthermore, the optimal cut-off level may not be the same in all branches: in fact, as will be shown later, this occurs only in special cases. Ideally, the clustering comparison algorithm itself should determine the optimal cut-off level for each branch.



View larger version (47K):
[in this window]
[in a new window]
 
Fig. 6 A comparison of the hierarchical clustering of (correlation-based distance, average linkage) sim 140 most varying genes (≥ 0.9 SD in 60% of the hybridizations) in the S. pombe stress response dataset to a k-means clustering of the same dataset (correlation-based distance, k = 5).

 
How do we define ‘optimal’ correspondence between clusters? One way is based on the concept of mutual information—how much one clustering can tell about the other. In terms of the minimal description length (MDL) principle (Rissanen, 1978), this can be defined as the length of the shortest encoding of one clustering using the other as given. An alternative definition is based on applying aesthetics to the visualization of the correspondence between the clusterings. We can use a bipartite graph (bi-graph) to visualize the two clusterings (each clustering is represented by a set of nodes on either side with the edges weighted by the number of common elements in the connected clusters, as in Fig. 1), minimizing the number of edge crossings.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 1 (a) A weighted bi-graph representing two clusterings A1, ..., A5 and B1, ..., B6 of the same set of elements. For instance, clusters A5 and B2 have high overlap, while A5 and B5 do not have any. (b) A drawing of the graph after minimizing edge crossings with the proposed gravity-center algorithm. (c) Finding the ‘superclusters’ as connected components in the graph using the most significant edges.

 
Here we present a clustering comparison method and its implementation that finds a many-to-many correspondence between groups of clusters from two clusterings. The number of clusters in each of the clusterings may be different, and we allow the comparison of either two flat clusterings, or a flat and a hierarchical clustering. In the latter case, the mutual information-based criterion and the graph layout aesthetics are used to find the optimal cut-off level in different branches of the hierarchical clustering tree.

We have extensively tested our method on simulated and real gene expression data. We demonstrate that, from simulated data, our method can be used to approximate true clusters without a priori knowledge of the number of the clusters, even with a high level of noise. We used real gene expression data from Schizosaccharomyces pombe stress response microarray experiments (Chen et al., 2003), and show that we can restore biologically meaningful clusters. The algorithm, the user interface and the visualization method have been implemented as a part of Expression Profiler (EP) (Kapushesky et al., 2004), the online microarray data analysis tool at the EBI (http://www.ebi.ac.uk/expressionprofiler). The software is available from sourceforge (http://ep-sf.sourceforge.net).


    2 DESCRIPTION OF THE METHOD
 TOP
 Abstract
 1 INTRODUCTION
 2 DESCRIPTION OF THE...
 3 TESTING THE METHOD...
 4 IMPLEMENTATION IN EXPRESSION...
 5 CONCLUSIONS
 REFERENCES
 
2.1 Comparison of two flat clusterings
Let A1, ..., Am and B1, ..., Bn be two partitions of a set of the same elements G = {g1, ..., gN}, i.e. A1 {cup} ··· {cup} Am = B1 {cup} ··· {cup} Bn = G. For instance, A and B can be two different clusterings of a set of genes. We can visualize the similarities between them by drawing a weighted bi-graph, where nodes on the left and right hand sides represent, respectively, sets A1, ..., Am and B1, ..., Bn, and the weight wij of the edge <Ai, Bj> (visualized as thickness) corresponds to the number of elements common to both sets, i.e. |Ai {cap} Bj|. In Figure 1a, we have two partitions with 5 and 6 clusters each.

If we have partitions with many clusters, or with densely interconnected ones, the bi-graph representing their correspondences can present many edge crossings, making the visualization difficult to interpret. To help to see and understand the similarities between both partitions a graph layout that minimizes the weighted edge crossings can be used. This is a well-known problem in drawing bi-graphs, generalized to weighted edges. It has been proven to be NP-hard (Eades et al., 1986). We use heuristics, based on those described in (Gansner et al., 1993).

The algorithm that we use assigns to each node v coordinates (x(v), y(v)). Nodes in the same partitions have the same abscissa (0 and 1, typically). In the initial set-up, nodes on each side are equally spaced vertically; the initial ordering is set by the user (by default, the ordering given by the labels of the clusters in each partition) and as the iterations of the method run, the ordinates are updated. Each iteration works alternately on either side; for a given partition, each node is assigned a new position using the barycenter (gravity-center) function. We consider weighted edges as multi-edges collapsed into one; therefore, if node Ai is connected to Bj with an edge <Ai, Bj> having weight wij, it can be understood as Ai being connected, with wij non-weighted edges, to wij nodes drawn at the same position. Then, if P is the list of positions (ordinates) of the incident vertices of node v, considering each position as many times as the corresponding edge weight, the gravity-center of v is the mean over all values in P. Nodes on this side are reordered by sorting on these gravity-centers, and drawn at points with y-coordinates equal to these values. If the new positions cause two neighbouring nodes to be less than smin apart, the ordinates of the second and all the following ones are increased by smin, thus shifting all nodes down.

An additional heuristic is used to improve the quality of the method: adjacent nodes are swapped if the transposition leads to a reduction of the number of edge crossings. These exchanges are repeated until there is no improvement. At each iteration, if the number of crossings decreases, the new ordering is stored. The maximum number of iterations is set to 24, as suggested in Gansner et al., (1993). The algorithm runs until there are no changes in the ordering of both sides or until the maximum number of iterations is reached. Figure 1b shows the performance of the algorithm on the bi-graph on the left. The weighted number of edge crossings has dropped from 580 to 69 and the thickest edges do not cross each other.

To find and visualize the most important many-to-many correspondences between the two partitions, we additionally use a simplified representation of the bi-graph. The basic idea is the following: we want to group the sets Ai (more precisely, the elements g of the sets Ai) into groups X1, ..., Xk (i.e. for each i = 1, ..., m, every g Ai belongs to exactly one Xj for j = 1, ..., k) and the sets Bi into groups Y1, ..., Yk (i.e. for each i = 1, ..., n, every g Bi belongs to exactly one Yj for j = 1, ..., k), so that

X {approx} Y means here that X and Y have high overlap, i.e. most elements on one side are also present on the other side. For instance in Figure 2 below, we form groups X1 = A1 {cup} A2, X2 = A3, X3 = A4 and Y1 = B1, Y2 = B2 {cup} B3 and Y3 = B4, because A1 {cup} A2 {approx} B1, A3 {approx} B2 {cup} B3 and A4 {approx} B4.



View larger version (9K):
[in this window]
[in a new window]
 
Fig. 2 Two clusterings A1, ..., Am and B1, ..., Bn of the same elements, such that A1 {cup} A2 {approx} B1, A3 {approx} B2 {cup} B3 and A4 {approx} B4.

 
To find these groups or superclusters we use the following greedy algorithm:
  1. Take, for each node, the edge with the highest weight.
  2. Find the connected components in this edge-reduced bi-graph, which define the cluster groupings on each side and their correspondences and connect them in ‘superclusters’ (see Fig. 1c).

The idea of merging clusters has been previously used by Sharan and Shamir (2000) in the CLICK algorithm, and by Hartuv et al. (1999) in the HCS algorithm. However, our application is different, as both these approaches are exploiting the similarities between the clusters in the same clustering, while we are using the comparison of two different clusterings.

Theoretically it is possible to obtain a fully connected graph, which leads to the trivial solution X1 = Y1 = G, but in practice our results on different gene expression datasets show that this does not happen normally. In artificial datasets we obtained fully connected graphs in the following two cases: when there is no underlying structure in the data, or when the number of clusters in the clusterings is smaller than the number of underlying groups and the elements are homogeneously distributed among the groups.

2.2 Comparison of a flat and a hierarchical clustering
We implemented a method that cuts different branches of the hierarchical tree at ‘optimal’ level, which may be different at different branches, to find the best matches with a given flat clustering. The method explores the tree depth-first, starting from the root. Suppose we are at branch A. Assume that the graph is visualized so that the hierarchical tree is on the left, while the flat clustering is on the right (Fig. 3). We compute score {sigma} = S(A), defined below. Then we run one iteration of the gravity-center algorithm (Section 2.1) on the right side for the children nodes A1 and A2, and compute score {sigma}1 = S'(A1, A2). We swap the nodes A1 and A2, run the algorithm, and compute {sigma}2 = S'(A2, A1). If either {sigma}1 or {sigma}2 is >{sigma}, we split the tree using the ordering giving the higher score. If both {sigma}1 and {sigma}2 are <{sigma}, we start the look-ahead algorithm to explore the tree depth-first starting from A, looking h steps ahead, where h is given by the user. If after h steps none of the scores has improved, we do not split the node A; otherwise we split the tree to obtain the path to the node with the highest score. This is illustrated in Figure 4.



View larger version (6K):
[in this window]
[in a new window]
 
Fig. 3 A1 and A2, descendants of a sub-branch A of the original tree, are tested as potential split points against the flat clustering B1, ..., Bk.

 


View larger version (5K):
[in this window]
[in a new window]
 
Fig. 4 Performance of the look-ahead algorithm: a decrease in the value of the score in (b) with respect to that of (a) is overcome with a better value in (c).

 
The correspondence algorithm without the ‘look-ahead’ is described in Box 1 more formally (the full algorithm is given in the web Supplement 1). A is a tree with a fixed layout (or a node in a special case), and Y = (y(B1), ..., y(Bk)) is the set of ordinates of the flat clusters.


procedure correspondence(A, Y); Y global;
    if A is a leaf
        then
            link(A, Y)
        else
            find the two descendants of A:A1 and A2
            compute {sigma} = S(A), {sigma}1 = S'(A1, A2) and
            {sigma}2 = S'(A2, A1)
            Y12 = gravity_center((A1, A2), Y)
            Y21 = gravity_center((A2, A1), Y)
            if max({sigma}1, {sigma}2) < {sigma}
                then
                    link(A, Y)
                else if ({sigma}1 ≥ {sigma}2)
                        then
                            Y = Y12
                            link((A1, A2), Y)
                            correspondence(A1, Y)
                            correspondence(A2, Y)
                        else
                            Y = Y21
                            link((A2, A1), Y)
                            correspondence(A2, Y)
                            correspondence(A1, Y)

Box. 1. The procedure gravity_center() finds the optimal ordering of the nodes on the flat side according to the gravity-center algorithm and the procedure link() adds to the bi-graph the edges that connect the set of nodes in the first argument with the nodes in the second one, in the ordering given by Y.

 

We use two different scoring functions for S. The first one, S1, is based on the concept of mutual information—we try to find a split of the tree optimizing the mutual information between the two clusterings; the second one, S2, is based on graph layout aesthetics.

More precisely, suppose the flat clustering is B = {B1, ..., Bn}. If the dendrogram has m leaves A = {A1, ..., Am}, we can express the distribution of genes across the dendrogram as (where N is the number of elements in G = A1 {cup} ··· {cup} Am). The distribution of genes in a particular leaf Ai across the flat clusters {B1, ..., Bn} will be

Notice that is the conditional probability of finding an arbitrary gene from Ai in cluster Bj. Therefore, the conditional Shannon entropy (Shannon, 1948) of B, given A, will be

Note that is the number of bits needed to encode the clusters in clustering B, knowing A. Similarly, denoting the probabilities and , the Shannon entropy of A, given the flat clustering B, is

We use Shannon entropy in the context of the MDL principle. According to this principle we try to minimize the length of the message that needs to be sent to describe the clustering B if the clustering A is known to the receiver, or vice versa. These message lengths are H(B|A) and H(A|B), respectively. However, we also need to add to the message the coding of the clusters themselves, which adds to the length in bits a term given by the expressions

and

respectively. To find a compromise between the information about the dendrogram conveyed by knowing the flat clustering and vice versa, we define a symmetric score given by the average of H(B|A) and H(A|B) plus the length of the coding of both clusterings

Suppose we want to decide whether or not to split branch Am. We denote its descendants by and . We will split Am if the average of the length of the messages that encode the information about one clustering contained in the other decreases when replacing the branch with its descendants. Since splitting a branch Ai affects only the i-th term of the entropy expression, we define the scoring function for a given branch Am as

and the score for its descendants as . As the scoring {sigma}1 thus defined is symmetric in its arguments, if {sigma}1 > {sigma} the ordering given in the tree will be the one with a smaller number of edge crossings. For the look-ahead algorithm we generalize this score from two leaves to an arbitrary subtree.

A similar idea of measuring the gain or loss of information is described in (Jonnalagadda et al., 2004), where they perform k-means clustering successively on the same dataset, increasing in each iteration the number of clusters by one. However, in our approach we use this measure to find the best correspondence between two different clusterings.

The second score S2 is based on a graph layout aesthetics demonstrated in Figure 5: if splitting a node A into A1 and A2 creates ‘thick’ connections from each of the Ai to each of the Bj (as in Fig. 5, top), but not reciprocally (as in Fig. 5, middle), then the splitting is rewarded, otherwise it is penalized. The intermediate cases depend on the exact weights—intuitively they are difficult to decide.



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 5 The scoring function S2 allows to split when corresponding clusters are identified (top) and prevents unnecessary splits (middle). Non-symmetric splits (bottom) are allowed depending on the thickness of the corresponding edges.

 
To achieve this for branch A, compute the product:

The rationale of this scoring can be explained as follows. The first term corresponds to the sum of Jaccard indices (Harper, 1999), , and rewards the cases where most of the elements are present in corresponding clusters on both sides, regardless of the absolute size. The second term is the addition of the square of the number of elements in common |A {cap} Bj|2, and rewards fewer thicker edges (i.e. edges representing larger absolute number of common elements), penalizing a relatively larger number of smaller edges. For the descendants A1 and A2 we compute {sigma}1 as:

where crossings(A1, A2) is the number of edge crossings within the subtree that results from expanding branch A. This number is based on the weights: it is computed as {sum}j>1 {sum}k<j w(A1, Bj) x w(A2, Bk), which is not symmetric in its arguments. The computational experiments described in the next section show that this heuristic scoring function performs almost as well as the information-theoretic one, outperforming it for a small number of look-ahead steps.

After determining how the branches in the dendrogram will be collapsed, we can run the greedy algorithm on the corresponding flat clusterings. As the computational experiments of the next section show, the resulting number of superclusters is usually close to the true number of groups in the data, partially or fully restoring them.


    3 TESTING THE METHOD ON SIMULATED DATA
 TOP
 Abstract
 1 INTRODUCTION
 2 DESCRIPTION OF THE...
 3 TESTING THE METHOD...
 4 IMPLEMENTATION IN EXPRESSION...
 5 CONCLUSIONS
 REFERENCES
 
We tested the algorithm both on simulated and on real gene expression data. We generate a number of ‘true’ clusters as elements scattered around gravity-centers with normal distribution and differently chosen standard deviations (called ‘noise level’). More precisely, the groups are constructed around random seeds, adding a normal random noise of mean 0 and SD {sigma} = 1, ..., 6, to each component, and using the following models, all in 10 dimensions. For comparing two flat clusterings we generated: (1) four groups of 250 elements; (2) four groups of size randomly chosen between 1 and 250, to allow for outliers; (3) seven groups of 250 observations; (4) seven groups of size randomly chosen between 1 and 250. For comparing a hierarchical to a flat clustering, we generated: (5) four groups of size randomly chosen between 1 and 100; (6) eight groups of size randomly chosen between 1 and 100. Then we ran k-means clustering, with k larger than the number of ‘true’ clusters and hierarchical agglomerative clustering (Euclidean distance, complete linkage). Next we apply our algorithms to compare the different clustering results. We show that the superclusters that are obtained by our methods are closer to the ‘true’ groups than the initial clusters obtained by k-means (see Table 1 and Supplement 2 online).


View this table:
[in this window]
[in a new window]
 
Table 1 The measure of the noise B/W is the ratio between the average distance between clusters centers, B, and the average distance between each point and the center of the cluster where it belongs, W

 
For comparing two k-means clusterings, we selected k = 8 and k = 11. We ran k-means 100 times for each value of k, and computed the average percentage of misclassified elements in each cluster (with the convention of considering each cluster as part of the true group that is the most voted by its elements) and the average number of superclusters. The summary of the results for two flat clusterings is shown in the Table 1. In the case of 4 real groups, the greedy algorithm shows the worst performance when the data is highly noisy ({sigma} = 6), increasing the percentage of misclassified elements by <2% (from 18.96% in k-means, with k = 11, to 20.79% after grouping). However, the average number of superclusters found is often <7, versus the 8 and 11 clusters of k-means. Similarly, in the case of 7 real groups the increase in the percentage of misclassified elements is not >3%, except for the case of {sigma} = 1, 2, 3 and k = 11, where the merging causes this percentage to increase in 9. However, this is a consequence of the error rate obtained with k-means when k = 8. The average number of superclusters is close to 7 in all the datasets.

For comparing a flat and a hierarchical clustering, we run k-means 100 times on the same dataset, while the hierarchical clustering is deterministic, therefore we run it once. Usually, regardless of the scoring function that we use, the algorithm collapses the dendrogram in such a way that after forming the superclusters there are almost no differences between the number of misclassified elements in the superclusters and in the original flat clusters.

Firstly, we used the information theoretical based scoring function, S1, using several values of the look-ahead parameter h. We run k-means with k larger than the real number of groups and compute the average number of superclusters, as well as the average number of branches in the collapsed tree. With no look-ahead, the number of superclusters underestimates the number of true clusters, and the percentage of misclassified elements, even with moderate levels of noise, is high. For example, for the case of 4 real groups, with noise level {sigma} = 4, and running k-means with k = 10, we get >20% of misclassified elements in the clustering that results from cutting the dendrogram, even if the number of elements closer to centers other than their own is <3%; for 8 real groups we get even worse results: datasets with noise level {sigma} = 3 have a misclassification percentage >15%, while for {sigma} = 4 is ~35%. The average number of branches in the tree is always smaller than the real number of groups. However, looking just one step ahead makes the average number of superclusters come close to the real number of groups and the percentage of misclassification to decrease. For example, for 4 true groups, and {sigma} = 4, the error rate decreases in >35%, while for 8 real groups and {sigma} = 4, it decreases in >40% (see Tables 2 and 3, in the Supplement 2 online). The average number of branches is always larger than the number of true clusters. For further look-ahead steps, the error rates do not decrease significantly.

Secondly, we used the aesthetics-based score S2, using the same values of h and k. With no look-ahead, the restoration of the 4 true groups is accurate: for low levels of noise ({sigma} < 4) the percentage of misclassified elements is <1%, and only for {sigma} = 6 we get values >15%. Moreover, the average number of superclusters is correctly estimated ~4. The number of branches needed to find them ranges between 5 and 9. In the case of 8 real groups the restoration is worse; the number of real groups is underestimated in all the cases, and for {sigma} > 3 the percentage of misclassification is >10%, increasing up to 46% when {sigma} = 6. Notice that even if the error rate obtained with k-means is much smaller than that of the hierarchical classification, there is a large increase of the error rate on the flat side when the superclusters are formed, due to the percentage obtained on the hierarchical side. The average number of branches in this case ranges between 6 and 9. When we look-ahead one step, the percentage of misclassified elements decreases, while the average number of superclusters increases. With 4 real groups, the differences are not so remarkable as those of 8 real groups, where the decrease in the percentage of misclassification is up to 10. However, the number of superclusters still underestimates the number of true groups.

The summary of the results for a hierarchical clustering and a non-hierarchical clustering is shown in the Supplement 2 online. In brief, it appears that the visualization aesthetics-based scoring function outperforms the MDL principle for a small number of look-ahead steps. Therefore, it can be used to improve the execution time (in our implementation it takes seconds to compare clusters of thousands of genes without look-ahead, and minutes with look-ahead).


    4 IMPLEMENTATION IN EXPRESSION PROFILER AND APPLICATIONS ON S.POMBE STRESS RESPONSE DATA
 TOP
 Abstract
 1 INTRODUCTION
 2 DESCRIPTION OF THE...
 3 TESTING THE METHOD...
 4 IMPLEMENTATION IN EXPRESSION...
 5 CONCLUSIONS
 REFERENCES
 
The algorithm is available through the online gene expression data analysis tool EP. A user can upload a gene expression matrix together with associated annotations to the EP platform, where a number of analytical components are available, including data transformation, sub-selection, statistical tests and clustering, both hierarchical and k-means. The Clustering Comparison component, under the ‘Clustering’ menu allows the user to run the algorithm and view the visualization of the output. The user is presented with the analysis history of the current dataset and can select a pair of previously performed clusterings to be compared. Alternatively, the user can select a dataset to be analysed and set up two sets of clustering parameters. In this case the two clusterings will be executed and then their results will be compared and the visual display will be produced. For running hierarchical clusterings the user needs to specify a distance measure (e.g. Euclidean or Pearson correlation, etc.) and type of linkage (complete, average, etc.). Similarly, the distance measure should be specified for flat clusterings, as well as the number k for k-means.

Adjustable clustering comparison algorithm parameters are also presented. The user can specify the number of steps to use in the look-ahead tree search and the type of scoring function to use (information-theoretic or aesthetics-based).

The underlying clustering comparison algorithm is implemented in R. The source code is available as part of the open-source distribution of EP. The EP data analysis tool also provides a visualization of the comparison (Fig. 6), displaying side-by-side the tree, with optimal cut-off points marked in red, the expression heatmap, gene annotations, the set of k clusters resulting from the flat clustering, and the clustering comparison correspondence. The latter is depicted as lines of varying thickness, mapping sub-branches of the tree to flat clustering superclusters. Line thickness is proportional to the number of elements common to both sets; by placing the mouse cursor over a line, a Venn diagram is displayed with the exact numbers showing how many elements in the cluster are generated by cutting the respective branch, how many in the corresponding flat cluster and how many of these two overlap.

As an example, Figure 6 shows the comparison of a hierarchical clustering (correlation-based distance, average linkage) to a k-means clustering (correlation-based distance, k = 5) of the S.pombe stress response dataset, (ArrayExpress accession number E-MEXP-29). The highly responding (yellow) cluster is highly enriched with CESR genes (12/64), while near the middle of the hierarchical tree, the cluster of genes of high negative response (blue) are mostly mitochondrial ones. The down-regulated CESR genes are also enriched (25/62).


    5 CONCLUSIONS
 TOP
 Abstract
 1 INTRODUCTION
 2 DESCRIPTION OF THE...
 3 TESTING THE METHOD...
 4 IMPLEMENTATION IN EXPRESSION...
 5 CONCLUSIONS
 REFERENCES
 
We have developed a novel method that allows the user to explore relationships between different clustering results, including hierarchical clusterings without any a priori given cut-offs. Although the biological relevance of the comparison results depends on the clusters that are compared, in some cases we can even improve the original clustering. For instance, in simulated data, if the noise level is such that most of the elements in the clusters are closer to the centers of their ‘true’ clusters than to the centers of other clusters, our method effectively allows restoring the ‘true’ clusters. When the noise level increases to the extent that no clustering structure is left, restoring the ‘true’ clusters becomes difficult. Nevertheless, we have shown that when applied to real gene expression data, we can obtain biologically meaningful results, which help expert biologists explore their data.

Note that our visualisation algorithm of the comparison between a hierarchical clustering and a flat clustering effectively defines an order in the hierarchical tree. It is sometimes wrongly assumed by naïve users that a hierarchical clustering has a defined order. In fact, every branch in the tree can be flipped without affecting the clustering, and in this way, the same clustering can be presented in rather different ways. Applying the cluster comparison method developed in this paper reduces this ambiguity.

Can the proposed method be generalized for comparing two hierarchical clusterings? This is a topic of our current research, but there are several important reasons that make this more difficult. First of all, when comparing two hierarchical clusterings there are considerably more degrees of freedom—both trees can be cut at many different levels to obtain flat clusterings. For instance, we have to avoid trivial solutions—perfect correspondence between the roots of the two trees or the leaves of the two trees by adding appropriate terms to the scoring functions. In practical applications the visualization of the correspondence between the clusters plays a central role, and such visualization is more difficult when comparing two hierarchical clusterings.


    Acknowledgments
 
We would like to thank Jaak Vilo in Tartu University for useful discussions, Paulis Kikusts and his team in the University of Latvia for valuable advise on graph layout algorithms, Gabriela Rustici and Jurg Bahler for help with interpreting the results, and Christine Coerner for the help in the software implementation. The work was partly funded by the fellowship ES2001-0159, BM, supported by the Spanish Ministry of Science and Technology (MCyT) and by the European Commission as TEMBLOR, contract no. QLRI-CT-2001-00015.

Conflict of Interest: none declared.

Received on June 16, 2005; revised on August 1, 2005; accepted on August 23, 2005

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 DESCRIPTION OF THE...
 3 TESTING THE METHOD...
 4 IMPLEMENTATION IN EXPRESSION...
 5 CONCLUSIONS
 REFERENCES
 

    Quackenbush, J. (2001) Computational analysis of microarray data. Nat. Rev. Genet., 2, 418–427[CrossRef][ISI][Medline].

    Rand, W.M. (1971) Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc., 66, 846–850[CrossRef][ISI].

    Fowlkes, E.B. and Mallows, C.L. (1983) A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc., 78, 553–584[CrossRef].

    Kaufman, L. and Rousseeuw, P.J. (1990) Finding Groups in Data: An Introduction to Cluster Analysis. , New York Wiley 2, 418–427.

    Eisen, M.B., et al. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 14863–14868[Abstract/Free Full Text].

    Technical Report Tibshirani, R., Hastie, T., Eisen, M., Ross, D., Botstein, D., Brown, P. (1999) Clustering methods for the analysis of DNA microarray data. Department of Statistics, Stanford University.

    Rissanen, J. Modelling by shortest data description. Automatica, 14, 465–471.

    Chen, D., et al. (2001) Global transcriptional responses of fission yeast to enviromental stress. Mol. Biol. Cell, 14, 214–229.

    Kapushesky, M., et al. (2003) Expression Profiler: next generation an online platform for analysis of microarray data. Nucleic Acids Res., W465–470.

    Eades, P., McKay, B., Wormald, N. (1986) On an edge crossing problem. In Proceedings of the 9th Australian Computer Science ConferenceJanuary 29–31Australian National University, Canberra , pp. 327–334.

    Gansner, E.R., et al. (1993) A technique for drawing directed graphs. IEEE Trans. Software eng., 19, 214–230[CrossRef].

    Sharan, R. and Shamir, R. (2000) CLICK: a clustering algorithm with applications to gene expression analysis. Proc. Int. Conf. Intell. Syst. Mol. Biol., 8, 307–316[Medline].

    Hartuv, E., Schmitt, A., Lange, J., Meier-Ewert, S., Lehrach, H., Shamir, R. (1999) An algorithm for clustering cDNA for gene expression analysis using short oligonucleotide fingerprints. In Proceedings of the Third International Symposium on Computational Molecular Biology RECOMB'99August 11–14, 1999Lyon, France , New York ACM Press, pp. 188–197.

    Shannon, C.E. (1948) A mathematical theory of communication. Bell Syst. Tech. J., 27, 379–423 623–656.

    Jonnalagadda, S. and Srinivasan, R. (2004) An information theory approach for validating clusters in microarray data. In Proceedings of the 12th Intelligent Systems for Molecular BiologyJuly 31–August 4, 2004Glasgow, UK http://www.iscb.org/ismbeccb2004/short%20papers/39.pdf.

    Harper, D.A.T. Numerical Palaeobiology, (1999) , New York John Wiley & Sons.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
21/21/3993    most recent
bti644v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Google Scholar
Right arrow Articles by Torrente, A.
Right arrow Articles by Brazma, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Torrente, A.
Right arrow Articles by Brazma, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?