Skip Navigation


Bioinformatics Advance Access originally published online on September 25, 2007
Bioinformatics 2007 23(20):2708-2715; doi:10.1093/bioinformatics/btm414
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
23/20/2708    most recent
btm414v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Leone, M.
Right arrow Articles by Weigt, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Leone, M.
Right arrow Articles by Weigt, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Clustering by soft-constraint affinity propagation: applications to gene-expression data

Michele Leone *, Sumedha and Martin Weigt

Institute for Scientific Interchange, Viale Settimio Severo 65, Villa Gualino, I-10133 Torino, Italy

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE ALGORITHM
 3 APPLICATION TO BIOLOGICAL...
 4 METHODS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Similarity-measure-based clustering is a crucial problem appearing throughout scientific data analysis. Recently, a powerful new algorithm called Affinity Propagation (AP) based on message-passing techniques was proposed by Frey and Dueck (2007a). In AP, each cluster is identified by a common exemplar all other data points of the same cluster refer to, and exemplars have to refer to themselves. Albeit its proved power, AP in its present form suffers from a number of drawbacks. The hard constraint of having exactly one exemplar per cluster restricts AP to classes of regularly shaped clusters, and leads to suboptimal performance, e.g. in analyzing gene expression data.

Results: This limitation can be overcome by relaxing the AP hard constraints. A new parameter controls the importance of the constraints compared to the aim of maximizing the overall similarity, and allows to interpolate between the simple case where each data point selects its closest neighbor as an exemplar and the original AP. The resulting soft-constraint affinity propagation (SCAP) becomes more informative, accurate and leads to more stable clustering. Even though a new a priori free parameter is introduced, the overall dependence of the algorithm on external tuning is reduced, as robustness is increased and an optimal strategy for parameter selection emerges more naturally. SCAP is tested on biological benchmark data, including in particular microarray data related to various cancer types. We show that the algorithm efficiently unveils the hierarchical cluster structure present in the data sets. Further on, it allows to extract sparse gene expression signatures for each cluster.

Contact: leone{at}isi.it, sumedha{at}isi.it and weigt{at}isi.it


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE ALGORITHM
 3 APPLICATION TO BIOLOGICAL...
 4 METHODS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Clustering based on a measure of similarity is a crucial problem which appears throughout scientific data analysis. For an overview see Jain et al. (1999). Recently, a powerful algorithm called Affinity Propagation (AP) based on message passing was proposed by Frey and Dueck (2007a). As reported impressively in the original publication, this algorithm achieves a considerable improvement over standard clustering methods like K-means (MacQueen, 1967), spectral clustering (Shi and Malik, 2000) and super-paramagnetic clustering (Blatt et al., 1996; Shental et al., 2003).

Based on an ad hoc pairwise similarity function between data points, AP seeks to identify each cluster by one of its elements, the so-called exemplar. Each point in the cluster refers to this exemplar, each exemplar is required to refer to itself as a self-exemplar. This hard constraint forces clusters to appear as stars of radius one: there is only one central node, and all other nodes are directly connected to it. Subject to this constraint, AP seeks at maximizing the overall similarity of all data points to their exemplars. The solution to this hard combinatorial task is approximated following the ideas of belief-propagation (Kschischang, 2001; Yedidia et al., 2005). One of the important points of AP is its computational efficiency: whereas a naive implementation of belief propagation for N data points leads to O(N3) messages which have to be determined self-consistently, the elegant formulation of Frey and Dueck allows to work with O(N2) messages only. Therefore, the algorithm is feasible even in the presence of very large data sets.

Albeit its impressive power in a wide range of applications (Frey and Dueck, 2007a), AP in its present form suffers from a number of drawbacks. The most important ones related to the present work are:

  • The hard constraint in AP relies strongly on cluster-shape regularity: elongated or irregular multi-dimensional data might have more than one simple cluster center. AP may force division of single clusters into separate ones.
  • Since all data points in a cluster must point to the same exemplar, all information about both the internal structure and the hierarchical merging/dissociation of clusters is lost.
  • AP has robustness limitations: a small perturbation of similarities may influence the choice of one or few exemplars, and the hard constraint may trigger an avalanche leading to a different partitioning of the data set into clusters. This point is particularly important in the presence of noise in the data as, e.g. in microarray measurements.
  • AP forces each exemplar to point to itself. A relaxation of the hard constraint may allow for cluster structures including second- or higher-order pointing processes.

These problems may be solved by modifying the original optimization task of AP. As a first step, we relax the hard constraint by introducing a finite penalty term for each constraint violation. This softening can be chosen in a way that the computational complexity of the algorithm remains unchanged, but its performance on biological test sets is improved considerably. Moreover, relaxing the constraint helps in gaining valuable insight into the hierarchical structure of the clustering, increasing result robustness at the same time. By tuning the cluster number we see the merging of two clusters into a single one, or the dissociation of single clusters into two separated ones.


    2 THE ALGORITHM
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE ALGORITHM
 3 APPLICATION TO BIOLOGICAL...
 4 METHODS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Given a set Formula of N data points, the original algorithm of Frey and Dueck takes as input a collection of real valued similarities S(µ,{nu}) between the pairs xµ,x{nu},µ,{nu}isin{1,...,N}. The choice of the similarity measure is heuristic, it depends on the nature of data to be clustered. In the case of high-dimensional data as present in gene-expression analysis, similarity may be measured by the Pearson correlation coefficient or the negative pairwise Euclidean distance. However, for the algorithm described below it is not even necessary that the similarities are symmetric.

The algorithm searches for a mapping Formula which maps each data point µ to its exemplar cµ which itself is a data point. This mapping shall minimize the cost function (or energy) Formula which equals minus the overall similarity of the data points to their exemplars. In the original AP algorithm c is restricted by N hard constraints: whenever a data point is selected as an exemplar by another data point, it has to be its own self-exemplar.

In this setting, we need to specify self-similarities S(µ,µ). They describe the availability of data points for being self-exemplars (and thus to serve as a cluster center). Since all data points are a priori equally suitable to play such a role, one is naturally led to choose all self-similarities to have some common value Formula . The free model parameter {sigma} acts like a chemical potential in statistical physics, setting the prior likelihood of the number of self-exemplars (and of separated clusters consequently). For very small value of {sigma}, every data point prefers to be its own exemplar, and the number of clusters equals the number of data points. In the opposite extreme case of large {sigma}, self-exemplars have high cost in E1[c]. All data points are collected in one large cluster with a single exemplar. For intermediate values {sigma} acts as a tuning parameter for the cluster number which decreases monotonously with {sigma}. Frey and Dueck argue that, if the data set has some underlying structure, the correct clustering can be identified by a comparably broad range of values of the self-similarity in which the inferred cluster structure does not change. If data are not sparse and clusters are symmetrically shaped, then affinity propagation works very well and produces the correct clustering in a very short time.

Finding the cost minimum of E1[c] subject to the self-exemplar constraint is a computationally hard task. It can be achieved exactly only for very small systems. The central idea of AP is therefore to identify the exemplars using message passing (belief propagation, BP) as a heuristic strategy (Yedidia et al., 2005): real-valued messages between pairs of data points are updated recursively until a stable clustering emerges. The original AP equations are a direct application of BP [or, equivalently, Max-Sum (Kschischang et al., 2001)] to the clustering problem.

There are two types of messages exchanged between data points (Frey and Dueck, 2007a): the responsibility r(µ, {nu}) is sent from data point µ to candidate exemplar {nu}; it reflects the accumulated evidence that µ chooses {nu} as its cluster exemplar. The availability a(µ, {nu}) is sent from candidate exemplar {nu} to datum µ; it reflects the appropriateness for {nu} to be an exemplar for µ as a result of the self-exemplar constraint. As mentioned before, the original AP imposes constraints on exemplars to be self-exemplars. We modify the algorithm of Frey and Dueck by softening this hard constraint. We write the constraint attached to a given data point µ as follows, with pisin[0,1]:


Formula 1

(1)
The first case assigns a penalty p if data point µ is chosen as an exemplar by some other data point {nu}, without being a self-exemplar. The hard-constraint limit of Frey and Dueck is recovered by setting p to zero. For p = 1, Formula becomes identically one, the minimization task of E1[c] becomes unconstrained and independent for all data points, thus each data point chooses his nearest neighbor as an exemplar. An intermediate value of p allows to interpolate between these two extreme cases. It presents a compromise between the minimization of E1[c] on one hand, and the search for compact clusters on the other hand.

Finally, we introduce a positive real-valued parameter ß weighing the relative importance of the cost minimization with respect to the constraints. In a statistical-physics perspective, this parameter can be seen as a formal inverse temperature. Its introduction allows us to define the probability of an arbitrary clustering c as


Formula 2

(2)
where the partition function Z serves to normalize P[c]. In this setting, both clusterings of non-optimal cost and of many violated self-exemplar constraints are suppressed exponentially. The task of finding high-scoring c can be understood as a minimization problem with the modified cost function


Formula 3

(3)
AP is recovered by taking p = 0 since any violated constraint sets P[c] to zero in Equation (2). For general p, the optimal clustering c* can be determined component-wise by maximizing the marginal probabilities,


Formula 4

(4)
for all data points µ.

2.1 The SCAP equations
In the limit Formula , Equation (2) becomes concentrated to the true cost minima. Looking at Equation (3) it becomes obvious that p has to scale as Formula in order to have some non-trivial effect. In this limit, we find the SCAP equations (with Formula , see Section 4):


Formula 5

(5)
These 2N2 equations are closed and can be solved iteratively. Following Equation (4), the exemplar c* µ of any data point µ can be computed by maximizing the marginal a posteriori probability:


Formula 6

(6)
Compared to original AP, SCAP amounts to an additional threshold on the self-availabilities a(µ,µ) and the self-responsibilities r(µ,µ): for small enough Formula , Formula in many cases, up to p = 1 (or Formula , where every site chooses its best first neighbor as its exemplar. At the same time, beyond a certain threshold the self responsibility r(µ,µ) is substituted with Formula . For Formula (i.e. p = 0) the original AP equations are recovered.

In practice, this means that variables are discouraged to be self-exemplars beyond a given threshold, even in the case someone is already pointing at them. The resulting clustering is more stable and obviously allows for a hierarchical structure where {lambda} can point to µ that can point to {nu}, etc. Also loops are possible. In most of the tests performed (both on artificial and biological cancer data) clusters are almost tree-like besides a dimer.

2.2 Efficient implementation of the algorithm
The iterative solution of Equations (5) can be implemented in the following way:

  1. Define the similarity S(µ,{nu}) for each set of data points. Choose the values of the self-similarity {sigma} and of the constraint strength Formula . Initialize all a(µ,{nu}) = r(µ,{nu}) = 0
  2. For all Formula , first update the N responsibilities r(µ,{nu}) and then the N availabilities a({nu},µ), using Equations (5).
  3. Identify the exemplars cµ by looking at the maximum value of r(µ,{nu}) + a(µ,{nu}) for given µ, according to Equation (6).
  4. Repeat steps 2–3 till there is no change in exemplars for a large number of iterations (we used 10–100 iterations). If not converged after Tmax iterations (typically 100–1000), stop the algorithm.

Three notes are necessary at this point:

  • Step 3 is formulated as a sequential update: for each data point µ, all outgoing responsibilities and then all incoming availabilities are updated before moving to the next data point. In numerical experiments this was found to converge faster and in a larger parameter range than the damped parallel update suggested by Frey and Dueck (2007a). Dependence of the result on initial conditions was not observed.
  • The naive implementation of the update Equations (5) requires 2N2 updates, each one of computational complexity O(N). A factor N can be gained by first computing the unrestricted max and sum once for a given µ, and then implying the restriction only inside the internal loop over {nu}. Like this, the total complexity of a global update is O(N2) and thus feasible even for very large data sets.
  • Belief propagation on loopy graphs is not guaranteed to converge. We observe, however, efficient convergence of the sequential update over wide parameter ranges. To handle the possibility of non-convergence, we have introduced a cutoff in the number of iterations. If this is reached, the algorithm stops, and the actual parameter combination is discarded.

2.3 Extracting cluster signatures
In many clustering tasks input data consist of high-dimensional vectors, a specific example being genome-wide microarrays. Frequently only few components of these vectors carry useful information about the cluster structure, extracting such cluster signatures is of crucial importance in understanding the mechanisms behind the cluster structure.

In the following, we will use the specific case of microarray data. Therefore we use the notion gene for a component of the input vector, even if at this stage the discussion is still general. The total number of genes is denoted by G. We propose a simple measure of the influence of single genes on the total similarity measure of a cluster, as compared to random choices of the exemplar selection c. For simplicity, we assume the similarity between data points Formula and Formula to be additive in single-gene contributions


Formula 7

(7)
This is true, e.g. for the Pearson correlation or the negative square Euclidean distance. It can be easily generalized to similarity measures which are given by a monotonous function of a sum over gene contributions (like the negative of the Euclidean distance which is the square root of the sum of single-gene contributions).

Having found a clustering given by the exemplar selection c, we can calculate the similarity of a cluster C defined as a connected component of the directed graph given by c. It is given by


Formula 8

(8)
as a sum over single-gene contributions


Formula 9

(9)
These have to be compared to random exemplar choices which are characterized by their mean


Formula 10

(10)
and variance


Formula 11

(11)
The relevance of a gene can now be ranked according to


Formula 12

(12)
which measures the distance of the actual Si(C) from the distribution of random exemplar mappings. Genes can be ranked according to the value of Ii(C), highest-ranking genes are considered a cluster signature. The same procedure can be carried through for each cluster independently, but also for cluster combinations.


    3 APPLICATION TO BIOLOGICAL DATA
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE ALGORITHM
 3 APPLICATION TO BIOLOGICAL...
 4 METHODS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Iris data
The data consist of measurements of sepal length, sepal width, petal length and petal width, performed for 150 flowers, chosen from three species of the flower Iris. It is a benchmark problem for clustering (Duda and Hart, 1973). Super-paramagnetic clustering is able to cluster 125 of the data points correctly, leaving 25 points unclustered (Blatt et al., 1996).

When we apply AP on Iris data, we identify three clusters making 16 errors. With SCAP, we identify them with just nine errors. We use the Manhattan distance measure for the similarity function, i.e. Formula .

We saw that the species Iris Setosa separates without any errors. On increasing the value of Formula , the Iris Setosa cluster stays intact and the clusters for Versicolor and Virginica merge with each other, reflecting the fact that they are closer to each other than to Setosa. The errors occur because some samples from these species were closer to samples from other species than to their own.

3.2 Brain cancer data
We used a test data set monitoring the expression levels of more than 7000 genes for 42 patients, which were previously correctly classified into five diagnosis types by an a posteriori assessment method (Pomeroy et al., 2002) (10 medulloblastoma, 10 malignant glioma, 10 atypical teratoid/rhabdoid tumors, 4 normal cerebella and 8 primitive neuroectodermal tumors – PNET). Each array was filtered, log-normalized to mean zero and variance one, resulting in G = 6010 genes. Due to this choice Pearson correlation and negative square Euclidean distance are equivalent. The diagnosis information was not used during clustering, but only for checking the algorithmic outcome.

3.2.1 Imposing five clusters in AP and SCAP
Since we knew that the correct clustering was to identify five different patterns, our first approach was to tune {sigma} and Formula in order to get five clusters. First, we fixed Formula to infinity (original AP) and changed {sigma} finding around {sigma}~120 the desired number of five clusters with eight errors. The error was calculated a posteriori by counting every data point which referring to an exemplar of a different diagnosis. Next we fixed {sigma} to a sufficiently large value (the result becomes insensitive on {sigma} once the latter takes large values), and we changed Formula . In this case, for Formula we got six clusters with eight errors, for Formula four clusters with again eight errors. Five clusters were not found to correspond to any extended Formula -region. Note that in both cases all errors occur in the last cluster: samples supposed to take diagnosis 5 (PNET) rarely find an exemplar of the same class. Instead they distribute over the other four diagnoses.

3.2.2 Clustering with AP
Then, instead of fixing the number of clusters, we changed {sigma} continuously for Formula . We counted the number of clusters and of errors as a function of {sigma}, see Figure 1. The algorithm ground state (configuration of maximum marginals values) in the limit of Formula is a single cluster.


Figure 1
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Plot of number of clusters as a function of self-affinity S(µ,µ) {equiv}{sigma}, for Figure 1 (the original AP algorithm). Based on this we would conclude that the data has three non-trivial clusters.

 
The first non-trivial clustering occurs when the number of clusters remain unchanged for a stable range of {sigma} values. In this preliminary study, we took that to be the actual predicted data clustering. Hence, by looking at Figure 1, we would conclude that there are three well-distinguishable clusters in the present data set. Look, however, to the number of errors: it is found to be 14–15 in this range, basically due to the wrong assignment of two entire classes to only three exemplars. Four or five clusters can be imposed and lead to lower error values, but require fine-tuning of {sigma}.

3.2.3 Clustering with SCAP
We then fix {sigma} to be very large and change only Formula . For Formula we start with seven clusters, this number decreases rapidly as Formula increases, see Figure 2. As before, the point at which the number of clusters is robust against changes in Formula was taken as the best SCAP clustering. From Figure 2 we conclude that SCAP identifies four clusters. The number of errors in classification is eight.


Figure 2
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Plot of number of cluster and errors of SCAP as a function of cutoff parameter Figure 2. This plot suggests that the data has four clusters.

 
Right from Formula , where each data point chooses its closest neighbor as his exemplar, errors are due to misclassifications of the fifth diagnosis (PNET). The other data points select exemplars of the same diagnosis, but various clusters of same diagnosis exist. Only in the case of four clusters, as shown in Figure 3, each of the first four diagnoses is assembled in an isolated cluster, with the PNET arrays distributed over the three cancer-related clusters. The normal tissue (30–33 in the figure) is well separated from all others. Only if we go towards three clusters, it merges with diagnosis type (0–9) (medulloblastoma), showing that these two are closer in expression in between them than compared to others. A more detailed analysis of the brain cancer data is provided in the Supplementary Material.


Figure 3
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Graphical presentation of clusters at Figure 3. Increasing Figure 3 did not separate the data set (10–19) and (34-–41). Instead they together make one big cluster.

 
Note that SCAP also provides information about the internal organization inside the clusters. We find, e.g. that the misclassified patterns are always peripherical cluster elements. No other data point refers to them. This information is lost in AP. Due to the hard constraint all points belonging to the same cluster refer to the same exemplar, and information about the internal cluster structure is not contained in c*. A graphical representation of the cluster structure in this case is contained in the Supplementary Material.

In Pomeroy et al. (2002), data were clusterized using hierarchical clustering. Even if the overall cluster structure is similar to the one we found, there is no clear-cut clustering into 4–5 classes, some arrays (well-clustered with SCAP) were only added at very late stages of hierarchical clustering. The global nature of SCAP leads to a better clustering performance than the local and greedy hierarchical clustering.

Another interesting point comes from the comparison of our clustering results with the supervised classification results of Dettling (2004). There, a number of state-of-the-art classification algorithms are applied, with training sets containing 2/3, test sets containing 1/3 of the data points. Dettling finds that the minimal generalization error made is 23.8%, corresponding to ca. ten errors on a data set of cardinality 42. It is interesting to note that SCAP in the clustering corresponding to four clusters makes only eight errors. Note that training in Dettling (2004) is done on a subset of patterns, but supervision in this case seems to add no valuable information to the unsupervised clustering results.

Last but not least, we use the procedure described above to extract cluster signatures in the most stable case of four clusters depicted in Figure 3. The lists of the highest ranking genes together with their relevance value Ii(C) is given in the Supplementary Material. The number of statistically relevant genes (we consider a threshold Ii(C)~=3) depends on the cluster and is largest for the normal tissue (42 genes), it is much smaller in particular for the first cluster (ca. 4 genes). If we take the first 15–25 genes per cluster, i.e an overall signature of 60–100 genes, we already find basically the same clustering as before, only two new errors of previously well-assigned patterns appear. At gene signatures 120–240, only one of these errors survives. We therefore find that the signature found in this way carries most of the information needed for the clustering. Note also that, due to the fact that in an unsupervised way we did not separate the fifth diagnosis type into a single cluster, we do not have by definition a cluster signature for this cancer type.

3.3 Other benchmark cancer data
3.3.1 Lymphoma cancer data
We used a data set of 62 patients for 4026 genes, showing three different diagnosis (Alizadeh et al., 2000). In the limit of Formula going to infinity, we find the first non-trivial clustering for {sigma} between 150 and 250. In this regime AP group data into three sets, making with three errors. For very high {sigma} and varying Formula , the three-groups clustering becomes more stable and robust, while the algorithm makes just one assignment prediction error. In this case, Dettling finds a minimal generalization error of 0.95%, corresponding to less than one error in 62 patterns. Supervision adds some information, even if clustering itself makes only one error.

3.3.2 SRBCT cancer data
This set has 63 samples with 2308 genes and 4 expression diagnosis patterns (Kahn et al., 2001). For Formula going to infinity, the best tuning-robust estimates groups cluster data into 5 clusters making as many as 22 errors. On the other hand, with finite Formula , SCAP finds a regime of four clusters, making only seven assignment errors. Here, Dettling reports only 1.24% generalization error in supervised classification, corresponding to less than one error on 63 patterns. Classification thus performs considerably better than clustering alone.

3.3.3 Leukemia
This set has 72 samples with 3571 genes and 2 diagnoses (Golub et al., 1999). In the case of infinite Formula , the original AP groups data into two clusters with four errors, while for variable Formula (fixing {sigma} very large) SCAP finds two clusters with two errors. Also classification leads to 2.5% of errors, a result which is slightly better than our clustering result.


    4 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE ALGORITHM
 3 APPLICATION TO BIOLOGICAL...
 4 METHODS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
In the process of choosing exemplars, we need to calculate marginals


Formula 13

(13)
Pµ(c) is the probability that data point µ chooses point c as its exemplar. The calculation of marginals can be done iteratively via a message-passing algorithm called Belief Propagation (BP) (Yedidia et al., 2005). It is exact on tree factor graphs but usable heuristically in the general case. Together with a generalized larger family of message-passing algorithms, it was shown to be very powerful in solving NP-hard combinatorial problems on locally tree-like structures (Hartmann and Weigt, 2005; Mézard et al., 2002). Recently, the applicability of BP was also shown to be efficient in some important problems giving rise to dense and loopy factor graphs (Braunstein and Zecchina, 2006; Kabashima, 2003).

Looking at Figures 4 and 5, BP computes beliefs Formula (c) for the marginal probabilities as products of messages Formula coming from each compatibility constraint, times the local prior computed as the exponential of the similarity between point µ and its putative exemplar c. Up to overall normalization, we write:


Formula 14

(14)
where ß plays the role of an annealing parameter measuring the relative importance given to the priors compared to the information passed by the messages. Message Formula can be interpreted, as the probability that constraint {lambda} alone forces µ to select exemplar c. It can be calculated via the following self-consistent equations


Formula 15

(15)


Formula 16

(16)
where the N2 functions Formula can be seen as probabilities that data point µ chose c to be its exemplar if constraint {lambda} were absent in Equation (2). These probabilities are called cavity probabilities, because the disregarding of one data point/constraint effectively carves a cavity in the original factor graph. The link direction of functions As and Bs is shown if Figure 4 together with the problem's; factor graph. Figure 5 shows a pictorial representation of the flow of messages (15) and (16).


Figure 4
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. SCAP factor graph and direction of sent messages. Variables are represented by circles, constraints by squares.

 

Figure 5
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Pictorial representation of messages flow.

 
Along the lines of Frey and Dueck (2007a and b), but bearing in mind the modified form for the compatibility constraints, Equation (15) can be simplified after a few manipulations in the following way, depending on cases:


Formula 17

(17)
with Formula being normalization constants.

It is remarkable that the number of effectively independent quantities present in Equation (15) is much smaller than the apparent O(N3) real valued numbers. Indeed, functional messages A take only two different values: Formula , that from now on will be called A({nu},µ) to avoid index redundancy, and Formula independently on c, as long as it is != µ. The exchange of indexes in A(µ,{nu}) is pure convention, but it has been introduced for coherency with the definition of availabilities given in Frey and Dueck, (2007a). It follows immediately from the normalization condition that Formula .

For the cavity probability functions, manipulation of Equation (16) involving the use of the last normalization condition and of the normalization constant rescaling, leads to


Formula



Formula 18

(18)


Formula 19

(19)


Formula

with Formula guaranteeing normalization Formula . Messages Formula have been also renamed with a symbol coherent with the responsibility-availability notation of Frey and Dueck (2007a). It can be seen that self consistent equations close into the Formula quantities A({nu},µ) and R(µ,{nu}) alone. Indeed, the effective dependence on the exemplar choice is dropped, and the computational size of the problem reduces by a factor N. The set of equations of A and R can be solved iteratively via the BP algorithm. The case in which one is interested not in the whole form of the posterior probability function, but only in retaining information about the most probable exemplar chosen by each data point, can be seen to be equivalent to taking the Formula limit where availabilities a and responsibilities r are introduced in the following way:


Formula 20

(20)


Formula 21

(21)


Formula 22

(22)


Formula 23

(23)
and treating the exponential scaling in a regime where prior similarities between data points S are of the same order of magnitude of the valued of a and r. The rescaling of responsibilities can be freely done as it does not change the number of independent variables. From the last definitions one is led to equations


Formula 24

(24)


Formula 25

(25)
where the last relation already assumes a large ß limit with non-degeneracy of the most probable value of the cavity probabilities. This hypothesis is equivalent to having non-degenerate choices of exemplars for all data points, i.e. to the existence of a single optimal clustering identified via the SCAP algorithm as the unique ground state of the system energy (3). This is a sensible assumption, but it is not always satisfied in interesting cases. Studying the degenerate number and behavior of clustering choices is another crucial question that is only partially answered by the introduction of the relaxation parameter p and will be the subject of further work beyond this article. In the large ß regime, in order to work with quantities all with the same scaling, it is useful to define


Formula 26

(26)
and consider Formula , a and r fixed varying ß. Equating Equations (24) and (25) with Equations (17) and (19), respectively, and extracting the leading terms in the large ß limit assuming no degeneracy, the following equations are found, using Equation (26):


Formula 27

(27)
Making another change of variables redefining the self-responsibilities as


Formula 28

(28)
we get, in terms of the rescaled quantities,


Formula 29

(29)

leading to SCAP Equations (5).1 After convergence, marginals can be written (Frey and Dueck, 2007a; Yedidia, 2005) as


Formula 30

(30)
In the Formula limit, one can write Equation (6).


    5 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE ALGORITHM
 3 APPLICATION TO BIOLOGICAL...
 4 METHODS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Affinity Propagation is a new powerful tool for unsupervised clustering. It has many very strong points. First it is very efficient, convergence to the final clustering is very fast, the latter appears to be independent on the initialization of messages. Second, due to its hard constraints AP identifies exemplars which are prototypical data points representing a whole cluster.

This last point is, however, also a first limitation of the original AP algorithm. If clusters cannot be well represented by a single cluster exemplar, AP has to fail. The hard constraint renders the algorithm greedy, and small fluctuations in the similarity measure may trigger avalanches in the exemplar choice leading to different clusterings for only slightly modified model parameters.

We have introduced a soft-constraint version of affinity propagation which is able to cure a part of these problems without loosing the efficiency of the original AP:

  • By relaxing the hard constraint on clusters exemplars, we could introduce a parameter (Formula ) controlling the algorithm greediness. Formula is a better tuning parameter than {sigma} (it is more informative and leads to more robust and stable clustering) and it is easier to interpret the statistical meaning of its tuning process.
  • Clusters are more robust than in the original formulation of the algorithm. Moreover, even though a second a priori free parameter is introduced, the overall dependence of the algorithm on free parameters is reduced, and an optimal tuning strategy naturally emerges.
  • The cluster structure can be efficiently probed. This concerns the internal structure of the clusters since SCAP is able to identify central and peripherical nodes of each clusters, as well as the hierarchical organization leading to a process of cluster merging if cluster number is reduced by looking to less fine structures.
  • In the case of high-dimensional data, the relation between data points and their exemplars can be used to extract a sparse cluster signature. In the case of brain tumors, we have found that 20–40 genes per cluster are sufficient to reproduce almost the same clustering as found using all genes.

We conclude that SCAP is more efficient than AP in particular in the case of noisy, irregularly organized data—and thus in biological applications concerning microarray data. The computational efficiency of SCAP allows us to treat also very large data sets.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE ALGORITHM
 3 APPLICATION TO BIOLOGICAL...
 4 METHODS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We acknowledge very useful discussions with Alfredo Braunstein, Andrea Pagnani and Riccardo Zecchina. M.L. would like to thank the Malawi Polytechnic for hospitality during the preparation of the manuscript. The work of S. and M.W. was supported by the EC via the STREP GENNETEC (‘Genetic networks: emergence and complexity’).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Trey Ideker

1 Note that the last change of variable via the subtraction of the overall quantity maxµFormula is redundant if self-responsibilities are negative, as it is usually the case. Back

Received on May 24, 2007; revised on July 25, 2007; accepted on August 9, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE ALGORITHM
 3 APPLICATION TO BIOLOGICAL...
 4 METHODS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Alizadeh A, et al. Distinct types of diffuse large B-cell-lymphoma identified by gene expression profiling. Nature (2000) 403:503–511.[CrossRef][Medline]

    Blatt M, et al. Super-paramagnetic clustering of data. Phys. Rev. Lett (1996) 76:3251.[CrossRef][Web of Science][Medline]

    Braunstein A, Zecchina R. Learning by message passing in networks of discrete synapses. Phys. Rev. Lett (2006) 96:030201.[CrossRef][Medline]

    Dettling M. BagBoosting for tumor classification with gene expression data. Bioinformatics (2004) 20:3583–3593.[Abstract/Free Full Text]

    Duda RO, Hart PE. Pattern Classification and Scene Analysis (1973) New York: Wiley.

    Frey JF, Dueck D. Clustering by passing messages between data points. Science (2007a) 315:972–976.[Abstract/Free Full Text]

    Frey JF, Dueck D. Mixture modeling by affinity propagation. (2007b) Freely available at books.nips.cc/papers/files/nips18/NIPS2005_0799.pdf.

    Golub T, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science (1999) 286:531–537.[Abstract/Free Full Text]

    Hartmann AK, Weigt M. Phase Transitions in Combinatorial Optimization Problems (2005) Berlin: Wiley-VCH.

    Jain AK, et al. Data clustering: a review. ACM Comput. Surv (1999) 31:264–323.[CrossRef]

    Kabashima Y. A CDMA multiuser detection algorithm on the basis of belief propagation. J. Phys. A (2003) 36:11111–11121.[CrossRef]

    Khan J, et al. Classification and diagnostic prediction of cancer using gene expression profiling and artificial neural networks. Nat. Med (2001) 6:673–679.[CrossRef][Web of Science]

    Kschischang FR, et al. Factor graphs and the sum-product algorithm. IEEE Trans. Inform. Theory (2001) 47:1.

    MacQueen J. Some Methods for Classification and Analysis of Multivariate Observations. Le Cam LM, Neyman J, eds. (1967) Vol. 1. Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley CA: University of California Press. 281–297.

    Mézard M, et al. Analytic and algorithmic solution of random satisfiability problems. Science (2002) 297:812.[Abstract/Free Full Text]

    Pomeroy S, et al. Prediction of central nervous system embryonal tumor outcome based on gene expression. Nature (2002) 415:436–442.[CrossRef][Medline]

    Shental N, et al. Pairwise clustering and graphical models. (2003) In 17th International Conference on Neural Information Processing Systems - NIPS. Canada: Vancouver.

    Shi J, Malik J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell (2000) 22:888–905.[CrossRef]

    Yedidia JS, et al. Belief propagation and generalizations. IEEE Trans. Inform. Theory (2005) 51:2282.[CrossRef]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
23/20/2708    most recent
btm414v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Leone, M.
Right arrow Articles by Weigt, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Leone, M.
Right arrow Articles by Weigt, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?