Skip Navigation

Bioinformatics 2007 23(13):i29-i40; doi:10.1093/bioinformatics/btm212
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Asur, S.
Right arrow Articles by Parthasarathy, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Asur, S.
Right arrow Articles by Parthasarathy, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

An ensemble framework for clustering protein–protein interaction networks

Sitaram Asur , Duygu Ucar and Srinivasan Parthasarathy *

Department of Computer Science and Engineering, Ohio State University, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 ALGORITHMS
 4 EXPERIMENTS
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Protein–Protein Interaction (PPI) networks are believed to be important sources of information related to biological processes and complex metabolic functions of the cell. The presence of biologically relevant functional modules in these networks has been theorized by many researchers. However, the application of traditional clustering algorithms for extracting these modules has not been successful, largely due to the presence of noisy false positive interactions as well as specific topological challenges in the network.

Results: In this article, we propose an ensemble clustering framework to address this problem. For base clustering, we introduce two topology-based distance metrics to counteract the effects of noise. We develop a PCA-based consensus clustering technique, designed to reduce the dimensionality of the consensus problem and yield informative clusters. We also develop a soft consensus clustering variant to assign multifaceted proteins to multiple functional groups. We conduct an empirical evaluation of different consensus techniques using topology-based, information theoretic and domain-specific validation metrics and show that our approaches can provide significant benefits over other state-of-the-art approaches. Our analysis of the consensus clusters obtained demonstrates that ensemble clustering can (a) produce improved biologically significant functional groupings; and (b) facilitate soft clustering by discovering multiple functional associations for proteins.

Contact: srini{at}cse.ohio-state.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 ALGORITHMS
 4 EXPERIMENTS
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Proteins are central components of cell machinery and life. In fact, as noted by Kahn (1995), it is the proteins dynamically generated by a cell that execute the genetic program. von Mering et al. (2002) state that, to fully understand cell machinery, studying proteins in isolation is not enough—(clusters of) interactions need to be delineated as well, since proteins work with other proteins to regulate and support each other for specific functions. Recent advances in technology have enabled scientists to determine, identify and validate pair-wise protein interactions through a range of experimental and in-silico methods (Fields and Song, 1989; Fields and Sternglanz, 1994; Phizicky and Fields, 1995; Vasilescu et al., 2004). Such data can be naturally represented in the form of interaction networks. The task of extracting relevant groupings or functional modules from such interaction networks, for the purposes of understanding the behavior of organisms, protein function prediction and drug design is challenging and an active area of research (Brun et al., 2004; Pereira-Leal et al., 2004; Phizicky and Fields, 1995; Ucar et al., 2005; Wu et al., 2002; Yook et al., 2004). The challenges involved are manifold.

First, is the issue of data quality. Different experimental and in-silico methods can be used to compute interactions, each with its own strengths and weaknesses (Fields and Song, 1989; Fields and Sternglanz, 1994; Phizicky and Fields, 1995; Vasilescu et al., 2004). Often, the overlap, in terms of common interactions across experimental settings, is not very high. An added complexity is that the data obtained from such methods is believed to be quite noisy—many interactions are conjectured to be false positives. Integrating data from such sources yields interaction networks that are inherently noisy (Bader and Hogue, 2002). To address this problem, various researchers have examined data preprocessing techniques to identify and eliminate potential false positives (and to identify potential false negatives) by examining the topological characteristics of such networks (Chen et al., 2006; Saito et al., 2002; Ucar et al., 2005).

Second, even if the network is assumed to be noise free, partitioning the network using classical graph partitioning or clustering schemes is inherently difficult. A common characteristic of Protein–Protein Interaction (PPI) networks is that, a few nodes (hubs) have very large degrees, while most other nodes have very few interactions. Applying traditional clustering approaches typically results in a clustering arrangement that is quite poor—containing one or a few giant core clusters and several tiny clusters (possibly singleton clusters). To address this problem, researchers have relied on various refinements that take into account domain expertise and topological information (e.g. targeting scale-free networks) to constrain the clustering process resulting in an improved clustering arrangement (Singh et al., 2006; Friedel and Zimmer, 2006).

Third, some proteins are believed to be multi-functional—effective strategies for soft clustering of these essential proteins are needed. This dictates the need to leverage or adapt soft clustering approaches. To address this problem, recent research has examined strategies such as hub duplication (Ucar et al., 2006) and partitioning the line-graph transform of the original PPI network. The former ensures the soft clustering of hub proteins, that are believed to be multi-functional (Jeong et al., 2001), while the latter targets the clustering of edges in the original graph (nodes in the transformed graph) to dictate the eventual set of protein clusters (Pereira-Leal et al., 2004).

In this work we examine an alternative approach, ensemble clustering, to resolve these three problems simultaneously. Ensemble clustering has been proposed in the literature as a useful approach to strengthen the quality of simple clustering algorithms (Fred and Jain, 2002; Gionis et al., 2005; Strehl and Ghosh, 2002a, b; Topchy et al., 2004). The goal is to combine multiple, diverse and independent clustering arrangements to obtain a single, comprehensive consensus clustering. Empirical evidence has suggested that intelligent combination of these clusters can lead to novel and meaningful cluster structures, even in the presence of noise (Topchy et al., 2004).

However, naively applying ensemble clustering to the problem at hand will not work. There are certain key questions that need to be resolved. First, what are the base clustering methods to be used for processing PPI networks? An appealing option here is to leverage domain and topological information to identify good candidate base clustering methods. Second, clustering ensembles typically do not scale very well—building a consensus is expensive and is affected by the dimensionality of the problem on hand. An attractive option here is to investigate the use of traditional dimensionality reduction options to improve the scalability of the consensus building step. Third, are there ways in which one can make the ensemble clustering more robust to noise effects? For example, by developing suitable pruning or weighting strategies. Fourth, the existing literature on ensemble clustering algorithms is limited to hard clustering problems—can one adapt such approaches for soft clustering? Faced with these challenges our contributions are:

  • We have designed and evaluated the use of two topology-driven distance metrics for network clustering. We use three traditional graph partitioning algorithms with the two metrics to obtain six base clusterings that are diverse and yet informative about the topological properties of nodes in the network.
  • We have designed and evaluated a consensus method that relies on Principal Component Analysis (PCA) to reduce the dimensionality of the consensus determination problem. The ensemble solution on the reduced dimensional space can then be efficiently computed using traditional consensus methods.
  • We have also developed a topology-driven strategy for pruning weak base clusters that significantly improves the quality of the resulting ensemble cluster arrangement.
  • We have designed an adaptation to the above approach that allows for soft ensemble clustering of proteins in interaction networks. This enables our method to model and account for multi-faceted proteins.
  • We conduct a detailed empirical evaluation and comparison of our approaches with other state-of-the-art algorithms on the PPI network of budding yeast (Saccharomyces cerevisiae). We use topological, information theoretic and domain-specific cluster validation metrics to evaluate and modulate the improvements gained from each component of the proposed ensemble clustering methodology.

Our experimental results show that our algorithms can provide significant improvement in cluster quality across the board (not just the top clusters), when compared to previously reported methods. We also show that ensemble clustering can effectively facilitate the discovery of multiple functional associations for proteins.


    2 RELATED WORK
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 ALGORITHMS
 4 EXPERIMENTS
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Many clustering algorithms of various types have been applied to analyze PPI networks. Bader and Hogue (2003) proposed the three-stage Molecular Complex Detection (MCODE) algorithm to identify densely connected regions from a PPI graph. First, each vertex of the graph is associated with a weight based on the local neighborhood density of that vertex. Second, clusters are created around the top-weighted vertices (seed vertices) by iteratively adding high-scoring vertices to the cluster. Finally, clusters that are not dense enough are eliminated from the final set of partitions.

The algorithm Markov Clustering (MCL) (van Dongen, 2000) is a fast and scalable unsupervised clustering algorithm for graphs, based on the simulation of stochastic flow in graphs. The algorithm simulates random walks within a graph by alternation of two operators called expansion and inflation. Eventually, the iteration results in the separation of the graph into different segments (clusters). A recent study (Brohee and van Helden, 2006) compared four clustering algorithms,—Markov CLustering (MCL), Restricted Neighborhood Search Clustering (RNSC), Super Paramagnetic Clustering (SPC) and MCODE, on six PPI networks to identify protein complexes. The clusters obtained from the algorithms were compared with known annotated complexes. Their conclusion was that MCL algorithm far outperformed the other algorithms in the extraction of complexes from interaction networks.

The ensemble clustering problem has been studied previously in the machine learning community by many researchers, although it has been applied mainly to small classification datasets thus far. Fred and Jain (2002) map clusterings produced by multiple runs of the k-means algorithm with different initializations into a co-association matrix. They then apply a hierarchical single-link algorithm to partition this matrix into the final consensus clusters. In a later work, Topchy et al. (2004) present two approaches to prove the effectiveness of a cluster ensemble—using plurality voting and using a metric on the space of partitions.

Gionis et al. (2005) provide a formal definition to the problem of cluster aggregation and discuss a few consensus algorithms with theoretical guarantees. The algorithms they propose use the distance matrix representation and are suitable mainly for small datasets. The Agglomerative algorithm they proposed merges clusters that have distances less than 1/2, which is a hard-coded threshold. If a point has distance greater than half with all other clusters, it is placed in a cluster by itself. The Balls algorithm tries to find ball-shaped clusters, grouping together proteins that are close to each other and far from other nodes. Both these algorithms have been evaluated only on small categorical datasets. They have not been evaluated on large graph datasets. We use these two algorithms for comparison with our techniques.

Strehl and Ghosh (2000a, b) define the cluster ensemble problem as an optimization problem and aim to maximize the normalized mutual information (NMI) of the consensus clustering from the initial clusters obtained from 10 base clustering algorithms. They use a hypergraph representation with an Formula matrix, where n is the number of points and m is the total number of clusters in all the clusterings. They introduce three different algorithms to obtain consensus clusterings, namely Cluster-based Similarity Partitioning (CSPA), HyperGraph Partitioning (HGPA) and Meta-Clustering (MCLA) algorithms. In CSPA, they construct a similarity matrix from the clusters obtained from the base clustering algorithms. This similarity matrix is treated as a weighted graph and partitioned using the Metis (Karypis and Kumar, 1998) algorithm to obtain the consensus clustering. In HGPA, the goal is to find a hyperedge separator that partitions the hypergraph into k unconnected components by cutting a minimal number of hyperedges. The HMetis algorithm is used for this purpose. In MCLA, the main idea is to group related hyperedges (base clusters) to obtain meta-clusters. A representative cluster is obtained for each meta-cluster. Finally, each data point is compared with the representative clusters and assigned to the meta-cluster it is most associated with. We use these three ensemble consensus techniques in our evaluation.


    3 ALGORITHMS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 ALGORITHMS
 4 EXPERIMENTS
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this section, we describe our topological similarity metrics, base clustering algorithms and consensus methods in detail.

3.1 Similarity metrics
We introduce two different similarity metrics designed to capture diverse topological properties of PPI networks. Our goal is to weight edges of the PPI network to reflect the reliability of the corresponding interactions. Accordingly, edges with low values of weights will indicate potential false positive (noisy) interactions. Clustering algorithms can then use these weights to eliminate noisy edges and yield meaningful partitions. To assign suitable weights, we focus on two different topological features—Clustering coefficient and Edge betweenness.

3.1.1 Clustering coefficient-based
The first similarity metric is based on the Clustering coefficient, a popular metric from graph theory. The Clustering coefficient (Watts and Strogatz, 1998) is a measure that represents the interconnectivity of a vertex's neighbors. The Clustering coefficient of a vertex v with degree kv can be defined as follows:


Formula

where nv denotes the number of triangles that go through node v.

Essentially, if the edge between two nodes contributes significantly to the clustering coefficients of the nodes, then the nodes are considered similar and should be clustered together. To calculate the similarity of nodes vi and vj, we first calculate their clustering coefficients as Formula and Formula . We then remove the interaction(edge) between these nodes and recalculate the clustering coefficient of each node as Formula and Formula . The difference between these two values represent the importance of the edge for each node. Accordingly, the Clustering coefficient-based similarity of two nodes is then calculated as follows:


Formula

Note that if two nodes are not linked in the original network, their Clustering coefficient-based similarity score is zero. The similarity scores are normalized into the range [0-1] using min–max normalization.

3.1.2 Betweenness-based
The second metric is based on the Shortest-path Edge betweenness measure, which was first introduced by Newman and Girvan (2004). It is a popular measure for clustering networks in sociology and ecology to obtain communities. This measure favors edges between communities and disfavors ones within communities. The Shortest-path betweenness measure computes, for each edge in the graph, the fraction of shortest paths that pass through it. As pointed out by Holme et al. (2003), edge-betweenness uses properties calculated from the whole graph, allowing information from non-local features to be used in the clustering. To take advantage of the global information that is captured by the edge-betweenness measure Holme et al. (2003), we use it as a similarity metric, as follows.


Formula

where Formula is the number of shortest paths passing through edge Formula and Formula is the maximum number of shortest paths passing through an edge in the graph. Similar to the previous metric, this metric is defined only for connected pairs and rescaled into the range [0-1] using min–max normalization.

Note that, the Edge betweenness and Clustering coefficient are designed to capture different properties of the topology. The Clustering coefficient measure is a local measure, since it considers only the immediate neighborhood of nodes. The Betweenness measure on the other hand, considers all shortest paths between all nodes in the graph, thereby capturing global topological information.

3.2 Base algorithms
We use three conventional graph clustering algorithms to obtain the base clusters.

3.2.1 Repeated bisections (rbr)
The repeated bisections algorithm is a top-down clustering algorithm that computes the desired k-way clustering solution, by performing a sequence of Formula repeated bisections, where k is the required number of clusters. The input matrix is first clustered into two groups, after which one of the groups is selected and bisected further. This process continues until the desired number of clusters is found. During each step, a cluster is bisected so that the resulting two-way clustering solution optimizes the I2 clustering criterion function, which is given as:


Formula 1

(1)
where k is the total number of clusters, Si is the set of objects assigned to the i th cluster, v and u represent two objects, and Formula is the similarity between two objects.

3.2.2 Direct k-way partitioning (direct)
In this method, the desired k-way clustering solution is computed by simultaneously finding all k clusters. Initially, a set of k objects is selected from the datasets to act as the seeds of the k clusters. Then, for each object, its similarity to these k seeds is computed, and it is assigned to the cluster corresponding to its most similar seed. This initial clustering is then repeatedly refined to optimize the I2 clustering criterion function.

3.2.3 Multilevel k-way Partitioning (Metis)
Metis (kMetis) is a popular multilevel partitioning algorithm, developed by Karypis and Kumar (1998). It works in three phases: coarsening, initial partitioning and refinement. In the coarsening phase, the original graph is transformed into a sequence of smaller graphs. An initial k-way partitioning of the coarsest graph that satisfies the balancing constraints while minimizing the cut value is obtained in the next phase. Then the partitioning is projected back to the original graph by going through intermediate partitions. After projecting a partition, a refinement algorithm is employed to reduce the edge-cut while conserving the balance constraints.

3.3 Consensus methods
Using the base algorithms with the two topological metrics we discussed in the first subsection, we obtain six sets of k clusters. Our goal is to combine these individual clusterings to obtain a meaningful consensus clustering. Given n individual clusterings Formula , each having k clusters, a consensus function F is a mapping from the set of clusterings to a single, aggregated clustering:


Formula

Ideally, the consensus clustering needs to be representative of the individual component clusterings.

3.3.1 PCA-based consensus
The consensus technique we propose consists of three stages—Cluster Purification, Dimensionality Reduction and Consensus clustering.

Cluster purification—It has been well documented that different clustering algorithms typically yield diverse clusterings (Richard and Lippmann, 1991; Topchy et al., 2004). This is due to the different criteria and similarity metrics employed for clustering. Hence, it is likely that some of the clusters obtained are less consistent with the topology of the original graph than others. We believe that such clusters contribute to noise and distort the consensus function. To find these clusters, we once again rely on a topological measure. We define a reliability measure for each cluster, that is based on the topology of the proteins in the cluster. The shortest path distance between two proteins i and j is the minimum number of interactions in the original graph that separate them. For each cluster, we compute the intra-cluster distance as the average shortest path distance between all pairs of proteins in that cluster.


Formula 2

(2)
where Formula represents the nodes in cluster Formula and Formula represents the shortest path distance in terms of number of edges between nodes i and j. Formula signifies the diameter of the original PPI graph and is used for normalization. The reliability of a cluster is inversely proportional to its intra-cluster distance.


Formula 3

(3)
If the distance between nodes in a cluster is high, it indicates that the cluster is not very modular. Hence, we use a threshold value to prune away weak clusters. We choose a threshold value ensuring that each protein is represented in at least one third of the reliable subset of clusters.

Dimensionality reduction—We then represent the remaining clusters in a binary format with an Formula cluster-membership matrix, where m is the total number of clusters obtained using all base algorithms. Each row represents a point while each column corresponds to a cluster. The value I(x,y) in the matrix represents the indicator function of point x wrt cluster cly.


Formula

Even after pruning clusters, it is likely that the number of dimensions (clusters) is too large for the direct application of clustering algorithms. For instance, in our case, we have six algorithm-metric combinations each producing k clusters after pruning. If the value of k is large, clustering the Formula -dimensional points would prove inefficient, since distance metric computations that are integral to clustering, do not scale well to high dimensions (Aggarwal, 2001). It is also likely that noise still exists in the clusters.

To obtain a more scalable and efficient representation for clustering, we use the popular dimensionality reduction technique of PCA. The goal is to reduce the number of dimensions of the cluster-membership matrix without compromising the information required for clustering. As we described above, each feature vector (row) in the matrix corresponds to the cluster membership pattern of a node. Since we are using hard clustering algorithms, a node can occur only in six clusters. For large values of k, the binary feature vectors will be very sparse. Also, since the occurrence of a node in a cluster is not independent of other clusters in a clustering, there is bound to be a lot of redundancy in the feature vectors. Several researchers (Ding et al., 2002; Hoyle and Rattray, 2003; Strehl and Ghosh, 2002a, b) have suggested the application of dimensionality reduction techniques (such as PCA) as a pre-processing step to clustering sparse high-dimensional data. PCA uses the eigen decomposition of the correlation matrix to find orthogonal directions with total maximum variance of projections. In our case, it can use the correlations between the cluster membership patterns of nodes to eliminate redundancies reducing the matrix to a more compact representation, retaining only discriminatory information. Schein et al. (2003) have shown how a generalized linear model, that they term logistic PCA, is better suited for dimensionality reduction of binary data. Logistic PCA is based on a multivariate generalization of the Bernoulli distribution, as compared to conventional PCA, which assumes a Gaussian distribution on the data. Accordingly, we convert the Formula clusters into a cluster-membership matrix and apply logistic PCA, to reduce the number of dimensions. Traditional clustering algorithms can then be applied on this reduced representation without performance concerns, to obtain consensus clustering arrangements.

Consensus Clustering—To perform consensus clustering, we apply two different consensus clustering algorithms on the PCA representation - the Recursive Bisection (PCA-rbr) algorithm, which performed the best of the three base clustering algorithms, and the popular Agglomerative Hierarchical (PCA-agglo) algorithm. The agglomerative hierarchical clustering algorithm is a popular bottom-up clustering algorithm. In this method, the desired k-way clustering solution is computed using the agglomerative paradigm whose goal is to locally optimize (minimize or maximize) a particular clustering criterion function. The algorithm finds the clusters by initially assigning each object to its own cluster and then repeatedly merging pairs of clusters until either the desired number of clusters has been obtained or all of the objects have been merged into a single cluster leading to a complete agglomerative tree.

3.3.2 Weighted consensus
An alternative approach to pruning, is to weight proteins based on the reliability of the clusters they belong to. The intuition here is that, if two proteins are present together in a cluster of poor reliability, the corresponding interaction between them can be deemed to be of low significance and given a low weight. The base clusters obtained can be used to construct a new graph, with an edge existing between proteins iff they have been clustered together at least once. The weights for these edges are proportional to the reliability of the clusters they belong to.


Formula 4

(4)
where Formula is the Reliability score of cluster Formula and Formula is the cluster membership function.


Formula

The weighted graph is then clustered using the Agglomerative Hierarchical (PCA-agglo) algorithm.

3.3.3 Soft consensus clustering
As we mentioned earlier, several proteins are known to participate in several functions in the cell. By assigning all proteins to a single cluster each, we are inhibiting the number of functions that can be discovered. To overcome this issue, we construct a variant of the PCA-agglo consensus algorithm to perform soft clustering of proteins. The hard agglomerative algorithm places each protein into the most likely cluster to satisfy a clustering criterion. However, it is possible for a protein to belong to many clusters with varying degrees. The probability of a protein belonging to an alternate cluster can be expressed as a factor of its distance from the nodes in the cluster. If a protein has sufficiently strong interactions with the proteins that belong to a particular cluster, then it can be considered amenable to multiple membership. We use the average shortest path distance to quantify this measure.


Formula 5

(5)
where Formula denotes the length of the shortest path between i and j, Diam(G) is the diameter of the PPI graph, and Vclk denotes the nodes in cluster clk. The algorithm computes the probability for each protein and each cluster. We use a global threshold to assign all nodes that have high propensity towards multiple membership into their respective alternate clusters. Note that, although we perform this operation for all nodes, the nodes with the highest probability for multiple membership are the hubs in the PPI graph, which have been hypothesized to be multi-functional in nature (Jeong et al., 2001). Owing to their high degrees, they are more likely to interact with proteins having different functions.

3.3.4 Putting it all together
Figure 1 gives the overview of our ensemble framework. In the first step, the two topological metrics (Clustering Coefficient-based and Betweenness-based) are used with the three base clustering algorithms to reduce the noise in the PPI graph and produce six base clustering arrangements. In the consensus stage, the base clusters obtained are subjected to cluster purification to eliminate noisy clusters. We described two different techniques—pruning and weighting. The pruned clusters are fed into the PCA algorithm, which removes redundancies and noise and yields a compact representation. The result of the PCA step is a reduced matrix that contains only discriminatory information for proteins to be easily clustered. Alternately, the weights based on cluster reliability can be used to construct a new graph. For final consensus clustering, we use two algorithms as mentioned before—the Agglomerative algorithm and the RBR algorithm. Additionally, soft clustering can be performed to cluster certain proteins in multiple clusters.


Figure 1
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Overview of the Ensemble framework. Note that although we show only the agglomerative algorithm in the figure, the rbr algorithm can be used similarly.

 

    4 EXPERIMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 ALGORITHMS
 4 EXPERIMENTS
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
4.1 Dataset
The PPI network of budding yeast (S.cerevisiae) has been studied earlier in several works (Arnau et al., 2005; Ucar et al., 2005, 2006; Wu et al., 2002; Yook et al., 2004). This dataset is available from the Database of Interacting Proteins (DIP). At the time of download (01-07-2007), it consisted of 17 194 interactions among 4928 proteins.

4.2 Validation Metrics
Before presenting our experimental results, we would like to describe our validation metrics. We use both domain-specific and general metrics to evaluate the quality of the consensus clusters.

4.2.1 Topological measure: modularity
The first metric we use is a topology-based Modularity metric, originally proposed by Newman and Girvan (2004). This metric uses a k x k symmetric matrix of clusters where each element Formula represents the fraction of edges that link nodes between clusters i and j and each Formula represents the fraction of edges linking nodes within cluster i. The modularity measure is given by


Formula

4.2.2 Information theoretic measure: normalized mutual information
Another metric to evaluate the quality of clusters obtained is the amount of mutual information shared between clusterings. This metric was originally described by Strehl and Ghosh (2002a, b). They define the optimal combined clustering as the one that shares the most information, in terms of mutual information, with the original clusterings. Assume r groupings denoted as Formula . Suppose there are two clusterings {lambda}a and {lambda}b of sizes ka and kb, respectively. Let nh be the number of objects in cluster Ch according to {lambda}a, nl the number of objects in cluster Cl according to {lambda}b and Formula is the number of objects in cluster Ch according to {lambda}a and in Cluster Cl according to {lambda}b. The [0-1] normalized mutual information Formula can be calculated as follows:


Formula

The average normalized mutual information (ANMI) between a set of r labelings, {Lambda} and a labeling named {lambda}i is defined as follows:


Formula

Here {Lambda} is the set of base clusterings and {lambda}i is the consensus clustering.

4.2.3 Domain-based measure: clustering score
For the PPI network, we need to test if the clusters obtained correspond to known functional modules. This can be done by validating the clusters using known biological associations from the Gene Ontology Consortium Online Database (Ashburner et al., 2000).1 The Gene Ontology (GO) database provides three vocabularies of known associations—Cellular Component which refers to the localization of proteins inside the cell, Molecular Function which refers to shared activities at the molecular level and Biological Process, which refers to entities at both the cellular and organism levels of granularity. Earlier researchers have used these three ontologies to validate the biological significance of clusters (Arnau et al., 2005; Ucar et al., 2005, 2006). We use all three annotations for validation and comparison. As of 1, February 2007, the GO database (http://www.geneontology.org/GO.downloads.ontology.shtml) contained 1864 cellular component terms, 7527 molecular function terms and 13 155 biological process terms.

Merely counting the proteins that share an annotation will be misleading since the underlying distribution of genes among different annotations is not uniform. Hence, P-values are used to calculate the statistical and biological significance of a group of proteins. The P-values essentially represent the chance of seeing that particular grouping, or better, given the background distribution. Assume a cluster of size n, with m proteins sharing a particular biological annotation. Also assume that there are N proteins in the database with M of them known to have that same annotation. Then using the Hypergeometric Distribution, the probability of observing m or more proteins that are annotated with the same GO term out of n proteins is:


Formula

Smaller P-values imply that the grouping is not random and is more significant biologically than one with a higher P-value. A cuttoff 2 parameter is used to differentiate significant groups from the insignificant ones. If a cluster is associated with a P-value greater than cutoff, it is considered insignificant3.

As the p-value of a single cluster is statistically not representative, we define a Clustering score function to quantify the overall clusters, as follows.


Formula

where nS and nI denotes the number of significant and insignificant clusters, respectively and Formula denotes the smallest P-value of the significant cluster i. Hence, each cluster is associated with one Clustering score for each of the three ontologies.

4.3 Experimental results
4.3.1 Evaluation of similarity metrics
We first evaluate the two similarity metrics we have developed for base clustering. In particular, we wish to validate the benefits of using weighted metrics for eliminating noise. To do this, we apply the clustering algorithms on an unweighted graph, where all edges are treated the same ( = 1 ). We then compare the clusters obtained using the domain-based Clustering score measure. To compare, we also implement a neighborhood metric based on the Czekanowski-Dice distance metric (Brun et al., 2004), which has been previously employed for clustering PPI graphs (Chua et al., 2006). The neighborhood-based similarity metric is defined as:


Formula 6

(6)
Here, Int(i) and Int(j) denote the adjacency lists of proteins i and j, respectively, and {Delta} represents the symmetric difference between the sets. Note that using this metric, nodes that do not interact with each other may have non-zero similarity if they have common neighbors. The comparison, in terms of Clustering scores for the RBR algorithm,4 is given in Figure 2.


Figure 2
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Domain-based comparison of base similarity metrics.

 
The Betweenness and Clustering Coefficient-based metrics have high Clustering score values for all three ontologies. This indicates that the Betweenness and Clustering Coefficient-based metrics can help reduce the effect of noise, leading to meaningful clusters. The Neighborhood metric, on the other hand, performs worse than the unweighted scenario. The metric assigns non-zero scores to pairs of nodes that are not connected in the original graph, if they have common neighbors. The results suggest that this addition of new edges contributes to increased noise in the PPI graph.

4.3.2 Consensus clustering
We use the three graph clustering algorithms with the two topology-based metrics to obtain six independent base clusterings each. Estimating the optimal number of clusters, k, is a serious issue in clustering. Some earlier approaches (Strehl and Ghosh, 2002a, b; Ray and Turi, 1999) have suggested using the ratio between the inter-cluster and intra-cluster similarities to estimate the value. We used both similarity metrics with the Metis algorithm to estimate cluster quality for different values of k. We performed the same operation with the other two algorithms. Finally, one of the values optimal for all three algorithms was chosen as the value of k. Accordingly, the value of k for the PPI dataset was chosen to be 100. Once the base clusters are obtained, the cluster purification step is performed to prune away weak clusters. The remaining clusters are then represented in the form of a matrix, as described earlier, and PCA is applied to reduce the dimensions. We select the number of dimensions that capture 95% of the total variance. We then perform consensus clustering using three algorithms—the agglomerative hierarchical algorithm (PCA-agglo), the repeated bisections divisive algorithm (PCA-rbr) and the soft consensus (PCA-softagglo) algorithm. We also investigate the benefits of weighted (Wt-agglo) consensus clustering, for comparison. To compare with our consensus technique, we use the three ensemble algorithms proposed by Strehl and Ghosh (2002a, b) - CSPA, HGPA and MCLA, and two ensemble algorithms—Balls (CE-balls) and Agglomerative (CE-agglo) proposed by Gionis et al. (2005). The latter two algorithms do not accept the required number of clusters as a parameter. When we used the default settings for both, with a distance matrix based on shortest path distances, the CE-agglo algorithm produced 2121 clusters and the CE-balls algorithm yielded 2783 clusters for the 4928 proteins. Most of these clusters contained only singletons or pairs. Also, the CSPA algorithm ran out of memory for this dataset. It seems to be conducive only for small datasets.

Modularity and NMI—First, we compare the consensus algorithms in terms of their Modularity and ANMI scores. Figure 3 shows the comparative results in terms of both these metrics for four consensus methods. The CE-agglo and CE-balls algorithms, as we mentioned earlier, resulted in a large number of clusters, most of which contained only singletons and pairs.5 Hence, the modularity and NMI scores were very low for these clusters and are not presented here.


Figure 3
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Modularity and NMI scores for consensus algorithms.

 
It can be observed that the PCA-agglo and PCA-rbr algorithms perform the best with high scores in terms of both metrics.

Domain-based Evaluation—We proceed to evaluate the clusters obtained from the consensus algorithms using the domain-based metric. Figure 4 shows the comparison in terms of Clustering Score for the Biological Process, Molecular Function and Cellular Component ontologies. Since the CE-agglo and CE-balls contain a large number of singletons, they have very few significant clusters. The PCA-based consensus methods once again do better than all the other algorithms. The PCA-agglo and PCA-rbr algorithms provide the best clustering scores overall. The CE-balls and CE-agglo, due to the large number of singletons and pairs, perform the worst, with very poor Clustering scores for all three ontologies. The MCLA algorithm does best among the algorithms we compare against, although its scores are lower than the PCA-based methods. The Wt-agglo consensus method has poor results due to the fact that it produces 55 singleton clusters. However, we found that out of the other 45 clusters, most were significant. The fact that not all proteins were clustered by the weighted consensus method suggests that pruning with PCA is a better option.


Figure 4
View larger version (31K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Domain-based clustering scores for consensus algorithms. Comparisons with MCLA, HGPA, CE-Balls and CE-agglo.

 
4.3.3 Comparison with MCODE and MCL
Next, we compare our consensus technique with two algorithms commonly utilized for extracting functional modules from PPI graphs—MCODE and MCL. A recent study (Brohee and van Helden, 2006) that compared these algorithms (among others) showed that the MCL algorithm, in particular, was very effective in identifying protein complexes from protein interaction networks. We wish to investigate the benefits of ensemble clustering when compared to these two algorithms.

We used the MCODE and MCL algorithms to extract clusters from the PPI graph. We used the default settings for MCODE (fluff option set to 0.1, mode score cutoff set to 0.2, degree cutoff set to 2), and obtained 59 clusters. One major drawback of this algorithm is that not all the proteins (vertices) in the network are clustered. The clusters we obtained consisted of only 794 proteins (out of 4928). From the domain-based metric, we found that among these 59 clusters, 46 clusters had significant Cellular Component annotations, 40 clusters had significant Molecular Function and 50 clusters had significant Biological Process annotations. On the other hand, the MCL algorithm generated 1246 clusters for the 4928 proteins. However, on examination, we found that most of these clusters were insignificant. Only 277 out of the 1246 were significant in terms of Biological Process annotations.

The P-value distributions for the 50 best clusters for PCA-agglo, MCODE and MCL for the Biological Process ontology are shown in Figure 5. Note that the graph illustrates improvements across the board and not merely among the best clusters. The MCODE algorithm produces only 50 significant clusters for this ontology. The biological significance of these clusters is very poor compared to the other two. The top 50 (out of 277) MCL clusters have consistently lower significance than the PCA-agglo clusters, as can be observed from the figure. The PCA-agglo algorithm yielded a large percentage of significant clusters (88 out of the 100 clusters were significant for Biological Process) and with small P-values [high values of –log(p-value)]. Moreover, PCA-agglo clustered all 4928 proteins whereas in the case of MCODE, a majority of the proteins (around Formula ) were unclustered. In the case of MCL, the top 30 clusters are of much lower significance than the PCA-agglo clusters, although the two algorithms become comparable subsequently.


Figure 5
View larger version (46K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. P-value distribution comparison with MCODE and MCL for biological process ontology.

 
When we compared the modularity scores, we once again found the PCA-based methods outperforming MCODE and MCL. The modularity scores are given in Table 1. As we mentioned earlier, MCL produced a large number of clusters and most of the proteins in the clusters were sparsely connected. Since MCODE did not cluster all proteins, we only consider edges among the proteins clustered to compute the modularity. The results show that the ensemble methods produce denser clusters, with the PCA-agglo algorithm performing the best overall.


View this table:
[in this window]
[in a new window]

 
Table 1. Modularity scores comparison

 
Qualitative comparison with MCODE—We analyze the highest ranked cluster obtained by MCODE and the corresponding PCA-agglo cluster using the Cellular Component ontology to compare the effectiveness of these algorithms in terms of identifying protein complexes. The best scoring cluster in MCODE (with score 5.615) is composed of 26 proteins among which 15 belong to a known complex proteasome regulatory particle (GO:0005838). This grouping is associated with a small P-value of 8.5e–34. On the other hand, the PCA-agglo cluster that includes a majority of the same vertices has 21 proteins belonging to the proteasome regulatory particle complex. The significance of this result can be accentuated by the fact that out of the 6472 annotated proteins for yeast in the GO database, there exist only 23 proteins annotated with this complex. PCA-agglo groups 21 of them in one cluster (P-value 7.6e–48). The corresponding clusters produced by the two algorithms are plotted in Figure 6a and b. The white vertices represent proteins that are known to be part of this complex whereas the black ones do not have a known annotation in GO for that term. As can be seen from these two clusters, the cluster obtained by the PCA-agglo algorithm is denser compared to the MCODE cluster. In the MCODE cluster, there exist two separate dense regions, one composed of proteins in the proteasome regulatory particle complex and the other composed of proteins in the snRNP U6 complex (GO:0005688). This example indicates that PCA-agglo can obtain dense and homogeneous clusters.


Figure 6
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. (a) MCODE cluster (b) PCA-agglo cluster.

 
Qualitative comparison with MCL—Next, we compare the clusters obtained by the MCL algorithm with the ones from PCA-rbr. The MCL algorithm partitioned our interaction network into 1246 clusters. Among these only 277 of them had significant Biological Process annotations, 216 had significant Molecular Function and 226 of them had significant Cellular Component annotations. This meant that, around 900-1000 of the clusters were insignificant. On the other hand, out of the 100 clusters produced by PCA-rbr there exist 89 clusters with significant Cellular Component annotations, 87 clusters with significant Molecular Function annotations and 90 clusters with significant Biological Process annotations. Although MCL is able to produce more clusters, the precision (percentage of significant clusters) and the biological significance within the clusters are low.

For our analysis we considered the clusters with significant Biological Process annotations for the two algorithms. The corresponding distributions for the top 476 clusters are shown in Figure 7. The MCL algorithm grouped 1940 proteins into 277 significant clusters with average cluster size of seven. Although PCA-rbr algorithm identifies only 90 clusters with significant annotations, 4145 proteins are grouped into these clusters (average cluster size is 46). To assess the biological homogeneity of these clusters, we label each of these clusters with the most significant GO annotations (P-values). Accordingly, the most significant annotation for MCL clusters for the Biological Process ontology has a P-value of 7.15e–46, whereas the most significant annotation for the PCA-rbr clusters is 2.2e–56. Furthermore, the average P-values for all significant clusters of MCL is 1.2e–04 whereas the average for PCA-rbr clusters are 1.1e–05. These results show that MCL produces many small-sized clusters which are not as homogeneous as the clusters obtained by the PCA-rbr algorithm.


Figure 7
View larger version (38K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7. P-value distribution comparison with PCA-rbr and MCL for biological Process ontology.

 
To further analyze the effectiveness of these algorithms for protein complex identification purposes, we compared the most significant cluster obtained by MCL algorithm according to the Cellular Component ontology with its counterpart among the PCA-rbr clusters. The best cluster produced by the MCL algorithm (for this ontology) groups 31 proteins, among which 26 are known to be part of organellar large ribosomal subunit (GO:0000315). This arrangement is associated with a P-value of 5.7e–56. To find the corresponding PCA-rbr cluster, we identified the cluster that includes the most number of proteins from this cluster. As expected, the corresponding PCA-rbr cluster is also enriched with the proteins that are associated with organellar large ribosomal subunit. There exist 30 proteins (out of 40) in the corresponding PCA-rbr cluster which have known annotations with this complex (P-value is 1.3e–62). This cluster includes all 25 proteins that are correctly put together by the MCL algorithm as well as five other proteins (IMG1, MRP7, MRPL17, YDR115W, MRPL15) from the same complex that MCL fail to locate into this cluster. This illustrative example shows that the PCA-rbr clusters are larger and more homogeneous and may hence be better suited for the extraction of protein complexes.

4.3.4 Soft clustering
As we mentioned earlier, many proteins in PPI networks are believed to exhibit multiple functionalities, interacting with different groups of proteins for different functions. To identify these multi-faceted proteins, we used the soft-clustering variant of the PCA-agglo algorithm, which allows proteins to belong to multiple clusters. The algorithm identifies proteins that have high propensity for multiple membership. We use a strict threshold of 0.2 and assign a protein to an alternate cluster only if its average shortest path distance to the cluster is below 0.2. When we obtain the soft clusters, we found that a majority of the proteins that had multiple membership were hub proteins (proteins with high degrees). This is consistent with our initial assumption, since hub proteins are likely to be well-connected and are believed to exhibit multiple functionalities.

To emphasize the benefits of performing soft clustering, we provide an illustrative example.

CKA1 is a multi-faceted hub protein, involved in multiple cellular events such as maintenance of cell morphology and polarity, and regulating the actin and tubulin cytoskeletons. When we analyze the base clusterings using the clustering scores, we find that the base clusterings associate this hub protein in different groups. Three of the base algorithms (direct-betweenness, rbr-clustering coefficient and rbr-betweenness) group CKA1 with all the other proteins (CKB1,CKB2,CKA2) in protein kinase CK2 complex. On the other hand, the direct-clustering coefficient base algorithm grouped CKA1 together with 33 other proteins that take part in rRNA metabolism and the metis-betweenness base algorithm clusters it with proteins associated with cell organization and biogenesis (23 other proteins). These results indicate that most of the base clustering algorithms (except metis-clustering coefficient) are able to assign a multi-faceted protein to a cluster that includes proteins associated with one of its functions. A hard consensus clustering algorithm can only associate CKA1 with the most popular term. Accordingly, the pca-agglo consensus algorithm groups CKA1 with the protein kinase CK2 complex proteins in consensus with the majority of the base algorithms. This cluster, in which CKA1 has been placed by the PCA-agglo algorithm, has few proteins associated with the cell organization and biogenesis functionality. The soft clustering algorithm, on the other hand, places CKA1 into three clusters with significant enrichment scores. One of these clusters consists of proteins associated with rRNA metabolism with a significant P-value of 1.4e–21. The second cluster includes all protein kinase CK2 complex proteins (1.6e–09), whereas the third cluster is composed of cell organization and biogenesis proteins (4.8e–20). This example clearly shows that soft consensus clustering can lead to the discovery of multiple functionalities for proteins. The benefit of ensemble clustering is once again evident, since the different base clustering algorithms uncover different functionalities, which can be summed up adequately by the soft consensus clustering algorithm.

In our earlier work (Ucar et al., 2006) we developed a soft clustering method based on hub-duplication for the PPI dataset. Now, we compare the performance of the PCA-based soft consensus method7 with the hub-duplication technique. The P-value distributions for the Biological Process ontology 8 is shown in Figure 8. It can be observed that the PCA-soft-agglo method consistently yields clusters with higher biological significance than the hub-duplication technique. It can be hypothesized that the good performance of the soft ensemble algorithm is due to the fact that it assimilates the results of different base clusterings, whereas typical soft clustering algorithms use a single clustering criterion.


Figure 8
View larger version (41K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 8. P-value distribution comparison for soft clustering.

 
The hard and soft clusters we have obtained are provided as Supplementary Materials and are available from Bioinformatics online.


    5 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 ALGORITHMS
 4 EXPERIMENTS
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this article, we have presented an ensemble framework for partitioning PPI networks. To obtain informative base clusters, we have developed two topological metrics that can counteract the effect of noisy (false positive) interactions in the PPI network. We have presented a detailed consensus technique involving PCA, designed to scale to large datasets and reduce the dimensionality of the consensus determination problem. Additionally, we have introduced topology-based pruning strategies to complement PCA in the task of eliminating redundant and noisy data. Finally, we have presented a soft consensus clustering algorithm, that is designed to discover multiple functional associations for proteins. Our thorough empirical evaluation and comparison of these consensus clustering algorithms with other state-of-the-art approaches using topological, information theoretic and domain specific validation metrics, demonstrate that the proposed PCA-based algorithms, apart from the scalability advantage, can lead to consensus clusters with high efficiency. Also, the PCA-based soft consensus clustering algorithm proves to be very effective in identifying multiple functionalities of proteins. The qualitative comparison of our clusters with those of popular algorithms such as MCODE and MCL reveals that ensemble algorithms can yield larger, denser clusters with improved biological significance. In the future, we would like to focus on extensions for the base algorithms. Also, we would like to extend the notion of ensembles to inculcate domain bias for fusing information from multiple experimental and in-silico PPI networks.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 ALGORITHMS
 4 EXPERIMENTS
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work is supported in part by the DOE Early Career Principal Investigator Award No. DE-FG02-04ER25611 and NSF CAREER Grant IIS-0347662.

Conflict of Interest: none declared.


    FOOTNOTES
 
1 http://www.geneontology.org/ Back

2 The GO ontology performs multiple hypothesis testing to adjust the cutoff value. Back

3 We used the recommended cut-off of 0.05 for all our validations. Back

4 The trends for the other two clustering algorithms are similar and are omitted. Back

5 1124 of the 2121 clusters produced by the CE-agglo algorithm contained singletons, whereas for the CE-balls algorithm, 1939 of the 2783 clusters contained singletons. Back

6 The remaining clusters have comparable P-values. Back

7 To make a fair comparison, we used the same dataset as the earlier work. It was downloaded on 11-06-2005 and contained 15147 interactions among 4741 proteins. We provide the soft clusters obtained as Supplementary material. Back

8 The plots for the molecular function and cellular component ontologies follow similar trends and have been omitted due to lack of space. Back


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 ALGORITHMS
 4 EXPERIMENTS
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Aggarwal CC. Re-designing distance functions and distance-based applications for high dimensional data. SIGMOD Record, ( (2001) ) 30, : 13–18..

    Arnau V, et al. Iterative cluster analysis of protein interaction data. Bioinformatics, ( (2005) ) 21, : 364–378.[Abstract/Free Full Text].

    Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet, ( (2000) ) 25, : 25–29.[CrossRef][ISI][Medline].

    Bader GD, Hogue CW. Analyzing yeast protein-protein interaction data obtained from different sources. Nat Biotechnol, ( (2002) ) 20, : 991–997.[CrossRef][ISI][Medline].

    Bader GD, Hogue CWV. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, ( (2003) ) 4, ..

    Brohée S, van Helden J. Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics, ( (2006) ) 7, : 488.[CrossRef][Medline].

    Brun C, et al. Clustering proteins from interaction networks for the prediction of cellular functions. BMC Bioinformatics, ( (2004) ) 5, : 95.[CrossRef][Medline].

    Chen J, et al. Increasing confidence of protein interactomes using network topological metrics. Bioinformatics, ( (2006) ) 22, : 1998–2004.[Abstract/Free Full Text].

    Chua HN, et al. Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics, ( (2006) ) 22, : 1623–1630.[Abstract/Free Full Text].

    Ding C, et al. Adaptive dimension reduction for clustering high dimensional data. Proc. ICDM, ( (2002) ) 107–114..

    Fields S, Song OK. A novel genetic system to detect protein-protein interactions. Nature, ( (1989) ) 340, : 245–246.[CrossRef][Medline].

    Fields S, Sternglanz R. The two-hybrid system: an assay for protein-protein interactions. Trends Genet, ( (1994) ) 10, : 286–292.[CrossRef][ISI][Medline].

    Fred A, Jain AK. Data clustering using evidence accumulation. In Proc. of the 16th Int'l Conference on Pattern Recognition, ( (2002) ) 276–280..

    Friedel CC, Zimmer R. Inferring topology from clustering coefficients in protein-protein interaction networks. BMC Bioinformatics, ( (2006) ) 7, : 519.[CrossRef][Medline].

    Gionis A, et al. Clustering Aggregation. 21st International Conference on Data Engineering, ( (2005) ) 341–352..

    Holme P, et al. Subnetwork Hierarchies of Biochemical Pathways. Bioinformatics, ( (2003) ) 19, : 532–538.[Abstract/Free Full Text].

    Hoyle DC, Rattray M. PCA learning f