Skip Navigation

Bioinformatics 2007 23(13):i149-i158; doi:10.1093/bioinformatics/btm194
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Dutkowski, J.
Right arrow Articles by Tiuryn, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dutkowski, J.
Right arrow Articles by Tiuryn, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Identification of functional modules from conserved ancestral protein–protein interactions

Janusz Dutkowski * and Jerzy Tiuryn

Institute of Informatics, Warsaw University, Poland

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: The increasing availability of large-scale protein–protein interaction (PPI) data has fuelled the efforts to elucidate the building blocks and organization of cellular machinery. Previous studies have shown cross-species comparison to be an effective approach in uncovering functional modules in protein networks. This has in turn driven the research for new network alignment methods with a more solid grounding in network evolution models and better scalability, to allow multiple network comparison.

Results: We develop a new framework for protein network alignment, based on reconstruction of an ancestral PPI network. The reconstruction algorithm is built upon a proposed model of protein network evolution, which takes into account phylogenetic history of the proteins and the evolution of their interactions. The application of our methodology to the PPI networks of yeast, worm and fly reveals that the most probable conserved ancestral interactions are often related to known protein complexes. By projecting the conserved ancestral interactions back onto the input networks we are able to identify the corresponding conserved protein modules in the considered species. In contrast to most of the previous methods, our algorithm is able to compare many networks simultaneously. The performed experiments demonstrate the ability of our method to uncover many functional modules with high specificity.

Availability: Information for obtaining software and supplementary results are available at http://bioputer.mimuw.edu.pl/papers/cappi.

Contact: januszd{at}mimuw.edu.pl


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Large-scale protein–protein interaction (PPI) networks are now available for many model organisms. The arising challenge is to analyze this data to reveal the basic components and organization of cell machinery. One of the key approaches to the analysis of large PPI networks is through network alignment, or global comparison of two or more (multiple alignment) networks to identify similar regions (Sharan and Ideker, 2006). As in the case of biological sequence data, the comparative approach demonstrates high potential. The basic motivation behind it is the principle of evolutionary conservation of functional units of the cell. Just like homologous gene and protein sequences are retained in many species, the common patterns of protein interaction realizing essential functions are expectedly retained as well. Cross-species comparison of PPI networks may enable functional annotation of the proteins, prediction and verification of protein interactions and finally identification of essential cellular units.

Research in the area of network alignment has begun recently and is constantly progressing towards more advanced and general methods. One of the first studies of this type was performed by Matthews et al. (2001) who identified homologous pairs of interactions (one from each of the two PPI networks). In a more recent formulation of network alignment Kelley et al. (2003) construct an alignment graph, in which the nodes represent pairs of homologous proteins (one from each of the two compared networks) and edges represent conserved interactions. The alignment graph is used to search for conserved pathways (simple paths). This approach is extended in Sharan et al. (2005b) to enable the search for conserved complexes (dense subgraphs of the alignment graph). Similar procedure, but with different scoring of the edges is presented in Koyuturk et al. (2006). In Sharan et al. (2005a) the authors modify their previous procedure to align three (rather than two) networks simultaneously.

One of the drawbacks of the alignment graph is that it includes a node for every pair (or triplet) of similar proteins (one from each input network). The commonly used similarity functions (e.g. BLAST E-value threshold) generally impose a many-to-many correspondence between proteins, which causes the size of the alignment graph to grow exponentially with the number of aligned networks. Flannick et al. (2006) present a different approach to network alignment, which addresses this problem by greedily assigning the aligned proteins to non-overlapping homology classes and progressively aligning multiple input networks. The algorithm also allows searching for a larger class of conserved topologies defined by the user.

Another important issue is the evolutionary relevance of the identified alignments. As suggested in a recent review by Sharan and Ideker (2006), network alignment procedures should be more strongly rooted in an evolutionary model of how networks evolve. This is not quite achieved by the previous methods. Although Koyuturk et al. (2006) apply evolutionary-based scoring scheme and Flannick et al. (2006) reconstruct the most parsimonious history of protein evolution, these methods do not attempt to probabilistically model the evolution of protein interactions. This problem is addressed in two recent works by Berg and Lassig (2006) and Hirsh and Sharan (2006). The former work builds network alignments using a score function which incorporates link dynamics represented as a Markov process, while the latter assigns probabilities to interaction loss and emergence in a single evolutionary step leading from an ancestral module (complex) to the observed interaction patterns.

Here we present a different approach based explicitly on the phylogenetic history of proteins and a stochastic evolutionary model of interaction emergence, loss and conservation during each of the events, which have impact on the network.

We develop a novel approach to aligning PPI networks. At the heart of our procedure lies reconstruction of the conserved ancestral PPI (CAPPI) network. First, we reconstruct the hypothetical sequence of evolutionary events (duplications, deletions and speciations), by which the proteins of the input PPI networks evolved from their counterparts in the common ancestral network. In the next step, we determine the posterior probabilities of interaction between proteins at each stage of evolution up to and including the common ancestor network. The probability of protein interaction is calculated under a proposed stochastic model of network evolution. The topology of the ancestral network (and each network at every stage of evolution) is determined by the most probable interactions. Finally, we identify modules in the CAPPI network and project them back onto the input networks to determine the alignment. Compared to previous approaches the proposed framework provides more insight into the protein network evolution. Its parameters are clearly tied to the evolutionary events shaping the network. The algorithm also scales well with the number of species, allowing for simultaneous comparison of many networks. We apply the CAPPI procedure to align the PPI networks of Saccharomyces cerevisiae, Drosophila melanogaster and Caenorhabditis elegans, three of the largest publicly available PPI networks to date. Within the aligned network modules we detect a large number of known yeast complexes with high specificity, favorably comparing to the previous study on the same data.

The article is organized as follows: Section 2 presents a brief overview of our methodology. Section 3 provides the details of the CAPPI reconstruction procedure, presents the model of network evolution and describes the identification of conserved network modules. Section 4 presents the performed experiments and their outcome, as well as comparison to previously proposed methods. A brief summary of the work is given in Section 5.


    2 APPROACH
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The aim of network alignment is the identification of similar (conserved) network regions in two or more networks. Here we present a general framework for aligning protein interaction networks. We approach the problem by reconstructing the hypothetical ancestral PPI network based on the protein sequences and interactions found in the input networks. Nodes of the ancestral network represent ancestral proteins, descendants of which have been conserved in the observed input networks. Edges of the ancestral network are assigned weights, which denote the probability of interaction between the respective ancestral proteins. These probabilities are calculated as the posterior probabilities given the interaction data in the input networks and the evolutionary model, by which the input networks are assumed to have evolved from the common ancestral network. The regions of the ancestral network with high interaction probability can therefore be regarded as a merged representation of the common proteins and interactions that have been conserved in the input networks.


    3 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We now present the details of our methods. The subsequent steps of the CAPPI procedure are outlined in Figure 1.


Figure 1
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Overview of the CAPPI methodology. We start with the input PPI networks (color-coded fragments of three networks are shown at the top). We determine non-overlapping homologous protein groups via clustering. Next, we build gene trees for each protein group and reconcile the trees with a common species tree (reconciled gene trees for three protein groups with outgoing arrows are shown). Finally, we determine the probability of each ancestral protein interaction (dotted lines) based on the interactions observed in the input networks (mapped to the leaves of the trees) and on the sequence of evolutionary events (duplications, speciations and deletions) extracted from the reconciled gene trees.

 
3.1 Reconstructing the phylogenetic history
We assume that we are given a number of PPI networks, each coming from a different species. The first step towards aligning the networks is determining the homology relationship between all proteins. We choose to split the proteins into non-overlapping families of (putatively) homologous proteins [a similar approach was presented in Flannick et al. (2006)]. The proteins in each cluster are believed to have descended from a common ancestral protein. In contrast to the approach taken by Flannick et al. (2006), where proteins are greedily assigned to families during the iterative alignment process (so as to maximize the alignment score), we choose to determine the homology relationship for all proteins directly by previously established methods for identification of protein families. Our grouping, therefore, has a greater chance of representing true homologous protein families. As one possibility, we could split the proteins according to the clusters of the COG database (Tatusov et al., 2003). However, to allow the application to all species and protein sequences we decide on a more general approach. Specifically, we determine the homologous clusters using TRIBE-MCL algorithm by Enright et al. (2002) with BLAST E-values as pairwise distances between proteins.

In the next step, for each group of homologous proteins we reconstruct its evolutionary history by means of phylogenetic analysis. We perform multiple alignment of the protein sequences using the CLUSTALW method (Higgins et al., 1994). Next, we calculate the distance matrix using the PROTDIST procedure and construct phylogenetic trees using the Neighbor Joining algorithm implemented in the Phylip package (Felsenstein, 2005).

For consistency with the evolutionary history, it is necessary to reconcile the gene tree of each protein class with the species tree of the aligned organisms (Page and Charleston, 1997). The reconciliation algorithm minimizing the duplication and loss score function is implemented in NOTUNG (Durand et al., 2006) and a more efficient reconciliation algorithm has been recently shown by Górecki and Tiuryn (2006).

Given the sequence of evolutionary events extracted from the reconciled trees we proceed to determine the posterior probability of interactions between proteins at each stage of evolution, given the interactions observed in the input networks. The reconstructed evolutionary history of each protein family serves as the backbone for the reconstruction of protein interactions. The formal description of the network reconstruction procedure, the proposed model of network evolution and the details of calculating the posterior probability of ancestral interactions are presented in the next section.

3.2 Reconstructing the ancestral network
Let SO be the set of species, for which we are given the observed PPI networks. We assume that we are given a phylogenetic tree of the considered species, in which leaves are labeled with the species from SO and inner nodes are labeled with unknown (hidden) predecessor species. The tree is determined by the indexed set of nodes Formula of all the species in the tree (observed and hidden) and three functions Formula . We will assume that the common ancestor (the root of the tree) is s1. LS(i) and RS(i) return the index of the left and right son of species si respectively (or Formula if the right (left) son does not exist) and F(i) returns the index of the father node (or Formula if i = 1). We also assume that the set of proteins in all observed species has been split into non-overlapping protein families (equivalence classes) and for each family we are given a gene tree, which is reconciled with the aforementioned species tree. From the reconciled gene trees we are able to extract the set of duplication events taking place in species si, that is after the speciation event which established si and before the speciation event when si evolved into two distinct species. For the purpose of this discussion we will assume that we can impose an ordering on duplication events in each species1 (in practice this is hard to achieve, since we do not know the order of events which occur within different protein families; but we have found by simulation, that changing this specific order, has negligible effect on the final result). We will denote by Formula the undirected graph representing the protein network of si after the j-th duplication event occurring in this species. The ancestral graph is denoted by Formula . Graph Formula has exactly one protein from each protein family—the protein placed at the root of the appropriate gene tree. The sequence of duplication events in species si is given by Formula , where Formula denotes duplication of protein Formula into two proteins Formula . The problem we consider is to determine the probability of interaction between nodes in the graph Formula and all its descendants, based on the observed graphs of the species in SO and assuming the sequences of evolutionary events, by which each of the observed networks evolved from Formula .

3.3 Duplication and divergence model of protein network evolution
The following model of network evolution motivated by the general duplication model (Sole et al., 2002; Pastor-Satorras et al., 2003) is used to determine the probability of observing graph Gi, j under the assumption of the sequence of speciations and duplications, by which Gi, j evolved from the ancestral network Formula .

The model has four parameters: pd, {delta}d, ps and {delta}s

  • We start with the ancestral graph Formula , 2 and perform a defined sequence of duplications and speciations.
  • In case of duplication Formula graph Gi, j is constructed on the basis of Formula in the following way:
    D1. All vertices besides np and edges which are not incident to np are copied from Formula to Vi,j.
    D2. Vertices na and nb are added to Vi,j.
    D3. For each edge Formula we add to Ei, j edges Formula and Formula independently, each with probability pd.
    D4. For each vertex Formula such that Formula we add to Ei,j edges Formula and Formula independently, each with probability Formula .

  • In case of speciation of species si graph Formula is constructed on the basis of Formula in the following way3:S1. All vertices are copied from Formula to Formula .S2. Each edge Formula is added to Formula independently with probability ps.S3. Each edge Formula is added to Formula independently with probability Formula .

Steps D3 and D4 associated with duplication are referred to as local or correlated divergence, because they only affect the edges of the newly added vertices. The steps following speciation can be referred to as global divergence, as they can affect any edge in the network. Wagner (2001) points out that duplicate gene products usually diverge quickly and loose common interactions. One reason for this is that there is greater tolerance for mutations of the newly duplicated proteins, because of functional and structural redundancy of the duplicates. This would suggest that pd (the probability of edge conservation following duplication) should generally be much lower than ps (the probability of edge conservation after speciation). Also, as suggested by Berg et al. (2004), the overall rate of interaction attachment and loss should be equal making the model parameters interrelated.

Under the proposed model the probability of interaction between proteins in network Gi, j is determined by the interactions in the ancestral network Formula and the assumed sequence of speciations and duplications, which led to the formation of Gi, j. In the following, we use the above model to infer the posterior probabilities of ancestral interactions given the observed input networks.

3.4 The most probable ancestral topology
For each graph Gi, j and each pair of vertices Formula we will denote by Formula the binary random variable equal 1 when there exists an edge Formula , and 0 otherwise.

Assuming the duplication and divergence model described earlier the probability Formula depends on the existence or lack of an edge between the protein pair being the direct predecessor of the pair (nx, ny). Let us consider the last evolutionary event which could affect the pair (nx, ny). Three cases are possible:

  1. Vertex Formula was created by duplication Formula from vertex Formula , where Formula . (4) Then we have Formula , and Formula Formula .
  2. Vertex Formula duplicated from vertex Formula , where Formula . (symmetrical to i.)
  3. Vertices Formula emerged by means of speciation from vertices Formula . We then have Formula Formula , and Formula Formula .

The above dependencies can be represented using a Bayesian network (BN) model (Neapolitan, 2003). We will start the construction of the BN from the instantiated random variables, which represent the edges or non-edges in the observed graphs of the species in SO. By considering the last duplication or speciation event we recursively determine the direct predecessor of each possible edge and assign the conditional probabilities as described above, until we reach the corresponding possible edge (random variable) in the ancestral graph.

Each random variable corresponding to a possible edge nxny depends on exactly one random variable denoting the edge (or non-edge) in the direct predecessor graph. Therefore, the considered BN will be in the form of a set of trees (forest), where each tree models the joint distribution of the random variables corresponding to interactions, which are descendants of one of the interactions in the ancestral graph.

Now we can formulate the problem of finding the ancestral topology as a Bayesian network inference problem. Precisely, we would like to determine for each i and j the posterior probability


Formula

of interaction between a protein pair (nx, ny) in species i after j-th duplication, given the set of instantiated variables E, which are the interactions and non-interactions between nodes in the networks of observed species. We assume that the prior probabilities of interactions between proteins in the ancestral network Formula are given and we have the conditional probabilities linking each child node in the BN to its father. The problem we solve here is the classical problem of inference in Bayesian networks. The tree structure of our BN model enables an efficient solution using a simple version of the message-passing algorithm due to Pearl (Neapolitan, 2003).

In the above discussion, we have considered duplications and speciations, but left out (for simplicity) the protein deletion events present in the reconciled trees. In fact, deletions do not complicate the model, but only prune the set of possible interactions that have to be considered. Before we move on, let us also point out that in the above framework one can easily incorporate the probability of not observing an existing interaction or observing a false interaction. It would only require the inclusion of additional random variables at the leaves of our Bayesian tree models. We will not explore this issue further here, leaving it as a possible direction for future work.

3.5 Identifying conserved ancestral modules
Our ultimate goal is to determine the conserved functional modules in observed networks. Previously proposed network alignment procedures assumed a specific topology of functional modules. The candidate network regions were scored for fulfilling the desired structure. In contrast to the previous methods, we do not impose a predefined topology when searching for conserved modules. Instead, we specifically identify network regions, which are descendents of putative ancestral modules.

The ancestral network Formula reconstructed according to the procedure described in previous sections is a complete graph in which each edge nxny is assigned a weight corresponding to the probability of interaction between the adjacent proteins nx and ny. The subsets of vertices connected by highly weighted edges are likely to constitute a functional module. To identify the modules of the ancestral network we set an edge threshold value t and eliminate from the graph Formula all edges with weights below that threshold.

As shown in the next section, the threshold value can be determined by observing gradual decomposition of the largest component. At low values of t the nodes of the ancestral network form one giant component. As we raise the value of the threshold t, the giant component decomposes and many components with heavy edges are revealed. The cut-off value can further be refined by determining the level, at which the probability of interaction between two vertices in the ancestral graph is statistically significant compared to the background model. To estimate the level of edge significance we repeatedly run our algorithm on randomized versions of the input data maintaining the original homology relationship of the proteins and their phylogenetic history and permuting the protein interaction data. The original PPIs are randomized by redistributing the edges of the input networks, while maintaining their node degree sequences [similar technique was applied in Kelley et al. (2003); Koyuturk et al. (2006) and Sharan et al. (2005b)]. For each set of randomized networks we reconstruct the ancestral PPI network and count the number of edges with weights exceeding a given threshold. Next, we compute the false discovery rate (FDR) for multiple hypothesis testing. The FDR q-value for a given edge weight w is estimated as:


Formula

where Formula is the number of edges with weights equal to or higher than w in the previously reconstructed ancestral network and Formula (Formula ) is the number of such edges in the i-th randomized network. We determine the threshold value t at which the edge weights are significant (e.g. q-value <0.05 ). We then decompose the graph Formula by deleting the edges of Formula with weights lower than t and removing all nodes without any interactions. The remaining connected components of the network constitute the ancestral network modules.

With the putative ancestral modules at hand, we proceed to identify the respective descendant modules in each of the considered species. To this end we project the ancestral modules onto the input networks by mapping the nodes of the ancestral network (ancestral proteins) to their descendants (proteins) in the input networks and identify the conserved descendant interactions.


    4 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We applied our method to search for conserved functional modules in the networks of yeast (S. cerevisiae), worm (C. elegans) and fly (D. melanogaster), three of the largest PPI networks available in public domain. In the following, we present our results and comparison to the previously proposed methods. All algorithms are used with their default parameters, except where noted.

4.1 Ancestral network reconstruction
We downloaded the PPI and protein sequence data from the database of interacting proteins (DIP) (Salwinski et al., 2004) (April 2006 download). In order to identify protein families, we performed MCL clustering (I = 1.2) of all the sequences available in the DIP database (including sequences from species other that the three species of interest) taking BLAST E-values as the pairwise distances. We found that better clustering results are achieved when a larger number of species are included. This is intuitively correct: as homologous proteins should be more similar to each other than to other proteins, larger and more diverse protein universe serves as a better background for homology identification. The clustering identified 6971 non-overlapping protein groups, 460 of which included protein representatives from all three species (yeast, fly and worm). We decided to limit further analysis to only these 460 clusters. Note that our protein groups thus have support in at least three species, which is consistent with the major requirement assumed in the construction of orthologous groups in the COG database (Tatusov et al., 2003).

Next, for each of the 460 protein clusters separately, we performed multiple sequence alignment by CLUSTALW, calculated the distance matrix using PROTDIST and constructed a family gene tree using the Neighbor Joining algorithm. This resulted in an unrooted tree for each family, which was rooted and reconciled with the species tree of yeast, fly and worm using the procedures implemented in NOTUNG.

For each pair of protein families we calculated the posterior probability of interaction between the respective ancestral proteins according to the model described earlier. The choice of model parameters is discussed subsequently. In an analogous way, we computed the probability of each ancestral self loop (representing the interactions of proteins with themselves) based on the observed interactions within one protein family. The resulting ancestral network consists of 460 nodes and Formula edges weighted by the probability of interaction of adjacent nodes.

4.1.1 Estimating parameters of edge dynamics
Here we discuss the choice of parameters pd, {delta}d, ps, {delta}s and the prior probability of interaction of any two given ancestral proteins. The model of network evolution presented in Section 3 is motivated by and related to the proteome growth model (Sole et al., 2002), often referred to as the general duplication and divergence model. The general duplication model included only the parameters of link conservation or emergence following the duplication event (corresponding to pd and {delta}d in our model). Sole et al. (2002) denoted the probability of edge deletion following duplication event by {delta} and set its value to 0.53, which was influenced by estimations of rates of link addition and deletion made by Wagner (2001). The general duplication model assumes, however, that after a duplication event one of the duplicates remains unchanged. Our model is symmetrical in the sense that both of the duplicates are subject to deletion/emergence of edges. Our first estimate of the probability of edge conservation pd is therefore 0.7, which (assuming independence) gives almost the same joint probability of edge conservation in both duplicates as in Sole et al. (2002). The probability that a new edge is introduced was estimated by Sole et al. (2002) to be 0.06 divided by the number of nodes in the network. For the purpose of the experiments we used a constant value Formula (without normalizing it further by the size of the network at each stage, as assumed in our theoretical model). The probability of edge conservation after the speciation event was set to Formula . This parameter is related to the overall divergence of the network over time and should be more conservative than the probability pd related to the fast divergence of interactions of newly duplicated proteins. The probability of edge emergence after speciation was set to Formula . As in the case of {delta}d we did not normalize it by the number of edges in our experiments. We set the Formula probability of each possible ancestral interaction at 0.01, which is motivated by the observation that PPI networks are generally sparse with a small average node degree. As shown in the following section network reconstructed with these parameters yielded satisfactory results in terms of identified modules. We further studied the effect of individual parameters (additional results are available on our web page) on the reconstructed alignment and decided for a more conservative setting, raising the probability of edge conservation and lowering the probability of edge emergence. Precisely we set Formula , Formula , Formula . This allowed us to identify more modules, which were well conserved across the considered networks.

4.2 Decomposition of the giant component
As stated above, the reconstructed ancestral network is a complete weighted graph with 460 nodes. However, the weights of the edges (probabilities of interaction) vary considerably depending on the evidence supporting the given interaction in the input networks and the evolutionary history. As we gradually eliminate the edges with the lowest weights the initially connected graph decomposes into a large number of small components, suggesting the existence of network modules. This is consistent for a wide range of parameters we have tested. The exact transition point and speed of decomposition varies with the choice of model parameters (especially the choice of prior probability of interaction), however, the general phenomenon was always observed. The decomposition of the ancestral network constructed with the model parameters discussed earlier is presented in Figure 2.


Figure 2
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Decomposition of the largest component (1a and 2a) and change in the number of modules and identified MIPS complexes (1b and 2b) with increasing edge threshold. The results for the less conservative parameters are presented in panels 1a and 1b and for the more conservative parameters (preferred) in panels 2a and 2b. In case of both settings, it is evident that as the giant component decomposes, many pure modules and in turn MIPS complex categories are identified. The nodes of the ancestral network not involved in any interactions are eliminated from the graph, which explains why the fraction of nodes in the largest component (plot 1a) increases suddenly at high threshold values. The optimal threshold values (in terms of identified pure modules) are indicated on plots 1a and 2a with blue circles.

 
4.3 Edge weight threshold selection
The connected components of the ancestral graph identified at a sufficiently high edge threshold level determine the ancestral modules of conserved interactions. To determine the significance level we calculated the FDR edge weight q-values using randomized networks. We then deleted all edges with weights below threshold 0.48 (q-value of 0.049). We also eliminated from the graph all nodes without any interactions.

4.4 Conserved modules and quality assessment
The ancestral network decomposed by eliminating edges below the threshold value of 0.48 contains 40 modules (visualized in Figure 3). Overall, 75 nodes representing ancestors of conserved protein families are present in the identified modules. By projecting the ancestral nodes onto their present-day descendants we obtain an alignment consisting of 40 respective network regions (modules) in the three input networks. We have found that a large part of the detected conserved modules match well to the known yeast protein complexes collected in the MIPS database (Mewes et al., 2006). To formally evaluate the quality of this finding we compute the purity score for each identified module as proposed by Sharan et al. (2005b). The purity score of a module with respect to a given MIPS category is defined as the number of module's proteins annotated to that category divided by the number of all annotated proteins contained in that module. The module is defined to be pure if it contains at least three annotated proteins and at least half of these share the same annotation (purity Formula ). The module is impure if it contains at least three annotated proteins and its purity with respect to any considered MIPS category is less than 0.5. All other modules are treated as not sufficiently annotated (unknown). Following previous studies (Sharan et al., 2005a, b) we only consider the annotations at MIPS level 3 and exclude annotations based on high-throughput experiments (category 550).


Figure 3
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. The ancestral network modules identified with edge threshold value of 0.48. The pure modules (colored yellow) matching known protein complexes are described in Table 1. The projection of the largest module (in the top left corner) onto the input PPI networks is presented in Figure 4.

 
Overall, out of the 40 identified modules 14 are pure and 1 is impure. The 14 pure modules match to 16 MIPS categories summarized in Table 1. The largest ancestral module is found to be pure with respect to the MIPS category 360.10.20 (proteasome). The identified module projected onto the three PPI networks of the considered species is presented in Figure 4. The only one impure module we have found is composed of the ancestral nodes 109 and 54. Closer examination of this module yields interesting results. The module contains 15 annotated proteins assigned to five level 3 MIPS categories: 260.20.10 (4 proteins), 260.20.30 (4 proteins), 260.20.20 (4 proteins), 260.30.10 (2 proteins), 260.20.99 (1 protein). The module is thus pure with respect to level 2 category 260.20 (intracellular transport) and the interactions between the members of the respective level 3 categories are perhaps not unexpected.


View this table:
[in this window]
[in a new window]

 
Table 1. MIPS categories matched by pure modules identified by CAPPI

 

Figure 4
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. The conserved modules identified by projecting ancestral module 193-266-134-219-84 (see also Figure 3) onto the networks of yeast, fly and worm. Seven of the nine yeast proteins are assigned to the MIPS category 360.10.20 (Proteasome). The families corresponding to ancestral nodes 193, 266, 134, 219 and 84 are colored red, yellow, blue, purple and green, respectively. All three species contain representatives of each protein family, but only interacting proteins are visualized to present truly conserved PPI regions. The overall small coverage of C. elegans protein interactions in DIP may explain the apparent weak conservation of this module in case of this species.

 
The effect of edge threshold selection (discussed in the previous section) on the identified modules is visualized in Figure 2. The results for the chosen parameter values are presented in the bottom right panel. We observe that immediately after the decomposition of the giant component the number of pure modules increases to 14 and the number of impure modules drops to 1. Pure modules matching known protein complexes are connected to other pure modules by lighter edges, which disappear as the threshold is raised. It can also be seen that by choosing a sufficiently high threshold (0.96) we can eliminate the impure module from our solution and still identify seven pure modules. For comparison, we also plot the results for less restrictive parameters, which are perhaps better motivated by previous studies. Similar observations can be made with respect to these results and in fact, in this case, even with small threshold values we do not find any impure modules.

4.5 Comparison to existing methods
Our approach implemented in the CAPPI framework achieves two main goals. First, it is rooted in an evolutionary network growth model based on edge dynamics and phylogenetic information about the history of the network constituents. Second, it is able to simultaneously align multiple networks. Of the available network alignment methods only NetworkBLAST (Sharan et al., 2005a, b) and Graemlin (Flannick et al., 2006) have demonstrated the ability to align multiple (>2) networks and none of the methods apply evolutionary history reconstruction or probabilistic modeling of network growth to the extent presented here. A preliminary implementation of Graemlin is publicly available, however, the algorithm has many network dependent parameters, which have only been estimated for the SRINI networks used by the authors. The estimation of these parameters for other networks is not supported by the current implementation. NetworkBLAST was previously used to align the DIP networks of yeast, worm and fly (Sharan et al., 2005a) and the results provided by the authors enabled a straightforward preliminary comparison of the methods. Below we compare the results of the two methods using the MIPS complex database as a reference set of true functional modules.

NetworkBLAST experiments were performed on an earlier version of the DIP database (February 2004), which contained fewer interactions and protein sequences. In order to allow a fair comparison we have repeated our experiments on the same version of the database. The results are summarized in Table 2. Overall, the NetworkBLAST returned a much larger number of aligned modules. Also, a larger fraction of the returned modules were pure. However, the identified modules overlap considerably (up to 80% overlap was allowed by the authors) and among the 80 pure modules only 5 different MIPS categories are matched. In contrast, the alignments returned by CAPPI are non-overlapping. We provide CAPPI results for two edge weight thresholds. With a less restrictive edge threshold (corresponding to q-value <0.05) our method returned 39 modules among which 10 were pure. The 10 pure modules matched 13 MIPS complex categories altogether. The only one impure module returned was the same as the impure module identified in case of 2006 database version described earlier.


View this table:
[in this window]
[in a new window]

 
Table 2. Comparison of CAPPI and NetworkBLAST results

 
Raising the edge threshold to a more significant level (corresponding to q-value <0.03) resulted in a smaller number of pure modules and less identified MIPS categories. However, with this setting no impure modules were returned by CAPPI. At the same time CAPPI still identified two more MIPS categories than NetworkBLAST. We note that in case of both methods a considerable number of identified modules contain less than three annotated proteins (thus, they are not classified as being pure or impure). These modules may perhaps represent yet unknown protein complexes. Note that we intentionally do not formally define a sensitivity measure here, because the number of conserved complexes present in the DIP database is unknown. Instead we rely on the absolute number of matched MIPS complexes to compare the ability of the methods to identify true complexes. The assessment of the specificity of both methods is based on the number of impure modules, divided by the number of all identified modules. With the more restrictive edge threshold CAPPI achieves a perfect specificity score.

For completeness we also provide the results for the DIP 2006 version discussed in the previous section. We observe that at the lower edge threshold (corresponding to q-value <0.05) CAPPI identified more pure modules and MIPS categories on the more complete 2006 DIP database than on the 2004 edition. The results obtained with the more restrictive threshold remained unchanged with the introduction of new proteins and interactions, indicating that the group of alignments with the best support in the data is the same for both database versions.

The above experiments show that CAPPI is able to identify a relatively high number of known functional modules, while at the same time avoiding false-positive (impure) modules. More extensive tests are still needed to assess the performance of CAPPI and compare it to other methods under various conditions and with different reference databases.


    5 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have presented a new approach to PPI network alignment. The method is based on probabilistic modeling of link dynamics guided by reconstructed evolutionary history of the proteins. The parameters of our model are clearly tied to the evolutionary events shaping the PPI networks. We have shown that the proposed framework can successfully identify conserved functional modules in multiple networks. Other applications of the proposed methodology are possible. The basic novel principle behind our approach, the reconstruction of ancestral PPI network, can be applied not only to align different PPI networks, but also to predict unknown protein interactions and functions, and to integrate many sources of PPI data. It can also be useful in evaluating hypothesis about PPI network evolution and comparing different evolutionary models. We have implemented our method using established algorithms for protein family detection and phylogeny reconstruction. However, the basic components of our framework can easily be replaced by better counterparts if such should become available. Our future studies will be directed towards formal analysis of the proposed model of network evolution, as well as the development of methods of comparing the identified conserved network modules with respect to the phylogenetic history.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We would like to thank Bartek Wilczynski, BogusFormula aw Kluge and Norbert Dojer for helpful discussions and Trey Ideker for providing the NetworkBLAST results. This work was supported by the Polish Ministry of Science grants No 3 T11F 021 28 and PBZ-MNiI-2/1/2005.

Conflict of Interest: none declared.


    FOOTNOTES
 
1 Corresponding to the chronological order. Back

2 For simplicity, we will assume that the initial graph G1,0 does not contain self-loops. However, they are easily incorporated and necessary to model the predecessors of interactions between duplicate proteins. Back

3 Graph GRS(i),0 is constructed independently in the same manner. Back

4 If k<j then there were other duplication events dk+1(i), ... , dj(i)in species si, which did not affect the protein pair (nx, ny). Back


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Berg J, Lassig M. Cross-species analysis of biological networks by Bayesian alignment. PNAS (2006) 103:10967.[Abstract/Free Full Text]

    Berg J, et al. Structure and evolution of protein interaction networks: a statistical model for link dynamics and gene duplications. BMC Evolution. Biol (2004) 4:51.[CrossRef]

    Durand D, et al. A hybrid micro-macroevolutionary approach to gene tree reconstruction. J. Comput. Biol (2006) 13:320–335.[CrossRef][Web of Science][Medline]

    Enright AJ, et al. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res (2002) 30:1575–1584.[Abstract/Free Full Text]

    Felsenstein J. PHYLIP (Phylogeny Inference Package) version 3.6 (2005) Seattle, US: Department of Genome Sciences, University of Washington. Distributed by the author.

    Flannick J, et al. Graemlin: general and robust alignment of multiple large interaction networks. Genome Res (2006) 16:1169–1181.[Abstract/Free Full Text]

    Górecki P, Tiuryn J. Inferring phylogeny from whole genomes. Bioinformatics (2006) 23:e116–e122.[CrossRef][Web of Science]

    Higgins D, et al. CLUSTALW: improving the sensitivity of progressivemultiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res (1994) 22:4673–4680.[Abstract/Free Full Text]

    Hirsh E, Sharan R. Identification of conserved protein complexes based on a model of protein network evolution. Bioinformatics (2006) 23:e170–e176.[CrossRef][Web of Science]

    Kelley BP, et al. Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc. Natl Acad. Sci. USA (2003) 100:11394–11399.[Abstract/Free Full Text]

    Koyuturk M, et al. Pairwise alignment of protein interaction networks. J. Comput. Biol (2006) 13:182–199.[CrossRef][Web of Science][Medline]

    Matthews LR, et al. Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or ‘interologs’. Genome Res (2001) 11:2120–2126.[Abstract/Free Full Text]

    Mewes HW, et al. MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucleic Acids Res (2006) 34:D169–D172.[Abstract/Free Full Text]

    Neapolitan RE. Learning Bayesian Networks (2003) NJ: Upper Saddle River, Prentice Hall.

    Page R, Charleston M. From gene to organismal phylogeny: reconciled trees and the gene treespecies tree problem. Mol Phylogenet Evol (1997) 7:231–240.[CrossRef][Web of Science][Medline]

    Pastor-Satorras R, et al. Evolving protein interaction networks through gene duplication. J. Theor. Biol (2003) 222:199–210.[CrossRef][Web of Science][Medline]

    Salwinski L, et al. The database of interacting proteins: 2004 update. Nucleic Acids Res (2004) 32:D449.[Abstract/Free Full Text]

    Sharan R, Ideker T. Modeling cellular machinery through biological network comparison. Nat. Biotechnol (2006) 24:427–433.[CrossRef][Web of Science][Medline]

    Sharan R, et al. Conserved patterns of protein interaction in multiple species. Proc. Natl Acad. Sci. USA (2005a) 102:1974–1979.[Abstract/Free Full Text]

    Sharan R, et al. Identification of protein complexes by comparative analysis of yeast and bacterial protein interaction data. J. Comput. Biol (2005b) 12:835–846.[CrossRef][Web of Science][Medline]

    Sole RV, et al. A model of large-scale proteome evolution. Adv. Compl. Syst (2002) 5:43.[CrossRef]

    Tatusov RL, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics (2003) 4.

    Wagner A. The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. Mol. Biol. Evol (2001) 18:1283–1292.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
X. Guo and A. J. Hartemink
Domain-oriented edge-based alignment of protein interaction networks
Bioinformatics, June 15, 2009; 25(12): i240 - 1246.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Dutkowski, J.
Right arrow Articles by Tiuryn, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dutkowski, J.
Right arrow Articles by Tiuryn, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?