Skip Navigation


Bioinformatics Advance Access originally published online on February 2, 2006
Bioinformatics 2006 22(7):823-829; doi:10.1093/bioinformatics/btl014
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
22/7/823    most recent
btl014v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (21)
Google Scholar
Right arrow Articles by Yu, H.
Right arrow Articles by Gerstein, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Yu, H.
Right arrow Articles by Gerstein, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

Predicting interactions in protein networks by completing defective cliques

Haiyuan Yu {dagger}, Alberto Paccanaro {dagger}, Valery Trifonov 1,{dagger} and Mark Gerstein 1,*

Department of Molecular Biophysics and Biochemistry 266 Whitney Avenue Yale University PO Box 208114
1Department of Computer Science, Yale University New Haven, CT 06520-8285, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 

Datasets obtained by large-scale, high-throughput methods for detecting protein–protein interactions typically suffer from a relatively high level of noise. We describe a novel method for improving the quality of these datasets by predicting missed protein–protein interactions, using only the topology of the protein interaction network observed by the large-scale experiment. The central idea of the method is to search the protein interaction network for defective cliques (nearly complete complexes of pairwise interacting proteins), and predict the interactions that complete them. We formulate an algorithm for applying this method to large-scale networks, and show that in practice it is efficient and has good predictive performance. More information can be found on our website http://topnet.gersteinlab.org/clique/

Contact: Mark.Gerstein{at}yale.edu

Supplementary information: Supplementary Materials are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
A fundamental problem in modern biology is the identification of the complete set of interactions among the proteins in a cell (Jansen et al., 2003; Marcotte et al., 1999; Goldberg and Roth, 2003). Different experimental methods are available to identify such interactions, and they can be roughly divided into two main categories: small-scale (low-throughput) and large-scale (high-throughput) techniques. Given a set of proteins, small-scale techniques such as co-IP determine the interaction between one pair of proteins at a time (Bader et al., 2003; Mewes et al., 2002; Xenarios et al., 2002; Xia et al., 2004). On the other hand, large-scale techniques, e.g. yeast two-hybrid and TAP-tagging, allow identifying a large number of interacting pairs in a single experiment (Gavin et al., 2002; Ho et al., 2002; Ito et al., 2000; Uetz et al., 2000).

With the advent of genome-wide analysis, we are interested in the identification of the interaction among a great number of proteins (even of all the proteins in a genome). When the number of proteins is in the thousands, the number of possible interacting pairs is in the millions (Kumar and Snyder, 2002). To discover all these interactions using small-scale experiments becomes very labor-intensive and time-consuming, and in this situation large-scale experiments are preferred.

However, low-throughput experiments allow much more precise identification of the interacting pairs than high-throughput experiments—the latter are known to be more error-prone (Jansen et al., 2002; von Mering et al., 2002).

Two types of errors are possible: the large-scale experiment can wrongly indicate that an interaction exists, i.e. yield a false positive (FP); or it can fail to detect an interaction that actually exists, thus producing a false negative (FN). However, experimentalists would agree that these two types of errors occur with different frequency in large-scale experiments. While FPs have ‘higher visibility’ owing to the relatively small number of true interactions, it is generally observed that experiments allow a higher absolute degree of confidence when an interaction is observed, but a much lower degree when no interaction is detected. In other words, most of the errors (as an absolute count, not relative to the numbers of actual interacting or non-interacting protein pairs) are FNs: it is believed that when no interaction is detected, it is not unlikely that the interaction actually exists, but the experiment has failed to detect it. In support of this observation, Figure 1 shows the differences between the low-throughput and high-throughput experimental data on protein–protein interactions in a subset of 56 proteins of Saccharomyces cerevisiae, for which we were able to obtain complete matrices of experimental results.


Figure 1
View larger version (14K):
[in this window]
[in a new window]
 
Fig. 1 A graphical representation of the symmetric matrix of the differences between complete protein–protein interaction data obtained in small-scale and large-scale experiments on 56 proteins of S.cerevisiae (only the upper triangle is shown). There is a colored box for each cell in the matrix indicating the type of the interaction between proteins i and j. White boxes indicate interactions observed in small-scale but not in large-scale experiments (FNs); black boxes stand for interactions observed in large-scale but not in small-scale experiments (FPs); gray boxes show protein pairs for which both the small- and the large-scale experiments produced the same result. The number of FNs exceeds the number of FPs by an order of magnitude.

 
The results of the two types of experiments were the same for 1033 of the 1596 pairs of proteins (including possible self-interactions); of the 563 cases when the results were different, 521 (92.5%) were FNs and 42 (7.5%) were FPs. Ideally, we would like to have a computational method which would be able to correct many of the errors made by large-scale interaction experiments. In this paper we propose a new method, based purely on topological properties of graphs representing protein interaction networks, that attempts to detect those interactions that have been missed by large-scale experiments. Our algorithm searches for defective cliques in these graphs and predicts the interactions which complete them to full cliques.

The basic idea of the algorithm derives from the way in which large-scale experiments are carried out, and particularly from the matrix model interpretation of their results (Bader and Hogue, 2002; Gavin et al., 2002; Ho et al., 2002; Rigaut et al., 1999). In these experiments one protein, the bait, is used to pull out the set of proteins interacting with it, i.e. its protein complex, in the form of a list. When such lists differ only in a few elements, it is reasonable to assume that this is because of experimental errors, and the missing elements should therefore be added. Each list can be represented as a fully connected graph in which proteins occupy the nodes. Then the problem of identifying lists that differ in only a few elements is equivalent to finding a clique with a few missing edges, which we shall call defective clique.

The rest of this paper is organized as follows. In Section 2 we shall introduce some basic notions and give an overview of our method. In Section 2.2 we present a more efficient and practically useful algorithm implementing the method, and in Section 3 we present the results of applying the method to several datasets obtained from experimental observations of the protein interaction network of yeast.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
Before describing our approach, let us introduce some basic terms. A graph is a pair (V, E) of a set of vertices V and a set of edges E {subseteq} V x V where each edge is a pair of the vertices it connects; if <v1, v2> is in E, then the vertices v1 and v2 are adjacent. In a graph representing a protein interaction network, the vertices are proteins, and the edges are the pairs of interacting proteins.

A clique in a graph is a set K of vertices such that K x K {subseteq} E i.e. each pair of vertices in K is connected by an edge in E. The size of this clique is the number of vertices in it.

As discussed in Section 1, under the matrix model interpretation of the results of large-scale experiments, two proteins interacting with the same protein clusters are likely to interact with each other. Thus in graph-theoretic terms our approach is based on the following observation about protein interaction networks:

(*) If vertices P and Q are both adjacent to each vertex in a clique K, then it is likely that P and Q are adjacent to each other, if they are not adjacent already.

This observation can be depicted as shown in Figure 2A; in this example the size of the clique K is 5. The dashed edge between P and Q corresponds to an interaction which is missing from the experimental data, but which [according to observation (*)] is very likely to occur. We say that P, Q and K form a defective clique KPQ with a missing edge PQ. (Note that a defective clique could in theory have more than one missing edge.)


Figure 2
View larger version (14K):
[in this window]
[in a new window]
 
Fig. 2 Schematic illustrations of a defective clique and how the concept evolved. (A) A defective clique in a protein interaction network. KP and KQ are both (k + 1)-cliques, with k overlapping vertices (i.e. clique K). The dashed edge between proteins P and Q corresponds to a predicted interaction. KPQ is a defective clique with a missing edge PQ. (B) The decomposition of the defective clique (KPQ) into the union of two overlapping cliques (KP and KQ). (C) Generalized defective cliques. In general, a defective clique consists of two cliques: K {cup} KP and K {cup} KQ. There are two parameters to determine a defective clique: k, the size of the overlapping subclique (i.e. K); l, the size of the non-overlapping subcliques (i.e. KP {cup} KQ). In the defective clique K {cup} KP {cup} KQ, the dashed edges between subcliques KP and KQ correspond to predicted interactions.

 
Clearly the size of K plays an important role in determining how likely it is that P and Q interact. For example, if the size of K is 1 (i.e. P and Q both interact with one or more proteins, but those proteins do not interact among themselves), the likelihood of an interaction between P and Q is much smaller than in the case when the size of K is, say, 42. Thus a natural parameter of a prediction algorithm based on observation (*) is the minimal size k of K for which the interaction PQ is predicted.

Another parameter with which we can extend observation (*) is the number of edges missing from the clique when its size is sufficiently large. We will discuss the effects of this parameter in Section 2.2, when we describe our algorithm in detail.

2.1 An algorithm for finding defective cliques
Our definition of a defective clique does not immediately suggest a method for finding such patterns in a protein interaction network. For this purpose it is useful to find an alternative characterization of a defective clique in standard graph-theoretic terms, which will allow us to use some off-the-shelf algorithms.

The main idea of our algorithm is based on the realization that a defective clique KPQ of size n with one missing edge is the union of two (complete) cliques of size n – 1, namely K {cup} {P} and K {cup} {Q}, as shown in Figure 2B. Thus we can reduce the algorithm for finding defective cliques to the following two main steps (which may be repeated until no new edges are added):

Step 1: Find all cliques in the network.
Step 2:
  • Find pairs of cliques overlapping on all but one node each.
  • In each of these pairs predict the edges between the non-overlapping nodes.
  • Add the new edges to the network.

(Note that defective cliques with more than one missing edge could also be determined by applying this recipe.)

However, directly applying this naïve recipe to typical protein interaction networks is unrealistic for the following reason: Since every subset of nodes in a given clique is itself a clique, the number of all cliques in a graph is at least 2q, where q is the size of the largest clique in the graph. For example, the large-scale experimental data for the protein interaction network of S.cerevisiae we used to test our algorithm (see Section 3) contains four cliques of size 38; this yields >1012 cliques (even if we do not consider cliques of size <5, whose number is negligible), hence >1023 pairs of cliques to check in Step 2 of the algorithm. Since this number is prohibitively large, we need a more effective formulation of the algorithm. For this purpose in the next section we design an equivalent algorithm which only considers the maximal cliques in the graph.

2.2 Improving efficiency using maximal cliques
A maximal clique in a graph G is one which is not contained in any other clique in G. In the worst case the problem of finding all maximal cliques still takes time exponential in the size of the graph11; however, if Step 1 is modified to only produce the maximal cliques in the graph, for the reasons discussed in the previous section the output of Step 1 would be reduced by a factor exponential in the size of the largest clique. This would lead to a corresponding reduction (by the square of that factor) of the running time of Step 2 of the algorithm.

In practice, the protein interaction networks are rather sparse [e.g. <15 000 interactions are observed with high confidence in the network of S.cerevisiae, out of over 18 million possible pairs of about 6000 proteins (von Mering et al., 2002)]. Our results show that existing algorithms for finding maximal cliques (Tsukiyama et al., 1977) are very efficient on graphs with this structure. However, if we only compare maximal cliques for overlap on all but one node each, as we did with all cliques in the naïve version, the output of this algorithm will not be the same as that of the naïve version. The reason is that if a defective clique consists of a core clique K and two nodes P and Q, Step 2 of the algorithm will not, in general, attempt to match the cliques K {cup} {P} and K {cup} {Q}, but two maximal cliques they are contained in, say K {cup} KP and K {cup} KQ (note that KP and KQ always exist, but are not necessarily unique). However, KP will in general contain other nodes in addition to P, and these nodes might not all appear in KQ. As a result, the non-overlapping parts KP and KQ of the maximal cliques will consist of more than one node each, and Step 2 of the naïve algorithm will fail to predict the edge PQ.

Hence, to obtain the same results as with our original algorithm, we have to modify Step 2 of the algorithm to look for partial overlaps of maximal cliques which differ in more than one node. This leads us to a generalization of the notion of a defective clique, shown in Figure 2C. To obtain the same result as in the original approach, any pair of nodes Pi and Qi, belonging to the two non-overlapping components KP and KQ respectively, must be predicted as interacting, because the original algorithm would have predicted it (since it completes the defective non-maximal clique KPiQi). The maximal size l of non-overlapping subcliques KP and KQ is a parameter of the algorithm.

Thus one round of the algorithm we use in our experiments becomes

Step 1: Find all maximal cliques in the network.
Step 2:
  • For each pair of maximal cliques, overlapping on at least k nodes and with non-overlapping components of at most l nodes each, predict the edges between all pairs of nodes between the two non-overlapping components.
  • Add the new edges to the network.

Since even the number of maximal cliques can be significant (in the hundreds of thousands for some of our experimental datasets), and their sizes can be in the hundreds of nodes, the number of comparisons between nodes in pairs of cliques in Step 2 can still be formidable in practice. We further reduce the time complexity of Step 2 by organizing the cliques (represented as strings sorted by node index) in a suffix tree. This structure allows us to reuse some comparison results among cliques sharing a common prefix of nodes.

Step 1 of the algorithm has an upper bound of O(nmµ) for its time complexity (Tsukiyama et al., 1977), where n is the number of nodes, m, the number of edges, and µ, the number of maximal cliques in the graph. This implies an upper bound of O(nmµ + µ2) for the time complexity of one round of the algorithm (Step 1 followed by Step 2). In our tests the running time was indeed dominated by the time spent in Step 2.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
We tested our method on two datasets of protein–protein interactions in S.cerevisiae, obtained from large-scale experiments (Bader and Hogue, 2002; Yu et al., 2004). In both cases we compared its predictions with a ‘gold standard’ set of protein pairs, known with high degree of confidence to be ‘positive’ [interacting—protein pairs in the same protein complex determined by the MIPS complex catalog (Mewes et al., 2002)] or ‘negative’ [non-interacting—protein pairs with different sub-cellular localizations (Kumar et al., 2002)], as published in Jansen et al. (2003). Here, the gold standard positives are a collection of small-scale experiment results (Mewes et al., 2002).

Since neither the large-scale experimental datasets nor the gold standard set are complete (i.e. there are protein pairs for which no experiment has been performed), the question of how to treat missing information arises. We took a conservative approach, assuming that no information indicates no interaction. Since the set of new interactions, predicted by the algorithm, does not decrease with the addition of edges to the input data, our results represent a lower bound on the predictions that would be made with less missing information.

3.1 Performance on a complete dataset
To illustrate the method on a small example, in which we can accurately assess its performance, we considered a sub-graph of the protein interaction network of S.cerevisiae for a set of 43 proteins, for which the gold standard is complete (for each pair of proteins it is known if they interact or not, i.e. there is no missing information). Here, we used the large-scale protein interaction network obtained by Yu et al. (2004). The graph of the gold standard on this subset of proteins consists of four components, all of them cliques; we will refer to them as to G1 through G4, as defined in Figure 3A. The graph of the large-scale experimental dataset consists of 6 maximal cliques named E1 through E6, of size at least 2, plus 15 singleton nodes (cliques of size 1), shown in Figure 3B and C, where the data is presented in the form obtained after running Step 1 of the algorithm. Note that all elements of clique E2 except for MRPL38 appear also in clique E1 (protein names in bold; Figure 3B), and that the pairs of cliques E3E4 and E4E5 each share a node.


Figure 3
View larger version (39K):
[in this window]
[in a new window]
 
Fig. 3 Subset of the gold standard set without missing information and the corresponding large-scale interaction sub-network. (A) The gold standard. There are four maximal cliques in the gold standard set. Please see Supplementary Figure 1 for the network view of these four cliques. (B) The large-scale experimental data. There are six maximal cliques and 15 singleton nodes in the large-scale experimental data. (C) Network view of the experimental data in (B), excluding the singletons.

 
Applying Step 2 of the algorithm with parameters k = 6 and l = 17 to the experimental data finds the partial overlap between cliques E1 and E2, and predicts the interactions between MRPL38 and all nodes in E1 which are not in E2. After these edges are added, the cliques E1 and E2 are merged into one clique of size 24, which is a subset of the gold standard clique G1, missing only the protein MRPL49 (in the experimental dataset this node is a singleton, so the clique completion is unable to recover its interactions). These are the only new interactions the algorithm predicts. All of them are positive in the gold standard for this set of proteins, therefore all of the predictions in this case are correct. G1, E1 and E2 consist of Mitochondrial ribosomal proteins.

3.2 Performance on the available S.cerevisiae dataset
We applied the clique completion method to a large-scale experimental dataset of the protein interaction network of S.cerevisiae obtained by Bader and Hogue, (2002). Unlike the smaller set analyzed in the previous section, the gold standard for the proteins in this dataset is incomplete (as is the dataset itself).

The initial graph contains 6645 edges between 2283 nodes. In this graph Step 1 of the algorithm found 4934 maximal cliques. Step 2 of the algorithm, configured to search for partial overlaps of size at least k = 4 and non-overlapping parts of size at most l = 3, predicted 437 new interactions. Adding these interactions reduces the number of maximal cliques by 276, showing consolidation of smaller complexes into larger ones, as expected.

As a criterion of the effectiveness of the algorithm we used the likelihood ratio of the predicted interactions, defined in Jansen et al. (2003) as

Formula
where P+ is the number of true positives—predicted interactions which are positive in the gold standard; P is the number of FPs—predicted interactions which are negative in the gold standard; G+ is the total number of positive pairs in the gold standard and G is the total number of negative pairs in the gold standard.

Higher values of L correspond to sets of predictions having higher overlap with the positive and/or lower overlap with the negative gold standard, and generally indicate better predictors. One of the advantages of calculating L is that it naturally takes into account the biased sampling between positives and negative, which is often the case for biological data (see Supplementary Materials).

The gold standard set contained G+ = 8250 positive and G = 2 708 622 negative pairs when restricted to the proteins in this experimental dataset. Of the 437 interactions predicted by the method in this test, 94 were in the gold standard set; of them 73 were positive (P-values < 10–10; see Supplementary Materials) and 21 negative, which yields a likelihood ratio of 1141.3, significantly higher than the likelihood ratios of other single features reported in Jansen et al. (2003) (essentiality, expression correlation, MIPS function and GO biological process), which are below 400.

The values of the parameters chosen in our test are in a ‘plateau’ of relative stability of the results. In a wider spectrum of parameter values, the likelihood ratio of the predicted set was between 59.13 and 3720.94 when varying the parameters of the algorithm as follows: k (the minimal overlap size) between 4 and 7, and l (the maximal size of the non-overlapping parts) between 1 and 20; the number of predicted interactions was between 12 and 8993. The average running time was <4 s on a desktop machine. We also calculated the receiver operating characteristic (ROC) curve, and compared our method with the four available large-scale yeast interaction experiments. Our method out-performs all of them (Fig. 4).


Figure 4
View larger version (15K):
[in this window]
[in a new window]
 
Fig. 4 The trade-off between detection rate and error rate for different values of k and l to evaluate the performance of our defective clique method. The curve is also known as the ROC curve (Egan, 1975). The inset highlights the lower left corner of the ROC curve to show the comparison between our method and the four large-scale experimental datasets.

 
Another parameter indicating the quality of the prediction is the functional enrichment in the predicted interactions, i.e. the ratio of the frequency of functionally similar pairs among the predictions to the expected frequency in the yeast genome (see Supplementary Materials). (Note that functional similarity is not a feature taken into account when constructing the input set or predicting the new interactions.)

The distribution of the likelihood ratio and functional enrichment of the predicted edges as a function of the maximal size of a defective clique they complete is shown in Figure 5A. They show that even for small sizes of the overlap the predicted edges are much more likely to be in the positive than in the negative gold standard, and are significantly more likely to be functionally similar than the average interacting pair.


Figure 5
View larger version (20K):
[in this window]
[in a new window]
 
Fig. 5 (A) Distribution of the likelihood ratio L of predicted edges as a function of the maximal size of a defective clique they complete. (B) Comparison with the predictions of the RNSC algorithm on three of the datasets published in King et al. (2004).

 
Taking into account the size of the predicted set and the fact that the predictions were made only on the basis of the topology of the input set, we believe the high value of these measures is a strong argument for the usefulness of this method as a predictor of new interactions.

3.3 Biological examples
With the addition of the 437 predicted interactions, we were able to discover many protein complexes that are not present in the initial network. For example, Casein kinase II complex is composed of two catalytically active subunits (CKA1 and CKA2) and two regulatory subunits (CKB1 and CKB2). It is involved in regulating cell growth and proliferation (Ackermann et al., 2001). However, based on the original large-scale interaction experiments, the interaction between CKA2 and CKB1 is missing. Therefore, the whole complex could not be determined as a fully connected clique. We were only able to discover two three-cliques: {CKA1, CKA2, CKB2} and {CKA1, CKB1, CKB2}. Only after our defective clique procedure, CKA2 and CKB1 was predicted to be connected and the whole complex became a four-clique (Fig. 6A).


Figure 6
View larger version (21K):
[in this window]
[in a new window]
 
Fig. 6 Two biological examples of protein complexes that can only be discovered by our defective clique method. (A) Casein Kinase II complex consisting of fours proteins. (B) Exosome complex consisting of seven proteins.

 
Another good example is the exosome complex, consisting of seven proteins (Fig. 6B). It is involved in RNA processing (Mewes et al., 2002; Mitchell et al., 1997). In the original large-scale interaction network, RRP43, RRP4 and RRP42 are disconnected. Therefore, the whole complex is divided into three five-cliques: {RRP42, RRP46, SKI6, DIS3, RRP45}, {RRP43, RRP46, SKI6, DIS3, RRP45} and {RRP4, RRP46, SKI6, DIS3, RRP45}. Our defective clique procedure successfully predicted the interactions among RRP43, RRP4 and RRP42. The complex, thus, became a seven-clique as described in the MIPS complex catalog (Mewes et al., 2002).

3.4 Comparison with related work
King et al. have designed the Restricted Neighborhood Search Clustering (RNSC) algorithm, which partitions proteins into clusters depending on their interactions (King et al., 2004). The RNSC algorithm can also be viewed as a method for predicting new interactions, if we consider all pairs of proteins in a predicted cluster to be interacting. We compared the predictions of RNSC with those of our algorithm on the datasets published in King et al. (2004), which are based on the data of von Mering et al. (2002); the results are shown in Figure 5B. Since the two algorithms represent very different approaches to discover clusters, the overlap of their predictions is noteworthy.

Bader and Hogue also proposed the Molecular Complex Detection (MCODE) algorithm to discover protein complexes, which can be viewed as another way to predict new interactions (Bader and Hogue, 2003). The method essentially looks for k-cores in the network. A k-core is a sub-graph G of n (n ≥ k) vertices with minimal degree k [in G, degree (v) ≥ k for every v {subseteq} G]. By definition, all defective cliques determined with the parameters k and l are at least (k + 1)-cores, i.e. results from our method will be a subset of the MCODE method. Therefore, our predictions are much more stringent than theirs, whereas the MCODE method could potentially discover more interactions.


    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
We presented a method for predicting new protein–protein interactions, based purely on topological properties of networks of observed interactions. Comparing the results with the gold standard set and functional annotations confirmed that it is a very good predictor. While computationally expensive, we believe the method has the advantage of being more robust by virtue of its independence of non-topological features such as functional classification.


    Acknowledgments
 
MG acknowledges support from the NIH grant 5P50GM062413-03. Funding to pay the Open Access publication charges for this article was provided by NIH.

Conflict of Interest: none declared.


    FOOTNOTES
 
{dagger}The authors wish it to be known that, in their opinion, the first three authors should be regarded as joint First Authors. Back

Associate Editor: Thomas Lengauer

1More precisely, the problem is NP-complete, i.e. only exponential-time algorithms for solving it are known. Back

Received on April 14, 2005; revised on January 9, 2006; accepted on January 20, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 

    Ackermann, K., et al. (2001) Genes targeted by protein kinase CK2: a genome-wide expression array analysis in yeast. Mol. Cell. Biochem, . 227, 59–66[CrossRef][Medline].

    Bader, G.D. and Hogue, C.W. (2002) Analyzing yeast protein–protein interaction data obtained from different sources. Nat. Biotechnol, . 20, 991–997[CrossRef][Web of Science][Medline].

    Bader, G.D. and Hogue, C.W. (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4, 2[CrossRef][Medline].

    Bader, G.D., et al. (2003) BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res, . 31, 248–250[Abstract/Free Full Text].

    Egan, J.P. Signal Detection Theory and ROC-Analysis, (1975) , New York Academic Press.

    Gavin, A.C., et al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415, 141–147[CrossRef][Medline].

    Goldberg, D.S. and Roth, F.R. (2003) Assessing experimentally derived interaction in a small world. Proc. Natl Acad. Sci. USA, 100, 4372–4376[Abstract/Free Full Text].

    Ho, Y., et al. (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature, 415, 180–183[CrossRef][Medline].

    Ito, T., et al. (2000) Toward a protein–protein interaction map of the budding yeast: a comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc. Natl Acad. Sci. USA, 97, 1143–1147[Abstract/Free Full Text].

    Jansen, R., et al. (2002) Integration of genomic datasets to predict protein complexes in yeast. J. Struct. Funct. Genomics, 2, 71–81[CrossRef][Medline].

    Jansen, R., et al. (2003) A Bayesian networks approach for predicting protein–protein interactions from genomic data. Science, 302, 449–453[Abstract/Free Full Text].

    King, A.D., et al. (2004) Protein complex prediction via cost-based clustering. Bioinformatics, 20, 3013–3020[Abstract/Free Full Text].

    Kumar, A. and Snyder, M. (2002) Protein complexes take the bait. Nature, 415, 123–124[CrossRef][Medline].

    Kumar, A., et al. (2002) Subcellular localization of the yeast proteome. Genes Dev, . 16, 707–719[Abstract/Free Full Text].

    Marcotte, E.M., et al. (1999) Detecting protein function and protein–protein interactions from genome sequences. Science, 285, 751–753[Abstract/Free Full Text].

    Mewes, H.W., et al. (2002) MIPS: a database for genomes and protein sequences. Nucleic Acids Res, . 30, 31–34[Abstract/Free Full Text].

    Mitchell, P., et al. (1997) The exosome: a conserved eukaryotic RNA processing complex containing multiple 3'–>5' exoribonucleases. Cell, 91, 457–466[CrossRef][Web of Science][Medline].

    Pellegrini, M., et al. (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl Acad. Sci. USA, 96, 4285–4288[Abstract/Free Full Text].

    Rigaut, G., et al. (1999) A generic protein purification method for protein complex characterization and proteome exploration. Nat. Biotechnol, . 17, 1030–1032[CrossRef][Web of Science][Medline].

    Tsukiyama, S., et al. (1977) A new algorithm for generating all the maximal independent sets. SIAM J. Comput, . 6, 505–517[CrossRef].

    Uetz, P., et al. (2000) A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae. Nature, 403, 623–627[CrossRef][Medline].

    von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S., Bork, P. (2002) Comparative assessment of large-scale data sets of protein–protein interactions. Nature, 417, 399–403[Medline].

    Xenarios, I., et al. (2002) DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res, . 30, 303–305[Abstract/Free Full Text].

    Xia, Y., et al. (2004) Analyzing cellular biochemistry in terms of molecular networks. Annu. Rev. Biochem, . 73, 1051–1087[CrossRef][Web of Science][Medline].

    Yu, H., et al. (2004) TopNet: a tool for comparing biological sub-networks, correlating protein properties with topological statistics. Nucleic Acids Res, . 32, 328–337[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Sci SignalHome page
R. P. Alexander, P. M. Kim, T. Emonet, and M. B. Gerstein
Understanding Modularity in Molecular Networks Requires Dynamics
Sci. Signal., July 28, 2009; 2(81): pe44 - pe44.
[Abstract] [Full Text] [PDF]


Home page
Mol. Cell. ProteomicsHome page
H. Wang, B. Kakaradov, S. R. Collins, L. Karotki, D. Fiedler, M. Shales, K. M. Shokat, T. C. Walther, N. J. Krogan, and D. Koller
A Complex-based Reconstruction of the Saccharomyces cerevisiae Interactome
Mol. Cell. Proteomics, June 1, 2009; 8(6): 1361 - 1381.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
H. Chen, L. Ding, Z. Wu, T. Yu, L. Dhanapalan, and J. Y. Chen
Semantic web for integrated network analysis in biomedicine
Brief Bioinform, March 1, 2009; 10(2): 177 - 192.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
A. Ma'ayan
Insights into the Organization of Biochemical Regulatory Networks Using Graph Theory Analyses
J. Biol. Chem., February 27, 2009; 284(9): 5451 - 5455.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. Zampieri, N. Soranzo, and C. Altafini
Discerning static and causal interactions in genome-wide reverse engineering problems
Bioinformatics, July 1, 2008; 24(13): 1510 - 1515.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
B. Zhang, B.-H. Park, T. Karpinets, and N. F. Samatova
From pull-down data to protein interaction networks and complexes with biological relevance
Bioinformatics, April 1, 2008; 24(7): 979 - 986.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
A. Presser, M. B. Elowitz, M. Kellis, and R. Kishony
The evolutionary dynamics of the Saccharomyces cerevisiae protein interaction network after duplication
PNAS, January 22, 2008; 105(3): 950 - 954.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
K. Y. Yip, H. Yu, P. M. Kim, M. Schultz, and M. Gerstein
The tYNA platform for comparative interactomics: a web tool for managing, comparing and mining multiple networks
Bioinformatics, December 1, 2006; 22(23): 2968 - 2970.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
T. Aittokallio and B. Schwikowski
Graph-based methods for analysing networks in cell biology
Brief Bioinform, September 1, 2006; 7(3): 243 - 255.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
22/7/823    most recent
btl014v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (21)
Google Scholar
Right arrow Articles by Yu, H.
Right arrow Articles by Gerstein, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Yu, H.
Right arrow Articles by Gerstein, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?