Skip Navigation


Bioinformatics Advance Access originally published online on September 17, 2004
Bioinformatics 2005 21(2):239-247; doi:10.1093/bioinformatics/bth491
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/2/239    most recent
bth491v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Leone, M.
Right arrow Articles by Pagnani, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Leone, M.
Right arrow Articles by Pagnani, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics vol. 21 issue 2 © Oxford University Press 2005; all rights reserved.

Predicting protein functions with message passing algorithms

Michele Leone * and Andrea Pagnani

Institute for Scientific Interchange (ISI) Viale Settimio Severo 65, Turin, I-10133, Italy

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 METHODS
 DATA AND GRAPH ANALYSIS
 RESULTS
 DISCUSSION
 REFERENCES
 

Motivation: In the last few years, a growing interest in biology has been shifting toward the problem of optimal information extraction from the huge amount of data generated via large-scale and high-throughput techniques. One of the most relevant issues has recently emerged that of correctly and reliably predicting the functions of a given protein with that of functions exploiting information coming from the whole network of proteins physically interacting with the functionally undetermined one. In the present work, we will refer to an ‘observed’ protein as the one present in the protein–protein interaction networks published in the literature.

Methods: The method proposed in this paper is based on a message passing algorithm known as Belief Propagation, which accepts the network of protein's physical interactions and a catalog of known protein's functions as input, and returns the probabilities for each unclassified protein of having one chosen function. The implementation of the algorithm allows for fast online analysis, and can easily be generalized into more complex graph topologies taking into account hypergraphs, i.e. complexes of more than two interacting proteins.

Results: Benchmarks of our method are the two Saccharomyces cerevisiae protein–protein interaction networks and the Database of Interacting Proteins. The validity of our approach is successfully tested against other available techniques.

Contact: leone{at}isiosf.isi.it

Supplementary information: http://isiosf.isi.it/~pagnani


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 METHODS
 DATA AND GRAPH ANALYSIS
 RESULTS
 DISCUSSION
 REFERENCES
 
The most classical protein function prediction methods are those inferring similarity to function from sequence homologies (Hodgman, 2000) between proteins listed in the databases using programs, such as FASTA (Pearson and Lipman, 1988) and BLAST (Altschul et al., 1997); by comparison with known protein interactions in similar genomes [the so-called Rosetta Stone Method (Marcotte et al., 1999b)] or by phylogenetic analysis (Marcotte et al., 1999a). Most recently, a new class of method has been proposed that relies on the available data on the global structure of the PPI networks for a growing number of organisms of completely sequenced genome (Uetz et al., 2000; Ito et al., 2001; Giot et al., 2003). The most complete available online data are structured in a graph-like format, with graph sites indexed with protein names and links representing a physically experimentally tested interaction between two proteins. More limited databases on larger protein complexes are also available (Gavin et al., 2002; Ho et al., 2002). From the side of functional classification, databases or more complex dynamic classification schemes are now available (MIPS1 and Gene Ontology among others2), which provide a classification for the continuously growing number of proteins, listing them in different functional categories classes with a hierarchical like organization. Currently available methods that try to exploit the global PPI network structure to infer yet unknown functions for the unclassified proteins whose interactions with the rest of the graph are at least partially known are the so-called Neighboring Counting Method (Schwikowski et al., 2000), the {chi}2-method (Hishigaki et al., 2001), the Bayesian approaches (Deng et al., 2002; Letovsky and Kasif, 2003), the Redundancy Method (Samanta and Liang, 2003) and a more recent Monte Carlo simulated annealing (SA) approach (Vazquez et al., 2003).


    METHODS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 DATA AND GRAPH ANALYSIS
 RESULTS
 DISCUSSION
 REFERENCES
 
Let us name G a PPI graph, with a set of vertexes V = {1,...,N} representing the observed proteins, each protein name being assigned a numerical value form 1 to N. Let us also define a mapping between the set of all observed functions and the numbered set F = {1,...,F}. Each protein i belonging to V can then be characterized with a discrete variable X i that can take values f F. One would like to compute the probability P i (f) = Pr(X i = f) for each protein to obtain a given function f provided that the functions are assigned to the proteins in the rest of the graph. The method is based on the definition of a score function E on the PPI graph [Equation (1)] that counts the number of all common predicted functions among the neighboring proteins of the graph over all interactions. In addition to this, a certain fraction of the proteins is already classified, which means that there exists a subset A V of vertexes with at least one function belonging to F attached to it (Fig. 1a for an example of a graph portion). The effect of the already classified proteins with a function in the neighborhood of protein i on the PPI network is taken into account as an external field acting on i and proportional to the number of neighbors belonging to A with that given function. From this score function, a variational potential (called Gibbs potential) can be defined, which measures the distance between the true unknown function probabilities and performs a trial estimation on them. The values of the best estimated probabilities are found extremizing the Gibbs potential (Yedidia et al., 2003; Yeang and Jaakkola, 2003). The Gibbs potential extremizing equations used in this work are commonly known under the name of Belief Propagation (BP) equations and can easily be found by a procedure called Cavity Method (Mèzard and Parisi, 2001). We have solved the BP equations both for the probabilities of a completely unclassified proteins belonging to V\A and for the more complete model where we let a protein belonging to A the possibility of having other yet unknown functions. The modifications applied to the method are technically minor and hence they will not be described here. Given a choice of initial conditions on probability functions {P i (X i )} i=1,...,N and a choice of the score function E, the algorithm calculates the stationary probabilities whose values extremize the resulting Gibbs potential. The potential in general depends on one free real parameter ß that plays the role of an inverse temperature and weights the possibility of allowing functional assignments that do not exactly maximize the score, but could still be possible due to their large degeneracy: at low enough values of ß (high temperature) almost any function assignation to proteins in V\A gives an equivalent value to the potential. In this region the system is said to be in a ‘paramagnetic phase’. Every functional assignment is therefore accepted and the algorithm is not predictive. After a certain critical value ß c the shape of the Gibbs potential changes: only some values of the probability functions extremize it. Augmenting ß, the algorithm tends to weight more and more of those functional assignments that exactly maximize the score. Strictly at zero temperature (ß -> {infty}) only the score maximizing functional assignments survive with non-zero probability. Given sets V, A and V\A, the PPI graph G, the graph of unclassified proteins U G and the set of observed function F, a score function can be defined following Vazquez et al. (2003) as follows:



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 1 From G to the BP equations: (a) A small fraction of U network G. Circles represent proteins with their numerical ID used by the algorithm. Classified proteins are filled with dots, while unclassified ones are left open. Each classified protein has a series of functions whose numerical values F are written in boxes. In (b) only the corresponding part of the U subgraph has been drawn. Dotted arrows represent the external fields acting on the unclassified proteins; and the vectors whose non-zero components are defined in the lower boxes. For each protein in U, they count the number of classified neighbors having a given function. Upper boxes are sets of all functions of all the classified proteins neighboring a given unclassified one. Thick arrows represent ‘messages’ among unclassified proteins according to Equation (4). Notice how U is significantly less connected than G and often divides in smaller connected components. (c) A more detailed representation of the message passing between proteins i and j, in direction i -> j.

 

where J ij is the adjacency matrix of U (J ij = 1 if i and j V\A and they interact with each other). {delta}({sigma}; {tau}) is the Kronecker delta function measured between functions {sigma} and {tau} assigned to the neighboring proteins and h i ({tau}) is an external field that counts the number of classified neighbors of protein i in the original graph G that have at least function {tau}. The Gibbs potential can then be calculated as a variational way to compute the quantity


called as free-energy of the system, a fundamental quantity in statistical physics that counts the logarithm of the sum of all the weights of the probabilities each configuration of the variables in the systems appear with. Configurations with a largest statistical weight can then be calculated as those maximizing this potential function. Using the message passing approach (Yedidia et al., 2003; Mèzard and Parisi, 2001) under the assumption that correlations are low enough in the graph hence one can write P ij (X i ,X j ) {vprop} P i (X i )P j (X j ) if proteins i and j are chosen at random, and one can calculate each P j (X j ) as a product of conditional probabilities contributions M i->j (X j ) incoming to j from all neighbors of protein j, conditional to the fact that j has a function X j :


where I(j) V\A denotes the set of unclassified neighbors of j and u i->j ({sigma}) is a ‘message’ that represents the field in direction {sigma} F acting on protein j due to the presence of protein i when protein j has function {sigma}. Equations for the message functions can be solved iteratively as fixed points of the system of equations


one for each link of U, for both directions in the graph. Self-consistent BP equations can be rewritten in terms of messages, u's. The ones explicitly used in our algorithm are shown in the following:


where



and


The BP algorithm has been written in terms of that equation and solved at any ß with a population dynamic technique (Mèzard and Parisi, 2001). In general, all previously described quantities depend on the inverse temperature ß. Equation (3) turns out to be a good approximation of the solution for the problem of finding the probabilities for configurations maximizing (2). A pictorial representation of the iteration procedure is shown in Figure 1c.


    DATA AND GRAPH ANALYSIS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 DATA AND GRAPH ANALYSIS
 RESULTS
 DISCUSSION
 REFERENCES
 
As benchmarks for the method, we have used two yeast Saccharomyces cerevisiae PPI graphs, referred in the following as U (Uetz et al., 2000) and D (Xenarios et al., 2000), respectively. The functional categories set F was extracted from the MIPS database. The U network contains N = 1826 proteins out of which 1370 belong to A, while the remaining 456 are unclassified or have an unclear MIPS classification; and M = 2238 pairwise interactions. The D network contains N = 4713 proteins out of which 3303 belong to A and 1410 are unclassified; and M = 14 846 interactions.

The two graphs differ substantially in size, but a finer analysis indeed showed that the U graph is almost entirely contained into the D graph. Quantitatively it turns out that the two graphs shares 1779 proteins, and that there are 2177 edges of the U graph joining proteins shared with the D graph (to be confronted to the 2238 edges of the original U graph). The two graphs shares 1885 edges, which means that >80% of the U graph is actually contained into the D graph. The choice of two PPI networks of different size coming from different experimental sources was made for the following reasons: (1) We initially used U PPI network in order to directly compare our results with those of Vazquez et al. (2003) and Schwikowski et al. (2000); however, being aware of the low reliability of the U PPI interactions, we decided to extend our analysis to a more reliable dataset. (2) The necessity to test the algorithm performance on a larger graph with a larger number of functionally undetermined proteins. Nevertheless, we want to stress again that the main focus of this paper is methodological, i.e. it is the description of the data analysis algorithm and not any assessment on the actual validity of the starting datasets.

Different choices of functional categories sets F are possible in the MIPS database depending on the level of the coarse-grained specification of the hierarchical classification scheme. We used the latest publicly available finest classification scheme retrieving F = 165 functional categories present in U and F = 176 in D, but experiments were also run on the most coarse-grained classification scheme. The results are available upon request.

The U PPI graph consists of a giant component of 1299 proteins (990 classified), and the rest of the proteins are grouped into 184 smaller isolated components of at most 13 proteins. We have also analyzed the structure of the U graph, which turns out to be 456 proteins, grouped into 309 isolated components (clusters) of size at most 27. Each cluster in U can be considered as an isolated functional island of the graph surrounded by external fields. More details on the cluster composition of both G and U for the U PPI network are shown in Table 1. One may also wonder that these clusters are more than a topological feature of our model, but also reflect a more interesting functional segregation. In other words, one is interested to understand in quantitative terms how different clusters in V\A label different functional areas in our graphs. Toward this aim, we measured intercluster and intracluster functional overlaps as in Equations (6) and (7). Both observables take value in the interval (0,1) and give a measure of the functional similarity of clusters (higher values indicate higher similarity). The emerging scenario showed clear signs of segregation since the intracluster overlap distributions has support in the interval (0,0.1), while the intercluster distribution has support in the whole interval (0,1). This test can be interpreted as a coherence test on the graph itself, and also on the working hypothesis of our method, since segregation is tacitly assumed in the functional form of score function where only first neighbors interactions on the graph are taken into account. Let us define the notion of intracluster and intercluster functional overlap as follows:


View this table:
[in this window]
[in a new window]
 
Table 1 Cluster size (cs) and number of clusters (noc) for both the original graph G (the two leftmost columns) and the graph of the unknown proteins U (the two rightmost columns)

 


where index i labels the different clusters C i and run between 1 and the total number of clusters C, N i is the number of proteins in the cluster C i , {phi}(s l ,s k ) counts the numbers of functions that proteins l and k have in common, and {Phi}(C i ) is the number of different functions acting onto cluster C i , while {Phi}(C i {cup} C j ) is the number of different functions acting onto the union set C i {cup}C j . It is interesting to note that according to Equations (6) and (7), both o i and O ij have taken real values in the interval (0,1). We can consider the probability densities of the two variables O i and o ij as follows:



where {delta}(a, b) is the Kronecker delta function equal to 1 when a = b and 0 otherwise. It is interesting at this point to compare the average intracluster overlap <O> = 0.440673 with the average intercluster overlap <o> = 0.0294147 which is a factor 15 smaller and that can be taken as a quantitative measure of the functional segregation on the PPI graph.

We then define the cumulative distribution functions as follows:



The two cumulative functions are displayed in Figure 2. The algorithm can be run separately and in parallel on all connected components of U, because there is no exchange of information between them. Equivalently speaking, the score function can be written as a sum over all components c of separated scores: E = {sum} c E c ({X i } ic ).



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 2 Cumulative probability distribution of intracluster/ intercluster functional overlap as defined in Equations (6) and (7) for the U U graph. Since the intracluster overlap turns out to be always <0.085, the intracluster cumulative probability distribution (solid line) saturates to 1 above this value. The intercluster overlap instead shows a clear sign of segregation. Note that the sudden jump at 1 for the dashed line is due to a significant fraction of clusters (84 out of 309) with functional overlap strictly equal to 1.

 

    RESULTS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 DATA AND GRAPH ANALYSIS
 RESULTS
 DISCUSSION
 REFERENCES
 
We have run our algorithm solving Equations (3) and (4) at several values of ß > ß c and for different choices of initial conditions of populations {P i (X i )} i=1,...,N . The results are always very stable with respect to the initial conditions. Instead of maximizing the Gibbs potential directly at zero temperature, we have worked at finite ß because we were also interested in predicting functional assignments that could be biologically allowed although not strictly maximizing the score function (1). Above ß c , the function probabilities for each protein converge on a set of values organized in hierarchies. The probability values are ß-dependent, but not the hierarchical structure (Fig. 5). All the results presented in the following are therefore considered at a single given high value of ß (ß = 2 for the D PPI graph and ß = 10 for the U PPI graph). Different values of ß were chosen because if larger the graph, the heavier is the numerical effort increasing ß. Moreover, low-ranked function probabilities acquire very low numerical values at large ß, making the numerical analysis for the rank separation more delicate. Nevertheless, as already stated, the hierarchical rank structure is not dependent on ß (higher value results were checked for the D graph as well), so a lower value was taken for the larger graph. For any protein i and the component c containing i, we have filtered out all the background noise probability values for functions that are not present in c and still have a non-zero contributions due to the form of Equation (3). We have then collected and ranked the remaining functional probabilities, following their emerging hierarchical structure. A list of predicted functions for all the unclassified proteins in the U PPI network using MIPS 2003 functional categories catalog is presented in the Supplementary Table. The rank division is explained in Figure 5. To probe the reliability of our algorithm, we have followed the standard procedures of Vazquez et al. (2003): starting from G and a corresponding MIPS functional annotation, we disregarded the functions of a given fraction d of classified proteins and considered them as unclassified. We have called d ‘dilution’ of G. If a previously classified protein is considered unclassified we say it has been ‘whitened’. With this procedure one obtains a new larger graph U d of unclassified proteins, where the algorithm can be run and its ability of finding again the erased functions can be tested. This testing procedure is very similar but more stringent than the Leave One-Out Method (Deng et al., 2002), because it assumes an extensive fraction of proteins in the graph as unclassified instead of only one each time. We repeated the procedure for both PPI networks. The results for a set of performance parameters are shown in Figure 3 as a function of protein degrees for some fixed dilution values. Figure 3a: Reliability. We have defined as a first reliability parameter F 1 the fraction of whitened proteins for which the algorithm predicts correctly at least one function. Figure 3b measures a second reliability parameter F 2, defined as the fraction of correctly predicted functions out of all functions a whitened protein has on the original graph G. This test is more stringent because it checks the ability of the algorithm for predicting not only one function, but as many as it can. It is worth noting that under the F 2 test the method still performs very well when all non-background noise ranks are considered. Figure 3c: Sharpness. S measures the precision of the method and it is defined as the fraction of the number of correctly predicted functions over the number of all predicted ones. It is intuitive that the sharpness decreases with the number of probability levels (ranks) one accepts as significant. For whitened proteins of degree 5, for instance, on average only 31% of predicted functions belong to the set of already known erased ones, while in the case of best ranks only, this percentage increases to 65–70%. In case, one allows the algorithm to predict protein functions not present in the MIPS database also on the classified proteins in G, the sharpness still decreases. The results as a function of network dilution are shown in Figure 4. When (Fig. 3a) a direct comparison with other methods on the same G and MIPS catalog was possible, the results of our method were systematically better than both the Neighboring Counting Method and the SA. Performance further improves if we consider not only highest rank predictions, but all significant non-background noise probabilities. Together with the other available methods, BP performs worse in predicting functions on leaves of the PPI graph, i.e. on whitened proteins with only one neighbor. Nevertheless, even in this case we observed better reliability results.



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 5 Example of predicted probabilities ranks for protein YDR386W in the MIPS catalog and for the U PPI network. In this example, out of all possible 140 functions only the ones with a vertical bar have non-background noise probabilities. Bar heights (ranks) are proportional to the logarithm of the probability of having a given function for all the functions ordered on the horizontal axis. For the aim of our analysis only the ranks cardinality are counted and not the absolute value of the predictions.

 


View larger version (26K):
[in this window]
[in a new window]
 
Fig. 3 (a) Reliability parameter F 1: fraction of whitened proteins for which the algorithm predicts correctly at least one function. (b) Reliability parameter F 2: fraction of correctly predicted functions over erased ones for whitened proteins in G. (c) Sharpness S: fraction of correctly predicted functions over all predicted ones. F 1, F 2 and S are plotted versus protein degree for different U dilutions. The results are displayed for three dilution levels d 4 = 0.4, d 5 = 0.5 and d 7 = 0.7. Dotted lines are the results considering only functions of higher probabilities (first best rank). Dotted-dashed lines are the results considering both best and second best ranks. Thick solid lines consider all non-background noise ranks. SA and NCM are the Simulated Annealing and the Neighboring Counting Method results for dilution d = 0.4. Note that a low value of Sharpness does not necessarily indicate a poor performance of the algorithm. It could be due to the fact that some functions have not also been observed in already classified proteins and therefore the catalogs are incomplete not only for proteins in U, but on all G.

 


View larger version (24K):
[in this window]
[in a new window]
 
Fig. 4 (a) F 1, (b) F 2 and (c) S versus dilution, averaged over all the PPI network and over n=10 random dilution realizations. Thick lines are the results for the D network and the dotted lines for the U network. For each network, we have again considered 1st best, 1st and 2nd best and not all noise ranks results. The different spacing between lines comparing the two networks reflects their different topological structure. Proteins to be whitened were chosen randomly in A. The procedure was repeated n = 10 times (larger n datasets can be easily produced, but data are already very stable for n = 10) for each d, and the results were averaged. We disregarded the few observed proteins with degree greater than >8 as statistically not significant. The quantitative differences between the two datasets can be explained mainly noticing that, due to the larger number of edges of the D PPI network, the average number of ranked predicted functions increases in comparison with the U network. This is particularly dramatic in the case of ‘all ranks’ predictions.

 

    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 METHODS
 DATA AND GRAPH ANALYSIS
 RESULTS
 DISCUSSION
 REFERENCES
 
Hierarchical probabilities structure
Let us consider a simple example, a protein i surrounded by three classified neighbors, two having function {sigma} and one having function {tau}. According to (3), in the zero temperature (ß -> {infty}) limit one has P i ({sigma}) = 1 and P i ({tau}) = 0 together with all other function probabilities. However, if the interaction between protein i and the one neighbor with function {tau} is correct, a biologically more sound functional assignment would be that of giving to protein i both functions. Working at finite ß one can see again from (3) that a non-zero value of P i ({tau}) is also found. The numerical value will continuously depend on ß. The hierarchy of the values of the predicted probabilities turns out to be nevertheless very stable after crossing the critical point ß c . One example of the probabilities at convergence for a given ß and for a randomly chosen protein is shown in Figure 5.

Extension to the algorithm from the unclassified proteins graph U only to all proteins in G
Looking at four subsequent versions of MIPS databases releases (2001, 2002, 2003 and 2004, respectively), one can see that new functions are progressively assigned to already classified proteins, hence an inference procedure that allows for this possibility is in principle more complete. However, this procedure can lead to a spreading in the values of inferred probabilities, loosing in Sharpness. Indeed, we tested the performance of our algorithms in both the general and the restricted case, without noticing any significant difference in the performance. The results shown in the body of the text have been limited for clarity to the restricted case where inference is measured only on proteins V\A and for the 2002 and 2003 MIPS catalogs, in order to compare them with results already present in the literature. The same algorithm could be run on the latest 2004 MIPS catalog release with no effort.

Comparison with other available methods
Differently from SA (Vazquez et al., 2003), BP algorithm allows to compute directly and in a single run all probabilities P i (f) for a given protein i to be assigned a function f. This is an advantage with respect to the SA approach, where the output of a single run is one configuration only out of a mutually exclusive set, and in order to obtain trustful probabilities one should average over a large number of SA runs. Moreover, provided one can trust the numerical BP results hierarchies at convergence, some non-ground state configurations that could have a biological sound interpretation (for details see Methods section) are captured in the BP approach in a hierarchical way, while they are missed by SA unless one had time to run a number of cooling experiments in the order of 106 [compare with the 102 runs reported by Vazquez et al. (2003)]. Different from Letovsky and Kasif (2003), our version of BP naturally converges and does not therefore need iterations truncation. The connection of computed probability values with the real unknown ones can be made only at convergence of the BP iteration equations, and it is not clear how to interpret the probability values after only a limited number [two in Letovsky and Kasif (2003)] of iteration steps, when one could still be in the middle of a transient heavily dependent on initial conditions. Moreover, truncating the iteration after a small number of steps means disregarding propagation of information coming from distant regions of the network, which is the spirit of any message passing algorithm like BP. The method could still in principle work if the most distant message passing nodes of any chosen node i V\A were a few neighboring steps away. This turns out to be almost the case for the considered PPI networks, due to high clusterization and function segregation of V\A, as described in the Methods section, but it is not generally true in inference problems. In a second Bayesian approach (Deng et al., 2002), a large number of external parameters (one set for each function) has to be estimated before running the Bayesian inference algorithm. Still, the Gibbs potential (Deng et al., 2002) could in principle be of a more complete form, allowing for the presence of a chemical potential-like terms (one for each function) proportional to the overall number of times one function is present in the graph. However, it is not clear what the reliability of the biological significance of a term of this type, since influence from the classified functions of distant proteins should already be taken into account in the structure of the message passing procedure. Moreover, if the property of functional segregation was also true on the complete (still unknown) PPI network, it is not clear why a protein should have a high probability of being classified with a certain function only because a large group of proteins with a very frequent function existed, even if not interacting with the protein under consideration. In addition, our BP method does not require keeping track of single configurations of functions under the iterations, but only directly of probability weights. The algorithm converges into a stable fixed point and does not need the definition of a measuring time window period (Deng et al., 2002). Together with the Monte Carlo approach, our algorithm does not need previous estimation of external parameters defining the Gibbs potential, except for the overall tuning inverse temperature ß.

Limitations
Our method of course has many limitations. (1) The uncertainty over the graph structure, due to the presence of false positive and false negative interactions. The degree distribution of the network too could vary, even though some authors suggest that there is an evidence for a stabilization toward a scale-free-like form (Yook et al., 2004). Attempts to healing PPI networks errors or missing links are described in the literature (Lappe and Holm, 2004; Jansen et al., 2003), together with a general description of a message passing approach to network reconstruction (Yeang and Jaakkola, 2003). Our algorithm could be generalized to partially deal in parallel with these problems, considering two sets of dynamical variables {X i } and {J ij } instead of {X i } only. Each {J ij } could then take values in a discrete set measuring the likelihood of the interaction between proteins i and j to be present as a function of reliability of the experimental data and of the predicted functions assigned to the proteins under consideration. The extrapolated set {J ij } could then be taken as a starting point to calculate new function probabilities over the whole graph again using the BP procedure. Extensions of the method in this direction are under study but are not presented in this paper. (2) The Kronecker delta function defines a binary distance between functions: one link in U contributes 1 to the total score only if the interacting proteins have exactly the same function; 0 otherwise. However, the MIPS classification scheme is organized hierarchically: some proteins have very specific functions, while others can be classified only in a more coarse-grained functional categories. This hierarchical classification scheme, although very natural, is prone to a major potential weakness. The number of levels of a given functional category is largely dependent on the number of experimentalists who worked so far on the proteins belonging to the category itself. The higher the interest, the deeper and richer is the hierarchical classification. This makes a bit arbitrary cutting the hierarchies at a given depth since trees relative to different functional categories are far from being isomorphic. The choice of a binary distance seems unsatisfactory for the total classification, where a more complete notion of hierarchical distance between different functional categories would be needed. In particular, one would like to work with a distance function that recognizes as possibly close as two neighboring proteins of functions {sigma} and {tau} in the case {sigma} belongs to a very specific functional category, while {tau} only belongs to a broader one that includes the first, but with no further specification. In this paper we have limited the method to the binary distance score function (1), considering only functions at a chosen hierarchical level in the MIPS catalogs and disregarding all the others. In this way some information on partial knowledge of the functions assigned to a given protein is lost. This limitation becomes particularly dramatic in the case of the use of catalogs that are not organized hierarchically, but in a more complex way, such as Gene Ontology.


    Acknowledgments
 
The authors would like to thank Paolo Provero, for providing the starting input for this work, together with Riccardo Zecchina for fruitful discussions and reading the manuscript. We would also like to thank Alexei Vazquez, Alessandro Flammini and Vittoria Colizza for data and discussions.


    Footnotes
 
1 The MIPS Comprehensive Yeast Genome Database (CYGD), http://mips.gsf.de/proj/yeast/CYGD/db/ Back

2 The Gene Ontology Consortium, http://www.geneontology.org/ Back

Received on May 7, 2004; revised on July 1, 2004; accepted on August 16, 2004

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 METHODS
 DATA AND GRAPH ANALYSIS
 RESULTS
 DISCUSSION
 REFERENCES
 

    Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402[Abstract/Free Full Text].

    Deng, M., Zhang, K., Mehta, S., Chen, T., Sun, F. (2002) Prediction of protein function using protein–protein interaction data. Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB'02), August 14–16, , Stanford, CA , pp. 197–207.

    Gavin, A.C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J.M., Michon, A.M., Cruciat, C.M., et al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415, 141–147[CrossRef][Medline].

    Giot, L., Bader, J.S., Brouwer, C., Chaudhuri, A., Kuang, B., Li, Y., Hao, Y.L., Ooi, C.E., Godwin, B., Vitols, E., et al. (2003) A protein interaction map of Drosophila melanogaster . Science, 302, 1727–1736[Abstract/Free Full Text].

    Hishigaki, H., Nakai, K., Ono, T., Tanigami, A., Tagaki, T. (2001) Assessment of prediction accuracy of protein function from protein–protein interaction data. Yeast, 18, 523–531[CrossRef][Web of Science][Medline].

    Ho, Y., Grubler, A., Heilbut, A., Bader, G.D., Moore, L., Adams, S.L., Millar, A., Taylor, P., Bennett, K., Boutilier, K., et al. (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature, 415, 180–183[CrossRef][Medline].

    Hodgman, T. (2000) A historical perspective on gene/protein functional assignment. Bioinformatics, 16, 10–15[Abstract/Free Full Text].

    Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M., Sakaki, Y. (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl Acad. Sci. USA, 98, 4569–4574[Abstract/Free Full Text].

    Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N.J., Chung, S., Emili, A., Snyder, M., Greenbalt, J.F., Gerstein, M. (2003) A Bayesian networks approach for predicting protein–protein interactions from genomic data. Science, 302, 449–453[Abstract/Free Full Text].

    Lappe, M. and Holm, L. (2004) Unraveling protein interaction networks with near-optimal efficiency. Nat. Biotechnol., 22, 98–103[CrossRef][Web of Science][Medline].

    Letovsky, S. and Kasif, S. (2003) Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics, 19, Suppl. 1, i197–i204[Abstract].

    Marcotte, E.M., Pellegrini, M., Thompson, M.J., Yeates, T.O., Eisenberg, D. (1999a) A combined algorithm for genome-wide prediction of protein function. Nature, 402, 83–86[CrossRef][Medline].

    Marcotte, E.M., Pellegrini, M., Ng, H.L., Rice, D.W., Yeates, T.O., Eisenberg, D. (1999b) Detecting protein functions and protein–protein interactions from genome sequences. Science, 285, 751–753[Abstract/Free Full Text].

    Mèzard, M. and Parisi, G. (2001) The Bethe lattice spin glass revisited. Eur. Phys. J. B, 20, 217[CrossRef].

    Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448[Abstract/Free Full Text].

    Samanta, M.P. and Liang, S. (2003) Predicting protein functions from redundancies in large-scale protein interaction networks. Proc. Natl Acad. Sci. USA, 100, 12579–12583[Abstract/Free Full Text].

    Schwikowski, B., Uetz, P., Fields, S. (2000) A network of protein–protein interactions in yeast. Nat. Biotechnol., 18, 1257–1261[CrossRef][Web of Science][Medline].

    Uetz, P., Giot, L., Cagney, G., Mansfield, T.A., Judson, R.S., Knight, J.R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., et al. (2000) A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae . Nature, 403, 623–627[CrossRef][Medline].

    Vazquez, A., Flammini, A., Maritan, A., Vespignani, A. (2003) Global protein function prediction in protein–protein interaction networks. Nat. Biotechnol., 21, 697–700[CrossRef][Web of Science][Medline].

    Xenarios, I., Rice, D.W., Salwinski, L., Baron, M.K., Marcotte, E.M., Eisenberg, D. (2000) DIP: the database of interacting proteins. Nucleic Acids Res., 28, 289–291[Abstract/Free Full Text].

    Yeang, C.-H. and Jaakkola, T. (2003) Physical network models and multi-source data integration. The Seventh Annual International Conference on Research in Computational Molecular Biology (RECOMB 2003), April 10–13, , Berlin .

    Yedidia, J., Freeman, W., Weiss, Y. Understanding Belief Propagation and Its Generalizations, Exploring Artificial Intelligence in the New Millennium, (2003) Elsevier Science & Technology Books.

    Yook, S., Oltvai, Z., Barabási, A.-L. (2004) Functional and topological characterization of protein interaction networks. Proteomics, 4, , pp. 928–942[CrossRef][Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Proc. Natl. Acad. Sci. USAHome page
G. Bianconi, P. Pin, and M. Marsili
Assessing the relevance of node features for network structure
PNAS, July 14, 2009; 106(28): 11433 - 11438.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
T. Ideker and R. Sharan
Protein networks in disease
Genome Res., April 1, 2008; 18(4): 644 - 652.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
P. Larranaga, B. Calvo, R. Santana, C. Bielza, J. Galdiano, I. Inza, J. A. Lozano, R. Armananzas, G. Santafe, A. Perez, et al.
Machine learning in bioinformatics
Brief Bioinform, March 1, 2006; 7(1): 86 - 112.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
S. Bandyopadhyay, R. Sharan, and T. Ideker
Systematic identification of functional orthologs based on protein network comparison
Genome Res., March 1, 2006; 16(3): 428 - 435.
[Abstract] [Full Text] [PDF]


Home page
J R Soc InterfaceHome page
P. Holme and M. Huss
Role-similarity based functional prediction in networked systems: application to the yeast proteome
J R Soc Interface, September 22, 2005; 2(4): 327 - 333.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
S. Bandyopadhyay, R. Sharan, and T. Ideker
Systematic identification of functional orthologs based on protein network comparison
Genome Res., March 1, 2006; 16(3): 428 - 435.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/2/239    most recent
bth491v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Leone, M.
Right arrow Articles by Pagnani, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Leone, M.
Right arrow Articles by Pagnani, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?