Skip Navigation


Bioinformatics Advance Access originally published online on May 19, 2005
Bioinformatics 2005 21(15):3279-3285; doi:10.1093/bioinformatics/bti492
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/15/3279    most recent
bti492v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (18)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Liu, Y.
Right arrow Articles by Zhao, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liu, Y.
Right arrow Articles by Zhao, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Inferring protein–protein interactions through high-throughput interaction data from diverse organisms

Yin Liu 1, Nianjun Liu 2 and Hongyu Zhao 2,3,*

1Program of Computational Biology and Bioinformatics, Yale University New Haven, CT 06520, USA
2Department of Epidemiology and Public Health, Yale University School of Medicine New Haven, CT 06520, USA
3Department of Genetics, Yale University School of Medicine New Haven, CT 06520, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION AND CONCLUSIONS
 REFERENCES
 

Motivation: Identifying protein–protein interactions is critical for understanding cellular processes. Because protein domains represent binding modules and are responsible for the interactions between proteins, computational approaches have been proposed to predict protein interactions at the domain level. The fact that protein domains are likely evolutionarily conserved allows us to pool information from data across multiple organisms for the inference of domain–domain and protein–protein interaction probabilities.

Results: We use a likelihood approach to estimating domain–domain interaction probabilities by integrating large-scale protein interaction data from three organisms, Saccharomyces cerevisiae, Caenorhabditis elegans and Drosophila melanogaster. The estimated domain–domain interaction probabilities are then used to predict protein–protein interactions in S.cerevisiae. Based on a thorough comparison of sensitivity and specificity, Gene Ontology term enrichment and gene expression profiles, we have demonstrated that it may be far more informative to predict protein–protein interactions from diverse organisms than from a single organism.

Availability: The program for computing the protein–protein interaction probabilities and supplementary material are available at http://bioinformatics.med.yale.edu/interaction

Contact: hongyu.zhao{at}yale.edu


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION AND CONCLUSIONS
 REFERENCES
 
Protein–protein interactions play critical roles in the control of most cellular processes. Many proteins involved in signal transduction, gene regulation, cell–cell contact and cell cycle control require interaction with other proteins or cofactors to activate those processes (Papin et al., 2004; Tucker et al., 2001; Wang, 2002). Recently, systematic identifications of protein interactions in Saccharomyces cerevisiae have been conducted using high-throughput techniques such as yeast two-hybrid screening methods (Ito et al., 2001; Uetz et al., 2000) or affinity purification coupled with mass spectroscopy (Gavin et al., 2002; Ho et al., 2002). Although these experimental approaches have generated enormous amounts of data and valuable resources for studying protein interactions, these methods suffer from high false positive and false negative rates owing to their limitations (Mrowka et al., 2001; von Mering et al., 2002). For example, the false negative rate of the yeast two-hybrid assay used to construct S.cerevisiae interaction maps has been estimated to be >70% (Deng et al., 2002). Therefore, there is a great need to develop complementary computational methods capable of accurately predicting interactions between proteins through integrated analysis of data from multiple sources.

A number of computational approaches have been proposed to predict protein–protein interactions, including those based on genomic information (Enright et al., 1999; Tsoka et al., 2000), three-dimensional structural information (Lu et al., 2003; Aloy et al., 2004), integration of multiple genomic datasets (Jansen et al., 2003; Lin et al., 2004; Iossifov et al., 2004) and literature mining (Marcotte et al., 2001). Protein–protein interactions can also be predicted on the basis of evolutionary relationship. It has been shown that interacting proteins often exhibit coordinated evolution, so that proteins with similar phylogenetic trees are more likely to interact with each other (Pazos et al., 2001; Goh et al., 2002; Ramani et al., 2003). In addition, the concept of ‘interologs’ has been proposed based on the idea that a pair of interacting proteins are coevolving so that their respective orthologs in other organisms tend to interact as well (Walhout et al., 2000).

Several methods have been proposed to predict protein interactions in S.cerevisisae on the basis of another important principle, namely, domain–domain interactions. The protein domain as a unit of structure, function and evolution also serves as a unit for protein–protein interactions. Therefore, it is important to take into account domain–domain interactions when we infer plausible interacting protein pairs. In these methods, proteins are characterized by one or more domains and each domain is responsible for a specific interaction with another domain. Sprinzak and Margalit (2001) identified the domain pairs that are highly correlated with interacting protein pairs using protein–protein interaction data from S.cerevisiae as training data. The information was further used to predict interacting protein pairs that contain an interacting domain pair. Similarly, Gomez01,Gomez03 and Deng et al. (2002) estimated the probabilities of domain–domain interactions using protein–protein interaction data from S.cerevisiae as training data; the estimated domain–domain interaction probabilities can be used to infer protein–protein interaction probabilities. These methods depend highly on the accuracy of the training data and have been mostly applied to protein–protein interaction data from a single organism only, which may be inferior to methods that can incorporate more information in estimating domain–domain interaction probabilities.

Because domains are likely evolutionarily conserved, information from multiple organisms may be integrated together to improve the estimation of domain–domain interaction probabilities. In our study, we incorporate information from three organisms, S.cerevisiae, Caenorhabditis elegans and Drosophila melanogaster, to effectively utilize the domain information as the evolutionary connection among these model organisms. The protein–domain relationship can be extracted from relevant databases such as PFAM and SMART (Bateman et al., 2004; Letunic et al., 2004). By integrating large-scale protein–protein interaction data from these three organisms, we have extended a likelihood approach proposed by Deng et al. (2002) to estimate the probabilities of domain–domain interactions based on information from all three organisms. Considering each protein as a collection of domains, we can then estimate the probabilities of protein–protein interactions in S.cerevisiae based on the inferred domain–domain interaction probabilities. The protein pairs with interaction probabilities above a certain threshold can then be predicted to interact with each other. In order to assess the performance of our method, we first apply it to the interaction data from S.cerevisiae only and compare its performance with that of three other methods that predict protein interactions based on the domain composition of proteins in the cross-validation measurement, and we demonstrate that our method provides comparable performance to the others. Then, we compare our prediction results based on all three organisms with those based on S.cerevisiae alone. We find that the integrated analysis provides more reliable inference of protein–protein interactions than the analysis from a single organism based on the analysis of sensitivity and specificity, Gene Ontology term enrichment and gene expression profiles.


    METHODS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION AND CONCLUSIONS
 REFERENCES
 
Data sources
In our study, the high-throughput yeast two-hybrid data from three organisms, S.cerevisiae, C.elegans and D.melanogaster, are used to infer domain–domain interaction probabilities. For S.cerevisiae, we use a combined dataset from two independent studies (Ito et al., 2000; Uetz et al., 2000), which includes a total of 5295 interactions. For C.elegans, 4714 interactions were reported from yeast two-hybrid experiments (Li et al., 2004). For D.melanogaster, results from two-hybrid experiments yielded a total of 20 349 interaction pairs (Giot et al., 2003). The protein–domain relationships for each protein in S.cerevisiae, C.elegans and D.melanogaster are extracted from PFAM (Bateman et al., 2004) and SMART (Letunic et al., 2004).

Maximum likelihood estimation of domain–domainand protein–protein interaction probabilities
We estimate the probabilities of domain–domain interactions through the extension of a likelihood approach proposed by Deng et al. (2002) so that it can incorporate information from all three organisms. In this model, we make the following assumptions: (1) domain–domain interactions are independent, so whether two domains interact or not does not depend on the interactions among other domains; (2) the probability that two domains m and n interact is the same among all the three organisms; (3) Two proteins i and j interact if and only if at least one pair of domains from the two proteins interact.

With these assumptions, we have , where Pijk represents the protein pair i and j in species k; Pijk = 1 if protein i and protein j in species k interact with each other, and Pijk = 0 otherwise. Here, k = 1, 2, 3 represents species S.cerevisiae, C.elegans and D.melanogaster, respectively, {lambda}mn represents the probability that domain m interacts with domain n and the notation (Dmn Pijk) denotes all pairs of domains from protein pair i and j in species k. The probability that proteins i and j in species k are observed to be interacting in the experiments is Pr(Oijk = 1) = Pr(Pijk = 1)(1 – fn) + [1 – Pr(Pijk = 1)]fp, where Oijk = 1 if interaction between protein i and j is observed in species k, and Oijk = 0 otherwise. Here, fn and fp represent the false negative rate and false positive rate of the protein interaction data. It has been estimated that thetotal number of interactions between all yeast proteins is ~20 000–30 000(Bader et al., 2004). Therefore, for S.cerevisiae, we have

We obtained a total of 5717 proteins from SWISS-PROT and TrEMBL; therefore,

Similarly, for C.elegans, fn is ~0.90 by mapping the observed interactions to a benchmark data set (Li et al., 2004) and we estimate fp to be <3 x 10–5. For D.melanogaster, fn is ~0.80 (Giot et al., 2003) and we estimate fp to be <3.6 x 10–4.

The likelihood function that characterizes the probability of the observed protein interaction data across all three organisms is: L = Pr(Oijk = 1)Oijk[1 – Pr(Oijk = 1)]1 – Oijk. We can see that the likelihood function L is a function of parameter {lambda} mn if we specify fixed values for fn and fp. To obtain the maximum likelihood estimates (MLEs) of the parameters, we propose to use the EM algorithm (Dempster et al., 1977), which consists of the expectation (E) step and the maximization (M) step. In the E-step, we need to calculate the expectations of the complete data given the observed data. Here, the complete data include all the domain–domain interactions for each protein–protein pair i and j of each of the three organisms, denoted by . We have

With the expectations of the complete data, in the M-step, we updatethe {lambda} mn by

where Nmn is the total number of protein pairs containing domain (m, n) across the three organisms, and the summation is over all these protein pairs.

We update the parameter estimates of the {lambda} mn by iterating between the E-step and the M-step until convergence to obtain the MLEs of the {lambda} mn for all the domain pairs. The estimated values of the {lambda} mn allow us to compute the protein interaction probabilities so that two proteins with an interaction probability greater than a certain threshold can be predicted to be interacting partners.

Cross-validated comparison and receiving operator characteristic analysis
To compare our likelihood approach with other similar methods that predict protein interactions based on protein domain information, we measure the performance of each prediction using a 5-fold cross-validation. As all the other methods predicting protein interaction pairs are applied to the interaction data from S.cerevisiae only, we define the training interaction data for the cross-validation as follows: we considered the 3543 yeast physical interaction pairs in MIPS as positive examples (Mewes et al., 2004) and the other possible protein pairs, totally 6 895 215 pairs, as negative examples. At each iteration of the cross-validation experiments we reserve one-fifth of both positives and negatives for testing and use the remaining data for training. The training–test procedure is repeated five times.

The prediction accuracy is measured using the receiving operator characteristic (ROC) curve, which demonstrates the trade-offs between sensitivity and specificity. It is a plot of the true positive rate (sensitivity) against the false positive rate (1 – specificity) for different thresholds. Here, the true positive rate, denoted as TPF, is calculated as the number of predicted protein pairs that are included in the positive examples divided by 3543, the total number of positives; the false positive rate, denoted as FPF, is calculated as the number of predicted protein pairs that are included in the negative examples divided by 6 895 215, the total number of negatives. The ROC score, calculated as the area under the ROC curve is a measurement of prediction accuracy. The closer the ROC score is to 1.0, the better the prediction. In our study, we repeat the entire cross-validation procedure three times in order to estimate the variance of the ROC score.

Gene Ontology analysis
We determine whether the two genes encoding the predicted interacting protein pair have any GO annotation enriched in the biological process ontology by using the Saccharomyces Genome Database (SGD) GO TermFinder (http://search.cpan.org/dist/GO-TermFinder/). The probability that two genes share the same biological process by chance is calculated through the hypergeometric distribution. The P-value is calculated using the following equation:

where N and M represent the total number of genes in the population and the number of genes that have a particular biological process category annotation, respectively, and n and x represent the number of genes in the set and the number of genes in the set annotated with the particular biological process, respectively. Because each gene set we investigate is a pair of genes, both n and x are equal to 2. The P-value is corrected for multiple testing using Bonferroni correction and a protein pair is considered as GO term enriched if the corrected P-value is <0.05.

To assess the overall statistical significance of the observed GO term enrichment, we generate randomized protein–domain associations by randomly permuting the domain labels of all proteins while leaving the number of domains associated with each protein untouched. We then run the same prediction procedure on the permuted domain information. This process is repeated 100 times and the number of predicted protein pairs having GO term enrichment is recorded for each permutation. The empirical P-value for the observed GO term enrichment is calculated as the fraction of the permutations having a larger number of GO term enriched protein pairs than that based on the observed data.


    RESULTS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION AND CONCLUSIONS
 REFERENCES
 
The protein–domain relationships are extracted from PFAM and SMART, and there are a total of 3317 domains associated with the proteins of the three organisms (S.cerevisiae, C.elegans and D.melanogaster). The distribution of these domains across the three organisms is shown in a Venn diagram in Figure 1.



View larger version (10K):
[in this window]
[in a new window]
 
Fig. 1 The distribution of the domains in S.cerevisiae, C.elegans and D.melanogaster.

 
Sensitivity and specificity
In this study, we have extended a likelihood approach by Deng et al. (2002) to integrate information from diverse organisms to infer protein–protein interaction probabilities. We compare the performance of the likelihood approach with three other methods that have also been used for protein interaction prediction: the sequence-signature method proposed by Sprinzak and Margalit (2001) the attraction-only model (Gomez et al., 2001) and the attraction–repulsion model (Gomez et al., 2003). All four methods explore the experimental protein interaction data to assign the probability or score for each protein pair, and make predictions of interacting protein pairs based on a selected decision threshold. To compare the performance of each prediction method, we apply these methods to the same training interaction data obtained from a single organism—S.cerevisiae only—and measure the performance of each method using 5-fold cross-validation. For different thresholds, the sensitivity and specificity of each prediction method are calculated and the ROC scores that measure the accuracy ofprediction for each method are obtained (see Methods). The results in Figure 2 clearly demonstrate that, with only the information from a single organism, the prediction performance of the likelihood approach, with a ROC score of 0.628 ± 0.005, is comparable to that of the attraction–repulsion model, and is significantly better than those of the attraction-only model and the sequence-signature method.



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 2 ROC score summary. Error bars indicate the standard deviation over three cross-validation experiments.

 
The advantage of our extended likelihood approach is that it allows us to incorporate the large-scale protein–protein interaction data from diverse organisms. In order to assess the benefit of simultaneous analysis of multiple organisms, we investigate the information gain from the joint analysis of all three organisms compared with the analysis based solely on S.cerevisiae. Because information from C.elegans and D.melanogaster can affect (and hopefully improve) the estimated domain–domain interaction probabilities in S.cerevisiae, the predicted protein–protein interactions differ between the two methods. Taking the 3543 protein–protein physical interactions recorded in MIPS as true positives, we estimate the sensitivity and specificity for each threshold of the two methods either based on information from all three organisms or based on information from S.cerevisiae alone. The results are summarized in the ROC curves in Figure 3. The improvement based on the joint analysis of three organisms can be easily seen from this figure.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 3 ROC curves of the prediction results based on different information sources.

 
Evaluation of GO term enrichment
In order to evaluate the quality of our predicted protein interactions, we investigate whether two genes encoding a predicted interacting protein pair are functionally related. Because genes more likely share the same biological process if they are functionally related (Vazquez et al., 2003), we determine whether these two genes have any GO annotation enrichment in the biological process ontology compared with what would be expected by chance from a random pair of genes. We observe that, out of the top 1000 predicted interacting protein pairs based on the information from all three organisms, 203 pairs have at least one GO term enriched, whereas only 91 pairs out of the top 1000 predicted pairs based on the information from yeast alone have a GO term enriched. To assess the statistical significance of these results, we compare these predictions with those based on randomized protein–domain associations (see Methods). We find that the 203 observed GO term enriched pairs based on the information from all three species are statistically significant (empirical P-value is 0), whereas the observed 91 GO term enriched pairs based on S.cerevisiae alone are not statistically significant (empirical P-value is 0.06).

Gene expression profiles
Interacting proteins are more likely to be coexpressed than a random pair of genes and this fact has been used for experimental validation of the predicted protein–protein interactions (Ge et al., 2001; Kemmeren et al., 2002). In our study, we test whether there is statistical evidence suggesting that gene expression profiles are more similar between the predicted protein pairs, where the similarity is defined by the Pearson correlation coefficient between the gene expression profiles of these two genes. For gene expression profiles, we use publicly available gene expression data, including a time-course study during the yeast cell cycle (Spellman et al., 1998) and the Rosetta ‘compendium’ set, which is composed of 300 diverse mutations and chemical treatments (Hughes et al., 2000).

To test whether the correlation coefficients of gene expressions for the predicted interacting protein pairs are significantly higher than those for random gene pairs, we compare the distribution of the correlation coefficients between the predicted interacting protein pairs with a probability threshold of 0.1, the physical interaction protein pairs from MIPS, the predicted interacting pairs excluding those pairs from MIPS, and random pairs. We find that the distribution of the correlation coefficients of the predicted protein pairs is similar to that of the annotated interacting protein pairs in MIPS, which are verified interacting proteins. Compared with random protein pairs, the predicted protein pairs have a higher mean correlation coefficient (Supplementary Data). In addition, we compare the mean expression correlation coefficient for the predicted interacting protein pairs based on information from all three organisms and that based on information from S.cerevisiae alone. For this comparison, we first identify the top N predicted interacting pairs based on either method, where N takes values of 100, 500, 1000, 2000, 5000 and 10 000. We then calculate the average correlation coefficient for the predicted interacting pairs in the set for each method. As shown in Table 1, as N increases, the mean correlation coefficient decreases owing to the inclusion of a larger proportion of false positives in the data set. More importantly, for any given N, the mean correlation coefficient for the predicted interacting protein pairs based on the information from all three organisms is significantly higher than that for protein pairs predicted using the information from S.cerevisiae alone. In addition, the distributions of the correlation coefficients for the top 1000 predicted protein pairs based on two different sources are shown in Figure 4. As can be seen from this figure, there is a general shift of the distribution to higher correlation coefficient values for protein pairs predicted based on the information from all three organisms compared with those predicted based on S.cerevisiae alone, indicating that the prediction based on the information from all three organisms more probably yields more reliable predicted interacting protein pairs.


View this table:
[in this window]
[in a new window]
 
Table 1 Comparison of the mean correlation coefficient for the selected predicted protein pairs based on two different information sources

 


View larger version (24K):
[in this window]
[in a new window]
 
Fig. 4 Comparisons of the distributions of the Pearson correlation coefficients for the top 1000 predicted interacting protein pairs based on different information sources. sdc, prediction based on the information from three organisms S.cerevisiae, D.melanogaster and C.elegans.

 
Biological significance of the predictions
In this section, we discuss the biological relevance of the predicted interacting protein pairs. Although many of the predicted pairs are in the MIPS database, some of the top ones are not. Table 2 summarizes the top 10 predictions that are not in the MIPS database, and all these predictions have estimated interaction probabilities equal to 1. Table 2 also provides the functional annotation of these genes. Some of our predicted protein pairs include subunits of the same protein complex; for example, MCD1 and IRR1 are subunits of the yeast cohesin complex. Some other predictions involve interactions between proteins belonging to the same family, such as OCA1 and SNZ1, or between members of two different families, such as the VAC and ECM families. The interactions between VAC8, a phosphorylated vacuole membrane protein that is required for protein targeting from cytoplasm to vacuole (Scott et al., 2000), and the members of the ECM family, such as ECM15, may indicate that the ECM proteins are required for vacuole formation in three-dimensional extracellular matrices.


View this table:
[in this window]
[in a new window]
 
Table 2 The top 10 predicted interacting protein pairs that are not included in the MIPS physical interaction dataset

 
Some of our predictions may be biologically important. For example, it has been shown that the lack of Srp1 export might impair cNLS-dependent nuclear protein import in yeast (Stade et al., 2002). Because the ubiquitin-like modification of some proteins, such as RanGAP1, is required for protein nucleocytoplasmic trafficking (Matunis et al., 1998), the ubiquitin ligase may be involved in the nuclear protein import. Therefore, it may be reasonable to consider that Srp1 and BUL2, a component of the ubiquitin ligase complex, interact with each other and play a role in the nuclear protein import process together. The interaction between CUP2 and THI4 may indicate that genes activated by the transcription factor CUP2 are involved in the process of thiamine biosynthesis, in which THI4 plays an important role. Another example is the protein pair DCS1–NTH2. NTH2 is a neutral trehalase, and it has been proposed that the phosphorylation of DCS1 by CaM kinase II would lead to its dissociation from the neutral trehalase, and thus that the activity of the neutral trehalase would be upregulated (Souza et al., 2002). Therefore, the lack of CaM kinase II would downregulate the neutral trehalase activity as a result of the interaction between DCS1 and NIH2. In addition, we may predict the functions of some unknown proteins based on their interacting partners. For example, YMR009W is predicted to interact with FUN34, a transmembrane protein that is involved in ammonia production; therefore, we can predict that YMR009W may also be involved in this process.


    DISCUSSION AND CONCLUSIONS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION AND CONCLUSIONS
 REFERENCES
 
In this article, we propose estimating the probabilities of interactions between domain pairs by pooling information from three organisms—S.cerevisiae, C.elegans and D.melanogaster—based on large-scale protein interaction data. Using the estimated domain–domain interaction probabilities, we can then estimate the probabilities of interactions between each protein pair in a given organism. We focus our attention on predicting the protein interactions in S.cerevisiae, and we have found that, even based on the information from S.cerevisiae only, the likelihood approach is among the best-performing methods considered in our comparisons. Because of the experimental errors of large-scale two-hybrid assays, the domain interactions inferred from one organism may not be reliable, and the incorporation of data from other organisms can indeed improve the estimated domain–domain and protein–protein interactions. The extension of the likelihood approach allows the incorporation of the information from all three organisms, and the prediction results were found to be better than those obtained based on the information from S.cerevisiae alone through the examinations of ROC curves, GO term enrichments and expression profiles. Therefore, we conclude that the approach proposed in this study outperforms those used for comparison, providing more informative inference of protein interactions.

The results from our approach can be further improved when the domain information is further and more reliably annotated in the future. Currently, only about two-thirds of the S.cerevisiae proteins have a defined domain composition, and we have considered possible interactions only between those proteins with annotated domain information. As a result, the predictions based on domain–domain interactions will be able to capture only a portion of all interactions, the number of which is estimated to be ~20 000–30 000 in S.cerevisiae. Our predicted interacting pairs depend on the threshold value used for the estimated interaction probabilities, and the number of predicted pairs increases as we reduce the threshold. Owing to the unknown number of truly interacting protein pairs as well as the incompleteness of the annotated domain information, it is difficult to set a threshold value to match the expected number of interacting pairs. When we set the threshold at 0.1, 20 088 protein pairs are predicted to interact with each other. At this level, using MIPS physical interaction data as the gold standard, we estimate the sensitivity and specificity to be 38.6 and 99.7%, respectively. (The list of all the predicted interactions is provided as supplementary information.) As the interacting protein pairs included in MIPS are far from complete, these values calculated based on the MIPS data could be different from the actual values.

It is well known that two-hybrid assays contain many errors, and the exact error rates are hard to assess because the actual protein–protein interactions are not yet known. Based on the number of interactions in our training data, we have estimated the ranges of the false positive and false negative rates (see Methods). The estimated value of fn agrees with the literature in which the dataset is published, and the estimated value of fp differs from those established in the literature by an order of magnitude because a different definition of false positive is used (the number of incorrect interactions observed in experiments divided by the total number of observed interactions). We fix the fn and fp rates in our analysis as this approach has been shown to be robust with respect to a range of experimental error rates (Supplementary Data). In our study, we set the error rates to be fp = 3 x 10–4 and fn = 0.85 for the interaction data for all three organisms to ease the computation; the yielded predictions are used for the GO term enrichments and gene expression analysis. In addition, we have applied our approach to a core interaction dataset including 1374 interactions from S.cerevisiae (Ito et al., 2000; Uetz et al., 2000), 2135 interactions from C.elegans (Li et al., 2004) and 4625 interactions from D.melanogaster (Giot et al., 2003). We set the error rates to be fp= 0 and fn = 0.95 because the dataset contains only high-confidence interactions. However, the analysis yields a smaller number of predicted interactions, and measured by sensitivity and specificity, the overall performance of the core dataset is not comparable to that of the dataset including all the interactions (Supplementary Data). Given that the core dataset contains only ~8000 interactions for all three organisms, which is much smaller than the number of expected interactions, the information included in the core dataset may be further from being complete than the complete dataset, eventhough it has a smaller false positive rate, thus limiting the prediction power of our approach.

We predict protein–protein interactions through the annotated protein domains, which are responsible for protein interactions through direct physical interactions. Therefore, our goal, precisely defined, is to predict whether two proteins have direct physical interactions, not whether proteins are in the same complex. In this study, we have focused on the integration of two-hybrid data from different organisms. The prediction reveals potential protein physical interactions, but some of these may not be biologically relevant in a physiological condition. In principle, other types of data can be integrated into the approach; for example, the integration of data from high-throughput mass spectrometry protein complex purification along with the correlated mRNA expression profiles are expected to extend our prediction, yielding functionally related protein pairs.

The basic principle of our approach is the fact that domain–domain interactions are likely conserved across different organisms, therefore allowing us to borrow information from diverse organisms to improve the predictions of protein–protein interactions in a given organism. Although our current approach has indeed led to improved predictions, it can be further refined to generate more accurate predictions. For example, we may first improve the predictions of protein–protein interactions within the same organism through integrating diverse data sources from that organism (e.g. Jansen et al., 2003; Lin et al., 2004) and then perform joint analysis across different organisms based on the results from these integrated analyses. The current approach estimates the domain–domain interaction probabilities for each domain–domain pair separately, and these estimated probabilities may be more accurately estimated by pooling information from domains with similar structures or functions. Finally, a Bayesian approach may be adopted here both to incorporate prior information on domain–domain interactions and to better infer domain–domain interaction probabilities.


    Acknowledgments
 
This research was supported in part by National Science Foundation grant DMS-0241160 and Y.L. was supported by the NIH Institutional Training Grants for Informatics Research.

Conflict of Interest: none declared.

Received on March 2, 2005; revised on April 14, 2005; accepted on May 6, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION AND CONCLUSIONS
 REFERENCES
 

    Aloy, P., et al. (2004) Structure-based assembly of protein complexes in yeast. Science, 303, 2026–2029[Abstract/Free Full Text].

    Bader, J.S., et al. (2004) Gaining confidence in high-throughput protein interaction networks. Nat. Biotechnol., 22, 78–85[CrossRef][ISI][Medline].

    Bateman, A., et al. (2004) The Pfam protein families database. Nucleic Acids Res., 32, D138–D141[Abstract/Free Full Text].

    Dempster, A.P., et al. (1977) Maximum likelihood from incomplete data via the EM algorithm. J.R. Statist. Soc. B, 39, 1C38.

    Deng, M., et al. (2002) Inferring domain-domain interactions from protein-protein interactions. Genome Res., 12, 1540–1548[Abstract/Free Full Text].

    Enright, A.J., et al. (1999) Protein interaction maps for complete genomes based on gene fusion events. Nature, 402, 86–90[CrossRef][Medline].

    Gavin, A.C., et al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415, 141–147[CrossRef][Medline].

    Ge, H., et al. (2001) Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat. Genet., 29, 482–486[CrossRef][ISI][Medline].

    Giot, L., et al. (2003) A protein interaction map of Drosophila melanogaster. Science, 302, 1727–1736[Abstract/Free Full Text].

    Goh, C.S. and Cohen, F.E. (2002) Co-evolutionary analysis reveals insights into protein-protein interactions. J. Mol. Biol., 324, 177–192[CrossRef][ISI][Medline].

    Gomez, S.M., et al. (2001) Probabilistic prediction of unknown metabolic and signal-transduction networks. Genetics, 159, 1291–1298[Abstract/Free Full Text].

    Gomez, S.M., et al. (2003) Learning to predict protein-protein interactions from protein sequences. Bioinformatics, 19, 1875–1881[Abstract/Free Full Text].

    Ho, Y., et al. (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature, 415, 180–183[CrossRef][Medline].

    Hughes, T.R., et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109–126[CrossRef][ISI][Medline].

    Iossifov, I., et al. (2004) Probabilistic inference of molecular networks from noisy data sources. Bioinformatics, 20, 1205–1213[Abstract/Free Full Text].

    Ito, T., et al. (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl Acad. Sci. USA, 98, 4569–4574[Abstract/Free Full Text].

    Jansen, R., et al. (2003) A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science, 302, 449–453[Abstract/Free Full Text].

    Kemmeren, R., et al. (2002) Protein interaction verification and functional annotation by integrated analysis of genome-scale data. Mol. Cell., 9, 1133–1143[CrossRef][ISI][Medline].

    Letunic, I., et al. (2004) SMART 4.0: towards genomic data integration. Nucleic Acids Res., 32, D142–D144[Abstract/Free Full Text].

    Li, S., et al. (2004) A map of the interactome network of the metazoan C.elegans. Science, 303, 540–543[Abstract/Free Full Text].

    Lin, N., et al. (2004) Information assessment on predicting protein-protein interactions. BMC Bioinformatics, 5, 154[CrossRef][Medline].

    Lu, L., et al. (2003) Multimeric threading-based prediction of protein-protein interactions on a genomic scale: application to the Saccharomyces cerevisiae proteome. Genome Res., 13, 1146–1154[Abstract/Free Full Text].

    Marcotte, E.M., et al. (2001) Mining literature for protein-protein interactions. Bioinformatics, 17, 359–363[Abstract/Free Full Text].

    Matunis, M.J., et al. (1998) SUMO-1 modification and its role in targeting the Ran GTPase-activating protein, RanGAP1, to the nuclear pore complex. Cell Biol., 140, 499–509.

    Mewes, H.W., et al. (2004) MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res., 32, D41–D44[Abstract/Free Full Text].

    Mrowka, R., et al. (2001) Is there a bias in proteome research? Genome Res., 11, 1971–1973[Abstract/Free Full Text].

    Papin, J. and Subramaniam, S. (2004) Bioinformatics and cellular signaling. Curr. Opin Biotechnol, 15, 78–81[CrossRef][ISI][Medline].

    Pazos, F. and Valencia, A. (2001) Similarity of phylogenetic trees as indicator of protein-protein interaction. Protein Eng., 14, 609–614[Abstract/Free Full Text].

    Ramani, A.K. and Marcotte, E.M. (2003) Exploiting the co-evolution of interacting proteins to discover interaction specificity. J. Mol. Biol., 327, 273–284[CrossRef][ISI][Medline].

    Scott, S.V., et al. (2000) Apg13p and Vac8p are part of a complex of phosphoproteins that are required for cytoplasm to vacuole targeting. J. Biol. Chem., 275, 25840–25849[Abstract/Free Full Text].

    Souza, A.C., et al. (2002) Evidence for a modulation of neutral trehalase activity by Ca2+ and cAMP signaling pathways in Saccharomyces cerevisiae. Braz. J. Med. Biol. Res., 35, 11–16[ISI][Medline].

    Spellman, P.T., et al. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell., 9, 3273–3297[Abstract/Free Full Text].

    Sprinzak, E. and Margalit, H. (2001) Correlated sequence-signatures as markers of protein-protein interaction. J. Mol. Biol., 311, 681–692[CrossRef][ISI][Medline].

    Stade, K., et al. (2002) A lack of SUMO conjugation affects cNLS-dependent nuclear protein import in yeast. J. Biol. Chem., 277, 49554–49561[Abstract/Free Full Text].

    Tsoka, S. and Ouzounis, C.A. (2000) Prediction of protein interactions: metabolic enzymes are frequently involved in gene fusion. Nat. Genet., 26, 141–142[CrossRef][ISI][Medline].

    Tucker, C.L., et al. (2001) Towards an understanding of complex protein networks. Trends Cell Biol., 11, 102–106[CrossRef][ISI][Medline].

    Uetz, P., et al. (2000) A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature, 403, 623–627[CrossRef][Medline].

    Vazquez, A., et al. (2003) Global protein function prediction from protein-protein interaction networks. Nat. Biotechnol., 21, 697–700[CrossRef][ISI][Medline].

    von Mering, C., et al. (2002) Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417, 399–403[Medline].

    Walhout, A.J., et al. (2000) Protein interaction mapping in C.elegans using proteins involved in vulval development. Science, 287, 116–122[Abstract/Free Full Text].

    Wang, J. (2002) Protein recognition by cell surface receptors: physiological receptors versus virus interactions. Trends Biochem. Sci., 27, 122–126[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
S.-E. Schelhorn, T. Lengauer, and M. Albrecht
An integrative approach for predicting interactions of protein regions
Bioinformatics, August 15, 2008; 24(16): i35 - i41.
[Abstract] [PDF]


Home page
BioinformaticsHome page
P.-Y. Chen, C. M. Deane, and G. Reinert
A statistical approach using network structure in the prediction of protein characteristics
Bioinformatics, September 1, 2007; 23(17): 2314 - 2321.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Schlicker, C. Huthmacher, F. Ramirez, T. Lengauer, and M. Albrecht
Functional evaluation of domain domain interactions and human protein interaction networks
Bioinformatics, April 1, 2007; 23(7): 859 - 865.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
L. W. Hillier, A. Coulson, J. I. Murray, Z. Bao, J. E. Sulston, and R. H. Waterston
Genomics in C. elegans: So many genes, such a little worm
Genome Res., December 1, 2005; 15(12): 1651 - 1660.
[Abstract] [Full Text] [PDF]


Home page
Hum Mol GenetHome page
M. E. Cusick, N. Klitgord, M. Vidal, and D. E. Hill
Interactome: gateway into systems biology
Hum. Mol. Genet., October 15, 2005; 14(suppl_2): R171 - R181.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/15/3279    most recent
bti492v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (18)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Liu, Y.
Right arrow Articles by Zhao, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liu, Y.
Right arrow Articles by Zhao, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?