Bioinformatics Advance Access originally published online on February 4, 2005
Bioinformatics 2005 21(9):2043-2048; doi:10.1093/bioinformatics/bti305
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Accurate extraction of functional associations between proteins based on common interaction partners and common domains
1Graduate School of Information Sciences, Nara Institute of Science and Technology 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
2Department of Computational Biology, Faculty of Frontier Sciences, The University of Tokyo 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8561, Japan
3Computational Biology Research Center, The National Institute of Advanced Industrial Science and Technology Aomi Frontier Building 2-43 Aomi, 17F, Koto-ku, Tokyo 135-0064, Japan
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Genomic and proteomic approaches have accumulated a huge amount of data which provide clues to protein function. However, interpreting single omic data for predicting uncharacterized protein functions has been a challenging task, because the data contain a lot of false positives. To overcome this problem, methods for integrating data from various omic approaches are needed for more accurate function prediction.
Result: In this paper, we have developed a method which extracts functionally similar proteins with high confidence by integrating proteinprotein interaction data and domain information. We used this method to analyze publicly available data from Saccharomyces cerevisiae. We identified 1042 functional associations, involving 765 proteins of which 98 (12.8%) had no previously ascribed function. Our method extracts functionally similar protein pairs more accurately than conventional methods, and predicting function for previously uncharacterized proteins can be achieved. Our method can of course be applied to proteinprotein interaction data for any species.
Contact: okada-k{at}cb.k.u-tokyo.ac.jp
| 1 INTRODUCTION |
|---|
|
|
|---|
Although genomic and proteomic approaches have accumulated a huge amount of data which provide clues to protein function, annotating uncharacterized proteins is still one of the most challenging problems of the post-genomic era. All omic approaches have intrinsic shortcomings, such as false positives and negatives; furthermore, data emerging from any single omic approach can provide only crude indications of protein function. Methods for integrating data from various genomic and proteomic approaches are needed to build more robust biological hypotheses (Vidal, 2001). For example, by focusing on the intersection of proteinprotein interaction datasets generated by different kinds of technologies, the error rates in these data were reduced (von Mering et al., 2002). It has also been reported that comparing transcriptome profiling and proteinprotein interaction data reveals a correlation between the two sets of data (Ge et al., 2003). To predict uncharacterized protein functions, we introduce a strategy that integrates proteinprotein interaction data and domain information.
Proteinprotein interaction data, which are usually generated by two-hybrid or pull-down assays, are indicative of protein complexes and/or signal transduction pathways, and have been developed for several model organisms (Uetz et al., 2000; Walhout and Vidal, 2001; Giot et al., 2003; Li et al., 2004). Although proteinprotein interaction data are not available in other species, interaction mapping projects are likely to be extended to other organisms in the near future. The availability of these high-throughput data enables us to study proteinprotein interactions not only between individual proteins or small complexes, but also throughout the entire proteinprotein interaction network, and computational approaches have been developed to interpret the resulting large-scale proteinprotein interaction maps. These include the k-core method for identifying subsets of interconnected proteins, in which each protein has at least k interactions (Bader and Hogue, 2002), and a method for the assignment of protein functions based on global connectivity patterns of a protein network (Vazquez et al., 2003).
These methods are based on the idea that a proteinprotein interaction reflects common function. However, the true positive prediction rate for protein functions in these studies is limited: error rates are estimated to be >50% (Deane et al., 2002). As an alternative approach, Samanta and Liang (2003) focused on proteins having common interaction partners. The underlying idea of this approach is that if two proteins share common interaction partners, they should have close functional associations.
Domains are defined as structurally compact, independently folding parts of protein molecules, and are viewed as evolutionally conserved structural units (Ponting and Russell, 2002). Therefore, the function of a protein can often be regarded as the interactions involving its domains. For example, rod phosphodiesterase 6 should function in SH3-mediated cellular pathways because its proline-rich domain interacts with SH3-containing proteins (Morin et al., 2003). In such cases, a part of each protein may be conserved as a domain determining interaction partners. Thus, proteinprotein interactions and domains should be closely related to each other (Sprinzak and Margalit, 2001; Ng et al., 2003). Based on this idea, in the case that two proteins have common interaction partners, Lappe et al. (2001) proposed the method for assignment of SCOP folds. By developing an integrative strategy that focuses on proteins having common interaction partners and on domain information, we show here that functionally associated proteins can be extracted with higher confidence. After testing our method, we assigned functions to previously uncharacterized proteins encoded by genes in the Saccharomyces cerevisiae genome.
| 2 MATERIALS AND METHODS |
|---|
|
|
|---|
2.1 Proteinprotein interaction data
Pairwise proteinprotein interaction data among S.cerevisiae proteins were obtained from the MIPS website (http://www.mips.gsf.de/). The file used in the present study was PPI_120803.tab, which contains two different types of proteinprotein interaction data: physical (
85% of the interactions) and genetic (
15% of the interactions). We did not distinguish between genetic and physical interactions in the present study. There were 200 cases in which a protein interacted with the same protein, and these were eliminated before further analysis. The resulting non-redundant proteinprotein interaction dataset contained 8479 protein pairs, and 4326 proteins participated in this analysis.
2.2 Domain information
The Pfam_ls database (http://www.pfam.wustl.edu/) was used to detect full-length complete domains of proteins. Domain information for all proteins used in the proteinprotein interaction data was searched for in the database using the program hmmpfam (HMMER 2.3.1 packages, from the same website). We used the Pfam TC (trusted cutoff) score cutoff to detect true domains in Pfam database entries. Then, 1273 kinds of domains, such as SH3 domain, LSM domain, and so on, were detected among 4326 proteins which were used in proteinprotein interaction data, and the average length of these domains was 158 amino acids.
2.3 Integrating lists
Figure 1 summarizes the procedures for constructing an integrated table consisting of protein pairs, common domains and number of common interaction partners. An adjacency matrix of proteinprotein interaction data is a matrix with rows and columns labeled by graph vertices, with 1 or 0, respectively, in position (vi, vj) according to whether proteins vi and vj interact with each other or not. After making the adjacency matrix, we made a common interaction partner (CIP) matrix in which the figure in position (pi, pj) represents the number of common interaction partners between proteins pi and pj. From the CIP matrix, we extracted CIP pairs of proteins which have at least one common interaction partner.
|
After detecting domains for all proteins, we also made a domain matrix, in which a row represents a domain and a column represents a protein. We used the domain matrix to identify common domain (CD) pairs of proteins which have at least one common domain.
Finally, we extracted pairs of proteins which have common interaction partners and common domains, focusing on the intersection of the resulting CIP pairs and CD pairs. Figure 1 summarizes the procedures used to construct this integrated list.
2.4 Protein function classification
A functional classification table for yeast proteins (Mewes et al., 2002) was obtained from the MIPS ftp site (ftp://ftpmips.gsf.de/yeast/catalogues/funcat/). The table includes 14 subcellular classes; these classes were eliminated before further analysis. The evidence file on the MIPS ftp site provided us the fact that the functional categorization of proteins was based on their domains and on high-throughput yeast two-hybrid systems in some cases. For avoiding biased estimation, these functional categorizations were excluded before further analysis. The 2884 proteins used in proteinprotein interaction data were assigned to 89 characterized classes, and the remaining 1442 proteins to uncharacterized classes using MIPS functional classification scheme.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Method evaluation
Four types of protein pairs were generated for the evaluation of our method: single linkage pairs of proteins which interact with each other (Fig. 2a), domain linkage pairs of proteins which lack common interaction partners but have common domains (Fig. 2b), interaction partner linkage pairs of proteins which lack common domains but have common interaction partners (Fig. 2c), and interaction partner and domain linkage pairs of proteins which have common interaction partners and common domains (Fig. 2d). For comparison, we also generate total combination linkage pairs of proteins which were used in proteinprotein interaction data.
|
If one or both proteins in a pair belonged to an uncharacterized class, the pair was excluded from the evaluation. The number of residual pairs, called characterized protein pairs, was represented by N. The individual proteins in the nth pair of this group (1
n
N) were labeled in and jn. The set of functions of in assigned in the MIPS functional scheme was represented by In, and that of jn by Jn. We defined r, the rate of function correspondence between paired proteins, as follows:
![]() |
(A, B) is the discrete
function for sets A,B (
(A, B) is equal to 1 if A
B
). A comparison of function correspondence rates is presented in Table 1. For single linkage pairs, r was 47.7%; assignment of protein functions by single linkage alone would thus be doubtful. Similarly, the values of r for interaction partner linkage pairs and domain linkage pairs were 19.7 and 47%, respectively, representing no improvement in reliability but showing that the reliability of functional assignment by domain information alone was comparable to that by proteinprotein interaction data.
|
In contrast, the function correspondence rate for interaction partner and domain linkage pairs, based on common interaction partners and common domains, was 77%, much higher than for any of the three stand-alone pair analyses. The reliability of predicting uncharacterized protein functions therefore was improved greatly by integrating proteinprotein interaction data and domain information.
3.2 Uncharacterized proteins
3.2.1 Annotation of uncharacterized proteins
By our method, 1042 protein pairs which had common interaction partners and the common domains were extracted, involving 765 proteins of which 98 were uncharacterized. Among these protein pairs, 201 contained at least one uncharacterized protein. It was possible to assign functions to 76 of the 98 uncharacterized proteins because the function of the other protein in the pair is known; samples of these 76 are listed in Table 2. For example, the previously uncharacterized protein YHR105w can be predicted to function in vacuolar transport, because protein YGL212w, with which YHR105w pairs, is known to be involved in vacuolar transport. The fact that they both possess the PF00787 domain, which occurs in a variety of eukaryotic proteins and binds to phosphoinositides, further supports this prediction.
|
3.2.2 Clusters of uncharacterized proteins
As shown in Figure 3, we identified 19 uncharacterized proteins in eight clusters which represent potentially new functions. Proteins YNL056w, YNL099c, YCR095c and YNL032w had the domain PF03162 (Fig. 3a), which is annotated as tyrosine phosphatase family (Pfam database, http://www.sanger.ac.jp/). Likewise, proteins YMR192w and YPL249c shared the domain PF00566 (Fig. 3b), annotated as a TBC domain and implying that these proteins are GTPase activators; so too for proteins YNL293w and YOL112w (Fig. 3c). Proteins YGR031w and YGR110w had the domain PF00561 (Fig. 3d) which is annotated as alpha/beta hydrolase fold and has a catalytic function in a very wide range of hydrolytic enzymes (Pfam database). Proteins YBL056w, YBR125c and YER089c had the domain PF00481 (Fig. 3e) which is annotated as protein phosphatase 2C and is found in protein serine/threonine phosphatases (Pfam database).
|
On the other hand, proteins YML013w and YMR067c had the domain PF00789 (Fig. 3f), annotated as a UBX domain but whose function is unknown (Pfam database). Other functionally unknown domains are conserved hypothetical ATP-binding protein (PF03029; Fig. 3g), shared between YLR243w and YOR262w, TGS domain (PF02824; Fig. 3h) and GTP1/OBG family (PF01018; Fig. 3h), shared between YAL036c and YGR173w.
| 4 DISCUSSION |
|---|
|
|
|---|
Detecting a physical interaction between proteins is considered as one of the most powerful approaches for inferring uncharacterized protein functions. However, the term physical interaction encompasses various types of association including ligandreceptor interaction, pathway scaffolding and molecular machines (Walhout and Vidal, 2001). It is also recognized that such datasets tend to contain errors concerning false positives and negatives. Therefore, it is important to use suitable algorithms for extracting target interactions. Focusing on pairs of proteins that have common interaction partners is a strategy for extracting functionally associated proteins from proteinprotein interaction networks. However, this could not extract functional associations with a high function correspondence rate (19.7%; Table 1). To compare interaction partner linkage pairs and interaction partner and domain linkage pairs in detail, we calculated the function correspondence rates of pairs for each number of common interaction partners, as shown in Figure 4. This reveals that as the number of common interaction partners decreases, the function correspondence rate also decreases. It seems that pairs having fewer common interaction partners occur more frequently among false positives (Samanta and Liang, 2003). By focusing on common interaction partners as well as common domains, the correspondence rate in pairs having fewer common partners was improved. Thus, our method can extract functionally associated protein pairs with a relatively high rate, even when the number of common interaction partners is low.
|
For predicting SCOP folds, Lappe et al. (2001) proposed a method which is based on the idea that pairs having common interaction partners would be assigned the same SCOP fold. However, according to our result, only a few pairs which had common interaction partners had common domains (approximately 1%). This means protein pairs which have common interaction partners do not necessarily have common domains. Taking into account this result may contribute to the upgrading of the performance of fold assignment.
We also used domain information to make protein pairs. However, the function correspondence rate for domain linkage pairs was almost as low as for the other two types of pairs (47%; Table 1). Domain information should thus be interpreted with caution. In particular, attention must be paid to domains which are widely conserved even in different function classes. Integrating domain information with proteinprotein interaction data improved the function correspondence rate. Namely, filtering out pairs sharing these domains between proteins in different function classes by taking into account their common interaction partners should contribute to the improvement of the function correspondence rate.
Our method extracted 1042 protein pairs which had common interaction partners and common domains. A large proportion of these protein pairs do not have strong sequence similarities in each pair, as shown in Figure 5. Especially, only about 17.3% of the uncharacterized proteins in these extracted pairs have strong similarities with their paired characterized proteins (E-value < E 10). Our strategy should therefore contribute to effective gene characterization for sequenced genomes as a second step in gene annotation, augmenting initial assignments made by sequence-based approaches.
|
To verify the effect of false positives and negatives, we added or deleted edges randomly without changing the power-law nature of the network. Figure 6 indicates that the deletion of random edges affected more adversely the performance of our method than the addition of random edges. This is because most of the added edges can be ignored as non-common interaction partners. In addition, interaction partner and domain linkage pairs were relatively not affected by addition or deletion of random edges, compared to interaction partner linkage pairs, as also shown in Figure 6. This indicates that pairs which have common domains are not strongly affected by randomly added or deleted edges, because a domain can be seen to decide its interacting partners.
|
We found that our method, which integrates proteinprotein interaction data with domain information, is a confident approach to protein function prediction because of its high true positive prediction rate and robustness for randomly added or deleted edges, and has a high potential to overcome the limitations of relying on individual omic datasets. In the present study we have used yeast to test our methods, but they can of course be applied to proteinprotein interaction data for any species.
| Acknowledgments |
|---|
We thank Ian Smith (Nara Institute of Science and Technology) for reviewing the manuscript, and CBRC (Computational Biology Research Center; http://www.cbrc.jp) members for all kinds of support and discussions. We also thank two reviewers for helpful comments.
Received on September 17, 2004; revised on January 27, 2005; accepted on January 31, 2005
| REFERENCES |
|---|
|
|
|---|
Bader, G.D. and Hogue, C.W.V. (2002) Analyzing yeast protein-protein interaction data obtained from different sources. Nat. Biotechnol., 20, 991997[CrossRef][ISI][Medline].
Deane, C.M., et al. (2002) Protein interactions: Two methods for assessment of the reliability of high throughput observations. Mol. Cell Proteomics, 1, 349356
Ge, H., et al. (2003) Integrating omic information: a bridge between genomics and systems biology. Trends Genet., 19, 551560[CrossRef][ISI][Medline].
Giot, L., et al. (2003) A protein interaction map of Drosophila melanogaster. Science, 302, 17271736
Hunter, T. and Plowman, G.D. (1997) The protein kinases of budding yeast: six score and more. Trends Biochem. Sci., 22, 1822[ISI][Medline].
Lappe, M., et al. (2001) Generating protein interaction maps from incomplete data: application to fold assignment. Bioinformatics, 17, S149S156[Abstract].
Li, S., et al. (2004) A map of the interactome network of the metazoan C. elegans. Science, 303, 540543
Mayer, B.J. (2001) SH3 domains: complexity in moderation. J. Cell Sci., 114, 12531263[Abstract].
Mewes, H.W., et al. (2002) MIPS: a database for genomes and protein sequences. Nucleic Acids Res., 30, 3134
Morin, F., et al. (2003) A proline-rich domain in the gamma subunit of phosphodiesterase 6 mediates interaction with SH3-containing proteins. Mol. Vis., 9, 449459[ISI][Medline].
Ng, S., et al. (2003) InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes. Nucleic Acids Res., 31, 251254
Ponting, C.P. and Russell, R.R. (2002) The natural history of protein domains. Annu. Rev. Biophys. Biomol. Struct., 31, 4571[CrossRef][ISI][Medline].
Samanta, M.P. and Liang, S. (2003) Predicting protein functions from redundancies in large-scale protein interaction networks. Proc. Natl Acad. Sci. USA, 100, 1257912583
Sprinzak, E. and Margalit, H. (2001) Correlated sequence-signatures as markers of proteinprotein interaction. J. Mol. Biol., 311, 681692[CrossRef][ISI][Medline].
Tatusova, T.A. and Madden, T.L. (1999) BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol. Lett., 174, 247250[CrossRef][ISI][Medline].
Uetz, P., et al. (2000) A comprehensive analysis of proteinprotein interactions in Saccharomyces cerevisiae. Nature, 403, 623627[CrossRef][Medline].
Vazquez, A., et al. (2003) Global protein function prediction from proteinprotein interaction networks. Nat. Biotechnol., 21, 697700[CrossRef][ISI][Medline].
Vidal, M. (2001) A biological atlas of functional maps. Cell, 104, 333339[CrossRef][ISI][Medline].
von Mering, C., et al. (2002) Comparative assessment of large-scale data sets of proteinprotein interactions. Nature, 417, 399403[Medline].
Walhout, A.J.M. and Vidal, M. (2001) Protein interaction maps for model organisms. Nat. Rus. Mol. Cell. Biol., 2, 5562.
This article has been cited by other articles:
![]() |
Y. Shimoda, S. Shinpo, M. Kohara, Y. Nakamura, S. Tabata, and S. Sato A Large Scale Analysis of Protein-Protein Interactions in the Nitrogen-fixing Bacterium Mesorhizobium loti DNA Res, February 1, 2008; 15(1): 13 - 23. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Andreopoulos, A. An, X. Wang, M. Faloutsos, and M. Schroeder Clustering by common friends finds locally significant proteins mediating modules Bioinformatics, May 1, 2007; 23(9): 1124 - 1131. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. E. Cusick, N. Klitgord, M. Vidal, and D. E. Hill Interactome: gateway into systems biology Hum. Mol. Genet., October 15, 2005; 14(suppl_2): R171 - R181. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









