Skip Navigation

Bioinformatics 2008 24(16):i35-i41; doi:10.1093/bioinformatics/btn290
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Schelhorn, S.-E.
Right arrow Articles by Albrecht, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Schelhorn, S.-E.
Right arrow Articles by Albrecht, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

An integrative approach for predicting interactions of protein regions

Sven-Eric Schelhorn , Thomas Lengauer and Mario Albrecht *

Max Planck Institute for Informatics, Campus E1.4, 66123 Saarbrücken, Germany

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Protein–protein interactions are commonly mediated by the physical contact of distinct protein regions. Computational identification of interacting protein regions aids in the detailed understanding of protein networks and supports the prediction of novel protein interactions and the reconstruction of protein complexes.

Results: We introduce an integrative approach for predicting protein region interactions using a probabilistic model fitted to an observed protein network. In particular, we consider globular domains, short linear motifs and coiled-coil regions as potential protein-binding regions. Possible cooperations between multiple regions within the same protein are taken into account. A finegrained confidence system allows for varying the impact of specific protein interactions and region annotations on the modeling process. We apply our prediction approach to a large training set using a maximum likelihood method, compare different scoring functions for region interactions and validate the predicted interactions against a collection of experimentally observed interactions. In addition, we analyze prediction performance with respect to the inclusion of different region types, the incorporation of confidence values for training data and the utilization of predicted protein interactions.

Contact: mario.albrecht{at}mpi-inf.mpg.de

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Protein-protein interactions (PPIs) can be attributed to physical contacts between specific protein regions (PRs) of interacting proteins. Therefore, analysis and prediction of protein region interactions (PRIs) are of great use for studying protein function, structure and evolution (Bornberg-Bauer et al., 2005; Orengo and Thornton, 2005), for exploring protein networks, signaling pathways and the effects of post-translational modifications in PRs (Albrecht et al., 2005; Pawson and Nash, 2003; Seet et al., 2006) and for investigating the assembly of protein complexes and their 3D structures (Aloy and Russell, 2006; Bahadur and Zacharias, 2007). Frequently, PRs also constitute binding sites of nucleic acids, metabolites or drugs (Bhattacharyya et al., 2006; Santonico et al., 2005). Apart from that, PRIs may also be used for quality assessment of PPIs (Ramírez et al., 2007; Schlicker et al., 2007). At least three types of protein binding regions are known to be involved in PPIs and are detailed in the following: globular domains, peptides as short linear motifs (SLiMs) and coiled-coil regions.

1.1 Protein region types
Globular domains are evolutionarily conserved building blocks of proteins and consist of more than 30 residues, which independently fold into a stable compact structure (Bornberg-Bauer et al., 2005). They are often associated with specific functions and are well known to form interactions with regions in other proteins (Pawson and Nash, 2003). However, experimentally observed PRIs between globular domains can only explain up to 20% of the PPIs in major eukaryotic model organisms (Itzhaki et al., 2006).

Proteins may also contain non-globular SLiMs, which do not fold into tertiary structures and are located in intrinsically disordered parts of the protein outside globular domains (Dyson and Wright, 2005). SLiMs can often be represented by short consensus sequences and are assumed to mediate transient PPIs or act as recognition elements in temporary complexes (Puntervoll et al., 2003). The number of PPIs mediated by SLiMs is estimated to be up to 15–40% of all interactions in the human interactome (Neduva and Russell, 2005).

Coiled-coil regions are specific structural motifs in proteins that are ubiquitous in the cell and are known to facilitate helical protein dimerization (Lupas and Gruber, 2005).

As most eukaryotic proteins contain more than one binding region, sophisticated cellular functions may involve cooperative binding of multiple PRs in interacting proteins (Bornberg-Bauer et al., 2005; Moza et al., 2006). Statistical arguments about the frequency of domain combinations in interacting proteins support the essential role of cooperative PRIs as mediators of PPIs (Han et al., 2005; Wang et al., 2007b).

1.2 Related work
Most prediction approaches for PRIs use experimentally observed protein networks as training data and aim at discovering all PRIs that underlie the whole observed PPI network. Since the number of possible PRIs is huge and training data are usually noisy, efficient and noise-tolerant methods have to be applied to successfully derive PRIs. The prediction problem becomes even more difficult if not only single PRIs, but also region cooperations are considered. Here, cooperation means that two or more regions in the same protein jointly interact with a region in another protein. In addition, preparing suitable training and test sets of sufficient size is not easy in practice as only relatively few PRIs have been experimentally observed until now. Moreover, PPI data are spread over several databases worldwide, which use different identifiers for the interacting proteins.

One popular approach for predicting PRIs are association-based methods that rely on evidence counting of PRs in interacting and non-interacting proteins (Sprinzak and Margalit, 2001). This approach has been further improved by the integration of data from multiple sources (Ng et al., 2003), by using more elaborated scoring systems (Chen et al., 2006; Kim et al., 2003), and by the inclusion of cooperative domains (Han et al., 2005).

Recent probabilistic approaches use a maximum likelihood (ML) method to predict PRIs that best explain an observed PPI network. While early realizations were limited to small training sets of experimentally observed PPIs (Deng et al., 2002), succeeding methods included larger amounts of PPIs (Liu et al., 2005) and employed reference sets of non-interacting PRIs in addition to new scoring functions (Riley et al., 2005). The most recent ML-based approaches introduce integrative models that combine different data sources in a Bayesian fashion (Lee et al., 2006) and incorporate fine-grained modeling of experimental errors (Wang et al., 2007a).

Several non-probabilistic approaches are based on linear programming (LP) and model the prediction of PRIs as an optimization problem of constructing a network of PRIs that yield a given PPI network (Hayashida et al., 2005). The LP optimization criterion has been refined with different scoring and validation methods (Guimaraes et al., 2006; Guimaraes and Przytycka, 2008) and extended to model cooperative domains (Wang et al., 2007b).

Other approaches for predicting PRIs make use of evolutionary methods (Jothi et al., 2006; Kann et al., 2007) or models based on statistical analysis of protein super-families (Nye et al., 2004) and phylogeny (Pagel et al., 2006, 2008). Further approaches that were primarily developed for PPI prediction, but also utilize PRIs, are based on support vector machines (Bock and Gough, 2001), probabilistic networks (Gomez and Rzhetsky, 2002), random decision forests (Chen and Liu, 2005), weighting networks (Wuchty, 2006) and set cover problems (Huang et al., 2007).

1.3 Novel methodological features
In this work, we present Integrative Prediction of PRIs (IPPRI), a novel approach to the prediction and validation of interactions between protein-binding regions. Most of the related methods described earlier focus on predicting highly reliable PRIs with respect to a reference set of observed PRIs. Since available reference sets are still small, many predicted PRIs cannot be validated. Therefore, the following presentation of our IPPRI approach does not focus primarily on enhancing prediction accuracy, but instead concentrates on incorporating the new region types SLiMs and coiled coils, which are as important as globular domains for the formation of PPIs. In addition, we compile sizable PPI training sets for eukaryotic PRIs, compare a new region-scoring function for PRIs with established PRI-scoring functions, and exercise a useful collection of validation methods including a novel suitability test of PRIs for protein complex reconstruction. We chose a probabilistic approach to PRI modeling because it particularly provides the great flexibility required for methodological extensions such as the consideration of confidence values for both PPIs and PRs and the modeling of cooperative interactions between multiple protein regions within a single protein. Such an approach is also known to deliver good results in acceptable runtime, which enables comprehensive benchmarking procedures and parameter optimizations (Lee et al., 2006; Riley et al., 2005).


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Protein interaction data
The probabilistic method described in this section uses PPIs as training data. As we strive for the eventual application of the predicted PRIs to human proteins, five eukaryotic taxa (Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus and Homo sapiens) were selected and experimentally observed, binary PPIs for these taxa were obtained from five major public databases: BioGRID (Stark et al., 2006), DIP (Chatr-aryamontri et al., 2007), HPRD (Peri et al., 2003), IntAct (Kerrien et al., 2007) and MINT (Chatr-aryamontri et al., 2007). Please refer to Supplementary Data File 1 for an explanation of the applied data integration procedure using PICR (Cote et al., 2007) and detailed statistics about the data sets. Since certain PRIs may exist in vivo that have not been observed yet in experiment, we additionally include computationally predicted binary PPIs of the five taxa from BIOVERSE (McDermott et al., 2005) and HiMAP (Rhodes et al., 2005) into our training set of PPIs. This enables IPPRI to identify new PRIs that solely rely on these predicted PPIs. To take the different reliabilities of experimental and computational PPIs into account, we propose a fine-grained confidence system that assigns a confidence value to each PPI. We used a Gene Ontology (GO)-based functional similarity score for estimating confidences (Schlicker et al., 2006), see Supplementary Data File 1 for details.

2.2 Protein region annotation
Annotation information about the presence of Pfam-A/Pfam-B globular domains and coiled coils in proteins are obtained from Pfam (Bateman et al., 2003; Finn et al., 2005), while SLiMs are annotated by scanning each protein sequence against consensus motifs defined by regular expressions in the eukaryotic linear motif (ELM) resource (Puntervoll et al., 2003). However, as most SLiMs are short and unspecific motifs, this search greatly overestimates the number of true SLiM occurrences in interacting proteins. For this reason, two filtering steps are performed to avoid false positive findings of SLiMs. First, the globular domain filter discards all SLiM occurrences that overlap with known globular domains in the same sequence because SLiMs primarily occur in unstructured parts of the protein (Dyson and Wright, 2005). Second, in the neighborhood filtering step, a SLiM occurrence in a protein is retained only if the interaction partner of this protein contains a known Pfam-A binding domain of this SLiM. The filtering procedure reduces the number of found SLiM motifs from 142 029 instances to 16 319 instances in our training set. Note that the neighborhood filtering procedure induces a slight bias in the training set towards SLiM interactions that occur in the positive reference set. However, as no reliable procedure for identifying functionally relevant SLiMs in a large set of protein sequences is known to us, we chose to use this filtering step until better methods are available.

2.3 Region interaction validation
To validate the predicted PRIs,a positive reference set of interactions for all region types was obtained from public databases and literature. 5233 structurally known Pfam-A domain interactions contained in the Protein Data Bank (PDB) (Berman et al., 2003) were taken from the databases iPfam (Finn et al., 2004) and 3DID (Stein et al., 2005). Please refer to Supplementary Data File 1 for details. Here, we ignore all PRIs that are derived solely from structures containing intra-protein domain–domain interactions since IPPRI uses only inter-protein interactions for training. To this end, we mapped the PDB chains of all structures to sequences of the interacting proteins using the E-MSD database (Velankar et al., 2005). Whenever two interacting regions are located in the same chain or in different chains but mapped to the same protein identifier without any sequence overlap, we excluded the corresponding PDB structure complex from our positive reference set. Similarly, we filtered out PDB structure complexes that exhibit crystal packing artifacts by using the Protein Quaternary Structure (PQS) service (Henrick and Thornton, 1998). Overall, 1847 domain interactions were excluded from the positive reference set during the filtering procedure.

To collect experimentally observed interaction partners of SLiMs, we searched the literature indexed by ELM. In addition, we used the DOMINO (Ceol et al., 2006) database of experimentally observed interactions of peptides to extract interactions that involve SLiM consensus motifs. Additionally, cooperative interactions consisting of domain pairs in the same protein interacting with another domain in a second protein within the same PDB structure complex were derived from iPfam and added to the positive reference set.

As proposed in (Riley et al., 2005), we derived a negative reference set of putative non-interactions from all non-interacting domains contained in PDB structures used by iPfam and 3DID. While it is not absolutely certain that these domain interactions never take place in vivo, they form a more suitable negative set than the negative reference set of unobserved interactions that contains all PRIs that are not in the positive reference set.

2.4 Probabilistic method
Following related work (Riley et al., 2005), IPPRI identifies PRs that possess a high interaction propensity by probabilistic modeling of the PPI network at the level of PRIs. To this end, our probabilistic method optimizes a set of parameters so that the observed PPI network containing PPIs of all five taxa is well explained by PRIs.

Formally, our probabilistic method maximizes the expected likelihood L of the observed PPI data given a model P of how PRIs result in PPIs and a set {theta} of parameters that describe the interaction probabilities of two regions in the model. We find {theta} by solving the corresponding ML estimation problem with a variant of the expectation maximization (EM) algorithm (Dempster et al., 1977). In the following, an observed PPI between proteins pm and pn from the same taxon will be denoted by pmn=1, while pmn=0 indicates no interaction. All proteins in our training set contain at least one PR, denoted as ri, and riisinpm indicates that pm contains region ri. Two PRs ri and rj that are contained in two different proteins pm and pn, respectively, may form a PRI rij. Since not all PRIs are biologically plausible, for instance, the interaction of two SLiMs is improbable, we introduce an indicator Iij=1 if regions ri and rj can interact in our model and Iij=0 otherwise. The estimated model parameters Formula are the probabilities that two regions ri and rj form an interaction, denoted as {theta}ij=Pr(rij=1).

Our method applies a variant of the EM algorithm, an iterative two-step procedure that refines Formula in each iteration. The model parameters Formula are initialized with the number of protein pairs (pm,pn), whose proteins contain the regions ri and rj, respectively, and form an observed PPI pmn=1, divided by the overall number of protein pairs that contain the regions. In the expectation step of the EM algorithm, the probability Pmn of a PPI pmn=1 to occur is computed as the disjunction of all PRIs that may interact and potentially mediate pmn (Equation (1)).


Formula 1

(1)
For each PRI rij that may mediate a PPI pmn, let {Lambda}Formula be the estimated interaction propensity of rij within this specific PPI. The expected value of {Lambda}Formula depends on the value of Formula and the probability of the PPI to occur (Equation (2)).


Formula 2

(2)
The sum of the expected propensities of a PRI rij over all PPIs that rij may mediate is denoted as Mij, and the loss of expected propensities due to competition with other PRIs for binding is denoted as Nij. The expected values of both variables depend on {Lambda} as shown in Equations (3).


Formula 3

(3)
Let Zij denote the number of protein pairs pm and pn of the same taxon that contain ri and rj, respectively, but for which no PPI was observed so that pmn=0. In the maximization step, Mij, Nij and Zij are used to update the current estimate of Formula (Equation (4)).


Formula 4

(4)
Based on these updated estimates, the expected likelihood of the observed PPI data set, the training set, can be computed according to Equation (5).


Formula 5

(5)
The EM algorithm terminates when the likelihood converges, frequently after not more than 30 iterations.

2.5 Extensions to the probabilistic method
The probabilistic method can be extended to predicting cooperative region interactions, that is, PRIs between two cooperative regions in the same protein and another region in a second protein. To this end, all possible pairs of PRs within a protein are identified and added to the probabilistic model as additional, combined regions that represent possible region cooperations.

As Equation (1) contains an implicit independence assumption, PRIs involving cooperative regions cannot be modeled straightforwardly as normal, non-cooperative interactions. For this reason, when cooperative interactions are included in the model, an additional step of interaction selection has to be performed before the expectation step in each iteration of the EM algorithm. In this additional step, the dependent PRIs are separated from each other by temporarily deactivating the PRI with lower interaction probability Formula . Given that protein pm contains region ri and protein pn contains a potentially cooperative region (rk,rl), two different but exclusive PRIs can be active in the probabilistic model. Either the cooperative interaction ri(kl) is active or the two independent single PRIs rik and ril are active. Which of these two alternatives is selected depends on the probability of each alternative to mediate the PPI pmn. If Equation (6) holds true given the current estimates of Formula , the cooperative interaction is temporarily deactivated and the two independent, non-cooperative interactions remain active. However, if Equation (6) does not hold true, the cooperative interaction is selected to be active in this iteration and the two independent non-cooperative interactions are temporarily deactivated.


Formula 6

(6)
Note that this procedure closely follows the intuition that a cooperative PRI should have a stronger effect on the probability of a PPI than the two single PRIs acting alone. The interaction indicators I of temporarily deactivated PRIs are set to 0 for the current iteration of the algorithm and changed back to their original value before the next iteration begins. The rest of the EM algorithm is not affected by this modeling of cooperative interactions, and computation time does not increase considerably. Note that, as Formula changes in each iteration, cooperative interactions and non-cooperative interactions remain in constant competition with each other.

IPPRI also supports the inclusion of fine-grained confidences into the probabilistic model. Given a region riisinpm, the confidence vFormula describes our belief that this specific region instance is functionally relevant in mediating PPIs. Similarly, given a PPI pmn, the confidence wmn describes our belief that pmn is a true PPI. Both confidences can be used in the model by weighting the global sums Mij and Nij according to the confidences in the regions and the PPIs on which the PRI rij relies. This weighting schema is included in the model as shown in Equation (7), which replace Equation (3) if confidences are included.


Formula 7

(7)

2.6 Scoring functions
After estimating the model parameters Formula , they can be used directly for scoring PRIs. Benchmarking is then commonly done by ranking all PRIs by descending Formula and comparing the top n interactions with the positive reference set of PRIs obtained from structural evidence for PPIs.

It has been shown in related work (Riley et al., 2005) that using Formula directly leads only to a modest enrichment of PRIs in the positive reference set within the top ranks. This is most likely due to the fact that Formula is relatively sensitive to noise in the training data and might change drastically in response to only a few changed PPIs. Also, Formula does not capture the importance of a PRI for the whole PPI network. As a consequence, several scoring functions have been developed that use Formula as a central element, but also include global information about the importance of a PRI within the context of the whole PPI network. IPPRI computes two established scoring functions employed in related work, and we also introduce a new scoring function that makes use of additional information contained in the region architecture of interacting proteins.

Let Tij be the number of protein pairs pm and pn of the same taxon that contain the regions ri and rj, respectively, regardless of whether pmn is an observed interaction or not. We denote the product of Formula and Tij as the frequency scoring function SFormula (Lee et al., 2006). SFormula captures the importance of a PRI by including the number of PPIs it could maximally mediate.

An alternative scoring function directly utilizes the likelihood of the PPI network (Riley et al., 2005). This function, adapted here as likelihood scoring function SFormula, measures the drop in likelihood when excluding the respective scored PRI from the model.


Formula 8

(8)
Note that in the original publication (Riley et al., 2005), the denominator of Equation (8) was computed by excluding rij and refitting the model for each scored PRI. However, this method is computationally prohibitive for models that contain a large number of PRIs. Therefore, we decided to use a computationally less extensive version without repetitive solving of the MLE problem and instead set Formula for the excluded PRI. The performance of the two variants of SFormula is nearly identical using the validation methods described in the next subsection.

Furthermore, we propose the region scoring function SFormula, a novel scoring function that relies on the likelihood scoring function, but also considers information about the region co-occurrence Fij that counts the number of proteins containing both regions ri and rj. This region scoring function is defined by SFormula=SFormula (1+Fij). Region co-occurrence in protein sequences may be a good indicator for a PRI because the evolution of PRs involves mechanisms on the genomic level such as gene duplication, divergence and recombination that may distribute regions interacting within one protein to novel proteins (Hurles, 2004). These evolutionarily related regions in novel proteins may retain their propensity to interact and mediate PPIs (Enright et al., 1999).

For validation purposes, IPPRI also supports a random scoring function SFormula that assigns a value between 0 and 1 to each PRI according to a uniform random distribution.

Apart from the scoring functions presented here in detail, IPPRI additionally computes other scoring functions like the specificity score, the modularity score (Riley et al., 2005), and the number of witnesses (Guimaraes et al., 2006). However, as these functions are only of minor importance for the results of this work, they are not presented in detail here.

2.7 Validation methods
After scoring PRIs, the PRI predictions can be compared to the positive and negative reference sets described in Section 2.3.

Certain validation methods assess the quality of the ranked list of predicted PRIs that result from the application of a scoring function. Good scoring functions should enrich experimentally observed PRIs at the top ranks of the list. A common measure for determining the quality of such an enrichment is the positive predictive value (PPV, a.k.a. precision) defined as TP/(TP+FP). The PPV is based on the assumption that PRIs in the positive reference set are true and PRIs in the negative reference set of unobserved interactions are false.

As the current positive reference sets are incomplete, many of the unobserved interactions could still be true. Apart from that, the performance of a scoring function is difficult to estimate without a random scoring function for comparison. To remedy both problems, a new measure, denoted here as estimated PPV (ePPV), was developed in (Riley et al., 2005). The ePPV uses a negative set of PRIs and normalizes the performance of the scoring functions by the random scoring function. The ePPV is computed based on the fold values of PRIs in the positive reference set (Vpos) and in the negative reference set of putative non-interactions (Vneg). These fold values indicate how many more interactions of the respective sets are promoted by a scoring function in comparison to the random scoring function. The ePPV is then defined as Vpos/(Vpos+Vneg).

Furthermore, we measure how well the scoring functions can identify PRIs in the positive reference set that involve a PR of a specific region type. For each PR of the selected region type, we identify the top rank of any predicted PRI that is also in the positive reference set and that involves the PR. Good scoring functions should promote the PRIs in the positive reference set to the top ranks of all predicted PRIs for each PR.

Moreover, to evaluate the performance of scoring functions at recovering the putative architecture of known protein complexes, we analyzed how well each scoring function is able to identify CORUM (Ruepp et al. 2007) complexes (see Supplementary Data File 1 for the exact procedure).

Finally, we measure the performance of scoring functions in correctly identifying PRIs of the positive reference set among all PRIs that may mediate a PPI (Guimaraes et al., 2006). To this end, we first identify all PPIs in our training set that may be mediated by a PRI in the positive reference set. For each scoring function, we then calculate the number of PPIs whose PRIs in the reference set achieve the best score.


    3 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Basic training set
To verify the implementation of our method, we successfully reproduced published PRI predictions (Riley et al., 2005) using the training set provided by that publication. Afterwards, to allow comparisons with related methods (Guimaraes et al., 2006; Lee et al., 2006; Riley et al., 2005), we applied IPPRI to a basic training set of 8552 PPIs obtained from DIP covering five taxa. Here, the proteins participating in the PPIs were annotated only with Pfam-A domains, and no confidence values or region cooperations were considered. Although the basic training set contains fewer taxa and consequently fewer PPIs and PRIs than the training sets used in related work, the top 1000 predicted PRIs do overlap with related PRI predictions as expected from the similar basis of the prediction methods in terms of training data and methodology (Fig. S1).

The analyses of PPV (Fig. S2) and ePPV (Fig. S3) show that Formula scores poorly in predicting PRIs of the positive reference set, while Sfreq and Slike both perform comparatively well. The novel Sregion shows excellent performance with respect to the PPV, indicating that region co-occurrence is a good predictor of PRIs in the positive reference set. However, in the ePPV evaluation, Sregion displays only average performance as it promotes not only PRIs in the positive reference set, but, to a smaller degree, also putatively non-interacting region pairs with frequent region co-occurrence.

Regardless of the scoring function used, the quality of PRI predictions drops considerably after a rank cutoff of about 1000. Therefore, we decided to focus our further analysis on the top 1000 PRIs ranked by Sregion (see Supplementary Data File 2). It is notable that the PRIs promoted to top ranks by Sregion are enriched in region homodimers of identical PRs. This is due to the fact that identical regions often occur multiple times within a protein and therefore receive a high co-occurrence term in the scoring function. This may partly explain the good performance of Sregion when compared to the positive reference set, as PDB structures of interacting PRs are known to be biased towards PRIs between homodimers (Guimaraes and Przytycka, 2008; Itzhaki et al., 2006).

Further evaluation of the ranking of the first PRI of a specific region among all predicted PRIs of this region reveals that Sregion promotes considerably more PRIs in the positive reference set than Sfreq and Slike to top ranks while Formula performs near random (Fig. S4).

3.2 Extended training set
The extended training set contains 39 012 experimental PPIs of 8005 Pfam-A annotated proteins from the databases listed in Section 2.1. This increase in PPIs compared to the basic training set leads to many new predicted PRIs (see Supplementary Data File 3). Notably, the training set preparation, model fitting and PRI scoring by our IPPRI implementation is very fast and does not take longer than 2 h even on this large data set.

As in case of the basic training set, the top 1000 predicted PRIs overlap considerably with related PRI predictions (Fig. 3). The PPV/ePPV evaluation on the extended training set also yields similar performances of the scoring functions like the analysis of the basic set except that Formula performs even worse, presumably due to the increased noise in the much larger extended training set and the sensitivity of Formula with regard to noise (Figs. 1 and 2).


Figure 1
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. PPV for PRI scoring functions using the extended training set.

 

Figure 2
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. ePPV for PRI scoring functions using the extended training set.

 

Figure 3
View larger version (43K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Heat map for the overlap of the top 1000 region interactions in various predicted and validated data sets using the extended training set.

 
The analysis of correctly identified PRIs that mediate PPIs within the extended training set shows that Sfreq, Slike and Sregion all correctly identify most of the PRIs in the positive reference set. In contrast, Formula identifies not more PPIs than the random scoring function (Fig. S5).

This finding is in contrast to the computed recovery rate of experimentally determined protein complexes (Fig. S6). In this assessment, no scoring function performs well at reconstituting all complexes. While some complexes appear to be correctly recovered, the additional co-occurrence term of Sregion seems to hinder the recovery of most complexes. This may be partly attributed to the fact that the scoring functions are designed for identifying the interacting regions of binary PPIs.

To reveal completely new top-scoring PRIs that rely solely on predicted PPIs, we included high-confidence predicted PPIs into the extended training set (see Supplementary Data File 1). Eight such PRIs were found in the top 1000 predicted PRIs (see Supplementary Data File 4).

3.3 Additional region types and region cooperation
The proteins in the extended training set can be annotated with additional region types aside from Pfam-A domains. The inclusion of interactions between Pfam-A domains and SLiMs into our probabilistic model reveals that PRIs involving SLiMs can be identified by Sfreq, Slike and Sregion (Fig. 4) without decreasing the quality of predicted Pfam-A PRIs. Several newly predicted PRIs within the top 1000 predictions involving SLiMs are listed in Supplementary Data File 5.


Figure 4
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Recovery of SLiM PRIs in the positive reference set using the extended training set. We regard the rank of the first validated interaction among all SLiM domain interactions.

 
Another methodological extension is the consideration of coiled-coil regions and region cooperativity in addition to Pfam-A domains and SLiMs. The analysis of the scored PRIs of this extension uncovers 232 high-ranking cooperative PRIs in the top 1000 PRIs (see Supplementary Data File 6), most of them involving coiled-coil regions. However, only very few of these cooperative PRIs could be verified because no PRIs of coiled-coil regions with other PRs are contained in the positive reference set. The top-scoring PRI in the top 1000 is the interaction between two coils. This PRI may be responsible for up to 4723 PPIs of the extended training set, corroborating that coil interactions are indeed relevant for mediating PPIs.

Inclusion of low-confidence Pfam-B domains into the extended training set did not significantly decrease validation performance regarding the top 1000 PRIs. This suggests that PRIs involving Pfam-B domains are not considered relevant by the scoring func-tions. Therefore, it was not necessary to down-weight confidences of Pfam-B domains in the confidence system of our probabilistic method. 282 PRIs involving Pfam-B domains can be found in the top 1000 PRIs (see Supplementary Data File 7).

3.4 Protein interaction confidences
To assess the effect of including confidence values for PPIs, we generated 10 000 random binary PPIs between DIP proteins that were not reported to interact with each other in any of the PPI databases we used. Although some of these random PPIs may still occur in vivo, it is reasonable to assume that most of them do not. We thus denote these artificial PPIs as putatively negative and include them into the extended training set. The interacting proteins were annotated only with Pfam-A domains, and IPPRI was applied, in a first version, with all PPI confidences set to 1 and, in a second version, with PPI confidences equaling the GO-based functional similarity scores of two interacting proteins.

Our analysis of the predicted PRIs unveils that the putatively negative PPIs have a negative impact on prediction performance when given the full but wrong confidence of 1. Especially the performance of Formula and Sfreq declines notably, while the other scoring functions remain largely unaffected (Fig. S7). This can be expected because Formula is sensitive to false PPIs and Sfreq is dependent on Formula . The putatively negative PPIs appear not to have significant impact on the likelihood measure, which contributes to the robustness of Slike and Sregion.

Importantly, the performance of Formula and Sfreq improves significantly when using confidence values and even outperforms the scoring functions on the extended training set without putatively negative PPIs and confidences (Fig. S8). This demonstrates that confidence values are indeed useful for down-weighting unreliable and false PPIs contained in noisy data.

Alternatively, confidence values may also allow for selecting only PPIs for inclusion into the probabilistic model that surpass a certain confidence threshold. This is illustrated in Fig. S9, in which only PPIs with a confidence of at least 0.7 are considered. This procedure leads to a >3-fold increase in the PPV performance of Formula at a rank cutoff of 1000 (see Supplementary Data File 8).


    4 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We presented IPPRI, an integrative approach for predicting protein region interactions from protein networks. IPPRI significantly advances the modeling of molecular mechanisms underlying the formation of protein interactions by incorporating different region types and their combinations into a unified system. The consideration of SLiMs and coiled coils as alternative protein binding sites in addition to globular protein domains is particularly important for identifying the true interacting regions of proteins.

Our method can be efficiently trained on a large set of protein interactions to allow for the prediction of many region interactions, which are assessed by several validation methods. Several top-ranking novel region interactions that involve cooperations between two regions or that are solely relying on predicted protein intera-ctions in the training data might be useful for further studies. We could also show that good scoring functions are essential for prediction performance and that region co-occurrence is a valuable predictor of PRIs. The use of confidence values for protein interactions increases the robustness of some of the scoring functions to false PPIs. While interacting protein regions of binary protein–protein interactions can be reliably identified by our approach, complex reconstruction using the predicted region interactions has been difficult and requires further work and improved methods.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Part of this study was financially supported by the German National Genome Research Network (NGFN) and by the German Research Foundation (DFG), contract number KFO 129/1-2. The work was conducted in the context of the BioSapiens Network of Excellence funded by the European Commission under grant number LSHG-CT-2003-503265.

Conflict of Interest: none declared.


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Albrecht M, et al. Decomposing protein networks into domain-domain interactions. Bioinformatics (2005) 21(Suppl. 2):ii220–ii221.[Abstract]

    Aloy P, Russell R. Structural systems biology: modelling protein interactions. Nat. Rev. Mol. Cell Biol (2006) 7:188–197.[CrossRef][Web of Science][Medline]

    Bahadur RP, Zacharias M. The interface of protein-protein complexes: analysis of contacts and prediction of interactions. Cell. Mol. Life Sci (2007) 65:1059–1072.[CrossRef][Web of Science]

    Bateman A, et al. The Pfam protein families database. Nucleic Acids Res (2003) 32:D138–D141.[CrossRef][Web of Science]

    Berman H, et al. Announcing the worldwide Protein Data Bank. Nat. Struct. Biol (2003) 10:980.[CrossRef][Web of Science][Medline]

    Bhattacharyya RP, et al. Domains, motifs, and scaffolds: the role of modular interactions in the evolution and wiring of cell signaling circuits. Annu. Rev. Biochem (2006) 75:655–680.[CrossRef][Web of Science][Medline]

    Bock J, Gough D. Predicting protein-protein interactions from primary structure. Bioinformatics (2001) 17:455–460.[Abstract/Free Full Text]

    Bornberg-Bauer E, et al. The evolution of domain arrangements in proteins and interaction networks. Cell. Mol. Life Sci (2005) 62:435–445.[CrossRef][Web of Science][Medline]

    Ceol A, et al. DOMINO: a database of domain-peptide interactions. Nucleic Acids Res (2006) 35:D557–D560.[CrossRef][Web of Science][Medline]

    Chatr-aryamontri A, et al. MINT: the Molecular INTeraction database. Nucleic Acids Res (2007) 35:D572–D574.[Abstract/Free Full Text]

    Chen L, et al. Inferring protein interactions from experimental data by association probabilistic method. Proteins (2006) 62:833–837.[CrossRef][Web of Science][Medline]

    Chen X, Liu M. Prediction of protein-protein interactions using random decision forest framework. Bioinformatics (2005) 21:4394–4400.[Abstract/Free Full Text]

    Cote RG, et al. The Protein Identifier Cross-Reference (PICR) service: reconciling protein identifiers across multiple source databases. BMC Bioinformatics (2007) 8:401.[CrossRef][Medline]

    Dempster A, et al. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (1977) 39:1–38.

    Deng M, et al. Inferring domain-domain interactions from protein-protein interactions. Genome Res (2002) 12:1540–1548.[Abstract/Free Full Text]

    Dyson HJ, Wright PE. Intrinsically unstructured proteins and their functions. Nat. Rev. Mol. Cell Biol (2005) 6:197–208.[CrossRef][Web of Science][Medline]

    Enright AJ, et al. Protein interaction maps for complete genomes based on gene fusion events. Nature (1999) 402:86–90.[CrossRef][Web of Science][Medline]

    Finn R, et al. iPfam: visualization of protein-protein interactions in PDB at domain and amino acid resolutions. Bioinformatics (2004) 21:410–412.[CrossRef][Web of Science][Medline]

    Finn R, et al. Pfam: clans, web tools and services. Nucleic Acids Res (2005) 34:D247–D251.[CrossRef][Web of Science]

    Gomez S, Rzhetsky A. Towards the prediction of complete protein-protein interaction networks. Pac. Symp. Biocomput (2002) 413–424.

    Guimaraes K, et al. Predicting domain-domain interactions using a parsimony approach. Genome Biol (2006) 7:R104.[CrossRef][Medline]

    Guimaraes KS, Przytycka TM. Interrogating domain-domain interactions with parsimony based approaches. BMC Bioinformatics (2008) 9:171.[CrossRef][Medline]

    Han D, et al. A domain combination based probabilistic framework for proteinprotein interaction prediction. Genome Informatics (2005) 14:250–259.

    Hayashida M, et al. A simple method for inferring strengths of protein-protein interactions. Genome Informatics (2005) 15:56–68.

    Henrick K, Thornton JM. PQS: a protein quaternary structure file server. Trends Biochem. Sci (1998) 23:358–361.[CrossRef][Web of Science][Medline]

    Huang C, et al. Predicting protein-protein interactions from protein domains using a set cover approach. IEEE/ACM Trans. Comput. Biol. Bioinform (2007) 4:78–87.[CrossRef]

    Hurles M. Gene duplication: the genomic trade in spare parts. PLoS Biol (2004) 2:E206.[CrossRef][Medline]

    Itzhaki Z, et al. Evolutionary conservation of domain-domain interactions. Genome Biol (2006) 7:R125.[CrossRef][Medline]

    Jothi R, et al. Co-evolutionary analysis of domains in interacting proteins reveals insights into domain-domain interactions mediating protein-protein interactions. J. Mol. Biol (2006) 362:861–875.[CrossRef][Web of Science][Medline]

    Kann M, et al. Predicting protein domain interactions from coevolution of conserved regions. Proteins (2007) 67:811–820.[CrossRef][Web of Science][Medline]

    Kerrien S, et al. IntAct - open source resource for molecular interaction data. Nucleic Acids Res (2007) 35:D561–D565.[Abstract/Free Full Text]

    Kim W, et al. Large scale statistical prediction of protein-protein interaction by potentially interacting domain (PID) pair. Genome Informatics (2003) 13:42–50.

    Lee H, et al. An integrated approach to the prediction of domain-domain interactions. BMC Bioinformatics (2006) 7:269.[CrossRef][Medline]

    Liu Y, et al. Inferring protein-protein interactions through high-throughput interaction data from diverse organisms. Bioinformatics (2005) 21:3279–3285.[Abstract/Free Full Text]

    Lupas AN, Gruber M. The structure of alpha-helical coiled coils. Adv. Protein Chem (2005) 70:37–78.[CrossRef][Web of Science][Medline]

    McDermott J, et al. BIOVERSE: enhancements to the framework for structural, functional and contextual modeling of proteins and proteomes. Nucleic Acids Res (2005) 33:W324–W325.[Abstract/Free Full Text]

    Moza B, et al. Long-range cooperative binding effects in a T cell receptor variable domain. Proc. Natl Acad. Sci. USA (2006) 103:9867–9872.[Abstract/Free Full Text]

    Neduva V, Russell R. Linear motifs: evolutionary interaction switches. FEBS Lett (2005) 579:3342–3345.[CrossRef][Web of Science][Medline]

    Ng S, et al. InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes. Nucleic Acids Res (2003) 31:251–254.[Abstract/Free Full Text]

    Nye T, et al. Statistical analysis of domains in interacting protein pairs. Bioinformatics (2004) 21:993–1001.[CrossRef][Web of Science][Medline]

    Orengo CA, Thornton JM. Protein families and their evolution-a structural perspective. Annu. Rev. Biochem (2005) 74:867–900.[CrossRef][Web of Science][Medline]

    Pagel P, et al. The DIMAweb resource – exploring the protein domain network. Bioinformatics (2006) 22:997–998.[Abstract/Free Full Text]

    Pagel P, et al. DIMA 2.0 – predicted and known domain interactions. Nucleic Acids Res (2008) 36:D651–D655.[Abstract/Free Full Text]

    Pawson T, Nash P. Assembly of cell regulatory systems through protein interaction domains. Science (2003) 300:445–452.[Abstract/Free Full Text]

    Peri S, et al. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res (2003) 13:2363–2371.[Abstract/Free Full Text]

    Puntervoll P, et al. ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res (2003) 31:3625–3630.[Abstract/Free Full Text]

    Ramírez F, et al. Computational analysis of human protein interaction networks. Proteomics (2007) 7:2541–2552.[CrossRef][Web of Science][Medline]

    Rhodes DR, et al. Probabilistic model of the human protein-protein interaction network. Nat. Biotechnol (2005) 23:951–959.[CrossRef][Web of Science][Medline]

    Riley R, et al. Inferring protein domain interactions from databases of interacting proteins. Genome Biol (2005) 6:R89.[CrossRef][Medline]

    Ruepp A, et al. CORUM: the comprehensive resource of mammalian protein complexes. Nucleic Acids Res (2007) 36:D646–D650.[CrossRef][Web of Science][Medline]

    Santonico E, et al. Methods to reveal domain networks. Drug Discov. Today (2005) 10:1111–1117.[CrossRef][Web of Science][Medline]

    Schlicker A, et al. A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics (2006) 7:302.[CrossRef][Medline]

    Schlicker A, et al. Functional evaluation of domain-domain interactions and human protein interaction networks. Bioinformatics (2007) 23:859–865.[Abstract/Free Full Text]

    Seet BT, et al. Reading protein modifications with interaction domains. Nat. Rev. Mol. Cell Biol (2006) 7:473–483.[CrossRef][Web of Science][Medline]

    Sprinzak E, Margalit H. Correlated sequence-signatures as markers of protein-protein interaction. J. Mol. Biol (2001) 311:681–692.[CrossRef][Web of Science][Medline]

    Stark C, et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res (2006) 34:D535–D539.[Abstract/Free Full Text]

    Stein A, et al. 3did: interacting protein domains of known three-dimensional structure. Nucleic Acids Res (2005) 33:D413–D417.[Abstract/Free Full Text]

    Velankar S, et al. E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Res (2005) 33:D262–D265.[Abstract/Free Full Text]

    Wang H, et al. InSite: a computational method for identifying protein-protein interaction binding sites on a proteome-wide scale. Genome Biol (2007a) 8:R192.[CrossRef][Medline]

    Wang RS, et al. Analysis on multi-domain cooperation for predicting proteinprotein interactions. BMC Bioinformatics (2007b) 8:391.[CrossRef][Medline]

    Wuchty S. Topology and weights in a protein domain interaction network – a novel way to predict protein interactions. BMC Genomics (2006) 7:122.[CrossRef][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
A. Ruepp, B. Waegele, M. Lechner, B. Brauner, I. Dunger-Kaltenbach, G. Fobo, G. Frishman, C. Montrone, and H.-W. Mewes
CORUM: the comprehensive resource of mammalian protein complexes--2009
Nucleic Acids Res., November 1, 2009; (2009) gkp914v1.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
H. Blankenburg, F. Ramirez, J. Buch, and M. Albrecht
DASMIweb: online integration, analysis and assessment of distributed protein interaction data
Nucleic Acids Res., July 1, 2009; 37(suppl_2): W122 - W128.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
H. Blankenburg, R. D. Finn, A. Prlic, A. M. Jenkinson, F. Ramirez, D. Emig, S.-E. Schelhorn, J. Buch, T. Lengauer, and M. Albrecht
DASMI: exchanging, annotating and assessing molecular interaction data
Bioinformatics, May 15, 2009; 25(10): 1321 - 1328.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Schelhorn, S.-E.
Right arrow Articles by Albrecht, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Schelhorn, S.-E.
Right arrow Articles by Albrecht, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?