Skip Navigation


Bioinformatics Advance Access originally published online on October 27, 2004
Bioinformatics 2005 21(7):993-1001; doi:10.1093/bioinformatics/bti086
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/7/993    most recent
bti086v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (20)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Nye, T. M. W.
Right arrow Articles by Teichmann, S. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Nye, T. M. W.
Right arrow Articles by Teichmann, S. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Published by Oxford University Press 2004.

Statistical analysis of domains in interacting protein pairs

Tom M. W. Nye 1,*, Carlo Berzuini 1,2, Walter R. Gilks 1, M. Madan Babu 3 and Sarah A. Teichmann 3

1Medical Research Council Biostatistics Unit Cambridge, UK
2Dipartimento di Informatica e Sistemistica, Università di Pavia Italy
3Medical Research Council Laboratory of Molecular Biology Cambridge, UK

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

Motivation: Several methods have recently been developed to analyse large-scale sets of physical interactions between proteins in terms of physical contacts between the constituent domains, often with a view to predicting new pairwise interactions. Our aim is to combine genomic interaction data, in which domain–domain contacts are not explicitly reported, with the domain-level structure of individual proteins, in order to learn about the structure of interacting protein pairs. Our approach is driven by the need to assess the evidence for physical contacts between domains in a statistically rigorous way.

Results: We develop a statistical approach that assigns p-values to pairs of domain superfamilies, measuring the strength of evidence within a set of protein interactions that domains from these superfamilies form contacts. A set of p-values is calculated for SCOP superfamily pairs, based on a pooled data set of interactions from yeast. These p-values can be used to predict which domains come into contact in an interacting protein pair. This predictive scheme is tested against protein complexes in the Protein Quaternary Structure (PQS) database, and is used to predict domain–domain contacts within 705 interacting protein pairs taken from our pooled data set.

Contact: thomas.nye{at}mrc-bsu.cam.ac.uk


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Proteins frequently bind together in pairs or larger complexes to take part in biological processes. Understanding such interactions across the entire genome is an important goal with diverse implications about protein function. Experimental techniques that test thousands of pairs of proteins within a genome for their ability to interact (Ito et al., 2001; Uetz et al., 2000) have therefore attracted considerable interest. These large-scale interaction assays produce no explicit information about structural aspects of the interactions—how the proteins come into contact physically and the shapes they adopt. Our aim in this paper is to combine such interaction data with information about the structure of individual proteins at the domain level in order to learn about the structure of protein complexes.

A number of different experimental and computational techniques are available for elucidating the structure of protein complexes (Russell et al., 2004), but the complete three-dimensional atomic structure is available only for a limited range of proteins and their complexes. While such information can be used to predict the geometry of protein complexes of unknown structure (Aloy and Russell, 2002), this source of information is severely limited by the difficulty of performing experiments to determine structures at the atomic level. Given the large amount of genomic interaction data of the type described above, and the fact that further data are likely to come available in the future as more experiments are performed, techniques for extracting structural information from such data are important. In addition, while the complete three-dimensional structure is not available for the majority of proteins, it is often possible to elucidate the structure of a protein at the domain level by sequence homology alone.

A number of approaches to analysing and predicting protein interactions in terms of constituent domains have recently been published. The basic idea is to work within some classification of protein domains, and identify pairs of domain superfamilies (or other such groupings) that occur frequently in interacting protein pairs within some large-scale interaction data set. This information can then be used to make predictions about new groups of proteins that might interact, and also indicate the domains which might be contacting each other in these proteins. A common approach (Kim et al., 2002; Ng et al., 2003; Sprinzak and Margalit, 2001) has been to score domain pairs according to how frequently the pair occurs in interacting protein pairs, one in each protein, relative to the abundance of the domains across the proteome. However, there is little consensus over how such scores should be calculated, and a number of statistical issues arise: is it meaningful to compare the scores for different domain pairs directly? How would the scores be distributed if the domain composition of proteins had no influence on interaction? A more rigorous alternative approach due to Deng et al. (2002) employs a maximum likelihood method to estimate a probability of interaction for every domain pair. This involves using the EM algorithm to maximize the likelihood in a space with potentially thousands of dimensions.

This paper presents an approach to pairwise protein interaction aimed at addressing the statistical issues mentioned above. However, unlike existing approaches (Deng et al., 2002; Sprinzak and Margalit, 2001), our aim is to predict the most likely pair of domains mediating a given protein interaction, rather than predicting new protein interactions. Given a pair of proteins that are known to interact, the ability to predict structural aspects of this interaction would be of great biological value. Aloy et al. (2004) considered a similar structural prediction problem, but at a larger scale, predicting protein–protein contacts within complexes containing more than two proteins.

The approach we adopt has the following form. A simple model of contacts between protein domains is used to assign a score to superfamily pairs. Throughout the paper the term contact is used to signify physical binding between two domains in separate proteins. Rather than comparing scores for different superfamily pairs directly, a data-simulation method is used to generate a p-value for the observed score of each superfamily pair, measuring the strength of evidence for the role played by the superfamily pair in protein interaction. We generate a set of p-values for SCOP superfamily pairs, based on a set of protein interactions obtained by pooling together several data sets of interactions from yeast.

The ranked list of superfamily pairs we obtain can be used to predict which domains come into contact in an interacting protein pair: it is natural to expect that the pair with the lowest p-value should form a contact. We test this idea by extracting contacts between proteins in complexes in the Protein Quaternary Structure (PQS) database (Henrick and Thornton, 1998) and comparing these contacts with our predictions. We also predict where contacts occur between the interacting protein pairs in our pooled set of yeast interactions, and show how our analysis expands the repertoire of superfamily interactions found in the PQS.

1.1 Experimental data
Broadly speaking, experiments that identify protein interactions fall into two categories: low throughput approaches that test for interactions between a limited selection of proteins, and high throughput approaches that survey a large part of the proteome. All such techniques are affected by experimental errors, with high throughput approaches generally more affected. Experimental techniques can also be classified as to whether they identify interacting protein pairs or larger complexes of proteins, but in this paper we deal only with pairwise interactions.

The yeast two-hybrid technique is a high throughput approach that has been used to screen the yeast genome for pairwise interactions by two different groups (Ito et al., 2001; Uetz et al., 2000). The advantage of screening the entire genome (or at least a large part of it) for interactions is offset by a high rate of experimental error, and it is vital that the false positive rate is included in any analysis of the data. The potential benefits and drawbacks of such high throughput techniques have generated intense interest: see Legrain et al. (2001) and von Mering et al. (2002) for reviews of this area, and Sprinzak et al. (2003) for estimates of the false positive rate.

Given a list of interactions obtained from such experiments, absence of a protein pair from the list does not necessarily mean that the two proteins do not interact—it could be that interaction of the pair in question was not tested in the experimental set-up. This false negative probability has to be accounted for when making statistical inference based on absence from the list of interactions.

The experimental data used in this paper all concern the proteome of Saccaromyces cerevisiae. Three data sets of interactions are used: interactions observed in Ito et al. (2001) and Uetz et al. (2000) yeast two-hybrid experiments together with a set of interactions from the MIPS database (Mewes et al., 2002). The MIPS database contains interactions determined by a variety of experimental methods, many of which are focussed on individual interactions rather than large-scale interaction determination. Datasets of larger protein complexes are not used since they do not contain information about which proteins are in physical contact, and the evidence they contain for domain–domain contacts is therefore much weaker. More details about the data used are given in Section 2.4.

1.2 More about domains
Domains are structural subunits of proteins that can be thought of as ‘building blocks’ that are conserved during evolution. Proteins can consist of a single domain, or more frequently as a combination of several domains: in prokaryotes about one-third of all proteins are single domain while in eukaryotes the fraction is about 20% (Apic et al., 2001). Nature is able to construct a vast array of different proteins by combining domains, and the present-day variety of domains is believed to have evolved from a relatively small number of ancestral gene sequences. Domains which are related to each other by descent from a common ancestor are said to be homologous, and can be grouped together to form a superfamily. Domains in the same superfamily usually preserve the same three-dimensional conformation, if not function, but their sequences may have diverged considerably during evolution, a phenomenon called remote homology. We regard the SCOP database Murzin et al., 1995 as the standard for classifying domains in this way, and we work throughout with the SCOP classification.

It is often possible to determine a set of constituent domain superfamilies for a protein given its amino acid sequence. The SUPERFAMILY library (Gough et al., 2001) is a facility coupled to the SCOP classification used in this study to perform such assignments. It consists of a library of hidden Markov models (HMMs)—algorithms that have been trained to detect remote homology to the SCOP superfamilies in protein sequences. Given an amino acid sequence it produces a string of superfamily labels that represents the sequence of domains in the protein. This string may contain gaps for regions where no superfamily assignment was made, and the process may fail to assign any superfamilies to certain proteins. Proteins that were not assigned any superfamilies were removed from our analysis and gaps in the domain assignments were ignored. This can introduce a bias if such regions are involved in interaction, as described further in Section 4. More information on obtaining domain assignments is available from http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/comb.html.

While domains within the same superfamily are related by a common ancestor, it is possible that during the course of evolution their functions may have diverged. Similarly, the repertoire of interaction partners for each domain will not necessarily be conserved. If we observe domains from a particular superfamily pair coming into contact in a given protein interaction, it does not follow that in other protein pairs domains from this superfamily pair will always form contacts. However, our statistical approach is able to take this into account: the p-values we generate measure the evidence for the role played by each superfamily pair in protein interaction. When an interaction has been poorly conserved between the members of two superfamilies, this superfamily pair will be less significant and assigned a higher p-value.


    2 METHODS
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
This section describes the algorithm used to assign p-values to superfamily pairs and the procedure for predicting domain–domain contacts between proteins. Section 2.1 introduces our notation and assumptions. The algorithm used to compute the p-values is described in sections 2.2 and 2.3 and is illustrated schematically in Figure 1. In Section 2.4 we provide details of the data set used to generate the p-values. Finally, Section 2.5 describes the scheme for predicting domain–domain contacts.



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 1 A schematic representation of the algorithm to generate p-values. Shaded symbols represent superfamilies.

 
2.1 Notation and assumptions
Consider the complete set of proteins expressed in some organism, and let denote the set of all (unordered) protein pairs. Let denote the set of all possible pairs of domain superfamilies. Note that includes proteins paired with themselves, representing the possibility of interaction between identical proteins. Similarly includes domain superfamilies paired with themselves. Throughout we often use the symbol i to denote an element of and j to denote an element of .

Each protein p is assumed to have a domain assignment: a string (A1, A2, A3, ...) where each Ak is a domain superfamily. Our analysis does not acknowledge any uncertainty in this assignment, a valid assumption on account of the high accuracy of the domain HMMs. We will write A p to denote that protein p contains a domain of superfamily A. Given a protein pair i = (p,q) and superfamily pair j = (A,B), let Nij be the number of distinct ways of having p bound to q via a contact between a domain in superfamily A and a domain in superfamily B. Of course Nij != 0 if and only if A p and B q, or B p and A q.

We make the simplifying assumption that each pair of proteins either interacts or not, thereby ignoring issues such as the difference between stable and transient interactions, or interactions that require specific cellular conditions. Let denote the set of such interacting pairs. In its simplest form the experimental data is a list of protein pairs that have been observed to interact. It is important that experimental error and incompleteness of the data are taken into account, for which we assume fixed probabilities of false positive and false negative:

Here Pr(U | V) denotes the conditional probability of an event U given another event V. It follows that the probability that protein pair i interacts, given the experimental data , is given by:

(1)

More generally, we may have several different sets of experimentally observed interacting pairs each with a different false positive rate . A simple approach to combining these is to take the probability of false positive for a particular pair , to be the lowest error rate across data sets containing that pair. The observed sets of interactions that we consider overlap very little, so this is a valid assumption. Equation (1) is then replaced by

(2)

Estimates for the probabilities of false positive and negative in our data are given in Section 2.4.

2.2 Assigning p-values to superfamily pairs
For each superfamily pair we want to test the null hypothesis Hj that presence of the superfamily pair in a protein pair does not affect whether the two proteins interact. We also consider the global null hypothesis H{infty} = {cap} Hj, that interaction is entirely unrelated to the domain architectures of proteins. In order to test these hypotheses a statistic is attached to each superfamily pair j, that reflects the strength of evidence present in that j affects protein interaction. A p-value is calculated for each statistic by simulating data drawn from the global null hypothesis H{infty}. We now explain this procedure in detail.

The first step is to consider the expected number of contacts between domains of type j under the null H{infty} given the observed set of interactions \exppairs. In order to compute this we make the following assumption: when a pair of proteins i = (p,q) interact they do so by bringing a single domain from p into contact with a single domain from q. Biological evidence suggests that contacts may in fact occur between several domains in an interacting protein pair, but this simplifying assumption can nonetheless be used to extract useful information. Given the experimental data \exppairs, under the null H{infty} and the assumption above, the expected number of contacts Ej between domains of type j across the entire proteome is given by:

The total number of possible contacts within pair i is {sum}k Nik, so under the global null

It follows that

(3)
The first factor in this expression depends on and is given by Equation (2). The second factor depends only on the domain architectures of the proteins. In addition to the expected number of contacts Ej, we also need to consider the total number of different possible contacts of type j within the proteome, defined by

Next we calculate a measure of interaction for each superfamily pair j in the following way. A 2 x 2 array is defined by

and the log odds ratio s(j) for the array is calculated:

A large value for s(j) is obtained when the superfamily pair j has a relatively large expected number of contacts in comparison with the other superfamily pairs. Measures of interaction other than the log odds ratio can be used, but in practice the choice of statistic was observed to have a small affect on the p-values.

Recall that for each superfamily pair j we are interested in testing the null hypothesis Hj that presence of the superfamily pair in a protein pair has no effect on whether the two proteins interact. If protein interaction is not affected by the domain content of the proteins (i.e. if the global null hypothesis H{infty} were true) then we can create a new set of proteins in which the domains have been shuffled between and within proteins, that interact in exactly the same way as the original set. The precise nature of this shuffling is discussed below. It is important to note that the while the domains are shuffled, the network of interactions between proteins remains fixed. By shuffling the domains around in this way to create n replicate data sets, and re-calculating the statistics s(j) for each set, we generate a set of statistics for each superfamily pair , that represents a sample from the distribution of the statistic s(j) under H{infty}. By counting the number of times the simulated statistics exceed the observed statistics we can obtain a p-value for the observed data:

The p-value pj represents the probability of observing the data if the null hypothesis Hj is true.

2.3 Shuffling algorithms
This section describes in some detail the algorithm used to shuffle domain superfamilies. The exact way in which the domain superfamilies are shuffled at each stage reflects different implicit assumptions in the null hypothesis H{infty}. The simplest method is random permutation of the domain superfamilies as follows. A string is formed by concatenating the domain assignments of all the proteins. A random permutation is applied to the string, and the shuffled string is used to re-assign domain superfamilies to proteins. This maintains the marginal frequencies of each superfamily within the data set as well as the number of domains found in each protein. However, in nature certain consecutive superfamily pairs are over-represented in the domain architectures of proteins (Apic et al., 2001)—indeed this tends to be the rule rather than exception. For example the ‘motif’ A, B may occur more frequently in the domain architectures from an entire proteome than would be expected given the marginal frequency of the two superfamilies A and B. It is desirable to include this feature when testing the significance of superfamily pairs. This is achieved by shuffling the superfamilies at each stage using an algorithm that partially maintains the marginal frequency of consecutive superfamilies across the proteome. The first step is to shuffle the domains in the single-domain proteins, using the permutation method described above. Next, a string is formed by concatenating the domain architectures of all the multi-domain proteins. The letters of this string are shuffled in such a way as to preserve their marginal frequencies, together with the marginal frequency of consecutive letter pairs, via an algorithm described in Fitch (1983). The shuffled string is then used to re-assign domains to the multi-domain proteins.

2.4 Data and simulation
The domain assignments for the yeast open reading frames (ORFs) were obtained using the SUPERFAMILY library release number 1.61. After eliminating ‘null’ ORFs that were not assigned any superfamilies 3035 ORFs remained, containing a total of 555 different domain superfamilies. Of these ORFs, 2182 were assigned a single domain superfamily while 853 were assigned more than one superfamily.

Table 1 summarizes the interaction data sets. The Ito data set comprises two parts (‘core’ and ‘non-core’) with stronger levels of evidence for interactions in the core part. The MIPS data was taken from the pairwise interactions released on 12 August 2003, and consists of all physical interactions obtained via methods other than yeast two-hybrid. The second column of Table 1 lists the number of interactions that remain after eliminating ORFs that were not assigned any superfamilies. The approximate estimates of the false positive rates are taken from Sprinzak et al. (2003). Figure 2 shows the extent to which the data sets overlap (after eliminating null ORFs).


View this table:
[in this window]
[in a new window]
 
Table 1 Interaction data sets

 


View larger version (19K):
[in this window]
[in a new window]
 
Fig. 2 Numbers of interactions in the data sets, following removal of ORFs that were not assigned any domains.

 
The simulation to generate p-values was performed using a set of interactions obtained by pooling together the Uetz data set, the MIPS data set and the Ito data set (combining core and non-core). The probability of false positive for each interaction was taken to be the minimum for each of the data sets it was found in. The false negative rate f was estimated as follows:

where is the complement of . In fact, is small compared to and so can be ignored in the denominator. There are 6335 ORFs in the yeast genome and approximately 15 x 103 interactions, giving . The estimate of the total number of interactions is necessarily approximate, and is taken from Deng et al. (2002) and Legrain et al. (2001). The correction term can be estimated using the probabilities of false positive, to give a final false negative estimate of f = 5.7 x 10–4 for the combined data set.

In order to reduce the computational burden, p-values were calculated only for those superfamily pairs for which there was some evidence of interaction in the pooled data set: each possible superfamily pair was included provided it was present in at least one experimentally observed interaction. Let denote this set of superfamily pairs. Using the pooled set of interactions, contained 1931 superfamily pairs. Over 68 x 103 iterations of the algorithm were performed to compute the p-values.

2.5 Predicting contacts
We expect that in any protein pair that is known to interact the domains belonging to the superfamily pair with the lowest p-value are most likely to form a contact. This type of prediction was tested out on protein complexes in the PQS database (Henrick and Thornton, 1998): for each pair of interacting proteins we predict a domain–domain contact and compare this against the true three-dimensional configuration. The PQS is an Internet resource that makes available coordinates for likely quaternary states for structures contained in the Brookhaven Protein Data Bank (PDB) that were determined by X-ray crystallography. Contacts between proteins in the PQS were extracted by analysing the positions of constituent atoms.

Predictions were made for interacting protein pairs that satisfied the following constraints:

  1. At least one of the proteins must contain more than one domain. If this is not the case predicting which domains come into contact is trivial.
  2. Both proteins must only contain domains from superfamilies that are represented in the yeast genome. Presence of domains from superfamilies not found in yeast would bias the results.
  3. In addition we require that at least one of the possible contacts between the proteins is represented in the set defined in Section 2.4. If all the possible contacts lie outside then there is no evidence in to support any of the contacts, so attempting to make predictions is futile.
There are 5032 interacting protein pairs in the PQS satisfying conditions 1 and 2, and of these 1564 also satisfy condition 3. Predictions were also made for this data set using the scores of Sprinzak and Margalit (2001) and Deng et al. (2002). The scores were obtained by training on the pooled data set described in Section 2.4. The ‘Sprinzak-score’ for each superfamily pair can be calculated very easily, whereas the ‘Deng-score’ is computed via an involved likelihood maximization procedure. When making domain–domain contact predictions based on each of these scores the domain pair with the highest score was taken as the predicted contact.

Note that when a protein contains a number of domains from the same superfamily, several potential contacts may be assigned the same p-value (or score), and our predictive scheme is unable to distinguish between these. When the minimum p-value occurs for several potential contacts in this way, we simply choose one of them at random.

Domain contact predictions were also made using the p-values for interacting protein pairs satisfying condition 1 above taken from our pooled set of ‘training’ data. In total, 705 of the interactions satisfied this condition. For many of the yeast ORFs there is no known protein structure, and even less is known about the structure of binary complexes. These predictions therefore provide novel information as to how the interactions could be mediated.


    3 RESULTS
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Results can be downloaded from http://www.mrc-bsu.cam.ac.uk/personal/thomas/protein_files.html. A list of superfamily pairs and their p-values is available from this website together with the domain–domain contact predictions for our pooled set of interaction data.

Analysis of interacting protein pairs in the PQS reveals contacts between domains from 660 different superfamily pairs (restricting to superfamilies represented in the yeast genome). Our analysis of the experimental genomic interaction data suggests contacts between a set of 716 superfamily pairs in the following way. We make the basic assumption that superfamily pairs assigned a sufficiently low p-value form contacts, and extract a list of pairs by imposing a p-value threshold. The p-value threshold was chosen to ensure a false discovery rate (FDR) of 5%: we expect 5% of superfamily pairs falling below the threshold not to be significant (see Benjamini and Hochberg, 1995 for more information on the FDR). The FDR of 5% corresponds to a p-value threshold of 0.0185, and 716 superfamily pairs lie below this threshold. Seventy-three of these pairs occur as contacts in the PQS. Figure 3 shows a small part of the network of superfamily interactions, indicating how our analysis extends the repertoire of contacts represented in the PQS.



View larger version (59K):
[in this window]
[in a new window]
 
Fig. 3 An extract of the network of SCOP superfamily interactions. Interactions between domains on separate chains within the PQS are shown as blue edges. Red edges represent superfamily pairs with p-value below a certain threshold (0.0185). Edges satisfying both criteria are marked in green. Loops correspond to interactions between domains in the same superfamily. Superfamilies of enzymatic domains are shaded yellow.

 
Figure 4 shows the results of testing the predictive scheme described in Section 2.5 on the PQS. Each interacting protein pair in the PQS satisfying our constraints is classified according to the number of potential contacts. (For example, a protein containing two domains interacting with a three-domain protein has six potential contacts.) Since contacts often occur between several different domain pairs within each interacting protein pair in the PQS, if we simply pick one potential contact at random there is a certain probability that this will be observed as a true contact. The expected success rate by picking a potential contact at random is shown in the figure, together with the success rates obtained using the Sprinzak-score, the Deng-score and the p-values.



View larger version (37K):
[in this window]
[in a new window]
 
Fig. 4 Domain–domain contact prediction results. The results are broken down according to the potential number of domain–domain contacts available between protein pairs in the PQS database, and the number of protein pairs within each such category is shown at the bottom of the figure. The proportion of protein pairs for which four different prediction methods correctly predict a domain–domain contact is shown in the main graph. It is often observed in the PQS that several different domain pairs are in contact within each interacting protein pair. Any potential contact picked at random therefore has some probability of being confirmed as a contact in the PQS, and this baseline success rate is shown by the hatched bars. The other bars correspond to prediction using the Sprinzak-score, Deng-score and lowest p-value as described in Section 2.5. The error bars correspond to a 90% confidence interval based on a binomial distribution assumption.

 
From the figure it can be seen that the ‘expert’ prediction methods do not always outperform naive prediction at random (e.g. in the case of four potential contacts). Moreover, in some cases there is insufficient test data for meaningful comparisons between the methods to be made. For interacting proteins with two or six potential contacts, the p-value prediction method does not perform as well as prediction using the Deng or Sprinzak scores, although in the case of six potential contacts the evidence for this is limited. However, as the number of potential contacts increases—and as the prediction problem becomes harder—the p-value method outperforms the other methods. In particular, in the case of nine potential contacts there are 231 test pairs in the PQS, and the p-value method performs significantly better than prediction based on the Sprinzak or Deng scores.

It should be noted that the ability to make predictions of this kind is limited by three factors. It is a well-known feature of the genomic protein–protein interaction data sets that they explore a relatively small region of the vast space of possible interactions (Legrain et al., 2001), and so making predictions on the basis of this data will be limited, though in the future this coverage will probably improve. For example in 26% of the PQS protein pairs with four potential contacts, none of the true contacts arose as a possible contact in the genomic data sets. Secondly, despite the constraints we impose on PQS entries, the protein complexes for which we make predictions are not representative of all such complexes in yeast, owing to the nature and constraints of crystallographic experiments. Thirdly, when the proteins involve a number of repeated domains several different potential contacts will receive the same p-value, and our predictive scheme is unable to distinguish between these. This also applies to the Sprinzak and Deng scores.

In addition to testing the predictive approach against the PQS, domain–domain contact predictions were made for 705 protein interactions in our pooled data set, for which more than one domain–domain contact was possible. These predictions provide basic structural information for a large number of protein interactions for which such information has previously not been available. Selected examples are shown in Figure 5. Note that these predictions condition on each protein pair interacting: if the two proteins truly interact our prediction indicates the domain–domain contact most favoured by our analysis, but the pooled data set also contains false positives. However, it is natural to assume that protein pairs are more likely to interact when the lowest p-value is small. A total of 409 of the 705 interactions have the smallest p-value below the threshold of 0.0185 fixed above, and this information is included in the results on our website.



View larger version (32K):
[in this window]
[in a new window]
 
Fig. 5 Schematic diagram showing predicted domain–domain contacts for pairwise interactions selected from the Uetz and MIPS data sets. The different shapes correspond to SCOP superfamilies, while regions of proteins that were not assigned SCOP superfamilies are shown as black lines. Proteins are displayed with the N-terminal on the left. The arrows indicate the predicted domain–domain contact for each interaction. (A) Predicted contacts between a Cullin repeat domain on the protein CDC53, and domains from the ‘Skp1 dimerization domain-like’ superfamily on the proteins SKP1 and MDM30. These protein interactions are from the Uetz data set and are supported by other experiments [Yamanaka,A. et al. (2002) Curr. Biol. 12(4), 267–275, and Fritz,S. et al. (2003) Mol. Biol. Cell. 14(6), 2303–2313]. Domains from these superfamilies are known to interact in a human protein complex [Zheng,N. et al. (2002) Nature 416(6882), 36–36], and a 3D structure for this interaction is known (PDB identifier 1ldk). (B) Predicted contacts between an ATPase domain and domain I of DNA repair protein MutS. Four pairwise interactions are shown: MLH1–MSH3, PMS1–MSH3, MLH1–MSH6 and PMS1–MSH6. These proteins are involved in mismatch repair in meiosis and mitosis, specifically single-base and insertion–deletion mismatch repair. The mode of interaction is unknown, and our predictions are included as a suggestion for experimental verification.

 

    4 DISCUSSION
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
This paper proposes a methodology for the analysis of large sets of protein interaction data from genomic experiments in terms of the constituent domains within the proteins. The main motivation behind the methodology is the need to assess the evidence in such data sets for physical contact between domains in a statistically rigorous way. In particular, unlike existing approaches in the literature, our method allows domain pairs to be ranked in terms of evidence of contact by using a rigorous statistical measure of evidence, the p-value. A sophisticated simulation technique is necessary to generate these p-values, since the usual statistical association tests do not have an explicit asymptotic null distribution due to the complexity of the data. Our methodology allows for observational uncertainty, specifically false positive and false negative rates in interaction experiments. Estimates for these rates are incorporated in the analysis, with the possibility of merging data reflecting different degrees of error in the same analysis.

By imposing a p-value threshold we extracted a set of 716 superfamily pairs that play a statistically significant role in protein interaction. The p-value threshold was chosen in such a way as to control the false discovery rate (Benjamini and Hochberg, 1995), i.e. the expected number of pairs incorrectly included on the list. Under the assumption that domains from these superfamily pairs form physical contacts, we have demonstrated how large-scale interaction data sets extend the collection of superfamily contacts observed in the PQS database.

We have also tested a simple method for predicting domain–domain contacts between interacting proteins on the basis of the p-values. Predictions were made for interacting protein pairs in the PQS for which the contacts are known. Prediction based on the p-values outperformed prediction using other scores when the number of potential domain–domain contacts between two proteins is relatively high, and hence when the prediction problem is harder. For smaller numbers of potential contacts the p-value method was not as successful as other methods. Domain contact predictions based on the p-values were also made for 705 interacting protein pairs taken from the Uetz, Ito and MIPS data sets. In this way we have suggested novel structural information for a large number of protein interactions.

As discussed in Section 3, the predictive power of our method is limited by the quality and coverage of binary interaction data. The method is also limited by the fact that gaps in domain assignments are ignored, and by the assumption that interaction is mediated by domain–domain contact—protein interactions can also be mediated by a domain binding to a short protein motif (Pawson and Nash, 2003; Puntervoll et al., 2003). This could lead to erroneous predictions: a contact could be predicted between two domains when in fact it is an adjacent gap that is mediating the interaction. This gap could contain another domain that was not recognized by the assignment procedure or a short peptide motif of the type described above. Our methodology also relies on the assumption of a single domain–domain contact between interacting proteins, but analysis of the PQS reveals that many protein interactions involve contacts between several domains.

It is clear that this work raises many questions about the pattern and nature of domain–domain contacts in protein complexes. The PQS contains a wealth of information about domain–domain contacts in protein complexes, and while this paper has examined this information very briefly, a more detailed analysis is likely to be fruitful.

Received on June 3, 2004; revised on September 1, 2004; accepted on October 5, 2004

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

    Aloy, P. and Russell, R.B. (2002) Interrogating protein interaction networks through structural biology. Proc. Natl Acad. Sci. USA, 99, 5896–5901[Abstract/Free Full Text].

    Aloy, P., Bottcher, B., Ceulemans, H., Leutwein, C., Mellwig, C., Fischer, S., Gavin, A.C., Bork, P., Superti-Furga, G., Serrano, L., Russell, R.B. (2004) Structure-based assembly of protein complexes in yeast. Science, 303, 2026–2029[Abstract/Free Full Text].

    Apic, G., Gough, J., Teichmann, S.A. (2001) Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J. Mol. Biol., 310, 311–325[CrossRef][ISI][Medline].

    Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. B, 57, 289–300.

    Deng, M.H., Mehta, S., Sun, F.Z., Chen, T. (2002) Inferring domain–domain interactions from protein–protein interactions. Genome Res., 12, 1540–1548[Abstract/Free Full Text].

    Fitch, W.M. (1983) Random sequences. J. Mol. Biol., 163, 171–176[CrossRef][ISI][Medline].

    Gough, J., Karplus, K., Hughey, R., Chothia, C. (2001) Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol., 313, 909–919[CrossRef].

    Henrick, K. and Thornton, J.M. (1998) PQS: a protein quaternary structure file server. Trends Biochem. Sci., 23, 358–361[CrossRef][ISI][Medline].

    Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M., Sakaki, Y. (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl Acad. Sci. USA, 98, 4569–4574[Abstract/Free Full Text].

    Kim, W.K., Park, J., Suh, J.K. (2002) Large scale statistical prediction of protein–protein interaction by potentially interacting domain pair. Genome Inform., 13, 42–50.

    Legrain, P., Wojcik, J., Gauthier, J.M. (2001) Protein–protein interaction maps: a lead towards cellular functions. Trends Genet., 17, 346–352[CrossRef][ISI][Medline].

    Mewes, H.W., Frishman, D., Güldener, U., Mannhaupt, G., Mayer, K., Mokrejs, M., Morgenstern, B., Münsterkoetter, M., Rudd, S., Weil, B. (2002) MIPS: a database for genomes and protein sequences. Nucleic Acids Res., 30, 31–34[Abstract/Free Full Text].

    Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C. (1995) Scop—a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540[CrossRef][ISI][Medline].

    Ng, S.K., Zhang, Z., Tan, S.H. (2003) Integrative approach for computationally inferring protein domain interactions. Bioinformatics, 19, 923–929[Abstract/Free Full Text].

    Pawson, T. and Nash, P. (2003) Assembly of cell regulatory systems through protein interaction domains. Science, 300, 445–452[Abstract/Free Full Text].

    Puntervoll, P., Linding, R., Gemund, C., Chabanis-Davidson, S., Mattingsdal, M., Cameron, S., Martin, D.M.A., Ausiello, G., Brannetti, B., Costantini, A., et al. (2003) ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res., 31, 3625–3630[Abstract/Free Full Text].

    Russell, R.B., Alber, F., Aloy, P., Davis, F.P., Korkin, D., Pichaud, M., Topf, M., Sali, A. (2004) A structural prespective on protein–protein interactions. Curr. Opin. Struc. Biol., 14, 313–324[CrossRef][ISI][Medline].

    Sprinzak, E. and Margalit, H. (2001) Correlated sequence-signatures as markers of protein–protein interaction. J. Mol. Biol., 311, 681–692[CrossRef][ISI][Medline].

    Sprinzak, E., Sattath, S., Margalit, H. (2003) How reliable are experimental protein–protein interaction data?. J. Mol. Biol., 327, 919–923[CrossRef][ISI][Medline].

    Uetz, P., Giot, L., Cagney, G. (2000) A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae. Nature, 403, 623–627[CrossRef][Medline].

    von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S., Bork, P. (2002) Comparative assessment of large-scale data sets of protein–protein interactions. Nature, 417, 399–403[Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
J. Guo, X. Wu, D.-Y. Zhang, and K. Lin
Genome-wide inference of protein interaction sites: lessons from the yeast high-quality negative protein-protein interaction dataset
Nucleic Acids Res., April 1, 2008; 36(6): 2002 - 2011.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
B. Raghavachari, A. Tasneem, T. M. Przytycka, and R. Jothi
DOMINE: a database of protein domain interactions
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D656 - D661.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
F. P. Davis, H. Braberg, M.-Y. Shen, U. Pieper, A. Sali, and M.S. Madhusudhan
Protein complex compositions predicted by structural similarity
Nucleic Acids Res., May 31, 2006; 34(10): 2943 - 2952.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. Espadaler, O. Romero-Isart, R. M. Jackson, and B. Oliva
Prediction of protein-protein interactions using distant conservation of sequence patterns and structure relationships
Bioinformatics, August 15, 2005; 21(16): 3360 - 3368.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/7/993    most recent
bti086v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (20)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Nye, T. M. W.
Right arrow Articles by Teichmann, S. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Nye, T. M. W.
Right arrow Articles by Teichmann, S. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?