Bioinformatics Advance Access originally published online on October 27, 2004
Bioinformatics 2005 21(7):993-1001; doi:10.1093/bioinformatics/bti086
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Published by Oxford University Press 2004.
Statistical analysis of domains in interacting protein pairs
1Medical Research Council Biostatistics Unit Cambridge, UK
2Dipartimento di Informatica e Sistemistica, Università di Pavia Italy
3Medical Research Council Laboratory of Molecular Biology Cambridge, UK
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Several methods have recently been developed to analyse large-scale sets of physical interactions between proteins in terms of physical contacts between the constituent domains, often with a view to predicting new pairwise interactions. Our aim is to combine genomic interaction data, in which domaindomain contacts are not explicitly reported, with the domain-level structure of individual proteins, in order to learn about the structure of interacting protein pairs. Our approach is driven by the need to assess the evidence for physical contacts between domains in a statistically rigorous way.
Results: We develop a statistical approach that assigns p-values to pairs of domain superfamilies, measuring the strength of evidence within a set of protein interactions that domains from these superfamilies form contacts. A set of p-values is calculated for SCOP superfamily pairs, based on a pooled data set of interactions from yeast. These p-values can be used to predict which domains come into contact in an interacting protein pair. This predictive scheme is tested against protein complexes in the Protein Quaternary Structure (PQS) database, and is used to predict domaindomain contacts within 705 interacting protein pairs taken from our pooled data set.
Contact: thomas.nye{at}mrc-bsu.cam.ac.uk
| 1 INTRODUCTION |
|---|
|
|
|---|
Proteins frequently bind together in pairs or larger complexes to take part in biological processes. Understanding such interactions across the entire genome is an important goal with diverse implications about protein function. Experimental techniques that test thousands of pairs of proteins within a genome for their ability to interact (Ito et al., 2001; Uetz et al., 2000) have therefore attracted considerable interest. These large-scale interaction assays produce no explicit information about structural aspects of the interactionshow the proteins come into contact physically and the shapes they adopt. Our aim in this paper is to combine such interaction data with information about the structure of individual proteins at the domain level in order to learn about the structure of protein complexes.
A number of different experimental and computational techniques are available for elucidating the structure of protein complexes (Russell et al., 2004), but the complete three-dimensional atomic structure is available only for a limited range of proteins and their complexes. While such information can be used to predict the geometry of protein complexes of unknown structure (Aloy and Russell, 2002), this source of information is severely limited by the difficulty of performing experiments to determine structures at the atomic level. Given the large amount of genomic interaction data of the type described above, and the fact that further data are likely to come available in the future as more experiments are performed, techniques for extracting structural information from such data are important. In addition, while the complete three-dimensional structure is not available for the majority of proteins, it is often possible to elucidate the structure of a protein at the domain level by sequence homology alone.
A number of approaches to analysing and predicting protein interactions in terms of constituent domains have recently been published. The basic idea is to work within some classification of protein domains, and identify pairs of domain superfamilies (or other such groupings) that occur frequently in interacting protein pairs within some large-scale interaction data set. This information can then be used to make predictions about new groups of proteins that might interact, and also indicate the domains which might be contacting each other in these proteins. A common approach (Kim et al., 2002; Ng et al., 2003; Sprinzak and Margalit, 2001) has been to score domain pairs according to how frequently the pair occurs in interacting protein pairs, one in each protein, relative to the abundance of the domains across the proteome. However, there is little consensus over how such scores should be calculated, and a number of statistical issues arise: is it meaningful to compare the scores for different domain pairs directly? How would the scores be distributed if the domain composition of proteins had no influence on interaction? A more rigorous alternative approach due to Deng et al. (2002) employs a maximum likelihood method to estimate a probability of interaction for every domain pair. This involves using the EM algorithm to maximize the likelihood in a space with potentially thousands of dimensions.
This paper presents an approach to pairwise protein interaction aimed at addressing the statistical issues mentioned above. However, unlike existing approaches (Deng et al., 2002; Sprinzak and Margalit, 2001), our aim is to predict the most likely pair of domains mediating a given protein interaction, rather than predicting new protein interactions. Given a pair of proteins that are known to interact, the ability to predict structural aspects of this interaction would be of great biological value. Aloy et al. (2004) considered a similar structural prediction problem, but at a larger scale, predicting proteinprotein contacts within complexes containing more than two proteins.
The approach we adopt has the following form. A simple model of contacts between protein domains is used to assign a score to superfamily pairs. Throughout the paper the term contact is used to signify physical binding between two domains in separate proteins. Rather than comparing scores for different superfamily pairs directly, a data-simulation method is used to generate a p-value for the observed score of each superfamily pair, measuring the strength of evidence for the role played by the superfamily pair in protein interaction. We generate a set of p-values for SCOP superfamily pairs, based on a set of protein interactions obtained by pooling together several data sets of interactions from yeast.
The ranked list of superfamily pairs we obtain can be used to predict which domains come into contact in an interacting protein pair: it is natural to expect that the pair with the lowest p-value should form a contact. We test this idea by extracting contacts between proteins in complexes in the Protein Quaternary Structure (PQS) database (Henrick and Thornton, 1998) and comparing these contacts with our predictions. We also predict where contacts occur between the interacting protein pairs in our pooled set of yeast interactions, and show how our analysis expands the repertoire of superfamily interactions found in the PQS.
1.1 Experimental data
Broadly speaking, experiments that identify protein interactions fall into two categories: low throughput approaches that test for interactions between a limited selection of proteins, and high throughput approaches that survey a large part of the proteome. All such techniques are affected by experimental errors, with high throughput approaches generally more affected. Experimental techniques can also be classified as to whether they identify interacting protein pairs or larger complexes of proteins, but in this paper we deal only with pairwise interactions.
The yeast two-hybrid technique is a high throughput approach that has been used to screen the yeast genome for pairwise interactions by two different groups (Ito et al., 2001; Uetz et al., 2000). The advantage of screening the entire genome (or at least a large part of it) for interactions is offset by a high rate of experimental error, and it is vital that the false positive rate is included in any analysis of the data. The potential benefits and drawbacks of such high throughput techniques have generated intense interest: see Legrain et al. (2001) and von Mering et al. (2002) for reviews of this area, and Sprinzak et al. (2003) for estimates of the false positive rate.
Given a list of interactions obtained from such experiments, absence of a protein pair from the list does not necessarily mean that the two proteins do not interactit could be that interaction of the pair in question was not tested in the experimental set-up. This false negative probability has to be accounted for when making statistical inference based on absence from the list of interactions.
The experimental data used in this paper all concern the proteome of Saccaromyces cerevisiae. Three data sets of interactions are used: interactions observed in Ito et al. (2001) and Uetz et al. (2000) yeast two-hybrid experiments together with a set of interactions from the MIPS database (Mewes et al., 2002). The MIPS database contains interactions determined by a variety of experimental methods, many of which are focussed on individual interactions rather than large-scale interaction determination. Datasets of larger protein complexes are not used since they do not contain information about which proteins are in physical contact, and the evidence they contain for domaindomain contacts is therefore much weaker. More details about the data used are given in Section 2.4.
1.2 More about domains
Domains are structural subunits of proteins that can be thought of as building blocks that are conserved during evolution. Proteins can consist of a single domain, or more frequently as a combination of several domains: in prokaryotes about one-third of all proteins are single domain while in eukaryotes the fraction is about 20% (Apic et al., 2001). Nature is able to construct a vast array of different proteins by combining domains, and the present-day variety of domains is believed to have evolved from a relatively small number of ancestral gene sequences. Domains which are related to each other by descent from a common ancestor are said to be homologous, and can be grouped together to form a superfamily. Domains in the same superfamily usually preserve the same three-dimensional conformation, if not function, but their sequences may have diverged considerably during evolution, a phenomenon called remote homology. We regard the SCOP database Murzin et al., 1995 as the standard for classifying domains in this way, and we work throughout with the SCOP classification.
It is often possible to determine a set of constituent domain superfamilies for a protein given its amino acid sequence. The SUPERFAMILY library (Gough et al., 2001) is a facility coupled to the SCOP classification used in this study to perform such assignments. It consists of a library of hidden Markov models (HMMs)algorithms that have been trained to detect remote homology to the SCOP superfamilies in protein sequences. Given an amino acid sequence it produces a string of superfamily labels that represents the sequence of domains in the protein. This string may contain gaps for regions where no superfamily assignment was made, and the process may fail to assign any superfamilies to certain proteins. Proteins that were not assigned any superfamilies were removed from our analysis and gaps in the domain assignments were ignored. This can introduce a bias if such regions are involved in interaction, as described further in Section 4. More information on obtaining domain assignments is available from http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/comb.html.
While domains within the same superfamily are related by a common ancestor, it is possible that during the course of evolution their functions may have diverged. Similarly, the repertoire of interaction partners for each domain will not necessarily be conserved. If we observe domains from a particular superfamily pair coming into contact in a given protein interaction, it does not follow that in other protein pairs domains from this superfamily pair will always form contacts. However, our statistical approach is able to take this into account: the p-values we generate measure the evidence for the role played by each superfamily pair in protein interaction. When an interaction has been poorly conserved between the members of two superfamilies, this superfamily pair will be less significant and assigned a higher p-value.
| 2 METHODS |
|---|
|
|
|---|
This section describes the algorithm used to assign p-values to superfamily pairs and the procedure for predicting domaindomain contacts between proteins. Section 2.1 introduces our notation and assumptions. The algorithm used to compute the p-values is described in sections 2.2 and 2.3 and is illustrated schematically in Figure 1. In Section 2.4 we provide details of the data set used to generate the p-values. Finally, Section 2.5 describes the scheme for predicting domaindomain contacts.
|
2.1 Notation and assumptions
Consider the complete set of proteins expressed in some organism, and let
denote the set of all (unordered) protein pairs. Let
denote the set of all possible pairs of domain superfamilies. Note that
includes proteins paired with themselves, representing the possibility of interaction between identical proteins. Similarly
includes domain superfamilies paired with themselves. Throughout we often use the symbol i to denote an element of
and j to denote an element of
.
Each protein p is assumed to have a domain assignment: a string (A1, A2, A3, ...) where each Ak is a domain superfamily. Our analysis does not acknowledge any uncertainty in this assignment, a valid assumption on account of the high accuracy of the domain HMMs. We will write A
p to denote that protein p contains a domain of superfamily A. Given a protein pair i = (p,q) and superfamily pair j = (A,B), let Nij be the number of distinct ways of having p bound to q via a contact between a domain in superfamily A and a domain in superfamily B. Of course Nij
0 if and only if A
p and B
q, or B
p and A
q.
We make the simplifying assumption that each pair of proteins
either interacts or not, thereby ignoring issues such as the difference between stable and transient interactions, or interactions that require specific cellular conditions. Let
denote the set of such interacting pairs. In its simplest form the experimental data is a list
of protein pairs that have been observed to interact. It is important that experimental error and incompleteness of the data are taken into account, for which we assume fixed probabilities of false positive and false negative:
![]() |
, is given by:
![]() | (1) |
More generally, we may have several different sets of experimentally observed interacting pairs
each with a different false positive rate
. A simple approach to combining these is to take the probability of false positive for a particular pair
, to be the lowest error rate across data sets containing that pair. The observed sets of interactions that we consider overlap very little, so this is a valid assumption. Equation (1) is then replaced by
![]() | (2) |
Estimates for the probabilities of false positive and negative in our data are given in Section 2.4.
2.2 Assigning p-values to superfamily pairs
For each superfamily pair
we want to test the null hypothesis Hj that presence of the superfamily pair in a protein pair does not affect whether the two proteins interact. We also consider the global null hypothesis H
=
Hj, that interaction is entirely unrelated to the domain architectures of proteins. In order to test these hypotheses a statistic is attached to each superfamily pair j, that reflects the strength of evidence present in
that j affects protein interaction. A p-value is calculated for each statistic by simulating data drawn from the global null hypothesis H
. We now explain this procedure in detail.
The first step is to consider the expected number of contacts between domains of type j under the null H
given the observed set of interactions \exppairs. In order to compute this we make the following assumption: when a pair of proteins i = (p,q) interact they do so by bringing a single domain from p into contact with a single domain from q. Biological evidence suggests that contacts may in fact occur between several domains in an interacting protein pair, but this simplifying assumption can nonetheless be used to extract useful information. Given the experimental data \exppairs, under the null H
and the assumption above, the expected number of contacts Ej between domains of type j across the entire proteome is given by:
![]() |
k Nik, so under the global null
![]() |
![]() | (3) |
and is given by Equation (2). The second factor depends only on the domain architectures of the proteins. In addition to the expected number of contacts Ej, we also need to consider the total number of different possible contacts of type j within the proteome, defined by
![]() |
Next we calculate a measure of interaction for each superfamily pair j in the following way. A 2 x 2 array
is defined by
![]() |
![]() |
Recall that for each superfamily pair j we are interested in testing the null hypothesis Hj that presence of the superfamily pair in a protein pair has no effect on whether the two proteins interact. If protein interaction is not affected by the domain content of the proteins (i.e. if the global null hypothesis H
were true) then we can create a new set of proteins in which the domains have been shuffled between and within proteins, that interact in exactly the same way as the original set. The precise nature of this shuffling is discussed below. It is important to note that the while the domains are shuffled, the network of interactions between proteins remains fixed. By shuffling the domains around in this way to create n replicate data sets, and re-calculating the statistics s(j) for each set, we generate a set of statistics
for each superfamily pair
, that represents a sample from the distribution of the statistic s(j) under H
. By counting the number of times the simulated statistics exceed the observed statistics
we can obtain a p-value for the observed data:
![]() |
2.3 Shuffling algorithms
This section describes in some detail the algorithm used to shuffle domain superfamilies. The exact way in which the domain superfamilies are shuffled at each stage reflects different implicit assumptions in the null hypothesis H
. The simplest method is random permutation of the domain superfamilies as follows. A string is formed by concatenating the domain assignments of all the proteins. A random permutation is applied to the string, and the shuffled string is used to re-assign domain superfamilies to proteins. This maintains the marginal frequencies of each superfamily within the data set as well as the number of domains found in each protein. However, in nature certain consecutive superfamily pairs are over-represented in the domain architectures of proteins (Apic et al., 2001)indeed this tends to be the rule rather than exception. For example the motif A, B may occur more frequently in the domain architectures from an entire proteome than would be expected given the marginal frequency of the two superfamilies A and B. It is desirable to include this feature when testing the significance of superfamily pairs. This is achieved by shuffling the superfamilies at each stage using an algorithm that partially maintains the marginal frequency of consecutive superfamilies across the proteome. The first step is to shuffle the domains in the single-domain proteins, using the permutation method described above. Next, a string is formed by concatenating the domain architectures of all the multi-domain proteins. The letters of this string are shuffled in such a way as to preserve their marginal frequencies, together with the marginal frequency of consecutive letter pairs, via an algorithm described in Fitch (1983). The shuffled string is then used to re-assign domains to the multi-domain proteins.
2.4 Data and simulation
The domain assignments for the yeast open reading frames (ORFs) were obtained using the SUPERFAMILY library release number 1.61. After eliminating null ORFs that were not assigned any superfamilies 3035 ORFs remained, containing a total of 555 different domain superfamilies. Of these ORFs, 2182 were assigned a single domain superfamily while 853 were assigned more than one superfamily.
Table 1 summarizes the interaction data sets. The Ito data set comprises two parts (core and non-core) with stronger levels of evidence for interactions in the core part. The MIPS data was taken from the pairwise interactions released on 12 August 2003, and consists of all physical interactions obtained via methods other than yeast two-hybrid. The second column of Table 1 lists the number of interactions that remain after eliminating ORFs that were not assigned any superfamilies. The approximate estimates of the false positive rates are taken from Sprinzak et al. (2003). Figure 2 shows the extent to which the data sets overlap (after eliminating null ORFs).
|
|
The simulation to generate p-values was performed using a set of interactions obtained by pooling together the Uetz data set, the MIPS data set and the Ito data set (combining core and non-core). The probability of false positive for each interaction was taken to be the minimum for each of the data sets it was found in. The false negative rate f was estimated as follows:
![]() |
is the complement of
. In fact,
is small compared to
and so can be ignored in the denominator. There are 6335 ORFs in the yeast genome and approximately 15 x 103 interactions, giving
. The estimate of the total number of interactions
is necessarily approximate, and is taken from Deng et al. (2002) and Legrain et al. (2001). The correction term
can be estimated using the probabilities of false positive, to give a final false negative estimate of f = 5.7 x 104 for the combined data set.
In order to reduce the computational burden, p-values were calculated only for those superfamily pairs for which there was some evidence of interaction in the pooled data set: each possible superfamily pair was included provided it was present in at least one experimentally observed interaction. Let
denote this set of superfamily pairs. Using the pooled set of interactions,
contained 1931 superfamily pairs. Over 68 x 103 iterations of the algorithm were performed to compute the p-values.
2.5 Predicting contacts
We expect that in any protein pair that is known to interact the domains belonging to the superfamily pair with the lowest p-value are most likely to form a contact. This type of prediction was tested out on protein complexes in the PQS database (Henrick and Thornton, 1998): for each pair of interacting proteins we predict a domaindomain contact and compare this against the true three-dimensional configuration. The PQS is an Internet resource that makes available coordinates for likely quaternary states for structures contained in the Brookhaven Protein Data Bank (PDB) that were determined by X-ray crystallography. Contacts between proteins in the PQS were extracted by analysing the positions of constituent atoms.
Predictions were made for interacting protein pairs that satisfied the following constraints:
- At least one of the proteins must contain more than one domain. If this is not the case predicting which domains come into contact is trivial.
- Both proteins must only contain domains from superfamilies that are represented in the yeast genome. Presence of domains from superfamilies not found in yeast would bias the results.
- In addition we require that at least one of the possible contacts between the proteins is represented in the set
defined in Section 2.4. If all the possible contacts lie outside
then there is no evidence in
to support any of the contacts, so attempting to make predictions is futile.
Note that when a protein contains a number of domains from the same superfamily, several potential contacts may be assigned the same p-value (or score), and our predictive scheme is unable to distinguish between these. When the minimum p-value occurs for several potential contacts in this way, we simply choose one of them at random.
Domain contact predictions were also made using the p-values for interacting protein pairs satisfying condition 1 above taken from our pooled set of training data. In total, 705 of the interactions satisfied this condition. For many of the yeast ORFs there is no known protein structure, and even less is known about the structure of binary complexes. These predictions therefore provide novel information as to how the interactions could be mediated.
| 3 RESULTS |
|---|
|
|
|---|
Results can be downloaded from http://www.mrc-bsu.cam.ac.uk/personal/thomas/protein_files.html. A list of superfamily pairs and their p-values is available from this website together with the domaindomain contact predictions for our pooled set of interaction data.
Analysis of interacting protein pairs in the PQS reveals contacts between domains from 660 different superfamily pairs (restricting to superfamilies represented in the yeast genome). Our analysis of the experimental genomic interaction data suggests contacts between a set of 716 superfamily pairs in the following way. We make the basic assumption that superfamily pairs assigned a sufficiently low p-value form contacts, and extract a list of pairs by imposing a p-value threshold. The p-value threshold was chosen to ensure a false discovery rate (FDR) of 5%: we expect 5% of superfamily pairs falling below the threshold not to be significant (see Benjamini and Hochberg, 1995 for more information on the FDR). The FDR of 5% corresponds to a p-value threshold of 0.0185, and 716 superfamily pairs lie below this threshold. Seventy-three of these pairs occur as contacts in the PQS. Figure 3 shows a small part of the network of superfamily interactions, indicating how our analysis extends the repertoire of contacts represented in the PQS.
|
Figure 4 shows the results of testing the predictive scheme described in Section 2.5 on the PQS. Each interacting protein pair in the PQS satisfying our constraints is classified according to the number of potential contacts. (For example, a protein containing two domains interacting with a three-domain protein has six potential contacts.) Since contacts often occur between several different domain pairs within each interacting protein pair in the PQS, if we simply pick one potential contact at random there is a certain probability that this will be observed as a true contact. The expected success rate by picking a potential contact at random is shown in the figure, together with the success rates obtained using the Sprinzak-score, the Deng-score and the p-values.
|
From the figure it can be seen that the expert prediction methods do not always outperform naive prediction at random (e.g. in the case of four potential contacts). Moreover, in some cases there is insufficient test data for meaningful comparisons between the methods to be made. For interacting proteins with two or six potential contacts, the p-value prediction method does not perform as well as prediction using the Deng or Sprinzak scores, although in the case of six potential contacts the evidence for this is limited. However, as the number of potential contacts increasesand as the prediction problem becomes harderthe p-value method outperforms the other methods. In particular, in the case of nine potential contacts there are 231 test pairs in the PQS, and the p-value method performs significantly better than prediction based on the Sprinzak or Deng scores.
It should be noted that the ability to make predictions of this kind is limited by three factors. It is a well-known feature of the genomic proteinprotein interaction data sets that they explore a relatively small region of the vast space of possible interactions (Legrain et al., 2001), and so making predictions on the basis of this data will be limited, though in the future this coverage will probably improve. For example in 26% of the PQS protein pairs with four potential contacts, none of the true contacts arose as a possible contact in the genomic data sets. Secondly, despite the constraints we impose on PQS entries, the protein complexes for which we make predictions are not representative of all such complexes in yeast, owing to the nature and constraints of crystallographic experiments. Thirdly, when the proteins involve a number of repeated domains several different potential contacts will receive the same p-value, and our predictive scheme is unable to distinguish between these. This also applies to the Sprinzak and Deng scores.
In addition to testing the predictive approach against the PQS, domaindomain contact predictions were made for 705 protein interactions in our pooled data set, for which more than one domaindomain contact was possible. These predictions provide basic structural information for a large number of protein interactions for which such information has previously not been available. Selected examples are shown in Figure 5. Note that these predictions condition on each protein pair interacting: if the two proteins truly interact our prediction indicates the domaindomain contact most favoured by our analysis, but the pooled data set also contains false positives. However, it is natural to assume that protein pairs are more likely to interact when the lowest p-value is small. A total of 409 of the 705 interactions have the smallest p-value below the threshold of 0.0185 fixed above, and this information is included in the results on our website.
|
| 4 DISCUSSION |
|---|
|
|
|---|
This paper proposes a methodology for the analysis of large sets of protein interaction data from genomic experiments in terms of the constituent domains within the proteins. The main motivation behind the methodology is the need to assess the evidence in such data sets for physical contact between domains in a statistically rigorous way. In particular, unlike existing approaches in the literature, our method allows domain pairs to be ranked in terms of evidence of contact by using a rigorous statistical measure of evidence, the p-value. A sophisticated simulation technique is necessary to generate these p-values, since the usual statistical association tests do not have an explicit asymptotic null distribution due to the complexity of the data. Our methodology allows for observational uncertainty, specifically false positive and false negative rates in interaction experiments. Estimates for these rates are incorporated in the analysis, with the possibility of merging data reflecting different degrees of error in the same analysis.
By imposing a p-value threshold we extracted a set of 716 superfamily pairs that play a statistically significant role in protein interaction. The p-value threshold was chosen in such a way as to control the false discovery rate (Benjamini and Hochberg, 1995), i.e. the expected number of pairs incorrectly included on the list. Under the assumption that domains from these superfamily pairs form physical contacts, we have demonstrated how large-scale interaction data sets extend the collection of superfamily contacts observed in the PQS database.
We have also tested a simple method for predicting domaindomain contacts between interacting proteins on the basis of the p-values. Predictions were made for interacting protein pairs in the PQS for which the contacts are known. Prediction based on the p-values outperformed prediction using other scores when the number of potential domaindomain contacts between two proteins is relatively high, and hence when the prediction problem is harder. For smaller numbers of potential contacts the p-value method was not as successful as other methods. Domain contact predictions based on the p-values were also made for 705 interacting protein pairs taken from the Uetz, Ito and MIPS data sets. In this way we have suggested novel structural information for a large number of protein interactions.
As discussed in Section 3, the predictive power of our method is limited by the quality and coverage of binary interaction data. The method is also limited by the fact that gaps in domain assignments are ignored, and by the assumption that interaction is mediated by domaindomain contactprotein interactions can also be mediated by a domain binding to a short protein motif (Pawson and Nash, 2003; Puntervoll et al., 2003). This could lead to erroneous predictions: a contact could be predicted between two domains when in fact it is an adjacent gap that is mediating the interaction. This gap could contain another domain that was not recognized by the assignment procedure or a short peptide motif of the type described above. Our methodology also relies on the assumption of a single domaindomain contact between interacting proteins, but analysis of the PQS reveals that many protein interactions involve contacts between several domains.
It is clear that this work raises many questions about the pattern and nature of domaindomain contacts in protein complexes. The PQS contains a wealth of information about domaindomain contacts in protein complexes, and while this paper has examined this information very briefly, a more detailed analysis is likely to be fruitful.
Received on June 3, 2004; revised on September 1, 2004; accepted on October 5, 2004
| REFERENCES |
|---|
|
|
|---|
Aloy, P. and Russell, R.B. (2002) Interrogating protein interaction networks through structural biology. Proc. Natl Acad. Sci. USA, 99, 58965901
Aloy, P., Bottcher, B., Ceulemans, H., Leutwein, C., Mellwig, C., Fischer, S., Gavin, A.C., Bork, P., Superti-Furga, G., Serrano, L., Russell, R.B. (2004) Structure-based assembly of protein complexes in yeast. Science, 303, 20262029
Apic, G., Gough, J., Teichmann, S.A. (2001) Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J. Mol. Biol., 310, 311325[CrossRef][ISI][Medline].
Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. B, 57, 289300.
Deng, M.H., Mehta, S., Sun, F.Z., Chen, T. (2002) Inferring domaindomain interactions from proteinprotein interactions. Genome Res., 12, 15401548
Fitch, W.M. (1983) Random sequences. J. Mol. Biol., 163, 171176[CrossRef][ISI][Medline].
Gough, J., Karplus, K., Hughey, R., Chothia, C. (2001) Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol., 313, 909919[CrossRef].
Henrick, K. and Thornton, J.M. (1998) PQS: a protein quaternary structure file server. Trends Biochem. Sci., 23, 358361[CrossRef][ISI][Medline].
Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M., Sakaki, Y. (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl Acad. Sci. USA, 98, 45694574
Kim, W.K., Park, J., Suh, J.K. (2002) Large scale statistical prediction of proteinprotein interaction by potentially interacting domain pair. Genome Inform., 13, 4250.
Legrain, P., Wojcik, J., Gauthier, J.M. (2001) Proteinprotein interaction maps: a lead towards cellular functions. Trends Genet., 17, 346352[CrossRef][ISI][Medline].
Mewes, H.W., Frishman, D., Güldener, U., Mannhaupt, G., Mayer, K., Mokrejs, M., Morgenstern, B., Münsterkoetter, M., Rudd, S., Weil, B. (2002) MIPS: a database for genomes and protein sequences. Nucleic Acids Res., 30, 3134
Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C. (1995) Scopa structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536540[CrossRef][ISI][Medline].
Ng, S.K., Zhang, Z., Tan, S.H. (2003) Integrative approach for computationally inferring protein domain interactions. Bioinformatics, 19, 923929
Pawson, T. and Nash, P. (2003) Assembly of cell regulatory systems through protein interaction domains. Science, 300, 445452
Puntervoll, P., Linding, R., Gemund, C., Chabanis-Davidson, S., Mattingsdal, M., Cameron, S., Martin, D.M.A., Ausiello, G., Brannetti, B., Costantini, A., et al. (2003) ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res., 31, 36253630
Russell, R.B., Alber, F., Aloy, P., Davis, F.P., Korkin, D., Pichaud, M., Topf, M., Sali, A. (2004) A structural prespective on proteinprotein interactions. Curr. Opin. Struc. Biol., 14, 313324[CrossRef][ISI][Medline].
Sprinzak, E. and Margalit, H. (2001) Correlated sequence-signatures as markers of proteinprotein interaction. J. Mol. Biol., 311, 681692[CrossRef][ISI][Medline].
Sprinzak, E., Sattath, S., Margalit, H. (2003) How reliable are experimental proteinprotein interaction data?. J. Mol. Biol., 327, 919923[CrossRef][ISI][Medline].
Uetz, P., Giot, L., Cagney, G. (2000) A comprehensive analysis of proteinprotein interactions in Saccharomyces cerevisiae. Nature, 403, 623627[CrossRef][Medline].
von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S., Bork, P. (2002) Comparative assessment of large-scale data sets of proteinprotein interactions. Nature, 417, 399403[Medline].
This article has been cited by other articles:
![]() |
J. Guo, X. Wu, D.-Y. Zhang, and K. Lin Genome-wide inference of protein interaction sites: lessons from the yeast high-quality negative protein-protein interaction dataset Nucleic Acids Res., April 1, 2008; 36(6): 2002 - 2011. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Raghavachari, A. Tasneem, T. M. Przytycka, and R. Jothi DOMINE: a database of protein domain interactions Nucleic Acids Res., January 11, 2008; 36(suppl_1): D656 - D661. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. P. Davis, H. Braberg, M.-Y. Shen, U. Pieper, A. Sali, and M.S. Madhusudhan Protein complex compositions predicted by structural similarity Nucleic Acids Res., May 31, 2006; 34(10): 2943 - 2952. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Espadaler, O. Romero-Isart, R. M. Jackson, and B. Oliva Prediction of protein-protein interactions using distant conservation of sequence patterns and structure relationships Bioinformatics, August 15, 2005; 21(16): 3360 - 3368. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

















