Bioinformatics Advance Access originally published online on July 5, 2005
Bioinformatics 2005 21(17):3548-3557; doi:10.1093/bioinformatics/bti567
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Local modeling of global interactome networks

1Department of Biostatistics, Harvard School of Public Health Boston, MA 02115, USA
2Center for Cancer Systems Biology and Department of Cancer Biology, Dana-Farber Cancer Institute, and Department of Genetics, Harvard Medical School Boston, MA 02115, USA
3Program in Computational Biology, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center Seattle, WA 98109, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Systems biology requires accurate models of protein complexes, including physical interactions that assemble and regulate these molecular machines. Yeast two-hybrid (Y2H) and affinitypurification/mass-spectrometry (APMS) technologies measure different proteinprotein relationships, and issues of completeness, sensitivity and specificity fuel debate over which is best for high-throughput interactome data collection. Static graphs currently used to model Y2H and APMS data neglect dynamic and spatial aspects of macromolecular complexes and pleiotropic protein function.
Results: We apply the local modeling methodology proposed by Scholtens and Gentleman (2004) to two publicly available datasets and demonstrate its uses, interpretation and limitations. Specifically, we use this technology to address four major issues pertaining to proteinprotein networks. (1) We motivate the need to move from static global interactome graphs to local protein complex models. (2) We formally show that accurate local interactome models require both Y2H and APMS data, even in idealized situations. (3) We briefly discuss experimental design issues and how bait selection affects interpretability of results. (4) We point to the implications of local modeling for systems biology including functional annotation, new complex prediction, pathway interactivity and coordination with gene-expression data.
Availability: The local modeling algorithm and all protein complex estimates reported here can be found in the R package apComplex, available at http://www.bioconductor.org
Contact: dscholtens{at}northwestern.edu
Supplementary information: http://daisy.prevmed.northwestern.edu/~denise/pubs/LocalModeling
| INTRODUCTION |
|---|
|
|
|---|
Cellular systems depend on multiprotein complexes in which individual proteins assemble into functional modules (Alberts, 1998; Hartwell et al., 1999). Complexes may have stable or dynamic composition, and many share common proteins as members. Complex co-members may physically bind to each other, or instead directly interact with common bridge proteins. Large-scale experiments in Saccharomyces cerevisiae (Gavin et al., 2002; Ho et al., 2002; Ito et al., 2001; Uetz et al., 2000), Caenorhabditis elegans (Li et al., 2004), and Drosophila (Giot et al., 2003) continue to add to the vast compilation of data characterizing two different proteinprotein relationships. Yeast two-hybrid (Y2H) detects binary physical proteinprotein interactions, and affinitypurification/mass-spectrometry (APMS) identifies hit proteins that are co-members of complexes with bait proteins without regard to physical complex topology. Graph theoretic analyses have led to topological descriptions of the overall interactome (Jeong et al., 2001; Salwinski and Eisenberg, 2003), but current static models for these data neglect dynamic and spatial aspects of complex formation. Scholtens and Gentleman (2004) describe a statistical approach to local modeling of all complexes and their constituent proteins, thereby meeting one more requirement for a more complete functional characterization of the cell.
In this paper, we apply the methodology of Scholtens and Gentleman (2004) to estimate protein complex membership for two publicly available datasets and show how moving from static global interactome graphs to local models of proteinprotein interactions provides a more realistic view of functional modules in the cell. We demonstrate the appropriate joint use of APMS and Y2H data, pointing out the need for increased data collection using both technologies. Recognizing the importance of integrated analyses of genomic and proteomic data, we examine the similarity of gene expression profiles for protein complex co-members. We also raise relevant APMS experimental design issues which await further research before they are resolved. The local modeling algorithm is summarized so that readers understand the nature of the results and the ensuing discussion.
To motivate the local modeling algorithm and formalize the conceptual differences between Y2H and APMS read-outs, we employ a hypothetical example of six proteins, P1,...,P6, composing two distinct complexes, C1 and C2. Suppose P4 is a part of complex C1 with P1, P2 and P6, and is also part of a different complex C2 with P3 and P5 with physical topologies shown in Figure 1a. In an idealized situation, the static graph obtained from 100% complete, sensitive and specific (100CSS) Y2H data would point to all binary physical interactions. However, the resulting model would fail to reflect the dynamic and/or spatial aspects of the system, i.e. P4's presence in two distinct complexes (Fig. 1b, upper panel). Current spoke and matrix models (Bader and Hogue, 2003) of equally idealized 100CSS APMS data would generate networks of complex co-memberships. In the spoke model, edges connect the bait protein in a purification to all corresponding hit proteins (Fig. 1b, lower panel), and in the matrix model, additional edges connect all pairs of hits. In the hypothetical example in Figure 1b, the matrix model would place edges between all six proteins since they would all be found in the purification using P4 as the bait. Even with 100CSS APMS data, the matrix model generates false positive (FP) complex co-memberships, e.g. P1 and P3 would be connected by an edge even though they are not complex co-members. For this reason, we work with the spoke model which, in the 100CSS setting, accurately records the complex co-memberships assayed by APMS. Despite its correct representation of observed APMS data, the static spoke model does not recognize that a protein might be a part of different complexes in vivo, either in the same cell or in different cells of a population.
|
Even for 100CSS data, both Y2H and spoke or matrix modeling of APMS would fail to generate the necessary information to extract a correct functional wiring diagram. This is a limitation for systems biology research since valid assessment of both the functional activity and interactivity of protein complexes depends on their accurate description as physically distinct modules. Local modeling addresses this limitation and points to the next steps for ultimately understanding the operation and regulation of cellular systems.
| SYSTEM AND METHODS |
|---|
|
|
|---|
Modeling idealized data
Local modeling exploits the direct relationship between maximal complete subgraphs in spoke APMS graphs and complex membership. In a graph of nodes and edges, complete subgraphs are sets of nodes for which all pairwise edges are present (e.g. P1, P2 and P4), and maximal complete subgraphs are complete subgraphs not contained in any other complete subgraph (e.g. P1, P2, P4 and P6; Fig. 1c, lower panel). Although the commonly used static spoke APMS representation only reflects complex co-membership, maximal complete subgraphs in the spoke network capture complex membership, represented in a bipartite graph (Fig. 1c, lower panel). In this example, maximal complete subgraphs lead to strict assignment of complex membership with P1, P2, P4 and P6 composing C1, and P3, P4 and P5 composing C2. Maximal complete subgraphs differentiate between C1 and C2 and retain P4 as a member of both. In general, maximal complete subgraphs and their corresponding bipartite graphs capture multicomplex membership by individual proteins, a biological reality that must be accommodated for an accurate understanding of functional systems.
APMS-determined complex membership is the first step in local modeling; given the proteins in a complex, their physical connectivity still needs to be ascertained using Y2H data. Hence, both APMS and Y2H data are necessary for accurate complex models, even with 100CSS data (Fig. 1c). Appropriate use of both data types can extend systems modeling from collections of proteins performing coordinated functions to a description of the physical mechanics underlying systems operation.
Modeling actual interactome data
Actual APMS experiments are typically not genome-wide and therefore, not 100% complete, i.e. not all complex co-memberships are tested. At the start of APMS experiments a collection of proteins is prespecified to be baits, one for each purification. Each bait finds a set of hit proteins that are co-members with itself in at least one complex. The hits may be baits for other purifications, or they may be hit-only proteins that are detected as hits but never used as baits; therefore, all proteins in APMS datasets can be designated as either baits or hit-only proteins. The resultant set of co-memberships is not unbiased and is limited to those that were directly tested (i.e. determined by the bait proteins). A subtle but important point: under 100CSS conditions a bait will not find proteins as hits that are not complex co-members with itself. These non-co-memberships are meaningful features in APMS data which help distinguish between distinct complexes.
In the hypothetical APMS spoke graph in Figure 2a, P1, P2 and P3 are baits and P4, P5 and P6 are hit-only proteins. Noting the two types of proteins helps distinguish between three types of edges in the APMS spoke graph: (1) tested and present, (2) tested and absent and (3) untested. All possible directed edges originating at baits P1, P2 and P3 are tested. If complex co-membership is detected the edges are present, e.g. the edges between P1 and P2 and from P1 to P4, and if co-membership is not detected the edges are absent, e.g. the absent edges between P1 and P3 and from P1 to P5. On the contrary, all possible edges among hit-only proteins are untested. The edges between P4, P5 and P6 are missing, not because they are known not to exist, but because complex co-membership for these proteins is never directly assayed. It would be inappropriate to assume the existence of any edges originating at hit-only proteins since they are never tested. Local modeling acknowledges these missing data and only models the observed data.
|
A mapping from maximal complete subgraphs in the spoke models to the bipartite graph is not reasonable for incomplete APMS data owing to the untested edges among hit-only proteins. We, therefore, define an analog to complete subgraphs for bait/hit APMS data, namely BH-complete subgraphs. BH-complete subgraphs are defined to be sets of nodes containing at least one bait protein for which all tested edges are present. Since, by definition, all edges originating at bait proteins are tested, all baits in a BH-complete subgraph must have edges present to all other baits and all hit-only proteins in the subgraph. BH-complete subgraphs may not contain tested and absent edges originating at baits. Edges connecting pairs of hit-only proteins are untested and will necessarily be missing in BH-complete subgraphs. We do require that BH-complete subgraphs contain at least one bait protein since the subgraph for a collection of hit-only proteins would only contain untested edges and would, therefore, not be particularly meaningful. Maximal BH-complete subgraphs are BH-complete subgraphs not contained in any other BH-complete subgraphs; we let these collections determine complex membership mappings.
Figure 2a contains two maximal BH-complete subgraphs: (1) P1, P2, P4, P6 and (2) P3, P4, P5. In the P1, P2, P4, P6 subgraph all tested edges are present, i.e. the edges between P1 and P2, and from P1 to P4, P1 to P6, P2 to P4 and P2 to P6. The potential edge connecting hit-only proteins P4 and P6 is untested, and as the definition states, untested edges are permitted in BH-complete subgraphs. Similarly, in the P3, P4, P5 subgraph, the tested edges from P3 to P4 and P3 to P5 are present, and the potential edge connecting P4 and P5 is untested. All other subgraphs in Figure 2a are not maximal BH-complete, either because they are not maximal or because they are not BH-complete. For example, the subgraph P2, P4, P6 is not maximal; it is BH-complete since all tested edges are present, but it is contained in the subgraph P1, P2, P4, P6 which is also BH-complete. As another example, the subgraph P3, P4, P5, P6 is not BH-complete because the tested edge from P3 to P6 is absent. Similarly, the subgraph P1, P3, P4, P5, P6 and the entire graph P1, P2, P3, P4, P5, P6 are not BH-complete since all tested edges originating at bait proteins are not all present.
In Figure 2a, maximal BH-complete subgraphs in the APMS spoke model correspond to a bipartite graph reflecting exact complex membership. Importantly, different baits may result in different estimates. For example, if P2, P3 and P4 were used as baits, our algorithm would detect the two correct complex estimates, as well as a third irrelevant complex (Fig. 2b). Without further APMS experiments, it is impossible to determine actual complex membership among the hit-only proteins. It would be difficult to compare two experiments that use different bait proteins as the set of tested co-memberships could be quite different. This raises interesting questions of experimental design and suggests that it is essential for the investigator to critically evaluate the set of tested co-memberships so that the resultant complex estimates pertain to the cellular networks of interest and are as complete as possible.
In addition to being incomplete, APMS data are neither 100% sensitive nor 100% specific, resulting in false negative (FN) observations (a complex co-member of the bait is not found as a hit) and FP observations (a protein not in any complexes with the bait is found as a hit). In the first 3-bait example, suppose a FN observation occurs between P2 and P4 and FP observations occur between P3 and P7, and P8 and P3 (Fig. 3). When using <100CSS APMS data to estimate the bipartite graph, the local modeling algorithm detects maximal BH-complete subgraphs, allowing for a number of FN missing edges in accordance with a user-specified sensitivity value. To prevent the admission of too many FNs, the algorithm also monitors the consistency of the overall number of true negatives (TNs) with the specificity set by the user.
|
| ALGORITHM |
|---|
|
|
|---|
The local modeling algorithm of Scholtens and Gentleman (2004) assumes that the errors in the (imperfect) observation of these edges are independent, and uses a logistic regression model for the probability pij of detecting protein j as a hit using protein i as a bait:
![]() |
/(1+eµ+
) is the sensitivity of the APMS technology and 1/(1+eµ) is the specificity. The algorithm first considers the likelihood L, the product of e(µ+
Yij)Zij/(1+eµ+
Yij) for all tested edges, where Zij = 1 if protein j is found as a hit using protein i as a bait and 0 otherwise. The algorithm begins by maximizing L to estimate each Yij and then locates maximal BH-complete subgraphs in the estimated APMS graph and records these as the initial complex estimates. If parameter specification is done according to the suggestions in Scholtens and Gentleman (2004) where sensitivity is assumed to be less than specificity, then L will be maximized when doubly tested but singly observed edges (i.e. unreciprocated edges connecting bait proteins) are estimated to exist. If so, the initial maximal BH-complete subgraphs may include doubly tested but unreciprocated edges, e.g. P3 and P8 in the graph of observed APMS data in Figure 3. All singly tested edges must still be observed; for example, P3, P4, P5, P7, P8 is not maximal BH-complete in Figure 3 since the singly tested edges from P8 to P4, P5 and P7 are not present.
Although APMS experiments do result in high-dimensional data, there are, in fact, at most two observations of each edge when the purifications are performed only once. FN observations will break a true complex subgraph into multiple maximal BH-complete subgraphs. The local modeling algorithm incorporates a second criterion, called C, to assess combinations of the maximal BH-complete subgraphs. For a proposed complex ck, the binomial probability for the number of observed edges xk out of the number of tested edges tk, specifically
![]() |
(ck) equal the two-sided P-value from Fisher's exact test for the distribution of observed incoming edges for each protein. C is then taken to be the product of
(ck) x
(ck) for all proposed complexes ck. After the initial estimate, pairwise unions of complex estimates are investigated. The union which leads to the highest increase in P = L x C is accepted. The pairwise union process is then repeated using the new complex estimates. The algorithm stops when no union proposals increase P = L x C (Fig. 3). Since C is the product of probabilities ranging from 0 to 1 over all complex estimates, it tends to increase for a smaller number of complexes that reasonably reflect the underlying BH-complete subgraph structure. An increase in C may lead to a decrease in L; the product P = L x C balances the contribution of each term.
If desired, externally derived similarity measures between proteins can be incorporated into an extended logistic regression model:
![]() |
Local modeling may result in three types of complex estimates: (1) multibaitmultiedge (MBME) complexes containing multiple baits and multiple edges; (2) single baitmultihit (SBMH) complexes containing one bait and multiple hits and (3) unreciprocated baitbait (UnRBB) complexes containing two proteins, both used as baits, connected by one unreciprocated edge (Fig. 3). MBME complexes contain data from multiple purifications, allowing a more detailed view of complex co-membership among the hits, and are, therefore, the most accurate local models afforded by the data. Any hits found by a common bait are reported in a SBMH complex if they do not appear together in a MBME complex. SBMH complexes may contain proteins from multiple complexes since edges between hit-only proteins are untested, but they are crucial outputs that prevent any loss of information after refined local modeling represented in the MBME complexes. The edges in UnRBB complexes may be FPs since they are tested twice, observed once and not contextually confirmed by other edges, or the unreciprocated edge may be an FN observation of an edge that does exist but was not observed experimentally. MBME complexes are believed to be the most reliable outputs and are the basis for our ensuing discussion of the implications of local modeling for systems biology. SBMH and UnRBB complexes, although less reliable, can be used to design future experiments. It is important to note that a large portion of protein complexes may be missed since the complex estimates are strictly limited to the cell types and experimental conditions used for the purifications.
| IMPLEMENTATION |
|---|
|
|
|---|
Detecting previously characterized proteincomplexes in the global APMS network
We applied our algorithm to 589 raw tandem affinity purifications (TAP) from Saccharomyces cerevisiae (Gavin et al., 2002). Excluding homodimers, there are 455 bait proteins and 909 hit-only proteins in these data. The raw purifications were manually organized and originally released as 232 annotated yTAP complexes (Krause et al., 2003). Our algorithm predicted 708 complexes including 260 MBME, 325 SBMH and 123 UnRBB complexes available for use and examination at http://www.bioconductor.org/Docs/Papers/2003/apComplex and in the apComplex package.
To reflect the previously estimated sensitivity of TAP, we used a sensitivity of 0.75. Previous studies suggest that 50% of reported proteinprotein interactions may be FPs (von Mering et al., 2002). In the TAP data, 455(454 + 909) = 620 165 edges were tested and 3420 were observed. If 1720 are FPs, then an estimate of the specificity of TAP is 1 1720/(620 165 1720 x 1.25)
0.997. For our analysis, we conservatively used a specificity parameter of 0.995. For analyses of new data, investigators are encouraged to perform repeated purifications using well-characterized proteins as baits to estimate the sensitivity of their APMS technology for detecting known co-memberships. Specificity can then be estimated using a procedure similar to ours. If parameter specification in this manner is not possible, investigators may want to start with the values used here for the TAP data and observe changes in the complex estimates for differing values of specificity and sensitivity. The robustness of our algorithm to parameter specification is discussed in Scholtens and Gentleman (2004).
yTAP complexes are unions of all hits for hand-picked sets of baits linked together by spoke edges, neglecting the fact that hits in an individual purification may belong to different complexes. Since our local modeling algorithm accounts for this biological reality, MBME complexes reflect better in vivo complexes. For example, all hits from six purifications using Arc15, Arc18, Arc35, Arc40, Arp2 and Arp3 as baits were grouped together in yTAPC153 (Fig. 4b). We model the seven known subunits of the Arp2/3 complex (Winter et al., 1997) and assign the five loosely-connected proteins to other complexes with their corresponding baits (Fig. 4a and c).
|
Another striking example involves the RNA polymerase complexes (PolI, PolII and PolIII). We detect three MBME complexes corresponding to the distinct Pol complexes known to exist in vivo (Archambault and Friesen, 1993). Importantly, their overlapping members (e.g. Rpc40 is a known member of both PolI and PolIII) are correctly modeled. In contrast, the yTAP analysis collapses PolI and PolIII together in yTAPC154 and includes several extraneous proteins in the estimate of PolII in yTAPC145 (Fig. 5). A correct description of cellular systems demands recognition of these functionally dependent but physically distinct RMA polymerase complexes.
|
In another example, the heterotrimeric composition of protein phosphatase 2A (PP2A) was reported for the PP2A proteins Tpd3, Pph21, Pph22, Cdc55 and Rts1 (Jiang and Broach, 1999). All four, along with cell-cycle regulators Zds1 and Zds2, are combined into the 46-subunit yTAPC151 (Fig. 6h). Our local model distinguishes between the four known PP2A trimers and furthermore, demonstrates the exclusive association of Zds1 and Zds2 with the Cdc55/Pph22/Tpd3 trimer (Fig. 6bg). Such precise local modeling is not presented in the manually compiled yTAP complexes nor is it immediately evident in the static spoke representation of the APMS data for these proteins (Fig. 6a). Local modeling recovers crucial information regarding the dynamic composition of PP2A as well as interactivity of Zds1 and Zds2 with a specific PP2A trimer.
|
For a large-scale assessment of the detection of known protein complexes in publicly available datasets, we compared our models to 267 hand-curated complexes available at MIPS (ftp://ftpmips.gsf.de/yeast/catalogues/complexes/complex.130603) using a similarity measure
. For a true complex, TC, and a predicted complex, PC, define
(TC,PC) = min(i/a,i/b), where i is the number of proteins in the intersection of TC and PC, and a and b are the number of proteins in TC and PC, respectively. This measure finds the proportion of overlapping proteins from the perspective of both complexes and takes the minimum as a conservative similarity measure. We similarly compared the yTAP complexes to the MIPS complexes. We used the proteins common between the MIPS and TAP data, leaving 129 MIPS multiprotein complexes for comparison. Using
> 0.70 to guarantee close correspondence between complex estimates, we mapped 80 of our complexes to 62 MIPS complexes (Table 1). MIPS complexes with multiple mappings from our complexes contained a high percentage of common core elements. For example, six of our complex estimates mapped to the casein kinase II complex containing Cka1, Cka2, Ckb1 and Ckb2 reported by MIPS. Our TAP analysis showed these four proteins interacting with other proteins in distinct complexes, and the similarity criterion
> 0.70 mapped all six to their common core. Only 38 yTAP complexes mapped to 36 MIPS complexes.
|
The large-scale comparison and the specific examples just described confirm that our local models better resemble well-characterized complexes. The local modeling algorithm provides an automated and repeatable analysis of raw APMS data that accommodates multicomplex membership by individual proteins and membership in different complexes for hits in a single purification. Accounting for these biological realities is not to be underestimated. In our TAP analysis, 341 of the 669 proteins composing all MBME estimates are members of two or more complexes and 296 of the 455 purifications detect hits that are members of different MBME complexes. For large scale systems investigations, local modeling is readily executable and describes functional modules at a finer level of detail than is otherwise available.
Combining APMS and Y2H data
Another large-scale assessment of our complexes and the yTAP estimates involved publicly available Y2H physical interactions, specifically the Ito core (Ito et al., 2001) and Uetz data (Uetz et al., 2000). For a complex to exist, all subunits must physically bind to at least one other subunit. We calculated the overlap proportion of proteins in our complexes that are reported in the combined Ito/Uetz data to physically interact with at least one other complex member. For Arp2/3, the combined Ito/Uetz data reported one physical interaction between Arc15 and Arc19, resulting in an overlap proportion of 2/7 = 0.286 for our complex estimate. (Note that this single physical interaction suggests that the Ito/Uetz data are not comprehensive since one physical interaction is insufficient to compose seven Arp2/3 proteins into one complex.) For the 12-subunit yTAPC153 (Fig. 4b), the proportion is 2/12 = 0.167. Excluding complexes with no representation in the Ito/Uetz data, for our MBME complexes, the mean overlap proportion was 0.412 (25th and 75th percentiles 0.25 and 0.50, respectively). For the yTAP complexes, the mean overlap proportion was 0.318 (25th and 75th percentiles 0.14 and 0.39, respectively). Even though the publicly available Y2H data are not 100CSS, our complex estimates are more structurally substantiated than the yTAP estimates.
If available Y2H data were more comprehensive, local modeling could be carried out to further completion. For example, we identify Clp1, Pcf11, Rna14 and Rna15 as an MBME complex; these four proteins are known to compose cleavage factor IA (CFIA). The Ito/Uetz data report physical interactions between Pcf11 and Clp1, Rna14 and Rna15 (Fig. 7). The identification of these four proteins as composing a distinct complex, followed by structural prediction using the Ito/Uetz Y2H physical interactions matches the behavior and mechanical assembly reported by Gross and Moore (2001). Unfortunately, sparse Y2H data prevent local modeling to its fullest extent for many complexes, e.g. the aforementioned lack of physical interactions reported for Arp2/3. Continued and improved Y2H data collection will mitigate the effects of apparent FN and previously reported FP rates (von Mering et al., 2002) on localmodeling.
|
Provisional functional annotation prediction
Many MBME predictions are not documented in the literature, and our analyses suggest provisional functional annotation for many uncharacterized proteins. We predict a four-subunit complex containing Pph9, Spt5, YNL201C and YBL046W (Supplementary Figure 1a). YNL201C and YBL046W have incomplete Gene Ontology (GO) annotation (Dwight et al., 2002). Although it may be incorrect to confer the specific functionality of Pph9 and Spt5 on YNL201C or YBL046W, an understanding of a protein complex in which they work at a minimum allows informed hypothesis generation.
Similarly, for a seven-subunit complex prediction, YCR072C, Kre32, YNL182C and YHR197W have incomplete GO annotation, and the remaining three, Bud20, Sda1 and YLR106C are not documented as complex co-members (Supplementary Figure 1b). This complex suggests many avenues for further research, pertaining to both individual proteins and potentially to a new protein complex. In short, local modeling provides a much needed platform for identifying specific follow-up experiments regarding functional annotation after global network data collection.
Pathway interactivity
Communication between cellular systems may also be investigated through local models of proteins associated with different pathways. An example involves the five translation initiation factor 3 (eIF3) core complex proteins: Prt1, Rpg1, Nip1, Tif34 and Tif35. We predict the co-activity of eIF3 proteins with Tif5 (Phan et al., 1998) and Sua7 (Supplementary Figure 2a). Prt1, Rpg1 and Nip1 are also known to form a stable subcomplex of eIF3. We predict that these three proteins interact separately with Ckb2, Hat1 and Cdc95 (Supplementary Figure 2bd). These observations may be used to confirm the suggestion that Prt1, Rpg1 and Nip1 perform several different translation initiation functions (Phan et al., 2001).
Bait selection
We analyzed another APMS dataset, known as high-throughput mass spectrometryprotein complex identification (HMSPCI) data (Ho et al., 2002), containing 493 baits and 1085 hit-only proteins. Only 81 of the 493 HMS-PCI baits also found hits in the TAP experiments, and a comparison of the estimated complexes from each dataset demonstrates the great influence of the set of baits on the interpretable data. As suggested previously, when designing APMS experiments, investigators must take care to select baits that probe the cellular networks of interest as completely as possible. We applied our algorithm to the HMSPCI data with a sensitivity of 0.75, but a higher FP probability of 0.01. The HMSPCI lab process includes one round of mild purification, and it has been suggested that their MS method may identify background proteins (Bader and Hogue, 2002). We doubled the FP probability to conservatively accommodate the potential increase in FP observations for these data.
Our HMSPCI analysis resulted in 1008 complex predictions, including 329 UnRBBs and 437 SBMHs. Several of the 242 MBMEs were similar to those found in the TAP data, despite the different baits in the two experiments. For example, our HMSPCI estimate of the Arp2/3 complex contained six of the seven known elements (Supplementary Figure 3). In the HMSPCI data, only two Arp2/3 proteins were used as baits, rather than six as in the TAP data. Arc15, the element we excluded, was detected in the Arp2 purification along with 21 other hits, but was missed in the Arc40 purification. There were insufficient data to differentiate Arc15 from the 21 other Arp2 hits for inclusion in the Arp2/3 estimate.
The different HMSPCI baits did enable complex identifications that were unobserved in the TAP data. One example is an eight-subunit prediction, six of which (Fyv10, Gid7, Gid8, Rmd5, Vid28 and Vid30) carry out catabolic degradation of fructose-1,6-biphosphatase (FBPase) in S.cerevisiae (Regelmann et al., 2003) (Supplementary Figure 4). The relationships of the other two proteins Cct2 and YBL049W to the FBPase proteins may represent co-involvement in other complexes.
Local modeling and the cell cycle
Joint analyses of APMS and other high-throughput data will provide a solid platform for new systems biology investigations (Vidal, 2001). For example, yeast cells synchronized according to cell-cycle periodicity have been used to study genome-wide mRNA transcript fluctuation. We investigated the concordance of our complex predictions with genes exhibiting similar cell-cycle-controlled expression profiles, using data in which 3000 genes were grouped into 30 k-means clusters (Tavazoie et al., 1999). For the 578 genes in common between the cell-cycle and TAP data, there were 578(577)/2 = 166 753 pairs of genes, 6619 of which shared membership in one of the 30 clusters. For each of our MBMEs and the yTAP complexes, we assessed the statistical significance of the total number of intracluster pairs using the hypergeometric distribution. We used the overlapping proteins and any remaining complexes with more than two subunits, resulting in 101 MBME and 107 yTAP complexes for evaluation. We found that 42/101 = 0.42 and 31/101 = 0.31 of our MBMEs had P-values <0.05 and 0.01, respectively. For the yTAP data, we found 32/107 = 0.30 and 21/107 = 0.20 had P-values <0.05 and 0.01, respectively. Our MBMEs contain a higher proportion of proteins with similar gene-expression profiles than the yTAP complexes, and, therefore, better reflect previous reports of co-regulated interacting proteins (Ge et al., 2001; Grigoriev, 2001; Kemmeren et al., 2002).
Three proteins Smc1, Smc3 and Mcd1 that participate in mitotic chromosome assembly during cell division (Hirano, 1999) were locally modeled in one MBME and reported to share cell-cycle cluster membership (Supplementary Figure 5). Several other predicted complexes demonstrated similarly significant numbers of intracluster pairs. Local modeling done under specific experimental conditions could clarify the coordination of gene expression and protein complex formation for systems operation. Important areas of further research also include the study of complexes with groups of proteins involved in different expression clusters and a comparison of the expression profiles for proteins found in many complexes with those involved in only one.
| DISCUSSION |
|---|
|
|
|---|
A comprehensive catalog of multiprotein complex membership is a foundational element of systems biology. The local modeling algorithm proposed by Scholtens and Gentleman (2004) accommodates biological realities, such as multicomplex membership by individual proteins, and exploits subtle differences between APMS and Y2H data types. In this paper, we use local modeling to move from global static interactome graphs to a rigorous characterization of protein macromolecules at a finer level of detail than previously available using APMS data. Y2H-detected physical interactions complete the local interactome model by providing a view of the mechanics of assembly into functional units. Local modeling provides a platform for explicit hypothesis development regarding functional annotation and pathway interactivity. Joint analyses of protein complexes with other data, such as gene-expression profiles, promise salient insight into the systems of modular networks responsible for cellular activity.
Conflict of Interest: none declared.
| Footnotes |
|---|
Present address: Northwestern University Medical School, Department of Preventive Medicine, 680 North Lake Shore Drive Suite 1102, Chicago, IL 60611, USA.
Received on January 31, 2005; revised on June 13, 2005; accepted on June 28, 2005
| REFERENCES |
|---|
|
|
|---|
Alberts, B. (1998) The cell as a collection of protein machines: Preparing the next generation of molecular biologists. Cell, 92, 291294[CrossRef][Web of Science][Medline].
Archambault, J. and Friesen, J.D. (1993) Genetics of eukaryotic RNA polymerase I, II, and III. Microbiol. Rev., 57, 703724
Bader, G. and Hogue, C.W. (2002) Analyzing yeast proteinprotein interaction data obtained from different sources. Nat. Biotechnol., 20, 991997[CrossRef][Web of Science][Medline].
Bader, G. and Hogue, C.W. (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4, 2[CrossRef][Medline].
Dwight, S.S., et al. (2002) Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO). Nucleic Acids Res., 30, 6972
Gavin, A.-C., et al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415, 141147[CrossRef][Medline].
Ge, H., et al. (2001) Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat. Genet., 29, 482486[CrossRef][Web of Science][Medline].
Giot, L., et al. (2003) A protein interaction map of Drosophila melanogaster. Science, 302, 17271736
Grigoriev, A. (2001) A relationship between gene expression and protein interactions on the proteome scale: analysis of bacteriophase T7 and the yeast Saccharomyces cerevisiae. Nucleic Acids Res., 29, 35133519
Gross, S. and Moore, C. (2001) Five subunits are required for reconstitution of the cleavage and polyadenylation activities of Saccharomyces cerevisiae cleavage factor I. Proc. Natl Acad. Sci. USA, 98, 60806085
Hartwell, L., et al. (1999) From molecular to modular cell biology. Nature, 402, C47C52[CrossRef][Medline].
Hirano, T. (1999) SMC-mediated chromosome mechanics: a conserved scheme from bacteria to vertebrates? Genes Devel., 13, 1119
Ho, Y., et al. (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature, 415, 180183[CrossRef][Medline].
Ihaka, R. and Gentleman, R. (1996) R: A language for data analysis and graphics. J. Comp. Graph. Stat., 5, 299314[CrossRef].
Ito, T., et al. (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl Acad. Sci. USA, 98, 45694574
Jeong, H., et al. (2001) Lethality and centrality in protein networks. Nature, 411, 4142[CrossRef][Medline].
Jiang, Y. and Broach, J. (1999) Tor proteins and protein phosphatase 2A reciprocally regulate Tap42 in controlling cell growth in yeast. EMBO J., 18, 27822792[CrossRef][Web of Science][Medline].
Kemmeren, P., et al. (2002) Protein interaction verification and functional annotation by integrated analysis of genome-scale data. Mol. Cell, 9, 11331143[CrossRef][Web of Science][Medline].
Krause, R., et al. (2003) A comprehensive set of protein complexes in yeast: mining large scale proteinprotein interaction screens. Bioinformatics, 19, 19011908
Li, S., et al. (2004) A map of the interactome network of the metazoan C. elegans. Science, 303, 540543
Phan, L., et al. (1998) Identification of a translation initiation factor 3 (eIF3) core complex, conserved in yeast and mammals, that interacts with eIF5. Mol. Cell. Biol., 18, 49354946
Phan, L., et al. (2001) A subcomplex of three eIF3 subunits finds eIF3 and eIF5 and stimulates ribosome binding of mRNA and tRNAimet. EMBO J., 20, 29542965[CrossRef][Web of Science][Medline].
Regelmann, J., et al. (2003) Catabolite degradation of fructose-1,6-biphosphatase in the yeast Saccharomyces cerevisiae: a genome-wide screen idenitifes eight nodel GID genes and indicates the existence of two degradation pathways. Mol. Biol. Cell, 14, 16521663
Salwinski, L. and Eisenberg, D. (2003) Computational methods of analysis of proteinprotein interactions. Curr. Opin. Struct. Biol., 13, 377382[CrossRef][Web of Science][Medline].
Scholtens, D. and Gentleman, R. (2004) Making sense of high-throughput proteinprotein interaction data. Stat. App. in Genetics and Mol. Biol., 3, Article 39.
Tavazoie, S., et al. (1999) Systematic determination of genetic network architecture. Nat. Genet., 22, 281285[CrossRef][Web of Science][Medline].
Uetz, P., et al. (2000) A comprehensive analysis of proteinprotein interactions in Saccharomyces cerevisiae. Nature, 403, 623627[CrossRef][Medline].
Vidal, M. (2001) A biological atlas of functional maps. Cell, 104, 333339[CrossRef][Web of Science][Medline].
von Mering, C., et al. (2002) Comparative assessment of large-scale data sets of proteinprotein interactions. Nature, 417, 399403[Medline].
Winter, D., et al. (1997) The complex containing actin-related proteins Arp2 and Arp3 is required for the motility and integrity of yeast actin patches,. Curr. Biol., 7, 519529[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
C. C. Friedel and R. Zimmer Identifying the topology of protein complexes from affinity purification assays Bioinformatics, August 15, 2009; 25(16): 2140 - 2146. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Yu, P. Braun, M. A. Yildirim, I. Lemmens, K. Venkatesan, J. Sahalie, T. Hirozane-Kishikawa, F. Gebreab, N. Li, N. Simonis, et al. High-Quality Binary Protein Interaction Map of the Yeast Interactome Network Science, October 3, 2008; 322(5898): 104 - 110. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Qi, F. Balem, C. Faloutsos, J. Klein-Seetharaman, and Z. Bar-Joseph Protein complex identification by supervised graph local clustering Bioinformatics, July 1, 2008; 24(13): i250 - i268. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Zhang, B.-H. Park, T. Karpinets, and N. F. Samatova From pull-down data to protein interaction networks and complexes with biological relevance Bioinformatics, April 1, 2008; 24(7): 979 - 986. [Abstract] [Full Text] [PDF] |
||||
![]() |
B.-J. M. Webb-Robertson and W. R. Cannon Current trends in computational inference from mass spectrometry-based proteomics Brief Bioinform, September 1, 2007; 8(5): 304 - 317. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Aittokallio and B. Schwikowski Graph-based methods for analysing networks in cell biology Brief Bioinform, September 1, 2006; 7(3): 243 - 255. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||












