Skip Navigation


Bioinformatics Advance Access originally published online on March 14, 2008
Bioinformatics 2008 24(8):1093-1099; doi:10.1093/bioinformatics/btn079
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/8/1093    most recent
btn079v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Higham, D. J.
Right arrow Articles by Przulj, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Higham, D. J.
Right arrow Articles by Przulj, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Fitting a geometric graph to a protein–protein interaction network

Desmond J. Higham 1, Marija Rasajski 2,3 and Natasa Przulj 2,*

1Department of Mathematics, University of Strathclyde, Glasgow G1 1XH, UK, 2Department of Computer Science, UC Irvine, Irvine, CA 92697, USA and 3Faculty of Electrical Engineering, University of Belgrade, Belgrade, Serbia

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 Introduction
 2 THEORETICAL BASIS
 3 INITIAL ALGORITHM
 4 PRACTICAL ALGORITHM
 5 DATA AND RESULTS
 6 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Finding a good network null model for protein–protein interaction (PPI) networks is a fundamental issue. Such a model would provide insights into the interplay between network structure and biological function as well as into evolution. Also, network (graph) models are used to guide biological experiments and discover new biological features. It has been proposed that geometric random graphs are a good model for PPI networks. In a geometric random graph, nodes correspond to uniformly randomly distributed points in a metric space and edges (links) exist between pairs of nodes for which the corresponding points in the metric space are close enough according to some distance norm. Computational experiments have revealed close matches between key topological properties of PPI networks and geometric random graph models. In this work, we push the comparison further by exploiting the fact that the geometric property can be tested for directly. To this end, we develop an algorithm that takes PPI interaction data and embeds proteins into a low-dimensional Euclidean space, under the premise that connectivity information corresponds to Euclidean proximity, as in geometric-random graphs. We judge the sensitivity and specificity of the fit by computing the area under the Receiver Operator Characteristic (ROC) curve. The network embedding algorithm is based on multi-dimensional scaling, with the square root of the path length in a network playing the role of the Euclidean distance in the Euclidean space. The algorithm exploits sparsity for computational efficiency, and requires only a few sparse matrix multiplications, giving a complexity of O(N2) where N is the number of proteins.

Results: The algorithm has been verified in the sense that it successfully rediscovers the geometric structure in artificially constructed geometric networks, even when noise is added by re-wiring some links. Applying the algorithm to 19 publicly available PPI networks of various organisms indicated that: (a) geometric effects are present and (b) two-dimensional Euclidean space is generally as effective as higher dimensional Euclidean space for explaining the connectivity. Testing on a high-confidence yeast data set produced a very strong indication of geometric structure (area under the ROC curve of 0.89), with this network being essentially indistinguishable from a noisy geometric network. Overall, the results add support to the hypothesis that PPI networks have a geometric structure.

Availability: MATLAB code implementing the algorithm is available upon request.

Contact: natasha{at}ics.uci.edu


    1 Introduction
 TOP
 ABSTRACT
 1 Introduction
 2 THEORETICAL BASIS
 3 INITIAL ALGORITHM
 4 PRACTICAL ALGORITHM
 5 DATA AND RESULTS
 6 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 
Large, complex networks (also called graphs) arise in a vast array of applications (Newman, 2003). Efforts to develop models that describe and summarize complex networks have focused on various network features such as motifs (Milo et al., 2002), graphlets (Przulj et al., 2004) and graphlet degree distributions Przulj (2006), clustering coefficients (Watts and Strogatz, 1998), pathlengths (Watts and Strogatz, 1998) and degree distributions (Khanin and Wit, 2006; Newman, 2003; Thomas et al., 2003).

Studying protein–protein interaction (PPI) networks has recently become possible due to advances in experimental high-throughput technologies such as yeast-2-hybrid (Y2H) (Y2H) (Ito et al., 2000; Uetz et al., 2000), tandem affinity purification (TAP) (Gavin et al., 2002) and high-throughput mass spectrometric protein complex identification (HMS-PCI) (Ho et al., 2002). A significant amount of experimental PPI network data for several organisms has already been generated. (Gavin et al., 2002; Giot et al., 2003; Ho et al., 2002; Ito et al., 2000; Krogan et al., 2006; Li et al., 2004; Rual et al., 2005; Stelzl et al., 2005; Uetz et al., 2000).

Understanding the patterns of intricate wiring in PPI networks is clearly of great importance for basic biological understanding, and also has the potential to feed back into the strategies for optimal interactome detection (Lappe and Holm, 2004). Further benefits of an accurate PPI model include (a) generation of synthetic datasets of any size in order to test computational algorithms, (b) detection of false positives and false negatives, (c) possible insights into the evolutionary processes that created the network and (d) convenience of representing complex networks in terms of a small number of model parameters and thereby distinguishing between networks for different organisms. Thus, modelling of PPI networks has become an active research area and several different random graph models have already been suggested (Grindrod, 2002; Grindrod and Kibble, 2004; Morrison et al., 2006; Przulj and Higham, 2006; Przulj et al., 2004; Thomas et al., 2003; Vazquez et al., 2001). Among them are geometric random graphs (Przulj et al., 2004, 2006) in which nodes correspond to uniformly randomly distributed points in a low-dimensional Euclidean space and edges exist between pairs of nodes in the graph if the corresponding points in the space are close enough (within some radius {varepsilon}) according to the Euclidean distance norm. Other models include: Erdös–Rényi random graphs (Erdös and Rényi, 1959, 1960), generalized random graphs (Bender and Canfield, 1978), small-world (Watts and Strogatz, 1998), scale-free (Barabási and Albert, 1999; Simon, 1955) and stickiness (Przulj and Higham, 2006) networks.

Our research focuses on geometric random graphs. The key observation that drives our work is that, in contrast with other putative PPI models, the geometric structure can be examined constructively. To test whether a given PPI network has a geometric structure, rather than measuring local and global statistics of the PPI network and comparing these with local and global statistics of random geometric graphs, a more direct question can be addressed:

Can we represent the given PPI network as a geometric graph by embedding the proteins in R2, R3 or R4 and finding an {varepsilon} such that proteins are connected if and only if they are {varepsilon}-close?

There are two main themes to this work. The first theme is designing and testing an algorithm to discover whether a network has an underlying geometric structure. This theme is dealt with in Sections 2–4. The second theme, covered in Section 5, is to use this tool to study PPI networks.

We remark that the reverse engineering problem considered here is related to the general, but less well-defined, tasks of ordering and clustering (Grindrod, 2002; Grindrod and Kibble, 2004; Titz et al., 2004). Spectral (eigenvalue/eigenvector-based) algorithms have proved successful for ordering and clustering (Grindrod and Kibble, 2004; Higham, 2003), and this provides motivation for a spectral algorithm to address the geometric embedding issue.


    2 THEORETICAL BASIS
 TOP
 ABSTRACT
 1 Introduction
 2 THEORETICAL BASIS
 3 INITIAL ALGORITHM
 4 PRACTICAL ALGORITHM
 5 DATA AND RESULTS
 6 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 
As in previous studies (Przulj, 2006; Przulj et al., 2004), we focus on non-periodic, uniform, Euclidean geometric random graphs in R2, R3 and R4. These are defined as follows [see Penrose (2003) for further details about geometric random graphs and their properties]. In the two-dimensional (2D) case, to create a network (an undirected, unweighted graph) with N nodes we place a set of N points, Formula , uniformly in the unit square; that is, each x[i] isin R2 has its two components drawn independently from the uniform (0,1) distribution, and all points are generated independently. Then, for each pair of points (x[i], x [j]), we create an edge between nodes i and j of the geometric random graph if and only if ||x[i] x [j] ||2 ≤ {varepsilon}, where ||·||2 denotes Euclidean distance and {varepsilon} > 0 is a parameter. In other words, nodes i and j are linked if and only if points i and j are within Euclidean distance {varepsilon}. The process is illustrated in the upper left picture of Figure 1. The three-dimensional (3D) and four-dimensional (4D) cases are defined analogously, by placing points in R3 and R4.


Figure 1
View larger version (55K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Upper left: a geometric random graph with N = 100 and {varepsilon} = 0.25. Upper right: node placement produced by the algorithm. Lower left: spurious edges introduced (using {varepsilon} = 0.25). Lower right: missing edges (using {varepsilon} = 0.25).

 
Our algorithm makes use of ideas from multi-dimensional scaling (MDS). We summarize here the necessary details, referring the reader to (Cox and Cox, 1994) for further information and historical references, and (Kaski et al., 2003; Taguchi and Oono, 2004) for examples where MDS has been used in bioinformatics.

Suppose that, for a set of N objects, we are given the pairwise Euclidean distances dij between all pairs, and we are asked to find a set of N vectors Formula in Rm such that


Formula 1

(1)
In other words, our task is to find locations in Rm for the objects so that the pairwise distances are respected. Finding the smallest dimension m for which a solution is possible may be regarded as part of the problem. In our context, we will think of the dimension as being fixed at 2, 3 or 4, and we will be seeking N locations Formula for which the constraints (1) are well approximated in a sense that will be made precise.

Given data {dij} that respects the triangle inequality, double centering produces the symmetric, positive semi-definite matrix A isin RN x N defined by


Formula 2

(2)
Letting X isin RmxN be the matrix whose jth column is x [j], it may then be shown that


Formula 3

(3)
Now A has the real Schur decomposition (Golub and Van Loan, 1996) A = UT{Sigma}U, where U isin RNxN is orthogonal (its rows are eigenvectors of A) and {Sigma} = diag({sigma}i) is diagonal with diagonal entries ordered high-to-low (these are the eigenvalues of A). We then see that a solution X in (3) may be computed as


Formula 4

(4)
To find an ‘optimal’ approximation such that x[i] isin Rr, we may truncate (4) using only the r most positive eigenvalues, so that


Formula 5

(5)
where u[k] isin RN is the kth row of U. This is optimal in the sense that Formula is the closest rank-r matrix to X, in any orthogonally invariant norm (Golub and Van Loan, 1996).


    3 INITIAL ALGORITHM
 TOP
 ABSTRACT
 1 Introduction
 2 THEORETICAL BASIS
 3 INITIAL ALGORITHM
 4 PRACTICAL ALGORITHM
 5 DATA AND RESULTS
 6 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this section, we outline the main ideas behind the algorithm and show how we propose to evaluate its accuracy. Our task is to embed proteins into R2, R3 or R4 given the PPI network. Rather than Euclidean distances, we have only {0,1} connectivity information. For this reason, we will use a function of the pathlength in lieu of Euclidean distance. By construction, if nodes A and B in a geometric graph are connected (pathlength one), then their Euclidean distance is between zero and {varepsilon}. Similarly, a pathlength of two indicates that an intermediate node, C, is {varepsilon}-close to both A and B, with the Euclidean distance between A and B lying somewhere between {varepsilon} and 2{varepsilon}. In the absence of exact distance information we will adopt the heuristic that a ‘typical’ configuration has a right angle for the angle ABC, and assume that a typical length-two path corresponds to a distance of Formula . The square root can also be regarded as an attempt to compromise between the opposing factors where (1) one of the distances A-to-C or C-to-B is much less than {varepsilon} and (2) the nodes A, B and C are co-linear. More generally, we will use the square root of the graph pathlength in lieu of Euclidean distance, so that dij in (1) is taken to be Formula , where pathij denotes the pathlength between nodes i and j. In practice, we tried several alternative monotonically increasing functions of pathij and found that the resulting algorithm was insensitive to this detail.

A minor issue is the natural scale-invariance of the problem—re-scaling {varepsilon} and the distances dij does not change the network. Because the traditional geometric model assumes that all points lie in the unit disk/cube, we will normalize the coordinate vectors in (5) so that


Formula 6

(6)

We now illustrate these ideas on data arising from a small geometric graph in R2 for which the results can be easily visualized. Here we took N = 100 nodes with coordinates drawn independently from the uniform (0, 1) distribution and joined nodes that were within Euclidean distance {varepsilon} = 0.25. The resulting graph is shown in the upper left picture of Figure 1. The six most positive eigenvalues of A in (2), using Formula for dij, were found to be 39.9, 31.3, 8.7, 7.4, 6.1 and 4.0. If we had used exact Euclidean distance information then, since we started with a geometric graph in R2, it would be possible to embed exactly in R2, so that only the first two eigenvalues would be non-zero. In using pathlength to approximate Euclidean distance, we have lost this property, but it is reassuring that the first two eigenvalues remain strongly dominant. The 2D embedding from the algorithm is shown in the upper right, and the lower pictures display the false positives and false negatives arising when a geometric graph with radius {varepsilon} = 0.25 is formed.

To measure the ability of the algorithm to recover the original network, we present a receiver operator characteristic (ROC) curve (Bradley, 1997; Tape, 2000) in Figure 2, marked Geometric/MDS. Here, we increased {varepsilon} from 0 to Formula in small increments, and for each {varepsilon} we generated the geometric graph arising from the MDS node placement as in the upper right picture of Figure 1. The horizontal axis is then defined as 1 – specificity, that is, 1 – TN/(TN + FP), and the vertical axis is defined as sensitivity, that is TP/(TP + FN). Here TN denotes the number of true negatives that is, the number of distinct pairs i and j for which there is no edge in the reverse engineered graph and there is no edge in the original graph. Similarly, TP, FP and FN denote the number of true positives, false positives and false negatives, respectively. With {varepsilon} = 0, we place no edges in the network, and hence we have perfect specificity, 1 – TN/(TN + FP) = 0, but the worst possible sensitivity, TP/(TP + FN) = 0. The other extreme Formula connects all nodes, giving the worst possible specificity, 1 – TN/(TN + FP) = 1, but perfect sensitivity, TP/(TP + FN) = 1. Increasing {varepsilon} always improves sensitivity at the expense of specificity. Good performance corresponds to having a curve that rises rapidly, containing points close to x = 0, y = 1, and the area under the curve (AUC) is a widely-used measure of quality (Bradley, 1997; Tape, 2000). In Figure 2 we have an AUC of 0.988 for MDS curve.


Figure 2
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. ROC curves displaying sensitivity and specificity. Upper curve, marked Geometric/MDS, arises from our MDS-based algorithm on a geometric network. Lower curve, marked Geometric/Coin Flip, arises when interactions are predicted at random for the geometric network. Middle curve, marked Erdos/MDS, arises when our MDS-based algorithm is applied to an Erdös-Rényi random graph.

 
For comparison, we have added two more ROC curves in Figure 2. First, for the same network, we show the effect of ‘randomly guessing’ links. Here, we take a biased coin that lands heads with probability p. For each pair of nodes we flip the coin and predict a link if the coin lands heads. As p is varied this leads to the ROC curve labelled Geometric/Coin Flip, which has an AUC of 0.47. Next, we generated a network with 100 nodes that did not have an underlying geometric structure. Instead we used an Erdös–Rényi model (Erdös and Rényi, 1959, 1960) as in (Przulj et al., 2004) where, for each pair of nodes, a link was inserted with independent probability 0.8. Applying our MDS-based algorithm produced the ROC curve labelled Erdos/MDS, which has an AUC of 0.64.


    4 PRACTICAL ALGORITHM
 TOP
 ABSTRACT
 1 Introduction
 2 THEORETICAL BASIS
 3 INITIAL ALGORITHM
 4 PRACTICAL ALGORITHM
 5 DATA AND RESULTS
 6 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 
For PPI networks, where the number of nodes is typically in the thousands, we propose setting


Formula 7

(7)
where K and Kmax are parameters in the algorithm. Here we have introduced a cutoff K with longer pathlengths rounded to a single value Kmax. Using the cutoff K in (7), rather than computing and recording the pathlength for all pairs, has three main advantages.

  1. By choosing K relatively small, the resulting algorithm can exploit sparsity in the original network. This is explained further below.
  2. The case where the network consists of two or more disconnected components (i.e. some pathlengths are infinite) is conveniently handled.
  3. The cutoff reflects the fact that for a true geometric graph in the unit cube, there is an upper bound on the maximum Euclidean distance.
We remark that, intuitively, it is clear that accurate information concerning near-neighbours is more important than information concerning distant nodes. In our experiments, we found that the results from the algorithm were not sensitive to the choice of K and Kmax, although, as explained below, the computational benefits can be dramatic.

Repeating the experiment in Figure 1, but with parameters K = 4 and Kmax = 5, gave rise to leading eigenvalues 38.3, 30.1, 10.7, 8.9, 6.2 and 3.8, and an area under the ROC curve of 0.984, showing that retaining level-four path information is adequate in this case. Similar effects were observed more generally, so we used these values in all computations.

We now explain how our distance measure (7) allows sparsity to be exploited. To measure complexity, we assume that the number of connections per protein is fixed, independently of N, and we consider asymptotics as N -> {infty}. We assume that subspace iteration (Golub and Van Loan, 1996) is used to compute eigenpairs of the matrix A in (2). The significant computational task is then the formulation of a matrix-vector product; given some v isin RN compute the product Av. By construction the elements Formula in (2) now come from a matrix of the form P + Kmax 11T, where 1 isin RN denotes the vector of 1's, and hence 11T isin RNxN is the matrix of 1's, and P has values


Formula 8

(8)
Hence, pij is non-zero only when there is a path of length ≤ K between proteins i and j. In other words, P has the same sparsity pattern as the Kth power of the adjacency matrix for the network. For a fixed value of K, this means that P has the same sparsity as the original network. It then follows that the product Av may be written


Formula 9

(9)
where psum isin RN has Formula , < v > isin R denotes the average value vT1/N and {alpha} isin R denotes Formula .

It follows that each step of the subspace iteration involves a matrix–vector multiply with a sparse matrix. In our computations, we stopped the iteration when successive eigenvector approximations agreed to within 10–3 in Euclidean norm.

Overall, given a protein–protein interaction network, our algorithm may be summarized as:

  1. Compute the pathlengths up to length K.
  2. Compute the first two, three or four most positive eigenvalues of A, and the corresponding eigenvectors.
  3. Embed the nodes in R2, R3 or R4 using (5) and (6).
  4. Examine the accuracy of the embedding as {varepsilon} is varied.

To measure computational cost we note that a sparse matrix–vector multiply Av is an O(N) process. Step 1 of the algorithm can be achieved by forming the matrices A, A2, A3, ... , AK, which has, at most, an O(N2) operation count. Step 2 costs O(N) per iteration of the subspace iteration algorithm. Step 3 is negligible. In Step 4, having generated the protein locations, computing the pairwise distance data {||x[i]x [j] ||2}i!=j, so that choices for {varepsilon} can be tested, is an O(N2) task.


    5 DATA AND RESULTS
 TOP
 ABSTRACT
 1 Introduction
 2 THEORETICAL BASIS
 3 INITIAL ALGORITHM
 4 PRACTICAL ALGORITHM
 5 DATA AND RESULTS
 6 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 
5.1 PPI networks
Using high-throughput techniques such as Y2H (Ito et al., 2000; Uetz et al., 2000), TAP (Gavin et al., 2002) and HMS-PCI (Ho et al., 2002), a significant amount of experimental PPI data has been generated. Our algorithm has been applied to 19 PPI networks of four eucaryotic organisms: yeast Saccharomyces cerevisiae, fruitfly Drosophila melanogaster, nematode worm Caenorhabditis elegans and human. We used nine yeast, one fruitfly, three worm and six human PPI networks obtained from different studies that used different PPI detection techniques, as well as from curated databases (described below).

The high-confidence part of the yeast PPI network described by (von Mering et al., 2002) is henceforth denoted by ‘YHC’. This dataset is discussed in more detail in the next subsection. We denote by ‘Y11K’ the yeast PPI network defined by the top 11 000 interactions in the (von Mering et al., 2002) classification. ‘YIC’ denotes the ‘core’ yeast PPI network from (Ito et al., 2000) Y2H study. We denote by ‘YIP’ the entire yeast PPI network from (Ito et al., 2000). ‘YU’ stands for yeast PPI network from (Ito et al., 2000). Y2H study. ‘YICU’ is the union of yeast PPI networks from Ito et al. (2000) and Uetz et al. (2000). We denote by ‘YD’ the yeast PPI network obtained from the database of interacting proteins (DIP) (Xenarios et al., 2000). ‘YK’ is the yeast PPI network from (Krogan et al., 2006) obtained by TAP and matrix-assisted laser desorption/ionization-time of flight mass spectrometry and liquid chromatography tandem mass spectrometry. ‘YM’ is the yeast PPI network from MIPS (Mewes et al., 2002). ‘FH’ is the high-confidence part of the fruitfly PPI network from (Giot et al., 2003) in which a two-hybrid-based protein-interaction map of the fly proteome has been presented.

‘WE’ is the entire worm PPI network published by Li et al., (2004), where more than 4000 interactions were identified from Y2H screens, and ‘WC’ denotes the ‘core’ part of the worm PPI network also from (Li et al., 2004). By ‘WS’ we denote the worm PPI network from (Zhong and Sternberg, 2006), where prediction techniques have been used to generate this PPI network, consisting of 18 183 interactions.

The human PPI network from (Stelzl et al., 2005), obtained by Y2H screens, which contains high- medium- and low-confidence data is denoted by ‘HSL’, its part that contains only high- and medium-confidence data ‘HSM’, and only high-confidence interaction from this study by ‘HSH’. ‘HR’ is the human PPI network from (Rual et al., 2005), also obtained by Y2H screens. By ‘HH’ we denote the human PPI network from the human protein reference database (HPRD) (Peri et al., 2004). We denote by ‘HM’ the human PPI network from MINT (Zanzoni et al., 2002).

5.2 Results
Using the algorithm described in section 4, we embedded these networks in 2D, 3D and 4D space. The resulting areas under the ROC curves are shown in Figure 3. One striking feature is that using only two dimensions typically gives results that are as good as the cases where dimension three or four is used. Of the 19 PPI networks, all but the YM 2D case produce an area under the ROC curve >0.6, and 11 networks have areas under the ROC curves above 0.75. Note that some of the human PPI networks (Rual et al., 2005; Stelzl et al., 2005) that we analysed come from the first Y2H studies of the human interactome and thus are considered to be of low confidence (‘HR’, ‘HSL’, ‘HSM’, and ‘HSH’ in Figure 3). We expect that low areas under the ROC curve for these networks are due to the noise present in them. Human PPI networks from curated databases MINT and HPRD are of higher confidence than the PPI networks from Y2H studies resulting in high areas under the ROC curve (see ‘HM’ and ‘HH’ in Figure 3).


Figure 3
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Values of the areas under the ROC curve arising from embedding nine yeast, one fruitfly, three worm, and six human PPI networks into 2D, 3D and 4D Euclidean space.

 
5.3 Further tests
5.3.1 High-confidence data
Yeast S.Cerevisiae is an organism important for research in human biology. Von Mering et al., (2002) performed a systematic synthesis and evaluation of PPIs obtained using the main high-throughput PPI detection methods for yeast. They integrated 78 390 interactions between 5321 yeast proteins, out of which 2455 are identified by more than one PPI detection method (von Mering et al., 2002). This high-confidence PPI network, which has 2455 interactions amongst 988 proteins, appears as YHC in Figure 3. The actual areas under the ROC curve are 0.892, 0.893, 0.896 for embedding into 2D, 3D and 4D space, respectively. This represents a very good match to the geometric model and, reassuringly, it is the best over all the datasets. Hence, in our further investigations we will focus on this YHC data.

5.3.2 Geometric structure in other random network models
Modelling real-world networks by various types of random graphs began with the work of Erdös and Rényi. One classical random graph model connects nodes uniformly at random with some fixed probability (Erdös and Rényi, 1959, 1960). This simple model does not describe many important properties of real-world networks such as degree distribution and clustering coefficients. Efforts to improve the applicability of these networks produced generalized random graphs (Bender and Canfield, 1978) in which the edges are chosen at random as in Erdös–Rényi graphs, but the degree distribution matches the degree distribution of the real network. Attempts to further improve global properties of the real-world networks led to numerous types of models such as small-world (Watts and Strogatz, 1998) and scale-free (Barabási and Albert, 1999; Simon, 1955) networks. Other models have been constructed with the idea of simulating some biological and topological properties of real biological networks, e.g. stickiness model (Przulj and Higham, 2006).

In Figure 4 we present the results of an experiment to measure the extent to which several random graph models can produce a geometric structure. Here, seven types of random networks have been generated corresponding to, i.e. having the same number of nodes and edges as, the dataset YHC: Erdös–Rényi random graphs (denoted by ‘ER’), random graphs with the same degree distribution as the data (denoted by ‘ER-DD’), 2D geometric random graphs (‘GEO-2D’), 3D geometric random graphs (‘GEO-3D’), 4D geometric random graphs (‘GEO-4D’), scale-free Barabasi–Albert model graphs (‘SF’) and stickiness model graphs (‘STICKY’). Note that ‘ER-DD’, ‘SF’, and ‘STICKY’ are three different types of scale-free networks. For each type, 30 networks have been generated, embedded in 2D, 3D and 4D space, and the area under the ROC curve has been computed. Results are also shown for networks obtained from randomly rewiring 10% (‘G-10P’), 20% (‘G-20P’), and 30% (‘G-30P’) of edges in a 3D geometric random graph, in an attempt to account for false positives and false negatives (further details in Section 5.3.3). The mean and the SD for AUC of these networks are shown in Figure 4, and we see that the algorithm clearly distinguishes between geometric and non-geometric models including scale-free networks. In other words, even allowing for noise in the data, the lack of geometric structure in the other models is apparent.


Figure 4
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Means and SDs of the areas under the ROC curve arising from embedding seven types of random networks and three types of rewired geometric random networks (which are used to simulate the noise present in the real data) into 2D, 3D and 4D Euclidean space.

 
5.3.3 Robustness of our approach
Unfortunately, noise is inherent in all current PPI networks (Mrowka et al., 2001). On one hand, PPI datasets contain a large percentage of false positive interactions. One example is when proteins interact indirectly, i.e. through mediation of one or more molecules, but this is recorded as a direct physical interaction by the experimental method. On the other hand, imperfect experimental methods lead to false negative interactions. Different biochemical techniques produce different sets of false positives and negatives. Thus, trying to find high-confidence PPI networks by overlapping multiple datasets may result in discarding many real interactions.

Since the PPI networks are thought to contain a large percentage of false negatives and a large percentage of false positives, we tested with simulated noise. From Figure 4, we see that for randomly generated GEO-3D graphs the embedding into 3D space is excellent. At the right in Figure 4 we show the effect of rewiring 10%, 20% and 30% of the edges in these GEO-3D graphs. The resulting networks are denoted by G-10P, G-20P and G-30P, respectively. We generated 30 networks of each type (corresponding to the percentages of rewired edges), embedded them in 2D, 3D and 4D space, and computed the areas under the ROC curve. The mean area under the ROC curve for the 10% rewired GEO-3D networks are: 0.811, 0.875 and 0.918, corresponding to 2D, 3D and 4D embeddings, respectively.

In Figure 5, ROC curves corresponding to embeddings into 3D Euclidean space of the high-confidence yeast PPI network (denoted by YHC) and the model networks ER, ER-DD, GEO-3D and SF generated with the same number of nodes and edges as the YHC network, are presented in the same graph for comparison. Also, we included one GEO-3D network with 10% rewired edges (denoted by GEO-3D-10%). We see low areas under ROC curves for random (ER) and scale-free (ER-DD and SF) network types. We also see that the YHC–ROC curve is consistent with that of a noisy geometric network. So, overall, from the ROC curve perspective, any departure from geometric structure in the PPI network can be explained by the inherent noise.


Figure 5
View larger version (31K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. ROC curves corresponding to embedding of four random networks (ER, ER-DD, GEO-3D and SF), one rewired network (GEO-3D-10%) and the high confidence yeast PPI network (YHC) into 3D Euclidean space.

 

    6 Discussion
 TOP
 ABSTRACT
 1 Introduction
 2 THEORETICAL BASIS
 3 INITIAL ALGORITHM
 4 PRACTICAL ALGORITHM
 5 DATA AND RESULTS
 6 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 
It has already been established that the random geometric graph model gives an excellent fit for various global and local measures of PPI networks such as pathlengths, clustering coefficients, relative graphlet frequencies (Przulj et al., 2004), and graphlet degree distributions (Przulj, 2006). The main idea of this work is to test directly whether PPI networks are geometric by embedding them into a low-dimensional Euclidean space. We developed an algorithm that takes PPI network data and attempts to recover the geometric network structure, using specificity and sensitivity measures to quantify the results. The algorithm was demonstrated to work well on artificially constructed geometric random networks, even in the presence of noise. We applied the algorithm to the 19 PPI networks of various organisms (yeast, fruitfly, worm and human) and seven types of random network models including three types of scale-free networks. Also, we compared these results with the results of rewired geometric networks, where rewiring simulates the noise that is present in the real PPI network data. The results we obtained in this work yield support to the hypothesis that the structure of PPI networks is consistent with the structure of a noisy geometric random graph. The fact that the algorithm produced a better fit on high-confidence PPI data suggests that the algorithm could be used to help discover false positives and false negatives in PPI networks.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 Introduction
 2 THEORETICAL BASIS
 3 INITIAL ALGORITHM
 4 PRACTICAL ALGORITHM
 5 DATA AND RESULTS
 6 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank the Institute for Genomics and Bioinformatics (IGB) at UC Irvine for providing computing resources. This project was supported by the NSF CAREER IIS-0644424 grant. M.R. is grateful to the Serbian Ministry of Science for financial support. D.J.H. was supported by grant EP/E0493701/1 from the Engineering and Physical Sciences Research Council of the UK.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Olga Troyanskaya

Received on September 19, 2007; revised on February 8, 2008; accepted on February 27, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 Introduction
 2 THEORETICAL BASIS
 3 INITIAL ALGORITHM
 4 PRACTICAL ALGORITHM
 5 DATA AND RESULTS
 6 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Barabási A-L, Albert R. Emergence of scaling in random networks. Science (1999) 286:509–512.[Abstract/Free Full Text]

    Bender EA, Canfield ER. The asymptotic number of labeled graphs with given degree sequences. J. Combinatorial Theory A (1978) 24:296–307.[CrossRef]

    Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. (1997) 30:1145–1159.[CrossRef]

    Cox TF, Cox MAA. Multidimensional Scaling (1994) London: Chapman and Hall.

    Erdös P, Rényi A. On random graphs. Publ. Math. (1959) 6:290–297.

    Erdös P, Rényi A. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci. (1960) 5:17–61.

    Gavin AC, et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature (2002) 415:141–147.[CrossRef][Medline]

    Giot L, et al. A protein interaction map of drosophila melanogaster. Science (2003) 302:1727–1736.[Abstract/Free Full Text]

    Golub GH, Van Loan CF. Matrix Computations (1996) 3rd. Baltimore: Johns Hopkins University Press.

    Grindrod P. Range-dependent random graphs and their application to modeling large small-world proteome datasets. Phys. Rev. E (2002) 66:066702.[CrossRef]

    Grindrod P, Kibble M. Review of uses of network and graph theory concepts within proteomics. Expert Rev. Proteomics (2004) 1:229–238.[CrossRef][Web of Science][Medline]

    Higham DJ. Unravelling small world networks. J. Comp. Appl. Math. (2003) 158:61–74.[CrossRef]

    Ho Y, et al. Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature (2002) 415:180–183.[CrossRef][Medline]

    Ito T, et al. Toward a protein–protein interaction map of the budding yeast: a comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc. Natl Acad. Sci. USA (2000) 97:1143–1147.[Abstract/Free Full Text]

    Kaski S, et al. Trustworthiness and metrics in visualizing similarity of gene expression. BMC Bioinformatics (2003) 4:48.[CrossRef][Medline]

    Khanin R, Wit E. How scale-free are gene networks? J. Computat. Biol. (2006) 13:810–818.[CrossRef]

    Krogan NJ, et al. Global landscape of protein complexes in the yeast saccharomyces cerevisiae. Nature (2006) 440:637–643.[CrossRef][Medline]

    Lappe M, Holm L. Unraveling protein interaction networks with near-optimal efficiency. Nat. Biotechnol. (2004) 22:98–103.[CrossRef][Web of Science][Medline]

    Li S, et al. A map of the interactome network of the metazoan c elegans. Science (2004) 303:540–543.[Abstract/Free Full Text]

    Mewes HW, et al. MIPS: a database for genomes and protein sequences. Nucleic Acids Res. (2002) 30:31–34.[Abstract/Free Full Text]

    Milo R, et al. Network motifs: simple building blocks of complex networks. Science (2002) 298:824–827.[Abstract/Free Full Text]

    Morrison JL, et al. A lock-and-key model for protein–protein interactions. Bioinformatics (2006) 22:2012–2019.[Abstract/Free Full Text]

    Mrowka R, et al. Is there a bias in proteome research? Genome Res. (2001) 11:1971–1973.[Abstract/Free Full Text]

    Newman MEJ. The structure and function of complex networks. SIAM Rev. (2003) 45:167–256.[CrossRef]

    Penrose M. Geometric Random Graphs (2003) Oxford: Oxford University Press.

    Peri S, et al. Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res. (2004) 32 Database issue. D497–D501, 1362–4962 (Journal Article).

    Przulj N. Biological network comparison using graphlet degree distribution. Bioinformatics (2006) 23:e177–e183.[CrossRef][Web of Science]

    Przulj N, Higham D. Modelling protein–protein interaction networks via a stickiness index. J. R. Soc. Interface (2006) 3:711–716.[Abstract/Free Full Text]

    Przulj N, et al. Modeling interactome: Scale-free or geometric? Bioinformatics (2004) 20:3508–3515.[Abstract/Free Full Text]

    Przulj N, et al. Efficient estimation of graphlet frequency distributions in protein–protein interaction networks. Bioinformatics (2006) 22:974–980.[Abstract/Free Full Text]

    Rual J-F, et al. Towards a proteome-scale map of the human protein–protein interaction network. Nature (2005) 437:1173–1178.[CrossRef][Medline]

    Simon HA. On a class of skew distribution functions. Biometrika (1955) 42:425–440.[Free Full Text]

    Stelzl U, et al. A human proteinprotein interaction network: a resource for annotating the proteome. Cell (2005) 122:957–968.[CrossRef][Web of Science][Medline]

    Taguchi Y, Oono Y. Relational patterns of gene expression via nonmetric multidimensional scaling analysis. Bioinformatics (2004) Advance Access, published online on October 27, 2004, doi:10.1093/bioinformatics/bti067.

    Tape TG. Interpreting diagnostic tests. University of Nebraska Medical Center (2000) Available at http://gim.unmc.edu/dxtests/.

    Thomas A, et al. On the structure of protein–protein interaction networks. Biochem. Soc. Trans. (2003) 31:1491–1496.[Web of Science][Medline]

    Titz B, et al. What do we learn from high-throughput protein interaction data. Expert Rev. Proteomics (2004) 1:111–121.[CrossRef][Web of Science][Medline]

    Uetz P, et al. A comprehensive analysis of protein–protein interactions in saccharomyces cerevisiae. Nature (2000) 403:623–627.[CrossRef][Medline]

    Vazquez A, et al. Modeling of protein interaction networks. ComPlexUs (2001) 1:38–44.[CrossRef]

    von Mering C, et al. Comparative assessment of large-scale data sets of protein–protein interactions. Nature (2002) 417:399–403.[Medline]

    Watts DJ, Strogatz SH. Collective dynamics of ‘small-world’ networks. Nature (1998) 393:440–442.[CrossRef][Medline]

    Xenarios I, et al. DIP: the Database of Interacting Proteins. Nucleic Acids Res. (2000) 28:289–291.[Abstract/Free Full Text]

    Zanzoni A, et al. Mint: a molecular interaction database. FEBS Letters (2002) 513:135–140.[CrossRef][Web of Science][Medline]

    Zhong W, Sternberg P. Genome-wide prediction of C. elegans genetic interactions. Science (2006) 311:1481–1484.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/8/1093    most recent
btn079v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Higham, D. J.
Right arrow Articles by Przulj, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Higham, D. J.
Right arrow Articles by Przulj, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?