Skip Navigation


Bioinformatics Advance Access originally published online on November 16, 2006
Bioinformatics 2007 23(2):222-231; doi:10.1093/bioinformatics/btl581
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/2/222    most recent
btl581v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Google Scholar
Right arrow Articles by Li, A.
Right arrow Articles by Horvath, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, A.
Right arrow Articles by Horvath, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2006 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Network neighborhood analysis with the multi-node topological overlap measure

Ai Li 1 and Steve Horvath 1,2,*

1 Department of Biostatistics, School of Public Health, University of California Los Angeles, CA 90095-1772, USA
2 Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, CA 90095-7088, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 APPLICATIONS
 4 SIMULATION
 5 COMPARING MTOM TO...
 6 CONCLUSION
 REFERENCES
 

Motivation: The goal of neighborhood analysis is to find a set of genes (the neighborhood) that is similar to an initial ‘seed’ set of genes. Neighborhood analysis methods for network data are important in systems biology. If individual network connections are susceptible to noise, it can be advantageous to define neighborhoods on the basis of a robust interconnectedness measure, e.g. the topological overlap measure. Since the use of multiple nodes in the seed set may lead to more informative neighborhoods, it can be advantageous to define multi-node similarity measures.

Results: The pairwise topological overlap measure is generalized to multiple network nodes and subsequently used in a recursive neighborhood construction method. A local permutation scheme is used to determine the neighborhood size. Using four network applications and a simulated example, we provide empirical evidence that the resulting neighborhoods are biologically meaningful, e.g. we use neighborhood analysis to identify brain cancer related genes.

Availability: An executable Windows program and tutorial for multi-node topological overlap measure (MTOM) based analysis can be downloaded from the webpage (http://www.genetics.ucla.edu/labs/horvath/MTOM/).

Contact: shorvath{at}mednet.ucla.edu

Supplementary information: Supplementary material is available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 APPLICATIONS
 4 SIMULATION
 5 COMPARING MTOM TO...
 6 CONCLUSION
 REFERENCES
 
The main focus of this paper is a fundamental screening task: how to define the neighborhood of an initial set of nodes (genes) in a network. Intuitively speaking, a neighborhood is composed of nodes that are highly connected to a given set of genes. Thus neighborhood analysis facilitates a guilt-by-association screening strategy for finding genes that interact with a given set of biologically interesting genes. To define a neighborhood of an initial gene set, one can make use of a similarity measure. For example, when dealing with gene expression microarray data, it is natural to use the correlation coefficient to measure pairwise gene co-expression similarity (Eisen et al., 1998; Golub et al., 1999).

Here, we consider the setting of an undirected network that can be represented by a symmetric adjacency matrix A =[aij]. In an unweighted network, aij = 1 if nodes i and j are connected and 0 otherwise. In a weighted network, aij isin [0,1] encodes the pairwise connection strength.

A simple approach for defining a neighborhood of node i is to choose the nodes with highest adjacencies aij. In an unweighted network, this amounts to choosing the directly connected neighbors of node i.

Erroneous links (adjacencies) can have a strong impact on network topological inference (Line et al., 2004; Lin and Zhao, 2005). Since spurious or weak connections in the adjacency matrix may lead to ‘noisy’ neighborhoods, it can be advantageous to use node (dis-)similarity measures that are based on common interacting partners or on topological metrics (Ravasz et al., 2002; Brun et al., 2003; Zhao et al., 2006; Chen et al., 2006; Chua et al., 2006). A comparison of different measures can be found in Chua et al. (2006) and Goldberg and Roth (2003).

A limitation of many network similarity measures is that they measure pairwise similarity. Although pairwise similarities are useful for clustering procedures and many gene annotation procedures, we will argue that it can be advantageous for neighborhood analysis to introduce multi-node similarity measures.

We outline a procedure for generalizing pairwise network similarity or dissimilarity measures that are based on shared neighbors. The resulting multi-point measures keep track of the numbers of shared neighbors among multiple network nodes. We apply our approach to the topological overlap measure (TOM) (Ravasz et al., 2002).

1.1 Topological overlap measure
The topological overlap of two nodes reflects their similarity in terms of the commonality of the nodes they connect to. In an unweighted network, the number of shared neighbors of nodes i and j is given by Formula. The topological overlap T = [tij] is a normalized version of this quantity. Specifically, the following definition of the pairwise TOM can be found in the Supplementary material of Ravasz et al. (2002):


Formula 1

(1)
The inclusion of the term aij in the numerator makes tij explicitly depend on the existence of a direct link between the two nodes in question. An advantage of the quantity 1 in the denominator is that it prevents the denominator from becoming 0 when Formula.

In the following, we use 0 ≤ aij ≤ 1 to prove that 0 ≤ tij ≤ 1. Since Formula and Formula, which implies Formula. Along with aij ≤ 1, we find that the numerator of tij is smaller than the denominator.


    2 APPROACH
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 APPLICATIONS
 4 SIMULATION
 5 COMPARING MTOM TO...
 6 CONCLUSION
 REFERENCES
 
2.1 Multi-node topological overlap measure
Here, we generalize the topological overlap matrix to multiple nodes and show how to use it in neighborhood analysis. Our multi-node TOM (MTOM) is motivated by the observation that formula (1) can be expressed as

Formula 2(2)
where N(i, j) denotes the set of direct neighbors shared by i and j, N(i,–j) denotes the set of the neighbors of i excluding j and |·| denotes the number of elements (cardinality) in its argument. Algebraically, we find

Formula 3(3)
The binomial coefficient Formula 3 in the denominator of (2) is an upper bound of aij.

In light of formula (2), it is natural to define the MTOM for three different nodes i, j, k as follows:

Formula 4(4)
where

Formula 5(5)
Here, N(i, j, –k) can be regarded as the set of the neighbors shared by i and j excluding k. The binomial coefficient Formula 5 in the denominator of (4) is an upper bound of aij + aik + ajk and equals the number of connections that can be formed between i, j and k. Analogous to the proof for two nodes, one can prove that 0 ≤ tijk ≤ 1. It is straightforward to extend the definition of the TOM to four or more nodes.

Generalizing MTOM to weighted networks. The algebraic formulas for MTOM do not require that the adjacencies aij take on binary values, i.e. they encode an unweighted network. Even for a weighted network with 0 ≤ aij ≤ 1, MTOM takes on values in the unit interval. Therefore, we use the algebraic formulation of the topological overlap matrix to define MTOM for weighted networks. Two simple examples illustrating the MTOM computation for four nodes are presented in Figure 1.


Figure 1
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Computing the four node TOMs for nodes A,B,C,D in two simple networks 1) Formula 5 and 2) Formula 5.

 
2.2 MTOM-based neighborhoods
We consider two basic approaches for defining a neighborhood based on the concept of multi-node topological overlap. The default approach is to build the neighborhood recursively. The non-recursive alternative is computationally faster but produces less interconnected neighborhoods.

The MTOM-based neighborhood analysis requires as input an initial seed neighborhood composed of S0 node(s) (S0 ≥ 1) and the requested final size of the neighborhood St = S0 + S ≥ S0, where S is the total number of nodes that will add to the initial neighborhood.

  1. Recursive approach
    1. For each node outside of the current neighborhood, compute the MTOM value of the combined set of this node and the node(s) in the current neighborhood.
    2. Add the node associated with the highest MTOM value to the current neighborhood to reach the updated neighborhood.
    3. Repeat Steps (a) and (b) S times until the final neighborhood size St is reached.

  2. Non-recursive approach
    1. For each node outside of the initial neighborhood, compute the MTOM value of the combined set of this node and the node(s) in the initial neighborhood.
    2. Choose the S nodes associated with the highest MTOM values and combine with the initial neighborhood as the final neighborhood.

Since the recursive approach leads to neighborhoods with higher MTOM values, it is preferable over the computationally faster, non-recursive approach.

2.3 Local permutations for choosing the neighborhood size S
An obvious challenge is to choose the number S = St S0 of nodes to be added to the initial neighborhood. Although prior knowledge of the pathway size may guide this choice, this information is not always available. We propose a permutation test based guideline to assist with the choice of S. The permutation test compares MTOM values based on the original adjacency matrix with their corresponding values in permuted versions of the adjacency matrix. We find that global (whole network) permutations often lead to a network without any module structure and to unrealistically large estimates of the neighborhood size (thousands of nodes). Therefore, we propose to permute only those rows of the adjacency matrix that correspond to nodes in the initial seed neighborhood. Next, the corresponding columns are permuted so that the resulting permuted adjacency matrix remains symmetric. After performing multiple permutations, one can estimate the 95th percentile of the permuted MTOM values. Figure 2 shows the original MTOM value as a function of S the 95th percentile of the MTOM values calculated on the basis of locally permuted versions of the adjacency matrix. In our applications, we find that there is a value Sc such that if more than Sc nodes are added to the initial neighborhood recursively MTOM value curve dips below the 95th percentile of the permuted MTOM value curves. Since for neighborhood sizes smaller than St0, where St0 = S0 + Sc, the neighborhood is more interconnected than 95% of the locally permuted neighborhoods, we chose a neighborhood size close to St0 in our applications. The proposed local permutation test for choosing a neighborhood size is meant as a heuristic. In practice, the user should explore the robustness of the estimate with respect to picking other percentiles, e.g. the 90th percentile. Of course, prior biological knowledge regarding the neighborhood size should take precedence over the rough estimate provided by the local permutation test. As an alternative, we suggest that hierarchical clustering analysis involving the pairwise TOM dissimilarity may also provide some estimate on how large a cluster may surround the initial set. Neighborhood analysis, similar to gene screening strategies, leads to results that require careful validation involving independent datasets and biological validation methods.


Figure 2
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Using a local permutation test to choose the neighborhood size, i.e. the number of nodes S to be added to the initial neighborhood. (a) Yeast cell-cycle example; (b) Drosophila protein–protein interaction network. The solid line shows the MTOM values (y-axis) of the observed network as a function of different neighborhood sizes (x-axis). The dashed line shows the 95th percentile of the MTOM values based on locally permuted adjacency matrices. A local permutation only permutates those rows (and columns) of the adjacency matrix that correspond to node of the initial neighborhood. As heuristic, we suggest to choose a value for S close to where the solid line (observed values) crosses the dashed line (95th percentile of permuted values).

 

    3 APPLICATIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 APPLICATIONS
 4 SIMULATION
 5 COMPARING MTOM TO...
 6 CONCLUSION
 REFERENCES
 
In the following sections, we apply our methods to gene co-expression networks and simple protein–protein interaction (PPI) networks.

3.1 Predicting brain cancer genes in a co-expression network
The proposed neighborhood analysis can be used for both unweighted and weighted gene co-expression networks. Here, we apply the method to find brain cancer related genes based on different initial seed neighborhoods. The data consisted of 55 brain cancer patients and their survival times. The gene expression profiles of each patient were measured using Affymetrix HG-U133A microarrays as detailed in Horvath et al. (2006). The details of the gene co-expression network construction are presented in Zhang and Horvath (2005). Briefly, the network adjacency matrix was defined by raising the Pearson correlation matrix between the gene expression profiles to the sixth power, i.e. aij = |cor(xi, xj)|6, where xi and xj are the expression profiles of gene i and j, respectively. Our findings remain largely unchanged with regard to different choices of the power ß = 6. Further, an unweighted network construction approach leads to similar results (see our online Supplementary material).

To illustrate the value of taking a multi-node perspective, we applied the MTOM approach to an initial seed neighborhood composed of five well-known cancer-related genes: TOP2A, Rac1, TPX2, EZH2 and KIF14. Table 1 shows the results from the recursive MTOM analysis. Out of 20 probes in the MTOM neighborhood, we find that 15 are cancer related, which provides empirical evidence that the MTOM approach leads to biologically meaningful results.


View this table:
[in this window]
[in a new window]

 
Table 1 MTOM neighborhood analysis of an initial neighborhood composed of five well-known cancer genes: TOP2A, Rac1, TPX2, EZH2 and KIF14

 
In the following sections, we provide a limited comparison to the simple (naive) approach of defining a neighborhood of node t based on ranking the remaining network nodes by their adjacencies with node t. For weighted gene co-expression networks constructed with the power function, this naive approach is equivalent to choosing genes based on the absolute values of their correlations with a given gene expression profile.

It is worth emphasizing that when dealing with microarray data, one can also determine the neighborhood of a quantitative microarray sample trait, e.g. cancer survival time. To accomplish this mathematically, one considers the sample trait as an additional, idealized gene expression profile when constructing the co-expression network. For our weighted network example, the adjacency between the sample trait T and the i-th gene expression profile is given by aTi = | cor(T, xi)|ß. In Table 2, we report the results of using MTOM to find a neighborhood with 20 gene neighbors around the sample trait ‘brain cancer survival time’. We find that the MTOM-based neighborhoods are enriched with cancer- and neuron-related genes. Out of the 20 probe sets, 11 are related to neuron cells and 10 are related to cancer. Since the brain cancer microarray data were based on neuronal tissue samples, finding neuron- or cancer-related genes provides indirect (but only tentative) evidence that the resulting neighborhoods are biologically meaningful. Note that several of the probe sets in Table 2 correspond to the same genes, but the correlation between gene expression profiles and survival time varies greatly across the different probe sets of a gene.


View this table:
[in this window]
[in a new window]

 
Table 2 Neighborhood of survival time based on the recursive MTOM approach

 
In contrast, a standard, naive approach, which simply selects a neighborhood on the basis of the absolute values of the correlations between gene expression profile and survival time, leads to a neighborhood with fewer cancer- and neuron-related genes. Out of the 20 most highly correlated probe sets in Table 3, only 4 are related to neuron cells and only 6 are related to cancer. Comparing Tables 2 and 3 provides indirect empirical evidence that the MTOM neighborhood analysis leads to biologically more meaningful results than the standard approach in this application.


View this table:
[in this window]
[in a new window]

 
Table 3 Correlation based neighborhood of the survival time (TTS)

 
3.2 Neighborhood analysis for predicting cell cycle proteins in yeast
Here, we use an MTOM neighborhood analysis to predict cell cycle related proteins. Numerous protein annotation methods have been presented in the literature, e.g. recent papers include Deng et al. (2006) and Carroll and Pavlovic (2006). Our limited analysis is meant to illustrate the value of taking a multi-node perspective. A comprehensive comparison to other methods is beyond the scope of this article. The protein identifiers of the open reading frames (ORF) were obtained from the Saccharomyces Genome Database (SGD) and the yeast protein–protein interactions (PPI) were retrieved from the Munich Information Center for Protein Sequences (MIPS) (Guldener et al., 2006). We restricted the analysis to the largest connected component composed of 3858 proteins with 7196 pairwise interactions. To compare different neighborhood analysis approaches, we studied the neighborhoods of subsets of 101 cell cycle related proteins found in the Kyoto Encyclopedia of Genes and Genomes (KEGG). A local permutation test suggested S = 10. Within each neighborhood, we determined the number C of cell cycle related proteins. We found that C is significantly correlated with the network connectivity k of the initial protein (Spearman correlation r = 0.36, P-value ≤ 0.001) across the 101 cell cycle genes. We focused the neighborhood analysis on subsets of the 50 most highly connected ‘hub’ cell cycle related proteins. These proteins had a connectivity ≥4, i.e. each initial protein had at least four known interactions. Our results were largely unchanged with regard to using more highly connected genes in the initial neighborhood set. However, using less connected proteins (k < 3) leads to neighborhoods that contain very few cell cycle related proteins.

As can be seen from Figure 3, the neighborhoods of cell cycle genes tend to be enriched with other cell cycle genes as well. A major advantage of the MTOM screening approach is the ability to input multiple initial nodes as seed set. Figure 3 shows that an initial seed neighborhood composed of two cell cycle related hub proteins leads to far better results than using a single protein as input. But as Figure 4 indicates, this is only true for protein pairs that have high topological overlap. Note that pairs of proteins resulting in neighborhoods with high percentages of cell cycle related proteins are composed of proteins with high TOM.


Figure 3
View larger version (34K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Comparing the percentage of cell cycle proteins R (y-axis) in neighborhoods constructed in different ways for the Yeast Protein–Protein Physical Interaction Network (MIPS Data). The recursive approach involving an initial neighborhood of two cell cycle related ‘hub’ proteins performs better than approaches based on an initial set composed of a single protein. In this application, the recursive and the non-recursive MTOM neighborhood analysis involving a single initial protein do not lead to better results than the naive approach of building a neighborhood on the basis of direct connections (adjacency = 1) with the initial protein. We also report the P-values of the Kruskal–Wallis rank sum test, which is a non-parametric multi-group comparison test.

 


Figure 4
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 Boxplots for visualizing the distribution of the topological overlap (y-axis) of initial protein pairs that lead to neighborhoods with a high percentage of cell cycle genes (x-axis). A boxplot consists of the most extreme values (the whiskers) in the dataset (maximum and minimum values), the lower and upper quartiles (lower and upper boundary of the box) and the median value (horizontal line inside the notch). A notch is drawn on each side of the box. If the notches of two plots do not overlap, the two medians differ significantly.

 
3.3 Neighborhood analysis for predicting essential yeast proteins
Networks are a natural framework for understanding PPI, see e.g. Jeong et al. (2001), Yook et al. (2004) and Deng et al. (2006). Knock-out experiments in lower organisms (e.g. yeast, fly and worm) have shown that essential proteins tend to be highly connected ‘hub’ proteins in PPI networks (Jeong et al., 2001, 2003; Hahn and kern, 2005). Here we use MTOM-based neighborhood analysis to predict essential proteins in a yeast PPI network (BioGrid data) (Breitkreutz et al., 2003). The largest connected component contained 3332 proteins that include 877 essential proteins. We find that proteins that are in the neighborhood of essential, highly connected hub proteins have an increased chance of being essential as well. Specifically, we picked essential seed genes from among the 200 most highly connected essential proteins. Based on our local permutation test, we chose S = 30 for MTOM analysis. The percentage of essential proteins in the neighborhoods constructed by different methods are reported in Figure 5. Apart from seed sets composed of a single gene, we also considered seeds involving two and three essential hub proteins with high topological overlap. Note that as the initial neighborhood size increases, so does the biological signal in the resulting neighborhoods. In this application, neighborhoods built on the basis of multiple interconnected initial proteins lead to better results than standard methods that can only input a single protein.


Figure 5
View larger version (37K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5 Comparing the percentage of essential proteins R (y-axis) in neighborhoods constructed in different ways for the yeast PPI network (BioGrid Data). The neighborhoods initialized by sets of two or three hub proteins contain more essential proteins than those constructed from a single protein. We also report the P-values of the Kruskal–Wallis rank sum test, which is a standard non-parametric multi-group comparison test.

 
3.4 Neighborhood analysis for predicting essential proteins in Drosophila
Here, we apply MTOM based neighborhood analysis to predict essential proteins in a Drosophila (fly) PPI network (BioGrid Data) (Breitkreutz et al., 2003). The largest connected component contained 2294 proteins that include 282 known essential proteins. Since essential genes tend to be highly connected, we chose subsets of the 100 most highly connected essential proteins as initial seeds. Our local permutation test suggested S = 30. Figure 6 reports the percentages of essential proteins in neighborhoods constructed using alternative methods. In this application, the recursive MTOM neighborhood analysis involving a single initial seed protein leads to a better result than both the naive and the non-recursive MTOM approaches. Further, Figure 6 demonstrates the value of choosing multiple nodes as seeds for neighborhood analysis.


Figure 6
View larger version (39K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6 Comparing the percentages of essential proteins R (y-axis) in neighborhoods constructed in the Drosophila PPI network. The recursive approach involving an initial neighborhood of a single essential protein performs better than the non-recursive and naive approaches. As the initial neighborhood size increases, so does the biological signal in the resulting neighborhoods.

 

    4 SIMULATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 APPLICATIONS
 4 SIMULATION
 5 COMPARING MTOM TO...
 6 CONCLUSION
 REFERENCES
 
To evaluate our method, we simulated a network model motivated by our yeast and cancer co-expression network applications. Although this simple model was motivated by our unpublished research on the structure of co-expression networks, it is beyond the scope of this article to discuss the relationship of this simple model to actual weighted gene co-expression networks. Here, we use the model to argue that MTOM can lead to more meaningful results than standard neighborhood analysis methods.

Specifically, we simulated a gene expression dataset {xij} composed of 2000 genes (1 ≤ i ≤ 2000) and 60 microarray samples (1 ≤ j ≤ 60). The network was simulated to be composed of six modules with sizes n1 = 400, n2 = 400, n3 = 400, n4 = 300, n5 = 300 and n6 = 200. Within the k-th module, the i-th gene had the following expression value in the j-th microarray sample.

Formula 5
where the stochastic noise Formula 5 was simulated to follow a normal distribution with mean 0 and variance 6. The vector Formula 5 is given below and turned out to be highly correlated (r > 0.95) with the first principal component of the corresponding module expression matrix (also known as module eigengene or metagene).

Formula 5
where the indicator function Formula 5 equals 1 if the condition is satisfied and 0 otherwise. To quantify co-expression, we correlated the simulated gene expressions with each other, which resulted in a 2000 x 2000 dimensional correlation matrix. To arrive at a simulated weighted gene co-expression network (adjacency matrix), we raised the entries of the correlation matrix to the power of ß = 6, i.e. aij = |cor(xi, xj)|6, where xi and xj are the expression profiles of gene i and j, respectively.

The goal of our neighborhood analysis was to determine membership in the first module that contained n1 = 400 genes. We considered initial neighborhoods composed of 1 or 2 genes out of the 50 most highly connected module genes. We considered S = 30. For each neighborhood, the percentage of module 1 genes represents the simulated biological signal. Figure 7 shows the results from averaging the signal over 50 MTOM analyses corresponding to a single initial hub gene and 500 MTOM analyses corresponding to pairs of genes with high topological overlap.


Figure 7
View larger version (38K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7 Comparing the percentage of module 1 genes (y-axis) that are retrieved by different neighborhood construction methods for the simulated network. The recursive approach involving an initial neighborhood of two ‘hub’ genes in the first module leads to the best neighborhoods. In this application, the recursive and the non-recursive MTOM neighborhood analysis involving a single initial gene outperforms the naive approach of simply using the 30 genes with highest adjacency with the initial gene. Further, an initial neighborhood composed of two genes (with high topological overlap) leads to better results than initial neighborhoods composed of a single gene.

 

    5 COMPARING MTOM TO THE AVERAGE PAIRWISE TOM
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 APPLICATIONS
 4 SIMULATION
 5 COMPARING MTOM TO...
 6 CONCLUSION
 REFERENCES
 
One can easily define a multi-node similarity measure by the average of the pairwise similarities between the nodes. Since the average pairwise similarity measure is computationally much simpler, it is important to argue that a MTOM performs better than the average pairwise TOMs. To facilitate such a comparison, we study here the performance of the averaged TOM neighborhood construction method which non-recursively add nodes based on average pairwise TOM.

To compare our proposed recursive MTOM method and with the averaged TOM neighborhood construction method, we carried out three comparisons.

The first comparison involves comparing the simulated or biological signal in the resulting neighborhoods for the different applications. Using simulated and biological applications, we find that the MTOM method outperforms the averaged TOM method. In Figure 8, we report three representative comparisons.


Figure 8
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 8 Recursive MTOM neighborhoods contain a significantly better signal (y-axis) than averaged TOM neighborhoods. Here we report three representative examples: (a) the simulated network; (b) essential genes in the yeast PPI network; (c) essential genes in the Drosophila (fly) PPI network. We report the Kruskal–Wallis P-values for comparing the median values. The median value corresponds to the horizontal line inside the box. The corresponding notch around the median line denotes the 95% confidence interval.

 
The second comparison involves comparing the MTOM values of the neighborhoods constructed with the different methods. As is to be expected, MTOM based neighborhoods have significantly higher MTOM values than neighborhoods constructed with the averaged TOM method (Fig. 9).


Figure 9
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 9 Recursive MTOM neighborhoods have higher MTOM values (y-axis) than averaged TOM neighborhoods. Here we report three representative examples: (a) the simulated network; (b) essential genes in the yeast PPI network; (c) essential genes in the Drosophila (fly) PPI network.

 
The third comparison involves comparing the average pairwise TOM values of the neighborhoods constructed with the different methods. According to this metric, we find that the recursive MTOM method is significantly better than the averaged TOM approach (Fig. 10). In summary, we find that the proposed MTOM outperforms the average pairwise TOM in our applications and simulated example.


Figure 10
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 10 Recursive MTOM neighborhoods have higher average pairwise TOM value (y-axis) than averaged TOM neighborhoods. Here we report three representative examples: (a) the simulated network; (b) essential genes in the yeast PPI network; (c) essential genes in the Drosophila (fly) PPI network.

 

    6 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 APPLICATIONS
 4 SIMULATION
 5 COMPARING MTOM TO...
 6 CONCLUSION
 REFERENCES
 
If individual network connections are susceptible to noise, then it can be advantageous to define neighborhoods on the basis of a more robust measure based on shared neighbors, e.g. the TOM. To illustrate the value of taking a multi-node perspective when defining neighborhoods, we generalize the standard pairwise TOM to measure the topological overlap of multiple nodes (MTOM). MTOM is a natural extension of the standard pairwise TOM to multiple nodes. But it should be straightforward to adapt our approach to alternative overlap measures described in Brun et al. (2003), Zhao et al. (2003), Chen et al. (2006) and Chua et al. (2006). Since computation time was a concern in our analyses, we presented a recursive and non-recursive approach for constructing neighborhoods. But it may be worth while to explore the use of alternative, more time consuming, construction methods. For example, stepwise methods that allow for node deletion at each step may lead to neighborhoods with higher MTOM values.

Further, we describe a local permutation scheme for determining the size of a neighborhood.

Using four network applications and a simulated example, we provide evidence that the MTOM approach yields biologically meaningful results. For example, we use MTOM to identify brain cancer related genes in a co-expression network and to identify essential genes in protein interaction networks. We provide empirical evidence that a neighborhood surrounding an initial set of two or more nodes can be far more informative than the neighborhood of a single node.

Our approach has several limitations. First and foremost, MTOM-based neighborhood analysis will only be useful in applications that satisfy the following assumption: the more neighbors are shared by multiple nodes, the stronger is the biological relationship among them. The second limitation of our approach is that we assume the setting of an undirected network. Although these types of networks are widely used in systems biology, we briefly mention that directed, Bayesian or Boolean network models allow for a probabilistic or even causal analysis of similar data, see e.g. Schaefer and Strimmer (2005) and Carroll and Pavlovic (2006).

The TOM can serve as a filter that decreases the effect of spurious or weak connections. Our applications and several publications provide empirical evidence that the topological overlap matrix leads to biologically meaningful results (Ravasz et al., 2002; Ye and Godzik, 2004; Carlson et al., 2006; Gargalovic et al., 2006; Ghazalpour et al., 2006; Horvath et al., 2006). But there will undoubtedly be situations when alternative similarity measures are preferable. We expect that the multi-node measures will also be useful for module detection when coupled with a suitable clustering procedure.


    Acknowledgments
 
The authors would like to thank our collaborators Jun Dong, Dan Geschwind, Peter Langfelder, Jake Lusis, Paul Mischel, Stan Nelson, Mike Oldham, Anja Presson, Lin Wang and Wei Zhao. Funding to pay the Open Access publication charges for this article was provided by 1U19AI063603-01.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Golan Yona

Received on August 27, 2006; revised on November 6, 2006; accepted on November 14, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 APPLICATIONS
 4 SIMULATION
 5 COMPARING MTOM TO...
 6 CONCLUSION
 REFERENCES
 

    Breitkreutz, B., et al. (2003) The GRID: the general repository for interaction datasets. Genome Biol, . 4, R23[CrossRef][Medline].

    Brun, C., et al. (2003) Functional classification of proteins for the prediction of cellular function from a protein–protein interaction network. Genome Biol, . 5, R6[CrossRef][Medline].

    Carlson, M., et al. (2006) Gene connectivity, function, and sequence conservation: predictions from modular yeast co-expression networks. BMC Genomics, 7, 40[CrossRef][Medline].

    Carroll, S. and Pavlovic, V. (2006) Protein classification using probabilistic chain graphs and the gene ontology structure. Bioinformatics, 22, 1871–1878[Abstract/Free Full Text].

    Chen, J., et al. (2006) Increasing confidence of protein interactomes using network topological metrics. Bioinformatics, 22, 1998–2004[Abstract/Free Full Text].

    Chua, N.H., et al. (2006) Exploiting indirect neighbours and topological weight to predict protein function from protein–protein interactions. Bioinformatics, 22, 1623–1630[Abstract/Free Full Text].

    Deng, M., et al. (2006) Mapping gene ontology to proteins based on protein–protein interaction data. Bioinformatics, 20, 895–902.

    Eisen, M., et al. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 14863–14868[Abstract/Free Full Text].

    Gargalovic, P., et al. (2006) Identification of inflammatory gene modules based on variations of human endothelial cell responses to oxidized lipids. Proc. Natl Acad. Sci. USA, 103, 12741–12746[Abstract/Free Full Text].

    Ghazalpour, A., et al. (2006) Integrating genetics and network analysis to characterize genes related to mouse weight. PLos Genet, . 2, e130[CrossRef][Medline].

    Goldberg, D. and Roth, F. (2003) Assessing experimentally derived interactions in a small world. Proc. Natl Acad. Sci. USA, 100, 4372–4376[Abstract/Free Full Text].

    Golub, T.R., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537[Abstract/Free Full Text].

    Guldener, U., et al. (2006) Mpact: the mips protein interaction resource on yeast. Nucleic Acids Res, . 34, 436–441[Abstract/Free Full Text].

    Hahn, M.W. and Kern, A.D. (2005) Comparative genomics of centrality and essentiality in three eukaryotic protein–interaction networks. Mol. Biol. Evol, . 22, 803–806[Abstract/Free Full Text].

    Horvath, S., et al. (2006) Analysis of oncogenic signaling networks in glioblastoma identifies aspm as a novel molecular target. Proc. Natl Acad. Sci. USA, 103, 22–29.

    Jeong, H., et al. (2001) Lethality and centrality in protein networks. Nature, 411, 41[CrossRef][Medline].

    Jeong, H., et al. (2003) Prediction of protein essentiality based on genome data. ComPlexUs, 1, 19–28.

    Lin, N. and Zhao, H. (2005) Are scale-free networks robust to measurement errors? BMC Bioinformatics, 16, 119[CrossRef].

    Lin, N., et al. (2004) Information assessment on predicting protein–protein interactions. BMC Bioinformatics, 4, 154.

    Ravasz, E., et al. (2002) Hierarchical organization of modularity in metabolic networks. Science, 297, 1551–1555[Abstract/Free Full Text].

    Schaefer, J. and Strimmer, K. (2005) An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics, 21, 754–764[Abstract/Free Full Text].

    Ye, Y. and Godzik, A. (2004) Comparative analysis of protein domain organization. Genome Biol, . 14, 343–353[CrossRef].

    Yook, S.Y., et al. (2004) Functional and topological characterization of protein interaction networks. Proteomics, 4, 928–942[CrossRef][ISI][Medline].

    Zhang, B. and Horvath, S. (2005) A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol, . 4, 17.

    Zhao, W., et al. (2006) Information theoretic method for recovering temporal gene regulations from time series microarray data. Bioinformatics, 22, 2129–2135[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J. Neurosci.Home page
J. A. Miller, M. C. Oldham, and D. H. Geschwind
A Systems Level Analysis of Transcriptional Changes in Alzheimer's Disease and Normal Aging
J. Neurosci., February 6, 2008; 28(6): 1410 - 1420.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/2/222    most recent
btl581v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Google Scholar
Right arrow Articles by Li, A.
Right arrow Articles by Horvath, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, A.
Right arrow Articles by Horvath, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?