Skip Navigation

Bioinformatics 2008 24(16):i28-i34; doi:10.1093/bioinformatics/btn296
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pandey, J.
Right arrow Articles by Grama, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pandey, J.
Right arrow Articles by Grama, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Functional coherence in domain interaction networks

Jayesh Pandey 1,*, Mehmet Koyutürk 2, Shankar Subramaniam 3 and Ananth Grama 1

1Department of Computer Science, Purdue University, West Lafayette, IN, USA, 2Department of Electrical Engineering & Computer Science, Case Western Reserve University, Cleveland, OH, USA and 3Department of Biomedical Engineering, University of California, San Diego, USA, CA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 MATERIALS
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 

Motivation: Extracting functional information from protein–protein interactions (PPI) poses significant challenges arising from the noisy, incomplete, generic and static nature of data obtained from high-throughput screening. Typical proteins are composed of multiple domains, often regarded as their primary functional and structural units. Motivated by these considerations, domain–domain interactions (DDI) for network-based analyses have received significant recent attention. This article performs a formal comparative investigation of the relationship between functional coherence and topological proximity in PPI and DDI networks. Our investigation provides the necessary basis for continued and focused investigation of DDIs as abstractions for functional characterization and modularization of networks.

Results: We investigate the problem of assessing the functional coherence of two biomolecules (or segments thereof) in a formal framework. We establish essential attributes of admissible measures of functional coherence, and demonstrate that existing, well-accepted measures are ill-suited to comparative analyses involving different entities (i.e. domains versus proteins). We propose a statistically motivated functional similarity measure that takes into account functional specificity as well as the distribution of functional attributes across entity groups to assess functional similarity in a statistically meaningful and biologically interpretable manner. Results on diverse data, including high-throughput and computationally predicted PPIs, as well as structural and computationally inferred DDIs for different organisms show that: (i) the relationship between functional similarity and network proximity is captured in a much more (biologically) intuitive manner by our measure, compared to existing measures and (ii) network proximity and functional similarity are significantly more correlated in DDI networks than in PPI networks, and that structurally determined DDIs provide better functional relevance as compared to computationally inferred DDIs.

Contact: jpandey{at}cs.purdue.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 MATERIALS
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
Availability of high-throughput protein–protein interaction (PPI) data makes it possible to study the function of biological systems from a network perspective. Recent advances in this area have focused on the development of computational infrastructure for network-based functional annotation (Sharan et al., 2007), identification of functionally coherent modules (Spirin and Mirny, 2003) and evolutionary network analysis (Koyutürk et al., 2006). However, the use of PPI data for computational assessment of network function poses several challenges: (i) PPI data generated by high-throughput screening is generally noisy and incomplete (Titz et al., 2004), (ii) PPI data provides only a generic and static picture of the cellular network, i.e. it does not capture the spatio-temporal dynamics of biological systems (Han et al., 2004) and (iii) proteins themselves are typically composed of multiple functional domains. For this reason, significant efforts are devoted to increasing the quality and reliability of PPI data, as well as using other data sources and abstractions to study interaction data (Lee et al., 2004).

An important limitation of PPI data that relates to the dynamics of cellular systems is that it does not explicitly capture the domain specificity of interactions. Domains in proteins are often regarded as primary functional and structural units (Bateman et al., 2004). Therefore, the functional relevance of an interaction may be considered at the domain level as well. However, the specificity of interactions at this level cannot be captured by high-throughput screening. Consequently, domain–domain interactions (DDI) are often identified using either dedicated structural analysis (Gong et al., 2005) or computational inference from PPI data (Deng et al., 2002; Riley et al., 2005). As DDI data and databases become commonplace (Ng et al., 2003; Raghavachari et al., 2007), DDI networks provide an attractive abstraction for functional network analysis (Schlicker et al., 2007; Wuchty, 2006).

In this article, we investigate how functional modularity manifests itself in a network of molecular interactions, considering different molecular entities—proteins and domains. This question is studied extensively on PPI and gene co-expression networks (Sevilla et al., 2005), however, knowledge on interactions involving different molecular entities is relatively scarce. In order to provide a basis and motivation for computational analysis of DDI networks, we investigate how network proximity in a DDI network relates to the functional coherence of domains. For this purpose, we consider PPI networks as a reference, and compare PPI and DDI networks comprehensively in terms of the relationship between network proximity and functional similarity.

While comparing networks composed of different molecular entities, it is particularly difficult to quantify the functional similarity between two entities in an unbiased manner. This is because functional similarity may have different meanings for different molecular entities. Furthermore, from a practical standpoint, the functional information available for different types of molecular entities may have different characteristics. This is indeed the case for proteins and domains. Most of the available functional annotations for domains are derived from annotations for proteins (Schug et al., 2002). Consequently, they are more general, scarce, and incomplete compared to protein annotations. Motivated by this observation, we develop a formal framework for evaluating metrics for assessing functional similarity between two molecular entities. We establish essential attributes of admissible measures of functional coherence, and demonstrate that existing, well-accepted measures are ill-suited to comparative analyses involving different entities. We propose an information theoretic functional similarity measure that takes into account functional specificity as well as distribution of functional attributes across entities. This results in a more statistically meaningful and biologically interpretable functional similarity measure that relies on only positive evidence to quantify the functional coherence of molecular entities—thus eliminating any artifacts caused by incompleteness of annotations. On a comprehensive collection of PPI and DDI data, we show that our measure indeed captures the relation between network proximity and functional coherence in a more biologically interpretable manner.

Using our proposed functional similarity measure, we compare PPI and DDI networks for diverse species comprehensively. We consider PPIs from large public databases that integrate different sources of data, as well DDIs that are derived from different sources, ranging from structural analysis to computational inference. Our results show that functional coherence is more closely related to network proximity in DDI networks as compared to PPI networks, clearly motivating the use of DDI data in the analysis of networks for functional inference. We also show that, for different sub-ontologies of Gene Ontology (GO) (Ashburner et al., 2000), functional coherence manifests itself differently in the networks.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 MATERIALS
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
Understanding the relationship between network topology and functional modularity requires measures for assessing the functional similarity (or coherence) of a group of entities with respect to each other. For example, in testing the hypothesis that functional modularity is related to high connectivity in PPI networks, it is common to investigate the functional purity of groups of proteins that induce dense subgraphs in the network (Grossmann et al., 2006). In this work, we focus on the relationship between the topological proximity of two entities in a network and their functional similarity. The eventual goal is to determine whether functional relationship manifests itself better in PPI or DDI networks.

There exist several approaches to assessing functional similarity of bio-molecules (e.g. genes, proteins, domains) (Lord et al., 2003; Schlicker et al., 2007). Since functional categories are not isolated, but rather related to each other through a taxonomy (e.g. GO), it is necessary to consider the underlying taxonomy while comparing molecules in terms of their functional annotation (Resnik, 1995). Various approaches take into account different factors, including taxonomical distance, specificity/generality (rank in hierarchy) of common ancestor, and associated number of molecules for the functional terms being compared. Since most molecules are associated with multiple functional terms, assessment of functional similarity between two molecules poses an additional challenge, namely one of evaluating the similarity between two sets of terms, as opposed to a pair of terms. Common, and relatively straightforward approaches to this problem include taking the maximum (Schlicker et al., 2007) or average (Lord et al., 2003) of similarities among all pairs of terms in the two sets. We show that neither of these alternatives provide robust metrics for extending term similarity to set similarity. We develop an information theoretic measure for set similarity that directly computes similarity of sets as a whole, as opposed to computing an aggregate of pairwise term similarities. Our measure takes into account the information content of the most specific of the common ancestors of all terms, and quantifies positive reinforcement of similar terms, avoiding negative contributions arising from incomplete data.

In order to motivate this approach, we provide a formal framework for the problem, and identify the desirable properties of a metric for evaluating the functional similarity between two molecules in this framework.

2.1 Concepts and ontologies
Let C={ci|1≤i≤N} be a finite partially ordered set of concepts. In terms of GO, these concepts represent the GO terms (i.e. molecular function, biological process and cellular component). Without loss of generality, we refer to concepts as terms throughout this article. Terms are related to each other through is a and part of relationships, such that ci->cj denotes ci is a/part of cj. Note that, if ci->cj, then the molecules associated with ci are also associated with cj, known as the true path rule. Based on these relationships, we define a binary relation over C, denoted by ≥ . We say cj is an ancestor of ci, denoted by ci ≥ cj if and only if either ci->cj, or for some {ell}≥1, there exist cFormulalisinC for 1≤l≤{ell} such that ci->cFormula1, cFormulal->cFormulal+1 for 1≤l<{ell}, and cFormula{ell}->cj (cj is an ancestor of ci in GO hierarchy). Two terms ci, cj are comparable, denoted by ci~cj, if either cj ≥ ci or ci ≥ cj. If ci and cj are comparable, then the shortest path between ci and cj is given by L(ci, cj)=L(cj, ci)={ell}+1 for minimum such {ell}.

We denote the set of ancestors of a term ci by Ai={ckisinC|ci ≥ ck}. Note that, not all ancestors of a term are comparable, since the GO hierarchy is a directed acyclic graph, as opposed to a tree. We represent the root term of GO with a terminal concept r, such that ci ≥ r {forall} ciisinC.

2.2 Semantic similarity of terms
Semantic similarity measures are intended to quantify the similarity between two terms based on the underlying taxonomical relationships. For a semantic similarity measure {delta}: C2->R, we identify the following as properties that must be satisfied for the measure to be meaningful:

  1. {delta}(ci, cj)={delta}(cj, ci) for all ci, cj,
  2. {delta}(ci, ci)≤{delta}(cj, cj) for cj ≥ ci,
  3. {delta}(ci, cj)≤{delta}(cj, cj) for all ci, cj,
  4. if exist cmisinAi{cap}Aj such that cm ≥ cn, {forall} cnisinAk{cap}Al, then {delta}(ci, cj)≥{delta}(ck, cl).

The first property states that a semantic similarity measure must be symmetric. The second property states that more specific terms should have at least as much self-similarity as more general terms. The third property states that a term should not be less similar to itself than to any other term. The fourth property states that terms with more specific common ancestors should be more similar to each other, compared to those with less specific common ancestors. Note that if {delta} satisfies properties (iii) and (iv), then {delta}(r, r)≤{delta}(ci, cj) and {delta}(ci, r)={delta}(r,r) {forall} i,j.

We now discuss candidate measures for assessing the semantic similarity of two terms.

Distance: Let the depth of a term ci be d(ci)=L(ci, r) and the depth of the entire hierarchy be D=max1≤i≤N L(ci, r). For terms ci and cj that are not comparable, let L(ci, cj)=minFormulakisinAi{cap}AjL(ci, ck)+L(ck, cj). Then, the distance between two terms in the hierarchy is defined as {delta}E(ci, cj)=2DL(ci, cj). This measure satisfies all properties except (iv), i.e. it does not take into account the specificity of the common ancestor.

Information content: This measure takes into account the distribution of terms among molecules. Let Gc be the set of molecules that are associated with term c. Then, the information content of a term is defined as I(c)=–log2(|Gc|/|Gr|) (where Gr is the set of all molecules) (Resnik, 1995). Clearly, I(r)=0, and as a consequence of the true path rule, I(cj)≥I(ci) for cj ≥ ci. Then, the semantic similarity between two terms is defined as


Formula 1

(1)

Note that, {lambda}(ci,cj)=argmaxcisinAi{cap}Aj I(c) is said to be the minimum common ancestor of ci and cj.


Figure 1
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Sample ontology and associated annotations. Each node of the hierarchy represents a GO term, each set of terms represents a protein (or domain).

 
This measure satisfies all four properties described above, but does not take into account the specificity of terms with identical common ancestors, as illustrated in Figure 1. In the figure, we have {delta}I(c2, c3)={delta}I(c5, c6). Although c5 and c6 are more specific concepts farther apart from each other, their similarity is equal to that of their parents, c2 and c3. This problem can be alleviated by normalizing the similarity between two terms by the self-similarities of the terms being compared, e.g. {delta}L(ci, cj)=2{delta}I(ci, cj)/(I(ci)+I(cj)) (Lin, 1998), and {delta}JC(ci, cj)=1/(1–2{delta}I(ci, cj)+I(ci)+I(cj)) (Jiang and Conrath, 1997). It is evident in Figure 1 that these measures satisfy {delta}L(c2, c3)≥{delta}L(c5, c6) and {delta}JC(c2, c3)≥{delta}JC(c5, c6). We now generalize these term-similarity measures to set similarity.

2.3 Functional similarity of molecules
Since most molecules are associated with multiple molecular functions and sometimes involved in multiple processes, the annotation of a molecule consists of a set of terms. While assessing the similarity of term sets, we assume that each set consists of terms that are not comparable, i.e. each branch of the hierarchy is represented by at most one term in each set. In GO, this involves considering only the most specific annotations associated with a gene, which enables non-redundant representation of functional annotation. In this representation, the association between the gene and the ancestors of the most specific term is implied by the true path rule.

A set of terms S{subseteq}C is said to be non-redundant if {forall} ci, cjisinS, ci{nsim}cj. Note that, to satisfy non-redundancy requirement for any set, we define the trim of a term set S as {Upsilon}(S)={ciisinS: exist no cjisinS s.t. cj ≥ ci. By definition, {Upsilon}(S) is non-redundant for any S. Now we can define the semantic similarity measure for sets assuming that the sets are non-redundant, since any set of terms has a unique trim1. For two non-redundant sets Si, Sj{subseteq}C, we need a measure {rho}(Si, Sj) to assess their semantic similarity. We identify the following as properties of any such measure:

  1. {rho}(Si, Sj)={rho}(Sj, Si) for all Si, Sj,
  2. {rho}(Si, Sj)≤{rho}(Si{cup}ck, Sj{cup}ck) where {forall}clisinSi,{cup}Sj, cl{nsim}ck for all Si, Sj,
  3. {rho}(Si, Sj)≤{rho}(Si, Sj{cup}Sk) for all Si, Sj, Sk,
  4. {rho}(Si, Sj)≤{rho}(Sj, Sj) for all Si, Sj.

Property (ii) states that adding a common annotation for two molecules should not decrease the similarity between these two molecules. Property (iii) states that if new annotations are added for a new molecule, the similarity of this molecule to any other molecule should not decrease. This seemingly unintuitive property is motivated by the fact that existing annotations are quite incomplete. For this reason, we require semantic similarity measures to rely only on positive evidence, avoiding negative conclusions based on lack of annotations. Property (iv) states that a set of annotations should be at least as similar to itself as it is to any other set.

Common approaches to computing similarity between sets include, taking the average or maximum of the similarities between all pairs in the two sets. We first discuss the limitations of such straightforward approaches, and propose a generalization of Resnik's information content-based term similarity measure to sets of terms. This also relates to the statistical significance of the similarity between two term sets.

Average: Formula (Lord et al., 2003). The idea behind this measure is that the semantic similarity between any two pairs of annotations contributes to the functional similarity between two molecules, so that molecules with more similar functions are assigned a higher similarity score. By letting Si={a}, Sj={b}, and choosing ck such that 3{delta}(a,b)/4>{delta}(ck,ck)+{delta}(a,ck)+{delta}(b,ck), it can be shown that this measure does not satisfy property (ii). Furthermore, it satisfies property (iii) only if {rho}A(Si, Sj)={rho}A(Si, Sk). Similarly, letting Si={a, b}, Sj={c} and 2({delta}(a,c)+{delta}(b,c)–{delta}(a,b))>{delta}(a,a)+{delta}(b,b), it can be seen that this measure violates property (iv) as well.

Maximum: {rho}M(Si, Sj)=maxckisinSi, clisinSj{delta}(ck, cl) (Sevilla et al., 2005). This measure is based on the notion that if two molecules perform similar functions in at least one context, then they can be considered functionally similar. While this measure satisfies all properties, it satisfies (ii) weakly, i.e. {rho}M(Si, Sj)={rho}M(Si{cup}ck, Sj{cup}ck) unless there exists no cmisinSi and cnisinSj such that {delta}(cm, cn)≥{delta}(ck, ck).

Average of maximums: Average functional similarity between two proteins can be defined in terms of a compromise between these two measures (Schlicker et al., 2007), namely


Formula

This modification provides a more biologically sound formulation of average functional similarity between two molecules, since a function of one molecule may be considered to be shared by another molecule as long as the other molecule is associated with a sufficiently similar function. However, this measure also fails to satisfy properties (ii), (iii) and (iv).

Information content: Observing that the notion of minimum common ancestor can be extended to sets of terms, we propose a set-similarity measure that is defined on entire sets, as opposed to a composite of pairwise similarities. Let {Lambda}(Si, Sj)={cup}FormulakisinSi, clisinSj{lambda}(ck,cl) be the minimum common ancestor set of term sets Si and Sj. Here, {sqcup} denotes a generalized union operator that preserves non-redundancy, i.e. A {sqcup} B={Upsilon}(A{cup}B). We define the similarity between two term sets as the information content of the set of minimum common ancestors, i.e.


Formula 2

(2)
where G{Lambda}(Si, Sj)={cap}ckisin{Lambda}(Si, Sj)Gck is the set of molecules that are associated with all terms in the minimum common ancestor set of Si and Sj. Note that the above definition also generalizes the concept of information content from a single term to a set of terms.

Example.
Consider the ontology in Figure 1. The root term in this ontology is R. The annotation sets for five molecules are also shown in the figure. Consider the similarity between the two molecules with annotation sets S1 and S2. Since {lambda}(c4, c4)=c4, {lambda}(c6, c4)={lambda}(c7, c4)=R, and c4 ≥ R, we have {Lambda}(S1, S2)={c4}. Consequently, {rho}I(S1, S2)=–log2(|Gc4|/|GR|)=–log2(|S1, S2, S3, S5}|/|{S1, S2, S3, S4, S5})=log2(5/4). On the other hand, since {Lambda}(S1, S3)={c4, c6}, we have {rho}I(S1, S3)=log2(5/2)>{rho}I(S1, S2). Observe that {rho}M(S1, S2)={rho}M(S1, S3), illustrating that {rho}I is stronger than {rho}M in terms of property (ii).

THEOREM 1.
{rho}I satisfies all properties required for a measure of semantic similarity between two sets of terms.

PROOF.
  1. Trivially, {rho}I(Si, Sj)={rho}I(Sj, Si) for all Si, Sj.
  2. Since ck{nsim}cn for all cnisinSi{cup}Sj, we have {Lambda}(Si{cup}ck, Sj{cup}ck)={Lambda}(Si, Sj){sqcup}{Lambda}(Si{sqcup}Sj, ck){sqcup}cksup{Lambda}(Si, Sj){cup}ck, leading to G{Lambda}(Si{cup}ck, Sj{cup}ck){subseteq}G{Lambda}(Si, Sj) and |G{Lambda}(Si{cup}ck, Sj{cup}ck)|≤|G{Lambda}(Si, Sj)|. Consequently, {rho}I(Si{cup}ck, Sj{cup}ck)≥{rho}I(Si, Sj).
  3. {Lambda}(Si, Sj{cup}Sk)={Lambda}(Si, Sj){sqcup}{Lambda}(Si, Sk)sup{Lambda}(Si, Sj). Therefore, G{Lambda}(Si, Sj{cup}Sk){subseteq}G{Lambda}(Si, Sj), leading to {rho}I(Si, Sj{cup}Sk)≥{rho}I(Si, Sj).
  4. Clearly, ck ≥ {lambda}(ck, cl) for any ck, cl. Now consider any cmisin{Lambda}(Si, Sj). Since cm={lambda}(ck, cl) for some ckisinSi and clisinSj, there always exists cnisin{Lambda}(Si, Si) such that cn ≥ ck ≥ cm. Consequently, we must have G{Lambda}(Si, Si){subseteq}G{Lambda}(Si, Sj), leading to {rho}I(Si, Sj)≤{rho}I(Si, Si).

Note that, {rho}I also has the problem associated with Resnik's measure (Section 2.2) and that this problem can be alleviated through normalization by self similarities, e.g. Formula or Formula


    3 MATERIALS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 MATERIALS
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
In order to evaluate the suitability of PPIs and DDIs to different functional analyses, we obtain protein and domain interaction data for five well-studied eukaryotic species from public databases. These datasets contain physical protein–protein interactions, as well as structural and computationally inferred domain–domain interactions.

3.1 Protein–protein interactions
We obtain protein interaction data for five species, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Saccharomyces cerevisiae, and Schizosaccharomyces pombe, from the BioGrid database (Breitkreutz et al., 2007). The networks are chosen to be largest among available networks in the database, with the expectation that larger networks are relatively more comprehensive. We filter the dataset to obtain a set of physical interactions between proteins, i.e. genetic interactions are removed based on experiment type (e.g. knockout experiments). The interaction data is binary, i.e. no confidence score is associated with the interactions. The numbers of proteins and interactions in each PPI network are shown in Table 1. Integr8 (Kersey et al., 2005) is used to map the proteins in the interaction dataset to their Uniprot names. The data is filtered to keep only those proteins for which Pfam domain decomposition is known using Integr8.


View this table:
[in this window]
[in a new window]

 
Table 1. Protein–protein interaction dataset

 
3.2 Domain interactions
We obtain domain interaction data from the DOMINE database (Raghavachari et al., 2007). This dataset is composed of known, as well as predicted domain interactions. Interactions inferred from PDB entries of protein complexes are collected from iPfam and 3did. Predicted interactions are obtained through computational methods, which infer domain interactions from protein interaction networks or co-evolution of conserved sites for details, see(Raghavachari et al., 2007). Based on the source and quality of the data, we partition this dataset into five classes:
  • Struct: Only known domain interactions (structure based).
  • HC+NA : High Confidence (HC) and Structure based (NA) interactions.
  • HC+MC : High Confidence (HC) and Medium Confidence (MC) interactions.
  • Comp-2: Interactions predicted by at least two computational approaches.
  • Comp-1: Interactions predicted by at least one computational approach.

The numbers of domains and interactions in each class are shown in Table 2. Note that domain–domain interactions here are binary, i.e. there is no confidence score associated with these interactions.


View this table:
[in this window]
[in a new window]

 
Table 2. Domain–domain interaction dataset

 
3.3 GO and annotations
Gene Ontology Annotation (GOA) (Camon et al., 2006) is used to obtain annotation information for Uniprot proteins. The mapping of Pfam-A domains to their GO functions is obtained from pfam2go (http://www.geneontology.org/external2go/pfam2go). We use only the biological process and molecular function sub-ontologies of GO for evaluation, since the coverage for the cellular component sub-ontology is relatively low.


    4 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 MATERIALS
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
We first compare different semantic similarity measures on comprehensive PPI and DDI data. Then, using our proposed semantic similarity measure, we investigate the differences between PPI and DDI networks in terms of the relationship between network proximity and functional similarity.

4.1 Comparison of semantic similarity measures
For each network, we compute the distance between all pairs of molecules (proteins or domains) in the network. Then, we group molecule pairs according to their distance and compute the average semantic similarity for each group. Since the distribution and range of semantic similarity scores varies across different measures, we normalize semantic similarity scores to obtain a mean similarity score of zero and SD of one in each network. In other words, for each similarity measure {rho}x, the similarity score between two molecules Pi, PjisinP is computed as Formula , where P denotes the set of molecules in the network. Note also that this normalization is useful in comparing PPI and DDI networks as well, since the distribution of available annotations across proteins and domains can be significantly different. In general, since domain annotations are generally derived from protein annotations, domain annotations are relatively scarce and more general (higher in the GO hierarchy) compared to protein annotations.


Figure 2
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Comparison of different semantic similarity measures in terms of their behavior with respect to network distance: (a) network distance versus average semantic similarity for pairs of proteins in C.elegans PPI network, (b) distribution of semantic similarity scores for direct neighbors, indirect neighbors and other domain pairs in the Struct DDI network.

 
In Figure 2a, the behavior of different semantic similarity measures with respect to network distance in the C.elegans, PPI network is shown. We consider five measures, namely {rho}A/{delta}I (average of Resnik's term similarity measure), {rho}H/{delta}I (average of maximums for Resnik's term similarity measure) {rho}A/{delta}JC (average of self-normalized Resnik's term similarity measure), {rho}I (proposed information content based molecule similarity measure) and {rho}JC (proposed information content based molecule similarity measure with self-normalization). As evident in the figure, all semantic similarity measures demonstrate a negative relation between network distance and functional similarity. However, if average term similarity score is used to compare molecules, an anomaly is observed in that average semantic similarity tends to increase for pairs of proteins at larger distances (≥4). This behavior demonstrates the inadequacy of average-based measures in handling randomness. Observe that in a network, the number of protein pairs with given distance grows with increasing distance and goes down after a point, which is the behavior of the curve for {rho}A/{delta}I in Figure 2 in reverse direction. Consequently, the growth in average similarity with respect to network distance beyond a point can be explained by the decrease in the number of pairs with larger distance, which is likely to be an artifact of randomness. On the other hand, all other measures show a consistent decline in semantic similarity with respect to network distance, with saturation at distance ≥5. However, it is worth noting that the proposed information content based measure provides the sharpest decline in semantic similarity with increasing distance throughout, while it provides the sharpest decline for distance ≤3 when it is used with self-normalization.

In Figure 2b, a comparison of the distribution of semantic similarity scores for the average information content ({rho}A/{delta}I) and self-normalized information content ({rho}JC) measures is shown. In this figure, domain pairs are grouped according to their distances in the Struct DDI network, to obtain groups immediate neighbors (distance =1) indirect neighbors (distance =2) and other domain pairs (distance >2). In the figure, the cumulative distribution of similarity score is shown for each group, i.e. the vertical axis shows the fraction of domain pairs with similarity larger than the value on horizontal axis, where similarity scores are normalized to range from 0 to 1. Observe that, {rho}JC provides very large (>90%) similarity score for a much larger fraction (>60%) of neighboring domain pairs, as compared to {rho}A (<10%), while keeping fraction of highly similar domain pairs with distance >2 considerably low (<10%). In general, the curves for {rho}JC demonstrate a sharper decline for similarity ≤20% as compared to their counterparts for {rho}A and remain well above them, particularly for neighboring domains up to similarity >90%, illustrating that {rho}JC is more successful than {rho}A in reflecting the differences between (directly or indirectly) interacting and arbitrary pairs of domains, in terms of functional similarity.

4.2 Comparison of PPI and DDI networks


Figure 3
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Comparison of the relation between network proximity and semantic similarity with respect to molecular functions in PPI and DDI networks: (a) raw semantic similarity, (b) normalized semantic similarity with zero mean and unit SD in each network. For distance 5,6 the similarity values are very close to that for distance 4. For annotation, see Section 3.2.

 
Using the proposed semantic similarity measure with self-normalization ({rho}JC), we compare the relationship between network proximity and functional similarity, using PPI and DDI networks described in Section 3. We find that the following pairs of networks yield similar results: HC+MC and HC+NA DDI, C.elegans and D.melanogaster PPI, S.cerevisiae and H.sapiiens PPI. For clarity, we do not display results for HC+NA DDI, D.melanogaster and H.sapiiens PPI in Figure 3. The behavior of semantic similarity with respect to network distance is shown in Figure 3, for the molecular function sub-ontology of GO, i.e. semantic similarity here refers to the similarity between the molecular functions of a pair of proteins or domains. Since the same semantic similarity measure is used for each network, the semantic similarity scores are compatible across different networks. The behavior of these raw semantic similarity scores for different networks is shown in Figure 3a. Since the annotations of proteins and domains are largely incomplete, and the coverage of annotations may differ significantly across different networks, the distribution of semantic similarity scores can also vary significantly. For this reason, we normalize similarity scores using the procedure described in the previous section, to ensure that the similarity scores have zero mean and unit SD in each network. The behavior of normalized similarity scores for different networks is shown in Figure 3b.

As evident in the figure, immediate and indirect neighbors perform (more) similar molecular functions. Furthermore, the negative correlation between network distance and functional similarity is stronger in the Struct DDI network, as compared to all other networks. This network is followed by other relatively more reliable HC+MC DDI network (and HC+NA DDI, not shown here). These observations suggest that network proximity is likely to be more relevant to, hence indicative of, functional coherence and modularity. However, this conclusion is tempered by the observation that the DDI networks that are based on structural information are relatively more reliable than PPI networks, which may come from noisy high-throughput screening.

The figure also shows that in PPI networks of relatively well-studied organisms such as S.cerevisiae, and C.elegans, functional similarity between two proteins that are further apart in the network is larger, on average, than that in the DDI and other PPI networks. This observation suggests that functional similarity between two arbitrary proteins in model organisms is expected to be larger than the functional similarity between two arbitrary domains or proteins in other organisms. This may be because more functional information is available for model organisms. As seen in Figure 3b, network-based normalization alleviates this problem. Furthermore, after normalization, it becomes apparent that the relationship between functional similarity and network distance is stronger in computationally inferred DDI networks than that in PPI networks. Since computational inference of domain–domain interactions is generally based on protein–protein interactions, this observation provides further evidence that supports the notion that network proximity in DDI networks is likely to be a better indicator of functional modularity than PPI networks.


Figure 4
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Comparison of the relation between network proximity and semantic similarity with respect to biological processes in PPI and DDI networks. (a) raw semantic similarity, (b) normalized semantic similarity with zero mean and unit SD in each network. For annotation, see Section 3.2.

 
The behavior of semantic similarity with respect to network distance for the biological process sub-ontology of GO is shown in Figure 4. Here, semantic similarity refers to the similarity between the biological processes that a pair of proteins individually take part in. The behavior of process similarity with respect to network distance is generally similar to that of functional similarity, however, there are differences worth noting. First, when the similarity scores are not normalized with respect to network, the process similarity for arbitrary pairs of proteins in model organisms appears to be lower, on average, than that for arbitrary pairs of domains. This is in contrast to the argument based on annotation coverage. However, even after normalization, PPI networks demonstrate weaker relationship between network proximity and process similarity, as compared to DDI networks. Yet, the gap that is observed for functional similarity closes when processes are considered, particularly for the S.pombe PPI network, which shows similar process similarity between neighbors compared to computationally inferred DDI networks. Furthermore, indirect neighbors in the S.pombe PPI network have highest average process similarity among all networks considered. This might be indicative of the difference between molecular functions and biological processes in terms of their relationship to functional similarity. In general, it is possible to speculate that molecular function is a lower level property of a molecule that is directly related to its structure, while biological processes are higher level constructs, related to the wider neighborhood in the network. For this reason, while our results suggest that domain–domain interactions may be more informative in terms of identification of function and functional modularity, it may be necessary to consider DDI networks along with PPI networks to extract information about process modularity.


    5 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 MATERIALS
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
We investigate metrics for quantifying functional similarity in PPIs and DDIs. We present essential attributes of admissible metrics for term- and set-similarity, show that existing commonly used measures are not admissible, and present an admissible metric. We establish that the proposed metric provides highly intuitive biological interpretations from comprehensive comparative analysis of PPIs and DDIs. In doing so, we conclusively establish the metric, as well as validate the role of DDIs in quantifying functional coherence.

Conflict of Interest: none declared.


    FOOTNOTES
 
1To see that {Upsilon}(S) is unique for S, recall that the underlying hierarchy of terms is represented by a directed acyclic graph. Consequently, its transitive closure is also an acyclic graph, in which an edge represents ancestral relationship between two terms. Observe that the trim of a term set is equivalent to the set of nodes with no incoming arcs in the subgraph induced by the term set on this transitive closure, therefore it is uniquely defined. Back


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 MATERIALS
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 

    Ashburner M, et al. Gene Ontology: Tool for the unification of biology. The Gene Ontology consortium. Nat. Genet (2000) 25:25–29.[CrossRef][Web of Science][Medline]

    Bateman A, et al. The Pfam protein families database. Nucleic Acids Res (2004) 32:D138–D141.[Abstract/Free Full Text]

    Breitkreutz B, et al. The BioGRID interaction database: 2008 update. Nucleic Acids Res (2007) 36((Database issue)):D637–D640.[CrossRef][Web of Science][Medline]

    Camon E, et al. The Gene Ontology Annotation (GOA) Database: Sharing Biological Knowledge with GO. In: Silico Genomics and Proteomics: Functional Annotation of Genomes and Proteins.—Mulder N, Apweiler R, eds. (2006) Nova Science Publishers. 37–54.

    Deng M, et al. Inferring domain–domain interactions from protein–protein interactions. Genome Res (2002) 12:1540–1548.[Abstract/Free Full Text]

    Gong S, et al. A protein domain interaction interface database: interpare. BMC Bioinformatics (2005) 6:207.[CrossRef][Medline]

    Grossmann S, et al. An improved statistic for detecting over-represented gene ontology annotations in gene sets. (2006) In Proccedings of 10th Annual Conference on Research in Computational Molecular Biology’06. Venice, Italy: Springer. 85–98.

    Han J-DJ, et al. Evidence for dynamically organized modularity in the yeast protein interaction network. Nature (2004) 430:88–93.[CrossRef][Web of Science][Medline]

    Jiang JJ, Conrath DW. Semantic similarity based on corpus statistics and lexical taxonomy. (1997) In 10th International Conference on Research on Computational Linguistics: Taipei, Taiwan. 19–33.

    Kersey P, et al. Integr8 and genome reviews: integrated views of complete genomes and proteomes. Nucleic Acids Res (2005) 33:297–302.[CrossRef]

    Koyutürk M, et al. Detecting conserved interaction patterns in biological networks. J. Comput. Biol (2006) 13:1299–1322.[CrossRef][Web of Science][Medline]

    Lee I, et al. A probabilistic functional network of yeast genes. Science (2004) 306:1555–1558.[Abstract/Free Full Text]

    Lin D. An information-theoretic definition of similarity. (1998) In Proceedings of the 15th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann. 296–304.

    Lord P, et al. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics (2003) 19:1275–1283.[Abstract/Free Full Text]

    Ng S, et al. InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes. Nucleic Acids Res (2003) 31:251–254.[Abstract/Free Full Text]

    Raghavachari B, et al. DOMINE: a database of protein domain interactions. Nucleic Acids Res (2007) 36(Database issue):D656–D661.[CrossRef][Web of Science][Medline]

    Resnik P. Using information content to evaluate semantic similarity in a taxonomy. (1995) In Proceedings of the 14th International Joint Conference on Articial Intelligence: Montreal, Canada. 448–453.

    Riley R, et al. Inferring protein domain interactions from databases of interacting proteins. Genome Biol (2005) 6:R89.[CrossRef][Medline]

    Schlicker A, et al. Functional evaluation of domain–domain interactions and human protein interaction networks. Bioinformatics (2007) 23:859–865.[Abstract/Free Full Text]

    Schug J, et al. Predicting gene ontology functions from ProDom and CDD protein domains. Genome Res (2002) 12:648–655.[Abstract/Free Full Text]

    Sevilla J, et al. Correlation between gene expression and GO semantic similarity. IEEE/ACM Trans. Comput. Biol. Bioinform (2005) 2:330–338.[CrossRef]

    Sharan R, et al. Network-based prediction of protein function. Mol. Syst. Biol (2007) 3:88.[Medline]

    Spirin V, Mirny LA. Protein complexes and functional modules in molecular networks. Proc. Natl Acad. Sci. USA (2003) 100:12123–12128.[Abstract/Free Full Text]

    Titz B, et al. What do we learn from high-throughput protein interaction data? Expert Rev. Proteomics (2004) 1:111–121.[CrossRef][Web of Science][Medline]

    Wuchty S. Topology and weights in a protein domain interaction network–a novel way to predict protein interactions. BMC Genomics (2006) 7:122.[CrossRef][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pandey, J.
Right arrow Articles by Grama, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pandey, J.
Right arrow Articles by Grama, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?