Skip Navigation

Bioinformatics 2007 23(13):i377-i386; doi:10.1093/bioinformatics/btm203
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Pandey, J.
Right arrow Articles by Grama, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pandey, J.
Right arrow Articles by Grama, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Functional annotation of regulatory pathways

Jayesh Pandey 1,*, Mehmet Koyutürk 1, Yohan Kim 2, Wojciech Szpankowski 1, Shankar Subramaniam 2 and Ananth Grama 1

1Department of Computer Science, Purdue University2Department of Chemistry and Biochemistry, University of California, San Diego, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 BACKGROUND AND MOTIVATION
 3 METHODS
 4 RESULTS AND DISCUSSION
 5 CONCLUDING REMARKS
 REFERENCES
 

Motivation: Standardized annotations of biomolecules in interaction networks (e.g. Gene Ontology) provide comprehensive understanding of the function of individual molecules. Extending such annotations to pathways is a critical component of functional characterization of cellular signaling at the systems level.

Results: We propose a framework for projecting gene regulatory networks onto the space of functional attributes using multigraph models, with the objective of deriving statistically significant pathway annotations. We first demonstrate that annotations of pairwise interactions do not generalize to indirect relationships between processes. Motivated by this result, we formalize the problem of identifying statistically overrepresented pathways of functional attributes. We establish the hardness of this problem by demonstrating the non-monotonicity of common statistical significance measures. We propose a statistical model that emphasizes the modularity of a pathway, evaluating its significance based on the coupling of its building blocks. We complement the statistical model by an efficient algorithm and software, NARADA, for computing significant pathways in large regulatory networks. Comprehensive results from our methods applied to the Escherichia coli transcription network demonstrate that our approach is effective in identifying known, as well as novel biological pathway annotations.

Availability: NARADA is implemented in Java and is available at http://www.cs.purdue.edu/homes/jpandey/narada/

Contact: jpandey{at}cs.purdue.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 BACKGROUND AND MOTIVATION
 3 METHODS
 4 RESULTS AND DISCUSSION
 5 CONCLUDING REMARKS
 REFERENCES
 
Gene regulatory networks represent powerful formalisms for modeling cell signaling. These networks are inferred from gene expression, as well as other sources of data, using various statistical and computational methods (Friedman et al., 2000; Husmeier, 2003). Recent studies on networks of specific organisms show that interactions between genes that take part in specific pairs of biological processes are significantly overrepresented (Lee et al., 2002; Tong et al., 2004). Generalizing such observations to pathways of arbitrary length may allow identification of standardized pathways, enabling creation of reference databases of direct and indirect interactions between various processes. Knowledge of such pathways is useful, not only in general understanding of the relationship between cellular processes at the systems level, but also in projecting existing knowledge of cellular organization of model organisms to other species. Increasing availability of species-specific interaction data, coupled with attempts aimed at creating standardized dictionaries of functional annotation for biomolecules provide the knowledge base that can be effectively used for this purpose. What is lacking is a comprehensive set of tools that combine these two sources of data to identify significantly overrepresented patterns of interaction through reliable statistical modeling with a formal computational basis.

In this article, we introduce the notion of functional network characterization, derived from a gene regulatory network and associated functional annotations of genes. We use the Gene Ontology (GO) (Ashburner et al., 2000) for annotations, however, our methods themselves generalize to other networks and annotations. Functional network characterization is based on the abstract notion of regulatory interactions between pairs of functional attributes (as opposed to genes). In this context, we demonstrate that methods for identifying significant pairwise annotations do not generalize to pathway annotations. We introduce the problem of identifying statistically overrepresented pathways of functional attributes, targeted at the identification of chains of regulatory interactions between functional attributes. We study the hardness of this problem, focusing on the non-monotonicity of commonly used statistical significance measures. We show that the problem is hard along two dimensions: (i) the pathway space of the functional attribute network, and (ii) the space of functional resolution specified by GO hierarchy. Emphasizing the modularity of a pathway to assess its significance, we propose a statistical model that focuses on the coupling of the building blocks of a pathway. We use this statistical model to derive efficient algorithms for solving the pathway annotation problem. Our methods are implemented in a web-based tool, NARADA which provides an intuitive user and data interface.

Comprehensive evaluation of NARADA on an Escherichia coli transcription network from RegulonDB (Salgado et al., 2006) shows that our method identifies several known, as well as novel pathways, at near-interactive query rates. Note that the current knowledge of regulatory networks is incomplete, and limited to a few model organisms. Therefore, the application of our method on currently available data does not provide a comprehensive library of regulatory network annotation. On the other hand, the partial annotation provided by our method forms a useful basis for extending our knowledge of regulatory networks beyond well-studied processes and model organisms.


    2 BACKGROUND AND MOTIVATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 BACKGROUND AND MOTIVATION
 3 METHODS
 4 RESULTS AND DISCUSSION
 5 CONCLUDING REMARKS
 REFERENCES
 
2.1 Results from previous studies
Lee et al. (2002) study the Saccharomyces cerevisiae transcription regulation network with a view to understanding relationships between functional categories. They observe that transcriptional regulators within a functional category commonly bind to genes encoding regulators within the same category (e.g. cell cycle, metabolism, environmental response). They also report that many transcriptional regulators within a functional category bind to transcriptional regulators that play key roles in the control of other cellular processes. For example, cell cycle activators are observed to bind to genes that are responsible for regulation of metabolism, environmental response, development and protein biosynthesis. Tong et al. (2004) identify putative genetic interactions in yeast via synthetic genetic array (SGA) analysis and investigate the functional relevance of their results in the context of GO annotations. They construct a network of GO terms by inserting an edge between any pair of terms that are bridged by a significant number of interacting gene pairs. Here, two GO terms are said to be bridged by an interaction if one of the interacting genes is associated with one of the terms, and the other gene with the second term, but neither is associated with both terms. They show that the resulting network is clustered according to underlying biological processes, while some biological processes buffer one another. For example, microtubule-based functions buffer both actin-based and DNA synthesis or repair functions, suggesting coordination of these functions through interaction of various genes.

2.2 Approach
Establishing functional relationships from gene interactions is essential to understanding functional organization of a cell. Current investigations are limited to case-specific studies that generally focus on validation or evaluation of results through simple statistical analyses — yet they provide significant insights (Gamalielsson et al., 2006; Lee et al., 2002; Tong et al., 2004). Computational tools that are based on sophisticated abstractions and customized statistical models are likely to yield novel insights. The basic approach for integrating existing knowledge of gene networks and functional annotations is to project the network in the gene space onto the functional attribute space through mapping of genes to attributes as specified by the annotation. A simple method for achieving this annotates each gene with its function and identifies overrepresented interacting annotations. This method yields interesting insights, as illustrated by Tong et al. (2004) in the context of synthetic genetic arrays. This model, however, does not generalize beyond pairwise interactions since each interaction between a pair of functional attributes is within a specific context (a different pair of genes) in the network, as illustrated by the following example.

2.3 Motivating example
Two regulatory pathways are shown in Figure 1 — each node is identified by its corresponding gene (gi) and tagged by the functional attribute (Tj) associated with the gene. In Figure 1a, genes g1, g2 and g3 indirectly regulate genes g5, g6 and g7 through gene g4. In Figure 1b, the network is isolated and there is no indirect regulation. Now assume the network of functional attributes derived from the simple method described above, separately for each gene network. For both networks, since all genes associated with functional attribute T1 regulate a gene with T2, one may conclude that T1 regulating T2 is significant. A similar conclusion follows for the regulatory effect of T2 on T3. If only pairwise interactions are considered, we derive the same network of functional attributes from both genetic networks (Figure 1c). This network clearly suggests that functional attribute T1 indirectly regulates T3 through T2. This is indeed a correct observation for the network in Figure 1a. However, this is not true for the network in Figure 1b.


Figure 1
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Pairwise assessment of regulatory interactions between functional attributes may result in identification of non-existent patterns. Two regulatory networks are shown in (a) and (b). The nodes are labeled by corresponding genes and each gene is tagged with a set of functional attributes. The network of functional attributes resulting from both networks, considering only pairwise interactions, is shown in (c).

 
To address this problem, we develop a formal framework for projecting a gene network on to a network of functional attributes, using multigraph models that accurately capture the context in which an interaction occurs. Through this framework, we generalize pairwise interactions between functional attributes to the identification of regulatory pathways of functional attributes.


    3 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 BACKGROUND AND MOTIVATION
 3 METHODS
 4 RESULTS AND DISCUSSION
 5 CONCLUDING REMARKS
 REFERENCES
 
We now describe the biological, statistical, and computational formalisms that underlie our methods.

3.1 Formal model for functional attribute networks
A gene regulatory network is modeled by a labeled directed graph Formula . In this network, nodes Formula represent genes. Directed edge Formula , where Formula , represents a regulatory interaction between genes gi and gj. Formula specifies a labeling of edges that represents the mode of regulation: activation ( + ), repression (–) or dual regulation (±). A sample gene regulatory network is shown in Figure 2. In our discussion, for the sake of simplicity, we omit the mode of regulation and treat all interactions as activator interactions, whenever appropriate.


Figure 2
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. (a) A sample gene regulatory network and the functional annotation of the genes in this network. Each node represents a unique gene and is tagged by the set of functional attributes attached to that gene. Activator interactions are shown by regular arrows, repressor interactions are shown by dashed arrows. (b) Functional attribute network derived from the gene regulatory network in (a). In this multigraph, nodes (functional attributes) are represented by squares and ports (genes) are represented by dark circles.

 
Each gene in the network is associated with a set of functional attributes. These attributes describe a functional annotation of the gene, i.e. they map an individual biological entity to known functional classes.

DEFINITION 1. Functional Annotation.
Given a set of genes VG and a set of functional attributes VF, let Formula and Formula denote the power set of VG and VF, respectively. Then, functional annotation Formula defines mapping Formula and Formula , such that Formula if and only if Formula , for any Formula and Formula . The frequency of Tj, Formula , is equal to the number of genes that are mapped to Tj.

In Figure 2a, each gene gi is tagged with the functional attributes in Formula . For each Tj, Formula is composed of the genes tagged by Tj. We use Gene Ontology (GO) (Ashburner et al., 2000) as a reference library for annotating genes. For each gene, GO specifies the molecular functions associated with it, biological processes it takes part in, and cellular components it may be part of. The functional attributes in GO, known as GO terms, are organized hierarchically through is a and part of relationships. For example, ‘regulation of stereoid biosynthetic process’ is a ‘regulation of stereoid metabolic process’ and is part of ‘stereoid biosynthetic process'. This hierarchy is abstracted using a directed acyclic graph (DAG). In this representation, if Ti is a, or part of Tj, then Formula , i.e. the genes associated with Ti form a subset of genes associated with Tj. In this case, Tj is said to be a parent of Ti. A term may have more than one parent, i.e. Formula and Formula does not imply Formula . Furthermore, there is a unique Formula with no parent, called root, such that Formula . In the rest of this section, we use a network of functional attributes with no constraints (e.g. GO hierarchy) on function Formula . We discuss the issue specifically relating to the GO hierarchy when addressing the implementation of NARADA.

We model networks of functional attributes using multigraphs. A multigraph is a generalized graph, where multiple edges are allowed between a single pair of nodes.

DEFINITION 2. Functional Attribute Network.
Given gene regulatory network Formula , a set of functional attributes VF, and functional annotation Formula , the corresponding functional attribute network Formula is a multigraph defined as follows. The set of functional attributes VF is also the set of nodes in F. Each node Formula contains a set of ports corresponding to the set of genes associated with Ti, i.e. Formula . Each multiedge TiTj is a set of ordered port pairs (edges) Formula , such that Formula , Formula and Formula .

The functional attribute network corresponding to the gene regulatory network in Figure 2a is shown in Figure 2b. This multigraph model captures the context of each interaction accurately through the concept of ports. As illustrated in Figure 1, if a simple graph model is used, paths that do not exist in the gene network emerge in the functional attribute network. This is not possible in the multigraph model, since a path must leave a node from the port at which it enters the node.

DEFINITION3. Path.
In functional attribute network Formula , a path Formula is an ordered set of node-port pairs such that (i) Formula for Formula (nodes are not repeated), (ii) Formula for Formula and (iii) Formula for Formula (consecutive edges are connected through the same port). The length of {pi} is Formula .

In Figure 2b, Formula is a path but Formula is not, since multiedge Formula does not contain the edge Formula . Note that allowing Formula and Formula , we may also include cycles in this definition. According to the above definition, paths are characterized by ports. While analyzing regulatory pathways of functional attributes, however, we are interested in paths that are characterized by nodes in the functional attribute network. Clearly, such pathways may correspond to multiple paths in the functional attribute network. Therefore, we model them using multipaths.

DEFINITION 4. Multipath.
In functional attribute network Formula , a multipath Formula is an ordered set of nodes such that (i) Formula for Formula , and (ii) there exist Formula for Formula , such that Formula is a path. The occurrence set Formula of {Pi} consists of all distinct paths that satisfy (ii) and each such path is called an occurrence of {Pi}. The frequency of {Pi}, Formula , is equal to the number of occurrences of {Pi}.

We use the terms pathway and multipath interchangeably, to emphasize the biological meaning of a multipath. Allowing Formula , we also extend this definition to multicycles, occurrences of which correspond to cycles in the gene network. In Figure 2b, Formula (also denoted Formula throughout this article) is a multipath with frequency four. On the other hand, multipath Formula does not exist in this network, i.e. it has frequency zero, although multiedges T2 {dashv} T4 and T4 -> T3 both exist. Note that the distinction between activator and inhibitor interactions is emphasized in this example for illustrative purposes, while it is omitted in the definition for simplicity. A multipath with high frequency is likely to be biologically interesting, since it corresponds to a regulatory pathway of functional attributes that recurs in various contexts in the underlying cellular organization. In order to quantify this biological significance, it is useful to evaluate frequency from a statistical perspective.

3.2 Hardness of significant pathway identification
Raw counts have long been used as a measure of significance—primarily because of the resulting algorithmic simplicity. This is a direct consequence of its monotonicity properties, namely that a subgraph (or substring/subset) of a frequent graph (or string/set) is itself frequent (Koyutürk et al., 2006b). In identification of significantly overrepresented pathways of functional attributes, frequency alone does not provide a good measure of statistical significance. This is because, the degree distribution of gene regulatory networks and the distribution of the frequency of functional attributes are both highly skewed. Consequently, paths including functional attributes that are associated with high-degree genes (e.g. molecular functions related to transcription) and those associated with many genes (e.g. GO terms that are at coarser levels of GO hierarchy) are likely to dominate. For this reason, a statistical measure that takes into account these distributions is needed.

Monotonicity of common statistical significance measures. We identify the basic properties of a useful measure of statistical significance.

PROPOSITION 1. Statistical Interpretability.
Consider a set Formula of binary random variables and the set of corresponding observations Formula , where Formula for Formula corresponds to an observation supporting a hypothesis. Let Formula be a real-valued function, used to assess the statistical significance of the collection of observations defined by Formula . Let Formula and Formula be disjoint binary random variable sets, i.e. Formula , and let Formula and Formula be the respective observation sets. A function f is statistically interpretable if it satisfies the following conditions:
  1. If y = 0 {forall} Formula , then Formula ,
  2. If y = 1 {forall} Formula then Formula .

Here, without loss of generality, Formula implies that Formula is a more interesting observation than Formula . More generally, the binary random variables characterize a pattern, and a larger set of these variables corresponds to a larger (or more general) pattern. This property simply states that additional positive (negative) observations should increase (decrease) our confidence that a pattern is interesting.

Most significance measures used in the analysis of discrete biological data are statistically interpretable. Consider, for example, the identification of significantly enriched GO terms in a set of genes. For a given term, the binary variables (X), one for each gene (Formula ), indicate whether the gene is associated with the term (X = 1). Adding a new gene (Formula ) to this set will improve the significance of enrichment (Formula ) if the new gene is associated with the term (Y = 1). If not (Y = 0), the enrichment of the term in the new set will be less significant (Formula ). Indeed, existing methods and statistical measures for this problem demonstrate this property (Grossman et al., 2006; Hsiao et al., 2005).

Now we show that, in contrast to approximations that do not take into account the size of the sample space (e.g. frequency), statistically interpretable measures of significance do not possess monotonicity.

THEOREM 1.
Let f be a monotonically non-decreasing (nonincreasing) function, i.e. for any Formula and Formula , Formula (Formula ). Then f is not statistically interpretable.

PROOF.
Without loss of generality, assume f is non-decreasing. Let Y be a set of binary random variables, and y be a set of corresponding observations, such that Formula , Formula . Since f is monotonically non-decreasing, we have Formula . This contradicts condition (ii) in Proposition 1.        {square}

3.2.1 Monotonicity with respect to GO hierarchy
We now show that this result directly applies to the monotonicity of useful significance measures with respect to the GO hierarchy. Consider an ordered set of GO terms Formula . For any ordered set Formula such that Formula for Formula , define a binary random variable indicating the existence of the corresponding path in the underlying regulatory network. Clearly, the frequency of multipath Formula is equal to the sum of the realizations of these random variables. Let X be the set of these random variables. Now, without loss of generality, consider pathway Formula , such that TP is a parent of Formula , i.e. Formula . Then, for each gene Formula , there are multiple additional random variables, each for one of Formula . Let Y be the set of these random variables. In this setting, the definition of statistical interpretability directly applies. If all paths of the sort Formula exist in the underlying regulatory network, then the pathway Formula is more significant than Formula . If none of them exist, then the pathway containing the child is more significant. Applying Theorem 1, we conclude that a statistically interpretable function, that quantifies the significance of the frequency of a multipath in the functional attribute network, cannot be monotonic with respect to GO hierarchy.

The example in Figure 3 illustrates this point. Here, both Formula and Formula are parents of T1. Since all genes that are not in T1 but in Formula regulate T3, the regulatory effect of Formula on T3 is more significant than that of T1. Since none of the genes absent in T1 but present in Formula regulate T3, the regulatory effect of Formula on T3 is less significant than that of T1. Thus, any statistically interpretable measure f should satisfy Formula , which violates monotonicity. Note also that frequency, which is monotonically non-decreasing with respect to height (proximity to root) in GO hierarchy, is not statistically interpretable as Formula .


Figure 3
View larger version (5K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Example illustrating that an interpretable measure of statistical significance is not monotonic with respect to GO hierarchy. GO terms Figure 3 and Figure 3 are parents of T1. The regulatory effect of Figure 3 on T2 is more significant than that of T1, but the regulatory effect of Figure 3 is less significant.

 
This result can be interpreted as follows. GO hierarchy defines a combinatorial space of resolution for pathways of functional attributes. In other words, a pathway may be generalized or specialized by replacing a node (GO term) in the pathway with one of its ancestors or descendants in the GO DAG. Since this can be done for each node in the pathway, the size of this space is exponential in pathway length. However, as demonstrated above, the significance of a pathway fluctuates in this space. Consequently, all significant pathways cannot be efficiently identified using traditional inductive techniques, by starting from the highest (lowest) resolution in GO hierarchy and pruning out coarser (finer) terms in chunks.

Alternate approaches to this problem are necessary, not only in the context of significant pathway identification, but also other combinatorial problems in systems biology that involve hierarchical annotations. One possible approach is to develop a measure of statistical significance that admits a tight bound on the significance of a pathway in terms of the frequencies of pathways that are at a higher (lower) GO resolution. The discussion above clearly demonstrates that it is not straightforward to do so. Indeed, the statistical model we introduce in the next section does not easily lead to such tight bounds, since it emphasizes the modularity of a pathway to assess its significance. Consequently, in our implementation of NARADA, we use the most specific GO terms as the default resolution. Development of measures and methods that effectively prune out parts of the GO space remains an open problem.

3.2.2 Monotonicity with respect to pathway length
We apply Theorem 1 to the multipath space of a functional attribute network, i.e. to the relationship between a multipath and its subpaths. As before, a multipath is represented by a set of binary random variables, each corresponding to one of its potential occurrences. Without loss of generality, consider multipaths Formula and Formula . The random variables that represent {Pi}k do not form a superset of those that represent Formula . Rather, they are extensions of them, as defined below:

DEFINITION 5. Extension.
Given a set Formula , an extension Formula of Formula , denoted Formula , is defined as follows. Each Formula , is attached to a subset Formula . Each Formula is attached to exactly one Formula , i.e. for any Formula , Formula .

Each potential occurrence of {Pi}k is a superpath of exactly one potential occurrence of Formula and there may be multiple such occurrences of {Pi}k that correspond to a particular occurrence of Formula . Therefore, the set of random variables that represent {Pi}k form an extension of the set of random variables that represent Formula .

PROPOSITION 2. Statistical Interpretability w.r.t. Extension.
Consider Formula , Formula , and Formula as defined in Proposition 1. Let Formula and let Formula be the respective observation set. A function f is statistically interpretable with respect to extension if it satisfies the following conditions:
  1. If for all Formula such that x = 1, z = 0 {forall} Formula , then Formula ,
  2. If for all Formula such that x = 1, z = 1 {forall} Formula , then Formula .

Each x = 1 corresponds to an occurrence of the corresponding pathway. Consequently, statistical interpretability with respect to extension of a pathway requires the following. If for all occurrences of Formula , all corresponding potential occurrences of {Pi}k exist in the network, then {Pi}k is statistically more interesting than Formula . If none of them occurs, then Formula is more interesting.

COROLLARY 1.
Let f be a monotonically non-decreasing (non-increasing) function with respect to extension, i.e. for any Formula and Formula , Formula (Formula ). Then f is not statistically interpretable with respect to extension.

The example shown in Figure 1 illustrates this result. In both of the scenarios shown in Figure 1a and b, Formula . In (a), Formula , i.e. condition (i) in Definition 2 (all potential occurrences of Formula , given the occurrences of T1 -> T2, exist in the network), hence the pathway Formula is more interesting than both T1 -> T2 and T2 -> T3. In (b), on the other hand, Formula (condition (ii) holds), so both T1 -> T2 and T2 -> T3 are more interesting than Formula . This discussion motivates the statistical model we present in the next section.

3.3 Statistical model for pathways of functional attributes
We present a novel statistical model for assessing the significance of the frequency of a multipath in a functional attribute network. In this approach, the ‘interestingness’ of a pathway is associated with its modularity, i.e. the significance of the coupling of its building blocks. In statistical terms, this is achieved by conditioning the distribution of the frequency (modeled as a random variable) of a pathway on the frequency of its subpaths (modeled as fixed parameters).

3.3.1 Motivating example
We illustrate the notion of the significance of coupling between regulatory interactions using the regulatory network and its corresponding functional attribute network shown in Figure 2. In this example, Formula Formula , i.e regulatory interactions T1 -> T2, T2 {dashv} T3, and T2 -> T4 occur twice. Furthermore, regulatory pathway (multipath in the functional attribute network) Formula occurs four times, i.e. Formula . Observe that, given the frequencies of T1 -> T2 and T2 {dashv} T3, this is the maximum value Formula can take. In other words, any gene with annotation T2, which is up-regulated by a T1 gene, always down-regulates a T3 gene. This observation suggests that, T1 plays an indirect, but important role in the regulation of T3. On the contrary, Formula , since gene g4 with annotation T2 up-regulates a T4-gene (g6), but it is not regulated by a T1-gene. These observations suggest that the coupling between regulatory interactions T1 -> T2 and T2 {dashv} T3 is stronger than the coupling between T1 -> T2 and T2 -> T4. In other words, the pathway Formula is more likely to be modular, compared to Formula .

We develop a statistical model that evaluates the modularity of regulatory pathways based on the coupling between their building blocks. For each pathway, our model assumes that the frequency of the building blocks of a pathway are known, i.e. constitute the background distribution. We quantify the statistical significance of a pathway with the conditional probability of its frequency based on this background.

3.3.2 Baseline model
To quantify the significance of a pathway of shortest length (i.e. a single regulatory interaction), we rely on a reference model that generates a functional attribute network. This model takes into account (i) the degree distribution of the underlying gene network, as well as (ii) the distribution of the number of genes associated with each functional attribute, based on the independent edge generation paradigm commonly used in modeling networks with arbitrary degree distribution (Chung et al., 2003; Itzkovitz et al., 2003). Note that this model is better suited to multigraphs than simple graphs (King, 2004). We refer to this model as the baseline model, and denote it Formula .

The baseline model is defined by a set of parameters, and specifies the expected multidegree of each node in the functional attribute network. Here, the multidegree of a node in a multigraph refers to the number of multiedges incident to that node. Given gene regulatory network Formula , functional attribute set VF, and annotation Formula , the expected in-degree Formula and out-degree Formula of a functional attribute Formula are estimated as follows:


Formula 1

(1)
where we denote the estimate of a parameter x by Formula . Note also that, if f is a function of x, we use fi to denote f(xi) whenever appropriate. Given these parameters, Formula generates a functional attribute network as follows: there is a pool of potential edges that contains Formula potential edges between each pair of functional attributes Ti and Tj. The size of the pool is given by: Formula A total of n edges are drawn from this pool, independently and without replacement, where n is equal to the number of edges in the observed functional attribute network, i.e. Formula Let Formula and Formula denote the random variables that correspond to the in and out degrees of Ti in the generated network. Then, we have


Formula 2

(2)
and similarly Formula . In other words, the expected values of multidegrees in the generated network mirror the specifications.

3.3.3 Significance of a regulatory interaction
Let Formula denote the random variable representing the frequency of pathway {Pi} in the generated functional attribute network. Clearly, Formula is a hypergeometric random variable with parameters m (number of items), Formula (number of good items), n (number of selected items) and Formula (number of selected good items) (Feller 1968). Hence, the P-value of a regulatory interaction TiTj in the observed network, i.e. the probability of observing at least Formula interactions between genes associated with Ti and genes associated with Tj, is given by


Formula 3

(3)

3.3.4 Significance of a pathway
We now present a statistical model to assess the statistical significance of a pathway of functional attributes, which assumes a background distribution based on the occurrence of the building blocks of a pathway. Let Formula denote the path Formula . For Formula , we want to evaluate the significance of the coupling between pathways Formula and Formula . In other words, we want to understand how strong a conclusion of the sort ‘If a gene Formula is regulated through a chain of regulatory interactions characterized by Formula , then this gene is likely to regulate a Formula gene through pathway Formula (or vice versa) can be.

To achieve this, we assume a reference model, in which the frequency of pathways Formula and Formula is established a priori. Let Formula and Formula denote Formula and Formula , respectively. Then, the P-value of the coupling between Formula and Formula is defined as follows:


Formula 4

(4)

Our model for the distribution of Formula , given Formula and Formula , is illustrated in Figure 4. Assume that a pool contains all possible occurrences of multipaths Formula and Formula . Clearly, there are Formula and Formula potential occurrences of each multipath. This is shown in Figure 4a. Now consider a pair of paths, one corresponding to a potential occurrence of Formula , the other to Formula . Such a pair corresponds to a path, i.e. an occurrence of Formula , only if the second path originates in the port in which the first one terminates. This is illustrated in Figure 4b and c. Since there are Formula and Formula occurrences of Formula and Formula , respectively, the problem is formulated as follows: we draw Formula paths from Formula potential occurrences of Formula and Formula paths from Formula potential occurrences of Formula , forming Formula pairs. What is the probability that in at least Formula of these pairs, the port on Tj will be common?


Figure 4
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Model testing whether the frequency of path Figure 4, given Figure 4 and Figure 4, is significant. (a) Pool of possible Figure 4 and Figure 4 edges. There are Figure 4 and Figure 4 possible Figure 4 and Figure 4 edges, respectively. (b) A possible pair of edges that corresponds to a path. (c) A possible pair of edges that does not correspond to a path. (d) Figure 4 Figure 4 edges and Figure 4 Figure 4 edges are randomly selected from the pool. (e) A possible configuration of selected edges. In this case, Figure 4.

 
We approximate this probability using our result on the behavior of dense subgraphs (Koyutürk et al., 2006a) and Chvátal's bound on hypergeometric tail (Chvátal, 1979). In order to apply these results, we resolve dependencies assuming that the selected path pairs are independent from each other. Then, letting Formula be the probability that a given path pair will go through the same gene and Formula be the fraction of observed paths among all existing pairs, we obtain the following bound:


Formula 5

(5)
where Formula denotes weighted entropy. This estimate is Bonferroni-corrected for multiple testing, i.e. it is adjusted by a factor of Formula .

3.4 NARADA: a software for identification of significant regulatory pathways
Based on the above statistical model, we develop algorithms and a comprehensive software tool, NARADA, for projecting gene regulatory networks on the functional attribute domain.

The input to NARADA consists of three files: (i) a gene regulatory network, in which the source gene, target gene and the mode of interaction are specified for each regulatory interaction, (ii) specification of the functional attributes and their relations (e.g. Gene Ontology Formula file) and (iii) annotation file that specifies the mapping between genes and functional attributes. NARADA currently handles three types of queries:

  • Formula : Given a functional attribute T, find all significant pathways that are regulated by (originate from) genes that are associated with T.
  • Formula : Given a functional attribute T, find all significant pathways that regulate (terminate at) genes that are associated with T.
  • Formula : Given a sequence of functional attributes Formula , find all occurrences of the corresponding pathway in the gene network and determine its significance.

A pathway is identified as being significant if its P-value is less than the critical {alpha}-level, a user-defined parameter.

NARADA delivers near interactive query response using a novel, biologically motivated pruning technique. We call a pathway strongly significant if all of its subpaths are significant. In biological terms, a strongly significant pathway is likely to correspond to a significantly modular process, in which not only the building blocks of the pathway, but also its constituent building blocks are tightly coupled. In the context of queries implemented in NARADA, these subpaths are limited to those that originate from (terminate at) the query term. The option for searching strongly significant paths is also available in NARADA.

The main motivation in identification of significant regulatory pathways is understanding the crosstalk between different processes, functions and cellular components. Therefore, functions and processes that are known to play a key role in gene regulation (e.g. transcription regulator activity or DNA binding) may overload the identified pathways and overwhelm other interesting patterns. However, genes that are responsible for these functions are likely to bridge regulatory interactions between different processes (Lee et al., 2002), so they cannot be ignored. For this reason, such GO terms are short-circuited, i.e. if process Ti regulates Tj, which is a key process in transcription, and Tj regulates another process Tk, then the pathway Formula is replaced with regulatory interaction Formula .


    4 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 BACKGROUND AND MOTIVATION
 3 METHODS
 4 RESULTS AND DISCUSSION
 5 CONCLUDING REMARKS
 REFERENCES
 
We test NARADA comprehensively on the E.coli transcriptional network obtained from RegulonDB (Salgado et al., 2006). The release 5.6. of this dataset contains 1364 genes with 3159 regulatory interactions. A total of 193 of these interactions specify dual regulation. We separate these dual regulatory interactions as up and down regulatory interactions. We use Gene Ontology (Ashburner et al., 2000) as a library of functional attributes. The annotation of E. coli genes is obtained from UniProt GOA Proteome (Camon et al., 2004). Using the mapping provided by GO, the gene network is mapped to functional attribute networks of the three name spaces in GO. Mapping to the biological process space provides maximum coverage in number of genes annotated, 881 genes are mapped to one or more of 318 process terms. We discuss here results obtained by this mapping only. Results relating to molecular functions and cellular components, as well as comprehensive results on pathways of biological processes, are available at the NARADA website.

We use NARADA to identify all significant forward and reverse pathways of length 2–5. In order to identify these paths, we run queries Formula and Formula with a critical {alpha} of 0.01 on all 318 biological processes. The number of pathways obtained using combinations of the algorithmic options described in the previous section are shown in Table 1. On a Pentium M (1.6 GHz) laptop with 1.21 GB RAM the brute-force approach takes on average 0.5 s per query for path length 2, to 12 s per query for paths of length 5. For strongly significant paths, it takes <2 s per query for paths of length 5, while for shortcutting terms it is 8 s per query for paths of length 4. Strongly significant pathways, i.e. those obtained by extending only significant pathways, compose a significant portion of the highly significant pathways. This observations suggests that significantly modular pathways are also likely to be composed of significantly modular building blocks.


View this table:
[in this window]
[in a new window]

 
Table 1. Total number of significant pathways found by NARADA on E.coli transcription network for various path lengths

 
4.1 Discussion
One of the prominent features of the detected significant pathways is that a large number of them begin with terms relating to transcriptional and translational regulation while ending in other cellular processes (Fig. 5). This can be explained by the fact that the network consists of a set of transcription factor genes and set of genes regulated by them. Therefore, most of the regulatory pathways of length 3 or more have to begin at or flow through this set of genes annotated with processes relating to transcription, translation and regulation thereof. Pathways involving other process terms occur with lower frequency, but most of them are highly significant.


Figure 5
View larger version (35K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Sample significantly overrepresented pathways in E.coli transcription network. (a) DNA recombination -> transcription -> phosphorylation (b) transcription Figure 5 flagellum biogenesis -> cell motility. The pathways in functional attribute space are shown on the upper panel, their occurrences in the gene network are shown on the lower panel.

 
Samples of pathways obtained are shown in Table 2. Some pathways like (sensory perception Formula transcription -> transport) occur frequently and may constitute a common mechanism for regulation of transport-related activities. Parts of the significant pathways that regulate phosphorylation via genes involved in transcription and DNA recombination are shown in Figure 5a. As genes involved in transcription are abundantly present in the network, part of the pathway (DNA recombination -> transcription) occurs rarely (12 times) and is not significant, but in 6 of the 12 times it occurs, the genes involved in transcription regulate phosphorylation. The fis transcriptional regulator is responsible for regulation of nuoA-N operon (Wackwitz et al., 1999), while the fhlA transcriptional activator regulates the hyf locus (Hopper et al., 1994; Skibinski et al., 2002). Indeed, it is observed that the integration host factor (ihfA,ihfB) affects the regulation of these phosphorylation related genes (nuoA-N, hyf,hyc) directly and indirectly (Hopper et al., 1994; Nasser et al., 2002).


View this table:
[in this window]
[in a new window]

 
Table 2. Selection of significantly overrepresented pathways identified by NARADA on E. coli transcription network

 
In Figure 5b, significant pathways that regulate cell motility are shown. This is part of a response to a query of type Formula . The flhD operon that encodes flhC and flhD has been shown to act as positive regulator of flagellar regulons(fli, flg) (Liu and Matsumura, 1994). The flagellar master operon flhDC, in turn, is tightly regulated at the transcriptional level by rscAB, fur, ompR (Francez-Charlot et al., 2003; Lehnen et al., 2002; Ko and Park, 2000). The output of NARADA captures this indirect regulation of flagellar expression perfectly.

4.2 Case study: Regulatory network of molybdate ion transport
Figure 6 shows all significant paths of maximum length 3 regulated by molybdate ion transport. The genes associated with molybdate ion transport are modE and the operon modABCD, but it has been observed that the gene modE down-regulates the operon modABCD (McNicholas et al., 1997), and the operon does not regulate any other gene. The three pathways at the bottom of the figure are the only significant paths of length 2 originating at molybdate ion transport. As can be seen on on the upper side of the figure (paths of length 3), molybdate ion transport promotes and suppresses various processes indirectly, through DNA-dependent regulation of transcription, two-component signal transduction system and nitrate assimilation. It is important to note that direct regulation of these intermediate terms by molybdate ion transport is not significant by itself. By extending the search beyond pairwise interactions, NARADA is able to capture these significant indirect interactions successfully.


Figure 6
View larger version (56K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Direct and indirect regulation of various processes by molybdate ion transport and the corresponding gene network.

 
The paths of length 2 mirror the direct regulation of moaABCDE operon (McNicholas et al., 1997) and oppABCDF operon (Tao et al., 2005) by modE. Furthermore, modE indirectly regulates cytochrome complex assembly ccm operon (Overton et al., 2006), electron transport nap operon (McNicholas and Gunsalus, 2002), nitrate assembly nar operon (Self et al., 1999) and mitochondrial electron transport nuo operon (Bongaerts, et al. 1995; Overton, et al., 2006). All these indirect regulations occur through genes involved in respiratory nitrate reductase narXL (Tao et al., 2005). In the RegulonDB network, we observe that modE indeed regulates narL, which regulates other genes. NARADA associates the mediation of modE's regulatory effect on several other processes with the functional associations of narL.

An interesting observation is that, even though the r