Functional annotation of regulatory pathways
1Department of Computer Science, Purdue University2Department of Chemistry and Biochemistry, University of California, San Diego, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Standardized annotations of biomolecules in interaction networks (e.g. Gene Ontology) provide comprehensive understanding of the function of individual molecules. Extending such annotations to pathways is a critical component of functional characterization of cellular signaling at the systems level.
Results: We propose a framework for projecting gene regulatory networks onto the space of functional attributes using multigraph models, with the objective of deriving statistically significant pathway annotations. We first demonstrate that annotations of pairwise interactions do not generalize to indirect relationships between processes. Motivated by this result, we formalize the problem of identifying statistically overrepresented pathways of functional attributes. We establish the hardness of this problem by demonstrating the non-monotonicity of common statistical significance measures. We propose a statistical model that emphasizes the modularity of a pathway, evaluating its significance based on the coupling of its building blocks. We complement the statistical model by an efficient algorithm and software, NARADA, for computing significant pathways in large regulatory networks. Comprehensive results from our methods applied to the Escherichia coli transcription network demonstrate that our approach is effective in identifying known, as well as novel biological pathway annotations.
Availability: NARADA is implemented in Java and is available at http://www.cs.purdue.edu/homes/jpandey/narada/
Contact: jpandey{at}cs.purdue.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Gene regulatory networks represent powerful formalisms for modeling cell signaling. These networks are inferred from gene expression, as well as other sources of data, using various statistical and computational methods (Friedman et al., 2000; Husmeier, 2003). Recent studies on networks of specific organisms show that interactions between genes that take part in specific pairs of biological processes are significantly overrepresented (Lee et al., 2002; Tong et al., 2004). Generalizing such observations to pathways of arbitrary length may allow identification of standardized pathways, enabling creation of reference databases of direct and indirect interactions between various processes. Knowledge of such pathways is useful, not only in general understanding of the relationship between cellular processes at the systems level, but also in projecting existing knowledge of cellular organization of model organisms to other species. Increasing availability of species-specific interaction data, coupled with attempts aimed at creating standardized dictionaries of functional annotation for biomolecules provide the knowledge base that can be effectively used for this purpose. What is lacking is a comprehensive set of tools that combine these two sources of data to identify significantly overrepresented patterns of interaction through reliable statistical modeling with a formal computational basis.
In this article, we introduce the notion of functional network characterization, derived from a gene regulatory network and associated functional annotations of genes. We use the Gene Ontology (GO) (Ashburner et al., 2000) for annotations, however, our methods themselves generalize to other networks and annotations. Functional network characterization is based on the abstract notion of regulatory interactions between pairs of functional attributes (as opposed to genes). In this context, we demonstrate that methods for identifying significant pairwise annotations do not generalize to pathway annotations. We introduce the problem of identifying statistically overrepresented pathways of functional attributes, targeted at the identification of chains of regulatory interactions between functional attributes. We study the hardness of this problem, focusing on the non-monotonicity of commonly used statistical significance measures. We show that the problem is hard along two dimensions: (i) the pathway space of the functional attribute network, and (ii) the space of functional resolution specified by GO hierarchy. Emphasizing the modularity of a pathway to assess its significance, we propose a statistical model that focuses on the coupling of the building blocks of a pathway. We use this statistical model to derive efficient algorithms for solving the pathway annotation problem. Our methods are implemented in a web-based tool, NARADA which provides an intuitive user and data interface.
Comprehensive evaluation of NARADA on an Escherichia coli transcription network from RegulonDB (Salgado et al., 2006) shows that our method identifies several known, as well as novel pathways, at near-interactive query rates. Note that the current knowledge of regulatory networks is incomplete, and limited to a few model organisms. Therefore, the application of our method on currently available data does not provide a comprehensive library of regulatory network annotation. On the other hand, the partial annotation provided by our method forms a useful basis for extending our knowledge of regulatory networks beyond well-studied processes and model organisms.
| 2 BACKGROUND AND MOTIVATION |
|---|
|
|
|---|
2.1 Results from previous studies
Lee et al. (2002) study the Saccharomyces cerevisiae transcription regulation network with a view to understanding relationships between functional categories. They observe that transcriptional regulators within a functional category commonly bind to genes encoding regulators within the same category (e.g. cell cycle, metabolism, environmental response). They also report that many transcriptional regulators within a functional category bind to transcriptional regulators that play key roles in the control of other cellular processes. For example, cell cycle activators are observed to bind to genes that are responsible for regulation of metabolism, environmental response, development and protein biosynthesis. Tong et al. (2004) identify putative genetic interactions in yeast via synthetic genetic array (SGA) analysis and investigate the functional relevance of their results in the context of GO annotations. They construct a network of GO terms by inserting an edge between any pair of terms that are bridged by a significant number of interacting gene pairs. Here, two GO terms are said to be bridged by an interaction if one of the interacting genes is associated with one of the terms, and the other gene with the second term, but neither is associated with both terms. They show that the resulting network is clustered according to underlying biological processes, while some biological processes buffer one another. For example, microtubule-based functions buffer both actin-based and DNA synthesis or repair functions, suggesting coordination of these functions through interaction of various genes.
2.2 Approach
Establishing functional relationships from gene interactions is essential to understanding functional organization of a cell. Current investigations are limited to case-specific studies that generally focus on validation or evaluation of results through simple statistical analyses — yet they provide significant insights (Gamalielsson et al., 2006; Lee et al., 2002; Tong et al., 2004). Computational tools that are based on sophisticated abstractions and customized statistical models are likely to yield novel insights. The basic approach for integrating existing knowledge of gene networks and functional annotations is to project the network in the gene space onto the functional attribute space through mapping of genes to attributes as specified by the annotation. A simple method for achieving this annotates each gene with its function and identifies overrepresented interacting annotations. This method yields interesting insights, as illustrated by Tong et al. (2004) in the context of synthetic genetic arrays. This model, however, does not generalize beyond pairwise interactions since each interaction between a pair of functional attributes is within a specific context (a different pair of genes) in the network, as illustrated by the following example.
2.3 Motivating example
Two regulatory pathways are shown in Figure 1 — each node is identified by its corresponding gene (gi) and tagged by the functional attribute (Tj) associated with the gene. In Figure 1a, genes g1, g2 and g3 indirectly regulate genes g5, g6 and g7 through gene g4. In Figure 1b, the network is isolated and there is no indirect regulation. Now assume the network of functional attributes derived from the simple method described above, separately for each gene network. For both networks, since all genes associated with functional attribute T1 regulate a gene with T2, one may conclude that T1 regulating T2 is significant. A similar conclusion follows for the regulatory effect of T2 on T3. If only pairwise interactions are considered, we derive the same network of functional attributes from both genetic networks (Figure 1c). This network clearly suggests that functional attribute T1 indirectly regulates T3 through T2. This is indeed a correct observation for the network in Figure 1a. However, this is not true for the network in Figure 1b.
|
To address this problem, we develop a formal framework for projecting a gene network on to a network of functional attributes, using multigraph models that accurately capture the context in which an interaction occurs. Through this framework, we generalize pairwise interactions between functional attributes to the identification of regulatory pathways of functional attributes.
| 3 METHODS |
|---|
|
|
|---|
We now describe the biological, statistical, and computational formalisms that underlie our methods.
3.1 Formal model for functional attribute networks
A gene regulatory network is modeled by a labeled directed graph
. In this network, nodes
represent genes. Directed edge
, where
, represents a regulatory interaction between genes gi and gj.
specifies a labeling of edges that represents the mode of regulation: activation ( + ), repression (–) or dual regulation (±). A sample gene regulatory network is shown in Figure 2. In our discussion, for the sake of simplicity, we omit the mode of regulation and treat all interactions as activator interactions, whenever appropriate.
|
Each gene in the network is associated with a set of functional attributes. These attributes describe a functional annotation of the gene, i.e. they map an individual biological entity to known functional classes.
DEFINITION 1. Functional Annotation.
Given a set of genes VG and a set of functional attributes VF, letand
denote the power set of VG and VF, respectively. Then, functional annotation
defines mapping
and
, such that
if and only if
, for any
and
. The frequency of Tj,
, is equal to the number of genes that are mapped to Tj.
In Figure 2a, each gene gi is tagged with the functional attributes in
. For each Tj,
is composed of the genes tagged by Tj. We use Gene Ontology (GO) (Ashburner et al., 2000) as a reference library for annotating genes. For each gene, GO specifies the molecular functions associated with it, biological processes it takes part in, and cellular components it may be part of. The functional attributes in GO, known as GO terms, are organized hierarchically through is a and part of relationships. For example, regulation of stereoid biosynthetic process is a regulation of stereoid metabolic process and is part of stereoid biosynthetic process'. This hierarchy is abstracted using a directed acyclic graph (DAG). In this representation, if Ti is a, or part of Tj, then
, i.e. the genes associated with Ti form a subset of genes associated with Tj. In this case, Tj is said to be a parent of Ti. A term may have more than one parent, i.e.
and
does not imply
. Furthermore, there is a unique
with no parent, called root, such that
. In the rest of this section, we use a network of functional attributes with no constraints (e.g. GO hierarchy) on function
. We discuss the issue specifically relating to the GO hierarchy when addressing the implementation of NARADA.
We model networks of functional attributes using multigraphs. A multigraph is a generalized graph, where multiple edges are allowed between a single pair of nodes.
DEFINITION 2. Functional Attribute Network.
Given gene regulatory network, a set of functional attributes VF, and functional annotation
, the corresponding functional attribute network
is a multigraph defined as follows. The set of functional attributes VF is also the set of nodes in F. Each node
contains a set of ports corresponding to the set of genes associated with Ti, i.e.
. Each multiedge TiTj is a set of ordered port pairs (edges)
, such that
,
and
.
The functional attribute network corresponding to the gene regulatory network in Figure 2a is shown in Figure 2b. This multigraph model captures the context of each interaction accurately through the concept of ports. As illustrated in Figure 1, if a simple graph model is used, paths that do not exist in the gene network emerge in the functional attribute network. This is not possible in the multigraph model, since a path must leave a node from the port at which it enters the node.
DEFINITION3. Path.
In functional attribute network, a path
is an ordered set of node-port pairs such that (i)
for
(nodes are not repeated), (ii)
for
and (iii)
for
(consecutive edges are connected through the same port). The length of
is
.
In Figure 2b,
is a path but
is not, since multiedge
does not contain the edge
. Note that allowing
and
, we may also include cycles in this definition. According to the above definition, paths are characterized by ports. While analyzing regulatory pathways of functional attributes, however, we are interested in paths that are characterized by nodes in the functional attribute network. Clearly, such pathways may correspond to multiple paths in the functional attribute network. Therefore, we model them using multipaths.
DEFINITION 4. Multipath.
In functional attribute network, a multipath
is an ordered set of nodes such that (i)
for
, and (ii) there exist
for
, such that
is a path. The occurrence set
of
consists of all distinct paths that satisfy (ii) and each such path is called an occurrence of
. The frequency of
,
, is equal to the number of occurrences of
.
We use the terms pathway and multipath interchangeably, to emphasize the biological meaning of a multipath. Allowing
, we also extend this definition to multicycles, occurrences of which correspond to cycles in the gene network. In Figure 2b,
(also denoted
throughout this article) is a multipath with frequency four. On the other hand, multipath
does not exist in this network, i.e. it has frequency zero, although multiedges T2
T4 and T4
T3 both exist. Note that the distinction between activator and inhibitor interactions is emphasized in this example for illustrative purposes, while it is omitted in the definition for simplicity. A multipath with high frequency is likely to be biologically interesting, since it corresponds to a regulatory pathway of functional attributes that recurs in various contexts in the underlying cellular organization. In order to quantify this biological significance, it is useful to evaluate frequency from a statistical perspective.
3.2 Hardness of significant pathway identification
Raw counts have long been used as a measure of significance—primarily because of the resulting algorithmic simplicity. This is a direct consequence of its monotonicity properties, namely that a subgraph (or substring/subset) of a frequent graph (or string/set) is itself frequent (Koyutürk et al., 2006b). In identification of significantly overrepresented pathways of functional attributes, frequency alone does not provide a good measure of statistical significance. This is because, the degree distribution of gene regulatory networks and the distribution of the frequency of functional attributes are both highly skewed. Consequently, paths including functional attributes that are associated with high-degree genes (e.g. molecular functions related to transcription) and those associated with many genes (e.g. GO terms that are at coarser levels of GO hierarchy) are likely to dominate. For this reason, a statistical measure that takes into account these distributions is needed.
Monotonicity of common statistical significance measures. We identify the basic properties of a useful measure of statistical significance.
PROPOSITION 1. Statistical Interpretability.
Consider a setof binary random variables and the set of corresponding observations
, where
for
corresponds to an observation supporting a hypothesis. Let
be a real-valued function, used to assess the statistical significance of the collection of observations defined by
. Let
and
be disjoint binary random variable sets, i.e.
, and let
and
be the respective observation sets. A function f is statistically interpretable if it satisfies the following conditions:
- If y = 0
![]()
, then
,
- If y = 1
![]()
then
.
Here, without loss of generality,
implies that
is a more interesting observation than
. More generally, the binary random variables characterize a pattern, and a larger set of these variables corresponds to a larger (or more general) pattern. This property simply states that additional positive (negative) observations should increase (decrease) our confidence that a pattern is interesting.
Most significance measures used in the analysis of discrete biological data are statistically interpretable. Consider, for example, the identification of significantly enriched GO terms in a set of genes. For a given term, the binary variables (X), one for each gene (
), indicate whether the gene is associated with the term (X = 1). Adding a new gene (
) to this set will improve the significance of enrichment (
) if the new gene is associated with the term (Y = 1). If not (Y = 0), the enrichment of the term in the new set will be less significant (
). Indeed, existing methods and statistical measures for this problem demonstrate this property (Grossman et al., 2006; Hsiao et al., 2005).
Now we show that, in contrast to approximations that do not take into account the size of the sample space (e.g. frequency), statistically interpretable measures of significance do not possess monotonicity.
THEOREM 1.
Let f be a monotonically non-decreasing (nonincreasing) function, i.e. for anyand
,
(
). Then f is not statistically interpretable.
PROOF.
Without loss of generality, assume f is non-decreasing. Let Y be a set of binary random variables, and y be a set of corresponding observations, such that,
. Since f is monotonically non-decreasing, we have
. This contradicts condition (ii) in Proposition 1.
3.2.1 Monotonicity with respect to GO hierarchy
We now show that this result directly applies to the monotonicity of useful significance measures with respect to the GO hierarchy. Consider an ordered set of GO terms
. For any ordered set
such that
for
, define a binary random variable indicating the existence of the corresponding path in the underlying regulatory network. Clearly, the frequency of multipath
is equal to the sum of the realizations of these random variables. Let X be the set of these random variables. Now, without loss of generality, consider pathway
, such that TP is a parent of
, i.e.
. Then, for each gene
, there are multiple additional random variables, each for one of
. Let Y be the set of these random variables. In this setting, the definition of statistical interpretability directly applies. If all paths of the sort
exist in the underlying regulatory network, then the pathway
is more significant than
. If none of them exist, then the pathway containing the child is more significant. Applying Theorem 1, we conclude that a statistically interpretable function, that quantifies the significance of the frequency of a multipath in the functional attribute network, cannot be monotonic with respect to GO hierarchy.
The example in Figure 3 illustrates this point. Here, both
and
are parents of T1. Since all genes that are not in T1 but in
regulate T3, the regulatory effect of
on T3 is more significant than that of T1. Since none of the genes absent in T1 but present in
regulate T3, the regulatory effect of
on T3 is less significant than that of T1. Thus, any statistically interpretable measure f should satisfy
, which violates monotonicity. Note also that frequency, which is monotonically non-decreasing with respect to height (proximity to root) in GO hierarchy, is not statistically interpretable as
.
|
This result can be interpreted as follows. GO hierarchy defines a combinatorial space of resolution for pathways of functional attributes. In other words, a pathway may be generalized or specialized by replacing a node (GO term) in the pathway with one of its ancestors or descendants in the GO DAG. Since this can be done for each node in the pathway, the size of this space is exponential in pathway length. However, as demonstrated above, the significance of a pathway fluctuates in this space. Consequently, all significant pathways cannot be efficiently identified using traditional inductive techniques, by starting from the highest (lowest) resolution in GO hierarchy and pruning out coarser (finer) terms in chunks.
Alternate approaches to this problem are necessary, not only in the context of significant pathway identification, but also other combinatorial problems in systems biology that involve hierarchical annotations. One possible approach is to develop a measure of statistical significance that admits a tight bound on the significance of a pathway in terms of the frequencies of pathways that are at a higher (lower) GO resolution. The discussion above clearly demonstrates that it is not straightforward to do so. Indeed, the statistical model we introduce in the next section does not easily lead to such tight bounds, since it emphasizes the modularity of a pathway to assess its significance. Consequently, in our implementation of NARADA, we use the most specific GO terms as the default resolution. Development of measures and methods that effectively prune out parts of the GO space remains an open problem.
3.2.2 Monotonicity with respect to pathway length
We apply Theorem 1 to the multipath space of a functional attribute network, i.e. to the relationship between a multipath and its subpaths. As before, a multipath is represented by a set of binary random variables, each corresponding to one of its potential occurrences. Without loss of generality, consider multipaths
and
. The random variables that represent
k do not form a superset of those that represent
. Rather, they are extensions of them, as defined below:
DEFINITION 5. Extension.
Given a set, an extension
of
, denoted
, is defined as follows. Each
, is attached to a subset
. Each
is attached to exactly one
, i.e. for any
,
.
Each potential occurrence of
k is a superpath of exactly one potential occurrence of
and there may be multiple such occurrences of
k that correspond to a particular occurrence of
. Therefore, the set of random variables that represent
k form an extension of the set of random variables that represent
.
PROPOSITION 2. Statistical Interpretability w.r.t. Extension.
Consider,
, and
as defined in Proposition 1. Let
and let
be the respective observation set. A function f is statistically interpretable with respect to extension if it satisfies the following conditions:
- If for all
such that x = 1, z = 0
![]()
, then
,
- If for all
such that x = 1, z = 1
![]()
, then
.
Each x = 1 corresponds to an occurrence of the corresponding pathway. Consequently, statistical interpretability with respect to extension of a pathway requires the following. If for all occurrences of
, all corresponding potential occurrences of
k exist in the network, then
k is statistically more interesting than
. If none of them occurs, then
is more interesting.
COROLLARY 1.
Let f be a monotonically non-decreasing (non-increasing) function with respect to extension, i.e. for anyand
,
(
). Then f is not statistically interpretable with respect to extension.
The example shown in Figure 1 illustrates this result. In both of the scenarios shown in Figure 1a and b,
. In (a),
, i.e. condition (i) in Definition 2 (all potential occurrences of
, given the occurrences of T1
T2, exist in the network), hence the pathway
is more interesting than both T1
T2 and T2
T3. In (b), on the other hand,
(condition (ii) holds), so both T1
T2 and T2
T3 are more interesting than
. This discussion motivates the statistical model we present in the next section.
3.3 Statistical model for pathways of functional attributes
We present a novel statistical model for assessing the significance of the frequency of a multipath in a functional attribute network. In this approach, the interestingness of a pathway is associated with its modularity, i.e. the significance of the coupling of its building blocks. In statistical terms, this is achieved by conditioning the distribution of the frequency (modeled as a random variable) of a pathway on the frequency of its subpaths (modeled as fixed parameters).
3.3.1 Motivating example
We illustrate the notion of the significance of coupling between regulatory interactions using the regulatory network and its corresponding functional attribute network shown in Figure 2. In this example,
, i.e regulatory interactions T1
T2, T2
T3, and T2
T4 occur twice. Furthermore, regulatory pathway (multipath in the functional attribute network)
occurs four times, i.e.
. Observe that, given the frequencies of T1
T2 and T2
T3, this is the maximum value
can take. In other words, any gene with annotation T2, which is up-regulated by a T1 gene, always down-regulates a T3 gene. This observation suggests that, T1 plays an indirect, but important role in the regulation of T3. On the contrary,
, since gene g4 with annotation T2 up-regulates a T4-gene (g6), but it is not regulated by a T1-gene. These observations suggest that the coupling between regulatory interactions T1
T2 and T2
T3 is stronger than the coupling between T1
T2 and T2
T4. In other words, the pathway
is more likely to be modular, compared to
.
We develop a statistical model that evaluates the modularity of regulatory pathways based on the coupling between their building blocks. For each pathway, our model assumes that the frequency of the building blocks of a pathway are known, i.e. constitute the background distribution. We quantify the statistical significance of a pathway with the conditional probability of its frequency based on this background.
3.3.2 Baseline model
To quantify the significance of a pathway of shortest length (i.e. a single regulatory interaction), we rely on a reference model that generates a functional attribute network. This model takes into account (i) the degree distribution of the underlying gene network, as well as (ii) the distribution of the number of genes associated with each functional attribute, based on the independent edge generation paradigm commonly used in modeling networks with arbitrary degree distribution (Chung et al., 2003; Itzkovitz et al., 2003). Note that this model is better suited to multigraphs than simple graphs (King, 2004). We refer to this model as the baseline model, and denote it
.
The baseline model is defined by a set of parameters, and specifies the expected multidegree of each node in the functional attribute network. Here, the multidegree of a node in a multigraph refers to the number of multiedges incident to that node. Given gene regulatory network
, functional attribute set VF, and annotation
, the expected in-degree
and out-degree
of a functional attribute
are estimated as follows:
|
| (1) |
|
| (2) |
3.3.3 Significance of a regulatory interaction
Let
denote the random variable representing the frequency of pathway
in the generated functional attribute network. Clearly,
is a hypergeometric random variable with parameters m (number of items),
(number of good items), n (number of selected items) and
(number of selected good items) (Feller 1968). Hence, the P-value of a regulatory interaction TiTj in the observed network, i.e. the probability of observing at least
interactions between genes associated with Ti and genes associated with Tj, is given by
|
| (3) |
3.3.4 Significance of a pathway
We now present a statistical model to assess the statistical significance of a pathway of functional attributes, which assumes a background distribution based on the occurrence of the building blocks of a pathway. Let
denote the path
. For
, we want to evaluate the significance of the coupling between pathways
and
. In other words, we want to understand how strong a conclusion of the sort If a gene
is regulated through a chain of regulatory interactions characterized by
, then this gene is likely to regulate a
gene through pathway
(or vice versa) can be.
To achieve this, we assume a reference model, in which the frequency of pathways
and
is established a priori. Let
and
denote
and
, respectively. Then, the P-value of the coupling between
and
is defined as follows:
|
| (4) |
Our model for the distribution of
, given
and
, is illustrated in Figure 4. Assume that a pool contains all possible occurrences of multipaths
and
. Clearly, there are
and
potential occurrences of each multipath. This is shown in Figure 4a. Now consider a pair of paths, one corresponding to a potential occurrence of
, the other to
. Such a pair corresponds to a path, i.e. an occurrence of
, only if the second path originates in the port in which the first one terminates. This is illustrated in Figure 4b and c. Since there are
and
occurrences of
and
, respectively, the problem is formulated as follows: we draw
paths from
potential occurrences of
and
paths from
potential occurrences of
, forming
pairs. What is the probability that in at least
of these pairs, the port on Tj will be common?
|
We approximate this probability using our result on the behavior of dense subgraphs (Koyutürk et al., 2006a) and Chvátal's bound on hypergeometric tail (Chvátal, 1979). In order to apply these results, we resolve dependencies assuming that the selected path pairs are independent from each other. Then, letting
|
| (5) |
3.4 NARADA: a software for identification of significant regulatory pathways
Based on the above statistical model, we develop algorithms and a comprehensive software tool, NARADA, for projecting gene regulatory networks on the functional attribute domain.
The input to NARADA consists of three files: (i) a gene regulatory network, in which the source gene, target gene and the mode of interaction are specified for each regulatory interaction, (ii) specification of the functional attributes and their relations (e.g. Gene Ontology
file) and (iii) annotation file that specifies the mapping between genes and functional attributes. NARADA currently handles three types of queries:
: Given a functional attribute T, find all significant pathways that are regulated by (originate from) genes that are associated with T.
: Given a functional attribute T, find all significant pathways that regulate (terminate at) genes that are associated with T.
: Given a sequence of functional attributes
, find all occurrences of the corresponding pathway in the gene network and determine its significance.
A pathway is identified as being significant if its P-value is less than the critical
-level, a user-defined parameter.
NARADA delivers near interactive query response using a novel, biologically motivated pruning technique. We call a pathway strongly significant if all of its subpaths are significant. In biological terms, a strongly significant pathway is likely to correspond to a significantly modular process, in which not only the building blocks of the pathway, but also its constituent building blocks are tightly coupled. In the context of queries implemented in NARADA, these subpaths are limited to those that originate from (terminate at) the query term. The option for searching strongly significant paths is also available in NARADA.
The main motivation in identification of significant regulatory pathways is understanding the crosstalk between different processes, functions and cellular components. Therefore, functions and processes that are known to play a key role in gene regulation (e.g. transcription regulator activity or DNA binding) may overload the identified pathways and overwhelm other interesting patterns. However, genes that are responsible for these functions are likely to bridge regulatory interactions between different processes (Lee et al., 2002), so they cannot be ignored. For this reason, such GO terms are short-circuited, i.e. if process Ti regulates Tj, which is a key process in transcription, and Tj regulates another process Tk, then the pathway
is replaced with regulatory interaction
.
| 4 RESULTS AND DISCUSSION |
|---|
|
|
|---|
We test NARADA comprehensively on the E.coli transcriptional network obtained from RegulonDB (Salgado et al., 2006). The release 5.6. of this dataset contains 1364 genes with 3159 regulatory interactions. A total of 193 of these interactions specify dual regulation. We separate these dual regulatory interactions as up and down regulatory interactions. We use Gene Ontology (Ashburner et al., 2000) as a library of functional attributes. The annotation of E. coli genes is obtained from UniProt GOA Proteome (Camon et al., 2004). Using the mapping provided by GO, the gene network is mapped to functional attribute networks of the three name spaces in GO. Mapping to the biological process space provides maximum coverage in number of genes annotated, 881 genes are mapped to one or more of 318 process terms. We discuss here results obtained by this mapping only. Results relating to molecular functions and cellular components, as well as comprehensive results on pathways of biological processes, are available at the NARADA website.
We use NARADA to identify all significant forward and reverse pathways of length 2–5. In order to identify these paths, we run queries
and
with a critical
of 0.01 on all 318 biological processes. The number of pathways obtained using combinations of the algorithmic options described in the previous section are shown in Table 1. On a Pentium M (1.6 GHz) laptop with 1.21 GB RAM the brute-force approach takes on average 0.5 s per query for path length 2, to 12 s per query for paths of length 5. For strongly significant paths, it takes <2 s per query for paths of length 5, while for shortcutting terms it is 8 s per query for paths of length 4. Strongly significant pathways, i.e. those obtained by extending only significant pathways, compose a significant portion of the highly significant pathways. This observations suggests that significantly modular pathways are also likely to be composed of significantly modular building blocks.
|
4.1 Discussion
One of the prominent features of the detected significant pathways is that a large number of them begin with terms relating to transcriptional and translational regulation while ending in other cellular processes (Fig. 5). This can be explained by the fact that the network consists of a set of transcription factor genes and set of genes regulated by them. Therefore, most of the regulatory pathways of length 3 or more have to begin at or flow through this set of genes annotated with processes relating to transcription, translation and regulation thereof. Pathways involving other process terms occur with lower frequency, but most of them are highly significant.
|
Samples of pathways obtained are shown in Table 2. Some pathways like (sensory perception
transport) occur frequently and may constitute a common mechanism for regulation of transport-related activities. Parts of the significant pathways that regulate phosphorylation via genes involved in transcription and DNA recombination are shown in Figure 5a. As genes involved in transcription are abundantly present in the network, part of the pathway (DNA recombination
transcription) occurs rarely (12 times) and is not significant, but in 6 of the 12 times it occurs, the genes involved in transcription regulate phosphorylation. The fis transcriptional regulator is responsible for regulation of nuoA-N operon (Wackwitz et al., 1999), while the fhlA transcriptional activator regulates the hyf locus (Hopper et al., 1994; Skibinski et al., 2002). Indeed, it is observed that the integration host factor (ihfA,ihfB) affects the regulation of these phosphorylation related genes (nuoA-N, hyf,hyc) directly and indirectly (Hopper et al., 1994; Nasser et al., 2002).
|
In Figure 5b, significant pathways that regulate cell motility are shown. This is part of a response to a query of type
4.2 Case study: Regulatory network of molybdate ion transport
Figure 6 shows all significant paths of maximum length 3 regulated by molybdate ion transport. The genes associated with molybdate ion transport are modE and the operon modABCD, but it has been observed that the gene modE down-regulates the operon modABCD (McNicholas et al., 1997), and the operon does not regulate any other gene. The three pathways at the bottom of the figure are the only significant paths of length 2 originating at molybdate ion transport. As can be seen on on the upper side of the figure (paths of length 3), molybdate ion transport promotes and suppresses various processes indirectly, through DNA-dependent regulation of transcription, two-component signal transduction system and nitrate assimilation. It is important to note that direct regulation of these intermediate terms by molybdate ion transport is not significant by itself. By extending the search beyond pairwise interactions, NARADA is able to capture these significant indirect interactions successfully.
|
The paths of length 2 mirror the direct regulation of moaABCDE operon (McNicholas et al., 1997) and oppABCDF operon (Tao et al., 2005) by modE. Furthermore, modE indirectly regulates cytochrome complex assembly ccm operon (Overton et al., 2006), electron transport nap operon (McNicholas and Gunsalus, 2002), nitrate assembly nar operon (Self et al., 1999) and mitochondrial electron transport nuo operon (Bongaerts, et al. 1995; Overton, et al., 2006). All these indirect regulations occur through genes involved in respiratory nitrate reductase narXL (Tao et al., 2005). In the RegulonDB network, we observe that modE indeed regulates narL, which regulates other genes. NARADA associates the mediation of modE's regulatory effect on several other processes with the functional associations of narL.
An interesting observation is that, even though the r


is



