Skip Navigation


Bioinformatics Advance Access originally published online on September 17, 2004
Bioinformatics 2005 21(2):227-238; doi:10.1093/bioinformatics/bth484
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/2/227    most recent
bth484v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ott, S.
Right arrow Articles by Miyano, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ott, S.
Right arrow Articles by Miyano, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics vol. 21 issue 2 © Oxford University Press 2005; all rights reserved.

Superiority of network motifs over optimal networks and an application to the revelation of gene network evolution

S. Ott 1,*, A. Hansen 2, S.-Y. Kim 1 and S. Miyano 1

1 Human Genome Center, Institute of Medical Science, University of Tokyo 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan
2 Wolfson Institute for Biomedical Research, University College London The Cruciform Building, Gower Street, London, WC1E 6AU, UK

*To whom correspondence should be addressed at Wolfson Institute for Biomedical Research, University College London, The Cruciform Building, Gower Street, London WC1E 6AU, UK.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 EVALUATING GENE NETWORK...
 APPLICATION TO GENE NETWORK...
 DISCUSSION AND CONCLUSION
 REFERENCES
 

Motivation: Estimating the network of regulative interactions between genes from gene expression measurements is a major challenge. Recently, we have shown that for gene networks of up to around 35 genes, optimal network models can be computed. However, even optimal gene network models will in general contain false edges, since the expression data will not unambiguously point to a single network.

Results: In order to overcome this problem, we present a computational method to enumerate the most likely m networks and to extract a widely common subgraph (denoted as gene network motif) from these. We apply the method to bacterial gene expression data and extensively compare estimation results to knowledge. Our results reveal that gene network motifs are in significantly better agreement to biological knowledge than optimal network models. We also confirm this observation in a series of estimations using synthetic microarray data and compare estimations by our method with previous estimations for yeast. Furthermore, we use our method to estimate similarities and differences of the gene networks that regulate tryptophan metabolism in two related species and thereby demonstrate the analysis of gene network evolution.

Availability: Commercial license negotiable with Gene Networks Inc. (cherkis{at}gene-networks.com)

Contact: sascha-ott{at}gmx.net


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 EVALUATING GENE NETWORK...
 APPLICATION TO GENE NETWORK...
 DISCUSSION AND CONCLUSION
 REFERENCES
 
The estimation of gene networks from expression level measurements has been one focus of bioinformatics research in recent years (Chen et al., 1999; Friedman et al., 2000; Hartemink et al., 2001; Ideker et al., 2002; Rung et al., 2002; van Someren et al., 2002). Knowledge of gene networks will be important for understanding cellular processes, designing new strategies to combat disease and so on.

A widely used approach to model gene networks are Bayesian networks (Buntine, 1991; Cooper and Herskovits, 1992; Friedman and Goldszmidt, 1998; Heckerman, 1999; Friedman et al., 2000; Hartemink et al., 2001; Pe'er et al., 2001; Imoto et al., 2002; Ong et al., 2002; Tamada et al., 2003; Nariai et al., 2004; Ott et al., 2004), which model expression levels of genes as random variables and gene networks as joint probability distributions of expression patterns. These distributions are decomposed using directed acyclic graphs (DAGs), which we will call networks. Networks are scored using score functions based on the likelihood of networks, given the data.

A recent result shows that one can optimally estimate gene network models for gene networks of up to about 35 genes (Ott et al., 2004).1 This result holds for any score function s : G x 2 G -> IR assigning a score to a pair of a gene g and a possible selection of parents for g. However, while optimal gene network models are the most likely models, there may still be very different models that have approximately the same likelihood, considering the large number of DAGs [in the case of 10 genes {approx}4.17 x 1018 (Robinson, 1973)]. Therefore, even optimal gene network models will in general not match the target gene network.

Our endeavour to tackle this problem is 3-fold. First, we provide an algorithm for the enumeration of optimal and suboptimal networks in the order of their likelihood, and extract frequent subgraphs of the most likely m networks, denoted as gene network motifs in this work.2 Second, we rigorously and extensively compare estimated network models to available knowledge about gene networks. Third, we apply the gene network motif extraction to microarray data of Bacillus subtilis and Escherichia coli in order to demonstrate the analysis of gene network evolution.

Our evaluations show that gene network motifs are in significantly better agreement with knowledge about transcriptional regulation than optimal network models. We also derived the conclusion that the gene networks governing the regulation of tryptophan metabolism in the above species have probably changed significantly during their evolution while other parts of the network have been conserved.


    METHODOLOGY
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 EVALUATING GENE NETWORK...
 APPLICATION TO GENE NETWORK...
 DISCUSSION AND CONCLUSION
 REFERENCES
 
Throughout this work, we assume a set of genes G and a score function s : G x 2 G -> IR. The score of a network N is defined as score(N) = {sum} gG s(g,P N (g)), where P N (g) denotes the set of g's parents in N. In order to find the model that explains the given data best, we need to find the DAG N that minimizes score(N).

Since this problem is NP-hard (Chickering, 1996), and the search space is of super-exponential size (Robinson, 1973), heuristic approaches have frequently been applied (Friedman et al., 2000; Hartemink et al., 2001; Pe'er et al., 2001; Imoto et al., 2002; Ong et al., 2002; van Someren et al., 2002; Tamada et al., 2003; Nariai et al., 2004), though the accuracy of heuristics is uncertain. However, we have recently shown that optimal networks can be found using (|G|/2 + 1) · 2|G| dynamic programming steps that leads to an exact algorithm applicable in all kinds of research settings (Ott et al. 2004).

For larger gene networks, empiric or heuristic methods can be used to select a biologically meaningful subspace of the search space. If this is done as described in Ott and Miyano (2003), the approach of Ott et al. (2004) can be used to perform an optimal search within the selected subspace.

We have extended the algorithm of Ott et al. (2004) in order to solve the enumeration task. Since the likelihood of gene network motifs is the sum of the likelihood of the networks containing it, they will turn out to be more reliable than single network estimations.

Enumerating optimal gene networks
Without loss of generality, we assume networks with equal score to be sorted in some way, therefore allowing the notion ‘the k-th best network’. For m IN, we use IN ≤m to denote {1, ..., m}. An ordering of a set A G can be described as a permutation {pi} : {1, ..., |A|} -> A. We use {Pi} A to denote the set of all permutations of A. For {pi} {Pi} A , we say that a network N A x A is {pi}-linear if for all (g,h) N{pi}–1(g) < {pi}–1(h) holds, that is, all edges in N comply with the direction given by {pi}.

Our strategy is to divide the space of DAGs on a set A G into subspaces of {pi}-linear networks, for all {pi} {Pi} A , and to decompose the problem of finding optimal and suboptimal networks into the following two problems:

  1. Find the subspace of the search space that contains the (sub)optimal network searched for.
  2. Find the (sub)optimal network within the selected subspace.

We first define some functions and then show how these functions can be computed for gene networks of considerable size.

DEFINITION 1.

Let m IN. We define Fm : G x 2 G x IN ≤m -> 2 G inductively.3 First, for all g G and A G, we define

Then, denoting the set of all previous solutions {F m (g,A,p)| p < k} as J(k),

for all 1 < k ≤ m.

DEFINITION 2.

Let m IN. We define Sm : G x 2 G x IN ≤m -> IR as

for all g G, A G, and k IN ≤m .

By the definitions, F m (g,A,k) is the k-th best choice of parents for a gene g when the parents have to be selected from A, and S m (g,A,k) is the score for this choice. When m is clear from the context, we use F and S instead of F m and S m , respectively. Note that F m and S m are partially defined functions, since m may be larger than the number of subsets of A.

In order to find optimal and suboptimal networks for a given subspace specified by a permutation {pi}, we need the following function Q A . For a given gene g, let us denote the set of all genes that precede g in {pi} as V({pi},g) = {h|{pi}–1(h) < {pi}–1(g)}.

DEFINITION 3.

Let A G. We define Q A : {Pi} A x IN |A| -> 2 A x A as

for all {pi} {Pi} A and v IN |A|.

In Definition 3, we have used a vector v IN |A| to determine the rank of the selection of parents for the particular genes. Below it will be shown that Q A ({pi},v) is the set of edges of an optimal or suboptimal {pi}-linear network on A, its rank depending on v. Next, we define two functions M m and D m that specify subspaces, in which (sub)optimal networks can be found, and the choice of a network from the subspace, respectively.

DEFINITION 4.

Let m IN. We inductively define functions Mm : 2 G x IN ≤m -> {cup} AG {Pi} A and over their second parameter. Let A G. First, we define

and

Let k IN ≤m with k > 1 and let N be a network on A with optimal score among networks not in {(A,Q A (M m (A,p),D m (A,p)))|p < k}. Let {pi}* {Pi} A be a permutation such that N is {pi}* -linear. We define

Let v * IN |A| such that for every g A, the set of g's parents, PaN (g), equals

We define:

As F m and S m , M m and D m are partial functions.

The functions and their intuitive meanings are summarized in Table 1, the definition of the algorithm is given in Table 2.


View this table:
[in this window]
[in a new window]
 
Table 1 Intuitive meanings of the functions used to define the algorithm

 

View this table:
[in this window]
[in a new window]
 
Table 2 The algorithm

 
The algorithm computes the functions F m and S m in Step 1 and in Step 2 for all g G, A G, and j ≤ m. This can be done by applying dynamic programming, since only function values of F m for a set A of lower cardinality or lower j are needed in order to select B * in Step 2a.

In Steps 3 and 4, functions M m and D m can be computed similarly using dynamic programming, since for the selection of a triple in Step 4a only function values of M m and D m for a set A of lower cardinality or lower j are needed. The triple (g,p,q) specifies a network on A, g being a candidate for the last element in the permutation searched for, p specifying the remaining permutation that can be chosen from up to m previously computed permutations of A – {g}. Then, to form a network in the subspace defined by the resulting permutation, the q-th best selection of parents for g is used, while for the other genes parents are selected as indicated by D m (A – {g}, p).

Functions F m , S m , M m and D m are partially defined in the case of high m which we did not explicitly mention in the description of the algorithm. In our implementation of the algorithm, we store permutations {pi} and make use of the restrictions for edges in {pi}-linear networks, yielding a memory- and time-efficient coding of networks. Moreover, the two applications of dynamic programming in Steps 1/2 and Steps 3/4, respectively, can be performed in an alternating way, thus reducing the memory requirements substantially. The number of network comparisons in Step 4a can be minimized in practice, since networks with different scores are different. Moreover, the algorithm can well be parallelized.

Correctness
Let us denote the k-th best network on a set A G by N* A,k . We first reformulate two lemmata from Ott et al. (2004).

LEMMA 1.

Let A G and {pi} {Pi} A . Let N* be a {pi}-linear network on A with minimal score. Then, score((A,Q A ({pi},(1, ..., 1)))) = score(N*) holds

LEMMA 2.

Let A G and m IN. Let g* = arg min gA (S m (g,A – {g},1) + N*A–{g},1). Define {pi} {Pi} A by {pi}(i) = M(A – {g*},1)(i) for i 1,...,|A| – 1}, and {pi}(|A|) = g*. Then, {pi} = M m (A, 1).

The following theorem provides the correctness. We regard as one dynamic programming step the computation that is executed for one g G and one A G in Step 2, respectively for one A G in Step 4. We use n to denote |G|.

THEOREM 1.

Let m N. The best m networks can be found using (n/2 + 1) · 2 n Idynamic programming steps, where the complexity of a dynamic programming step depends on m.

PROOF.

By the definitions, the output of the algorithm, Q G (M m (G,i),D m (G,i)), i ≤ m, are the best m networks on G. We only need to prove that the recursive formulas given in the algorithm are correct. The equations given in Step 1 are correct by the definitions of F m and S m . When we select a subset of a set A G in Step 2, we have basically two choices: the whole set A or a true subset. In the former case, we can compute the score of the choice directly, in the latter, we can use previously computed values of F m and S m , which gives the correctness of Step 2.

After the execution of Step 2, we have computed all values of F m and S m . Using these values, function Q can be computed directly. Therefore, we only need to compute functions M m and D m in order to being able to produce the output in Step 5. The equations in Step 3 are again correct by the definitions. We observe that with Lemma in combination with Definition 4, the following equation follows by induction:


From this equation and Lemma 2 we see that the recursion in Step 4 is correct for k = 1 (variable j in the algorithm). For k > 1, we compute the suboptimal permutation M m (A,k) and the suboptimal choice of parents D m (A,k) in the same way, restricting to a network not previously chosen.

The dynamic programming in Steps 1 and 2 requires 2 n–1 steps for each gene, since a gene may not be one of its parents. In the dynamic programming in Steps 3 and 4 a total number of 2 n steps is needed, which completes the proof.

Taking advantage of a combinatorial explosion
The time required for one dynamic programming step depends on the number of solutions m, but can be regarded as a feasible constant for m ≤ 200. Here, we show how m can be chosen as high as 20 000 in practice, with only slightly increasing the computation time compared to m = 200.

The algorithm as stated in Table 2 computes a fixed number of solutions, regardless of the size of the set A. However, it is often possible to derive the best m solutions for a set A in layer |A| = j (2 ≤ j ≤ |G|) from the knowledge of the best m' solutions for sets in layer j – 1 with m' < m. The number of derivable solutions may vary from no increase in the worst case to a quadratic increase in the best case (Ott, 2004). A quadratic increase in one step means a super-exponential increase of derived solutions with increasing layers. In the practical case, it is unlikely to encounter one of the extreme cases as we validated in computational experiments using microarray data.

We found that the number of derivable solutions usually increases exponentially. This allows us to compute a lower number of solutions m' for the layers exhibiting a high number of subsets, and to exponentially increase the number of solutions for higher layers of rapidly declining size. This strategy takes advantage of the intrinsic combinatorial explosiveness of the search space and works well for m' = 100 and m = 10 000, or m' = 200 and m = 20 000 in order to compute the best m networks (Ott, 2004). The observed increase in the number of solutions was by about a factor 1.5 when climbing up one layer.

Extracting gene network motifs
A straightforward approach to exploit the information given by an enumeration of the most likely m networks would be to count the occurrences of each edge in the networks, select only the edges which have a count above some threshold, and compose a partial network from these edges.5 However, a partial network composed from edges of high counts does not have to be frequent in the networks at hand and may, therefore, be unlikely. This leads to the following problem, which we refer to as the gene network motif extraction problem:

Given graphs N 1 = (G,E 1),...,N m = (G,E m ), and k IN, find a set M G x G with |M| = k maximizing the number of graphs N i that include M, i.e. M E i .

The motif extraction problem is equivalent to the well-known problem of finding maximal frequent item sets. The problem of finding balanced complete bipartite subgraphs (Garey and Johnson, 1979) can be reduced to both problems, so they are NP-hard. However, the problem can be solved for practical instances as encountered in this work by exhaustively searching over subsets M of the set of edges with at least c occurrences (c IN), using the fact that edges in a motif M with at least c occurrences must also have at least c occurrences. Using this search strategy, we find the optimal solution M* as long as the empirically chosen constant c is not set too high.


    EVALUATING GENE NETWORK ESTIMATIONS
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 EVALUATING GENE NETWORK...
 APPLICATION TO GENE NETWORK...
 DISCUSSION AND CONCLUSION
 REFERENCES
 
The principled evaluation of gene network estimations is an important issue which has been rarely addressed in the literature. Since transcription and translation co-localize in procaryotes, rendering mRNA expression data sufficient for estimating gene networks, we chose B.subtilis and E.coli as targets for evaluating the proposed methods, using the available microarray data and biological knowledge of both species.

Data and score function
For E.coli, we selected the data sets GDS95–GDS100 from the Gene Expression Omnibus.6 Changes in gene expression levels were elicited by perturbations of tryptophan metabolism, UV exposure and novobiocin treatment (Khodursky et al., 2000a; Khodursky et al., 2000b; Courcelle et al., 2001). We also received data of transcriptional regulation from RegulonDB (Salgado et al. 2001) for comparison with estimation results.

For B.subtilis, we used a data set of 70 microarrays from time-course experiments under various treatments, a data set of 99 microarrays from gene disruptant experiments (unpublished data) and a data set of known transcriptional regulation from DBTBS (Ishii et al., 2001; Makita et al., 2004).

The score functions (s : G x 2 G -> IR) used in our implementation include the MDL score (Friedman and Goldszmidt, 1998), the BDe score (Cooper and Herskovits, 1992; Friedman and Goldszmidt, 1998) and the BNRC score (Imoto et al., 2002). We selected the BNRC score for our computations because it can be applied without discretization of the data, avoiding additional parameters and loss of information, and gene interactions are modeled using B-splines, allowing for non-linear relationships. Furthermore, in computational experiments on the B.subtilis data set, optimal networks with respect to the BNRC score turned out to be in the best agreement with textbook knowledge among these score functions (Ott, 2004; Sonenshein et al., 2001).

Selection of target networks
We selected all relations from the knowledge data set for E.coli, for which experimental evidence is provided (evidence level three or higher) yielding a set of 899 known relations. From the B.subtilis data set, we selected 840 regulatory relations with evidence in the literature. We applied a random procedure to select target networks from these data. Since we need to select genes in a way that there are some known regulatory relations, we select the first few genes randomly, and then iteratively select genes that are connected to the previously selected genes. In each iteration, we randomly select a connected component of the intermediate network and a new gene with a known relation to at least one gene in the component. Since trivial choices of target networks should be avoided, we choose a gene not connected to the previously selected genes when five connected genes have been selected in a row.

The selection procedure yields a partially known gene network N of non-trivial structure represented as a matrix. Each pair of genes with (without) a known relation is represented with 1 (0) in the corresponding entry of the matrix, but 0.5 is set instead of 0 for pairs (g,h) fulfilling at least one of the following conditions:

  1. h regulates g.
  2. A gene i in the target network regulates g and h.
  3. A gene i in the target network is regulated by g (h) and regulates h (g).
  4. Condition 2 or Condition 3 hold for a gene i outside the target network.

Using these conditions, nearly correct estimations are distinguished from wrong estimations. If edge (g,h) is estimated and (h,g) is a known regulatory relation, then the fact that these two genes interact was correctly estimated (Condition 1). In the same way, indirect relations established by Conditions 2–4 are also not entirely wrong, if estimated.

Evaluating edge counts
The above algorithm allows to enumerate most likely networks. We, therefore, asked whether the number of occurrences (edge count) of a given edge in the most likely m networks carries biologically relevant information. We applied the procedure described above to select 30 target networks of 10 genes for B.subtilis and used the exact algorithm to enumerate the best 500 network models for each set of genes G i . For each G i (i = 1,...,30) and for each possible directed edge (g,h) or undirected edge {g,h} in a network on G i , we counted the number of occurrences in the 500 estimated networks for G i . We then examined how each group of edges (0, 0.5 and 1) distributes over the range of edge counts.

The results (Figure 1) show that most edges have an edge count below 50 or above 450 (first/last interval). In these two intervals, group 0 separates clearly from group 0.5 and group 1, whereas group 0.5 and group 1 do not show a clear separation, pointing to the difference between indirectly correct estimations (group 0.5) and estimations that can not be justified from current knowledge. Group 0 shows fewer high edge counts, but dominates the interval of lowest edge counts, indicating that edges with high edge count are more likely to be true than edges with low edge count.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 1 Result for B.subtilis data; x-axis: discretized ratio of edge counts to the number of estimated networks (500), y-axis: proportion of edges of the group 0, 0.5 and 1 per interval. Left: time-course data, Right: disruptant data.

 
Disruptant data yield a stronger separation of group 0 than time-course data, which might be due to the (different) amount of data rather than to the experimental method. Despite the lack of information about the target networks (since they had been randomly selected, irrespective of the microarray data) in the microarray data and true relations possibly being unknown or estimated as transitive edges, the observed separation of group 0 is an encouraging result.

A computation on the E.coli data set did not yield a clear separation of group 0, which might be due to the lower number of microarrays (53) and the low number of affected genes in these specific experiments (see above).

We conclude that the above evaluation scheme can be applied not only for evaluating score functions, but also for evaluating the significance of the data.

Superiority of gene network motifs
In order to evaluate the motif extraction approach, we selected the disruptant data set for B.subtilis and 1000 target gene networks N i of 10 genes in a random way as described above, enumerated the most likely 100 network models for each target network, and extracted motifs with two or more edges (highest-scoring motif among largest motifs), at threshold c = 80, resp. 95. We randomly selected one edge of each optimal network, and one edge of each motif. We then checked the correctness of both edges, using the DBTBS data, and computed the probability p i of randomly guessing a single edge from N i as the ratio of the number of edges of N i to the number of possible edges. According to the results of the previous subsection, we judged 1 entries and 0.5 entries as correct. We computed an upper bound for the probability of guessing at least k single edges correctly among n networks, P(n,k), by using the inequality , where p denotes .7 We observe that the results (Table 3) for the motif extraction approach as well as for the optimal models are in significant agreement with the knowledge. But the former approach clearly outperforms the results for optimal gene networks. We conclude that gene network motifs are even more reliable than the network with the best score. We note that our results also hold for the estimated networks as a whole, since we selected arbitrary edges for the evaluation.


View this table:
[in this window]
[in a new window]
 
Table 3 Evaluation results for the motif extraction approach using 80 and 95 as thresholds

 
This result was confirmed in further evaluations for gene networks of 15 genes and in the case of accepting only one entry of the knowledge matrix as correct (Ott, 2004). Examining the optimal number of enumerated networks m, we found that higher numbers yield better results, if m is chosen from {1,...,150}.

Evaluations using synthetic data
In order to evaluate whether the observed higher reliability of network motifs holds in general, we produced synthetic microarray data using the following procedure. First, we selected a DAG of 11 vertices (genes) as an artificial gene network. Four genes in our network have no parents, four genes have one parent, two genes have two parents, and one gene has three parents. We then selected a linear or non-linear function for each gene to describe the dependency of its expression values and the expression values of its parents. The expression levels of genes without parents were modelled as normally distributed with standard deviation set to 1. For genes with parents we added a normally distributed system error to the input from its parents. The standard deviation of this system error was set as half the standard deviation of its input. Finally, we added a measurement error of varying standard deviation to the data.

We assessed the accuracy of optimal networks and network motifs by comparing these graphs to the artificial network and computing the sensitivity and the specificity.8 Figures 2 and 3 summarize the results for varying measurement error. Each data point is an average value of 10 repetitions of data generation and network estimation.



View larger version (9K):
[in this window]
[in a new window]
 
Fig. 2 Average sensitivity as a function of noise-to-signal ratio. Rectangles represent average values for optimal networks, triangles represent average values for network motifs.

 


View larger version (9K):
[in this window]
[in a new window]
 
Fig. 3 Average specificity as a function of noise-to-signal ratio. Rectangles represent average values for optimal networks, triangles represent average values for network motifs.

 
We observe that optimal networks show a higher sensitivity, while the specificity of network motifs outperforms optimal networks. This result is consistent with our above finding on real data and shows that network motifs provide the biologist with more reliable estimations of gene regulation than optimal networks do. The increase in reliability comes at the expense of sensitivity, but reliability might be of higher priority, considering the high complexity of gene network models and the noisy data.

In a second series of estimations we used synthetic data to evaluate the influence of the number of microarrays on the sensitivity and specificity of estimations. Figures 4 and 5 summarize these results, each data point is an average value after 10 repetitions. As expected, the accuracy of estimations increases with an increasing number of arrays, while the observed differences between optimal networks and network motifs do not seem to depend on the number of arrays.



View larger version (9K):
[in this window]
[in a new window]
 
Fig. 4 Average sensitivity as a function of the number of microarrays. Rectangles represent average values for optimal networks, triangles represent average values for network motifs.

 


View larger version (9K):
[in this window]
[in a new window]
 
Fig. 5 Average specificity as a function of the number of microarrays. Rectangles represent average values for optimal networks, triangles represent average values for network motifs.

 
Application to yeast data
In Friedman et al. (2000), the bootstrap method has been applied to assess the confidence of features (e.g. edges) of networks estimated by a heuristic algorithm. In the bootstrap approach, the data set is modified using a random procedure m times, and a network estimation is done for each modified data set. The confidence of estimated features is then computed as the number of occurrences in m estimated graphs. In our approach, we do not modify the given data, but instead analyze the subspace of the most likely network structures. Still, both methods are similar in the sense that they aim to distinguish reliable parts of estimations from less reliable ones.

In Friedman et al. (2000), gene networks have been estimated for yeast, using the data of Spellman et al. (1998). As a comparison of both approaches, we have extracted network motifs for two small networks using the same microarray data. Figures 6 and 7 show the extracted network motifs and regulatory relations estimated in Friedman et al. (2000). We observe that out of 13 edges estimated by the bootstrap approach, we find 6 edges in the network motif, 3 edges that are reversed, and 4 edges that do not appear in the network motif. However, in these four cases we find directed paths of length 2 that represent the same relationships indirectly. Therefore, both estimation results are quite consistent.



View larger version (29K):
[in this window]
[in a new window]
 
Fig. 6 Estimation result for yeast data, subnetwork around cdc5. Solid arrows show edges included in the network motif estimated by our method, dashed arrows represent estimations given in Friedman et al. (2000), edge counts are included for solid arrows.

 


View larger version (44K):
[in this window]
[in a new window]
 
Fig. 7 Estimation result for yeast data, subnetwork around cln2. Solid arrows show edges included in the network motif estimated by our method, dashed arrows represent estimations given in Friedman et al. (2000), edge counts are included for solid arrows.

 

    APPLICATION TO GENE NETWORK EVOLUTION
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 EVALUATING GENE NETWORK...
 APPLICATION TO GENE NETWORK...
 DISCUSSION AND CONCLUSION
 REFERENCES
 
There are indications that evolution may be driven more strongly by changes in expression levels than changes in protein structures (Enard et al., 2002; Oleksiak et al., 2002), thus indicating a kind of gene network evolution. The methods decribed can help to unravel similarities and evolutionary changes in gene regulatory networks of related species. As we did in the above evaluations, we choose Gram-negative rod-shaped E.coli and Gram-positive rod-shaped B.subtilis as promising targets for our estimations. But instead of using the whole data set described above, we now restrict the data according to the matter of investigation, selecting specific experiments and specific genes only, thus increasing specificity.

Given the above data, the most promising target network is the tryptophan network because for E.coli there is few (18 microarrays), but highly specific data from experiments designed to unravel details of the regulation of the genes involved in tryptophan metabolism, and for B.subtilis there is a comparatively high number of 59 microarrays obtained from time-course experiments under various nutritional conditions, likely to affect the tryptophan network. We selected genes known to be involved in the well-studied tryptophan network and extracted directed motifs based on 100 enumerated optimal network models, using 50 as the threshold. A graph was derived from the output file showing the largest motif with highest score extracted from the enumerated networks, genes are presented by nodes, and each edge is labelled with its weight.

The largest motif obtained from the data set for E.coli is found 54 times and contains 13 edges with weights ranging from 79 to 100 (Figure 8), the one for B.subtilis data was found 61 times and contains 16 edges, weights ranging from 77 to 100 (Figure 9).



View larger version (38K):
[in this window]
[in a new window]
 
Fig. 8 Extracted motif for E.coli.

 


View larger version (40K):
[in this window]
[in a new window]
 
Fig. 9 Extracted motif for B.subtilis.

 
trpA, trpB, trpC, trpD and trpE are linked by highly weighted edges in both motifs, forming the core of this network, corresponding to the fact that they are positioned close to each other in the trp-operon in both species. Even the order of position in the trp-operon can partially be recognized in the graphs. But in B.subtilis, seven genes code for the enzymes in the trp biosynthesis pathway, as opposed to five in E.coli. The two extra genes, trpF and pabA are also contained in the derived motif. trpF is connected to trpB and trpD, corresponding to its approximate position on the trp-operon. pabA, also called trpG in B.subtilis, is found closely connected to trpA and trpB although it is not located in the trp-operon but in the folate operon (Sonenshein et al., 2001). This close connection may indicate the close functional relation in the tryptophan biosynthesis pathway. The E.coli pabA gene, which has not been found to be deeply involved in this pathway, remains unconnected in the extracted motif.

On the other hand, mtrB is closely linked to trpE and trpF in the graph for B.subtilis. This can be explained by the fact that its gene product, tryptophan RNA-binding attenuation protein (TRAP) has been found to be among the key regulators of tryptophan biosynthesis in B.subtilis (Valbuzzi and Yanofsky, 2001). Interestingly, mtr, coding for a tryptophan specific transport protein in E.coli (Ong et al., 2002) is also associated with core genes of tryptophan biosynthesis like trpB and trpE in the graph. The structure of the network motifs suggests that the function of mtr in E.coli might be similar to the one of mtrB in B.subtilis.

Recently, a TRAP-inhibitory protein (anti-TRAP, AT) has been found and has been identified as the gene product of yczA (Valbuzzi and Yanofsky, 2001). However, no association between yczA and the other genes examined can be found in the results; but yczA-mRNA levels may be low since AT does not act alone but depends on tryptophan levels in the cell and only binds to Trp-activated TRAP (Valbuzzi and Yanofsky, 2001). No homologue of yczA has been found so far in E.coli, but BLASTp searches revealed that an amino acid sequence of AT shows similarities to that of the cystein-rich domain of the chaperone DnaJ (Szabo et al. 1996; Martinez-Yamout et al., 2000). Therefore, dnaJ had been included into this calculation, the results suggest an association with the tryptophan network in E.coli and even more in B.subtilis.

The reason might be cross-hybridization with yczA-mRNA or interaction on the protein level as a chaperone or as a TRAP regulator. This would explain the strong link between dnaJ and mtrB in the figure for B.subtilis (Figure 9). trpS, coding for the tryptophanyl-tRNA synthetase has been included into this analysis because tRNA Trp had been shown to play an important, yet different, role in the regulation of tryptophan biosynthesis in both bacteria (Valbuzzi and Yanofsky, 2001). The structures of the respective network motifs differ indeed. The figure for E.coli (Figure 8) shows trpS being linked to mtr and wrbA. mtr may be involved in transcription termination which can be observed in abundance of tRNA Trp (Valbuzzi and Yanofsky, 2001), stalling at the leader peptide, which is encoded by trpL, corresponding to the edge from trpL to mtr in the figure. wrbA encodes a tryptophan repressor binding protein, so this edge corresponds to Trp activating the tryptophan repressor. However, there is no direct link between trpS and trpR, the tryptophan repressor gene. This can be explained by the fact that the tryptophan repressor protein acts through conformational change when Trp binds, not through mRNA expression. Yet the graph shows an association of trpR with trpD, which may indicate a general production of tryptophan repressor protein together with tryptophan biosynthesis enzymes, though trpR is not located in the trp-operon. In the B.subtilis graph, trpS expression is linked to pabA and dnaJ. If dnaJ functions as yczA, this is consistent with the finding that tRNA Trp has been found to be associated with the formation of AT-inactivated TRAP (Valbuzzi and Yanofsky, 2001).

We conclude that comparing estimated gene networks can be valuable to confirm previous knowledge and to derive new hypotheses. The meaning of edges may vary, ranging from transcriptional regulation to membership in the same operon or more complex interactions, and has, therefore, to be assessed in the light of previous knowledge after gene network estimation, and the understanding of such estimations has to be developed further. Our method can thus be helpful in both medical and biological applications (e.g. finding candidate target genes).


    DISCUSSION AND CONCLUSION
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 EVALUATING GENE NETWORK...
 APPLICATION TO GENE NETWORK...
 DISCUSSION AND CONCLUSION
 REFERENCES
 
We have provided a theoretical basis for the enumeration of optimal and suboptimal gene network models for gene networks of considerable size, presented results of a comprehensive comparison of estimations to biological knowledge, showed that gene network motifs are superior to optimal networks and applied the motif extraction approach to the intriguing problem of gene network evolution. We note that our evaluation criterion for estimated networks is very demanding and our work is one among very few that perform a principled evaluation of the estimated networks.9

The algorithmic methodology described is generally applicable in all situations where a score s with functionality s: G x 2 G -> IR is given, which is the case for all scores within the Bayesian network framework, but also for most other score functions. This is an important property for gene network estimation techniques, since work on new scores incorporating previous knowledge is on-going (Imoto et al., 2003; Nariai et al., 2004). In contrast to heuristic approaches, the exact algorithm guarantees finding the most likely networks.

Since the number of significantly distinct gene expression patterns observed in typical data sets is low, the number of essential (groups of) genes will be within the range of feasibility in many cases. For gene networks of size beyond the computational limits of our algorithm, techniques as in Ott and Miyano (2003) can be applied, such that our approach of finding gene network motifs is not limited to small gene networks.

Rigorously assessing the accuracy of gene network estimations using available knowledge as we conducted in the above evaluations, is a promising approach to develop standards for comparing the strength of gene network estimation methods. Since there is little work on the principled evaluation of gene network estimations, this should be further pursued.


    Acknowledgments
 
The authors would like to thank the members of the RegulonDB team for processing our query in natural language, and Yuko Makita for providing yet unpublished data from DBTBS. Furthermore, we are grateful for discussions and advice from Michiel de Hoon and Seiya Imoto. A.H. was supported by a HFSPO LT fellowship and a Wellcome Trust Functional Genomics Programme grant (066790/E/02/Z).


    Footnotes
 
1 This boundary of feasibility holds in the biologically relevant case of a limited number of regulators for every gene. Back

2 Therefore, our definition of a gene network motif seems to differ from (Milo et al., 2002), but might turn out to be closely related. Back

3 We define IN as {1,2, ...} in this work. Back

4 Since network N is among the best m networks on A, the choice of parents for each gene g must also be among the best m choices. Back

5 This approach was used in order to compose partial networks based on networks enumerated by the bootstrap method (Pe'er et al., 2001). It was shown that, assuming independency of edge counts, regions of significantly many edges with high edge counts can be found in gene network estimations. However, this assumption does not account for the fact that the edge counts of all edges connected to the same gene g depend on the expression measurements of g. Back

6 http://www.ncbi.nlm.nih.gov/geo/ Back

7 In order to avoid too rough estimations of p-values, random selection of target networks is repeated when a target network of extreme p i was selected. The actual value of p was about 0.4 in our evaluations. Back

8 Sensitivity is defined as TP/(TP+FN), specificity as TP/(TP+FP), where TP, FP and FN denote the number of true positives, false positives, and false negatives, respectively. Back

9 The probably most demanding evaluation so far is given in (Rung et al., 2002). Back

Received on March 26, 2004; revised on August 2, 2004; accepted on August 17, 2004

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 EVALUATING GENE NETWORK...
 APPLICATION TO GENE NETWORK...
 DISCUSSION AND CONCLUSION
 REFERENCES
 

    Buntine, W. (1991) Theory refinement on Bayesian networks. UAI '91, 7, 52–60.

    Chen, T., He, H.L., Church, G.M. (1999) Modeling gene expression with differential equations. Pac. Symp. Biocomput., 4, 29–40[Medline].

    Chickering, D.M. (1996) Learning Bayesian networks is NP complete. In Fisher, D. and Lenz, H.-J. (Eds.). Learning from Data: Artificial Intelligence and Statistics V, , NY Springer-Verlag, pp. 121–130.

    Cooper, G.F. and Herskovits, E. (1992) A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309–347[CrossRef].

    Courcelle, J., Khodursky, A., Peter, B., Brown, P.O., Hanawalt, P.C. (2001) Comparative gene expression profiles following UV exposure in wild-type and SOS-deficient. Escherichia coli. Genetics, 158, 41–64.

    Enard, W., Khaitovich, P., Klose, J., Zöllner, S., Heissig, F., Giavalisco, P., Nieselt-Struwe, K., Muchmore, E., Varki, A., Ravid, R., et al. (2002) Intra- and interspecific variation in primate gene expression patterns. Science, 296, 340–343[Abstract/Free Full Text].

    Friedman, N. and Goldszmidt, M. (1998) Learning Bayesian networks with local structure. In Jordan, M.I. (Ed.). Learning and Inference in Graphical Models, , Dordrecht, The Netherlands Kluwer Academic Publishers, pp. 421–459.

    Friedman, N., Linial, M., Nachman, I., Pe'er, D. (2000) Using Bayesian networks to analyze expression data. J. Comp. Biol., 7, 601–620.

    Garey, M.R. and Johnson, D.S. Computers and Intractability, (1979) , San Francisco, CA W.H. Freeman and Company.

    Hartemink, A.J., Gifford, D.K., Jaakkola, T.S., Young, R.A. (2001) Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pac. Symp. Biocomput., 6, , pp. 422–433.

    Heckerman, D. (1999) A tutorial on learning with Bayesian networks. In Jordan, M. (Ed.). Learning in Graphical Models, , Cambridge, MA MIT Press.

    Ideker, T., Ozier, O., Schwikowski, B., Siegel, A.F. (2002) Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics, 18, , pp. 233–240.

    Imoto, S., Higuchi, T., Goto, T., Tashiro, K., Kuhara, S., Miyano, S. (2003) Combining microarrays and biological knowledge for estimating gene networks via Bayesian networks. Comput. Syst. Bioinformatics, 2, 104–113.

    Ishii, T., Yoshida, K., Terai, G., Fujita, Y., Nakai, K. (2001) DBTBS: a database of Bacillus subtilis promoters and transcription factors. Nucleic Acids Res., 29, 278–280[Abstract/Free Full Text].

    Khodursky, A.B., Peter, B.J., Cozzarelli, N.R., Botstein, D., Brown, P.O., Yanofsky, C. (2000a) DNA microarray analysis of gene expression in response to physiological and genetic changes that affect tryptophan metabolism in Escherichia coli . Proc. Natl Acad. Sci. USA, 97, 12170–12175[Abstract/Free Full Text].

    Khodursky, A.B., Peter, B.J., Schmid, M.B., DeRisi, J., Botstein, D., Brown, P.O. (2000b) Analysis of topoisomerase function in bacterial replication fork movement: use of DNA microarrays. Proc. Natl Acad. Sci. USA, 97, 9419–9424[Abstract/Free Full Text].

    Makita, Y., Nakao, M., Ogasawara, N., Nakai, K. (2004) DBTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics. Nucleic Acids Res., 32, D75–D77[Abstract/Free Full Text].

    Martinez-Yamout, M., Legge, G.B., Zhang, O., Wright, P.E., Dyson, H.J. (2000) Solution structure of the cysteine-rich domain of the Escherichia coli chaperone protein DnaJ. J. Mol. Biol., 300, 805–818[CrossRef][ISI][Medline].

    Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D.D., Alon, U. (2002) Network motifs: simple building blocks of complex networks. Science, 298, 824–827[Abstract/Free Full Text].

    Nariai, N., Kim, S.-Y., Imoto, S., Miyano, S. (2004) Using protein–protein interactions for refining gene networks estimated from microarray data by Bayesian networks. Pac. Symp. Biocomput., 9, 336–347.

    Oleksiak, M.F., Churchill, G.A., Crawford, D.L. (2002) Variation in gene expression within and among natural populations. Nat. Genet., 32, 261–266[CrossRef][ISI][Medline].

    Ong, I.M., Glasner, J.D., Page, D. (2002) Modelling regulatory pathways in E. coli from time series expression profiles. Bioinformatics, 18, 241–248.

    Ott, S. (2004) Finding optimal models for gene networks. , Tokyo, Japan PhD thesis University of Tokyo.

    Ott, S. and Miyano, S. (2003) Finding optimal gene networks using biological constraints. Genome Informatics, 14, 124–133[Medline].

    Ott, S., Imoto, S., Miyano, S. (2004) Finding optimal models for small gene networks. Pac. Symp. Biocomput., 9, 557–567.

    Pe'er, D., Regev, A., Elidan, G., Friedman, N. (2001) Inferring subnetworks from perturbed expression profiles. Bioinformatics, 17, 215–224.

    Robinson, R.W. (1973) Counting labeled acyclic digraphs. In Harary, F. (Ed.). New Directions in the Theory of Graphs, , New York Academic Press, pp. 239–273.

    Rung, J., Schlitt, T., Brazma, A., Freivalds, K., Vilo, J. (2002) Building and analysing genome-wide gene disruption networks. Bioinformatics, 18, 202–210[Abstract/Free Full Text].

    Salgado, H., Santos-Zavaleta, A., Gama-Castro, S., Millán-Zárate, D., Díaz-Peredo, E., Sánchez-Solano, F., Pérez-Rueda, E., Bonavides-Martínez, C., Collado-Vides, J. (2001) RegulonDB (version 3.2): transcriptional regulation and operon organization in Escherichia coli K-12. Nucleic Acids Res., 29, 72–74[Abstract/Free Full Text].

    Sonenshein, A.L., Hoch, J.A., Losick, R. Bacillus subtilis and its Closest Relatives: From Genes to Cells, (2001) , Washington, DC ASM Press.

    Spellman, P., Sherlock, G., Zhang, M., Iyer, V., Anders, K., Eisen, M., Brown, P., Botstein, D., Futcher, B. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, , pp. 3273–3297[Abstract/Free Full Text].

    Szabo, A., Korszun, R., Hartl, F.U., Flanagan, J. (1996) A zinc finger-like domain of the molecular chaperone DnaJ is involved in binding to denatured protein substrates. EMBO J., 15, 408–417[ISI][Medline].

    Tamada, Y., Kim, S.-Y., Bannai, H., Imoto, S., Tashiro, K., Kuhara, S., Miyano, S. (2003) Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection. Bioinformatics, 19, 227–236.

    Valbuzzi, A. and Yanofsky, C. (2001) Inhibition of the B. subtilis regulatory protein TRAP by the TRAP-inhibitory protein, AT. Science, 293, 2057–2059[Abstract/Free Full Text].

    van Someren, E.P., Wessels, L.F.A., Backer, E., Reinders, M.J.T. (2002) Genetic network modeling. Pharmacogen., 3, 507–525.


Add to CiteULike CiteULike