Skip Navigation


Bioinformatics Advance Access originally published online on October 14, 2004
Bioinformatics 2005 21(6):765-773; doi:10.1093/bioinformatics/bti064
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/6/765    most recent
bti064v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Rice, J. J.
Right arrow Articles by Stolovitzky, G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Rice, J. J.
Right arrow Articles by Stolovitzky, G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Reconstructing biological networks using conditional correlation analysis

John Jeremy Rice , Yuhai Tu and Gustavo Stolovitzky *

Computational Biology Center, IBM T.J. Watson Research Center PO Box 218, Yorktown Heights, NY 10598, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION AND CONCLUSIONS
 REFERENCES
 

Motivation: One of the present challenges in biological research is the organization of the data originating from high-throughput technologies. One way in which this information can be organized is in the form of networks of influences, physical or statistical, between cellular components. We propose an experimental method for probing biological networks, analyzing the resulting data and reconstructing the network architecture.

Methods: We use networks of known topology consisting of nodes (genes), directed edges (gene–gene interactions) and a dynamics for the genes' mRNA concentrations in terms of the gene–gene interactions. We proposed a network reconstruction algorithm based on the conditional correlation of the mRNA equilibrium concentration between two genes given that one of them was knocked down. Using simulated gene expression data on networks of known connectivity, we investigated how the reconstruction error is affected by noise, network topology, size, sparseness and dynamic parameters.

Results: Errors arise from correlation between nodes connected through intermediate nodes (false positives) and when the correlation between two directly connected nodes is obscured by noise, non-linearity or multiple inputs to the target node (false negatives). Two critical components of the method are as follows: (1) the choice of an optimal correlation threshold for predicting connections and (2) the reduction of errors arising from indirect connections (for which a novel algorithm is proposed). With these improvements, we can reconstruct networks with the topology of the transcriptional regulatory network in Escherichia coli with a reasonably low error rate.

Contact: gustavo{at}us.ibm.com

Supplementary information: Available from our website: www.research.ibm.com/FunGen


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION AND CONCLUSIONS
 REFERENCES
 
A fundamental task in the post-genomic era is to understand the gene regulatory networks controlling cellular functions. Although still a nascent discipline, many attempts have been made to reverse engineer biological networks or pathways (for reviews see Rice and Stolovitzky, 2004; van Someren et al. 2002). Broadly, the goals of these efforts can be placed into three hierarchical levels.

Topology. This level is concerned with mapping interactions among genes, proteins, metabolites and so on without any regard to directionality. Initial work in this area includes methods to identify metabolic network topologies based on correlation (Arkin and Ross, 1995) and later on mutual information (Samoilov et al., 2001) of time-series measurements of species concentration. Other examples include relevance networks connecting genes with similar responses (Butte and Kohane, 2000) and networks based on protein–protein interactions (Uetz et al., 2000).

Topology with qualitative connections. This level includes not only associations between cellular entities but also information on the nature of such associations. The goal is to create a diagram of directional connections (arrows) from the input to the output nodes with some qualitative indicator of how the input affects the output (e.g. a positive or negative arrow). Many methods have been proposed at this level including a more refined correlation-based approach (Arkin et al., 1997); fuzzy logic analysis (Woolf and Wang, 2000); Jacobian matrix methods (Chevalier et al., 1993; Kholodenko et al., 2002); binary network approaches (Liang et al., 1998); Bayesian network inference (Friedman et al., 2000; Hartemink et al., 2001; Smith et al., 2003); and functional module analysis (Segal et al., 2003; Troyanskaya et al., 2003).

Quantitative connections. This level seeks to determine a quantitative description of the interactions, such as a mathematical relationship that maps outputs in terms of inputs. The goal is to produce a set of equations that could simulate and reproduce the behavior of the actual system. Some examples are linear models of gene regulation (D'Haeseleer et al., 1999; Yeung et al., 2002), perturbation and regression methods (Gardner et al., 2003) and genetic algorithms (Koza et al., 2001).

In this paper, we propose a novel methodology to infer the ‘wiring’ of basic cellular circuits of the second level described above. Our work deals with a conditional correlation-based method in which we seek to deduce directional connections in a network based on gene expression measurements. The directionality comes from the asymmetry of the conditional correlation matrix, which deals with the correlation between two genes given that one of them has been perturbed. As others have done (Koza et al., 2001; Yeung et al., 2002) we also used synthetic networks of known topology as surrogates for real biological networks. Although a study of real data would have been desirable, our method requires several replicates of the same gene perturbations, and this type of data is still unavailable. Indeed our approach indicates the need for new experimental designs for network reconstruction. Until such experiments become available, we will have to assess the performance of our method using synthetic networks. In addition, synthetic networks allow precise control of data quantity and quality, and the reconstructed networks can be objectively measured against the original network, a feature not generally possible with real biological networks. The networks consist of nodes (genes), directed edges (gene–gene interactions) and dynamics of the genes’ mRNA concentrations in terms of the gene–gene interactions. We interrogate the network by forcing each node (the ‘mutated’ node) to a zero concentration and measuring the resulting stationary state concentrations of the other nodes. Hence, our method does not require time-series measurements that have been considered by others (Arkin and Ross, 1995; Arkin et al., 1997; Samoilov et al., 2001; Yeung et al., 2002). Replicate measurements of stationary state concentrations are used to compute the correlation between the perturbed gene and all the other genes. Although others have used correlation of experimental expression data to infer undirected connections (Butte and Kohane, 2000) this work attempts to reconstruct whole networks in a node-by-node fashion to predict activating or inhibiting connections using conditional correlations. We test the algorithm on a variety of synthetic networks based on both random placement of connections and on connections found in the transcriptional regulatory network in Escherichia coli.


    METHODS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION AND CONCLUSIONS
 REFERENCES
 
We will consider networks with three different types of topologies, as described below.

Topology I: fully synthetic networks. In recent years, network theory has become a powerful tool to describe complex interactions used in diverse areas such as the Internet, social networks, food webs, etc. (for an overview see Barabasi, 2002). One important feature of the architecture of many of these networks is that the distribution of the number of links associated with the network nodes does not follow a uniform distribution but rather a power law distribution. These networks are called scale-free (Barabasi and Albert, 1999). A wide set of biological networks have recently been shown to exhibit scale-free behavior, such as metabolic networks (Jeong et al., 2000), protein–protein interaction networks (Jeong et al., 2001) and transcriptional networks (Babu et al., 2004). To reproduce this property in our synthetic networks, we considered random networks in which the connections between nodes are chosen at random using an approximate power law distribution. Specifically, the output degree (OD) of each node i, ODi, is drawn from the distribution

(1)
where ODmin and ODmax are the upper and lower bounds on the OD allowed in the network and {eta} is a constant chosen to be 2.5 (Wagner, 2002). For each node i, we drew ODi from the distribution of Equation (1) and node i was connected to ODi, of other nodes chosen randomly among the N – 1 other nodes. Each connection was randomly assigned to be either excitatory or inhibitory.

Topology II: E.coli network. To construct our E.coli transcription network, we start with the interaction topology reported by Shen-Orr et al., (2002). This network contains 423 nodes (operons) and 578 interactions. Next, we removed all the auto-regulation loops. This reduced the number of interactions to 518. In the original dataset, connections between transcription factors and promoters are positive, negative or both (i.e. some connection can be either positive or negative depending on other factors). We removed the third category by randomly assigning these connections to be either positive or negative.

Topology III: randomized E.coli networks. Recent studies have suggested that certain network ‘motifs’ are overrepresented in real networks (Milo et al., 2002; Shen-Orr et al., 2002) when compared with randomized networks that preserve the degree properties of the original network. To explore the effect of these motifs in the network reconstruction process, we will consider the networks for which each node has the same number of ingoing and outgoing edges as in the E.coli network, but for which the identity of the nodes connected to and from each node is randomized. This type of network is half way between the two previously discussed ones.

Network dynamics. Each network will be equipped with a dynamics for the mRNA concentration of each node. Following Yeung et al., (2002), the kinetic behavior governing the RNA concentration of gene i, ui, is considered to be determined by the concentration of the nodes impinging on node i according to

(2)
The interpretation of this equation is as follows: the rate of change of the mRNA concentration of gene i is composed of a degradation term [{lambda} ui, the first term in the right-hand side of Equation (2)], a constitutive rate of transcription ({alpha}) and the effects that other genes exert on gene i. The nature of this influence can be excitatory (represented by the sum over the set Ai , both in the numerator and in the denominator only) and inhibitory (represented by the sum over Ri, in the denominator only). The time scale is chosen such that the degradation rate {lambda} is unity. The baseline rate of transcription {alpha} is fixed at 0.5 for all nodes. The sets of excitatory and inhibitory nodes reaching node i, Ai and Ri , are disjoint. The influence of the excitatory (inhibitory) inputs of node j on node i is incorporated via the Hill-like coefficients {gamma}ijij), which control the non-linear dependence of the output node on the input nodes. Unless otherwise specified, these parameters will be set to 1. Noise is incorporated in the model through {varepsilon}i, a Gaussian white-in-time noise source independent for each node. The noise strength {varepsilon} is the same for all nodes, i.e. <{varepsilon}i (t){varepsilon}j (t')> = {varepsilon}2 {delta}ij {delta} (tt'), where {delta}ij is Kroenecker’s delta and {delta}(t t') is Dirac's delta function.

Simulation protocol. The numerical simulation protocol entails simulation of the intact network to produce replicate gene expression measurements of the wild-type system, and subsequently forcing one node at a time (the ‘mutated’ node) to a value of zero and then tracking the output of all nodes in the system at equilibrium. We collected the data by integrating Equation (2). After sufficient time for the initial conditions to decay, k samples are obtained for each node by recording the values of all the ui's for every 100 units of time. This sampling time is much longer than the relaxation time of the system (of the order of 1/{lambda}, which was set to 1). Thus, the samples can be interpreted as k biological replicates of the system measured in its stationary state. Our protocol can be implemented by measuring the gene expression of all genes of interest in k independent gene expression assays. In our simulations we chose k = 10 for each node.

In real biological systems, the type of perturbations that we model with this numerical protocol can be implemented with a variety of techniques that alter the function of single genes. Examples of these technologies are as follows: (1) antisense single-stranded RNA, which hybridizes with its target mRNA leading to reduced levels of its protein; (2) RNA interference, consisting of short double-stranded RNA of specific sequence, known as small interference RNA (siRNA), which can selectively and efficiently inhibit expression of specific genes; (3) gene-specific knockout or transgenic models, in which a gene is either completely silenced or its promoter is manipulated to render the gene inducible under the experimenter's control; and (4) perturbation of the cellular state with small chemical compounds that target the active site of specific proteins thereby enhancing or diminishing its function. All these technologies could potentially be used singly or in combination for a real implementation of our methodology. In recent years, however, RNA interference (Dykxhoorn et al., 2003) has emerged as the predominant tool with valuable properties such as specificity (Chi et al., 2003) and scalability (Luo et al., 2004). Owing to these desirable properties, this technique is simulated in our numerical study. However, one must keep in mind that experimental realization of these assays are rarely as perfect as the computational counterpart, since not all genes in a network will have the appropriate siRNA for its perturbation (as is assumed in our method), and that some siRNA can silence non-intended targets (Jackson et al., 2003). Despite these limitations, efforts are underway to produce libraries of thousands of siRNAs (i.e. Paddison et al., 2004) suitable for applications such as large-scale functional genomics studies in the lines of which we are simulating here.

Sample data are shown in Figure 1 for a 30-node random network. The rows correspond to different nodes, and the columns correspond to different experiments. In each row, the data (ui) have been scaled to obtain mean 0 and variance 1. The color scale goes from saturated green (≤–2) to saturated red (>2). Node 30 (lowest row, the mutated gene) is at its steady-state (plus fluctuations) value in the first 10 replicate experiments (corresponding to the wild-type state) and then is forced to zero for the next 10 experiments. The response of the other nodes often shows correlated (e.g. node 26) or anti-correlated (e.g. node 25) behavior with respect to the mutated node. This correlated behavior usually depends on the order of the connection between the mutated node and the other nodes. The order of the connection between a source node and a destination node corresponds to the minimum number of connections that we need to traverse to go from the former to the latter. In Figure 1A, both nodes 25 and 26 have a first-order connection to the mutated node. In Figure 1B, some examples of first-order connections are between nodes p and q and between nodes p and r. Similarly, there are second-order connections between nodes q and s and between nodes p and s. Such higher order connections can also produce correlations (e.g. nodes 1 and 7 have connections of the orders 2 and 3, respectively, to the mutated node).



View larger version (44K):
[in this window]
[in a new window]
 
Fig. 1 (A) Data from a 30-node network. Rows are scaled with zero mean and unit variance. Color scale goes from saturated green (≤–2) to saturated red (>2). Node 30 is at its stationary state for the first 10 experiments and forced to 0 for the next 10 experiments. Row labels are the node number (left) and the order of connection to node 30 (right). (B) Illustration of the order of connection between two nodes. Node p is connected to node q and node r via first-order connections. Nodes q and s and nodes p and s have a second-order connection.

 
Data processing.For each mutated node, the conditional pairwise correlation ({rho}lm) between node l and node m given that node l was perturbed is computed as

(3)
where ul(i) and um(i) are the measurements corresponding to nodes l and m in experiment i, the first k experiments (1 ≤ i ≤ k) refer to the wild-type network (no mutation to any gene), the second k experiments (k + 1 ≤ i ≤ 2k) correspond to the network with gene l mutated and <ul> is the average of ul(i) through the 2k experiments. Hence, we compute a total of N – 1 conditional correlations for each mutated node. Note that {rho}lm is a non-symmetric matrix. The asymmetry stems from the fact that we were conditioning the correlation on the node corresponding to the first index (node l) being perturbed. In what follows we will refer to {rho}lm as the correlation between nodes l and m, but it should be kept in mind that this is a conditional correlation. As a simple baseline reconstruction algorithm, a positive threshold correlation value ({rho}thresh) was chosen. Subsequently, a putative connection was assumed from node l to node m when the correlation between these two nodes exceeds the threshold (|{rho}lm| > {rho}thresh). The sign of the correlation coefficient was then used to give the edge a sign (i.e. ‘+’ is activating and ‘–’ is repressing).

We can compare the reconstructed connections to the actual connections present in the network used to generate the data. A true positive (TP) corresponds to a correctly reconstructed edge with the correct sign, and a true negative (TN) corresponds to correctly determining that no edge is present. A false negative (FN) refers to a failure to correctly predict the existence of an edge. A false positive (FP) corresponds to a predicted edge that does not exist in the original network, or one that exists but has the opposite sign.

We assess the performance of the reconstruction as a function of {rho}thresh by computing the Receiver Operator Characteristics (ROCs). In these curves, a fractional error is computed as the sum of FP and FN rates and plotted as a function of {rho}thresh. The FP rate is computed as the number of FPs divided by the sum of FPs and TNs. Similarly, the FN rate is computed as the number of FNs divided by the sum of TPs and FNs. It is customary to represent ROC curves as plots in which the TP rate is plotted as a function of the FP rate, but in such a representation the parameter used to traverse the ROC curve (in our case {rho}thresh) is implicit. As one of our main objectives is the determination of an optimal {rho}thresh, we decided to draw the ROC as the sum of the FP rate and the FN to allow for a clear visualization of the effect of the {rho}thresh in the network reconstruction. For this study, each ROC corresponds to the reconstruction of a single network [see Supplementary materials 1 (SM1)].


    RESULTS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION AND CONCLUSIONS
 REFERENCES
 
Effects of noise on network reconstruction
Figure 2 shows ROCs for reconstructions with different noise strengths {varepsilon}, for random networks with 100 nodes, ODmin = 1, ODmax = 5 and {gamma},ß = 1. The case for no noise ({varepsilon} = 0) is shown by the flat thin trace in Figure 2. This ROC has no {rho}thresh dependence. Here, the number of FNs is zero, but many FPs arise from the perfect correlations between nodes connected at orders of two or more. For {varepsilon} = 0.01, the FPs greatly increase for low values of {rho}thresh, whereas the FNs drop abruptly close to {rho}thresh = 1. Interestingly, the small amount of added noise lowers the minimum error below that found with no noise, a result that runs against intuition. This is because a moderate amount of noise strength masks higher order connections that produce FP while affecting the correlation of first-order connections to a lesser extent (see SM2). With increasing noise ({varepsilon} = 0.1), the ROC is more U-shaped as FNs increase at smaller {rho}thresh values. As the noise level increases, the error rates generally increase over the whole range of {rho}thresh due to a greater number of FNs.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 2 ROC curves as function of {rho}thresh. The different traces (solid lines) show ROCs for different noise strengths for a 100-node random network with ODmin = 1 and ODmax = 5. ROC curves are U-shaped. Error rates tend to increase as noise strength is increased. For noise strengths 0.9 and 0.5, we plotted independently the FN rate dependence as a function of the correlation threshold (‘x’ signs for {varepsilon} = 0.9 and open diamonds for {varepsilon} = 0.5) as well as the FP rate dependence on the correlation threshold (plus symbols for {varepsilon} = 0.9; the similar curve for {varepsilon} = 0.5 is almost indistinguishable from plus symbols and thus is not represented in the figure).

 
With moderate levels of noise, order 1 connections can be distinguished from the higher order connections, but many connections of order 2–4 produce similar correlations as first-order connections giving rise to FPs. With very noisy data even the correlation between first-order connections becomes smaller making first-order connections harder to distinguish on the basis of correlation, and the FN rate increases markedly ({varepsilon} =0.5 and 0.9 in Fig. 2).

Effects of network properties on reconstruction of synthetic networks
In the remaining sections of the paper we fix {varepsilon} = 0.1, as it produces a U-shaped ROC that is typical of higher noise strengths. Figure 3A shows ROCs for reconstructions of networks with 30, 100 and 300 nodes, ODmin = 1 and ODmax = 5. For these reconstructions, the error rate increases as the number of nodes decreases. The primary effect at play is that as the network size shrinks, the whole network becomes more connected and the number of higher order connections increases, giving rise to FPs as correlations exist between nodes that are not directly connected. The number of higher order connections decreases as the network size increases but does not completely go away.



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 3 ROC dependence on network parameters. For this and remaining figures, unless otherwise specified random networks have 100 nodes, {varepsilon} = 0.1, {gamma},ß = 1, and ODmin = 1 and ODmax = 5. (A) ROCs for networks with 30, 100 and 300 nodes. The error rate increases as the number of nodes decreases. (B) ROCs for networks in which each node has an OD of exactly 1, 3 and 5. The error rates increase as the OD increases. (C) ROCs for {gamma} and ß taking values 1, 3, 5 or in the range 1–5, as labeled. The best reconstruction occurs when all exponents are 1 and worsens for {gamma} and ß taking larger values.

 
Figure 3B shows the effects of node OD on ROCs for networks in which each node has the same OD equal to 1, 3 and 5 as labeled. As the degree increases, the error rates increase as a result of the generation of more connections of order >1 which are prone to produce FPs. In general, sparse networks with many disconnected nodes tend to be reconstructed with low error rates given that the disconnected nodes are poorly correlated and are thus less likely to generate FPs.

Figure 3C shows the effects of varying {gamma} and ß, the exponents on the inputs to each node in Equation (2). For this set of reconstructions, the topology of the connections is the same. Reconstruction is most accurate when all exponents are equal to 1. This result can be explained as follows. Exponents of magnitude 1 produce a fairly linear input–output relationship. Higher exponents produce a steeper and less linear response. It is not surprising that the linear model of interactions provides the best reconstruction results as our method is based on correlations, a statistical measure designed to detect the linear component of dependences. Hence, the ROCs show increasingly higher error rates when the exponent magnitudes are all set to 3 or 5. One other case is shown in which the exponent magnitudes are uniformly distributed between 1 and 5. Unlike the other runs, this network is heterogeneous with respect to the connection exponents. The ROC for this heterogeneous network with the mean exponent to 3 is quite similar to that where all exponents are set to 3. Hence, there appears to be little effect of homogeneous versus heterogeneous exponents.

Methods to reduce false positives
Low error rates are achieved when the correlation of first-order connections are larger than that of higher order connections and unconnected nodes. However, order 1 connections cannot always be easily distinguished from order 2–4 connections on the basis of correlation alone, and hence some number of FPs seems unavoidable. To improve our ability to reconstruct networks, we employed two strategies. First, we attempt to choose an optimal threshold value ({rho}opt) to distinguish first from higher order connections. Second, we attempt to reject FPs that are likely to be a result of second-order connections masquerading as first-order connections. These two strategies are discussed in the next section.

To choose an optimal threshold value, we generate a model of the correlation distribution. First, the logarithm of the normalized histogram of the measured correlation was computed (Fig. 4, thin jagged line). This distribution shows a first mode at {rho} = 0 due to the correlation between weakly connected nodes, and a second mode close to {rho} ~ 1 presumably corresponding to directly connected nodes. The thick line in Figure 4 is a fit to the histogram based on a model with two separate components. The first component is the theoretical distribution of the correlation between two randomly chosen k-dimensional vectors (where k will be fitted by enumerative technique from the data) given by the formula Prandom ({rho};k) = 2(1 – {rho}2)(k–3)/2/{Omega}(k), where

[with {Gamma}(k) being the Gamma function, see SM3]. The second component accounts for the contribution of the highly correlated nodes, and will be modeled with a truncated Gaussian distribution:

Our correlation distribution model consists of the mixture of these two components as

(4)
where f is a weighing factor. The parameters f, k, µ and {sigma} were chosen to minimize the root mean square error between the histogram and Pfit ({rho},k,f,{sigma}) over the range 0 ≤ {rho} ≤ 1.



View larger version (18K):
[in this window]
[in a new window]
 
Fig. 4 Logarithm of the probability density function (pdf) of the correlations between the mutated node and all the other nodes, after mutating all the nodes in the network one by one (jagged thin line) and its theoretical fit [thicker line; Equation (4)]. The fit consists of a weighted mixture of a first component corresponding to random correlations (thin line at the left) and a second component produced by first-order connections in the network (thin parabolic trace at the right). The inset shows a plot of the same two components without the weighting factor term. The intersection of these two curves, shown by the arrow at 0.645 is {rho}opt, the optimal threshold that balances FPs and FNs. This {rho}opt differs from the {rho}thresh at which the theoretical fit attains its minimum, shown by an arrow in the main plot.

 
A maximum-likelihood calculation (see SM4) shows that {rho}opt is the value of {rho} for which Prandom ({rho};k) = Pcorr({rho};µ,{sigma}) (Fig. 4, inset). {rho}opt is the {rho}thresh that just balances competing effects: a smaller {rho}thresh increases the FPs and a larger {rho}thresh increases the FNs. Note that the {rho}opt (shown by the arrow in the inset of Fig. 4) does not correspond to the minimum in the fitted histogram (shown by the arrow in the main plot in Fig. 4).

The second strategy to improve reconstructions is to reduce FPs by looking for second-order connections that can explain high correlations between two nodes. To do this, we inspect all possible second-order paths between each pair of nodes whose correlation is above {rho}opt. Using the scheme of Figure 1B, for any pair of nodes q and s, if there exists a node r such that min(|{rho}qr|,|{rho}rs|) > | {rho}qs| and sign({rho} qr · {rho}rs) = sign({rho}qs), then we do not infer a direct connection between q and s even if the condition |{rho}qs| > {rho}opt holds. The net result is the removal of FP connections if an alternative explanation can be found based on a second-order connection. This procedure, which we call the triangle reduction construction, can also generate FNs. For example, a TP edge between nodes p and r could be wrongly removed if the second-order path through node q shows higher correlations (Fig. 1B).

The triangle reduction scheme is a heuristic criterion. However, under some assumptions it can be shown that if nodes q and s are not connected but nodes q and r and nodes r and s are, then it must be true that min(|{rho}qr|,|{rho}rs|) > |{rho}qs|. This means that the triangle reduction criterion is a necessary, albeit not sufficient, condition for the absence of a connection (see SM5).

We applied the two strategies outlined above sequentially. First, the {rho}opt was determined. Next, the putative first-order connections were determined for this threshold value. Then, the triangle reduction method was applied to remove FPs. The decrease in FPs by application of the triangle reduction is slightly dependent on the order in which the triangles were chosen. For our results, we explored all the triangles in random order. Table 1 shows {rho}opt, the number of TPs, FPs and FNs before and after the application of the triangle reduction for all the networks reconstructed in this paper. It can be seen that the number of FPs is dramatically reduced for all the cases considered, without a serious increase in FNs.


View this table:
[in this window]
[in a new window]
 
Table 1 Reconstruction improvement from the triangle reduction rule. The results before (pre) and after (post) the application of the triangle reduction rule. Unless otherwise specified, {gamma},ß = 1, ODmin = 1 and ODmax = 5

 
Reconstruction of the transcriptional control network of E.coli
Biological networks have very different properties than their random counterparts, with certain ‘motifs’ being over- or under-represented in real versus random networks (Milo et al., 2002; Shen-Orr et al., 2002). To compare the reconstructability of biological versus the random networks considered so far, we generated a network using the connection topology of the E.coli transcription network adapted from the study of Shen-Orr et al. Figure 5 shows results for the reconstructed E.coli network as the thick solid trace. Similar to earlier figures, we chose the sum of FP rate and FN rate as the measure of error. The inset in Figure 5 shows the decomposition of this error into the FP rate, which decreases as {rho}thresh increases, and the FN rate, which increases as {rho}thresh increases. It is clear that the FP rate has decreased considerably before the FN rate starts to increase under these conditions, making it relatively easy to separate their respective effects. The minimum error rate is 0.01 which occurs at {rho}thresh = 0.62 (shown by the solid arrow). Recall that we are computing the error rates as the sum of FP and FN rates. Therefore, our error rates give equal weight to the FP and FN rates. The minimum error rate of ~1% appears on the surface to be very good, but some care must be taken in its interpretation (see SM6). This value of {rho}thresh produces 929 FPs, far exceeding the 518 actual connections in the network. Our method of computing {rho}opt by optimization of the likelihood (SM4) does not impose equal weighting of the FP and FN rate, and clearly our high rate of FPs illustrates the need to reduce the number of FPs. As shown in Table 1 the number of FPs is reduced to 212 by choosing {rho}opt = 0.72 (shown by the dashed arrow in Fig. 5) via the maximization of the likelihood. With the application of the triangle reduction, the number of FPs is further reduced to 85 (a reduction of 127 FPs). This decrease in FPs produced only a moderate increase of 13 FNs. Such an approach may be justified as investigators balance their need to err in the direction of FPs and FNs.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 5 Reconstruction of E.coli transcriptional network. The ROC for the reconstructed E.coli network (thick solid line) has a minimum error (i.e. minimum sum of FP and FN rates) of 0.01 at {rho}thresh = 0.62 (solid arrow) but the {rho}opt found by the maximum-likelihood method yields {rho}opt = 0.72 (dashed arrow). ROCs are also shown for random networks of similar size. The thick gray trace shows results for a random network with ODmin = 1 and ODmax = 3. The thin dotted trace is the ROC for a degree-preserving randomized version of the original E.coli network. The inset shows the relative contribution of the FP and FN rate to the total error rate in the neighborhood of the minimum of the total error.

 
Next, we investigated how the reconstruction of the E.coli network compares to reconstructions of random networks of similar size. The first random network is constructed with ODmin = 1 and ODmax = 3 to produce an average OD of 1.19 (similar to 1.22 for the original E.coli network). The error rate is much higher for the random network (thick gray trace in Fig. 5). Two factors increase the error rate in the random network. The first source of error is that the number of connections of order >1 is larger in the random network in which each node has at least one outgoing connection. By construction, every node is connected to at least one other node by a first-order connection, at least one other by a second-order connection, at least one other by a third-order and so on. This increase in the number of connections of higher order is a potential source of FPs. In contrast, the E.coli network is more star-like with a few nodes with high OD and many nodes with an input degree (ID) of 1 and an OD of 0. A second source of error is an increase in the ID of nodes in the random network. The E.coli network has relatively few nodes with high OD and many nodes with an ID of 1. These nodes are the easiest to reconstruct as the correlation tends to be highest when the receiving node has a single input. In contrast, as the number of input nodes increases, each input contributes less to the overall output decreasing the measured correlation. Thus, one expects direct connection to ID 1 nodes to be correctly identified more readily than direct connection to nodes with larger ID. For the examples shown, the E.coli network has 228 connections to nodes of ID 1 and 142 connections to nodes of ID 2. For comparison, the random network considered here had 157 connections to nodes with ID 1 and 186 connections to nodes of ID 2. The net effect of this difference is an increase in the FN rate, which can be clearly seen in Figure 5 as the thick gray line attains larger values than the solid line to the right of the arrows.

To check that the ID of reconstructed nodes has an important effect on the error, we developed another random network where the ID and OD of each node is preserved as the internode connections are rearranged. This degree-preserving randomized network is reconstructed to approximately the same error level as that of the original E.coli network (compare thick solid and dashed traces in Fig. 5). The small differences in traces are within reproducibility bounds. We conclude that the distribution of ID and OD of nodes plays an important role in the ability of this method to reconstruct networks. Note that sparseness is only part of the story: all three networks just considered are similarly sparse if one just computes the ratio of connections to nodes (see also SM7).


    DISCUSSION AND CONCLUSIONS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION AND CONCLUSIONS
 REFERENCES
 
We have shown that our method can reconstruct networks with a wide range of sizes, connectivity and noise levels with a reasonable error rate. We now discuss some limitations of our methodology.

Foremost, we are only attempting to reconstruct the connections between nodes and whether the connections exert positive or negative influences. While this information can provide the biologist with useful hypotheses, it clearly fails to capture all the richness of biological interactions. Our method does not reconstruct interactions that are activating and inhibiting as is sometimes found in biology. For instance, the original E.coli transcription network contains 24 cases of these dual interactions, 10 of which are associated with a single transcription factor that regulates 72 targets. Hence, at least in this system, dual regulation occurs in a minority of cases and we expect that a majority of connections can be reconstructed.

We chose to ignore self-loops in this initial study. Self-loops may increase the difficulty in reconstructing networks. Positive autoregulation is likely to cause nodes to be more binary-like and are thus less linear. From the results of Figure 3C, reconstructions degrade as the node input–output relationship becomes less linear. The possible effects of negative autoregulation will depend on the function. Negative feedback is thought to decrease the temporal delay between an increase in activating transcription factors and a subsequent increase in the corresponding protein level (Rosenfeld et al., 2002). This type of regulation should have little effect on our methods as the data are collected at equilibrium. In contrast, negative-regulation may also reduce fluctuations in output in the face of changing conditions such as a method proposed to promote robustness (Becskei and Serrano, 2000). In this case, a node may show less of a response to small changes in its input and reconstruction may be hampered, but further study is needed to quantify this effect.

Our continuum-based network dynamics [Equation (2)] may not capture the discrete nature of nanoscale environments (McAdams and Arkin, 1999). However, a similar continuum-based model could reproduce bacterial transcriptional events as reported by high-temporal-resolution measurements in living cells by using green fluorescent protein reporter plasmids (Ronen et al., 2002). Hence, a continuum-based approach appears to be reasonable, at least for a first approximation model of the system.

Noise plays a critical role in our reconstruction method. In the noiseless case, all higher order connections are perfectly correlated giving rise to many FPs. The addition of a small amount of noise decreases the number of FPs as noise masks higher order connections (Fig. 2). However as noise increases further, the error rate increases. When substantial noise is present, first-order connections do not produce correlation magnitudes much larger than those of connections of higher orders (see SM2). In principle one can compensate for this error by considering more data. However, this approach does not fully compensate for high noise levels because more data points makes FPs to be reduced but not the FNs (see SM8). The resulting effect of taking more data is to shift {rho}opt to lower {rho}thresh values with a narrower region around the minimum error in the ROC. In real biological experiments, collecting large amounts of data may not be feasible.

As mentioned above, more data can be collected to improve reconstruction. Indeed our model for the correlation distribution can be used to estimate the number of measurements needed to obtain a reasonably low error rate. In this paper, we collected 10 wild-type replicates and 10 replicates for each mutated node (a reasonable estimate of what might be collected experimentally) to reconstruct each node for a total of 10(N + 1) samples for the whole network. If each sample corresponds to a gene chip that samples M gene expression levels, then the network reconstruction requires a total of 10(N + 1)M gene expression levels to be measured. Note that the new data are collected to reconstruct each node, but the data collected for one node could be reused for other nodes.

Our triangle reduction scheme has some potential limitations. For example, a source node can have a direct connection to a target as well as an indirect path through an intermediate node. When the indirect path has the same net sign as the direct path, the subnetwork has been termed a coherent feed-forward motif (Shen-Orr et al., 2002). The same study showed that coherent feed-forward motifs are over-represented in the transcriptional regulatory network in E.coli compared to random graphs of similar size. If this result holds across all biological networks, then the effectiveness of the triangle reduction may be limited as it will increase the number of FNs. However, for our reconstruction, we still achieved considerable reduction in FPs with a small increase in FNs (Table 1).

Despite the limitations discussed above, our method has some worthy advantages. We can reconstruct a wide variety of network sizes and topologies with some potential sources of error that we have characterized. While our method only provides qualitative information on the connections, we suspect that this output will be useful to the biologists, even if it contains some errors. For example, a mostly correct set of connections could provide a valuable first step in reconstructing complex biological interactions. While one might desire a full set of kinetic equations to describe the network, this goal does not appear to be tractable in a general sense at present. Reconstructions of synthetic gene networks have been attempted but assumptions of sparseness and linearity are required as well as very well-characterized derivatives of the signals (Yeung et al., 2002). Calculation of derivatives can be problematic with typically noisy biological systems. The kinetic equations for small metabolic networks have been reconstructed (Koza et al., 2001) but the method requires a priori assumptions about node connections as well as tightly controlled manipulation of node inputs that may not be biologically feasible. Clearly, no method is a one-size-fits-all in the world of systems biology. Indeed, untangling complex biological networks will almost certainly require flexible and integrated use of both data and tools. We hope that our pairwise conditional correlation method can be one of the valuable tools in this quest.


    Acknowledgments
 
We thank A. Kershenbaum for his valuable advice on many aspects of graph theory. We also thank K. Duggar, T. Syeda-Mahamood, A. Califano and C. Wiggins for their useful discussions.

Received on February 4, 2004; revised on August 3, 2004; accepted on August 23, 2004

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION AND CONCLUSIONS
 REFERENCES
 

    Arkin, A. and Ross, J. (1995) Statistical construction of chemical reaction mechanism from measured time-series. J. Phys. Chem., 99, 970–979[CrossRef].

    Arkin, A., Shen, P., Ross, J. (1997) A test case of correlation metric construction of a reaction pathway from measurements. Science, 29, 1275–1279.

    Babu, M.M., Luscombe, N.M., Aravind, L., Gerstein, M., Teichmann, S.A. (2004) Structure and evolution of transcriptional regulatory networks. Curr. Opin. Struct. Biol., 14, 283–291[CrossRef][Web of Science][Medline].

    Barabasi, A.L. Linked, the New Science of Networks, (2002) , Cambridge, MA Perseus Publishing.

    Barabasi, A.L. and Albert, R. (1999) Emergence of scaling in random networks. Science, 286, , pp. 509–512[Abstract/Free Full Text].

    Becskei, A. and Serrano, L. (2000) Engineering stability in gene networks by autoregulation. Nature, 405, 590–593[CrossRef][Medline].

    Butte, A.J. and Kohane, I.S. (2000) Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac. Symp. Biocomput., 418–429.

    Butte, A.J., Tamayo, P., Slonim, D., Golub, T.R., Kohane, I.S. (2000) Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl Acad. Sci. USA, 97, 12182–12186[Abstract/Free Full Text].

    Chevalier, T., Schreiber, I., Ross, J. (1993) Toward a systematic determination of complex reaction mechanisms. J. Phys. Chem., 97, 6776–6787[CrossRef].

    Chi, J.T., Chang, H.Y., Wang, N.N., Chang, D.S., Dunphy, N., Brown, P.O. (2003) Genomewide view of gene silencing by small interfering RNAs. Proc. Natl Acad. Sci. USA, 100, 6343–6346[Abstract/Free Full Text].

    D'Haeseleer, P., Wen, X., Fuhrman, S., Somogyi, R. (1999) Linear modeling of mRNA expression levels during CNS development and injury. Pac. Symp. Biocomput., 41–52.

    Dykxhoorn, D.M., Novina, C.D., Sharp, P.A. (2003) Killing the messenger: short RNAs that silence gene expression. Nat. Rev. Mol. Cell. Biol., 4, 457–467[CrossRef][Web of Science][Medline].

    Friedman, N., Linial, M., Nachman, I., Pe’er, D. (2000) Using Bayesian networks to analyze expression data. J. Comput. Biol., 7, 601–620[CrossRef][Web of Science][Medline].

    Gardner, T.S., di Bernardo, D., Lorenz, D., Collins, J.J. (2003) Inferring genetic networks and identifying compound mode of action via expression profiling. Science, 301, 102–105[Abstract/Free Full Text].

    Hartemink, A.J., Gifford, D.K., Jaakkola, T.S., Young, R.A. (2001) Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pac. Symp. Biocomput., 422–433.

    Jackson, A.L., Bartz, S.R., Schelter, J., Kobayashi, S.V., Burchard, J., Mao, M., Li, B., Cavet, G., Linsley, P.S. (2003) Expression profiling reveals off-target gene regulation by RNAi. Nat. Biotechnol., 21, 635–637[CrossRef][Web of Science][Medline].

    Jeong, H., Mason, S.P., Barabasi, A.L., Oltvai, Z.N. (2001) Lethality and centrality in protein networks. Nature, 411, 41–42[CrossRef][Medline].

    Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N., Barabasi, A.L. (2000) The large-scale organization of metabolic networks. Nature, 407, 651–654[CrossRef][Medline].

    Kholodenko, B.N., Kiyatkin, A., Bruggeman, F.J., Sontag, E., Westerhoff, H.V., Hoek, J.B. (2002) Untangling the wires: a strategy to trace functional interactions in signaling and gene networks. Proc. Natl Acad. Sci. USA, 99, 12841–12846[Abstract/Free Full Text].

    Koza, J.R., Mydlowec, W., Lanza, G., Yu, J., Keane, M.A. (2001) Reverse engineering of metabolic pathways from observed data using genetic programming. Pac. Symp. Biocomput., 6, 434–445.

    Liang, S., Fuhrman, S., Somogyi, R. (1998) Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp. Biocomput., 18–29.

    Luo, B., Heard, A.D., Lodish, H.F. (2004) Small interfering RNA production by enzymatic engineering of DNA (SPEED). Proc. Natl Acad. Sci. USA, 101, 5494–5499[Abstract/Free Full Text].

    McAdams, H.H. and Arkin, A. (1999) It's a noisy business! Genetic regulation at the nanomolar scale. Trends Genet., 15, 65–69[CrossRef][Web of Science][Medline].

    Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U. (2002) Network motifs: simple building blocks of complex networks. Science, 298, 824–827[Abstract/Free Full Text].

    Paddison, P.J., Silva, J.M., Conklin, D.S., Schlabach, M., Li, M., Aruleba, S., Balija, V., O'Shaughnessy, A., Gnoj, L., Scobie, K., Chang, K., et al. (2004) A resource for large-scale RNA-interference-based screens in mammals. Nature, 428, 427–431[CrossRef][Medline].

    Rice, J.J. and Stolovitzky, G.A. (2004) Making the most of it: pathway reconstruction and integrative simulation using the data at hand. Biosilico, 2, 70–77.

    Ronen, M., Rosenberg, R., Shraiman, B.I., Alon, U. (2002) Assigning numbers to the arrows: parameterizing a gene regulation network by using accurate expression kinetics. Proc. Natl Acad. Sci. USA, 99, 10555–10560[Abstract/Free Full Text].

    Rosenfeld, N., Elowitz, M.B., Alon, U. (2002) Negative autoregulation speeds the response times of transcription networks. J. Mol. Biol., 323, 785–793[CrossRef][Web of Science][Medline].

    Samoilov, M., Arkin, A., Ross, J. (2001) On the deduction of chemical reaction pathways from measurements of time series of concentrations. Chaos, 11, 108–114[CrossRef][Web of Science][Medline].

    Segal, E., Shapira, M., Regev, A., Pe'er, D., Botstein, D., Koller, D., Friedman, N. (2003) Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat. Genet., 34, 166–176[CrossRef][Web of Science][Medline].

    Shen-Orr, S.S., Milo, R., Mangan, S., Alon, U. (2002) Network motifs in the transcriptional regulation network of Escherichia coli. Nat. Genet., 31, 64–68[CrossRef][Web of Science][Medline].

    Smith, V.A., Jarvis, E.D., Hartemink, A.J. (2003) Influence of network topology and data collection on network inference. Pac. Symp. Biocomput., 164–175.

    Troyanskaya, O.G., Dolinski, K., Owen, A.B., Altman, R.B., Botstein, D. (2003) A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl Acad. Sci., USA, 100, 8348–8353[Abstract/Free Full Text].

    Uetz, P., Giot, L., Cagney, G., Mansfield, T.A., Judson, R.S., Knight, J.R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., Qureshi-Emili, A., et al. (2000) A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae. Nature, 403, 623–627[CrossRef][Medline].

    van Someren, E.P., Wessels, L.F., Backer, E., Reinders, M.J. (2002) Genetic network modeling. Pharmacogenomics, 3, 507–525[CrossRef][Web of Science][Medline].

    Wagner, A. (2002) Estimating coarse gene network structure from large-scale gene perturbation data. Genome Res., 12, 309–315[Abstract/Free Full Text].

    Woolf, P.J. and Wang, Y. (2000) A fuzzy logic approach to analyzing gene expression data. Physiol. Genomics, 3, 9–15[Abstract/Free Full Text].

    Yeung, M.K., Tegner, J., Collins, J.J. (2002) Reverse engineering gene networks using singular value decomposition and robust regression. Proc. Natl Acad. Sci. USA, 99, 6163–6168[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
D. Ucar, I. Neuhaus, P. Ross-MacDonald, C. Tilford, S. Parthasarathy, N. Siemers, and R.-R. Ji
Construction of a reference gene association network from multiple profiling data: application to data analysis
Bioinformatics, October 15, 2007; 23(20): 2716 - 2724.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
K. Missal, M. A. Cross, and D. Drasdo
Gene network inference from incomplete expression data: transcriptional control of hematopoietic commitment
Bioinformatics, March 15, 2006; 22(6): 731 - 738.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/6/765    most recent
bti064v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Rice, J. J.
Right arrow Articles by Stolovitzky, G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Rice, J. J.
Right arrow Articles by Stolovitzky, G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?