Skip Navigation


Bioinformatics Advance Access originally published online on August 30, 2005
Bioinformatics 2005 21(21):4007-4013; doi:10.1093/bioinformatics/bti648
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/21/4007    most recent
bti648v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Xing, B.
Right arrow Articles by van der Laan, M. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xing, B.
Right arrow Articles by van der Laan, M. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oxfordjournals.org

A causal inference approach for constructing transcriptional regulatory networks

Biao Xing 1,* and Mark J. van der Laan 2

1Genentech Inc. 1 DNA Way, South San Francisco, CA 94080, USA
2Division of Biostatistics, School of Public Health, University of California 140 Warren Hall #7360, Berkeley, CA 94720, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 DATA ANALYSIS
 4 DISCUSSION
 REFERENCES
 

Motivation: Transcriptional regulatory networks specify the interactions among regulatory genes and between regulatory genes and their target genes. Discovering transcriptional regulatory networks helps us to understand the underlying mechanism of complex cellular processes and responses.

Method: This paper describes a causal inference approach for constructing transcriptional regulatory networks using gene expression data, promoter sequences and information on transcription factor (TF) binding sites. The method first identifies active TFs in each individual experiment using a feature selection approach. TFs are viewed as ‘treatments’ and gene expression levels as ‘responses’. For every TF and gene pair, a marginal structural model is built to estimate the causal effect of the TF on the expression level of the gene. The model parameters can be estimated using the G-computation procedure or the IPTW estimator. The P-value associated with the causal parameter in each of these models is used to measure how strongly a TF regulates a gene. These results are further used to infer the overall regulatory network structures.

Results: Our analysis of yeast data suggests that the method is capable of identifying significant transcriptional regulatory interactions and the corresponding regulatory networks.

Availability: The software is under development.

Contact: xing.biao{at}gene.com


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 DATA ANALYSIS
 4 DISCUSSION
 REFERENCES
 
Transcriptional regulatory networks specify the interactions among regulatory genes and between regulatory genes and their target genes. Discovering transcriptional regulatory networks is an important scientific task since it helps us to understand the underlying mechanism of complex cellular processes and responses.

There is a rich literature in methods for inferring regulatory networks. Lee et al. (2002) and Bar-Joseph et al. (2003) used the experiment-based genome-wide location analysis to investigate how yeast transcription factors (TF) bind to promoter sequences across genome, then constructed TF–promoter binding networks to infer transcriptional regulatory networks. Location analysis experiments provide in vivo evidence of TF binding to genes. However, physical binding does not directly imply transcriptional regulatory activities. Moreover, location analysis experiments are often restricted to a certain growth condition. As a result, TF-promoter binding network structures specific to other growth conditions may not be observed.

As microarray data on gene expression programs become available, various statistical data mining tools have been devised for discovering (often more broadly defined) gene networks, for example, reverse engineering approaches (Somogyi et al., 1997; Liang et al., 1998; D'haeseleer et al., 2000), differential equations (Chen et al., 1999; D'haeseleer et al., 1999), Bayesian networks (Friedman et al., 2000; Yoo et al., 2002), etc. These methods often require large number of time-course data or rely on very greedy computational strategies.

Some other computational methods attempt to integrate gene expression data, DNA sequences and functional annotations into a comprehensive framework for discovering transcriptional regulatory networks (Pilpel et al., 2001; Wang et al., 2002; Segal et al., 2003; Beer and Tavazoie, 2004). These methods allow one to infer motif-to-gene or to some extent gene-to-gene regulatory networks.

Xing and van der Laan (2005) described a statistical approach for constructing transcriptional regulatory networks using gene expression, promoter sequence and TF binding site data. This approach first identifies TFs that are significantly associated with changes in gene expression profile under each experiment condition, then estimates the strength with which a regulatory gene regulates a potential target gene, and finally averages evidence across experiments to infer the transcriptional regulatory network structures. This method employs a naive normal mixture model to estimate the strength with which a regulatory gene regulates a potential target gene. The normal mixture model is chosen for computational convenience. There are some concerns with the appropriateness of using a normal or other mixture models, e.g. the data may not appear to be normally distributed and consequently the model estimation may be poor. In addition, the mixture model is built on the transformed gene expression data for each TF separately. There is not enough control for the possible confounding effects of other TFs on the estimated transcriptional regulatory interactions between the TF under analysis and its potential target genes.

Here we describe an alternative approach based on causal inference methodology. We can view each gene as a subject and each TF as analogous to a ‘treatment’, which may have direct causal effects on the ‘responses’ (i.e. expression levels) of its target genes whose regulatory region is able to be bound by the TF when it is active in an experiment. For genes whose regulatory region does not contain the TF binding sites, there is no direct causal effect of the TF on the expressions of those genes. More specifically, a ‘treatment’ variable is created for each TF and coded into 1 when the TF is active and 0 when the TF is inactive in an experiment. Then for each experiment we will have a vector of ‘treatments’ associated with all the TFs, some of which may be active and some may be not. For different experiments, there will be different combinations of ‘treatments’ as the active or inactive status of the TFs can be different across experiments. The changes in the expression level of a gene across experiments are seen as results from different combinations of ‘treatments’ under different experiment conditions.

It should be noted that here a ‘treatment’ is not an usual treatment as we see in a controlled clinical trial, which is known prior to study and is highly manipulable. In a typical microarray gene expression experiment, the TF activities may not be deliberately controlled at all or may be under only limited manipulation. To our purpose of constructing transcriptional regulatory networks, we will need to first estimate the activities of TFs in an experiment and treat the TFs as ‘treatments’ that are causally responsible for the changes in gene expression levels.

Under this framework, we describe in the next section a causal inference approach for constructing transcriptional regulatory networks using gene expression data, promoter sequences and TF binding sites. In Section 3, we apply the method to yeast data to study the yeast transcriptional regulatory network. We conclude with a discussion of the uses and limitations of our method in Section 4.


    2 METHODS
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 DATA ANALYSIS
 4 DISCUSSION
 REFERENCES
 
2.1 A brief introduction to causal inference methods
In causal inference, one is concerned with estimation of the causal effect (a parameter with a causal interpretation) of a variable that can be manipulated (e.g. a treatment) on an outcome of interest, possibly adjusted for other variables. Robins (1986, 1999a, b), Robins et al. (2000) and van der Laan and Robins (2002) described causal inference methods for estimating the average marginal causal effect of treatment A on outcome Y adjusted for covariates V W, based on longitudinal data involving time-independent or time-dependent treatments. The marginal causal effects are defined using the concept of counterfactual. The counterfactual Ya represents the random variable Y one would have observed, if, possibly contrary to the fact, one would have ‘assigned’ A = a, where , and denotes a set of possible treatments.

For a point treatment study, which is a special case of longitudinal study, the observed data structure is O = (A, YA, W), where W is a set of baseline covariates. The full data structure is , W), where Ya is the counterfactual. A marginal structural model (MSM) can be used to estimate the average marginal causal effect of A on Y (i.e. the effect of a on E(Ya|V)) adjusted for V W as follows: E(Ya|V) = m(a, V|ß). For example, we may specify a linear model like m(a, V|ß) = ß0 + ß1a + ßvV, where ß0 is the intercept, ß1 is the parameter for the marginal causal effect, ßv is a vector of regression coefficients, and V is a subset of baseline covariates to be adjusted. Let {varepsilon}a(ß) = Yam(a, V|ß). Then E({varepsilon}a(ß)|V) = 0 for each . The estimation of causal parameter ß1 requires the assumption of no unobserved confounders, i.e. g(A|X) = g(A|W), where g is a probability density (or mass) function, representing the treatment assignment mechanism. In other words, this assumption states that treatment A is conditionally independent of the counterfactual outcomes given W, i.e. A is randomized within each stratum of W.

2.2 A causal inference method for constructing transcriptional regulatory networks
2.2.1 Data
Consider a particular organism such as the budding yeast. Let S(j, l) {A, C, T, G} denote the DNA base pair at the l-th position of the promoter sequence of the j-th gene. Let S = (S(j, l): j = 1, ..., J, l = 1, ..., L(j)) denote all the promoter regions for J genes, where L(j) is the length of the promoter sequence of the j-th gene. For simplicity, we can let L(j) = L for all j = 1, ..., J.

Let M = (M(1), ..., M(K)) be a vector of DNA binding motifs, possibly of variable lengths, which correspond to the binding sites of K known TFs. Suppose we know the correspondence between a TF k and its producer gene g(k), for k = 1, ..., K.

Consider now a particular gene expression experiment. Let Y = (Y(j):j = 1, ..., J) be the observed gene expression vector. Given data of S, M and Y, we can estimate the set of TFs that are active under the experiment condition (where a TF is ‘active’ refers to the situation in which the DNA binding site of the TF is significantly associated with the changes in the genome-wide gene expression values), using linear regression with model selection or multiple testing procedures as described in Bussemaker et al. (2001), Keles et al. (2002), Conlon et al. (2003) and Xing and van der Laan (2005). These procedures provide ways to transform single-experiment gene expression data (Y), promoter sequences (S) and TF binding sites (M) into a vector of ‘treatment’ codes indicating which TF is active or not active in an experiment. Denote the ‘treatment’ vector by A = {varphi}(Y, S, M) = (A(k):k = 1, ..., K) {0, 1}K, where {varphi} is a mapping function corresponding to a procedure used to estimate A, A(k) = 1 means the k-th TF is active under the current experiment condition and A(k) = 0 otherwise. So, after having obtained this transformation, we define our data as (A, Y, S, M).

Now suppose we have a collection of n gene expression experiments under different conditions, possibly conducted for independent study purposes. We assume that we can view these n gene expression experiments as n i.i.d. draws from some population data generating distribution. Under this assumption, we have n i.i.d. observations (Ai, Yi), i = 1, ..., n and fixed sequence data S and M.

2.2.2 A causal inference model
We view the data as analogous to data from a point treatment study, where the ‘treatment’ is whether a TF is active or not active. Consider a non-parametric marginal structural model which estimates the causal effect of TF k on gene expression j for each j = 1, ..., J and k = 1, ..., K.

Let (j, k) be given. We now define a counterfactual gene expression outcome Ya(j, k), which represents the expression level of gene j one would have observed if one had set/assigned A(k) = a, a {0, 1}. Here Y0(j, k) can be thought of as the expression of gene j one would have observed if one had knocked out gene g(k) (thereby eliminating TF k), or controlled TF k to be inactive in the sense of binding and regulation, while Y1(j, k) can be thought of as the expression value of gene j one would have observed if one had controlled TF k to be actively involved in binding and regulation under the experiment condition. Define ß1(j, k) = EP [Y1(j, k) – Y0(j, k)]. Such a parameter ß1(j, k) measures a marginal causal effect of A(k) on gene expression Y(j).

Let G(k1, k2) = I(M(k1) S(g(k2))), that is, G(k1, k2) equals 1 if the binding site of TF k1 is contained in the promoter region of gene k2; it otherwise equals 0. Let G = (G(k1, k2):k1 = 1, ..., K, k2 = 1, ..., K) denote the corresponding K x K matrix. Note that matrix G represents a directed graph defined by applying the rule ‘if G(k1, k2) = 1, then draw an arrow from k1 to k2’ for each (k1, k2) {1, ..., K}2. From this graph G we can obtain a K x K potential TF connectivity matrix, denoted by

Let W(j, k) = (A(l): l {1, ..., K}, l != k and {delta}(k, l) = 0) be the sub-vector of A corresponding to all TFs which are not on the potential causal pathway from A(k) to Y(j), and are therefore potential confounders of A(k). We now think of the full data as X(j, k) = (Y0(j, k),Y1(j, k),W(j, k)). We link the observed data to this counterfactual data by the relation:

We assume there are no unmeasured confounders; in other words, A(k) is conditionally independent of the counterfactual gene expressions (Y0(j, k), Y1(j, k)) given W(j, k):

In the context of transcriptional regulatory interaction, this assumption implies that all other TFs affecting the absence/presence of TF k (i.e., A(k)) should be measured and captured in W(j, k). In addition, we assume the (j, k)-specific experimental treatment assignment assumption holds, which states that 0 < P(A(k) = 1 | W(j, k)) < 1 a.e. In other words, we assume the probability that TF k is active in an experiment, given the observed confounders is not zero or one. This now defines a non-parametric marginal structural model M(j, k) for the data structure O(j, k), and the parameter of interest is given by ß1(j, k).

We consider a simple (j, k)-specific non-parametric marginal structural model as follows

(1)
where a {0, 1}, ß0(j, k) = E(Y0(j, k)) is the background effect when TF k is inactive, ß1(j, k) = E[Y1(j, k) Y0(j, k)] is the causal effect of TF k on gene j. The indicator function I(M(k) S(j)) constrains the model to estimate the causal parameter only when there is a possible direct effect of transcription factor k on gene expression j.

We are interested in estimating the causal parameter ß1 (j, k) for every gene and every TF. Several strategies have been proposed for the estimation of the marginal causal parameters: (1) the G-computation estimation procedure (Robins, 1986; 1987), which requires the model for be correctly specified (where FX represents the full data distribution); (2) the inverse probability of treatment weighted (IPTW) estimation procedure (Robins, 1999a,b, Robins et al. 2000, van der Laan and Robins, 2002), which requires the treatment mechanism (g(A|W)) be correctly specified; and (3) the double robust (DR) estimation procedure (Robins, 2000, van der Laan and Robins, 2002), which requires either the model for EFX(Y|A, W) or the model for g(A|W) be correctly specified. Since our analysis involves large scale genomic data, we may use the G-computation estimator or the IPTW estimator for computational ease.

2.2.3 The G-computation estimation procedure
Let P and Pn denote the true and empirical data distribution, respectively. Since

We can estimate E(Ya(j, k)) by

Then the causal parameter ß1(j, k) is estimated by

The G-computation estimation requires us to assume a suitable model for E(Y(j) | A(k) = a, W(j, k)). For simplicity, we may assume a linear model of the form as follows:

where {gamma} is the coefficient and Za = (1,a,W(j, k))T is a vector. Similarly, we write Z1 = (1, 1, W(j, k))T, Z0 = (1, 0, W(j, k))T and Z = (1, A(k), W(j, k))T.

We first estimate the model based on the observed data, that is,

Then, we estimate and , using the above estimated model but using Z1 and Z0 instead of Z.

To estimate , i.e., the variance of the estimated causal effect, we need to estimate the influence curve of the estimator . We describe the estimation procedure below. For notation convenience, we drop the index for j and k.

Let M(a, W|{gamma}) be the true data generating model, and be the empirically estimated model. Then we can write ß1 = EP [M(1, W) – M(0, W)] and . is a consistent estimator for ß1. Under the regularity condition, is asymptotically linear with the influence curve being

where . In other words,

Note IC(Oi) can be estimated by

and {sigma}2 can be estimated by

The Wald test statistic for H01 = 0 is .

2.2.4 The IPTW estimation procedure
The IPTW estimator is constructed by the following estimating function:

where h is a vector function of A and V, and g is a conditional distribution of A, given W (i.e. treatment mechanism). It is shown that, under the assumption of no unobserved confounders and the experimental treatment assignment assumption, the above estimating function is unbiased for ß, i.e. E(Dh(O|ß,g)) = 0 [see e.g. the appendix of Neugebauer and van der Laan (2002)].

As suggested by Robins (1999b), a sensible choice of h is . Consequently, the estimate of ß can be obtained by regressing Y over A with weights . For example, if we specify a marginal structural model as (1), we may simply choose . We then estimate ß using the weighted least square estimation method by regressing Y over A with weights .

To estimate the weight, we need to model the treatment mechanism g(A|W). For our particular problem, A is binary with A = 1 if the TF under study is active and A = 0 otherwise. A convenient choice of the model for g(A|W) is the logistic regression model, for example,

where m is such that A(m) W(j, k).

So, the estimated weight is for i = 1, ..., n, where g(Ai(k) = 1|Wi(k) = wi(k)) is estimated using the logit model above and

2.2.5 Inference on the regulatory network structures
After fitting the causal marginal structural model (1) for every gene and every TF, we can obtain a J by K P-value matrix, P, whose element is defined by

The J by K transcriptional regulatory interaction matrix, B, can be estimated by

where p is a user specified P-value threshold (e.g., p = 0.001).

Note may be interpreted that the marginal causal effect of the k-th TF on the j-th gene is significantly different from zero. Therefore, we infer that there is a transcriptional regulatory interaction between the k-th TF and the j-th gene.

We then can use the methods described in Lee et al. (2002) and Xing and van der Laan (2005) to identify network motifs and assemble the overall regulatory network.


    3 DATA ANALYSIS
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 DATA ANALYSIS
 4 DISCUSSION
 REFERENCES
 
We apply our method to study the transcriptional regulatory network in Saccharomyces cerevisiae (budding yeast).

3.1 Data
3.1.1 DNA microarray experiments
We collect 658 DNA microarray experiments on yeast gene expression programs under various conditions: 7 on diauxic shift (DeRisi et al., 1997), 10 on sporulation (Chu et al., 1998), 60 on cell cycle (Spellman et al., 1998), 4 on adaptive evolution (Ferea et al., 1999), 173 on environmental stress (Gasch et al., 2000), 6 on Copper regulation (Gross et al., 2000), 300 on diverse mutations and chemical treatments (Hughes et al., 2000), 8 on Pho metabolism (Ogawa et al., 2000), 12 on SNF/SWI mutants (Sudarsanam et al., 2000), 26 on FKH1 and FKH2 roles during cell cycle (Zhu et al., 2000) and 52 on DNA damage (Gasch et al., 2001).

Prior to analysis, the data are normalized by subtracting the genome-wise median for every experiment. In addition, the log2(ratios) are truncated by ±log2(20).

3.1.2 Promoter sequences
We extract promoter sequences of 700 bps in length in the upstream non-coding region [–700, –1] for 6136 open reading frames (ORFs) using the SCPD database (Zhu and Zhang, 1999).

3.1.3 Transcription factor binding sites
We collect known binding sites for 46 yeast TFs from SCPD (Zhu and Zhang, 1999), TRANSFAC (Wingender et al., 1996), and YPD of Incyte Proteome BioKnowledge Library (Hodges et al., 1999) (Table 1).


View this table:
[in this window]
[in a new window]
 
Table 1 Some yeast transcription factors and their specific binding sites

 
3.2 Analysis results
3.2.1 Estimated transcriptional regulatory interactions
The estimated number of transcriptional regulatory interactions between TFs and genes is a function of the cut-off value used. Table 2 shows the results at different cut-off levels.


View this table:
[in this window]
[in a new window]
 
Table 2 Estimated number of regulatory interactions (with 6136 ORFs and 46 transcription factors)

 
In our analysis, we choose p = 0.001 as a threshold to infer the yeast transcriptional regulatory interaction. The estimated transcriptional regulatory interaction matrix is then used to find network motifs and network structures.

3.2.2 Network motifs
Network motifs are the simplest units of the network architecture, which suggest models for the regulatory mechanism that can be tested. Lee et al. (2002) described six regulatory network motifs in terms of TF binding (Fig. 1) and algorithms to find them. We redefine the network motifs in terms of transcriptional regulatory interaction as follows: (a) auto-regulation motif, in which a regulator gene regulates its own expression; (b) feed-forward loop motif, in which a master regulator regulates the second regulator, and both regulate a common target gene; (c) multi-component loop motif, in which regulator(1) regulates regulator(2), ..., regulator(n – 1) regulates regulator(n), and regulator(n) regulates regulator(1), where n ≥ 2; (d) single input motif, in which a single regulator uniquely regulates a set of target genes; (e) multi-input motif, in which a set of regulators regulate a set of target genes together; and (f) regulator chain motif, in which regulator(1) regulates regulator(2), ..., regulator(n – 1) regulates regulator(n), where n ≥ 2, and the chain ends if regulator(n) does not directly regulate any other regulator that is not in the chain.



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 1 Transcriptional regulatory network motifs: (a) Auto-regulation, (b) Feed-forward loop, (c) Multi-component loop, (d) Single-input motif, (e) Multi-input motif and (f) Regulator chain motif. Transcription factors are indicated by circles and genes by boxes. Solid arrows indicate regulatory interaction between transcription factors and their target genes. Dashed arrows link transcription factors and their producer genes. The diagram is modified from Lee et al. (2002).

 
We used the algorithms described in Lee et al. (2002) and Xing and van der Laan (2005) to find interesting transcriptional regulatory network motifs. We found 6 auto-regulated genes, 37 feed-forward loops, 1 multi-component loops, 26 single-input modules, 254 multi-input modules and 96 regulator chains, based on the estimated transcriptional regulatory interactions matrix for 46 TFs and 6136 genes, at a threshold of p = 0.001.

To assess the significance of the findings, we compared our results with those from Lee et al. (2002). Our analysis involves 46 TFs; the analysis of Lee et al. (2002) involves 106 TFs. We have 33 TFs in common. However, since the presence of the additional TFs affects the finding of almost all the network motifs, particularly the single-input and multi-input modules and regulator chains (a result of the network motif finding algorithm). So the comparison focuses on only the auto-regulation motif and feed-forward loop motif.

Table 3 lists genes that are likely to be auto-regulated. At the threshold of P = 0.001, we found 6 regulator genes (out of 46) that may be auto-regulated: ADR1, MIG1, NDT80, RAP1, ROX1 and TBP1. Among these, RAP1 was also identified as an auto-regulated gene in Lee et al. (2002), and NDT80 and ROX1 were computationally identified as auto-regulated genes in Xing and van der Laan (2005).


View this table:
[in this window]
[in a new window]
 
Table 3 Autoregulated genes

 
NDT80p functions at the pachytene of yeast gametogenesis (sporulation) to activate transcription of a set of genes required for both meiotic division and gamete formation. There is evidence that NDT80p activates its own transcription through an upstream MSE consensus site (Chu and Herskowitz, 1998; Lindgren et al., 2000.

The ROX1 gene encodes a heme-induced repressor of hypoxic genes in yeast. Experiments indicated that ROX1p is capable of binding to its own upstream region and represses its own expression (Deckert et al., 1995). ROX1 was included in (Lee et al. 2002), but was not identified as auto-regulated.

At a less restrictive threshold level, STE12 and SWI4 are also found to be auto-regulated, which were also identified as auto-regulated genes in Lee et al. (2002) and Xing and van der Laan (2005). However, we did not found literature support for ADR1, MIG1 and TBP1. Regulation mechanisms for these genes are not completely clear.

We found 37 feed-forward loops involving 23 TFs at the threshold level of P = 0.001. Among these, RAP1-HSF1 was also identified in Lee et al. (2002). ADR1, LEU3, MIG1 and TBP1 seem to form a feed-forward loop with MSN4, as the same relationships were also found computationally in Xing and van der Laan (2005).

3.2.3 Overall transcriptional regulatory network
We can construct the overall transcriptional regulatory network based on the estimated transcriptional regulatory interaction matrix. Figure 2 shows the estimated regulator network, which consists of 36 regulatory genes that have estimated transcriptional regulatory interactions with either themselves (i.e. auto-regulation) or with other regulators. The remaining 10 regulatory genes that are involved in the analysis but have no estimated transcriptional regulatory interactions with any regulators are not shown. Each of the 46 regulatory genes involved in the analysis has its own set of potential target genes, which are, to make it clear, not shown in the graph either.



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 2 Estimated yeast transcriptional regulatory network. Ovals indicate regulatory genes. Arrows indicate the direction of transcriptional regulatory interactions. Regulators without significant interactions with other regulators are not shown. The potential target genes of each regulator are not shown.

 
The analysis results show that the proposed statistical approach is capable of identifying significant transcriptional regulatory interactions and the corresponding regulatory network structures. For example, the constructed network directly connects most of the regulators that are known to regulate the yeast cell cycle process, such as RME1, SWI4, SWI5, ACE2, MCM1, FKH1 and FKH2, to form a sub-network for cell cycle regulation. Among the estimated cell-cycle related transcriptional regulatory interactions, some have already been experimentally confirmed. For example, ACE2 induces the meiosis repressor RME1 (Toone et al., 1995; McBride et al., 1999); REB1 directly increases the mRNA abundance of SWI5 (Svetlov and Cooper, 1995); FKH2 upregulates cell-cycle dependent expression of the CLB2 cluster of genes, which include SWI5 and ACE2 (Boros et al., 2003).

The method is capable of revealing the transcriptional regulatory network structure that is not obvious under a single experiment condition. For example, our analysis suggests that SUM1 transcriptionally regulates NDT80, and NDT80 is auto-regulated. In fact, SUM1p and NDT80p bind competitively to the MSE sites in the promoter region of NDT80 and result in very different consequences: NDT80p activates the expression of NDT80, but SUM1p represses the expression of NDT80 (Pak and Segall, 2002). The cross link between SUM1 and NDT80 may not be observed in a location analysis based on only one kind of growth condition.


    4 DISCUSSION
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 DATA ANALYSIS
 4 DISCUSSION
 REFERENCES
 
In this paper we described a causal inference model based approach for constructing transcriptional regulatory networks using data on gene expression, promoter sequence and transcription factor binding sites. The method views an active TF in a given experiment analogous to a point treatment and the gene expressions as responses. The concept of counterfactual gene expression is introduced and a marginal structural model is built for every gene and TF pair to infer the regulatory interactions. Our analysis based on 658 microarray experiments on yeast gene expression programs and 46 TFs suggests that the method is capable of identifying significant transcriptional regulatory interactions and uncovering the corresponding network structures.

The computational approach is based on available gene expression and sequence data, so it is time-wise and resource-wise more efficient. It is especially suitable for mining the fast accumulating microarray data on gene expressions under various experiment conditions. Since data from many different experiment conditions are explored, our method may be, in some senses, more advantageous over location analysis and single TF perturbation experiment based approaches for its capability of finding the transcriptional regulatory network structure that is not fully observable under a single experiment condition, for example, the interaction between SUM1 and NDT80. However, since certain regulatory mechanisms will only occur under specific environmental conditions and will not be seen under other circumstances, aggregating over other experiments where this regulatory interaction is not evident may diminish the signal. Therefore, the estimation may be improved if we have some prior knowledge about the relationship between the regulatory interactions and experiments, and select analysis data carefully.

As compared with our previous method (Xing and van der Laan, 2005), this method is time-wise more efficient since it does not use the naive normal mixture model, and the IPTW estimation of the marginal structural models is faster than the EM algorithm based estimation of the mixture models. But the MSM needs the no-unmeasured confounders assumption and the experimental treatment assignment assumption to obtain a consistent estimate of the causal effects. We usually are not able to check whether these assumptions all hold without further knowledge except the data. The analyses based on data from real experiments seem to suggest that the two methods can be complementary to each other in maximizing significant findings.

The method has some limitations. First, it may fail to estimate the regulatory interactions of a TF that results in only subtle changes in the genome-wide gene expression profile. Second, the method relies on knowledge of TF binding sites. The number of TFs with known consensus binding sites is small and their functional coverage is somewhat limited. However, this may not be a problem when more and more TF binding sites are characterized and added to our knowledge. Also, we may use putative transcription factor binding sites in the analysis. Using putative TF binding sites will increase the error rates in estimation, but the constructed networks should suggest more models for further testing.

It is also worthwhile to mention that instead of using causal estimate ß(j, k), we can also use association ß(j, k). But then the coefficient would only measure the association between transciption factor k and gene j. However, it would be easier for estimation and thereby worthwhile for data analysis purposes.


    Acknowledgments
 
This study is supported in part by the Russell M. Grossman Endowment for dissertation assistance to the first author at University of California at Berkeley.

Conflict of Interest: none declared.

Received on May 27, 2005; revised on August 6, 2005; accepted on August 25, 2005

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 DATA ANALYSIS
 4 DISCUSSION
 REFERENCES
 

    Bar-Joseph, Z., et al. (2003) Computational discovery of gene modules and regulatory networks. Nat. Biotechnol., 21, 1337–1342[CrossRef][Web of Science][Medline].

    Beer, M.A. and Tavazoie, S. (2004) Predicting gene expression from sequence. Cell, 117, 185–198[CrossRef][Web of Science][Medline].

    Boros, J., et al. (2003) Molecular determinants of the cell-cycle regulated mcm1p-fkh2p transcription factor complex. Nucleic Acids Res., 31, 2279–2283[Abstract/Free Full Text].

    Bussemaker, H.J., et al. (2001) Regulatory element detection using correlation with expression. Nat. Genet., 27, 167–171[CrossRef][Web of Science][Medline].

    Chen, T., et al. (1999) Modeling gene expression with differential equations. Proc. Pac. Symp. Biocomput., 4, 29–40.

    Chu, S., et al. (1998) The transcriptional program of sporulation in budding yeast. Science, 282, 699–705[Abstract/Free Full Text].

    Chu, S. and Herskowitz, I. (1998) Gametogenesis in yeast is regulated by a transcriptional cascade dependent on ndt80. Mol. Cell, 1, 685–696[CrossRef][Web of Science][Medline].

    Conlon, E.M., et al. (2003) Integrating sequence motif discovery and microarray analysis. Proc. Natl Acad. Sci. USA, 100, 3339–3344[Abstract/Free Full Text].

    Deckert, J., et al. (1995) Multiple elements and auto-repression regulate rox1, a repressor of hypoxic genes in saccharomyces cerevisiae. Genetics, 139, 1149–1158[Abstract].

    DeRisi, J.L., et al. (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278, 680–686[Abstract/Free Full Text].

    D'Haeseleer, P., et al. (1999) Linear modeling of mRNA expression levels during cns development and injury. Proc. Pac. Symp. Biocomput., 4, 41–52.

    D'Haeseleer, P., et al. (2000) Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics, 16, 707–726[Abstract/Free Full Text].

    Ferea, T.L., et al. (1999) Systematic changes in gene expression patterns following adaptive evolution in yeast. Proc. Natl Acad. Sci. USA, 96, 9721–9726[Abstract/Free Full Text].

    Friedman, N., et al. (2000) Using bayesian networks to analyze expression data. J. Comput. Biol., 7, 601–620[CrossRef][Web of Science][Medline].

    Gasch, A.P., et al. (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell, 11, 4241–4257[Abstract/Free Full Text].

    Gasch, A.P., et al. (2001) Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. Mol. Biol. Cell, 12, 2987–3003[Abstract/Free Full Text].

    Gross, C., et al. (2000) Identification of the copper regulon in Saccharomyces cerevisiae by DNA microarrays. J. Biol. Chem., 275, 32310–32316[Abstract/Free Full Text].

    Hodges, P.E., et al. (1999) Yeast proteome database (YPD): a model for the organization and presentation of genome-wide functional data. Nucleic Acids Res., 27, 69–73[Abstract/Free Full Text].

    Hughes, T.R., et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109–126[CrossRef][Web of Science][Medline].

    Keles, S., et al. (2002) Identification of regulatory elements using a feature selection method. Bioinformatics, 18, 1167–1175[Abstract/Free Full Text].

    Lee, T.I., et al. (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 298, 799–804[Abstract/Free Full Text].

    Liang, S., et al. (1998) REVEAL: a general reverse engineering algorithm for inference of genetic network architectures. Proc. Pac. Symp. Biocomput., 3, 18–29.

    Lindgren, A., et al. (2000) The pachytene checkpoint in saccharomyces cerevisiae requires the Sum1 transcriptional repressor. EMBO J., 19, 6489–6497[CrossRef][Web of Science][Medline].

    McBride, H.J., et al. (1999) Distinct regions of the Swi5 and Ace2 transcription factors are required for specific gene activation. J. Biol. Chem., 274, 21029–21036[Abstract/Free Full Text].

    U.C. Berkeley Division of Biostatistics Working Paper Series Neugebauer, R. and van der Laan, M.J. (2002) Why prefer double robust estimates? Illustration with causal point treatment studies. http://www.bepress.com/ucbbiostat/paper115.

    Ogawa, N., et al. (2000) New components of a system for phosphate accumulation and polyphosphate metabolism in Saccharomyces cerevisiae revealed by genomic expression analysis. Mol. Biol. Cell, 11, 4309–4321[Abstract/Free Full Text].

    Pak, J. and Segall, J. (2002) Regulation of the premiddle and middle phases of expression of the Ndt80 gene during sporulation of Saccharomyces cerevisiae. Mol. Cell Biol., 22, 6417–6429[Abstract/Free Full Text].

    Pilpel, Y., et al. (2001) Identifying regulatory networks by combinatorial analysis of promoter elements. Nat. Genet., 29, 153–159[CrossRef][Web of Science][Medline].

    Robins, J.M. (1986) A new approach to causal inference in mortality studies with sustained exposure periods—application to control of the healthy worker survivor effect. Math. Modelling, 7, 1393–1512[CrossRef].

    Robins, J.M. (1987) A graphical approach to the identification and estimation of causal parameters in mortality studies with sustained exposure periods. J. Chron. Dis., 40, 139s–161s[Medline].

    Robins, J.M. (1999a) Association, causation, and marginal structural models. Synthese, 121, 151–179[CrossRef].

    Robins, J.M. (1999b) Marginal structural models versus structural nested models as tools for causal inference. In Halloran, M. and Berry, D. (Eds.). Statistical Models in Epidemiology: The Environment and Clinical Trials, , NY Springer-Verlag, pp. 95–134.

    Robins, J.M., et al. (2000) Marginal structural models and causal inference in epidemiology. Epidemiology, 11, 550–560[CrossRef][Web of Science][Medline].

    Robins, J.M. (2000) Robust estimation in sequentially ignorable missing data and causal inference models. Proceedings of the American Statistical Association , pp. 6–10 Section on Bayesian Statistical Science 1999, Alexandria, VA. 1999.

    Segal, E., et al. (2003) Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics, 19, i273–i282[Abstract].

    Somogyi, R., et al. (1997) The gene expression matrix: towards the extraction of genetic network architectures. Proceedings of Second World Congress of Nonlinear Analysts 30, , pp. 1815–1824.

    Spellman, P.T., et al. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 3273–3297[Abstract/Free Full Text].

    Sudarsanam, P., et al. (2000) Whole-genome expression analysis of Snf/Swi mutants of Saccharomyces cerevisiae. Proc. Natl Acad. Sci. USA, 97, 3364–3369[Abstract/Free Full Text].

    Svetlov, V.V. and Cooper, T.G. (1995) Review: compilation and characteristics of dedicated transcription factors in Saccharomyces cerevisiae. Yeast, 11, 1439–1484[CrossRef][Web of Science][Medline].

    Toone, W.M., et al. (1995) RME1, a negative regulator of meiosis, is also a positive activator of G1 cyclin gene expression. EMBO J., 14, 5824–5832[Web of Science][Medline].

    van der Laan, M.J. and Robins, J. (2002) Unified methods for Censored Longitudinal Data and Causality. , NY Springer-Verlag.

    Wang, W., et al. (2002) A systematic approach to reconstructing transcription networks in Saccharomyces cerevisiae. Proc. Natl Acad. Sci. USA, 99, 16893–16898[Abstract/Free Full Text].

    Wingender, E., et al. (1996) TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res., 24, 238–241[Abstract/Free Full Text].

    Xing, B. and van der Laan, M.J. (2005) A statistical method for constructing transcriptional regulatory networks using gene expression and sequence data. J. Comput. Biol., 12, 229–246[CrossRef][Web of Science][Medline].

    Yoo, C., et al. (2002) Discovery of causal relationships in a gene-regulation pathway from a mixture of experimental and observational DNA microarray data. Proc. Pac. Symp. Biocomput., 7, 498–509.

    Zhu, G., et al. (2000) Two yeast forkhead genes regulate the cell cycle and pseudohyphal growth. Nature, 406, 90–94[CrossRef][Medline].

    Zhu, J. and Zhang, M.Q. (1999) SCPD: a promoter database of yeast Saccharomyces cerevisiae. Bioinformatics, 15, 607–611[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
S. Ranjan, J. Seshadri, V. Vindal, S. Yellaboina, and A. Ranjan
iCR: a web tool to identify conserved targets of a regulatory protein across the multiple related prokaryotic species.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W584 - W587.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
W. Li, M. Wang, P. Irigoyen, and P. K. Gregersen
Inferring causal relationships among intermediate phenotypes and biomarkers: a case study of rheumatoid arthritis
Bioinformatics, June 15, 2006; 22(12): 1503 - 1507.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/21/4007    most recent
bti648v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Xing, B.
Right arrow Articles by van der Laan, M. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xing, B.
Right arrow Articles by van der Laan, M. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?