Skip Navigation


Bioinformatics Advance Access originally published online on September 11, 2006
Bioinformatics 2006 22(22):2775-2781; doi:10.1093/bioinformatics/btl473
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/22/2775    most recent
btl473v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sanguinetti, G.
Right arrow Articles by Rattray, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sanguinetti, G.
Right arrow Articles by Rattray, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Probabilistic inference of transcription factor concentrations and gene-specific regulatory activities

Guido Sanguinetti 1,*, Neil D. Lawrence 1 and Magnus Rattray 2

1 Department of Computer Science, Regent Court 211 Portobello Road, Sheffield, S1 4DP, UK
2 School of Computer Science, University of Manchester Oxford Road, Manchester, M13 9PL, UK

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1. INTRODUCTION
 2 METHODS
 3. RESULTS
 4. DISCUSSION
 REFERENCES
 

Motivation: Quantitative estimation of the regulatory relationship between transcription factors and genes is a fundamental stepping stone when trying to develop models of cellular processes. Recent experimental high-throughput techniques, such as Chromatin Immunoprecipitation (ChIP) provide important information about the architecture of the regulatory networks in the cell. However, it is very difficult to measure the concentration levels of transcription factor proteins and determine their regulatory effect on gene transcription. It is therefore an important computational challenge to infer these quantities using gene expression data and network architecture data.

Results: We develop a probabilistic state space model that allows genome-wide inference of both transcription factor protein concentrations and their effect on the transcription rates of each target gene from microarray data. We use variational inference techniques to learn the model parameters and perform posterior inference of protein concentrations and regulatory strengths. The probabilistic nature of the model also means that we can associate credibility intervals to our estimates, as well as providing a tool to detect which binding events lead to significant regulation. We demonstrate our model on artificial data and on two yeast datasets in which the network structure has previously been obtained using ChIP data. Predictions from our model are consistent with the underlying biology and offer novel quantitative insights into the regulatory structure of the yeast cell.

Availability: MATLAB code is available from http://umber.sbs.man.ac.uk/resources/puma

Contact: guido{at}dcs.shef.ac.uk

Supplementary information: Supplementary Data are available at Bioinformatics online


    1. INTRODUCTION
 TOP
 ABSTRACT
 1. INTRODUCTION
 2 METHODS
 3. RESULTS
 4. DISCUSSION
 REFERENCES
 
Quantitative modelling of the regulatory network of the cell is one of the grand challenges of bioinformatics. Although recent techniques, such as Chromatin Immunoprecipitation (ChIP) (Lee et al., 2002; Harbison et al., 2004) have uncovered much information about the architecture, or connectivity, of the networks, any quantitative model would require the knowledge of both the concentration of transcription factor proteins at a given time and the intensity with which they can promote or repress transcription of their target genes. Experimental estimation of these variables meets with insurmountable obstacles: measuring protein concentrations is a notoriously difficult task, and little help can be gleaned from knowledge of transcription factor gene expression levels. Transcription factors are often post-transcriptionally regulated and have low and noisy expression levels. Furthermore, the effect a transcription factor has on a target gene depends greatly on the experimental conditions (Harbison et al., 2004; Papp and Oliver, 2005), making experimental estimation of the strength of regulatory relationships a difficult task.

An idea that has gained a lot of interest in recent years has been to infer information about regulatory activity from the expression levels of target genes. Using information from ChIP experiments or from sequence analysis about the connectivity of the network and genome-wide microarray data for the expression levels of the targets, it should be possible to gain insights on the activity of the transcription factors.

This idea has been pursued by a number of research groups (Liao et al., 2003; Gao et al., 2004; Boulesteix and Strimmer, 2005). Most methods aim to infer a matrix of transcription factor activities (TFAs), which are supposed to sum up in a single number the concentration of the transcription factor at a certain experimental point and its binding affinity to its target genes. The techniques used are modified forms of linear regression, where the TFAs are obtained as regression coefficients. These models were able to obtain results in broad accordance with the existing biological knowledge, and have the advantage of being fast and practical for genome-wide analysis. However, a major limitation of these methods is that the TFAs inferred are constant across genes, i.e. they can only infer the mean influence of a transcription factor on its target genes1. Also, none of these methods is probabilistic and therefore it is difficult to see how credibility intervals can be obtained, as well as how the models can be made robust against false positives (a notorious problem of ChIP data).

A more sophisticated approach was taken in Nachman et al. (2004). Here, dynamical Bayesian networks were used to model the concentrations of transcription factor proteins. The binding of a transcription factor protein to a target gene was then modelled using binary variables allowing for gene-specific effects to be considered. Although the model is elegant and more realistic than the regression based models, its computational costs ruled out genome-wide investigation.

In a recent study, we proposed a probabilistic extension of the linear regression model to describe gene and environment specific effects (Sanguinetti et al., 2006). We allowed a different regression matrix for each gene, and rendered the model identifiable by using a prior distribution on the gene-specific TFAs that was shared across all genes. The temporal structure of the data was captured by requiring the prior distribution to form a stationary Markov chain. Bayesian inference was exactly tractable with this model and the posterior estimates for the gene-specific TFAs proved to be consistent with known biological regulatory relationships. However, although useful, the model lacked in interpretability. While one could argue that the TFAs obtained by regression are monotonically linked to the actual protein concentrations, it was hard to separate the effect of high protein concentrations from the effect of high regulatory strength. This is a serious problem when trying to use the results of the model for further analysis, as it is impossible to distinguish between what is a time varying quantity (the protein concentration) from what should be considered a condition dependent parameter (the strength with which a transcription factor regulates one of its targets).

Here, we propose a modified model that separately models concentrations and regulatory strengths. While this makes the model no longer exact, it has the advantage of providing probabilistic estimates for the intensities of the regulatory relationships, hence allowing the genome-wide quantitative reconstruction of the dynamical process of transcriptional regulation. We model the temporal structure of the data by using a Markov chain, and we develop an efficient variational Expectation Maximization (EM) algorithm for the estimation of the model parameters and posterior statistics.

Models with a Markov chain prior on continuous valued latent variables are special cases of dynamical Bayesian networks known as state space models, or Kalman filters (Kalman, 1960; Haykin, 2002), and these models have previously been used in microarray time-series analysis. For example, Beal et al. (2005) recently applied a state space model to learn about interactions between a subset of genes from highly replicated microarray experiments. However, they did not make use of prior knowledge about potential regulatory interactions to explicitly infer the activity of transcription factors. This knowledge greatly reduces the search space, allowing genome-wide applications to become feasible and reducing the need for substantial experimental replication.

Recently, Sabatti and James (2006) also presented an extension of the linear regression model that provides separate estimates of protein concentrations and regulatory intensities for regulatory networks of known connectivity. They use Markov chain Monte Carlo (MCMC) to perform approximate inference of the posterior distribution of the concentration and intensities, and provided an R package implementing it. While their model is in many ways similar to ours, in other ways the models differ substantially. A more detailed comparison of the two models is addressed in the Discussion section.

We validate the model both on synthetic data and real data from two yeast time series: the benchmark cell cycle data set of Spellman et al. (1998) and the recent metabolic cycle data set of Tu et al. (2005). The connectivity data we use in the cell cycle case is that obtained by Lee et al. (2002) using ChIP, while for the metabolic cycle data we combine the ChIP data of Lee et al. (2002) with the more recent data of Harbison et al. (2004).

Our results are largely confirmed in the biological literature, but, using the gene-specific nature of our model, we also manage to predict biologically plausible regulatory relationships which are not documented in the literature. The probabilistic nature of our model also means that we can identify false positives in the ChIP data as regulatory relationships below a certain significance threshold.


    2 METHODS
 TOP
 ABSTRACT
 1. INTRODUCTION
 2 METHODS
 3. RESULTS
 4. DISCUSSION
 REFERENCES
 
The logged gene expression measurements are collected in a design matrix Y isin RNxT, where N is the number of genes and T the number of time points in a time-series microarray experiment. The elements of Y are written yn(t) to denote the expression level of gene n at time t. The connectivity of the network is represented by a binary matrix X isin RNxq, where q is the number of transcription factors; element (i,j) of X is one if transcription factor j binds gene i, zero otherwise.

2.1 Model
We propose a discrete time state space model (Kalman, 1960), which takes the form,

Formula 1(1)

Formula 2(2)
Equation (1) describes a linear model of the effect that each transcription factor has on the expression level of each gene. Equation (2) describes the dynamics of the underlying transcription factor concentrations as a 1st order Markov chain. Elements of the concentration matrix C = [cm(t)] represent the relative concentration of transcription factor protein m at time t. Elements of the activity matrix B = [bnm] model the regulatory strength with which transcription factor m influences the target gene n. The mean vector µ = [µn] represents the baseline expression level for each gene, i.e. the expression level in the absence of any of the known transcription factors being bound. The variables {varepsilon}nt and {eta}mt each represent zero mean i.i.d. Gaussian noise on the measurements and underlying process respectively. The measurement noise has variance {sigma}2, Formula 2. The process noise Formula 2 has a variance that ensures the Markov process governing the dynamics of cm(t) is stationary with unit variance [we set Formula 2]. The parameter vector {gamma} = [{gamma}m] has components {gamma}m isin [0,1] that determine the temporal variability in the concentration of transcription factor m. Values of {gamma}m that are close to one indicate very little variability in time while lower values correspond to larger changes. Intermediate values of this parameter indicate a smoothly varying transcription factor concentration profile.

The assumption that the activities bnm are independent of time is reasonable for time-series data when the conditions (e.g. pH, temperature, growth medium etc.) are kept relatively constant. Of course, large changes in the relative proportion of different transcription factors would eventually lead to the simple linear relationship in equation (1) breaking down, but by making this simplification the model remains sufficiently tractable for practical application to a genome-wide study.

An important feature of the model is that the connectivity matrix X is sparse, mirroring the biological fact that few transcription factors bind any single gene. This is crucial for the identifiability of our model: models of this high dimensionality (in both the measurement and latent space) with full matrices would require very large numbers of replicates to be identifiable, and even then would be unlikely to correctly identify the sparsity structure of the connectivity matrix. However, the presence of the matrix X ensures that only a few of the bnm need to be estimated, making the task possible with the limited amount of data normally available in microarray time-series data. The degree of sparsity of the matrix X varies depending on the organism studied and the type of data used to build the network. Typically, regulatory networks for yeast constructed from ChIP data can be expected to lead to matrices with approximately only 1% nonzero entries, while networks for higher organisms constructed from motif data (see e.g. Xie et al., 2005) lead to matrices with ~20% nonzero entries, although these are expected to contain very many false positives.

The model is over-parameterized and we therefore use Bayesian methods for inferring the posterior probability of the activities in B. This also provides us with a tool to determine which of these activities are significant. The bnm are each given a zero mean spherical Gaussian prior distribution with variance {alpha}2, which sets the typical scale of regulatory effects,

Formula 3(3)
The baseline expression level vector µ is also given a spherical Gaussian prior with zero mean and unit variance.

The Markov process in equation (2) can be formulated as a Gaussian distribution on the vector {kappa} = [c(1), ..., c(T)]T, where c(t) = [cm(t)] is the row vector of concentrations at time t,

Formula 3
with

Formula 4(4)
Here we have defined

Formula 4

Having defined our model, we can now write a joint distribution for the observed and latent variables

Formula 5(5)
The standard approach would now be to marginalize the latent variables and apply type II maximum likelihood to obtain the values of the hyperparameters {alpha},{sigma},{gamma}. Estimates of the activities and concentrations can then be obtained by posterior estimation using Bayes' theorem. However, exact marginalisation of equation (5) is intractable and we have to resort to approximate techniques. We use a variational EM algorithm that exploits a factorisation assumption to achieve an efficient approximation of the log likelihood.

2.2. Variational inference
The most common method for carrying out approximate Bayesian inference for models with many parameters is MCMC. However, we prefer to use a variational EM algorithm here. The advantage of using this approach over MCMC is that we can deal more efficiently with the huge number of parameters in the matrices B and C. The deterministic nature of the resulting parameter optimization is also attractive, as it is easier to assess convergence compared to MCMC. Variational EM algorithms have previously been applied to similar models with impressive results (Beal et al., 2005) although it is always important to validate approximate methods in a new application using simulated data.

Variational inference (see e.g. Jordan et al., 1999) approximates the intractable posterior probability distribution for the model parameters by using a simplified form, usually involving a factorized approximation. An EM algorithm can then be used to minimize the KL-divergence between this approximation and the true posterior distribution. The EM algorithm exploits Jensen's inequality to obtain the following bound on the log likelihood

Formula 6(6)
where <>q denotes expectation under the probability distribution q, q(B,C,µ) is any probability distribution on the variables B, C and µ and H denotes the entropy of the distribution. For brevity, we use {theta} to denote collectively the hyperparameters {alpha}, {sigma} and {gamma}. It can be shown that the bound is saturated if and only if q(B,C,µ) is the posterior distribution p(B,C,µ|Y,{theta}).

Computing the posterior distribution is as intractable as computing the marginal likelihood. We will, however, construct a tractable distribution that approximates the posterior distribution. We take the approximating distribution to factorize over the hidden variables

Formula 6
We can then construct the approximating distribution iteratively as follows: start with any distributions q2(C) and q3(µ) (e.g. the prior distributions) and average the joint likelihood under them. Then the choice for q1(B) that maximizes the bound in equation (6) can be computed exactly since we averaged over C and µ. However, the sufficient statistics of q1(B) will depend on the sufficient statistics of q2(C) and q3(µ). We can now iterate computing similarly updated distributions for q2(C) and q3(µ) until the algorithm is deemed to have converged. The approximation becomes exact if the random variables C, B and µ are independent a posteriori.

E-step: posterior updates.
In our model, the log of the joint probability has the following form

Formula 7(7)
where {chi}n is the diagonal matrix having the n-th row of X on the diagonal, Formula is the n-th row of B, c(t) = [cm(t)] and K and {kappa} are defined in equation (4).

By inspection, one obtains that

Formula 7
with

Formula 8(8)
where <>q2 and <>q3 represent averaging under q2({kappa}) and q3(µ), respectively. Notice that both the posterior inverse covariance and the posterior mean are sparse, mirroring the sparsity of the connectivity data.

Similarly we can compute the approximating distribution for C

Formula 8
with

Formula 9(9)
where {otimes} denotes the matrix Kronecker product2. yn = [yn(t)] and e is a T dimensional vector whose entries are all one.

Notice that the matrix K' is of size Tq x Tq. For most genome-wide networks Tq is in the thousands, which makes a naïve inversion of the covariance matrix computationally prohibitive [the complexity and memory requirements for inverting a matrix of size Tq are O((Tq)3)]. Also, fast recursive methods to compute posterior expectations, such as forward-backward algorithms are prone to numerical instability when the dimension of the latent space becomes high (for a detailed review of recursive methods from the point of view of control theory, as well as for a discussion of their numerical problems, see Haykin, 2002). To avoid these problems, we used the recent inversion algorithm for banded block matrices proposed by Asif and Moura (2005). This computes the whole covariance matrix while reducing the computational complexity and memory requirements by a factor of order T (for most typical microarray applications T is between 10 and 40) and maintaining numerical stability.

Finally, we can compute the approximating distribution for µ, which is easily derived as

Formula 9
where

Formula 10(10)
Notice that, in the limit in which the observation noise {sigma} goes to zero, Equation (10) returns the temporal mean of the difference between the observed values and the predicted values, as expected.

The update equations (810) can be iterated until convergence.

M-step: hyperparameter updates
Having computed an approximation to the logarithm of the marginal likelihood, we can perform an M-step to optimize the hyperparameters. Fixed point update equations for {alpha}2 and {sigma}2 are readily found as

Formula 11(11)
Unfortunately, it is not possible to obtain fixed point equations for the hyperparameters {gamma}, since they appear both in the prior mean and in the prior covariance for the concentrations C. We optimized them using a scaled conjugate gradient algorithm in the NETLAB implementation of Nabney (2002).


    3. RESULTS
 TOP
 ABSTRACT
 1. INTRODUCTION
 2 METHODS
 3. RESULTS
 4. DISCUSSION
 REFERENCES
 
3.1. Artificial data
To check the consistency of our model, we first tested it on artificial data. We randomly generated data from the model with known parameters, and ran the EM algorithm from a random initialization. To simulate more faithfully a real situation, we used a connectivity matrix obtained from the transcriptional regulatory network of the yeast cell.

We generated eight samples from a system simulating a cellular network with 649 genes and 19 transcription factors. The parameters obtained at convergence were generally in good agreement with the true ones. The inferred posterior expectations, apart from the obvious sign ambiguity (the sign of cm and bm can both be changed without altering the model)3 are in accordance with the true ones. Figure 1 shows a plot of the original activities versus the inferred ones and a reconstructed concentration profile versus the true profile.


Figure 1
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Experimental results on synthetic data: left shows the posterior estimates for the gene-specific activities versus the true ones; right shows a true concentration profile (dashed line) and the reconstructed posterior concentration profile. The true value of the temporal continuity {gamma} was 0.7255 and the reconstructed value of {gamma} was 0.7240.

 
3.2. Cell cycle data
We then turned to the benchmark yeast cell cycle dataset of Spellman et al. (1998). This consists of the expression profiles of 6181 genes measured at 24 equally spaced time points covering the yeast cell cycle. We integrated the microarray data with the connectivity described by Lee et al. (2002), who performed ChIP on 113 transcription factors, measuring their binding to 6270 genes. Although both these datasets are relatively old, they have been extensively studied in the literature on regulatory networks, thus providing an excellent benchmark for model validation and comparison.

The ChIP data are continuous, but, following the suggestion of Lee et al. (2002), we binarized it by considering only regulatory relationships which gave P-values smaller than 10–3. This threshold was confirmed as providing a good compromise between retaining enough regulations without introducing too many false positives in our previous study (see Supplementary Data to Sanguinetti et al., 2006).

We removed from the dataset genes which were not bound by any transcription factor and transcription factors not binding any gene. We also removed the expression data of genes with five or more missing values in the microarray data, leaving a network of 1975 genes and 104 transcription factors.

Figure 2 shows posterior estimates of the protein concentrations and gene specific activities, as well as expression profiles, of two of the most important regulators of the yeast cell cycle, ACE2 and SWI5.


Figure 2
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Expression profiles, inferred protein concentrations and gene-specific activities for ACE2 (top), SWI5 (bottom). (a) and (d) show the transcription profiles of ACE2 and SWI5, (b) and (e) show the posterior temporal profile for protein concentration levels with confidence intervals at each point and (c) and (f) show histograms of the significance levels of the regulatory interactions for each transcription factor acting on all of its target genes (ratios between the inferred mean regulatory strength <b> and its associated standard deviation {sigma}b).

 
The protein concentration profiles obtained by posterior estimation are similar to the ones obtained using regression methods [cf. figure 4 in Boulesteix and Strimmer (2005) and figure 2 in the supplementary material]. This is in accordance with the idea that TFAs obtained by regression are monotonically linked to protein concentrations and provide an estimate of the average effect of a transcription factor over its target genes.

The main novelty of our model is represented by the third column of Figure 2. This plots as a histogram the ratios between the posterior gene specific activities of the two transcription factors and the associated posterior standard deviations for all of their target genes. This can be viewed as a generalization of the standard significance level where we also retain the sign of the interaction. In both cases, there is a large number of genes whose level of regulation is not statistically significant, which could then be considered as false positives in the ChIP data.

A detailed analysis of which genes are significantly regulated can be used for further validation of our model. For example, ACE2 exhibits a group of four genes very significantly promoted (signal to noise ratios between 14 and 19). These are SCW11, CTS1, DSE1 and DSE2. They were clustered with ACE2 in Spellman et al. (1998) and were also identified as the four main targets of ACE2 in our previous study (Sanguinetti et al., 2006). A search of the GO database4 reveals that the functional annotations of these genes are very coherent: CTS1 is well known to be mediated by ACE2 and is required for cell separation, while both DSE1 and DSE2 are involved in degrading the cell wall causing the daughter cell to separate from the parent. The functional annotation for SCW11 is less clear but it is known that its protein is again localized at the cell wall.

At the other end of the spectrum, we find that our model predicts with reasonable confidence (signal to noise ratio 4.4) that ACE2 represses NCE4. Although, this interaction is not documented in the literature as far as we know, NCE4 is known to be important in ensuring DNA stability during replication (Mullen et al., 2005). It is therefore not unreasonable that ACE2, whose main function is to terminate mitosis, should inhibit production of NCE4.

Among the most significant targets of SWI5 we find SIC1 and PCL2 (signal to noise ratio 12.1 and 4.3, respectively), which were identified as main targets for SWI5 by Aerne et al. (1998), and EGT2 (signal to noise ratio 11.1). EGT2 was shown in Kovaceh et al. (1996) to be primarily regulated by SWI5, but also regulated, to a lesser extent, by ACE2. This is confirmed by our model which assigns a signal to noise ratio of 1.5 to the activity of ACE2 on EGT2.

At a more global level, we can use the posterior distribution over the regulatory intensities to assess which binding events (non zero entries in the connectivity matrix) result in significant regulation. For example, only 1238 out of 3656 bindings give regulations with a signal to noise ratio greater than 2 (95% significance level). Similarly, we get that only 86 transcription factors regulate at least one target at 95% significance, and that only 155 genes are regulated by more than one transcription factor at 95% significance level (while 792 are bound by more than one transcription factor according to the connectivity data). For a more detailed discussion of these global issues, as well as for more examples of inferred concentrations and regulatory intensities, we refer to the Supplementary Data.

3.3. Metabolic cycle data
Tu et al. (2005) investigated the molecular origin of the glycolitic and respiratory oscillations that constitute the yeast metabolic cycle. mRNA was prepared at regular intervals of ~25 min over three consecutive cycles. The study identified that, at 95% significance, over half of the yeast genes (approximately 3500) display periodic behaviour, and can hence be assumed to be metabolic cycle-regulated.

In order to capture as many regulatory relationships as possible, we built our network by merging the results of two ChIP experiments (Lee et al., 2002; Harbison et al., 2004). After eliminating genes that were not bound by any transcription factors and transcription factors not binding any genes, this resulted in a very large network of 177 transcription factors and 3195 target genes. Although merging two ChIP experiments can be expected to result in an increased number of false positives, the probabilistic nature of our model means it can deal efficiently with false positives by assigning a low signal to noise ratio to regulatory relationships which are not consistent with the expression data. For example, while the ChIP data describes 7082 binding events, only 3150 of these were associated by our model with a regulatory intensity which was significant at 95% confidence level. These significant regulations involved 163 of the 177 transcription factors and 2264 of the 3178 genes present in the network.

A transcription factor that is of particular interest in the metabolic cycle of yeast is LEU3, which is involved in the metabolism of branched chain amino acids. This has been the subject of a recent experimental study (Boer et al., 2005) in which LEU3's regulon was investigated through comparison of in vitro binding affinities, ChIP data and data from mutant strains. They identified nine target genes for LEU3 confirmed by all experimental techniques, plus several more putative targets that were confirmed by two experiments.

Figure 3 shows the expression profile, concentration profile and activities for LEU3. Figure 3c shows the gene-specific activities of LEU3 on its target genes. Again, as we already noticed for the cell cycle data, most target genes are not significantly regulated. However, the model identifies a number of very highly significantly regulated genes. The three most significantly regulated genes (signal to noise ratio greater than 10) are OAC1, BAT1 and LEU1, which were all confirmed as targets of LEU3 by the in vivo experiments of Boer et al. (2005). At the other end of the spectrum, our model predicts significant downregulation for two genes, YLR356W and YHR209W. The functional annotation of these genes is very poor (for YHR209W the protein is unknown); a significant link with a well characterized transcription factor such as LEU3 could be the start to a better understanding of these genes.


Figure 3
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 (a) Expression profile for LEU3 during the metabolic cycle, (b) posterior protein concentration profile inferred by our model for LEU3 and (c) posterior mean to standard deviation ratios for the gene-specific activities of LEU3 on its target genes.

 
For more examples of transcription factors involved in the metabolic cycle see the Supplementary Data.


    4. DISCUSSION
 TOP
 ABSTRACT
 1. INTRODUCTION
 2 METHODS
 3. RESULTS
 4. DISCUSSION
 REFERENCES
 
In this paper we introduced a novel probabilistic model to infer transcription factor protein concentration and regulatory strengths from microarray data when the structure of the regulatory network is known. The expression levels of target genes are modelled as sparse linear combinations of the transcription factor protein concentrations, where the coefficients represent the intensity of the regulatory relationship between a transcription factor and its targets. The regulatory intensities are given a spherical Gaussian prior distribution, while the protein concentrations are modelled as a discrete time state space model. Approximate posterior inference allows estimation of both the intensities and the protein concentration profiles with associated credibility intervals.

A key feature of our model is the way it exploits the natural sparsity of the regulatory network. State space models had previously been used to analyse microarray data (Beal et al., 2005), but the absence of a sparsity constraint meant that they could only be applied to small networks and highly replicated data. The sparse nature of the inference in our model means that we can successfully apply it in genome-wide studies of time courses.

The contribution that is closest to ours is the recent paper of Sabatti and James (2006). While we share many of their aims, there are some important differences between the two approaches: first, their approach is static and cannot account for the temporal structure of the data. While it is in principle possible to modify their algorithm to include dynamics, the authors themselves acknowledge that this may make the computational cost prohibitive [see Supplementary Data to Sabatti and James (2006)]. Second, one of the aims of their approach is to be able to identify false positives in the network structure. To do this, they need a prior distribution over the binary connectivity matrix, which they obtain from sequence information using their Vocabulon method (Sabatti et al., 2005). It is not clear, however, how to obtain such a prior distribution for ChIP data. It would seem therefore that Sabatti and James's approach is perhaps most suitable for network structures derived from motif analysis, rather than ChIP data. Last, there are important differences at the algorithmic level in the choice of approximate inference techniques (MCMC versus variational EM) and in the optimization of the hyperparameters.

We demonstrated our model both on artificial data and on two yeast data sets, the benchmark cell cycle data set of Spellman et al. (1998) and the more recent metabolic cycle data set of Tu et al. (2005), using network structure obtained by ChIP (Lee et al., 2002; Harbison et al., 2004).

While results on artificial data confirmed the identifiability of our model, results on biological data provided wide ranging predictions on the regulatory network of the yeast cell. Most of these predictions were confirmed by the existing biological literature. However, as in the case of ACE2 repressing NCE4, our model predicted regulatory relationships which are not documented in the biological literature but are consistent with the known function of both the transcription factor and the target gene.

In this paper, we made a number of simplifying assumptions. We considered all noise on microarray measurement to be explained by a spherical Gaussian term. This is not always a realistic assumption, and we are aware that the model's results, particularly on low expressed genes, could be negatively affected by large levels of noise. While in principle it is straightforward to propagate noise through a probabilistic model along the lines outlined in Sanguinetti et al. (2005), the computational costs of considering heteroscadistic models can be significant. Also, we assumed the regulatory intensity with which a transcription factor affects a target gene to be constant across time. This is not always the case in reality; however, making the regulatory intensities time-dependent would make the model less identifiable. In these cases it would perhaps be more appropriate to use a model that combines concentrations and intensities, such as the one presented in Sangunetti et al. (2006).

Perhaps the most glaring assumption we make is that an additive linear model is appropriate to describe a complex biological process, such as transcription. While this is clearly not the case, a linear model should still capture the most prominent features of the system. Although nonlinear models do obtain better results (Nachman et al., 2004; Beer and Tavazoie, 2004), their computational complexity rules out inference on a genome-wide scale, thus providing a serious limit to their usefulness in exploratory studies.


    Acknowledgments
 
The authors gratefully acknowledge support from a BBSRC award ‘Improved processing of microarray data with probabilistic models’.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

1Gao et al. (2004) introduces gene-specific regulatory strengths by considering correlations across different conditions, but their method is incapable of obtaining gene-specific results from single condition data. Back

2The Kronecker product (also known as tensor product) of a q x q matrix A with a T x T matrix B is the Tq x Tq block matrix C whose ij th block is given by aij B. Back

3The sign ambiguity can easily be resolved for real data by comparing protein concentration profiles with expression data or using known regulatory relationships Back

4 http://db.yeastgenome.org/ Back

Received on May 22, 2006; revised on July 28, 2006; accepted on September 2, 2006

    REFERENCES
 TOP
 ABSTRACT
 1. INTRODUCTION
 2 METHODS
 3. RESULTS
 4. DISCUSSION
 REFERENCES
 

    Aerne, B.L., et al. (1998) Swi5 controls a novel wave of cyclin synthesis in late mitosis. Mol. Biol. Cell, 9, 945–956[Abstract/Free Full Text].

    Asif, A. and Moura, J.M.F. (2005) Block matrices with L-banded inverse: Inversion algorithms. IEEE Trans. Signal Proc, . 53, 630–643[CrossRef].

    Beal, M.J., et al. (2005) A bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics, 21, 349–356[Abstract/Free Full Text].

    Beer, M. and Tavazoie, S. (2004) Predicting gene expression from sequence. Cell, 117, 185–198[CrossRef][Web of Science][Medline].

    Boer, V.M., et al. (2005) Contribution of the Saccharomyces cerevisiae transcriptional regulator leu3p to physiology and gene expression in nitrogen- and carbon-limited chemostat cultures. FEMS Yeast Res, . 5, 885–897[CrossRef][Web of Science][Medline].

    Boulesteix, A.-L. and Strimmer, K. (2005) Predicting transcription factor activities from combined analysis of microarray and ChIP data: a partial least squares approach. Theor. Biol. Med. Model, 2, 1471–16582.

    Gao, F., et al. (2004) Defining transcriptional networks through integrative modeling of mRNA expression and transcription factor binding data. BMC Bioinformatics, 5, 1471–16582.

    Harbison, C.T., et al. (2004) Transcriptional regulatory code of a eukaryotic genome. Nature, 431, 99–104[CrossRef][Medline].

    Haykin, S. Adaptive Filter Theory, (2002) , Englewood Cliffs, NJ Prentice Hall.

    Jordan, M.I., et al. (1999) An introduction to variational methods for graphical models. Mach. Learn, . 37, 183–233[CrossRef].

    Kalman, R.E. (1960) A new approach to linear filtering and prediction problems. J. Basic Eng, . 82, 35–45.

    Kovaceh, B. (1996) EGT2 gene transcription is induced predominantly by Swi5 in early G1. Mol. Cell. Biol, . 16, 3264–3274[Abstract].

    Lee, T.I., et al. (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 298, 799–804[Abstract/Free Full Text].

    Liao, J.C., et al. (2003) Network component analysis: Reconstruction of regulatory signals in biological systems. Proc. Nat. Acad. Sci. USA, 100, 15522–15527[Abstract/Free Full Text].

    Mullen, J.R., et al. (2005) Yeast Rmi1/Nce4 controls genome stability as a subunit of the Sgs1-Top3 complex. Mol. Cell. Biol, . 25, 4476–4487[Abstract/Free Full Text].

    Nabney, I.T. Netlab: Algorithms for Pattern Recognition, (2002) , London Springer.

    Nachman, I., et al. (2004) Inferring quantitative models of regulatory networks from expression data. Bioinformatics, 20, i248–i256[Abstract].

    Papp, B. and Oliver, S. (2005) Genome-wide analysis of the context dependence of regulatory networks. Genome Biol, . 6, .

    Sabatti, C. and James, G.M. (2006) Bayesian sparse hidden components analysis for transcription regulation networks. Bioinformatics, 22, 739–746[Abstract/Free Full Text].

    Sabatti, C. (2005) Vocabulon: a dictionary model approach for reconstruction and localizarion of transcription factor binding sites. Bioinformatics, 21, 932–937[Abstract/Free Full Text].

    Sanguinetti, G., et al. (2005) Accounting for probe-level noise in principal component analysis of microarray data. Bioinformatics, 21, 3748–3754[Abstract/Free Full Text].

    Bioinformatics Sanguinetti, G., et al. (2006) A probabilistic dynamical model for quantitative inference of the regulatory mechanism of transcription. 22, 1753–1759.

    Spellman, P.T., et al. (1998) Comprehensive identification of cell cycle regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 3273–3297[Abstract/Free Full Text].

    Tu, B.P., et al. (2005) Logic of the yeast metabolic cycle: temporal compartmentalization of cellular processes. Science, 310, 1152–1158[Abstract/Free Full Text].

    Xie, X., et al. (2005) Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature, 434, 338–345[CrossRef][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
G. Sanguinetti, A. Ruttor, M. Opper, and C. Archambeau
Switching regulatory models of cellular stress response
Bioinformatics, May 15, 2009; 25(10): 1280 - 1286.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
K. S. Davidge, G. Sanguinetti, C. H. Yee, A. G. Cox, C. W. McLeod, C. E. Monk, B. E. Mann, R. Motterlini, and R. K. Poole
Carbon Monoxide-releasing Antibacterial Molecules Target Respiration and Global Transcriptional Regulators
J. Biol. Chem., February 13, 2009; 284(7): 4516 - 4524.
[Abstract] [Full Text] [PDF]


Home page
CarcinogenesisHome page
N. K. Clemo, T. J. Collard, S. L. Southern, K. D. Edwards, M. Moorghen, G. Packham, A. Hague, C. Paraskeva, and A. C. Williams
BAG-1 is up-regulated in colorectal tumour progression and promotes colorectal tumour cell survival through increased NF-{kappa}B activity
Carcinogenesis, April 1, 2008; 29(4): 849 - 857.
[Abstract] [Full Text] [PDF]


Home page
Brief Funct Genomic ProteomicHome page
J. Noirel, S. Y. Ow, G. Sanguinetti, A. Jaramillo, and P. C. Wright
Automated extraction of meaningful pathways from quantitative proteomics data
Brief Funct Genomic Proteomic, March 7, 2008; (2008) eln011v1.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
L. R. Jarboe, D. R. Hyduke, L. M. Tran, K. J. Y. Chou, and J. C. Liao
Determination of the Escherichia coli S-Nitrosoglutathione Response Network Using Integrated Biochemical and Systems Analysis
J. Biol. Chem., February 22, 2008; 283(8): 5148 - 5157.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. P. Brynildsen, T.-Y. Wu, S.-S. Jang, and J. C. Liao
Biological network mapping and source signal deduction
Bioinformatics, July 15, 2007; 23(14): 1783 - 1791.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
J. D. Partridge, G. Sanguinetti, D. P. Dibden, R. E. Roberts, R. K. Poole, and J. Green
Transition of Escherichia coli from Aerobic to Micro-aerobic Conditions Involves Fast and Slow Reacting Regulatory Components
J. Biol. Chem., April 13, 2007; 282(15): 11230 - 11237.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/22/2775    most recent
btl473v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sanguinetti, G.
Right arrow Articles by Rattray, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sanguinetti, G.
Right arrow Articles by Rattray, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?