Gaussian process modelling of latent chemical species: applications to inferring transcription factor activities
1School of Computer Science, University of Manchester, Kilburn Building, Oxford Road, Manchester, M13 9PL and 2Adaptive Informatics Research Centre, Helsinki University of Technology, PO Box 5400, FI-02015 TKK, Finland
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Inference of latent chemical species in biochemical interaction networks is a key problem in estimation of the structure and parameters of the genetic, metabolic and protein interaction networks that underpin all biological processes. We present a framework for Bayesian marginalization of these latent chemical species through Gaussian process priors.
Results: We demonstrate our general approach on three different biological examples of single input motifs, including both activation and repression of transcription. We focus in particular on the problem of inferring transcription factor activity when the concentration of active protein cannot easily be measured. We show how the uncertainty in the inferred transcription factor activity can be integrated out in order to derive a likelihood function that can be used for the estimation of regulatory model parameters. An advantage of our approach is that we avoid the use of a coarsegrained discretization of continuous time functions, which would lead to a large number of additional parameters to be estimated. We develop exact (for linear regulation) and approximate (for non-linear regulation) inference schemes, which are much more efficient than competing sampling-based schemes and therefore provide us with a practical toolkit for model-based inference.
Availability: The software and data for recreating all the experiments in this paper is available in MATLAB from http://www.cs.man.ac.uk/~neill/gpsim.
Contact: neill{at}cs.man.ac.uk
| 1 INTRODUCTION |
|---|
|
|
|---|
Ordinary differential equations (ODEs) are the most common framework in use for modelling biological sub-systems (Alon, 2006). Well established methodologies have been developed for estimating the parameters of these equations in the context of a particular experiment or set of experiments, using e.g. least squares and maximum likelihood combined with an appropriate optimization algorithm (Mendes and Kell, 1998). More recently, significant progress has been made on Bayesian parameter estimation in the context of ODEs (Coleman and Block, 2006). Through the use of advanced Monte Carlo techniques it is even possible to, given a specific data set, rank model structures through the use of Bayes factors (Vyshemirsky and Girolami, 2008). This shows the potential for ODE models to be closely integrated with biological investigations, informing the process of biological experimental design.
A challenging problem for parameter estimation in ODE models occurs where one or more chemical species influencing the dynamics are controlled outside of the sub-system being modelled. For example, a signalling pathway can be triggered by a signal external to the pathway itself. In a regulatory sub-system, one or more transcription factors (TFs) may influence the expression of a set of target genes, but these TFs may not be regulated at the transcriptional level, instead being activated by another sub-system such as a signalling pathway. Similarly, in a metabolic pathway external metabolites and enzymes will influence the dynamics of the pathway. If these external chemical species have a constant influence, e.g. as in the case of steady state behaviour of a metabolic pathway, then they can simply be treated as additional parameters of the model and their effect can be estimated along with the other model parameters. However, more often these external factors are time-varying quantities. In this case, they are functional parameters and cannot be estimated by the standard methods discussed above. One approach for dealing with this is to discretize in time, treating the time-varying function as a sequence of discrete parameters. However, this leaves the problem of choosing the correct granularity for the discretization and either ignoring temporal continuity, or assuming a simple Markovian relationship and thereby introducing further parameters and assumptions. Here, we propose an alternative approach. We deal with these parameters as continuous functions of time, avoiding the need for arbitrary discretization.
To further compound the problem of dealing with the time-varying effects of these chemical species, their concentration is often not directly observable and their dynamics must therefore be inferred indirectly according to their influence on measured elements of the system. This is a common problem and it is a natural consequence of the fact that some quantities are relatively easy to measure in a high throughput manner (e.g. mRNA concentrations with a microarray), whereas others are much more difficult to measure (e.g. the concentration of TFs located in the nucleus). In this article, we advocate the use of Gaussian processes (GPs) to define prior distributions over these latent chemical species. This allows us to marginalize their contributions in the interaction network of interest. We present a basic toolkit of algorithms based on GPs which allow us to consider different response models (Michaelis Menten kinetics, repression responses) and cascades of interactions in which chemical species of interest are missing. The application domain we consider is inference of TF activity in both developmental and signalling networks.
Inference of TF activity in a given network is a well studied problem with both genome wide approaches (Liao et al., 2003; Sanguinetti et al., 2006a,b) and algorithms designed for a subset of genes (Barenco et al., 2006; Khanin et al., 2006; Nachman et al., 2004; Rogers et al., 2006). Our approach is most directly inspired by Barenco et al. (2006) who infer TF activity in a single input module network motif through a differential equation with a linear response. We build on their work to consider simple cascade networks and non-linear response models such as Michaelis Menten kinetics and repression responses (Alon, 2006; Khanin et al., 2006).
GPs (Rasmussen and Williams, 2006) are probabilistic models of functions that encode particular assumptions about the function such as smoothness and timescale. They are commonly applied in the context of regression and interpolation. GP modelling provides not only the inference of continuous quantities without discretization but also the natural capability of handling uncertainty. Another attractive characteristic of the GP is that the result of any linear operator on the function leads to another GP and we will exploit this when applying GPs in the context of differential equations.
Our focus in this article will be the inference of TF activities given mRNA concentrations. We start by considering the model given in Barenco et al. (2006) and reviewing the work done by Lawrence et al. (2007) who provided the first treatment of this model with GPs. Then, in Section 2.3, we extend our model to the case of repression. We follow Khanin et al. (2006) and apply the model in the context of the SOS system in Escherichia coli where genes are controlled by the transcriptional repressor protein LexA. In both these cases, no model of TF translation is included as the proteins are post-translationally regulated. In our final example (Section 2.4) we show how, in the context of Drosophila mesoderm development, a model of translation can also be incorporated to improve the quality of the TF inference in cases where TFs are primarily regulated at the transcriptional level.
| 2 METHODS AND RESULTS |
|---|
|
|
|---|
The following linear model of gene activation was considered by Barenco et al. (2006),
|
| (1) |
|
| (2) |
2.1 GP inference
Our general approach is to assume that f(t) was drawn from a GP. A GP is a prior distribution over functions that leads to highly flexible non-linear functions in which characteristics such as the stationarity, the roughness and the timescale of the signal can be controlled. GPs are the functional analogue of the Gaussian distribution. They are fully specified by a mean function, µ(t), and a covariance function, k(t,t'). The mean function is an unconstrained function of time, whilst the covariance function is constrained to be a positive definite function. Various covariance functions can be used but we will predominantly make use of the squared exponential covariance (sometimes known as the Gaussian or RBF covariance),
|
| (3) |
|
| (4) |
|
| (5) |
, cross covariances between the genes can be computed,
|
| (6) |
As was shown in Lawrence et al. (2007), if f(t) is drawn from a GP with a squared exponential covariance function, we can analytically solve the integrals in Equations (456) and thereby define a probabilistic model of the expression data for which the parameters of the differential equation, Bj, Sj, Dj and the scale l can all be determined, either through maximum likelihood or Bayesian sampling.
A likelihood function for the model parameters
={Bj, Sj, Dj}
and GP scale l is obtained by integrating out the latent function f(t)
|
| (7) |
Once the model parameters have been estimated then the model can be used to estimate the predictive GP distribution for f(t) and for each target gene xj(t). The mean and covariance of this predictive distribution for f(t) are given by,
|
| (8) |
|
| (9) |
ij
mr
where 
is the measurement noise associated with xi(tr). Here tp and tq denote any choice of points over which one wishes to evaluate the GP, whereas tr and tm are specifically the times at which the data are measured. This equation was used to plot the credible intervals for f in the figures presented in this article. It is similarly possible to compute the predictive distribution for the data by replacing f with xi in the above expression. The predictive distribution can be used for ranking genes according to their fit to the model [as in the application considered by Barenco et al. (2006)]. We illustrate our results from applying the method below, first on the previously studied linear activation model and then showing how we can extend the approach to non-linear response functions for activation and repression, followed by a two-layer activation model.
2.2 Activation
In this section, we consider a system in which a single TF activates a number of targets. The example we consider is the TF p53 which is a tumour repressor activated during DNA damage. According to Barenco et al. (2006), irradiation is performed to disrupt the equilibrium of the p53 network stimulating transcription of p53 target genes. Seven samples of the expression levels of the target genes in three replicas are collected as the raw time course data.
2.2.1 Linear activation
In this section, we recreate the results presented by Lawrence et al. (2007), for the linear model with several key differences. First, in their original paper Barenco et al. (2006) constrained f(0) to be zero, forcing the basal transcription rate to account for all transcription at time t=0. This constraint was not included in Lawrence et al. (2006), but is included here. This allows us to incorporate the prior information of the latent TF profiles as much as possible. Second, Lawrence et al. (2007) used an unnormalized version of the Affymetrix array data. We found that simple median based normalization removed the effect of a couple of repeats that were anomalously high. Inspection of the processed data used by Barenco et al. (2006) showed that they had also dealt with these anomalies.
In Figure 1, we show results of the TF inference as well as the model parameter estimations. The latent TF activity profiles are reconstructed with 95% credible intervals. The result resembles the activity profile of p53 measured by western blot (Barenco et al., 2006) and the kinetic parameters in the model are also closely matched with the results there (we followed Barenco et al. in fixing kinetic parameters for p21 to improve identifiability).
|
2.2.2 MAP-Laplace approximation
The differential equation with a linear response is an attractive model to use in the context of GPs as it allows the joint distribution over the gene expression and TF activity to be determined analytically, given the model parameters. However, as a model, it has some shortcomings. First, it treats both the gene expression and the TF activity as GPs. Since a GP cannot encode the information that a function is constrained positive, this means that the concentrations are a priori allowed to be negative. Whilst the posteriors, in the region where there is data, tend to stay positive (Fig. 1), when the predictions move away from the data they allow the TF activity to become negative. In Figure 2a, we show the results using the linear model for a different replicate to the one used in Figure 1. In this case it can be observed that, although the mean inferred profile remains positive or very close to zero, the inferred distribution of profiles summarized by the credible intervals includes profiles with negative concentrations. This contradicts our prior knowledge that concentrations are positive quantities.
|
A potential solution to this problem is to place a GP prior over, for example, the log of the TF activity. However, this, in effect, is a non-linear response in the differential equation. The non-linear response means that it is no longer possible to construct the joint distribution over gene expression and TF activity in a closed form. We must, necessarily, turn to approximations to make progress. In Lawrence et al. (2007) the use of a MAP-Laplace approximation is suggested. They demonstrate how the concentration of the TF activity can be constrained positive by placing the GP in log space.
Consider the following modification to the model,
|
| (10) |
|
| (11) |
We now build on this work in two major ways. We go beyond the simple exponential response model considered in Lawrence et al. (2007) by exploring a Michaelis Menten kinetics inspired response and a response that has been suggested as appropriate for repression (Alon, 2006). We also extend the algorithm to learn the parameters of the differential equation by maximizing the approximation to the log likelihood.
2.2.3 Michaelis Menten kinetics
Michaelis Menten reaction kinetics were designed for the case where a chemical reaction is enabled through an enzyme. If the concentration of the enzyme is much lower than the concentration of the substrate the reaction rate becomes limited by the reduced enzyme availability. The justification in the context of modelling transcription is that there are a limited number of binding sites on the genome for the TF. This bottle neck has the same effect as a reduced availability of enzyme and Michaelis Menten kinetics is therefore also appropriate as a model of transcription.
We implemented Michaelis Menten kinetics for the p53 data by taking the non-linearity to have the following form
|
| (12) |
j and we are using a GP f(t) to model the log of the TF activity. Figure 2b shows the results of applying the above kinetic model with positively constrained TF concentration to the p53 data set. It can be seen that the inferred distribution of TF profiles no longer includes negative concentrations and the result is now closer in shape to the replicate shown in Figure 1.
2.3 Repression
The same framework can easily be adapted to the case of a repressor by using an analogous Michaelis Menten model of repression,
|
| (13) |
We applied our method to the same microarray data as Khanin et al. (2006). They identify 14 target genes which are repressed by the TF LexA in E.Coli and mRNA measurements for these genes are collected over six time points. The amount of LexA is reduced after UV irradiation, decreasing for a few minutes and then recovering to its normal level.
In the case of repression, we have to include the transient terms in Equation (2),
|
| (14) |
j} are an additional set of model parameters that have to be inferred. The result for the inferred TF profile along with predictions for a selection of target genes are shown in Figure 3. Our inferred TF profile and reconstructed target gene profiles are similar to those obtained by Khanin et al. (2006), showing how different model parameters can lead to very different target gene expression given the same stimulus. However, whereas Khanin et al. (2006) had to use interpolation to go from a discretized model to a continuous profile, our GP representation avoids any need for discretization or interpolation. The model parameters shown in the figure are estimated by maximum likelihood.
|
2.4 Cascaded differential equations
As a final example, we consider a simple cascade of differential equations. Returning to the framework of the linear model, we consider the case where the TF protein is regulated at the level of transcription. In other words, we model the process of translation from mRNA to protein for the TF, but we are only able to measure the mRNA of the TF and of its targets. The simple model we consider would only be appropriate in the case where the TF does not require activation, e.g. by phosphorolation, after translation. The model is, therefore, not appropriate for many signalling pathways, but seems sensible in the context of development where TFs are often functional directly after translation. We therefore considered an example of the development of the mesoderm in Drosophila combining target gene predictions from Chromatin immunoprecipitation (ChIP–chip) data (Sandmann et al., 2006) with microarray time-course data from wild-type development (Tomancak et al., 2002).
We take the production rate of active TF to be given by,
|
| (15) |
gives the decay rate of the active TF protein,
gives a rate of translation and y(t) is the concentration of the TF's mRNA. The solution for f(t), setting transient terms to zero, is
|
| (16) |
. The derivation and form of the resulting covariance function are rather involved, but remarkably all the integrals required can be calculated analytically for the squared exponential covariance function given by Equation (3).
We applied the above driven input model to a simple cascade in Drosophila mesoderm development, focusing on the TF Mef2 (Myocyte enhancing factor 2). We selected six targets of Mef2 identified by ChIP–chip assays (Sandmann et al., 2006) and which were observed to be up-regulated after Mef2 is expressed. Affymetrix time course microarray data from wild-type embryos (Tomancak et al., 2002) provide us with observations of Mef2 expression (y(t)) and the expression of the target genes xj(t) at hourly intervals. The microarray time course was replicated three times, and we used all three replicates to fit the model parameters
,{Bj,Dj,Sj} and the kernel scale l by maximum likelihood (we set
=1 without loss of generality since the scale of the TF protein is arbitrarily determined by the Sj parameters).
Figure 4 shows the results for one of the replicates. Error bars associated with the mRNA data are obtained from the Affymetrix probe-level processing of each replicate (Liu et al., 2005). In Figure 4a, we show the inferred mRNA profile for Mef2 along with the microarray data. In Figure 4b, we show the inferred TF profile for Mef2. Figure 4c and d shows the model predictions for two of the targets, along with the associated expression data, showing again the difference in response possible given different model parameters.
|
We observe that using the TF mRNA data results in the model obtaining tighter credibility intervals for the inferred TF. The fit to the target gene expression looks reasonable, although the response in Figure 4c appears to be sharper than predicted by the model. It also appears that the peak of the inferred Mef2 mRNA profile precedes the peak in the data. The model here highlights the fact that some of our simplifying assumptions are being violated by the biology. The most critical assumption that we have made is that all of the targets used to train the model are unique targets of Mef2. In reality it is likely that other TFs are co-bound with Mef2. In future work, we expect to extend the model to account for these other TFs. We are currently using unpublished ChIP–chip data (obtained from E. Furlong, EMBL, Heidelberg) for other TFs involved in mesoderm development in order to refine our model and selection of targets.
| 3 DISCUSSION |
|---|
|
|
|---|
GPs allow for inference over chemical species concentrations in a continuous time manner. In this article, we have laid out the basic technologies required for this inference and highlighted some of the issues that arise: e.g. dealing with non-linear response models and layered systems. Barenco et al. (2006) have already shown that these models can be used for ranking targets of TFs. Such rankings can also be given in the context of our models, both for linear and non-linear response models.
Another important direction of future research will be scaling the models used to much larger systems with multiple interacting TFs. Larger systems will lead to an increase in computational requirements, particularly if it is necessary to account for correlations between multiple interacting latent chemical species.
The non-linear response models require particular approximations as exact inference in these models is not tractable. One issue with the Laplace approximation we have applied is that it is only responsive to the mode of the posterior distribution. Variational approximations (Jordan et al., 1998) provide an alternative approach which can lead to a rigorous lower bound on the likelihood. Finally, it also makes sense to validate the quality of the approximation through MCMC methods. We are working towards efficient MCMC algorithms for validating the quality of our approximate inference.
All the experiments we have shown have been in the context of transcription networks. However, the technologies are equally applicable for other biological systems, such as metabolic networks.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
For data provision and assistance with our questions we would like to thank Charles Girardot and Eileen Furlong of EMBL in Heidelberg (mesoderm development in D.Melanogaster), Martino Barenco and Mike Hubank at the Institute of Child Health in UCL (p53 pathway) and Raya Khanin and Ernst Wit of the University of Glasgow and the University of Lancaster (E.coli repressor system).
Funding: This work is funded by BBSRC Grant. No BBS/B/0076X Improved processing of microarray data with probabilistic models, EPSRC Grant. No EP/F005687/1 Gaussian Process Models for Systems Identification with Applications in Systems Biology and the EU FP6 Network of Excellence PASCAL.
Conflict of Interest: none declared.
| REFERENCES |
|---|
|
|
|---|
Alon U. An Introduction to Systems Biology: Design Principles of Biological Circuits. (2006) London: Chapman and Hall/CRC.
Barenco M, et al. Genome Biol (2006) 7:R25.[CrossRef][Medline]
Coleman MC, Block DE. Am. Inst. Chem. Eng. J (2006) 52:651–667.
Jordan MI, et al. Jordan Michael I, ed. (1998) Kluwer, Dordrecht, The Netherlands. 105–162. In Learning in Graphical Models, Vol. 89 of Series D: Behavioural and Social Sciences.
Khanin R, et al. Proc. Natl Acad. Sci. USA (2006) 103:18592–18596.
Lawrence ND, et al. Modelling transcriptional regulation using gaussian processes. In: Advances in Neural Information Processing Systems.—Schölkopf B, Platt JC, Hofmann T, eds. (2007) 19. MA: MIT Press. Cambridge.
Liao JC, et al. Proc. Natl Acad. Sci. USA (2003) 100:15522–15527.
Liu X, et al. Bioinformatics (2005) 21:3637–3644.
Mendes P, Kell D. Bioinformatics (1998) 14:869–883.
Nachman I, et al. Bioinformatics (2004) 20(Suppl. 1):248–256.[CrossRef]
Rasmussen CE, Williams CKI. Gaussian Processes for Machine Learning. (2006) Cambridge, MA: MIT Press.
Rogers S, et al. Probabilistic Modeling and Machine Learning in Structural and Systems Biology. (2006) Finland: Tuusula. June 17, 18, 2006.
Sandmann T, et al. Dev. Cell (2006) 10:797–807.[CrossRef][Web of Science][Medline]
Sanguinetti G, et al. Bioinformatics (2006a) 22:2275–2281.
Sanguinetti G, et al. Bioinformatics (2006b) 22:1753–1759.
Tomancak P, et al. Genome Biol (2002) 3. RESEARCH0088.
Vyshemirsky V, Girolami M. Bioinformatics (2008) 24:833–839.
This article has been cited by other articles:
![]() |
T. Aijo and H. Lahdesmaki Learning gene regulatory networks from gene expression measurements using non-parametric molecular kinetics Bioinformatics, November 15, 2009; 25(22): 2937 - 2944. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Sanguinetti, A. Ruttor, M. Opper, and C. Archambeau Switching regulatory models of cellular stress response Bioinformatics, May 15, 2009; 25(10): 1280 - 1286. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. D. W. Kirk and M. P. H. Stumpf Gaussian process regression bootstrapping: exploring the effects of uncertainty in time course data Bioinformatics, May 15, 2009; 25(10): 1300 - 1306. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




