Bioinformatics Advance Access originally published online on July 28, 2008
Bioinformatics 2008 24(18):2071-2078; doi:10.1093/bioinformatics/btn367
Modelling non-stationary gene regulatory processes with a non-homogeneous Bayesian network and the allocation sampler
1School of Biological Sciences, The University of Edinburgh, Swann Building, King's Buildings, Edinburgh EH9 3JR, 2Centre for Systems Biology at Edinburgh (CSBE), Darwin Building, King's Buildings, Edinburgh EH9 3JU, 3Biomathematics and Statistics Scotland (BioSS), JCMB, King's Buildings, Edinburgh EH9 3JZ, 4Advanced Technologies (Cambridge) Ltd, Cambridge CB4 0WA and 5Division of Pathway Medicine (DPM), Medical School, The University of Edinburgh, Chancellor's Buildings, Edinburgh EH16 4SB, UK
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Method: The objective of the present article is to propose and evaluate a probabilistic approach based on Bayesian networks for modelling non-homogeneous and non-linear gene regulatory processes. The method is based on a mixture model, using latent variables to assign individual measurements to different classes. The practical inference follows the Bayesian paradigm and samples the network structure, the number of classes and the assignment of latent variables from the posterior distribution with Markov Chain Monte Carlo (MCMC), using the recently proposed allocation sampler as an alternative to RJMCMC.
Results: We have evaluated the method using three criteria: network reconstruction, statistical significance and biological plausibility. In terms of network reconstruction, we found improved results both for a synthetic network of known structure and for a small real regulatory network derived from the literature. We have assessed the statistical significance of the improvement on gene expression time series for two different systems (viral challenge of macrophages, and circadian rhythms in plants), where the proposed new scheme tends to outperform the classical BGe score. Regarding biological plausibility, we found that the inference results obtained with the proposed method were in excellent agreement with biological findings, predicting dichotomies that one would expect to find in the studied systems.
Availability: Two supplementary papers on theoretical (T) and experi-mental (E) aspects and the datasets used in our study are available from http://www.bioss.ac.uk/associates/marco/supplement/
Contact: marco{at}bioss.ac.uk, dirk{at}bioss.ac.uk
| 1 INTRODUCTION |
|---|
|
|
|---|
The ultimate objective of systems biology is the elucidation of the regulatory networks and signalling pathways of the cell. The ideal approach would be the deduction of a detailed mathematical description of the entire system in terms of a set of coupled non-linear differential equations. As high-throughput measurements at the cell level are inherently stochastic and most kinetic rate constants cannot be measured directly, the parameters of the system would have to be estimated from the data. Unfortunately, multiple parameter sets of non-linear systems of differential equations can offer equally plausible solutions, and standard optimization techniques in high-dimensional multimodal parameter spaces are not robust and do not provide a reliable indication of the confidence intervals. Most importantly, model selection would be impeded by the fact that more complex pathway models would always provide a better explanation of the data than less complex ones, rendering this approach intrinsically susceptible to over-fitting.
To assist the elucidation of regulatory network structures, probabilistic machine learning methods based on Bayesian networks can be employed, as proposed in the seminal paper by Friedman et al. (2000). In a nutshell, the idea is to simplify the mathematical description of the biological system by replacing coupled differential equations by simple conditional probability distributions of a standard form such that the unknown parameters can be integrated out analytically. This results in a scoring function (the marginal likelihood) of closed form that depends only on the structure of the regulatory network and avoids the over-fitting problem referred to above. Novel fast Markov Chain Monte Carlo (MCMC) algorithms, like Grzegorczyk and Husmeier (2008), can be applied to systematically search the space of network structures for those that are most consistent with the data. To obtain the closed form expression of the marginal likelihood referred to above, two probabilistic models with their respective conjugate prior distributions have been employed in the past: the multinomial distribution with the Dirichlet prior, leading to the so-called BDe score (Cooper and Herskovits, 1992), and the linear Gaussian distribution with the normal-Wishart prior, leading to the BGe score (Geiger and Heckerman, 1994). These approaches are restricted in that they either require the data to be discretized (BDe) or can only capture linear regulatory relationships (BGe). A non-linear non-discretized model based on heteroscedastic regression has been proposed by Imoto et al. (2003). However, this approach no longer allows the marginal likelihood to be obtained in closed form and requires a restrictive approximation (the Laplace approximation) to be adopted. Another non-linear model based on node-specific Gaussian mixture models has been proposed in Ko et al. (2007). Again, the marginal likelihood is intractable. The authors resort to the Bayesian information criterion (BIC) of Schwarz (1978) for model selection, which is only a good approximation to the marginal likelihood in the limit of very large datasets. In the present article we propose a non-linear generalization of the BGe score, which is motivated by the fact that any probability distribution can, in principle, be approximated arbitrarily closely by a mixture model. Our method is based on recent work by (Nobile and Fearnside 2007), who proposed the allocation sampler as an alternative to the computationally expensive approach of reversible jump Markov chain Monte Carlo (RJMCMC) (Green, 1995). We will describe the method in Section 2. We have evaluated our approach on a set of synthetic and real-world datasets described in Section 3 according to criteria outlined in Section 4. The results are presented in Section 5 and discussed in Section 6. A concluding summary of the proposed method and the results can be found in Section 7.
| 2 METHOD |
|---|
|
|
|---|
We focus on the most important methodological aspects. A more detailed representation can be found in Supplementary Material T.
2.1 Bayesian network methodology
Static Bayesian networks (BNs) are interpretable and flexible models for representing probabilistic relationships between interacting variables. At a qualitative level, the graph of a BN describes the relationships between the domain variables in the form of conditional independence relations. At a quantitative level, local relationships between variables are described by conditional probability distributions. Formally, a BN is defined by a graph
, a family of conditional probability distributions F, and their parameters q, which together specify a joint distribution over the domain variables.
The graph
of a BN consists of a set of N nodes (variables) X1,...,XN and a set of directed edges between these nodes. The parent set of node Xn, symbolically
n, is defined as the set of all parent nodes of Xn, that is, the set of nodes from which an edge points to Xn in
. The structure of a static BN is defined to be a directed acyclic graph (DAG), that is, a directed graph without any cycles of directed edges (loops). It is due to this acyclicity constraint that the joint probability distribution in BNs can be uniquely factorized as follows:
|
| (1) |
n). Given data
and a parametric model, (DAGs),
can be scored with respect to their posterior probabilities:
|
| (2) |
|
) is the marginal likelihood and P(
) is the prior distribution over the space of graphs. For two stochastic models BDe and BGe a closed-form solution can be derived for the likelihood P(
|
) (Cooper and Herskovits, 1992; Geiger and Heckerman, 1994).
When time series data (X1,t,...XN,t)t=1,...,m have been collected, dynamic Bayesian networks (DBNs) can be employed. In DBNs edges correspond to interactions with a time delay
; e.g. for
=1 an edge pointing from Xi to Xj means that the realization of Xj at time point t is influenced by the realization of Xi at the previous time point t–1. In DBNs parameters are tied such that the transition probabilities between time slices t–1 and t are the same for all t, resulting in a homogeneous Markovian dependence. Because of the time delay of interactions and the bipartite graph structure thus imposed, the acyclicity of the underlying graph
is automatically guaranteed, and Equation (1) is replaced by:
|
| (3) |
n,t–1 denotes the parent set of Xn at the previous time point t–1. For more details see Friedman et al. (1998).
MCMC methods can be used for sampling DAGs
from the posterior distribution P(
|
). The structure MCMC approach of Madigan and York (1995) generates a sample of graphs
1, ...,
T as follows: given a graph
i, a new candidate graph
i+1 is proposed with probability:
|
| (4) |
(
i) denotes the neighbourhood of
i, that is the collection of all valid graphs that can be reached from
i by deletion, addition or reversal of one single edge of the current graph
i, and |
(
i)| is the cardinality of this collection. We note that all neighbour graphs
i+1 have to be acyclic when non-dynamic BNs are employed. The graph
i+1 is accepted with probability:
|
| (5) |
i+1:=
i. The Markov chain {
i} converges to the posterior distribution P(
|
) (Madigan and York, 1995). If a fan-in restriction is imposed on the cardinality of the parent sets, all graphs possessing a node with more than fan-in parent nodes have to be excluded from the graph neighbourhoods. Structure MCMC generates a graph sample {
1,...,
T}, from which posterior probabilities of edges can be computed. We focus on undirected edges for independent data and directed edges for time-dependent data. There is an undirected edge between Xi and Xj (i<j) in
, if
possesses either the edge Xi
Xj or the edge Xi
Xj. Likewise, there is a directed edge from Xi to Xj (i
j) in
, if
possesses the edge Xi
Xj. An estimator for the posterior probabilities of an edge is given by the fraction of graphs in the sample that contain the edge of interest. When the true graph for the domain is known, the concept of receiver operator characteristic (ROC) curves and area under receiver operator characteristic (AUROC) values can be used to evaluate the global network reconstruction accuracy of BN inference (see e.g. Husmeier (2003) for details). An alternative and more intuitive criteria is given by (TP|FP=5) counts: for each MCMC output a threshold
is imposed on the inferred edge posterior probabilities such that five false positive (FP) edges are extracted and the corresponding number of true positive (TP) edges, symbolically (TP|FP=5), exceeding the threshold
, is counted (Werhli et al., 2006).
2.2 Gaussian mixture Bayesian network model
We assume that we have either m independent and identically distributed (iid) observations (BNs) or m+1 time-dependent observations with a homogeneous first-order Markovian dependence structure (DBNs) for the variables X1,...,XN. This gives a dataset matrix of size N-by-m, where
j (j=1,..., m) is the j-th observation of the N nodes. The allocation vector
of size m defines an allocation of the m observations to
mixture components:
means that the j-th observation is allocated to the k-th component.
denotes the data subset consisting of all observations allocated to the k-th component by
. The joint posterior probability of a graph
, an allocation vector
, and
mixture components can be factorized as follows:
|
| (6) |
|
| (7) |
can be computed independently with the BGe scoring metric of Geiger and Heckerman (1994), as derived and discussed in Supplementary Materials T. If no observation is allocated to the k-th component
)=const. For the synthetic Raf-Mek-Erk network data we employ a more restrictive graph prior (see Supplementary Materials T and E). For the prior on
, P(
), we take a truncated Poisson distribution with parameter
=1 restricted to 1


MAX. This prior is known to be suitable for finite mixture models (Nobile, 2005). We further assume that the probability distribution of the allocation vector
is given by:
|
| (8) |

pk = 1 are the non-negative mixture weights, and nk is the number of observations allocated to the k-th mixture component by
is thus given by
|
| (9) |
2.3 Gaussian mixture allocation MCMC inference
The new Gaussian mixture allocation MCMC sampling scheme (BGM) generates a sample from the joint posterior distribution
given in Equation (6) and comprises six different types of moves in the state-space
. The first move type is a structure MCMC single edge operation on the graph
while the number of components
and the allocation vector
are left unchanged. According to Equation (4), a new graph
is proposed, and the new state
is accepted or rejected according to Equation (5) where the likelihood terms P(
|
) in Equation (5) have to be replaced by the
terms given in Equation (7). The five other move types are adapted from Nobile and Fearnside (2007) and operate on
or on
and
. If there are
>2 mixture components, then moves of the type M1 and M2 can be used to re-allocate some observations from one component k1 to another one k2. That is, a new allocation vector
is proposed while
and
are left unchanged. The Ejection move type proposes an increase in the number of mixture components by one and simultaneously tries to re-allocate some observations to fill the new component. More precisely, it randomly selects a mixture component and tries to re-allocate some of its observations to the newly proposed component
+1, while
is left unchanged. The Absorption move is complementary to the Ejection move and decreases the number of mixture components by one. It randomly selects two mixture components and deletes one of them after having reallocated all of its observations to the other component. The acceptance probabilities for M1, M2, Ejection and Absorption moves are of the same functional form:
|
| (10) |
*=
for M1 and M2 moves
*=
+1 for Ejection moves, and
*=
–1 for Absorption moves. See Supplementary Material T and Nobile and Fearnside (2007) for details. Finally, the sixth move type uses Gibbs sampling to re-allocate a single observation by sampling its new allocation from the corresponding univariate conditional distribution, while leaving
and the other components of | 3 DATA |
|---|
|
|
|---|
We have evaluated the proposed method on synthetic data generated from a widely studied protein signalling network and on gene expression time series from two different biological systems. For details of the simulation studies see Supplementary Material E.
3.1 Synthetic data
For a comparative evaluation study, Werhli et al. (2006) generated synthetic datasets for the Raf-Mek-Erk signalling pathway presented in Sachs et al. (2005), which consists of 11 nodes representing phosphorylated proteins and 20 directed edges. As Werhli et al. (2006) assigned Gaussian regulatory mechanisms with varying (randomly sampled) parameters, we can generate Gaussian mixture network data as follows: for obtaining data with
=1,...,5 components we randomly selected
of the original datasets, sampled the same number of observations m/
from each and merged these observations to a single dataset of size m. For each of 16 combinations of m (m=30,60,120,180) and
we generated five datasets by applying this procedure. Additionally, for each
=1,...,5 we generated five further datasets along this line Werhli et al. (2006) with m=480 observations each.
3.2 Bone marrow-derived macrophages
Interferons (IFNs) play a pivotal role in the innate and adaptive mammalian immune response against infection, and central research efforts, therefore, aim to elucidate their regulatory interactions (Honda et al., 2006). For the present study, we have applied our method to gene expression time series from bone marrow-derived macrophages, which were sampled at 24 x 30 min time intervals. The macrophages were subjected to three external conditions: (1) infection with Cytomegalovirus (CMV), (2) treatment with Interferon Gamma (IFN
) and (3) infection with Cytomegalovirus after pretreatment with IFN
(CMV+IFN
). To obtain the gene expression profiles, samples derived from the macrophages were hybridized to Agilent mouse genome arrays. Samples were co-hybridized with a pooled common control RNA. Expression levels were obtained in the form of log2 scale signal intensity ratios between the sample and the pooled control RNA. Differential dye-label incorporation between the two samples on each array was corrected by applying a within-array, non-linear, loess normalization to the ratios. Global non-biological variations between ratio distributions were corrected by applying median-absolute-deviation between-array normalization. We focus on time series of the Interferon regulatory factors (Irfs) 1, 2 and 3 (which we write as Irf1, Irf2 and Irf3, respectively), as a gold standard network for the interactions between these factors can be derived from the literature (Darnell et al., 1994; Raza et al., 2008): Irf2
Irf1
Irf3. The Irfs are the key regulators in the response of the macrophage cell to pathogens. They mediate the cellular signalling that leads to a transcriptional response to the initial binding events on the surface of the cell.
3.3 Circadian regulation in Arabidopsis thaliana
We have also applied our method to two gene expression time series from A.thaliana cells, which were sampled at 13 x 2 h time intervals with Affymetrix microarray chips, and robust multi-array (RMA) normalized. The expressions were measured twice independently under experimentally generated constant light condition, but differed with respect to the prehistories. In the first experiment, T20, the plant was entrained in a 10h:10h light/dark cycle, while the plant in the second experiment, T28, was entrained in 14h:14h light/dark cycle. Our analysis focuses on nine genes, namely LHY, CCA1, TOC1, ELF4, ELF3, GI, PRR9, PRR5 and PRR3, which are known to be involved in circadian regulation (Mas, 2008; Salome and McClung 2004).
| 4 EVALUATION |
|---|
|
|
|---|
To evaluate the proposed method (BGM), we have compared it with Bayesian learning of homogeneous BNs using the standard BGe score, as described in Geiger and Heckerman, (1994). We have applied a 3-fold evaluation procedure. First, we use static synthetic network data to show that if the data are, in fact, of a heterogeneous nature, BGM achieves improved network reconstruction results. Second, using gene expression time series from bone marrow-derived macrophages, we focus on a small subsystem of the IFN pathway whose biology is well understood, and we demonstrate that BGM leads to a better pathway reconstruction. Third, we consider a larger set of circadian genes from A.thaliana. Since the true network structure in this case is not known, we apply two standard methods from statistics for the evaluation: Bayes factors and predictive distributions. We briefly describe these methods in the remainder of this section. The mathematical details can be found in Supplementary Material T. We want to compare two competing hypotheses. According to the null hypothesis H0, the conventional homogeneous DBN (BGe) is the adequate model. We want to compare this with the alternative hypothesis H1 that the proposed non-homogeneous DBN (BGM) provides the right description of the system. We want to pursue a Bayesian approach, according to which the decision between the two hypotheses is based on the Bayes factor: P(
|H1/P(
|H0). Note that the two hypotheses are nested, and that P(
|H0)=P(
|
=1,H1). We can therefore follow Huelsenbeck et al. (2004) and calculate the Bayes factor using the Savage-Dickey ratio (Verdinelli and Wasserman, 1995):
|
| (11) |
is the number of mixture components (segments). The validity of Equation (11) can easily be proven from Bayes rule:
|
| (12) |
and CMV+IFN
(macrophages) or T20 and T28 (circadian genes), respectively. Denote by
the gene expression data obtained under a condition used for training. Denote by
. As before,
denotes the number of mixture components,
denotes the graph, and let
. We get the following expression for the predictive distribution:
|
| (13) |
|
| (14) |
|
| (15) |
|
| (16) |
and | 5 RESULTS |
|---|
|
|
|---|
The mean AUROC values and the mean (TP|FP=5) counts for assessing the reconstruction of the Raf-Mek-Erk pathway from the synthetic data, described in Section 3.1, are represented as histograms in Figures 1 and 2. It can be seen that BGM performs significantly better than BDe and BGe for almost all combinations of
and m. Only if there is either one single component or a small sample size (m=30), there is no (significant) difference between BGM and BGe. In particular for
=1, BGM assigns all observations to one single component, and so does not differ from BGe. Figure 3 reveals that BGM infers for each number of components
TRUE the correct number of components for the synthetic data with m=480 observations. Histograms of the numbers of inferred components for the synthetic data with fewer data points m are provided inFigure 1 in Supplementary Material E.
|
|
|
A comparison of Figures 1 and 2 reveals that the reconstruction accuracy is slightly worse for the datasets with m=480 observations. This finding might appear counter intuitive, as larger datasets contain more information and should therefore lead to better performances. However, our finding is consistent with the fact that increased dataset sizes lead to likelihood landscapes that are more rugged and, hence, result in increased mixing and convergence problems. This shortcoming of the structure MCMC sampler by Madigan and York (1995) has already been reported (e.g. see Grzegorczyk and Husmeier, 2008).
For the macrophage gene expression time series, BGM infers
=2 components for the conditions CMV and IFN
, while for the third condition (CMV+IFN
) most of the sampled states consist of
=1 component only, as shown in Figure 4. The fraction of sampled states for which two observations i and j are allocated to the same component k (1
k
) can be used as a connectivity measure C(i, j). Figure 5 displays the resulting connectivity matrices graphically as heat matrices. From the heat matrices the same systematic trend can be observed for the three conditions. The first part (observations no. 2–6) and the last part of the three time series (observations no. 8–25) are allocated to different components. For condition CMV (IFN
) the allocation of observation no. 7 (no. 9) is not fixed, that is, the allocation changes during the MCMC simulation. For CMV+IFN
, whose number of components peaks at
=1 (Fig. 4), the separation between the two parts is less pronounced, though consistent with the other results. To understand whether BGM also leads to a better network reconstruction accuracy, we compare the mean posterior probabilities of the true and false edges of BGM in Figure 6 with those obtained from BGe and BDe. For the IFN
condition (Fig. 6b) it becomes obvious that BGM has performed substantially better than BGe and BDe. For the other two conditions the difference between the posterior means for the true and the false edges is also best for BGM, but the difference is less pronounced [BDe: 0.24, BGe: 0.24, BGM: 0.39 (CMV) and BDe: –0.42, BGe: 0.09, BGM: 0.14 (CMV+IFN
)]. Since it appears that the three conditions do not lead to systematic deviations between the expression profiles of Irf1, Irf2 and Irf3, we treat the three experiments as independent replications and compute predictive probabilities, as discussed in Section 4. The predictive probabilities for BGM are much higher than those of BGe (Table 2). This finding provides further evidence that BGM does not overfit the data but outputs results that can be confirmed by independent replications. The BGM/BGe Bayes factors are: 36.45 (CMV), 2.73 (IFN
) and 0.71 (CMV+IFN
). This finding is consistent with Figure 4, where the peaks for CMV and IFN
are at 2, while CMV+IFN
peaks at 1. Gene expression time series plots and scatter plots for the three Irf factors can be found in Supplementary Material E.
|
|
|
For both A.thaliana gene expression time series (see Supplementary Material E) the number of components inferred with BGM peaks at 2 (Fig. 7a and b). The heat matrices shown in Figure 7a and b appear to be of a similar structure, but subject to a translation along the main diagonal. More precisely, it appears that the transition from the first to the second component is shifted by 2–3 time points (4–6 h). Compared with BGe the Bayes factors are in favour of BGM: 5.66 (T20) and 9.41 (T28). The predictive probabilities are given in Table 1 and confirm the improved generalization performance. Further plots for the Arabidopsis data are provided in Supplementary Material E, and the inference results are discussed in more detail in Section 6.
|
|
| 6 DISCUSSION |
|---|
|
|
|---|
The results for the synthetic data generated from the Raf-Mek-Erk pathway of Sachs et al. (2005) show that the proposed BGM scheme consistently outperforms the conventional BGe and BDe metrics in terms of global network reconstruction accuracy (Figs 1 and 2). This confirms that BGM is superior when the data stem from a mixture distribution, and that the proposed sampling scheme (allocation MCMC) renders the inference, which is more complex than for the conventional case, practically viable. Furthermore, histograms of the number of inferred mixture components (Fig. 2) reveal that BGM succeeds in inferring the correct number of components. To assess whether BGM achieves any improvement for real biological applications, we applied it to gene expression data obtained from two different platforms (Agilent and Affymetrix) for two different systems: macrophages challenged with viral infection, and circadian rhythms in plants.
For macrophages challenged with CMV or pretreated with IFN
, BGM tends to infer a two-stage process (Fig. 4). This two-stage process reflects a state change in the host macrophage brought about by infection (CMV) or immune activation (IFN
), and can be found in all three experimental conditions (Fig. 5). Interestingly, though, this state change is less pronounced in the combined condition CMV+IFN
(Fig. 4c), where the Bayes factor does not support the more flexible heterogeneous model (see the previous section). This observation is consistent with the known biological responses of macrophages to simultaneous infection by virus (mCMV) and immune (IFN
) activation. It suggests that upon dual challenge with both an infection and immune activation (CMV+IFN
) signalling leads to a pronounced singular response. This is in agreement with observations of cooperation between viral and immune signalling in effective vigorous anti-viral state within the host macrophage, as discussed in Benedict et al. (2001).
For the A.thaliana gene expression time series, BGM also infers a two-stage process (Fig. 7). In this application, the two stages are most likely related to the diurnal nature of the dark/light cycle. We have applied our method to two sets of plant samples, which were subjected to different prehistories, related to different lengths of the artificial, experimentally controlled light/dark cycle. Although the two-stage nature of the process is preserved, the state co-allocation posterior probabilities, shown in the heatmap of Figure 7, points to a phase shift of about 4–6 h as a consequence of the increased day length. This phase shift is biologically plausible and indeed expected. It can be explained by the early phase of entrainment that is required to elicit a phase delay that matches the 24-h period of the wild-type plants to the longer light/dark cycle (T28), compared to the later phase of entrainment required to elicit a phase advance to match the shorter light/dark cycle (T20) (Johnson et al., 2003).
We anticipate that a non-linear and non-homogeneous generalization of (Bayesian) networks will have broader general utility for reconstructing regulatory networks in systems biology. In this regard there is increasing interest in the development of new statistical methods, as exemplified by the recent and related work of (L`bre, 2008). Our article complements this work and constitutes a natural generalization of the BGe score of (Geiger and Heckerman, 1994) by applying the ideas of mixture models and allocation sampling presented in Nobile and Fearnside (2007). This is an advantage over the work of Ko et al. (2007). While the latter model is more flexible owing to the fact that different nodes can have different breakpoints, it leaves the computation of the marginal likelihood intractable. The authors resort to BIC Schwarz (1978) as a crude approximation to the marginal likelihood. However, this approximation is only valid in the limit of very large datasets, and BIC is known to be over-regularized in many practical applications. For a more detailed theoretical comparison between BGM and the approaches of L`, (2008), and Ko et al. (2007) see Supplementary Material E. The evaluation of the proposed BGM approach on synthetic benchmark data and the novel application to two real biological scenarios provide an encouraging demonstration of the viability of the proposed BGM method.
| 7 CONCLUSION |
|---|
|
|
|---|
We have proposed a non-linear and non-homogeneous generalization of the BGe score for Bayesian networks (BGM). BGM is based on a mixture model, using latent variables to assign individual measurements to different classes. The practical inference follows the Bayesian paradigm and samples the graph, the number of classes and the assignment of latent variables from the posterior distribution with MCMC, using the allocation sampler of Nobile and Fearnside (2007) as an alternative to RJMCMC (Green, 1995). We have evaluated BGM using three criteria: network reconstruction, statistical significance and agreement with intrinsic biological features. In terms of network reconstruction, we found improved results both for a synthetic network of known structure (Figs 1 and 2) and for a small real regulatory network derived from the literature (Fig. 6). For assessing the statistical significance of the improvement, we computed two scores: Bayes factors and predictive distributions. We applied these scores to gene expression time series obtained on different platforms (Agilent and Affymetrix) for two different systems (viral challenge of macrophages and circadian rhythms in plants), where BGM tended to outperform BGe (Tables 1 and 2). Interestingly, we found that when the improvement obtained with BGM was significant, the posterior distribution peaked at two latent classes (Figs 4 and 7). This result provides excellent agreement with intrinsic dichotomies that we expect to find in these systems, related to the dichotomy between the healthy and diseased state of the cell, and the diurnal contrast between light and darkness.
|
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We are grateful to T. Forster, S. Watterson, K. Robertson, and P. Dickinson for discussions and assistance with the handling and interpretation of the data, and to D. Nutter for technical support with computational issues.
Funding: M.G. is supported by the Biotechnology and Biological Sciences Research Council (BBSRC) and by the Engineering and Physical Sciences Research Council (EPSRC). D.H. is supported by the Scottish Government Rural and Environment Research and Analysis Directorate (RERAD). Experimental work on Arabidopsis was supported by BBSRC grant G19886 [GenBank] to A.J.M. The Centre for Systems Biology at Edinburgh is a Centre for Integrative Systems Biology (CISB) funded by BBSRC and EPSRC, reference BB/D019621/1.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Trey Ideker
Received on May 17, 2008; revised on July 9, 2008; accepted on July 14, 2008
| References |
|---|
|
|
|---|
Benedict CA, et al. Lymphotoxins and cytomegalovirus cooperatively induce interferon-b establishing host-virus détente. Immunity (2001) 15:617–626.[CrossRef][Web of Science][Medline]
Cooper GF, Herskovits E. A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. (1992) 9:309–347.
Darnell J, et al. Jak-STAT pathways and transcriptional activation in response to IFNs and other extracellular signaling proteins. Science (1994) 264:1415–1421.
Friedman N, et al. Using Bayesian networks to analyze expression data. J. Comput. Biol. (2000) 7:601–620.[CrossRef][Web of Science][Medline]
Friedman N, et al. Learning the structure of dynamic Bayesian probabilistic networks. In: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI).—Cooper GF, Moral S, eds. (1998) San Francisco, CA: Morgan Kaufmann Publishers. 139–147.
Geiger D, Heckerman D. Learning Gaussian networks. In: Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence (UAI).—López de Mántaras R, Poole D, eds. (1994) Seattle, Washington, USA: Morgan Kaufmann Publishers. 235–243.
Green P. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika (1995) 82:711–732.
Grzegorczyk M, Husmeier D. Improving the structure MCMC sampler for Bayesian networks by introducing a new edge reversal move. Mach. Learn. (2008) 71:265–305.[CrossRef]
Honda,K., et al. Type I Inteferon gene induction by the Interferon regulatory factor family of transcription factors. Immunity (2006) 25:349–360.[CrossRef][Web of Science][Medline]
Huelsenbeck J, et al. Bayesian phylogenetic model selection using reversible jump Markov chain monte carlo. Mol. Biol. Evol. (2004) 21:1123–1133.
Husmeier D. Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks. Bioinformatics (2003) 19:2271–2282.
Imoto S, et al. Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network. J. Bioinform. Computat. Biol. (2003) 1:231–252.[CrossRef]
Johnson C, et al. Entrainment of circadian programs. Chronobiol. Int. (2003) 20:741–774.[CrossRef][Web of Science][Medline]
Ko Y, et al. Inference of gene pathways using Gaussian mixture models. In: Proceedings of the 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM'07).—Xia J, ed. (2007) USA: IEEE Computer Society, Los Alamitos, CA. 362–367.
Lèbre S. Analyse de processus stochastiques pour la génomique : étude du modèle MTD et inférence de réseaux bayésiens dynamiques. In: Ph.D. thesis. (2008) Paris, France: Université d'Evry-Val-d'Essonne.
Madigan D, York J. Bayesian graphical models for discrete data. Int. Stat. Rev. (1995) 63:215–232.
Mas P. Circadian clock function in Arabidopsis thaliana: time beyond transcription. Trends Cell Biol. (2008) 18:273–281.[CrossRef][Web of Science][Medline]
Nobile A. Bayesian finite mixtures: a note on prior specification and posterior computation. In: Technical report. (2005) UK: Department of Statistics, University of Glasgow.
Nobile A, Fearnside A. Bayesian finite mixtures with an unknown number of components: the allocation sampler. Stat. Comput. (2007) 17:147–162.[CrossRef]
Raza S, et al. A logic based diagram of signalling pathways central to macrophage activation. BMC Syst. Biol. (2008) 2. Article 36.
Sachs K, et al. Protein-signaling networks derived from multiparameter single-cell data. Science (2005) 308:523–529.
Salome P, McClung C. The Arabidopsis thaliana clock. J. Biol. Rhythms (2004) 19:425–435.
Schwarz G. Estimating the dimension of a model. Ann. Stat. (1978) 6:461–464.[CrossRef]
Vehtari A, Lampinen J. Bayesian model assessment and comparison using cross-validation predictive densities. Neural Comput. (2002) 14:2439–2468.[CrossRef][Web of Science][Medline]
Verdinelli I, Wasserman L. Computing Bayes factors using a generalization of the Savage-Dickey density ratio. J. Am. Stat. Assoc. (1995) 90:614–618.[CrossRef][Web of Science]
Werhli AV, et al. Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical Gaussian models and Bayesian networks. Bioinformatics (2006) 22:2523–2531.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


