Bioinformatics Advance Access originally published online on April 26, 2007
Bioinformatics 2007 23(12):1511-1518; doi:10.1093/bioinformatics/btm142
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
An approximation method for solving the steady-state probability distribution of probabilistic Boolean networks
1Advanced Modeling and Applied Computing Laboratory, Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong, 2Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong and 3Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji-city, Kyoto 611-0011, Japan
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Probabilistic Boolean networks (PBNs) have been proposed to model genetic regulatory interactions. The steady-state probability distribution of a PBN gives important information about the captured genetic network. The computation of the steady-state probability distribution usually includes construction of the transition probability matrix and computation of the steady-state probability distribution. The size of the transition probability matrix is 2n-by-2n where n is the number of genes in the genetic network. Therefore, the computational costs of these two steps are very expensive and it is essential to develop a fast approximation method.
Results: In this article, we propose an approximation method for computing the steady-state probability distribution of a PBN based on neglecting some Boolean networks (BNs) with very small probabilities during the construction of the transition probability matrix. An error analysis of this approximation method is given and theoretical result on the distribution of BNs in a PBN with at most two Boolean functions for one gene is also presented. These give a foundation and support for the approximation method. Numerical experiments based on a genetic network are given to demonstrate the efficiency of the proposed method.
Contact: sqzhang{at}hkusua.hku.hk
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Mathematical modeling and computational study of regulatory interactions between DNA, RNA, proteins and small molecules based on the microarray data are hot topics in bioinformatics and have been studied by a number of researchers (Celis et al., 2000; Hughes et al., 2001; Lipshutz et al., 1999; Lockhart and Winzeler, 2000; Schena et al., 1995). There have been many formalisms proposed in the literatures to study the genetic regulatory networks such as directed graphs, Bayesian networks, Boolean networks (BNs), probabilistic Boolean networks (PBNs), ordinary and partial differential equations and many other mathematical models (Jong, 2002). Among these models, BN and PBN (an extension from BN) attract much attention in the biophysics community. Reviews of BN models can be found in Huang (1999), Kauffman (1993) and Somogyi and Sniegoski (1996). In a BN model, gene expression states are quantified into two levels: on and off (represented as 1 and 0, respectively). Even though most biological phenomena manifest them in continuous domain, the binary expression can capture the qualitative relationships, and show promising and useful results (Shmulevich and Zhang, 2002; Szallasi and Liang, 1998; Wuensche, 1998). In a BN model, the target gene is predicted by several genes called its input genes via a Boolean function. Once the input genes and the Boolean functions are known, the BN model is constructed deterministically.
However, genetic regulation exhibits uncertainty in the biological level and microarray data for inferring the model may have errors due to experimental noise in the complex measurement processes. Thus, a deterministic model is not favorable to such real situations and to develop a model incorporating the uncertainty is needed, which results in the development of PBNs. PBNs have been recently developed and studied in the literatures. In a PBN, for each gene, there can be more than one Boolean function. The state transits into a number of states with certain probabilities according to the realizations of all possible BNs. Thus, the dynamics (transitions) of the system can be described by a Markov chain. Detailed explanations of extending BN to the instantaneously random PBN can be found in Shmulevich et al. (2002a). The random gene perturbations are introduced into the PBN model in the paper (Shmulevich et al., 2002b), where the perturbation describes the random inputs from the outside. Introducing the random gene perturbation into the system makes it stable in the long run. Further extension of PBN model to the context-sensitive PBN model was introduced by Pal et al. (2005). A brief review of the PBN model based on mathematical formulation will be given in the next section.
Given a PBN, a natural and important problem is to study its steady-state probability distribution (Brun et al., 2005, Shmulevich et al., 2003). It provides the first-order statistical information of a PBN. Based on such information of a PBN, one can understand a genetic network, and identify the influence of different genes in a network. Furthermore, one can figure out how to control some genes in a network, such that the whole system can evolve into a target state or desired steady-state probability distribution. Therapeutic gene intervention or gene control policy (Datta et al., 2003, 2004; Ng et al., 2006; Shmulevich et al., 2002b) can then be developed and studied.
Monte-Carlo simulation method has been proposed in Shmulevich et al. (2003) to estimate the steady-state probability distribution of a PBN. The idea is that, by simulating the underlying Markov chain for a sufficiently long time until it converges, one can get an approximation of the steady-state probability distribution. Although it has been shown that Monte-Carlo simulation method can perform well in a small PBN, it can be successfully used only if we are sufficiently confident that the system has evolved into its steady state before the algorithm stops. Theoretically, a priori bound on the number of iterations is too large to be useful even for a moderate size network (Rosenthal, 1995). Thus in practice, only empirical determination methods can be used to stop the chain and get an estimate of the steady-state probability distribution (Shmulevich et al., 2003). On the other hand, matrix-based method, a deterministic method can be used to obtain the steady-state probability distribution more accurately and efficiently.
It is well known that in Markov chain theory, if a Markov chain is irreducible and aperiodic, the steady-state probability distribution is independent of the initial condition. We remark that in a PBN with random gene perturbations (Shmulevich et al., 2002b), the underlying transition probability matrix can be shown to be irreducible and aperiodic. In Zhang et al. (2007), power method has been successfully applied to compute the steady-state probability distribution based on an efficient construction of the transition probability matrix of a PBN without random perturbation. The complexity of the construction of probability transition matrix is
, where N is the total number of BNs and n is the number of genes.
The main aim of this article is to propose an efficient and effective approximation method, based on the method presented in Zhang et al. (2007) to find the steady-state probability distribution for the general PBN model. The rest of the article is organized as follows. In Section 2.1, a brief review on the mathematical formulation of the PBN model is given. In Section 2.2, the methodology to compute the steady-state probability is introduced with an error analysis. In Section 3.1, numerical experiments are given to demonstrate the efficiency and effectiveness of the proposed method. In Section 3.2, we give the probability distribution of BNs in a PBN with at most two Boolean functions for one gene. Finally, in Section 4, we give a brief summary to conclude the article.
| 2 METHODS |
|---|
|
|
|---|
2.1 A review of probabilistic Boolean networks
In this section, we give a brief review of the PBNs. A PBN is the generalization of a BN. A BN G(V, F) consists of a set of nodes V and Boolean functions F where
|
|
Let vi(t) be the state of vi at time t, where vi = 0 represents that the gene is unexpressed and vi = 1 means it is expressed. The overall expression levels of all the genes in the network at time step t is given by the following column vector
|
|
This vector is called the gene activity profile (GAP) of the network at time t. We note that when v(t) ranges from
(all entries are 0) to
(all entries are 1), it takes on all the 2n possible states of the n genes. The list of Boolean functions represents the rules of the regulatory interactions among the nodes (genes):
|
|
Here, each gene will update its state according to the states of its input genes in the previous step and its corresponding Boolean function. Thus, a BN is a deterministic dynamical system.
In a PBN, for each target gene, instead of having only one single Boolean function, it has a number of Boolean functions having equivalent prediction abilities. All these Boolean functions can be selected randomly with some probabilities. We assume that for the ith gene, there are l(i) possible Boolean functions:
|
|
and the probability of choosing function
is
, where
is a function with respect to the activity levels of n genes. A PBN is said to be independent if the elements from different F(i) are independent (Lähdesmäki et al., 2006). For an independent PBN of n genes, there are at most
|
| (1) |
different possible BNs. This means that there are totally N possible realizations of the genetic network. Suppose fj be the j-th possible realization,
|
|
The probability to choose the j-th realization is:
|
| (2) |
If the joint probability distribution of F(1), F(2), ... , F(n) cannot be factorized as the product of F(i), then it is a dependent PBN. For a dependent PBN, we use the same notations as those for independent PBNs. But notice that now the expressions of N and Pj will be different from (1) and (2).
Let a and b be any two column vectors with n entries being either 0 or 1, which represent the states of the system at time t + 1 and t. Then
|
|
By letting a and b ranging from
to
independently, one can get the transition probability matrix A with size
. It can be expressed as:
|
|
where Aj is the transition matrix corresponding to the j-th BN.
Random gene perturbation is the description of the random inputs from the outside due to external stimuli and this is meaningful in an open genome system. The effect of the random gene perturbation is to make the genes flip from state 1 to state 0 or vice versa. It makes the underlying Markov chain of the PBN ergodic and therefore all the 2n states in the system are communicated (Shmulevich et al., 2002b). When random gene perturbation is included, the transition probability matrix
is
|
| (3) |
Here
is Hamming distance between the two vectors v(i) and v(j), p is the perturbation probability of each gene and
is the indicator function.
The instantaneously random PBN was extended to the context-sensitive PBN in Pal et al. (2005). In a context-sensitive PBN, at each time step the BN will be changed with a probability q. The transition probability matrix A without the gene perturbations can be obtained from the following:
|
|
Here the probability
|
|
describes the chance that state b will make a transition into state a in network l when b belongs to network j. The difference between a context-sensitive PBN and an instantaneously random PBN is that the column vectors a and b are assumed to be in a certain BN with some probability at each time point and it will change to other BNs with a probability q. Similar to the instantaneously random PBN, when the column vectors a and b run from state
to
, we can get the transition matrix A. It can also be described as:
|
| (4) |
The random gene perturbations can be introduced into a context-sensitive PBN similarly.
2.2 Computation of the steady-state probability distribution
In this section, a computational method for approximating the steady-state probability distribution is introduced. The computational method consists of two steps: (i) the construction of the transition probability matrix of the PBN without perturbation and the construction of the perturbation matrix, with which we can get the final transition matrix; (ii) computing the eigenvector corresponding to the maximum eigenvalue. The eigenvector in the normalized form is the steady-state probability vector. From Equation (3), we observe that the final transition matrix
is the sum of the transition matrix without perturbation A multiplied by
and the perturbation matrix. The perturbation matrix is same for different networks since it only depends on the number of genes and the random gene perturbation probability. Although the construction of this matrix may cost much time, once the matrix is constructed, it can be used later for all the networks with same number of genes and same perturbation probability. When there is no perturbation, the transition matrix is sparse, while it is dense if there is perturbation. In the third step, the power method, an efficient and widely used method, is applied to solve the dominant eigenvalue and the corresponding eigenvector. We remark that in our case, the dominant eigenvector is actually the steady-state probability vector. In our numerical tests, the computation of the eigenvector can be finished within one minute. When there are 12 genes, it cost
5 s and when there are 14 genes, it cost
13 s only. Now our main aim is to reduce the computational cost for construction of the transition matrix A for both the instantaneously random PBN and the context-sensitive PBN without gene perturbations.
In Shmulevich et al. (2002c), the transition probability matrix A is constructed by computing all the entries one by one. For each entry, all the BNs have to be considered so as to determine if the network contributes to it or not. Take for example, if one wants to compute the entry A(j, i), one needs to consider if the first BN is applied, whether state j can be visited by i. And then consider the second network, and so on. Since the matrix A is large and sparse in practice, much time was wasted in computing the zero entries. In Zhang et al. (2007), an efficient algorithm has been proposed to construct the transition probability matrix. The idea is to consider the non-zero entries only. The method is based on the state-space of the Markov chain. Given a state i, if a specific Boolean function can lead it to state j, then A(j, i) will have value corresponding to the probability of this BN. If another BN also can lead state i to state j, then the probability will be greater by the probability corresponding to the BN. Although this is only an improvement in computing the transition probability matrix, it already saves much time and makes significant progress in computing the steady-state probability. We remark that the computational complexity is
for method proposed in Shmulevich et al. (2002c), while it is
in Zhang et al. (2007) where n is the number of genes and N is the number of BNs in the PBN. Moreover if there are m states for each gene, the computational complexity will change from
to
.
We observe that in many realizations of a PBN, a lot of BNs have slim chances to be chosen. Therefore our proposed method here is to consider only those BNs with probability greater than a given threshold. This of course can save much time in the construction of the transition probability matrix. Since the idea here only involves the probabilities of choosing the BNs, it can be extended to the PBNs having multiple values. Moreover, it can be applied to both the dependent and independent PBNs. In the next section, we will show some numerical experiments for the proposed method. The following is a simple explanation of the error due to removal of the BNs.
Suppose the steady-state probability vector of
is X. There are n0 BNs being removed whose corresponding transition matrix are (A1, A2, ... , An0) and their probability of being chosen are given by
, respectively. Then after the removal of these n0 BNs and making a normalization, the transition probability matrix becomes
|
| (5) |
Here
is the perturbation matrix. Suppose that the steady-state probability vector for the linear system
, then from (5), we have
|
| (6) |
|
| (7) |
We note that in each column of Ak (k=1,2, ...,n0), there is one non-zero entry and it is equal to one. We do not know the exact form of each Ak. Here, we assume that the position of the non-zero entry follows the uniform distribution. Let
|
|
Since the term
is the transition probability matrix with perturbation corresponding to the k-th BN, E(yi) = 1. We remark that the Chernoff bound in Motwani and Rahavan (1995, p. 72) states that:
Let
be independent Poisson trials such that for
,
where 0 < pi < 1. Then, for
|
|
and
,
|
|
By letting
,
and
, we can have
|
|
We should note that n should be larger than 2 since
is assumed (e is the base of natural log). However, if
,
always holds.
Thus, we have
|
|
|
|
|
|
If
is equal to or very close to
, we can see
|
|
Since this error estimate only gives an expected upper error bound, it cannot be applied for all the cases to estimate n0. From the analysis, we can see generally, the error bound can be determined by total probability of the removed BNs and the number of genes in the PBN.
Take for example, if n = 10, the total probability of all the removed BNs is 0.01 (the remaining networks is 0.99), and
. Then the expected residual norm
of the new steady-state probability vector is bounded by
|
|
We remark that for the context-sensitive PBN, from (4), we can see
|
|
corresponds to Pj in the instantaneously random PBN. With the same method, the error can be estimated.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Numerical experiments
In this section, we present some numerical experiments based on a network described in Shmulevich et al. (2003). Gene (TOP2A) is the only input gene of gene (SCYB10); (INP10); (IP10) and the in-degree of it is zero, then it is not considered in the tests. Thus, the total number of genes studied in the network is 14 and the total number of possible states is equal to 214, i.e. 16 384. For simplicity of discussion, here we consider independent PBNs only. The settings of the experiments are the same as those in Zhang et al. (2007). The number of Boolean functions of each gene is given by
|
|
and the probabilities for choosing the Boolean functions is shown in Table 1. In the numerical experiment, the number of input genes in a BN is set to be no more than three and the input genes are randomly selected. The Boolean functions are also generated randomly.
|
All the numerical experiments are done in a notebook computer with the following configuration: Intel Pentium M 1.5 GHz and RAM: 768 MB. Figure 1 presents the distribution of all the 1536 BNs. We observe that a relatively small number of BNs with high probability constitute most of the total probability. Table 2 shows the experimental results for the instantaneously random PBNs and Table 3 shows the experimental results for the context-sensitive PBNs. The perturbation probability for each gene is set to be 0.01. In the results of the context-sensitive model, the probability of the transition from one network to others is set to be 0.01. We set a lower bound of all Pj. If Pj is less than the lower bound, the corresponding network is not considered in the construction of the transition probability matrix. We compute the error between the true steady-state probability distribution and the approximate one with the following vector norms:
|
|
|
3.2 Probability distribution of the Boolean networks
Since the number of input genes of one gene cannot be very large (Guelzim et al., 2002), the number of Boolean functions should be very small. For the independent PBN, if the maximum number of Boolean functions for all the target genes is two, and the probability of choosing one Boolean function follows the uniform distribution, then the number of BNs dropped given a threshold will follow some interesting probability distribution.
We first consider the case that for any gene i the probability that the first Boolean function is chosen is given by c. This means that
|
|
We may further assume
. In the case of c = 0.5, all BNs have the same probability 2–n of being chosen. Thus given a threshold, either all the BNs are removed or all are retained. If the threshold level is given by ß, then the condition that a BN will be removed is given by
|
|
|
|
Here, k is the number of first Boolean functions being chosen in the BNs. Let k* be the minimum integer such that the above inequality holds. Then the number of BNs being removed will be given by
|
|
Therefore, the proportion of BNs being removed is given by
|
|
We then consider the following. We derive the probability that a randomly chosen BN will be removed during the construction of the transition matrix in the approximate computation given that for any gene i, the probability of its first Boolean function being chosen follows the uniform distribution U(0, 1). We begin with the following lemma.
LEMMA 1. (Ross, 1997) If u follows the uniform distribution U[0, 1] then 1–u also follows the uniform distribution and the random variables
|
|
follow the exponential distribution
.
LEMMA 2. If
are independent and identically distributed, and follow the exponential distribution with
= 1, then
|
|
has the following Erlangian distribution of m phases
|
|
Proof: We will prove this by using mathematical induction. When
,
, the statement holds. Suppose that for
, we have
|
|
Then
|
|
Here the proposition follows.
PROPOSITION 1. Given that the threshold level is ß and there are n genes, the probability distribution function of all the BNs is given by
|
|
Proof: Assume there are n genes and for each gene i there correspondingly li = 2 Boolean functions:
. Let the probability of choosing
be
,
. Suppose that Pj is the probability of choosing the j -th BN,
|
| (8) |
Given a constant ß,
|
|
Since
,
follows the exponential distribution. According to the above lemmas,
|
| (9) |
Then by differentiating (9) with respect to ß, the probability density function is got as
|
|
We use a simple randomly generated network to illustrate the results. There are 12 genes in the network, each gene has 2 Boolean functions, and therefore there are totally 4096 BNs. For each gene there are 4 input genes, which are the input of one of the two Boolean functions. Figure 2 shows the distribution of all BNs. Figure 3 gives a close picture to the distribution of the first 500 BNs with highest probability. Table 4 shows the selection probabilities of all the Boolean functions. Tables 5 and 6 show the details of this experiment for both the instantaneously random PBN model and the context- sensitive PBN model. The notations are same as those in the previous section. The perturbation probability is 0.03 here. In the context-sensitive PBN, the transition probability of a BN to other BNs is set to be 0.5. From the errors, it can be seen that the approximate method can give a reasonable explanation of the original system after dropping some BNs. The total probability of the first 500 states with highest probability is about 0.88. We note that almost all of these states appear in the approximate solution depending on the requirement of the error. The computational time for constructing the transition matrix are recorded in the tables. It decreased much in our approximation method depending on the requirement of the error tolerance.
|
|
|
|
|
| 4 CONCLUSION |
|---|
|
|
|---|
In this article, we presented a matrix-based approximation method for computing the steady-state probability distribution of PBNs. This method works in the construction of the transition probability matrix, which is of complexity
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
W.-K.C. is supported in part by RGC Grant 7017/07P, HKU Strategic Research Theme Fund on Computational Physics and Numerical Methods, Hung Hing Ying Physical Research Fund, HKU GRCC Grants Nos. 10206647, 10206483 and 10206147.
M.K.N. is supported in part by RGC 7046/03P, 7035/04P, 7035/05P and HKBU FRGs.
T.A. is partially supported by Grant-in-Aid Systems Genomics from MEXT, Japan.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Trey Ideker
Received on September 25, 2006; revised on April 6, 2007; accepted on April 6, 2007
| REFERENCES |
|---|
|
|
|---|
Brun M, et al. Steady-state probabilities for attractors in probabilistic Boolean networks. EURASIP J. Signal Processing (2005) 85:1993–2013.
Celis JE, et al. Gene expression profiling: monitoring transcription and translation products using DNA microarrays and proteomics. FEBS Lett (2000) 480:2–16.[CrossRef][Web of Science][Medline]
Datta A, et al. External control in Markovian genetic regulatory networks. Mach. Learn (2003) 52:169–191.[CrossRef]
Datta A, et al. External control in Markovian genetic regulatory networks: the imperfect information case. Bioinformatics (2004) 20:924–930.
Guelzim N, et al. Topological and causal structure of the yeast transcriptional regulatory network. Nat. Genet (2002) 31:60–63.[CrossRef][Web of Science][Medline]
Huang S. Gene expression profiling, genetic networks, and cellular states: an integrating concept for tumorigenesis and drug discovery. J. Mol. Med (1999) 77:469–480.[CrossRef][Web of Science][Medline]
Hughes TR, et al. Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nat. Biotechnol (2001) 19:342–347.[CrossRef][Web of Science][Medline]
Jong HD. Modeling and simulation of genetic regulatory systems: a literature review. J. Comp. Biol (2002) 9:67–103.[CrossRef]
Kauffman SA. The Origins of Order: Self-organization and Selection in Evolution. (1993) New York: Oxford University Press.
Lähdesmäki H, et al. Relationships between probabilistic Boolean networks and dynamic Bayesian networks as models of gene regulatory networks. Signal Processing (2006) 86:814–834.[CrossRef][Web of Science][Medline]
Lipshutz RJ, et al. High density synthetic oligonucleotide arrays. Nat. Genet (1999) 21:20–24.[CrossRef][Web of Science][Medline]
Lockhart DJ, Winzeler EA. Genomics, gene expression and DNA arrays. Nature (2000) 405:827–836.[CrossRef][Medline]
Motwani R, Raghavan P. Randomized Algorithms. (1995) New York: Cambridge University Press.
Ng MK, et al. A control model for Markovian genetic regulatory networks. Trans. Comput. Syst. Biol (2006) 4070:36–48.[CrossRef]
Pal R, et al. Intervention in context-sensitive probabilistic Boolean networks. Bioinformatics (2005) 21:1211–1218.
Rosenthal JS. Minorization conditions and convergence rates for Markov chain Monte Carlo. J. Am. Stat. Assoc (1995) 90:558–566.[CrossRef][Web of Science]
Ross SM. Introduction to Probability Model. (1997) 7th edn. San Diego, CA: Academic Press.
Schena M, et al. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science (1995) 270:467–470.
Shmulevich I, Zhang W. Binary analysis and optimization-based normalization of gene expression data. Bioinformatics (2002) 18:555–565.
Shmulevich I, et al. From Boolean to probabilistic Boolean networks as models of genetic regulatory networks. Proc. IEEE (2002a) 90:1778–1792.[CrossRef]
Shmulevich I, et al. Gene perturbation and intervention in probabilistic Boolean networks. Bioinformatics (2002b) 18:1319–1331.
Shmulevich I, et al. Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics (2002c) 18:261–274.
Shmulevich I, et al. Steady-state analysis of genetic regulatory networks modeled by probabilistic Boolean networks. Comp. Funct. Genomics (2003) 4:601–608.[CrossRef]
Somogyi R, Sniegoski C. Modeling the complexity of gene networks: understanding multigenic and pleiotropic regulation. Complexity (1996) 1:45–63.
Szallasi Z, Liang S. Modeling the normal and neoplastic cell cycle with realistic Boolean genetic networks': their application for understanding carcinogenesis and assessing therapeutic strategies. Proc. Pac. Symp. Biocomput (1998) 3:66–76.
Wuensche A. Genomic regulation modeled as a network with basins of attraction. Proc. Pac. Symp. Biocomput (1998) 3:89–102.
Zhang S, et al. Simulation study in probabilistic Boolean network models for genetic regulatory networks. Int. J. Data Min. Bioinformatics (2007) 1:217–240.[CrossRef]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||












