Bioinformatics Advance Access originally published online on March 25, 2007
Bioinformatics 2007 23(11):1313-1320; doi:10.1093/bioinformatics/btm054
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A sequential Monte Carlo EM approach to the transcription factor binding site identification problem
Signal Processing Laboratory, Department of Engineering, Cambridge University, UK
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: A significant and stubbornly intractable problem in genome sequence analysis has been the de novo identification of transcription factor binding sites in promoter regions. Although theoretically pleasing, probabilistic methods have faced difficulties due to model mismatch and the nature of the biological sequence. These problems result in inference in a high dimensional, highly multimodal space, and consequently often display only local convergence and hence unsatisfactory performance.
Algorithm: In this article, we derive and demonstrate a novel method utilizing a sequential Monte Carlo-based expectation-maximization (EM) optimization to improve performance in this scenario. The Monte Carlo element should increase the robustness of the algorithm compared to classical EM. Furthermore, the parallel nature of the sequential Monte Carlo algorithm should be more robust than Gibbs sampling approaches to multimodality problems.
Results: We demonstrate the superior performance of this algorithm on both semi-synthetic and real data from Escherichia coli.
Availability: http://sigproc-eng.cam.ac.uk/
ej230/smc_em_tfbsid.tar.gz
Contact: ej230{at}cam.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
The fundamental element in the control of gene expression is a class of proteins called transcription factors (TFs) (Watson et al., 2004). These direct the timing, location and rate of gene expression by binding to the DNA double helix immediately upstream of specific genes. Should the complete set of transcription factor binding sites (TFBSs) for an organism be found, it would reveal the fundamental topology of the gene regulatory network, a significant advancement of our understanding. Unfortunately, determining these TFBSs is still an open problem. Chemical assays exist, but are too costly and time consuming for a problem of this scale. The alternative is a computational approach. However, to date, no approach has proven sufficiently accurate or precise and much work remains (Hu et al., 2005; Jensen et al., 2004).
There exist two main categories of computational methods: enumerative and probabilistic. Enumerative methods search for over-represented sequences by direct string comparison. Although highly effective, they are not physically or biologically well motivated and hence cannot model physical traits beyond conservation. Probabilistic methods are model based and are hence much more flexible and potentially powerful. These methods usually treat the promoter sequence as a set of binding sites embedded in a background of nucleotides. Inference is performed to determine the location of the binding sites and their parameters. The independent multinomial model is the standard for the binding site and a Markov chain model is usually used for the background.
1.1 Probabilistic methods and justification
The canonical treatment of this problem is as a missing data problem, with the binding-site locations missing and the motif model parameters to be inferred. Hence, algorithms based on expectation-maximization (EM) (Dempster et al., 1977) (such as MEME (Bailey and Elkan, 1994)), as well as EMs stochastic counterpart, data augmentation (Tanner and Wong, 1987), (such as the Gibbs Sampler method (Hughes et al., 2000; Lawrence et al., 1993)) are popular.
We wish to infer the model parameters, being the motif-binding parameters (often represented as a logo such as in Fig. 3). The dimension of this parameter space is
, where L is the motif length, and often >20. This may certainly be considered high dimensional, which is widely known to complicate inference.
A more serious concern, and the focus of this article, is the multimodatility of the parameter space. Unfortunately, in these problems there exist a multitude of suboptimal solutions, corresponding to the many (semi-)conserved patterns found among the promoters. This produces a highly multimodal, highly mode spread probability space. This is a well-known pathological condition for both classical EM and Gibbs sampling. Our algorithm is designed specifically for these problems.
In conditions, such as these the Gibbs sampler is unable to move the Markov chain to another mode of equal importance because of its inability to step over valleys of low probability (Celeux et al., 2000). Importance sampling approaches, due to the parallel nature, present a promising solution. There are settings ... where Markov Chain Monte Carlo (MCMC) has a very hard time converging ... while importance sampling based on identical proposals manages to reach regions of interest for the target distribution (Robert and Casella, 2004). For this reason, we propose an algorithm based on importance sampling. Specifically, we propose a sequential importance sampling/resampling (SIR) algorithm in place of the usual E step of the EM algorithm, following (Andrieu and Doucet 2003). This avoids the greediness problems of classical EM, and should produce a more robust representation of the probability landscape and hence superior results.
The convergence difficulties with the Gibbs sampler method for the TFBS identification problem have been recently and independently noted in Down and Hubbard (2005), who propose an alternative approach.
1.2 Development of the approach
In the statistics literature, SIR has long been shown useful for data augmentation (Rubin, 1987); however, there the sampling distribution is fixed. Given its importance for convergence, (Geyer, 1996) proposes a dynamic sampling distribution. They present a stochastic EM method with multiple parsing of the data set, optimizing the sampling distribution after each parse. While an improvement, for large data sets this is still computationally burdensome, and does not produce as rapid a convergence as possible. In Andrieu and Doucet (2003), which considers online estimation, the authors suggest even finer updating by segmenting the data set into blocks and updating the sampling distribution after each block. Similarly but for the batch case, Chopin (2002) develops a particle filter which has a stepwise evolving sampling distribution. This fine data partition and rapid sequential incorporation allows the data to quickly influence inference, and produces a tempering effect. Our proposal is to use the already fine segmentation of the promoter set into individual promoters, which should achieve these advantages.
In summary, we propose an EM-based algorithm due to its natural application to this latent variable mixture problem. However, we propose replacing the classical E step with a Monte Carlo approximation in order to gain robustness against multimodality. This is very similar to data augmentation; however, to gain further robustness we propose using an importance sampling method. This allows multiple, semi-independent explorations of the probability space. Finally, this importance sampling is performed sequentially (with resampling) as this allows a fine data partition and the itinerant advantages. This algorithm will be discussed in detail in Section 3 below.
| 2 MODELS |
|---|
|
|
|---|
The promoter region of each gene is modelled as a mixture of background sequence and an unknown number of probabilistically conserved motif instances at unknown locations. The vector of motif instance locations is treated as a hidden variable which is inferred and from which the model of the motif calculated. This section presents concise representation of this pervasive model.
2.1 Background model
Very little biological knowledge exists on the nature of this sequence; and hence given the linear nature of the data a Markov model of particular order is employed. Early efforts focused on order 0 models while increasingly higher-order models are utilized for eukaryotes. Recently switching Markov model based methods have been introduced (Down and Hubbard, 2005; Thompson, 2003) but these are beyond our scope.
We begin with some notation. Define the set of nucleotide symbols
. Furthermore, define the sequence of N nucleotides
and let sub-scripting indicate a sub-string operation,
. Let the space of
order Markov transitions be denoted by
|
|
T, be written as |
|
|
|
|
|
|
| (1) |
|
| (2) |
in the observed sequence.
The probabilities,
0, are the parameters of this model, and are found empirically on a per-organism basis. Fitting the maximum-likelihood (ML) estimation (Koski, 2001) on the full set of promoter regions of the organism of interest results in
|
| (3) |
2.2 Motif model
The binding-site motif is modelled as an independent, non-identically distributed multinomial distribution. Thus, a motif of width w,
has likelihood
|
|
is termed the position weight matrix and defined as
|
| (4) |
In extension, the likelihood of a set of M motif instances,
, is found by exchangeability to be
|
| (5) |
|
| (6) |
2.3 Mixture model
Let
be a promoter set of K promoters, each of which is N nucleotides in length,
. Consider
as a single sequence modelled as a Markov chain with transition probabilities
0 with a number of motif instances distributed according to an independent multinomial model with parameter
, inserted randomly at positions
at positions m in motifs k, where
and Mk is the number of motif instances in promoter k.
Ignoring minor edge effects, the likelihood for this sequence is given through independence arguments by
|
| (7) |
|
|
|
|
|
| (8) |
| 3 INFERENCE ALGORITHM |
|---|
|
|
|---|
Having developed the models and likelihood functions in the previous section, it remains to perform inference based on them. The goal of this algorithm is to find the most probable model,
3.1 Sequential Monte Carlo EM
The EM algorithm for the latent variable problem consists of two steps:
which are iterated until
converges.
For the E step, we propose a sequential Monte Carlo (SMC) EM method, using the natural segmentation of S into promoters Sk and hence sampling via SIS/R. We approximate (9) by
|
| (11) |
to the weighted average of each particles
, and the next EM iteration commences. The resulting full algorithm is summarized in Algorithm 1.
3.2 Monte Carlo integration
As it is a non-standard technique, we very briefly review the Monte Carlo method here; further details are provided in (Andrieu, 2004; Robert and Casella, 2004). Monte Carlo integration is the use of computational techniques to generate a set of samples
in order to approximate a probability measure
, with an empirical measure
Under certain conditions the integral
|
| (12) |
|
| (13) |
3.2.1 Importance sampling
A particular method of sampling is to recast (12) as
|
| (14) |
|
| (15) |
3.2.2 Sequential importance sampling (SIS)
In high-dimensional problems, it is difficult to find a good importance distribution. Sequential importance sampling solves this problem by building samples from the trial distribution sequentially by dimension. Decompose
into its dimensions
and let
, where
indicate all the dimensions up to and including k. Then the importance distribution may be written
|
|
|
| (16) |
|
| (17) |
3.3 Sequential latent variable algorithm
Consider sequential importance sampling for inference in the case of latent variables. Following (Liu, 2001), let the data set comprise observed and missing data
with a parameter prior f(
). The problem at hand is to maximize the posterior
. By marginalization
|
| (18) |
|
|
|
| (19) |
|
|
|
| (20) |
3.4 Framing of TFBS-ID as a sequential latent variable problem
The TFBS-ID problem is clearly amenable to the sequential latent variable treatment outlined above. Beginning with the parameter estimation for a fixed motif model
i
|
| (21) |
|
|
|
| (22) |
|
| (23) |
3.4.1 The weighting function
Consider, marginalizing the weighting function developed as Equation (23) with respect to the discrete latent variable,
|
| (24) |
We assume that Ak is independent of
as there is no reason to believe that in any given collection of promoters, the TFs will bind in a related way between the promoters. Some TFs do display a position-dependent binding with respect to the transcription start site; however, this may be accounted for in the
term in (29). Thus, in the absence of the promoter data Sk, we may take the first term in Equation (24) as flat and absorb it into proportionality.
|
| (25) |
|
| (26) |
|
| (27) |
3.4.2 The sampling function
Upon substitution of the model, the sampling distribution for the missing data in (22) becomes:
|
| (28) |
|
| (29) |
|
| (30) |
| 4 RESULTS AND DISCUSSION |
|---|
|
|
|---|
4.1 Semi synthetic
In order to observe the properties of the proposed algorithm in a controlled setting, we have constructed a semi-synthetic test set. This allows the algorithm to be tested in an environment where the model assumptions are met, thus examining the algorithm, rather than the models. Furthermore, as the ground truth for this scenario is known, it is possible to assign true performance measures to the algorithm.
The test set consists of several subsets, each of which is a collection of artificial promoters, with a background of nucleotides generated according to human IID frequencies, into which are inserted one, or two real binding sites for CRP (catabolite receptor protein (Wingender et al., 2000)) at random locations. Each subset of the test set is differentiated by an artificial reduction of the information content of binding site. This is achieved by sequentially altering the binding-site model until it is the background model. Thus the test set consists of sets of promoters, each of which has a different information content.
We tested this semi-synthetic set with both our algorithm (SEM) and AlignAce (Hughes et al., 2000), which makes identical model assumptions but is Gibbs sampler based. This is done as the thesis of this article is that the Gibbs sampler should present inferior convergence to the true mode due to multimodality. The results of this test are shown in Figure 1, which display the true-positive and true-negative ratios of the two algorithms as a function of the varying information content. Here, the true and false counts are with respect to the nucleotides.
|
This graph highlights two important characteristics of the algorithms. The first concerns the robustness of the true-positive ratio. At information content above
120 nats (information content with logs taken to the natural number base), the algorithms are indistinguishable. However, the AlignAce algorithm collapses sharply, such that by 80 nats it makes no true predictions, while the SEM algorithm is still 80% correct, and only falls to 0% true positive by 60 nats. At virtually all points, the SEM algorithm has a higher true-positive ratio than the Gibbs sampler. The second important feature is the false-positive ratio. Again, at virtually every point, the SEM has a lower false-positive ratio than AlignAce. More important is the failure mode of the algorithm. Below 80 nats AlignAce makes no true calls, yet continues to make false calls. SEM, however, fails into silence—when the true positive falls to zero, so does the false positive. This is because the algorithm will favour the case of no binding sites present, and hence will make no (true or false) predictions.
This is intentionally the simplest data case, in order to clearly demonstrate the superior ability of the proposed algorithm to find a solution that is known to exist. Although the data set is contrived, this example illustrates the superior performance of the proposed algorithm by controlling for obscuring effects that the more realistic tests to follow will introduce.
4.2 Full E.coli K12 set
A recent comparative study of motif-finding algorithms (Hu et al., 2005) provides an interesting benchmark of competitive algorithms. Included in this study is a data set consisting of laboratory-verified binding sites for 82 motifs from the E.coli K12 organism. Each example of each motif is aligned with a margin of the flanking nucleotides retained on each end. There are several sets where the length of the margin varies from 100 to 800 nucleotides. This provides a good test data set as the true position of the binding site in each sequence is known, yet the data are not synthetic.
4.2.1 Results
The data set is initially filtered to remove binding sites with fewer than two examples and to remove overlapping sequences of the genome. Each of the remaining sets has been run through the proposed algorithm and in each case five motifs have been requested. The results are reported for that with the highest specificity. The algorithm is tested with both an independent multinomial (Markov order 0) and a
order background model a flat (all frequencies set to 0.25) (SMC Flat), and an order
and
order Markov background (SMC O(0) and SMC O(3) respectively).
Table 1 below reports the algorithmic performance on this data set at various margin lengths and with a background model of varying order. Sensitivity and specificity are as usual defined by
|
|
|
|
The performance of the flat background model is unexpectedly superior to that of the third-order Markov background. The reason, as Figure 2 shows, is the AT content, which we now explain. There are several (mostly unknown) factors that allow a motif to be recognized well by the algorithm. By ordering the TFs by specificity with zeroth-order background, we group them according to some such set of characteristics. Hence, neighbouring TFs on the abscissa of this plot are in some sense similarly difficult to find according to this large set unknown set of factors. Plotting the specificity of the model with a third-order background with the same ordering, we see that above
20 % specificity the performance of the algorithm is similar for most TFs, except for being unable to find many. The final curve on this plot shows the AT content, and it is seen that TFs with high AT content relative to nearby TFs are identically those that the algorithm is unable to find with a third-order background. This is because poly-A, poly-T sequences and poly-AT repeats occur often in intergenic regions. Hence, these transitions probabilities are highly weighted in the third-order background model that has been fitted to intergenic sequences. Consequently, when calculating the motif likelihood according to Equation (8), true motifs with these patterns will score a low probability. This explains this behaviour.
In addition, the signal strength of the logos is of interest. In the Supplementary Materials, we provide the Kullback–Leibler divergence or cross entropy between the empirical true- and inferred-motifs. We also plot the motif logos, which present the same information graphically, for CRP only below.
From these logos, which are a graphical representation of
several points may be made. Although, CRP has a very well defined and clear reference logo in the literature, it is not sufficiently conserved in this data set for the logo to be very clear, Figure 3. As all the binding sites in this set represent actual binding sites, this speaks to the faintness of the sought after signal in practice.
|
| 5 CONCLUSION |
|---|
|
|
|---|
We have identified a problem in the current algorithmic approaches to TFBS identification arising from the irregularity of the probability space. The effect of this problem is that single line of search algorithms, such as the Gibbs sampler, may easily become trapped in local modes, leading to poor performance. To overcome this problem, we have proposed a batch setting SMC EM algorithm that benefits from having many independent, parallel searches, which by covering the alignment space more comprehensively return higher performance. We have demonstrated this increased performance with both semi-synthetic as well as true sequence data from E.coli. A strong observation from this work is that the main problem in inference problems in this field is the modelling. Specifically, there is difficulty in defining and thus fitting the background model, and hence inference in general is severely inhibited. The successful extension of this TFBS identification to higher organisms, such as mammalians, will require the solution to this problem.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The authors thank Mr Adam Johansen for his frequent discussions and corrections, and Dr Arnaud Doucet for his valuable input. Thanks Trinity College and the Cambridge Commonwealth Trust for funding.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alex Bateman
Received on October 9, 2006; revised on January 21, 2007; accepted on February 9, 2007
| REFERENCES |
|---|
|
|
|---|
Andrieu C. Monte Carlo methods for absolute beginners. Lecture Notes in Computer Science (2004) 113–145.
Andrieu C, Doucet A. Online expectation-maximization type algorithms for parameter estimation in general state space models. In. (2003) Proceedings of the International Conference on Acoustics, Speech, and Signal Processing.
Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. (1994) Proceedings of the Second International conference on intelligent systems for molecular biology.
Celeux G, et al. Computational and inferential difficulties with mixture posterior distributions. J. Am. Stat. Assoc (2000) 95.
Chopin N. A sequential particle filter method for static models. Biometrika (2002) 89:539–552.
Coessens B, et al. INCLUSive: a web portal and service registry for microarray and regulatory sequence analysis. Nucleic Acids Res (2003) 31:3468–3470.
Crooks GE, et al. WebLogo: a sequence logo generator. Genome Res (2004) 14:1188–1190.
Dempster AP, et al. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soci. Ser. B (1977) 39:1–38.
Down TA, Hubbard T.JP. NestedMICA: sensitive inference of overrepresented motifs in nucleic acid sequence. Nucleic Acids Res (2005) 33:1445–1453.
Geyer CJ. Estimation and optimization of functions. In. Markov Chain Monte Carlo in Practice (1996).
Hu J, et al. Limitations and potentials of current motif discovery algorithms. Bioinformatics (2005) 33.
Hughes JD, et al. Computational identification of cis-regulatory elements associated with groups of functionally related genes in saccharomyces cerevisiae. J. Mol. Biol (2000) 296:1205–1214.[CrossRef][Web of Science][Medline]
Hughes JD, et al. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol (2000) 296:1205–1214.[CrossRef][Web of Science][Medline]
Jensen ST, et al. Computational discovery of gene regulatory binding motifs: a Bayesian perspective. Stat. Sci (2004) 19:188–204.[CrossRef][Web of Science]
Koski T. Hidden Markov Models for Bioinformatics (2001) Kluwer Academic Publishers.
Lawrence CE, et al. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science (1993) 262:208–214.
Liu JS. Monte Carlo Strategies in Scientific Computing. (2001) Springer.
Robert CP, Casella G. Monte Carlo statistical Methods. (2004) Springer.
Robison K, et al. A comprehensive library of DNA binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. J. Mol. Biol (1998) 284:241–254.[CrossRef][Web of Science][Medline]
Rubin DB. The calculation of posterior distributions by data augmentation: comment: a noniterative sampling/importance resampling alternative to the data augmentation algorithm for creating a few imputations when fractions of missing information are modest: the SIR algorithm. J. Am. Stat. Assoc (1987) 82.
Tanner M, Wong W. The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc (1987) 82:528–550.[CrossRef][Web of Science]
Thompson W, et al. Gibbs recursive sampler: finding transcription factor binding sites. Nucleic Acids Res (2003) 31:3580–3585.
Watson JD, et al. Molecular Biology of the Gene. (2004) Cold Spring Harbour Laboratory Press.
Wingender E, et al. TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res (2000) 28:316–319.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||














