Skip Navigation


Bioinformatics Advance Access originally published online on March 25, 2007
Bioinformatics 2007 23(11):1313-1320; doi:10.1093/bioinformatics/btm054
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/11/1313    most recent
btm054v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Jackson, E. S.
Right arrow Articles by Fitzgerald, W. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Jackson, E. S.
Right arrow Articles by Fitzgerald, W. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

A sequential Monte Carlo EM approach to the transcription factor binding site identification problem

Edmund S. Jackson * and William J. Fitzgerald

Signal Processing Laboratory, Department of Engineering, Cambridge University, UK

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODELS
 3 INFERENCE ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: A significant and stubbornly intractable problem in genome sequence analysis has been the de novo identification of transcription factor binding sites in promoter regions. Although theoretically pleasing, probabilistic methods have faced difficulties due to model mismatch and the nature of the biological sequence. These problems result in inference in a high dimensional, highly multimodal space, and consequently often display only local convergence and hence unsatisfactory performance.

Algorithm: In this article, we derive and demonstrate a novel method utilizing a sequential Monte Carlo-based expectation-maximization (EM) optimization to improve performance in this scenario. The Monte Carlo element should increase the robustness of the algorithm compared to classical EM. Furthermore, the parallel nature of the sequential Monte Carlo algorithm should be more robust than Gibbs sampling approaches to multimodality problems.

Results: We demonstrate the superior performance of this algorithm on both semi-synthetic and real data from Escherichia coli.

Availability: http://sigproc-eng.cam.ac.uk/~ej230/smc_em_tfbsid.tar.gz

Contact: ej230{at}cam.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODELS
 3 INFERENCE ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The fundamental element in the control of gene expression is a class of proteins called transcription factors (TFs) (Watson et al., 2004). These direct the timing, location and rate of gene expression by binding to the DNA double helix immediately upstream of specific genes. Should the complete set of transcription factor binding sites (TFBSs) for an organism be found, it would reveal the fundamental topology of the gene regulatory network, a significant advancement of our understanding. Unfortunately, determining these TFBSs is still an open problem. Chemical assays exist, but are too costly and time consuming for a problem of this scale. The alternative is a computational approach. However, to date, no approach has proven sufficiently accurate or precise and much work remains (Hu et al., 2005; Jensen et al., 2004).

There exist two main categories of computational methods: enumerative and probabilistic. Enumerative methods search for over-represented sequences by direct string comparison. Although highly effective, they are not physically or biologically well motivated and hence cannot model physical traits beyond conservation. Probabilistic methods are model based and are hence much more flexible and potentially powerful. These methods usually treat the promoter sequence as a set of binding sites embedded in a background of nucleotides. Inference is performed to determine the location of the binding sites and their parameters. The independent multinomial model is the standard for the binding site and a Markov chain model is usually used for the background.

1.1 Probabilistic methods and justification
The canonical treatment of this problem is as a missing data problem, with the binding-site locations missing and the motif model parameters to be inferred. Hence, algorithms based on expectation-maximization (EM) (Dempster et al., 1977) (such as MEME (Bailey and Elkan, 1994)), as well as EM’s stochastic counterpart, data augmentation (Tanner and Wong, 1987), (such as the ‘Gibbs Sampler’ method (Hughes et al., 2000; Lawrence et al., 1993)) are popular.

We wish to infer the model parameters, being the motif-binding parameters (often represented as a ‘logo’ such as in Fig. 3). The dimension of this parameter space is Formula , where L is the motif length, and often >20. This may certainly be considered high dimensional, which is widely known to complicate inference.

A more serious concern, and the focus of this article, is the multimodatility of the parameter space. Unfortunately, in these problems there exist a multitude of suboptimal solutions, corresponding to the many (semi-)conserved patterns found among the promoters. This produces a highly multimodal, highly mode spread probability space. This is a well-known pathological condition for both classical EM and Gibbs sampling. Our algorithm is designed specifically for these problems.

In conditions, such as these ‘the Gibbs sampler is unable to move the Markov chain to another mode of equal importance because of its inability to step over valleys of low probability’ (Celeux et al., 2000). Importance sampling approaches, due to the parallel nature, present a promising solution. ‘There are settings ... where Markov Chain Monte Carlo (MCMC) has a very hard time converging ... while importance sampling based on identical proposals manages to reach regions of interest for the target distribution’ (Robert and Casella, 2004). For this reason, we propose an algorithm based on importance sampling. Specifically, we propose a sequential importance sampling/resampling (SIR) algorithm in place of the usual E step of the EM algorithm, following (Andrieu and Doucet 2003). This avoids the greediness problems of classical EM, and should produce a more robust representation of the probability landscape and hence superior results.

The convergence difficulties with the Gibbs sampler method for the TFBS identification problem have been recently and independently noted in Down and Hubbard (2005), who propose an alternative approach.

1.2 Development of the approach
In the statistics literature, SIR has long been shown useful for data augmentation (Rubin, 1987); however, there the sampling distribution is fixed. Given its importance for convergence, (Geyer, 1996) proposes a dynamic sampling distribution. They present a stochastic EM method with multiple parsing of the data set, optimizing the sampling distribution after each parse. While an improvement, for large data sets this is still computationally burdensome, and does not produce as rapid a convergence as possible. In Andrieu and Doucet (2003), which considers online estimation, the authors suggest even finer updating by segmenting the data set into blocks and updating the sampling distribution after each block. Similarly but for the batch case, Chopin (2002) develops a particle filter which has a stepwise evolving sampling distribution. This fine data partition and rapid sequential incorporation allows the data to quickly influence inference, and produces a tempering effect. Our proposal is to use the already fine segmentation of the promoter set into individual promoters, which should achieve these advantages.

In summary, we propose an EM-based algorithm due to its natural application to this latent variable mixture problem. However, we propose replacing the classical E step with a Monte Carlo approximation in order to gain robustness against multimodality. This is very similar to data augmentation; however, to gain further robustness we propose using an importance sampling method. This allows multiple, semi-independent explorations of the probability space. Finally, this importance sampling is performed sequentially (with resampling) as this allows a fine data partition and the itinerant advantages. This algorithm will be discussed in detail in Section 3 below.


    2 MODELS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODELS
 3 INFERENCE ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The promoter region of each gene is modelled as a mixture of background sequence and an unknown number of probabilistically conserved motif instances at unknown locations. The vector of motif instance locations is treated as a hidden variable which is inferred and from which the model of the motif calculated. This section presents concise representation of this pervasive model.

2.1 Background model
Very little biological knowledge exists on the nature of this sequence; and hence given the linear nature of the data a Markov model of particular order is employed. Early efforts focused on order 0 models while increasingly higher-order models are utilized for eukaryotes. Recently switching Markov model based methods have been introduced (Down and Hubbard, 2005; Thompson, 2003) but these are beyond our scope.

We begin with some notation. Define the set of nucleotide symbols Formula . Furthermore, define the sequence of N nucleotides Formula and let sub-scripting indicate a sub-string operation, Formula . Let the space of Formula order Markov transitions be denoted by


Formula

Let the probabilities of a particular transition, {psi} isin {Psi}T, be written as Formula , with the set of all such probabilities given by


Formula

Let the set of such transitions observed in the string S be


Formula

Using the notation Formula to denote a transition of a Formula order Markov chain, the probability of a sequence, S, is given by


Formula

where Formula . The terms in Formula are calculated by appropriate lower order Markov chains, or usually simply ignored, as in the given approximation, as they contribute only a small edge effect. Thus, by exchangeability we have


Formula 1

(1)


Formula 2

(2)
where, the cardinality of the set Formula is simply the number of observations of the transition {psi} in the observed sequence.

The probabilities, {theta}0, are the parameters of this model, and are found empirically on a per-organism basis. Fitting the maximum-likelihood (ML) estimation (Koski, 2001) on the full set of promoter regions of the organism of interest results in


Formula 3

(3)
and hence the set of Formula order transition probabilities, Formula

2.2 Motif model
The binding-site motif is modelled as an independent, non-identically distributed multinomial distribution. Thus, a motif of width w, Formula has likelihood


Formula

where, {theta} is termed the ‘position weight matrix’ and defined as


Formula 4

(4)
where, Formula is the probability of observing a nucleotide of type j in position i of the motif. This matrix is the subject of subsequent inference.

In extension, the likelihood of a set of M motif instances, Formula , is found by exchangeability to be


Formula 5

(5)
Similarly to (3) the ML solution of the parameters is


Formula 6

(6)

2.3 Mixture model
Let Formula be a promoter set of K promoters, each of which is N nucleotides in length, Formula . Consider Formula as a single sequence modelled as a Markov chain with transition probabilities {theta}0 with a number of motif instances distributed according to an independent multinomial model with parameter {theta}, inserted randomly at positions Formula at positions m in motifs k, where Formula and Mk is the number of motif instances in promoter k.

Ignoring minor edge effects, the likelihood for this sequence is given through independence arguments by


Formula 7

(7)
where subscripts indicate sub-strings; Formula is the sequence with the motif instances at A removed and Formula is the set of motif instances. With slight abuse of notation, we may define the set of nucleotides at position i of all motif instances in promoter k as


Formula

and the set of all motif instances in promoter k as


Formula

Substituting the models (1) and (5) into (7), we arrive at the mixture likelihood function


Formula 8

(8)


    3 INFERENCE ALGORITHM
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODELS
 3 INFERENCE ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Having developed the models and likelihood functions in the previous section, it remains to perform inference based on them. The goal of this algorithm is to find the most probable model, Formula , of a repeated TFBS motif supposed within a set of sequences Formula . This is achieved by inferring the latent variable, A, the set of motif instance positions, and calculating Formula based on this. The calculation proceeds on two levels; the upper level is a stochastic EM algorithm to be described in Section 3.1 within which the E step is performed with a Monte Carlo algorithm described in Section 3.2.

3.1 Sequential Monte Carlo EM
The EM algorithm for the latent variable problem consists of two steps:

  • Expectation (E): of the log-likelihood function


    Formula 9

    (9)

  • Maximization (M): with respect to {theta}


    Formula 10

    (10)

which are iterated until Formula converges.

For the E step, we propose a sequential Monte Carlo (SMC) EM method, using the natural segmentation of S into promoters Sk and hence sampling via SIS/R. We approximate (9) by


Formula 11

(11)
The Monte Carlo aspect of this process is described in Section 3.2 below. In each EM iteration, once Formula formed it must be maximized. This is done by first initializing {theta} to the weighted average of each particle’s Formula and proceeding with simulated annealing. Then Formula is set to the maximum value of {theta}, and the next EM iteration commences. The resulting full algorithm is summarized in Algorithm 1.

3.2 Monte Carlo integration
As it is a non-standard technique, we very briefly review the Monte Carlo method here; further details are provided in (Andrieu, 2004; Robert and Casella, 2004). Monte Carlo integration is the use of computational techniques to generate a set of samples Formula in order to approximate a probability measure Formula , with an empirical measure Formula Under certain conditions the integral


Formula 12

(12)
may be approximated with


Formula 13

(13)
where Formula are independently sampled from Formula . Thus, if one can sample Formula and evaluate Formula one may perform this integral numerically.

Formula

3.2.1 Importance sampling
A particular method of sampling is to recast (12) as


Formula 14

(14)
It is apparent that by sampling Formula (termed the importance distribution) and weighting by Formula (the importance weights) one may approximate


Formula 15

(15)
provided that Formula . This technique is called importance sampling.

3.2.2 Sequential importance sampling (SIS)
In high-dimensional problems, it is difficult to find a good importance distribution. Sequential importance sampling solves this problem by building samples from the trial distribution sequentially by dimension. Decompose Formula into its dimensions Formula and let Formula , where Formula indicate all the dimensions up to and including k. Then the importance distribution may be written


Formula

This allows the importance weight to be written


Formula 16

(16)
and hence to be calculated sequentially according to


Formula 17

(17)

3.3 Sequential latent variable algorithm
Consider sequential importance sampling for inference in the case of latent variables. Following (Liu, 2001), let the data set comprise observed and missing data Formula with a parameter prior f({theta}). The problem at hand is to maximize the posterior Formula . By marginalization


Formula 18

(18)
This integral is a clear candidate for Monte Carlo solution with


Formula

So by sampling Formula we may solve the integral (18), producing an empirical posterior distribution function. Given the dimensionality, dimension-wise sequential importance sampling is appropriate. If we choose the sampling function to be


Formula 19

(19)
then from Equation (17) the weight recursion is solved as


Formula

Thus, the weight is updated sequentially as


Formula 20

(20)

3.4 Framing of TFBS-ID as a sequential latent variable problem
The TFBS-ID problem is clearly amenable to the sequential latent variable treatment outlined above. Beginning with the parameter estimation for a fixed motif model Formula i


Formula 21

(21)
By setting


Formula

we may invoke the latent variable framework, if we can sample Formula . Using the sequential framework we may sample from Equation (19)


Formula 22

(22)
and evaluate the weighting recursion (20)


Formula 23

(23)
In this section we will derive these equations.

3.4.1 The weighting function
Consider, marginalizing the weighting function developed as Equation (23) with respect to the discrete latent variable,


Formula 24

(24)
where, Formula is the set of all possible alignments in promoter k.

We assume that Ak is independent of Formula as there is no reason to believe that in any given collection of promoters, the TFs will bind in a related way between the promoters. Some TFs do display a position-dependent binding with respect to the transcription start site; however, this may be accounted for in the Formula term in (29). Thus, in the absence of the promoter data Sk, we may take the first term in Equation (24) as flat and absorb it into proportionality.


Formula 25

(25)
The term under summation may be solved by realizing that the only information that Formula contributes to inference at dimension k is an estimate of the model parameters Formula . The ML, Formula of (6) where Formula is


Formula 26

(26)
By substituting this into (8) we arrive at the likelihood function Formula , which is substituted into Equation (25) allowing us to write the unnormalized weighting function as


Formula 27

(27)
For simple models (a low number of motifs considered per promoter) this summation may be performed exhaustively, and this is the technique adopted in the proposed algorithm.

3.4.2 The sampling function
Upon substitution of the model, the sampling distribution for the missing data in (22) becomes:


Formula 28

(28)
Through Bayes’ rule we can manipulate


Formula 29

(29)
The first term is the prior on the binding site, which can reasonably be taken as flat. The second term is again the likelihood function (8), thus


Formula 30

(30)
As it is discrete, sampling this function is trivial through inversion sampling. Furthermore, it is computationally convenient, as it is required in the particle weighting (27) and hence need only to be calculated once.


    4 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODELS
 3 INFERENCE ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
4.1 Semi synthetic
In order to observe the properties of the proposed algorithm in a controlled setting, we have constructed a semi-synthetic test set. This allows the algorithm to be tested in an environment where the model assumptions are met, thus examining the algorithm, rather than the models. Furthermore, as the ground truth for this scenario is known, it is possible to assign true performance measures to the algorithm.

The test set consists of several subsets, each of which is a collection of artificial promoters, with a background of nucleotides generated according to human IID frequencies, into which are inserted one, or two real binding sites for CRP (catabolite receptor protein (Wingender et al., 2000)) at random locations. Each subset of the test set is differentiated by an artificial reduction of the information content of binding site. This is achieved by sequentially altering the binding-site model until it is the background model. Thus the test set consists of sets of promoters, each of which has a different information content.

We tested this semi-synthetic set with both our algorithm (SEM) and AlignAce (Hughes et al., 2000), which makes identical model assumptions but is Gibbs sampler based. This is done as the thesis of this article is that the Gibbs sampler should present inferior convergence to the true mode due to multimodality. The results of this test are shown in Figure 1, which display the true-positive and true-negative ratios of the two algorithms as a function of the varying information content. Here, the true and false counts are with respect to the nucleotides.


Figure 1
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. TFBS semi-synthetic data results. This graph is plotted with information contents (in nats) on the x-axis and performance metric on the y-axis. It demonstrates the superior performance of the SEM algorithm in terms of higher true-positive ratio and lower false-positive ratio, especially at low information content.

 
This graph highlights two important characteristics of the algorithms. The first concerns the robustness of the true-positive ratio. At information content above ~120 nats (information content with logs taken to the natural number base), the algorithms are indistinguishable. However, the AlignAce algorithm collapses sharply, such that by 80 nats it makes no true predictions, while the SEM algorithm is still 80% correct, and only falls to 0% true positive by 60 nats. At virtually all points, the SEM algorithm has a higher true-positive ratio than the Gibbs sampler.

The second important feature is the false-positive ratio. Again, at virtually every point, the SEM has a lower false-positive ratio than AlignAce. More important is the failure mode of the algorithm. Below 80 nats AlignAce makes no true calls, yet continues to make false calls. SEM, however, fails into silence—when the true positive falls to zero, so does the false positive. This is because the algorithm will favour the case of no binding sites present, and hence will make no (true or false) predictions.

This is intentionally the simplest data case, in order to clearly demonstrate the superior ability of the proposed algorithm to find a solution that is known to exist. Although the data set is contrived, this example illustrates the superior performance of the proposed algorithm by controlling for obscuring effects that the more realistic tests to follow will introduce.

4.2 Full E.coli K12 set
A recent comparative study of motif-finding algorithms (Hu et al., 2005) provides an interesting benchmark of competitive algorithms. Included in this study is a data set consisting of laboratory-verified binding sites for 82 motifs from the E.coli K12 organism. Each example of each motif is aligned with a margin of the flanking nucleotides retained on each end. There are several sets where the length of the margin varies from 100 to 800 nucleotides. This provides a good test data set as the true position of the binding site in each sequence is known, yet the data are not synthetic.

4.2.1 Results
The data set is initially filtered to remove binding sites with fewer than two examples and to remove overlapping sequences of the genome. Each of the remaining sets has been run through the proposed algorithm and in each case five motifs have been requested. The results are reported for that with the highest specificity. The algorithm is tested with both an independent multinomial (Markov order 0) and a Formula order background model a flat (all frequencies set to 0.25) (SMC Flat), and an order Formula and Formula order Markov background (SMC O(0) and SMC O(3) respectively).

Table 1 below reports the algorithmic performance on this data set at various margin lengths and with a background model of varying order. Sensitivity and specificity are as usual defined by


Formula

where, Formula , Formula and Formula are the nucleotide true-positive, false-negative and false-positive counts. The performance of the comparative algorithms is reproduced from (Hu et al., 2005). This table shows the favourable performance of the algorithm with respect to competing algorithms when the SMC algorithm uses a flat background model. The higher-order models obtained from (Coessens et al., 2003) display rapidly decreasing performance. This is unexpected and explored below. The effect of the background model is displayed in Figure 2 below, which displays the specificity of the algorithm with a 300 nucleotide margin with a flat, and third-order Markov background for each TF. In addition, the percentage of A and T nucleotides in the true binding site is plotted, which we call the AT content henceforth.


Figure 2
View larger version (61K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Specificity comparison. This graph shows the specificity for each TF individually, under two background models Flat and SMC O(3). It also plots the AT content of the true motif. It demonstrates the deleterious effect that high AT content exerts of specificity.

 

View this table:
[in this window]
[in a new window]

 
Table 1. Performance Summary. This table compares the specificity (Sp) and sensitivity (Sn) across the set between existing algorithms, and the proposed algorithm using zeroth- and third-order Markov background models

 
The performance of the flat background model is unexpectedly superior to that of the third-order Markov background. The reason, as Figure 2 shows, is the AT content, which we now explain. There are several (mostly unknown) factors that allow a motif to be recognized well by the algorithm. By ordering the TFs by specificity with zeroth-order background, we group them according to some such set of characteristics. Hence, neighbouring TFs on the abscissa of this plot are in some sense similarly difficult to find according to this large set unknown set of factors. Plotting the specificity of the model with a third-order background with the same ordering, we see that above ~ 20 % specificity the performance of the algorithm is similar for most TFs, except for being unable to find many. The final curve on this plot shows the AT content, and it is seen that TFs with high AT content relative to nearby TFs are identically those that the algorithm is unable to find with a third-order background.

This is because poly-A, poly-T sequences and poly-AT repeats occur often in intergenic regions. Hence, these transitions probabilities are highly weighted in the third-order background model that has been fitted to intergenic sequences. Consequently, when calculating the motif likelihood according to Equation (8), true motifs with these patterns will score a low probability. This explains this behaviour.

In addition, the signal strength of the logos is of interest. In the Supplementary Materials, we provide the Kullback–Leibler divergence or cross entropy between the empirical true- and inferred-motifs. We also plot the motif logos, which present the same information graphically, for CRP only below.

From these logos, which are a graphical representation of Formula several points may be made. Although, CRP has a very well defined and clear reference logo in the literature, it is not sufficiently conserved in this data set for the logo to be very clear, Figure 3. As all the binding sites in this set represent actual binding sites, this speaks to the faintness of the sought after signal in practice.


Figure 3
View larger version (30K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. CRP logos. On the top is the reference CRP Motif. This is the motif logo generated from a different reference data set (Robinson et al., 1998) of 549 binding sites for the CRP TF in E.coli K12. In the centre is the true motif in this test set. Although the same in form, it is clearly less conserved. The lower motif is that inferred by the algorithm. The algorithm is visibly only able to find well-conserved motifs. Produced with WebLogo (Crooks et al., 2004).

 

    5 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODELS
 3 INFERENCE ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have identified a problem in the current algorithmic approaches to TFBS identification arising from the irregularity of the probability space. The effect of this problem is that single line of search algorithms, such as the Gibbs’ sampler, may easily become trapped in local modes, leading to poor performance. To overcome this problem, we have proposed a batch setting SMC EM algorithm that benefits from having many independent, parallel searches, which by covering the alignment space more comprehensively return higher performance. We have demonstrated this increased performance with both semi-synthetic as well as true sequence data from E.coli. A strong observation from this work is that the main problem in inference problems in this field is the modelling. Specifically, there is difficulty in defining and thus fitting the background model, and hence inference in general is severely inhibited. The successful extension of this TFBS identification to higher organisms, such as mammalians, will require the solution to this problem.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODELS
 3 INFERENCE ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The authors thank Mr Adam Johansen for his frequent discussions and corrections, and Dr Arnaud Doucet for his valuable input. Thanks Trinity College and the Cambridge Commonwealth Trust for funding.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alex Bateman

Received on October 9, 2006; revised on January 21, 2007; accepted on February 9, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODELS
 3 INFERENCE ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Andrieu C. Monte Carlo methods for absolute beginners. Lecture Notes in Computer Science (2004) 113–145.

    Andrieu C, Doucet A. Online expectation-maximization type algorithms for parameter estimation in general state space models. In. (2003) Proceedings of the International Conference on Acoustics, Speech, and Signal Processing.

    Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. (1994) Proceedings of the Second International conference on intelligent systems for molecular biology.

    Celeux G, et al. Computational and inferential difficulties with mixture posterior distributions. J. Am. Stat. Assoc (2000) 95.

    Chopin N. A sequential particle filter method for static models. Biometrika (2002) 89:539–552.[Abstract/Free Full Text]

    Coessens B, et al. INCLUSive: a web portal and service registry for microarray and regulatory sequence analysis. Nucleic Acids Res (2003) 31:3468–3470.[Abstract/Free Full Text]

    Crooks GE, et al. WebLogo: a sequence logo generator. Genome Res (2004) 14:1188–1190.[Abstract/Free Full Text]

    Dempster AP, et al. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soci. Ser. B (1977) 39:1–38.

    Down TA, Hubbard T.JP. NestedMICA: sensitive inference of overrepresented motifs in nucleic acid sequence. Nucleic Acids Res (2005) 33:1445–1453.[Abstract/Free Full Text]

    Geyer CJ. Estimation and optimization of functions. In. Markov Chain Monte Carlo in Practice (1996).

    Hu J, et al. Limitations and potentials of current motif discovery algorithms. Bioinformatics (2005) 33.

    Hughes JD, et al. Computational identification of cis-regulatory elements associated with groups of functionally related genes in saccharomyces cerevisiae. J. Mol. Biol (2000) 296:1205–1214.[CrossRef][Web of Science][Medline]

    Hughes JD, et al. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol (2000) 296:1205–1214.[CrossRef][Web of Science][Medline]

    Jensen ST, et al. Computational discovery of gene regulatory binding motifs: a Bayesian perspective. Stat. Sci (2004) 19:188–204.[CrossRef][Web of Science]

    Koski T. Hidden Markov Models for Bioinformatics (2001) Kluwer Academic Publishers.

    Lawrence CE, et al. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science (1993) 262:208–214.[Abstract/Free Full Text]

    Liu JS. Monte Carlo Strategies in Scientific Computing. (2001) Springer.

    Robert CP, Casella G. Monte Carlo statistical Methods. (2004) Springer.

    Robison K, et al. A comprehensive library of DNA binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. J. Mol. Biol (1998) 284:241–254.[CrossRef][Web of Science][Medline]

    Rubin DB. The calculation of posterior distributions by data augmentation: comment: a noniterative sampling/importance resampling alternative to the data augmentation algorithm for creating a few imputations when fractions of missing information are modest: the SIR algorithm. J. Am. Stat. Assoc (1987) 82.

    Tanner M, Wong W. The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc (1987) 82:528–550.[CrossRef][Web of Science]

    Thompson W, et al. Gibbs recursive sampler: finding transcription factor binding sites. Nucleic Acids Res (2003) 31:3580–3585.[Abstract/Free Full Text]

    Watson JD, et al. Molecular Biology of the Gene. (2004) Cold Spring Harbour Laboratory Press.

    Wingender E, et al. TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res (2000) 28:316–319.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/11/1313    most recent
btm054v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Jackson, E. S.
Right arrow Articles by Fitzgerald, W. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Jackson, E. S.
Right arrow Articles by Fitzgerald, W. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?