Skip Navigation


Bioinformatics Advance Access originally published online on November 17, 2007
Bioinformatics 2008 24(1):46-55; doi:10.1093/bioinformatics/btm543
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
24/1/46    most recent
btm543v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Liang, K.-C.
Right arrow Articles by Anastassiou, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liang, K.-C.
Right arrow Articles by Anastassiou, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

A profile-based deterministic sequential Monte Carlo algorithm for motif discovery

Kuo-Ching Liang , Xiaodong Wang * and Dimitris Anastassiou

Columbia University, Department of Electrical Engineering, New York, NY 10025, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEM MODEL
 3 DSMC MOTIF DISCOVERY...
 4 INSERTION/DELETION MODEL
 5 EXPERIMENTAL RESULTS
 6 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Conserved motifs often represent biological significance, providing insight on biological aspects such as gene transcription regulation, biomolecular secondary structure, presence of non-coding RNAs and evolution history. With the increasing number of sequenced genomic data, faster and more accurate tools are needed to automate the process of motif discovery.

Results: We propose a deterministic sequential Monte Carlo (DSMC) motif discovery technique based on the position weight matrix (PWM) model to locate conserved motifs in a given set of nucleotide sequences, and extend our model to search for instances of the motif with insertions/deletions. We show that the proposed method can be used to align the motif where there are insertions and deletions found in different instances of the motif, which cannot be satisfactorily done using other multiple alignment and motif discovery algorithms.

Availability: MATLAB code is available at http://www.ee.columbia.edu/~kcliang

Contact: xw2008{at}columbia.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEM MODEL
 3 DSMC MOTIF DISCOVERY...
 4 INSERTION/DELETION MODEL
 5 EXPERIMENTAL RESULTS
 6 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
The vast and fast increasing amount of sequenced genomic information presents us with an opportunity for biological discovery in silico by identifying conserved DNA segments, or motifs, associated with known or even unknown biological functionalities.

There is a wide range of motif discovery algorithms available, which in general can be grouped into three categories: consensus-based algorithms, projection-based algorithms and profile-based algorithms. Algorithms such as WINNOWER (Pevzner and Sze, 2000) and CONSENSUS (Hertz and Stormo, 1999) are examples of consensus-based algorithms, where a consensus is formed for each of the possible sets of starting locations, and the best consensus is chosen to describe the motifs in the data. Projection-based algorithms such as Projection (Buhler and Tompa, 2002) and Uniform Projection Motif Finder (UPMF) in (Raphael et al., 2004), are designed to solve the ({ell}, k) motif problems where each instance of a motif of length {ell} differs from the original at exactly k positions. These k positions serve as hashing functions for all possible contiguous sequences of {ell} nucleotides. The possible motif sequences are placed in buckets based on their hashing functions. The algorithms then search the buckets that exceeds a threshold to explore.

In profile-based algorithms, a position weight matrix (PWM) is used to describe the statistical distribution of the four possible nucleotides at every position in the motif. For motifs of length M, the PWM is a matrix of size 4 x M, so that each column of the PWM represents the distribution of the 4 nt at the corresponding position in the motif. The PWM is estimated in the various profile-based algorithms and is used to predict the most likely location of the motif within each sequence. In Lawrence and Reilly (1990), by treating the locations of the motifs in each sequence as missing information, an expectation-maximization (EM) algorithm is proposed to estimate and locate the motifs. In Bailey and Elkan (1993, 1994), MEME, an algorithm based on EM, is introduced with support for finding an unknown number of motifs present with unknown number of occurrences in the sequences. In Lawrence et al. (1993), Roth et al. (1998) and Hughes et al. (2000), the Gibbs Motif Sampler and AlignACE techniques are proposed based on the Gibbs sampler, a Markov chain Monte Carlo (MCMC) algorithm, to estimate the PWM and the locations of the motifs in the sequences. In Liu et al. (2001), the Gibbs sampler-based BioProspector technique is proposed allowing for a two-block motif model and palindromic patterns.

With the ever increasing amount of sequenced genomic data for various organisms, improvements in motif discovery algorithms in terms of both accuracy and ability to process large datasets is desirable. With this goal in mind, we propose a deterministic sequential Monte Carlo (DSMC) algorithm for the profile-based approach to motif discovery to estimate the PWM and the locations of the motifs. Furthermore, we extend our algorithm to address cases where some insertions and deletions are found in different instances of the motif. We use a hidden Markov model (HMM) approach in which the state of the sequence is the location of the motif, and the observation is the nucleotide order of the sequence. Modeling the PWM with a prior distribution also provides a simple method to incorporate prior knowledge about the PWM from previous studies. Since the PWM is an unknown parameter in the system model, the EM approach may converge to local maxima, thus is less efficient and less accurate than a Bayesian approach. Furthermore, although the underlying system is not sequential in nature, we take a sequential approach to the problem. It has been shown in Fearnhead (2004) that sequential Monte Carlo methods provide efficient alternatives to batched-processing methods such as Gibbs sampling.


    2 SYSTEM MODEL
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEM MODEL
 3 DSMC MOTIF DISCOVERY...
 4 INSERTION/DELETION MODEL
 5 EXPERIMENTAL RESULTS
 6 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We assume that each sequence contains one motif of length M, and that the motifs are described statistically by the unknown PWM of size 4 x M. For simplicity, and without loss of generality, we also assume that each sequence has a fixed length of L. To estimate the starting locations of the motifs in a set of sequences, we model the system as an HMM. In this approach, a natural choice for the state space of the HMM would be the set of all possible starting locations for the motif within the sequence, and the hidden state at time interval t is the unknown starting location of the motif, with the observation at time t being the t-th sequence in the dataset. Furthermore, given the starting location, or state, at time t, the t-th sequence has an emission probability that depends upon the unknown PWM. In the following, we will give the formulation of our model that will be solved with a DSMC algorithm.

Let S = {s1, s2, ... , sT}, with st = [st, 1,..., st, L], be the set of T nucleotide sequences of length L in which we wish to find a common motif of length M. The distribution of the motif is described by the 4 x M position weight matrix {Theta} = [{theta}1, {theta}2, ... , {theta}M], where the vector {theta}j = [Formula j, 1, ... , Formula j, 4]', j = 1, ... , M, is the probability distribution of the nucleotides {A, C, G, T } at the j-th position of the PWM. In the non-motif regions, each nucleotide is assumed to be independent and identically distributed according to the background distribution {theta}0 {triangleq} [Formula 0,1,..., Formula 0, 4]'. The background distribution {theta}0 can be given prior to motif discovery as user-specified parameters, or can be computed from the nucleotide frequencies from the given dataset.

At each time interval, we observe the order of the nucleotides in a given sequence, and the state of the system during the corresponding interval is the location i of the first nucleotide of the motif in the sequence. Since the last M – 1 nucleotides in a sequence are not valid starting locations for a motif with length M, at step t, t = 1,..., T, the state, denoted as xt, takes value from the set {chi} = {1, 2, ... , LM}, where LM = L M = 1.

Let at, i be a sequence fragment of length M in st with the first nucleotide located at position i, and Formula be the remaining fragment in st with at,i removed. We then define a vector n(a) = [n1, n2, n3, n4] where nj, j = 1, ... , 4, denotes the number of each of the four types of nucleotides found in the sequence fragment a. Given two vectors {theta} = [Formula 1, ... , Formula 4]' and n = [n1, ... , n4]', we define Formula .

Since the nucleotides located in the motif are independent of the other motif nucleotides and non-motif nucleotides, given the PWM {Theta}, the background distribution {theta}0, and the state at time t, the distribution of the observed sequence st is then given as follows:


Formula 1

(1)
where at, i(m) is the m-th element of the sequence fragment at, i, and n(at,i(m)) is a 1 x 4 vector of zeros except at the position corresponding to the nucleotide at, i (m), where it is a one. Note that the background distribution we used here is the single nucleotide distribution, and each of the non-motif nucleotides are assumed to be distributed i.i.d. according to {theta}0. The single nucleotide distribution can be easily replaced by higher order distributions by replacing the Formula term in (1) with the higher order non-motif distribution.

From the discussion above, we formulate our inference problem as follows. Let us denote the state realizations up to time t as xt {triangleq} [x1, x2, ... , xt] and similarly the sequences up to time t as St {triangleq} [s1, s2, ... , st]. Given the sequences St and the non-motif nucleotide distribution {theta}0, we wish to estimate the state realizations xt, which are the starting locations of the motif in sequences s1, ... , st, and the position weight matrix {Theta}, which describes the statistical properties of the motif. The proposed DSMC algorithm performs the inference sequentially, where the location of the motif in each sequence is inferred in the order in which they are arranged in S. In the next section, we derive the DSMC algorithm to solve this inference problem.


    3 DSMC MOTIF DISCOVERY ALGORITHM
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEM MODEL
 3 DSMC MOTIF DISCOVERY...
 4 INSERTION/DELETION MODEL
 5 EXPERIMENTAL RESULTS
 6 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this section, we derive a DSMC motif discovery algorithm for the case where each sequence in the dataset contains exactly one instance of the same motif. In Section 4, we extend this algorithm to treat insertions/deletions.

3.1 Deterministic sequential sampling
In the sequential treatment of the motif discovery problem, at time t we want to make an online inference of the starting locations of the motifs xt = (x1, ..., xt) based on the observation St = (s1, ... , st). We assume that we have at time t 1 a set of samples and their associated weights Formula properly weighted with respect to the posterior distribution p(xt – 1|St – 1). Since the locations of the first nucleotide of the motifs take value from a finite set of possible locations, this naturally leads us to consider all possible extensions for each sample from t – 1 at time t. The deterministic approach to SMC was developed in Fearnhead (1998) and Punskaya (2003), and extended to include unknown parameters and Markov state processes. For each sample Formula , k = 1,... , K, we consider all LM possible extensions, and perform a selection step to keep only K of the K x LM samples to avoid an exponential growth of the number of samples. Figure 1 illustrates the particle extensions and selections in deterministic sequential sampling. At time t, we have K = 4 particles, each particle has LM = 5 possible extensions, all of which are realized at t + 1. After all necessary estimations are performed at t + 1, only 4 of the extensions with the largest weights are retained. Thus, the DSMC algorithm is a deterministic tree-search algorithm. Multiple runs through the dataset will always give the same result as long as the ordering of the sequences in the dataset remains the same. Therefore, the DSMC algorithm has no convergence issue as in other algorithms based on MCMC. Compared to other MCMC techniques, sequential Monte Carlo methods do not consider the entire dataset in the beginning, so the estimations for the first few sequences may be poor. However, performing a second pass through the dataset, even only for the first few sequences, and using the results from second pass can improve the performance of the algorithms.


Figure 1
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Extension and selection steps for deterministic sequential sampler.

 
Given a set of samples and associated weights Formula that does not contain duplicate paths, we can obtain the posterior distribution of xt – 1 as


Formula 2

(2)
where Formula , and Formula is the indicator function such that Formula for x = 0 and Formula otherwise. From Bayes' theorem we have


Formula 3

(3)
From this relationship, we can approximate the posterior distribution of xt as


Formula 4

(4)
where Formula represents the vector obtained by appending the element i, i isin {1, ... , LM} to the vector Formula , and Formula with


Formula 5

(5)

Depending on the system under consideration, evaluating (3) can be an easy or very difficult task. It turns out, as we will show in the following, from our system model and our choice of prior distribution for {Theta}, that {Theta} depends only on some sufficient statistic Tt that can be easily updated and will be derived in the next section. We can derive an analytical solution to the integral in (3) with respect to p({Theta}|Tt – 1).

3.2 DSMC estimator
In Bayesian inference, we often assume uninformative priors, thus even when different prior probabilities are used, additional observations will tend to bring the posterior probabilities close to each other. In this article, we use the product Dirichlet distribution as the prior for the PWM {Theta}. For the prior distribution of the m-th column of the PWM, we have {theta}m ~ Formula ({rho} m1, ... , {rho}m4), m = 1, 2, ... , M. Please refer to Evans et al. (2002) for a detailed discussion on the Dirichlet distribution. Recently, it has been shown that positions within a binding site may be interdependent (Bulyk et al., 2002). However, while the independence assumption does not fit the data perfectly, in most cases it does provide a good approximation (Benos et al., 2002). Therefore, to reduce the complexity of the motif discovery algorithm, we choose to model the columns of the PWM as independent from one another.

Let us denote {rho}m {triangleq} [{rho}m1, ..., {rho}m4], and assuming independent priors, the prior distribution for the PWM {Theta} is then the product Dirichlet distribution Formula . To update the posterior distribution of {Theta}, we use the following conditional posterior distribution for {Theta}:


Formula 6

(6)
where we denote {Lambda}M({Theta};{rho}1,..., {rho}M) as the product Dirichlet pdf for {Theta}, {rho}m(t) {triangleq} [{rho}m1(t),..., {rho}m4(t)], m = 1,..., M, as the parameters of the distribution of {Theta} at time t, and Formula . Note that the posterior distribution of {theta} depends only on the sufficient statistics Tt {triangleq} {{rho}mj(t),1 ≤ m ≤ M,1 ≤ j ≤ 4}, which is easily updated based on Tt – 1, xt, and st as given by (6), i.e. Tt = Tt(Tt – 1, xt, st).

We now outline the DSMC algorithm for motif discovery. By our choice of the product Dirichlet distribution as the prior for {Theta}, the integral with respect to p({Theta}|Tt – 1) can also be computed analytically. We have p({Theta}|Tt – 1) = {Lambda}M({Theta}; {rho}1(t 1),...,{rho}M (t – 1)) and p(st|xt = i,{theta}) = Formula (st; i, {Theta}), therefore we get


Formula 7

(7)
which is used in (5) for weight update. In the selection step, the K samples with the highest weights are retained.

We are now ready to give the DSMC motif discovery algorithm:

ALGORITHM 1. [DSMC motif discovery algorithm]
  • Initialization: Use the first {gamma} sequences to enumerate the Formula possible samples, where {gamma} is the largest integer such that Formula , and compute their weights.
  • Update: For q = {gamma} = 1,{gamma} = 2, ...
    • For k = 1, 2,...
    • Enumerate all possible sample extensions. Formula
    • {forall}i, compute the weights Formula according to (5) and (7)

  • Select and preserve K distinct sample streams Formula with the highest importance weights Formula from the set Formula Formula
  • {forall} k, update the sufficient statistics Formula according to (6).

Note that due to the sequential nature of the algorithm, the ordering of the sequences may affect the performance. In particular, if the initial sequences in the dataset have poorly conserved motifs or contains no motifs at all, a good estimate of the PWM cannot be made, and the inferences made for the following sequences may be poor. To lessen this effect, we can average the inference results over different permutations of the sequence order, and take as the final result the most frequently appearing estimation.

3.3 Unknown motif length
In the previous section, we have assumed that the length of the motif is known, thus we can simply use the DSMC algorithm to search for a motif of the given length. However, the motif length is often not given when dealing with a dataset containing an unknown motif. In Algorithm 1, estimating the unknown motif length is not directly supported. However, we can apply the algorithm on the dataset with different given motif lengths, each with a score computed based on the estimated results. The motif length that obtained the highest score is chosen as the estimated motif length.

For Bayesian analysis, a natural choice for a score is the posterior distribution p(ST|xT):


Formula 8

(8)
where Formula is the m-th column of the estimated PWM.


    4 INSERTION/DELETION MODEL
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEM MODEL
 3 DSMC MOTIF DISCOVERY...
 4 INSERTION/DELETION MODEL
 5 EXPERIMENTAL RESULTS
 6 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this section, we present modifications to the DSMC motif discovery algorithm to cope with situations where insertions and deletions exist in different instances of a motif. For certain transcription factors, it is known that there are indel mutations in some of their binding sites (McIver et al., 1999). Bulges in the structure of nucleotide chains can lead to indels in the functional motifs for some of the sequences. Furthermore, even if indel mutations have rendered some motifs non-functional, we are often still interested in detecting their presence. In most cases, existing motif discovery algorithms are unable to locate these motifs due the partial shift of the nucleotides. We propose an extension of the DSMC motif discovery algorithm to handle datasets that contain indel mutations in some instances of the motif.

In multiple alignment (Karplus et al., 1998; Krogh et al., 1994), a sequence containing a motif is often modelled as a hidden Markov model, with states divided into motif states and non-motif states. In our case, we model a sequence containing a single motif with indel mutations as shown by Figure 2. As a convention in this article, for a sequence with motif of length M, we index the motif states from 1 to M, insertion states M + 1 to 2M 1, the deletion states 2M to 3M – 1, and 3M and 3M + 1 for non-motif states before and after the motif, respectively. Thus, for a dataset with motif length M, the number of states for the HMM is R = 3M + 1.


Figure 2
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Markov model for insertion/deletion occurring within a motif.

 
We define z1:q {triangleq} [z1,..., zq] and s1:q {triangleq} [s1,...,sq], where zq and sq are the state and observation at time q for the sequence s, respectively. The state transition from zq to zq + 1 follows the transition probability matrix {Phi} {triangleq} [{varphi}ij]R x R. Note that {Phi} is a sparse matrix where the entries are non-zero only at positions that correspond to valid state transitions indicated by Figure 2. Row r of the transition probability matrix, {varphi}r, is assumed to have the Dirichlet prior {varphi}r ~ Formula r1, ... , βrR).

The posterior distribution of {varphi}r is then given by


Formula 9

(9)
where {Lambda}1({varphi}r; βr1,..., βrR) is the Dirichlet pdf and βrp(q) is the Dirichlet coefficient of the p-th element in the r-th row at time q. The parameter βrp is non-zero only for entries that correspond to valid state transitions. Note that the posterior distribution of {Phi} is only dependent on the sufficient statistic Uq {triangleq}rp(q),1 ≤ p, r ≤ R}, and is updated according to (9).

The distribution of the observation in the motif states again has a Dirichlet prior, and we make use of the estimated PWM obtained in Algorithm 1. Thus, for each motif state m, m = 1,..., M, the observed nucleotides have the Dirichlet prior {theta}m ~ Formula ({rho}m1,..., {rho}m4), where parameters{rho}mj are initialized with values according to the corresponding column in Formula obtained after using Algorithm 1 to process the entire dataset. The posterior distribution of {theta}m is then given by


Formula 10

(10)
and it depends only on the sufficient statistics Tq {triangleq} {{rho}mj(q),1 ≤ m ≤ M, 1 ≤ j ≤ 4}.

For both the non-motif states and the insertion states, the nucleotide distribution follows the non-motif nucleotide distribution {theta}0 {triangleq} [{theta}01,...,{theta}04]', which is assumed to be known. In the case of deletion states, the nucleotides in the motif positions that correspond to deletion mutations are removed, and the sequence fragments to either side of the deletions are concatenated. Therefore, the outputs of the deletion states are not observed, and this occurs with probability 1.

Then, similar to (3), the posterior distribution p(z1:q | s1:q) can be derived as


Formula 11

(11)
with


Formula 12

(12)
and


Formula 13

(13)

After the (q – 1)-th nucleotide, p(z1:q|s1:q) can be approximated by


Formula 14

(14)
where Formula is the set of states that Formula can make transitions to, and the weight update formula is given by


Formula 15

(15)

Note that to assist Algorithm 2 to more accurately predict the starting locations of the motifs and the locations of the indel mutations, we first process the dataset using Algorithm 1. The result of Algorithm 1 are then used to update the parameters for {theta}m, m = 1, ... , M, so that Algorithm 2 is initialized with better knowledge of the PWM, and is able to more accurately locate the motif within the sequences. Furthermore, Algorithm 2 is used to process the nucleotides immediately surrounding the starting locations predicted by Algorithm 1, e.g. 100 nt up and downstream relative to the predicted starting locations. From our experience, if the site predicted by Algorithm 1 is far away from the known location, Algorithm 2 will also make the incorrect prediction due to the lack of conservation in the true motif. In other words, Algorithm 2 is used to refine the results of Algorithm 1 to locate possible indel mutations that are found in instances of motifs. Another approach would be to divide long sequences into several shorter sequences, each processed by Algorithm 2 separately, and the best motif is then selected based on a score similar to (8).

In conclusion, the DSMC algorithm to locate a motif with possible indel mutations within some sequences is described as follows:

ALGORITHM 2 [DSMC motif discovery algorithm for motif with indel mutations]
  • Process dataset using Algorithm 1 update {theta}m, m = 1,..., M, using Algorithm 1 results.
  • For each sequence in the dataset:
    • Initialization: Enumerate all paths through the Markov model up to the {gamma}-th nucleotide such that the total number of paths enumerated is less than or equal to K.
    • For q = {gamma} + 1, {gamma} + 2, ...
      • For k = 1, 2, ...
        • Enumerate all possible sample extensions. Formula .
        • {forall}i, compute the weights Formula according to (15).
        • Select and preserve K distinct sample streams Formula with the highest importance weights Formula from the set Formula .
        • {forall}k, update the sufficient statistics Formula and Formula according to (10) and (9), respectively.




    5 EXPERIMENTAL RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEM MODEL
 3 DSMC MOTIF DISCOVERY...
 4 INSERTION/DELETION MODEL
 5 EXPERIMENTAL RESULTS
 6 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have implemented the proposed DSMC motif discovery algorithms and evaluated their performance on synthetic and real data. The results are compared to those of BioProspector, MEME (Bailey and Elkan, 1993, 1994), and AlignACE.

5.1 Results for synthetic data
We used the following rule to generate synthetic data for performance comparisons. For the dominant nucleotide at each position in the motif, we assign the probability of 70%, while the remaining nucleotides are assigned a probability of 10%. Non-motif frequency is assigned as 25% for each nucleotide. We compared the performance of DSMC Algorithm 1, refinement of Algorithm 1 by 2, BioProspector, MEME and AlignACE using synthesized datasets at various motif lengths. We generated 10 motif datasets for each motif lengths between 17 and 22. Each dataset contains 50 sequences, each of which is 200 nt long. A motif of corresponding length is present in each of the sequences. The performance comparison using the synthetic datasets is given in Figure 3, in terms of performance coefficient. The accuracy of each algorithm is plotted against this length of motif.


Figure 3
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Accuracy comparison using synthesized dataset with motifs of different length.

 
We see from Figure 3 that DSMC Algorithm 1 achieves higher performance than the other algorithms do at all motif lengths tested. Note that results of Algorithm 1 after being refined by Algorithm 2 is as good as the unrefined results except in the regions of short motif lengths. In these regions, due to the low conservation and short sequence lengths, Algorithm 2 is more likely to erroneously include insertions and deletions to attempt to obtain a better match to the PWM.

In another experiment, we constructed 20 datasets, each of which contained 20 sequences with length of 100 nt. In each sequence, we embedded a motif with length of 16 nt, where the dominant nucleotides appear with probability 91%, and the remaining nucleotides appear with probability 3%. Each sequence has a 40% chance of having a 1-nt-long insertion or deletion occurring within the motif, where the insertions and deletions appear with equal probability. We compared the performance of both DSMC Algorithms with those of BioProspector, MEME and AlignACE. The comparisons in terms of their average performance coefficient and the average number of correctly predicted motif locations are shown in Table 1.


View this table:
[in this window]
[in a new window]

 
Table 1. Performance comparisons using synthetic datasets with insertion and deletion occurring inside motifs

 
We can see that on the average, Algorithm 1 has a slight edge over the existing algorithms, whereas after running Algorithm 2, the results is improved over all algorithms by having the highest performance coefficient and predicting on average more than two more correct motif location than Algorithm 1, BioProspector, MEME and AlignACE can. In Figure 4, we give an example of motif alignment of a synthetic dataset as estimated by Algorithm 2. Once again, the boldfaced first sequence in each alignment is the motif consensus.


Figure 4
View larger version (30K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Alignment of the motifs found by Algorithm 1 and the corresponding alignment with insertion/deletion predicted by after running Algorithm 2.

 
In Figure 5, we show the convergence of the performance coefficient of Algorithm 1. The result is averaged over 10 datasets, where each dataset contains 25 sequences of 200 nt long each. Each sequence is generated to contain a motif of length 17, where the dominant nucleotide in the motif appears with 70%. In Figure 5, Algorithm Step at 0 represents the average performance coefficient of the datasets after the first pass. For Algorithm Step t, t = 1,..., 25, the average performance coefficient is computed by replacing the estimation made during the first pass for the t- th sequence with its second pass estimation. Similarly, Algorithm Steps 26–51 represents the sequential update of the estimations to the third pass estimations. As we can see from Figure 5, performing a third pass through the dataset does not improve the performance, and in most cases, only the first few sequences need to be re-estimated in the second pass.


Figure 5
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Convergence of performance coefficient of DSMC Algorithm 1 over the first three passes.

 
5.2 Results for Real Data
5.2.1 Cyclic-AMP receptor protein
The first dataset we used consists of 18 sequences from Escherichia coli with cyclic-AMP receptor protein (CRP) binding sites (Stormo and Hartzel, 1989). Each of the sequences is 105 nt long, and the dataset contains 23 motifs that have been experimentally determined with the footprinting method (Liu et al., 2004) A prediction performance comparison for this dataset using BioProspector, MEME and AlignACE is presented in Jensen et al. (2004).

Our results are presented in terms of the slightly modified ‘performance’ coefficient defined in Pevzner and Sze (2000) as follows: Let a *t be the known motif segment in st, and let AT{triangleq}Formula {triangleq}Formula be the sets containing the estimated and known motif segments in ST, respectively. The performance coefficient is equal to


Formula 16

(16)
where Formula is the number of nucleotids common to both segments a and b, and Formula is the number of unique nucleotides in both a and b.

Note that we may encounter cases where insertions or deletions are found in an instance of the motif, or the motif states predicted by Algorithm 2 may contain insertions or deletions. Thus (16) needs to be modified to handle these situations. We construct the known motif fragment Formula by counting the nucleotides found between the first and the last nucleotides of the known motif. In other words, the number of nucleotides in Formula is increased by 1 with an insertion, and decreased by 1 with a deletion. For Algorithm 2, at is constructed similarly, reflecting the number of insertion/deletion states found in the estimated state transitions. Since all other algorithms, including Algorithm 1, do not assume insertion/deletions, their size of Formula is always the given length of the motif.

For each algorithm, we performed six trials on the CRP dataset and the most frequently appeared position is used as the estimated motif position for each sequence. The six trials consist of the CRP dataset in its original and a randomly permuted sequence order, searched using motif lengths of 21, 22 and 23. The results found by using motif length 21 and 23 are shifted so that their consensus have the same starting location as the consensus found using motif length 22. By averaging over results from different motif lengths and different sequence orders, the performance of MEME and AlignACE improved by about two positions, whereas the DSMC algorithms and BioProspector improved by about one position. The performance coefficient is then computed using motif length of 22. The performance results of the DSMC Algorithms, BioProspector, MEME and AlignACE on the CRP dataset are given in Table 2, where the first row shows the performance coefficients and the second row indicates the number of correctly located motifs. The results obtained from each algorithm is compared with the experimentally confirmed CRP motif position as given in (Liu et al., 2004). Note that for MEME and AlignACE, the consensus of the predicted motif is shifted to the left by 2 nt from the known consensus, therefore, the known motif locations for these two algorithms are shifted accordingly so that fair comparisons of the performance coefficients can be made.


View this table:
[in this window]
[in a new window]

 
Table 2. Performance comparisons of motif discovery algorithms using the CRP dataset

 
In our simulations, Algorithm 1 and BioProspector have the top two scores both in terms of performance coefficient and the number of correctly predicted motif locations. Note that although Algorithm 1 and BioProspector have the same number of correctly predicted motif locations, Algorithm 1 has a higher performance coefficient since some of its incorrectly predicted positions have higher degree of overlap with the known motifs than those predicted by BioProspector. The results of Algorithm 1 after being refined by Algorithm 2 actually performs worse than those of Algorithm 1, due to the incorrect prediction of two motifs. These errors seems to have been caused by repetitions of one of the highly conserved 5-nt block from the CRP motif at other locations within the sequence. Figure 6 shows the CRP motifs predicted by each algorithm. The first sequence in each panel is the consensus of the CRP motif, and sequences with stars to the left are those that are correctly predicted. The CRP motif is known to contain two blocks of highly conserved nucleotides: ‘TGTGA’ and ‘TCACA’. For each algorithm, the matches to the two highly conserved blocks are highlighted in red. Interestingly, AlignACE contains the least number of correctly identified motifs although its performance coefficient is higher than that of MEME. Upon further inspection of the motif positions incorrectly predicted by MEME and AlignACE, we observe that while the positions predicted by AlignACE have higher degree of overlap with the known positions, they however have poorer alignments with the two highly conserved blocks in the CRP motif. It is in our opinion that this example illustrates the insufficiencies of the performance coefficient as a measure for the performance of motif discovery algorithms. In a typical motif discovery problem, there are two primary interests for any algorithm: the positions of the motifs, and the alignment or the consensus logo of the motif. As we can see from (16), the computation of performance coefficient only considers the amount of overlap between the predicted and the known motifs, whether the predicted motifs are aligned has no effect on the end result. Thus we have the situation that we observe in this example, where the incorrectly predicted motif positions by AlignACE are closer to the known positions but has a poorer estimate for the PWM compared to the known motifs. We believe that both of these properties are important in a motif discovery problem.


Figure 6
View larger version (48K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Alignment of CRP motifs found.

 
With this in mind, we compare the alignment in Figure 6 for Algorithm 1 and BioProspector. We can see that the only difference between the predictions are in the 21st and 23rd motifs. Comparing the two predicted motifs for each algorithm to the complete consensus of the CRP motif, we can see that Algorithm 1 has 22 matches to the CRP motif consensus in both predicted motifs, whereas BioProspector only has 18. For only the two highly conserved blocks, Algorithm 1 has 13 matches to the consensus whereas BioProspector has 12. Thus both in terms of performance coefficient and alignment to the consensus, Algorithm 1 slightly outperforms BioProspector. Furthermore, we include in Table 2 the logarithm of the motif scores using (8), where the estimated PWM used for each algorithm is built from the motif predicted by that algorithm. As we can see from Table 2, the motif scores is closer to what we observed from the alignments in Figure 6 where MEME shows stronger alignment of the conserved regions in the CRP motif than that of AlignACE, and that Algorithm 1 and BioProspector have the highest scores and show very similar alignment properties.

Next, we repeated the above simulations for Algorithm 1 without assuming a known motif length. Algorithm 1 was ran using motif lengths between 16 and 25, and for each motif length a score was computed according to (8). The simulation results showed that the motif length 23 has the highest score, and in Table 2 we compared this result with the predictions by BioProspector, MEME and AlignACE from (Jensen and Liu, 2004).

5.2.2 Drosophilo melanogaster Dscam introns in exon 6 cluster
It has recently been discovered (Anastassiou et al., 2006; Graveley, 2005) that the mutually exclusive choice of 1 out of the 48 alternative exons in exon cluster 6 of the alternatively spliced Dscam gene of the fruit fly Drosophila melanogaster is governed by competing alternative pre-mRNA secondary structures. There exists a conserved ‘docking site’ containing an ‘anchor sequence’ of ~ 40 nt, upstream of the first of the tandemly arranged 48 alternative exons. Specifically, the sequence ‘AAATTGAAAACTGCCTGAATGTTGGGATAGGGTACTCGACAA’ is perfectly conserved among six Drosophila species. On the other hand, in direct upstream of each of the alternative exons there is a binding site containing a moderately conserved segment of the reverse complement of the anchor sequence. The introns are competing for binding with the anchor sequence, and the one that binds with the anchor sequence activates the selection of the exon directly downstream from it. The first exon does not have such a motif (Anastassiou et al., 2006). Thus our second dataset consists of the 47 introns upstream of the 2nd up to the 48th alternative exon in the exon 6 cluster of the Dscam gene, and the motif is the reverse complement of the anchor sequence.

We applied DSMC Algorithm 1 to the Dscam dataset with various motif lengths. Motif length of 21 resulted in the longest estimated PWM with contiguous highly conserved positions. The consensus sequence of the estimated PWM, ‘CCAACATTCAGGCAGTTTTTC’, is very similar to a segment of the reverse complement of the anchor sequence, ‘CCAACATTCAGGCAGTTTTCA’, with two errors at the end due to misalignment of the predicted motifs.We then applied Algorithm 2 to the dataset using the estimated PWM from Algorithm 1 as the parameters for {Theta}. As the identified motifs may be slightly shifted versions of each other, to compute the performance coefficient, Equation (16) is applied to the motif positions estimated by the algorithm with respect to the corresponding part of the known ‘superset’ motif. In Table 3, we give the performance comparisons of the different algorithms for 45 introns out of the total of 47 available. The 45th and 46th introns are excluded since they do not contain the fragment ‘CCAACATTCAGGCAGTTTTCA’, as shown in the supplement to (Anastassiou et al., 2006), and thus are not used in our comparisons since their contribution to the performance coefficient cannot be computed. Table 3 shows the performance coefficient comparisons for the different algorithms.


View this table:
[in this window]
[in a new window]

 
Table 3. Performance comparisons using Dscam dataset

 
We can see that the refinement of Algorithm 1's; results by Algorithm 2 produces the most exact matches to the known sequence. From (Graveley, 2005) we observe that many of the intronic sequences contain insertions and deletions relative to the consensus, which from a biological standpoint are bulges in the secondary structure. When an insertion or deletion occurs in the motif of an intronic sequence, the PWM-based algorithm needs to ‘decide’ which side of the insertion/deletion has better match to the estimated PWM. For algorithms that are unable to deal with insertions and deletions, if the right side of the insertion/deletion is determined to contain more matches, the estimated motif location will be shifted to the right to obtain higher score. Thus, we observe that many estimated locations are shifted from the correct locations by a few nucleotides. The HMM used in Algorithm 2 provides a mechanism to break up the motif, or remove fragments, so that the match to the PWM is improved.

5.3 Results for motif discovery assessment project datasets
In (Tompa et al., 2005), a large-scale assessment of 13 motif discovery algorithms was performed. The 52 datasets used in the assessment contain transcription factor binding sites drawn from mouse, human, fly and yeast. The binding sites are planted in three different types of background sequences: their original promoter sequence, the promoter sequence of a randomly chosen gene from the same genome and sequence generated using Markov chain of order 3. Four datasets with no motifs were also added as negative controls. In this section, we present the comparison of the performance of the DSMC algorithms on these datasets with those of MEME and AlignACE, which were also part of the study in (Tompa et al., 2005). The metrics used in the comparison are: nucleotide and site-level sensitivity (nSn and sSn), nucleotide and site-level positive predictive value (nPPV and sPPV), nucleotide-level performance coefficient (nPC), nucleotide-level correlation coefficient (nCC) and site-level average site performance (sASP). Please refer to (Tompa et al., 2005) regarding the detailed formulas for these metrics. From Figure 7, we can see that the DSMC algorithms have comparable or better performances in nPPV, nPC, nCC, sPPV and sASP, whereas for the measures nSn and sSn, the DSMC algorithms performed more poorly than those of MEME and AlignACE. The sensitivity metrics, nSn and sSn, are both similar measures to the performance coefficient, except that they do not take into consideration of the false positives as performance coefficient does. As the figure shows, the DSMC algorithms predicted less false positives than by MEME and AlignACE, thus we see a more comparable performance by the algorithms in terms of nPC. The DSMC algorithms also have higher positive predictive values, meaning that more of its predictions are the true nucleotide or true sites.


Figure 7
View larger version (30K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7. Performance comparison for combined measurement over 56 datasets.

 

    6 DISCUSSION AND CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEM MODEL
 3 DSMC MOTIF DISCOVERY...
 4 INSERTION/DELETION MODEL
 5 EXPERIMENTAL RESULTS
 6 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have proposed a novel deterministic sequential Monte Carlo solution to the hidden Markov model of the motif discovery problem. We have shown that for certain types of datasets, the DSMC algorithm can provide better performance than those of other algorithms. We have also proposed an extension to treat datasets that have insertions and deletions occurring inside motifs.

The complexity of the DSMC algorithms compares favorably with existing algorithms. We implemented the Gibbs sampling-based algorithm proposed in (Lawrence et al., 1993) using MATLAB for a comparison with Algorithm 1. Our tests have shown that for a dataset of 50 200-nt-long sequences, DSMC Algorithm 1 on average takes about 1.5 min on an Intel Core 2 Dual E6700 cpu, whereas it takes 3 min for the Gibbs sampling-based algorithm to converge to the final solutions. However, running time for Algorithm 2 is much longer, where each 200-nt-long sequence takes up to 6 min. This is due to Algorithm 2 having to process each nucleotide in the sequence to determine the state it is in.

From our simulations, the DSMC algorithms are shown to perform well when most of the sequences in the given dataset contain at least one instance of the motif. For datasets that have insertions/deletions in some instances of the motif, Algorithm 2 is shown to improve the performance and provides an alignment of the estimated motifs. At this moment, the DSMC algorithms are not able to accurately predict the locations of motifs if a significant portion of the dataset contains no motif at all. Algorithm 1 performs the search by predicting the most likely candidate for each sequence in each pass. Multiple occurrences of the motif in the same sequence are discovered in the subsequent passes through the dataset. If the majority of the motifs are located in a few sequences, the algorithm is unable to form a good estimate of the PWM, and subsequent passes through the dataset will also produce poor results.

The computational complexity of Algorithm 2 immediately points to a possible future research topic of improving its performance. Also, modifications to the HMM model may allow us to better treat datasets where significant portions of the sequences contains no motifs. Furthermore, the scope of this article focuses on improving the performance of traditional models where the sequences are assumed to be independent. The current system model for Algorithm 1 can be modified to support models where the locations of the motifs are correlated between adjacent sequences by casting the current model into an HMM problem. The states and observations in the HMM model remain the same as the current model, and the state transition probabilities can be estimated with some metric based on data provided by microarray experiment results.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEM MODEL
 3 DSMC MOTIF DISCOVERY...
 4 INSERTION/DELETION MODEL
 5 EXPERIMENTAL RESULTS
 6 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work was supported in part by the U.S. National Science Foundation (NSF) under grant DMS-0244583.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on April 16, 2007; revised on October 11, 2007; accepted on October 27, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEM MODEL
 3 DSMC MOTIF DISCOVERY...
 4 INSERTION/DELETION MODEL
 5 EXPERIMENTAL RESULTS
 6 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Anastassiou D, et al. Variable window binding for mutually exclusive alternative binding. Genome Biol (2006) 7:R2.[CrossRef][Medline]

    Bailey T, Elkan C. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Technical Report (1993) CS93-302. University of California, San Diego.

    Bailey T, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: In Proceedings of the 2nd Int'l Conference on Intelligent Systems for Molecular Biology. (1994) San Diego: AAAI Press. 28–36.

    Benos P, Bulyk M, Stormo G. Additivity in proteinDNA interactions: how good an approximation is it? Nucleic Acids Res (2002) 30:4442–4451.[Abstract/Free Full Text]

    Buhler J, Tompa M. Finding motifs using random projections. J. Comput. Biol (2002) 9:225–242.[CrossRef][Web of Science][Medline]

    Bulyk M, Johnson P, Church G. Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res (2002) 30:1255–1261.[Abstract/Free Full Text]

    Evans M, Hastings N, Peacock B. Statistical Distributions. (2002) 3rd. New York: Wiley-Interscience.

    Fearnhead P. Sequential Monte Carlo methods in filter theory. In: Ph.D. Dissertation. (1998) University of Oxford.

    Fearnhead P. Particle filters for mixture models with an unknown number of components. J. Stat. Comput (2004) 14:11–21.[CrossRef]

    Graveley B. Mutually exclusive splicing of the insect Dscam pre-mRNA directed by competing intronic RNA secondary structures. Cell (2005) 123:65–73.[CrossRef][Web of Science][Medline]

    Hertz G, Stormo G. Indentifying DNA and protein patterns with statistically significant alignment of multiple sequences. Bioinformatics (1999) 15:563–577.[Abstract/Free Full Text]

    Hughes J, et al. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol (2000) 296:1205–1214.[CrossRef][Web of Science][Medline]

    Jensen S, Liu J. BioOptimizer: a Bayesian scoring function approach to motif discovery. Bioinformatics (2004) 20:1557–1564.[Abstract/Free Full Text]

    Jensen S, et al. Computational discovery of gene regulatory binding motifs: a Bayesian perspective. Stat. Sci (2004) 19:188–204.[CrossRef][Web of Science]

    Karplus K, et al. Hidden Markov Models for detecting remote protein homologies. Bioinformatics (1998) 14:846–856.[Abstract/Free Full Text]

    Krogh A, et al. Hidden Markov models in computational biology: applications to protein modeling. J. Mol. Biol (1994) 235:1501–1531.[CrossRef][Web of Science][Medline]

    Lawrence C, et al. Detecting subtle signals: a Gibbs sampling strategy for multiple alignment. Science (1993) 262:208–214.[Abstract/Free Full Text]

    Lawrence C, Reilly A. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins Struct. Funct. Genet (1990) 7:41–51.[CrossRef][Web of Science][Medline]

    Liu J, et al. Statistical models for biological sequence motif discovery. In: Case Studies in Bayesian Statistics VI. (2004) New York: Springer.

    Liu X, Brutlag D, et al. Bioprospector: discover conserved DNA motifs in upstream regulatory regions of co-expressed genes. (2001) Proceedings of the 6th Pacific Symposium on Biocomputing 2001. USA: Hawaii.

    McIver S, et al. Regulation ofmgatranscription in the Group A Streptococcus: specific binding of Mga within its own promoter and evidence for a negative regulator. J. Bacteriol (1999) 7:5373–5383.

    Pevzner P, Sze S. Combinatorial approaches to finding subtle signals in DNA sequences. In: In Proceedings of the 8th Int'l Conferences on Intelligent Systems for Molecular Biology. (2000) San Diego: AAAI Press. 269–278.

    Punskaya E. Sequential Monte Carlo methods for digital communications. In: Ph.D. dissertation. (2003) University of Cambridge.

    Raphael B, et al. A uniform projection method for motif discovery in DNA sequences. IEEE Trans. Comput. Biol. Bioinform (2004) 1:91–94.[CrossRef]

    Roth F, et al. Finding DNA regulatory motifs within unaligned non-coding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol (1998) 10:939–945.

    Stormo G, Hartzell G. Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl Acad. Sci. USA (1989) 86:1183–1187.[Abstract/Free Full Text]

    Tompa M, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol (2005) 23:137–144.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
24/1/46    most recent
btm543v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Liang, K.-C.
Right arrow Articles by Anastassiou, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liang, K.-C.
Right arrow Articles by Anastassiou, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?