Skip Navigation

Bioinformatics 2008 24(16):i160-i166; doi:10.1093/bioinformatics/btn282
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by da Fonseca, P. G. S.
Right arrow Articles by Sagot, M.-F.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by da Fonseca, P. G. S.
Right arrow Articles by Sagot, M.-F.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Efficient representation and P-value computation for high-order Markov motifs

Paulo G. S. da Fonseca 1,2,*, Katia S. Guimarães 1 and Marie-France Sagot 2

1Centro de Informática, Universidade Federal de Pernambuco, 50732-970, Recife, Brazil and 2Université de Lyon, F-69000, Lyon; Université Lyon 1; INRIA Rhône-Alpes; CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622, Villeurbanne, France

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Position weight matrices (PWMs) have become a standard for representing biological sequence motifs. Their relative simplicity has favoured the development of efficient algorithms for diverse tasks such as motif identification, sequence scanning and statistical significance evaluation. Markov chainbased models generalize the PWM model by allowing for interposition dependencies to be considered, at the cost of substantial computational overhead, which may limit their application.

Results: In this article, we consider two aspects regarding the use of higher order Markov models for biological sequence motifs, namely, the representation and the computation of P-values for motifs described by a set of occurrences. We propose an efficient representation based on the use of tries, from which empirical position-specific conditional base probabilities can be computed, and extend state-of-the-art PWM-based algorithms to allow for the computation of exact P-values for high-order Markov motif models.

Availability: The software is available in the form of a Java objectoriented library from http://www.cin.ufpe.br/~paguso/kmarkov.

Contact: paguso{at}cin.ufpe.br


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Position weight matrices (PWMs) have become a de facto standard probabilistic model for biological sequence motifs, for instance, for representing transcription factor-binding sites (TFBS). Key to this success is their simplicity, which allows for immediate interpretation and easier analytical manipulation, and the consequent computational efficiency, which has also favoured the development of numerous techniques for the various tasks related to the analysis of sequence motifs such as motif discovery for a recent survey(GuhaThakurta, 2006), sequence scanning (Beckstette et al., 2006; Pizzi et al., 2007) and statistical significance assessment (Bejerano, 2003; Touzet and Varre, 2007; Zhang et al., 2007). A strong hypothesis on which the PWM model is based is that of the independence of each position of the motif relative to other positions. This assumption, the practicality it confers to the model notwithstanding, comes at the cost of an alleged simplification of the actual biochemical interactions that take place at the regulatory sites.

Markov models have been used for biological sequence analysis in general (Durbin et al., 1999), and in particular for representing binding site motifs (Ellrott et al., 2002; Huang et al., 2006; Zhao et al., 2005). They represent a step up in terms of expressiveness over PWMs, since they allow for local dependencies between adjacent positions to be represented directly (which does not mean that non-adjacent positions are independent). Other more general models have been proposed which allow for more complex inter-position (in)dependence schemes to be considered. For example, Barash et al. (2003) have successfully used Bayesian network models to represent TFBS motifs. Unfortunately, however, the gain in terms of expressiveness is often accompanied by a substantial computational overhead, which may limit their application in practice. Moreover, we conjecture that no model will replace PWMs as a standard until it includes a reasonably complete algorithmic framework for the diverse tasks related to the analysis of sequence motifs mentioned above.

In this article, we address two aspects regarding the utilization of higher order Markov models for biological sequence motifs. First, we consider the representation of high-order Markov motifs described by a set of occurrences. We propose a convenient data structure based on tries, from which empirical position-specific conditional base probabilities can be computed as needed. Next, we consider the problem of computing exact P-values for putative occurrences of high-order Markov motifs. We propose non-trivial extensions of two state-of-the-art PWM-based algorithms for working with a Markov dependency model. We illustrate the use of the proposed model and algorithms on a set of TFBS motifs defined by collections of aligned sites extracted from the TRANSFAC database (Wingender et al., 2000).


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 The K-order Markov model
A motif m of length W is a probabilistic model that defines a multivariate discrete probability distribution Pm(·) over the space of words of length W of a fixed alphabet A. If x=x1··· xWisinAW, then Pm(x) is actually a shorthand for P(X1=x1,...,XW=xW|m), where Xj is a r.v. corresponding to the position j of the pattern. Here, we consider sequence motifs that are described by discrete K-order non-homogeneous Markov chains. The likelihood of a particular word x=x1···xW under such a model m is given by


Formula

The model parameters are the position-specific transition probabilities


Formula

for all possible assignments of a0,...,aKin A and for each position j=1,...,W. If K=0, then the model resumes to a PWM. In general, though, this model captures local relationships between individual letters that can vary according to their positions in the sequence.

2.2 Representing a set of motif occurrences
In most practical cases, the parameters of a motif model are derived from a set of occurrences, even during a learning procedure. For example, if we examine the algorithms that learn motif models by EM (Bailey and Elkan, 1994), we see that the model parameters are updated so as to maximize the likelihood of the occurrences indicated by the current estimates of the hidden variables. The same goes for Gibbs sampling-based techniques (Lawrence et al., 1993), where the model parameters are calculated from the last sampled occurrences. In other words, the model parameters summarize the relevant information contained in a set of occurrences. Motivated by these observations, we adopt an alternative approach that consists in representing a set of occurrences with a convenient data structure in such a way that the information necessary for the computation of word probabilities can be recovered as needed. Our proposed representation is based on the use of index structures known as tries (Fredkin, 1960; Knuth, 1998).

DEFINITION 1
(Trie). A trie over a fixed alphabet A is a rooted tree such that (i) every edge is labelled with a character in A and (ii) every node has, at most, one outgoing edge labelled with a, for any aisinA.

An important property resulting from the definition of tries is that we can then establish a 1 : 1 association between each node v and the word v obtained by concatenating the labels of the edges on the path from the root to that node [by definition,lbl (root)={varepsilon}, the empty string]. We then define the set of words represented by a trie T by


Formula

If no word of a set of words X is a proper prefix of another, then there exists one, and only one trie T=T(X) s.t. (T)=X. Since we can enforce this pre-condition on X by appending a ‘sentinel’ symbol ${notin}A to each of its elements, we can assume w.l.o.g. that it holds. Moreover, we can generalize the definition of tries so that they can represent multisets of words: we store the number of copies of a word in the multiset in the corresponding leaf of the trie. We also attach to every internal node the sum of the counts of the leaves of the subtree rooted at that node. We will refer to these numbers generically as leaf counts. Figure 1 displays an example of an extended trie with its leaf counts. Two fundamental properties of leaf counts are:P1. The leaf count of a node v of T(X) corresponds to the number of words of X having (v) as prefix.P2. The leaf count of an internal node v equals the sum of leaf counts of its children.


Figure 1
View larger version (6K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Extended trie representing a multiset of 20 words of length 3.

 
The extended trie encodes the character frequencies at each position of the represented words as in a PWM. This fact is established by the following proposition, which follows immediately from the definition of extended tries.

PROPOSITION 1.
Let T=T(X) be a trie representing a multiset of words X. The frequency of the character a at position j in X is obtained from T as the sum of the leaf counts of the nodes at height j (the root is assumed to be at height zero) whose incoming edges are labelled by a, divided by the total number of represented words, i.e. the leaf count of the root node.

Consider the trie of Figure 1 for an example of the property above. The proportion of words whose second letter is a t in that set, Pr[x2=t], is given by the sum of the leaf counts of the two nodes at height 2 whose incoming edge is labelled by t, divided by the total number of words, that is Pr[x2=t]=(2+2)/20=0.2. However, tries are most useful when we consider connections between positions. Suppose, for instance, that we want to establish the probability of having a c in the first and third positions. By looking at the leaf count of the central node at height one, we see that nine words start with a c. By looking at the leaf counts of its children, we see that in seven out of those nine words, the initial c is followed by an a, whereas in the other two, the second letter is a t. Since we made no restrictions about the second letter, in principle all those words can interest us. However, by looking at leaf counts further down in the tree, we notice that although the two words that start with ct end by c, only two out of the seven words that start with ca end with a c. The two words of the latter case plus the other two of the former are then the only ones to match our requirements. The desired probability can thus be computed as Pr[x1=c,x3=c]=(2+2)/20=0.2. This principle is generalized in the following proposition which is, again, a direct consequence of the definition of extended trie.

PROPOSITION 2.
Let T=TX be a trie representing a multiset of words X, each of length W. Let i=(i1,...,iL)sub(1,...,W) be a ‘subvector’ of positions, and a=(a1,...,aL)isinAL be a vector of L characters. We denote by xi=a the event [xi1=a1,...,xiL=aL]. Then the probability of xi=a in X is given by the sum of the leaf counts of all nodes of height iL whose ingoing edges are labelled by aL, and such that the paths from the root to those nodes go through nodes at heights i1,...,iL–1 whose incoming edges are labelled by a1,...,aL–1, respectively, divided by the total number of represented words.

Conditional probabilities can be computed from Proposition 2 and the relation P(A|B)=P(A,B)/P(B). For example, according to the trie of Figure 1, the probability of having a ct in the third position, given that the base at the first position is a c is given by Pr[x3=g|x1=c]=/#[x1,3=c]/#[x1=c]=5/9{approx}0.56.

Finally, we note that, for a K-order Markov motif model m built from a set of motif occurrences X, we need to compute transition probabilities of the kind Pm(xj|xjK···xj–1) and thus we need to count the occurrences of contiguous k-mers xjK···xj on X. This can be easily done as follows. Starting from the nodes at height jK–1 in the trie, we follow the edges labelled by each position of the k-mer, xjK, xjK+1, ..., xj, in sequence, and we sum up the leaf counts of the final nodes whenever we can reach the end of the k-mer. Hence we need to visit at most


Formula

vertices for computing Pm(x1···xW). However, this worst case O(K|A|WK) behaviour is rare in practice since it happens only when the motif trie is complete. Most often the trie is very sparse for it contains only a few occurrences of the motif. In addition, once Pm(xj|xjK··· xj–1) is computed, it can be stored for subsequent use.


Figure 2
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Branch and bound motif P-value computation. This algorithm must be initially called with BranchAndBoundPvalue(m, m0, l, {varepsilon}).

 
2.3 Computing P-values for Markov motif models
We consider now the problem of computing motif occurrence P-values for K-order Markov models.

DEFINITION 2.
Let m be a K-order motif model of length W. We define the log-likelihood score of a word x relative to m as


Formula

Given also a background sequence model m0, we define the P-value of a score l as


Formula

In words, it represents the probability, under the background (null) model, for a random sequence to have a score at least as good as l relative to the motif model m.

For the PWM model, the problem of computing P-values is well studied in the literature and has been recently demonstrated to be NP-hard (Touzet and Varre, 2007; Zhang et al., 2007). The naive solution consists in exhaustively enumerating each word of the desired size W, computing its log-likelihood under the motif model m and comparing it to the target score l. Approximate methods exist which avoid exhaustive enumeration by sampling words from the background distribution and estimating the P-value through the empirical expected value of the binary indicator function 1({ell}(y, m)≥l). Here, however, we consider extensions of recent exact PWM-based techniques for our case of K-order Markov models.

2.3.1 Branch-and-bound P-value computation
The algorithm presented by Bejerano (2003) is based on a recursive enumeration of the words of length W in such a way that, at each point of the enumeration tree, we have a prefix of length j≤W and we can compute bounds for the log-likelihood score of the words having this prefix. At this point, we compare these bounds with the target score l to decide whether or not to pursue the enumeration further down the tree. Procedures employing this kind of heuristic fall into a broad class of algorithms commonly referred to as branch and bound algorithms (e.g. Michalewicz and Fogel (2004), Sec. 4.4). A skeleton of this particular procedure is shown in Figure 2.

The algorithm of Figure 2 can also be used for K-order Markov models, its efficacy depending on the estimation of the bounds performed in line 7. In order to explain our solution for this problem, we need some notation. First, we notice that the log-likelihood score of a word x=x1···xW w.r.t. a K-order Markov model m of length W decomposes additively into {ell}(x;m)={sum}Formulalog P(xj|xjK··· xj–1). We denote by {ell}j(x;m)=P(xj|xjK··· xj–1) the term corresponding to position j, which we call the j-position score. The j-position score depends only on the character a at that position and the k-mer kisinAmin{j–1,K} that precedes it. Thus we write {ell}j(k·a;m) to denote the j-position score of a word x with xj=a and xj–|k|···j–1=k. We denote by {ell}u··· v(x;m)={sum}j=uv{ell}j(x; m) the partial sum of the position scores for positions from u to v, or the subword score from u to v. If u=1, then we call it the v-prefix score of x. If v=W, it is called the u-suffix score of x. Notice that, as for the position score, the subword score {ell}u··· v(x;m) depends not only on the subword xu··· xv but also on the k–mer k that precedes it. We thus write {ell}u··· v(k·;z;m) to denote the subword score from u to v of a word x s.t. xu–|k|···u–1=k and xu··· v=z. For the v-prefix score of a word starting with visinAv we write {ell}··· v(v;m), and for the u-suffix score of a word ending with the suffix uisinAWu+1 preceded by the k-mer k we write {ell}u···(k·;u; m). Finally, we denote by {ell}maxu··· v(khellip;m)=maxzisinAvu+1 {ell}u··· v(k·;z; m) the maximum subword score from u to v of a word with the ‘seed’ k-mer xu–|k|··· u–1=k preceding the subword. By the same token, we denote by {ell}Formulam)=maxvisinAv {ell}··· v(v;m) the maximum v-prefix score and by {ell}Formula(k·;m)=maxuisinAWu+1} {ell}u···(k·;u;m) the maximum u-suffix score for of a word with the ‘seed’ k-mer xu–|k|··· u–1=k. The minimum subword, prefix and suffix scores, {ell}Formula(k·;m), {ell}Formulam) and {ell}Formulau···(k·;m) are defined accordingly.

Back to the branch and bound problem, suppose that we are at a point in the enumeration tree where we have spelled a (j–1)-prefix y ending with the k-mer k, that is, the prefix is of the form y=*k for some kisinAmin{K,j–2}. Then we propose to use the bounds Lmin(y)={ell}Formula(k·;m) and Lmax(y)={ell}maxj···(k·;m) in line 7 of the algorithm of Figure 2. Notice that these are the most strict bounds we can use knowing only the prefix y. The computation of these bounds is done by a dynamic programming procedure based on the following proposition.

LEMMA 1.
Let m be a1 K-order Markov motif model of length W, 0≤u≤v≤W, and kisinAmin{K,u–1}. In addition, let {ell}Formula(k·*z; m) denote the maximum subword score from u to v of a word x s.t. xu–|k|···u–1=k and xu···v=*z, that is, we require that the subword xu···v ends with z. Then the following holds:
  1. If|z|≥min{vu+1,K, then, for all aisinA,


    Formula


  2. {ell}Formula(k·;m)=maxzisinAz{ell}Formula(k·*z; m) for any 0≤z≤Wu+1.

PROOF.
To prove proposition (i) notice that, by definition, {ell}Formula(k·*za;m) corresponds to the best subword score from u to v+1 of a word x s.t. xu–|k|···u–1=k and xu···v+1=*za. The subword score from u to v+1 of one such a word is of the form


Formula

The condition |z|≥min{vu+1,K} guarantees that, if the size of the subword (i.e. uv+1) is greater than K, then z has size at least K. Otherwise, z fills the whole space u···v. This means that all the letters affecting the last term of the subword score {ell}u···v+1(x; m) are fixed. Hence {ell}v+1(x; m) is constant and


Formula

Part (ii) follows directly from the definition of {ell}Formula(k·;m) and {ell}Formula (k·*z; m). It is only saying that the best score for a suffix starting at position u preceded by k is the best score for a suffix starting at position u preceded by k and ending with a certain z of size z, among all possible choices of z, no matter its length.


Figure 3
View larger version (36K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Dynamic programming computation of the maximum suffix log-score. The minimum suffix log-score can be computed similarly, only by replacing ‘max’ by ‘min’ in lines 11 and 12.

 
Lemma 1 still holds if we consider minimum scores instead of maximum scores, and the proof is analogous. Based on this result, we can compute {ell}Formula(k·; m) [and equally {ell}Formula(k·;m)] as shown in Figure 3. For each position j = u,..., W we build a table of the best subword scores {ell}maxu···j(k·*z; m) for all possible z of size z=min{K,ju+1}. We start with the value {ell}Formula(k; m)=0 and use the entries of the table corresponding to position j to build the entries of position j+1 with the aid of item (i) of Lemma 1. When the size of the subword u...j is bigger than K, the best subword scores from u to j+1, obtained after applying relation (i), would refer to subwords ending with all possible suffixes of length K+1. However we do not need to keep all this information since, for the next iteration, only the last K letters of the subword will matter. We hence shrink the table by considering only the best scores based on the last K positions of the subword. At the end, we use item (ii) of the lemma to return the desired result.

As a final note, we remark that the algorithm of Figure 3 takes time O((Wu)· |A|K+1). Indeed, for each position j=u,..., W, we need to build an extended table based on the |A|K entries of the previous iteration and the possible |A| new letters, hence |A|K+1 entries. Then we eventually need to shrink this table, which demands visiting again its |A|K+1 entries. At the end of the algorithm, we have to scan the final |A|K entries corresponding to the last position to obtain the final result. We also remark that, for each prefix in the recursive enumeration, the computation of the maximum and minimum suffix score bounds can be done simultaneously. Moreover, all prefixes of the same length ending with the same k-mer will share the same bounds, therefore this computation needs to be carried out only once per k-mer and per prefix length. Thus the worst case total cost of computing score bounds along the enumeration procedure is {sum}Formula|A|K· (Wu)·|A|K+1 = |A|2K+1·W·(W–1)/2 = O(W2|A|K).

2.3.2 Computing P-values by iterative model refining
We discuss now our extension of the method proposed by Touzet and Varre (2007). We begin with the very general observation that an essential factor impacting the efficiency of score P-value algorithms is the number of possible scores. We say that a score s is possible for a model m whenever there is a word x s.t. the score of x under m equals s. Indeed, in a principled enumeration schema like the one discussed in the previous section, if there are many possible scores then, at any given point, it is more likely that some of them will fail to observe our pruning conditions forcing the recursion to go down. On the other hand, if the number of possible scores is small, the number of possible sequences being the same, the score distribution is more ‘concentrated’, which results in more sequences being counted or discarded by groups, hence abbreviating the enumeration. Therefore, it would be interesting to reduce the number of possible scores of a model, for example, by truncating the values of its parameters at an arbitrary precision. Of course, we cannot do this in general for any score threshold, since this could result in some sequences whose scores are normally above the threshold not being counted due to the reduction in their scores caused by the precision loss. However, for sufficiently extreme P-values and for an adequate precision, we may miss no sequence while speeding up the P-value computation.

To formalize these ideas, let us re-phrase some key definitions found in Touzet and Varre (2007). First, given a number xisinR and a precision isin≥1, we define the round value of x at precision isin as Formula , where {lfloor}·{rfloor} denotes the ‘integer part’ function. The precision isin will be normally taken to be a non-negative power of 10, i.e. isin=10k for k≥0 and, in this case, the rounding operation will correspond to truncating the number at its k-th decimal place. Next, given a K-order Markov motif model of length W, m, we call the derived isin-round model misin the model obtaining by rounding each parameter of m at precision isin, that is


Formula

Since, for any x≥0, [x]isin≤x, and since log is a monotonically increasing function, it follows that {ell}(x;misin)≤{ell}(x;m) for all isin≥1 and for all xisinAW. We can thus define the maximal error associated to the precision isin as


Formula

By this very definition of the maximum error, we have that, for any isin≥1 and xisinAW


Formula

Finally, given a null model m0, we define the log–score distribution of m (under m0) as the function


Formula

that is, for sisinR, Qm0(s;m) denotes the probability under m0 for a sequence to have log-score s in m.

The algorithm we are about to describe is based on the following proposition.

LEMMA 2.
Let m be a K-order Markov model, isin'≥isin be two precisions and t'≤t be two real numbers s.t. P-value(t;misin)=P-value(t{Delta}m,isin;misin) and P-value(t';misin')=P-value(t'–{Delta}m,isin';misin'). Then


Formula

Lemma 2 is essentially the same as Lemma 8 in Touzet and Varre (2007). We refer thus to that paper for a proof. This result says that we can approximate the log-score distribution of the original model m with no error by the score distribution of the round model misin' inside the interval[t',t). If, for a still higher precision isin'', we could find a value t''≤t' s.t. P-value(t'';misin'')=P-value(t''–{Delta}misin'';misin'') then, according to the lemma, we would be able to approximate the original score distribution inside the adjacent interval [t'',t') with the refined model misin''. The net effect is that we would have the original score distribution in the interval [t'',t) exactly estimated from two round distributions. If we continue increasing the precision, we get an increasingly larger interval.


Figure 4
View larger version (33K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. P-value iterative refinement.

 
This idea is used to compute P-value(l,m) by approximating the log-score distribution on a series of adjacent intervals covering the entire range [l,+{infty}) using round models of increasing precision. The procedure is shown in Figure 4. The two main problems of this algorithm when applied to K-order Markov models correspond to lines 5 and 6. The former refers to the computation of the maximum error, which is a trivial problem when PWMs are used, but which is more complicated in our case. The latter corresponds to the computation of the score distribution, which is the core of the algorithm with exponential behaviour. The strategy to alleviate its deleterious impact consists in starting with models with low precision and augmenting it progressively. The increase in the precision up to a reasonable level may still be necessary, but it is also partially compensated by the fact that the interval over which the distribution is computed diminishes as the precision grows.

For the purpose of computing the maximum error associated to the round model misin, we have elaborated a dynamic programming algorithm which is based on the following proposition.

LEMMA 3.
Let {Delta}m,isin[j,k]=maxy=*kisinAj{ell}···j(y;m)–{ell}···j(y;misin), i.e. {Delta}m,isin[j,k] denotes the maximum rounding error for a prefix of length jending withk (by convention, {Delta}m,isin[0,{varepsilon}]=0). Then, the following holds:
  1. If|k|≥min{K,j} then, {forall} aisinA,


    Formula


  2. {Delta}m,isin=maxkisinAk{Delta}m,isin[W,k], for any 0<k≤W.

PROOF.
For part (i), let y=*kaisinAj+1, that is, y is a prefix of length j+1 ending with ka. Then


Formula

Because of the restriction on the size of k, the last term of the sum above is fixed, that is,


Formula

Hence


Formula

Part (ii) is a direct consequence of the definition of {Delta}m,isin[W,k]. It is only saying that the maximum error for a W-word is the maximum error for a W-word (=W-prefix) ending with a certain k among all possible choices for k.

The algorithm of Figure 5 computes {Delta}m,isin by using Lemma 3 to build tables D[j,k] containing the values of {Delta}m,isin[j,k] for j=1,...,W and kisinAmin{j,K}. It is easy to see that, for each j=1,..., W, the algorithm takes at most |A|K+1 steps to build the table Formula , and then |A|K+1 additional steps to read all its elements and ‘compress’ it into the final table D[j,k]. Hence the overall time complexity of the algorithm is O(W·|A|K+1).


Figure 5
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Maximum rounding error for K-order Markov models.

 
Finally, we need to discuss how to compute the score distribution Qm0(s;m). More specifically, our objective is to compute Qm0(s;m) for every possible score s within a given interval [c,d]. The idea is to perform the enumeration of the words in AW and select the words x satisfying c≤{ell}(x;m)≤d, grouping them by their scores. As in the branch and bound P-value computation, we use bounds for the scores of a word starting with a given prefix to prune the enumeration. However, we now perform a breadth-first enumeration whereas in the branch and bound algorithm we performed a depth-first enumeration. This is necessary because we keep track of the possible prefix scores for all prefixes of length j in order to compute the possible scores for prefixes of length j+1. As a matter of fact, we do not need to keep track of all possible prefixes of length j and their scores for the computation of the possible prefix scores of length j+1 since only the last k=min{K,j} letters of the j-prefix affect the (j+1)-position score. We hence need only to keep a table with the possible combinations of (j,k,s), where j is the size of the prefix, kisinAmin{K,j} is such that the prefix is of the form *k and s is a possible score for one or more prefixes of this form. The value of the entry in the table indexed by (j,k,s) corresponds to Pm0({yisinAj|y=*k {wedge} {ell}...j(y;m)=s}). In the end, the entries of the form (j=W,k,s) summarize all possible scores in [c,d] and the corresponding words of length W, grouped by each possible suffix k, with their joint probabilities under the background model. The procedure is shown in Figure 6.


Figure 6
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Log-likelihood score distribution.

 
The core of the algorithm consists in computing the score distribution tables S[j,k,s] by taking the entries of S[j–1,k,s] and considering their extension to position j upon the addition of each character aisinA, thus |S[j–1,k,s]|·|A| possibilities. For each possible extension, the algorithm makes use of suffix score bounds, as in the branch and bound procedure, and so the maximum time cost for computing these bounds is the same as in that algorithm. The maximum time of the table construction phase is thus given by {sum}j=1W|S[j–1,k,s]|·|A|+{sum}Formula(Wj)|A|2K+1. The final phase of the algorithm consists in reading the entries of S[W,k,s] to produce the final table Q[s]. Overall, the worst case occurs when each prefix of length j=1,...,W originates a distinct prefix score, all of them passing the pruning criteria. In this case, |S[j,k,s]|=|A|j and thus the the total cost of the algorithm (assuming W>K) is given by Formula Formula as expected. However, in practice, the iterative refinement algorithm considerably reduces the size of the score distribution tables (and therefore the running time of the algorithm) by, on the one hand, working with models of the lowest possible precision, which reduces the number of possible scores and, on the other hand, by decreasing the score interval as the model precision increases, which makes the pruning filter more strict.

2.3.3 Computing the log-likelihood threshold for a given P-value
Despite the speed-up brought by the P-value computation algorithms proposed here, this problem remains computationally costly. This is why, instead of computing the P-value of a putative occurrence in order to determine if it is significant or not, it is preferable to have a likelihood threshold and test whether the likelihood of this occurrence is greater than this threshold. The algorithm proposed by Touzet and Varre (2007) for computing the maximum possible score whose P-value is greater than a given value p can also be used for K-order Markov models once we have extended the algorithms for computing the log-likelihood distribution (Fig. 6) and the maximum error (Fig. 5), and by using our adapted branch and bound algorithm as the counterpart of the algorithm ‘FastPvalue’ of that paper.


    3 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
To illustrate our discussion, we report some results of an experiment performed to study the behaviour of our P-value algorithms. We have selected seven motifs with lengths ranging from 6 to 12 from the dataset used in Barash et al. (2003), which contains TBFS alignments extracted from the TRANSFAC database (Wingender et al., 2000). We chose, for each length, the motif with the greatest number of complete aligned sites (i.e. gaps were not allowed). Next, we used each of these motifs to build a K-order Markov model for K=0,1,2 using the motif trie representation. Then, for each of these models, we took the same set of 20 words sampled from the uniform background model, and computed the P-value of their log-scores using the brute-force (BF), branch and bound (BB) and iterative refinement (IR) methods. The mean computa-tion times are summarized in Table 1. By analysing this table we notice that:

  1. The implemented heuristics significantly improve over the naive method in practice, although for small motif lengths, the gain is not substantial due to the overhead brought by the calculation of score bounds and score distributions.
  2. cr>As expected, the mean computation time per P-value of the naive method remains more or less unaltered for a fixed motif length, no matter what the motif order is (SD is also low). The BB and IR methods, on the contrary, are more sensitive to the variation of the model order, since the number of possible scores varies exponentially with this number.
  3. For a fixed motif order, all algorithms are negatively affected by an increase of the motif length. However, the actual computation times depend on each particular model and on the distribution of the scores. It may happen that a model with a bigger length gives lower computation times if the test sequences give more extreme P-values. Hence it is difficult to compare different values from the same column.
  4. Comparing the values of each line, we can see that, in general, the IR method performs better than the BB method. The difference is accentuated as both the motif size and order grows since this increases exponentially the number of possible scores, a problem which is addressed more directly in the IR method by using models with lower precision.


View this table:
[in this window]
[in a new window]

 
Table 1. P-value computation times for different values of K and W using three methods: BF, BB and IR.

 

    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this article, we have considered two practical issues regarding the use of high-order Markov models for the representation of biological sequence motifs. First, we proposed a compact representation for a set of example motif occurrences from which position-specific conditional base probabilities can be efficiently computed. This representation can be useful in situations where the model is constantly updated, for instance, during the course of a motif learning procedure. As an example, we have included, in the software library mentioned in the abstract, an implementation of the motif site sampler (Lawrence et al., 1993) which uses the trie representation to infer K-order Markov motifs in a set of unaligned sequences. In the original Gibbs sampling procedure, the model parameters are updated as soon as a new site is sampled. With the proposed representation, we only include the newly sampled occurrence in the trie, which can be done in linear time. Then we considered the problem of assessing the statistical significance of putative occurrences through the computation of exact P-values. We proposed extensions of PWM-based techniques to deal with K-order Markov models. These PWM algorithms figure among the most efficient algorithms in their category, and these extensions enrich the algorithmic toolkit of Markov dependency models contributing to the dissemination of their use. It is known, however, that standard K-order Markov models can suffer from the scarcity of training data, and therefore it will be interesting to study the adaptation of the proposed representation and algorithms to more general Markov-type dependency models such as the ones proposed by Zhao et al. (2005) or Huang et al. (2006), for instance.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The authors thank Augusto Vellozo for the helpful discussions.

Funding: P.F. was funded by a doctoral scholarship from the CAPES Foundation (Brazilian ministry of education) under co-supervision of K.G. and M.F.S. This project was also funded by an ANR research grant (Project REGLIS no. 05-NT05-3_45205) and counted on the computational infrastructure of the PRABI (Rhone-Alpes Pole of Bioinformatics).

Conflict of Interest: none declared.


    FOOTNOTES
 
In the pattern-matching literature, the term u-suffix usually refers to the suffix on length u whereas here it means the suffix starting at position u. Back


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol (1994) 2:28–36.[Medline]

    Barash Y, et al. Modeling dependencies in protein-DNA binding sites. In: In Proceedings of the Seventh Annual International Conference on Research in Computational Molecular Biology (RECOMB’03). (2003) New York, NY, USA: ACM. 28–37.

    Beckstette M, et al. Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics (2006) 7:389.[CrossRef][Medline]

    Bejerano G. Efficient exact P-value computation and applications to biosequence analysis. In: In Proceedings of the Seventh Annual International Conference on Research in Computational Molecular Biology (RECOMB’03). (2003) New York, NY, USA: ACM. 38–47.

    Durbin R, et al. Biological sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. (1999) Cambridge, UK: Cambridge University Press.

    Ellrott K, et al. Identifying transcription factor binding sites through markov chain optimization. Bioinformatics (2002) 18(Suppl 2):S100–S109.[Abstract]

    Fredkin E. Trie memory. Comm. ACM (1960) 3:490–499.[CrossRef]

    GuhaThakurta D. Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res (2006) 34:3585–3598.[Abstract/Free Full Text]

    Huang W, et al. Optimized mixed markov models for motif identification. BMC Bioinformatics (2006) 7:279.[CrossRef][Medline]

    Knuth DE. Sorting and searching. In: In The Art of Computer Programming, vol. 3 of The Art of Computer Programming. (1998) 2nd edn. Reading, MA, USA: Addison.

    Lawrence CE, et al. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science (1993) 262:208–214.[Abstract/Free Full Text]

    Michalewicz Z, Fogel D. How to Solve it: Modern Heuristics. (2004) Berlin, Germany: Springer.

    Pizzi C, et al. Fast search algorithms for position specific scoring matrices. In: In Proceedings of the Bioinfomatics Research and Development BIRD 2007, vol. 4414 of Lecture Notes in Bioinformatics. (2007) Berlin, Germany: Springer. 239–250.

    Touzet H, Varre J-S. Efficient and accurate P-value computation for position weight matrices. Algorithms Mol. Biol (2007) 2:15.[CrossRef][Medline]

    Wingender E, et al. Transfac: an integrated system for gene expression regulation. Nucleic Acids Res (2000) 28:316–319.[Abstract/Free Full Text]

    Zhang J, et al. Computing exact P-values for DNA motifs. Bioinformatics (2007) 23:531–537.[Abstract/Free Full Text]

    Zhao X, et al. Finding short DNA motifs using permuted markov models. J. Comput. Biol (2005) 12:894–906.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by da Fonseca, P. G. S.
Right arrow Articles by Sagot, M.-F.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by da Fonseca, P. G. S.
Right arrow Articles by Sagot, M.-F.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?