Bioinformatics Advance Access originally published online on October 26, 2006
Bioinformatics 2006 22(23):2870-2875; doi:10.1093/bioinformatics/btl528
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Adding sequence context to a Markov background model improves the identification of regulatory elements
National Center for Biotechnology Information, National Library of Medicine National Institutes of Health, Bethesda, MD 20894, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Many computational methods for identifying regulatory elements use a likelihood ratio between motif and background models. Often, the methods use a background model of independent bases. At least two different Markov background models have been proposed with the aim of increasing the accuracy of predicting regulatory elements. Both Markov background models suffer theoretical drawbacks, so this article develops a third, context-dependent Markov background model from fundamental statistical principles.
Results: Datasets containing known regulatory elements in eukaryotes provided a basis for comparing the predictive accuracies of the different background models. Non-parametric statistical tests indicated that Markov models of order 3 constituted a statistically significant improvement over the background model of independent bases. Our model performed slightly better than the previous Markov background models. We also found that for discriminating between the predictive accuracies of competing background models, the correlation coefficient is a more sensitive measure than the performance coefficient.
Availability: Our C++ program is available at ftp://ftp.ncbi.nih.gov/pub/spouge/papers/archive/AGLAM/2006-07-19
Contact: spouge{at}ncbi.nlm.nih.gov
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
In the post-genomic era, identifying regulatory elements is an important step to understanding the mechanisms of gene regulation. Regulatory motifs in DNA take considerable time and effort to identify experimentally, so the development of computational tools to discover them is of great interest in bioinformatics. Most computational tools for identifying regulatory elements use one of two methods, alignment or enumeration (Ohler and Niemann, 2001). On one hand, enumerative methods consider all words of a particular length, listing the over-represented words (Marino-Ramirez et al., 2004; Pavesi et al., 2004; Sinha and Tompa, 2002). Enumerative methods have difficulties, however, with word-lengths longer than 12 or with large datasets. On the other hand, alignment methods consider local alignments and maximize a score, usually with expectation maximization (Bailey and Elkan, 1995) or Gibbs sampling (Lawrence et al., 1993; Liu et al., 1995; Tharakaraman et al., 2005). The score usually corresponds to a probability model and often is the sum of entries in a position-specific score matrix (PSSM). In this set-up, the PSSM represents the log-likelihood ratio between a statistical model for a motif and a background model. In the past, background models usually postulated random sequences whose letters were selected independently from a fixed distribution on the nucleotide alphabet {A, C, G, T}. Recently, however, researchers have proposed Markov background models, to extract more empirical information about nucleotide word frequencies. Presently, there are two different Markov background models extant (Liu et al., 2001; Thijs et al., 2001). Here, we examine the two Markov background models and compare them to a third model derived from fundamental statistical principles. We then compare the predictive accuracy of the different background models on a standard test dataset (Tompa et al., 2005).
| 2 METHODS |
|---|
|
|
|---|
The usual statistical approach posits two probability models for nucleotide sequences containing possible embedded motifs: a null model and an alternative model. In our null model, there is no regulatory element; in our alternative, each sequence contains a single element representing a transcription binding factor (TBF) motif. Tools implementing statistical models need not restrict themselves to a single element per sequence, and indeed, many do not (Hughes et al., 2000; Liu et al., 1995; Thijs et al., 2001). Although the restriction to a single element is easily removed, we make it here for convenience and simplicity, principally to be able to compare the different Markov background models under controlled conditions.
Though the two probability models assume multiple sequences as an input, to start, let us consider a single random sequence A
(A1, ... , An), i.e. a single string of length n. Let a
(a1, ... , an) be a realization of the sequence A. In the null model, the sequence A is generated by a Markov background model of order m. Let wk
(ak, ... , ak+m1) represent the k-th word of length m (m-letter word) in the realization a. The probability of observing a is written as follows:
|
|
(w) is the equilibrium probability of the word w, and p(u,v) is the transition probability given the word u to the word v. Naturally, p(u,v) > 0 only if the last m 1 letters of the word u and the first m 1 letters of the word v are the same.
In our alternative model, the sequence A is generated by a background equilibrium Markov model of order m, except for one random element, a t-word positioned somewhere within the sequence. Let J be the initial position of the random element, with a uniform prior over all its possible positions, before the sequence is known. The alternative model is
|
|
to differentiate this alternative model from two others,
and
below.
Starting with the probability of the sequence A containing an element at the j-th position conditional on the observed sequence,
|
| (1) |
In Equation (1),
can replace
, because both are constants given the data, the sequence {A = a}. Note: (A) although many factors cancel, Equation (1) is derived from a model of the entire sequence A; and (B) the probability in Equation (1) depends on both of the m-words neighboring an element. We call the Markov background model in Equation (1) the context-2 model, where 2 represents two-sided dependency on the m-words neighboring an element. As a simple example of Equation (1), consider an order-1 Markov model and a motif consisting of a single position 0 embedded within a sequence. Then, Equation (1) becomes
![]() | (2) |
In contrast, the equilibrium probability of an isolated m-word representing a putative TBF element is
|
| (3) |
![]() | (4) |
Thijs et al. (2001) proposed yet another Markov background, which in the present notation can be written
|
| (5) |
Equation (5) conditions on the m-word to the left of a putative TBF element, but not on the m-word to the right. We call the Markov background model in Equation (5) the context-1 model, where 1 represents the dependency on the single m-word to the left of an element. As a simple example, for an order-1 Markov model and a motif consisting of a single position 0 embedded within a sequence, Equation (5) becomes
![]() | (6) |
If the Markov order is 0, the three models are identical, but for higher Markov orders, important logical and statistical considerations distinguish the three models. Logically, the statistical analysis using a Markov model should not depend on the arbitrary direction chosen for the Markov transitions. Consider the reverse of the Markov chain considered above. The reverse chain has the same equilibrium probabilities as the forward chain, with its transition probabilities
satisfying
. Although the probabilities in Equations (1) and (3) for the context-2 and context-0 models are invariant under reversal, the probability in Equation (5) for the context-1 model is not. The context-2 model, moreover, incorporates all available sequence information in accord with standard modeling procedures, including the sequence outside the putative TBF element, whereas the context-0 model does not.
The simple examples in Equations (2), (4) and (6) for Markov order 1 show that statistical inferences might differ among the three models described (Fig. 1). We implemented all three context-dependent Markov models inside the A-GLAM (anchored gapless local alignment of multiple sequences) program (Frith et al., 2004; Tharakaraman et al., 2005). A-GLAM is a tool for finding regulatory motifs, based on a probability model. The probability model represents binding sites as a gapless alignment block, and each column of the alignment is assumed to follow a multinomial distribution over nucleotide alphabet {A, C, G, T}. An alignment that maximizes the likelihood function of the motif is reported as a candidate of binding sites. A-GLAM implements a Gibbs sampling algorithm to optimize the parameters of its probability model.
|
| 3 RESULTS |
|---|
|
|
|---|
To test the context-dependent Markov models, we used a standard test dataset (Tompa et al., 2005) consisting of human, yeast, mouse and fly sequences with experimentally verified binding sites. Tompa et al. (2005) considered three types of data: real, generic and Markov. We used only real datasets, because the background sequences in the generic and Markov datasets were artificial and therefore not suited to the purpose of testing background models. To avoid special cases that unnecessarily complicate our study, we further removed three datasets having only one sequence or containing ambiguous characters, leaving 49 real datasets.
Transition probabilities in higher-order Markov model have to be estimated from some background dataset. Because the transition probabilities may differ from species to species, we constructed four background datasets, one for each species, by combining all datasets corresponding to that species. For example, there were eight datasets from yeast, so the background transition probabilities for yeast were estimated from a combination of the 8 yeast datasets. This study examined order-3 Markov models, because background datasets for all four species were large enough to estimate order-3 transition probabilities, but not order 4. We therefore compared the three different Markov background models of order 3 to each other and to the Markov background model of order 0 (independent bases).
To measure the predictive accuracy of A-GLAM using the four different background models, we used the correlation coefficient (Tompa et al., 2005), defined as follows:
![]() | (7) |
![]() | (8) |
We extended the A-GLAM program (Tharakaraman et al., 2005) (written in C++), so it could calculate under each of the four background models. The program also calculates an E-value for each individual site by multiplying the P-value for candidate site by the number of site locations within a sequence (Huang et al., 2004). Although many different tests of the four models are possible, to compare the models on an even and natural footing, we simply ran A-GLAM in its default setting for each model. In particular, the default setting for A-GLAM is a double-stranded ZOOPS mode (Zero or One Occurrence Per Sequence), where it searches the two strands corresponding to each input DNA sequence for zero or one instance of a sequence motif. Although arguments can be made for searching for more than one instance per sequence, searching for the strongest instance in each sequence provides a particularly straightforward evaluation of the different background models.
Table 1 shows an anecdotal result from a dataset, hm06r. An alignment from an order-0 Markov model is in Table 1 (A), and an alignment from an order-3 context-2 Markov model is in Table 1 (B). The sequences in the table were sorted according to their E-values, in the same order as their scores. Because A-GLAM was run in the ZOOPS mode, each sequence was reported as containing zero or one site. For example, A-GLAM reported no site in seq_6 as shown in Table 1 (A). The experimentally verified sites are underlined in the table. An order-3 Markov model in Table 1 (B) contains three known sites in the alignment while an order-0 Markov model in Table 1 (A) contains only one known site. The correlation coefficient between the alignment in Table 1 (A) and the known sites is 0.017 while the correlation coefficient between the alignment in Table 1 (B) and the known sites is 0.202. Not only did the order-3 Markov models identify three known sites, but Table 1 (B) also shows those three sites received the smallest E-values. In contrast, the known site in Table 1 (A) did not have the smallest E-value. For putative elements, the order-0 Markov model provided smaller E-values than the order-3 Markov model, which might mislead practitioners who judge the strength of a given alignment solely by looking at the E-values. Because an order-0 Markov model assigns a spuriously low background probability to repeats, Table 1 (A) shows the order-0 Markov model wrongly predicted regulatory elements in C-rich regions. In contrast, the order-3 Markov model assigned larger background probabilities to repeats than the order-0 Markov model. Therefore, the order-3 Markov model successfully avoided predictions in C-rich regions and identified several known sites, as shown in Table 1 (B).
|
Table 2 contains a systematic summary over the test datasets for the four different models. Among 49 datasets, the first 26 datasets are from human, next 8 are from yeast, following 11 are from mouse, and the last 4 are from fly. The table contains the correlation coefficients from the formula (7) along with the ranks of the correlation coefficients for each model within each dataset. Specifically, each row contains a dataset name and four correlation coefficients and four ranks. For example, the dataset hm05r has four correlation coefficients: 0.051, 0.039, 0.038 and 0.082. Ranks are assigned according to relative magnitude. Since 0.082 is the largest of all, its rank is 1. The rank of the value 0.051 is 2, and so on. The smaller a rank is, the better the result. Ranks were used to compare the predictive accuracy of competing models, because an analysis based on ranks is robust against an occasional unusual magnitude among the correlation coefficients, thereby retaining fidelity to any overall trend within the data. Correlation coefficients and their ranks for each species are followed by averages. For completeness, Table 2 also lists the average correlation coefficients, but our analysis gives primacy to the comparison with average ranks.
|
In all four species, and in terms of average ranks, the context-2 model of order 3 predicted most successfully while the order-0 Markov model predicted least successfully. With the single exception of yeast, all species show a trend, with the context-2 model of order 3 predicting best, followed by the context-1 of order 3, the context-0 of order 3, and the Markov model of order 0. Overall rank averages also reflect the same trend (the average ranks being in opposite order to the correlation coefficients). To evaluate statistical significance of our results, let µ represent the average rank of the correlation coefficient corresponding to a particular model. We examined various one-sided hypotheses on µ with the one-sample Wilcoxon test, e.g.
,
, and
. The P-value for the test between the order-0 model and the context-0 model is 0.023; between the context-0 and the context-1, 0.338; and between the context-1 and the context-2, 0.002. Thus, the improvement of a Markov model of order 3 over the Markov model of order 0 is statistically significant at the level 0.05, as is the improvement of the context-2 model over the context-1 model. Our test datasets are all from eukaryotes, which might partially explain why these results contrast with a previous report based on datasets from prokaryotes (Hu et al., 2005). The previous study, however, took the performance coefficient as its measure of prediction accuracy, finding that differences between the predictive accuracy of an order-0 Markov model and higher-order Markov models (for example, order 3) were not obvious. With this discrepancy in mind, we also evaluated our results using the performance coefficient [defined in Equation (8)]. Based on the average ranks of the performance coefficient, the context-2 model of order 3 predicted most successfully and the Markov model of order 0 predicted least successfully (Supplementary Table 1), as before. The statistical test showed that the context-2 model significantly outperformed other three models with a P-value of 0.034. Differences in the other comparisons were not as dramatic, however, probably because the performance coefficient produced more ties than correlation coefficient.
To check that the comparison of the different background models was stable under incidental perturbations of the test conditions, we increased the number of iteration steps in the Gibbs sampling, to guard against A-GLAM converging prematurely to a local rather than global maximum. Although, e.g. the correlation coefficient of the context-1 model on hm23r (0.2774) degraded to a negative value (0.051) over longer sampling, results remained generally stable and confirmed the results from A-GLAM's default settings. Most perturbations had similar, occasional, inconsequential effects, with one notable exception. Because the test datasets contain experimentally determined binding sites, all known sites are present in the strands corresponding to sequences in the datasets. Accordingly, we restricted the search space to the strand given in the dataset (as opposed to both the given strand and its complementary strand), to take advantage of strand-specific information that might occasionally be available in a practical situation. The difference between the order-0 Markov model and the order 3 Markov models remained statistically significant (P-value at most 0.024), but differences among the context-0, context-1 and context-2 models of order 3 were not significant. Note, however, that whereas the results from the order-0 Markov model, the context-0 model and the context-1 model improved when the search was restricted to a single-stranded search, the result from the context-2 model did not improve. Although an entirely satisfactory explanation is lacking, possibly the context-2 model might have already reached a practical limit, leaving little room for improvement.
| 4 DISCUSSION |
|---|
|
|
|---|
This article presents a context-dependent higher-order Markov model, the context-2 model. We named the previous versions of the higher-order Markov model context-0 and context-1 model (Fig. 1). Intuitively, the context of a putative element should affect statistical inferences, under the following logic. In IUPAC notation (Suzuki et al., 2001), S denotes a base with strong pairing (G or C); and W, a base with weak pairing (A or T). Consider a hypothetical AT-rich (W-rich) motif embedded in a GC-rich (S-rich) regulatory region. In background GC-rich DNA, the transitions S
W or W
S at the motif boundaries are improbable. Our context-2 Markov model therefore sees the transitions from motif to background as evidence of a TBF element. In contrast, the context-0 Markov model discounts these transitions completely, and the context-1 Markov model ignores the transition on the right. Moreover, the asymmetry in the context-1 Markov model is somewhat artificial, because logically, the directionality of the Markov model should not affect the analysis. Moreover, although the context-0 model correctly lacks any inherent directionality, it ignores potentially useful sequence information outside a putative regulatory element. In our hands, computation on the order-3 Markov models required little additional computer time or space, when compared to an independent background model. When A-GLAM was run to find the binding sites in test datasets, order-3 Markov models, e.g. only doubled the computation time of the background model of order 0. We could discern no appreciable difference in computation time for the different context-dependent Markov models.
Other articles implicitly present higher-order Markov background models as improvements over an order-0 Markov model. To the best of our knowledge, however, these articles do not give a controlled comparison specifically testing their Markov improvement. Here, non-parametric statistical tests evaluated potential improvements associated with higher-order Markov models, using as a gold standard 49 test datasets where regulatory elements had been experimentally identified.
The correlation coefficient was our primary measure of the predictive accuracy of competing models. We also compared models with the performance coefficient, but it usually downplayed differences between models. To see why, assume that there are no true positives in a dataset. The corresponding performance coefficient is 0 [Equation (8)], whereas the correlation coefficient decreases from 0 as positives are predicted. For example, consider two hypothetical predictions with correlation coefficients of 0.1 and 0.2. On one hand, a correlation coefficient of 0.1 should be considered better than a correlation coefficient of 0.2, because the corresponding prediction has fewer false positives. On the other hand, both alignments have a performance coefficient of 0. In contrast to claims elsewhere (Hu et al., 2005), we conclude that as a measure of predictive accuracy, the correlation coefficient is more sensitive than the performance coefficient, and therefore superior for distinguishing predictive accuracies.
In our tests, the higher-order Markov models were superior to a background model of independent letters, the superiority of all three order-3 Markov models being statistically significant. The context-2 Markov was consistently superior to the other higher-order Markov models, although the statistical significance of the improvement occasionally disappeared as conditions of the testing varied. Because repeats are often composed of repeated one- or three-letter words, Markov models of order 3 should capture some of their essence. One might therefore expect that part of the superiority of the order-3 Markov models could be due to modeling repeats as part of the background. Our results did not bear that expectation out, however. The superiority of Markov background models was not as pronounced in human (where DNA has abundant repeats) as in some other species (Table 2).
Statistical tests showed the clear superiority of some models, but all correlation coefficients remained disquietingly low. A-GLAM's performance is comparable to any other motif-finding tool, however (Supplementary Table 2). The low coefficients therefore suggest at least two possibilities. First, (although perhaps unlikely) any tool presently available might be only slightly more accurate than a random guess. Second, the best standard test dataset available for evaluating motif prediction (Tompa et al., 2005) might not provide a good gold standard: its real biological sequences could contain large numbers of unidentified motifs, and experiments might not have accurately identified the boundaries of its known motifs. The fact that it is difficult to quantify the second possibility accurately indicates that the science of finding regulatory motifs is still very incomplete.
| Acknowledgments |
|---|
The authors thank Sergey Sheetlin for helpful discussion. This research was supported by the Intramural Research Program of the National Library of Medicine at the National Institutes of Health. Funding to pay the Open Access charges was provided by the Intramural Research program of the National Library of Medicine at National Institutes of Health/DHHS.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
Received on July 24, 2006; revised on September 20, 2006; accepted on October 10, 2006
| REFERENCES |
|---|
|
|
|---|
Bailey, T.L. and Elkan, C. (1995) Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning J, . 21, 5183.
Frith, M.C., et al. (2004) Finding functional sequence elements by multiple local alignment. Nucleic Acids Res, . 32, 189200
Hu, J., et al. (2005) Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res, . 33, 48994913
Huang, H., et al. (2004) Determination of local statistical significance of patterns in Markov sequences with application to promoter element identification. J. Comput. Biol, . 11, 114[CrossRef][Web of Science][Medline].
Hughes, J.D., et al. (2000) Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol, . 296, 12051214[CrossRef][Web of Science][Medline].
Lawrence, C.E., et al. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208214
Liu, J.S., et al. (1995) Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Am. Statist. Assoc, . 90, 11561169[CrossRef][Web of Science].
Liu, X., et al. (2001) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput, . 127138.
Marino-Ramirez, L., et al. (2004) Statistical analysis of over-represented words in human promoter sequences. Nucleic Acids Res, . 32, 949958
Ohler, U. and Niemann, H. (2001) Identification and analysis of eukaryotic promoters: recent computational approaches. Trends Genet, . 17, 5660[CrossRef][Web of Science][Medline].
Pavesi, G., et al. (2004) Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. An algorithm for finding signals of unknown length in DNA sequences. Nucleic Acids Res, . 32, W199W203
Sinha, S. and Tompa, M. (2002) Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res, . 30, 55495560
Suzuki, Y., et al. (2001) Identification and characterization of the potential promoter regions of 1031 kinds of human genes. Genome Res, . 11, 677684
Tharakaraman, K., et al. (2005) Alignments anchored on genomic landmarks can aid in the identification of regulatory elements. Bioinformatics, 21, I440I448.
Thijs, G., et al. (2001) A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics, 17, 11131122
Tompa, M., et al. (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol, . 23, 7144[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
M. Mihara, T. Itoh, and T. Izawa In Silico Identification of Short Nucleotide Sequences Associated with Gene Expression of Pollen Development in Rice Plant Cell Physiol., October 1, 2008; 49(10): 1451 - 1464. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||










