Skip Navigation


Bioinformatics Advance Access originally published online on October 26, 2006
Bioinformatics 2006 22(23):2870-2875; doi:10.1093/bioinformatics/btl528
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/23/2870    most recent
btl528v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Kim, N.-K.
Right arrow Articles by Spouge, J. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kim, N.-K.
Right arrow Articles by Spouge, J. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Published by Oxford University Press 2006
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Adding sequence context to a Markov background model improves the identification of regulatory elements

Nak-Kyeong Kim , Kannan Tharakaraman and John L. Spouge *

National Center for Biotechnology Information, National Library of Medicine National Institutes of Health, Bethesda, MD 20894, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

Motivation: Many computational methods for identifying regulatory elements use a likelihood ratio between motif and background models. Often, the methods use a background model of independent bases. At least two different Markov background models have been proposed with the aim of increasing the accuracy of predicting regulatory elements. Both Markov background models suffer theoretical drawbacks, so this article develops a third, context-dependent Markov background model from fundamental statistical principles.

Results: Datasets containing known regulatory elements in eukaryotes provided a basis for comparing the predictive accuracies of the different background models. Non-parametric statistical tests indicated that Markov models of order 3 constituted a statistically significant improvement over the background model of independent bases. Our model performed slightly better than the previous Markov background models. We also found that for discriminating between the predictive accuracies of competing background models, the correlation coefficient is a more sensitive measure than the performance coefficient.

Availability: Our C++ program is available at ftp://ftp.ncbi.nih.gov/pub/spouge/papers/archive/AGLAM/2006-07-19

Contact: spouge{at}ncbi.nlm.nih.gov

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
In the post-genomic era, identifying regulatory elements is an important step to understanding the mechanisms of gene regulation. Regulatory motifs in DNA take considerable time and effort to identify experimentally, so the development of computational tools to discover them is of great interest in bioinformatics. Most computational tools for identifying regulatory elements use one of two methods, alignment or enumeration (Ohler and Niemann, 2001). On one hand, enumerative methods consider all words of a particular length, listing the over-represented words (Marino-Ramirez et al., 2004; Pavesi et al., 2004; Sinha and Tompa, 2002). Enumerative methods have difficulties, however, with word-lengths longer than 12 or with large datasets. On the other hand, alignment methods consider local alignments and maximize a score, usually with expectation maximization (Bailey and Elkan, 1995) or Gibbs sampling (Lawrence et al., 1993; Liu et al., 1995; Tharakaraman et al., 2005). The score usually corresponds to a probability model and often is the sum of entries in a position-specific score matrix (PSSM). In this set-up, the PSSM represents the log-likelihood ratio between a statistical model for a motif and a background model. In the past, background models usually postulated random sequences whose letters were selected independently from a fixed distribution on the nucleotide alphabet {A, C, G, T}. Recently, however, researchers have proposed Markov background models, to extract more empirical information about nucleotide word frequencies. Presently, there are two different Markov background models extant (Liu et al., 2001; Thijs et al., 2001). Here, we examine the two Markov background models and compare them to a third model derived from fundamental statistical principles. We then compare the predictive accuracy of the different background models on a standard test dataset (Tompa et al., 2005).


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
The usual statistical approach posits two probability models for nucleotide sequences containing possible embedded motifs: a null model and an alternative model. In our null model, there is no regulatory element; in our alternative, each sequence contains a single element representing a transcription binding factor (TBF) motif. Tools implementing statistical models need not restrict themselves to a single element per sequence, and indeed, many do not (Hughes et al., 2000; Liu et al., 1995; Thijs et al., 2001). Although the restriction to a single element is easily removed, we make it here for convenience and simplicity, principally to be able to compare the different Markov background models under controlled conditions.

Though the two probability models assume multiple sequences as an input, to start, let us consider a single random sequence A colone (A1, ... , An), i.e. a single string of length n. Let a colone (a1, ... , an) be a realization of the sequence A. In the null model, the sequence A is generated by a Markov background model of order m. Let wk colone (ak, ... , ak+m–1) represent the k-th word of length m (m-letter word) in the realization a. The probability of observing a is written as follows:


Formula

where {pi}(w) is the equilibrium probability of the word w, and p(u,v) is the transition probability given the word u to the word v. Naturally, p(u,v) > 0 only if the last m – 1 letters of the word u and the first m – 1 letters of the word v are the same.

In our alternative model, the sequence A is generated by a background equilibrium Markov model of order m, except for one random element, a t-word positioned somewhere within the sequence. Let J be the initial position of the random element, with a uniform prior over all its possible positions, before the sequence is known. The alternative model is


Formula

where qj(a) is the probability of finding the letter a at the j-th position within the element. We use the notation ‘Formula’ to differentiate this alternative model from two others, Formula and Formula below.

Starting with the probability of the sequence A containing an element at the j-th position conditional on the observed sequence,


Formula 1

(1)

In Equation (1), Formula can replace Formula, because both are constants given the data, the sequence {A = a}. Note: (A) although many factors cancel, Equation (1) is derived from a model of the entire sequence A; and (B) the probability in Equation (1) depends on both of the m-words neighboring an element. We call the Markov background model in Equation (1) the ‘context-2’ model, where ‘2’ represents two-sided dependency on the m-words neighboring an element. As a simple example of Equation (1), consider an order-1 Markov model and a motif consisting of a single position 0 embedded within a sequence. Then, Equation (1) becomes

Formula 2(2)

In contrast, the equilibrium probability of an isolated m-word representing a putative TBF element is


Formula 3

(3)
Equation (3) is essentially the probability (Liu et al., 2001) assign to a putative TBF element under their alternative hypothesis. Note: (1) Equation (3) ignores information outside the element and considers only on the words within the element; and (2) the probability in Equation (3) does not depend on either m-words neighboring an element. We call the Markov background model in Equation (3) the ‘context-0’ model, where ‘0’ represents the absence of dependency on the m-words neighboring an element. As a simple example of Equation (3), consider an order-1 Markov model and a motif consisting of a single position 0 embedded within a sequence, as above. Then, Equation (3) becomes

Formula 4(4)

Thijs et al. (2001) proposed yet another Markov background, which in the present notation can be written


Formula 5

(5)

Equation (5) conditions on the m-word to the left of a putative TBF element, but not on the m-word to the right. We call the Markov background model in Equation (5) the ‘context-1’ model, where ‘1’ represents the dependency on the single m-word to the left of an element. As a simple example, for an order-1 Markov model and a motif consisting of a single position 0 embedded within a sequence, Equation (5) becomes

Formula 6(6)

If the Markov order is 0, the three models are identical, but for higher Markov orders, important logical and statistical considerations distinguish the three models. Logically, the statistical analysis using a Markov model should not depend on the arbitrary direction chosen for the Markov transitions. Consider the reverse of the Markov chain considered above. The reverse chain has the same equilibrium probabilities as the forward chain, with its transition probabilities Formula 6 satisfying Formula 6. Although the probabilities in Equations (1) and (3) for the context-2 and context-0 models are invariant under reversal, the probability in Equation (5) for the context-1 model is not. The context-2 model, moreover, incorporates all available sequence information in accord with standard modeling procedures, including the sequence outside the putative TBF element, whereas the context-0 model does not.

The simple examples in Equations (2), (4) and (6) for Markov order 1 show that statistical inferences might differ among the three models described (Fig. 1). We implemented all three context-dependent Markov models inside the A-GLAM (‘anchored gapless local alignment of multiple sequences’) program (Frith et al., 2004; Tharakaraman et al., 2005). A-GLAM is a tool for finding regulatory motifs, based on a probability model. The probability model represents binding sites as a gapless alignment block, and each column of the alignment is assumed to follow a multinomial distribution over nucleotide alphabet {A, C, G, T}. An alignment that maximizes the likelihood function of the motif is reported as a candidate of binding sites. A-GLAM implements a Gibbs sampling algorithm to optimize the parameters of its probability model.


Figure 1
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 A comparison of the context-0 model, the context-1 model, and the context-2 model. In the context-0 model, the probability of the element location depends only on the sequence within putative element. In the context-1 model, the probability of the element location is influenced by the word preceding the putative element, as well as the sequence within the element. In the context-2 model, the probability of the element location is influenced by both words neighboring the element, as well as the sequence within the element.

 

    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
To test the context-dependent Markov models, we used a standard test dataset (Tompa et al., 2005) consisting of human, yeast, mouse and fly sequences with experimentally verified binding sites. Tompa et al. (2005) considered three types of data: ‘real’, ‘generic’ and ‘Markov’. We used only real datasets, because the background sequences in the generic and Markov datasets were artificial and therefore not suited to the purpose of testing background models. To avoid special cases that unnecessarily complicate our study, we further removed three datasets having only one sequence or containing ambiguous characters, leaving 49 real datasets.

Transition probabilities in higher-order Markov model have to be estimated from some ‘background’ dataset. Because the transition probabilities may differ from species to species, we constructed four background datasets, one for each species, by combining all datasets corresponding to that species. For example, there were eight datasets from yeast, so the background transition probabilities for yeast were estimated from a combination of the 8 yeast datasets. This study examined order-3 Markov models, because background datasets for all four species were large enough to estimate order-3 transition probabilities, but not order 4. We therefore compared the three different Markov background models of order 3 to each other and to the Markov background model of order 0 (independent bases).

To measure the predictive accuracy of A-GLAM using the four different background models, we used the correlation coefficient (Tompa et al., 2005), defined as follows:

Formula 7(7)
where TP (true positives) is the number of nucleotides in both known sites and predicted sites, FP (false positives) is the number of nucleotides not in known sites but in predicted sites, TN (true negatives) is the number of nucleotides in neither known sites nor predicted sites and FN (false negatives) is the number of nucleotides in known sites but not in predicted sites. The correlation coefficient is the Pearson correlation coefficient of two binary variables, so it ranges from –1 to +1. The correlation coefficient of value +1 indicates perfectly correct prediction while value –1 indicates perfectly incorrect prediction. If sites were predicted randomly, the correlation coefficient would be 0. For completeness, we also examined another measure, the performance coefficient (Tompa et al., 2005):

Formula 8(8)

We extended the A-GLAM program (Tharakaraman et al., 2005) (written in C++), so it could calculate under each of the four background models. The program also calculates an E-value for each individual site by multiplying the P-value for candidate site by the number of site locations within a sequence (Huang et al., 2004). Although many different tests of the four models are possible, to compare the models on an even and natural footing, we simply ran A-GLAM in its default setting for each model. In particular, the default setting for A-GLAM is a ‘double-stranded ZOOPS mode’ (Zero or One Occurrence Per Sequence), where it searches the two strands corresponding to each input DNA sequence for zero or one instance of a sequence motif. Although arguments can be made for searching for more than one instance per sequence, searching for the ‘strongest’ instance in each sequence provides a particularly straightforward evaluation of the different background models.

Table 1 shows an anecdotal result from a dataset, ‘hm06r’. An alignment from an order-0 Markov model is in Table 1 (A), and an alignment from an order-3 context-2 Markov model is in Table 1 (B). The sequences in the table were sorted according to their E-values, in the same order as their scores. Because A-GLAM was run in the ZOOPS mode, each sequence was reported as containing zero or one site. For example, A-GLAM reported no site in seq_6 as shown in Table 1 (A). The experimentally verified sites are underlined in the table. An order-3 Markov model in Table 1 (B) contains three known sites in the alignment while an order-0 Markov model in Table 1 (A) contains only one known site. The correlation coefficient between the alignment in Table 1 (A) and the known sites is 0.017 while the correlation coefficient between the alignment in Table 1 (B) and the known sites is 0.202. Not only did the order-3 Markov models identify three known sites, but Table 1 (B) also shows those three sites received the smallest E-values. In contrast, the known site in Table 1 (A) did not have the smallest E-value. For putative elements, the order-0 Markov model provided smaller E-values than the order-3 Markov model, which might mislead practitioners who judge the strength of a given alignment solely by looking at the E-values. Because an order-0 Markov model assigns a spuriously low background probability to repeats, Table 1 (A) shows the order-0 Markov model wrongly predicted regulatory elements in C-rich regions. In contrast, the order-3 Markov model assigned larger background probabilities to repeats than the order-0 Markov model. Therefore, the order-3 Markov model successfully avoided predictions in C-rich regions and identified several known sites, as shown in Table 1 (B).


View this table:
[in this window]
[in a new window]

 
Table 1 The alignments of ‘hm06r’ produced by A-GLAM

 
Table 2 contains a systematic summary over the test datasets for the four different models. Among 49 datasets, the first 26 datasets are from human, next 8 are from yeast, following 11 are from mouse, and the last 4 are from fly. The table contains the correlation coefficients from the formula (7) along with the ranks of the correlation coefficients for each model within each dataset. Specifically, each row contains a dataset name and four correlation coefficients and four ranks. For example, the dataset ‘hm05r’ has four correlation coefficients: 0.051, –0.039, –0.038 and 0.082. Ranks are assigned according to relative magnitude. Since 0.082 is the largest of all, its rank is 1. The rank of the value 0.051 is 2, and so on. The smaller a rank is, the better the result. Ranks were used to compare the predictive accuracy of competing models, because an analysis based on ranks is robust against an occasional unusual magnitude among the correlation coefficients, thereby retaining fidelity to any overall trend within the data. Correlation coefficients and their ranks for each species are followed by averages. For completeness, Table 2 also lists the average correlation coefficients, but our analysis gives primacy to the comparison with average ranks.


View this table:
[in this window]
[in a new window]

 
Table 2 A summary table for the Tompa's test datasets

 
In all four species, and in terms of average ranks, the context-2 model of order 3 predicted most successfully while the order-0 Markov model predicted least successfully. With the single exception of yeast, all species show a trend, with the context-2 model of order 3 predicting best, followed by the context-1 of order 3, the context-0 of order 3, and the Markov model of order 0. Overall rank averages also reflect the same trend (the average ranks being in opposite order to the correlation coefficients). To evaluate statistical significance of our results, let µ represent the average rank of the correlation coefficient corresponding to a particular model. We examined various one-sided hypotheses on µ with the one-sample Wilcoxon test, e.g. Formula 8, Formula 8, and Formula 8. The P-value for the test between the order-0 model and the context-0 model is 0.023; between the context-0 and the context-1, 0.338; and between the context-1 and the context-2, 0.002. Thus, the improvement of a Markov model of order 3 over the Markov model of order 0 is statistically significant at the level 0.05, as is the improvement of the context-2 model over the context-1 model. Our test datasets are all from eukaryotes, which might partially explain why these results contrast with a previous report based on datasets from prokaryotes (Hu et al., 2005). The previous study, however, took the performance coefficient as its measure of prediction accuracy, finding that differences between the predictive accuracy of an order-0 Markov model and higher-order Markov models (for example, order 3) were not obvious.

With this discrepancy in mind, we also evaluated our results using the performance coefficient [defined in Equation (8)]. Based on the average ranks of the performance coefficient, the context-2 model of order 3 predicted most successfully and the Markov model of order 0 predicted least successfully (Supplementary Table 1), as before. The statistical test showed that the context-2 model significantly outperformed other three models with a P-value of 0.034. Differences in the other comparisons were not as dramatic, however, probably because the performance coefficient produced more ties than correlation coefficient.

To check that the comparison of the different background models was stable under incidental perturbations of the test conditions, we increased the number of iteration steps in the Gibbs sampling, to guard against A-GLAM converging prematurely to a local rather than global maximum. Although, e.g. the correlation coefficient of the context-1 model on hm23r (0.2774) degraded to a negative value (–0.051) over longer sampling, results remained generally stable and confirmed the results from A-GLAM's default settings. Most perturbations had similar, occasional, inconsequential effects, with one notable exception. Because the test datasets contain experimentally determined binding sites, all known sites are present in the strands corresponding to sequences in the datasets. Accordingly, we restricted the search space to the strand given in the dataset (as opposed to both the given strand and its complementary strand), to take advantage of strand-specific information that might occasionally be available in a practical situation. The difference between the order-0 Markov model and the order 3 Markov models remained statistically significant (P-value at most 0.024), but differences among the context-0, context-1 and context-2 models of order 3 were not significant. Note, however, that whereas the results from the order-0 Markov model, the context-0 model and the context-1 model improved when the search was restricted to a single-stranded search, the result from the context-2 model did not improve. Although an entirely satisfactory explanation is lacking, possibly the context-2 model might have already reached a practical limit, leaving little room for improvement.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
This article presents a context-dependent higher-order Markov model, the ‘context-2’ model. We named the previous versions of the higher-order Markov model ‘context-0’ and ‘context-1’ model (Fig. 1). Intuitively, the context of a putative element should affect statistical inferences, under the following logic. In IUPAC notation (Suzuki et al., 2001), S denotes a base with strong pairing (G or C); and W, a base with weak pairing (A or T). Consider a hypothetical AT-rich (W-rich) motif embedded in a GC-rich (S-rich) regulatory region. In background GC-rich DNA, the transitions S -> W or W -> S at the motif boundaries are improbable. Our context-2 Markov model therefore sees the transitions from motif to background as evidence of a TBF element. In contrast, the context-0 Markov model discounts these transitions completely, and the context-1 Markov model ignores the transition on the right. Moreover, the asymmetry in the context-1 Markov model is somewhat artificial, because logically, the directionality of the Markov model should not affect the analysis. Moreover, although the context-0 model correctly lacks any inherent directionality, it ignores potentially useful sequence information outside a putative regulatory element.

In our hands, computation on the order-3 Markov models required little additional computer time or space, when compared to an independent background model. When A-GLAM was run to find the binding sites in test datasets, order-3 Markov models, e.g. only doubled the computation time of the background model of order 0. We could discern no appreciable difference in computation time for the different context-dependent Markov models.

Other articles implicitly present higher-order Markov background models as improvements over an order-0 Markov model. To the best of our knowledge, however, these articles do not give a controlled comparison specifically testing their Markov ‘improvement’. Here, non-parametric statistical tests evaluated potential improvements associated with higher-order Markov models, using as a ‘gold standard’ 49 test datasets where regulatory elements had been experimentally identified.

The correlation coefficient was our primary measure of the predictive accuracy of competing models. We also compared models with the performance coefficient, but it usually downplayed differences between models. To see why, assume that there are no true positives in a dataset. The corresponding performance coefficient is 0 [Equation (8)], whereas the correlation coefficient decreases from 0 as positives are predicted. For example, consider two hypothetical predictions with correlation coefficients of –0.1 and –0.2. On one hand, a correlation coefficient of –0.1 should be considered better than a correlation coefficient of –0.2, because the corresponding prediction has fewer false positives. On the other hand, both alignments have a performance coefficient of 0. In contrast to claims elsewhere (Hu et al., 2005), we conclude that as a measure of predictive accuracy, the correlation coefficient is more sensitive than the performance coefficient, and therefore superior for distinguishing predictive accuracies.

In our tests, the higher-order Markov models were superior to a background model of independent letters, the superiority of all three order-3 Markov models being statistically significant. The context-2 Markov was consistently superior to the other higher-order Markov models, although the statistical significance of the improvement occasionally disappeared as conditions of the testing varied. Because repeats are often composed of repeated one- or three-letter words, Markov models of order 3 should capture some of their essence. One might therefore expect that part of the superiority of the order-3 Markov models could be due to modeling repeats as part of the background. Our results did not bear that expectation out, however. The superiority of Markov background models was not as pronounced in human (where DNA has abundant repeats) as in some other species (Table 2).

Statistical tests showed the clear superiority of some models, but all correlation coefficients remained disquietingly low. A-GLAM's performance is comparable to any other motif-finding tool, however (Supplementary Table 2). The low coefficients therefore suggest at least two possibilities. First, (although perhaps unlikely) any tool presently available might be only slightly more accurate than a random guess. Second, the best standard test dataset available for evaluating motif prediction (Tompa et al., 2005) might not provide a good ‘gold standard’: its real biological sequences could contain large numbers of unidentified motifs, and experiments might not have accurately identified the boundaries of its known motifs. The fact that it is difficult to quantify the second possibility accurately indicates that the science of finding regulatory motifs is still very incomplete.


    Acknowledgments
 
The authors thank Sergey Sheetlin for helpful discussion. This research was supported by the Intramural Research Program of the National Library of Medicine at the National Institutes of Health. Funding to pay the Open Access charges was provided by the Intramural Research program of the National Library of Medicine at National Institutes of Health/DHHS.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on July 24, 2006; revised on September 20, 2006; accepted on October 10, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

    Bailey, T.L. and Elkan, C. (1995) Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning J, . 21, 51–83.

    Frith, M.C., et al. (2004) Finding functional sequence elements by multiple local alignment. Nucleic Acids Res, . 32, 189–200[Abstract/Free Full Text].

    Hu, J., et al. (2005) Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res, . 33, 4899–4913[Abstract/Free Full Text].

    Huang, H., et al. (2004) Determination of local statistical significance of patterns in Markov sequences with application to promoter element identification. J. Comput. Biol, . 11, 1–14[CrossRef][ISI][Medline].

    Hughes, J.D., et al. (2000) Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol, . 296, 1205–1214[CrossRef][ISI][Medline].

    Lawrence, C.E., et al. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208–214[Abstract/Free Full Text].

    Liu, J.S., et al. (1995) Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Am. Statist. Assoc, . 90, 1156–1169[CrossRef][ISI].

    Liu, X., et al. (2001) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput, . 127–138.

    Marino-Ramirez, L., et al. (2004) Statistical analysis of over-represented words in human promoter sequences. Nucleic Acids Res, . 32, 949–958[Abstract/Free Full Text].

    Ohler, U. and Niemann, H. (2001) Identification and analysis of eukaryotic promoters: recent computational approaches. Trends Genet, . 17, 56–60[CrossRef][ISI][Medline].

    Pavesi, G., et al. (2004) Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. An algorithm for finding signals of unknown length in DNA sequences. Nucleic Acids Res, . 32, W199–W203[Abstract/Free Full Text].

    Sinha, S. and Tompa, M. (2002) Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res, . 30, 5549–5560[Abstract/Free Full Text].

    Suzuki, Y., et al. (2001) Identification and characterization of the potential promoter regions of 1031 kinds of human genes. Genome Res, . 11, 677–684[Abstract/Free Full Text].

    Tharakaraman, K., et al. (2005) Alignments anchored on genomic landmarks can aid in the identification of regulatory elements. Bioinformatics, 21, I440–I448.

    Thijs, G., et al. (2001) A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics, 17, 1113–1122[Abstract/Free Full Text].

    Tompa, M., et al. (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol, . 23, 7–144[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/23/2870    most recent
btl528v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Kim, N.-K.
Right arrow Articles by Spouge, J. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kim, N.-K.
Right arrow Articles by Spouge, J. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?