Skip Navigation

Bioinformatics 2007 23(2):e156-e162; doi:10.1093/bioinformatics/btl319
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Noto, K.
Right arrow Articles by Craven, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Noto, K.
Right arrow Articles by Craven, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Machine Learning in Computational Biology

Learning probabilistic models of cis-regulatory modules that represent logical and spatial aspects

Keith Noto * and Mark Craven

Department of Computer Sciences and Department of Biostatistics and Medical Informatics, University of Wisconsin Madison, WI 53706, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 

Motivation: The process of transcription is controlled by systems of factors which bind in specific arrangements, called cis-regulatory modules (CRMs), in promoter regions. We present a discriminative learning algorithm which simultaneously learns the DNA binding site motifs as well as the logical structure and spatial aspects of CRMs.

Results: Our results on yeast datasets show better predictive accuracy than a current state-of-the-art approach on the same datasets. Our results on yeast, fly and human datasets show that the inclusion of logical and spatial aspects improves the predictive accuracy of our learned models.

Availability: Source code is available at http://www.cs.wisc.edu/~noto/crm

Contact: noto{at}cs.wisc.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
Eukaryotic gene transcription is controlled by multiple factors, which to bind to DNA in a specific arrangement in a gene's promoter region. This type of regulation system is called a cis-regulatory module (CRM). Such a module consists of (1) specific sequences of DNA called motifs to which transcription factors bind, (2) logical relationships between these sites and (3) spatial relationships between these sites. Three examples of logical relationships are as follows:

  • AND logic; multiple required binding sites,
  • OR logic; a set of motifs, any of which satisfies a binding site,
  • NOT logic; a binding site which must not appear in a promoter sequence.
Three examples of spatial relationships are as follows:
  • Order preference; e.g. binding site A appears upstream of B,
  • Distance preference; e.g. binding site A appears ~125 bp from B, or binding site A appears somewhere within 50 bp from the estimated start of transcription,
  • Strand preference; e.g. binding site A appears on the template DNA strand (as opposed to the transcribed strand).

Given (1) a set of positive DNA sequence examples thought to contain a hidden CRM, and (2) a set of negative DNA sequence examples thought not to contain the CRM, the task we consider is to learn a representation of a CRM which distinguishes between the positive and negative sequences.

Several methods have been proposed for this task (Aerts et al., 2003; Beer and Tavazoie, 2000; Keles et al., 2004; Segal and Sharan, 2004; Sinha et al., 2003; Zhou and Wong, 2004). These methods either learn motifs, or learn rich representations of the relationships between candidate motifs, but not both. For instance, the approach of Keles et al. (2004) employs a rich representation of the logical relationships between motifs. The approach of Beer and Tavazoie (2004) involves logical aspects and spatial constraints regarding order, distance and strand. However, these approaches finalize motifs during a pre-processing step, before logical and spatial aspects are considered. The approaches of Segal and Sharan (2004) and Zhou and Wong (2004) learn motifs, and they do represent that binding sites must appear relatively close together, but in a window of a fixed and predetermined size.

Since motifs are hidden in data, learning the relevant logical and spatial aspects may help to identify motifs which would not otherwise appear to be significantly overrepresented. None of the aforementioned approaches learn both binding site motifs as they learn the logical and spatial aspects of a CRM. The approach we present is, to our knowledge, the first learning algorithm which does so.

Another advantage of our approach is that it can take advantage of background knowledge. Learning CRMs is a difficult problem in part because training set sizes are often limited (for instance, they may consist of a few genes which are upregulated together) and because the relevant binding site motifs may appear anywhere in a large control region of a gene (tens of thousands of base pairs in higher eukaryotes). Since our approach is entirely probabilistic, a prior probability distribution over where binding sites are likely to be [e.g. from a set of multiple alignment conservation scores or from data concerning hypersensitive regions (Noble et al., 2005)] fits naturally into our approach.


    2 APPROACH
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
2.1 CRM representation
Our representation of a CRM is illustrated in Figure 1. It can describe a logical relationship among binding site motifs (using AND, OR and NOT), each represented by the standard position weight matrix (PWM), and a set of probabilities representing the spatial aspects: order, distance and strand. Note that we make a distinction between binding sites and motifs, since because of OR logic, more than one motif may be associated with a given binding site. Also note that the pairwise distance distributions in Figure 1b refer to adjacent binding sites. Hence, the order of binding sites determines which distance distributions are relevant.


Figure 1
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 An example model in our CRM representation. (a) A logical structure of motifs consisting of a conjunction of binding sites (rounded boxes), each with a disjunction of motif PWMs (lettered boxes). Each binding site is associated with a strand preference (one number indicating the probability of binding to the template DNA strand). A binding site may be negated. (b) Each pair of binding sites is associated with an order preference (one number indicating the probability of one site being upstream of the other) and a distance distribution over the possible intermotif distances of two adjacent binding sites. Also, each binding site has order and distance preferences relative to a fixed point, such as the estimated transcription start site (TSS).

 
It is useful to think of an instantiation of our CRM representation as a hidden Markov model (HMM), such as the one shown in Figure 2, and each sequence as being generated by a path through this model. The logical structure of a CRM representation is reflected in the structure of the HMM. The submodel denoted ‘BG represents a fixed background distribution over DNA sequences of arbitrary length. We use a 5th-order HMM as the background submodel, which is trained on the appropriate DNA, such as promoter sequences in the organism being analyzed. A DNA sequence with a hidden CRM takes an ‘upper’ path through the model, and does the following: (1) probabilistically chooses an ordering of the binding sites, (2) generates DNA from a background distribution, at some point (3) chooses a PWM from which to generate the first binding site, (4) chooses a DNA strand, (5) generates DNA probabilistically from the PWM distribution, (6) generates more DNA from the background distribution, etc. The amount of sequence generated by the background distribution submodel between one binding site and the next depends on the probability of that distance, which is given by our model parameters (Figure 1b). This means that the HMM is a generalized HMM (Burge and Karlin, 1997), because the amount of sequence explained by the background state is not represented by transition probabilities alone. A DNA sequence without a hidden CRM takes the ‘lower’ path in the model, and it generates all DNA from the background distribution.


Figure 2
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 A DNA-generating hidden Markov model representing the logical structure shown in Figure 1 (inset). BG represents a fixed background submodel. Shaded lettered boxes represent PWMs (these are upside-down for reverse complement distributions). A sequence with the CRM takes the upper path, and generates DNA from PWMs and the background distribution. A sequence without a CRM takes the lower path, and generates all DNA from the background distribution. Note that several parameters are tied together. All instances of the background submodel are identical, and the submodel corresponding to each binding site appears multiple times for each possible binding site order.

 
Note that several parameters are tied together: all instances of the background submodel are identical, and the submodel corresponding to each binding site appears multiple times for each possible binding site order. Also note that the model illustrated in Figure 2 contains no negated binding sites (NOT logic). These cases are slightly different and are discussed in Section 2.5.

2.2 Learning structure
The task of learning one of our CRM models involves learning both the logical structure (such as the example shown in Figure 1a) and the parameters for a given structure (which is discussed in the next section). Given a structure and parameters, we evaluate a model by how well it classifies the training data (when there is a large number of training examples, we prefer to use held-aside tuning data for evaluation), according to a user-defined metric.

We search for the CRM logical structure using a best-first beam search (Mitchell, 1997). Our search operators are shown in Figure 3. We start with a structure containing a single motif. At each step in the search process, we apply each operator to the highest-scoring solution to obtain new logical structures. We learn the parameters for these structures, evaluate the models and repeat until some maximum number of structures has been evaluated.


Figure 3
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Illustration of our search space operators. (a) An initial CRM logical structure. (b) The result of applying the AND operator to the initial structure. This introduces a new binding site with an untrained PWM X. (c) The result of the OR operator. The second binding site becomes a disjunction of the original PWM B and an untrained PWM Y. (d). The result of the NOT operator. This is the same as the AND operator, except the new binding site is negated.

 
Each application of our search operators adds a new, untrained, PWM. The model parameters we learn depend on the values in this PWM, so we initialize it by sampling from the training data.

2.3 Learning parameters
Given the logical structure of a candidate CRM model, we set the parameters,Formula, in an attempt to optimize

Formula 1(1)
where cx is the label (i.e. positive or negative), of a training example sequence x, and Vx is a prior probability distribution describing locations on x where binding sites are likely to occur.

To train our parameters, we use the discriminative learning approach of Krogh (1994). We calculate the expected number of times each parameter is used to explain our input sequences, and compare this with the expected number of times each parameter should be used (i.e. positive examples should always take the upper paths in Figure 2, and negative examples should always take the lower path). We iteratively update our parameters according to:

Formula 2(2)
where Formula 2 is the current value for the k-th parameter, mk is the expected number of times it is used in correct paths to generate the training sequences, nk is the expected number of times it is used in all paths, N is a normalization constant, and {eta} is the learning rate, which can be adjusted dynamically so that the parameters do not fall below zero.

After we update model parameters, we do the appropriate normalization and smoothing. For most of the parameters, smoothing is done with pseudocounts. For our distance distributions, we smooth each histogram with a Gaussian-shaped kernel with standard deviation Formula 2 for a sample size of n (John and Langley, 1995).

2.4 Efficient computation
To classify a sequence, x, we calculate the probability that x takes a positive path through the HMM. This is given by

Formula 3(3)
where {Pi}pos represents the set of positive paths through the HMM, and {Pi}all represents the set of all possible paths.

To understand how we compute (3), consider the example shown in Figure 4. Here, our model has two binding sites, A and B, as shown in Figure 4a. We consider each possible order separately, so assume the binding site order is fixed, with A upstream of B. Figure 4b. shows two possible locations for A and B on sequence x, which is of length L. To compute (3), we use a dynamic programming (DP) algorithm. {alpha} is a DP matrix, shown in Figure 4c. Each entry in the matrix, {alpha}(S,l) represents the likelihood of the subsequence of x, from location l to L, given that binding site S occurs at location l. The iterative update step in the DP algorithm for our two sites, A and B, with A at location i, is

Formula 4(4)
where xx...j represents the subsequence of x from i to j, wA and wB are the widths of motifs A and B, {Theta}A is the PWM for motif A, BG is the background distribution and {Theta}dist (A, B) is the probability distribution in {Theta} over the distance between A and B.


Figure 4
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 (a) A CRM logical structure. (b) Possible binding site locations on a DNA sequence x. (c) A dynamic programming matrix {alpha}, where {alpha}(A,i) represents the likelihood of sequence x from location i to L when site A occurs at location i. (d) A probability distribution over the locations of binding sites A and B, respectively. These probabilities tend to be extreme and high probabilities are sparsely distributed, which allows us to skip computation on all but the most likely values.

 
The run-time complexity of this calculation is O(L2) (for two or more binding sites), which is impractical for long DNA sequences such as mammalian promoters. To make this computation tractable, we take advantage of one key insight: a large proportion of the sequence likelihood often depends on a small proportion of the combinations of binding site locations (i.e. where the sequence DNA actually matches the PWM distributions). Therefore, returning to our example in Figure 4, we first scan the sequence for the most likely locations for A and B (shown in Figure 4d), and consider those in order of decreasing likelihood, until we have considered enough, e.g. 95% of the probability over all possible locations of both A and B. Figure 4c shows which cells (they are crossed out) in the DP matrix {alpha} that we ignore because they contribute very little to the sequence probability. In practice, this decreases run time substantially and speeds up parameter convergence. It is a necessary step in learning CRMs from the long promoter sequences we wish to consider.

We calculate the expected number of times a parameter is used in (2) with a similar calculation, except that we need an additional DP matrix for the upstream subsequences. The expected number of times a parameter is used is proportional to the sum of the sequence likelihood over the paths that use it.

2.5 Negated binding sites
Recall that our model structure allows for negated binding sites (NOT logic). In these cases, there are three groups of paths through the corresponding HMM, as shown in Figure 5. The correct paths for positive examples are still the upper paths, corresponding to the CRM model with negated binding sites removed. However, the ‘correct’ paths for the negative examples are divided among the lower (background) path, and the ‘middle’ paths (with negated sites), proportional to the likelihood that the example takes each path.


Figure 5
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5 An HMM with three groups of paths representing a CRM structure with a negated binding site (inset). A negative example sequence should take either the background path, or the paths that include the negated site.

 

    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
We wish to test whether or not our approach is able to learn accurate CRM models. To this end, we run our approach on several datasets, using cross-validation to measure the predictive accuracy of our learned models.

For the datasets that we consider, we train a 5th-order HMM on promoter sequences in the same genome to use as the background distribution.

As a metric for scoring learned models during our logical structure search, we use a statistic called F1. Given a trained model, we probabilistically estimate how many positive examples were generated by positive paths through our model (true positives, tp), and through negative paths (false negatives, fn), and how many negative examples were generated by positive paths (false positives, fp). Precision is given by P = tp/tp + fp. Recall is given by R = tp/tp + fp. F1 is the harmonic mean of precision and recall, and is given by F1 = 2PR/P + R.

3.1 Evaluating predictive accuracy
In order to test our algorithm's effectiveness in identifying CRMs, we compare our approach to that of Segal and Sharan (2004) on the same datasets. We recreated 25 yeast datasets where each gene in a given set has evidence (P-value < 0.001) from the genome-wide analysis of Lee et al. (2002) that two particular proteins bind somewhere in its upstream region. For each dataset, we use 500 bp promoter sequences, and choose 100 yeast promoter sequences at random to use as negative examples.

To show that the predictions of our learned models on held-aside data are more accurate than could be obtained by chance, we compute a classification margin (following Segal and Sharan) which is the largest difference between the true positive rate and the false positive rate as a threshold is varied on what is called a positive example. If there is <1% chance that a randomly-labeled test set with the same cardinality of positive and negative examples would have the classification margin of one of our test sets (or a higher one), then we consider this result statistically significant.

Our results are shown in Table 1. We find significant results in 21 of 25 datasets, compared to 12 of 25 found by the approach developed by Segal and Sharan.


View this table:
[in this window]
[in a new window]

 
Table 1 Results of finding CRMs in 25 yeast datasets from Lee et al. (2002)

 
Recall that we train our models using a discriminative approach. Our experiments show that our learner is more accurate than models learned using a standard, generative training approach. Of the 30 datasets mentioned in this section, the discriminative method finds more accurate models for 20 of them, especially on the yeast datasets described in the next section.

3.2 Evaluating the effectiveness of logical and spatial aspects
In order to evaluate whether the logical structure and spatial aspects of our representation improve the ability of our learner to recover CRMs, we compare our approach to a restricted version wherein we do not allow logical structure beyond AND, and all our spatial probabilities are fixed by a uniform distribution. That is, the model space of this restricted version is simply a conjunction of motifs which may appear in any order, in any location, and so we refer to it as the ‘bag-of-motifs’ approach. The classification margin is higher using our approach than using the bag-of-motifs approach on 16 of the 25 Lee et al. datasets described above.

We test our approach on four additional datasets from both yeast and fly, for which we obtain promoter sequences from genes known to be co-regulated. Table 2 describes these datasets and includes a classification margin and P-value showing that we find statistically significant CRMs in all four datasets. Since there is a large discrepancy between the number of positive and negative examples in these datasets, we create precision–recall (PR) curves, which show the tradeoff between precision and recall over classification thresholds. These results are shown in Figure 6.


Figure 6
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6 Precision–recall curves from four datasets described in Table 2. The accuracy of our models dominates that of the bag-of-motifs approach over almost all of the recall space in these datasets.

 


View this table:
[in this window]
[in a new window]

 
Table 2 Descriptions and classification margins of four datasets we use to test the effectiveness of our representation's logical structure and spatial aspects (p-value < 0.01 shown in bold.)

 
In each case, the PR curve for our model dominates the PR curve for the bag-of-motifs model over all or almost all of the recall space.

The yeast ESR PAC/RRPE genes described in Table 2 contain two known elements in their upstream regions, the PAC element (consensus sequence GCGATGAG) and the RRPE element (consensus sequence AAAAAwTTTTT). Figure 7 shows the hypothesis CRM model learned by our approach (Figure 7a and b) and that of the bag-of-motifs approach (Figure 7c), when trained on the entire dataset.


Figure 7
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7 (a, b) The hypothesis CRM model learned by our algorithm for the yeast PAC/RRPE dataset (Table 2). Both the PAC element (consensus sequence GCGATGAG) and the RRPE element (consensus sequence AAAAAwTTTTT) appear as overrepresented in our learned PWMs. (c) The model learned by the bag-of-motifs approach on the same dataset.

 
The PWMs recovered by our algorithm are shown in Figure 7a as sequence logos (Crooks et al., 2004), which show a high amount of overlap with the known consensus sequences. The bag-of-motifs approach did not recover the PAC element. This example illustrates how the inclusion of spatial preferences in the representation can aid the learner in finding better motif models. Moreover, the inclusion of these aspects leads to more accurate classifications even when the ‘right’ motifs have been learned.

3.3 CRMs in human
In order to determine whether or not our approach can be effective in finding CRMs in DNA sequences in more complex organisms than yeast and fly, we test our approach on several human promoter sequences annotated with GO term 3677 (DNA binding proteins).

This set consists of 95 positive examples of 4000 bp regions that have evidence of being bound by a transcription factor called TAF1, and 284 negative examples with evidence of not having a TAF1 binding site (these data were obtained from the Thomson lab at the University of Wisconsin; unpublished data).

The classification margin we obtain from this dataset is 0.220 (P-value = 8.9e – 4). The comparison of precision and recall with the bag-of-motifs approach is shown in Figure 8.


Figure 8
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 8 The precision–recall curve for our approach compared with the bag-of-motifs approach on human DNA sequences.

 

    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
We have presented a probabilistic learning algorithm which is capable of learning multiple motifs and rich representations of logical and spatial relationships among them simultaneously. Our models can be thought of as generalized HMMs, but they are specifically designed to represent aspects of CRMs. We learn their structure by searching for the logical structure of the underlying CRM, and our representation is compact because of extensive parameter sharing. We have also presented a learning algorithm to train these HMMs, which uses a heuristic approach to make it efficient enough to learn from mammalian sequence data.

We have shown that our motif learner performs better than a current state-of-the-art approach on the 25 yeast datasets from Lee et al. and we have shown that learning information about the logical structure and spatial aspects of a CRM helps our learner find better models on five datasets, as measured by predictive accuracy.


    Acknowledgments
 
The authors would like to thank Audrey Gasch and James Thomson and their groups for help with the datasets. This research was supported in part by NIH/NLM training grant 5T15LM005359 and NSF grant IIS-0093016.


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 

    Aerts, S., et al. (2003) Computational detection of cis-regulatory modules. Bioinformatics, 19, 5–14.

    Beer, M.A. and Tavazoie, S. (2004) Predicting gene expression from sequence. Cell, 117, 185–198[CrossRef][ISI][Medline].

    Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol, . 268, 78–94[CrossRef][ISI][Medline].

    Crooks, G.E. (2004) WebLogo: a sequence logo generator. Genome Res, . 14, 1188–1190[Abstract/Free Full Text].

    John, G.H. and Langley, P. (1995) Estimating continuous distributions in Bayesian classifiers. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345 Morgan Kaufmann.

    Keles, S., et al. (2004) Regulatory motif finding by logic regression. Bioinformatics, 20, 2799–2811[Abstract/Free Full Text].

    Krogh, A. (1994) Hidden Markov models for labeled sequences. Proceedings of the 12th IAPR International Conference on Pattern Recognition IEEE Computer Society Press, pp. 140–144.

    Lee, T.I., et al. (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 298, 799–804[Abstract/Free Full Text].

    Mitchell, T.M. Machine Learning, (1997) , New York McGraw-Hill.

    Noble, W.S., et al. (2005) Predicting the in vivo signature of human gene regulatory sequences. Bioinformatics, 21, i338–i343[Abstract].

    Segal, E. and Sharan, R. (2004) A discriminative model for identifying spatial cis-regulatory modules. Proceedings of the Eighth Annual International Conference on Computational Molecular Biology (RECOMB), San Diego, California, USA. ACM Press, pp. 141–149.

    Sinha, S., et al. (2003) A probabilistic method to detect regulatory modules. Bioinformatics, 19, 292–301.

    Zhou, Q. and Wong, W.H. (2004) CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc. Natl Acad. Sci. USA, 101, 12114–12119[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Noto, K.
Right arrow Articles by Craven, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Noto, K.
Right arrow Articles by Craven, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?