Bioinformatics Advance Access originally published online on June 5, 2008
Bioinformatics 2008 24(15):1669-1675; doi:10.1093/bioinformatics/btn254
Modeling promoter grammars with evolving hidden Markov models

The Bioinformatics Centre, Department of Biology & Biotech Research and Innovation Centre, University of Copenhagen, Ole Maaloes Vej 5, 2200 Copenhagen N, Denmark
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Describing and modeling biological features of eukaryotic promoters remains an important and challenging problem within computational biology. The promoters of higher eukaryotes in particular display a wide variation in regulatory features, which are difficult to model. Often several factors are involved in the regulation of a set of co-regulated genes. If so, promoters can be modeled with connected regulatory features, where the network of connections is characteristic for a particular mode of regulation.
Results: With the goal of automatically deciphering such regulatory structures, we present a method that iteratively evolves an ensemble of regulatory grammars using a hidden Markov Model (HMM) architecture composed of interconnected blocks representing transcription factor binding sites (TFBSs) and background regions of promoter sequences. The ensemble approach reduces the risk of overfitting and generally improves performance. We apply this method to identify TFBSs and to classify promoters preferentially expressed in macrophages, where it outperforms other methods due to the increased predictive power given by the grammar.
Availability: The software and the datasets are available from http://modem.ucsd.edu/won/eHMM.tar.gz
Contact: krogh{at}binf.ku.dk
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
One of the fundamental challenges in computational biology is to decipher the signals underlying transcriptional regulation (Stormo, 2000; Wasserman and Sandelin, 2004). The goal of this is 2-fold:not only to minimize the number of experiments necessary in the laboratory, but also to understand the general mechanism underlying the precise selection of expressed genomic loci. Transcription of a typical gene by RNA Polymerase II is directed by DNA sequence signals, which are bound by specific proteins: transcription factors (TFs). These transcription factor binding sites (TFBSs) are often located in the region around the transcription initiation site (Smale and Kadonaga, 2003). The TFBSs for a given TF usually show a constrained pattern of nucleotides, which can be represented by a position-specific scoring matrix (PSSM) (Stormo, 2000; Wasserman and Sandelin, 2004). While such PSSMs are adequate predictors of sites bound in vitro, the information content in the model is too small to make meaningful predictions at genomic scales. The binding preference of a TF alone is not sufficient to find its cognate functional sites. One of the principal methods used to solve this problem is to find combinations of sites. Such combinations of TFBS are known as cis-regulatory modules, which can direct tissue-specific expression of genes (Stormo, 2000; Wasserman and Fickett, 1998).
In higher eukaryotes, it is challenging to model the interaction between TFs and TFBSs, as the location, composition and number of TFBSs varies greatly even in genes having similar expression patterns. A bottleneck in this type of analysis is the lack of data: there are only a handful of gene sets where the number of experimentally defined sites is sufficient for direct training of a predictive model. On the other hand, genome-wide data on both promoter locations (Carninci et al., 2006) and expression patterns (Su et al., 2002) are available. Thus, a commonly occurring situation in experimental biology is that a set of genes are found to be co-expressed, leading to the hypothesis that they are co-regulated or at least share some regulatory features. In some of these cases, there are experimental indications on what type of features those might be (for instance, some particular TFs might be suspected to be involved in the regulation of most genes within the set). Many algorithms aimed at analyzing data originating from this type of situation have been presented. Most have focused on identifying regions in which predicted sites co-occur (Berman et al., 2002; Markstein et al., 2002; Rebeiz et al., 2002). Others have suggested probabilistic models (Crowley et al., 1997; Frith et al., 2001, 2002, 2003; Rajewsky et al., 2002). Of particular interest are Cister and COMET, designed using two parallel strings of hidden Markov model (HMM), which represent forward and backward reading of a PSSM (Frith et al., 2001, 2002). Background states are used to model spacing between PSSMs. Comet calculates E-values considering the score from the HMM and the gap between the putative TFs. A more recent program, Cluster-Buster, employed a quadratic-time algorithm to find clusters of pre-specified motifs in nucleotide sequences (Frith et al., 2003). Similarly, Stubb is a program that uses correlation between motifs to model the coordinated TFBSs and also incorporate phylogenetic information (Sinha et al., 2003). Given a reasonable set of PSSMs as input, the applications using probabilistic models have adequate performance in the sense that they can successfully locate known clusters of motifs.
However, these approaches do not consider the internal structure of the co-occurring TFBSs, just the proximity or co-localization of motifs. This is the motivation for our study: we focus on how to model the promoters of such a group of genes both to gain understanding of the mechanism of regulation and to be able to search for other genes that are regulated the same way. This can be viewed as an extension of the above approaches—not only do we want to find cis-regulatory modules; but also to model the grammar of the promoters, by modeling both the binding sites, the order in which they appear, and the regions in between them. Thus, for a set of genes which are hypothesized to have similar promoter structures, we make a model by combining known PSSMs with models of the regions in between them: in effect, an HMM architecture based on smaller blocks which in turn also are HMMs. We use genetic algorithms (GAs) to automatically optimize the HMM architecture, starting from a simple network. This is similar to another evolutionary method for motif discovery, EMCMODULE (Gupta and Liu, 2005). Using an evolutionary Monte Carlo method, EMCMODULE updates the motifs and their locations in the sequences in each genetic cycle. It also has some similarity to Stubb (Sinha et al., 2003), which uses intercorrelations of motifs by constructing a history-conscious HMM, where a previous non-background motif is remembered.
In this work, we search for multiple models, each of which represents one aspect of the underlying grammar. The evolving strategy searches for possible motif grammars, while transforming the internal representation of the context. We use an evolutionary algorithm to optimize an HMM structure (Won et al., 2007). The structure learning method has been successfully applied to prediction of protein secondary structure (Won et al., 2007) and prokaryotic promoters (Won et al., 2006). The method we present here differs from the previous work in that we use a special block designed to model a PSSM. We also present a way of designing a promoter classifier using PSSM blocks.
Below, we show that our method can find dense clusters of sites with performance equal to or better than other methods, and can successfully reconstruct known regulatory grammars at the same time. We also show that our method can be used as a classifier, where the modeled grammar gives additional predictive power compared to other approaches.
| 2 METHODS |
|---|
|
|
|---|
2.1 Evolving an HMM
We present an evolutionary strategy that evolves the structure of an HMM. The search space is restricted to interconnections of static sub-structures called blocks, which correspond to regulatory features or spacers (Won et al., 2006). Such evolved HMMs we call e-HMMs. Four types of blocks were used (Fig. 1): linear block, self-loop block, forward-jump blocks and zero blocks. A linear block consists of N states, where each state only has a transition to the next state. These linear blocks are used to model sites with a certain length (similar to a PSSM). Self-loop blocks are linear blocks in which each state has an additional transition to itself. They are used to represent a background distribution and model the length distribution between other blocks. A forward-jump block is a linear block where the first state is also connected to the last M states (with 1
M <N) to model spacers of varying lengths up to a (small) maximum length. Zero blocks are empty blocks with no states, which just place holders in the model structure and may change to another type of block at some point (through mutation or crossover). In addition to the four types of blocks suggested in our previous work, we use a type that represents forward and backward reading of a PSSM to model TFBSs (Fig. 1e), since functional TFBS generally can occur on both strands. Each end of the PSSM block is a silent state that does not emit any symbols, but has transitions to other states and these transitions can capture possible directional preferences of the binding sites. Inside a PSSM block the emission probabilities are set according to the base frequencies of a known PSSM with a pseudo count of one, and these probabilities are fixed. The number of states in a PSSM block is 2q+2, where q is the length of the corresponding PSSM.
|
All block types except the PSSM blocks are called background blocks as they are used to model background DNA sequence or spacing between TFBSs.
An HMM consists of a fixed (preset) number of blocks, some of which may be zero blocks. Initially, block types are randomly allocated and the blocks are fully connected to form a complete HMM. In this context, fully connected means that the end state of each (non-zero) block is connected to the start states of all other blocks and itself. An example of an HMM composed of three blocks is shown in Figure 2. States in background blocks are tied within blocks, meaning that their emission probabilities of the states inside a block are identical at all times. Thus, in the example depicted inFigure 2, the first three states are tied, and so are the last three states, but emission probabilities of the states in the first and the last block are different.
|
The method requires a set of PSSMs that are believed to be overrepresented in the input set of sequences. The aim of the method is then to find an optimal HMM architecture containing these PSSM blocks as well as evolving spacer blocks that explains the sequences. To do this, the genetic algorithm starts from a number of randomly chosen blocks to form a population. Inside the genetic cycle, genetic operators select members of the population (called parents) and evolve them to produce new members (called children). Children and parents are evaluated using a fitness function based on the likelihood. According to the fitness function, the selection procedure selects a number of members in the population for the next genetic cycle. Genetic operators (crossover, mutation) are applied to evolve the structure while retaining the properties of the blocks. Crossover operations swap a number of blocks in two parents to create two children. Mutations change the number of transitions or states in the children. In each cycle, the best member in the population is stored. See Supplementary Material for more detail.
2.2 Genetic operators and training of e-HMMs
Genetic algorithms evolve a population of solutions with genetic operators. Two genetic operators are used: crossover and mutation (Won et al., 2006). In crossover, two members are chosen through the selection procedure described subsequently. A number of blocks are randomly selected and swapped to create two children. The number of blocks is constant, but the number of the states of an HMM can be changed. Figure S1 illustrates an example of the crossover scheme. Mutations can either change the block type or the state structure of a block. In a block-type mutation, another type is selected at random. Mutations to the state structure can happen inside any block of the HMM except the PSSM blocks (Fig. S2). They change the number of states and the number of transitions in a block by randomly inserting or deleting a state. By applying these operators, the e-HMM evolves the model. To obtain suitable HMM architectures we tested various numbers of blocks between 25 and 40. Table S1 shows parameters used in the simulation. We have used a hybrid GA with traditional genetic operators to explore the space of HMM topologies in combination with Baum–Welch optimization of the transition and emission probabilities. After a number of iterations, most of the initial transitions converge to zero. The remaining transitions decide the grammar of the evolved HMM. The log likelihood of a model also tells us how well it fits the data. Given an HMM, we calculated the fitness value using the Akaike Information Criterion (AIC) (Akaike, 1973) which balances the likelihood and the model complexity (the number of parameters). The fitness value is
|
| (1) |
µ of the population. The number of free parameters in the HMM is called fµ, and the parameter
balances the likelihood and the complexity of the HMM.
A member of the population is selected with the Boltzmann probability
|
| (2) |
is the SD of the fitness in the population and s is a constant that controls the strength of the selection. In the work reported here, we used a value of s equal to 0.3 and
equal to 0.5.
2.3 Parameter training using PSSM scores
The evolved model is trained again considering the PSSM matches in a sequence and the distances to other matches. It is likely that a putative binding site with high matrix score and located close to other binding sites is a true binding site. To train with this additional biological information, we adopted the method for including database matches previously used for gene detection in Drosophila (Krogh, 2000). We assigned a probability distribution over labels to each base in the sequence based on the PSSM score and the distance to other candidate TFBSs. A label is associated with a TF or background. An HMM state emits only a single label. The training algorithm with labels calculates a path where a state label matches with a sequence label. By assigning multiple labels to a sequence, multiple PSSM blocks or background can have a path through each base in the sequence. First, putative binding sites are obtained by considering the PSSM score.
|
| (3) |
·z, where z is a normalization factor and s
is a pseudo count. Pseudo counts are used in order to avoid label probabilities of 0. Table S2 gives the value before normalization assigned to each label based on its PSSM score. The normalized distribution is scaled using a confidence. Motifs are usually found clustered in real data. If any other putative binding sites are found within a certain distance, a confidence of 1 is used, and otherwise 0.1. We observed the performance with various distances D along with the decoding method. The scaled probability of the k-th label at position l is
|
| (4) |
=(
1,
2,...,
L) and a label y=(y1, y2, ..., yL) in an HMM is
|
| (5) |
l) is a label in the state
l.
is the Kronecker delta function. It is 1 if yl=c(
l), and 0, otherwise. These probabilities are multiplied along a path, so the probability of not using a path with high label probabilities is heavily penalized. See Supplementary Material for more details.
2.4 Decoding e-HMMS and interpreting e-HMM ensembles
In each iteration of the genetic algorithm, new models are estimated. Many of these models are equally good, and choosing one best model in the end rarely yields the best results. Therefore we use an ensemble of models to analyze the data, which is a well-known method for limiting overfitting (Riis and Krogh, 1996). Depending on the problem, the way of decoding the e-HMM and interpreting an ensemble of e-HMMs differs. To find putative binding sites we used posterior state probabilities to calculate the most probable states for the given sequences (Durbin et al., 1999). For each position in a sequence, the state with the highest probability is chosen, and if it is a state in a PSSM block, the position is predicted to be a binding site for the corresponding transcription factor. Associated with each site is a score of the sequence given a PSSM (3). If a predicted site has a score below a chosen cut-off, it is discarded. Among the remained sites we only counted predicted sites that clustered. Any two sites were regarded as clustered if the gap between them was smaller than a certain size (D). We checked the performance while varying D from 80 to infinity. When D is infinite, all predicted sites above the cut-off are kept. We add up the number of times each site is predicted in an iteration of the GA. If a site is predicted as a specific motif 10 times in 15 iterations, the ensemble method gives a score of 10/15 to the site. Consequently, this method locates signals surviving for a long period of genetic permutations.
2.5 Constructing a classifier
To design a classifier using e-HMMs we calculated the log-odds ratio
|
| (6) |
is a set of HMM parameters. The positive model (
+) is an evolved e-HMM. To design the negative model (
–) we took all PSSM blocks out of the positive model and used only background blocks. In the example shown in Figure 2, the negative model becomes a 2-block HMM, without the PSSM block in the middle. The negative model is trained using the Baum–Welch algorithm with a set of negative sequences believed not to share the same grammar or sites as the positive set. For an ensemble of N models, we used the averaged log-odds ratios for classification:
|
| (7) |
2.6 Dataset construction
2.6.1 Muscle-specific sequence set
We used the muscle-specific promoter sequences as downloaded in the 24.nonaligned.pos.train.fa.tar.gz file at http://bayesweb.wadsworth.org/gibbs/module/ and extracted a 1000–1500 nt region from each included sequence that contained all the annotated TFBS. The sequences are mostly 1000 nt long; in the few cases when this span did not cover all TFBS, a 1500 bp span was used.
2.6.2 Macrophage-specific sequence set
For the macrophage promoter set construction, we used cap analysis of gene expression (CAGE)-derived promoters. CAGE is a method for sequencing the first 20–21 nt of full-length cDNAs. A unique strength of the method is nucleotide resolution Transcription Start Site (TSS) detection coupled with a data depth enabling measuring the tissue specificity of core promoters and individual TSSs. CAGE tags were sequenced and mapped to the mouse genome (assembly MM5) as described in (Carninci et al., 2006). A total of 11 567 973 CAGE tags were sequenced from more than 20 tissues, using 144 distinct CAGE libraries. Tag clusters (tags mapped to the genome that overlap with at least 1 nt on the same strand) with more than 30 tags from the libraries studied were used. We focused our analysis on tags from libraries derived from lipopolysaccharide (LPS) induced bone marrow macrophages (Library IDs: CBV, CBW,CBX, CBY, CBZ, CCA, CCB,CDR, CDS, CDT, CDU, CDW, CDY) and used remaining tags for assessing how constrained expression was to macrophage regions. In total, 15 717 tag clusters were analyzed. If a tag cluster had more than 60% of tags from the LPS-induced set (correcting for sample size), it was considered to be LPS-induced specific, otherwise the cluster was labeled negative. For each cluster, a sequence region of –300 to + 50 in relation to the most used TSS in the cluster was extracted from the mm5 genome assembly for the testing. This resulted in a positive set of 503 LPS-induced promoters and 15 314 negative promoters. We conducted a 3-fold cross-validation with this set of sequences.
| 3 RESULTS |
|---|
|
|
|---|
3.1 The e-HMM performance
For testing the method, we constructed three test sets aimed to assess the performance of our e-HMMs to both classify promoters and find motifs;
- Two artificial sets, where we first assess how well our method can find implanted motifs compared to other methods, and then whether it can rediscover the artificially constructed promoter grammar.
- A well-annotated experimental set of muscle-specific promoters, where we assess how well-known motif clusters can be found compared with two other methods, and present some insights into the underlying regulatory grammar in these promoters.
- An unannotated set of core promoters preferentially expressed in macrophage cells involved in immunological response, where we test the ability to classify promoters with respect to tissue specificity.
3.1.1 Reconstruction of a known grammar
We generated 120 artificial sequences using an HMM with five muscle specific PSSM (Wasserman and Fickett, 1998) blocks and some background blocks. This is equivalent to knowing the underlying grammar in these sequences. An important caveat with this study is that all the sequences were constructed with the same grammar, so there are no other grammars present. The generated sequences have 501 TFBSs (121 MYODs, 120 MEFs, 120 SRFs, 79 SP1s and 61 TEFs). For PSSMs we used V_$MYOD_Q6_01, V_$SRF_Q4, V_$MEF2_02, V_$TEF_Q6 and V_$SP1_Q6 PSSMs from the TRANSFAC database (Matys et al., 2003).
Starting with 25 blocks, we evolved a population of HMMs using the artificial sequences and the same PSSMs without any prior information on the HMM that generated the sequences.
First, we assessed how well the e-HMMs predict locations of motifs. Figure 3 compares the result of the e-HMM, COMET and Cluster-Buster. COMET and Cluster-Buster are TFBS predictors given the sequence and PSSMs. We ran COMET and Cluster-Buster while varying E-value and cluster threshold, respectively, from 0 until it has a maximum number of predictions.
|
We counted the number of correctly predicted TFBSs (TP) and incorrectly predicted TFBSs (FP) while changing the cutoff value of Cluster-Buster and COMET. As our algorithm is not deterministic, the performance of the e-HMMs fluctuates. Nevertheless, the e-HMMs perform substantially better than both Cluster-Buster and COMET on this set. Next, we investigated to what degree the evolved HMMs can reconstruct the original grammar (this is equivalent to the similarity of the HMM that constructed the artificial sequences). Figure 4a shows the HMM used for construction of the artificial sequences, whileFigure 4b shows the corresponding remodeled directed graph acquired after decoding the sequence through one of the evolved HMMs. This particular model found 438 TPs among 607 predictions (TPs+FPs). The figure shows that the evolved HMM approximates the original (correct) grammar quite well. For example, the transitions from TEF are modeled well with high transition probabilities to end and SP1_1, and other transitions have P<0.1. Thus, in this example of a known grammar, with no additional grammars present, our model is better at finding the locations of the grammar elements and can reconstruct the grammar at the same time. In Supplementary Material, we describe a more generalized test using a graph where all TFs are connected, and transitions are randomly assigned (Fig. S6, Table S4 and Supplementary Material); the correlation between assigned and predicted transition probabilities is generally successful.
|
3.1.2 Analysis of an annotated set
Above, we observed that the HMM method has the power to model the existing grammar of motifs. The grammar of real data is composed of many rules—either variants of the same grammar or multiple grammars that are different. To model this, rules that are evolved in each stage are collected and the most conserved (stable) grammar during the artificial evolution are evaluated (see Section 2).
To assess the performance with real data, we used the 48 human and mouse sequences (Wasserman et al., 2000) containing 166 experimentally verified TFBSs from muscle-specific promoters. It is commonly assumed that these sequences share a transcriptional logic. For this simulation we used five TF models: MYF, MEF2, SRF, TEF and SP1 from the JASPAR database. As above, we compared the result of COMET, Cluster-Buster and our method to first see if the known TFBSs can be detected, and then assess the hypothesized underlying grammar. Figure 5 shows the correct predictions (TP) versus the total number of predictions (TP+FP) of COMET, Cluster-Buster and e-HMMs. The performance of individual e-HMMs (Fig. 5) is consistently better than Cluster-Buster and overall, the performance of individual e-HMMs is comparable to that of COMET. As discussed above, our algorithm can also construct an ensemble of e-HMMs to include prediction information from individual e-HMMs(see Section 2). The ensemble models perform better than individual e-HMMs in almost all instances, and is generally closer to the performance of a perfect predictor, although in the range of 60–150 total predictions (x-axis), the COMET method has comparable performance (Fig. S3). When the number of predictions is larger (>150), the ensemble method outperformed the other methods. Figure S4 compares the receiver operator characteristic (ROC) curve of the three methods: the eHMM method performed better than the other methods.
|
For clarity, the above test only measures if the known TFBS are found, and not the underlying grammar. The HMM structure is only used to weight sites higher if they cluster—we explore the effect of changing these settings in the Supplementary Material. As a summary, the e-HMMs are generally comparable or better than COMET and Cluster-Buster for finding the known sites.
However, this is not the main attraction with the model: what sets e-HMMs apart from the other methods is that we can assess the grammar to form a hypothesis of the underlying biology. Table S3 shows strong edges between motifs in two evolved HMMs (resulting from iterations 100 and 120, respectively). We counted the number of grammars of the HMM at the 100th and 120th iteration and found 78 grammar and 72 grammar, respectively, after decoding. Among the grammar MYF
MEF is well preserved. SP1 is not frequently found in the HMM at 100th iteration, but new rules from SP1 are emerged during the evolution, as new edges from SP1 are present in the HMM from the 120th iteration. Conversely, the TEF
MEF edge decreased significantly.
3.1.3 Classification of macrophage-specific core promoters
Most interesting datasets are not as thoroughly analyzed and annotated as the muscle set that we analyze above. In genomics, we often face the situation where some promoters are shown to be active under some stimuli, and others are not. In general, in these situations we have no knowledge about the functional sites in these promoters, although often some hypothesis exists regarding the identity of the transcription factors involved (so, we might have the identity of the factors but not their sites nor the regulatory grammar). This type of situation is a case where we can use e-HMMs as classifiers, if it can find grammars that distinguish two sets of promoters from one another. To test this, we choose a large, novel biological promoter set where we have some indications of which TFs might be determinants of tissue-specific expression, but not the details of its mode of regulation. Specifically, we extracted promoters preferentially expressed in mouse macrophages induced by LPS, as measured by CAGE data presented in (Carninci et al., 2006) (See Section 2). LPS treatment provokes a drastic expression response similar to that of in vivo immunological response (Nilsson et al., 2006). To train the e-HMMs, we selected PSSMs among the TFs presented in Carninci et al. (2006), which are likely determinants in this set. We chose four TFs whose binding preferences are not similar: IRF, PU.1, SOX-9 and c-Ets-2 [from TRANSFAC (Matys et al., 2003)]. We also included NF-kappaB [JASPAR (Bryne et al., 2007) ID MA0061], as this factor is expected to play a major role in immunological response (Bonizzi and Karin, 2004). The positive model for the classifier was designed automatically by evolving e-HMMs with the positive training set. To run the genetic algorithm, we generated 30 HMMs to construct a population where each HMM was composed of 40 blocks including zero blocks and PSSM blocks. The negative model was derived from the positive model by taking PSSM blocks out and performing parameter training with the negative set, as described previously. Figure 6 shows a ROC curve plotting sensitivity (=TP/(TP+FN)) against 1-specificity (=TN/(FP+TN)) of the 3-fold cross-validation of the three e-HMM classifiers. To test the evolved model, we selected HMMs after some evolutionary iterations. The three HMMs are acquired from the 80th, 90th and 100th iteration of the simulation. We also tested ensemble of these three HMMs. For comparison, we also applied Cluster-Buster and COMET to identify motif clusters given the same PSSMs. Also, we compared our method with the performance of a classifier using support vector machines (SVMs) (Vapnik, 1995). Using these results, we checked if they can classify the macrophage dataset well by counting the number of predicted sites as in the test by Chowdhary et al. (2006). For COMET and Cluster-Buster we regarded a sequence as an induced macrophage promoter if any cluster prediction was made on the sequence. Changes in the cut-off values were used to derive the ROC plots as above. Cluster-Buster performed better than COMET in this test, but had lower sensitivity than the e-HMM classifiers for the whole range of specificity values. The SVM classifier in this test had the worst performance.
|
To investigate which grammar the HMMs have found, we decoded the sequence with the evolved HMMs. Table S5 lists the grammar the three HMMs found.
| 4 DISCUSSION |
|---|
|
|
|---|
In this study, we present a method that models co-occurring motifs in a set of promoters. A wealth of methods taking a set of PSSMs to classify gene sets have been published (Frith et al., 2001, 2002, 2003; Gupta and Liu, 2005; Rajewsky et al., 2002; Rebeiz et al., 2002; Sharan and Myers, 2005; Sinha et al., 2003; Wasserman and Fickett, 1998). The methods range from relatively simple statistical frameworks such as logistic regression (Wasserman and Fickett, 1998) to general classifiers based on SVM (Sharan and Myers, 2005) or other more complex methods. However, none of these methods are aimed at describing the inter-dependencies of sites. In our approach, using blocks derived from known TFBSs as well as blocks modeling the intermediate regions, we build HMM architectures describing the promoters using an evolving model. Thus, we can model the grammar of the regulation, i.e. the order of transcription factors, the distance between them, the composition of intermediate regions, etc. This has two immediate advantages. First, it gives increased predictive power for classifying sequences. Second, it represents a step towards more realistic descriptions of promoter architecture and function, which is necessary for understanding the complex transcription process. An important issue is that the grammar obtained by e-HMMs may not be the optimal or the simplest one because e-HMMs search for the rules in the sequence set heuristically. Thus, it represents a hypothesis that to some degree can explain the data, and these types of hypothesis can give a starting point for experimental trials.
A cis-regulatory structure learning algorithm using HMMs was presented by Noto and Craven (2007), in which they tried to build logical relationships among binding sites. Starting from a structure with a single motif, they expanded their logics. Our structure learning method is different in that we search for a whole logic using a genetic algorithm. A greedy approach to infer the cis-regulatory logic of transcriptional network was suggested using DNA sequence and mRNA expression data (Beer and Tavazoie, 2004). They used a Bayesian network that models combinatorial regulatory rules to predict gene expression pattern of Saccharomyces cerevisiae. While the model in this work share some similarity with ours, the study is different as they tried to learn gene expression patterns, which we do not (so, the gene expression data is part of the training data). However, this points to a possible extension of the e-HMM concept—to incorporate expression data in the actual model. Recent studies have claimed that regulatory grammars are very flexible (Brown et al., 2007) and a predictor can perform well without considering any grammar (Segal et al., 2008). It is not clear whether this is a universal truth in mammalian gene regulation; the increased classification power of our method compared to others imply that part of grammars are detectable and increase accuracy if taken into account.
A further enhancement introduced is the usage of ensembles of models in classification as well as in motif identification. The ensemble approach is beneficial in that it compensates for a biased result of a single HMM and thus prevents overfitting the data. As seen in the simulation with the muscle set, the ensemble approach accumulates the results of each genetic step, and generally produces better results than other methods. The concept of e-HMMs can be extended in many directions. For instance, it may be applied to protein sequences. As this methodology is new, many aspects await further study: for instance, the selection of the fitness function, which has implications on search space and the evolution pace within the process. Nevertheless, we believe that methods trying to model the underlying regulatory grammar are an important step at the road from phenomena-based promoter analysis to hypothesis-based.
| ACKNOWLEDGEMENT |
|---|
|
|
|---|
Funding: This work was supported by a grant from the Novo Nordisk Foundation.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alex Bateman
Present address: Department of Chemistry & Biochemistry, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0359, USA. ![]()
Received on January 28, 2008; revised on April 28, 2008; accepted on May 30, 2008
| REFERENCES |
|---|
|
|
|---|
Akaike H. Information theory and extension of the maximum likelihood principle. In. Petro B, Csaki F, eds. (1973) Akadémiai Kiadó, Budapest. 267–281. Proceedings of the 2nd International Symposium on Information Theory.
Beer MA, Tavazoie S. Predicting gene expression from sequence. Cell (2004) 117:185–198.[CrossRef][Web of Science][Medline]
Berman BP, et al. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc. Natl Acad. Sci. USA (2002) 99:757–762.
Bonizzi G, Karin M. The two NF-kappaB activation pathways and their role in innate and adaptive immunity. Trends Immunol (2004) 25:280–288.[CrossRef][Web of Science][Medline]
Brown CD, et al. Functional architecture and evolution of transcriptional elements that drive gene coexpression. Science (2007) 317:1557–1560.
Bryne JC, et al. JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res (2007) 36(Database issue):D102–D106.[CrossRef][Web of Science][Medline]
Carninci P, et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet (2006) 38:626–635.[CrossRef][Web of Science][Medline]
Chowdhary R, et al. Dragon Promoter Mapper (DPM): a Bayesian framework for modelling promoter structures. Bioinformatics (2006) 22:2310–2312.
Crowley EM, et al. A statistical model for locating regulatory regions in genomic DNA. J. Mol. Biol (1997) 268:8–14.[CrossRef][Web of Science][Medline]
Durbin R, et al. Biological Sequence Analysis (1999) Cambridge: Cambridge University Press.
Frith MC, et al. Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics (2001) 17:878–889.
Frith MC, et al. Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucleic Acids Res (2002) 30:3214–3224.
Frith MC, et al. Cluster-Buster: finding dense clusters of motifs in DNA sequences. Nucleic Acids Res (2003) 31:3666–3668.
Gupta M, Liu JS. De novo cis-regulatory module elicitation for eukaryotic genomes. Proc. Natl Acad. Sci. USA (2005) 102:7079–7084.
Krogh A. Using database matches with for HMMGene for automated gene detection in Drosophila. Genome Res. (2000) 10:523–528.
Markstein M, et al. Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo. Proc. Natl Acad. Sci. USA (2002) 99:763–768.
Matys V, et al. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. (2003) 31:374–378.
Nilsson R, et al. Transcriptional network dynamics in macrophage activation. Genomics (2006) 88:133–142.[CrossRef][Web of Science][Medline]
Noto K, Craven M. Learning probabilistic models of cis-regulatory modules that represent logical and spatial aspects. Bioinformatics (2007) 23:e156–e162.
Rajewsky N, et al. Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics (2002) 3:30.[CrossRef][Medline]
Rebeiz M, et al. SCORE: a computational approach to the identification of cis-regulatory modules and target genes in whole-genome sequence data. Site clustering over random expectation. Proc. Natl Acad. Sci. USA (2002) 99:9888–9893.
Riis SK, Krogh A. Improving prediction of protein secondary structure using structured neural networks and multiple sequence alignments. J. Comput. Biol (1996) 3:163–183.[Web of Science][Medline]
Segal E, et al. Predicting expression patterns from regulatory sequence in Drosophila segmentation. Nature (2008) 451:535–540.[CrossRef][Web of Science][Medline]
Sharan R, Myers EW. A motif-based framework for recognizing sequence families. Bioinformatics (2005) 21(Suppl. 1):i387–i393.[Abstract]
Sinha S, et al. A probabilistic method to detect regulatory modules. Bioinformatics (2003) 19(Suppl. 1):i292–i301.[Abstract]
Smale ST, Kadonaga JT. The RNA polymerase II core promoter. Annu. Rev. Biochem (2003) 72:449–479.[CrossRef][Web of Science][Medline]
Stormo GD. DNA binding sites: representation and discovery. Bioinformatics (2000) 16:16–23.
Su AI, et al. Large-scale analysis of the human and mouse transcriptomes. Proc. Natl Acad. Sci. USA (2002) 99:4465–4470.
Vapnik V. The Nature of Statistical Learning Theory (1995) New York: Springer.
Wasserman WW, Fickett JW. Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol (1998) 278:167–181.[CrossRef][Web of Science][Medline]
Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat. Rev. Genet (2004) 5:276–287.[CrossRef][Web of Science][Medline]
Wasserman WW, et al. Human-mouse genome comparisons to locate regulatory sites. Nat. Genet (2000) 26:225–228.[CrossRef][Web of Science][Medline]
Won K, et al. Evolving the structure of Hidden Markov Models. IEEE T. Evolut. Comput (2006) 10:39–49.[CrossRef]
Won KJ, et al. An evolving method for learning HMM structure: prediction of protein secondary structure. BMC Bioinformatics (2007) 8:357.[CrossRef][Medline]
This article has been cited by other articles:
![]() |
A. Vandenbon and K. Nakai Modeling tissue-specific structural patterns in human and mouse promoters Nucleic Acids Res., October 22, 2009; (2009) gkp866v1. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






