Bioinformatics Advance Access originally published online on April 13, 2006
Bioinformatics 2006 22(18):2310-2312; doi:10.1093/bioinformatics/btl125
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
© 2006 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Dragon Promoter Mapper (DPM): a Bayesian framework for modelling promoter structures
1 Knowledge Extraction Lab, Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore 119613, Singapore
2 Department of Mathematics and Statistics, University of Guelph Guelph ON, Canada N1G 2W1, Canada
3 Norsys Software Corporation, 3512 West 23rd Avenue Vancouver, BC, Canada V6S 1K5, Canada
4 School of Computing, National University of Singapore Singapore 117543, Singapore
5 University of the Western Cape, South African National Bioinformatics Institute (SANBI) Private Bag X17, Bellville 7535, South Africa
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Dragon Promoter Mapper (DPM) is a tool to model promoter structure of co-regulated genes using methodology of Bayesian networks. DPM exploits an exhaustive set of motif features (such as motif, its strand, the order of motif occurrence and mutual distance between the adjacent motifs) and generates models from the target promoter sequences, which may be used to (1) detect regions in a genomic sequence which are similar to the target promoters or (2) to classify other promoters as similar or not to the target promoter group. DPM can also be used for modelling of enhancers and silencers.
Availability: http://defiant.i2r.a-star.edu.sg/projects/BayesPromoter/
Contact: vlad{at}sanbi.ac.za
Supplementary information: Manual for using DPM web server is provided at http://defiant.i2r.a-star.edu.sg/projects/BayesPromoter/html/manual/manual.htm
Critical components in gene regulation are transcription factor binding sites (TFBSs) that are usually present in a gene's promoter region. TFBSs operate together in a constitutive manner and provide a unique framework and functionality to the promoter structure. A promoter structure is characterized by the TFBS organization within the promoter, which is specific to gene groups (Werner, 1999). Elucidation of promoter structure may further enhance our understanding of gene regulation.
Different techniques are used to model promoter structure, ranging from simple binary scoring schemes (Halfon et al., 2002; Berman et al., 2002; Markstein et al., 2002; Frech, 1997; Sosinsky et al., 2003) to more sophisticated hidden Markov models (HMMs) (Grundy et al., 1997; Frith et al., 2001, 2002 and 2003; Bailey and Noble, 2003; Sinha et al., 2003). Though most of these programs are statistical in nature, their design objectives and strategies vary. For example, for motif discovery, which forms part of promoter structure modelling, some researchers have followed IUPAC consensus (Markstein et al., 2002) to represent TFBSs, while some others have used position weight matrices (PWMs) (Berman et al., 2002; Markstein et al., 2002; Frech, 1997; Sosinsky et al., 2003; Grundy et al., 1997; Frith et al., 2002; Bailey and Noble, 2003; Frith et al., 2001; Sinha et al., 2003). Owing to their design requirements, these programs generally tend to have various built-in restrictions. For example, FastM, along with ModelInspector (Frech, 1997), allows generation of promoter structure models using just two TFBSs; in Cis-analyst (Berman et al., 2002), the number of TFBS clusters to be identified within the promoter is restricted; Target Explorer (Sosinsky et al., 2003) looks only for TFBS clusters with a fixed number of motifs specified by the user; rVISTA (Loots et al., 2002), TraFaC (Jegga et al., 2002) and CisMols (Jegga et al., 2005) are based on comparative sequence analysis and thus are restricted to work only on single higher eukaryotic sequences (from one species), tending to miss species-specific TFBSs. Most of these programs consider different motif features for modelling promoter structure. For example, Target Explorer (Sosinsky et al., 2003) and Cis-analyst (Berman et al., 2002) consider mere presence of motifs, while Cister (Frith et al., 2001), COMET (Frith et al., 2002), Cluster-Buster (Frith et al., 2003) and MCAST (Bailey and Noble, 2003), take into account also the spacing between motifs; Meta-Meme (Grundy et al., 1997) and the method proposed by Sinha et al. (2003) additionally consider the order of motif occurrence. Overall, these programs have their own pros and cons when it comes to performance issues. Each one has its own limitations. Each one has its own set of parameters suitable for specific situations.
We present here Dragon Promoter Mapper (DPM), which implements a novel methodology to model promoter structure of co-regulated genes. DPM uses an exhaustive set of biologically meaningful features and is based on a robust mathematical/probabilistic formulation of Bayesian networks. DPM can analyse any type and any number of sequences and can consider a variable number of motifs in a target TFBS cluster. Thus, it can equally be used for modelling of enhancers and silencers, although we focus our presentation on promoters only. To the best of our knowledge, aside DPM no other system currently allows the user to model the dependencies of arbitrary order between the motifs in a sequence that itself improves prediction quality by such models. DPM builds a Bayesian model of promoter structure that is associated with the training data. Training data may contain promoter sequences of different classes; background sequences could also be used. For modelling, DPM exploits higher order features of motifs present within the sequence. These features include motifs, the strand where they are found, their order of occurrence and mutual spacer length between adjacent motifs. DPM builds the model by discovering the motifs in the training data using a set of predefined PWMs (Fig. 1 of DPM manual, http://defiant.i2r.a-star.edu.sg/projects/BayesPromoter/html/manual/fig_1.htm). Motif organization obtained for each sequence after PWM scanning is transformed to a higher order motif-feature definition used by DPM to train the Bayesian model. DPM uses Expectation Maximization algorithm (Dempster et al., 1977) based on uniform (Dirichlet) priors to train the model. DPM uses Netica functions (http://www.norsys.com/) for Bayesian networks. A Bayesian model has two main components: (1) a directed acyclic graph (DAG) structure where each node represents a random variable and directed arcs indicate dependencies between these variables and (2) model parameters, defined by a set of conditional probability distributions for each node in the network. The model nodes may encode motif features and class of sequences used, while the graphical structure of the model may encode the dependencies between these nodes (Fig. 2 of DPM manual, http://defiant.i2r.a-star.edu.sg/projects/BayesPromoter/html/manual/fig_2.htm). A trained DPM Bayesian model may be used for inference based on the junction-tree algorithm (Huang and Darwiche, 1994). For a query sequence, DPM returns probability distribution over all the target sequence classes. DPM associates the query sequence to that sequence class which has the highest probability among target classes. Higher classification probability indicates higher similarity in sequence structures between the query sequence and the target class and increases the likelihood of the query sequence to belong to the target class. DPM inference, thus, essentially deals with classifying a query sequence to one of the target classes based on the structure similarity. DPM, the following are steps a user may follow:
Step 1 (Training data). Collect promoter sequences of transcripts assumed to be co-regulated in order to model them. Background sequences (e.g. random DNA sequences) may also be used.
Step 2 (Query data). Collect your query data sequences that you want to analyze for the presence of promoter model associated with the training data.
Step 3 (PWM file). Find out which motifs are specific to your target sequence classes in the training data. Compile a list of PWMs associated with these motifs.
Submit the training data, query data and PWM file, along with other user options to DPM. DPM builds the promoter model from the training data, the PWM file and an automatically generated model definition file. Model definition file contains the information for the Bayesian promoter model.
Step 4 (Model tuning and testing). This intermediate step allows you to modify the default model definition file generated by DPM. DPM also provides a utility where you can test the performance of the model using leave-one-out cross-validation. Depending on the test results obtained, you may wish to either proceed ahead with the processing of query data, or tune the model further (by modifying any or all of these files, training data, PWM file and model definition file) and perform test again.
Step 5 (Mapping model to query data). DPM maps the model to the query sequences. The output file contains the probability distribution for each of the query sequences over the target classes. For long query sequences, the output shows the regions of the same length as the training data sequences, which have similar structures as the target class.
More details on the above steps can be found in the manual provided at the manual page of the DPM website.
DPM is simple to use and has the advantage of being highly flexible, allowing users to design their model to suit their needs by manipulating the nodes and the DAG structure of the model. The model can be used for the following:
- Search for regions in a genomic sequence, which have similar structures as the target promoters. These regions may represent potential promoters.
- Find characterized promoters that have similar structures as the target promoters. Such promoters may potentially be co-regulated with the target group.
- Approximately predict the transcription start sites of promoters similar to promoters of the target class.
The web server manual (http://defiant.i2r.a-star.edu.sg/projects/BayesPromoter/html/manual/manual.htm) contains a detailed example that explains the necessary steps to use the DPM system. We have also provided an example of a comparison analysis of DPM with several similar systems (COMET, Cluster-Buster, Meta-MEME, and MCAST), which shows distinct and specific advantages of DPM, as discussed on the Comparative analysis page of the DPM server (http://defiant.i2r.a-star.edu.sg/projects/BayesPromoter/html/manual/Comparison_analysis.doc). We believe that users will find DPM a useful complement of the existing set of promoter analysis tools.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Keith A Crandall
Received on December 29, 2005; revised on March 26, 2006; accepted on March 28, 2006
| REFERENCES |
|---|
|
|
|---|
Bailey, T.L. and Noble, W.S. (2003) Searching for statistically significant regulatory modules. Bioinformatics, 19, Suppl. 2, II16II25.
Berman, B.P., et al. (2002) Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc. Natl Acad. Sci. USA, 99, 757762
Dempster, A.P., et al. (1977) Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist.Soc. B, 39, 138.
Frech, K. (1997) A novel method to develop highly specific models for regulatory units detects a new LTR in GenBank which contains a functional promoter. J. Mol. Biol, . 270, 674687[CrossRef][ISI][Medline].
Frith, M.C., et al. (2001) Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics, 17, 878889
Frith, M.C., et al. (2002) Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucleic Acids Res, . 30, 32143224
Frith, M.C., et al. (2003) Cluster-Buster: finding dense clusters of motifs in DNA sequences. Nucleic Acids Res, . 31, 36663668
Grundy, W.N., et al. (1997) Meta-MEME: motif-based hidden Markov models of protein families. Comput. Appl. Biosci, . 13, 397406
Halfon, M.S., et al. (2002) Computation-based discovery of related transcriptional regulatory modules and motifs using an experimentally validated combinatorial model. Genome Res, . 12, 10191028
Huang, C. and Darwiche, A. (1994) Inference in belief networks: a procedural guide. Intl. J. Approximate Reasoning, 11, 1158.
Jegga, A.G., et al. (2002) Detection and visualization of compositionally similar cis-regulatory element clusters in orthologous and coordinately controlled genes. Genome Res, . 12, 14081417
Jegga, A.G., et al. (2005) CisMols analyzer: identification of compositionally similar cis-element clusters in ortholog conserved regions of coordinately expressed genes. Nucleic Acids Res, . 33, W408W411 [Erratum (2005) Nucleic Acids Res., 33, 4377.]
Loots, G.G., et al. (2002) rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res, . 12, 832839
Markstein, M., et al. (2002) Genomewide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo. Proc. Natl Acad. Sci. USA, 99, 763768
Sinha, S., et al. (2003) A probabilistic method to detect regulatory modules. Bioinformatics, 19, Suppl. 1, i292i301[Abstract].
Sosinsky, A., et al. (2003) Target explorer: an automated tool for the identification of new target genes for a specified set of transcription factors. Nucleic Acids Res, . 31, 35893592
Werner, T. (1999) Models for prediction and recognition of eukaryotic promoters. Mamm. Genome, 10, 168175[CrossRef][ISI][Medline].
This article has been cited by other articles:
![]() |
K.-J. Won, A. Sandelin, T. T. Marstrand, and A. Krogh Modeling promoter grammars with evolving hidden Markov models Bioinformatics, August 1, 2008; 24(15): 1669 - 1675. [Abstract] [PDF] |
||||
![]() |
A. Vandenbon, Y. Miyamoto, N. Takimoto, T. Kusakabe, and K. Nakai Markov Chain-based Promoter Structure Modeling for Tissue-specific Expression Pattern Prediction DNA Res, February 7, 2008; (2008) dsm034v1. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

