Bioinformatics Advance Access originally published online on February 18, 2007
Bioinformatics 2007 23(8):1032-1034; doi:10.1093/bioinformatics/btm047
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ClusterDraw web server: a tool to identify and visualize clusters of binding motifs for transcription factors
Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA 94720, USA
| ABSTRACT |
|---|
|
|
|---|
ClusterDraw is a program aimed to identification of binding sites and binding-site clusters. Major difference of the ClusterDraw from existing tools is its ability to scan a wide range of parameter values and weigh statistical significance of all possible clusters, smaller than a selected size. The program produces graphs along with decorated FASTA files. ClusterDraw web server is available at the following URL: http://flydev.berkeley.edu/cgi-bin/cld/submit.cgi
Contact: dxp{at}berkeley.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Large number of programs have been developed to identify transcription regulatory regions in genomic sequences (Alkema et al., 2004; Berman et al., 2004; Frith et al., 2003; Markstein et al., 2002; Philippakis et al., 2005; Pierstorff et al., 2006; Rajewsky et al., 2002; Sinha et al., 2006; Sosinsky et al., 2003; Waleev et al., 2006). However, this important task still represents a challenge. One obstacle is the presence of large amount of non-functional binding-site matches (Papatsenko et al., 2002). Available binding motifs are imperfect and, often, thresholds in binding motif searches are not known. In addition, search for binding-site clusters may require size of the expected clusters or window size. This adds a second ambiguous parameter to the search. A statistical solution to the cluster size problem was employed by A. Wagner in r-scan analysis (Wagner, 1997, 1998, 1999; Karlin and Brendel,1992). ClusterDraw takes advantage of the r-scan algorithm, combined with an exhaustive search over a wide range of the binding site match P-values (Lifanov et al., 2003). The program calculates cluster significance from the sum l (in bases) of N – 1 consecutive distances between all N site matches present in a cluster and determines statistical significance for every possible cluster, smaller than a given size lmax. Among all overlapping clusters, the program selects those producing the best statistical scores. The described method is equivalent to a search for the best clusters in the parameter space defined by the motif match quality, size of the resolution window and position in a sequence.
| 2 ALGORITHM |
|---|
|
|
|---|
2.1 Calculating motif match P-values
Calculation of cumulative match P-value for a word is based on the score M calculated using position-weighted matrix (Prestidge and Stormo, 1993) (PWM, see Equation 1S, Supplementary Material). First, for a PWM given, the algorithm finds all possible words producing score higher or equal than the score M. Then, expected frequencies of all these words are calculated using standard approach (see Equation 2S, Supplementary Material). Sum, taken over all word frequencies for the words scoring higher or equal than M is the cumulative match probability PM corresponding to the matrix score M (see also Equations 2S–4S in Supplementary Material):
|
| (1) |
2.2 Calculating cluster significance
Cluster significance score E is calculated from the cluster size l, the number of matches N, the match probability cutoff P and the number of binding motifs in the search T using binomial distribution (Wagner, 1997):
|
| (2) |
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
ClusterDraw web server has a standard common gateway interface (CGI); current settings allow processing of up to 100 KB of sequence data. For a convenience of users, motifs can be entered as multiple alignments or position frequency matrices (PFMs). Basic interface provides only three options: minimal combination of binding motifs, cluster significance cutoff and background model/organism selection. Advanced interface provides options to control maximal cluster size, minimal match P-value, statistics and graphics. By default, ClusterDraw filters overlapping binding sites by finding local maxima; however, options are available to control this function and even extract overlapping sites/composite elements (Makeev et al., 2003; Waleev et al., 2006). ClusterDraw output plot is shown in Figure 1.A
|
To validate performance of ClusterDraw, cluster significance profiles generated by ClusterDraw were compared to profiles generated by AHAB (Rajewsky et al., 2002). The AHAB was selected as one of the algorithms less sensitive to the window size and optimized to analyze the same type of data (i.e. fly enhancers), performance of AHAB versus other programs is available (Pierstorff et al., 2006). Results of the tests were quite striking, in most cases, cluster significance profiles produced by AHAB and ClusterDraw were highly correlated (see Fig. 1); predictive power of the both algorithms was similar as well. Differences were found in the ranks of the highest scores (see arrows in Fig. 1B). One can explain the agreement between the two different programs by the fact that they both perform exhaustive local searches. AHAB identifies the best out of all possible partitions for a given set of binding motifs in a window; ClusterDraw finds the best out of all possible overlapping clusters. The considered tests demonstrate efficiency of an exhaustive search strategy in detection of regulatory regions.
Performance tests for ClusterDraw and AHAB were also run on genomic sequences from mosquito Anopheles gambia, and honeybee Apis mellifera. These sequences contained Anopheles and Apis sim enhancers, recently identified in M. Levine lab (Zinzen et al., 2006). Given binding motifs presented in Drosophila sim enhancer, the both programs were able correctly predict sim enhancers in mosquito (see Fig. 1E) and honeybee (data not shown).
Absence of the window and the match score cutoff parameters in ClusterDraw, as well as correlation of the program predictions with other programs and experimental data provides new opportunities (Clyde et al., 2003; Ochoa-Espinosa et al., 2005) in the exploration of transcription regulatory regions.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
Author thanks Mike Levine, who participated in algorithm improvement and provided data for testing. The work was supported by grant from Moore foundation to the Center of Integrated Genomics, University of California, Berkeley. Funding to pay the Open Access publication charges was provided by the Center for Integrative Genomics, University of California, Berkeley. The Center is supported by a grant from Moore Foundation.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alex Bateman
Received on October 25, 2006; revised on January 12, 2007; accepted on February 6, 2007
| REFERENCES |
|---|
|
|
|---|
Alkema WB, et al. MSCAN: identification of functional clusters of transcription factor binding sites. Nucleic Acids Res. (2004) 32:W195–W198.
Berman BP, et al. Computational identification of developmental enhancers: conservation and function of transcription factor binding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura. Genome Biol. (2004) 5:R61. Epub 2004, Aug 2020.[CrossRef][Medline]
Clyde DE, et al. A self-organizing system of repressor gradients establishes segmental complexity in Drosophila. Nature (2003) 426:849–853.[CrossRef][Medline]
Frith MC, et al. Cluster-Buster: finding dense clusters of motifs in DNA sequences. Nucleic Acids Res. (2003) 31:3666–3668.
Karlin S, Brendel V. Chance and statistical significance in protein and DNA sequence analysis. Science (1992) 257:39–49.
Lifanov AP, et al. Homotypic regulatory clusters in Drosophila. Genome Res. (2003) 13:579–588.
Makeev VJ, et al. Distance preferences in the arrangement of binding motifs and hierarchical levels in organization of transcription regulatory information. Nucleic Acids Res. (2003) 31:6016–6026.
Markstein M, et al. Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo. Proc. Natl. Acad. Sci. USA (2002) 99:763–768. Epub 2001, Dec 2018.
Ochoa-Espinosa A, et al. The role of binding site cluster strength in Bicoid-dependent patterning in Drosophila. Proc. Natl. Acad. Sci. USA (2005) 102:4960–4965. Epub 2005, Mar 4925.
Papatsenko DA, et al. Extraction of functional binding sites from unique regulatory regions: the Drosophila early developmental enhancers. Genome Res. (2002) 12:470–481.
Philippakis AA, et al. Modulefinder: a tool for computational discovery of cis regulatory modules. In: Pac. Symp. Biocomput. (2005) 519–530.
Pierstorff N, et al. Identifying cis-regulatory modules by combining comparative and compositional analysis of DNA. Bioinformatics (2006) 10:10.
Prestridge DS, Stormo G. SIGNAL SCAN 3.0: new database and program features. Comput. Appl. Biosci. (1993) 9:113–115.
Rajewsky N, et al. Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics (2002) 3:30. Epub 2002, Oct 2024.[CrossRef][Medline]
Sinha S, et al. Stubb: a program for discovery and analysis of cis-regulatory modules. Nucleic Acids Res. (2006) 34:W555–W559.
Sosinsky A, et al. Target explorer: an automated tool for the identification of new target genes for a specified set of transcription factors. Nucleic Acids Res. (2003) 31:3589–3592.
Wagner A. A computational genomics approach to the identification of gene networks. Nucleic Acids Res. (1997) 25:3594–3604.
Wagner A. A computational "genome walk" technique to identify regulatory interactions in gene networks. In: Pac. Symp. Biocomput. (1998) 264–278.
Wagner A. Genes regulated cooperatively by one or more transcription factors and their identification in whole eukaryotic genomes. Bioinformatics (1999) 15:776–784.
Waleev T, et al. Composite module analyst: identification of transcription factor binding site combinations using genetic algorithm. Nucleic Acids Res. (2006) 34:W541–W545.
Zinzen RP, Cande J, Ronshaugen M, Papatsenko D, Levine M. Evolution of the ventral midline in insect embryos. Dev Cell. (2006) 11:895–902.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
D. Papatsenko, Y. Goltsev, and M. Levine Organization of developmental enhancers in the Drosophila embryo Nucleic Acids Res., September 1, 2009; 37(17): 5665 - 5677. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. J. Pape, H. Klein, and M. Vingron Statistical detection of cooperative transcription factors with similarity adjustment Bioinformatics, August 15, 2009; 25(16): 2103 - 2109. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




