Skip Navigation


Bioinformatics Advance Access originally published online on April 26, 2005
Bioinformatics 2005 21(13):3058-3059; doi:10.1093/bioinformatics/bti461
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/13/3058    most recent
bti461v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Donaldson, I. J.
Right arrow Articles by Göttgens, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Donaldson, I. J.
Right arrow Articles by Göttgens, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

TFBScluster: a resource for the characterization of transcriptional regulatory networks

Ian John Donaldson , Michael Chapman and Berthold Göttgens *

Department of Haematology, Cambridge Institute for Medical Research, University of Cambridge Hills Road, Cambridge CB2 2XY, UK

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 REFERENCES
 

Summary: One major challenge of the post-sequencing era of the human genome project will be the functional annotation of the non-coding portion of the genome, in particular gene regulatory sequences. We have developed a new web-based tool, TFBScluster, which performs genome-wide identification of transcription factor binding site clusters that are conserved in multiple mammalian genomes. Clusters representing candidate gene regulatory elements can be filtered further, based on the presence or absence of additional user-defined DNA sequence motifs or by constraining the orientation or order of binding sites. Comprehensive results files, returned by email, are designed to facilitate experimental validation of computationally identified candidate gene regulatory sequences. TFBScluster, therefore, has the potential to contribute to deciphering transcriptional networks that regulate a wide range of mammalian developmental processes.

Availability: http://hscl.cimr.cam.ac.uk/TFBScluster_genome_34.html

Contact: bg200{at}cam.ac.uk

Supplementary information: http://hscl.cimr.cam.ac.uk/sup_don05_app_api.html

A key goal of the post-genome sequencing era will be the identification of cis-regulatory sequences to reconstruct the transcriptional networks that lie at the heart of normal development and are frequently perturbed in human disease processes. Since transcription factor binding sites (TFBSs) are often short (4–6 nt) and degenerate, genome-wide computational identification of TFBSs must be able to discriminate functional (true positive) from the vast numbers of non-functional (false positive) sites. The proportion of functional TFBSs in such analyses may be increased by (1) concentrating on TFBSs conserved between species (Lenhard et al., 2003), (2) devising algorithms capable of predicting the likelihood that a given region of the genome is involved in regulating gene expression (Kolbe et al., 2004), and (3) exploiting the observation that most functional TFBSs form part of clusters of multiple sites (Krivan and Wasserman, 2001; Markstein et al., 2002; Wasserman and Fickett, 1998). Taking advantage of all three approaches, we have recently demonstrated for the first time that human gene regulatory sequences with predicted in vivo biological activity (when assayed under the most rigorous conditions in transgenic mice) can be identified through whole-genome computational analysis (Donaldson et al., 2005). We have now built on this approach and developed a web-integrated suite of bioinformatics tools (TFBScluster) designed for the identification of candidate human gene regulatory sequences with predicted biological activity.

Briefly, TFBScluster utilizes genome-wide libraries of human TFBSs that are conserved in a range of other mammalian species. Binding site libraries currently implemented include libraries for 35 IUPAC consensus strings (conserved between human/mouse and human/mouse/rat) and 163 position weight matrices (conserved between human/mouse/rat). These libraries are used to identify TFBSs located in user-defined TFBS clusters. TFBScluster, therefore, significantly differs from existing tools that are specifically designed for the identification of TFBS clusters in lower model eukaryotes, such as Drosophila (Berman et al., 2002; Grad et al., 2004; Markstein et al., 2002; Sosinsky et al., 2003), limited to identifying TFBS clusters in relatively small sets of aligned sequences (Frith et al., 2003; Johansson et al., 2003; Loots and Ovcharenko, 2004; Yu et al., 2004) or restricted to sequences flanking predicted transcriptional start sites (Dieterich et al., 2005; Jegga et al., 2002). Moreover, to increase the proportion of functional binding sites, TFBScluster search results can be further screened based on order, spacing and orientation of binding sites as well as presence or absence of user-defined motifs.

The genome-wide positions of 35 IUPAC consensus strings were identified using a revised version of our PERL program TFBSsearch.pl (Chapman et al., 2004) in human/mouse and human/mouse/rat aligned genomes available from Genome Bioinformatics at the UCSC (http://genome.ucsc.edu/downloads.html). Background and relevant references for the consensus binding site sequences used can be found at http://hscl.cimr.cam.ac.uk/TFBScluster_genome_34_background_iupac.html. For each consensus string, five library files were prepared using increasing levels of sequence conservation. The first file contains all ‘non-exact’ core positions where degenerate IUPAC positions may differ between aligned sequences. The second file contains ‘exact’ positions of only those sites with identical sequence across the alignment. In the remaining three files, the IUPAC code ‘N’ (any nucleotide) is added to the start and end of the core search string, thereby extending the conserved site beyond the original core string to ensure that the core binding sites are situated within extended blocks of high sequence conservation. Search strings were used with one, two and three conserved flanking nucleotides to identify TFBSs located in blocks of increasing sequence conservation. The genome-wide positions of 163 position weight matrices conserved in human/mouse/rat whole-genome alignments were obtained from the UCSC genome browser (http://genome.ucsc.edu/). Position weight matrices used are from the Transfac database v4.0 (Heinemeyer et al., 1999). A list of 163 matrices is provided as background information (http://hscl.cimr.cam.ac.uk/TFBScluster_genome_34_background_pwm.html) and includes hyperlinks to database records for the individual matrices maintained on the TESS website (http://www.cbil.upenn.edu/tess). To increase the likelihood that TFBSs represent functional sites, TFBScluster can be restricted to only analyse sites located in regions of regulatory potential (Kolbe et al., 2004) with scores ≥0.0002, based on the threshold suggested by UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/hg16/regPotential/).

Using the web interface of TFBScluster, TFBS libraries are screened for clusters of TFBSs in the human genome that contain a set number of sites in a defined window of sequence. In TFBScluster, cluster sizes are defined by the start of the first (left-most) TFBS and the end of the last (right-most) TFBS, and overlapping clusters are merged, with the result that some clusters may be larger than the user-defined window size. By default there is no restriction to the distance between individual TFBSs. However, since some binding sites may overlap, the web interface includes an option that only counts the number of non-overlapping TFBSs when determining whether the minimum number of each TFBS for the user-defined cluster has been met. This ensures that the minimum set of TFBSs is free to bind proteins simultaneously.

The user may choose to receive a simple list of candidate clusters in the Sanger GFF format (http://www.sanger.ac.uk/Software/formats/GFF/) with the start and end chromosomal coordinates of each cluster relative to the human genome (‘short’ analysis). This output is directly portable into many downstream sequence analysis programs and also easily displayed in genome browsers. The TFBScluster interface also provides a much more comprehensive analysis of candidate TFBS clusters (‘long’ analysis).

The results files for ‘long’ analysis include information for the genes that have been localised to such clusters of TFBSs, and are therefore, potentially regulated by such clusters. Comprehensive information about potentially regulated genes is retrieved via the Ensembl API (Stabenau et al., 2004), and web-links to the UCSC genome browser are provided (see http://hscl.cimr.cam.ac.uk/sup_don05_app_api/TFBScluster_ex_file.html for examples of result files). Consequently, lists of candidate TFBS clusters can be converted into lists of ‘localized’ genes with the relevant gene identifiers (Swiss-Prot/Locuslink accessions) to perform downstream analysis based, for example, on gene function (Gene Ontology). Moreover, ‘long’ analysis generates comprehensive information on the local sequence surrounding each cluster. An alignment is stored for the predicted cluster sequence as well as an extended sequence containing 350 nt on either side of the cluster. The TFBScluster web interface provides the option to search these alignments in order to retain or exclude cluster candidates, based on the presence of a user-supplied DNA sequence motif (any user-defined IUPAC string) or a pattern of motifs.

TFBScluster.pl and the associated programs are written in PERL and are accessible via a PERL CGI interface on a web server, hosted by the University of Cambridge. The user HTML input pages contain default parameters where appropriate. Through its ease of use and streamlined provision of information required for functional validation of candidate gene regulatory regions, TFBScluster has the potential to contribute to deciphering the transcriptional networks controlling a wide range of human developmental processes. Finally, TFBScluster can be easily adapted to analyse additional genomes as long as they have been annotated in Ensembl. The development of a mouse-centric resource is currently underway.


    Acknowledgments
 
We are grateful for many stimulating discussions with Tony Green and Josette-Renée Landry (Cambridge University) and Noel Buckley, Alex Bruce and Michael Sadowski (Leeds University). Work in the authors' laboratory is funded by the Leukaemia Research Fund, BBSRC, Cambridge MIT Institute and an IBM SUR equipment grant.

Received on January 20, 2005; revised on March 18, 2005; accepted on April 21, 2005

    REFERENCES
 TOP
 Abstract
 REFERENCES
 

    Berman, B.P., et al. (2002) Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc. Natl Acad. Sci. USA, 99, 757–762[Abstract/Free Full Text].

    Chapman, M.A., et al. (2004) Analysis of multiple genomic sequence alignments: a web resource, online tools, and lessons learned from analysis of mammalian SCL loci. Genome Res., 14, 313–318[Abstract/Free Full Text].

    Dieterich, C., et al. (2005) Comparative promoter region analysis powered by CORG. BMC Genomics, 6, 24[CrossRef][Medline].

    Donaldson, I.J., et al. (2005) Genome-wide identification of cis-regulatory sequences controlling blood and endothelial development. Hum. Mol. Genet., 14, 595–601[Abstract/Free Full Text].

    Frith, M.C., et al. (2003) Cluster-Buster: finding dense clusters of motifs in DNA sequences. Nucleic Acids Res, 31, 3666–3668[Abstract/Free Full Text].

    Grad, Y.H., et al. (2004) Prediction of similarly acting cis-regulatory modules by subsequence profiling and comparative genomics in Drosophila melanogaster and D. pseudoobscura. Bioinformatics, 20, 2738–2750[Abstract/Free Full Text].

    Heinemeyer, T., et al. (1999) Expanding the TRANSFAC database towards an expert system of regulatory molecular mechanisms. Nucleic Acids Res., 27, 318–322[Abstract/Free Full Text].

    Jegga, A.G., et al. (2002) Detection and visualization of compositionally similar cis-regulatory element clusters in orthologous and coordinately controlled genes. Genome Res., 12, 1408–1417[Abstract/Free Full Text].

    Johansson, O., et al. (2003) Identification of functional clusters of transcription factor binding motifs in genome sequences: the MSCAN algorithm. Bioinformatics, 19, Suppl. 1, i169–i176[Abstract].

    Kolbe, D., et al. (2004) Regulatory potential scores from genome-wide three-way alignments of human, mouse, and rat. Genome Res., 14, 700–707[Abstract/Free Full Text].

    Krivan, W. and Wasserman, W.W. (2001) A predictive model for regulatory sequences directing liver-specific transcription. Genome Res., 11, 1559–1566[Abstract/Free Full Text].

    Lenhard, B., et al. (2003) Identification of conserved regulatory elements by comparative genome analysis. J. Biol., 2, 13[CrossRef][Medline].

    Loots, G.G. and Ovcharenko, I. (2004) rVISTA 2.0: evolutionary analysis of transcription factor binding sites. Nucleic Acids Res., 32, W217–W221[Abstract/Free Full Text].

    Markstein, M., et al. (2002) Genome-wide analysis of clustered dorsal binding sites identifies putative target genes in the Drosophila embryo. Proc. Natl Acad. Sci. USA, 99, 763–768[Abstract/Free Full Text].

    Sosinsky, A., et al. (2003) Target Explorer: an automated tool for the identification of new target genes for a specified set of transcription factors. Nucleic Acids Res, 31, 3589–3592[Abstract/Free Full Text].

    Stabenau, A., et al. (2004) The Ensembl core software libraries. Genome Res., 14, 929–933[Abstract/Free Full Text].

    Wasserman, W.W. and Fickett, J.W. (1998) Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol., 278, 167–181[CrossRef][ISI][Medline].

    Yu, H., et al. (2004) Cluster Analyzer for Transcription Sites (CATS): a C++-based program for identifying clustered transcription factor binding sites. Bioinformatics, 20, 1198–1200[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Proc. Natl. Acad. Sci. USAHome page
J. E. Pimanda, I. J. Donaldson, M. F. T. R. de Bruijn, S. Kinston, K. Knezevic, L. Huckle, S. Piltz, J.-R. Landry, A. R. Green, D. Tannahill, et al.
The SCL transcriptional network and BMP signaling pathway interact to regulate RUNX1 activity
PNAS, January 16, 2007; 104(3): 840 - 845.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
I. J. Donaldson and B. Gottgens
CoMoDis: composite motif discovery in mammalian genomes
Nucleic Acids Res., January 12, 2007; 35(1): e1 - e1.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
I. J. Donaldson and B. Gottgens
TFBScluster web server for the identification of mammalian composite regulatory elements.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W524 - W528.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/13/3058    most recent
bti461v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Donaldson, I. J.
Right arrow Articles by Göttgens, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Donaldson, I. J.
Right arrow Articles by Göttgens, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?