Bioinformatics Advance Access originally published online on February 24, 2006
Bioinformatics 2006 22(8):999-1001; doi:10.1093/bioinformatics/btl062
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
START: an automated tool for serial analysis of chromatin occupancy data

1 Children's Hospital Informatics Program at the Harvard-MIT Division of Health Sciences and Technology, Harvard Medical School 300 Longwood Avenue, Boston, MA 02115, USA
2 Division of Neuroscience, Children's Hospital Boston, Harvard Medical School 300 Longwood Avenue, Boston, MA 02115, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: The serial analysis of chromatin occupancy technique (SACO) promises to become a widely used method for the unbiased genome-wide experimental identification of loci bound by a transcription factor of interest. We describe the first web-based automatic tool, termed sequence tag analysis and reporting tool (START), for processing SACO data generated by experiments performed for the yeast, fruit fly, mouse, rat or human genomes. The program uses as input sequences of inserts from a SACO library from which it extracts all SACO tags, maps them to genomic locations and annotates them. START returns detailed information about these tags including the genes, the genomic elements and the miRNA precursors found in their vicinity, and makes use of the MAPPER database to identify putative transcription factor binding sites located close to the tags.
Availability: The program is available at http://bio.chip.org/start/
Contact: vdmarinescu{at}chip.org
Supplementary information: Supplementary information is available at http://bio.chip.org/doc/start/START-supplementary.pdf
| INTRODUCTION |
|---|
|
|
|---|
Among the experimental techniques for identifying genome-wide targets for a transcription factor (TF) of interest, the recently introduced serial analysis of chromatin occupancy (SACO) (Impey et al., 2004), also termed GMAT (Roh et al., 2004), STAGE (Kim et al., 2005) and SABE (Chen and Sadowski, 2005), has emerged as a promising choice given its ability to combine the specificity of chromatin immunoprecipitation (ChIP) (Weinmann and Farnham, 2002), with the sensitivity of serial analysis of gene expression (SAGE) (Velculescu et al., 1995). In contrast to ChIP-on-chip assays that require building an array of promoter sequences that can potentially bind a given TF (Ren et al., 2000), SACO allows the unbiased genome-wide identification of all genomic targets for the TF. The technique consists of a first (ChIP) step in which the TF is cross-linked in vivo to its genomic targets, the DNA is sheared by sonication, isolated using a highly specific antibody for the TF, released from the proteinDNA complex by reverse cross-linking and purified. In a second (SAGE-like) step the DNA fragments are amplified by ligation-mediated PCR and digested with the NlaIII endonuclease (a four-base-cutter enzyme that recognizes the sequence CATG). The resulting DNA fragments are ligated to linkers containing a recognition site for the MmeI endonuclease and then cleaved with MmeI to release 21 bp DNA fragments (referred to as tags) containing an NlaIII (CATG) site; these tags are ligated end-to-end into ditags, concatenated and subcloned into sequencing vectors thus generating the SACO library (Impey et al., 2004; Roh et al., 2004; Kim et al., 2005). The initial result of a SACO experiment is a collection of sequenced inserts that contain a large number of 21 bp tags. Each tag labels a genomic fragment of
1 kb (depending on the sonication resolution in the ChIP step) that contains the tag and one or several sites to which the TF bound. Further processing of this data is based on the assumption that the 21 bp sequence of the tag can specifically identify a chromosomal location in an entire genome (Impey et al., 2004; Roh et al., 2004) and, therefore, lead to the identification of the genes and genomic regions targeted by the TF. The computational processing of this type of information is challenging due to the vast search space of all putative tags in the genome (around 20 million for the human and mouse genomes) and the need to integrate complex genomic annotations in order to identify the genomic features in the vicinity of the tags. Such analyses have been done so far either by manually processing a limited number of tags (Chen and Sadowski, 2005) or with custom scripts made available by the authors that developed the technique (Impey et al., 2004; Roh et al., 2004; Kim et al., 2005). These implementations are usually restricted to a small number of genomes whose annotations are not always promptly updated, consist of programs that have to be run from the command line and, most importantly, output a limited amount of information (usually restricted to the location of the tags) that has to be further processed, in order to generate interpretable results. Here we present the first web-based tool, termed sequence tag analysis and reporting tool (START), for the comprehensive analysis of SACO data, applicable to experiments performed in the yeast, fruit fly, mouse, rat and human genomes. START, available at http://bio.chip.org/start/, returns a detailed and comprehensive set of results pertaining to the location of the tags with respect to genes, regulatory elements (predicted binding sites for the TF of interest), other genomic elements (CpG islands, miRNA precursors) as well as general statistics for the distribution of tags and statistical measures of significance for the genes found as targets.
| RESULTS |
|---|
|
|
|---|
Genomic sequence processing and annotation
The most recent genomic sequences and annotations for Saccharomyces cerevisiae (sacCer1), Drosophila melanogaster (dm2 Release 4), Mus musculus (mm7), Rattus norvegicus (rn3) and Homo sapiens (hg17) were downloaded from the UCSC Genome Browser (Karolchik et al., 2003). Each genome was processed to extract (in forward and reverse orientations) the list of all possible 21 bp tag sequences that start with the string "CATG" (the NlaIII recognition site marking the beginning of a SACO tag sequence). A large proportion of genomic tag sequences (
90%) occur in only one location in the genome but many tags can be repeated several times. Tags occurring more than five times in a genome at different locations were eliminated from the analysis. Table 1 in the Supplementary Material presents the total number of tags and the average number of tags per gene found in the analysed genomes. Data on evolutionary conservation and location of CpG islands in the genomes were obtained from the UCSC Genome Browser (Karolchik et al., 2003), the location of miRNA precursors was downloaded from the miRBase (Griffiths-Jones, 2004), and the positions of putative TF binding sites found in the upstream regulatory regions of the human, mouse and Drosophila genes were retrieved using the MAPPER database (Marinescu et al., 2005).
Input format and analysis parameters
The START input is a single file or Zip archive containing the sequences of inserts from a SACO library in FastA format. The program searches the sequences in forward and reverse orientations for ditags of 4246 bp (to accommodate potential cloning artifacts) that begin and end with the CATG string and reads a 17 bp sequence for the tag following this string. As exemplified in Figure 1 in Supplementary Material the run parameters to be supplied by the user include the genome to be searched, whether the match between the tag in input and the tag in the genome should be exact or with one mismatch allowed, and the maximum distance allowed between a tag and the nearest evolutionarily conserved element (see the START online Help pages for details) or CpG island. As one of the main purposes of a SACO experiment is to identify the target genes for the TF of interest, START reports the genes containing tags within a user-specified window upstream or downstream of their transcript. We provide a measure of statistical significance for each gene based on a standard normal approximation to hypothesis testing on a proportion (Fleiss et al., 2004) that indicates whether the proportion of tags in input mapping close to a gene is significantly different from the proportion of tags expected to map close to that gene if n tags (equal to the total number of tags in input) were selected randomly from the genome (see the Supplementary Material and the START online Help pages for details).
The optional output selection allows the user to restrict the analysis to a set of genes (e.g. genes found to be differentially expressed in a given experimental condition; genes that are labeled by tags that fall within their 5' or 3' regions) and to produce additional files that contain the following information: the tags that are repeated a given number of times in the input (the higher this number the higher the confidence that the tag uncovers a real binding locus for the TF); tags that cluster within a given distance from each other and are found within a specified and tunable window from genes, and tags that map within a given distance from a miRNA precursor. Although little is known about the sequences and factors that orchestrate the regulation of miRNA transcription, recent evidence suggests a functional role for the presence of binding sites for a TF in the proximity of a miRNA precursor (Vo et al., 2005). In addition, based on the TF binding site models and the results of the MAPPER database, START can optionally return the predictions for all or selected MAPPER models around the tags that fall within the 5' regulatory regions of the genes.
Since a typical run for a genome-wide map of
50 000 tags takes on the order of 3 h, a status page allows the user to monitor its progress and, at the completion of the analysis, a link to a page containing the complete results is returned to the user by e-mail. The results are saved on the server in private individual accounts from where they can be accessed and downloaded at any time. A sample input file containing sequences of 100 clones from a SACO library is provided on the entry page of the application. These sequences are part of the results of a SACO experiment for the TF MEF2 performed in mouse (CD1) hippocampal neurons. These cells were cultured for 10 days in vitro and depolarized by KCl for 1 h before formaldehyde fixation. ChIP was carried out using a MEF2A antibody (C-21; Santacruz Biotech). Immunoprecipitated DNA fragments were then processed to generate the pool of ditags. The resulting ditags were concatemerizedeach concatemer typically contains 1020 ditagsand subcloned into the SphI-cut pZero-1 vector (Invitrogen). Vectors were transformed by electroporation into bacteria (E.cloni 10G SUPREME DUOs; Lucigen Corp.) and then processed by a commercial service (Seqwright) for cell plating, robotic colony picking, DNA isolation and sequencing using the M13 reverse sequencing primer (5'-CAGGAAACAGCTATGAC-3').
Output format and analysis options
The START output is organized as a collection of tab-delimited files structured so that they could be easily imported and used in a local relational database (e.g. Microsoft Access or MySQL). The source data files (tags in input, tags in genome) present a comprehensive account of all tags and their location in the input and in the genome. Each tag is assigned a unique number that is computed with a hash function from the actual nucleotide sequence of the tag, so that results of different runs can be easily compared. The tab-delimited START output files can be used in a relational database to narrow down the candidate targets for further biological validation through suitable queries. For example, a query could ask for all genes that are identified by at least five tags that map in the promoter of the gene within an evolutionarily conserved region, tags that form at least one cluster and have predicted binding sites for the TF of interest in their vicinity. Thus, START is instrumental in identifying novel targets for a TF, in evaluating the binding of a TF to chromatin in different experimental conditions or in intersecting binding and functional data by restricting the analysis to sets of genes selected based on functional assays (e.g. microarray experiments).
In the experiment described above (T.-K. Kim et al., manuscript in preparation), we are currently constructing two SACO libraries in order to identify MEF2 gene targets in neurons in the presence and absence of neuronal activity (known to induce the expression of several hundred genes that are critical for brain development and synaptic remodeling). MEF2 is highly enriched in muscle and brain tissues (McKinsey et al., 2001), and belongs to a group of TFs that mediate activity-dependent transcription. SACO analyses performed in both non-depolarized and depolarized neurons might reveal the regulated binding of MEF2 to gene targets in response to neuronal activity, and thus, provide a better understanding of the biological consequences of MEF2-dependent gene expression in the nervous system.
| DISCUSSION |
|---|
|
|
|---|
In recent years, the large-scale mapping of TF binding to chromatin was made possible by the use of ChIP-on-chip approaches in which DNA fragments selected by ChIP are hybridized on custom-made DNA microarrays (Ren et al., 2000). For mammalian genomes, such microarrays conventionally contain only sequences of proximal promoters (typically within 1 kb of the transcription start site) of currently annotated genes, an experimental choice that is inherently biased. Recently, high-density tiled microarrays that provide coverage at a high resolution of the non-repetitive sequences of a large portion of the human genome have been developed (Cawley et al., 2004). Their use allowed the unbiased genome-wide location analysis for selected TFs and has unambiguously revealed that only a small fraction of these sites are located within proximal promoters (Cawley et al., 2004). However, performing a complete genome-wide analysis of TF binding using tiled microarrays would be prohibitively expensive for sizeable genomes such as the mammalian ones given the cost required for constructing such arrays. Therefore, ChIP-sequencing-based methods, such as SACO, have been developed as an alternative way of investigating the genome-wide mapping of a TF of interest in an unbiased, yet cost-effective and sensitive manner. This approach is becoming a more popular choice for generating genome-wide TF location data and a growing number of studies are being published that use and refine it. Our software can be easily applied to most of these ChIP-sequencing-based methods with either no or only minor modifications. START can be used for experiments performed in the yeast, fruit fly, mouse, rat or human genomes that use the LongSAGE linkers (Saha et al., 2002) for ligation to the ends of the DNA fragments and the NlaIII endonuclease as a four-base-cutter; these conditions are true for the SACO, GMAT and STAGE methods (but not for SABE, since it uses a different four-base-cutter restriction endonuclease, TaiI, in the protocol). However, since the process of generating the internal START database is totally automated we will be able to accommodate experimental variations from this protocol (e.g. the use of a different recognition site in ditags) if necessary.
The START data processing is based on integrating a large number of genomic annotations and resources and, most significantly, TF binding site prediction data for a large number of genomes. These features, combined with its powerful and user-friendly web interface, make START a valuable resource for the large-scale experimental study of transcriptional regulation.
| Acknowledgments |
|---|
The authors thank Prof. Paola Sebastiani (Department of Biostatistics, Boston University) for her advice on the statistical significance test for the tagged genes.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Present address: Department of Molecular Genetics and Microbiology, University of Florida ARBR1-291, P.O. Box 100266, Gainesville, FL 32610, USA
Associate Editor: Martin Bishop ![]()
Received on December 19, 2005; revised on February 15, 2006; accepted on February 18, 2006
| REFERENCES |
|---|
|
|
|---|
Cawley, S., et al. (2004) Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell, 116, 499509[CrossRef][ISI][Medline].
Chen, J. and Sadowski, I. (2005) Identification of the mismatch repair genes PMS2 and MLH1 as p53 target genes by using serial analysis of binding elements. Proc. Natl Acad. Sci. USA, 102, 48134818
Fleiss, J.L., Levin, B., Paik, M.C. Statistical Methods for Rates and Proportions, (2004) , Hoboken, NJ John Wiley and Sons, Inc.
Griffiths-Jones, S. (2004) The microRNA Registry. Nucleic Acids Res, . 32, D109D111
Impey, S., et al. (2004) Defining the CREB regulon: a genome-wide analysis of transcription factor regulatory regions. Cell, 119, 10411054[ISI][Medline].
Karolchik, D., et al. (2003) The UCSC Genome Browser Database. Nucleic Acids Res, . 31, 5154
Kim, J., et al. (2005) Mapping DNAprotein interactions in large genomes by sequence tag analysis of genomic enrichment. Nat. Methods, 2, 4753[CrossRef][ISI][Medline].
Marinescu, V.D., et al. (2005) The MAPPER database: a multi-genome catalog of putative transcription factor binding sites. Nucleic Acids Res, . 33, D91D97
McKinsey, T.A., et al. (2001) Control of muscle development by dueling HATs and HDACs. Curr. Opin. Genet. Dev, . 11, 497504[CrossRef][ISI][Medline].
Ren, B., et al. (2000) Genome-wide location and function of DNA binding proteins. Science, 290, 23062309
Roh, T.Y., et al. (2004) High-resolution genome-wide mapping of histone modifications. Nat. Biotechnol, . 22, 10131016[CrossRef][ISI][Medline].
Saha, S., et al. (2002) Using the transcriptome to annotate the genome. Nat. Biotechnol, . 20, 508512[CrossRef][ISI][Medline].
Velculescu, V.E., et al. (1995) Serial analysis of gene expression. Science, 270, 484487
Vo, N., et al. (2005) A cAMP-response element binding protein-induced microRNA regulates neuronal morphogenesis [Erratum (2006) Proc. Natl Acad. Sci. USA, 103, 825]. Proc. Natl Acad. Sci. USA, 102, 1642616431
Weinmann, A.S. and Farnham, P.J. (2002) Identification of unknown target genes of human transcription factors using chromatin immunoprecipitation. Methods, 26, 3747[CrossRef][ISI][Medline].
This article has been cited by other articles:
![]() |
A. A. Bhinge, J. Kim, G. M. Euskirchen, M. Snyder, and V. R. Iyer Mapping the chromosomal targets of STAT1 by Sequence Tag Analysis of Genomic Enrichment (STAGE) Genome Res., June 1, 2007; 17(6): 910 - 916. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. B. Vos, N. A. Datson, K. F. Rabe, and P. S. Hiemstra Exploring host-pathogen interactions at the epithelial surface: application of transcriptomics in lung biology Am J Physiol Lung Cell Mol Physiol, February 1, 2007; 292(2): L367 - L377. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

