Bioinformatics Advance Access originally published online on February 4, 2005
Bioinformatics 2005 21(9):1789-1796; doi:10.1093/bioinformatics/bti307
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A comparative analysis of relative occurrence of transcription factor binding sites in vertebrate genomes and gene promoter areas
1Center for Biomedical Genomics and BioInformatics, Molecular and Microbiology Department, College of Arts and Sciences, George Mason University Fairfax, VA 22031, USA
2Vavilov Institute of General Genetics Gubkina str, 3, GSP-1, 111991, Moscow, Russia
3Russian Center of Haematology Moscow 125167, Russia
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: The detection of transcription factor binding sites (TFBS) in genomic sequences is a basic task for elucidating the transcriptional aspects of gene regulation. Evaluation procedures applicable to the TFBS prediction outputs need improvement. Predicted TFBS located outside of the transcription associated areas are often neglected from the functional and the evolutionary points of view, therefore deserving a systematic overview.
Results: We calculated theoretical occurrences of 184 TFBS according to their position weight matrices and the dinucleotide statistics of the completed vertebrate genomes, then performed a TFBS prediction in the corresponding complete genomic sequences and their repeat-free, repetitive and regulatory fractions. Repeat-free fractions of the closely related mammalian genomes were characterized by strong similarities in TFBS occurrences. A significant over-representation of multiple TFBS was found in both repetitive and non-repetitive genome fractions.
Availability: F-values and real TFBS occurrences calculated for human, chimp, mouse, rat, zebrafish and fugu genomes are available for free download at http://www.gmu.edu/departments/mmb/baranova/pages/bioinformatics
Contact: abaranov{at}gmu.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Understanding the regulation of gene expression is a crucial issue in molecular biology. Sequence-specific transcription factors (TFs) are required for initiation of transcription, as this process cannot be executed in the presence of eukaryotic RNA polymerase alone. Hundreds of transcription factors serve as tissue or timing specific inductors or repressors of transcription in the variety of vertebrate cells. Specificity of each TF is defined by its interaction with specific DNA sequences located at the promoter or enhancer regions. The DNA recognition process, which is extremely selective, is mediated by non-covalent interactions between appropriately arranged structural motifs of the protein and exposed surfaces of the DNA bases and backbone (Vazquez et al., 2003). In the last few decades, attempts have been made to describe these mechanisms by general sets of rules and associated models (Benos et al., 2002). Most popular software algorithms which allow one to map the potential binding sites in the putative promoter region are based on position weight matrices (PWMs). These put the TFBS in a statistical framework, having positional base frequencies derived from binding site examples described in experimental papers (Harr et al., 1983; Murakami et al., 2004). In simple words, such search algorithms compare the known statistical pattern of bases in DNA with an unknown sequence, and calculate the statistical significance of the match for a given PWM at all positions in the unknown sequence (Harr et al., 1983). A list of all potential TF binding sites in the given DNA sequence with given probability threshold levels is obtained as an output file (Quandt et al., 1995).
TFBS are usually represented by relatively short (1015 bp) nucleotide sequences, which leads to prediction of hundreds of potential TF sites, often overlapping with each other. To properly evaluate the results listed in the output file, researchers are relying on random expectation values (re-values) indicating the theoretical occurrence of a given TF binding site in DNA. If the TFBS re-value approximately corresponds to the predicted occurrence of a given TFBS in the sequence of interest, then the results of the prediction are valued as inconclusive. In contrast, when a re-value is significantly lower than its predicted occurrence, there is a high probability that the given TF is indeed able to bind to the sequence of interest in vivo. In the popular software MatInspector, re-values for each of the PWMs are defined as expectation values for the number of PWM matches per 1000 nucleotides of random DNA sequence (Quandt et al., 1995). Random DNA sequence is comprised of four nucleotides (A, G, C and T) in equimolar quantities, whereas real nucleotide composition of vertebrate genomes significantly differs from equimolar. Moreover, dinucleotide occurrences in real genomes are very far from random, as a classical methylation pathway leads to conversion of CpG to TpG and hence CpG deficiency and TpG abundance (Sved and Bird, 1990).
As multiple vertebrate genome sequences become available (Lander et al., 2001; Mouse Genome Sequencing Consortium, 2002; Pennisi, 2003; Rat Genome Sequencing Consortium, 2004), we performed recalculation of re-values for available PWMs on the basis of real dinucleotide occurrences in the vertebrate genomes. In some cases redefined re-values were significantly different from those reported in the MatInspector reference file (http://www.genomatix.de/cgi-bin/matinspector/). We also created the software application PWMatcher that allows us to predict all the potential TFBS in five vertebrate genomes and study their distributions in complete genomes, promoter areas, repeat-free areas and areas covered by repeats.
| 2 METHODS |
|---|
|
|
|---|
Recognition of the TF binding sites in vertebrate genomes was performed via utilization of PWMs available in public database TRANSFAC 6.0 located at http://www.gene-regulation.com. This database contains information about species specificity of 184 transcription factors and their PWMs. Some TFBS correspond to multiple known PWMs, as during PWM construction sequences evidently binding to a given TF are grouped according to the evidence quality values ranging from 1 to 6 and reflecting the experimental reliability of a certain proteinDNA interaction (Quandt et al., 1995).
Complete genome sequences of human, mouse, rat and zebrafish, as well as a draft version of the chimpanzee genome were downloaded from the NCBI repository (ftp://ftp.ncbi.nih.gov/genebank/genomes/). Complete Fugu rubripes genome was downloaded from ftp://ftp.ensembl.org/pub/current_fugu/data/fasta/dna/. We also utilized NCBI databases containing information on the positions of genes and repeats in human, mouse and rat genomes. We downloaded masking_coordinates.gz files that list locations for segments of repetitive sequence in the human, mouse and rat genomic contigs (ftp://ftp.ncbi.nih.gov/genomes). All repeats corresponding to the listed coordinates were downloaded onto an accessory database DATACORD.REP and were classified as LINE, LINE mixed with SINE, LINE mixed with any other sort of repeats except SINE, SINE, SINE mixed with any other sort of repeats except LINE, and OTHER. The latter category included various transposons and their molecular remnants, simple repeats, low complexity regions, etc.
The absolute coordinates for the first 5' gene exons were calculated using the MapView. These coordinates represent the exact nucleotide distance between the first sequenced nucleotide of the chromosome and the first nucleotide of the first mapped exon. We used the classic definition of a promoter as the gene area located between the positions 2000 to +1 according to the major transcription start site, retrieved these promoter fragments for human, mouse and rat from GenBank, and systematized them in the data storage facility DATACORD, which has been developed specially for this purpose.
We calculated dinucleotide frequencies for complete human, chimp, mouse, rat, zebrafish and fugu genomes as well as for separated regulatory fractions represented by human, mouse and rat promoter regions stored in DATACORD facility. Artificial nucleotide sequences (1 Gb each) were generated to emulate genomes and genome fractions stated above by an order-1 Markov model essentially as described by Gorbachev et al. (2002). F-values corresponding to frequencies of TFBS findings in artificially generated genomic sequences were calculated using software application PWMatcher that allows TFBS prediction according to weight matrix algorithm similar to the one used in software application MatInspector (Quandt et al., 1995).
With PWMatcher we performed the prediction of all TF binding sites with matrix similarity more than 0.85 and core similarity more than 0.75. The core similarity was calculated for the highest conserved positions of the matrix (usually four positions). The maximum core similarity of 1.0 is only reached when the highest conserved bases of a matrix match exactly in the sequence. Matrix similarity was calculated for all positions included in the TFBS matrix with a maximum of 1.0, and mismatches in highly conserved positions of the matrix decreasing the matrix similarity more than mismatches in less conserved regions. The matrix similarity and core similarity was calculated as described earlier (Quandt et al., 1995). PWMatcher software is available at our group's web-site and ftp-server: http://www.gmu.edu/departments/mmb/baranova/pages/bioinformatics; ftp://194.67.85.195/Database/TF_analysis/
We also calculated F-values and occurrences for invertebrate Drosophila melanogaster and for two sets of bacterial genomes. According to their GC content, bacterial genomes were chosen from the database described earlier (Sandberg et al., 2003). BACSET I contained genomes of Archaeglobus fulgidus (2 178 400 nt in size), Synechocystis (3 573 470 nt) and Vibrio cholerae (4 033 462 nt), each characterized by a GC content close to the GC content of human promoters. BACSET II contained genomes of Chlamydia trachomatis (1 042 519 nt), Chlamydophila pneumonia (1 225 935 nt) and Pasteurella multocida (2 257 487 nt), each characterized by a GC content close to the GC content of human genome.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Calculation of TF re-values in artificial genomic sequences with dinucleotide content corresponding to real vertebrate genomes
We calculated the GC content in the complete human genome and in promoter areas of human genes defined as nucleotide sequence corresponding to positions 2000 to +1 according to the major transcription start site. Nucleotide sequences corresponding to 23 887 human promoters were retrieved from GenBank relying on MapView absolute coordinates and were organized in an accessory database DATACORD. As we expected, the percent of GC nucleotides in promoter areas were significantly higher compared to the complete sequence of the human genome (48.08% G+C in human promoters versus 40.88% of G+C in human genome). The relative and theoretical occurrences of dinucleotides in the human genome and in promoter areas are summarized in Table 1. We compared dinucleotide occurrences between downloaded sequences and artificial sequences with the same nucleotide content. As described before (Holliday and Grigg, 1993), depletion of CpG dinucleotides due to an enhanced mutation rate of 5-methylcytosine was evident for both human genome (0.99 versus 4.18% ) and human promoters (2.59 versus 5.78% ).
|
We compared random-based values (random expectation, or re-values) having equal number of G, C, A and T nucleotides as described earlier (Quandt et al., 1995) and values that correspond to the real dinucleotide occurrences in the human genome (frequency-based values, or F-values), as the latter significantly differ from random A-C-G-T-uniform sequence (Table 1). For this purpose we generated two sets of artificial nucleotide sequences (ten 100 Mb-length sequences each) according to the described rules, and calculated all PWM matches in them.
For 17 of 184 TFs studied, F-values differ from re-values by more than tenfold (Supplementary Table 1). Two out of 17 TFBS namely ROR-
and EVI1 have re-values which were 10 times lower than the F-values, indicating that the significance of PWM matching to the sequence of interest should be lowered considerably. On the other hand, 15 out of 17 PWMs F-values were smaller than the corresponding re-values in the order of magnitude. For all these factors significance of TFBS matches in sequence of study was underestimated previously, especially for PAX6 and COMP1 TFBS with re-values that are 80 times higher than the corresponding F-values. Nine out of 15 PWMs of the latter type were found to contain CpG dinucleotides, including the PWM for the helix-loop-helix TF HEN1 that contains two CpGs. In total, 98 of 184 PWMs are characterized by F-values higher than expected and 86 PWMs by lower than expected. 65 out of 185 TFs re-values differ from F-values by at least threefold.
F-values calculated for human, chimp, mouse, rat, zebrafish and fugu genomes are represented in Supplementary Table 2 also available for free download at ftp://194.67.85.195/Database/TF_analysis/
3.2 Occurrences of TFBS in vertebrate genomes differ from calculated F-values
To study the distribution of TFBS in vertebrate genomes we downloaded genomic sequences of interest, including human, chimpanzee, mouse, rat, zebrafish and Drosophila genomes and calculated TFBS occurrences (O-values). For all studied genomes the differences between genome-specific F-values and O-values represented by O versus F ratios (value shows for how many times O and F differ) were significant, reaching the order of magnitude in some cases. In a significant proportion of cases genomic O-values were found to be much larger than corresponding F-values. Five TFs, namely TP53, NRSF, EVI1, SP1 and GC box element were characterized by real occurrences prevailing on F-values in all five vertebrate genomes analyzed (Table 2). For each vertebrate genome studied the TFs mentioned earlier occupy the top five places in the TF PWM lists sorted according to O versus F ratios. TP53 binding sites in the fugu genome were the only exception to this rule, as its occurrence was not different from the F-value calculated on the basis of the fugu genome dinucleotide content.
|
We calculated average O versus F ratios in each species studied. The largest average ratios were characteristic for zebrafish (1.69), mouse (1.54) and human genomes (1.49), while the average ratios calculated for chimpanzee, fugu and rat were 1.45, 1.44 and 1.43, respectively. The average ratios calculated for three bacterial genomes (BACSET II) and for Drosophila were 1.15 and 1.33, respectively. The Drosophila genome could be placed between vertebrate and bacterial genomes according to the individual ratios calculated. Four out of five factors characterized by the highest O versus F ratios in vertebrates are found to be in the top 10 positions of the Drosophila TFBS list sorted according to ratios discussed.
3.3 Closely related mammalian genomes are characterized by strong similarities in TFBS occurrences
At the time of the manuscript preparation (in summer, 2004) the chimpanzee genome was available and shared only as an incomplete draft composed of shotgun sequences not more than 2 kb in size each. Nevertheless, the similarity in occurrences of predicted transcription factor binding sites in human and chimpanzee genomes was almost perfect. The maximal differences in the genome O-values were characteristic of PWMs for SP1 factor (Ohum/Ochimp = 1.1095, where O is occurrence per 1000 bp), for early growth response 4 factor (EGR4), also known as NGFI-C (Ohum/Ochimp = 1.0978), and for GC box-like sequences (Ohum/Ochimp = 1.096). Such differences are rather small, but nevertheless they could not be explained solely by differences in the dinucleotide content of human and chimpanzee genomes, as ratios of the corresponding F-values (Fhum/Fchimp) for listed TFs are not the maximal ratios observed (data not shown). Maximal ratio of F-values (Fchimp/Fhum = 2.090) was observed for TP53 factor, in reality characterized by almost equal occurrences in human and chimpanzee genome (Ochimp/Ohum = 1.002). The latter observation could be explained by the extreme importance of preservation of all TP53 sites for proper functioning of the cell cycle in long living animals. O- and F-values for 10 transcription factors with the strongest differences in occurrence for human and chimpanzee genomes are summarized in Table 3.
|
Two other relatively close genomes, Mus musculus and Rattus norvegicus, were characterized by much stronger differences in TFBS distribution. Surprisingly, the maximal differences in both TFBS occurrences (Omouse/Orat) and ratios of corresponding F-values (Fmouse/Frat) were found for TP53, indicating a selective pressure toward accumulation of TP53 binding sites in the mouse lineage (Omouse/Orat = 6.70), that could be explained by accumulation of corresponding dinucleotides only partially (Fmouse/Frat = 2.0). Another TF with striking differences of TFBS distribution between mouse and rat genomes is EVI1 that has important roles in cell proliferation, vascularization and cell-specific developmental signaling at midgestation (Hoyt, 1997). The Omouse/Orat ratio for this factor is 2.22, while the Fmouse/Frat ratio is 1.02. This is also indicative of positive selection for EVI1 TFBS in the mouse lineage. On the other hand, binding sites for factors STAT1, STAT3 and V-JUN were positively selected in rat, as their occurrences were increased in the rat genome in comparison with the mouse genome. Data on the 10 transcription factors with the strongest differences in occurrence in the mouse and rat genomes are summarized in Table 4.
|
To perform a statistical evaluation of TFBS occurrences in vertebrate genomes we created density histograms for all 184 PWMs for a genomic interval in each of the chosen windows. All PWMs were subdivided into four groups: Group 1 encompassing 23 TFBS of lowest occurrences (less than 0.01 TFBS per kb), Group 2 containing 37 TFBS of relatively low occurrence (from 0.01 to 0.1 TFBS per kb), Group 3 of 73 TFBS of moderate occurrence (from 0.1 to 1 TFBS per kb) and Group 4 of 51 common TFBS (more than 1 TFBS per kb of genomic sequences). The above mentioned groups were formed according to TFBS occurrences in the human genome. Independent TFBS groupings for the mouse and rat genomes were performed and found to be not significantly different from the human grouping (data not shown). Density histograms of predicted TFBS were formed for genomic intervals of 50 Mb for Group 1, 5 Mb for Group 2, 500 kb for Group 3 and 100 kb for Group 4.
As significant portions of vertebrate genomes are occupied by various types of repeats, we decided to subdivide complete genomic sequences into repetitive and non-repetitive sequence fractions and assess them separately. To identify functionally meaningful TF PWMs with similar occurrences in mammalian genomes, all types of repeats annotated in the human, chimp, mouse and rat genomes were excluded from the analyzed sequences and the remaining non-repetitive genomic fractions were subjected to TFBS prediction as described above.
Density distributions of TFBS proved to have equal variances for the different factors in the same genome and are equally distributed between 33 PWMs. For comparing dispersions we used the F-criterion with a significance level of p = 0.01; for determination of distribution, we used the chi-square criterion with the same p-value. For evaluation of relative similarity in TFBS occurrences for those 33 PWMs, we used the T-criterion (p = 0.01). Among these 33 PWMs we found that only PWMs for factors COMP1, Delta EF1, ZID and SRY have close O-values in mouse and human genomes. Human versus rat comparison of occurrences for the same group of PWMs with normal TFBS distribution revealed only two factors, COMP1 and ZID. Comparison of TFBS occurrences in rat and mouse genomes revealed similarity for 9 PWMs, namely COMP1, ZID, BRN2, OCT1, OCT6, CDP, TCF11, NKX2-5 and GFI1. The O-values for COMP1 and ZID were similar for all three of the species compared. For the rest of the PWMs characterized by differing dispersions (151 PWMs) we determined confidence intervals (CI, with 99% confidence) and analyzed only the overlapping ones. This analysis revealed 32 additional PWMs with O-values statistically similar in two or more genomes (Supplementary Table 2).
We also studied distributions of PWM TFBS for each pair of genomes as well as for the human, mouse and rat genomes together by applying a chi-square test for homogeneity (p = 0.01). Only four PWMs with similar TFBS distributions in all three genomes were found: COMP1, EVI1, STAT1 and NFkappaB50. Nineteen PWMs correspond to TFBS that are equally distributed in human and mouse genomes, while 14 PWMs were equally distributed in human and rat genomes, and 115 PWMs in the genomes of mouse and rat (Supplementary Table 2). As the rodent genomes are much more closely related to each other than to the human genome, this outcome was expected. Similarities in TFBS distributions, interestingly, do not always correspond to similarities in their occurrence values, as the homogeneity criterion proved to be more sensitive to demonstrate genome relatedness from the TF binding point of view.
To visualize patterns of TFBS occurrences in vertebrate genomes we calculated occurrences of TFBS in each genomic interval which were plotted against the number of intervals (described above) with a given range of O-values. The distributions of TFBS for COMP1 and CHOP were determined in the three vertebrate genomes analyzed (Fig. 1). TFBS for factor COMP1 revealed a strong similarity in distribution among all three genomes (Fig. 1A), while for factor CHOP, no similarity was demonstrated was for any criterion applied (Fig. 1B).
|
3.4 Distribution of TFBS in mammalian repeats
Different vertebrate genomes contain lineage-specific as well as common types of repeats; we studied TFBS distributions separately in different types of repeats in the human, mouse and rat genomes. We separated all three complete genomic sequences into repetitive and non-repetitive fractions and calculated TFBS occurrences in them. For 18 PWMs the TFBS occurrences in non-repetitive and repetitive fractions of the human genome differ by at least twofold. Thirteen out of 18 mentioned PWMs were over-represented in human repeats, and only five were under-represented (Table 5). The same type of analysis was performed for mouse and rat genomes (data not shown). We calculated mean of TFBS occurrences of non-repetitive (N) and repetitive (R) fractions in human and rodent genomes differing in PWMs by more than twofold. A significant trend of over-representation of TFBS in human repeats in comparison to repeats of rodents was observed, as the mean values of N/R for Group I PWMs were 0.89 for H.sapiens, 1.92 for M.musculus and 2.10 for R.norvegicus.
|
To visualize a general picture of TFBS distributions among vertebrate genomes, we created a Venn diagram representing TFBS PWMs over-represented in repetitive genomic fractions in each of the species studied (Fig. 2). As expected, more repeat-associated TFBS types are found to be common in repeat-specific mouse versus rat comparison (31) than in human versus mouse comparison (5) and in human versus rat comparison (4).
|
We subdivided all repetitive DNA extracted from complete human, mouse and rat genomes into six categories: LINE, LINE mixed with SINE, LINE mixed with any other sort of repeats except SINE, SINE, SINE mixed with any other sort of repeats except LINE, and OTHER. The last category included various transposons and their molecular remnants, simple repeats, low complexity regions, etc. The relative input of each category of repeats into genomic TFBS distributions was assessed by calculation of a maximal density of TFBS found in each category of repeats. The TFBS distributions in the repetitive DNA of mouse and rat genomes are much more similar to each other than to that in the human genome, as we found 31 TFBS over-represented in common in rodent, but not in the human repeats. Twenty-nine PWMs were found to be over-represented in repeats of all three mammalian species analyzed. As expected, corresponding TFBS were found relatively more often in LINE elements that are common in all mammalian lineages. On the other hand, 15 out of 20 TFBS uniquely over-represented in human repeats were found to be mostly corresponding to SINE and SINE mixed. PWMs for factors GC and SP1 were over-represented in category OTHER repeats in all three vertebrate genomes studied.
3.5 Occurrences of TFBS in regulatory areas of mammalian genes
Among predicted TFBS, those which are located in regulatory areas of human genes are the most interesting ones; as such TFBS can significantly contribute to the cellular level of the mRNA transcripts. As the dinucleotide content of the complete genome and gene promoters differs significantly (Table 1), certain TFBS could cluster in the regulatory area of a gene (for review see FitzGerald et al., 2004). We retrieved nucleotide sequences corresponding to 23 887 human, 28 327 mouse and 22 579 rat promoters from GenBank, as well as their absolute coordinates (see corresponding databases ftp://ftp.ncbi.nih.gov/genomes/). The sequences were organized in an accessory database DATACORD as described before (Section 2). After prediction of TFBS, 13 PWMs that are two or more times over-represented in human promoters in comparison to the non-repetitive fraction of the human genome were revealed. In addition, 10 PWMs were found to preferentially match mouse regulatory sequences, and 6 PWMs were similarly over-represented in the rat regulatory sequences (Table 6). SP1 binding sites and GC box elements were over-represented in promoters of all three mammalian species. We also obtained a list of TFBS significantly under-represented in mammalian promoters in comparison to the mammalian genomes. In all mammalian genomes analyzed, the top positions in the lists of under-represented TFBS were occupied by the AT-rich PWMs, including PWMs for factors HNF1 and MEF2 for rat promoters (Onorep/Oprom in the range 1.941.43); factors MEF2, OCT1 and HNF1 for mouse promoters (Onorep/Oprom in the range 1.731.48); factors OCT1, HNF1, MEF2 and EVI1 for human promoters (Onorep/Oprom in the range 1.711.67).
|
All PWMs corresponding to the TFBS over-represented in mammalian promoters contain at least one conservative CpG dinucleotide. Only five out 184 PWMs (2.7%) contain two CpG dinucleotides, and three of them (EGR1_01, EGR2_01 and EGR3_01) match TFBS over-represented in promoters at least twofold. Two other PWMs with two CpG dinucleotides, E2F_01 and HEN1_02, also somewhat over-represented, with coefficients of 1.89 and 1.49, respectively. Some PWMs with only one CpG dinucleotide, e.g. PWM for EpsteinBarr virus transcription factor R (BRLF1, or EBVR) and PWM for TAX/CREB complex, are over-represented in promoters to a much larger extent (coefficients of 5.08 and 4.25, respectively). The list of TFBS over-represented in promoters of mammalian genes seems to be conserved between humans and rodents.
| 4 DISCUSSION |
|---|
|
|
|---|
We calculated theoretical occurrences (F-values) of 184 TFBS according to their PWMs and dinucleotide content of the various vertebrate genomes. As the dinucleotide content of vertebrate genomes differs significantly from that of statistically uniform random sequence, F-values calculated on the basis of real dinucleotide frequencies are much more reliable for applications requiring estimation of significance of the PWM match in the sequence of interest. Large-scale TFBS prediction in corresponding complete genomic sequences has been performed. For all studied genomes the differences between F-values and occurrences of predicted TFBS were significant, for some factors differing by orders of magnitude. In our opinion, this finding may indicate TFBS preservation in evolution that could be explained by partial protection of DNA segments covered by bound proteins as mutation accumulation rate in such segments will be lower. The net effect of TFBS protection against mutations could be substantial, even in cases when TF binding to its corresponding TFBS occurs for a short period of time not necessarily associated with transcription.
TFBS preservation, revealing itself as the high O/F value for each given factor, could also be explained by the relative DNA segment stability dependent on its nucleotide sequence per se. If so, the high O/F values for a given TFBS observed for a vertebrate genome also should be observed in non-vertebrate and even prokaryotic genomes presumably not exposed to eukaryotic transcription factors. To support or reject this hypothesis we calculated F-values and real genomic occurrences for TFBS in the genome of fruit fly Drosophila melanogaster and prokaryotic genomes comprising BACSET II with GC content close to that of the human genome. To our surprise, TFBS for TP53 and NRSF factors characterized by the highest O versus F ratios in vertebrate genomes were also found to be preserved in bacterial genomes (for TP53 O/F = 5.52; for NRSF O/F = 1.65). However sequence per se hypothesis does not hold good for either EVI1 and SP1 TFBS or GC box element with an O versus F ratios in prokaryotes close to 1 (data not shown). The latter TFBS may be indeed preserved in eukaryotic genomes due to their binding to certain proteins indicating that both binding-related and sequence per se mechanisms of TFBS preservation are functioning.
The average O/F values calculated for three bacterial genomes (BACSET II) and for Drosophila were much lower than similar ratios for genomes of fishes and mammalians, indicating the presence of the positive selection in vertebrate lineages towards preserving potential TFBS in their genomes in comparison to non-vertebrates. A prominent conservation of the list of TFs by the highest O/F values was mentioned when vertebrate genomes were compared with the genome of Drosophila. Invertebrate genomes encode transcription factors for both the SP1 family (Ramachandran et al., 2001) and the Kruppel family (Dang et al., 2000), which may bind to vertebrate SP1 and EVI1 sites with low efficiency. As vertebrate TF function across invertebrate species and the level of conservativeness of the corresponding TFBS are not well characterized; this observation deserves further investigation.
Human, mouse and rat genomes were separated into repeat-free, repetitive and regulatory fractions and also subjected to TFBS prediction. Out of 20, 15 TFBS uniquely over-represented in human repeats were found to be mostly corresponding to SINE and SINE mixed. This is easily explained by the unique character of human SINE repeats (Alu-repeats) which burst into the primate lineage 4050 million years ago (Oshima et al., 2003). PWMs for estrogen-receptor binding ERE elements previously found to occur in the human Alu-repeats (Klinge, 2001) were almost evenly distributed through non-repetitive and repetitive fractions of the human genome (N versus R ratio = 1.077). Nevertheless, human SINEs were found to be enriched in ERE elements even as ERE occurs once per 1513 bp of human SINEs, per 2455 bp of human LINEs and per 1878 bp of non-repetitive human DNA. PWMs for factors GC and SP1 were over-represented in category OTHER repeats in all three vertebrate genomes studied. That could be explained by the symmetrical nature of these GC-rich PWMs that may overlap with non-perfect G stretches in simple repeats and GC-rich low complexity regions. The same is true for homeobox-containing transcription factor PBX1 with its core element (caa)t(caa) possibly annealing to non-perfect repeat (CAA)n.
Repeat-free fractions of the closely related mammalian genomes were found to be characterized by strong similarities in TFBS distributions. A significant trend of over-representation of TFBS in the human repeats in comparison to repeats of rodents was observed. This finding supports previous observations describing the suppression of nucleotide substitutions at certain clustered positions of Alu-repeats corresponding to sites for protein binding (Britten, 1994). As such suppression of mutations cannot be explained by sequence requirements for Alu-repeat replication, Britten suggested that Alu-repeats have sequence-dependent functions in the primate genomes, probably related to the regulation of gene transcription.
Thirteen PWMs were found to be twice or more over-represented in human promoters in comparison to the non-repetitive fraction of the human genome. This could be explained by the presence of conservative CpG dinucleotides in corresponding PWMs only partially. The list of TFBS over-represented in promoters of mammalian genes seems to be conserved between humans and rodents. For example, TFBS corresponding to the entire set of EGR (early growth response) genes pertinent to gonadal development and differentiation (Ohno, 1999) as well as for neuronal plasticity (O'Donovan et al., 1999) are over-represented in all three mammalian species studied. Included in that set are the TFBS for factors ERG1, EGR2, EGR3 and NGFI-C, also known as EGR4. We also mentioned that two TFBS that are topmost over-represented for the human promoters, but not for the rodent ones, correspond to sites for TFs produced by EpsteinBarr virus and human T-cell leukemia virus type 1. As both viruses are human-specific and unable to propagate in the rodent cells, the latter finding might indicate concerted evolution of viral TFs and dinucleotide content of human promoters preserving viral ability to disorganize the expression of as many human genes as possible.
A list of the TFBS occurrences in mammalian genomes and particularly in the promoter areas could be used as an important starting point for molecular biologists involved in the gene regulation studies, as it provides real TFBS frequencies for 184 PWMs commonly used for TFBS prediction de novo.
| Acknowledgments |
|---|
The authors are very grateful to Dr. Mikhail Gelfand for extremely valuable scientific discussions, critical reading of the manuscript and for suggesting additional controls included in the submitted version and to Prof. Nick K. Yankovsky for valuable advices. Also we appreciate the help of the GMU MMB colleagues Dr. A. Christensen and Shobha Gowder for help with English grammar and everything else.
Received on August 11, 2004; revised on January 7, 2005; accepted on February 2, 2005
| REFERENCES |
|---|
|
|
|---|
Benos, P.V., et al. (2002) Is there a code for protein-DNA recognition? Probab(ilistical)ly. Bioessays, 24, 466475[CrossRef][Web of Science][Medline].
Britten, R.J. (1994) Evolutionary selection against change in many Alu repeat sequences interspersed through primate genomes. Proc. Natl Acad. Sci. USA, 91, 59925996
Dang, D.T., et al. (2000) The biology of the mammalian Kruppel-like family of transcription factors. Int. J. Biochem. Cell Biol., 32, 11031121[CrossRef][Web of Science][Medline].
FitzGerald, P.C., et al. (2004) Clustering of DNA sequences in human promoters. Genome Res., 14, 15621574
Gorbachev, O.G., et al. (2002) Stochastic Processes. , Moscow MIPT Publishers.
Harr, R., et al. (1983) Search algorithm for pattern match analysis of nucleic acid sequences. Nucl. Acids Res., 11, 29432957
Holliday, R. and Grigg, G.W. (1993) DNA methylation and mutation. Mutat Res., 285, 6167[CrossRef][Web of Science][Medline].
Hoyt, P.R., et al. (1997) The Evi1 proto-oncogene is required at midgestation for neural, heart, and paraxial mesenchyme development. Mech. Dev., 65, 5570[CrossRef][Web of Science][Medline].
Klinge, C.M. (2001) Estrogen receptor interaction with estrogen response elements. Nucl. Acids Res., 29, 29052919
Lander, E.S., et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860921[CrossRef][Medline].
Mouse Genome Sequencing Consortium. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 520562[CrossRef][Medline].
Murakami, K., et al. (2004) Assessment of clusters of transcription factor binding sites in relationship to human promoter, CpG islands and gene expression. BMC Genomics, 5, 16[CrossRef][Medline].
O'Donovan, K.J., et al. (1999) The EGR family of transcription-regulatory factors: progress at the interface of molecular and systems neuroscience. Trends Neurosci., 22, 167173[CrossRef][Web of Science][Medline].
Ohno, S. (1999) The one-to-four rule and paralogues of sex-determining genes. Cell Mol. Life Sci., 55, 824830[CrossRef][Web of Science][Medline].
Ohshima, K., et al. (2003) Whole-genome screening indicates a possible burst of formation of processed pseudogenes and Alu repeats by particular L1 subfamilies in ancestral primates. Genome Biol., 4, R74[CrossRef][Medline].
Pennisi, E. (2003) Evolution. Chimp genome draft online. Science, 302, 1876[CrossRef].
Quandt, K., et al. (1995) MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucl. Acids Res., 23, 48784884
Ramachandran, A., et al. (2001) Novel Sp family-like transcription factors are present in adult insect cells and are involved in transcription from the polyhedrin gene initiator promoter. J. Biol. Chem., 276, 2344023449
Rat Genome Sequencing Consortium. (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature, 428, 493521[CrossRef][Medline].
Sandberg, R., et al. (2003) Quantifying the species-specificity in genomic signatures, synonymous codon choice, amino acid usage and G+C content. Gene, 311, 3542[CrossRef][Web of Science][Medline].
Sved, J. and Bird, A. (1990) The expected equilibrium of the CpG dinucleotide in vertebrate genomes under a mutation model. Proc. Natl Acad. Sci. USA, 87, 46924696
Vazquez, M.E., et al. (2003) From transcription factors to designed sequence-specific DNA-binding peptides. Chem. Soc. Rev., 32, 338349.
This article has been cited by other articles:
![]() |
M. Megraw, F. Pereira, S. T. Jensen, U. Ohler, and A. G. Hatzigeorgiou A transcription factor affinity-based code for mammalian transcription initiation Genome Res., April 1, 2009; 19(4): 644 - 656. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


