Skip Navigation


Bioinformatics Advance Access originally published online on December 6, 2005
Bioinformatics 2006 22(4):430-437; doi:10.1093/bioinformatics/bti819
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/4/430    most recent
bti819v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (11)
Google Scholar
Right arrow Articles by Sauer, T.
Right arrow Articles by Wingender, E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sauer, T.
Right arrow Articles by Wingender, E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

Evaluating phylogenetic footprinting for human–rodent comparisons

Tilman Sauer 1,*, Ekaterina Shelest 1 and Edgar Wingender 1,2

1Department of Bioinformatics, UKG, Georg-August-University of Goettingen Goldschmidtstrasse 1, 37077 Goettingen, Germany
2BIOBASE GmbH, Halchtersche Strasse 33, 38304 Wolfenbuettel, Germany

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 

Motivation: ‘Phylogenetic footprinting’ is a widely applied approach to identify regulatory regions and potential transcription factor binding sites (TFBSs) using alignments of non-coding orthologous regions from two or more organisms. A systematic evaluation of its validity and usability based on known TFBSs is needed to use phylogenetic footprinting most effectively in the identification of unknown TFBSs.

Results: In this paper we use 2678 human, mouse and rat TFBSs from the TRANSFAC® database for this evaluation. To ensure the retrieval of correct orthologous sequences, we combine gene annotation and sequence homology searches. Demanding a sequence identity of at least 65% is most effective in discriminating TFBSs from non-functional sequence parts, while different alignment algorithms only have a minor influence on TFBS identification by human–rodent comparisons. With this threshold ~72% of the known TFBSs are found conserved, a number which varies significantly between different transcription factors and also depends on the function of the regulated gene. TFBSs for certain transcription factors do not require strict sequence conservation but instead may show a high pattern conservation, limiting somewhat the validity of purely sequence-based phylogenetic footprinting.

Availability: Scripts are available from the authors upon request.

Contact: tsa{at}bioinf.med.uni-goettingen.de

Supplementary information: http://www.sybig.de/download


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Comparative genomics is a powerful approach for the prediction of gene structure and of cis-acting regulatory elements (Bulyk, 2003; Cooper and Sidow, 2003; Duret and Bucher, 1997; Frazer et al., 2003; Ureta-Vidal et al., 2003). The latter approach is also termed ‘phylogenetic footprinting’ and was first introduced by Tagle et al. (1988) who investigated primate {gamma}- and {varepsilon}-globin genes. The basic assumption of phylogenetic footprinting is that regulatory elements in non-coding regions are under a higher selective pressure during evolution than non-functional regions. Deleterious mutations in regulatory regions are removed from a population owing to purifying selection and non-functional regions are therefore expected to accumulate more mutations over time than regulatory regions. Aligning orthologous regions from two or more organisms hence should highlight the conserved parts which are expected to play a functional role. The term ‘functional’ is used in the following tantamount to an experimentally proven transcription factor–DNA interaction.

Owing to the availability of an increasing number of complete eukaryotic genomes, phylogenetic foot-printing is an approach which is widely applied to identify regulatory regions and potential transcription factor binding sites (TFBSs) in different organisms. The inherent problem of comparative genomics is the question that which species should be compared with each other to most reliably identify functional regions in the genomic sequence.

Several studies dealt with this topic for Drosophila species. Bergman et al. (2002) indicated that large-scale comparative genomic sequence data from Drosophila pseudoobscura could greatly contribute to the functional annotation of the genome of Drosophila melanogaster. Species like Drosophila erecta are too closely related to differentiate whether conservation of non-coding sequence is due to functional constraint or shared ancestry, whereas the Anopheles gambiae genome may not substantially aid genome annotation as it lacks non-coding conservation in comparison with D.melanogaster. Emberly et al. (2003) found that sequence conservation between D.melanogaster and D.pseudoobscura alone may not reliably discriminate regions containing TFBSs from those that do not. However, it has been shown that a combination of TFBS clustering and comparative sequence analysis is much more effective in identifying regulatory modules in D.melanogaster than each individual method on its own (Berman et al., 2004; Sinha et al., 2004).

A catalogue of regulatory motifs in Saccharomyces cerevisiae was discovered by a comparative analysis using high-quality draft sequences of three related yeast species (Kellis et al., 2003). These three species had sufficient sequence similarity to S.cerevisiae to reliably align orthologous regions, but sufficient sequence divergence to allow recognition of many functional elements by their greater degree of conservation. The same sequence data were also used to study the evolution of known TFBSs in S.cerevisiae, which show fewer substitutions than background DNA (Moses et al., 2003).

Human–mouse comparisons have been used extensively to identify potential regulatory regions which in many cases proved to be functional (Dermitzakis et al., 2002; Elnitski et al., 2003; Hardison et al., 1997; Loots et al., 2000). Human and mouse diverged ~75–90 million years ago. The divergence rate between their genomes has been low enough that one can still align orthologous sequences, but high enough to allow the discrimination of functional elements by their greater conservation. To assess the usefulness of human–mouse comparisons, several previous studies addressed the point to what extent experimentally known TFBSs can be identified by phylogenetic footprinting (Dermitzakis and Clark, 2002; Lenhard et al., 2003; Levy and Hannenhalli, 2002; Liu et al., 2004; Wasserman et al., 2000). The data collections of these studies comprehended betweeen 99 and 481 TFBSs of which about 60–68% could be detected by human–mouse comparisons.

There are several crucial points in using phylogenetic footprinting which have to be recognized to obtain reliable results. The three major problems are (1) the retrieval of the correct orthologous sequence from the second organism, (2) the choice of an alignment algorithm and (3) the choice of criteria for determining conservation. In this paper we address these three points and derive our own approach from the conclusions drawn.

To evaluate the usefulness of human–mouse or human–rat genome comparison for phylogenetic foot-printing and to tune the approach, we mapped 2676 experimentally proven TFBSs from the TRANSFAC® database (Matys et al., 2003) to different alignments and checked their conservation. In this study we used a collection of TFBSs larger than in the previous studies and systematically investigated the influence of the conservation threshold (CT) and alignment algorithm on TFBS identification as well as the dependence of conservation on the type of TF and the functional category of the regulated gene.

As the DNA-binding specificity of some TFs is low, sequence conservation may not reflect properly the existence of an orthologous TFBS. To address the identification of TFBSs which are not well conserved, we therefore abstracted the conservation of TFBS from the sequence level to the pattern level using position-specific scoring matrices (PSSMs).


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Mapping of TFBS sequences
A total of 3508 TFBS (human: 1745; mouse: 1045; rat: 718) were extracted from the TRANSFAC® database (PROFESSIONAL Release 9.1). For all the TFBSs in TRANSFAC® it was confirmed experimentally that a TF binds to them. The TFBSs were lengthened via their EMBL links (given by TRANSFAC®) by 50 bp on each side to facilitate a unique mapping to their corresponding genome (human: NCBI build 35; mouse: NCBI build m33; rat: RGSC 3.4) using BLAT (Kent et al., 2002). Sequences containing the TFBSs with an 800 bp environment were retrieved from the Ensembl database (Hubbard et al., 2005).

The mapped TFBSs were assigned to their corresponding gene by retrieving annotation from Ensembl. This step was necessary as a fraction of TFBSs is not linked to an Ensembl gene in TRANSFAC®. As the TFBS sequences are annotated strand-specifically, each TFBS was assigned to the Ensembl Gene ID of the gene being on the same strand as the TFBS and having the transcript with the lowest distance of its transcription start site (TSS) to the TFBS. This gene was assumed to be the one regulated by the TFBS. The gene descriptions stored in TRANSFAC® were in very good agreement with the Ensembl gene descriptions (data not shown). The following Ensembl-databases were used: homo_sapiens_core_30_35c, mus_musculus_core_30_33f and rattus_norvegicus_core_30_34.

Sequence retrieval of orthologous sequences
If a TFBS could be assigned to a gene, the Ensembl-database ensembl_mart_30 was queried for known orthologous genes (mouse orthologs for human TFBSs, human orthologs for rat or mouse TFBSs). The sequences obtained above were blasted against the second genome using WU-BLAST (W. Gish, http://blast.wustl.edu) with standard parameters. We preferred a whole genome approach to a WU-BLAST search just in regions near the known orthologous genes, to avoid an arbitrary choice of the size of these regions. Prior to blasting potential repeats were species-specifically masked using RepeatMasker (A. F. A. Smit and P. Green, http://repeatmasker.org) in combination with MaskerAid (Bedell et al., 2000) and RepBaseUpdate of March 2004 (Jurka et al., 2000). The sequence around the WU-BLAST-hit having the highest possible score and being assigned to a known orthologous gene (as described earlier) was extracted (see also Supplementary Material). Start and end of the query sequence were extrapolated to retrieve a sequence of similar length, i.e. ~800 bp.

If the obtained orthologous sequence pairs (OSPs) were found to overlap, they were merged into one OSP.

All sequences containing the TFBSs were mapped to whole genome alignments (WGAs) made with BLASTZ, which were downloaded from the UCSC website (ftp://hgdownload.cse.ucsc.edu; hg17vsMm5, mm5vsHg17 and rn3vsHg17 in axtNet format).

Alignment of orthologous sequences
The obtained OSPs were aligned with several global alignment algorithms (AVID, CLUSTAL W, CONREAL, DIALIGN, LAGAN and T-COFFEE) (Berezikov et al., 2004; Bray et al., 2003; Brudno et al., 2003; Morgenstern et al., 2000; Notredame et al., 2000; Thompson et al., 1994) and two local alignment algorithms (BLASTZ and LALIGN) (Huang and Miller, 1991; Schwartz et al., 2003). The alignment programs were run with default parameters. They are not specially designed for aligning non-coding sequences, except for CONREAL which uses a different approach by scanning both sequences with PSSMs for potential TFBSs to establish anchors between the orthologous sequences and to guide the sequence alignment.

Conservation rate and background conservation
The conservation of the TFBSs is calculated using their sequence identity in the obtained pairwise alignments. The percentage of identity (PI) value of TFBS i (with i = 1, ..., N) is defined as the fraction of identical symbols in the columns j of the pairwise alignment over the length li of the TFBS i:

Formula 1(1)

Xj, Yj isin {A,C,G,T,–} are the human and rodent nucleotides or gaps (a nucleotide which is aligned to a gap is treated the same as a mismatch) at position j of TFBS i in the pairwise alignment. A TFBS i is considered as conserved, if its PI value is equal to or above a certain CT. The conservation rate Cseq is defined as the fraction of all N TFBSs that are considered as conserved:

Formula 2(2)

Analogously the conservation of the background is calculated. The background is defined as all columns of the obtained alignments which do not overlap with exons or known TFBSs. Every length li of all TFBSs (with i = 1, ..., N) is used as a sliding window to calculate PI values for all possible positions of the background. One obtains a distribution of PI values which is based on the same length distribution as the PI value distribution used for calculating Cseq. The background conservation rate Formula 2 is defined as the fraction of these PI values that are equal to or above a certain CT.

Gene function dependency of TFBS conservation
The gene ontology (GO) annotation for each gene linked to a TFBS was replaced by the corresponding ‘GO slim’ terms (Harris et al., 2004). To construct ‘GO slim’ a set of high level terms was selected by the EBI to cover most aspects of each of the molecular function, biological process and cellular component ontologies. If several GO terms for one gene correspond to the same GO slim term, the GO slim term was counted only once. The conservation rate Cseq for all TFBSs linked to a certain GO slim term was calculated individually.

Conservation rate on pattern level
The complete vertebrate PSSM library of TRANSFAC® was used to predict TFBSs in the OSPs using the MATCH program (Kel et al., 2003). All PSSMs related to TFs binding to an experimentally verified TFBS were used to scan both sequences. Matrix similarity score (mSS) thresholds from the predefined profile minFN were used to minimize false negative predictions.

If a PSSM hit in the orthologous sequence occurred in the alignment within ±2 bp of a PSSM hit for the same matrix in the orginial sequence and its score did not differ from the score of the original hit by a tolerance limit of 0.15, the PSSM hit was called ‘pattern-conserved’. Since the position of a PSSM hit does not necessarily coincide with the TFBS documented in TRANSFAC®, we allowed for a positional tolerance of ±10 bp around the start of the TFBS to check for pattern-conservation, unless the TFBS itself was used in the construction of the PSSM. In this case it is known where exactly the hit should occur. The conservation rate Cpat is defined as the fraction of all known TFBSs which are pattern-conserved. The background conservation rate Formula 2 is defined as the fraction of all predicted TFBSs (for a particular TF) which are pattern-conserved.

Significance of conservation rates
If K TFBSs out of a group of N TFBSs are conserved (Cseq,pat = K/N), the probability to observe this conservation rate by chance (given a background conservation rate Formula 2) is calculated using a cumulative binomial distribution:

Formula 3(3)


    RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Orthologous sequence retrieval
Intuitively orthologous sequences from a second species would be retrieved according to the annotated TSS, e.g. Jareborg et al. (1999) or Liu et al. (2004) focussed in their studies on the sequences 1000 bp upstream of the TSS for both human and mouse. But it is generally not valid to retrieve an orthologous sequence from the second species just by the known distance from the TSS in the first genome, since the TSS in the second species may be different or annotated incorrectly in either genome. Moreover, when analyzing enhancer sequences that may be located at rather large distances from the TSS, the chances of insertions/deletions between them and the TSS increase with distance. Therefore, an anchor on the second genome is needed to retrieve the correct orthologous sequence. In our approach this anchor is provided by a WU-BLAST search.

Of the 3508 TFBSs 3383 could be mapped to their corresponding genome and linked to a gene (see Materials and Methods section and Supplementary Material), with 72 TFBSs no mapping result was obtained and 53 TFBSs could not be mapped on to a gene. Failures may be due to incomplete genome assemblies and annotation or to errors in the sequences obtained from EMBL. The vast majority of TFBSs are located in the region of –1000 to +250 bp around the TSS (see Supplementary Material). Of these 3383 TFBSs 2971 (88%) were found to be covered by the WGAs, whereas 412 (12%) occurred in gaps of the WGAs.

For 224 of the 3383 TFBSs there was no annotation of an orthologous gene, giving 3159 TFBSs where a WU-BLAST search was performed. For 481 of these 3159 TFBSs the WU-BLAST search failed to identify an anchor on the second genome (15%). Reasons for this could be as described above, very weak sequence homologies that were not detected by the WU-BLAST search or owing to repeat-masking. It is known that regulatory sequences occur in repetitive DNA (Jordan et al., 2003). Repeats were found more frequently in sequences where no corresponding orthologous sequence could be identified, indicating that the repeat-masking may have prevented the identification of an orthologous sequence: For 184 of the 481 sequences (38%) repeats were found with an average of 34% of masked base pairs. In the remaining 2678 cases (85%) an orthologous sequence was identified. Here the amount of repeat-masked sequence was significantly lower: For 572 (21%) of these cases repeats were found with an average of 18% of masked base pairs. This low repeat content is also reflected in the corresponding identified orthologous sequences, where 467 (17%) orthologous sequences contain repeats with an average of 19% of masked base pairs.

Merging overlapping OSPs gave 614 OSPs containing 1349 human and 812 mouse TFBSs and 161 OSPs containing 515 rat TFBSs. For all 775 clusters the TSS distances of the human sequences were determined and compared with the TSS distances of the rodent sequences (Fig. 1). In 54% of the cases the TSS distances differed by <100 bp, but in ~25% by >500 bp (in 18% by >1000 bp), demonstrating the need for an anchor on the second genome other than the annotated TSS: as the average length of the sequences which have been aligned is ~950 bp, a TSS distance difference of 1000 bp would render it impossible to align the known TFBSs to their presumed orthologous counterparts if the orthologous sequences would have been retrieved in relation to the TSS.


Figure 1
View larger version (10K):
[in this window]
[in a new window]
 
Fig. 1 The diagram shows the TSS distance differences for the 775 OSPs. The distance difference in base pairs is plotted against the cumulative percentage of the sequence pairs below a certain TSS distance difference. About 25% of the sequence pairs differ by >500 bp in their TSS distance.

 
Sequence conservation of TFBS
The aim of this work was to evaluate and compare different alignment algorithms in their ability to identify experimentally known TFBSs. The obtained results should give hints on how to use phylogenetic footprinting to optimally identify unknown TFBSs by human–rodent comparisons. Two parameters are of importance for this evaluation: sensitivity and specificity. As not all TFBSs are expected to be conserved between human and rodents, the term sensitivity is misleading, because a sensitivity of 1 cannot be reached a priori. To avoid misconception the term ‘conservation rate’ is used instead (see Materials and Methods section).

It is difficult to determine specificity in phylogenetic footprinting directly, because erroneously identified conserved regions [false-positives (FPs)] cannot be distinguished from potentially functional conserved regions. To get an impression of how each algorithm and parameter combination can distinguish between real functional conservation and background conservation, the percentage of background sequences with a PI value equal to or above a chosen CT was determined (see Materials and Methods section). This percentage is an overestimation of the real FP rate, as there should be other functional TFBSs (true positives) in the vicinity of the known TFBSs. These true positives would be counted as FPs, thus rendering the background conservation rate Formula 3 very conservative. The functionality of the known TFBSs is reflected in a higher sequence conservation (average PI value of 75.6%) compared with the background sequences (52.0%) (Fig. 2).


Figure 2
View larger version (10K):
[in this window]
[in a new window]
 
Fig. 2 The graph shows what fraction of the TFBSs and the background sequences suffices the chosen CT. With increasing CT these fractions decrease. For sake of clarity only results obtained with AVID are shown.

 
Plotting the obtained conservation rate versus the background conservation rate as a receiver operating characteristic (ROC) curve shows no significant differences between the investigated alignment algorithms (Fig. 3). Global alignment programs perform slightly better than local alignment programs and the BLASTZ WGAs. Local tools generally have lower sensitivity as the alignments produced do not cover the input sequences completely. This observation was also made by Pollard et al. (2004) who simulated alignments over a range of divergence times and compared them with alignments produced by pairwise alignment tools. The BLASTZ WGAs are a bit weaker in performance than the other alignment programs, but are better in coverage, as 2971 TFBSs could be mapped to the WGAs, whereas only 2678 TFBSs were covered by the OSPs obtained using WU-BLAST. CONREAL performs best in separating the known TFBSs from non-functional sequences, but it has to be emphasized that CONREAL utilizes known TFBSs in the form of PSSMs to create the alignment (using predicted TFBSs as anchor points) in contrast to the other alignment programs. The circularity in this approach may be a drawback as footprints which cannot yet be assigned to any known TF remain unaligned. This disadvantage does not hold true for the other alignment algorithms, whose runtimes are additionally at least 20 times shorter. Alignments obtained with AVID were selected for further investigations.


Figure 3
View larger version (24K):
[in this window]
[in a new window]
 
Fig. 3 To compare the performance of the different algorithms, the obtained conservation rate is plotted versus the background conservation rate. The global alignment programs perform slightly better than the local alignment programs and the WGAs.

 
While human and rodent sequences are still similar enough to guide most alignment programs, the performance differences between the alignment algorithms may vary substantially for increasing evolutionary distances as was shown recently by Pollard et al. (2004). Very high evolutionary distances may even render it impossible to obtain an accurate alignment. Rosenberg (2005) found that if <50% of positions between two non-coding sequences are identical, CLUSTAL W creates pairwise alignments that do not differ from random data. Thus for sequence comparisons of species more diverged than human and rodent one particular alignment program may be more clearly preferable than others.

The CT which gives the optimal trade-off between conservation rate and the background conservation rate was determined using the ROC curves by calculating the distance of each point in the plot to the upper left corner of the graph (see Supplementary Material). At this point the conservation rate would equal 100%, while the background conservation rate would equal 0%. A CT of 65% gives the minimal distance and is optimal in separating the real TFBSs from the background sequences and is therefore used for further analyses. This CT is an average value determined using all TFBSs in the dataset, for individual TFs the optimal CT may differ slightly. It gives a high conservation rate of Cseq = 71.7% and a background conservation rate of Formula 3. With this threshold 28.3% of TFBSs remain that are not considered as conserved. A fraction of these TFBSs may be species-specific to human or rodents and therefore not detectable by sequence comparisons. Also it may be that regulatory regions, owing to their inner variability, gain and lose TFBSs, change their relative positioning or the spacing between them. The latter has been shown for Drosophila by Ludwig et al. (2000).

A subset of 100 rodent TFBSs was found to be aligned with human TFBSs belonging to the same TF. Of these TFBSs 94 suffice the CT of 65%, proving that phylogenetic footprinting properly detects TFBSs existing in both species. These 100 rodent TFBSs were not included in the calculation of Cseq to avoid counting the same information twice.

As for 481 TFBSs no orthologous sequence could be identified, the above mentioned Cseq may be biased. If the orthologous sequence retrieval failed owing to low overall sequence homology and no other reasons, these TFBS would have a lower probability to be conserved and thus could decrease the percentage of conserved TFBSs. When assuming that all these 481 TFBSs are not conserved, a Cseq of 60.4% would be obtained as an underestimate of the genome-wide conservation of TFBSs between human and rodents. Nevertheless, when using phylogenetic footprinting to detect unknown TFBSs, ~72% of real TFBSs would be detected because the orthologous sequence is needed as a prerequisite.

The fraction of conserved TFBSs observed in this study is in good agreement with observations made in earlier studies. Levy and Hannenhalli (2002) found 65% of experimentally defined TFBSs in regions conserved between human and mouse (having a minimal length of 50 bp and a minimal PI value of 70%). Lenhard et al. (2003) could detect ~68% of TFBSs in conserved regions defined by a sliding window of 50 bp and a CT of 70%. It has to be noted that the PI value of the individual TFBSs themselves may be <70%, as the average length of TFBSs is much smaller than the used window length of 50 bp. Liu et al. (2004) found ~60% of TFBSs and ~25% of background sequences to be conserved when using a window of 21 bp centered on the TFBSs and a CT of 70%. We did not focus on a certain window size to determine conserved regions but instead investigated the TFBSs themselves as did Dermitzakis and Clark (2002). They found ~62% of TFBSs to suffice a CT of 70%. Using the same CT we observe 66% of the TFBSs and 28% of the background sequences to be conserved.

TF dependency of TFBS sequence conservation
As TFBSs are bound by TFs, it is reasonable to subgroup the TFBSs according to their related TFs and determine the conservation rate for each of these subgroups individually. TFBSs of MyoD (Cseq = 100%), MEF-2 (100%), SRF (96.9%) or NF-AT1 (96.7%) have higher conservation rates than TFBSs for Sp1 (63.7%), C/EBP{alpha} (63.2%), AP-2{alpha}A (48.2%) or GATA-1 (44.0%) when regarding conservation on sequence level (Table 1 and Supplementary Material). The obtained results are in good agreement with those by Wasserman et al. (2000) who found all of 20 MEF2, 23 MyoD and 15 SRF TFBSs to be conserved, but only 18 of 24 Sp1 TFBSs. The observed Cseq for some TFs (e.g. AP-2{alpha}A and GATA-1) does not differ significantly from the background conservation (after Bonferroni correction for multiple testing). Sequence conservation therefore may not be a sufficient criterion to identify TFBSs for these TFs.


View this table:
[in this window]
[in a new window]
 
Table 1 TF dependency of the conservation rate

 
The TFs with a high conservation rate of their TFBSs may have an essential role in gene regulation for both human and rodents which increases the selective pressure on their TFBSs. For the TFs with low TFBS conservation rate the non-conserved TFBSs maybe existed in a homotypic cluster of an ancestor genome, of which different elements mutated and lost function in the distinct evolutionary lines. The remaining elements of the homotypic cluster probably compensated for this loss of function. For the even-skipped stripe 2 enhancer (S2E) in Drosophila which regulates the expression of the gene even-skipped (eve) it has been shown that although the expression pattern of eve is nearly identical among several species, the S2E sequences from these species have substantially diverged (Ludwig et al., 2000, 2005). The authors showed that these sequence differences have functional consequences, but are masked by other coevolved differences. There is experimental evidence that at least some of the factors with a lower conservation rate (Sp1, GATA-1) appear in clusters (Hardison et al., 1993; Hermfisse et al., 1996). As further examples, TRANSFAC documents four AP-2{alpha}A TFBSs between –230 and –110 bp upstream of the TSS of the human MT2A gene and six c-Myb TFBSs in the human SFRS2 gene (–366 to +16 bp relative to the TSS).

Gene function dependency of TFBS sequence conservation
Several studies have discovered a correlation between the locations of highly conserved sequence elements and genes involved in the regulation of transcription and development (Bejerano et al., 2004; Sandelin et al., 2004; Woolfe et al., 2005).

These highly conserved sequences are probably existing in all vertebrate species and are essential to vertebrate development. They are in most cases at least several kb apart from known genes and may act as distal enhancers or play a role in structuring the genomic architecture. In contrast to that, Iwama and Gojobori (2004) focussed in their study on the conservation of sequences directly upstream of 3055 orthologous human–mouse gene pairs. They found that TF genes and developmental process-related genes show the highest degree of upstream sequence conservation, while genes involved in metabolism and cell cycle show less upstream sequence conservation.

While Iwama and Gojobori (2004) investigated upstream sequence conservation in general, in this study we explicitly checked if there is a correlation between gene function and the conservation of individual TFBSs. GO slim annotation was retrieved for all genes linked to a TFBS from Ensembl (see Materials and Methods section). The results are shown in Table 2 (see also Supplementary Material). TFBSs regulating genes having transcription regulator activity (Cseq = 80.5%) or which are involved in cell death (77.9%), the regulation of biological processes (77.3%), nucleic acid binding (77.2%), signal transduction (75.8%) and development (75.1%) have high conservation rates. TFBSs regulating genes involved in transport (62.1%) and having certain catalytic activities, e.g. like kinase (66.9%), transferase (61.9%) and oxidoreductase activity (59.9%), exhibit low conservation rates. These results are in very good agreement with the studies mentioned earlier and indicate that TFBSs regulating genes involved in processes essential to vertebrate evolution are under a high selective pressure and therefore exhibit a higher sequence conservation.


View this table:
[in this window]
[in a new window]
 
Table 2 Gene function dependency of the conservation rate

 
Because the investigated TFBSs are linked simultaneously to TFs and GO slim terms, we also checked if the results of the gene function dependency could be explained by the results of the TF dependency analysis, but we could not find that certain TFs act solely on genes in certain functional categories (data not shown).

Pattern conservation of TFBS
We have shown that the sequence conservation of TFBSs differs among different TFs. Since the DNA-binding specificity of some TFs is rather relaxed, the sequence of a TFBS can be poorly conserved, but the TF may still find a recognizable binding site in the orthologous sequence which may match with a ‘pattern’, e.g. a probabilistic representation of a binding sequence as it is presented by a PSSM. On the contrary it may be that even if the sequence of a TFBS is highly conserved, a single mutation, insertion or deletion may have rendered the orthologous TFBS non-functional. A corresponding PSSM could be retrieved for 1990 of 2678 TFBSs from the TRANSFAC® database and pattern conservation was determined as described in the Materials and Methods section for AVID alignments.

TFBSs (93) of the subset of 100 rodent TFBSs which were found to be aligned with human TFBSs belonging to the same TF are pattern conserved and not included in the calculation of the pattern conservation rate, giving a Cpat of 69.5%.

On sequence level 72.3% of the 1890 TFBSs are found conserved (with a CT of 65%). Of the conserved sequences 58.4% are conserved by both sequence and pattern, which means that 13.9% are conserved by sequence only and 11.1% by pattern only. Also 16.6% are conserved neither on sequence nor on pattern level.

TF dependency of TFBS pattern conservation
Generally sequence and pattern conservation of TFBSs are positively correlated, but for several TFs they differ (Table 1 and Supplementary Material). This may basically have two reasons: (1) As the conservation on sequence level does not require 100% sequence identity, it may happen for some TFBSs that just a few essential nucleotides have mutated or the TFBS has lost its function owing to a single insertion or deletion and a matrix search would give no hit above the mSS threshold in the orthologous sequence. A similar case would be a sequence-conserved TFBS that is highly degenerate and therefore not detectable with a corresponding PSSM in the original sequence. This would appear as a sequence-, but not pattern-conserved TFBS. (2) Alternatively it may be that the sequence identity is generally low but the TF binds nevertheless to both the original and the orthologous sequence as it may have a low binding specificity. Such a case would appear as a pattern-, but not sequence-conserved site.

Examples for the first case are the TFBSs of ER-{alpha} which are well conserved on the sequence level (82%), but the conservation rate on pattern level is just ~53%. For example in one ER-{alpha} site in the rat Calbindin D9K gene one essential nucleotide in the core of the binding site has mutated in the orthologous sequence, which decreases the mSS drastically, although the surrounding sequence is conserved. It is possible that the used PSSM is too stringent owing to the low number of TFBSs (20 for ER-{alpha}) used for their creation and may not reflect the real variability for TFBSs of this type, i.e. the mutation of 1 nt in the core of the binding site may not necessarily prevent the TF from binding in vivo. TFBSs for NF-{kappa}B and c-Ets-1 also show an increased conservation rate on sequence level compared with the pattern level.

Examples for the second case are TFBSs for C/EBP{alpha}, AP-2{alpha}A and Sp1. C/EBP{alpha} shows low conservation on sequence level (63%), but high conservation on pattern level (90%), which may, at least in part, be due to the high degeneracy of C/EBP{alpha} binding sites: The corresponding PSSMs will match frequently, which is reflected in the high background pattern conservation rate Formula 3 of 39%. For all investigated TFs we found a weak positive correlation between the pattern conservation of the known TFBSs and the background (correlation coefficient R = 0.53). Also AP-2{alpha}A and to a smaller extent Sp1 show high pattern and low sequence conservation, while having a lower background conservation rate than C/EBP{alpha}. The conservation rate of TFBSs for AP-2{alpha}A does not differ significantly from the background conservation on sequence level (Bonferroni corrected p-value of 1), whereas this difference is significant on the pattern level (Bonferroni corrected p-value of 5 x 10–8). The concept of pattern conservation therefore is useful for certain TFs where purely sequence based phylogenetic footprinting may fail.


    CONCLUSIONS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
In this study we evaluated the usefulness of phylogenetic footprinting when using human–mouse/rat comparisons. Starting from known TFBSs we tuned the approach to obtain the best discrimination between functional and non-functional elements. Retrieving the correct orthologous sequence is the first crucial point in this approach and in terms of performance it is recommendable to create an anchor on the orthologous genome by combining a WU-BLAST search and gene annotation, because retrieving orthologous sequences just by using an identical relative distance to the TSS in both genomes may not return the correct orthologous region, as in 25% of the investigated 775 OSPs the TSS distance difference exceeded 500 bp. However orthologous sequences could not be identified in all cases by this approach, which can be avoided by using WGAs which proved to be better in coverage by providing alignments for 2971 TFBSs compared with 2678 TFBSs in the WU-BLAST approach.

The optimal cut-off to determine conservation is the second crucial point and was evaluated in two points: first, the fraction of TFBSs which complied with the CT and second, what FP rates were obtained with this CT. Two measures for the FP rate have been introduced showing that a CT of 65% optimally discriminates real TFBSs from non-functional regions. About 72% of the known TFBSs suffice the CT of 65%, suggesting that ~72% of unknown TFBSs may be identified by human–rodent comparisons, if an orthologous sequence could be retrieved.

Alignment algorithms differ only marginally in their performance when comparing human and rodent sequences with global alignment programs performing slightly better than local alignment programs or the WGAs. For the comparison of more diverged species it is expected that the performance difference will increase, as indicated by the results of Pollard et al. (2004) and Rosenberg (2005).

The conservation of TFBSs shows some dependence on the corresponding TF, e.g. binding sites for MyoD, MEF-2, SRF or NF-AT1 are conserved very frequently, whereas binding sites for Sp1, C/EBP{alpha}, AP-2{alpha}A or GATA-1 show a lower conservation rate. Depending on the functionality of the regulated genes, the conservation rate for TFBSs also differs: genes involved in transcriptional regulation, development or signal transduction show a higher TFBS conservation rate than genes involved in catalytic activities and transport. Additionally, for some TFs sequence conservation may not be informative about the conservation of the TFBS in the orthologous sequence, as the constraints for binding may be relatively relaxed. A PSSM for a certain TF may still score above the mSS threshold although several positions have mutated in the orthologous sequence. This leads to significant differences in the conservation rates on sequence and pattern level for several factors, e.g. for AP-2{alpha}A binding sites one obtains a conservation rate of 90% on pattern level compared with 48% on sequence level. These findings have to be taken into account when assessing the reliability of phylogenetic footprints for certain TFs.


    Acknowledgments
 
This work was supported by the Bioinformatics Competence Center ‘Intergenomics’ using a grant of the German Ministry of Education and Research (grant no. 031U210A). The authors are indebted to Martin Haubrock, Alexander Kel, Peter Meinicke and Anne-Kathrin Schultz for stimulating discussions and three anonymous reviewers who helped to improve this manuscript. Funding to pay the Open Access publication charges for this article was provided by the Georg-August-University of Goettingen.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Nikolaus Rajewsky

Received on September 27, 2005; revised on November 30, 2005; accepted on December 2, 2005

    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 

    Bedell, J.A., et al. (2000) MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics, 16, 1040–1041[Abstract/Free Full Text].

    Bejerano, G., et al. (2004) Ultraconserved elements in the human genome. Science, 304, 1321–1325[Abstract/Free Full Text].

    Berezikov, E., et al. (2004) CONREAL: conserved regulatory elements anchored alignment algorithm for identification of transcription factor binding sites by phylogenetic footprinting. Genome Res, . 14, 170–178[Abstract/Free Full Text].

    Bergman, C.M., et al. (2002) Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome. Genome Biol, . 3, RESEARCH0086.1.

    Berman, B.P., et al. (2004) Computational identification of developmental enhancers: conservation and function of transcription factor binding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura. Genome Biol, . 5, R61[CrossRef][Medline].

    Bray, N., et al. (2003) AVID: A global alignment program. Genome Res, . 13, 97–102[Abstract/Free Full Text].

    Brudno, M., et al. (2003) LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res, . 13, 721–731[Abstract/Free Full Text].

    Bulyk, M.L. (2003) Computational prediction of transcription-factor binding site locations. Genome Biol, . 5, 201[CrossRef][Medline].

    Cooper, G.M. and Sidow, A. (2003) Genomic regulatory regions: insights from comparative sequence analysis. Curr. Opin. Genet. Dev, . 13, 604–610[CrossRef][Web of Science][Medline].

    Dermitzakis, E.T. and Clark, A.G. (2002) Evolution of transcription factor binding sites in Mammalian gene regulatory regions: conservation and turnover. Mol. Biol. Evol, . 19, 1114–1121[Abstract/Free Full Text].

    Dermitzakis, E.T., et al. (2002) Numerous potentially functional but non-genic conserved sequences on human chromosome 21. Nature, 420, 578–582[CrossRef][Medline].

    Duret, L. and Bucher, P. (1997) Searching for regulatory elements in human noncoding sequences. Curr. Opin. Struct. Biol, . 7, 399–406[CrossRef][Web of Science][Medline].

    Elnitski, L., et al. (2003) Distinguishing regulatory DNA from neutral sites. Genome Res, . 13, 64–72[Abstract/Free Full Text].

    Emberly, E., et al. (2003) Conservation of regulatory elements between two species of Drosophila. BMC Bioinformatics, 4, 57[CrossRef][Medline].

    Frazer, K.A., et al. (2003) Cross-species sequence comparisons: a review of methods and available resources. Genome Res, . 13, 1–12[Abstract/Free Full Text].

    Hardison, R., et al. (1993) Comparative analysis of the locus control region of the rabbit beta-like gene cluster: HS3 increases transient expression of an embryonic epsilon-globin gene. Nucleic Acids Res, . 21, 1265–1272[Abstract/Free Full Text].

    Hardison, R.C., et al. (1997) Long human–mouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome. Genome Res, . 7, 959–966[Free Full Text].

    Harris, M.A., et al. (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res, . 32, D258–D261[Abstract/Free Full Text].

    Hermfisse, U., et al. (1996) The aldolase A promoter in proliferating rat thymocytes is regulated by a cluster of SP1 sites and a distal modulator. Biochem. Biophys. Res. Commun, . 225, 997–1005[CrossRef][Web of Science][Medline].

    Huang, X. and Miller, W. (1991) A time-efficient, linear-space local similarity algorithm. Adv. Appl. Math, . 12, 337–357[CrossRef].

    Hubbard, T., et al. (2005) Ensembl 2005. Nucleic Acids Res, . 33, D447–D453[Abstract/Free Full Text].

    Iwama, H. and Gojobori, T. (2004) Highly conserved upstream sequences for transcription factor genes and implications for the regulatory network. Proc. Natl Acad. Sci. USA, 101, 17156–17161[Abstract/Free Full Text].

    Jareborg, N., et al. (1999) Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs [Erratum (1999) Genome Res, 9, 1156.]. Genome Res, . 9, 815–824[Abstract/Free Full Text].

    Jordan, I.K., et al. (2003) Origin of a substantial fraction of human regulatory sequences from transposable elements. Trends Genet, . 19, 68–72[CrossRef][Web of Science][Medline].

    Jurka, J., et al. (2000) Repbase update: a database and an electronic journal of repetitive elements. Trends Genet, . 16, 418–420[CrossRef][Web of Science][Medline].

    Kel, A.E., et al. (2003) MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res, . 31, 3576–3579[Abstract/Free Full Text].

    Kellis, M., et al. (2003) Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature, 423, 241–254[CrossRef][Medline].

    Kent, W.J., et al. (2002) BLAT–the BLAST-like alignment tool. Genome Res, . 12, 656–664[Abstract/Free Full Text].

    Lenhard, B., et al. (2003) Identification of conserved regulatory elements by comparative genome analysis. J. Biol, . 2, 1.

    Levy, S. and Hannenhalli, S. (2002) Identification of transcription factor binding sites in the human genome sequence. Mamm. Genome, 13, 510–514[CrossRef][Web of Science][Medline].

    Liu, Y., et al. (2004) Eukaryotic regulatory element conservation analysis and identification using comparative genomics. Genome Res, . 14, 451–458[Abstract/Free Full Text].

    Loots, G.G., et al. (2000) Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science, 288, 136–140[Abstract/Free Full Text].

    Ludwig, M.Z., et al. (2000) Evidence for stabilizing selection in a eukaryotic enhancer element. Nature, 403, 564–567[CrossRef][Web of Science][Medline].

    Ludwig, M.Z., et al. (2005) Functional evolution of a cis-regulatory module. PLoS Biol, . 3, e93[CrossRef][Medline].

    Matys, V., et al. (2003) TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res, . 31, 374–378[Abstract/Free Full Text].

    Morgenstern, B., et al. (2000) A space-efficient algorithm for aligning large genomic sequences. Bioinformatics, 16, 948–949[Abstract/Free Full Text].

    Moses, A.M., et al. (2003) Position specific variation in the rate of evolution in transcription factor binding sites. BMC Evol. Biol, . 3, 19[CrossRef][Medline].

    Notredame, C., et al. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol, . 302, 205–217[CrossRef][Web of Science][Medline].

    Pollard, D.A., et al. (2004) Benchmarking tools for the alignment of functional noncoding DNA. BMC Bioinformatics, 5, 6[CrossRef][Medline].

    Rosenberg, M.S. (2005) Evolutionary distance estimation and fidelity of pair wise sequence alignment. BMC Bioinformatics, 6, 102[CrossRef][Medline].

    Sandelin, A., et al. (2004) Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes. BMC Genomics, 5, 99[CrossRef][Medline].

    Schwartz, S., et al. (2003) Human-mouse alignments with BLASTZ [Erratum (2004) Genome Res. 14, 786.]. Genome Res, . 13, 103–107[Abstract/Free Full Text].

    Sinha, S., et al. (2004) Cross-species comparison significantly improves genome-wide prediction of cis-regulatory modules in Drosophila. BMC Bioinformatics, 5, 129[CrossRef][Medline].

    Tagle, D.A., et al. (1988) Embryonic epsilon and gamma globin genes of a prosimian primate (Galago crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J. Mol. Biol, . 203, 439–455[CrossRef][Web of Science][Medline].

    Thompson, J.D., et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, . 22, 4673–4680[Abstract/Free Full Text].

    Ureta-Vidal, A., et al. (2003) Comparative genomics: genome-wide analysis in metazoan eukaryotes. Nat. Rev. Genet, . 4, 251–262[Web of Science][Medline].

    Wasserman, W.W., et al. (2000) Human–mouse genome comparisons to locate regulatory sites. Nat. Genet, . 26, 225–228[CrossRef][Web of Science][Medline].

    Woolfe, A., et al. (2005) Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol, . 3, e7[CrossRef][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
F. Colecchia, D. Kottwitz, M. Wagner, C. V. Pfenninger, G. Thiel, I. Tamm, C. Peterson, and U. A. Nuber
Tissue-specific regulatory network extractor (TS-REX): a database and software resource for the tissue and cell type-specific investigation of transcription factor-gene networks
Nucleic Acids Res., June 1, 2009; 37(11): e82 - e82.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
E. Wingender
The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation
Brief Bioinform, July 1, 2008; 9(4): 326 - 332.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
J. W. Tullai, J. Chen, M. E. Schaffer, E. Kamenetsky, S. Kasif, and G. M. Cooper
Glycogen Synthase Kinase-3 Represses Cyclic AMP Response Element-binding Protein (CREB)-targeted Immediate Early Genes in Quiescent Cells
J. Biol. Chem., March 30, 2007; 282(13): 9482 - 9491.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/4/430    most recent
bti819v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (11)
Google Scholar
Right arrow Articles by Sauer, T.
Right arrow Articles by Wingender, E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sauer, T.
Right arrow Articles by Wingender, E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?