Skip Navigation


Bioinformatics Advance Access originally published online on January 10, 2006
Bioinformatics 2006 22(6):676-684; doi:10.1093/bioinformatics/btk032
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/6/676    most recent
btk032v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (11)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Boeva, V.
Right arrow Articles by Makeev, V.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Boeva, V.
Right arrow Articles by Makeev, V.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression

Valentina Boeva 1,*, Mireille Regnier 2, Dmitri Papatsenko 3 and Vsevolod Makeev 4,5

1Department of Bioengineering and Bioinformatics, Moscow State University Moscow, Russia
2INRIA Rocquencourt France
3University of California Berkeley, USA
4State Research Center GosNIIGenetika Moscow, Russia
5Engelhardt Institute of Molecular Biology, Russian Academy of Sciences Moscow, Russia

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 

Motivation: Genomic sequences are highly redundant and contain many types of repetitive DNA. Fuzzy tandem repeats (FTRs) are of particular interest. They are found in regulatory regions of eukaryotic genes and are reported to interact with transcription factors. However, accurate assessment of FTR occurrences in different genome segments requires specific algorithm for efficient FTR identification and classification.

Results: We have obtained formulas for P-values of FTR occurrence and developed an FTR identification algorithm implemented in TandemSWAN software. Using TandemSWAN we compared the structure and the occurrence of FTRs with short period length (up to 24 bp) in coding and non-coding regions including UTRs, heterochromatic, intergenic and enhancer sequences of Drosophila melanogaster and Drosophila pseudoobscura. Tandems with period three and its multiples were found in coding segments, whereas FTRs with periods multiple of six are overrepresented in all non-coding segment. Periods equal to 5–7 and 11–14 were characteristic of the enhancer regions and other non-coding regions close to genes.

Availability: TandemSWAN web page, stand-alone version and documentation can be found at http://bioinform.genetika.ru/projects/swan/www/

Contacts: valeyo{at}imb.ac.ru

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
Eukaryotic genomes contain many types of repetitive sequences, such as long repeats, satellite DNA and many other yet unclassified sequences of various lengths and levels of repetitiveness (Singer and Berg, 1991). So far, the efforts of researchers have been predominantly focused on nearly perfect repeats such as microsatellites and others (Li et al., 2002). Analysis of more divergent (fuzzy) tandem repeats was complicated by problems related to their discrimination from background and insufficient annotation level of genomes.

In this study we focus on fuzzy tandems containing n occurrences (n > 2) of a mismatched word with period of T bases (T ~ 3–24) without insertions or deletions. Tandem repeats are usually classified into microsatellites (1–6 bp), minisatellites (6–24 bp, and in some cases longer) (Vergnaud and Denoeud, 2000) and ‘classical’ satellites. The length scale of fuzzy repeats considered here corresponds to micro- and minisatellite repeat classes. However, we do not consider periods with T = 1 or 2, as they correspond to poly-A or TATA-like sequence, a different biological object explored elsewhere (Katti et al., 2001; Schug et al., 1998; Subramanian et al., 2003).

Fuzzy tandem repeats (FTRs) have been found in regulatory regions of eukaryotic genes (Shi et al., 2000); such tandems sometimes form cooperative arrays of binding sites and interact with transcription factors (Gao and Finkelshtein, 1998; Ott and Hansen, 1996; Meloni et al., 1998; Ramchandran et al., 2000). However, it is still unclear (1) how to define and extract fuzzy tandems, (2) whether functionally different sequences are enriched by tandems of a specific structure and (3) what biological function (if any) fuzzy tandems perform in genome. If the genome distribution of FTRs is uneven, their exploration should help to locate structural/functional sequence categories and to understand underlying mechanisms of their function.

The degree of FTR propagation varies from one genome to the other and from one functional sequence category to the other; existing algorithms (Benson, 1999; Kolpakov et al., 2003) return up to 10–15% of the Drosophila melanogaster and >10% (Benson, 1999) of the human genome as tandem repeats of various structure.

Accumulation of tandems in genomes is a result of errors during replication and some rearrangement events (Dover, 1982; Singer and Berg, 1991, Ellegren, 2004). From that perspective, much of repetitive genomic DNA might be considered as non-informative; however, there are cases where presence of tandems is tightly linked to a biological function (Nakamura et al., 1998).

For instance, long tandem repeats constitute a large portion of heterochromatin satellite DNA and are involved in centromere formation and function (Martienssen, 2003); sometimes presence of long tandems even serves as a signal of extra centromere formation (Singer and Berg, 1991). Much less is known about the role of shorter repetitive sequences, especially highly mismatched fuzzy tandems (FTRs), quite abundant in exons, introns and transcription regulatory sequences (Nakamura et al., 1998). In exons, FTRs may reflect sequence periodicities existing in protein sequence or even structural features, such as hydrophobic helices (Katti et al., 2000; Li et al., 2004); it is unclear if these tandems have any function at the DNA level. In complex eukaryotic regulatory regions, such as enhancers and silencers, FTRs appear to be linked with some types of binding sites for transcription factors (Antoniewski et al., 1996; Ott and Hansen, 1996; Ramchandran et al., 2000). One of the attractive models suggests that an FTR with a unit consensus similar to a binding site modulates exact response to regulator concentration (Carroll et al., 2001; Davidson et al., 2000).

Repeats of various types may also be important for regulation that controls spatial packaging/dynamics of eukaryotic DNA. Thus, 8–16 bp repeats separated by distance <200 bases may characterize Scaffold Attached Regions (Boulikas, 1995), periodic signals appear to play a role in nucleosome positioning (Ioshikhes et al., 1999). Periodic signals are present in prokaryotic and eukaryotic promoters, where they correspond to arrays of sites for DNA-binding proteins (Kutuzova et al., 1999; Kravatskaia et al., 2002; Makeev et al., 2003).

Tandem structure sometimes may be important for genome functioning—many human diseases are known to be caused by increase in the number of copies, etc. (Verkerk et al., 1991; Huntington's Disease Collaborative Research Group, 1993; Fu et al., 1992; Thibodeau et al., 1993; Wooster et al., 1994; Villafranca et al., 2001; Niv et al., 2005). From practical point of view, variations in tandem structure serves in many important applications, such as linkage analysis and DNA fingerprinting (Edwards et al., 1992; Weber and May, 1989).

Here we conducted a functional analysis of tandems in Drosophila at a genome-wide level by (1) introducing probabilistic models for tandems with a high degree of fuzziness and (2) finding tandem structures specific to certain functional sequence categories.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
2.1 Algorithm
Fuzzy tandems differ in number of mismatched letters, period T and the number of repeated units n, also called the exponent. For instance, in the tandem ATcgc|ATggc|ATtcc|ATcgg only two positions are identical in all units; this level of divergence makes it difficult to detect such tandems with the existing tools (Benson, 1999; Kolpakov et al., 2003).

Typically, finding of periodic signals in biological sequences is solved with the help of autocorrelation analysis (Makeev and Tumanyan, 1996; Chaley et al., 1999; Chechetkin and Lobzin, 1998) and/or periodic alignments (e.g. Benson, 1999). However, such algebraic methods per se usually cannot select the best repeat from overlapping repeats with different periods. In addition, in the case of fuzzy repeats the probability of the tandem repeat to appear by chance cannot be neglected. In this work, we amend a detection/scoring algorithm with statistical criteria for tandem discrimination. At the first step, candidate repeats are found using local autocorrelation analysis. At the second step, the candidate repeats are filtered based on their statistical weights. The filtering allows one to obtain a set of non-overlapping tandems for any sequence. Non-overlapping tandems identified in a sequence comprise a coverage map that can be easily compared with genome annotations and sequence feature maps.

Until now, there have been no algorithms providing solutions to all problems aimed at this study. Some algorithms (Los Alamos National Laboratory, http://biosphere.lanl.gov/tandyman/cgi-bin/tandyman.cgi) detect only perfect tandems, others return only tandems with predefined parameters (Castelo et al., 2002; Sagot and Myers, 1998) or cannot resolve the problem of overlapping periods (Benson, 1999; Kolpakov et al., 2003). We included an option into our software, which allows one to calculate statistical significance of repeats found by other repeat finders, particularly TRF (Benson, 1999) and MREPS (Kolpakov et al., 2003).

Identification of candidate repeats. In the first step of the algorithm we search for candidate repeats for each period T from the range of interest.

The algorithm compares a seed word of length (period) T in sequence position i with words of the same length in positions i T and i + T. For each letter of the seed word, the number of mismatches w, found from both comparisons, is then recorded to the corresponding sequence positions of an output data array. If the symbols in all three comparisons are identical the score equals zero; if only two symbols are identical, the score equals 1; and if all three symbols are different the score equals 2. An example of the output array obtained after the described local autocorrelation procedure is shown in Figure 1. The algorithm identifies putative tandem repeats by finding minimums for the local sum A of scores w in the second pass,

Formula 1(1)
All positions with the local sum below threshold K are included into candidate tandem repeats for the selected period. Greater values of K correspond to tandems with higher degree of fuzziness. This procedure is repeated for each period T from the input range. For each T, tandems are extracted for different K, which runs from zero to (TC), where C is a user-defined parameter, ‘the significance level’, literally a number of maximal mismatches allowed.


Figure 1
View larger version (9K):
[in this window]
[in a new window]
 
Fig. 1 Identification of candidate repeats. The i-th element of the output array (w in the text) contains the number of mismatches between three sequence positions: i, i + T, iT. The i-th element of the local sum array (A in the text) contains the sum of T sequential elements of the output array starting from positions i. Small values of the local sum indicate tandem positions (see the text for the details).

 
Filtration of candidate tandem repeats. The extraction step may return a collection of tandems different by their phase, fuzziness and the number of repeated units for the same DNA segment. However, genome-wide analysis (i.e. map feature comparison) requires a non-overlapping set of tandems. Therefore we filter extracted overlapping tandems (including those with multiple periods, like 3 and 6) and select the most statistically significant one. We propose two statistical models for calculation of FTR P-values, ‘the MaSk’ and ‘the MotiF’. Corresponding P-values are denoted here as PS-value and PF-value; their calculation is based on ‘MaSk’ and ‘MotiF’ probabilities, PrS and PrF. ‘MaSk’ characterizes combinatorial properties of tandem repeats such as the minimal number of identical symbols in corresponding repeat position. For instance, the ‘MaSk’ for the tandem TTC|TCC|TGG is (3,[3,1,2]), which means that the repeat has an exponent equal to 3, and at the first position all three letters are identical, at the second position it can be any letter and at the third position at least two letters must be identical. In all cases the MaSk is considered regardless of the particular letters in the sequence. The probability to obtain the ‘MaSk’ on random position, called the ‘MaSk’ probability, is equal to:

Formula 2(2)
Here, the summation is taken over all possible letter combination that comply to the specific mask. In this formula n is the exponent of the candidate repeat, T is the period, ki is the maximal number of identical symbols in position i, 1 ≤ ki ≤ n, of the repeat. The parameters of Bernoulli model, symbol frequencies pA, ... , pT, are evaluated from the entire sequence. For instance, MaSk probability calculated for tandem TTC|TCC|TGG assuming pA = pC = pG = pT = 0.25 is equal to PrS(3, [3,1,2]) = 0.04.

The corresponding PS-value (the probability to obtain a tandem satisfying the ‘MaSk’ in a random text of length N) is equal to

Formula 3(3)
where N is taken equal to the length of the entire sequence or the scanning window length.

The ‘MotiF’ model is based on the conception of the motif. A motif H represents a set of all words which comply with IUPAC consensus [http://bioinformatics.org/sms/iupac.html] of the observed FTR. For example a consensus for ATC|ATG|TTG is ‘WTS’ and the motif is {ATC,ATG,TTC,TTG}. Such motif representation has advantages and drawbacks; we discussed some of these issues in Kotelnikova et al. (2005). The probability to find a motif H in the sequence is simply the probability to find any word belonging to it:

Formula 4(4)
Then the MotiF P-value, PF-value, is the probability to find at least n consecutive occurrences of motif H in a sequence of length N, given that H has been already found once in the sequence:

Formula 5(5)
P-values calculated using either model allow for unambiguous discrimination between, for instance, a longer, highly mismatched tandem and a shorter one, containing fewer, but better matching units. As we pointed out earlier, weighting also helps to eliminate overlapping tandems.

2.2 Implementation
The FTR extraction algorithm is implemented as a C++ package TandemSWAN, available for online data processing and for download from the following URL: http://bioinform.genetika.ru/projects/swan/www. TandemSWAN accepts input sequences in most available file formats; user-defined parameters include minimal and maximal period lengths, ‘significance level’, ‘the MaSk’ or ‘the MotiF’ statistical mode and ‘the penalty factor’ for sub-periods (see online help for the details on parameter settings). Memory requirements and running time depend on repeat abundance in the query sequence and on parameter values; e.g. running time for 22 MB Drosophila chrX is amounted to ~2.5 h. TandemSWAN includes utilities for calculation of P-values for tandem repeats obtained by related programs, MREPS (Kolpakov et al., 2003) and TRF (Benson, 1999).

2.3 Coverage of random sequence with FTRs identified by TandemSWAN agrees with theoretical prediction
Genome-wide exploration requires convenient FTR maps, which we refer here as ‘coverage maps’ or fraction of the sequence dataset positions (e.g. percentage of total exon length) covered by FTRs with specific structure. In the case of sequences obtained under Bernoulli model, this fraction can be evaluated analytically for each particular FTR period. To test the performance of our algorithm, we compared this analytical value with the results of FTR identification in simulated random sequences. Indeed, for each period T and ‘significance level’ C the probability {Theta} to find a candidate tandem repeat in the first step of the algorithm starting from some position iT (Fig. 1) is written as

Formula 6(6)
Here wT[k], the elements of the array wT (see definitions in ‘Identification of candidate repeats’ and Fig. 1), are considered as random variables with expectation E and variance V depend on the letter frequencies evaluated from D.melanogaster genome. According to central limit approximation Equation (6) can be written as follows:

Formula 7(7)
where F(x) is the standard normal cumulative distribution. To obtain the fraction of the random sequence covered by tandem repeats found at the second step of the algorithm, statistical weighting, one should take into account possible overlaps of candidate tandem repeats in neighboring positions. Finally, the coverage can be approximated as ~3T{Theta}N/(5T 2), where the T-dependent factor at {Theta} reflects tandem repeat overlaps.

We generated several 1 Mb sequences with uniform letter frequencies and with letter frequencies from the genome of D.melanogaster, identified FTRs with different parameter settings and calculated coverage maps for periods in the range 3–15 bases. Comparison between the observed and the calculated coverage of FTRs (Fig. 2) demonstrates that the devised formula [Equation (7)] accurately describes the distribution of majority of FTRs present in the random sequence. The agreement between theoretical and observed coverage values holds for the range of periods explored in this study and only moderately depends on letter frequencies.


Figure 2
View larger version (10K):
[in this window]
[in a new window]
 
Fig. 2 Comparison of map coverage values obtained by TandemSWAN with theoretical expectation. The expected coverage was according to Equation (7) in the text. –•–, TandemSWAN, significance level 1; –*–, theoretical, significance level 1; –{circ}–, TandemSWAN, significance level 2; –{Delta}–theoretical, significance level 2; –x– TandemSWAN, significance level 3; –{diamond}–, theoretical, significance level 3. A 1M long Bernoulli random sequence with average genomic nucleotide frequencies was simulated.

 

    3 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
FTR density is uneven across the genome of Drosophila. Distribution of functional and other sequence features is known to be unequal among eukaryotic chromosomes and between different chromosome loci. Therefore, we decided to explore distribution of FTRs, as detected by our algorithm across a sample eukaryotic genome, D.melanogaster (Celniker et al., 2002). We focused our attention on the genome of Drosophila because of its outstanding level of annotation, availability of many related genomes and its relatively small size (~120 MB). We performed FTR extraction with default parameter setting (see TandemSWAN help) using the ‘MaSk’ model for statistical weighting. In this test we explored map coverage, calculated for non-overlapping 16 KB windows, without discriminating tandem motifs and periods.

We have found that FTR density (map coverage) along all chromosomes is highly inhomogeneous (Fig. 3a), and the central arm regions have higher FTR density than the distal regions (Fig. 3b). In addition, the average FTR density on X-chromosome is substantially higher than on autosomes (See ‘Relative FTR density across genome’ below).


Figure 3
View larger version (37K):
[in this window]
[in a new window]
 
Fig. 3 FTR occurrence in different segments of D.melanogaster genome. FTR density is higher on the X-chromosome (a) and on the centers of chromosome arms (b).

 
In order to assess possible roles of FTRs, we correlated FTR distribution in the Drosophila genome with positions of coding sequences and some other sequence features, such as local AT/GC composition. We have found that gene rich segments contain less FTRs than intergenic regions, while AT rich segments are enriched by FTRs (Supplementary Fig. 1). Reasons leading to fluctuation of FTR density in genome can be different; therefore we explored FTR structure in functionally distinct sequence datasets.

FTR with 3k periods prevail in coding regions. We investigated if FTRs of specific periods may prevail in certain functional categories. We assembled several sequence datasets containing all exons, 3'-untranslated region (3'-UTRs), 5'UTRs, intergenic regions, intergenic heterochromatin (from Drosophila Heterochromatin Project, http://www.dhgp.org/; http://flybase.net/annot/dmel_het_release3.2b2.txt; ftp://flybase.net/genomes/Drosophila_melanogaster/current_hetchr/fasta/) and a dataset containing 124 transcription regulatory regions, i.e. enhancers (or cis-regulatory modules) (https://webfiles.berkeley.edu/dap5/public_html/index.html). Corresponding datasets were also constructed for related species D.pseudoobscura (Richards et al., 2005). In order to explore FTRs within a functionally related group of genes we also generated the corresponding datasets for selection of 16 developmental gene loci from D.melanogaster and D.pseudoobscura. To investigate prevalence of specific tandems in all these datasets we compared total sequence coverage by FTRs with different periods (Fig. 4).


Figure 4
View larger version (26K):
[in this window]
[in a new window]
 
Fig. 4 Fraction covered by FTRs with different periods calculated for DNA with different function from D.melanogaster and D.pseudoobscura. FTRs with periods multiple to 3 are overrepresented in exons and FTRs with periods 6 and 12 in non-translated DNA. Note the comparative deficiency of period 9 FTRs in intergenic DNA, especially in D.pseudoobscura. FTRs were identified by SWAN with significance level C = 2 without filtration by statistical significance. For comparison in all cases curve ‘–x–’ shows results for 1 MB simulated Bernoulli random sequence with average genomic nucleotide frequencies. (a) D.melanogaster exons: ‘–*–’, autosomes (26 719 758 bp); ‘–{Delta}–’, X-chromosome, (5 479 105 bp). (b) D.melanogaster euchromatin intergenic and heterochromatin DNA: ‘–*–’, autosome intergenic (47 527 681 bp); ‘–{Delta}–’, X-chromosome intergenic (11 598 950 bp); ‘–{square}–’, heterochromatin (7 089 934 bp). (c) D.melanogaster UTRs: ‘–•–’, autosome 5'-UTRs (3 331 080 bp); ‘–{circ}–’, X-chromosome 5'-UTRs (686 503 bp); ‘–{diamond}–’, autosome 3'-UTRs (5 549 366 bp); ‘–x–’, X-chromosome 3'-UTRs (1 254 227 bp); (d) D.melanogaster regulatory regions compared with autosome intergenic DNA: ‘–*–’ dorsal and twist enhancers (114 354 bp); ‘–{circ}–’, 124 enhancers (181 690); ‘–{Delta}–’, autosome intergenic DNA (47 527 681 bp), the same as in panel (B); (e) D.melanogaster regulatory regions normalized for its autosome intergenic DNA: ‘–*–’, 124 enhancers (181 690 bp); ‘–{Delta}–’, AP enhancers, (115 599 bp); ‘–•–’, AP spacers (349 634 bp); (f) D.pseudoobscura intergenic DNA and CDS: ‘–{square}–’, autosome intergenic (49 347 738 bp); ‘–•–’, X-chromosome intergenic (30 400 371 bp);‘–*–’, autosome exons (14 069 722 bp); ‘–{Delta}–’, X-chromosome exons (5 417 123 bp); (g) D.pseudoobscura and D.melanogaster CDS: ‘–{square}–’, D.pseudoobscura autosomes (14 069 722 bp); ‘–*–’, D.melanogaster autosomes (26 719 758 bp); (h) D.pseudoobscura and D.melanogaster intergenic DNA: ‘–{square}–’, D.pseudoobscura autosome (49 347 738 bp); ‘–*–’, D.melanogaster autosome (47 527 681 bp); (i) D.pseudoobscura and D.melanogaster AP enhancers:‘–{square}–’, D.pseudoobscura (60 085 bp); ‘–*–’, D.melanogaster (115 599 bp).

 
As expected, the most striking signals were detected in datasets containing exons (Fig. 4a). We found that in the coding regions FTRs with periods divisible by 3 are prevailing; instead, tandems with periods not equal to 3k are suppressed (below random expectation). We also found that 3k periods in coding regions of the X-chromosome have a greater coverage than 3k-periodic FTRs found in exons of autosomes. This suggests that FTR density even within similar functional units may be linked with the physical map, i.e. a particular place in genome.

Surprisingly, we also detected high presence of periods multiples of 6 (but not the other 3k periods) in the non-coding sequences. Apparently, there is a 6k background in genome, not related to periodicities caused by codon triplets [see ‘Possible source of 3k (6/12) background’ below].

FTR periods specific to sequence categories other than exons. FTRs with periods 6 and 12 were found to be highly abundant throughout all analyzed datasets, including transcription regulatory regions, intergenic spacers, UTRs and even intergenic heterochromatin (Fig. 4b–d). At the same time, non-coding regions were also found to be enriched by FTRs with other than 3k periods. Outside exons we observed 2- to 3-fold FTR excess over random expectation, which supports ‘non-random’ origin of FTRs and the non-random character of genomic sequences in general.

In order to detect differences in the FTR structure among datasets representing non-coding regions, we compared prevalence of all periods, i.e. FTR profiles, as shown in Figure 4 (Table 1 and Supplementary Table 1). The correlation analysis has shown that according to prevailing FTR periods, all 22 datasets can be subdivided into at least three groups of similarity, one corresponding to coding regions, another to heterochromatin and the last one corresponding to intergenic regions, spacers and others.


View this table:
[in this window]
[in a new window]
 
Table 1 Correlation between FTR profiles for D.melanogaster datasets

 
Comparison of absolute levels of FTR presence in different datasets has shown that intergenic heterochromatin, in general, contains less FTRs than euchromatin (Fig. 4b). Moreover, heterochromatic regions displayed some excess of FTRs with periods equal to 3k.

In general, comparison of different sequence categories demonstrated that FTRs with all explored periods are overabundant in the genome, (with the exception of exons) and FTRs with 6k periods for some reason strongly prevail, even in non-coding DNA.

FTRs in enhancers are similar, but not identical to that in intergenic regions. Repetitive sequences in transcription regulatory regions are of special interest. While periodic signals present in exons (3k-periodic FTRs) can be explained by the genetic triplet code (and by periodicities in protein sequences), in regulatory regions, FTRs may well represent a background. To investigate this problem, we removed 6k background by normalizing FTR coverage values in functional datasets to the coverage in genome fragments without any functional annotation (non-functional).

We focused on 124 annotated (experimentally validated) enhancer regions from D.melanogaster (https://webfiles.berkeley.edu/dap5/public_html/data_06/124_Dmel_Enc.fa). The vast majority of these sequences are involved in regulation of developmental genes. However, this group is nether functionally nor structurally homogeneous. The enhancers have different length (0.3–3 kb) and regulate genes transcribed at different developmental stages. To achieve better representation we considered the entire dataset (124 sequences, 181 690 bp) and two sub samples, so-called ‘AP’ (anterior–posterior, 72 sequences, 117 377 bp) and ‘DV’ (dorso-ventral, 136 sequences, 114 354 bp) enhancers. Along with enhancers we also considered a separate dataset combining ‘spacers’ between enhancers and a dataset combining coding regions from the same genome locations. The corresponding datasets were also constructed for D.pseudoobscura ‘AP’ enhancers (sequences are available from the website).

Analysis of the normalized FTR distributions in enhancer datasets and in spacers (Fig. 4e) have shown some degree of enrichment by FTRs with periods 7 and 8 in all datasets, representing loci of developmental genes. No major differences were found in FTR distribution between the enhancers and their flanking regions or ‘spacers’. However, the overall FTR distribution was not identical to that found in the other non-functional, intergenic regions of the genome.

Along with the assessment of general FTR distributions in enhancers, we also investigated a possible relation between FTR motifs and the binding motifs for transcription factors present in the enhancers. In some single cases (even-skipped stripe 2 enhancer) we observed some similarity, but on the larger scale the correlation was not found to be significant.

It appears that FTRs in enhancers are different from FTRs in the rest of the genome by their period, and perhaps, by motif composition; however, insufficient amount of annotated enhancer regions in genome and presence of 6k background complicates the analysis.

Possible source of 3k (6k) background. Our results show that in Drosophila tandems with periods 6 and 12 are found in the entire genome, whereas other 3k periods are restricted to protein coding sequences.

The abundance of 6k periods might be explained, either through the mechanisms of DNA replication and rearrangement, or from possible structural DNA features. Existing experimental data suggest that certain 3-periodic synthetic DNA sequences have substantially different helix stability probably owing to mismatched alignments at equilibrium temperatures during melting (Delcourt and Blake, 1991). So, the quasi-repeated structures may be important for functional stability/flexibility of DNA molecule.

Periods of 3k abundant in coding sequences apparently are related to periodicity in protein sequences and triplet nature of genetic code. Repetitive structures in the protein sequences, such as 3–10 helix or hydrophobic alpha helixes may also cause periodicity at the level of DNA (Katti et al., 2000).

The ‘coincidence’ that 6k periods also fall into 3k period may even have deeper roots. Apparently, nucleic acids and their replication appeared in certain form before the triplet genetic code (Lifson, 1997); so it can be that 3–6k ‘matrix’ was in DNA long before the genetic code itself, and on itself served as a source for formation of what we know today as triplet genetic code.

Be the origin of 3k/6k periodic FTRs ‘mechanical’ (DNA stability/replication) or ‘historical’ (3k matrix), it is unlikely that this non-specifically distributed signal is connected with fine mechanisms of genome functioning, such as regulation of gene expression. However, this does not exclude a possibility that FTRs with other periods or FTRs containing some specific motifs are involved into some regulatory functions. Moreover, if the quasi-periodic structure of native DNA sequence is indeed important it would poise an additional constraint on all motifs in the sequence, including regulatory signals.

Role of FTRs with periods other than 3k. Analysis of FTRs with periods different from 3k has shown that periods 7 and 8 are more abundant in enhancers and in regions without functional annotation around (Fig. 4e). Currently, it is not clear what function these FTRs may perform in enhancers and whether their presence is related to any function at all. For instance, we have found no correlation between FTRs and recognition motifs for transcription factors present in the same enhancers. However we considered only a limited number of binding motifs (11), most of which are far from being perfect. Improving enhancer annotations, number of considered motifs and better recognition of the functional motif matches may shed more light on possible roles of FTRs in the regulation of transcription.

As it has been suggested earlier (Papatsenko et al., 2002), some FTRs may play a role as cassettes, containing synergistically acting tandems of binding sites and responding to certain threshold levels of transcription factors. However, it is also possible that presence of specific repetitive sequences provide certain spatial geometry to an enhancer, required for correct assembly of regulatory protein complexes. Apparently, some FTRs may be involved in the maintenance of chromatin structure and/or spatial DNA geometry even in a broader context.

The role of tandem repeats in regulatory regions was discussed recently in (Sinha and Siggia, 2005). The authors assess high quality tandem repeats found in enhancers of D.melanogaster and D.pseudoobscura using TRF and MREPS and demonstrated their low conservation. They concluded that the tandem repeats carry a limited functional load. This agrees with our first finding that the majority of FTRs found in enhancers have the same predominant periods as non-annotated intergenic DNA. On the other hand, some FTRs found in enhancer regions have specific properties and may be involved in regulatory function.

Relative FTR density across the genome. While qualitative composition of FTRs is surprisingly similar across the genome, their density may substantially vary from one genomic location to the other and from one genome to the other. This may or may not be connected with the local gene density or even presence of ‘gene deserts’ (Ovcharenko et al., 2005).

We have found that the genome of D.pseudoobscura has a greater DNA fraction covered by FTRs in all functional categories than the genome of D.melanogaster (Fig. 4). In both genomes X-chromosome has a higher FTR density than other chromosome arms, and finally, distal locations of the chromosome arms have lower FTR densities than more central loci (Fig. 3).

It is noteworthy that the Drosophila X-chromosome has a greater number of short perfect repeats (Katti et al., 2001) and probably a greater number of recent gene duplications as compared with autosomes (Thornton and Long, 2002). This difference between sex-related chromosomes and the rest of genome was also reported in human, where LINE1 repeat elements cover one-third of the human X-chromosome (Ross et al., 2005). However, in the case of Caenorhabditis elegans genome the reported repeat density in X-chromosome is lower (Achaz et al., 2001).

The difference in FTR fraction between the genomes D.melanogaster and D.pseuodoobscura is probably related to a higher compactness of the D.melanogaster genome. Finally, we have found that FTRs are relatively less abundant in intergenic heterochromatin. Perfect tandems with periodicity of 5 found in pericentromeric heterochromatic regions (Sun et al, 2003) actually cover a surprisingly low fraction of the total heterochromatic DNA.

Problems of FTR exploration and focus of the TandemSWAN algorithm. Eukaryotic genomes are overwhelmed by repetitive sequences probably carrying no biological function, and are caused, for instance, simply by peculiarities of DNA replication (Ellegren, 2004). This non-functional noise needs to be filtered, which in part can be done by exploring repeat parameters in different sequence categories and/or by normalizing to the noise in the non-functional regions. However, even the background signals may serve maintenance of overall DNA survivability and might contain signals, required, for instance, for chromatin packaging.

Here we explored fuzzy tandems since they are largely out of the focus of the regular tandem-finding programs (Castelo et al., 2002; Sagot and Myers, 1998; Benson, 1999; Kolpakov et al., 2003). Specifics of our algorithm can be illustrated by comparing TandemSWAN with the two most popular repeat finders, TRF (Benson, 1999) and MREPS (Kolpakov et al., 2003) (Fig. 5 and Supplementary Table 2). Doing a comparison with TRF and MREPS we tried to obtain the sequence sets that were as similar as possible to those obtained with SWAN with parameters characteristic for our study. Actually, both TRF and MREPS are usually used to search for more precise repeats than those in Figure 5 and at least TRF was operating near the limit of repeat fuzziness allowed by its internet-based version. All the three tools perform similarly in the case of perfect repeats. However, the results of TandemSWAN and TRF/MREPS become quite different in the case of FTR extraction.


Figure 5
View larger version (12K):
[in this window]
[in a new window]
 
Fig. 5 Similarity between tandem sets identified by different repeat finders. Fractions of a 50 kb fragment of D.melanogaster chr2L sequence covered with tandem repeats identified by different algorithms with following parameters: TandemSWAN, minimal period 3, maximal period 15, significance level 2, filtration with (a) PS-probability < 10–5 and (b) PS-probability < 10–3, ‘MaSk’ statistical mode; TRF (Benson, 1999), minimal period 3, maximal period 15, match 2, mismatch 2, indel 15, pmatch 80, pindel 0, minscore 20; MREPS (Kolpakov et al., 2003), err 15, maxperiod 15, minperiod 3.

 

    4 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
In this work we have formulated a working definition of FTRs based on its statistical properties, developed an extraction algorithm and compared properties of tandems found in sequences carrying different functions. We developed statistics that allow calculation of P-values for FTR occurrence in Bernoulli type random sequences, which can be useful for other algorithms. This statistical approach implemented in the TandemSWAN program, aimed to identify FTRs with a broad spectrum of listed parameters.

Using this approach, we identified short FTRs with periods 3–24 bases in the D.melanogaster and D.pseudoobscura genomes and compared FTR structure and occurrence in coding and non-coding regions, heterochromatic regions and regulatory (enhancer) sequences. We have found that different types of tandems are abundant in different functional sequence categories, with each category having its own pattern of preferred period lengths. Tandems with period 3 and their multiples were found to be characteristic of coding regions. FTRs with 6k periods are characteristic for all non-coding DNA. FTRs with periods equal to 5, 6, 7 and 11, 12, 14 were enriched in loci of developmental genes and developmental enhancers. The regulatory modules at the mean have no less FTRs than spacers nearby; furthermore, FTR with periods 7 and 8 are found more often in Drosophila cis-regulatory modules then in other non-coding DNA. Obviously, both the evolution and the DNA structure of regulatory modules are subject to many additional parameters, such as DNA melting and adsorption of protein regulatory factors. Thus, it is possible that FTR found in cis-regulatory modules have some particular sequence structure facilitating their function and should be studied in greater detail. To understand the role of the 6 bp-related omnipresent repeats it is necessary first to test if they are present in different bacterial, animal and plant taxa, and not only in Drosophila species. This work is currently in progress.


    Acknowledgments
 
The authors are thankful for M. Borodovsky, A. P. Lifanov, N. A. Oparina, N. G. Esipova, M. Lassig, A. V. Favorov, V. E. Ramensky, M. G. Gelfand and A. A. Mironov for valuable discussion. They also thank R.Zinzen for careful manuscript reading and suggested changes. This study has been supported by the French Program EcoNet-08159PG, INTAS grant 04-83-3994, Russian State Contract No 02.434.11008, RFBR grant 04-04-49601, Fogerthy RO3 TW005899-01A1 program, Russian Academy of Science Presidium Program in Molecular and Cellular Biology, project #10 and Ludwig Institute of Cancer Research Grant CRDF GAP RBO-1268.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Steven L. Salzberg

Received on October 14, 2005; revised on December 22, 2005; accepted on December 28, 2005

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 

    Antoniewski, C., et al. (1996) Direct repeats bind the EcR/USP receptor and mediate ecdysteroid responses in Drosophila melanogaster. Mol. Cell. Biol, . 16, 2977–2986[Abstract].

    Achaz, G., et al. (2001) Study of intrachromosomal duplications among the eukaryote genomes. Mol. Biol. Evol, . 18, 2280–2288[Abstract/Free Full Text].

    Benson, G. (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res, . 27, 573–580[Abstract/Free Full Text].

    Benson, G. and Waterman, M. (1994) A method for fast database search for all k-nucleotide repeats. Nucleic Acids Res, . 22, 4828–4836[Abstract/Free Full Text].

    Boulikas, T. (1995) Chromatin domains and prediction of MAR sequences. Int. Rev. Cytol, . 162A, 279–388.

    Carroll, S.B., Grenier, J.K., Weatherbee, S.D. From DNA to Diversity, (2001) , Malden, MA ISBN 0-632-04511-6 Molecular Genetics and the Evolution of Animal Design. Blackwell Science.

    Castelo, A.T., et al. (2002) TROLL—tandem repeat occurrence locator. Bioinformatics, 18, 634–636[Abstract/Free Full Text].

    Celniker, S., et al. (2002) Finishing a whole genome shotgun: release 3 of the Drosophila melanogaster euchromatic genome sequence. Genome Biol, . 3, RESEARCH0079.

    Chaley, M.B., et al. (1999) Method revealing latent periodicity of the nucleotide sequences modified for a case of small samples. DNA Res, . 6, 153–163[Abstract].

    Chechetkin, V.R. and Lobzin, V.V. (1998) Nucleosome units and hidden periodicities in DNA sequences. J. Biomol. Struct. Dyn, . 15, 937–947[ISI][Medline].

    Delcourt, S.G. and Blake, R.D. (1991) Stacking energies in DNA. J. Biol. Chem, . 266, 15160–15169[Abstract/Free Full Text].

    Davidson, H., et al. (2000) Genomic sequence analysis of Fugu rubripes CFTR and flanking genes in a 60 kb region conserving synteny with 800 kb of human chromosome 7. Genome Res, . 10, 1194–1203[Abstract/Free Full Text].

    Dover, G.A. (1982) Molecular drive, a cohesive model of species evolution. Nature, 299, 111–117[CrossRef][Medline].

    D.melanogaster heterochomatin genome data from Drosophila Heterochromatin Genome Project.

    Edwards, A., et al. (1992) Genetic variation at five trimeric and tetrameric tandem repeat loci in four human population groups. Genomics, 12, 241–253[CrossRef][ISI][Medline].

    Ellegren, H. (2004) Microsatellites: simple sequences with complex evolution. Nat. Rev. Genet, . 5, 435–445[CrossRef][ISI][Medline].

    Fu, Y.-H., et al. (1992) An unstable triplet repeat in a gene related to myotonic muscular dystrophy. Science, 255, 1256–1258[Abstract/Free Full Text].

    Gao, Q. and Finkelstein, R. (1998) Targeting gene expression to the head: the Drosophila orthodenticle gene is a direct target of the Bicoid morphogen. Development, 125, 4185–4193[Abstract].

    Huntington's Disease Collaborative Research Group. (1993) A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes. Cell, 72, 971–983[CrossRef][ISI][Medline].

    Ioshikhes, I., et al. (1999) Periodical distribution of transcription factor sites in promoter regions and connection with chromatin structure. Proc. Natl Acad. Sci. USA, 96, 2891–2895[Abstract/Free Full Text].

    Karlin, S., et al. (1988) Efficient algorithms for molecular sequence analysis. Proc. Natl Acad. Sci. USA, 85, 841–845[Abstract/Free Full Text].

    Katti, M.V., et al. (2001) Differential distribution of simple sequence repeats in eukaryotic genome sequences. Mol. Biol. Evol, . 18, 1161–1167[Abstract/Free Full Text].

    Katti, M.V., et al. (2000) Amino acid repeat patterns in protein sequences: their diversity and structural-functional implica-tions. Protein Sci, . 9, 1203–1209[Abstract].

    Kolpakov, R., et al. (2003) mreps: Efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res, . 31, 3672–3678[Abstract/Free Full Text].

    Kotelnikova, E.A., et al. (2005) Evolution of transcription factor DNA binding sites. Gene, 347, 255–263[CrossRef][ISI][Medline].

    Kravatskaia, G.I., et al. (2002) Similarities in periodical structures in the position of nucleotides in regions of initiation of replication of bacterial genomes. Biofizika, 47, 595–599[Medline].

    Kutuzova, G.I., et al. (1999) Periodicity in contacts of RNA-polymerase with promoters. Biofizika, 44, 216–223[Medline].

    Landau, G.M., et al. (2001) An algorithm for approximate tandem repeats. J. Comput. Biol, . 8, 1–18[CrossRef][ISI][Medline].

    Li, L., et al. (2004) Pseudo-periodic partitions of biological sequences. Bioinformatics, 20, 295–306[Abstract/Free Full Text].

    Li, Y.C., et al. (2002) Microsatellites: genomic distribution, putative functions and mutational mechanisms: a review. Mol. Ecol, . 11, 2453–2465[CrossRef][Medline].

    Lifson, S. (1997) On the crucial stages in the origin of animate matter. J. Mol. Evol, . 44, 1–8.

    Makeev, V.Ju. and Tumanyan, V.G. (1996) Search of periodicities in primary structure of biopolymers: a general Fourier approach. Comput. Appl. Biosci, . 12, 49–54[Abstract/Free Full Text].

    Makeev, V.J., et al. (2003) Distance preferences in the arrangement of binding motifs and hierarchical levels in organization of transcription regulatory information. Nucleic Acids Res, . 31, 6016–6026[Abstract/Free Full Text].

    Martienssen, R.A. (2003) Maintenance of heterochromatin by RNA interference of tandem repeats. Nat. Genet, . 35, 213–214[CrossRef][ISI][Medline].

    Meloni, R., et al. (1998) A tetranucleotide polymorphic microsatellite, located in the first intron of the tyrosine hydroxylase gene, acts as a transcription regulatory element in vitro. Hum. Mol. Genet, . 7, 423–428[Abstract/Free Full Text].

    Nakamura, Y., et al. (1998) VNTR (variable number of tandem repeat) sequences as transcriptional, translational, or functional regulators. J. Hum. Genet, . 43, 149–152[CrossRef][ISI][Medline].

    Niv, E., et al. (2005) Microsatellite instability in patients with chronic B-cell lymphocytic leukaemia. Br. J. Cancer, . 92, 1517–1523[CrossRef][ISI][Medline].

    Ovcharenko, I., et al. (2005) Evolution and functional classification of vertebrate gene deserts. Genome Res, . 15, 137–145[Abstract/Free Full Text].

    Ott, R.W. and Hansen, L.K. (1996) Repeated sequences from the Arabidopsis thaliana genome function as enhancers in transgenic tobacco. Mol. Gen. Genet, . 252, 563–571[ISI][Medline].

    Papatsenko, D.A., et al. (2002) Extraction of functional binding sites from unique regulatory regions: the Drosophila early developmental enhancers. Genome Res, . 12, 470–481[Abstract/Free Full Text].

    Ramchandran, R., et al. (2000) A (GATA)(7) motif located in the 5' boundary area of the human beta-globin locus control region exhibits silencer activity in erythroid cells. Am. J. Hematol, . 65, 14–24[CrossRef][ISI][Medline].

    Richards, S., et al. (2005) Comparative genome sequencing of Drosophila pseudoobscura: chromosomal, gene, and cis-element evolution. Genome Res, . 15, 1–18[Abstract/Free Full Text].

    Ross, M.T., et al. (2005) The DNA sequence of the human X chromosome. Nature, 434, 325–337[CrossRef][Medline].

    Sagot, M. and Myers, E. (1998) Identifying satellites in nucleic acid sequences. Proceedings of the Second Annual International Conference on Computational Molecular Biology , NY ACM Press, pp. 234–242.

    Schug, M.D., et al. (1998) The distribution and frequency of microsatellite loci in Drosophila melanogaster. Mol. Ecol, . 7, 57–70[CrossRef][Medline].

    Shi, X.M., et al. (2000) Tandem repeat of C/EBP binding sites mediates PPARgamma2 gene transcription in glucocorticoid-induced adipocyte differentiation. J. Cell Biochem, . 76, 518–527[CrossRef][ISI][Medline].

    Singer, M. and Berg, T. Genes and Genomes, (1991) , Mill Valley, California University Science Books.

    Sinha, S. and Siggia, E.D. (2005) Sequence turnover and tandem repeats in cis-regulatory modules in drosophila. Mol. Biol. Evol, . 22, 874–885[Abstract/Free Full Text].

    Subramanian, S., et al. (2003) Genome-wide analysis of microsatellite repeats in humans: their abundance and density in specific genomic regions. Genome Biol, . 4, R13[CrossRef][Medline].

    Sun, X., et al. (2003) Sequence analysis of a functional Drosophila centromere. Genome Res, . 13, 182–194[Abstract/Free Full Text].

    Thibodeau, S.N., et al. (1993) Microsatellite instability in cancer of the proximal colon. Science, 260, 816–819[Abstract/Free Full Text].

    Thornton, K. and Long, M. (2002) Rapid divergence of gene duplicates on the Drosophila melanogaster X chromosome. Mol. Biol. Evol, . 19, 918–925[Abstract/Free Full Text].

    Verkerk, A., et al. (1991) Identification of a gene (FMR-1) containing a CGG repeat coincident with a breakpoint cluster region exhibiting length variation in fragile X syndrome. Cell, 65, 905–914[CrossRef][ISI][Medline].

    Vergnaud, G. and Denoeud, F. (2000) Minisatellites: mutability and genome architecture. Genome Res, . 10, 899–907[Abstract/Free Full Text].

    Villafranca, E., et al. (2001) Polymorphisms of the repeated sequences in the en-hancer region of the thymidylate synthase gene promoter may predict downstaging after preoperative chemoradiation in rectal cancer. J. Clin. Oncol, . 19, 1779–1786[Abstract/Free Full Text].

    Weber, J.L. and May, P.E. (1989) Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. Am. J. Hum. Genet, . 44, 388–396[ISI][Medline].

    Wooster, R., et al. (1994) Instability of short tandem repeats (microsatellites) in human cancers. Nat. Genet, . 6, 152–156[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
A. Merkel and N. Gemmell
Detecting short tandem repeats from genome data: opening the software black box
Brief Bioinform, July 10, 2008; (2008) bbn028v1.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. B. Mudunuri and H. A. Nagarajaram
IMEx: Imperfect Microsatellite Extractor
Bioinformatics, May 15, 2007; 23(10): 1181 - 1187.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. Archak, E. Meduri, P. S. Kumar, and J. Nagaraju
InSatDb: a microsatellite database of fully sequenced insect genomes
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D36 - D39.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/6/676    most recent
btk032v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (11)
Right arrowRequest Permissions