Bioinformatics Advance Access originally published online on January 10, 2006
Bioinformatics 2006 22(6):676-684; doi:10.1093/bioinformatics/btk032
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression
1Department of Bioengineering and Bioinformatics, Moscow State University Moscow, Russia
2INRIA Rocquencourt France
3University of California Berkeley, USA
4State Research Center GosNIIGenetika Moscow, Russia
5Engelhardt Institute of Molecular Biology, Russian Academy of Sciences Moscow, Russia
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Genomic sequences are highly redundant and contain many types of repetitive DNA. Fuzzy tandem repeats (FTRs) are of particular interest. They are found in regulatory regions of eukaryotic genes and are reported to interact with transcription factors. However, accurate assessment of FTR occurrences in different genome segments requires specific algorithm for efficient FTR identification and classification.
Results: We have obtained formulas for P-values of FTR occurrence and developed an FTR identification algorithm implemented in TandemSWAN software. Using TandemSWAN we compared the structure and the occurrence of FTRs with short period length (up to 24 bp) in coding and non-coding regions including UTRs, heterochromatic, intergenic and enhancer sequences of Drosophila melanogaster and Drosophila pseudoobscura. Tandems with period three and its multiples were found in coding segments, whereas FTRs with periods multiple of six are overrepresented in all non-coding segment. Periods equal to 57 and 1114 were characteristic of the enhancer regions and other non-coding regions close to genes.
Availability: TandemSWAN web page, stand-alone version and documentation can be found at http://bioinform.genetika.ru/projects/swan/www/
Contacts: valeyo{at}imb.ac.ru
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Eukaryotic genomes contain many types of repetitive sequences, such as long repeats, satellite DNA and many other yet unclassified sequences of various lengths and levels of repetitiveness (Singer and Berg, 1991). So far, the efforts of researchers have been predominantly focused on nearly perfect repeats such as microsatellites and others (Li et al., 2002). Analysis of more divergent (fuzzy) tandem repeats was complicated by problems related to their discrimination from background and insufficient annotation level of genomes.
In this study we focus on fuzzy tandems containing n occurrences (n > 2) of a mismatched word with period of T bases (T
324) without insertions or deletions. Tandem repeats are usually classified into microsatellites (16 bp), minisatellites (624 bp, and in some cases longer) (Vergnaud and Denoeud, 2000) and classical satellites. The length scale of fuzzy repeats considered here corresponds to micro- and minisatellite repeat classes. However, we do not consider periods with T = 1 or 2, as they correspond to poly-A or TATA-like sequence, a different biological object explored elsewhere (Katti et al., 2001; Schug et al., 1998; Subramanian et al., 2003).
Fuzzy tandem repeats (FTRs) have been found in regulatory regions of eukaryotic genes (Shi et al., 2000); such tandems sometimes form cooperative arrays of binding sites and interact with transcription factors (Gao and Finkelshtein, 1998; Ott and Hansen, 1996; Meloni et al., 1998; Ramchandran et al., 2000). However, it is still unclear (1) how to define and extract fuzzy tandems, (2) whether functionally different sequences are enriched by tandems of a specific structure and (3) what biological function (if any) fuzzy tandems perform in genome. If the genome distribution of FTRs is uneven, their exploration should help to locate structural/functional sequence categories and to understand underlying mechanisms of their function.
The degree of FTR propagation varies from one genome to the other and from one functional sequence category to the other; existing algorithms (Benson, 1999; Kolpakov et al., 2003) return up to 1015% of the Drosophila melanogaster and >10% (Benson, 1999) of the human genome as tandem repeats of various structure.
Accumulation of tandems in genomes is a result of errors during replication and some rearrangement events (Dover, 1982; Singer and Berg, 1991, Ellegren, 2004). From that perspective, much of repetitive genomic DNA might be considered as non-informative; however, there are cases where presence of tandems is tightly linked to a biological function (Nakamura et al., 1998).
For instance, long tandem repeats constitute a large portion of heterochromatin satellite DNA and are involved in centromere formation and function (Martienssen, 2003); sometimes presence of long tandems even serves as a signal of extra centromere formation (Singer and Berg, 1991). Much less is known about the role of shorter repetitive sequences, especially highly mismatched fuzzy tandems (FTRs), quite abundant in exons, introns and transcription regulatory sequences (Nakamura et al., 1998). In exons, FTRs may reflect sequence periodicities existing in protein sequence or even structural features, such as hydrophobic helices (Katti et al., 2000; Li et al., 2004); it is unclear if these tandems have any function at the DNA level. In complex eukaryotic regulatory regions, such as enhancers and silencers, FTRs appear to be linked with some types of binding sites for transcription factors (Antoniewski et al., 1996; Ott and Hansen, 1996; Ramchandran et al., 2000). One of the attractive models suggests that an FTR with a unit consensus similar to a binding site modulates exact response to regulator concentration (Carroll et al., 2001; Davidson et al., 2000).
Repeats of various types may also be important for regulation that controls spatial packaging/dynamics of eukaryotic DNA. Thus, 816 bp repeats separated by distance <200 bases may characterize Scaffold Attached Regions (Boulikas, 1995), periodic signals appear to play a role in nucleosome positioning (Ioshikhes et al., 1999). Periodic signals are present in prokaryotic and eukaryotic promoters, where they correspond to arrays of sites for DNA-binding proteins (Kutuzova et al., 1999; Kravatskaia et al., 2002; Makeev et al., 2003).
Tandem structure sometimes may be important for genome functioningmany human diseases are known to be caused by increase in the number of copies, etc. (Verkerk et al., 1991; Huntington's Disease Collaborative Research Group, 1993; Fu et al., 1992; Thibodeau et al., 1993; Wooster et al., 1994; Villafranca et al., 2001; Niv et al., 2005). From practical point of view, variations in tandem structure serves in many important applications, such as linkage analysis and DNA fingerprinting (Edwards et al., 1992; Weber and May, 1989).
Here we conducted a functional analysis of tandems in Drosophila at a genome-wide level by (1) introducing probabilistic models for tandems with a high degree of fuzziness and (2) finding tandem structures specific to certain functional sequence categories.
| 2 METHODS |
|---|
|
|
|---|
2.1 Algorithm
Fuzzy tandems differ in number of mismatched letters, period T and the number of repeated units n, also called the exponent. For instance, in the tandem ATcgc|ATggc|ATtcc|ATcgg only two positions are identical in all units; this level of divergence makes it difficult to detect such tandems with the existing tools (Benson, 1999; Kolpakov et al., 2003).
Typically, finding of periodic signals in biological sequences is solved with the help of autocorrelation analysis (Makeev and Tumanyan, 1996; Chaley et al., 1999; Chechetkin and Lobzin, 1998) and/or periodic alignments (e.g. Benson, 1999). However, such algebraic methods per se usually cannot select the best repeat from overlapping repeats with different periods. In addition, in the case of fuzzy repeats the probability of the tandem repeat to appear by chance cannot be neglected. In this work, we amend a detection/scoring algorithm with statistical criteria for tandem discrimination. At the first step, candidate repeats are found using local autocorrelation analysis. At the second step, the candidate repeats are filtered based on their statistical weights. The filtering allows one to obtain a set of non-overlapping tandems for any sequence. Non-overlapping tandems identified in a sequence comprise a coverage map that can be easily compared with genome annotations and sequence feature maps.
Until now, there have been no algorithms providing solutions to all problems aimed at this study. Some algorithms (Los Alamos National Laboratory, http://biosphere.lanl.gov/tandyman/cgi-bin/tandyman.cgi) detect only perfect tandems, others return only tandems with predefined parameters (Castelo et al., 2002; Sagot and Myers, 1998) or cannot resolve the problem of overlapping periods (Benson, 1999; Kolpakov et al., 2003). We included an option into our software, which allows one to calculate statistical significance of repeats found by other repeat finders, particularly TRF (Benson, 1999) and MREPS (Kolpakov et al., 2003).
Identification of candidate repeats. In the first step of the algorithm we search for candidate repeats for each period T from the range of interest.
The algorithm compares a seed word of length (period) T in sequence position i with words of the same length in positions i T and i + T. For each letter of the seed word, the number of mismatches w, found from both comparisons, is then recorded to the corresponding sequence positions of an output data array. If the symbols in all three comparisons are identical the score equals zero; if only two symbols are identical, the score equals 1; and if all three symbols are different the score equals 2. An example of the output array obtained after the described local autocorrelation procedure is shown in Figure 1. The algorithm identifies putative tandem repeats by finding minimums for the local sum A of scores w in the second pass,
![]() | (1) |
|
Filtration of candidate tandem repeats. The extraction step may return a collection of tandems different by their phase, fuzziness and the number of repeated units for the same DNA segment. However, genome-wide analysis (i.e. map feature comparison) requires a non-overlapping set of tandems. Therefore we filter extracted overlapping tandems (including those with multiple periods, like 3 and 6) and select the most statistically significant one. We propose two statistical models for calculation of FTR P-values, the MaSk and the MotiF. Corresponding P-values are denoted here as PS-value and PF-value; their calculation is based on MaSk and MotiF probabilities, PrS and PrF. MaSk characterizes combinatorial properties of tandem repeats such as the minimal number of identical symbols in corresponding repeat position. For instance, the MaSk for the tandem TTC|TCC|TGG is (3,[3,1,2]), which means that the repeat has an exponent equal to 3, and at the first position all three letters are identical, at the second position it can be any letter and at the third position at least two letters must be identical. In all cases the MaSk is considered regardless of the particular letters in the sequence. The probability to obtain the MaSk on random position, called the MaSk probability, is equal to:
![]() | (2) |
ki
n, of the repeat. The parameters of Bernoulli model, symbol frequencies pA, ... , pT, are evaluated from the entire sequence. For instance, MaSk probability calculated for tandem TTC|TCC|TGG assuming pA = pC = pG = pT = 0.25 is equal to PrS(3, [3,1,2]) = 0.04.
The corresponding PS-value (the probability to obtain a tandem satisfying the MaSk in a random text of length N) is equal to
![]() | (3) |
The MotiF model is based on the conception of the motif. A motif H represents a set of all words which comply with IUPAC consensus [http://bioinformatics.org/sms/iupac.html] of the observed FTR. For example a consensus for ATC|ATG|TTG is WTS and the motif is {ATC,ATG,TTC,TTG}. Such motif representation has advantages and drawbacks; we discussed some of these issues in Kotelnikova et al. (2005). The probability to find a motif H in the sequence is simply the probability to find any word belonging to it:
![]() | (4) |
![]() | (5) |
2.2 Implementation
The FTR extraction algorithm is implemented as a C++ package TandemSWAN, available for online data processing and for download from the following URL: http://bioinform.genetika.ru/projects/swan/www. TandemSWAN accepts input sequences in most available file formats; user-defined parameters include minimal and maximal period lengths, significance level, the MaSk or the MotiF statistical mode and the penalty factor for sub-periods (see online help for the details on parameter settings). Memory requirements and running time depend on repeat abundance in the query sequence and on parameter values; e.g. running time for 22 MB Drosophila chrX is amounted to
2.5 h. TandemSWAN includes utilities for calculation of P-values for tandem repeats obtained by related programs, MREPS (Kolpakov et al., 2003) and TRF (Benson, 1999).
2.3 Coverage of random sequence with FTRs identified by TandemSWAN agrees with theoretical prediction
Genome-wide exploration requires convenient FTR maps, which we refer here as coverage maps or fraction of the sequence dataset positions (e.g. percentage of total exon length) covered by FTRs with specific structure. In the case of sequences obtained under Bernoulli model, this fraction can be evaluated analytically for each particular FTR period. To test the performance of our algorithm, we compared this analytical value with the results of FTR identification in simulated random sequences. Indeed, for each period T and significance level C the probability
to find a candidate tandem repeat in the first step of the algorithm starting from some position i T (Fig. 1) is written as
![]() | (6) |
![]() | (7) |
3T
N/(5T 2), where the T-dependent factor at
reflects tandem repeat overlaps. We generated several 1 Mb sequences with uniform letter frequencies and with letter frequencies from the genome of D.melanogaster, identified FTRs with different parameter settings and calculated coverage maps for periods in the range 315 bases. Comparison between the observed and the calculated coverage of FTRs (Fig. 2) demonstrates that the devised formula [Equation (7)] accurately describes the distribution of majority of FTRs present in the random sequence. The agreement between theoretical and observed coverage values holds for the range of periods explored in this study and only moderately depends on letter frequencies.
|
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
FTR density is uneven across the genome of Drosophila. Distribution of functional and other sequence features is known to be unequal among eukaryotic chromosomes and between different chromosome loci. Therefore, we decided to explore distribution of FTRs, as detected by our algorithm across a sample eukaryotic genome, D.melanogaster (Celniker et al., 2002). We focused our attention on the genome of Drosophila because of its outstanding level of annotation, availability of many related genomes and its relatively small size (
120 MB). We performed FTR extraction with default parameter setting (see TandemSWAN help) using the MaSk model for statistical weighting. In this test we explored map coverage, calculated for non-overlapping 16 KB windows, without discriminating tandem motifs and periods. We have found that FTR density (map coverage) along all chromosomes is highly inhomogeneous (Fig. 3a), and the central arm regions have higher FTR density than the distal regions (Fig. 3b). In addition, the average FTR density on X-chromosome is substantially higher than on autosomes (See Relative FTR density across genome below).
|
In order to assess possible roles of FTRs, we correlated FTR distribution in the Drosophila genome with positions of coding sequences and some other sequence features, such as local AT/GC composition. We have found that gene rich segments contain less FTRs than intergenic regions, while AT rich segments are enriched by FTRs (Supplementary Fig. 1). Reasons leading to fluctuation of FTR density in genome can be different; therefore we explored FTR structure in functionally distinct sequence datasets.
FTR with 3k periods prevail in coding regions. We investigated if FTRs of specific periods may prevail in certain functional categories. We assembled several sequence datasets containing all exons, 3'-untranslated region (3'-UTRs), 5'UTRs, intergenic regions, intergenic heterochromatin (from Drosophila Heterochromatin Project, http://www.dhgp.org/; http://flybase.net/annot/dmel_het_release3.2b2.txt; ftp://flybase.net/genomes/Drosophila_melanogaster/current_hetchr/fasta/) and a dataset containing 124 transcription regulatory regions, i.e. enhancers (or cis-regulatory modules) (https://webfiles.berkeley.edu/dap5/public_html/index.html). Corresponding datasets were also constructed for related species D.pseudoobscura (Richards et al., 2005). In order to explore FTRs within a functionally related group of genes we also generated the corresponding datasets for selection of 16 developmental gene loci from D.melanogaster and D.pseudoobscura. To investigate prevalence of specific tandems in all these datasets we compared total sequence coverage by FTRs with different periods (Fig. 4).
|
As expected, the most striking signals were detected in datasets containing exons (Fig. 4a). We found that in the coding regions FTRs with periods divisible by 3 are prevailing; instead, tandems with periods not equal to 3k are suppressed (below random expectation). We also found that 3k periods in coding regions of the X-chromosome have a greater coverage than 3k-periodic FTRs found in exons of autosomes. This suggests that FTR density even within similar functional units may be linked with the physical map, i.e. a particular place in genome.
Surprisingly, we also detected high presence of periods multiples of 6 (but not the other 3k periods) in the non-coding sequences. Apparently, there is a 6k background in genome, not related to periodicities caused by codon triplets [see Possible source of 3k (6/12) background below].
FTR periods specific to sequence categories other than exons. FTRs with periods 6 and 12 were found to be highly abundant throughout all analyzed datasets, including transcription regulatory regions, intergenic spacers, UTRs and even intergenic heterochromatin (Fig. 4bd). At the same time, non-coding regions were also found to be enriched by FTRs with other than 3k periods. Outside exons we observed 2- to 3-fold FTR excess over random expectation, which supports non-random origin of FTRs and the non-random character of genomic sequences in general.
In order to detect differences in the FTR structure among datasets representing non-coding regions, we compared prevalence of all periods, i.e. FTR profiles, as shown in Figure 4 (Table 1 and Supplementary Table 1). The correlation analysis has shown that according to prevailing FTR periods, all 22 datasets can be subdivided into at least three groups of similarity, one corresponding to coding regions, another to heterochromatin and the last one corresponding to intergenic regions, spacers and others.
|
Comparison of absolute levels of FTR presence in different datasets has shown that intergenic heterochromatin, in general, contains less FTRs than euchromatin (Fig. 4b). Moreover, heterochromatic regions displayed some excess of FTRs with periods equal to 3k.
In general, comparison of different sequence categories demonstrated that FTRs with all explored periods are overabundant in the genome, (with the exception of exons) and FTRs with 6k periods for some reason strongly prevail, even in non-coding DNA.
FTRs in enhancers are similar, but not identical to that in intergenic regions. Repetitive sequences in transcription regulatory regions are of special interest. While periodic signals present in exons (3k-periodic FTRs) can be explained by the genetic triplet code (and by periodicities in protein sequences), in regulatory regions, FTRs may well represent a background. To investigate this problem, we removed 6k background by normalizing FTR coverage values in functional datasets to the coverage in genome fragments without any functional annotation (non-functional).
We focused on 124 annotated (experimentally validated) enhancer regions from D.melanogaster (https://webfiles.berkeley.edu/dap5/public_html/data_06/124_Dmel_Enc.fa). The vast majority of these sequences are involved in regulation of developmental genes. However, this group is nether functionally nor structurally homogeneous. The enhancers have different length (0.33 kb) and regulate genes transcribed at different developmental stages. To achieve better representation we considered the entire dataset (124 sequences, 181 690 bp) and two sub samples, so-called AP (anteriorposterior, 72 sequences, 117 377 bp) and DV (dorso-ventral, 136 sequences, 114 354 bp) enhancers. Along with enhancers we also considered a separate dataset combining spacers between enhancers and a dataset combining coding regions from the same genome locations. The corresponding datasets were also constructed for D.pseudoobscura AP enhancers (sequences are available from the website).
Analysis of the normalized FTR distributions in enhancer datasets and in spacers (Fig. 4e) have shown some degree of enrichment by FTRs with periods 7 and 8 in all datasets, representing loci of developmental genes. No major differences were found in FTR distribution between the enhancers and their flanking regions or spacers. However, the overall FTR distribution was not identical to that found in the other non-functional, intergenic regions of the genome.
Along with the assessment of general FTR distributions in enhancers, we also investigated a possible relation between FTR motifs and the binding motifs for transcription factors present in the enhancers. In some single cases (even-skipped stripe 2 enhancer) we observed some similarity, but on the larger scale the correlation was not found to be significant.
It appears that FTRs in enhancers are different from FTRs in the rest of the genome by their period, and perhaps, by motif composition; however, insufficient amount of annotated enhancer regions in genome and presence of 6k background complicates the analysis.
Possible source of 3k (6k) background. Our results show that in Drosophila tandems with periods 6 and 12 are found in the entire genome, whereas other 3k periods are restricted to protein coding sequences.
The abundance of 6k periods might be explained, either through the mechanisms of DNA replication and rearrangement, or from possible structural DNA features. Existing experimental data suggest that certain 3-periodic synthetic DNA sequences have substantially different helix stability probably owing to mismatched alignments at equilibrium temperatures during melting (Delcourt and Blake, 1991). So, the quasi-repeated structures may be important for functional stability/flexibility of DNA molecule.
Periods of 3k abundant in coding sequences apparently are related to periodicity in protein sequences and triplet nature of genetic code. Repetitive structures in the protein sequences, such as 310 helix or hydrophobic alpha helixes may also cause periodicity at the level of DNA (Katti et al., 2000).
The coincidence that 6k periods also fall into 3k period may even have deeper roots. Apparently, nucleic acids and their replication appeared in certain form before the triplet genetic code (Lifson, 1997); so it can be that 36k matrix was in DNA long before the genetic code itself, and on itself served as a source for formation of what we know today as triplet genetic code.
Be the origin of 3k/6k periodic FTRs mechanical (DNA stability/replication) or historical (3k matrix), it is unlikely that this non-specifically distributed signal is connected with fine mechanisms of genome functioning, such as regulation of gene expression. However, this does not exclude a possibility that FTRs with other periods or FTRs containing some specific motifs are involved into some regulatory functions. Moreover, if the quasi-periodic structure of native DNA sequence is indeed important it would poise an additional constraint on all motifs in the sequence, including regulatory signals.
Role of FTRs with periods other than 3k. Analysis of FTRs with periods different from 3k has shown that periods 7 and 8 are more abundant in enhancers and in regions without functional annotation around (Fig. 4e). Currently, it is not clear what function these FTRs may perform in enhancers and whether their presence is related to any function at all. For instance, we have found no correlation between FTRs and recognition motifs for transcription factors present in the same enhancers. However we considered only a limited number of binding motifs (11), most of which are far from being perfect. Improving enhancer annotations, number of considered motifs and better recognition of the functional motif matches may shed more light on possible roles of FTRs in the regulation of transcription.
As it has been suggested earlier (Papatsenko et al., 2002), some FTRs may play a role as cassettes, containing synergistically acting tandems of binding sites and responding to certain threshold levels of transcription factors. However, it is also possible that presence of specific repetitive sequences provide certain spatial geometry to an enhancer, required for correct assembly of regulatory protein complexes. Apparently, some FTRs may be involved in the maintenance of chromatin structure and/or spatial DNA geometry even in a broader context.
The role of tandem repeats in regulatory regions was discussed recently in (Sinha and Siggia, 2005). The authors assess high quality tandem repeats found in enhancers of D.melanogaster and D.pseudoobscura using TRF and MREPS and demonstrated their low conservation. They concluded that the tandem repeats carry a limited functional load. This agrees with our first finding that the majority of FTRs found in enhancers have the same predominant periods as non-annotated intergenic DNA. On the other hand, some FTRs found in enhancer regions have specific properties and may be involved in regulatory function.
Relative FTR density across the genome. While qualitative composition of FTRs is surprisingly similar across the genome, their density may substantially vary from one genomic location to the other and from one genome to the other. This may or may not be connected with the local gene density or even presence of gene deserts (Ovcharenko et al., 2005).
We have found that the genome of D.pseudoobscura has a greater DNA fraction covered by FTRs in all functional categories than the genome of D.melanogaster (Fig. 4). In both genomes X-chromosome has a higher FTR density than other chromosome arms, and finally, distal locations of the chromosome arms have lower FTR densities than more central loci (Fig. 3).
It is noteworthy that the Drosophila X-chromosome has a greater number of short perfect repeats (Katti et al., 2001) and probably a greater number of recent gene duplications as compared with autosomes (Thornton and Long, 2002). This difference between sex-related chromosomes and the rest of genome was also reported in human, where LINE1 repeat elements cover one-third of the human X-chromosome (Ross et al., 2005). However, in the case of Caenorhabditis elegans genome the reported repeat density in X-chromosome is lower (Achaz et al., 2001).
The difference in FTR fraction between the genomes D.melanogaster and D.pseuodoobscura is probably related to a higher compactness of the D.melanogaster genome. Finally, we have found that FTRs are relatively less abundant in intergenic heterochromatin. Perfect tandems with periodicity of 5 found in pericentromeric heterochromatic regions (Sun et al, 2003) actually cover a surprisingly low fraction of the total heterochromatic DNA.
Problems of FTR exploration and focus of the TandemSWAN algorithm. Eukaryotic genomes are overwhelmed by repetitive sequences probably carrying no biological function, and are caused, for instance, simply by peculiarities of DNA replication (Ellegren, 2004). This non-functional noise needs to be filtered, which in part can be done by exploring repeat parameters in different sequence categories and/or by normalizing to the noise in the non-functional regions. However, even the background signals may serve maintenance of overall DNA survivability and might contain signals, required, for instance, for chromatin packaging.
Here we explored fuzzy tandems since they are largely out of the focus of the regular tandem-finding programs (Castelo et al., 2002; Sagot and Myers, 1998; Benson, 1999; Kolpakov et al., 2003). Specifics of our algorithm can be illustrated by comparing TandemSWAN with the two most popular repeat finders, TRF (Benson, 1999) and MREPS (Kolpakov et al., 2003) (Fig. 5 and Supplementary Table 2). Doing a comparison with TRF and MREPS we tried to obtain the sequence sets that were as similar as possible to those obtained with SWAN with parameters characteristic for our study. Actually, both TRF and MREPS are usually used to search for more precise repeats than those in Figure 5 and at least TRF was operating near the limit of repeat fuzziness allowed by its internet-based version. All the three tools perform similarly in the case of perfect repeats. However, the results of TandemSWAN and TRF/MREPS become quite different in the case of FTR extraction.
|
| 4 CONCLUSIONS |
|---|
|
|
|---|
In this work we have formulated a working definition of FTRs based on its statistical properties, developed an extraction algorithm and compared properties of tandems found in sequences carrying different functions. We developed statistics that allow calculation of P-values for FTR occurrence in Bernoulli type random sequences, which can be useful for other algorithms. This statistical approach implemented in the TandemSWAN program, aimed to identify FTRs with a broad spectrum of listed parameters.
Using this approach, we identified short FTRs with periods 324 bases in the D.melanogaster and D.pseudoobscura genomes and compared FTR structure and occurrence in coding and non-coding regions, heterochromatic regions and regulatory (enhancer) sequences. We have found that different types of tandems are abundant in different functional sequence categories, with each category having its own pattern of preferred period lengths. Tandems with period 3 and their multiples were found to be characteristic of coding regions. FTRs with 6k periods are characteristic for all non-coding DNA. FTRs with periods equal to 5, 6, 7 and 11, 12, 14 were enriched in loci of developmental genes and developmental enhancers. The regulatory modules at the mean have no less FTRs than spacers nearby; furthermore, FTR with periods 7 and 8 are found more often in Drosophila cis-regulatory modules then in other non-coding DNA. Obviously, both the evolution and the DNA structure of regulatory modules are subject to many additional parameters, such as DNA melting and adsorption of protein regulatory factors. Thus, it is possible that FTR found in cis-regulatory modules have some particular sequence structure facilitating their function and should be studied in greater detail. To understand the role of the 6 bp-related omnipresent repeats it is necessary first to test if they are present in different bacterial, animal and plant taxa, and not only in Drosophila species. This work is currently in progress.
| Acknowledgments |
|---|
The authors are thankful for M. Borodovsky, A. P. Lifanov, N. A. Oparina, N. G. Esipova, M. Lassig, A. V. Favorov, V. E. Ramensky, M. G. Gelfand and A. A. Mironov for valuable discussion. They also thank R.Zinzen for careful manuscript reading and suggested changes. This study has been supported by the French Program EcoNet-08159PG, INTAS grant 04-83-3994, Russian State Contract No 02.434.11008, RFBR grant 04-04-49601, Fogerthy RO3 TW005899-01A1 program, Russian Academy of Science Presidium Program in Molecular and Cellular Biology, project #10 and Ludwig Institute of Cancer Research Grant CRDF GAP RBO-1268.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Steven L. Salzberg
Received on October 14, 2005; revised on December 22, 2005; accepted on December 28, 2005
| REFERENCES |
|---|
|
|
|---|
Antoniewski, C., et al. (1996) Direct repeats bind the EcR/USP receptor and mediate ecdysteroid responses in Drosophila melanogaster. Mol. Cell. Biol, . 16, 29772986
Achaz, G., et al. (2001) Study of intrachromosomal duplications among the eukaryote genomes. Mol. Biol. Evol, . 18, 22802288
Benson, G. (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res, . 27, 573580
Benson, G. and Waterman, M. (1994) A method for fast database search for all k-nucleotide repeats. Nucleic Acids Res, . 22, 48284836
Boulikas, T. (1995) Chromatin domains and prediction of MAR sequences. Int. Rev. Cytol, . 162A, 279388[CrossRef].
Carroll, S.B., Grenier, J.K., Weatherbee, S.D. From DNA to Diversity, (2001) , Malden, MA ISBN 0-632-04511-6 Molecular Genetics and the Evolution of Animal Design. Blackwell Science.
Castelo, A.T., et al. (2002) TROLLtandem repeat occurrence locator. Bioinformatics, 18, 634636
Celniker, S., et al. (2002) Finishing a whole genome shotgun: release 3 of the Drosophila melanogaster euchromatic genome sequence. Genome Biol, . 3, RESEARCH0079.
Chaley, M.B., et al. (1999) Method revealing latent periodicity of the nucleotide sequences modified for a case of small samples. DNA Res, . 6, 153163[Abstract].
Chechetkin, V.R. and Lobzin, V.V. (1998) Nucleosome units and hidden periodicities in DNA sequences. J. Biomol. Struct. Dyn, . 15, 937947[Web of Science][Medline].
Delcourt, S.G. and Blake, R.D. (1991) Stacking energies in DNA. J. Biol. Chem, . 266, 1516015169
Davidson, H., et al. (2000) Genomic sequence analysis of Fugu rubripes CFTR and flanking genes in a 60 kb region conserving synteny with 800 kb of human chromosome 7. Genome Res, . 10, 11941203
Dover, G.A. (1982) Molecular drive, a cohesive model of species evolution. Nature, 299, 111117[CrossRef][Medline].
D.melanogaster heterochomatin genome data from Drosophila Heterochromatin Genome Project.
Edwards, A., et al. (1992) Genetic variation at five trimeric and tetrameric tandem repeat loci in four human population groups. Genomics, 12, 241253[CrossRef][Web of Science][Medline].
Ellegren, H. (2004) Microsatellites: simple sequences with complex evolution. Nat. Rev. Genet, . 5, 435445[CrossRef][Web of Science][Medline].
Fu, Y.-H., et al. (1992) An unstable triplet repeat in a gene related to myotonic muscular dystrophy. Science, 255, 12561258
Gao, Q. and Finkelstein, R. (1998) Targeting gene expression to the head: the Drosophila orthodenticle gene is a direct target of the Bicoid morphogen. Development, 125, 41854193[Abstract].
Huntington's Disease Collaborative Research Group. (1993) A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes. Cell, 72, 971983[CrossRef][Web of Science][Medline].
Ioshikhes, I., et al. (1999) Periodical distribution of transcription factor sites in promoter regions and connection with chromatin structure. Proc. Natl Acad. Sci. USA, 96, 28912895
Karlin, S., et al. (1988) Efficient algorithms for molecular sequence analysis. Proc. Natl Acad. Sci. USA, 85, 841845
Katti, M.V., et al. (2001) Differential distribution of simple sequence repeats in eukaryotic genome sequences. Mol. Biol. Evol, . 18, 11611167
Katti, M.V., et al. (2000) Amino acid repeat patterns in protein sequences: their diversity and structural-functional implica-tions. Protein Sci, . 9, 12031209[Web of Science][Medline].
Kolpakov, R., et al. (2003) mreps: Efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res, . 31, 36723678
Kotelnikova, E.A., et al. (2005) Evolution of transcription factor DNA binding sites. Gene, 347, 255263[CrossRef][Web of Science][Medline].
Kravatskaia, G.I., et al. (2002) Similarities in periodical structures in the position of nucleotides in regions of initiation of replication of bacterial genomes. Biofizika, 47, 595599[Medline].
Kutuzova, G.I., et al. (1999) Periodicity in contacts of RNA-polymerase with promoters. Biofizika, 44, 216223[Medline].
Landau, G.M., et al. (2001) An algorithm for approximate tandem repeats. J. Comput. Biol, . 8, 118[CrossRef][Web of Science][Medline].
Li, L., et al. (2004) Pseudo-periodic partitions of biological sequences. Bioinformatics, 20, 295306
Li, Y.C., et al. (2002) Microsatellites: genomic distribution, putative functions and mutational mechanisms: a review. Mol. Ecol, . 11, 24532465[CrossRef][Medline].
Lifson, S. (1997) On the crucial stages in the origin of animate matter. J. Mol. Evol, . 44, 18.
Makeev, V.Ju. and Tumanyan, V.G. (1996) Search of periodicities in primary structure of biopolymers: a general Fourier approach. Comput. Appl. Biosci, . 12, 4954
Makeev, V.J., et al. (2003) Distance preferences in the arrangement of binding motifs and hierarchical levels in organization of transcription regulatory information. Nucleic Acids Res, . 31, 60166026
Martienssen, R.A. (2003) Maintenance of heterochromatin by RNA interference of tandem repeats. Nat. Genet, . 35, 213214[CrossRef][Web of Science][Medline].
Meloni, R., et al. (1998) A tetranucleotide polymorphic microsatellite, located in the first intron of the tyrosine hydroxylase gene, acts as a transcription regulatory element in vitro. Hum. Mol. Genet, . 7, 423428
Nakamura, Y., et al. (1998) VNTR (variable number of tandem repeat) sequences as transcriptional, translational, or functional regulators. J. Hum. Genet, . 43, 149152[CrossRef][Web of Science][Medline].
Niv, E., et al. (2005) Microsatellite instability in patients with chronic B-cell lymphocytic leukaemia. Br. J. Cancer, . 92, 15171523[CrossRef][Web of Science][Medline].
Ovcharenko, I., et al. (2005) Evolution and functional classification of vertebrate gene deserts. Genome Res, . 15, 137145
Ott, R.W. and Hansen, L.K. (1996) Repeated sequences from the Arabidopsis thaliana genome function as enhancers in transgenic tobacco. Mol. Gen. Genet, . 252, 563571[Web of Science][Medline].
Papatsenko, D.A., et al. (2002) Extraction of functional binding sites from unique regulatory regions: the Drosophila early developmental enhancers. Genome Res, . 12, 470481
Ramchandran, R., et al. (2000) A (GATA)(7) motif located in the 5' boundary area of the human beta-globin locus control region exhibits silencer activity in erythroid cells. Am. J. Hematol, . 65, 1424[CrossRef][Web of Science][Medline].
Richards, S., et al. (2005) Comparative genome sequencing of Drosophila pseudoobscura: chromosomal, gene, and cis-element evolution. Genome Res, . 15, 118
Ross, M.T., et al. (2005) The DNA sequence of the human X chromosome. Nature, 434, 325337[CrossRef][Web of Science][Medline].
Sagot, M. and Myers, E. (1998) Identifying satellites in nucleic acid sequences. Proceedings of the Second Annual International Conference on Computational Molecular Biology , NY ACM Press, pp. 234242.
Schug, M.D., et al. (1998) The distribution and frequency of microsatellite loci in Drosophila melanogaster. Mol. Ecol, . 7, 5770[CrossRef][Medline].
Shi, X.M., et al. (2000) Tandem repeat of C/EBP binding sites mediates PPARgamma2 gene transcription in glucocorticoid-induced adipocyte differentiation. J. Cell Biochem, . 76, 518527[CrossRef][Web of Science][Medline].
Singer, M. and Berg, T. Genes and Genomes, (1991) , Mill Valley, California University Science Books.
Sinha, S. and Siggia, E.D. (2005) Sequence turnover and tandem repeats in cis-regulatory modules in drosophila. Mol. Biol. Evol, . 22, 874885
Subramanian, S., et al. (2003) Genome-wide analysis of microsatellite repeats in humans: their abundance and density in specific genomic regions. Genome Biol, . 4, R13[CrossRef][Medline].
Sun, X., et al. (2003) Sequence analysis of a functional Drosophila centromere. Genome Res, . 13, 182194
Thibodeau, S.N., et al. (1993) Microsatellite instability in cancer of the proximal colon. Science, 260, 816819
Thornton, K. and Long, M. (2002) Rapid divergence of gene duplicates on the Drosophila melanogaster X chromosome. Mol. Biol. Evol, . 19, 918925
Verkerk, A., et al. (1991) Identification of a gene (FMR-1) containing a CGG repeat coincident with a breakpoint cluster region exhibiting length variation in fragile X syndrome. Cell, 65, 905914[CrossRef][Web of Science][Medline].
Vergnaud, G. and Denoeud, F. (2000) Minisatellites: mutability and genome architecture. Genome Res, . 10, 899907
Villafranca, E., et al. (2001) Polymorphisms of the repeated sequences in the en-hancer region of the thymidylate synthase gene promoter may predict downstaging after preoperative chemoradiation in rectal cancer. J. Clin. Oncol, . 19, 17791786
Weber, J.L. and May, P.E. (1989) Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. Am. J. Hum. Genet, . 44, 388396[Web of Science][Medline].
Wooster, R., et al. (1994) Instability of short tandem repeats (microsatellites) in human cancers. Nat. Genet, . 6, 152156[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
I. V. Kulakovskiy, A. V. Favorov, and V. J. Makeev Motif discovery and motif finding from genome-mapped DNase footprint data Bioinformatics, September 15, 2009; 25(18): 2318 - 2325. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Papatsenko, Y. Goltsev, and M. Levine Organization of developmental enhancers in the Drosophila embryo Nucleic Acids Res., September 1, 2009; 37(17): 5665 - 5677. [Abstract] [Full Text] [PDF] |
||||
![]() |
G.-F. Richard, A. Kerrest, and B. Dujon Comparative Genomics and Molecular Dynamics of DNA Repeats in Eukaryotes Microbiol. Mol. Biol. Rev., December 1, 2008; 72(4): 686 - 727. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Merkel and N. Gemmell Detecting short tandem repeats from genome data: opening the software black box Brief Bioinform, September 1, 2008; 9(5): 355 - 366. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. B. Mudunuri and H. A. Nagarajaram IMEx: Imperfect Microsatellite Extractor Bioinformatics, May 15, 2007; 23(10): 1181 - 1187. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Archak, E. Meduri, P. S. Kumar, and J. Nagaraju InSatDb: a microsatellite database of fully sequenced insect genomes Nucleic Acids Res., January 12, 2007; 35(suppl_1): D36 - D39. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








, TandemSWAN, significance level 2;
theoretical, significance level 2; x TandemSWAN, significance level 3;
, theoretical, significance level 3. A 1M long Bernoulli random sequence with average genomic nucleotide frequencies was simulated.

, heterochromatin (7 089 934 bp). (c) D.melanogaster UTRs: , autosome 5'-UTRs (3 331 080 bp); 



