Bioinformatics Advance Access originally published online on August 30, 2005
Bioinformatics 2005 21(21):3951-3958; doi:10.1093/bioinformatics/bti651
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Negative correlation between compositional symmetries and local recombination rates
1Department of Molecular, Cellular and Developmental Biology, Yale University New Haven, CT, USA
2Department of Epidemiology and Public Health, Yale University New Haven, CT, USA
3Department of Genetics, Yale University New Haven, CT, USA
*To whom correspondence should be addressed at 60 College Street, New Haven, CT 06520-8034, USA
| Abstract |
|---|
|
|
|---|
Although still not much understood, the universal reverse complement symmetry in genomes may contain much information about the genome. In this article, under the hypothesis that recombination rate variations may be related to the high order DNA structure, we studied the association between local recombination rates and local symmetry levels in mouse, rat and human. We found significant negative correlations between recombination rates and reverse complement compositional symmetries in these three organisms. This negative correlation pattern also held at individual chromosome levels when data only from each individual chromosome was analyzed.
Contact: hongyu.zhao{at}yale.edu
| INTRODUCTION |
|---|
|
|
|---|
Chargaff's first parity rule states that the frequency of A is equal to that of T and the frequency of C is equal to that of G in double-stranded DNA (Magasanik and Chargaff, 1951). Watson and Crick's DNA helix model explained the first parity rule (Watson and Crick, 1953). Chargaff and colleagues also observed that for single-stranded DNA, the equalities are validated approximately (Rudner et al., 1968). That is, when only considering one strand of the double-stranded DNA, the frequency of A is equal to that of T and the frequency of C is equal to that of G. This intra-strand parity rule about a single nucleotide can be extended to longer oligonucleotides (Prabhu, 1993; Qi and Cuticchia, 2001). For example, under this parity rule, for single-stranded DNA, at order 2 (thus length 2), the frequency of GA is equal to that of TC (TC is the reverse complement of GA) and the frequency of CT is equal to that of AG (AG is the reverse complement of CT). Therefore, there is reverse complement symmetry for single-stranded DNA. Baisnee et al. (2002) conducted a comprehensive study of this single strand reverse complement symmetry. They measured the symmetry at orders 19 for a wide range of genomes including viruses, bacteria, archae, mitochondria and eukaryota and demonstrated that the higher-order symmetry does not entirely result from the lower-order symmetry (Baisnee et al., 2002). The reason for this single strand reverse complement symmetry is still not well understood.
Forsdyke (1995) hypothesized that this symmetry results from the DNA stemloop secondary structures. The single strand of the supercoiled duplex DNA may form stemloop structures, which may facilitate the initiation of homologous recombination by way of kissing between the tips of stemloop structures. The recombination evolutionary advantage causes the selection of single strand reverse complement symmetry (Forsdyke, 1995). Baisnee et al. argued that the reverse complement symmetry does not result from point mutation or recombination, but from a combination effect of different mechanisms at different orders (Baisnee et al., 2002). Above all, the reverse complement symmetry may contain multiple levels of information about genome.
It is widely known that recombination rates vary along chromosomes with widespread recombination hotspots and coldspots (reviewed in Lichten and Goldman, 1995; Petes, 2001; Nachman, 2002). However, the reason for recombination rate variations is little known. Scientists have found that the cross-over hot-spot instigator (Chi) sequences locally increase recombination in Escherichia coli (Smith, 1988). But, for most of the recombination hotspots, no specific sequence motifs can be found. In yeast, double-strand DNA breaks (DSBs) initiate most, if not all, meiotic recombination. And these DSB sites usually are in deoxyribonuclease I sensitive regions (Wu and Lichten, 1994). All these suggest that DNA structure and accessibility may have an important role in recombination variation. GC content has been reported to be positively correlated with local recombination rates in the human genome (Fullerton et al., 2001). Other sequence features such as poly(A)/poly(T) fraction and CpG fraction also have significant correlations with recombination rates (Kong et al., 2002).
In this paper, we studied the association between local recombination rates and genome compositional reverse complement symmetry using publicly available genome-wide recombination rate data and genomic sequence data in Mus musculus (mouse), Rattus norvegicus (rat) and Homo sapiens (human). We found that local recombination rates are negatively correlated with compositional symmetries.
| METHODS |
|---|
|
|
|---|
Sequence data
We downloaded the genomic sequences of Mus musculus, Rattus norvegicus and Homo sapiens from the NCBI [ftp://ftp.ncbi.nih.gov/genomes/, April (May for mouse), 2005].
Measure of reverse complement symmetry
In this paper, we adopted a symmetry measure defined by
(Baisnee et al., 2002), where fi is the frequency of the i-th N-mer oligonucleotide in a genomic region and fi' is the frequency of its reverse complement in the same region. We allow overlapping sequences in deriving these counts. SN is computed over the complete set of N-mers, and its value ranges from 0 (total asymmetry) to 1 (perfect symmetry). Baisnee and colleagues stated that this measure has some advantages over Pearson's correlation coefficient, e.g. Pearson's correlation coefficient is sensitive to outliers (Baisnee et al., 2002).
Other sequence features
In each non-overlapping genomic sequence window, after correcting for the number of Ns in the genome sequence, the fraction of G or C is the GC content. The fraction of CpG dinucleotides is the CpG fraction. The fraction of poly An or Tn where n
4 is the poly(A)/poly(T) [(A)n
4 and (T)n
4] tract fraction.
Local recombination rates
The genome-wide recombination rates for Mus musculus, Rattus norvegicus and Homo sapiens were based on the paper of Jensen-Seaman et al., 2004. In that paper, the authors estimated the recombination rates using mouse OBxCAST F2 intercross genetic map (Dietrich et al., 1996) including 4880 markers, rat SHRSPxBN F2 intercross genetic map (Steen et al., 1999) including 2305 markers, and human Iceland pedigree map (Kong et al., 2002) including 5114 markers. Assuming a linear genetic distance across the immediately flanking genetic markers, they assigned each base a recombination rate. The average recombination rate of the bases within each non-overlapping 5 Mb window along the whole genome was shown in the paper's Supplementary files. For human, they also estimated the sex-specific recombination rates which can be downloaded from the UCSC Genome Bioinformatics database, Table Browser (http://genome.ucsc.edu/). These sex-specific recombination rates correspond to non-overlapping 1 Mb windows. Except the sex difference study, the windows used are 5 Mb in the following.
In our analyses, we calculated the reverse complement symmetry measures in these 5 Mb or 1 Mb windows. We also calculated CpG fraction, GC content fraction, and poly(A)/poly(T) [(A)n
4 and (T)n
4] tract fraction. Windows with >20% N bases and windows without estimated recombination rates were excluded from this study. For the 5 Mb non-overlapping windows, there were totally 426 windows for mouse, 467 windows for rat and 543 windows for human. For the 1 Mb windows for human, we only considered the autosomal chromosomes. There were totally 2563 windows for female-specific recombination rates and 2439 windows for male-specific recombination rates.
We used perl language to calculate the reverse complement symmetry measures, CpG fraction, GC content fraction and poly(A)/poly(T) [(A)n
4 and (T)n
4] tract fraction. R language was used in all the statistical analyses.
| RESULTS |
|---|
|
|
|---|
Reverse complement symmetry in mouse, rat and human
In Figure 1, we summarize the symmetry measures for oligonucleotide length 112 in mouse, rat and human for non-overlapping 5 Mb windows. From these box plots of symmetry measures, we can see that the symmetry measure is high for short oligonucleotides and drops slightly for long oligonucleotides. The variance of the symmetry measure for short oligonucleotides is very tiny (
105). In order to capture the potential relationship between local recombination rates and symmetry levels, we focused on oligonucleotide length 12 which has the highest variances for these three organisms (1.04 x 103 for mouse, 1.25 x 103 for rat and 5.29 x 104 for human) among the orders examined and reasonable symmetry levels (mean 0.41 for mouse, 0.40 for rat and 0.44 for human) in our study.
|
Negative correlation between recombination rates and symmetry measures
In Figure 2, the correlations between recombination rates and symmetry measures are plotted for oligonucleotide lengths 112. For short lengths, there was almost no correlation, which is not surprising because of the small variance of the calculated symmetry measures across different regions. For longer lengths, there were negative correlations for all three organisms.
|
The scatter plot between recombination rates and symmetry measures at order 12 for mouse is shown in Figure 3. The left panel shows all the data points. The regression coefficient for symmetry measure was 3.67 with P-value 8.05 x 109. The two circled points may be considered possible outliers. After removing these two points, the regression coefficient was 4.76 with P-value 7.24 x 1010 (scatter plot in the right panel). For these two possible outlying points, the symmetry measure was relatively too high (0.77) or too low (0.27). The region with the high symmetry measure was on chromosome X, a 5 Mb region from 25 to 30 Mb. The number of Ns in this region was 770 153, which was high. The presence of these many Ns may affect the symmetry measure calculation. The region with the low symmetry measure was also on chromosome X, from 140 to 145 Mb. The number of Ns in this region was only 50 000. Because these special symmetry measures represent true biological observations, we kept these two points in the following study.
|
Figure 4 shows the scatter plot between recombination rates and symmetry measures at order 12 for rat. The regression coefficient for symmetry measure was 3.37 with P-value 3.54 x 1010.
|
Figure 5 is the scatter plot between recombination rates and symmetry measures at order 12 for human. The left panel is for all the data points. The regression coefficient for symmetry measure was 11.85 with P-value <2 x 1016. After removing the circled point, the regression coefficient became 14.01 with P-value <2 x 1016. For the circled point, the symmetry measure was high (0.68), and the corresponding region lay on chromosome 9 covering from 40 to 45 Mb. The number of Ns in this region was 659 729, which was also high. We kept this point in the following study.
|
For mouse, Pearson's correlation coefficient between recombination rates and symmetry measures at order 12 was 0.27 (0.21 at order 10 and 0.28 at order 11). For rat, the correlation between recombination rates and symmetry measures at order 12 was 0.29 (0.15 at order 10 and 0.24 at order 11). For human, the correlation between recombination rates and symmetry measures at order 12 was 0.39 (0.40 at order 10 and 0.47 at order 11). These correlations are summarized in Table 1. We also list the correlations between recombination rates and three other sequence features: poly(A)/poly(T) [(A)n
4 and (T)n
4] fraction, CpG fraction and GC content. poly(A)/poly(T) [(A)n
4 and (T)n
4] fraction had a negative correlation with the recombination rate, whereas CpG fraction and GC content both had positive correlations with the recombination rate. In Table 2, we summarize the pairwise correlations among symmetry measure, poly(A)/poly(T) [(A)n
4 and (T)n
4] fraction CpG fraction and GC content. For these three organisms, poly(A)/poly(T) [(A)n
4 and (T)n
4] fraction, CpG fraction and GC content were highly correlated (absolute value of correlation
0.88). It suggests that poly(A)/poly(T) [(A)n
4 and (T)n
4] fraction, CpG fraction and GC content may capture similar information in genomic sequences. However, symmetry measure was much less correlated with these three DNA features. The absolute correlations between symmetry measure and three other DNA features were about 0.6 for mouse and 0.7 for rat. For human, the correlation between symmetry measure and CpG fraction was only 0.08, 0.16 for GC content, and 0.29 for poly(A)/poly(T) [(A)n
4 and (T)n
4] fraction. Symmetry measure always had a negative correlation with GC content and CpG fraction and a positive correlation with poly(A)/poly(T) [(A)n
4 and (T)n
4] fraction.
|
|
Multiple regressions were carried out between local recombination rate and symmetry measure, poly(A)/poly(T) [(A)n
4 and (T)n
4] fraction, CpG fraction and GC content, and the results are summarized in Table 3. In order to capture potential interactions among sequence features, we performed backward stepwise regression with the Akaike information criterion (AIC) for model selection. Because these recombination rates were estimated for contiguous non-overlapping windows, they were possibly autocorrelated. The DurbinWatson test was performed to test possible autocorrelations among regression residuals. The autocorrelation was 0.04 with P-value 0.5 for mouse, 0.08 with P-value 0.05 for rat and 0.26 with P-value 0 for human. So, for rat and human, we also fitted the generalized linear model incorporating autocorrelated residuals. Therefore, the coefficients and P-values were re-calculated. The final models are shown in Table 3. By using these sequence features, we can explain about 20% of the variance of the local recombination rates for mouse, 19% for rat and 49% for human. Symmetry measure had a significant negative effect on recombination rates for mouse and human, while it was not significant for rat. This difference may be due to less accurate estimation of recombination rates in rat. The results also show that there were significant interactions between symmetry measure, CpG fraction and poly(A)/poly(T) [(A)n
4, and (T)n
4] fraction in human and mouse. GC content had a positive correlation with recombination rates (0.38 for mouse, 0.26 for rat and 0.44 for human). But in the multiple regression models, GC content had a significant negative effect for the recombination rate. This phenomenon was also noted in previous papers (Jensen-Seaman et al., 2004; Kong et al., 2002), where the authors found that GC content was negatively correlated with the recombination rate after considering the CpG fraction and poly(A)/poly(T) [(A)n
4 and (T)n
4] fraction.
|
Negative correlation exists at chromosome level
In order to know whether the negative association also holds at individual chromosome level, we calculated Pearson's correlation coefficient between recombination rates and symmetry measures for each individual chromosome. The results are summarized in Table 4. For mouse, 8 out of 20 chromosomes had significant negative correlations with one-sided P-value <0.05. For rat, 8 out of 21 chromosomes had significant negative correlations. For human, 16 out of 23 chromosomes had significant negative correlations. For most of the chromosomes, the correlations were negative and no significant positive correlation was found.
|
In Figure 6, we plot 1-symmetry measure at order 12 (upper panel) and recombination rate (lower panel) along each individual chromosome in the human genome. Note that if there is a negative correlation between the recombination rate and symmetry measure, the correlation is positive between the recombination rate and (1symmetry measure). We plot 1symmetry measure instead of symmetry measure for visual convenience. This figure clearly shows the variation of recombination rates and that of symmetry levels along chromosomes. Also the negative association between recombination rate and symmetry measure is apparent.
|
Sex difference
Since there are differences in recombination rates between males and females (Broman et al., 1998), we also studied the negative correlations between sex-specific recombination rates and symmetry levels. We only considered autosomals, and the correlation between female-specific recombination rate and symmetry measure at order 12 was 0.12 (0.24 at order 10 and 0.21 at order 11). The correlation between male-specific recombination rate and symmetry measure at order 12 was 0.14 (0.32 at order 10 and 0.24 at order 11). To test the statistical significance of the observed sex difference, we used the following regression model to jointly consider sex-specific recombination rates and symmetry measure. The model is:
![]() |
is the error term. Here n = 2563 and m = 2439. Because some windows did not have estimated female recombination rates and some windows did not have estimated male recombination rates, n was not equal to m. From the results, the sex-specific symmetry effect ßd was estimated to be 2.78 with P-value <2 x 1016. The common symmetry effect ßs was 2.59 with P-value 1.13 x 109. Therefore, symmetry measure had a negative effect on both sex-specific recombination rates. Compared to females, symmetry measure had an additional negative effect on male-specific recombination rates. | DISCUSSION |
|---|
|
|
|---|
In this article, we have explored the negative correlations between local recombination rates and local symmetry levels. The negative correlation was significant for the three organisms studied using estimated local recombination rates. This negative correlation was not only observed at the genome level but also at the chromosome level. The results for rat were relatively less significant, which may be due to less reliable measured recombination rate estimates. For human, we note that the negative correlation was significantly stronger for males than females. Although there appeared to be some heterogeneity of variances in the regression analyses, this may not lead to a change of our conclusions due to the extreme P-values from these analyses. Here, we studied mouse, rat and human at the 5 Mb resolution. If we can get more accurate genetic maps, we may explore the association more correctly. As to the genome sequence windows, even after we removed the windows with >20% of N bases, there was still 8.6% Ns for the remaining windows for rat, 2.1% for mouse and 0.6% for human. The missing sequence information may affect the symmetry measure calculation. It may be another possible reason for the less significant results for rat. Although, we measured the symmetry level at order 12 in this article, the results and conclusions were similar for order 10 and order 11.
The reverse complement symmetry in many organisms has been known for a long time. However, it has not drawn much attention from scientists. Currently, there is little explanation for this universal symmetry phenomenon. Baisnee et al. (2002) argued that the symmetry results from a combination effect of different mechanisms at different orders. Unfortunately, they did not quantify the relative contribution of these different mechanisms. Forsdyke suggested that because the stem-loop structure in supercoiled DNA facilitates the initiation of recombination, there is evolutionary pressure to produce reverse complement DNA sequences (Forsdyke, 1995). If the local stem-loop structure is the only force for the reverse complement symmetry, the higher local symmetry levels should result in higher recombination rates. On the contrary, our analysis shows that there is a negative instead of positive correlation between the local symmetry levels and the local recombination rates. We hypothesize that although the reverse symmetry can cause stemloop structure, the presence of symmetry may keep the stability of the chromatin. So, the high symmetry level can inhibit the occurrence of recombination events.
| Acknowledgments |
|---|
This work was supported in part by NSF grant DMS 0241160 and NIH grant GM 59507. We thank the reviewers for their constructive comments.
Conflict of Interest: none declared.
Received on June 6, 2005; revised on August 25, 2005; accepted on August 26, 2005
| REFERENCES |
|---|
|
|
|---|
Baisnee, P.F., et al. (2002) Why are complementary DNA strands symmetric? Bioinformatics, 18, 10211033
Broman, K.W., et al. (1998) Comprehensive human genetic maps: individual and sex-specific variation in recombination. Am. J. Hum. Genet., 63, 861869[CrossRef][ISI][Medline].
Dietrich, W.F., et al. (1996) A comprehensive genetic map of the mouse genome. Nature, 380, 149152[CrossRef][Medline].
Forsdyke, D.R. (1995) A stemloop kissing model for the initiation of recombination and the origin of introns. Mol. Biol. Evol., 12, 949958[Abstract].
Fullerton, S.M., et al. (2001) Local rates of recombination are positively correlated with GC content in the human genome. Mol. Biol. Evol., 18, 11391142
Jensen-Seaman, M.I., et al. (2004) Comparative recombination rates in the rat, mouse and human genomes. Genome Res., 14, 528538
Kong, A., et al. (2002) A high-resolution recombination map of the human genome. Nat. Genet., 31, 241247[CrossRef][ISI][Medline].
Lichten, M. and Goldman, A.S. (1995) Meiotic recombination hotspots. Annu. Rev. Genet., 29, 423444[CrossRef][ISI][Medline].
Magasanik, B. and Chargaff, E. (1951) Studies on the structure of ribonucleic acids. Biochim. Biophys. Acta, 7, 396412[Medline].
Nachman, M.W. (2002) Variation in recombination rate across the genome: evidence and implications. Curr. Opin. Genet. Dev., 12, 657663[CrossRef][ISI][Medline].
Petes, T.D. (2001) Meiotic recombination hot spots and cold spots. Nat. Rev. Genet., 2, 360369[CrossRef][ISI][Medline].
Prabhu, V.V. (1993) Symmetry observations in long nucleotide sequences. Nucleic Acids Res., 21, 27972800
Qi, D. and Cuticchia, A.J. (2001) Compositional symmetries in complete genomes. Bioinformatics, 17, 557559
Rudner, R., et al. (1968) Separation of B. subtilis DNA into complementary strands. 3. Direct analysis. Proc. Natl Acad. Sci. USA, 60, 921922
Smith, G.R. (1988) Homologous recombination in procaryotes. Microbiol. Rev., 52, 128
Steen, R.G., et al. (1999) A high-density integrated genetic linkage and radiation hybrid map of the laboratory rat. Genome Res., 9, AP18, insert.
Watson, J.D. and Crick, F.H. (1953) Genetical implications of the structure of deoxyribonucleic acid. Nature, 171, 964967[Medline].
Wu, T.C. and Lichten, M. (1994) Meiosis-induced double-strand break sites determined by yeast chromatin structure. Science, 263, 515518
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






