Bioinformatics Advance Access originally published online on March 15, 2005
Bioinformatics 2005 21(10):2322-2328; doi:10.1093/bioinformatics/bti376
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Identification and measurement of neighbor-dependent nucleotide substitution processes
1Max Planck Institute for Molecular Genetics Ihnestrasse 73, 14195 Berlin, Germany
2Physics Department and Center for Theoretical Biological Physics UC San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0374, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Neighbor-dependent substitution processes generated specific pattern of dinucleotide frequencies in the genomes of most organisms. The CpG-methylationdeamination process is, e.g. a prominent process in vertebrates (CpG effect). Such processes, often with unknown mechanistic origins, need to be incorporated into realistic models of nucleotide substitutions.
Results: Based on a general framework of nucleotide substitutions we developed a method that is able to identify the most relevant neighbor-dependent substitution processes, estimate their relative frequencies and judge their importance in order to be included into the modeling. Starting from a model for neighbor independent nucleotide substitution we successively added neighbor-dependent substitution processes in the order of their ability to increase the likelihood of the model describing given data. The analysis of neighbor-dependent nucleotide substitutions based on repetitive elements found in the genomes of human, zebrafish and fruit fly is presented.
Availability: A web server to perform the presented analysis is freely available at: http://evogen.molgen.mpg.de/server/substitution-analysis
Contact: arndt{at}molgen.mpg.de
| 1 INTRODUCTION |
|---|
|
|
|---|
The mutation rate of a nucleotide can be drastically affected by the identity of the neighboring nucleotides in the genome. A well-known and well studied example of this fact is the increased mutation of cytosine to thymine in CpG dinucleotides in vertebrates (Coulondre et al., 1978; Razin and Riggs, 1980). This process is triggered by the methylation of cytosine in CpG followed by deamination and mutation from CpG to TpG or CpA (on the reverse strand). As a result of this process, the number of CpG is decreased while the number of TpG and CpA is increased with respect to what is expected from independently evolving nucleotides. Most of the deviant dinucleotide odds ratios (dinucleotide frequencies normalized for the base composition) in the human genome can be explained by the presence of the CpG-methylationdeamination process (Arndt et al., 2002). Biochemical studies in the 1970s already compared these odds ratios for different genomes and different fractions of genomic DNA (Russell et al., 1976; Russell and Subak-Sharpe, 1977) and concluded that these ratios are a remarkably stable property of genomes. Subsequently, Karlin and coworkers (Karlin and Burge, 1995; Karlin and Mrázek, 1997; Karlin et al., 1997) elaborated and expanded upon these observations, showing that the pattern of dinucleotide abundance constitutes a genomic signature in the sense that it is stable across different parts of a genome and is generally similar between related organisms. Since this signature is also present in non-coding and intergenic DNA, it is tempting to study neighbor-dependent mutation and fixation processes (which we refer to as the substitution process henceforth) to understand the evolution of neutral DNA. However, to pursue this line of research, it is necessary to establish accurate and yet computationally tractable models of nucleotide evolution, beyond the familiar and widely used single nucleotide substitution models (refer Lio and Goldman, 1998 for a review).
Recently, a mathematical and computational framework to include such neighbor-dependent substitution processes has been introduced (Arndt et al., 2002) and was successfully applied to model the CpG-methylationdeamination process in vertebrates (Arndt et al., 2003). Other extensions of single nucleotide substitution models, which generalize the 4 x 4 substitution matrix for single nucleotides to a 16 x 16 matrix for dinucleotides, have also been considered (Siepel and Haussler, 2004; Lunter and Hein, 2004). Similarly, the framework of Arndt et al. (2002) allows the inclusion of any type of neighbor-dependent process and these models allow one to make a quantitative analysis of neighbor dependent processes as well as to get reliable estimates of other properties, e.g. the stationary GC-content. Here we will extend this framework to include multiple neighbor-dependent substitution processes and infer their relevance without prior knowledge of the underlying biomolecular processes, which are often not fully understood or characterized.
The rest of the paper is organized as follows. In the next section, we will describe details of our method. A public web server at http://evogen.molgen.mpg.de/server/substitution-analysis is provided for readers who want to analyze their own sequences. First, applications of such an analysis will be presented in the Results section where we study neighbor-dependent substitutions in human (Homo sapiens), zebrafish (Danio rerio) and fruit fly (Drosophila melanogaster). In all these studies, we compare repetitive elements found in the genomes of the above species with their respective ancestral Master sequence, which can easily be reconstructed from all identified copies. All copies accumulated nucleotide substitutions, which we first try to model by including only single nucleotide substitutions without any neighbor dependencies. Subsequently, we ask which neighbor-dependent substitution process could be added to better describe the observed data. Our strategy is to capture most of the observed data by single nucleotide substitutions (independent of the neighboring bases) and then include neighbor-dependent substitutions one-by-one to generate successively better models with the least number of parameters. Neighbor-dependent processes are added in the order of their ability to describe the observed data better. Naturally, the addition of any further process (together with one rate parameter) into a model will increase the likelihood of this model to describe the observed data. In order not to over-fit the data we use a likelihood ratio test to judge whether the addition of further process is justified. Compared with other approaches, the strength of our approach is to generate a model with fewer parameters that nevertheless captures the essential neighbor-dependent substitution processes. This prevents over-fitting the model to the given data and eases the computational demand for the quantitative estimation of the parameters.
| 2 METHODS |
|---|
|
|
|---|
2.1 The substitution model
In total, there are 12 distinct neighbor-independent substitution processes of a single nucleotide by another; 4 of them are the so-called transitions that interchange a purine with a purine or a pyrimidine with a pyrimidine. The remaining eight processes are the so-called transversions that interchange a purine with a pyrimidine and vice versa. The rates of these processes,

ß, will be denoted as r
ß, where
,ß
{A, C, G, T} denote a nucleotide. In addition to these 12 processes, we also want to consider neighbor-dependent processes of the kind 



and 



, where 
, 
and 
denote dinucleotides, and either the right or the left base of a dinucleotide changes. There might be several of these processes present in our model and their rates will be denoted by r


or r


. We do not consider the very rare processes where both nucleotides of a dinucleotide change at the same time. In vertebrates, the most important neighbor-dependent process to be considered is the substitution of cytosine in CpG resulting in TpG or CpA. The rate of substitution, especially in mammals, is
40 times higher than that of a transversion (Arndt et al., 2003). This process is triggered by the methylation and subsequent deamination of cytosine in CpG pairs. It is commonly (and erroneously) assumed that this process affects only CpG dinucleotides. However, this is not the case as it has been shown (Arndt et al., 2002).
Our substitution model is defined by the set of substitution processes, which include all neighbor-independent single nucleotide changes and additional neighbor-dependent processes. All these processes carry one rate parameter r giving the number of substitutions per base pair and time. Further, the length of the time span, dt, and the respective substitution processes that act on some sequence have to be specified. In our application, this would be time, T, between the insertion of the repetitive element into the genome and its current observation. Since we have the freedom to rescale time and measure it in units of T, the time span dt = 1 and with this choice, the substitution rates are in fact equal to the substitution frequencies giving the number of nucleotide substitutions per base pair. In the simplest case our model includes neighbor-independent processes only and is parameterized by 12 substitution frequencies. For each additional neighbor-dependent process we need to add one more parameter. The set of all these substitution frequencies will be denoted by {r}. The number of parameters can actually be reduced by a factor of 2 when one considers substitutions for neutrally evolving DNA. In this case, we cannot distinguish the two strands of the DNA and therefore, the substitution rates are reverse complement symmetric, e.g. the rate for the substitution C
A is equal to the rate for the substitution G
T (in the following we will denote this process by C:G
A:T, for the rates we have rCA = rGT).
In order to facilitate the subsequent maximum-likelihood analysis we need to compute the probability, P{r}(·ß2·|
1
2
3), that the base
2 flanked by
1 to the left and by
3 to the right, changes into the base ß2 for given neighbor-dependent substitution frequencies {r}. This probability can be easily calculated by numerically solving the time evolution of the probability to find three bases p(
ß
;t) at time t, which is given by the Master equation and can be written as the following set of differential equations:
![]() | (1) |

and r
ß
ß, are defined by
![]() |

ß
p(
ß
;t)/
t = 0, since the total influx is balanced by an appropriate outflux of probability. The first three terms on the right hand side in Equation (1) describe single nucleotide substitutions on the three sites, whereas the last two sums (which are summed over all pairs of nucleotides) represent the neighbor-dependent processes at the sites (1,2) and (2,3), respectively. To describe the evolution of three nucleotides
1
2
3, these differential equations have to be solved for initial conditions of the form
![]() | (2) |
![]() | (3) |
1
2
3. After each iteration four of the transition probabilities Pr(·ß2·|
1
2
3) with ß2 = A, C, G or T can be computed. Note that the above set of differential equations describes the time evolution of only three nucleotides. It can be easily extended to describe systems of length N > 3, in which case one has to solve for 4N functions p(
1
2...
N;t).
2.2 Estimation of substitution frequencies
To estimate the above mentioned substitution frequencies from real sequence data, we need to compare a pair of ancestral sequence
and daughter sequence
, where the daughter sequence represents the state of the ancestral sequence after the substitution processes acted upon it for some time. Note that we do not assume any other properties regarding the nucleotide or dinucleotide distributions of the sequences. The two sequences, in particular, need not be in their stationary state with respect to the substitution model. In practice, these pairs of ancestral and daughter sequences can be obtained in various ways. One very fruitful approach is to take alignments of repetitive sequences, which can be found in various genomes due to the activity of retroviruses. Such repetitive elements have entered these genomes during short periods in evolution. Hence, all copies of such elements in a genome have been subjected to nucleotide substitutions for the same time and have accumulated corresponding amounts of changes. Various such repetitive elements and their respective alignments to the once active Master [which is taken to be the ancestral sequence (Arndt et al., 2003)] can be identified using the RepeatMasker, http://www.repeatmasker.org.
The log likelihood that a sequence
evolved from a master sequence
under a given substitution model parameterized by the substitution frequencies {r} is given by
![]() | (4) |
is the probability of the evolution of the sequence
into
. This probability can be approximated very well by the product in the second line, owing to the fact that the correlations induced by the substitutional processes are very short ranged (Arndt et al., 2002). We, therefore, take into account the identities of bases and the dynamics on the nearest neighbors to the left and to the right, and neglect those on the next nearest neighbors and beyond. For most applications this approximation turns out to be sufficient since estimated substitution frequencies deviate <1% from their actual values (see below). Note, that this approximation is exact even in the absence of neighbor-dependent substitution processes. The numbers N(
1
2
3
·ß2·) denote the counts of observations of a base substitution from
2 (flanked by
1 to the left and
3 to the right) to ß2.
To estimate the substitution frequencies {r*} for a given pair of
and
or given numbers N(
1
2
3
·ß2·) we maximize the above likelihood by adjusting the substitution frequencies. This can be easily done using Powell's method (Press et al., 1992) while taking care of boundary conditions (Box, 1966), i.e. the positivity of the substitution frequencies.
2.3 Uncertainty of estimates for finite sequence length
Owing to the stochastic nature of the substitution process and the fact that always only a finite amount of sequence data is available to estimate the substitution frequencies {r*}, estimated frequencies will show deviations from the real substitution frequencies. In general, we do not know or cannot infer these real frequencies otherwise. In order to be able to analyze the uncertainty of frequency estimates from finite sequences, we synthetically (in silico) generate pairs of ancestral and daughter sequences using known substitution processes and rates
. In the following section, we include just one neighbor-dependent substitution process, namely the CpG-methylationdeamination process, CpG
TpA/TpG, which plays a predominant role in the analysis of nucleotide substitutions in vertebrates. The nucleotides of the ancestral sequences
(of length N) have been chosen randomly with equal probability from the four nucleotides. Subsequently, the ancestral sequence was synthetically aged and we applied substitutions using a Monte Carlo algorithm as described in (Arndt et al., 2002) yielding the sequence
. The resulting pair of sequences is then analyzed using the above procedure to get estimates of the rates {r*}. We repeated this experiment 500 times and got estimates for the means
and standard deviations {
r*} of these measurements. In addition, we computed the stationary GC-content from each set of substitution frequencies (Arndt et al., 2002). Results of this analysis are presented in Figure 1, where we show the mean and standard deviation of estimated rates for different lengths of sequences N. The transversion frequencies were chosen to be 0.01, the frequency of the A:T
G:C transition to be 0.03, that of the G:C
A:T transition to be 0.05 and that of the CpG
CpA/TpG transition to be 0.4, as indicated by the dotted lines in Figure 1. This choice of frequencies mimics the relative strength of the substitution process as they are observed in the human genome. As can be seen, the uncertainty of observed substitution frequencies correlates positively with the substitution frequencies and negatively with the length of the sequences.
|
To further quantify these uncertainties and discuss their dependence on various quantities, we plotted the deviations
and the standard deviations {
r*} as a function of the sequence length N in Figure 2. The standard deviations decrease with
. In the absence of neighbor-dependent substitutions and for ancestral sequences with equally probable nucleotides the standard deviation for reverse complement symmetric frequencies can actually be calculated to as
![]() | (5) |
CpA/TpG can be computed to be of the order of:
![]() | (6) |
and
as described in the previous section.
|
The deviations of the observed frequencies from the real frequencies
(Fig. 2) also decrease with
and are always bounded from the above by {
r*}. Note, that the estimates of substitution frequencies are very precise, although we used an approximation when deriving the likelihood in Equation (4). This property does not hold true for neighbor-dependent processes in general. For instance, we observe small (<1%, data not shown) but systematic deviations of the estimated substitution frequencies if we include the process ApA/TpT
CpA/TpG. In this case, one should also take into account the identity and dynamics of nucleotides on the next nearest neighbor sites and the associated neighbor-dependent processes. One would have to introduce corrections of higher order in Equation (4). This is true because of initial overlapping states of the neighbor-dependent process, i.e. two ApA s in a triplet AAA. However, such corrections do not have to be considered for the CpG
CpA/TpG process. For a given CpG, the next nearest neighbor-dependent process might only occur on a neighboring CpG, which in contrast to ApA s cannot overlap with the given CpG. Hence, correlations to the next CpG are even smaller, which makes the estimation of substitution frequencies neglecting such correlations very precise. In the absence of any neighbor-dependent process, there is no approximation involved to compute the likelihood in Equation (4) and therefore estimates will be asymptotically exact for N
.
The above formulas for the standard deviation, Equations (5) and (6), lose their validity if any one of the frequencies is of the order of one. However, the standard deviations are still decreasing with increasing sequence length. In Figure 3 we present estimated frequencies from sequences of various degrees of divergence. The substitution rates have been chosen in the ratios 1:3:5:40 for the transversions, the A:T
G:C transition, the G:C
A:T transition and the CpG
CpA/TpG process. On the horizontal axis we plot the length of the time interval for which the ancestral sequence (of length N = 107) has been aged. The dotted lines give the real substitution frequencies, which are the products of the corresponding rates and the length of the time interval. So long as not all substitution frequencies are >1 (to the left of the dashed vertical line in Fig. 3) the substitution frequencies can be faithfully estimated, even if single frequencies exceed one (the dashed horizontal line). If all substitution frequencies are of the order of or larger than one, the estimation of substitution frequencies is not possible anymore (to the right of the dashed vertical line). In this case, all nucleotides, more or less, underwent one or more substitution processes making it impossible to estimate the frequencies of the underlying processes.
|
In reality, however, the nucleotides in the ancestral sequence will not be randomly distributed with equal probability from the four nucleotides (as assumed above). Moreover, genomic sequences will show non-trivial dinucleotide distributions, i.e. neighboring bases are not independent and the dinucleotide frequencies f
ß will deviate from the product of nucleotide frequencies f
fß (Karlin and Burge, 1995). Both these factors will influence the deviations between the observed and the real substitution frequencies and in such cases the above formulas (5) and (6) do not hold anymore. We also expect additional errors due to the presence of unaccounted neighbor-dependent processes. Depending on the magnitude of the rates for such processes the errors can get quite significant as discussed below. To exclude the latter type of errors one actually has to try to incorporate additional neighbor-dependent processes and judge whether their inclusion is actually relevant (as discussed in the next subsection).
Further, it is not possible for genomic applications to repeat the measurements of substitution frequencies for different sets of sequences to get an estimate of the typical errors. However, one can still get estimates on the expected standard deviation from bootstrapping the available data. One has to resample the available data drawing randomly and with replacement of N pairs of aligned ancestral and daughter nucleotides (keeping the information of the ancestral base identity to the left and to the right) and generate a list of counts N(
1
2
3
·ß2·) which will be used to maximize the likelihood and estimate the substitution frequencies as described above. One repeats this resampling procedure M times, and from the M estimates of the substitution frequencies and stationary GC-content their standard deviation is calculated, which gives the statistical error as a result of the limited amount of sequence data. We found that M = 500 samples are sufficient to estimate those errors (data not shown).
2.4 Extending the model to include additional processes
Next, we address how one can extend a given substitution model and include additional neighbor-dependent processes to maximize the potential of such a model to describe the observed data. With the inclusion of additional neighbor-dependent processes, the likelihood of a model {r'} will in any case be greater than that of the original model {r}. This is the case because the models are nested and one has an additional parameter to explain the given data. To test whether the inclusion of a new parameter is justified, we employ the likelihood ratio test for nested models. Let
= L{r}/L{r}' be the likelihood ratio, then 2log
has an asymptotic
2 distribution with degrees of freedom equal to the difference in the numbers of free parameters of the two models, which in our case is one (Ewens and Grant, 2001).
In practice, we extend a given substitution model, in turn, by each of the 4 x 4 x 3 x 2 = 96 possible neighbor-dependent processes. Out of these extended models, we choose the best one, i.e. the one with the highest likelihood L{r}'. Since the best is chosen out of a finite set of possibilities, we have to account for multiple testing and use a Bonferroni correction. Hence we require 2log
> 15 to have significance on the 5% level (Note that
015
12(x)dx = 0.99989 > 1 0.05/96). We confirmed this conservative threshold also by simulations using sequences that have been synthetically mutated according to a known model.
| 3 RESULTS |
|---|
|
|
|---|
As a first test, we applied the above method to identify and measure neighbor-dependent substitution processes acting on the human genome. We took as input 334 000 copies (comprising
9 Mbp) of the AluSx SINEs that have been found in a genome-wide search of the human genome (release v20.34c.1 at ensembl.org from April 1, 2004) together with individual alignments to ancestral Master AluSx found in RepBase (Jurka, 2000). These elements are assumed to have evolved neutrally and therefore, the substitution process is reverse complement symmetric. The results are presented in Table 1. In the first column of data we give estimations for the six neighbor-independent single nucleotide substitutions. We subsequently tested 48 possible extensions of this simple substitution model by one additional neighbor-dependent substitution process together with its reverse complement symmetric process (note, that in this case only 48 extensions have to be considered). As expected (and as shown in the second column in Table 1), the CpG-methylationdeamination process (CpG
CpA/TpG) gives the best improvement with 2log
= 7.7 x 106, which is clearly above the threshold of 15. The substitution frequency of this process is about 45 times higher than that of a transversion. Extending the model from 6 to 7 parameters and including the CpG
CpA/TpG process, mostly affects the estimate for the G:C
A:T transition, which decreases by about a factor of 3. Please also note that subsequently, the estimation of the stationary GC-content from these rates rises from 21% for the 6-parameter model to 34% for the 7-parameter model. This reveals that estimates of substitution frequencies and the stationary nucleotide composition are very much affected by the underlying substitution model. Substantial deviations can be observed when the substitution model does not include all relevant processes, as is the case for the 6-parameter model for nucleotide substitutions in the human lineage.
|
In principle, there can be even more neighbor-dependent processes we have to account for. According to our method, the second process that needs to be included to improve the model is the substitution of CpG
CpC/GpG (2 log
= 1.3 x 105). This is another CpG-based process and probably also triggered by the methylation of cytosine. The substitution frequency obtained is
30 times smaller than that of the CpG
CpA/TpG process. Nevertheless, it is nearly three times larger than the corresponding single nucleotide transversion frequency. The third process to be included is the substitution TpT/ApA
TpG/CpA (2log
= 9.6 x 104). The instability of the TpT dinucleotide does not come as a surprise here, since two consecutive thymine nucleotides tend to form a thymine photodimer T <> T. This process is one of the major lesions formed in DNA during exposure to UV light (Douki et al., 1997).
Next, we turn to the analysis of the DANA repeats in zebrafish (D.rerio). Results are presented in Table 2. Again, we start with a model just comprising single nucleotide transversions and transitions. As observed in human, the transitions occur more often than transversions and there is a strong A:T bias in the single nucleotide substitutions. After accounting for the CpG-process the ratio of transversions to transitions is roughly 1:3:5 as it is observed for the human lineage. Zebrafish being a vertebrate also utilizes methylation as an additional process to regulate gene expression. As a consequence, we observe a higher mutability of the CpG dinucleotide due to the deamination process in zebrafish also. However, the substitution frequency for the CpG
CpA/TpG process in zebrafish is only about 8 times higher than that of a transversion, suggesting that the degree of methylation is generally lower than in human. The reduced frequency for the CpG deamination process found in the fish is also consistent with previous studies which found an elevated CpG deamination frequency in the human lineage only after the mammalian radiation (Arndt et al., 2003).
|
We investigated non-vertebrate sequence data also. As an example, we present here the analysis of the DNAREP1_DM repeat in D.melanogaster (Table 3). The need to include neighbor-dependent process in this case is clearly not as strong as for vertebrate genomes. The values of 2log
are three orders of magnitude smaller, but still above the threshold of our method. The first such process is the substitution TpA
TpT/ApA. Although the corresponding substitution frequency is lower than all the single nucleotide transitions and transversions, the dinucleotide frequencies in the stationary state deviate up to 10% from their neutral expectation under a neighbor-independent substitution model (data not shown). Therefore, even processes with a small contribution to the overall substitutions have a large influence on the observed patterns of dinucleotide frequencies or genomic signatures and therefore may very well be solely responsible for the generation of such pattern in different species.
|
| 4 CONCLUSION |
|---|
|
|
|---|
We presented a framework to identify the existence and measure the rates of neighbor-dependent nucleotide substitution processes. We discussed the extension of models of nucleotide substitutions in human and included more neighbor-dependent processes besides the well-known CpG-methylationdeamination process (Arndt et al., 2002). We could also show that the CpG-methylationdeamination is the predominant substitution process in zebrafish, while it does not play a role in fruit fly. We exemplified our method using sequence data from one particular subfamily of repeats from these three organisms. In the case of the human genome a much more thorough analysis on various families of repeats have been presented in (Arndt et al., 2003). A similar study, which would also include neighbor-dependent substitutions for other species will further broaden our knowledge about the molecular processes that are responsible for nucleotide mutations and their fixation.
| Acknowledgments |
|---|
We thank Nadia Singh and Dmitri Petrov (Stanford) for kindly providing sequence data on the DNAREP1_DM repeat in Drosophila melanogaster.
Received on November 17, 2004; revised on February 16, 2005; accepted on March 4, 2005
| REFERENCES |
|---|
|
|
|---|
Arndt, P.F., et al. (2002) DNA sequence evolution with neighbor-dependent mutation. 6th Annual International Conference on Computational Biology RECOMB2002 , Washington DC ACM Press, KK.
Arndt, P.F., et al. (2003) Distinct changes of genomic biases in nucleotide substitution at the time of mammalian radiation. Mol. Biol. Evol., 20, , pp. 18871896
Box, M.J. (1966) A comparison of several current optimization methods and use of transformations in constrained problems. Computer Journal, 9, 6777.
Coulondre, C., et al. (1978) Molecular basis of base substitution hotspots in. Escherichia coli. Nature, 274, 775780.
Douki, T., et al. (1997) Far-UV-induced dimeric photoproducts in short oligonucleotides: sequence effects. Photochem. Photobiol., 66, 171179[ISI][Medline].
Ewens, W.J. and Grant, G. Statistical Methods in Bioinformatics : An Introduction, (2001) , New York Springer.
Jurka, J. (2000) Repbase update: a database and an electronic journal of repetitive elements. Trends Genet., 16, 418420[CrossRef][ISI][Medline].
Karlin, S. and Burge, C. (1995) Dinucleotide relative abundance extremes: a genomic signature. Trends Genet., 11, 283290[CrossRef][ISI][Medline].
Karlin, S. and Mrázek, J. (1997) Compositional differences within and between eukaryotic genomes. Proc. Natl Acad. Sci. USA, 94, 1022710232
Karlin, S., et al. (1997) Compositional biases of bacterial genomes and evolutionary implications. J. Bacteriol., 179, 38993913
Lio, P. and Goldman, N. (1998) Models of molecular evolution and phylogeny. Genome Res., 8, 12331244
Lunter, G. and Hein, J. (2004) A nucleotide substitution model with nearest-neighbour interactions. Bioinformatics, 20, Suppl 1, I216I223.
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P. Numerical Recipes in C, The Art of Scientific Computing, (1992) , Cambridge Cambridge University Press.
Razin, A. and Riggs, A.D. (1980) DNA methylation and gene function. Science, 210, 604610
Russell, G.J., et al. (1976) Doublet frequency analysis of fractionated vertebrate nuclear DNA. J. Mol. Biol., 108, 123[ISI][Medline].
Nature Russell, G.J. and Subak-Sharpe, J.H. (1977) Similarity of the general designs of protochordates and invertebrates. 266, 533536.
Siepel, A. and Haussler, D. (2004) Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol. Biol. Evol., 21, 468488
This article has been cited by other articles:
![]() |
W. Jia and P. G. Higgs Codon Usage in Mitochondrial Genomes: Distinguishing Context-Dependent Mutation from Translational Selection Mol. Biol. Evol., February 1, 2008; 25(2): 339 - 351. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. T. Saunders and P. Green Insights from Modeling Protein Evolution with Context-Dependent Mutation and Asymmetric Amino Acid Selection Mol. Biol. Evol., December 1, 2007; 24(12): 2632 - 2647. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Tanay, A. H. O'Donnell, M. Damelin, and T. H. Bestor Hyperconserved CpG domains underlie Polycomb-binding sites PNAS, March 27, 2007; 104(13): 5521 - 5526. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Mustonen and M. Lassig Evolutionary population genetics of promoters: Predicting binding sites and functional phylogenies PNAS, November 1, 2005; 102(44): 15936 - 15941. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











