Evolution and Phylogenetics
Efficient inference on known phylogenetic trees using Poisson regression
IBM T.J. Watson Research Center, Yorktown Heights NY 10598, USA
| ABSTRACT |
|---|
|
|
|---|
Motivation: We suggest the use of Poisson regression for time inference and hypothesis testing on a bifurcating Phylogenetic tree with known topology. This method is computationally simple and naturally accommodates variable substitution rates across different sites, without requiring the estimation of these rates. We identify the assumptions under which this is a maximum-likelihood inference approach and show that in some realistic situationsin particular, when the probability of repeated mutation within each branch of the tree is smallthese assumptions hold with high probability.
Results: Our motivating domain is human mitochondrial DNA trees, and we illustrate our method on a problem of estimating the time to most recent common ancestor of all non-African mtDNA, using publicly available data. We test for molecular clock violations using multiple comparisons, and conclude that the global molecular clock hypothesis cannot be rejected based on these data.
Contact: srosset{at}us.ibm.com
| 1 INTRODUCTION |
|---|
|
|
|---|
In this article we concentrate on the inference phase of phylogenetic analysis. That is, given a phylogenetic tree, how can we estimate parameters (such as time) and test hypotheses about the mutation process (such as molecular clock/lineage independence). The tree construction phase, which is arguably the most difficult part of the process, and often the most scientifically important one, is not addressed here, and we assume that we start from a correct phylogenetic tree, which we will assume here is bifurcating (although our methods can also account for multi-furcation).
We propose the use of Poisson regression (PR) to estimate parameters, do inference on these parameters and test hypotheses on this tree. We show that this allows us to accommodate site-dependent mutation rates and different mutational clock assumptions. We show that our Poisson regression method is a maximum-likelihood phylogenetic inference approach if we assume:
- We observe all mutations, i.e. there are no multiple mutations in the same site in any branch of the Phylogenetic tree.
- The overall mutation rate (i.e. the sum of mutation probabilities in all sites) is fixed, regardless of the current state, which holds under some commonly used nucleotide substitution models, the most general of which is the Kimura (1981) 3-parameter model (K3ST).
We compare PR with two prevalent approaches: maximum-likelihood inference and the commonly used simple
method (Forster et al., 1996). PR is more efficient than the former (and equivalent to it statistically under the assumptions mentioned above) and statistically more sound than the latter.
Finally, we analyze the human mtDNA tree using PR. We argue that the probability of repeated mutation within branches of the tree is negligible; therefore, the use of PR is well justified. We use PR to estimate the time to most recent common ancestor of all non-African mtDNA, and also employ a multiple comparison approach to test for molecular clock violations. Our TMRCA confidence interval is 67 000 ± 13 400older and significantly tighter than the commonly quoted estimates of Ingman et al. (2000). Our testing for molecular clock violations concludes that the global clock hypothesis cannot be rejected based on these data.
| 2 THE POISSON REGRESSION METHOD |
|---|
|
|
|---|
Consider a known rooted phylogenetic tree for non-recombining DNA, such as the simple tree in Figure 1. Assume the terminal nodes (leaves) are extant samples and the internal nodes of the tree represent their common ancestors. Further assume that we have mutation counts on each branch of the tree, e.g. from using the Fitch algorithm (Fitch, 1971). We associate parameters with non-leaf nodes (denoted ß1, ß2, ß3, ...). These represent distances from each such node to the terminal nodes (present day) under a molecular clock assumption, measured in terms of expected mutation counts. Thus, we expect to encounter about ß1 mutations when traversing any path from the ancestor at the root to an extant samples at the leaves. We assume the observed mutation counts on each branch of the tree (denoted by xij) have a Poisson(
ij) distribution with a parameter, i.e. the difference between the parameter at the beginning and the end of the branch, or the difference between the expected mutation distance from the starting node and ending node of this branch. For branches leading directly to leaves, the Poisson parameter is simply the ß parameter of the starting node for this branch (since the leaves represent the present). In Figure 1 we illustrate this setting on two branchesthe one between nodes 1 and 3 and the one from internal node 4 to leaf 8.
|
The Poisson log-likelihood function for a tree with known, fixed structure, given mutation counts is
![]() | (1) |
is of no consequence for maximum-likelihood estimation of the ßs. Since the Poisson parameters of the observed counts on the branches are linear functions of the ß parameters, finding the maximum-likelihood estimates of the ß parameters is simply a Poisson regression with an identity link function. This link function (identity) is different from the one usually used for Poisson regression, which is the log link (the natural link for the Poisson family). However our problem here is still a well-defined Generalized Linear Model (GLM), which can be solved using any good GLM package, such as the glm function in R. See McCullagh and Nelder (1989) for discussion of the theory of GLMs and Venables and Ripley (1994) for discussion of the glm function in S+ and R.
The advantages of using a GLM approach for inference are both computational and statistical:
- GLM methodology offers highly efficient solutions to even large phylogenetic problems (see McCullagh and Nelder, 1989). Problems with up to several hundreds of variables (internal nodes in the tree) can easily be solved in <1 min of CPU time on a standard desktop computer. This efficiency allows us to solve big problems, and also to experiment with many different solutions to smaller ones, like we do in Section 5, where we enumerate many possible ways of relaxing the molecular clock assumption and compare their outcomes.
- The theory of GLMs offers easily attainable uncertainty estimates on the estimated parameters, in particular confidence intervals. These are based on asymptotic Gaussian approximations of the likelihood around the ML estimates and are not guaranteed to be exactly correct for our finite samples. However they are typically reasonable for Poisson quantities, especially for relatively large values of Poisson parameters. An alternative approach to inference would be to use the parametric bootstrap (Efron and Tibshirani, 1994; Felsenstein, 2003), which would also be facilitated by the computational efficiency of the Poisson regression approach.
As a concrete example, consider setting the parameters of the example in Figure 1 to be
![]() | (2) |
If we draw a random sample of mutation counts on all branches and fit the Poisson regression, we get the results in Table 1.
|
Some of these estimates (e.g. ß3, ß5) are very close to their true values and some are a little off. Most importantly, the standard deviation estimates in the third column seem well calibrated. In particular, all parameter estimates are within the 1.7 estimated standard deviations of their true value.
| 3 PR AND MAXIMUM LIKELIHOOD |
|---|
|
|
|---|
Maximum-likelihood (ML) methods (Felsenstein, 2003; Yang, 1993) assume a mutation (or substitution) model and, subject to that model, attempt to find the tree and tree parameters (specifically, branch lengths) which best explain the observed data, where the quality of explanation is measured by the likelihood of the data, given the parameters. In the context of learning tree structure the maximum-likelihood approach has been applied extensively. It has also been applied to inference, in particular hypothesis testing, when the phylogenetic tree is assumed known. See Felsenstein (2003, Chapters 16 and 19) for a review of tree generation and inference using maximum-likelihood approaches. For our purpose we point out two important facts about this area:
- The essence of the ML approach is in calculating the maximum-likelihood parameters for a single tree, with a tree-search wrapped around that. Thus, efficient calculations for known trees are critical for tree generation as well.
- Allowing for variable mutation rates by site significantly complicates the ML analysis. In fact Yang [(1993, 1994, 1997), http://abacus.gene.ucl.ac.uk/software/paml.html], who originally showed how to adapt ML to variable rates with a Gamma prior, advocates practical use of a variable rate approach for inference only and not for tree construction, owing to the increased computational burden.
Our Poisson regression is a maximum-likelihood approach, under the Poisson model we assume. The interesting question is, can we consider it to be an exact (or at least approximate) maximum-likelihood approach for phylogenetic inference and under what assumptions?
The answer, as we show below, is that the Poisson regression estimate is indeed a maximum-likelihood estimate of tree branch lengths if we assume the following:
(1) An appropriately simple substitution model, in particular one where the mutation probability is independent of the current nucleotide (and consequently all 4 nt are equally likely to appear). This is true of three of the most commonly used substitution models:
- The JukesCantor model (JC69; Jukes and Cantor, 1969), which assumes that all substitutions are equally likely.
- The Kimura 2-parameter model (K80; Kimura,1980), which allows for different probability of transitions and transversions.
- The Kimura 3-parameter model (K3ST; Kimura,1981), which allows for different probability of transitions and two different types of transversions. For a site with overall mutation rate
per time unit, the instantaneous transition matrix for K3ST is
with a + b1 + b2 = 1
(3)
(2) That we observe all mutations. This is typically not a realistic assumption, since on a phylogenetic tree we can only observe (or infer) the haplotypes on the nodes of the tree, representing inferred ancestors just before branching events. Assuming the tree topology and inferred ancestors are correct [strong assumption, but one usually required to facilitate inference, e.g. Langley and Fitch (1974)], the situation in which we observe all mutations, is if no repeated or back mutations have occurred along any single branch of the tree. In that case, we know that all sites that differ between a starting and ending node of a branch (ancestor and descendent) have mutated once along this branch. It is important to emphasize that we do not need to assume there are no repeated mutations anywhere in the tree. The same nucleotide position can mutate many times, as long as no two mutations happen on the same branch of the tree. This is a non-testable assumption, and indeed in some situations it is also not reasonable. We show in Section 5.2 that it is reasonable for our motivating domain, of human mtDNA trees, when we consider only coding region mutations, without the D-loop.
The assumption of observing all mutations has a significant simplifying effect on the likelihood function. It means that a maximum-likelihood analysis does not have to account for all possible combinations of unobserved mutations that could lead to the observed and inferred haplotypes on the tree. Hence it may not be surprising that these two assumptions lead us to the relatively simple PR formulation as a maximum-likelihood approach. We now prove this equivalence rigorously.
First, it is easy to show (e.g. Huelsenbeck and Crandall, 1997) that a site with mutation rates as in (3) has a total number of mutations in time t distributed as Pois(t ·
). Given this, assume now a genetic region of length n has an appropriate nucleotide substitution model (i.e. K3ST or simpler) and each nucleotide position has a different overall mutation rate. Denote these rates by
=
1, ... ,
n. This implies that transitions in site j occur at rate a·
j and the two types of transversions occur at rates b1·
j and b2·
j, with a + b1 + b2 + 1. Assume WLOG
(this just fixes a scale for the time length estimates). For a fixed phylogenetic tree with m branches, denote the unknown branch lengths by t = t1, ... , tm and the mutation counts for each (branch, site) combination by
. Let
be the mutation counts per branch. Then
THEOREM 1is a sufficient statistic for the maximum-likelihood estimation of branch lengths t under the model K3ST. In other words, the maximum-likelihood estimate is independent of
1, ... ,
n, a, b1, b2 and the specific identities of the mutations on the edges when we condition on
PROOF. The log-likelihood of a given tree with m branches as a function of branch lengths t = t1, ... , tm, mutation rates
=
1, ... ,
n, mutation parameters a, b1, b2 and mutation counts for each (branch, site) combination C = cij:
|
| (4) |
and thus
is a sufficient statistics for maximum-likelihood estimation of t.
The same is not true for phylogeny estimation. In particular, the specific mutation rates
are meaningful for maximum-likelihood generation of trees, since it also implicitly involves estimating the C matrix (mutation counts, dependent on phylogeny), which cannot be decomposed from
.
Recall, that the tls in this representation are linear functions of the ßjs in our original representation (1). Keeping the scaling for which
, we can write
if l is the branch from node i to node j, and the
part of (4) is exactly our original maximum-likelihood problem in (1). Thus we have proved our central result, which makes explicit the connection between PR and ML:
THEOREM 2.
If the substitution model is K3ST or simpler and all mutations on the branches are observed, then the PR solution is the maximum-likelihood phylogenetic inference solution, even if the mutation rates vary by locus.
Theorems 1 and 2 have some interesting implications on maximum-likelihood methods in general. For example, we can conclude from Theorem 1 that if the probability of repeated or back mutation on any branch is very low, and hence the counts cij are almost fully observed, then the result of maximum-likelihood inference with and without the variable rate assumption should be almost identical. We claim in Section 5.2 that human mtDNA trees are one such domain, where branch lengths are short enough relative to individual mutation rates (when considering coding region only). Inspecting the literature, we can indeed see the expected effect: cases where the analysis obtains the same results whether a constant or varying mutation rate is assumed. For example, Howell et al. (2004) use the software package PAML (Yang, 1997) to analyze African mtDNA data. Their conclusions regarding molecular clock validity are of independent interest, but for our purpose here we are interested in their comparison of the results of PAML with and without the variable rate analysis (under a Gamma prior), which are practically identical (e.g. rows 5 and 6, and rows 11 and 12 in their Table 3).
4 TMRCA ESTIMATION AND COMPARISON TO
|
|---|
|
|
|---|
An important use of the PR method is for estimating the age of various ancestral nodes in the phylogenetic tree under a mutational clock assumption. The Poisson parameters estimated at the nodes are estimates of the TMRCA of the lineage starting at this node, in units of mutations. Thus, we are simultaneously estimating the age of all ancestral nodes, taking into account the whole tree.
It is interesting to compare our method with the simple, widely used
method (Forster et al., 1996). This method only estimates the age of the root node (i.e. TMRCA) of a whole tree (or sub-tree) by averaging the mutational length of all paths from that node to the leaves of the tree.
is clearly an unbiased estimate under a mutational clock assumption and is certainly a reasonable one. The PR estimate is not guaranteed to be unbiased; however, it is unbiased to first order (McCullagh and Nelder, 1989), and therefore almost unbiased in practice, as it is indeed in the example below. Under the Poisson assumption, there are several pieces of evidence that the variance of the PR estimate is likely smaller than that of
:
- As a maximum-likelihood estimate, the asymptotic variance of the PR estimate of TMRCA is equal to the Fisher bound, i.e. no estimator of TMRCA can have lower variance.
- By the RaoBlackwell theorem, PR is the minimum variance estimate of its own expectation (which we know to be unbiased to first order).
For a concise review of the statistical principles underlying these statements, see Casella and Berger (2001, chapter 7).
Where we expect PR to have a bigger advantage over
in estimating the age of internal nodes in our tree, i.e. TMRCA of sub-trees, since
does not take into account the whole tree, just the relevant sub-tree.
We now compare the performance of PR and
for estimating the parameters of the top three nodes in the example of Section 2. We know
is unbiased, and since it is a simple linear combination of the counts on the branches, we can calculate its variance analytically. To estimate bias and variance of the PR estimates, we simulate 100 000 random draws of data from the model (2) and empirically estimate the mean and variance based on those. Our results are as follows:
- Top node (
= 20). From
: Bias = 0, variance = 6.75, which is also the expected meansquared error (MSE). From PR: Bias
[0.014,0.012] (90% confidence interval), variance 6.69, expected MSE is below 6.7. MSE reduction from using PR: 1%.
- Left node (
= 9) From
: Bias = 0, variance = 3.5, which is also the expected MSE. From PR: Bias
[0.005,0.013], variance 2.96, expected MSE is below 3. MSE reduction from using PR: 14%.
- Right node (
= 13) From
: Bias = 0, variance = 5.5, which is also the expected MSE. From PR: Bias
[0,0.02], variance 4.45, expected MSE is below 4.5. MSE reduction from using PR: 18%.
| 5 EXAMPLE: TIMING THE EXODUS OF MTDNA FROM AFRICA |
|---|
|
|
|---|
We now use our approach to investigate the question of the date of the out of Africa event for the mitochondrial DNA. For that, we should identify a date for the MRCA of clades M,N,L3 in the mtDNA tree (for a discussion on this tree, see Jobling et al., 2004, chapter 8). Since all non-African DNA are derived from this ancestor, and no more recent ancestor for non-African mtDNA can be defined, the date of the M,N MRCA is a likely earliest bound for the out-of-Africa event for mtDNA. A widely quoted confidence interval for this ancestor is the one given by Ingman et al. (2000) of 52 000 ± 27 500. As we will see, our data and method lead to an older expected age and a much tighter confidence interval.
For this purpose, we have constructed a tree of mtDNA samples using publicly available full sequences from the NCBI data repository (http://www.ncbi.nlm.nih.gov/entrez) and the known structure of the mtDNA haplogroup tree (we completely avoided parts of the tree that are disputed in the literature, such as the sub-clades of haplogroups H, T and L0L1). The resulting tree is shown in Figure 2. We used two randomly drawn samples each from Haplgroups H, V, J, T, I, W, C, Z, D (total 18 samples), for which we fully counted on the known mtDNA phylogeny, as well as seven samples from haplogroup L2, for which we reconstructed their relationship using neighbor joining, leading to a verification of the accepted L2 nomenclature for sub-clades L2a, L2b and L2c (see haplogroup annotation in Fig. 2). We reconstructed the coding region ancestral types using the Fitch algorithm (Fitch, 1971), and the resulting mutation counts for the coding region on each branch are shown in blue in Figure 2. The terminal nodes represent two samples each, with the total number of mutations on both branches on them (e.g. the rightmost node for Z implies the two Z samples differ by 12 SNPs; hence, this is the inferred total number of mutations on both branches). Note that for the PR likelihood (1) we only need this total number for both branches, which share the same Poisson parameter.
|
We ran our Poisson regression on this tree, to obtain time estimates and confidence intervals on all the internal nodes. Figure 2 shows the resulting parameter estimates and, for some nodes, a confidence interval of ±2 SD. The resulting estimate for the MN super-clade ancestor is 17.5 ± 3.5 mutations. Using the same conversion factor as in Ingman et al. (2000) of 1.7 x 108 substitutions per site per year, this translates to a confidence interval of 67 000 ± 13 400 years for the MN TMRCA. Comparing that to the result of Ingman et al. (2000) quoted above, we can see that we obtain an older age and a tighter confidence interval. The older age is likely not owing to the method used but to the difference in datathe difference in estimated age is clearly within the uncertainty bounds defined by the confidence intervals. We should also comment that our assumption of no repeated mutation should result in PR under-estimating the TMRCA rather than over-estimating, if it is violated.
The tighter confidence interval is probably owing to our use of a better inference approach. Ingman et al. (2000) do not explicitly mention what method they used. We do know, however, that they used a somewhat larger and no less diverse set of samples than our analysis (53 samples, including 5 from L1 and 7 from L2, and representatives from all 9 MN haplogroups above).
5.1 Testing the molecular clock
We would like to test the validity of the molecular clock assumption, i.e. the underlying assumption that the mutation distance from every ancestral node to its leaves is fixed. The typical approach to this is to completely relax the molecular clock assumption and, instead of putting a parameter at each node, put one on each branch in the tree, increasing the number of parameters roughly by 2-fold. A likelihood ratio test (LRT) is then performed between the two resulting nested models [see Felsenstein (2003) for description of this methodology]. For example, Howell et al. (2004) successfully use this approach to reject the molecular clock hypothesis within the L2 haplogroup, based on a detailed tree they construct. In the case of PR, with no clock assumption it simply estimates the parameter for each branch as the observed mutation count. On our tree of Figure 2, the molecular clock hypothesis cannot be rejected based on the resulting LRT, giving a P-value of
0.7.
A compromise between the global clock and no clock assumptions is that there are several different clocks which affect different parts of the tree. This is the approach that Yang (1997) calls local clock. In particular, we may hypothesize two different clocks, where a sub-tree is subject to a different clock than the rest of the tree. In the tree of Figure 2 there are a total of 24 internal branches defining subtrees below them. Each time we assume a rate shift for one such sub-tree, we add just one degree of freedom to our model fit, and hence the resulting LRT uses one degree of freedom. Out of the 24 P-values we get here, the lowest are 0.017 for the branch marked with double asterisk in Figure 2, and 0.05 for the branch marked with single asterisk. Given that we have carried out 24 tests, these P-values cannot be considered significant1. Thus we can safely conclude that our tests reveal no evidence against a global clock hypothesis.
5.2 mtDNA and repeated mutation on branches
As we discussed in Section 3, the assumption of no repeated or back mutation on any branch of the phylogenetic tree is crucial for the interpretation of our PR method as a maximum-likelihood phylogenetic inference approach. We may be interested, therefore, whether this is a realistic assumption for the domain we consider here, of human mtDNA trees, considering coding regions only (and omitting the fast mutating control regions).
To calculate the probability of repeated mutation we need to know the distribution of mutation rates in the relevant region. Tamura and Nei (1993) and others have argued that a Gamma prior with shape parameter
0.1 is appropriate for the distribution of mutation rates in the mtDNA. Furthermore, it is easy to show [as Tamura and Nei (1993) mention] that this Gamma prior with a Poisson likelihood leads to a negative binomial distribution of the number of mutations per site, with parameters: NB(0.1,1µ/0.1), where µ is the expected number of mutations per site.
Now, consider a representative branch whose true length is 20 000 years. Considering the well-established rough time scale for our tree in Figure 2, it is very likely that the vast majority of its actual branches would be shorter than this length. Using the number proposed by Ingman et al. (2000) of 1.7 x 108 substitution per site per year in the mtDNA coding region, which implies that the number of substitutions in a randomly selected coding site in 20 000 years has a NB(0.1,0.9966) distribution, which implies it has probability of
6 x 107 of being bigger than 1. Considering now that the coding region has about 15 500 total sites in it, we get that the overall probability of any site having more than one substitution on our hypothesize branch of 20 000 years is about 15 500.6 x 107 = 0.009
So, a branch of length 20 000 years has probability of slightly <1% of having any sites with repeated mutation. Since our tree has
25 branches, we can reasonably assume that it is unlikely to have repeated or back mutations on any branch. Even if it does, we would have to be exceedingly unlucky for it to have more than one or two such events.
An important question is, what would be the effect of this slight under-counting on the Poisson parameter estimates? It is easy to show via simulation that the effect of under-counting by one mutation is always less than one mutation unit on the estimates on the tree, and usually much less.
So when we consider this additional uncertainty relative to the confidence intervals in Figure 2, which are typically bigger than one, we can see that taking this additional uncertainty into account is not expected to have a significant effect on our results.
| 6 CONCLUSIONS |
|---|
|
|
|---|
We have proposed the use of Poisson regression with identity link-function as an approach for efficient and reliable inference on known phylogenetic trees. We have shown that the PR method can be considered an approximate maximum-likelihood approach under some non-guaranteed conditions, that are nonetheless reasonable in our motivating domain of the human mtDNA tree. The main advantages of the PR method over the exact maximum-likelihood calculation are as follows:
- It implicitly allows variable mutation rates across sites, in the sense that the PR estimates are not affected by the individual rates, but only by the sum of rates across all sites.
- Since PR is a Generalized Linear Modeling approach, the extensive inference machinery developed for GLM is at our disposal.
- In principle, PR should be significantly more efficient and well behaved computationally than general-purpose maximum-likelihood fitting.
We have demonstrated the PR method on the problem of estimating the out-of-Africa TMRCA of the human mtDNA, and obtained confidence intervals that are significantly tighter than the commonly quoted result of Ingman et al. (2000), obtained with a comparable (in fact, slightly bigger) number of samples.
The mtDNA example we present here is relatively small, as it served to demonstrate the ability of PR to accomplish a tighter confidence interval than Ingman et al. (2000) with comparable data. We plan to apply PR to a much larger dataset of mtDNA, in an attempt to obtain tighter confidence intervals.
| Acknowledgments |
|---|
This work was inspired by the Genographic project (www.nationalgeographic.com/genographic) and in particular by discussions with Doron M. Behar of Family Tree DNA. The author is also grateful to Gabriela Alexe, Claudia Perlich and Ajay Royyuru of IBM and Trevor Hastie of Stanford University for their help in data preparation and comments.
| FOOTNOTES |
|---|
1Since these tests are not independent the simple Bonferroni or FDR-type corrections are not directly applicable, but a minimal p-value of 0.017 is by all measures not impressive with 24 tests.
| REFERENCES |
|---|
|
|
|---|
Casella, G. and Berger, R. Statistical Inference, (2001) 2 edn , CA Duxbury Press Belmont.
Efron, B. and Tibshirani, R. An Introduction to the Bootstrap, (1994) , FL Chapman Hall/CRC Boca Raton.
Felsenstein, J. Inferring Phylogenies, (2003) , MA Sinaur Associates, Sunderland.
Fitch, W. (1971) Toward defining the course of evolution: minimum change for a specified tree topology. J. Mol. Evol, . 20, 406416.
Forster, P., et al. (1996) Origin and evolution of native american mtdna variation: a reappraisal. Am. J. Hum. Genet, . 59, 935945[Web of Science][Medline].
Howell, N., et al. (2004) African haplogroup l mtDNA sequences show violations of clock-like evolution. Mol. Biol. Evol, . 21, 18431854
Huelsenbeck, J. and Crandall, K. (1997) Phylogeny estimation and hypothesis testing using maximum likelihood. Annu. Rev. Ecol. Syst, . 28, 437466[CrossRef][Web of Science].
Ingman, M., et al. (2000) Mitochondrial genome variation and the origin of modern humans. Nature, 408, .
Jobling, M., Hurles, M., Tyler-Smith, C. Human Evolutionary Genetics: Origins, Peoples and Disease, (2004) , New York Garland Science.
Jukes, T. and Cantor, C. (1969) Evolution of protein molecules. In Munro, H. (Ed.). Mammalian Protein Metabolism, , New York Academic Press Vol. 3, , pp. 21132.
Kimura, M. (1980) A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol, . 16, 111120[CrossRef][Web of Science][Medline].
Kimura, M. (1981) Estimation of evolutionary distances between homologous nucleotide sequences. Proc. Natl Acad. Sci. USA, 78, 454458
Langley, C. and Fitch, W. (1974) An examination of the constancy of the rate of molecular evolution. J. Mol. Evol, . 3, 161177[CrossRef][Web of Science][Medline].
McCullagh, P. and Nelder, J. Generalized Linear Models, (1989) , London Chapman and Hall.
Tamura, K. and Nei, M. (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol, . 10, 512526[Abstract].
Venables, W. and Ripley, B. Modern Applied Statistics with S-Plus, (1994) , New York Springer.
Yang, Z. (1993) Maximum likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol, . 10, 13961401[Abstract].
Yang, Z. (1994) Maximum likelihood phylogenetic estimation from dna sequences with variable rates over sites: approximate methods. J. Mol. Evol, . 39, 306314[CrossRef][Web of Science][Medline].
Yang, Z. (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci, . 13, 555556
This article has been cited by other articles:
![]() |
S. Rosset, R. S. Wells, D. F. Soria-Hernanz, C. Tyler-Smith, A. K. Royyuru, D. M. Behar, and and The Genographic Consortium Maximum-Likelihood Estimation of Site-Specific Mutation Rates in Human Mitochondrial DNA From Partial Phylogenetic Classification Genetics, November 1, 2008; 180(3): 1511 - 1524. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





