Skip Navigation


Bioinformatics Advance Access originally published online on November 29, 2005
Bioinformatics 2006 22(3):341-345; doi:10.1093/bioinformatics/bti803
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/3/341    most recent
bti803v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (53)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Beerli, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Beerli, P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Comparison of Bayesian and maximum-likelihood inference of population genetic parameters

Peter Beerli

School of Computational Science and Department of Biological Sciences, Florida State University Tallahassee, FL 32306-4120, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 DISCUSSION
 4 CONCLUSION
 REFERENCES
 

Comparison of the performance and accuracy of different inference methods, such as maximum likelihood (ML) and Bayesian inference, is difficult because the inference methods are implemented in different programs, often written by different authors. Both methods were implemented in the program MIGRATE, that estimates population genetic parameters, such as population sizes and migration rates, using coalescence theory. Both inference methods use the same Markov chain Monte Carlo algorithm and differ from each other in only two aspects: parameter proposal distribution and maximization of the likelihood function. Using simulated datasets, the Bayesian method generally fares better than the ML approach in accuracy and coverage, although for some values the two approaches are equal in performance.

Motivation: The Markov chain Monte Carlo-based ML framework can fail on sparse data and can deliver non-conservative support intervals. A Bayesian framework with appropriate prior distribution is able to remedy some of these problems.

Results: The program MIGRATE was extended to allow not only for ML(-) maximum likelihood estimation of population genetics parameters but also for using a Bayesian framework. Comparisons between the Bayesian approach and the ML approach are facilitated because both modes estimate the same parameters under the same population model and assumptions.

Availability: The program is available from http://popgen.csit.fsu.edu/

Contact: beerli{at}csit.fsu.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 DISCUSSION
 4 CONCLUSION
 REFERENCES
 
Population genetics changed considerably after Kingman (1982a, b, c) introduced the n-coalescent. The n-coalescent (coalescent for short) allows us to calculate probabilities of relationships among a random population sample. This in turn facilitates calculations of probabilities of whole genealogies under a specific population model, for example, two populations exchanging migrants at a constant rate. The first applications that calculated the likelihood of the population size parameter based on DNA samples were described by Griffiths and Tavaré (1994) and Kuhner et al. (1995). Bahlo and Griffiths (2000) and Beerli and Felsenstein (1999, 2001) extended the basic estimation of a single parameter to joint estimations of migration rates and population sizes, whereas Kuhner et al. (2000) allowed for the estimation of recombination rate. These maximum-likelihood (ML) approaches were complemented by several Bayesian approaches (Nielsen, 1998, 2000; Hey and Nielsen, 2004; Beaumont, 1999 and others). All of these approaches try to estimate population genetic parameters. They typically treat the genealogy as a nuisance parameter and summarize over all possible genealogies G; to be precise, they sample over all possible labeled histories T and branch lengths B, taking into account the genetic data and the population genetic model. The likelihood of the data given the model parameters is

Formula 1(1)
where k(T, B|{pi}) is the Kingman coalescent probability density and L(D|T, B) is the likelihood of the data given the genealogy.

Nielsen (personal communication, 2001) suggested that the ML approach is hampered by several problems. Maximizing the likelihood function L(D|{pi}) for complicated scenarios with many parameters is very difficult. The Metropolis–Hastings algorithm (Metropolis et al., 1953; Hastings, 1970) with static driving values {pi}0, as implemented in MIGRATE and other programs, can take a prohibitively long run time required to explore completely all the possible genealogies. These problems have been shown by Abdo et al. (2004), although Abdo et al. apparently failed to recognize that the problems are far less serious when using biologically reasonable datasets and when the guidelines about convergence outlined in the MIGRATE-manual (available from http://popgen.csit.fsu.edu) are followed.


    2 APPROACH
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 DISCUSSION
 4 CONCLUSION
 REFERENCES
 
The program MIGRATE uses a Metropolis–Hastings algorithm to explore all possible genealogies (Beerli and Felsenstein, 1999). The adaptation of the program to a Bayesian framework was not difficult because only a module handling the prior distributions and a minor change in the program flow need to be added, together with changes in the input and output user interfaces.

The program MIGRATE calculates the posterior probability distribution per locus, treating each locus as completely unlinked to the others. This assumption is reasonable: most biologists would prefer to sample such loci rather than partially linked or completely linked loci because unlinked loci can be treated as independent replicates of the genealogical history. MIGRATE approximates the posterior distribution

Formula 2(2)
using a Metropolis–Hastings approach. The integral over G is a condensed expression of the sum over topologies and integral over all branch lengths. The denominator is

Formula 2
where we integrate over all possible parameter values {pi}.

The updating scheme of the genealogies is the same in the ML and the Bayesian approach and was described by Beerli Felsenstein (1999). The updating scheme of the parameters is based on arbitrary prior distributions r({pi}). MIGRATE allows the user to choose between a small number of prior distributions.

  • Uniform prior distribution between a minimum and a maximum value for each parameter.
  • Exponential prior distribution with a minimum, mean and maximum value for each parameter.

The incorporation of additional prior distributions, such as a gamma distribution, are planned.

A key issue in Metropolis–Hastings algorithms is the acceptance or not of a change of the current state in the Markov chain. The algorithm should accept fairly often so that the chain can explore the solution space more efficiently; poor algorithms will reject often and force very long runs to achieve equilibrium and an appropriate sample of the possible states. Typically, the acceptance or rejection of a move in the Markov chain is based on a ratio that consists of two parts: (1) the ratio of probabilities to move from an old state to a new state using a prior distribution and the effect of the data (Metropolis et al., 1953), the Metropolis ratio rM; and (2) the ratio of probabilities to be in the old or new state and go to the new or old state (Hastings, 1970), the Hastings ratio, rH. In the Bayesian implementation in MIGRATE the ratio of accepting a move suggested by the parameter prior is only dependent on the Kingman coalescent probability density. The acceptance/rejection ratio is

Formula 3(3)
which reduces to

Formula 4(4)
If we consider the uniform random prior distribution (URP) then

Formula 5(5)
and the Hasting ratio rH will turn into

Formula 6(6)
For the exponential-prior distribution a similar logic applies, although moving from Formula 6 to Formula 6 versus from Formula 6 to Formula 6 will not have equal probability as with the URP (6). In this case the prior probabilities in the Hastings-ratio will cancel with the prior probabilities in the Metropolis ratio [formula (3)].

The performance of the improvements were illustrated on a dataset for four populations with a unidirectional migration pattern (Fig. 1). Simulated DNA sequence alignments, generated using the population model described in Figure 1, were analyzed to show the performance of the Bayesian and the ML approach. One dataset with 10 loci, and four groups of 100 single locus datasets were analyzed. Each dataset contained 20 individuals from each of the four populations. Using a coalescence-based simulator (cf. Hudson, 2002) ‘true’ genealogies using population sizes ({Theta}T) for all populations of 0.1, 0.01, 0.001 and 0.0001, and Mji referenced in Figure 1 were created. DNA sequences of 10 000 bp length were then simulated on these true genealogies using an F84 model with equal base frequencies and transition/transversion ratio of 2.0. These datasets were then analyzed using either the ML inference mode (Beerli and Felsenstein, 1999, 2001) or the Bayesian inference mode in MIGRATE. The ML mode was run for 10 short chains visiting 100 000 genealogies and storing 5000, updating the driving parameter after each chain, and two long chains with 10 000 000 visited genealogies and 50 000 sampled using an adaptive heating scheme. The Bayesian inference was run for 10 000 000 updates, approximately half of which were updates of the 16 parameters and approximately half (~5 000 000 because of random switching between genealogy and parameter updates) were genealogy updates. New parameters were proposed using an exponential prior distribution with population size mean of 2{Theta}T and boundaries of {Theta}T/10 and 10{Theta}T, and scaled migration rate mean M of 200 and boundaries of 0 and 1000. Results for uniform priors with the same boundaries were very similar, and therefore are not shown.


Figure 1
View larger version (17K):
[in this window]
[in a new window]
 
Fig. 1 Population scenario used in the example: four populations exchange migrants unidirectionally as follows: from population 1 to 4 (M14), from 4 to 3 (M43), from 3 to 2 (M32) and from 2 to 1 (M21). Parameters are scaled effective population sizes {Theta}i (4x effective population size x mutation rate per site per generation), and scaled immigration rates Mji (immigration rate divided by mutation rate). Migration along routes indicated by solid arrows was simulated using ‘true’ values of M = 100; migration along all eight other migration routes was simulated with a value of M = 0. Migration along the dashed arrows are discussed in the Results section.

 
The results and problematic issues are shown only for population 4, but the pattern is identical for the other three populations. The scenario chosen for an example is difficult for any gene flow estimator because it requires the estimation of 12 migration rates and 4 population sizes. With high migration rates, haplotypes are distributed evenly over all populations, so that establishing the directionality of gene flow from estimated migration rates is difficult. With low migration rates, however, the difference from zero, and thus the directionality, is also difficult to establish. The number of variable sites or the number of alleles in the dataset is crucial for accurate estimation of population size and migration rates of any magnitude. Single locus datasets with low variability do not allow estimating migration rates with great precision.

Despite these difficulties, with sufficient data, estimates are expected to be useful for inferring direction and magnitude of gene flow and magnitude of population size. Using the 16 parameter model analyzed here will produce very variable parameter estimates from single locus data, however, and such analyses are not advisable for real biological data.

2.1 Multilocus analysis
Figure 2 shows that the variability of individual loci resulting from the coalescent can be large and that there are difficulties in reaching convergence; the combined estimate over all loci, however, gives a rather accurate picture. The variability for migration rate estimates is much larger than for the population size estimates. It is difficult to establish the gene flow direction (M41 versus M14) for the single locus estimates. The estimate over all loci clearly allows the distinction between the two directions: M14 is much bigger than M41. The estimation of migration parameter values between populations with no direct connections, for example, migration rate M24 between population 2 and 4, is consistently low (Fig. 2).


Figure 2
View larger version (24K):
[in this window]
[in a new window]
 
Fig. 2 Posterior distributions f estimated using exponential priors: expected mode for the scaled migration rate M14 is 100, expected modes for M41 and for M24 are zero, expected mode for the scaled effective population size {Theta}4 is 0.01. The posterior distributions of 10 independent loci (thin lines) and the combined posterior distribution (thick line) are shown. The relationship among the populations is explained in Figure 1.

 
2.2 Comparison of Bayesian and ML inference
MIGRATE allows direct comparison of the success of parameter inference using the Bayesian approach and the ML approach. In theory the results should be very similar. Table 1 shows medians and quartiles of 100 single locus runs. The medians and quartiles were chosen because they are a better indicator of the distribution of the results than mean and standard deviation because these are heavily influenced by large outliers. The median of the maximum posterior probabilities is similar to the median of the ML estimates for moderate values of the population size ({Theta} = 0.01). The results for the low-variability datasets are mixed; the medians of the two methods are still comparable but the range of the quartiles of ML M estimates are very large; standard deviations (data not shown) were even larger because of outliers in the ML analysis. Several of the 100 runs reported values that were very different from the true value. The datasets with the smallest true {Theta} (0.0001) shows even more problems with the ML approach because the medians for {Theta} is strongly overestimated and the range of the quartiles for M is huge. In contrast to ML the Bayesian runs recover the population size, but report very low values for the migration rate. Figure 3 shows a comparison of posterior distributions of the scaled migration rate M of the first datasets of each population size category (Table 1). The power to make inferences about the magnitude of the migration rate is directly correlated with the magnitude of the population size. For very small population sizes there is no power to estimate such low migration rates in the chosen 16 parameter problems with a single locus dataset of 10 000 bp for each of the 100 individuals. The posterior distribution is similar to the exponential prior distribution used. In contrast to the problems encountered in the migration rate estimations, the posterior distributions for {Theta} are strongly peaked near 0.0001 (data not shown).


View this table:
[in this window]
[in a new window]
 
Table 1 Comparison of Maximum likelihood and Bayesian inference

 

Figure 3
View larger version (13K):
[in this window]
[in a new window]
 
Fig. 3 Posterior distribution of the scaled migration rate M14 for four different values of {Theta}4 of a single locus dataset. The population model is explained in Figure 1. Graphs are results from the first replicate of the four replicate groups shown in Table 1. Data were simulated with M14 = 100.

 
The ML method has difficulty is recovering the expected values when the dataset is very variable, whereas the Bayesian inference is closer to the ‘true’ values for all the scenarios. The range of the quartiles of the ML approach is often much larger than the range of the Bayesian approach.

The coverage of the Bayesian approach is rather conservative and includes the ‘true’ values in the 95% credibility interval with frequencies of 0.85–1.00 for the migration and population size parameters (Table 2), whereas the ML approach has difficulty with convergence, especially on low-variability datasets, and so has a rather low coverage (frequencies between 0.06 and 0.94).


View this table:
[in this window]
[in a new window]
 
Table 2 Coverage of Maximum likelihood and Bayesian inferences

 

    3 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 DISCUSSION
 4 CONCLUSION
 REFERENCES
 
The scenario chosen for an example is difficult for any gene flow estimation program that uses only a single sample in time. The problem stems from the fact that the only information about the directionality are the mutations in the dataset. If the migration rate is high, all mutations, even the rare ones, are distributed over all populations and any directionality estimation based on a single locus will fail. With low migration rates among the populations, each population will acquire unique mutations and in principle the magnitude and directionality can be estimated even for single locus dataset, if there is enough variability in the dataset. In reality, however, such an estimation has proven difficult because the difference between the migration rates between two populations is small and often close to zero. Estimation based on single locus datasets thus often cannot recover the directionality, but multilocus estimates will allow the inference of the migration direction (Fig. 2).

The power to estimate migration rate is crucially dependent on the number of variable sites or number of alleles in the dataset. Too little variation leads to haphazard results in the ML method because the MCMC process has no strong guidance whether to insert or remove migration events during the course of the analysis; the process is more dependent on the static driving parameters. Comparison of several runs will deliver very different results and therefore show non-convergence. The only remedy is to run these analyses much longer to get a better estimate of the uncertainty of the estimate. Bayesian analysis is straightforward in such cases because when the posterior distribution is similar to the prior distribution, we can conclude that the dataset does not contain enough information for the inference. The ML method also has difficulties exploring the distribution around the ML estimate with highly variable data because the genealogy is very well defined by the large number of variable sites: the static driving value and the updating scheme (Beerli and Felsenstein, 1999) will not explore many different migration scenarios and therefore the tails of the distribution are not visited. This results in too narrow support intervals with small coverage values. In contrast, Bayesian inference manipulates the parameters using a diffuse prior. This forces more changes of the genealogy, therefore exploring more different migration scenarios and visiting the tails of the posterior distribution more efficiently.

The coverage shown for the Bayesian runs might be conservative but this is preferable to the coverage reported for ML, especially in the low-variability datasets ({Theta} ≤ 0.001, Table 2). Some ML runs did not really converge and were estimating either very large or zero migration rates.


    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 DISCUSSION
 4 CONCLUSION
 REFERENCES
 
Many users of MIGRATE have reported in numerous email queries that achieving convergence with the ML approach with low-information data, such as single locus datasets or data with a low mutation rate, is difficult and needs special attention. Bayesian inference seems to allow such users to achieve reliable results with less effort than the ML approach. It seems appropriate that if only the parameters and their support intervals are of interest, then biologists should prefer the Bayesian approach, although it will be interesting to see whether this will hold for all biological datasets.


    Acknowledgments
 
The author thank Rasmus Nielsen for the suggestion to implement a Bayesian method into MIGRATE. At first, the author was uncertain about its success for complicated scenarios, like the one used (Fig. 1). Thomas Uzzell, Koffi Sampson and two anonymous reviewers helped to improve the manuscript. Funding for this research was supplied through Florida State University and National Science Foundation grant DEB-0108249 to Scott Edwards and P.B.


    FOOTNOTES
 
Associate Editor: Frank Dudbridge

Received on September 8, 2005; revised on November 23, 2005; accepted on November 25, 2005

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 DISCUSSION
 4 CONCLUSION
 REFERENCES
 

    Abdo, Z., et al. (2004) Evaluating the performance of likelihood methods for detecting population structure and migration. Mol. Ecol, . 13, 837–851[CrossRef][Medline].

    Bahlo, M. and Griffiths, R.C. (2000) Inference from gene trees in a subdivided population. Theor. Popul. Biol, . 57, 79–95[CrossRef][Web of Science][Medline].

    Beaumont, M.A. (1999) Detecting population expansion and decline using microsatellites. Genetics, 153, 2013–2029[Abstract/Free Full Text].

    Beerli, P. and Felsenstein, J. (1999) Maximum-likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. Genetic, 152, 763–773.

    Beerli, P. and Felsenstein, J. (2001) Maximum likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. Proc. Nath Acad. Sci. USA, 98, 4563–4568.

    Griffiths, R. and Tavaré, S. (1994) Sampling theory for neutral alleles in a varying environment. Philos. Trans. R. Soc. Lond. Ser. B Biol. Sci, . 344, 403–410[Web of Science][Medline].

    Hastings, W.K. (1970) Monte Carlo sampling methods using markov chains and their application. Biometrika, 57, 97–109[Abstract/Free Full Text].

    Hey, J. and Nielsen, R. (2004) Multilocus methods for estimating population sizes, migration rates and divergence time, with applications to the divergence of Drosophila pseudoobscura and D. persimilis. Genetics, 167, 747–760[Abstract/Free Full Text].

    Hudson, R.R. (2002) Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics, 18, 337–338[Abstract/Free Full Text].

    Kingman, J.F.C. (1982a) Exchangeability and the evolution of large populations. In Koch, G. and Spizzichino, F. (Eds.). Exchangeability in Probability and Statistics, , Amsterdam North-Holland, pp. 97–112.

    Kingman, J.F.C. (1982b) On the genealogy of large populations. J. Appl. Probab, . 19A, 27–43.

    Kingman, J.F.C. (1982c) The coalescent. Stochastic Proc. Appl, . 13, 235–248[CrossRef].

    Kuhner, M.K., et al. (1995) Estimating effective population size and mutation rate from sequence data using Metropolis–Hastings sampling. Genetics, 140, 1421–1430[Abstract].

    Kuhner, M.K., et al. (2000) Maximum likelihood estimation of recombination rates from population data. Genetics, 156, 1393–1401[Abstract/Free Full Text].

    Metropolis, N., et al. (1953) Equation of state calculations by fast computing machines. J. Chem. Phys, . 21, 1087–1092[CrossRef].

    Nielsen, R. (1998) Maximum likelihood estimation of population divergence times and population phylogenies under the infinite sites model. J. Theor. Popul. Biol, . 53, 143–151.

    Nielsen, R. (2000) Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics, 154, 931–942[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J HeredHome page
D. E. Janes, T. Ezaz, J. A. Marshall Graves, and S. V. Edwards
Recombination and Nucleotide Diversity in the Sex Chromosomal Pseudoautosomal Region of the Emu, Dromaius novaehollandiae
J. Hered., March 1, 2009; 100(2): 125 - 136.
[Abstract] [Full Text] [PDF]


Home page
J HeredHome page
J. I. Schmidt, K. J. Hundertmark, R. T. Bowyer, and K. G. McCracken
Population Structure and Genetic Diversity of Moose in Alaska
J. Hered., March 1, 2009; 100(2): 170 - 180.
[Abstract] [Full Text] [PDF]


Home page
J HeredHome page
R. M. Jennings, T. M. Shank, L. S. Mullineaux, and K. M. Halanych
Assessment of the Cape Cod Phylogeographic Break Using the Bamboo Worm Clymenella torquata Reveals the Role of Regional Water Masses in Dispersal
J. Hered., January 1, 2009; 100(1): 86 - 96.
[Abstract] [Full Text] [PDF]


Home page
Progress in Physical GeographyHome page
B. R. Riddle, M. N. Dawson, E. A. Hadly, D. J. Hafner, M. J. Hickerson, S. J. Mantooth, and A. D. Yoder
The role of molecular genetics in sculpting the future of integrative biogeography
Progress in Physical Geography, April 1, 2008; 32(2): 173 - 202.
[Abstract] [PDF]


Home page
Mol Biol EvolHome page
F. Rousset and R. Leblois
Likelihood and Approximate Likelihood Analyses of Genetic Structure in a Linear Habitat: Performance and Robustness to Model Mis-Specification
Mol. Biol. Evol., December 1, 2007; 24(12): 2730 - 2745.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
D. A. Moeller, M. I. Tenaillon, and P. Tiffin
Population Structure and Its Effects on Patterns of Nucleotide Polymorphism in Teosinte (Zea mays ssp. parviglumis)
Genetics, July 1, 2007; 176(3): 1799 - 1809.
[Abstract] [Full Text] [PDF]


Home page
J. Virol.Home page
S. Zarate, S. L. K. Pond, P. Shapshak, and S. D. W. Frost
Comparative Study of Methods for Detecting Sequence Compartmentalization in Human Immunodeficiency Virus Type 1
J. Virol., June 15, 2007; 81(12): 6643 - 6651.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
M. K. Kuhner and L. P. Smith
Comparing Likelihood and Bayesian Coalescent Estimation of Population Parameters
Genetics, January 1, 2007; 175(1): 155 - 165.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. L. Aylor, E. W. Price, and I. Carbone
SNAP: Combine and Map modules for multilocus population genetic analysis
Bioinformatics, June 1, 2006; 22(11): 1399 - 1401.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/3/341    most recent
bti803v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (53)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Beerli, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Beerli, P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?