Bioinformatics Advance Access originally published online on December 7, 2004
Bioinformatics 2005 21(8):1516-1529; doi:10.1093/bioinformatics/bti178
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bayesian models for the analysis of genetic structure when populations are correlated
1Department of Statistics U-4120 University of Connecticut Storrs, CT 06269-4120, USA
2Department of Ecology and Evolutionary Biology U-3043 University of Connecticut Storrs, CT 06269-4120, USA
*To whom correspondence should be addressed at Department of Public Health and Preventive Medicine, Mail code #CB669, Center for Policy and Research in Emergency Medicine, Oregon Health and Science University, Portland, OR 97239, USA.
| Abstract |
|---|
|
|
|---|
Motivation: Population allele frequencies are correlated when populations have a shared history or when they exchange genes. Unfortunately, most models for allele frequency and inference about population structure ignore this correlation. Recent analytical results show that among populations, correlations can be very high, which could affect estimates of population genetic structure. In this study, we propose a mixture beta model to characterize the allele frequency distribution among populations. This formulation incorporates the correlation among populations as well as extending the model to data with different clusters of populations.
Results: Using simulated data, we show that in general, the mixture model provides a good approximation of the among-population allele frequency distribution and a good estimate of correlation among populations. Results from fitting the mixture model to a dataset of genotypes at 377 autosomal microsatellite loci from human populations indicate high correlation among populations, which may not be appropriate to neglect. Traditional measures of population structure tend to overestimate the amount of genetic differentiation when correlation is neglected. Inference is performed in a Bayesian framework.
Contact: fur{at}ohsu.edu
| INTRODUCTION |
|---|
|
|
|---|
The individual members of any species usually form different populations that are more or less geographically isolated from one another. Inevitably, the populations tend to diverge and there are differences of allele frequencies among these populations. Describing and understanding the patterns of genetic differentiation in natural populations has been a central focus of population genetics since the founding of the field. While it has long been known that semi-isolated populations will tend to diverge over time, leading to a correlation between alleles within a population, it has been less widely appreciated that allele frequencies will tend to be correlated among populations due to shared history or gene flow. Fu et al. (2003) provide exact analytical expressions for the correlation in allele frequencies for a set of populations, subject to drift, mutation and migration, including simple algebraic forms for several special cases under the finite island migration model. They show that the correlation in allele frequencies among populations can be very large for realistic rates of mutation and migration, unless an enormous number of populations are exchanging genes.
Most models for analysis of genetic diversity in geographically structured populations have ignored the among-population correlation in allele frequency (Weir and Hill, 2002). Balding and Nichols (1995) for example, adopted a beta distribution to describe allele frequency at biallelic loci. This beta distribution and its multiallelic version have been widely used to infer genetic differentiation and population structure in both likelihood-based and Bayesian approaches Balding and Nichols, 1995, 1997, Roeder et al., 1998; Holsinger, 1999; Holsinger et al., 2002; Falush et al., 2003). Balding (2003) presented arguments suggesting that a Dirichlet distribution is an appropriate choice whenever populations are exchangeable. Nicholson et al. (2002) proposed a truncated normal model to describe the allele frequency for single nucleotide polymorphisms (SNPs). For both the beta and truncated normal models, the populations have been interpreted as dependent in the sense that the mean of allele frequencies from a set of populations is assumed to be the allele frequency of a hypothetical ancestral population and the populations have each diverged from the ancestral population. Hence, the populations are related to one another by having the common ancestral population. However, these models do not incorporate correlations across populations induced by gene flow. The implicit assumption is that stochastic changes in allele frequency occurred independently within each population rather than being correlated as a result of ongoing migration. As a result, estimates of genetic differentiation based on these models do not incorporate the correlation among populations. Beerli and Felsenstein (1999) estimated migration rates and effective population numbers by using a coalescent approach. In their model, the correlation across populations was implicitly accounted for, but they did not estimate the magnitude of the correlation.
Unfortunately, the correlation across populations affects the estimates of genetic differentiation. Fu et al. (2003) and Song et al. (submitted for publication) discussed some implications of this correlation for measures of genetic differentiation based on Wright's FST (Wright, 1969). In particular, they showed that when populations are small, the estimates of population structure measures are remarkably different depending on whether the correlation is incorporated or not. Nicholson et al. (2002) also expressed their concern, more than once, that the correlation across populations due to shared history or gene flow is typically neglected. In this study, we propose a new mixture beta model to approximate the allele frequency in which the correlation among populations induced by shared history or gene flow is incorporated and explicitly estimated. The allele frequency at each locus in any population is described by the sum of two beta variables in which one of them is common across populations. In general, such a mixture of beta distributions forms a very rich class of distributions (Diaconis and Ylvisaker, 1985). We also extend our approach to genetic data with clusters. The performance of the model is evaluated by using four sets of simulated data from finite island model and illustrated by a real dataset with phenotypes at 377 autosomal microsatellite loci from 52 human populations. The analyses are implemented in a Bayesian framework.
| METHODS |
|---|
|
|
|---|
Mixture beta model for allele frequency
Assume that we have allele frequencies in K populations at I loci, each locus having two alleles A1 and A2. Let pIxK denote the allele frequencies of A1, i.e. the ik-th element of p, pik, is the allele frequency of A1 at locus i in population k, i = 1,...,I; k = 1,...,K. It is sufficient to work with p since the allele frequencies of A1 and A2 sum to 1. To incorporate the correlation among populations into the analysis, we describe allele frequency pik by using the following mixture model
![]() | (1) |
For each locus, we assume that they have the same probabilistic structure and focus on loci that have been subjected to similar evolutionary process. In particular, we assume
![]() | (2) |
x,
y,
k, k = 1,...,K and
are all between 0 and 1. It follows that
![]() | (3) |
k. Again, it is possible to have a different covariance among any pair of populations by assuming wi. Notice that (3) implies that our formulation only allows positive correlation among populations, which, again, agrees with the results of Fu et al. (2003).
When w = 1, pik = xik and (1) reduces to a formulation similar to the usual beta model. The major difference between our formulation and previous ones Balding and Nichols, 1995; Roeder et al., 1998; Holsinger, 1999; Holsinger et al., 2002; Falush et al., 2003) is that in our formulation, E(xik) =
k, e.g. the mean allele frequency is calculated across loci for each population while in previous ones, the beta model is given by
![]() | (4) |
i is the mean allele frequency across populations for each locus. The model of Nicholson et al. (2002) has similar specification and the two models agree to the first and second moments. In these models,
i can be interpreted as the allele frequency in an ancestral population from which the sampled populations have each independently diverged. Thus these populations are related by sharing the ancestral population. Conditional on
i, the allele frequencies are independent. However, this relationship to a common ancestral population is not sufficient to produce an among-population correlation in allele frequencies since, marginally,
![]() | (5) |
i is considered as the allele frequency of the ancestor population and a parameter in the beta model. The same argument applies to the truncated normal distribution by Nicholson et al. (2002). They also recognized that their model does not account for correlation across populations induced by shared history or by gene glow, which could be the most likely deviation from real data.
When E(pik) =
i, estimate of
in (4) is analogous to Weir and Cockerham's (1984)
and is interpreted as a measure of population structure [e.g. Roeder et al. (1998), Holsinger (1999) and see the discussion in Song et al., submitted for publication]. In our formulation, E(pik) =
k, and our estimate of
is a function of
x,
y and w. While this complicates interpretation of the parameters, the traditional method to estimate FST typically ignores correlation across populations.
Recall that Wright's (1951) definition of FST for one locus with two alleles is given by
![]() | (6) |
in Equation (4) corresponds with this definition if we regard
as the temporal variance in allele frequency. We define
(I) as an estimate of FST corresponding with (6) where
is regarded as the temporal variance in allele frequency. If we regard
as the variance in allele frequency across a set of contemporaneous populations, a natural analog of (6) for a finite set of K populations is
![]() | (7) |
. We define
(II) as an estimate of FST corresponding with (7).
These two definitions of FST are equivalent only when the populations are independent, i.e. when there has been no gene flow among them since they simultaneously diverged from an ancestral population. To see this, note that
for a sample of populations. This approaches
only as K approaches infinity and
approaches 0. Thus, the amount of differentiation observed among contemporaneous populations is smaller than the temporal variance in allele frequencies. Weir and Hill (2002) also showed similar effect of correlation on the estimates of genetic variation. Denote the numerator on the right-hand side of (7) as Num and the denominator Denom, then
(III) = E(Num/Denom) fully reflects the correlation among populations (Fu et al., 2003). While an analytical expression for E(Num/Denom) through parameters of the mixture beta distribution is not available, a Bayesian estimate of E(Num/Denom) is directly obtainable for cases where K is known by plugging the posterior estimate of pik into (7) for each locus. A comparison with Bayesian estimate of
based on (4) will reveal more about how the correlation affects the FST analysis in a Bayesian framework.
Bayesian model
For different types of genetic data, the likelihood functions are slightly different. For clarity, we specify the Bayesian model for allele frequency data and codominant genotype data separately. Data directly on allele frequency may be available from haploid organisms or the haploid stage of diploid organisms. More often, data are available as genotype or phenotype counts and calculation of allele frequencies varies with the dominance types and presence or absence of inbreeding.
Modeling allele frequencies
For loci with two allele types, the number of each type in a sample at any locus is assumed to follow a binomial distribution. Consider a diploid population with N individuals, the total number of alleles at each locus is 2N. We use NA1 and NA2, both I x K matrices, to denote the numbers of allele types A1 and A2 in K populations at locus I, e.g. NA1,ik and NA2,ik are the numbers of allele A1 and A2 at locus i in population k, respectively. Also let x and y denote the collection of xik and yi, i = 1,...,I, k = 1,...,K, respectively. If we assume that the magnitude of gametic disequilibrium within populations is negligible, which is equivalent to assuming independent loci, then the likelihood is given by
![]() | (8) |
To complete model specification, we use (2) as the prior distributions for xik and yi and denote them as P(xik|
x,
k) and P(yi|
y,
), respectively. Let P(·) denote the prior distribution for any of the other parameters and hyperparameters. We use uniform(0,1) for P(·) throughout this paper though P(·) could also be specified by using information from previous comparable studies, if available, e.g. the power prior (Ibrahim and Chen, 2000). Let
= (
1,...,
K), then the full conditional posterior distribution for x, y,
, w,
x,
y and
is given by
![]() | (9) |
Modeling codominant markers
Again let us assume that the sample consists of data on genetic variation in K diploid populations at I loci, each locus having two alleles A1 and A2. The phenotype corresponding with each of the three genotypes A1A1, A1A2 and A2A2 is distinguishable when A1 and A2 are codominant. Thus, there is a one-to-one mapping between phenotype and genotype counts. Let NA11, NA12 and NA22, all I x K matrices, denote the numbers of A1A1, A1A2 and A2A2 genotypes in the sample, respectively and similarly,
A11,
A12 and
A22, the frequencies of these genotypes. The numbers of different genotypes at locus i of population k are usually described by a multinomial distribution. If we assume that genotypes are sampled at random across loci, which is equivalent to assuming that magnitudes of gametic and identity disequilibrium within populations are negligible, the likelihood of the sample is given as:
![]() | (10) |
![]() | (11) |
Moreover, the prior distributions for xik and yi are based on (2), and P(·) is used to denote the prior distribution for the rest of the parameters and hyperparameters. The full conditional posterior distribution for x, y,
= (
1,...,
K), w, f,
x,
y and
can be written as
![]() | (12) |
Analytical expressions for the posterior distributions of the parameters and hyperparameters derived from (9) and (12) are not available in closed form. The posterior inference is achieved through Markov chain Monte Carlo (MCMC) simulation. We use the MetropolisHasting algorithm (Gilks et al., 1996) in MCMC implementation.
Test for goodness of fit
We evaluated how well the mixture model fits data in two ways. One is to evaluate whether the mixture beta model provides a good approximation for the allele frequency itself in each population. The other is whether the mixture model appropriately incorporates the correlation among allele frequencies. Note that although the beta model (4) neglects the correlation among populations, there is no need to check whether it provides a good approximation for the allele frequency since the allele frequency pik itself is modeled directly. In the mixture model, the allele frequency is modeled as the weighted sum of two beta variables, thus the accuracy of the estimated allele frequency is compromised at the expense of incorporating correlation. Even so, we are going to show that this compromise occurs to a very minor degree and the mixture model provides an adequate approximation for the allele frequency.
To check whether the mixture beta model provides a good approximation for the allele frequency in our Bayesian model, we apply the
2-statistic to test goodness of fit. For (9), conditioning on xik, yi and w, the statistic is defined in the following way:
![]() | (13) |
and
, where
,
and
i are values sampled from the posterior distribution,
and
are treated as the expected values of A1 and A2 from the model. Strictly, the Bayesian estimates of allele frequencies are not unbiased. However, with a large amount of data and using Uniform(0,1) as the prior specification, the likelihood dominates the posterior inference and we expect the bias to be negligible. So if the mixture model is a good approximation of the allele frequency, the test statistic approximately has a
2-distribution with degrees of freedom I x K 1. Thus,
2/(I x K 1) in (13) should be close to 1.
Under the finite island model, for loci with two alleles, the stationary variance (
2) and correlation (
) for allele frequencies are given by (Fu et al., 2003):
![]() | (14) |
Fitting codominant markers to (12), the
2-statistic is defined similarly but slightly different from (13):
![]() | (15) |
,
and
where
,
and
are obtained by plugging in the corresponding posterior parameter estimates to (11) and pik = wxik + (1 w)yi. Again if the mixture model is a good approximation of the allele frequency, the test statistics approximately has a
2-distribution with degrees of freedom 2(I x K 1) for this model and
2 in (15) divided by 2(I x K 1) should be close to 1.
Modeling data with clusters
In practice, it is very common to have genetic data from different geographical regions. The human microsatellite data analyzed later in this paper is one such example. Populations within the same geographical region may be relatively homogeneous but considerably different from those in other regions. More generally, we may just have a set of populations in which some populations are more similar to one another than to others. In such a case, the set of populations could be clustered into different groups naturally by geographical regions or by an appropriate clustering method (e.g. Pritchard et al., 2000). It is reasonable to assume that populations within the same cluster are more correlated than populations belonging to different clusters. The within-cluster correlation might also be quite different from one another. To incorporate this characteristic into the model, we assume a different set of parameters for each cluster. Suppose there are J clusters in the sample, then for each cluster j, we specify
![]() | (16) |
![]() | (17) |
The above formulation completely neglects the possible correlation among clusters. In fact, by assuming either a common w or a common
x or a common
y (or equivalently yi) across clusters, we are interested in comparing the following six models:
- Model (I): common w, common
x and common
y;
- Model (II): different w, common
x and common
y;
- Model (III): common w, different
x and common
y;
- Model (IV): different w, different
x and common
y;
- Model (V): common w, different
x and different
y;
- Model (VI): different w, different
x and different
y.
- Model (II): different w, common
x and different
y and the other is different w and common
x and different
y. We will not include these two models in the comparison. In our view, it is not reasonable to assume a common parameter across clusters for the individual component of the mixture model but a different parameter for the common component.
Model (VI) is the same model as specified by (16) and (17). Model (V) assumes that the distributions of the two beta variates are different from cluster to cluster but the weight used to mix the two distributions is the same across clusters. Model (I) treats the whole set of the populations as equally correlated and there is no cluster effect. Models (II), (III) and (IV) attempt to build correlation among populations both within clusters and between clusters by sharing a common y. For example, given locus i, it follows from model (IV) that
![]() | (18) |
Model comparison
Models (I)(VI) all assume some correlation among populations. Besides comparing models (I)(VI), we are also interested in comparing models (I)(VI) with four beta models that do not incorporate correlation among populations. One of the models is (4) and the other three are modified versions of (4) with cluster effects incorporated to different extents. Equation (4) assumes no cluster effect, i.e. a common
and
i for the whole set of populations. For notation, we refer to (4) as model (i). The other three models assume a common
and cluster-specific
i [model (ii)], or a cluster-specific
and common
i [model (iii)], or both
and
i cluster-specific [model (iv)]. Model (iv) is given by
![]() | (19) |
ij, (pikj) =
ij(1
ij)
j and
ij represents the mean of allele frequency across all populations in cluster j at locus i. A common
implies that the genetic differentiation among populations, if not correlated, is the same across clusters, though the mean and variance of allele frequency might be different across clusters.
We use a quadratic loss L measure (Ibrahim and Laud, 1994; Gelfand and Ghosh, 1998, Ibrahim et al., 2004) to compare models. Quadratic loss L measure is a decisiontheoretic approach to model choice based on expected loss on replicate datasets. For codominant data, the number of phenotype follows a multinomial distribution and we use the multivariate version developed by Ibrahim et al. (2004) to estimate L measure. Let y denote observed data and z = (z1,...,zn) denote future response. Each zi is a vector. Let
![]() | (20) |
). The expectation for variancecovariance matrix and mean of z is taken with respect to the posterior predictive distribution, which is defined as:
![]() | (21) |
given observed data. The quantity
is a number between 0 and 1, and usually decided by the investigator. Equation (20) indicates that L measure could be considered as a weighted sum of expected variance and estimated variance of future response. When
= 0, L measure depends only on the expected variance of the future observations. For reasons discussed in the section on Test for goodness-of-fit (allele frequency is modeled as a sum of two beta variates), the estimated allele frequency from the mixture model is not as accurate as that from the beta model, thus the mixture model would produce a larger estimated variance of future response than the beta model. With this in mind, we are more interested in small values of
and testing whether the mixture model could provide better (or smaller) expected variance for future observation. For L measure in general, a smaller value indicates a better model. | IMPLEMENTATION |
|---|
|
|
|---|
Finite island model simulation
Wright's (1951) island model is one of the most important theoretical models in population genetics. Its defining assumption is that migration among all populations occurs with equal probability. Substantial results have been derived from this widely studied model (Crow and Aoki, 1984; Cockerham and Weir, 1987; Fu et al., 2003). Under this model, the correlation among populations is induced by migration (gene exchange), but the number of populations involved in gene exchange and mutation rate affects the magnitude of correlation.
To evaluate the performance of the mixture beta model, we considered loci with two alleles and simulated data from the finite island model with different combinations of mutation rate (µ) and migration rate (m), population size (2N) and number of populations (K). In particular, there are four different sets of simulations. The first set of simulation is equivalent to exhaustive sampling, and each simulated dataset consist of allele frequencies from N individuals in each of K populations. In this simulation, we used symmetric mutation rate between the two alleles and the mutation rate µ is set to be one of (1.0 x 103, 1.0 x 104, 1.0 x 105, 1.0 x 106). The migration rate m is one of (0.001, 0.01, 0.1) and population size 2N, one of (1000, 2000, 20 000). We choose the number of populations K to be one of (2, 10, 25, 100). The above combinations of population process parameters would result in the mean allele frequency being 0.5 in each combination, variance varying from 1.0 x 104 to 0.20 and correlation varying from 0.02 to 0.999. In the second set of simulation, we only considered cases with a relatively large population size of 2N = 20 000. For each of the K populations, allele frequencies of n = 100 individuals were sampled from the whole population with a subset of mutation and migration rates chosen from the set above. In the third set of simulation, the mutation rate (µ) is set to be asymmetric and the number of populations (K) is one of (100, 250, 500, 1000). Then the combinations of mutation rate, migration rate, population size were chosen such that the mean allele frequency was 0.4, the correlation was
0.28 and the variance was within the range of (0.01, 0.04). In each combination, 2n = 40 was sampled from each population and K' = 52 populations were sampled. These values were chosen to approximate those from the full human dataset analyzed below. For the fourth set of simulation, the number of populations K is one of (25, 50, 100, 250, 500, 1000) and population size, one of (500, 1000, 2000). Again, we used symmetric mutation rates between two alleles, and chose mutation and migration rates such that the correlation was
0.67 and variance varies in the range of (0.002, 0.06). In each combination, 2n = 40 was sampled from each population and K' = 2 or K' = 10 populations were sampled. This simulation evaluates the performance of mixture beta model when data are available only from a small number of populations. The variance and correlation of this simulation are comparable to those from the human data of different geographical regions. Finally, in the first three sets of simulations, for each combination of process parameters, we sampled allele frequency from 50 independent generations from the stationary distribution, which is equivalent to 50 independent loci evolving under the finite island model. In the fourth set of simulations, we sampled allele frequency from both 50 and 377 loci (there are 377 loci in the human dataset). All the simulated data were fitted to model (9).
For the first set of simulations, Table 1 (K = 25) and Table 2 (K = 100) show the variance and correlation among allele frequencies estimated from mixture model, compared with true values from finite island model and values calculated from the simulated data as summary statistics. With data from exhaustive sampling, the mixture model provides accurate estimates of variance and correlation for all simulated cases when the migration rate is 0.1 and for some cases when the migration rate is smaller (m = 0.01 and µ
1.0 x 104; m = 0.001 and µ
1.0 x 103 for the given 2N and K). The variance and correlation are very close to the true values, whether the true correlation is 0.04 or 0.99. In some other cases when m = 0.01 (K = 25, µ = 1.0 x 105, 2N = 1000 or 2N = 2000; K = 25, µ = 1.0 x 106, 2N = 2000) or 0.001 (K = 25, µ = 1.0 x 106, 2N = 20 000), the mixture model provides good estimates of variance and correlation, which are close to the true values. Only when both mutation and migration rates were small (e.g. m = 0.001 and µ = 1.0 x 105 or 1.0 x 106) along with relatively small population size (2N = 1000 or 2000), did the mixture model fail to estimate the correlation accurately but estimates of variance were still fairly good.
|
|
In population genetics, the distribution of allele frequency are commonly described in terms of 2Nm and 2Nµ. For the mixture model, as shown from the current simulations, the performance is satisfactory other than when both 2Nm and 2Nµ are very small (2Nm = 1 or 2 and 2Nµ is <1). However, obtaining the explicit conditions when the performance of the mixture model is satisfactory/unsatisfactory through simulation has not yet proven feasible because it is determined by the complex interaction among the four process parameters. On the other hand, the values of process parameters are rarely known and it is almost impossible to judge the fit of the mixture model from the values of 2Nm and 2Nµ. Instead, the simulated data that produced these estimates provide some clues. Specifically, when the mixture model does not provide a good estimate of correlation, many loci are at fixation for one of the two alleles. That is, many loci have allele frequencies of 0 or 1. Therefore, the mixture beta model or (any beta model) would not be a good description of the data under these conditions since these models do not have mass at 0 or 1.
When K = 2 or K = 10, the results are similar (data not shown). Both variance and correlation could be accurately estimated unless the values of allele frequencies have a large amount of 0s and 1s. The parameter estimates are associated with a wider credible interval.
The last columns of Tables 1 and 2 give the values of
2/df for the mixture model. In some earlier simulations, the values of
2 were not calculated and represented by /. The values of
2/df are very close to 1 in all the calculated cases, indicating that the mixture model could provide a good approximation of allele frequency whether the correlation is well estimated or not.
Table 3 shows the comparison of moments from the second set of simulations where allele frequencies from 2n = 200 are sampled from 2N = 20 000. Values of
2/df are close to 1 and the estimated variance and correlation are very close to the true values from finite island model in all simulated cases. Table 4 shows the results from the third set of simulations with K' = 52 and 2n = 40, similar to the results from the second set of simulations. Allele frequency is well approximated and both variance and correlation well estimated. For the fourth set of simulations with K' = 2 or K' = 10 from K = (25, 50, 100, 250, 500, 1000), Table 5 selectively shows estimates with K' = 2. For both I = 50 and I = 377, correlation could be well estimated and with I = 377, correlation is estimated with better precision. Estimates with K' = 10 is omitted here. For the last three sets of simulations, there are few 1s and 0s in the data. Also for sampled data, the mixture data usually provides better estimates for variance and correlation than estimates as summary statistics directly from simulated data.
|
|
|
In general, these results suggest that the mixture model works well for a broad range of data, only if the data consist of loci in which populations are mostly fixed for alternative alleles. Both allele frequency and the variancecovariance structure are accurately estimated. Model parameter estimates of
x,
y, w and
were reported by Fu (2003). The mean allele frequency could always be well estimated whether the mutation rates between the two alleles are symmetric or not. Estimates of
x and
y indicate that the mixture model could be two unimodal betas, a U-shaped beta and unimodal beta as well as two U-shaped betas.
An example of human data
Cann et al. (2002) reported the development of a HGDP-CEPH Human Genome Diversity Cell Line Panel. The dataset includes genotypes at 377 autosomal microsatellite loci in 1056 individuals from 52 worldwide populations. Rosenberg et al. (2002) clustered the 52 populations into five groups, which correspond to five geographical areas, by using the Bayesian clustering method proposed by Pritchard et al. (2000). In the dataset, there are multiple allele types for each locus. To illustrate the use of our mixture model, we designate the most frequent allele type as A1 and group all the other allele types into a pseudo-allele type A2. Since the numbers of each of the three genotypes A1A1, A1A2 and A2A2 are known from the dataset, this is an example of codominant data. We also used the five clusters presented by Rosenberg et al. (2002) and fit the data to models (i)(iv) and models (I)(VI). The five groups are EuroAsia (21 populations), African (6 populations), EastAsia (18 populations), American (5 populations) and Oceania (2 populations).
The results of model choice analyses using L measure are shown in Table 6. Among models (I)(VI), model (VI) consistently produces the smallest L measure. Thus, we consider it as the best model for these data. Model (VI) recognizes substantial differences among clusters but ignores the correlation among clusters. Since none of models (II)(IV) gives a performance as good as model (VI), we take this as evidence that these models could not appropriately incorporate correlation among clusters. From now on, we focus further discussion on each cluster. Values of
2/df (Table 7) from each model are all very close to 1, suggesting that models (I)(VI) provide an adequate approximation to allele frequency pik.
|
|
Among models (i)(iv), model (iv) outperforms the other three models based on L measure. Model VI produces smaller L measure than model (iv) when
< 0.3, which indicates that model (VI) provides a better estimate of expected variance for future response than model (iv). Also, the estimated
2/df and the corresponding 95% credible interval from model (VI) is 1.040 (1.027, 1.054). This value is only a little larger than the corresponding estimates from model (iv): 1.028 (1.014, 1.046). In other words, not only does model (VI) produce a smaller expected variance for future response and incorporate correlation into account, it costs little accuracy in estimating allele frequency for the mixture model to do so. In short, a model accounting for both the grouping of populations into population clusters and the correlation in allele frequencies across populations within clusters provides the most satisfactory description of these data.
Table 7 shows the parameter estimates from models (I)(VI). For model (VI), posterior estimate of inbreeding coefficient f is 0.015. A small value is expected since in human populations, inbreeding occurs very infrequently and the closest degree of inbreeding usually encountered in most societies is first-cousin mating. Also with codominant data, f is precisely estimated as shown by the narrow credible interval. Estimates of
x,
y and w are quite different across clusters, indicating that it is appropriate to fit data from each cluster into separate mixture models. Figure 1 shows the kernel density estimation of parameters from each cluster based on model (VI).
|
Estimated variance of allele frequency and correlation among populations within each cluster from model (VI) are given in Table 8. Since there are only a few 0s and 1s in the dataset, along with the small estimate of f, we believe that these estimates of correlation among populations are reliable and accurate. In general, the populations are highly correlated and it would not be sensible to ignore such correlation in the analysis. Particularly, the cluster of EastAsia has the highest correlation of 0.885 followed by African, 0.767; EuroAsia, 0.747; and Oceania, 0.614. The populations in the Americas have the least correlation of 0.420. These values are consistent with evidence that the history of humans is shorter in the Americas than on other continents. The most recent claim by Seielstad et al. (2003) is that humans first entered America within the last 15 000 years.
|
Estimates of
(I) and
(III) are shown in Table 8, where the estimates for
(III) were obtained by plugging posterior estimates of allele frequency pik from model (VI) for each loci then taking average over populations within each cluster. These estimates assume that the actual number of populations exchanging genes is equal to the number in our sample, but the estimates for EuroAsia and East Asia are not likely to be much affected because of the relatively large number of populations sampled in these regions (1820; compared with Song et al. submitted for publication). For all but one cluster, estimates of
(III) are small (i.e.
0.05) and indicate little genetic differentiation among populations within each cluster; the cluster of American populations shows evidence of moderate genetic differentiation among populations with an estimate of
(III) > 0.1.
Comparing
(III) with
(I) from model (iv), we again note that estimates of
(I) from each cluster are uniformly greater than estimates of
(III) (Table 8). These results are expected, because the variance in allele frequency among contemporaneous populations is smaller than the variance in allele frequency across time. The small number of populations sampled from America and Oceania is likely to play a role in the difference, because the influence of the among-population correlation in allele frequencies on
(III) declines as the number of sampled populations increases. Rosenberg et al. (2002) used Weir and Cockerham's method (Weir and Cockerham, 1984) to estimate allele frequency differentiation for each cluster. These results are denoted as
c and also shown in Table 8 for comparison purpose. It is not surprising that their estimates are similar to estimates of
's based on model (iv) but greater than our estimates of
(III) as both estimates neglect correlation among populations.
In this analysis, we reduced multiallelic microsatellite data to biallelic data by designating the most frequent allele type as A1. To test the sensitivity of our estimates to this assumption, we did a second analysis in which we assigned the second most frequent allele type as A1 and fit the converted data to Model (VI). We report the results in Table 8. The estimates of
(III) and correlation are very similar to the estimates when most frequent allele type is treated as A1. It remains the same as the cluster of EastAsia has the highest correlation (0.851 versus 0.885) followed by Africa and EuroAsia with the Americans having the least correlation. Estimates of f (0.0147 (0.0117, 0.0181)) and
2/df (1.033 (1.015, 1.055)) are also similar, so are estimates of f (0.0147 (0.0117, 0.0181)) and
2/df (1.033 (1.015, 1.055)). So our estimates are quite robust. Furthermore, the fact that estimates of
(I) from model (iv) are very similar to estimates by Rosenberg et al. (2002) although the former are based on converted biallelic data and the latter based on multiallelic data, also indicates this conversion had little effect.
| DISCUSSION |
|---|
|
|
|---|
In population genetics, probability models have widely been used to describe allele frequency and infer about population structure. The beta model developed by Balding and Nichols (1995) and its multiallelic version are the most commonly used parametric approach. Nicholson et al. (2002) proposed a truncated normal model for single nucleotide polymorphism allele frequencies. The beta distribution is usually justified as the equilibrium distribution under several genetic models of interest and the rationale for the truncated normal distribution is in terms of modeling the transient states of allele frequency. However, the assumption of equilibrium or transient states is effectively impossible to check. In most cases, it is more practical to select the model which gives the best fit of data. The marginal distribution of allele frequency is likely to be approximated by a beta distribution unless it is multimodal.
Correlation among populations has not been treated adequately in most probability models for allele frequency, even though it affects the estimate of population structure. The mixture model we present here explicitly incorporates the correlation among population. Based on evaluation using simulated data from the finite island model, the mixture model provides a good approximation of allele frequency and an accurate estimate of correlation in general unless many populations are fixed for only one allele at many loci. The model we develop can be easily applied either to allele frequency data or to dominant/codominant genotype data.
Although parameters of our model are not directly interpretable as measures of population structure, estimates of FST corresponding with two different interpretations of its variance term are available. Our results show that the amount of genetic differentiation among contemporaneous populations is overestimated when among-population correlation is not accounted for. In the future, we will pursue how to construct models appropriate for datasets with a large proportion of 0s and 1s. One possibility would be to use a mixture of truncated normal distribution (Nicholson et al., 2002) since it has mass at 0 and 1. Our current model is appropriate for multilocus genotype data with two allele types and we will seek to extend our model to multiallelic data. We also extend our approach to model data with clusters and attempt to model correlation both within and among clusters. Effects of correlation on estimates of effective size, migration rate and other quantities of interest will also be the topic of future research.
Received on May 19, 2004; revised on October 13, 2004; accepted on November 22, 2004
| REFERENCES |
|---|
|
|
|---|
Balding, D.J. (2003) Likelihood-based inference for genetic correlation coefficients. Theoret. Popul. Biol., 63, 221230.
Balding, D.J. and Nichols, R.A. (1995) A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica, 96, 312[CrossRef][ISI][Medline].
Balding, D.J. and Nichols, R.A. (1997) Significant genetic correlations among Caucasians at forensic DNA loci. Heredity, 78, 583589.
Beerli, P. and Felsenstein, J. (1999) Maximum-likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. Genetics, 152, 763773
Cann, H.M., de Toma, C., Cazes, L., Legrand, M.F., Morel, V., Piouffre, L., Bodmer, J., Bodmer, W.F., Bonne-Tamir, B., Cambon-Thomsen, A., et al. (2002) A human genome diversity cell line panel. Science, 296, 261262[ISI][Medline].
Cockerham, C.C. and Weir, B.S. (1987) Correlations, descent measures: drift with migration and mutation. Proc. Natl Acad. Sci. USA, 84, 85128514
Crow, J.F. and Aoki, K. (1984) Group selection for a polygenic behavioral trait: estimating the degree of population subdivision. Proc. Natl Acad. Sci. USA, 81, 60736077
Diaconis, P. and Ylvisaker, D. (1985) Quantifying prior opinion. Proceedings of the 2nd Valencia International Meeting on Bayesian StatisticsSeptember 1983North-Holland, Amsterdam , pp. 133156 Bayesian Statistics 2.
Falush, D., Stephens, M., Pritchard, J.K. (2003) Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics, 164, 15671587
Fu, R., Gelfand, A.E., Holsinger, K.E. (2003) Exact moment calculations for genetic models with migration, mutation and drift. Theor. Popul. Biol., 63, 231243[CrossRef][ISI][Medline].
Fu, R. (2003) Probabilistic structure and statistical inference for nonexplicit population models of allele frequency. , CT PhD Dissertation University of Connecticut.
Gelfand, A.E. and Ghosh, S.K. (1998) Model choice: a minimum posterior predictive loss approach. Biometrika, 85, 113
Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (1996) Introducing Markov chain Monte Carlo. In Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (Eds.). Markov Chain Monte Carlo in Practice, , London Chapman & Hall, pp. 119.
Holsinger, K.E. (1999) Analysis of genetic diversity in geographically structured populations: a Bayesian perspective. Hereditas, 130, 245255[CrossRef].
Holsinger, K.E., Lewis, P.O., Dey, D.K. (2002) Bayesian analysis of population structure with dominant marker data. Mol. Ecol., 11, 11571164[CrossRef][Medline].
Ibrahim, J.G. and Laud, P.W. (1994) A predictive approach to the analysis of designed experiments. J. Am. Stat. Assoc., 89, 309319[CrossRef].
Ibrahim, J.G. and Chen, M.-H. (2000) Power prior distributions for regression models. Stat. Sci., 15, 4660.
Ibrahim, J.G., Chen, M., -H. and Sinha, D. (2004) Bayesian methods for joint modeling of longitudinal and survival data with applications to cancer vaccine studies. Statist. Sin., 14, 863883.
Nicholson, G., Smith, A.V., Jónsson, F., Gústaffson, Ó., Steánsson, K., Donnelly, P. (2002) Assessing population differentiation and isolation from single-nucleotide polymorphism data. J. R. Stat. Soc. Ser. B, 64, 695716[CrossRef].
Pritchard, J.K., Stephens, M., Donnelly, P. (2000) Inference of population structure using multilocus genotype data. Genetics, 155, 945959
Roeder, K., Escobar, M., Kadane, J.B., Balazs, I. (1998) Measuring heterogeneity in forensic databases using hierarchical Bayes models. Biometrika, 85, 269287
Rosenberg, N.A., Pritchard, J.K., Weber, J.L., Cann, H.M., Kidd, K.K., Zhivotovsky, L.A., Feldman, M.W. (2002) Genetic structure of human populations. Science, 298, 23812385
Seielstad, M., Yuldasheva, N., Singh, N., Underhill, P., Oefner, P., Shen, P., Wells, R.S. (2003) A novel Y-chromosome variant puts an upper limit on the timing of first entry into the Americas. Am. J. Hum. Genet., 73, 700705[CrossRef][ISI][Medline].
Weir, B.S. and Cockerham, C.C. (1984) Estimating F-statistics for the analysis of population structure. Evolution, 38, 13581370[CrossRef][ISI].
Weir, B.S. and Hill, W.G. (2002) Estimating F-statistics. Annu. Rev. Genet., 36, 721750[CrossRef][ISI][Medline].
Wright, S. (1951) The genetical structure of populations. Ann. Eugen., 15, 323354.
Wright, S. Evolution and the Genetics of Populations. Volume 2, The Theory of Gene Frequencies, (1969) , Chicago, IL University of Chicago Press.
| |||||||||||||||||||





















