Skip Navigation

Bioinformatics 2007 23(13):i479-i489; doi:10.1093/bioinformatics/btm171
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Sohn, K.-A.
Right arrow Articles by Xing, E. P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sohn, K.-A.
Right arrow Articles by Xing, E. P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Spectrum: joint bayesian inference of population structure and recombination events

Kyung-Ah Sohn and Eric P. Xing *

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INHERITANCE MODEL
 3 MCMC INFERENCE
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: While genetic properties such as linkage disequilibrium (LD) and population structure are closely related under a common inheritance process, the statistical methodologies developed so far mostly deal with LD analysis and structural inference separately, using specialized models that do not capture their statistical and genetic relationships. Also, most of these approaches ignore the inherent uncertainty in the genetic complexity of the data and rely on inflexible models built on a closed genetic space. These limitations may make it difficult to infer detailed and consistent structural information from rich genomic data such as populational single nucleotide polymorphisms (SNP) profiles.

Results: We propose a new model-based approach to address these issues through joint inference of population structure and recombination events under a non-parametric Bayesian framework; we present Spectrum, an efficient implementation based on our new model. We validated Spectrum on simulated data and applied it to two real SNP datasets, including single-population Daly data and the four-population HapMap data. Our method performs well relative to LDhat 2.0 in estimating the recombination rates and hotspots on these datasets. More interestingly, it generates an ancestral spectrum for representing population structures which not only displays sub-structure based on population founders but also reveals details of the genetic diversity of each individual. It offers an alternative view of the population structures to that offered by Structure 2.1, which ignores chromosome-level mutation and recombination with respect to founders.

Contact: epxing{at}cs.cmu.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INHERITANCE MODEL
 3 MCMC INFERENCE
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Single nucleotide polymorphisms, or SNPs, represent the largest class of individual differences in DNA. SNPs are remnants of ancient, (possibly) neutral DNA alterations dated back to a time measured at a genealogical scale; they contain more fine-grained information on molecular evolution than that revealed by orthologous genomic sequences from multiple species. In general, the higher the frequency of a SNP allele, the older the mutation that produced it, so high-frequency SNPs largely predate human population diversification whereas low-frequency ones appeared afterwords. Therefore, population-specific alleles may bear important information about human evolution such as specific migrations and genetic diversifications (Stoneking, 2001).

Recent experimental advances have led to an explosion in SNP data from various populations. For example, the international SNP map working group (International SNP Map Working Group) has reported the identification and mapping of 1.4 million SNPs in human genomes from four world populations. The deluge of SNP data fuels the long-standing interest in analyzing patterns of genetic variations to reconstruct the ancestral structures of modern human populations, for such genetic ancestral information can both shed light on the evolutionary history of modern populations and provide guidelines for more accurate association studies and for many other population genetics problems.

A number of variants of statistical admixture models for genetic polymorphisms have been proposed for the analysis of current population structure (Falush et al., 2003; Pritchard et al., 2000; Rosenberg et al., 2002). These models are instances of a more general class of hierarchical Bayesian models known as mixed membership models (Erosheva et al., 2004), which postulate that genetic markers of each individual are iid (Pritchard et al., 2000) or spatially coupled (Falush et al., 2003) samples from multiple population-specific fixed-dimensional multinomial distributions (known as ancestry proportions (Flausch et al., 2003) or AP) of marker alleles. Under this assumption, the admixture model identifies each ancestral population by a specific AP (that defines a unique allele frequency profile for each ancestral population for each marker) and displays the fraction of contributions from each AP in a modern individual chromosome as a structural map. Figure 1 shows an example of structural maps of four modern populations inferred from a portion of the HapMap multi-population dataset by Structure 2.1 (Falusch et al., 2003; Pritchard et al., 2000). In this population structural map, each individual is represented as a thin vertical line which shows the fraction of the individual's chromosome which originated from each ancestral population, as given by a unique AP.


Figure 1
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Population structural map inferred by Structure 2.1 on HapMap multi-population data consisting of CEU, YRI, HCB and JPT populations.

 
However, since an AP merely represents the frequency of alleles in an ancestral population, rather than the actual allelic content or haplotypes of the alleles themselves, the admixture model does not model genetic drift due to mutations from the ancestral alleles. Moreover, in the extant admixture models, the correlations between loci along the chromosome are only captured by the linkage disequilibrium due to variation in the AP fractions over all markers among individuals, or due to a ‘recombination’ process between APs (rather than ancestral chromosomes) for sampling markers along a modern chromosome. These two scenarios are known as ‘mixture LD’ and ‘admixture LD’ respectively (Falush et al., 2003). Neither one captures the actual recombination events at the ancestral chromosome level, so they do not enable inference of the founding genetic patterns, the recombination events, the age of the founding alleles or the composition of individual chromosomes at founding chromosome level (Excoffier and Hamilton, 2003). Actually, while this model aims to provide ancestry information for each individual and each locus, there is no explicit representation of ‘ancestors’ as a real chromosome haplotype. Therefore, the inferred population structural map emphasizes revealing the contributions of abstract population-specific ancestral proportion profiles, which does not directly reflect individual diversity. This representation may not be optimal, as seen in Figure 1: each modern population is represented by a very homogeneous (but distinct) population structural sub-map, which reflects little about the actual genetic diversity of each population and individual and little about the relative similarity between populations. For example, the YRI population from Africa is known to be genetically diverse, but in Figure 1 it appears to be the most homogeneous.1

In this article, we present a new method, Spectrum, for inferring and representing population structures, using a unified statistical framework for modeling the genetic inheritance process that allows both recombination among an unspecified number of founding alleles and mutations from these founders. Based on this model, which represents a well-defined generative model for the observed chromosomes, we represent the population structure in terms of an ancestral spectrum which shows the ancestral composition of each modern individual chromosome in terms of its origin among the chromosomal ancestors. By considering the different ancestral association patterns among populations, this spectrum helps to separate the sub-populations, as well as reveal the diversity among individuals and populations. Moreover, our model allows us to recover the recombination events in each individual chromosome. In fact, the population structure can play an important role for the LD analysis. Figure 2 shows the LD measurements for all pairwise loci on the ENm010 region from HapMap DB. When we compute LD in three populations of CEU (European ancestry), HCB and JPT (Asian ancestry) together (Fig. 2a), some degree of block-like patterns are visible, but when CEU (European ancestry) and YRI (African ancestry) populations are mixed (Fig. 2b), the block structure is less obvious. This result implies the existence of different genetic processes in the evolutionary history of the two populations. Hence, if we perform LD or recombination analysis on a population which may have a concealed sub-population structure, it would be more informative to perform LD analysis on each sub-population separately, and our ancestral spectrum offers a way to classify such sub-populations on genetic basis. While the statistical methodologies developed so far mostly deal with ancestral inference and LD analysis separately using specialized models that do not capture the close statistical and genetic relationships of these two problems, we propose a unified framework which allows joint inference of the population structure and the recombination patterns.


Figure 2
View larger version (93K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. The LD measurements, Figure 2 (upper right), and the P-values for Fisher's exact test (lower left), of HapMap DB (Gadmundur et al., 2005). Note the LD-block structures on the mixed populations of CEU and YRI are rather opaque compared to the LD patterns of CEU+HCB+JPT populations.

 
We assume that individual chromosomes in a modern population originated from a number of ancestral chromosomes via biased random recombination and mutation. By associating each ancestor with a hidden state, the recombination between the ancestors can follow a state transition process, and the mutation can follow an emission process in the hidden Markov model. Hence each individual chromosome can be thought of as a ‘mosaic’ of ancestral chromosomes under this model.

Several existing methods have employed similar ideas. For example, Daly et al. (2004) and Greenspan and Geiger (2001) have developed hidden Markov models for locating recombination hotspots in haplotypes; Anderson and Novembre (2003) proposed a minimum description length (MDL) method for optimal haplotype block finding. While these models are based on a similar assumption that each observed haplotype is a ‘mosaic’ of ancestral haplotypes and the formation of the mosaic is governed by a hidden Markov process over the ancestor space, these HMMs cannot be used easily to infer individual recombination events because the block boundaries (which conceptually correspond to the recombination sites) of all individual chromosomes are decided outside the model via model selection, and the only intrinsic stochasticity lies in the choice of the ‘ancestors’ at each block for each chromosome rather than the genomic locations of recombination events in each chromosome. It is also unclear to what extent this class of approaches might be helpful for applications involving explicit ancestral map inference as in Rosenberg et al. (2002) and for interpreting LD patterns that do not have sharp block boundaries as in Figure 2b.

While most of the previous approaches ignore the inherent uncertainty in the genetic complexity (e.g. the number of genetic founders of a population) of the data, our new approach employs a recently developed non-parametric Bayesian model known as a Hidden Markov Dirichlet Process (Xing and Sohn, 2007) to extend a closed genetic inheritance model based on a fixed number of founders to an open ancestral space, which allows more flexible control over the number of genetic founders than has been provided by the statistical methods proposed thus far. We report validation of Spectrum on both simulated data and on two real datasets of HapMap and Daly data, and compare with a number of established methods.


    2 INHERITANCE MODEL
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INHERITANCE MODEL
 3 MCMC INFERENCE
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We describe a statistical model for generating individual haplotypes in a modern population from a hypothetical pool of ancestral haplotypes via recombination and mutations. We begin our exposition with a parametric Bayesian model of genetic inheritance involving recombination and mutation over a fixed number of ancestors; then we extend the model to open ancestral space which requires no ad hoc specification of the number of ancestors, via a non-parametric Bayesian approach.

2.1 Hidden Markov model for recombination and mutation in closed ancestral space
We begin with the assumption that modern chromosomes are derived from ancestral chromosomes via biased random recombination and mutation. This assumption corresponds to an idealized non-interference model for chromosomal crossover and a star genealogy over every inherited site. Although very simple and not realistic, this assumption has been widely adopted by statistical genetic models, such as the BLADE model for mapping (Liu et al., 2001), and numerous models for haplotype inference (Zhang et al., 2006). If the number of ancestors is known to be K, sequential selection of recombination targets from a set of ancestral chromosomes can be modeled as a hidden Markov process, where the hidden states correspond to the ancestors, the transition probabilities correspond to the recombination rates between the recombining chromosome pairs, and the emission model corresponds to a mutation process that passes the chosen chromosome in the ancestors to the descendants.

Assuming that individual haplotypes over T SNPs Formula for e = 1, 2 are given unambiguously for the study population, as is the case in many LD and haplotype-block analyses (Anderson and Novembre, 2003; Daly et al., 2001), we can now treat the paternal and maternal haplotypes of N individual as 2N iid samples and omit the parental index e. Although this assumption may seem stringent, our model can easily generalize to unphased genotype data by incorporating a simple genotype model, as will be explained later in this section.

Now, let Formula for Formula be the K ancestral chromosomes, and let Formula denote the sequence of inheritance variables that specify the index of the ancestral chromosome at each SNP locus for each chromosome i. Also suppose that the transition probabilities of the HMM are given as a Formula matrix {pi}. When no recombination takes place during the inheritance process that produces the haplotype Hi from an ancestor k, then Formula for all Formula . When recombination occurs between a locus t and t + 1, we have Formula . We can introduce a Poisson point process to control the duration of non-recombinant inheritance. That is, given that Formula , then with probability Formula , where dt is the physical distance between two loci, r reflects the rate of recombination per unit distance and Formula is the self-transition probability of ancestor k defined by HMM, we have Formula ; otherwise, the source state (i.e. ancestor chromosome k) pairs with a target state (e.g. ancestor chromosome Formula ) between loci t and t + 1 with probability Formula . That is,


Formula 1

(1)

Hence, each haplotype Hi can be thought of as a mosaic of segments of multiple ancestral chromosomes from the ancestral pool Formula .

The emission process of this model corresponds to a mutation model from an ancestor to the matching descendant. For simplicity, we adopt the single-locus mutation model in Xing et al. (2004):


Formula 2

(2)

where hi,t and ai,t denote the alleles at locus t of an individual chromosome i and its corresponding ancestor k, respectively; {theta}k indicates the ancestor-specific mutation rate; and Formula denotes the number of possible alleles. As discussed in Liu et al. (2001), this model corresponds to a star genealogy resulting from infrequent mutations over a shared ancestor and is widely used in statistical genetics as an approximation to a full coalescent genealogy starting from the shared ancestor. Assuming that the mutation rate {theta}k admits a Beta prior with hyperparameter Formula ,2 the marginal conditional likelihood of all the haplotype instances Formula given the set of ancestors Formula and the ancestor indicators Formula can be obtained by integrating out {theta} from the joint conditional probability starting from Equation (2) which reduces to:


Formula 3

(3)

where Formula is the gamma function, Formula is the normalization constant associated with Formula (which is a prior distribution for {theta}), Formula is the number of alleles which were not mutated with respect to the ancestral allele, and Formula is the number of mutated alleles. The counting record Formula is a sufficient statistic for the parameter {theta}k (Xing et al., 2004).

2.2 Genotype model for unphased data
The model described above can be easily generalized to unphased genotype sequence data by introducing a genotyping model as in Xing et al. (2004). We assume that the observed genotype at a locus is determined by the paternal and maternal alleles of this site via the following genotyping model:


Formula 4

(4)
where Formula denotes the unordered pair of two actual SNP allele instances at locus t; ‘Formula denotes set difference by exactly one element; ‘Formula ’ denotes set difference of both elements; and µ1 and µ2 are appropriately defined normalizing constants. Again we place a beta prior Formula on {xi} for smoothing. Under the above model specifications, it is standard to derive the posterior distribution of each haplotype Hie given all other haplotypes and all genotypes by integrating out parameters {xi} and resorting to the Bayes theorem, which enables a collapsed Gibbs sampling step where necessary.

It is noteworthy that the proposed model presents a well-defined generative model for the observed haplotypes or genotypes based on a spatial point process for stochastic recombination and also random mutations over a pool of complete ancestral chromosomes. The difference in our model compared to approaches with a similar HMM assumption (Anderson and Novembre, 2003; Daly et al., 2001; Patil et al., 2001) is that, in those models, the ‘ancestors’ are defined independently for each block rather than as whole chromosomes, which is biologically less meaningful. Although such a generative process is still a simplification of the real biological mechanism, it enables the joint statistical characterization of a number of genetic variables of interest, via posterior inference based on well-founded statistical principles, and it strikes a reasonable tradeoff between being biologically meaningful and computationally manageable.

2.3 Hidden Markov Dirichlet process for inheritance in open ancestral space
So far, we have been assuming that recombination and mutation take place in a closed ancestral space; that is, the number of ancestral chromosomes is known a priori. But this assumption, which is also widely adopted in other existing approaches for LD analysis and ancestral inference, ignores the inherent uncertainty in the genetic complexity of populations. Model selection according to information theoretic score or Bayes factors is a typical solution to problems of this nature, but it can be inflexible when the hypothesis space is large. Recently, we have developed a non-parametric Bayesian framework for modeling genetic polymorphism based on the Dirichlet process (DP) mixtures and extension (Xing and Sohn, 2007; Xing et al., 2004) which allows more flexible control over the number of genetic founders.

Under coalescence-with-mutation (but without recombination), one can treat a haplotype from a modern individual as a descendant of their most recent common ancestor. Hoppe (1984) observed that a coalescent process in an infinite population leads to a partition of the population at every generation that can be succinctly captured by the following Pólya urn scheme.

Consider an urn that at the outset contains a ball of a single color. At each step we either draw a ball from the urn and replace it with two balls of the same color, or we are given a ball of a new color which we place in the urn. One can see that such a scheme leads to a partition of the balls according to their color. Mapping each ball to an individual haplotype and each color to its corresponding ancestor, this partition is equivalent to the one resulting from the coalescence-with-mutation process (Hoppe, 1984) and the probability distribution of the resulting allele spectrum—the numbers of colors (respectively haplotypes) with every possible number of representative balls (respectively descendants)—is captured by the well-known Ewen's sampling formula (Tavare and Ewens, 1998). Blackwell and MacQueen (1973) showed that this Pólya urn model yields samples whose distributions are those of the marginal probabilities under the Dirichlet process (Ferguson, 1973).

Xing et al. (2004) proposed a haplotype inheritance model (without recombination) as a Dirichlet process mixture (DPM), of which a DP is used as the prior over the unbounded ancestral space (of founding haplotypes and their associated mutation rates). This model can be understood in the above Pólya urn scenario as associating each individual haplotype with a ball in the Pólya urn and associating the ancestral haplotypes and their own mutation rates with the colors. Essentially a DPM defines a ‘clustering’ of the modern individual haplotypes based on the ball color. Notice that our construction so far requires no prior specification of the number of ancestors. Thus a DPM offers a principled approach to generalize the finite mixture model for haplotypes to an infinite mixture model that models uncertainty regarding the size of the ancestral haplotype pool, and at the same time it provides a reasonable approximation to the coalescence model by utilizing the partition structure resulted thereof (but allows further mutation within each partite to introduce further diversity among descendants of the same founder).

Using a further extension of DPM known as the Hidden Markov Dirichlet Process (HMDP) (Xing and Sohn, 2007) which models stochastic transitions among states in an open state space, we can extend the HMM model proposed in Section 2.1 to work in an infinite ancestral space. Recall that in the HMM inheritance model described earlier, the transition probabilities can be represented as a Formula matrix, and each row of the matrix indicates the probabilities of transitioning (i.e. recombination) from the source state (e.g. ancestor k) to all the target states (all ancestors in the pool), which sums to 1. Now we do not restrict ourselves with such a K and generalize the HMM to a space with countably infinite ancestors in principal. Without going into technical details (but see Xing and Sohn, 2007), our generalization can be understood as modeling each row of transition probabilities (from a specific ancestor) of an HMM with a unique DP over open ancestral space, letting all these DPs (each of which is over a particular row) follow a higher level DP to ensure that they are all defined on the same open ancestral space. We have developed a hierarchical Pólya urn scheme to realize this model and facilitate sampling based posterior inference (Xing and Sohn, 2007) for which we omit details due to lack of space. But at a high level, the recombination probability under HMDP Formula can be expressed by the same formula as in Equation (1), except that the Formula now indicates the transition probability from a source state k to a target state Formula in an open ancestral space under HMDP (see Xing and Sohn, 2007) for the somewhat cumbersome form for this variable). This Formula specifies the probability of ancestor chromosome k pairing with ancestor Formula given that a recombination is taking place, and Formula can grow arbitrarily large as needed conditioning on the given data.

The generative process described above leads naturally to an algorithm for population genetic inference. Unlike the classical coalescence models for recombination (Hudson, 1983), which have been primarily used for theoretical analysis and simulation and are not feasible for reverse ancestral inference based on observed genetic data, Spectrum provides a non-parametric Bayesian formalism for recombination and inheritance that is well suited for data-driven posterior inference on the latent variables that can yield rich information of the population ancestry and genetic structure of the study population. For example, using Spectrum, given the haplotype (or genotype) data, one can infer the ancestral structure, LD and recombination patterns of a population using the posterior distribution of inheritance variable c and ancestral state a, as we will elaborate in the sequel.


    3 MCMC INFERENCE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INHERITANCE MODEL
 3 MCMC INFERENCE
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this section, we briefly describe a Gibbs sampling algorithm for posterior inference under HMDP. Recall that a Gibbs sampler draws samples of each random variable in the model from the conditional distribution of the variables given (previously sampled) values of all the remaining variables. The variables of interest in our model include Formula , the inheritance variables specifying the origins of SNP alleles of all loci on each haplotype, and Formula , the founding alleles at all loci of each ancestral haplotype. All other variables in the model, e.g. the mutation rate {theta}, are integrated out.

The Gibbs sampler alternates between two stages. First it samples the inheritance variables Formula , conditioning on all given individual haplotypes Formula and the most recently sampled configuration of the ancestor pool Formula ; then given Formula and current values of the ci,t's, it samples every ancestor ak.

To improve the mixing rate, we sample the inheritance variables one block at a time. That is, every time, we sample {delta} consecutive states Formula starting at a randomly chosen locus t + 1 along a haplotype. (For simplicity we omit the haplotype index i here and in the forthcoming expositions when it is clear from context that the statements or formulas apply to all individual haplotypes.) Let Formula denote the set of previously sampled inheritance variables. Let n and m denote the sufficient statistics for the transitions between ancestors in HMDP Pólya urn scheme. And let lk denote the sufficient statistics associated with all haplotype instances originated from ancestor k. The predictive distribution of a {delta}-block of inheritance variables can be written as:


Formula 5

(5)

This expression is simply Bayes' theorem with Formula playing the role of the likelihood and Formula playing the role of the posterior. Note that, naively, the sampling space of an inheritance block of length {delta} is Formula where Formula represents the cardinality of the ancestor pool. However, if we assume that the recombination rate is low and block length is not too big, then the probability of having two or more recombination events within a {delta}-block is very small and thus can be ignored. This approximation reduces the sampling space of the {delta}-block to Formula , i.e. Formula possible recombination targets times {delta} possible recombination locations. Accordingly, Equation (5) reduces to:


Formula

for some Formula . Recall that in an HMDP model for recombination, given that the total recombination probability between two loci d-units apart is Formula (assuming d and r are both very small), the transition probability from state k to state Formula is:


Formula

where Formula represents the transition probability vector for ancestor k under HMDP. Putting everything together, we have the proposal distribution for a block of inheritance variables.

To sample the ancestors Formula , we can derive the posterior distribution from Equation (3). We refer the reader to Xing et al. (2004) for further details.


    4 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INHERITANCE MODEL
 3 MCMC INFERENCE
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We validated Spectrum on a simulated dataset and analyzed two real datasets: the HapMap four-population data (Gudmundur et al., 2005) and the single-population Daly data (Daly et al., 2001). Although Spectrum can be applied to both haplotype and genotype data, in this article we focus on haplotype data for simplicity. The HapMap data includes 209 individuals' haplotypes (phased by PHASE software (Gudmundur et al., 2005) on the ENm010 region of chromosome 7. The Daly data includes 256 individuals (after excluding one person due to severe missing data), whose haplotypes (512 in total) can be recovered from trio data. For each dataset, we focus on the analysis of population structure and recombination patterns based on the ancestral origin of each SNP locus in each individual haplotype.

4.1 Analyzing a simulated haplotype population
We simulated a population of individual haplotypes with a fixed number Ks (unknown to Spectrum) of randomly generated ancestor haplotypes, on each of which a set of recombination hotspots were pre-specified. Then we applied a hand-specified recombination process, which is defined by a Ks-dimensional HMM, to the ancestor haplotypes to generate Ns individual haplotypes via sequentially recombining segments of different ancestors according to the simulated HMM states at each locus and mutating certain ancestor SNP alleles according to the emission model. All the ancestor haplotypes were set to be 100 SNPs long. The hotspots are pre-specified at every 10th loci in the ancestor haplotypes. Overall, 30 datasets, each containing 100 individuals (i.e. 200 haplotypes) with 100 SNPs, were generated from Ks = 5 ancestor haplotypes. Since there is no extant method that can perform both structural analysis and recombination analysis, we compared our method with existing algorithms specialized for each of our tasks. For ancestral inference, we implemented three standard fixed-dimensional HMMs, with 3, 5 (the true number of ancestors for the simulation) and 10 hidden states, respectively. For recombination analysis, we selected the widely used LDhat 2.0 (Fearnhead and Donnelly, 2001) for comparison. Structure 2.1 yields a different kind of population map that is not quantitatively comparable to that from Spectrum; thus we only show empirical comparisons on real data.

4.1.1 Structural analysis
Spectrum uncovers the genetic origins of all loci of each individual haplotype in a population from Gibbs samples of the inheritance variables Formula . For each individual, we define an empirical ancestor composition vector {eta}e, which records the fractions of every ancestor in all the ci,t's of that individual. Figure 3 displays an ancestral spectrum constructed from the {eta}e's of all individuals. In this spectrum, each individual is represented by a vertical line which is partitioned into colored segments in proportion to the ancestral fraction recorded by {eta}e. Five spectrums, corresponding to (1) true ancestor compositions, (2) ancestor compositions inferred by Spectrum, and (3–5) ancestor compositions inferred by HMMs with 3, 5, 10 states, respectively, are shown in Figure 3. To assess the accuracy of our estimation, we calculated the distance between the true ancestor compositions and the estimated ones as the mean squared distance between true and estimated {eta}e over all individuals in a population, and then over all 30 simulated populations. We found that the distance between the Spectrum-derived population spectrum and the true spectrum is 0.190 Formula , whereas the distance between HMM-spectrum and true spectrum is Formula , significantly worse than that of Spectrum even though the HMM is set to have the true number of ancestral states (i.e. K = 5). Because of dimensionality incompatibility and apparent dissimilarity to the true spectrum for other HMMs (i.e. K = 3 and 10), we forgo the above quantitative comparison for these two cases.


Figure 3
View larger version (49K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Analysis of simulated haplotype populations. The true (panel 1) and estimated (panel 2 for Spectrum, and panel 3–5 for three HMMs) population maps of ancestral compositions in a simulated population.

 
4.1.2 Recombination analysis
From the Gibbs samples of Formula , we can also infer the recombination status of each locus of each haplotype. We define the empirical recombination rates {lambda}e at each locus to be the ratio of individuals who are determined to have recombinations at that locus over the total number of haplotypes in the population. We classify a locus to be a recombination hotspot if its {lambda}e is greater than an empirical threshold {omega}, which is set to be the third quartile value of the estimated recombination rates. Alternatively, we can set {omega} to be the {lambda}e value at which the false positive rate and the false negative rate become equal in a held-off set. Due to the stochastic nature of the recombination position in our simulation, we score a correct hit of recombination hotspot if the identified hotspot based on {lambda}e-thresholding falls within a small window around the true position, and the window is set to be 0, Formula and Formula , respectively. Table 1 summarizes the results of the performance comparison, which show that Spectrum outperforms LDhat 2.0 and HMM significantly in most of the cases.


View this table:
[in this window]
[in a new window]

 
Table 1. False positive and false negative rates for recombination hotspot detection over 30 population samples. Two kinds of threshold {omega}'s are used. The results with different tolerance windows wtol are also shown

 
4.2 Analyzing real datasets
4.2.1 Population structure analysis
We analyzed the population structure of HapMap data (on the ENm010 region) based on the ancestor composition vector {eta}e. Figure 4 shows the results from Spectrum and from Structure 2.1 with different pre-determined numbers of populations K. Both algorithms successfully identified the major geographical populations grouped as CEU, YRI and HCB+JPT populations. However, the population map from Structure 2.1 does not reflect the diversity of each population or similarity between populations as mentioned earlier in this article. In contrast, the result from Spectrum reveals the relative diversity of each population clearly by showing the ancestral association fraction for each individual from shared ancestors.


Figure 4
View larger version (29K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Inferred population structure of HapMap four population data from Spectrum, and Structure 2.1 with different pre-specified numbers of population K.

 
For further comparison, we applied each method to the YRI population only. In Figure 5, panel (a) shows the ancestral spectrum of YRI when this population only is subject to analysis by Spectrum; and panel (b) re-displays the YRI spectrum extracted from Figure 4a, where all four populations were analyzed together. Figure 5 c and d presents the maps from Structure 2.1 applied to YRI only, under three- and five-cluster assumptions, respectively. While it is not straightforward to match (a) with (b) pictorially, both maps reveal that this population is rather diverse. On the other hand, Figure 5 c and d, both from Structure 2.1, show two very different structures from those in Figure 4, where the four populations were analyzed together. Since Structure 2.1 maps each individual locus to its origin of population (represented by a unique AP) rather than to its origin of ancestral chromosome, this result is not surprising considering the different level of details of the two (i.e. our spectrum and their map) representations. It seems that our method provides an arguably more robust and consistent way of showing the population structure in terms of origin of ancestral chromosome, which clearly illustrates the sharing of ancestors between populations, as well as the diversities of each population. It is also noteworthy that in Structure 2.1 the choice of K can significantly affect the result, and it is not always easy to choose the best K, as shown in Figure 5. In contrast, our method does not rely on a fixed number of ancestors, instead giving a flexible model for the genetic inheritance under a non-parametric Bayesian framework.


Figure 5
View larger version (79K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Inferred population structure of HapMap YRI population data from (a)–(b) Spectrum, and (c)–(d) Structure 2.1 with different number of clusters K.

 
Next, we analyzed the 256 individuals (i.e. 512 haplotypes) from the Daly dataset with 103 SNPs. For a more informative revelation of the underlying population structure captured by the empirical ancestor composition vector {eta}e, we clustered the individuals based on their {eta}e's and then ordered all individuals accordingly Figure 6. Specifically, all individuals were clustered into six clusters (which is an empirical choice just for illustration) using the K-means algorithm; within each group, individual orderings were determined by their distances to the cluster centroid. Interestingly, we can see that although the Daly data were reported to be from a European-derived population that is expected to be genetically less diverse, our ancestral map suggests that in this population there exists distinct sub-structures, each with a unique ancestral composition.


Figure 6
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. The estimated population map of the Daly dataset. The ordering of all individuals in the sample population was determined by a K-means clustering with K = 6, followed by a within-cluster ordering of samples based on their distances to the cluster centroid. The black vertical bars show the K-means cluster boundaries.

 
4.2.2 Recombination analysis
For the analysis of recombination events in real datasets, rather than picking an empirical threshold, we determined the recombination hotspots as follows. We fitted the estimated {lambda}e's of all loci with a 1D mixture of Gaussians (Fig. 7). Then we used the intersection point of the two Gaussian components as the threshold for determining hotspot loci. This threshold is essentially the point where the posterior probabilities of {lambda}e being a baseline recombination rate or a hotspot recombination rate are equal. The mass in the area where the two Gaussians overlap represents the Bayes-error of loci classification under this model. One can also employ more rigorous model-based methods for hotspot classification, and we will return to this point in the discussion section.


Figure 7
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7. A mixture of Gaussian fitting of the estimated {lambda}e on HapMap data.

 
Figure 8 shows the recovered recombination rates on the ENm010 region of chromosome 7 for each population in HapMap DB. While the algorithm was run with all the populations together, according to the implications about the distinct genetic structure reflected in the ancestral map (Fig. 4), we estimated the empirical recombination rates separately for each population (i.e. CEPH, YRI and HCB+JPT) by using the posterior samples belonging to each population only. Figure 8 shows the recombination rate estimates and the detected recombination hotspots, together with the corresponding LD measurement. While each recombination pattern largely agrees with the given LD patterns, noticeably different patterns of recombination hotspots of the three groups are observed, which may reflect different recombination histories of the ancestors of these populations and the need for the population-based recombination analysis. For comparison, the result on the mixed populations are also shown together for Spectrum and LDhat 2.0 in the last column of Figure 8.


Figure 8
View larger version (30K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 8. For each population of HapMap data, the LD measure with the estimated recombination rates along the chromosomal position are shown together with the detected recombination hotspots. The last column shows the result on the mixed four populations from both Spectrum and LDhat 2.0.

 
Finally, we give the comparison of the recombination hotspot estimation on the Daly dataset with those reported in Daly et al. (2001) (which is based on an HMM employing different numbers of states at different chromosome segments) and in Anderson and Novembre (2003) [which is based on a minimal description length (MDL) principle]. In Figure 9, we show the plot of the empirical recombination rates estimated from Spectrum, side-by-side with the reported recombination hotspots. We also display the LD measurements together. Note that according to Spectrum, certain estimated recombination hotspots are very close to each other; for example, at locus 398 kb, two hotspots are right next to each other. This finding suggests that the actual LD patterns in a population sample may not simply fall into blocks with sharp boundaries universal to all individuals, as assumed in Daly's HMM model. It is more appropriate to define ‘hotspot regions’ (i.e. stretches of consecutive hotspot loci) rather than point ‘hotspot loci’, where necessary, to delineate haplotype blocks, as discussed in Li and Stephens (2003). For example, according to the estimated {lambda}e's shown in Figure 9, 15 hotspot loci/regions (represented as thick solid vertical bars in Fig. 9) were identified, and they divide the entire study region into 16 haplotype blocks of low diversity. Note that in Figure 9, the x-axis represents the actual genetic locations of the SNP loci (starting from 274 kb at the leftmost with respect to a genetic reference). Since the SNPs of interest are not located uniformly in this region, the spatial intervals as seen from Figure 9 between hotspots may not reflect the ‘lengths’ of the haplotype blocks. For example, the block between 445–518 kb contains 15 SNPs. At the same time, the seemingly longest interval between 738–877 kb contains only three SNPs, two of which have high recombination rates, which render this interval to be a hotspot region as explained below. Biologically, this is not surprising because the probability of recombination between adjacent SNPs increases with their physical distance, in addition to depending on the intrinsic recombination rate. This ‘hotspot region’ between 738–877 kb is more likely to be merely a consequence of sparse location and sampling of SNPs in this region, rather than a biologically meaningful hotspot region.


Figure 9
View larger version (47K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 9. Analysis of the Daly data. Upper panel: the LD-map of the data. Lower panel: a plot of {lambda}e estimated via Spectrum; and the haplotype block boundaries according to Spectrum (black solid line), HMM (Anderson and November, 2003) (red dotted line), and MDL (Daly et al., 2001) (blue dashed line). Note that the thickness of the black solid lines delineating the haplotype blocks is proportional to the width of the hotspot regions between adjacent blocks.

 
Table 2 summarizes the summary statistics that characterize each haplotype block (and hotspot regions). We used the threshold of 0.005 determined by the mixture of Gaussians as described above to identify recombination hotspots. The blocks were determined accordingly, with the constraint that the lengths of the identified blocks were at least three SNPs long, to avoid over-fragmenting the haplotypes. In column 1 of Table 2, the blocks with blockID starting with an ‘r’ represent the hotspot regions which contain more than two SNPs, and others represent the haplotype blocks. The number of SNPs within the blocks varied from 3 to 15 (the second column of Table 2). The actual genomic region and length of each block are shown in the third and the fourth columns, respectively. The lengths of the smallest and the biggest blocks were 1.3 and 93 kb, respectively, while the average was 22 kb. We also report the total number of distinct haplotypes as a reflection of diversity for each block, of which the most diverse is, not surprisingly, one of the largest blocks (which spans 71 kb), which contains 17 different haplotypes. This is significantly lower than the Formula possible different haplotypes one could observe had there existed no co-inheritance among loci in this block. Note that the 17 haplotypes reported here indicate the actual total observed diversity in this region among the study population, not the number of prototypes underlying these haplotypes that parsimoniously account for the majority of the observed diversity when small amounts of mutation are allowed, as reported in Daly et al. (2001). The actual demographic diversity of these blocks is much lower than that which is reflected by the total number of haplotypes, as shown by the results in columns 6–15. In columns 6–11 of Table 2, we report the ancestor association frequencies of haplotypes within each block, where the associations were directly estimated from the inheritance variable ci,t's sampled by our algorithm. We can see that, overall, six founders sufficed to fully account for our data, and indeed within each block, only 3–4 of them were significantly used. We present the number of necessary haplotypes to cover over 95 and 90 % of the entire population, which were mostly around 3 with a few blocks with higher diversity around 10.


View this table:
[in this window]
[in a new window]

 
Table 2. Haplotype block structures and the summary statistics of the blocks for the Daly data. The block boundaries correspond to the x-coordinates of the {lambda}e peaks in Figure 9

 

    5 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INHERITANCE MODEL
 3 MCMC INFERENCE
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have proposed a new Bayesian method, Spectrum, for jointly modeling genetic recombination (with mutation) and population structure. Under a pool of complete ancestral chromosomes, Spectrum describes the underlying genetic process of recombination and mutation explicitly in terms of the association between ancestors and modern individuals. By incorporating a Hidden Markov Dirichlet process prior, which facilitates a well-defined transition process between infinite ancestor spaces, the proposed method can efficiently infer a number of important genetic variables, such as recombination hotspots and ancestor patterns, jointly under a unified statistical framework.

Our model provides a new way of representing a population structure in terms of an ancestral spectrum which shows the ancestral association composition of each modern individual chromosome with the chromosomal ancestors. While the existing method based on admixture models (Falush et al., 2003) gives some degree of clear population label information, it is less informative in showing the population diversity or relationship between populations in the genetic history. In contrast, the Spectrum identifies the structure of sub-populations by considering the different ancestral association patterns among populations, in addition to displaying the diversity among individuals and populations, which yields a more informative representation for the population structure among shared ancestors across the populations.

Moreover, Spectrum allows us to recover the recombination events in each individual chromosome. Unlike other existing methods based on HMMs for recombination analysis which assume fixed recombination sites for the population and consider block-wise ancestors, we proposed a full generative model for haplotype inheritance which explicitly models the individual-level genetic recombination and mutation along the chromosome.

As of now, Spectrum does not intrinsically capture the heterogeneity of recombination rates over loci, and the recombination rates are determined by the posterior distribution of recombination events under a universal recombination rate, rather than directly by a maximum-likelihood estimation of site-specific recombination rates as in Li and Stephens (2003). Also, we have not addressed the issues of threshold calculations and confidence measures of hotspot predictions as in Li and Stephens (2003). These problems are of importance in various applications such as linkage-based quantitative trait locus mapping and disease-gene mapping. One way of addressing these issues is to explicitly introduce more recombination states (e.g. for both base-line recombination and hotspot recombination) into the infinite HMM we proposed and/or to introduce priors for site-specific recombination rates for Bayesian inference.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INHERITANCE MODEL
 3 MCMC INFERENCE
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This material is based upon work supported by the National Science Foundation under Grant No. 0523757 and by the Pennsylvania Department of Health's Health Research Program under Grant No. 2001NF-Cancer Health Research Grant ME-01-739. E.P.X. is also supported by a NSF CAREER Award under Grant No. DBI-0546594.

Conflict of Interest: none declared.


    FOOTNOTES
 
1 In fact, the genetic diversity of each individual is captured at a higher level by the population APs. But the AP profiles are very hard to visualize and interpret because they consist of allele frequency profiles for every locus and are independent a priori across loci. Back

2 For simplicity, we assume that the mutation rates pertaining to different ancestors follow the same prior Beta({alpha}h, ßh) Back


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INHERITANCE MODEL
 3 MCMC INFERENCE
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Anderson EC, Novembre J. Finding haplotype block boundaries by using the minimum-description-length principle. Am. J. Hum. Genet, ( (2003) ) 73, : 336–354.[CrossRef][ISI][Medline].

    Blackwell D, MacQueen JB. Ferguson distributions via Pólya urn schemes. Ann. Stat, ( (1973) ) 1, : 353–355..

    Daly MJ, et al. High-resolution haplotype structure in the human genome. Nat. Genet, ( (2001) ) 29, : 229–232.[CrossRef][ISI][Medline].

    Erosheva E, et al. Mixed-membership models of scientific publications. Proc. Natl Acad. Sci. USA, ( (2004) ) 101, ((Suppl. 1)): 5220–5227.[Abstract/Free Full Text].

    Excoffier L, Hamilton G. Comment on genetic structure of human populations. Science, ( (2003) ) 300, : 1877b.[Free Full Text].

    Falush D, et al. Inference of population structure: Extensions to linked loci and correlated allele frequencies. Genetics, ( (2003) ) 164, : 1567–1587.[Abstract/Free Full Text].

    Fearnhead P, Donnelly PJ. Estimating recombination rates from population genetic data. Genetics, ( (2001) ) 159, : 1299–1318.[Abstract/Free Full Text].

    Ferguson TS. A Bayesian analysis of some nonparametric problems. Ann. Stat, ( (1973) ) 1, : 209–230..

    Greenspan G, Geiger D. High density linkage disequilibrium mapping using models of haplotype block variation. Bioinformatics, ( (2004) ) 20, ((Suppl. 1)): 137–144.[CrossRef].

    International SNP Map Working Group. A map of human genome sequence variation containing 1.42 millionsingle nucleotide polymorphisms. Nature, . 409, : 928–933..

    Hoppe FM. Pólya-like urns and the ewens' sampling formula. J. Math. Biol, ( (1984) ) 20, : 91–94.[CrossRef][ISI].

    Hudson RR. Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol, ( (1983) ) 23, : 183–201.[CrossRef][ISI][Medline].

    Li N, Stephens M. Modelling linkage disequilibrium, and identifying recombination hotspots using snp data genetics. Genetics, ( (2003) ) 165, : 2213–2233.[Abstract/Free Full Text].

    Liu JS, et al. Bayesian analysis of haplotypes for linkage disequilibrium mapping. Genome Res, ( (2001) ) 11, : 1716–1724.[Abstract/Free Full Text].

    Patil N, et al. Blocks of limited haplotype diversity revealed by high-resolutionscanning of human chromosome 21. Science, ( (2001) ) 294, : 1719–1723.[Abstract/Free Full Text].

    Pritchard JK, et al. Inference of population structure using multilocus genotype data. Genetics, ( (2000) ) 155, : 945–959.[Abstract/Free Full Text].

    Rosenberg NA, et al. Genetic structure of human populations. Science, ( (2002) ) 298, : 2381–2385.[Abstract/Free Full Text].

    Stoneking M. Single nucleotide polymorphisms: From the evolutionary past. Nature, ( (2001) ) 409, : 821–822.[CrossRef][Medline].

    Tavare S, Ewens WJ. The Ewens sampling formula. Encyclopedia of Statistical Sciences, ( (1998) ) 2, : 230–234..

    Thorisson GA, et al. The international hapmap project web site. Genome Res, ( (2005) ) 15, : 1591–1593..

    Xing EP, et al. Bayesian haplotype inference via the Dirichlet process. ( (2004) ) Proceedings of the 21st International Conference on Machine Learning. New York: ACM Press. 879–886..

    Xing EP, Sohn K.-A. Hidden Markov Dirichlet process: Modeling genetic recombination in open ancestral space. Bayesian Analysis, ( (2007) ) September..

    Zhang Y, et al. A coalescence-guided hierarchical bayesian method for haplotype inference. Am. J. Hum. Genet, ( (2006) ) 79, : 313–322.[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow