Joint estimation of gene conversion rates and mean conversion tract lengths from population SNP data
1Computer Science Division and 2Department of Statistics, University of California, Berkeley, CA, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Two known types of meiotic recombination are crossovers and gene conversions. Although they leave behind different footprints in the genome, it is a challenging task to tease apart their relative contributions to the observed genetic variation. In particular, for a given population SNP dataset, the joint estimation of the crossover rate, the gene conversion rate and the mean conversion tract length is widely viewed as a very difficult problem.
Results: In this article, we devise a likelihood-based method using an interleaved hidden Markov model (HMM) that can jointly estimate the aforementioned three parameters fundamental to recombination. Our method significantly improves upon a recently proposed method based on a factorial HMM. We show that modeling overlapping gene conversions is crucial for improving the joint estimation of the gene conversion rate and the mean conversion tract length. We test the performance of our method on simulated data. We then apply our method to analyze real biological data from the telomere of the X chromosome of Drosophila melanogaster, and show that the ratio of the gene conversion rate to the crossover rate for the region may not be nearly as high as previously claimed.
Availability: A software implementation of the algorithms discussed in this article is available at http://www.cs.berkeley.edu/
yss/software.html.
Contact: yss{at}eecs.berkeley.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
A major evolutionary mechanism responsible for generating genetic variation in a population is meiotic recombination, which creates a chimeric genome from the two homologous genomes of an individual. Two known types of meiotic recombination are crossovers and gene conversions, which are typically modeled as follows. Both events involve taking two equal-length parental sequences to produce a descendant sequence of the same length. In a crossover event, the descendant sequence consists of some prefix of one of the parental sequences, followed by a suffix of the other parental sequence. In a gene conversion event, the descendant sequence is formed by copying a short segment (called a conversion tract) starting at a particular position in one of the parental sequences to the same position in the other parental sequence. Hence, the typical pattern created by gene conversion is: a prefix of sequence h followed by a short internal fragment of a sequence h', which is then followed by a suffix of the first sequence h. It is believed that the conversion tract typically ranges between 50 bp and 2000 bp (Hilliker et al., 1994; Jeffreys and May, 2004).
Although crossovers and gene conversions have different effects on the evolutionary history of chromosomes and therefore leave behind different footprints in the genome, it is a challenging task to tease apart their relative contributions to the observed genetic variation. For example, the methods employed in recent studies (Crawford et al., 2004; International HapMap Consortium, 2005; Myers et al., 2005) of recombination rate variation in the human genome actually capture combined effects of crossovers and gene conversions.
Studying gene conversion is important for a number of reasons, a few of which we mention below. First, in several organisms—e.g. humans (Frisse et al., 2001; Pritchard and Przeworski, 2001) and Drosophila melanogaster (Langley et al., 2000)—gene conversion has been shown to be necessary to explain the observed pattern of linkage disequilibrium (LD), i.e. the statistical non-independence of alleles at different loci. Second, it has been argued that ignoring gene conversion may cause problems in association studies (Wall, 2004a) and linkage analysis (Mancera et al., 2008). Third, methods for detecting signatures of natural selection usually require estimates of fine-scale recombination rates (see, e.g. Voight et al. 2006), and their success may hinge on having reliable estimates of crossover and gene conversion rates, as well as the distribution of the conversion tract length. Lastly, gene conversion also plays an important role in molecular evolution. Biased gene conversion is believed to be a significant source of biases in substitution, and variation in biased gene conversion effects appears to be partially responsible for variation in substitution patterns across the mammalian phylogeny (Hwang and Green, 2004).
Gene conversion rate variation in the human genome is currently not well understood, though a recent sperm-typing study (Jeffreys and May, 2004) of the major histocompatibility complex region suggests that the rate of gene conversion can be about 5–15 times higher than that of crossover. Gene conversion has been hard to study in populations because of the lack of fine-scale data. However, the genomic resequencing data to be produced over the next several years will allow us to quantify the fundamental parameters of gene conversion. Therefore, algorithmic and statistical tools to study gene conversion are becoming increasingly more important.
Song et al. (2007) recently developed algorithms to distinguish the role of gene conversion from crossover in the derivation of SNP sequences in a population. Their method can produce an explicit evolutionary history of the input sequences using mutations and recombinations (crossovers and gene conversions), but it cannot produce estimates of recombination parameters. The parameters fundamental to recombination are the crossover rate, the gene conversion rate and the mean conversion tract length—the conversion tract length is often assumed to follow a geometric distribution (Wiuf and Hein, 2000), in which case the mean completely specifies the distribution. Joint estimation of all three parameters is widely viewed as a very difficult problem. There currently exist several statistical methods (reviewed in Section 2) that can jointly estimate crossover and gene conversion rates, but all existing methods, with the only exception being the recent work of Gay et al. (2007), cannot estimate the mean conversion tract length at the same time.
To obtain accurate parameter estimates, it is crucial to make full use of data, and that is exactly what Gay et al. (2007) aimed to achieve in their work. Specifically, they constructed a likelihood-based method by incorporating gene conversion into a popular framework called the Product of Approximate Conditionals (PAC), first proposed by Li and Stephens (2003) to estimate crossover rates only. The work of Gay et al. marks important progress towards developing practical tools for studying gene conversion.
The goal of this article is to improve on the work of Gay et al. (2007) by introducing modifications to the model which we show are crucial to make the joint estimation of all three parameters feasible. Briefly, Gay et al. disallowed overlapping gene conversions in their model, for computational simplicity. We show that this simplification frequently leads to gross errors in the estimation of the gene conversion rate and the mean conversion tract length, when all three parameters are being estimated. In their article, Gay et al. did not try to estimate the mean conversion tract length, but always fixed it to some reasonable value (actually, the true value in the case of simulation study). Therefore, they did not encounter this problem when testing their method. In this article, we devise algorithms to incorporate overlapping gene conversions into the PAC model and show that this modification dramatically improves the estimation of the gene conversion rate and the mean conversion tract length.
To test the performance of our method, we carry out a simulation study. We then apply our method to analyze real biological data from the telomere of the X chromosome of D.melanogaster, and show that the ratio of the gene conversion rate to the crossover rate for the region may not be nearly as high as it was claimed to be by Gay et al. (2007).
| 2 PREVIOUS METHODS |
|---|
|
|
|---|
We briefly review previous work on estimating recombination parameters. Throughout this article, the population-scaled crossover and gene conversion rates are denoted by
= 4Nec and
= 4Neg, respectively, where Ne is the effective population size, c is the per-generation probability of crossover per unit distance (kilobase in this article) and g is the per-generation probability of initiating a gene conversion per unit distance. The conversion tract length is assumed to follow a geometric distribution, and
denotes the mean of that distribution.
2.1 An overview of previous work
There exist several statistical methods for estimating gene conversion rates from population genetic data. Padhukasahasram et al. (2006) suggested using multiple summary statistics from SNP data to estimate crossover and gene conversion rates jointly. This approach makes only partial use of the information in the data.
The methods proposed by (Frisse et al., 2001), (Ptak et al., 2004) and (Wall, 2004b) generalize the composite-likelihood approach of (Hudson, 2001). Briefly, these methods break up the dataset into smaller subsets (pairs or triplets of segregating sites), compute the likelihoods (as functions of
and
, with
fixed) for the subsets, and then multiply those likelihoods together to form a composite likelihood. The point estimates of
and
are obtained by maximizing the composite likelihood over a suitably chosen finite grid. These methods do not take into account the dependence between the smaller subsets.
Assuming that each gene conversion tract contains a single SNP, Hellenthal (2006) incorporated gene conversion into the PAC framework, originally proposed by (Li and Stephens, 2003) to estimate crossover rates only. Gay et al. (2007) later generalized this approach to allow for an arbitrary conversion tract length, and their method can be used to estimate
,
and
jointly from SNP data. The main advantage of these likelihood-based approaches is that they improve the statistical efficiency of the estimates by utilizing as much of the information in the data as possible. The work of Gay et al., further detailed below, is most relevant to our own work.
2.2 The PAC model with gene conversion
The PAC model is motivated by the coalescent (Kingman, 1982) and its generalization to include recombination (Hudson, 1983). The main idea of the model is to relate the observed pattern of LD directly to the underlying recombination processes.
Given a set H = {h1,..., hn} of haplotypes sampled from a population, the probability of observing H given
,
and
can be decomposed as
|
| (1) |
, thus obtaining the following approximation for the joint probability:
|
| (2) |
,
,
). The goal is to estimate
,
and
under the framework of maximum likelihood estimation (MLE), using LPAC as a surrogate function for the original intractable likelihood function (1).
By exchangeability, the value of the right-hand side of (1) is invariant under a permutation of the haplotype indices 1,..., n. However, because the
in (2) are not exact, the PAC likelihood LPAC does depend on the order of haplotypes being considered. To account for this lack of exchangeability, Li and Stephens (2003) suggested averaging the PAC likelihood over several (say, between 10 and 20) random permutations of the input haplotypes.
The approximate conditional
is constructed by assuming that haplotype hk+1 is an imperfect mosaic of the first k haplotypes. That is, hk+1 is obtained by copying segments from h1,..., hk; a crossover or a gene conversion can change the haplotype from which copying is performed. Furthermore, copying can be imperfect, corresponding to mutation. See Figure 1 for an illustration. The copying process proceeds along the sequence from one end to the other, and it is assumed to be Markovian. This process can easily be modeled as a hidden Markov model (HMM) (Rabiner, 1989).
|
|
To compute
|
| (3) |
{1,..., k} and Gj
{
,1,..., k} are hidden states. The states Xj and Gj jointly determine the index cj of the haplotype from which hk+1,j (allele at the j-th site of hk+1) is copied: if Gj =
(the null state which indicates that the j-th site is not in a gene conversion tract), then cj = Xj; otherwise, cj = Gj. To capture the imperfect nature of the copying process resulting from mutation, the emission probability of the HMM is set up as follows:
|
| (4) |
/L is the rate of mutation per site. If
is not specified, it is estimated by using Watterson's unbiased estimator (Watterson, 1975):
|
| (5) |
As in the original PAC model of Li and Stephens (2003), crossover is modeled as a Poisson process with rate
across the sequence. The transition probability of the X chain has only two distinct cases, depending on whether the hidden states of adjacent sites are the same or not:
|
| (6) |
The transition probability of the G chain is more complicated. By assuming that the conversion tract length follows a geometric distribution, both initiation and termination of a conversion tract are modeled as Poisson processes along the sequence, with rates
and 1/
, respectively. Gay et al. used
(not 1/
) to denote the termination rate and assumed that the termination process goes on all the time, even when the copying process is not in a gene conversion state. Further, they make an additional assumption that conversion tracts from different gene conversion events cannot overlap. For example, consider the following probability of moving from state g
{1,..., k} to state g'
{1,..., k}, where g
g':
|
| (7) |
Lastly, the initial probability of the G chain depends on how the rate of starting a gene conversion tract compares to the rate of the ending one, i.e.
|
|
In the above HMM formulation, it is straightforward to compute the conditional probability
by using the standard forward–backward algorithm.
| 3 OUR MODEL |
|---|
|
|
|---|
As described above, the work of Gay et al. (2007) assumes that crossovers and gene conversions are independent, and that gene conversion tracts cannot overlap. In this section, we construct a new model that couples the crossover and gene conversion processes. We then describe how overlapping gene conversions can be incorporated into the model.
3.1 Interleaved HMM
By assuming independence of the two hidden chains, the factorial HMM formulation of (Gay et al., 2007) cannot model the typical alternating pattern of gene conversion; i.e. a prefix of haplotype h followed by an internal fragment of a haplotype h', which is then followed by a suffix of the first haplotype h. To remedy this, we couple the two hidden chains by using an interleaved HMM, illustrated in Figure 2b. Direct edges from the G chain to the X chain constrain the X chain to stay in its previous state whenever the G chain is active. More precisely,
|
| (8) |
(Xj+1 | Xj) in the second line is the same as in (6). If site j + 1 is in a conversion tract (i.e. Gj+1
), the G chain is active and the copying process keeps track of the previous state of the X chain (i.e. Xj+1 = Xj). If Gj+1 =
, the X chain evolves according to the usual transition probability
(Xj+1 | Xj). We point out that coupling the two hidden chains does not increase the complexity of the forward–backward computation. Even in the factorial HMM, the two hidden chains become dependent upon conditioning on the observed variables. Therefore, the computational complexity is the same for both HMMs.
3.2 Modeling overlapping gene conversions
The key new feature of our model is that it allows for overlapping gene conversion events in the copying process. This means that the copying process does not need to terminate a gene conversion event before initiating another gene conversion event.
Figure 3 shows two examples of genealogies that can generate overlapping gene conversion tracts in the coalescent model with gene conversion (Wiuf and Hein, 2000). In Figure 3a, two gene conversion events have conversion tracts that overlap partially, while in Figure 3b, one conversion tract is entirely nested inside the other conversion tract.
|
Motivated by the common belief that the conversion tract length is typically short, between 50 bp and 2000 bp (Hilliker et al., 1994; Jeffreys and May, 2004), we restrict each overlap to involve only a pair of gene conversion events, although a generalization to more than two gene conversion events can easily be achieved at the expense of more computation time. In terms of the underlying HMM, we augment the state space of the G chain as follows. When computing
, 1,..., k} considered in Gay et al.'s model. If Gj = (g, g'), then site j of haplotype hk+1 is within a region of overlapping gene conversion events involving two haplotypes hg and hg'. The second entry g' in a doublet state (g, g') is said to be active and it indicates that the conversion tract from hg' overwrites the conversion tract from hg at marker j of hk+1. In Figure 3a, g is active in the region of overlapping gene conversions, while in Figure 3b g' is active in the region of overlap. As in Gay et al.'s model, the hidden states Xj
{1,..., k} and Gj jointly determine the index cj of the haplotype from which hk+1,j is copied. In our model, |
|
3.3 Transition probabilities for the augmented G chain
We now describe the transition probabilities
(Gj+1 = s' | Gj = s) for the augmented G chain in the computation of
. Instead of using the formulation described in (7), which implicitly allows for infinitely many gene conversion events between two adjacent sites, we explicitly enumerate all possible valid paths of events defined to satisfy the following two properties: (i) each valid path starts in state s and ends in state s', and (ii) contains at most a initiations and b terminations of gene conversions. In our implementation, we use a = b = 1 for simplicity, but it is straightforward to consider larger values of a and b without increasing the asymptotic complexity of the forward–backward algorithm in our HMM.
For a = b = 1, the path (g, g')
g'
(g', g'') is valid, since it contains exactly one initiation event and one termination event. In contrast, the path g
g'
(g, g') is not valid since it contains two initiation events.
For a given pair of states s, s' of the G chain (and for given values of a and b), all valid paths starting in s and ending in s' can be enumerated using dynamic programming. We use
s,s' to denote the set of all such valid paths. To compute the probability
(
) for a given path
s,s', we make the following assumptions:
- Instead of allowing the termination process to run all the time, which Gay et al. (2007) assume, we assume that no termination event can occur if the current state in
is the
state.
- If the current state in
is a singlet g, then an initiation event uniformly chooses g'
{1,..., k} and creates either (g, g') or (g', g) with equal probability; the termination process has rate 1/
.
- If the current state in
is a doublet (g, g'), then no initiation can occur, since we assume only pairwise overlaps of gene conversions. The termination process has rate 2/
, and when a termination event occurs, one makes a transition from (g, g') to either g or g' with equal probability.
With the above assumptions,
(
) can be computed by integrating over all possible positions along the sequence where the events in
can happen. In contrast, recall that Gay et al. only integrate over the position of the last termination event. It turns out that the main computation involves a symbolic convolution of exponential functions, which can be easily evaluated. The transition probability
(Gj+1 = s' | Gj = s) can be obtained by adding up the probability of all valid paths in
s,s' and then normalizing to make sure that the outgoing probabilities sum to 1, that is,
|
|
As a concrete example, consider the transition probability
(Gj+1 = g' | Gj = g), where g, g'
{1,..., k} and g
g'. For a = b = 1,
g,g' contains three valid paths, namely
1 = g
g',
2 = g
(g, g')
g' and
3 = g
(g', g)
g'. The probability of
1 is given by
|
|
1. In a similar vein, one can show that the probabilities
(
2) and
(
3) are given by |
|
(Gj+1 = g' | Gj = g) is proportional to
(
1)+
(
2)+
(
3). Table 1 lists the transition probabilities in the G chain of our implementation with a = b = 1. In the table, g, g' and g'' denote distinct elements of {1,..., k}.
|
3.4 Initial probabilities of the G chain
We wish to use the stationary distribution of the transition matrix of the G chain as the initial probability at the first SNP site. However, in the computation of
, k singlet states (g), k degenerate doublet states (g, g) and k2 – k non-degenerate doublet states (g, g'), where g
g'. Finding an eigenvector of that transition matrix could be computationally expensive for moderate values of k. Therefore, we make the following approximation: we collapse the transition matrix to a 4 x 4 matrix, whose rows and columns are indexed by null, singlet, degenerate doublet and non-degenerate doublet. Each entry in the collapsed matrix is obtained by summing over the corresponding entries in the original transition matrix. We find the left eigenvector v = (v0, v1, v2, v3) of the collapsed matrix with eigenvalue 1. Then, for g, g'
{1,..., k}, where g
g', the initial probabilities of the G chain are specified as |
|
3.5 Complexity of the algorithm
Since the augmented HMM has O(k3) states when computing
, a naive implementation of the forward–backward algorithm takes O(k6L) time, where L is the number of polymorphic sites in the input data (i.e. the length of each haplotype). Hence, the computational complexity of the PAC likelihood LPAC (for fixed parameters
,
,
) in our model is O(n7L), where n is the total number of input haplotypes. However, by exploiting the sparsity and regularity of transition probabilities, we can use algorithmic shortcuts to reduce the complexity to O(n4L). As in Gay et al.'s method, we use a standard derivative-free optimization procedure to find the maximum likelihood estimates of
,
and
based on LPAC.
| 4 RESULTS |
|---|
|
|
|---|
In this section, we summarize the performance of our method on simulated data and then consider a real biological application. In both cases, we compare our method with GenCo, the method developed by Gay et al. (2007).
4.1 Simulation study
To test the performance of our method, we used Hudson's (2002) coalescent simulation program MS to generate simulated datasets. In general, it is possible that the evolutionary history of a particular region R in a genome involves gene conversions with one end of the conversion tract falling outside R and the other end falling within R. To account for such events, we simulated a 30 kb region and then discarded 5 kb from each end. In all simulations, we used
= 1.0/kb for mutation rate and
=0.5 kb for the mean conversion tract length, both of which being relevant to humans [see Ptak et al. (2004) and Frisse et al. (2001), respectively]. For each dataset, both GenCo and our method were each run 10 times, taking 20 random permutations of haplotype order in each iteration. The same permutations were used in the two methods. In the first iteration, both GenCo and our method started the optimization procedure at the true values of
,
and
, while in the subsequent iterations, the maximum likelihood estimates from the previous iteration were used as initial values. For the crossover rate, we used
= 0.5 or 1.0/kb, while for the gene conversion rate, we used
= 0.5, 1.0 or 2.5/kb. For each parameter setting, we generated 100 simulated datasets each with 20 haplotypes. For each simulated dataset, we estimated all three parameters
,
and
, while
was set to Watterson's estimate (5). Shown in Table 2 is a summary of performance results. The columns labeled
and
display the mean and SD (shown in parentheses) of the corresponding estimates. The column labeled
shows the number of datasets with crossover estimates
within a factor of k from the true
; and the columns labeled
and
are similarly defined for gene conversion rate
and the mean tract length
, respectively.
|
4.1.1 Estimation of

Both our method and GenCo produced reasonable estimates of
. The two estimates had similar means, but our estimate generally had a smaller variance than that of GenCo.
4.1.2 Estimation of 
Our improvement over GenCo is clearly illustrated in the estimation of
. GenCo's estimate of
was substantially biased upward, with means above the true
by factors of tens to thousands. In most cases, this significant bias was not a result of only a few outliers; as the column labeled
in Table 2 and the histogram in Figure 4a show, GenCo produced very large estimates of
for a significant fraction of simulated datasets. In contrast, as Table 2 and the histogram in Figure 4b indicate, our estimate of
was much more well behaved for all parameter settings, though it was slightly biased upward for
= 0.5 and 1.0/kb.
|
4.1.3 Estimation of

GenCo's estimate of
was slighted biased upward. This upward bias occurred even though many estimates were well below the true value
= 0.5 kb, as shown in the histogram in Figure 4c. In GenCo, a very large
is much more accurate, with a smaller variance. However, as the cases with
= 2.5/kb suggest, our estimate of the mean tract length
seems slightly biased downward when
is large.
4.2 A real biological application
Gay et al. (2007) used their method to study recombination patterns in two genes—namely, su(s) and su(wa) surveyed by Langley et al. (2000)—located near the telomere of the X chromosome of D.melanogaster. The su(s) and su(wa) loci are about 4.1 kb and 2.5 kb long, respectively, and are about 400 kb apart. Langley et al. (2000) surveyed samples from both an African and an European population, but only the African sample was considered by Gay et al., and we do the same here. The su(s) dataset contains 50 haplotypes and 41 SNPs, while the su(wa) dataset contains 50 haplotypes and 46 SNPs.
Gay et al. reported that, upon fixing the mean tract length to 0.352 kb (Hilliker et al., 1994), they obtained
and
, thus concluding
. In their paper, Gay et al. did not specify whether the above estimates were for the su(s) locus or the su(wa) locus. To compare their method GenCo with our method, we redid the analysis, following the same procedure as in Section 4.1, i.e. taking 20 random permutations of haplotype order and iterating the computation 10 times. We used
= 1.0/kb and
= 1.0/kb as the starting values of the optimization procedure in the first iteration. The results, summarized in Table 3, are quite different between the two methods. Assuming
= 0.352 kb, GenCo suggests that the gene conversion rate is substantially higher than the crossover rate in each gene, while our method implies that the two rates are comparable.
|
We also performed analysis with
as a free parameter; Gay et al. (2007) did not consider this analysis in their paper. In this case, we used
= 5.0/kb,
= 5.0/kb, and
= 0.352 kb as the starting values of the optimization procedure in the first iteration. GenCo and our method again produced generally different results. The corresponding maximum likelihood estimates of
,
and
are shown in Table 4. For the su(s) locus, GenCo and our method produced similar estimates of
, but GenCo produced a much smaller estimate of
than that of our method, while the opposite is true for
, but GenCo produced a much larger estimate of
than that of our method, though both methods produced a value of
in both methods were quite small; this could be an artifact of the methods, which tend to produce small estimates of
when estimates of
are large.
|
As discussed in Section 4.1, both GenCo and our method tend to overestimate
(GenCo more so than our method), but the fact that both methods detected strong signals of gene conversion suggests that gene conversion is likely to have played an important role in shaping the observed pattern of genetic variation in the two genes. This agrees with Langley et al.'s conclusion. However, unlike what Gay et al. (2007) concluded, our analysis implies that crossover may not have been greatly suppressed in the su(s) and su(wa) loci. | 5 DISCUSSION |
|---|
|
|
|---|
High-throughput sequencing technology has advanced remarkably in the past few years (Bentley, 2006), and soon it will become routine to obtain whole-genome sequence information. Such fine-scale data from populations will allow us to quantify fundamental population genetics parameters with high accuracy. In particular, it will soon be possible to provide a genomic annotation of gene conversion rates and characterize the distribution of conversion tract lengths. Hence, improved algorithms and statistical tools for studying gene conversion are much in need.
In this article, we have developed a model that allows overlapping gene conversions. We believe that this aspect of our model is crucial in making the joint estimation of the gene conversion rate and the mean conversion tract length feasible. Although the joint estimation of the three parameters
,
and
is indeed a very difficult problem, and the method proposed here is unlikely to be optimal, we believe that we have taken an important step towards devising a robust, reliable method.
Our current method can be improved in several ways. When the gene conversion rate
is high, our method tends to underestimate the conversion tract length
slightly. On the other hand, when
is small, our method tends to overestimate
slightly. We believe that both biases can be corrected by considering larger threshold values (a and b) on the maximum number of allowed gene conversion initiation and termination events. We will explore this improvement in the future. Other important future directions include handling missing data and variable rates across the sequence.
The PAC model proposed by Li and Stephens (2003) is a useful framework with many applications. Hellenthal et al. (2008) recently proposed using a PAC-based copying model to infer human colonization history. Clearly, the accuracy of that inference method can benefit from having a more realistic copying model, as that proposed here.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Jo Gay for making her source code available to us and Charles H. Langley for providing us with the su(s) and su(wa) data.
Funding: Department of Energy (BER KP110201 to M.I.J. in part); National Institutes of Health (R01-GM071749 [GenBank] to M.I.J. and R00-GM080099 [GenBank] to Y.S.S. in part); an Alfred P. Sloan Research Fellowship (to Y.S.S. in part); Packard Fellowship for Science and Engineering (to Y.S.S. in part).
Conflict of Interest: none declared.
| REFERENCES |
|---|
|
|
|---|
Bentley DR. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. (2006) 16:545–552.[CrossRef][Web of Science][Medline]
Crawford DC, et al. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat. Genet. (2004) 36:700–706.[CrossRef][Web of Science][Medline]
Frisse L, et al. Gene conversion and different population histories may explain the contrast between polymorphism and linkage disequilibrium levels. Am. J. Hum. Genet. (2001) 69:831–843.[CrossRef][Web of Science][Medline]
Gay JC, et al. Estimating meiotic gene conversion rates from population genetic data. Genetics (2007) 177:881–894.
Ghahramani Z, Jordan MI. Factorial hidden Markov models. Mach. Learn. (1997) 29:245–273.[CrossRef]
Hellenthal G. Exploring Rates and Patterns of Variability in Gene Conversion and Crossover in the Human Genome. In: PhD Thesis. (2006) Seattle: University of Washington.
Hellenthal G, et al. Inferring human colonization history using a copying model. PLoS Genet. (2008) 4:e1000078.[CrossRef][Medline]
Hilliker AJ, et al. Meiotic gene conversion tract length distribution within the rosy locus of Drosophila melanogaster. Genetics (1994) 137:1019–1026.[Abstract]
Hudson RR. Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. (1983) 23:183–201.[CrossRef][Web of Science][Medline]
Hudson RR. Two-locus sampling distributions and their application. Genetics (2001) 159:1805–1817.
Hudson RR. Generating samples under the Wright-Fisher neutral model of genetic variation. Bioinformatics (2002) 18:337–338.
Hwang DG, Green P. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc. Natl Acad. Sci. USA (2004) 101:13994–14001.
International HapMap Consortium. A haplotype map of the human genome. Nature (2005) 437:1299–1320.[CrossRef][Medline]
Jeffreys AJ, May CA. Intense and highly localized gene conversion activity in human meiotic crossover hot spots. Nat. Genet. (2004) 36:151–156.[CrossRef][Web of Science][Medline]
Kingman JFC. The coalescent. Stoch. Process. Appl. (1982) 13:235–248.[CrossRef]
Langley CH, et al. Linkage disequilibria and the site frequency spectra in the su(s) and su(wa) regions of the Drosophila melanogaster X chromosome. Genetics (2000) 156:1837–1852.
Li N, Stephens M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics (2003) 165:2213–2233.
Mancera E, et al. High-resolution mapping of meiotic crossovers and non-crossovers in yeast. Nature (2008) 454:479–485.[CrossRef][Web of Science][Medline]
Myers S, et al. A fine-scale map of recombination rates and hotspots across the human genome. Science (2005) 310:321–324.
Padhukasahasram B, et al. Estimating recombination rates from single-nucleotide polymorphisms using summary statistics. Genetics (2006) 174:1517–1528.
Pritchard JK, Przeworski M. Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet. (2001) 69:1–14.[CrossRef][Web of Science][Medline]
Ptak SE, et al. Insights into recombination from patterns of linkage disequilibrium in humans. Genetics (2004) 167:387–397.
Rabiner L. A tutorial on HMM and selected applications in speech recognition. Proc. IEEE (1989) 77:257–286.[CrossRef]
Song YS, et al. Algorithms to distinguish the role of gene-conversion from single-crossover recombination in the derivation of SNP sequences in populations. J. Comput. Biol. (2007) 14:1273–1286.[CrossRef][Web of Science][Medline]
Voight BF, et al. A map of recent positive selection in the human genome. PLoS Biol. (2006) 4:e72.[CrossRef][Medline]
Wall JD. Close look at gene conversion hot spots. Nat. Genet. (2004a) 36:114–115.[CrossRef][Web of Science][Medline]
Wall JD. Estimating recombination rates using three-site likelihoods. Genetics (2004b) 167:1461–1473.
Watterson G. On the number of segregation sites. Theor. Popul. Biol. (1975) 7:256–276.[CrossRef][Web of Science][Medline]
Wiuf C, Hein J. The coalescent with gene conversion. Genetics (2000) 155:451–462.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











