Skip Navigation


Bioinformatics Advance Access originally published online on December 6, 2006
Bioinformatics 2007 23(4):520-521; doi:10.1093/bioinformatics/btl622
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/4/520    most recent
btl622v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (11)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Hellenthal, G.
Right arrow Articles by Stephens, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hellenthal, G.
Right arrow Articles by Stephens, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

msHOT: modifying Hudson's ms simulator to incorporate crossover and gene conversion hotspots

Garrett Hellenthal 1,* and Matthew Stephens 2

1 Department of Statistics, University of Oxford 1 South Parks Road, Oxford OX1 3TG, UK
2 Department of Statistics, University of Chicago 5734 S. University Avenue, Eckhart 126, Chicago, IL 60637, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODEL
 3 IMPLEMENTATION
 REFERENCES
 

Summary: We have incorporated both crossover and gene conversion hotspots into an existing coalescent-based program for simulating genetic variation data for a sample of chromosomes from a population.

Availability: The source code for msHOT is available at http://home.uchicago.edu/~rhudson1, along with accompanying instructions.

Contact: hellenth{at}stats.ox.ac.uk


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODEL
 3 IMPLEMENTATION
 REFERENCES
 
Richard R. Hudson's ‘program for generating samples under neutral models’ (ms simulator) (2002) is a widely-used program for simulating genetic variation data, in particular single nucleotide polymorphism (SNP) data, for randomly-sampled haplotypes from a population. The program allows the user to specify various aspects of population demography (e.g. population sizes and migration patterns) and factors governing evolution [e.g. mutation, crossover and gene conversion (gc) rates]. However, it presently does not allow for variation in recombination rates. In particular, ‘hotspots,’ or areas of the genome in which crossover and/or (allelic) gc occur at higher rates than the genome-wide average, appear to be common in humans (Jeffreys et al., 2001; Jeffreys and May, 2004; Myers et al., 2005). We have incorporated both crossover and gc hotspots into a freely available, updated simulator called msHOT. The output and usage is the same as in the ms program of Hudson (2002), but includes additional arguments for specifying hotspot features. Though other coalescent-based simulators have been written to incorporate variable crossover rates (Schaffner et al., 2005), ours is the first to our knowledge to include the option for gc hotspots as well.


    2 MODEL
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODEL
 3 IMPLEMENTATION
 REFERENCES
 
The current implementation of ms allows the user to specify the (population-scaled) rate of crossing-over, {rho}, and the relative rate of gc to crossover, f, for the genetic region to be simulated. Here {rho} = 4N0r, where N0 is the current diploid population size and r is the probability of a crossover occuring in the region in a single transmission from parent to offspring, and f = g/r, where g denotes the probability of a gc initiating in the region of interest in a single transmission from parent to offspring (Wiuf and Hein, 2000). Since r and g are typically small, dividing these parameters by the sequence length of the genetic region gives, respectively, the crossover probability per base pair, rbp, and the gc probability per base pair, gbp.

Our modification msHOT allows the user to insert as many (non-overlapping) crossover hotspots and (non-overlapping) gene conversion hotspots into the genetic region as they wish by specifying the locations and intensities for each. Specifically, incorporating H crossover hotspots requires the user to specify a left endpoint (ah), right endpoint (bh), and intensity ({lambda}h) for each, h = 1, ... , H. Inside hotspot h, the probability of a crossover occuring between two adjacent base pairs in a single transmission from parent to offspring is {lambda}hrbp. Outside any hotspot, this probability is rbp. Similarly, incorporating Formula gc hotspots requires the user to specify a left endpoint (Formula ), right endpoint (Formula ) and intensity (Formula ) for each, Formula . Inside gc hotspot h, the probability of a gc intitation between two adjacent base pairs in a single transmission from parent to offspring is Formula gbp. Outside any hotspot, this probability is gbp (see Figure 1a). Gene conversion hotspots may overlap with crossover hotspots.


Figure 1
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 (a) Illustration of varying crossover and/or gc intensities in a genetic region [S, E]. Here the crossover, (respectively gc) probability rbp, (respectively gbp) is increased by a multiple {lambda}1 in [a1,b1] and by a multiple {lambda}2 in [a2,b2]. (b) Illustration of the three distinct gc types that can influence variation in the genetic region [S, E]. The grey vertical lines represent the initiation point of each gc event, and the black horizontal bars represent the tract length of each of the gc events.

 
In Hudson's ms, gc events initiate at some base pair, which is assumed to form the left-point of the region affected by the gc. The right-point is then determined by the length (in physical distance) of the region affected by the gc (i.e. the tract length), which is assumed to have a geometric distribution with user-specified mean. This difference in the treatment of the left and right endpoints causes some bothersome asymmetry when the rate of gc initiation is allowed to vary along the region. To deal with this, we changed the model to assume that gc events initiate at some point and then spread both right and left independently according to geometric distributions with user-specified mean t*. Thus, in our model, the tract length is the sum of two independent geometric distributions. Incidentally, this may also better represent current knowledge of the biology underlying gc events (Szostak et al., 1983).


    3 IMPLEMENTATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODEL
 3 IMPLEMENTATION
 REFERENCES
 
The basic algorithm of msHOT is as described in Hudson (1983). In brief, ms generates ancestral recombination graphs for a sample of chromsomes by stochastically determining ‘events’ to occur on the ancestral material of the chromosomes going back in time, until all the material has coalesced into a common ancestor. We refer to any individual segment of this ancestral material as an ‘ancestral segment.’ Potential ‘events’ include the coalesence of two such segments or a recombination event (crossover or gc) occuring in a single segment. Incorporating hotspots involves changing the rates at which these recombination events occur, as described below. (The consequences of these events, which involve splitting ancestral segments, are not changed by the introduction of hotspots and are already dealt with in Hudson's code.)

The rate of each possible recombination event, backwards in time, is determined by computing the probability of the event occurring in a single generation forwards in time, and multiplying this by 4N0. We therefore focus on computing the relevant probabilities forwards in time. In the following we use [S, E] to denote an ancestral segment beginning at S and ending at E.

Crossover. Assume the entire simulated region contains H crossover hotspots, each with left endpoint ah, right endpoint bh, and intensity {lambda}h, h = 1, ... , H. Under the model described above, the probability of a crossover initiating at any particular location z isin [S, E] is:


Formula 1

(1)
Here Izisin[ah,bh] is an indicator function, taking the value 1 if z is in crossover hotspot h and 0 otherwise. The total probability of a crossover occurring in [S, E] is found by summing over z in Equation (1). If a crossover is to occur in [S, E], the location z is selected with probability proportional to Equation (1).

Gene conversion. Each gc event can be thought of as having an ‘initiation point’ and ‘right’ and ‘left’ endpoints. We distinguish three types of gc event that can influence patterns of genetic variation in [S, E] (see Figure 1b):

  1. Type i: a gc event initiates within [S, E] and has endpoints that may be either inside or outside this region.
  2. Type ii: a gc event initiates to the left of S and has a right endpoint within [S, E].
  3. Type iii: a gc event initiates to the right of E and has a left endpoint within [S, E].

The following subsections give the relative probabilities, and describe how to determine the endpoints, for each of these types of event. Assume the entire simulated region contains Formula gene conversion hotspots, each with left endpoint Formula , right endpoint Formula , and intensity Formula , Formula .

‘Type i’: The probability of a type i event initiating at z isin [S, E]:


Formula 2

(2)
where Formula is an indicator denoting whether location z is in gc hotspot h. If a type i event occurs, its endpoints are determined by first selecting the initiation point, z, with probabilities proportional to Equation (2), and then simulating the left and right endpoints as z – T1 and z + T2, where Ti are randomly sampled from a geometric(t*). (These endpoints may fall outside [S, E].)

‘Type ii’: The probability that a type ii event initiates at a location y to the left of S (thus y is outside [S, E], in contrast to z above) and has a right endpoint at x isin [S, E] is given by:


Formula 3

(3)
where Formula , y ranges from -{infty} to S (for simplicity, we have assumed, as in ms, that the chromosome has infinite length), and Formula is an indicator for whether y is in hotspot h. The total probability of a type ii event occurring in [S, E] is obtained by summing Equation (3) over possible values of x and y. (We deal with the infinite sum over y by use of standard geometric series results. All locations outside the simulated region are assumed to have the background probability of a gene conversion, gbp.) If a type ii event occurs, its right endpoint, x*, is chosen via a truncated geometric distribution [i.e. Pr(X* = x*) {propto} qx*–S(1 q),for x* = S + 1, ... , E].

‘Type iii’: The type iii gc events are similar to the type ii events above, but with locations starting from the end of an ancestral segment and counting from right to left.


    Acknowledgments
 
The authors thank E.C. Anderson for sharing an annotated version of Hudson's code edited to incorporate crossover hotspots and R.R. Hudson for kindly agreeing to distribute our modified version of his code.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Keith A Crandall

Received on September 28, 2006; revised on November 21, 2006; accepted on November 28, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODEL
 3 IMPLEMENTATION
 REFERENCES
 

    Hudson, R. (1983) Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol, . 23, 183–201[CrossRef][Web of Science][Medline].

    Hudson, R. (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics, 18, 337–338[Abstract/Free Full Text].

    Jeffreys, A. and May, C. (2004) Intense and highly localized gene conversion activity in human meiotic crossover hot spots. Nat. Genet, . 36, 151–156[CrossRef][Web of Science][Medline].

    Jeffreys, A., et al. (2001) Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat. Genet, . 29, 217–222[CrossRef][Web of Science][Medline].

    Myers, S., et al. (2005) A fine-scale map of recombination rates and hotspots across the human genome. Science, 310, 321–324[Abstract/Free Full Text].

    Schaffner, S., et al. (2005) Calibrating a coalescent simulation of human genome sequence variation. Genome Res, . 15, 1576–1583[Abstract/Free Full Text].

    Szostak, J., et al. (1983) The double-strand-break repair model for recombination. Cell, 33, 25–35[CrossRef][Web of Science][Medline].

    Wiuf, C. and Hein, J. (2000) The coalescent with gene conversion. Genetics, 155, 451–462[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Proc. Natl. Acad. Sci. USAHome page
B. Ferwerda, S. Alonso, K. Banahan, M. B. B. McCall, E. J. Giamarellos-Bourboulis, B. P. Ramakers, M. Mouktaroudi, P. R. Fain, N. Izagirre, D. Syafruddin, et al.
Functional and genetic evidence that the Mal/TIRAP allele variant 180L has been selected by providing protection against septic shock
PNAS, June 23, 2009; 106(25): 10272 - 10277.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
K. E. Lohmueller, C. D. Bustamante, and A. G. Clark
Methods for Human Demographic Inference Using Haplotype Patterns From Genomewide Single-Nucleotide Polymorphism Data
Genetics, May 1, 2009; 182(1): 217 - 231.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
Y. Wang and B. Rannala
Population genomic inference of recombination rates and hotspots
PNAS, April 14, 2009; 106(15): 6215 - 6219.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
P. Andolfatto
Controlling Type-I Error of the McDonald-Kreitman Test in Genomewide Scans for Selection on Noncoding DNA
Genetics, November 1, 2008; 180(3): 1767 - 1771.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
J.-F. Lefebvre and D. Labuda
Fraction of Informative Recombinations: A Heuristic Approach to Analyze Recombination Rates
Genetics, April 1, 2008; 178(4): 2069 - 2079.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
B. Ferwerda, M. B. B. McCall, S. Alonso, E. J. Giamarellos-Bourboulis, M. Mouktaroudi, N. Izagirre, D. Syafruddin, G. Kibiki, T. Cristea, A. Hijmans, et al.
From the Cover: TLR4 polymorphisms, infectious diseases, and evolutionary pressure during migration of modern humans
PNAS, October 16, 2007; 104(42): 16645 - 16650.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
J. Gay, S. Myers, and G. McVean
Estimating Meiotic Gene Conversion Rates From Population Genetic Data
Genetics, October 1, 2007; 177(2): 881 - 894.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/4/520    most recent
btl622v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (11)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Hellenthal, G.
Right arrow Articles by Stephens, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hellenthal, G.
Right arrow Articles by Stephens, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?