Bioinformatics Advance Access originally published online on September 27, 2005
Bioinformatics 2005 21(23):4309-4311; doi:10.1093/bioinformatics/bti689
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HapSim: a simulation tool for generating haplotype data with pre-specified allele frequencies and LD coefficients
Department of Mathematics, Imperial College London, UK
| ABSTRACT |
|---|
|
|
|---|
We have developed a simulation tool HapSim for the generation of haplotype data. The simulated haplotypes are such that their allele frequencies and linkage disequilibrium coefficients match exactly those estimated in a real sample.
Availability: The program is available as an R package and can be downloaded from http://cran.r-project.org/
Contact: g.montana{at}imperial.ac.uk
| INTRODUCTION |
|---|
|
|
|---|
With the widespread availability of molecular data, novel computational methods for gene mapping are being developed at an increasing rate. It is often the case that the statistical properties and behavior of these methods need to be assessed and tested by simulation, especially when no analytical derivations are available. A number of simulation programs have been developed and are currently used to this end; see, for instance, Hudson (2002) and Excoffier et al. (2000).
Once a mechanism for the generation of DNA polymorphism data is in place, entire casecontrol association studies can also be simulated, e.g. through the random allocation of phenotypes conditional on genotypes at one or more loci, and according to a variety of different genetic models. Among other applications, such a simulation framework allows for comparisons and performance evaluations of multi-locus gene mapping methods to be made.
We suggest a simple haplotype simulation approach whose main feature lies in its ability to produce haplotype data with pre-specified allele frequencies and pairwise linkage disequilibria as measured by the r2 index. Assuming that linkage disequilibria of order greater than two can be ignored, a haplotype is modeled as a multivariate random variable possessing known marginal distributions and pairwise correlation coefficients. We have implemented an algorithm that can efficiently generate large samples of such variables.
| SIMULATION ALGORITHM |
|---|
|
|
|---|
Suppose we have collected a sample of N haplotypes comprising L binary markers coded as 0 and 1, and call pi the estimated allele frequency at locus i. Further assume that, for all pairs of markers i and j, the linkage disequilibrium coefficient
has been calculated as (pij pi pj)2 [pi pj(1 pi)(1 pj)]1. The joint probability pij of observing the same allele at both loci is then easily derived. We now wish to simulate a multivariate Bernoulli variable S = (S1, S2, ..., Sm) such that each variable Si has a marginal distribution pi = Pr(Si = 1), and the correlation between each pair of variables, say Si and Sj, is exactly rij. The following algorithm draws a random sample from the distribution of S:
- Compute a covariance matrix C = {cij} by solving the equations
(z(pi), z(pj); cij) = pij uniquely for cij (1
i < j
L). Here the notation
(z(x), z(y);
) refers to the c.d.f. of a standard bivariate normal distribution with correlation coefficient
, whereas z(p) denotes the p-th quantile of the univariate standard normal distribution.
- Simulate a random vector m = (m1, ..., mL) from a multivariate normal distribution with mean vector (0, ..., 0) and covariance matrix C. The required random sample s = (s1, s2, ..., sL) is then obtained by thresholding the components of m according to the following rule: si = 1 if mi
z(pi) else si = 0.
The underlying simulation strategy we adopted follows a well-known general recipe taken from the stochastic simulation cookbook, i.e. transform a random variable to a normally distributed variate via the c.d.f. and then map back. Despite its appealing simplicity, this algorithm had not been implemented before in the context of simulating binary genetic markers.
A few computational remarks are in order. The L(L 1)/2 equations that appear in Step (1) are solved with high precision by an iterative bisection method; the function implementing the equation solver has been written in C in order to achieve a substantial speedup at run-time. It is important to notice that the covariance matrix C obtained from all the pairwise calculations of Step (1) is not guaranteed to be positive defined. When this is indeed the case, the non-positive definitiveness poses a problem, as a working covariance matrix is needed for sampling from a multivariate normal distribution1. Accordingly, the algorithm executes an intermediate step as needed:
(1a) Replace the non-positive definite covariance matrix C by its approximated positive definite version.
The approximation above, sometimes called bending (Hayes and Hill, 1981), requires performing an initial spectral decomposition of C; all the eigenvalues smaller than a minimum tolerance threshold are then replaced by this minimum value.
In all cases when the covariance matrix approximation applies, the mean square error between the target r2 and the corresponding LD coefficient estimated from a simulated sample will stay above a positive constant, even when the sample size increases. However, as we briefly illustrate in the sequel, the approximation error is indeed small, at times almost negligible, and the estimated LD plot is generally in very good agreement with its target. On the other hand, the estimated allele frequencies are not affected by this approximation, and are unbiased.
| ILLUSTRATION |
|---|
|
|
|---|
Let us assume that a sample dataset X has been loaded into R. The generation of virtual haplotypes is a 2-fold process: first, the covariance matrix C is computed by creating a haplotype object H
haplodata(X); then, a sample Y of n haplotypes is obtained by the command Y
haplosim(n, H). Other summary statistics (e.g. allele frequencies and a locus-specific genetic diversity index) can also be easily extracted from both the H and Y objects. We briefly illustrate the use of HapSim on the ACE (angiotensin I converting enzyme) data, as analyzed and made available by Lin and Altman (2004). This dataset contains 52 SNPs typed on DCP1 in 11 individuals. All the 22 phased chromosomes were used as a target sample. We then simulated haplotype samples of sizes ranging from 12 to 2000. For each sample size, 100 replicated samples were generated, and the average mean square errors of both marginal probabilities and correlation coefficients were obtained.
Figure 1 clearly shows that, as the sample size increases, the marginal probability estimates quickly converge to the target values; on the other hand, the error in the estimation of pairwise correlations does not go down to zero due to the covariance matrix approximation. The minimum bias obtained at large sample sizes is marked in Figure 1 by a horizontal line.
|
Figure 2 reassures that this error is not much of a concern. The upper triangular matrix represents the LD coefficients as estimated from the real data, whereas the lower triangular matrix displays the corresponding LD in a small sample of 22 simulated haplotypes; it clearly emerges that the block-like LD structure of the real data is reproduced with high accuracy.
|
| DISCUSSION |
|---|
|
|
|---|
We have described a simple simulation procedure for generating artificial haplotypes matching pre-specified first and second order data summaries. An important assumption of our model is that the pairwise LD coefficients describe sufficiently well the real association structure among markers in the dataset at hand, whereas disequilibria of order greater than two (e.g. LD involving three or more markers) are close to zero, and can be ignored. When this assumption is violated, and in regions with very low LD, the simulated haplotype patterns may not always be in very good agreement with the observed ones.
Our implementation only relies on simulating a multivariate normal distribution, and therefore allows the user to efficiently generate many large samples in a matter of a few seconds. Even the initial and most computationally intensive task involving the solution of a large number of equations is quickly performed on a common desktop PC.
The R package provides a set of functions for haplotype simulation which is intended to be used as a component of a broader simulation study coded in R. Two potential applications are in order. The LD coefficient r2 arises naturally in the context of association mapping studies and is related to their power (Pritchard and Przeworski, 2001); therefore HapSim, being able to generate large samples of chromosomes that closely mimic observed r2 patterns, can be used as a building block of a simulation program for power and sample size estimation. We are currently employing it to characterize the properties of selected SNP tagging algorithms that only rely on pairwise LD. Other possible applications are under evaluation in a collaboration with GlaxoSmithKline.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
1Multivariate normal variates are drawn by the mvrnorm function in the MASS library, also available from the CRAN repository.
Received on August 1, 2005; revised on September 21, 2005; accepted on September 22, 2005
| REFERENCES |
|---|
|
|
|---|
Excoffier, L., et al. (2000) SIMCOAL: a general coalescent program for the simulation of molecular data in interconnected populations with arbitrary demography. J. Hered, . 91, 506509
Hayes, J.F. and Hill, W.G. (1981) Modification of estimates of parameters in the construction of genetic selection indices (bending). Biometrics, 37, 483493[CrossRef].
Hudson, R.R. (2002) Generating samples under a WrightFisher neutral model of genetic variation. Bioinformatics, 18, 337338
Lin, Z. and Altman, R.B. (2004) Finding haplotype tagging SNPs by use of principal components analysis. Am. J. Hum. Genet, . 75, 850861[CrossRef][Web of Science][Medline].
Pritchard, J.K. and Przeworski, M. (2001) Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet, . 69, 114[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
F. A. Wright, H. Huang, X. Guan, K. Gamiel, C. Jeffries, W. T. Barry, F. Pardo-Manuel de Villena, P. F. Sullivan, K. C. Wilhelmsen, and F. Zou Simulating association studies: a data-based resampling method for candidate regions or whole genome scans Bioinformatics, October 1, 2007; 23(19): 2581 - 2588. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


