Skip Navigation


Bioinformatics Advance Access originally published online on September 27, 2005
Bioinformatics 2005 21(23):4309-4311; doi:10.1093/bioinformatics/bti689
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/23/4309    most recent
bti689v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Montana, G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Montana, G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oxfordjournals.org

HapSim: a simulation tool for generating haplotype data with pre-specified allele frequencies and LD coefficients

Giovanni Montana

Department of Mathematics, Imperial College London, UK


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 SIMULATION ALGORITHM
 ILLUSTRATION
 DISCUSSION
 REFERENCES
 

We have developed a simulation tool HapSim for the generation of haplotype data. The simulated haplotypes are such that their allele frequencies and linkage disequilibrium coefficients match exactly those estimated in a real sample.

Availability: The program is available as an R package and can be downloaded from http://cran.r-project.org/

Contact: g.montana{at}imperial.ac.uk


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 SIMULATION ALGORITHM
 ILLUSTRATION
 DISCUSSION
 REFERENCES
 
With the widespread availability of molecular data, novel computational methods for gene mapping are being developed at an increasing rate. It is often the case that the statistical properties and behavior of these methods need to be assessed and tested by simulation, especially when no analytical derivations are available. A number of simulation programs have been developed and are currently used to this end; see, for instance, Hudson (2002) and Excoffier et al. (2000).

Once a mechanism for the generation of DNA polymorphism data is in place, entire case–control association studies can also be simulated, e.g. through the random allocation of phenotypes conditional on genotypes at one or more loci, and according to a variety of different genetic models. Among other applications, such a simulation framework allows for comparisons and performance evaluations of multi-locus gene mapping methods to be made.

We suggest a simple haplotype simulation approach whose main feature lies in its ability to produce haplotype data with pre-specified allele frequencies and pairwise linkage disequilibria as measured by the r2 index. Assuming that linkage disequilibria of order greater than two can be ignored, a haplotype is modeled as a multivariate random variable possessing known marginal distributions and pairwise correlation coefficients. We have implemented an algorithm that can efficiently generate large samples of such variables.


    SIMULATION ALGORITHM
 TOP
 ABSTRACT
 INTRODUCTION
 SIMULATION ALGORITHM
 ILLUSTRATION
 DISCUSSION
 REFERENCES
 
Suppose we have collected a sample of N haplotypes comprising L binary markers coded as 0 and 1, and call pi the estimated allele frequency at locus i. Further assume that, for all pairs of markers i and j, the linkage disequilibrium coefficient has been calculated as (pijpi pj)2 [pi pj(1 – pi)(1 – pj)]–1. The joint probability pij of observing the same allele at both loci is then easily derived.

We now wish to simulate a multivariate Bernoulli variable S = (S1, S2, ..., Sm) such that each variable Si has a marginal distribution pi = Pr(Si = 1), and the correlation between each pair of variables, say Si and Sj, is exactly rij. The following algorithm draws a random sample from the distribution of S:

  1. Compute a covariance matrix C = {cij} by solving the equations {Phi}(z(pi), z(pj); cij) = pij uniquely for cij (1 ≤ i < j ≤ L). Here the notation {Phi}(z(x), z(y); {rho}) refers to the c.d.f. of a standard bivariate normal distribution with correlation coefficient {rho}, whereas z(p) denotes the p-th quantile of the univariate standard normal distribution.
  2. Simulate a random vector m = (m1, ..., mL) from a multivariate normal distribution with mean vector (0, ..., 0) and covariance matrix C. The required random sample s = (s1, s2, ..., sL) is then obtained by thresholding the components of m according to the following rule: si = 1 if mi ≤ z(pi) else si = 0.

The underlying simulation strategy we adopted follows a well-known general recipe taken from the stochastic simulation cookbook, i.e. transform a random variable to a normally distributed variate via the c.d.f. and then map back. Despite its appealing simplicity, this algorithm had not been implemented before in the context of simulating binary genetic markers.

A few computational remarks are in order. The L(L – 1)/2 equations that appear in Step (1) are solved with high precision by an iterative bisection method; the function implementing the equation solver has been written in C in order to achieve a substantial speedup at run-time. It is important to notice that the covariance matrix C obtained from all the pairwise calculations of Step (1) is not guaranteed to be positive defined. When this is indeed the case, the non-positive definitiveness poses a problem, as a working covariance matrix is needed for sampling from a multivariate normal distribution1. Accordingly, the algorithm executes an intermediate step as needed:

(1a) Replace the non-positive definite covariance matrix C by its approximated positive definite version.

The approximation above, sometimes called bending (Hayes and Hill, 1981), requires performing an initial spectral decomposition of C; all the eigenvalues smaller than a minimum tolerance threshold are then replaced by this minimum value.

In all cases when the covariance matrix approximation applies, the mean square error between the target r2 and the corresponding LD coefficient estimated from a simulated sample will stay above a positive constant, even when the sample size increases. However, as we briefly illustrate in the sequel, the approximation error is indeed small, at times almost negligible, and the estimated LD plot is generally in very good agreement with its target. On the other hand, the estimated allele frequencies are not affected by this approximation, and are unbiased.


    ILLUSTRATION
 TOP
 ABSTRACT
 INTRODUCTION
 SIMULATION ALGORITHM
 ILLUSTRATION
 DISCUSSION
 REFERENCES
 
Let us assume that a sample dataset X has been loaded into R. The generation of virtual haplotypes is a 2-fold process: first, the covariance matrix C is computed by creating a haplotype object H <- haplodata(X); then, a sample Y of n haplotypes is obtained by the command Y <- haplosim(n, H). Other summary statistics (e.g. allele frequencies and a locus-specific genetic diversity index) can also be easily extracted from both the H and Y objects.

We briefly illustrate the use of HapSim on the ACE (angiotensin I converting enzyme) data, as analyzed and made available by Lin and Altman (2004). This dataset contains 52 SNPs typed on DCP1 in 11 individuals. All the 22 phased chromosomes were used as a target sample. We then simulated haplotype samples of sizes ranging from 12 to 2000. For each sample size, 100 replicated samples were generated, and the average mean square errors of both marginal probabilities and correlation coefficients were obtained.

Figure 1 clearly shows that, as the sample size increases, the marginal probability estimates quickly converge to the target values; on the other hand, the error in the estimation of pairwise correlations does not go down to zero due to the covariance matrix approximation. The minimum bias obtained at large sample sizes is marked in Figure 1 by a horizontal line.



View larger version (7K):
[in this window]
[in a new window]
 
Fig. 1 ACE data: mean square error of marginal probabilities (dashed line) and pairwise correlations (solid line) against sample size.

 
Figure 2 reassures that this error is not much of a concern. The upper triangular matrix represents the LD coefficients as estimated from the real data, whereas the lower triangular matrix displays the corresponding LD in a small sample of 22 simulated haplotypes; it clearly emerges that the block-like LD structure of the real data is reproduced with high accuracy.



View larger version (58K):
[in this window]
[in a new window]
 
Fig. 2 ACE data: hybrid LD matrix obtained from real and artificial samples of 11 individuals each.

 

    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 SIMULATION ALGORITHM
 ILLUSTRATION
 DISCUSSION
 REFERENCES
 
We have described a simple simulation procedure for generating artificial haplotypes matching pre-specified first and second order data summaries. An important assumption of our model is that the pairwise LD coefficients describe sufficiently well the real association structure among markers in the dataset at hand, whereas disequilibria of order greater than two (e.g. LD involving three or more markers) are close to zero, and can be ignored. When this assumption is violated, and in regions with very low LD, the simulated haplotype patterns may not always be in very good agreement with the observed ones.

Our implementation only relies on simulating a multivariate normal distribution, and therefore allows the user to efficiently generate many large samples in a matter of a few seconds. Even the initial and most computationally intensive task involving the solution of a large number of equations is quickly performed on a common desktop PC.

The R package provides a set of functions for haplotype simulation which is intended to be used as a component of a broader simulation study coded in R. Two potential applications are in order. The LD coefficient r2 arises naturally in the context of association mapping studies and is related to their power (Pritchard and Przeworski, 2001); therefore HapSim, being able to generate large samples of chromosomes that closely mimic observed r2 patterns, can be used as a building block of a simulation program for power and sample size estimation. We are currently employing it to characterize the properties of selected SNP tagging algorithms that only rely on pairwise LD. Other possible applications are under evaluation in a collaboration with GlaxoSmithKline.

Conflict of Interest: none declared.


    FOOTNOTES
 
1Multivariate normal variates are drawn by the mvrnorm function in the MASS library, also available from the CRAN repository. Back

Received on August 1, 2005; revised on September 21, 2005; accepted on September 22, 2005

    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 SIMULATION ALGORITHM
 ILLUSTRATION
 DISCUSSION
 REFERENCES
 

    Excoffier, L., et al. (2000) SIMCOAL: a general coalescent program for the simulation of molecular data in interconnected populations with arbitrary demography. J. Hered, . 91, 506–509[Free Full Text].

    Hayes, J.F. and Hill, W.G. (1981) Modification of estimates of parameters in the construction of genetic selection indices (‘bending’). Biometrics, 37, 483–493[CrossRef].

    Hudson, R.R. (2002) Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics, 18, 337–338[Abstract/Free Full Text].

    Lin, Z. and Altman, R.B. (2004) Finding haplotype tagging SNPs by use of principal components analysis. Am. J. Hum. Genet, . 75, 850–861[CrossRef][Web of Science][Medline].

    Pritchard, J.K. and Przeworski, M. (2001) Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet, . 69, 1–14[CrossRef][Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
F. A. Wright, H. Huang, X. Guan, K. Gamiel, C. Jeffries, W. T. Barry, F. Pardo-Manuel de Villena, P. F. Sullivan, K. C. Wilhelmsen, and F. Zou
Simulating association studies: a data-based resampling method for candidate regions or whole genome scans
Bioinformatics, October 1, 2007; 23(19): 2581 - 2588.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/23/4309    most recent
bti689v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Montana, G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Montana, G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?