Skip Navigation


Bioinformatics Advance Access originally published online on December 5, 2006
Bioinformatics 2007 23(2):252-254; doi:10.1093/bioinformatics/btl574
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/2/252    most recent
btl574v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Hao, K.
Right arrow Articles by Cawley, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hao, K.
Right arrow Articles by Cawley, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2006 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

LdCompare: rapid computation of single- and multiple-marker r2 and genetic coverage

K. Hao {dagger},*, X. Di {dagger} and S. Cawley {dagger}

Algorithm and Data Analysis, Affymetrix Inc. 3420 Central Expressway, Santa Clara, California, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PROGRAM FEATURE SUMMARY
 3 ALGORITHM DETAILS
 4 RESULTS AND DISCUSSIONS
 REFERENCES
 

Summary: The scale of genetic-variation datasets has increased enormously and the linkage equilibrium (LD) structure of these polymorphisms, particularly in whole-genome association studies, is of great interest. The significant computational complexity of calculating single- and multiple-marker correlations at a genome-wide scale remains challenging. We have developed a program that efficiently characterizes whole-genome LD structure on large number of SNPs in terms of single- and multiple-marker correlations.

Availability: LdCompare is licensed under the GNU General Public License (GPL). Source code, documentation, testing datasets and precompiled executables are available for download at: http://www.affymetrix.com/support/developer/tools/devnettools.affx

Contact: ke_hao{at}affymetrix.com


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PROGRAM FEATURE SUMMARY
 3 ALGORITHM DETAILS
 4 RESULTS AND DISCUSSIONS
 REFERENCES
 
The scale of genetic-variation datasets has increased enormously [~9 million SNPs in current dbSNP database, http://www.ncbi.nlm.nih.gov/SNP; (Karchin et al., 2005; Wang et al., 2005)]. Recently, the international HapMap project genotyped ~4 million SNPs in four ethnic groups (Altshuler et al., 2005). These developments have enabled, for the first time, genome-wide association studies targeting clinical traits (de Bakker et al., 2005; Wang et al., 2005). To date, it is still technically and financially prohibitive to type all SNPs on a meaningful sample size. Fortunately the correlation among nearby markers [linkage disequilibrium (LD)] allows us to survey a subset of all SNPs and infer nearby ungenotyped markers (de Bakker et al., 2005). Furthermore, rapid advances in biotechnology currently enable typing of 500 000 or more SNPs in a single experiment (Marchini et al., 2006).

In parallel there is great interest in statistical approaches to efficiently design and analyze genome-wide association studies. One important aspect is evaluating the LD structure of a SNP panel (i.e. a collection of SNPs) and its genomic coverage. As a brief definition, a panel's coverage is the fraction of SNPs, among a targeted pool, exceeding some correlation threshold with at least one SNP genotyped by the panel. Both single- and multiple-marker coverage (using single-marker and combinations of markers, respectively, to tag the SNPs in the target data) are frequently used in estimating the statistical power of a study (Altshuler et al., 2005; Barrett et al., 2005; de Bakker et al., 2005). However, such computation on large numbers of SNPs is a challenging task. In the context of HapMap Phase II (~4 million SNPs) and the relatively simple case of computing single-marker coverage, determining all pairwise correlations for markers located within 100 kb of one another will involve ~108 calculations of r2. In the more complicated scenario of three-marker coverage, ~1013 r2 evaluations are needed. Furthermore, the computation demand jumps exponentially when we increase sliding window size. To our knowledge, LdCompare is the first tool, which is capable of practical evaluation of whole-genome three-marker coverage.


    2 PROGRAM FEATURE SUMMARY
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PROGRAM FEATURE SUMMARY
 3 ALGORITHM DETAILS
 4 RESULTS AND DISCUSSIONS
 REFERENCES
 

  • The program has a uniform framework for single and multiple marker modes. It automatically detects the running mode according to parameter file setting.
  • Both diploid genotypes and phased haplotypes can be accommodated. Pairwise r2 and single-marker coverage can be computed in either case, multiple-marker r2 and coverage requires phased haplotype data (because reconstruction of 3 or 4 SNP haplotypes on the fly could be very time-consuming and potentially less accurate).
  • Standard linkage format input files (http://www.broad.mit.edu/mpg/haploview/files.php) are used for diploid data, from which we apply EM algorithm to reconstruct two-marker haplotype. Pre-phased haplotype data are also accommodated in a straightforward input file format.
  • Program outputs single- and multiple-marker r2. Researchers can use this information for coverage cumulative distribution computation and downstream tag SNP selection.
  • A list of user-defined parameters, such as minor allele frequency (MAF) filter, is provided for maximum flexibility.
  • Program is written in C++. Users are welcome to modify and redistribute the program under GNU General Public License (GPL).
  • The code is optimized for rapid computation and memory efficiency. The program has been developed and tested on Microsoft Windows, Linux and Sun Microsystems Solaris.


    3 ALGORITHM DETAILS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PROGRAM FEATURE SUMMARY
 3 ALGORITHM DETAILS
 4 RESULTS AND DISCUSSIONS
 REFERENCES
 
Consider two collections or panels of SNPs, denoted as P1 and P2, for which we have genotypes of P1 and P2 on the same sample set. One aim of LdCompare is to calculate single- and multiple-marker r2 and coverage of one SNP panel (e.g. P1) on the other (e.g. P2). For memory efficiency, computation is broken into chromosomes. For each P2 SNP we keep track of the maximum r2 to a SNP in P1 among all SNPs located within a specified distance (typically 100 kb). After calculation of the maximal r2 for each SNP in P2 the genome-wide coverage can be characterized.

3.1 Two-marker coverage
Herein, we implement an algorithm proposed previously (de Bakker et al., 2005; Pe'er et al., 2006). Consider three SNPs A, B and C. Each SNP carries two possible alleles, denoted as A and a, B and b, and C and c, respectively. We are interested in the coverage of C by A and B. The first step is to compute the LD in term of r2 between SNP C and SNPs A and B. SNPs A and B may form four possible haplotypes (AB, Ab, aB and ab). Therefore, A and B together can be treated as a multi-allelic marker, which carries four alleles, denoted as AB, Ab, aB and ab. Pooling {Ab, aB and ab}, we transform this multi-allelic marker to a bi-allelic SNP, which carries alleles AB and nonAB. Easily, we compute the r2 between this new bi-allelic SNP and SNP C, and record the result as Formula. Similarly, we calculate Formula by pooling {AB, aB and ab}. The same for Formula and Formula. Furthermore, we compute the r2 between SNP A and SNP C, recorded as Formula, as well as r2 between SNP B and SNP C, recorded as Formula. Herein, we define r2 between SNP C and SNPs A and B as max{Formula, Formula, Formula, Formula, Formula and Formula}. The second step is to compute coverage, which requires a pre-specified threshold (Formula). We defined SNP C is covered by SNP A and B if max{Formula, Formula, Formula, Formula, Formula and Formula} ≥ Formula.

3.2 Three+ marker coverage
There are four SNPs (A, B, C and D), and we are interested in the coverage of SNP D by SNPs A, B and C. A, B and C form 23 = 8 possible haplotypes. Again, we construct a novel bi-allelic SNP by pooling 7 haplotypes together, and obtain the r2 after eight iterations. The coverage of four or more marker can be computed in the same framework, but has not been implemented at the current stage.


    4 RESULTS AND DISCUSSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PROGRAM FEATURE SUMMARY
 3 ALGORITHM DETAILS
 4 RESULTS AND DISCUSSIONS
 REFERENCES
 
LdCompare is written in ANSI C++ and is usable on most operating systems. Running on a 2.8 GHz Intel Xeon workstation with 1G RAM, it computes the whole genome (~4 million HapMap SNPs, 30 trios) pairwise r2 and Affymetrix® Mapping 500 K Product's single-marker coverage (using ± 100 K bp sliding window size) within 2 h. Two-marker r2 and coverage take about 12 h on a single CPU. Three-marker coverage computation is divided into small jobs (by chromosome), therefore, these jobs can be carried out in parallel. The largest chromosome finishes within 24 h. Computation intensity rises exponentially when increasing the sliding window size.

Using this tool, we have evaluated the coverage of the Affymetrix Mapping 500 K Product on HapMap PhaseII SNPs (release 19). Both data are publicly available at http://www.hapmap.org/downloads/index.html.en, http://www.affymetrix.com/support/technical/sample_data/500k_hapmap_genotype_data.affx. Since the 500 000 SNPs have not been selected from a particular set of SNPs or human population, the coverage decays smoothly with raising r2 (Fig. 1). In contrast, if SNP panel is selected by Carlson's greedy approach (Carlson et al., 2004) using an arbitrary Formula, the coverage drops sharply when r2 > Formula (data not shown). Two-marker coverage offers substantial gain over single-marker (Fig. 1), however, the gain is smaller when we extend to three-marker coverage. Moreover, the coverage remains nearly unchanged between the choices of ±100 kb or ±200 kb sliding window (data not shown), suggesting there is little ‘useful’ r2 (e.g. r2 ≥ 0.8) out of 100 kb range on human genome.


Figure 1
View larger version (51K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Single-, two- and three-marker coverage of the Affymetrix Mapping 500 K product on the human genome using ±100 kb sliding window.

 
In summary, LdCompare provides geneticists a practical solution to characterize single- or multiple-marker LD pattern at the whole-genome scale, and provide two- or three-marker r2 estimation. We have compared the single- and two-marker coverage results from LdCompare with those from Haploview and found them to be near-identical, though LdCompare is significantly faster (in large part because it is a far simpler program design for the sole purpose of computing multi-marker r2, whereas Haploview performs many additional calculations). Single- and multiple-marker r2 computed by LdCompare also faciliates downstream tagSNPs selection. Although, HapMap project provides tagSNPs, in many cases, researchers still need to conduct tagSNP selection suitable for their own study, because many SNP tagging algorithms exist and show various strengths (e.g. efficiency or tagSNP portability). Also, studies indicate HapMap project tagSNPs have restricted applicability in some regions (Mueller et al., 2005; Willer et al., 2006) More importantly, tagSNPs that incorporate multiple-marker LD will further boost the power of association tests.


    Acknowledgments
 
The authors thank Dr Itsik Pe'er from the Broad Institute at Harvard and MIT for providing phased HapMap II data. Funding to pay the Open Access publication charges for this article was provided by Algorithm and Data Analysis Group, Affymetrix, Inc.

Conflict of Interest: none declared.


    FOOTNOTES
 
{dagger}The authors wish it to be known that, in their opinion, the first three authors should be regarded as joint First Authors Back

Associate Editor: Martin Bishop

Received on September 21, 2006; revised on November 10, 2006; accepted on November 11, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PROGRAM FEATURE SUMMARY
 3 ALGORITHM DETAILS
 4 RESULTS AND DISCUSSIONS
 REFERENCES
 

    Altshuler, D., et al. (2005) A haplotype map of the human genome. Nature, 437, 1299–1320[CrossRef][Medline].

    Barrett, J.C., et al. (2005) Haploview: analysis and visualization of ld and haplotype maps. Bioinformatics, 21, 263–265[Abstract/Free Full Text].

    Carlson, C.S., et al. (2004) Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet, . 74, 106–120[CrossRef][ISI][Medline].

    De bakker, P.I., et al. (2005) Efficiency and power in genetic association studies. Nat. Genet, . 37, 1217–1223[CrossRef][ISI][Medline].

    Karchin, R., et al. (2005) Ls-snp: large-scale annotation of coding non-synonymous snps based on multiple information sources. Bioinformatics, 21, 2814–2820[Abstract/Free Full Text].

    Marchini, J., et al. (2006) A comparison of phasing algorithms for trios and unrelated individuals. Am. J. Hum. Genet, . 78, 437–450[CrossRef][ISI][Medline].

    Mueller, J.C., et al. (2005) Linkage disequilibrium patterns and tagsnp transferability among european populations. Am. J. Hum. Genet, . 76, 387–398[CrossRef][ISI][Medline].

    Pe'er, I., et al. (2006) Evaluating and improving power in whole genome association studies using fixed marker sets. Nat. Genet, . 38, , pp. 663–667[CrossRef][ISI][Medline].

    Wang, W.Y., et al. (2005) Genome-wide association studies: theoretical and practical concerns. Nat. Rev. Genet, . 6, 109–118[CrossRef][ISI][Medline].

    Wang, X., et al. (2005) Single nucleotide polymorphism in transcriptional regulatory regions and expression of environmentally responsive genes. Toxicol. Appl. Pharmacol, . 207, 84–90[CrossRef][Medline].

    Willer, C.J., et al. (2006) Tag snp selection for finnish individuals based on the ceph utah hapmap database. Genet. Epidemiol, . 30, 180–190[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/2/252    most recent
btl574v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Hao, K.
Right arrow Articles by Cawley, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hao, K.
Right arrow Articles by Cawley, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?