Bioinformatics Advance Access originally published online on January 18, 2007
Bioinformatics 2007 23(6):774-776; doi:10.1093/bioinformatics/btl657
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
snp.plotter: an R-based SNP/haplotype association and linkage disequilibrium plotting package
1 GCAP/CBDB, NIMH/NIH, 10 Center Drive; Room 4S-235, Bethesda, MD 20814, USA and 2Epidemiology, Johns Hopkins SPH, Baltimore, MD, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: snp.plotter is a newly developed R package which produces high-quality plots of results from genetic association studies. The main features of the package include options to display a linkage disequilibrium (LD) plot below the P-value plot using either the r2 or D' LD metric, to set the X-axis to equal spacing or to use the physical map of markers, and to specify plot labels, colors, symbols and LD heatmap color scheme. snp.plotter can plot single SNP and/or haplotype data and simultaneously plot multiple sets of results. R is a free software environment for statistical computing and graphics available for most platforms. The proposed package provides a simple way to convey both association and LD information in a single appealing graphic for genetic association studies.
Availability: Downloadable R package and example datasets are available at http://cbdb.nimh.nih.gov/~kristin/snp.plotter.html and http://www.r-project.org
Contact: nicodemusk{at}mail.nih.gov
| 1 INTRODUCTION |
|---|
|
|
|---|
Genetic association studies have been an important strategy for identifying susceptibility genes for a range of diseases including Alzheimer disease, deep vein thrombosis, inflammatory bowel disease, hypertriglyceridemia, diabetes and schizophrenia (Morton 2005). Single nucleotide polymorphisms (SNPs) are often used to test for a statistical association between a disease phenotype and single markers or multiple markers via haplotype-based analyses. SNPs may be tightly linked and exhibit correlation or linkage disequilibrium (LD). Knowledge of LD aids in the selection of SNPs and haplotypes to be examined for association with a disease (Abecasis et al., 2005) and in localizing a putative causal variant. Given the importance of LD to genetic association studies, researchers often plot the results of association studies in relation to LD present in the chromosomal region or gene examined. However, researchers often create the LD plot and association result plot separately using different software, which can lead to difficulty in aligning the two plots, making the resulting graphic unclear. We propose snp.plotter, which produces Portable Document Format (PDF) or Encapsulated Postscript (EPS) images of genetic association results using single SNP and/or haplotype data with a corresponding LD heatmap in one correctly aligned graphic.
| 2 SOFTWARE OVERVIEW |
|---|
|
|
|---|
snp.plotter is a package for R, the freely available statistical computing and graphics environment, which is available for several platforms including Windows, MacOS and UNIX/Linux (R Development Core Team, 2006). Nearly all aspects of the images produced by snp.plotter are customizable including labels, symbols, colors and color schemes, LD metric, graph P-value threshold, Y-axis scale, and lines corresponding to user specified P-value thresholds. snp.plotter has the ability to visualize multiple SNP and haplotype association sets of results. Haplotype results can be plotted using either global and/or individual haplotype P-values. P-value results may be plotted using physical spacing or can be evenly spaced. Even spacing of P-values aids in elucidating results in areas with dense SNP maps. Figures are produced in two print sizes (3.5 and 7 inches) corresponding to one and two columns, respectively, on a printed page in resolution-independent formats (PDF and EPS) for ease of use in manuscript preparation. snp.plotter figures can be easily imported into LaTeX documents, and due to the resolution-independent formats used, figures can be converted into raster image formats such as JPG, PNG and BMP without a loss in quality.
| 3 DATA INPUT |
|---|
|
|
|---|
snp.plotter uses four different types of input files: configuration files, single SNP and haplotype file for each result set, and genotype data; all files used are plain-text and tab-delimited. The configuration file is the preferred method of running snp.plotter because it allows users to save preferred settings and avoids the difficulty of writing extended R commands.
SNP.FILE=snp20_ss.txt,snp20_ss2.txt
HAP.FILE=snp20_haplo.txt,snp20_haplo2.txt
GENOTYPE.FILE=snp20_geno.txt
DISP.LDMAP=TRUE
COLOR.LIST=blue,red
SYMBOLS=circle-fill,square
LD.TYPE=rsquare
IMAGE.TYPE=pdf
The single SNP result set, SNP.FILE, includes four necessary columns: ASSOC, SNP.NAME, LOC and SS.PVAL corresponding to positive or negative association (indicating susceptibility or protective alleles), a SNP label, the location and a P-value for each SNP.
|
Haplotypes are specified using three necessary columns: ASSOC, GBL.PVAL and IND.PVAL, corresponding to positive or negative association, a global P-value, and an individual P-value for each haplotype followed by a set of columns of SNPs containing the corresponding haplotypes. Haplotypes are presented with the major allele given as 1 and the minor allele as 2; haplotype variants for a set of SNPs should be grouped together in the file. SNP labels in HAP.FILE must be the same as in SNP.FILE, and only SNPs with corresponding haplotypes need to be included.
|
Genotype data are formatted in modified LINKAGE format pedigree files; this marker information is used in the creation of LD plots and may be based on the controls from a case-control study or the founders in a family-based study. An optional file type can be used to specify color schemes for LD plots; PALETTE.FILE colors are hexadecimal HTML color codes with one color per line. The first and last colors correspond to the lowest and highest value of the chosen LD metric, respectively.
| 4 snp.plotter USAGE |
|---|
|
|
|---|
The package makes use of the grid graphics package for creation and placement of individual graphic elements, and the genetics package is used for the calculation of linkage disequilibrium (Warnes and Leisch, 2005). Modified code from the LDheatmap package is used to create a LD heatmap (Shin et al., 2006). Once snp.plotter and its dependencies are installed, snp.plotter can be loaded into R using this command:
library(snp.plotter)
snp.plotter is then run using the following command; this command produces the desired figure in the current working directory:
snp.plotter(config.file="config.txt")
In addition, there is an optional web interface for snp.plotter utilizing the Rpad R package for download. The web interface is best suited to intranet environments since users have complete access to any command in R and any system command (Short et al., 2005). snp.plotter must be installed on the machine running Rpad. Instructions for server deployment are presented on the Rpad website. The interface includes the majority of features, but is limited to one result set. The snp.plotter interface can be extended with basic knowledge of HTML and R to manipulate options presented or to perform additional analysis the researcher may require.
| 5 EXAMPLE |
|---|
|
|
|---|
The HapMap Project catalogs SNPs from populations with African, Asian and European ancestry (The International HapMap Consortium, 2005). Sample data for 20 SNPs was obtained from HapMap and two case-control populations with 500 cases and 500 controls were simulated using the Simulation of Haplotype Heterogeneity, Interaction and Population Stratification (SH2IPS) R package (Nicodemus and Luna, 2006). Logistic regression was used to determine association of each SNP with the disease phenotype. Haplotypes were analyzed using haplo.stats to evaluate disease association of haplotypes using a 3-SNP sliding window (Schaid et al., 2002). The results are presented in Figure 1 using snp.plotter. Single SNP and global haplotype P-values are shown for the two populations; the adjoining LD plot uses the r2 metric.
|
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We are grateful to Dr Daniel Weinberger, Dr Steven Huffaker, and Anushka Aqil for comments and feedback and to Dr Richard Coppola for help with Rpad.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on November 21, 2006; revised on November 21, 2006; accepted on December 21, 2006
| REFERENCES |
|---|
|
|
|---|
Abecasis GR, et al. Linkage disequilibrium: ancient history drives the new genetics. In: Hum. Hered (2005) 59:118–124.[CrossRef][Web of Science][Medline]
The International HapMap Consortium. A haplotype map of the human genome. In: Nature (2005) 437:1299–1320.[CrossRef][Medline]
Morton NE. Linkage disequilibrium maps and association mapping. In: J. Clin. Invest. (2005) 115:1425–1430.[CrossRef][Web of Science][Medline]
Nicodemus KK, Luna A. Simulation of haplotype heterogeneity, interaction, and population stratification. R package version 1.0. (2006).
R Development Core Team. R: a language and environment for statistical computing. (2006).
Schaid DJ, et al. Score tests for association between traits and haplotypes when linkage phase is ambiguous. In: Am. J. Hum. Genet. (2002) 70:425–434.[CrossRef][Web of Science][Medline]
Shin J, et al. LDheatmap: Graphical display of pairwise linkage disequilibria between SNPs. R package version 0.2. (2006).
Short T, Grosjean P. Rpad: workbook-style, web-based interface to R. R package version 1.1.1. (2006).
Warnes G, Leisch F. Genetics: population genetics. R package version 1.2.0. (2005).
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
