Bioinformatics Advance Access originally published online on March 23, 2007
Bioinformatics 2007 23(10):1294-1296; doi:10.1093/bioinformatics/btm108
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GenABEL: an R library for genome-wide association analysis
1Department of Epidemiology and Biostatistics, Erasmus MC Rotterdam, Postbus 2040, 3000 CA Rotterdam, The Netherlands and 2Statistical Genetics Group, Max-Planck-Institute of Psychiatry, Kraepelinstr. 10, D-80804 Munich, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Here we describe an R library for genome-wide association (GWA) analysis. It implements effective storage and handling of GWA data, fast procedures for genetic data quality control, testing of association of single nucleotide polymorphisms with binary or quantitative traits, visualization of results and also provides easy interfaces to standard statistical and graphical procedures implemented in base R and special R libraries for genetic analysis. We evaluated GenABEL using one simulated and two real data sets. We conclude that GenABEL enables the analysis of GWA data on desktop computers.
Availability: http://cran.r-project.org
Contact: i.aoultchenko{at}erasmusmc.nl
| 1 INTRODUCTION |
|---|
|
|
|---|
Genome-wide association (GWA) analysis is a tool of choice for the identification of genes for complex traits. Effective storage, handling and analysis of GWA data represent a challenge to modern computational genetics. GWA studies generate large amounts of data: hundreds of thousands of single nucleotide polymorphisms (SNPs) are genotyped in hundreds or thousands of patients and controls. Data on each SNP undergoes several types of analysis: characterization of frequency distribution, testing of Hardy–Weinberg equilibrium, analysis of association between single SNPs and haplotypes and different traits and so on. Because SNP genotypes in dense marker sets are correlated, significance testing in GWA analysis is preferably performed using computationally intensive permutation test procedures, further increasing the computational burden (Evans and Cardon, 2006).
Effective software making GWA analysis possible on desktop computers should meet the following criteria:
- Facilitate effective data storage and manipulation.
- Give access to wide range of statistical and graphical tools.
- Implement fast procedures for specific GWA tests.
With these objectives in mind, we developed the GenABEL software, implemented as an R library. R is a free, open source language and environment for statistical analysis (http://www.r-project.org/). Building upon existing statistical analysis facilities allowed for rapid development of the package.
| 2 IMPLEMENTATION |
|---|
|
|
|---|
2.1 Objective (1)
GWA data storage using standard R data types is ineffective. A SNP genotype for a single person may take four values (AA, AB, BB and missing). Two bits, therefore, are required to store these data. However, the standard R data types occupy 32 bits, leading to an overhead of 1500%, compared to the theoretical optimum. Use of the raw R data format, occupying eight bits, would still lead to 75% of RAM used inefficiently; moreover, this data type cannot be used directly in an analysis. We developed a new R data class, snp.data, which uses the optimal two bits to store information on a single SNP genotype. The standard R subsetting model was applied for this class, allowing retrieval of subsets of the data by SNP and study subject index, name or logical condition. Coercion to R integer and character and data types used by the "haplo.stats" (Schaid et al., 2002) and "genetics" libraries was implemented.
2.2 Objective (2)
R provides extensive statistical analysis and graphical facilities. This was one of the reasons why we implemented GenABEL as an R library. The function scan.glm and scan.glm.2D were developed to iteratively apply the standard R procedure glm (estimation of generalized linear models) to GWA data. The functions scan.haplo and scan.haplo.2D use the "haplo.stats" library to run sliding-window haplotype analysis and to evaluate the associations between a trait and haplotypes formed by all possible pairs of SNPs in a region. These functions are relatively slow and are aimed at the analysis of selected regions. In order to represent the objects generated by GenABEL graphically, new methods were designed for the generic R function "plot."
2.3 Objective (3)
Fast statistical genetic analysis procedures were implemented using ANSI standard of the C language and integrated into our library. These procedures facilitate data quality control and rapid single-SNP GWA analysis. The check.trait function provides summary statistics for phenotypic data and checks for outliers at a specified P-value or false discovery rate cut-off level. The function check.marker, based on summary.snp.data, allows the selection of a set of SNPs which pass user-specified criteria on call rate, redundancy, minimal marker allele frequency and deviation from Hardy–Weinberg equilibrium (using an exact test, Wigginton et al., 2005). The functions ccfast and qtscore enable a fast GWA analysis for case-control data and quantitative traits. The functions emp.ccfast and emp.qtscore were developed to estimate empirical genome-wide significance.
| 3 EXAMPLE |
|---|
|
|
|---|
We applied GenABEL for the analysis of one simulated and two real data sets. The first data set is distributed together with GenABEL. Using the MS program (Hudson, 2002), 833 SNPs covering a 2.5 MB region were simulated in 2500 people. We denote this data set as 2500 x 0.8 k. Two real data sets both used Affymetrix 250 K SNP arrays. The first included 197 (197 x 250 k) and the second included 500 people (500 x 250 k). All analyses were performed on a workstation with a 64-bit Intel Xeon 2.8 GHz processor, running SuSE Linux 9.2, using R v. 2.4.1. Analysis under Windows 2000 showed similar benchmark results.
Table 1 shows maximum resident memory size used by the package. It should be noted that most of the memory is occupied by the descriptive data (such as SNP names) and objects storing the results of analysis. For the 250 K GWA data, the maximum resident memory set size was 402 MB (set containing 500 people). A data set, which was obtained by quadruplicating every person in the 500 x 250 k set occupied a maximum of 1.24 GB. The memory occupied is roughly proportional to the number of subjects, though if the number of subjects increases N times, the RAM required increases by less than N times. Thus, GenABEL will facilitate analysis of GWA data on at least 2500 subjects on desktop computers (RAM 2 GB). From Table 1, it is clearly possible to run GWA and regional analyses in the course of a few minutes. Estimation of empirical genome-wide significance is one of the most laborious parts. Time for analysis grows proportionally to the product of the number of subjects, SNP tests and analysis replicas. Again, as is the case with RAM, with an N-fold increase of this product, time for computations increase slightly less than N times. Using GenABEL, it was possible to estimate empirical genome-wide significance using 500 permutations in a data set of 2000 people within 76 min.
|
GenABEL facilitates not only GWA analysis, but also presentation of results. Figure 1 presents graphs generated in the analysis of the 2500 x 0.8 k set. In Figure 1A, the associations between the simulated quantitative trait and SNPs are shown for the whole region. Figure 1B and C presents results of more detailed analyses of the region surrounding the most significant association signal.
|
| 4 CONCLUSIONS |
|---|
|
|
|---|
We developed the GenABEL package for GWA analysis, which implements effective GWA data storage and handling, fast procedures for genetic data quality control and analysis and interfaces to standard and specific R data types and functions. The package is available at http://cran.r-project.org.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We would like to thank Prof. L.Cardon, Prof. D.Clayton, Dr B.Müller-Myhsok and Dr M.Kayser for their valuable insights. This work was supported by the Netherlands Organization for Scientific Research (NWO-RFBR 047.016.009), the Centre for Medical Systems Biology (CMSB), the European Special Populations Research Network (FP6) and the Russian Foundation for Basic Research (RFBR). The work of Y.S.A was supported by a grant from Stichting MS.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on December 3, 2006; revised on February 14, 2007; accepted on March 13, 2007
| REFERENCES |
|---|
|
|
|---|
Evans DM, Cardon LR. Genome-wide association: a promising start to a long race. Trends Genet. (2006) 22:350–354.[CrossRef][Web of Science][Medline]
Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics (2002) 18:337–338.
Schaid DJ, et al. Score tests for association between traits and hap lotypes when linkage phase is ambiguous. Am. J. Hum. Genet. (2002) 70:425–434.[CrossRef][Web of Science][Medline]
Wigginton JE, et al. A note on exact tests of Hardy-Weinberg equilibrium. Am. J. Hum. Genet. (2005) 76:887–893.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
C.P. Garner, J.A. Murray, Y.C. Ding, Z. Tien, D.A. van Heel, and S.L. Neuhausen Replication of celiac disease UK genome-wide association study results in a US population Hum. Mol. Genet., November 1, 2009; 18(21): 4219 - 4225. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. B. Richards, F. K. Kavvoura, F. Rivadeneira, U. Styrkarsdottir, K. Estrada, B. V. Halldorsson, Y.-H. Hsu, M. C. Zillikens, S. G. Wilson, B. H. Mullin, et al. Collaborative Meta-analysis: Associations of 150 Candidate Genes With Osteoporosis and Osteoporotic Fracture Ann Intern Med, October 20, 2009; 151(8): 528 - 537. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Estrada, A. Abuseiris, F. G. Grosveld, A. G. Uitterlinden, T. A. Knoch, and F. Rivadeneira GRIMP: a web- and grid-based tool for high-speed analysis of large-scale genome-wide association using imputed data Bioinformatics, October 15, 2009; 25(20): 2750 - 2752. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Marroni, A. Pfeufer, Y. S. Aulchenko, C. S. Franklin, A. Isaacs, I. Pichler, S. H. Wild, B. A. Oostra, A. F. Wright, H. Campbell, et al. A Genome-Wide Association Scan of RR and QT Interval Duration in 3 European Genetically Isolated Populations: The EUROSPAN Project Circ Cardiovasc Genet, August 1, 2009; 2(4): 322 - 328. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. M. Woodward, A. Kottgen, J. Coresh, E. Boerwinkle, W. B. Guggino, and M. Kottgen Identification of a urate transporter, ABCG2, with a common functional polymorphism causing gout PNAS, June 23, 2009; 106(25): 10338 - 10342. [Abstract] [Full Text] [PDF] |
||||
![]() |
I.-H. SUNG, T.-H. KIM, S.-Y. BANG, T.-J. KIM, B. LEE, L. PEDDLE, P. RAHMAN, C. M.T. GREENWOOD, P. HU, and R. D. INMAN IL-23R Polymorphisms in Patients with Ankylosing Spondylitis in Korea J Rheumatol, May 1, 2009; 36(5): 1003 - 1005. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Johansson, F. Marroni, C. Hayward, C. S. Franklin, A. V. Kirichenko, I. Jonasson, A. A. Hicks, V. Vitart, A. Isaacs, T. Axenovich, et al. Common variants in the JAZF1 gene associated with height identified by linkage and genome-wide association analysis Hum. Mol. Genet., January 15, 2009; 18(2): 373 - 380. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






