Skip Navigation


Bioinformatics Advance Access originally published online on March 23, 2007
Bioinformatics 2007 23(10):1294-1296; doi:10.1093/bioinformatics/btm108
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/10/1294    most recent
btm108v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (17)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Aulchenko, Y. S.
Right arrow Articles by van Duijn, C. M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Aulchenko, Y. S.
Right arrow Articles by van Duijn, C. M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

GenABEL: an R library for genome-wide association analysis

Yurii S. Aulchenko 1,*, Stephan Ripke 2, Aaron Isaacs 1 and Cornelia M. van Duijn 1

1Department of Epidemiology and Biostatistics, Erasmus MC Rotterdam, Postbus 2040, 3000 CA Rotterdam, The Netherlands and 2Statistical Genetics Group, Max-Planck-Institute of Psychiatry, Kraepelinstr. 10, D-80804 Munich, Germany

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 EXAMPLE
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Here we describe an R library for genome-wide association (GWA) analysis. It implements effective storage and handling of GWA data, fast procedures for genetic data quality control, testing of association of single nucleotide polymorphisms with binary or quantitative traits, visualization of results and also provides easy interfaces to standard statistical and graphical procedures implemented in base R and special R libraries for genetic analysis. We evaluated GenABEL using one simulated and two real data sets. We conclude that GenABEL enables the analysis of GWA data on desktop computers.

Availability: http://cran.r-project.org

Contact: i.aoultchenko{at}erasmusmc.nl


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 EXAMPLE
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Genome-wide association (GWA) analysis is a tool of choice for the identification of genes for complex traits. Effective storage, handling and analysis of GWA data represent a challenge to modern computational genetics. GWA studies generate large amounts of data: hundreds of thousands of single nucleotide polymorphisms (SNPs) are genotyped in hundreds or thousands of patients and controls. Data on each SNP undergoes several types of analysis: characterization of frequency distribution, testing of Hardy–Weinberg equilibrium, analysis of association between single SNPs and haplotypes and different traits and so on. Because SNP genotypes in dense marker sets are correlated, significance testing in GWA analysis is preferably performed using computationally intensive permutation test procedures, further increasing the computational burden (Evans and Cardon, 2006).

Effective software making GWA analysis possible on desktop computers should meet the following criteria:

  1. Facilitate effective data storage and manipulation.
  2. Give access to wide range of statistical and graphical tools.
  3. Implement fast procedures for specific GWA tests.

With these objectives in mind, we developed the GenABEL software, implemented as an R library. R is a free, open source language and environment for statistical analysis (http://www.r-project.org/). Building upon existing statistical analysis facilities allowed for rapid development of the package.


    2 IMPLEMENTATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 EXAMPLE
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Objective (1)
GWA data storage using standard R data types is ineffective. A SNP genotype for a single person may take four values (AA, AB, BB and missing). Two bits, therefore, are required to store these data. However, the standard R data types occupy 32 bits, leading to an overhead of 1500%, compared to the theoretical optimum. Use of the raw R data format, occupying eight bits, would still lead to 75% of RAM used inefficiently; moreover, this data type cannot be used directly in an analysis. We developed a new R data class, snp.data, which uses the optimal two bits to store information on a single SNP genotype. The standard R subsetting model was applied for this class, allowing retrieval of subsets of the data by SNP and study subject index, name or logical condition. Coercion to R integer and character and data types used by the "haplo.stats" (Schaid et al., 2002) and "genetics" libraries was implemented.

2.2 Objective (2)
R provides extensive statistical analysis and graphical facilities. This was one of the reasons why we implemented GenABEL as an R library. The function scan.glm and scan.glm.2D were developed to iteratively apply the standard R procedure glm (estimation of generalized linear models) to GWA data. The functions scan.haplo and scan.haplo.2D use the "haplo.stats" library to run sliding-window haplotype analysis and to evaluate the associations between a trait and haplotypes formed by all possible pairs of SNPs in a region. These functions are relatively slow and are aimed at the analysis of selected regions. In order to represent the objects generated by GenABEL graphically, new methods were designed for the generic R function "plot."

2.3 Objective (3)
Fast statistical genetic analysis procedures were implemented using ANSI standard of the C language and integrated into our library. These procedures facilitate data quality control and rapid single-SNP GWA analysis. The check.trait function provides summary statistics for phenotypic data and checks for outliers at a specified P-value or false discovery rate cut-off level. The function check.marker, based on summary.snp.data, allows the selection of a set of SNPs which pass user-specified criteria on call rate, redundancy, minimal marker allele frequency and deviation from Hardy–Weinberg equilibrium (using an exact test, Wigginton et al., 2005). The functions ccfast and qtscore enable a fast GWA analysis for case-control data and quantitative traits. The functions emp.ccfast and emp.qtscore were developed to estimate empirical genome-wide significance.


    3 EXAMPLE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 EXAMPLE
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We applied GenABEL for the analysis of one simulated and two real data sets. The first data set is distributed together with GenABEL. Using the MS program (Hudson, 2002), 833 SNPs covering a 2.5 MB region were simulated in 2500 people. We denote this data set as 2500 x 0.8 k. Two real data sets both used Affymetrix 250 K SNP arrays. The first included 197 (197 x 250 k) and the second included 500 people (500 x 250 k). All analyses were performed on a workstation with a 64-bit Intel Xeon 2.8 GHz processor, running SuSE Linux 9.2, using R v. 2.4.1. Analysis under Windows 2000 showed similar benchmark results.

Table 1 shows maximum resident memory size used by the package. It should be noted that most of the memory is occupied by the descriptive data (such as SNP names) and objects storing the results of analysis. For the 250 K GWA data, the maximum resident memory set size was 402 MB (set containing 500 people). A data set, which was obtained by quadruplicating every person in the 500 x 250 k set occupied a maximum of 1.24 GB. The memory occupied is roughly proportional to the number of subjects, though if the number of subjects increases N times, the RAM required increases by less than N times. Thus, GenABEL will facilitate analysis of GWA data on at least 2500 subjects on desktop computers (RAM 2 GB). From Table 1, it is clearly possible to run GWA and regional analyses in the course of a few minutes. Estimation of empirical genome-wide significance is one of the most laborious parts. Time for analysis grows proportionally to the product of the number of subjects, SNP tests and analysis replicas. Again, as is the case with RAM, with an N-fold increase of this product, time for computations increase slightly less than N times. Using GenABEL, it was possible to estimate empirical genome-wide significance using 500 permutations in a data set of 2000 people within 76 min.


View this table:
[in this window]
[in a new window]

 
Table 1. Characteristics of GenABEL v. 1.1–6

 
GenABEL facilitates not only GWA analysis, but also presentation of results. Figure 1 presents graphs generated in the analysis of the 2500 x 0.8 k set. In Figure 1A, the associations between the simulated quantitative trait and SNPs are shown for the whole region. Figure 1B and C presents results of more detailed analyses of the region surrounding the most significant association signal.


Figure 1
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Analysis of simulated data set. Region-wide analysis of single SNP association, using qtscore. (A) Nominal (above zero) and region-wise empirical (below zero) significance is presented as –log10P. Dotted lines correspond to experimentwise 5% significance (Bonferroni corrected above and empirical for below zero). Dots: allelic 1 d.f. test; crosses: genotypic 2 d.f. test. (B) Analysis of region surrounding SNPs showing highest significance. Dotted line: two-SNP sliding window haplotype analysis; solid line: three-SNP sliding window analysis. (C) Analysis of all pairs of SNPs in the region. Intensity corresponds to –log10P from analysis of haplotype association (above diagonal) and D' (below diagonal).

 

    4 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 EXAMPLE
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We developed the GenABEL package for GWA analysis, which implements effective GWA data storage and handling, fast procedures for genetic data quality control and analysis and interfaces to standard and specific R data types and functions. The package is available at http://cran.r-project.org.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 EXAMPLE
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We would like to thank Prof. L.Cardon, Prof. D.Clayton, Dr B.Müller-Myhsok and Dr M.Kayser for their valuable insights. This work was supported by the Netherlands Organization for Scientific Research (NWO-RFBR 047.016.009), the Centre for Medical Systems Biology (CMSB), the European Special Populations Research Network (FP6) and the Russian Foundation for Basic Research (RFBR). The work of Y.S.A was supported by a grant from ‘Stichting MS’.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on December 3, 2006; revised on February 14, 2007; accepted on March 13, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 EXAMPLE
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Evans DM, Cardon LR. Genome-wide association: a promising start to a long race. Trends Genet. (2006) 22:350–354.[CrossRef][Web of Science][Medline]

    Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics (2002) 18:337–338.[Abstract/Free Full Text]

    Schaid DJ, et al. Score tests for association between traits and hap lotypes when linkage phase is ambiguous. Am. J. Hum. Genet. (2002) 70:425–434.[CrossRef][Web of Science][Medline]

    Wigginton JE, et al. A note on exact tests of Hardy-Weinberg equilibrium. Am. J. Hum. Genet. (2005) 76:887–893.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Hum Mol GenetHome page
C.P. Garner, J.A. Murray, Y.C. Ding, Z. Tien, D.A. van Heel, and S.L. Neuhausen
Replication of celiac disease UK genome-wide association study results in a US population
Hum. Mol. Genet., November 1, 2009; 18(21): 4219 - 4225.
[Abstract] [Full Text] [PDF]


Home page
ANN INTERN MEDHome page
J. B. Richards, F. K. Kavvoura, F. Rivadeneira, U. Styrkarsdottir, K. Estrada, B. V. Halldorsson, Y.-H. Hsu, M. C. Zillikens, S. G. Wilson, B. H. Mullin, et al.
Collaborative Meta-analysis: Associations of 150 Candidate Genes With Osteoporosis and Osteoporotic Fracture
Ann Intern Med, October 20, 2009; 151(8): 528 - 537.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
K. Estrada, A. Abuseiris, F. G. Grosveld, A. G. Uitterlinden, T. A. Knoch, and F. Rivadeneira
GRIMP: a web- and grid-based tool for high-speed analysis of large-scale genome-wide association using imputed data
Bioinformatics, October 15, 2009; 25(20): 2750 - 2752.
[Abstract] [Full Text] [PDF]


Home page
Circ Cardiovasc GenetHome page
F. Marroni, A. Pfeufer, Y. S. Aulchenko, C. S. Franklin, A. Isaacs, I. Pichler, S. H. Wild, B. A. Oostra, A. F. Wright, H. Campbell, et al.
A Genome-Wide Association Scan of RR and QT Interval Duration in 3 European Genetically Isolated Populations: The EUROSPAN Project
Circ Cardiovasc Genet, August 1, 2009; 2(4): 322 - 328.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
O. M. Woodward, A. Kottgen, J. Coresh, E. Boerwinkle, W. B. Guggino, and M. Kottgen
Identification of a urate transporter, ABCG2, with a common functional polymorphism causing gout
PNAS, June 23, 2009; 106(25): 10338 - 10342.
[Abstract] [Full Text] [PDF]


Home page
The Journal of RheumatologyHome page
I.-H. SUNG, T.-H. KIM, S.-Y. BANG, T.-J. KIM, B. LEE, L. PEDDLE, P. RAHMAN, C. M.T. GREENWOOD, P. HU, and R. D. INMAN
IL-23R Polymorphisms in Patients with Ankylosing Spondylitis in Korea
J Rheumatol, May 1, 2009; 36(5): 1003 - 1005.
[Abstract] [Full Text] [PDF]


Home page
Hum Mol GenetHome page
A. Johansson, F. Marroni, C. Hayward, C. S. Franklin, A. V. Kirichenko, I. Jonasson, A. A. Hicks, V. Vitart, A. Isaacs, T. Axenovich, et al.
Common variants in the JAZF1 gene associated with height identified by linkage and genome-wide association analysis
Hum. Mol. Genet., January 15, 2009; 18(2): 373 - 380.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/10/1294    most recent
btm108v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (17)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Aulchenko, Y. S.
Right arrow Articles by van Duijn, C. M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Aulchenko, Y. S.
Right arrow Articles by van Duijn, C. M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?