Bioinformatics Advance Access originally published online on May 23, 2006
Bioinformatics 2006 22(15):1928-1929; doi:10.1093/bioinformatics/btl268
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
© 2006 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
SNPStats: a web tool for the analysis of association studies
1 Catalan Institute of Oncology, IDIBELL, Epidemiology and Cancer Registry L'Hospitalet, Barcelona, Spain
2 Autonomous University of Barcelona, Laboratory of Biostatistics and Epidemiology Bellaterra, Barcelona, Spain
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: A web-based application has been designed from a genetic epidemiology point of view to analyze association studies. Main capabilities include descriptive analysis, test for HardyWeinberg equilibrium and linkage disequilibrium. Analysis of association is based on linear or logistic regression according to the response variable (quantitative or binary disease status, respectively). Analysis of single SNPs: multiple inheritance models (co-dominant, dominant, recessive, over-dominant and log-additive), and analysis of interactions (genegene or geneenvironment). Analysis of multiple SNPs: haplotype frequency estimation, analysis of association of haplotypes with the response, including analysis of interactions.
Availability: http://bioinfo.iconcologia.net/SNPstats. Source code for local installation is available under GNU license.
Contact: v.moreno{at}iconcologia.net
Supplementary Information: Figures with a sample run are available on Bioinformatics online. A detailed online tutorial is available within the application.
The analysis of association between genetic polymorphisms and diseases allows identifying susceptibility genes (Cordell and Clayton, 2005). The proper analysis of these studies can be performed with general purpose statistical packages, but the researcher usually needs the assistance of additional software to perform specific analysis, like haplotype estimation, and results from different packages are difficult to integrate.
We present a free web-based tool to help researchers in the analysis of association studies based on SNPs or biallelic markers. Both the selection of analysis and the output have been designed from a genetic epidemiology perspective. This application can also be used for learning purposes. We have written (in Spanish) an analysis guide with detailed explanations (Iniesta et al., 2005). A similar extensive help in English can also be found on the website.
The software is used following three steps, with the possibility of performing multiple analyses in one session. The steps are as follows.
(1) Data entry. Raw data in tabular form can be pasted in a window or uploaded from a text file. Variables can be named and the user can choose the field delimiter and the missing value code (Supplementary Figure 1). SNPs should be coded as genotypes with each allele separated by a slash (e.g. T/T, T/C, C/C).
(2) Data processing. A list with the variables read by the application is presented with an initial suggestion about the type: quantitative, categorical or SNP, which can be modified (Supplementary Figure 2). The user is prompted to select those needed for the analysis and to specify which one is the response, which may be binary (disease status) or quantitative. For categorical variables, including SNPs, the user can reorder the categories. The first one will be treated as reference category in the analysis. The application assumes that the main interest is the analysis of the SNPs in relation to the response. Other variables selected with type quantitative or categorical will be added to the regression models for analysis as covariates and treated as potential confounders.
(3) Analyses customization. The third step requests the selection of the desired statistical analyses that will be described later in this article (Supplementary Figure 3).
Regarding the statistical analysis, the association with disease is modeled depending on the response variable. If binary, the application assumes an unmatched casecontrol design and unconditional logistic regression models are used. If the response is quantitative, then a unique population is assumed and linear regression models are used to assess the proportion of variation in the response explained by the SNPs.
The association for each SNP is analyzed in turn and adjusted for the selected covariates. If more than one SNP are selected, then the application assumes that haplotype analysis is appropriate. Haplotype frequencies are estimated using the implementation of the EM algorithm coded into the haplo.stats package (Sinnwell and Schaid, 2005, http://mayoresearch.mayo.edu/mayo/research/biostat/schaid.cfm). Association between haplotypes and disease appropriately accounts for the uncertainty in the estimation of haplotypes for individuals with multiple heterozygous when phase is unknown or when missing values are present (Schaid et al., 2002). Individuals with missing values in the response, in all SNPs or in any covariate are excluded from analysis.
The software main page can be found online at http://bioinfo.iconcologia.net/SNPstats. The application uses PHP server programming language to build the input forms, upload data, call the statistical analysis procedures and process the output. The statistical analyses are performed in a batch call to the R package (R Development Core Team, 2005, http://www.R-project.org). The contributed packages genetics (Warnes and Leisch, 2005) and haplo.stats (Sinnwell and Schaid, 2005, http://mayoresearch.mayo.edu/mayo/research/biostat/schaid.cfm) are called to perform some of the analysis. Anonymous use is guaranteed and data are treated as confidential. Source code for local installation (Linux and Windows) is also available under GNU license.
SNPStats returns a complete set of results for the analysis, covering from the descriptive statistics to the haplotype analysis. The descriptive statistics returned are the absolute frequencies and proportions for categorical variables, and mean, standard deviation and a list of percentiles for the quantitative ones. Always the total valid sample size and the count of missing values are displayed (Supplementary Figure 4).
Each SNP is described as allele and genotype frequencies. An exact test for HardyWeinberg equilibrium is performed (Supplementary Figure 5). When the response variable is binary, these statistics can be displayed by each response group. The user usually will be interested in checking HardyWeinberg equilibrium in the control population.
The analysis of association for each SNP can be performed both for quantitative or binary response variables. For binary responses, the logistic regression analysis is summarized with genotype frequencies, proportions, odds ratios (OR) and 95% confidence intervals (CI) (Fig. 1). For quantitative responses, linear regression is summarized by means, standard errors, mean differences respect to a reference category and 95% CI of the differences.
SNPStats can also perform analyses of interactions. For simplicity, models with only one pair of variables interacting can be selected at a time. Three summary tables are shown (Supplementary Figure 7). The first one is the cross-classification that uses a common reference category for both interacting variables. ORs or mean differences are estimated, together with 95% CI, for all other combinations. Next tables use the margins as reference category and estimate ORs or mean differences of one variable nested within the other one. A global test for interaction is performed, as well as a test for the interaction in the linear trend of the nested variable. This assumes that the nested variable is ordinal and tests for different trend among categories. This test might be more sensitive than the global one due to the reduction in degrees of freedom.
When more than one SNP is included in the analysis, SNPStats offers the possibility of performing linkage disequilibrium (LD) and haplotype analysis. For LD, matrices with selected statistics (D, D', Pearson's r and associated P-values) are shown. (Supplementary Figure 8).
In the analysis of haplotypes, descriptive statistics show the estimated relative frequency for each haplotype (Supplementary Figure 9). Cumulative frequencies are also shown to help in the selection of the threshold cut point to group rare haplotypes for further analysis. The association analysis of haplotypes is similar to that of genotypes in that either logistic regression results are shown as OR and 95% CI or linear regression results with differences in means and 95% CI. The most frequent haplotype is automatically selected as the reference category and rare haplotypes are pooled together in a group. The analysis of haplotypes assumes a log-additive model by default, but dominant and recessive models are available as alternative choices.
When haplotypes are selected for interaction tables similar to the genotype interaction ones are shown, replacing the genotypes by haplotypes (Supplementary Figure 10). This analysis of interactions and presentation of the results is unique to the available alternatives explored and is an important contribution to the analysis of genetic epidemiology studies, often focused on testing for geneenvironment interactions (Lake et al., 2003).
As a limitation, we are aware that the selection of the available analysis has been done for the most frequent profile but might not be adequate in some instances. We plan to implement in future versions more response types: survival data for studies of prognosis, multinomial data for categorical responses with more than two categories and paired designs (matched casecontrol or nested casecontrol).
|
| Acknowledgments |
|---|
Funding support from the Spanish Instituto de Salud Carlos III (networks of centres RCESP C03/09 and RTICCC C03/10). Funding to pay the Open Access publication charges for this article was provided by Instituto de Salud Carlos III (FIS 03/0114).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Charlie Hodgman
Received on March 6, 2006; revised on May 16, 2006; accepted on May 18, 2006
| REFERENCES |
|---|
|
|
|---|
Cordell, H.J. and Clayton, D.G. (2005) Genetic association studies. Lancet, 366, 11211131[CrossRef][Web of Science][Medline].
Iniesta, R., et al. (2005) Análisis estadístico de polimorfismos genéticos en estudios epidemiológicos. Gac. Sanit, . 19, 333341[CrossRef][Medline].
Lake, S., et al. (2003) Estimation and tests of haplotypeenvironment interaction when linkage phase is ambiguous. Human Heredity, 55, 5665[CrossRef][Web of Science][Medline].
R Development Core Team. (2005) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
Schaid, D.J., et al. (2002) Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am. J. Hum. Genet, . 70, 425434[CrossRef][Web of Science][Medline].
Sinnwell, J.P. and Schaid, D.J. (2005) haplo.stats: statistical analysis of haplotypes with traits and covariates when linkage phase is ambiguous. R package version 1.2.2.
Warnes, G. and Leisch, F. (2005) Genetics: Population Genetics. R package version 1.2.0.
This article has been cited by other articles:
![]() |
B. R. Thumma, B. A. Matheson, D. Zhang, C. Meeske, R. Meder, G. M. Downes, and S. G. Southerton Identification of a Cis-Acting Regulatory Polymorphism in a Eucalypt COBRA-Like Gene Affecting Cellulose Content Genetics, November 1, 2009; 183(3): 1153 - 1164. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Konac, I. Dogan, H. I. Onen, A. S. Yurdakul, C. Ozturk, A. Varol, and A. Ekmecki Genetic Variations in the Hypoxia-Inducible Factor-1{alpha}Gene and Lung Cancer Experimental Biology and Medicine, September 1, 2009; 234(9): 1109 - 1116. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Agudo, M. Peluso, N. Sala, G. Capella, A. Munnia, S. Piro, F. Marin, R. Ibanez, P. Amiano, M.J. Tormo, et al. Aromatic DNA adducts and polymorphisms in metabolic genes in healthy adults: findings from the EPIC-Spain cohort Carcinogenesis, June 1, 2009; 30(6): 968 - 976. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. I. Anton, R. Teruel, J. Corral, A. Minano, I. Martinez-Martinez, A. Ordonez, V. Vicente, and B. Sanchez-Vega Functional consequences of the prothrombotic SERPINC1 rs2227589 polymorphism on antithrombin levels Haematologica, April 1, 2009; 94(4): 589 - 592. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Enjuanes, Y. Benavente, F. Bosch, I. Martin-Guerrero, D. Colomer, S. Perez-Alvarez, O. Reina, M. T. Ardanaz, P. Jares, A. Garcia-Orad, et al. Genetic Variants in Apoptosis and Immunoregulation-Related Genes Are Associated with Risk of Chronic Lymphocytic Leukemia Cancer Res., December 15, 2008; 68(24): 10178 - 10186. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Sedlacek, K. Stark, S. R. Cunha, A. Pfeufer, S. Weber, I. Berger, S. Perz, S. Kaab, H.-E. Wichmann, P. J. Mohler, et al. Common Genetic Variants in ANK2 Modulate QT Interval: Results From the KORA Study Circ Cardiovasc Genet, December 1, 2008; 1(2): 93 - 99. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. H. Onen, A. Ekmekci, M. Eroglu, E. Konac, S. Yesil, and H. Biri Association of Genetic Polymorphisms in Vitamin D Receptor Gene and Susceptibility to Sporadic Prostate Cancer Experimental Biology and Medicine, December 1, 2008; 233(12): 1608 - 1614. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. R. Maxwell, C. Potter, K. L. Hyrich, BRAGGSS, A. Barton, J. Worthington, J. D. Isaacs, A. W. Morgan, and A. G. Wilson Association of the tumour necrosis factor-308 variant with differential response to anti-TNF agents in the treatment of rheumatoid arthritis Hum. Mol. Genet., November 15, 2008; 17(22): 3532 - 3538. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. M Alvi and S. Hasnain Angiotensinogen gene variants in a Pakistani hypertensive population of Punjab Journal of Renin-Angiotensin-Aldosterone System, March 1, 2008; 9(1): 27 - 31. [Abstract] [PDF] |
||||
![]() |
N. Kondo, S. Honda, K. Ishibashi, Y. Tsukahara, and A. Negi Elastin Gene Polymorphisms in Neovascular Age-Related Macular Degeneration and Polypoidal Choroidal Vasculopathy Invest. Ophthalmol. Vis. Sci., March 1, 2008; 49(3): 1101 - 1105. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. W. Hewitt, S. Sharma, K. P. Burdon, J. J. Wang, P. N. Baird, D. P. Dimasi, D. A. Mackey, P. Mitchell, and J. E. Craig Ancestral LOXL1 variants are associated with pseudoexfoliation in Caucasian Australians but with markedly lower penetrance than in Nordic people Hum. Mol. Genet., March 1, 2008; 17(5): 710 - 716. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Kury, B. Buecher, S. Robiou-du-Pont, C. Scoul, V. Sebille, H. Colman, C. Le Houerou, T. Le Neel, J. Bourdon, R. Faroux, et al. Combinations of Cytochrome P450 Gene Polymorphisms Enhancing the Risk for Sporadic Colorectal Cancer Related to Red Meat Consumption Cancer Epidemiol. Biomarkers Prev., July 1, 2007; 16(7): 1460 - 1467. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. R. Gonzalez, L. Armengol, X. Sole, E. Guino, J. M. Mercader, X. Estivill, and V. Moreno SNPassoc: an R package to perform whole genome association studies Bioinformatics, March 1, 2007; 23(5): 654 - 655. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











