Bioinformatics Advance Access originally published online on December 5, 2007
Bioinformatics 2008 24(3):435-437; doi:10.1093/bioinformatics/btm603
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Association studies for untyped markers with TUNA
1Departments of Statistics and 2Department of Medicine, 5734 S. University Avenue, Chicago, IL 60637, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: The software package TUNA (Testing UNtyped Alleles) implements a fast and efficient algorithm for testing association of genotyped and ungenotyped variants in genome-wide case-control studies. TUNA uses Linkage Disequilibrium (LD) information from existing comprehensive variation datasets such as HapMap to construct databases of frequency predictors using linear combination of haplotype frequencies of genotyped SNPs. The predictors are used to estimate untyped allele frequencies, and to perform association tests. The methods incorporated in TUNA achieve great accuracy in estimation, and the software is computationally efficient and does not demand a lot of system memory and CPU resources.
Availability: The software package is available for download from the website: http://www.stat.uchicago.edu/~wen/tuna/
Contact: nicolae{at}galton.uchicago.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Genome-wide association studies are now well recognized as powerful tools for finding genetic risk variants for complex traits (e.g. WTCCC, 2007). However, even with the availability of newly developed high throughput genotyping platforms, it is likely that the disease causing variants are not directly genotyped. The ability of performing statistical tests on untyped variants based on genotyped data becomes crucial in finding and explaining association signals. There has been recently a lot of efforts on methods and software designed to tackle this problem (Marchini et al., 2007; Nicolae, 2006b; Scheet and Stephens, 2006; Servin and Stephens, 2007). We introduce here a software package, called TUNA, that provides a fast and accurate solution for the inference on untyped variation. The package consists of two computational elements: (i) a predictor database building program that efficiently extracts LD information from a reference population panel data set such as HapMap, where a much larger set of genetic variants are studied; (ii) an analysis program that performs single SNP statistical testing. The software is written in ANSI C++ language and can be compiled and used in all modern operating systems. The programs only require modest amount of computing system resources and run extremely fast as an application for a genome-wide case-control study. In the following sections, we describe the methods and algorithms implemented in TUNA and investigate the performance of the package.
| 2 OVERVIEW OF THE METHOD |
|---|
|
|
|---|
The method implemented in TUNA aims to estimate allele frequencies for all the markers in HapMap (or other reference database) based on genotype data from a subset of markers (e.g. the Illumina HumanHap300 BeadChip SNP set) in a group of subjects (e.g. the cases in a case-control sample). The frequency is estimated as a linear combination of observed haplotype frequencies. For example, Table 1 shows haplotype frequencies from HapMap CEU: one SNP is absent in the Illumina HumanHap300 set (rs897623, denoted by T), and three Illumina SNPs (rs10909880, rs3748816 and rs12409348, denoted by A1, A2 and A3 respectively). TUNA estimate of the frequency of allele 1 of T is given by, U1 = h000 + h101 x 0.017 / (0.017 + 0.333), where the haplotype frequencies are estimated from the available genotypes.
|
Given a set of genotyped markers (e.g. high quality Illumina HumanHap300 SNPs) and information on the population composition of the samples (e.g. Caucasian), TUNA first builds a database containing, for every SNP in HapMap, the following information:
- the amount of information available on the SNP in the genotyped markers. The information is quantified using MD (Nicolae, 2006a), a multi-locus measure of LD that is similar in interpretation with r2. MD is defined as the asymptotic relative efficiency (ratio of sample sizes necessary to achieve the same power) of two allele frequency estimators: the direct estimator (as if the marker is genotyped) and the indirect estimator (via LD, based on the available genotypes as described above). Note that pairwise r2 might give an incomplete picture of the information available (Nicolae et al., 2006 a,b,c).
- the set of genotyped markers used in the frequency estimation (e.g. the markers A1, A2 and A3). This is done by first finding the set of N (e.g. four) SNPs that give maximum MD within a pre-defined window size (e.g. 400 kb). We further remove SNPs that do not contribute significantly to the information on the SNP of interest by computing the following adjusted MD,
where n is the number of haplotypes. Clearly,
penalizes large N, and this definition is inspired by the adjusted R2 widely used in regression.
- the weights for each possible haplotype (in Table 1, the haplotype 1-0-1 has weight 0.017/0.350 = 0.048).
The resulting database depends on the set of genotyped markers that will be used and on the population used in inference.
The case-control statistical tests implemented in TUNA aim to find differences in the allele frequencies in the two groups. Note that the linear haplotype predictor obtained in the database construction is used not for frequency estimation, but for defining the null hypothesis to be tested. For example, for the rs897623 case described above, we test the hypothesis that the frequencies h000+0.048h101 are equal in cases and controls. This can be done using tools developed for haplotype-based genetic association studies (Lin and Zeng, 2006; Nicolae, 2006b). Our first implementation used a likelihood ratio test where the MLEs were calculated using an ECM algorithm (Kim and Taylor, 1995; Meng and Rubin, 1993). The ECM algorithm converged occasionally to local maxima, and we decided to adopt a different strategy for assessing significance. We use squared difference of the estimated allele frequencies (in cases and controls) as test statistic and estimate its variance using two methods: a direct estimation based on the asymptotic interpretation of MD, and a resampling-based evaluation. Both test statistics are expected to have an asymptotic
distribution under the null. A permutation-based assessment of significance is also implemented in the software.
| 3 APPLICATION |
|---|
|
|
|---|
For prediction methods, the most important issues that need to be considered are the validity of the model assumptions and the accuracy of the imputation. Most existing software are computationally expensive in model evaluation, and we implement in TUNA a simple and fast algorithm that provides solutions for this. We estimate for each genotyped SNP two allele frequencies, one direct estimation using the observed genotypes and one indirect estimation using the methods described above. Comparing the two sets of results provides a direct assessment of the accuracy of the procedure. This feature is extremely important when used in cases where the reference database is not well matched with the studied population (e.g. using the Caucasian HapMap to construct predictors for non-Caucasian data).
We applied TUNA on two datasets from the Illumina iControl Database. We downloaded data on 248 Caucasian samples and 75 Latino samples that were genotyped using the HumanHap300 platform. For both samples, the HapMap CEU population is used as the LD reference panel. Out of the 2.557 M polymorphic SNPs in HapMap, 2.114 M are well-tagged (MD
0.7) by SNPs in this Illumina set, and 177 K are well-tagged if only using multi-marker predictors (MD
0.7 and max r2
0.5). Figure 1 shows the comparison of the direct and indirect allele frequency estimators for all SNPs on chromosome 22 and for the well-tagged (MD
0.7) subset of SNPs. Note that the method is more accurate on the Caucasian controls because they match better with the reference panel, and is more accurate when more information (as measured by MD) is available. All computations (database construction and frequency estimation) only cost 300 s of CPU time and 10 Mb of total memory in an AMD Opteron 2.6 MHz Linux system. In general, one can analyze a full genome scan on a regular workstation in a few hours.
|
| 4 DISCUSSION |
|---|
|
|
|---|
We introduce in this note a novel software package, TUNA, that provides a powerful computational tool for inference on ungenotyped variants in genome-wide association studies. Studying hidden information has many advantages including: (i) an increased power to detect genetic associations, because we can effectively turn a scan based on few hundred thousand markers into something that more closely approximates a scan of all the SNPs in the HapMap; (ii) a clear interpretation of the detected associations because every statistical test that is performed corresponds to one (typed or untyped) marker and (iii) a simple way to integrate data from different platforms because each marker in HapMap can be assigned with a P-value for association.
There is currently a lot of research done on methods for association testing at untyped variation, and many statistical and computational issues are still under investigation. For example, it is clear that if we have no prior information on risk variation, a larger number of markers leads to an increase in power even after after adjusting for multiple comparisons. If we test untyped alleles only for the markers where the prediction/imputation is perfect, the same conclusion applies: there is an increase in power. Inaccurate prediction/imputation is equivalent to a reduction in sample size for a genotyped marker (Nicolae, 2006a), and more research needs to be done on what are the appropriate thresholds for the prediction accuracy that need to be imposed to guarantee an increase in power. Another important issue is the choice of the reference panel used in constructing the prediction database. Our procedure is designed to guard against an increase in false-positives as a result of a reference panel that comes from a different population than the samples under investigation. We use the predictors to define null hypotheses that are tested in the sample, and an incorrect prediction leads to testing an uninteresting null hypothesis. The price for this robustness is a possible decrease in power as testing uninteresting hypotheses would lead to a decrease in power due to the change in the multiple comparison adjustment. We believe that this robustness and the computational efficiency are the benefits of this algorithm over the methods where individual genotypes are imputed.
We would like to end by noting that the same software and methods that are described in this article can be applied for indirect association testing of other genomic variants such insertion/deletions and copy number polymorphisms. The only requirement is the availability of reference databases that contain the typed SNPs and genotypes on the genomic variants that can be used to construct the haplotype predictors.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The research was supported in part by NIH grants HL084715, DK62429, DK077489 and HL087665.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Keith Crandall
Received on September 23, 2007; revised on November 27, 2007; accepted on November 30, 2007
| REFERENCES |
|---|
|
|
|---|
Kim DK, Taylor JMG. The restricted EM algorithm for maximum likelihood estimation under linear restrictions on the parameters. J. Am. Stat. Assoc (1995) 90:708–716.[CrossRef][Web of Science]
Lin DY, Zeng D. Likelihood-based inference on haplotype effects in genetic association studies (with discussion). J. Am. Stat. Assoc (2006) 101:89–118.[CrossRef][Web of Science]
Marchini J, et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet (2007) 39:906–913.[CrossRef][Web of Science][Medline]
Meng XL, Rubin DB. Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika (1993) 80:267–278.
Nicolae DL. Quantifying the amount of missing information in genetic association studies. Genet. Epidemiol (2006a) 30:703–717.[CrossRef][Web of Science][Medline]
Nicolae DL. Testing untyped alleles (TUNA)-applications to genome-wide association studies. Genet. Epidemiol (2006b) 30:718–727.[CrossRef][Web of Science][Medline]
Nicolae DL, et al. Coverage and characteristics of the Affymetrix GeneChip Human Mapping 100K SNP set. PLoS Genet (2006c) 2:e67.[CrossRef][Medline]
Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet (2006) 78:629–644.[CrossRef][Web of Science][Medline]
Servin B, Stephens M. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet (2007) 3:e114.[CrossRef][Medline]
Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature (2007) 447:661–678.[CrossRef][Medline]
This article has been cited by other articles:
![]() |
A. D. Johnson Single-Nucleotide Polymorphism Bioinformatics: A Comprehensive Review of Resources Circ Cardiovasc Genet, October 1, 2009; 2(5): 530 - 536. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

