Bioinformatics Advance Access originally published online on August 25, 2005
Bioinformatics 2005 21(20):3935-3937; doi:10.1093/bioinformatics/bti643
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PAWE-3D: visualizing power for association with error in casecontrol genetic studies of complex traits
1Laboratory of Statistical Genetics, Rockefeller University 1230 York Avenue, New York, NY 10021, USA
2The Rogosin Institute 505 East 70th Street, New York, NY 10021, USA
3Weill Medical College of Cornell University 1300 York Avenue, New York, NY 10021, USA
4Department of Applied Mathematics and Statistics, Stony Brook University Stony Brook, NY 11794, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: A website that plots power and sample size calculations over a range of up to eight parameters (including diagnostic misclassification error parameters) for two commonly used statistical tests of genetic association, the linear trend test and the genotypic test of association.
Availability: This method is made available via the website http://linkage.rockefeller.edu/pawe3d/
Contact: pawe3d{at}linkage.rockefeller.edu
Power and sample size calculations are a critical part of the study design for genetic association analysis. Traditionally, statistical power for linkage or association analysis is computed by specifying genetic model parameters, such as the disease allele frequency and the conditional probabilities Pr(affected | j copies of disease allele), where j = 0, 1 or 2 for a diallelic disease locus (Boehnke, 1986; Weeks et al., 1990; Gordon et al., 2002; Purcell et al., 2003; De La Vega et al., 2005). The conditional probabilities are often referred to as penetrances (Ott, 1999). Equivalently, one can specify the genotype relative risks (Schaid and Sommer, 1993) and the prevalence of the disease (Sham, 1998; Purcell et al., 2003). Although these values can usually be estimated with a high degree of accuracy for Mendelian disorders, they are typically unknown for complex diseases (Ulgen et al., 2004). One statistical method to deal with such uncertainty regards considering a range of values for parameters. One can then either report the worst-case scenario (i.e. the smallest power or largest required sample size observed over the range) or median power and/or sample size values (Cox and Hinkley, 1979). This approach has been considered in several genetic applications over the past several years (Gordon et al., 1997; Vieland, 1998; Abreu et al., 1999; Cousin et al., 2003; Gordon et al., 2005; Zheng et al., 2005). One advantage is that researchers can observe a distribution of power values for the range of parameter values considered, including minimum, median, average and maximum power.
| VISUALIZING THE POWER |
|---|
|
|
|---|
We have implemented a method to visualize power and sample size for two commonly used statistical tests for genetic association, the linear trend test (Cochran, 1954; Armitage, 1955) and the genotypic test of association (Gordon et al., 2002). The linear trend test is actually a class of tests that are functions of weights. For genetic casecontrol association analyses, Sasieni (1997) made recommendations for the choice of weights assuming different underlying genetic models. Power and/or sample size are computed through derivation of the respective test's non-centrality parameter (Mitra, 1958; Chapman and Nam, 1968). The power for a fixed sample size of cases and controls and minimal sample size for a fixed power, each at a specified significance level, are computed as functions of genotype relative risks for the heterozygote (R1) and for the homozygous risk allele (R2), disease allele frequency (pd), marker allele frequency (p1) of the SNP allele in coupling with disease allele, measure of disequilibrium between disease and SNP locus [(D') (Lewontin, 1964) or r2 (Fisher, 1970; Weir, 1990)] and disease prevalence (K). Alternatively, if one is studying a quantitative trait locus (QTL) and one specifies lower and upper cutoffs for definition of affected and unaffected individuals, then power and/or sample size are calculated as functions of QTL variance, the dominance/additive ratio, the frequency of the QTL increaser allele, marker allele frequency (p1) of the SNP allele in coupling with QTL increaser allele and a measure of disequilibrium between the QTL and SNP locus [(D') or r2 as above] (Purcell et al., 2003).
Futhermore, because of work documenting the effects of diagnostic misclassification on the power of the linear trend test (Zheng and Tian, 2005) and the genotypic test of association (Bross, 1954; Edwards et al., 2005), we also include misclassification probabilities
(the probability of misclassifying a true affected as an observed unaffected) and
(the probability of misclassifying a true unaffected as an observed affected).
In total, there are eight disease model parameters required for the determination of power and/or sample size at a given significance level, assuming a diallelic disease or QTL and a marker locus that are in disequilibrium. Our webtool, PAWE-3D, allows one to perform power calculations considering a range of values for any subset of the eight parameters (with the remaining parameters specified at a single value). If we consider a range for only one parameter, the resulting figure is a graph. If we consider a range for exactly two parameters, the resulting figure is a contour plot. If we consider a range for three or more parameters, the resulting figure is a histogram. The figures are created by randomly sampling from either a Uniform or a Beta distribution for 100 000 data points in the n-dimensional cube defined by the parameter intervals and computing power and/or sample size for these data points.
| EXAMPLE SAMPLE SIZE CALCULATION |
|---|
|
|
|---|
Consider the following example, gleaned from a casecontrol genetic association study design of modifier loci in the PKD1 locus (Rossetti et al., 2002, 2003) for polycystic kidney disease. In this design, affected (respectively unaffected) status is defined by being at high (respectively low) risk to develop premature end-stage renal failure as determined by the diagnostic instrument used in The Consortium for Radiologic Imaging Studies of Polycystic (CRISP) Kidney Disease cohort (Chapman et al., 2003). The prevalence (K) of case individuals is
4050%, although we assume that we have equal number of cases and controls when performing the statistical analysis. Sequencing of several polymorphic SNPs in these patients indicates that the large majority of these SNPs have a minor allele frequency (P1) of 0.01. If we assume that one of the SNPs is a modifier locus that increases risk for being a case, then the disease allele frequency (pd) equals the minor allele frequency and D' (or r2) between the two SNPs is 1.0.
Since we anticipate the modifier loci will have small to moderate genotype relative risks (Schaid and Sommer, 1993), we consider genotype relative risks R1 and R2 in the range [1.5, 2.5]. We also consider misclassification probabilities
and
in the range [0.00, 0.03]. Using the information above, we consider a prevalence K in the range [0.40, 0.50]. Thus, we compute sample sizes considering ranges for five parameters and consequently, our resultant figure will be a histogram. In this example, we use a Uniform prior distribution for the parameters considered.
In Figure 1, we present the histogram of sample size calculations (cases and controls) for the linear trend test for power = 0.80, ratio cases-controls = 1.0, significance level = 5%. The weights considered for the linear trend test are, X0 = 2, X1 = 1, X0 = 0, where Xi is the weight corresponding to genotype having i copies of the SNP minor allele. Note that the sample sizes assume that the test statistic is the one being used for analysis. In Table 1, we present sample size values corresponding to specified percentile thresholds.
|
|
Viewing Table 1, we see that a total sample size of 587 (respectively 1005, 2825) is sufficient to achieve 80% power for at least half (respectively 75%, all) of the genetic model parameter settings (Table 1). In the spirit of minimax theory (Cox and Hinkley, 1979), these results give researchers a way of determining the worst-case sample size requirements.
| Acknowledgments |
|---|
The authors acknowledge grants received from the National Institutes of Health, K01-HG00055 and MH44292.
Conflict of Interest: none declared.
Received on May 31, 2005; revised on August 18, 2005; accepted on August 22, 2005
| REFERENCES |
|---|
|
|
|---|
Abreu, P.C., et al. (1999) Direct power comparisons between simple LOD scores and NPL scores for linkage analysis in complex diseases. Am. J. Hum. Genet., 65, 847857[CrossRef][Web of Science][Medline].
Armitage, P. (1955) Tests for linear trends in proportions and frequencies. Biometrics, 11, 375386[CrossRef].
Boehnke, M. (1986) Estimating the power of a proposed linkage study: a practical computer simulation approach. Am. J. Hum. Genet., 39, 513527[Web of Science][Medline].
Bross, I. (1954) Misclassification in 2 x 2 tables. Biometrics, 10, 478486[CrossRef].
Chapman, A.B., et al. (2003) Renal structure in early autosomal-dominant polycystic kidney disease (ADPKD): the Consortium for Radiologic Imaging Studies of Polycystic Kidney Disease (CRISP) cohort. Kidney Int., 64, 10351045[CrossRef][Medline].
Chapman, D.G. and Nam, J.M. (1968) Asymptotic power of chi square tests for linear trends in proportions. Biometrics, 24, 315327[CrossRef][Web of Science][Medline].
Cochran, W.G. (1954) Some methods for strengthening the common chi-squared tests. Biometrics, 10, 417451[CrossRef].
Cousin, E., et al. (2003) Association studies in candidate genes: strategies to select SNPs to be tested. Hum. Hered., 56, 151159[CrossRef][Web of Science][Medline].
Cox, D.R. and Hinkley, D.V. Theoretical Statistics, (1979) , Boca Raton CRC Press.
De La Vega, F.M., et al. (2005) Power and sample size calculations for genetic case/control studies using gene-centric SNP maps: application to Human Chromosomes 6, 21, and 22 in three populations. Hum. Hered., 60, 4360[Medline].
Edwards, B.J., et al. (2005) Power and sample size calculations in the presence of phenotype errors for case/control genetic association studies. BMC Genet., 6, 18[CrossRef][Medline].
Fisher, R.A. Statistical Methods for Research Workers, (1970) 14th ed , New York Hafner/MacMillan.
Gordon, D., et al. (1997) Association of posterior p-values of S.A.G.E. SIBPAL proportion-IBD and HasemanElston statistics for ACTHR112. Genet. Epidemiol., 14, 629634[Medline].
Gordon, D., et al. (2002) Power and sample size calculations for casecontrol genetic association tests when errors are present: application to single nucleotide polymorphisms. Hum. Hered., 54, 2233[CrossRef][Web of Science][Medline].
Gordon, D., et al. (2005) Power for complex trait genetic association. Clin. Neuroscience Res., 5, 3135.
Lewontin, R.C. (1964) The interaction of selection and linkage. I. General considerations; heterotic models. Genetics, 49, 4967
Mitra, S.K. (1958) On the limiting power function of the frequency chi-square test. Ann. Math. Stat., 29, 12211233.
Ott, J. Analysis of Human Genetic Linkage, (1999) , Baltimore Johns Hopkins.
Purcell, S., et al. (2003) Genetic power calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics, 19, 149150
Rossetti, S., et al. (2002) The position of the polycystic kidney disease 1 (PKD1) gene mutation correlates with the severity of renal disease. J. Am. Soc. Nephrol., 13, 12301237
Rossetti, S., et al. (2003) Association of mutation position in polycystic kidney disease 1 (PKD1) gene and development of a vascular phenotype. Lancet, 361, 21962201[CrossRef][Web of Science][Medline].
Sasieni, P.D. (1997) From genotypes to genes: doubling the sample size. Biometrics, 53, 12531261[CrossRef][Web of Science][Medline].
Schaid, D.J. and Sommer, S.S. (1993) Genotype relative risks: methods for design and analysis of candidate-gene association studies. Am. J. Hum. Genet., 53, 11141126[Web of Science][Medline].
Sham, P. Statistics in Human Genetics, (1998) , New York J. Wiley and Sons, Inc.
Ulgen, A., et al. (2004) Percentiles of the null distribution of 2 maximum lod score tests. Hum. Hered., 57, 3948[CrossRef][Web of Science][Medline].
Vieland, V.J. (1998) Bayesian linkage analysis, or: how I learned to stop worrying and love the posterior probability of linkage. Am. J. Hum. Genet., 63, 947954[CrossRef][Medline].
Weeks, D.E., et al. (1990) SLINK: a general simulation program for linkage analysis. Am. J. Hum. Genet., 47, A204 Suppl.
Weir, B.S. Genetic Data Analysis: Methods for Discrete Population Genetic Data, (1990) , Sunderland Sinauer Associates, Inc.
Zheng, G. and Tian, X. (2005) The impact of diagnostic error on testing genetic association in case-control studies. Stat. Med., 24, 869882[Medline].
Zheng, G., et al. (2005) On averaging power for genetic association and linkage studies. Hum. Hered., 59, 1420[Medline].
This article has been cited by other articles:
![]() |
H. Li, S. Wetten, L. Li, P. L. St. Jean, R. Upmanyu, L. Surh, D. Hosford, M. R. Barnes, J. D. Briley, M. Borrie, et al. Candidate Single-Nucleotide Polymorphisms From a Genomewide Association Study of Alzheimer Disease Arch Neurol, January 1, 2008; 65(1): 45 - 53. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Wiedmann, M. Fischer, M. Koehler, K. Neureuther, G. Riegger, A. Doering, H. Schunkert, C. Hengstenberg, and A. Baessler Genetic Variants Within the LPIN1 Gene, Encoding Lipin, Are Influencing Phenotypes of the Metabolic Syndrome in Humans Diabetes, January 1, 2008; 57(1): 209 - 217. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


