Skip Navigation


Bioinformatics Advance Access originally published online on November 28, 2008
Bioinformatics 2009 25(2):272-273; doi:10.1093/bioinformatics/btn616
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
25/2/272    most recent
btn616v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Kitamura, N.
Right arrow Articles by Hoque, M. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kitamura, N.
Right arrow Articles by Hoque, M. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Programs for calculating the statistical powers of detecting susceptibility genes in case–control studies based on multistage designs

Nobutaka Kitamura 1,*, Kouhei Akazawa 1, Akinori Miyashita 2, Ryozo Kuwano 2, Shin-ichi Toyabe 1, Junichiro Nakamura 1, Norihito Nakamura 1, Tatsuhiko Sato 1 and M. Aminul Hoque 3

1Department of Medical Informatics, Niigata University Medical and Dental Hospital, Niigata 951-8520, 2Genome Science Branch, Center of Bioresource-Based Researches, Brain Research Institute, Niigata University, Niigata 951-8585, Japan and 3Department of Statistics, University of Rajshahi, Rajshahi-6205, Bangladesh

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

Motivation: A two-stage association study is the most commonly used method among multistage designs to efficiently identify disease susceptibility genes. Recently, some SNP studies have utilized more than two stages to detect disease genes. However, there are few available programs for calculating statistical powers and positive predictive values (PPVs) of arbitrary n-stage designs.

Results: We developed programs for a multistage case–control association study using R language. In our programs, input parameters include numbers of samples and candidate loci, genome-wide false positive rate and proportions of samples and loci to be selected at the k-th stage (k=1,..., n). The programs output statistical powers, PPVs and numbers of typings in arbitrary n-stage designs. The programs can contribute to prior simulations under various conditions in planning a genome-wide association study.

Availability: The R programs are freely available for academic users and can be downloaded from http://www.med.niigata-u.ac.jp/eng/resources/informatics/gwa.html

Contact: nktmr{at}m12.alpha-net.ne.jp

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Genome-wide association (GWA) case–control studies using genetic polymorphism data, including data on single nucleotide polymorphisms (SNPs) and microsatellite markers, aim to identify genes involved in common human diseases. Some multistage GWA case–control studies have been proposed as powerful and cost-effective study designs. The most commonly used method among the multistage designs is a two-stage design. In the designs, all markers in the first stage are genotyped and statistical tests are conducted for each marker. Only significant markers selected in the first stage remain for the following stages. At the end of the last stage, the overall evidence for association is evaluated by replication-based analysis (RBA)(Satagopan et al., 2002) or joint analysis (JA) (Skol et al., 2006).

Recently, some SNP-based studies have utilized more than two stages of multistage designs (Kathiresan et al., 2007; Prentice et al., 2006). However, there are few available programs for multistage designs. Skol et al. (2006) proposed JA by new statistics of a linear combination weighted by the square root of the proportion of subjects and presented their power calculation for one- and two-stage designs on their website, but they have not reported about more than two stages. In previous studies, statistical power was used to evaluate the performance of different designs. On the other hand, positive predictive values (PPVs) are more practical for identifying disease susceptibility genes implicated in complex traits. However, there is no program for statistical powers, PPVs and numbers of typings of multistage designs for arbitrary n >2.

In this article, we present programs for calculating the indicators of arbitrary n-stage designs using disease model parameters (prevalence, disease-associated allele frequency and genetic relative risk) or 2x3 contingency table data as input parameters by R language (Crawley, 2005). The programs can output the indicators for allelic models, additive models, dominant models and recessive models under various conditions.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
2.1 Statistical background for detecting risk SNPs
We extended Skol's power-evaluation methods of two-stage designs Skol et al. (2006 to generalized n-multistage designs (n being an arbitrary natural number). Let {pi}s,k be a proportion of the sample size at the k-th stage and let {pi}m,k be a proportion of loci to be selected at the k-th stage (k=1,..., n). In RBA, statistical tests are conducted by statistics for difference in ratio of an expected disease-associated allele rate in cases (p') and that in controls (p). Critical values for the k-th stage of RBA (Cm,k) can be obtained by calculating {Phi}–1 [1– {pi} m,k /2] under the null hypothesis, and powers of k-stage designs of RBA are calculated by multiplying cumulative distribution functions of a normal distribution of each stage under the alternative hypothesis.

Test statistics of JA are represented as the Formula weighted sum of statistics of each stage. Since the statistics of JA and those of previous stages are not independent from those definitions, critical values of JA (Cj,k) can be calculated iteratively by solving the equation of n-multiple integrals under the null hypothesis using the Newton–Raphson method. Powers of JA can be calculated with Cj,k under the alternative hypothesis.

PPVs can be calculated by the number of true positive loci (tpk) and false positive loci (fpk). Let N be the number of samples in cases and controls, M be the number of candidate loci and m be the number of true loci. tpk is given asmx power of RBA or JA, and fpk is given as(M–m)x{Pi}Formula {pi}m,k.

Therefore, PPV at the k-th stage is calculated as tpk/(tpk+fpk).

The number of typings of one-stage design is given as 2MN.

The number of typings of n-stage design is


Formula

2.2 Algorithm and programs for multistage designs
The algorithm for the powers and PPVs and numbers of typings of multistage designs consists of the following three modules: a module for internal parameters, a module for RBA and a module for JA.

Let a candidate locus be a base pair of allele A and allele a and let a penetrance of aa be f0. The relative risk to f0 in a human population of Aa and that of AA will be f1/f0 and f2/f0 respectively. By using disease model parameters, such as prevalence (Prev) and disease-associated allele rate (ppo) as input parameters, f0, f1 and f2 in the allelic model, in which the numbers of each allele per cases and controls are counted, is calculated as follows:


Formula

Therefore, p' and p are calculated as follows:


Formula

In additive models, dominant models and recessive models, those genotype rates in cases and controls are calculated according to each definition (refer to Supplementary Material).

The programs for calculating the indicators were made by using R language (version 2.7.1). In the program, calculation of the equation of n-multiple integrals uses the ‘pmvnorm’ function. This function computes the distribution function of the multivariate normal distribution for arbitrary limits and correlation matrices based on algorithms by Genz (1992). We prepared the R package and the web user interface (WUI) for calculating them in our website (see Availability).

2.3 Simulation algorithm
Assume the following conditions for calculating the powers and PPVs and the numbers of typings in one- to five-stage designs:N=1000, M=500 000, m=5, genome-wide false positive rate ({alpha}genome)=0.05 and odds ratio (OR)=1.5, which is calculated by Prev=0.1, disease-associated allele frequency = 0.4 and GRR=1.44, and the genetic model is set to be an allelic model. The samples were equally divided into each stage. In addition, {pi}m,k are set to equal proportions to adjust the product of them to 0.0001 so that the number of remaining loci in the final stage by each stage design is equal.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Figure 1 shows the powers and PPVs and the numbers of typings of RBA and JA in one- to five-stage designs. The powers of JA were always higher than those of RBA. In RBA, with an increase in the number of stages, the numbers of typings and the powers monotonically decreased. However, in JA, the powers decreased more slowly than in RBA. The powers of the three-stage design were higher than the powers of two-, four- and five-stage designs. On the other hand, PPVs of RBA and those of JA both exhibited plateau behaviors in a high range of PPVs.


Figure 1
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Powers and PPVs and numbers of typings by RBA and JA in one- to five-stage designs.

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Multistage case–control association designs have the distinct advantage of considerable reduction in the number of typings, while maintaining high powers for many practical projects to detect disease-related genes.

We made programs for multistage designs by an arbitrary number of stages by using package R language. In this study, the properties of multistage designs of RBA and JA were investigated by comparisons of one- to five-stage designs, and it was shown that the powers by JA are larger than those by RBA in any number of stages and that three-stage designs are superior to two-, four- and five-stage designs in powers under the condition that the samples are equally divided into each stage (refer to Supplementary Material).

Factors affecting powers include N, M, {pi}s,k,{pi}m,k and study designs. However, there is no simple way to predict the powers by various study designs. Therefore, our programs would be beneficial in study communities at the planning stage of GWA studies.

Funding: This research was supported by grants-in-aid for scientific research [Based research (B)] by the Japanese Ministry of Education, Culture, Sports, Science and Technology.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alfonso Valencia

Received on June 5, 2008; revised on November 14, 2008; accepted on November 24, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

    Crawley MJ. Statistics. In: An Introduction Using R. (2005) England: John Wiley&Sons.

    Genz A. Numerical computation of multivariate normal probabilities. J. Comput. Graph. Stat. (1992) 1:141–150.

    Kathiresan S, et al. A genome-wide association study for blood lipid phenotypes in the Framingham Heart Study. BMC Med. Genet. (2007) 8. http://www.biomedcentral.com/1471-2350/8/S1/S17 (last accessed on December 7, 2008).

    Prentice RL, Lihong Qi. Aspects of the design and analysis of high-dimensional SNP studies for disease risk estimation. Biostatistics (2006) 7:339–654.[Abstract/Free Full Text]

    Satagopan JM, et al. Two-stage designs for gene-disease association studies. Biometrics (2002) 58:163–170.

    Skol AD, et al. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat. Genet. (2006) 38:209–213.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
25/2/272    most recent
btn616v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Kitamura, N.
Right arrow Articles by Hoque, M. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kitamura, N.
Right arrow Articles by Hoque, M. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?