Bioinformatics Advance Access originally published online on October 4, 2006
Bioinformatics 2006 22(24):3103-3105; doi:10.1093/bioinformatics/btl507
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
P2BAT: a massive parallel implementation of PBAT for genome-wide association studies in R
1 Department of Biostatistics, Harvard School of Public Health 655 Huntington Avenue, Boston, MA 02115, USA
2 Harvard Medical School, Channing Laboratory 181 Longwood Avenue, Boston, MA 02115, USA
*To whom correpondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: The software tool P2BAT provides a massive parallel and user friendly implementation of the PBAT-analysis tools for family-based association tests (FBATs) in large-scale studies, including genome-wide association studies with several thousand subjects. Built on the original PBAT-implementation of the LangeVan Steen algorithm to bypass the multiple testing problem in family-based association studies, P2BAT integrates all PBAT-analysis tools for binary and complex traits into R and makes them accessible through a user-friendly GUI. The genome-wide analysis tools are fully automated and can be ran massively parallel directly through the GUI. P2BAT is fully documented and contains graphical output tools for time-to-onset analysis. P2BAT also features the ability to test for gene and environment/drug interaction.
Availability: The P2BAT package is available as the R package pbatR which can be downloaded from http://cran.r-project.org/. The PBAT-software is available at http://www.biostat.harvard.edu/~clange/.
Contact: thoffman{at}hsph.harvard.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
The area of genome-wide association studies has finally started (Herbert et al., 2006; Kachergus et al., 2005; Klein et al., 2005), offering a unique chance to identify genes for complex traits through an unbiased search at a genome-wide level. The initial fear was that the new wealth of genomic data could not be translated into an increased statistical power to detect new genes, but would be diluted by the multiple comparisons problems. This concern about the major statistical road block in such study seems now to be fading, as new methodology emerges. For studies of unrelated individuals, several statistical approaches have been suggested (Hirschhorn and Daly, 2005; Thomas et al., 2004; Roeder et al., 2005; Verzilli et al., 2006). For genome-wide association studies in family-based designs, Van Steen et al. (2005) proposed a novel testing strategy that bypasses the multiple testing problem within one study and thereby reduces the impact of study heterogeneity. The approach has successfully been applied to a 100 K-scan in a family-sample of the Framingham Heart Study (Herbert et al., 2006; Lange et al., 2003; Laird and Lange, 2006), which has been up-to-date, the only successful genome-wide association study revealing a novel, replicable candidate gene for obesity.
However, so far, no software implementations for genome-wide association study exist that can analyze the vast amount of information, which is produced by such studies, in a user-friendly way and that runs massively parallel on clusters, minimizing the analysis time to a couple of minutes. With P2BAT, we have developed such a software tool. P2BAT implements all the analysis features of PBAT in R (R Development Core Team, 2005) and makes them accessible through a user-friendly GUI. Further, without requiring any additional efforts by the user, P2BAT allows one to run the analysis massively parallel with as many parallel jobs as specified (Fig. 1). The parallelization process in P2BAT is achieved by running multiple instances of the original PBAT program, using the queuing system of a cluster. The process is fully automated and monitored by P2BAT. The package P2BAT is available as the R package pbatR in conjunction with the software PBAT. The two software packages can be downloaded from http://cran.r-project.org/ and from http://www.biostat.harvard.edu/~clange/, respectively. Detailed instructions are available on the webpage http://people.fas.harvard.edu/~tjhoffm/pbatR.html.
|
| 2 USAGE/DATA FORMAT |
|---|
|
|
|---|
P2BAT can be run in both a command line version and a graphical interface version. In both cases, the data must be in the format of a pedigree and phenotype file. The first line in the pedigree file contains the names of the markers. Each subsequent line corresponds to an individual's pedigree id, subject id, father id, mother id, gender, affectation status and each pair of marker alleles, all separated by spaces. Missing data here is encoded with a 0. Except for the marker names, the ped-file may not contain any characters. The first line of the phenotype file lists the names of all the traits in the phenotype file. Each subsequent line corresponds to an individual's pedigree id, subject id and the values of each trait, all separated by spaces. In contrast to the pedigree file, a hyphen - must be used here to indicate missing data.
2.1 Graphical user interface
The main window of the analysis portion of the graphical interface is started with the command pbat() and is shown in Figure 1. Phenotypes, covariates, SNPs/haplotype blocks, stratification variables and other options can be selected from lists within the interface. For instance, for testing single traits one would select gee for FBAT-GEE, for testing multiple traits simultaneously one would select pc for FBAT-PC, and for testing time-to-onset traits one would select logrank for FBAT-LOGRANK.
It is easy to take advantage of PBAT's parallel implementation. To use multiple processors or multiple cores, one can choose the multiple option, and specify the number of cores on a single processor machine for instance. To spread PBAT out on a cluster, one can use the cluster option, and specify the number of nodes for the number of jobs. If a cluster refresh time of 0 is specified, the jobs will be submitted, and pbatR will not wait for the output; otherwise it specifies the number of seconds to wait before checking if the processes are done. When 0 is specified, additional command line commands can be used to paste the output together at a later time.
The power and sample size interface partially shown in Figure 1 is started with the command pbat.power(). Options to calculate power are available for both binary and continuous, in both family-based and population-based studies. Additionally, options to calculate sample size are available for the population-based studies.
2.2 Command line interface
For additional control, or an alternative interface, one can also use the command line. For the analysis portion, data is only partially (default for the GUI) or completely read in with read.ped and read.phe, either loading in just the marker names or no names for datasets will millions of SNPs in the former case, or the entire dataset into objects that extend a dataframe. P2BAT is then run with the command pbat.m. The default options and values are identical to the ones shown in the graphical interface in Figure 1. An intuitive formula notation is used to specify the model for the association testing. For instance, suppose that we have phenotypes p1 and p2; covariates c1 to c3; and SNPs m1 to m6. The formula for a single phenotype, a single covariate, and three SNPs is given by
![]() |
If instead we wanted to test for an association of the phenotype p1 with one of the two haplotype blocks (m1, m2, m3 and m4, m5, m6) in the presence of the geneenvironment interaction with variable c2 and the covariate c1, we would specify
![]() |
![]() |
Finally, if we wanted to do a time-to-onset analysis with FBAT-LOGRANK (Lange et al., 2004b; Jiang et al., 2006) on all the SNPs, we would have time & censor
c1. Further examples are available in the documentation. The result of this operation returns an object that works with the standard generic R functions, such as summary and plot (only time-to-onset has plots). The time-to-onset plots follow the algorithm developed in (Jiang et al., 2006), shown in Figure 2. To configure multiple jobs under the command line use pbat.setmode. Lastly, the power and sample size commands can also be used from the command line, with commands such as pbat.binaryFamily.
|
| 3 RESULTS |
|---|
|
|
|---|
To assess the performance of P2BAT, we re-ran the analysis of the 100 K-scan in the Framingham Heart Study (Herbert et al., 2006; Laird and lange, 2006), using the entire data set with 1400 probands. We analyzed BMI-measurements at the six exams of the study as longitudinal data in the FBAT-PC approach (Lange et al., 2004a). Running the analysis in parallel on a cluster with 50 dual-nodes (XeonTM 3.2 Ghz), P2BAT used 70 MB of memory (per node) and took 41 min to complete the analysis. The aggregated results from the program runs are shown in Figure 3. Since P2BAT is able divide the analysis into as many parallel jobs as SNPs are available, the analysis could have been split up into 100 000 parallel jobs. Assuming that there are
8 million common SNPs (Carlson, 2006) and the constantly growing cluster sizes, even the analyses of all common SNPs, if the technology should become available, will not face running time issues.
|
| 4 DISCUSSION/CONCLUSION |
|---|
|
|
|---|
In the search for genes for complex diseases, genome-wide association studies are more and more replacing standard linkage studies. For complex diseases with their numerous disease-related phenotypes, the analysis of such studies is cumbersome, error-prone and computationally intensive. In order to translate the wealth of information into the successful identification of novel genes (Herbert et al., 2006), powerful and user-friendly analysis tools are needed. With P2BAT, we have developed such a tool based on the original PBAT program. P2BAT is software-package for the analysis of family-based association studies that is embedded into the R-environment, contains a user-friendly GUI-interface and that allows to run the analysis of genome-wide association studies massively parallel on cluster, reducing the analysis time of 100 000 SNPs and more to a couple of minutes.
| Acknowledgments |
|---|
The authors thank the participants of the FHS for their contribution and the NHLBI-FHS investigators for providing DNA samples and phenotypic data for our analysis. The authors would also like to thank the reviewers for their suggestions. Funding was provided in part by grant MH17119. Funding to pay the Open Access publication charges for this article was provided by the Department of Biostatistics, Harvard School of Public Health.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Keith A Crandall
Received on July 28, 2006; revised on September 8, 2006; accepted on September 28, 2006
| REFERENCES |
|---|
|
|
|---|
Carlson, C.S. (2006) Agnosticism and equity in genome-wide association studies. Nat. Genet, . 38, 605606[CrossRef][Web of Science][Medline].
Herbert, A., et al. (2006) A common genetic variant is associated with adult and childhood obesity. Science, 312, 279283
Hirschhorn, J.N. and Daly, M.J. (2005) Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet, . 6, 95108[Web of Science][Medline].
Jiang, H., et al. (2006) Family-based association test for time-to-onset data with time-dependent differences between the hazard functions. Genet. Epidemiol, . 30, 124132[CrossRef][Web of Science][Medline].
Kachergus, J., et al. (2005) Identification of a novel LRRK2 mutation linked to autosomal dominant parkinsonism: evidence of a common founder across European populations. Am. J. Hum. Genet, . 76, 672680[CrossRef][Web of Science][Medline].
Klein, R.J., et al. (2005) Complement factor H polymorphism in age-related macular degeneration. Science, 308, 385389
Laird, N.M. and Lange, C. (2006) Family-based designs in the age of large-scale gene-association studies. Nat. Rev. Genet, . 7, 385394[Web of Science][Medline].
Lange, C., et al. (2003) Using the noninformative families in family-based association tests: a powerful new testing strategy. Am. J. Hum. Genet, . 73, 801811[CrossRef][Web of Science][Medline].
Lange, C., et al. (2004a) A family-based association test for repeatedly measured quantitative traits adjusting for unknown environmental and/or polygenic effects. Stat. Appl. Genet. Mol. Biol, . 3, Article17.
Lange, C., et al. (2004b) Family-based association tests for survival and times-to-onset analysis. Stat. Med, . 23, 179189[CrossRef][Web of Science][Medline].
Roeder, K., et al. (2005) Analysis of single-locus tests to detect gene/disease associations. Genet. Epidemiol, . 28, 207219[CrossRef][Web of Science][Medline].
R Development Core Team. R: A Language and Environment for Statistical Computing, (2005) , Vienna, Austria ISBN 3-900051-07-0 R Foundation for Statistical Computing.
Thomas, D., et al. (2004) Two-Stage sampling designs for gene association studies. Genet. Epidemiol, . 27, 401414[CrossRef][Web of Science][Medline].
Van Steen, K., et al. (2005) Genomic screening and replication using the same data set in family-based association testing. Nat. Genet, . 37, 683691[CrossRef][Web of Science][Medline].
Verzilli, C.J., et al. (2006) Bayesian graphical models for genomewide association studies. Am. J. Hum. Genet, . 79, 100112[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
J. M. Mercader, E. Saus, Z. Aguera, M. Bayes, C. Boni, A. Carreras, E. Cellini, R. de Cid, M. Dierssen, G. Escaramis, et al. Association of NTRK3 and its interaction with NGF suggest an altered cross-regulation of the neurotrophin signaling pathway in eating disorders Hum. Mol. Genet., May 1, 2008; 17(9): 1234 - 1244. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






