Skip Navigation


Bioinformatics Advance Access originally published online on October 4, 2006
Bioinformatics 2006 22(24):3103-3105; doi:10.1093/bioinformatics/btl507
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/24/3103    most recent
btl507v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Google Scholar
Right arrow Articles by Hoffmann, T.
Right arrow Articles by Lange, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hoffmann, T.
Right arrow Articles by Lange, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2006 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

P2BAT: a massive parallel implementation of PBAT for genome-wide association studies in R

Thomas Hoffmann 1,* and Christoph Lange 1,2

1 Department of Biostatistics, Harvard School of Public Health 655 Huntington Avenue, Boston, MA 02115, USA
2 Harvard Medical School, Channing Laboratory 181 Longwood Avenue, Boston, MA 02115, USA

*To whom correpondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 USAGE/DATA FORMAT
 3 RESULTS
 4 DISCUSSION/CONCLUSION
 REFERENCES
 

Summary: The software tool P2BAT provides a massive parallel and user friendly implementation of the PBAT-analysis tools for family-based association tests (FBATs) in large-scale studies, including genome-wide association studies with several thousand subjects. Built on the original PBAT-implementation of the Lange–Van Steen algorithm to bypass the multiple testing problem in family-based association studies, P2BAT integrates all PBAT-analysis tools for binary and complex traits into R and makes them accessible through a user-friendly GUI. The genome-wide analysis tools are fully automated and can be ran massively parallel directly through the GUI. P2BAT is fully documented and contains graphical output tools for time-to-onset analysis. P2BAT also features the ability to test for gene and environment/drug interaction.

Availability: The P2BAT package is available as the R package ‘pbatR’ which can be downloaded from http://cran.r-project.org/. The PBAT-software is available at http://www.biostat.harvard.edu/~clange/.

Contact: thoffman{at}hsph.harvard.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 USAGE/DATA FORMAT
 3 RESULTS
 4 DISCUSSION/CONCLUSION
 REFERENCES
 
The area of genome-wide association studies has finally started (Herbert et al., 2006; Kachergus et al., 2005; Klein et al., 2005), offering a unique chance to identify genes for complex traits through an unbiased search at a genome-wide level. The initial fear was that the new wealth of genomic data could not be translated into an increased statistical power to detect new genes, but would be diluted by the multiple comparisons problems. This concern about the major statistical road block in such study seems now to be fading, as new methodology emerges. For studies of unrelated individuals, several statistical approaches have been suggested (Hirschhorn and Daly, 2005; Thomas et al., 2004; Roeder et al., 2005; Verzilli et al., 2006). For genome-wide association studies in family-based designs, Van Steen et al. (2005) proposed a novel testing strategy that bypasses the multiple testing problem within one study and thereby reduces the impact of study heterogeneity. The approach has successfully been applied to a 100 K-scan in a family-sample of the Framingham Heart Study (Herbert et al., 2006; Lange et al., 2003; Laird and Lange, 2006), which has been up-to-date, the only successful genome-wide association study revealing a novel, replicable candidate gene for obesity.

However, so far, no software implementations for genome-wide association study exist that can analyze the vast amount of information, which is produced by such studies, in a user-friendly way and that runs massively parallel on clusters, minimizing the analysis time to a couple of minutes. With P2BAT, we have developed such a software tool. P2BAT implements all the analysis features of PBAT in R (R Development Core Team, 2005) and makes them accessible through a user-friendly GUI. Further, without requiring any additional efforts by the user, P2BAT allows one to run the analysis massively parallel with as many parallel jobs as specified (Fig. 1). The parallelization process in P2BAT is achieved by running multiple instances of the original PBAT program, using the queuing system of a cluster. The process is fully automated and monitored by P2BAT. The package P2BAT is available as the R package ‘pbatR’ in conjunction with the software PBAT. The two software packages can be downloaded from http://cran.r-project.org/ and from http://www.biostat.harvard.edu/~clange/, respectively. Detailed instructions are available on the webpage http://people.fas.harvard.edu/~tjhoffm/pbatR.html.


Figure 1
View larger version (74K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 The P2BAT graphical interface.

 

    2 USAGE/DATA FORMAT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 USAGE/DATA FORMAT
 3 RESULTS
 4 DISCUSSION/CONCLUSION
 REFERENCES
 
P2BAT can be run in both a command line version and a graphical interface version. In both cases, the data must be in the format of a pedigree and phenotype file. The first line in the pedigree file contains the names of the markers. Each subsequent line corresponds to an individual's pedigree id, subject id, father id, mother id, gender, affectation status and each pair of marker alleles, all separated by spaces. Missing data here is encoded with a ‘0’. Except for the marker names, the ped-file may not contain any characters. The first line of the phenotype file lists the names of all the traits in the phenotype file. Each subsequent line corresponds to an individual's pedigree id, subject id and the values of each trait, all separated by spaces. In contrast to the pedigree file, a hyphen ‘-’ must be used here to indicate missing data.

2.1 Graphical user interface
The main window of the analysis portion of the graphical interface is started with the command pbat() and is shown in Figure 1. Phenotypes, covariates, SNPs/haplotype blocks, stratification variables and other options can be selected from lists within the interface. For instance, for testing single traits one would select ‘gee’ for FBAT-GEE, for testing multiple traits simultaneously one would select ‘pc’ for FBAT-PC, and for testing time-to-onset traits one would select ‘logrank’ for FBAT-LOGRANK.

It is easy to take advantage of PBAT's parallel implementation. To use multiple processors or multiple cores, one can choose the ‘multiple’ option, and specify the number of cores on a single processor machine for instance. To spread PBAT out on a cluster, one can use the ‘cluster’ option, and specify the number of nodes for the number of jobs. If a cluster refresh time of ‘0’ is specified, the jobs will be submitted, and pbatR will not wait for the output; otherwise it specifies the number of seconds to wait before checking if the processes are done. When ‘0’ is specified, additional command line commands can be used to paste the output together at a later time.

The power and sample size interface partially shown in Figure 1 is started with the command pbat.power(). Options to calculate power are available for both binary and continuous, in both family-based and population-based studies. Additionally, options to calculate sample size are available for the population-based studies.

2.2 Command line interface
For additional control, or an alternative interface, one can also use the command line. For the analysis portion, data is only partially (default for the GUI) or completely read in with read.ped and read.phe, either loading in just the marker names or no names for datasets will millions of SNPs in the former case, or the entire dataset into objects that extend a dataframe. P2BAT is then run with the command pbat.m. The default options and values are identical to the ones shown in the graphical interface in Figure 1. An intuitive formula notation is used to specify the model for the association testing. For instance, suppose that we have phenotypes p1 and p2; covariates c1 to c3; and SNPs m1 to m6. The formula for a single phenotype, a single covariate, and three SNPs is given by

Formula

If instead we wanted to test for an association of the phenotype p1 with one of the two haplotype blocks (m1, m2, m3 and m4, m5, m6) in the presence of the gene–environment interaction with variable c2 and the covariate c1, we would specify

Formula
where mi(.) denotes the interaction term. Now, if we wanted to do a multivariate analysis with FBAT-PC, testing our phenotypes p1 and p2 simultaneously, and including the third covariate c3 to second order, we would have

Formula

Finally, if we wanted to do a time-to-onset analysis with FBAT-LOGRANK (Lange et al., 2004b; Jiang et al., 2006) on all the SNPs, we would have ‘time & censor~c1’. Further examples are available in the documentation. The result of this operation returns an object that works with the standard generic R functions, such as summary and plot (only time-to-onset has plots). The time-to-onset plots follow the algorithm developed in (Jiang et al., 2006), shown in Figure 2. To configure multiple jobs under the command line use pbat.setmode. Lastly, the power and sample size commands can also be used from the command line, with commands such as pbat.binaryFamily.


Figure 2
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 The time-to-onset graph (Jiang et al., 2006) can be saved in the various graphical formats supported by R.

 

    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 USAGE/DATA FORMAT
 3 RESULTS
 4 DISCUSSION/CONCLUSION
 REFERENCES
 
To assess the performance of P2BAT, we re-ran the analysis of the 100 K-scan in the Framingham Heart Study (Herbert et al., 2006; Laird and lange, 2006), using the entire data set with 1400 probands. We analyzed BMI-measurements at the six exams of the study as longitudinal data in the FBAT-PC approach (Lange et al., 2004a). Running the analysis in parallel on a cluster with 50 dual-nodes (XeonTM 3.2 Ghz), P2BAT used 70 MB of memory (per node) and took 41 min to complete the analysis. The aggregated results from the program runs are shown in Figure 3. Since P2BAT is able divide the analysis into as many parallel jobs as SNPs are available, the analysis could have been split up into 100 000 parallel jobs. Assuming that there are ~8 million common SNPs (Carlson, 2006) and the constantly growing cluster sizes, even the analyses of all common SNPs, if the technology should become available, will not face running time issues.


Figure 3
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 P2BAT-analysis results from a 100 K-scan in the Famingham Heart Study: the top 10 SNPs based on the conditional power estimates. After adjusting for selecting 10 comparisons/SNPs, the P-values for SNP SNP_A – 1669246 and SNP_A – ???????? achieve genome-wide significance. SNP SNP_A – 1669246 was previously identified in Herbert et al. (2006).

 

    4 DISCUSSION/CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 USAGE/DATA FORMAT
 3 RESULTS
 4 DISCUSSION/CONCLUSION
 REFERENCES
 
In the search for genes for complex diseases, genome-wide association studies are more and more replacing standard linkage studies. For complex diseases with their numerous disease-related phenotypes, the analysis of such studies is cumbersome, error-prone and computationally intensive. In order to translate the wealth of information into the successful identification of novel genes (Herbert et al., 2006), powerful and user-friendly analysis tools are needed. With P2BAT, we have developed such a tool based on the original PBAT program. P2BAT is software-package for the analysis of family-based association studies that is embedded into the R-environment, contains a user-friendly GUI-interface and that allows to run the analysis of genome-wide association studies massively parallel on cluster, reducing the analysis time of 100 000 SNPs and more to a couple of minutes.


    Acknowledgments
 
The authors thank the participants of the FHS for their contribution and the NHLBI-FHS investigators for providing DNA samples and phenotypic data for our analysis. The authors would also like to thank the reviewers for their suggestions. Funding was provided in part by grant MH17119. Funding to pay the Open Access publication charges for this article was provided by the Department of Biostatistics, Harvard School of Public Health.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Keith A Crandall

Received on July 28, 2006; revised on September 8, 2006; accepted on September 28, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 USAGE/DATA FORMAT
 3 RESULTS
 4 DISCUSSION/CONCLUSION
 REFERENCES
 

    Carlson, C.S. (2006) Agnosticism and equity in genome-wide association studies. Nat. Genet, . 38, 605–606[CrossRef][Web of Science][Medline].

    Herbert, A., et al. (2006) A common genetic variant is associated with adult and childhood obesity. Science, 312, 279–283[Abstract/Free Full Text].

    Hirschhorn, J.N. and Daly, M.J. (2005) Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet, . 6, 95–108[Web of Science][Medline].

    Jiang, H., et al. (2006) Family-based association test for time-to-onset data with time-dependent differences between the hazard functions. Genet. Epidemiol, . 30, 124–132[CrossRef][Web of Science][Medline].

    Kachergus, J., et al. (2005) Identification of a novel LRRK2 mutation linked to autosomal dominant parkinsonism: evidence of a common founder across European populations. Am. J. Hum. Genet, . 76, 672–680[CrossRef][Web of Science][Medline].

    Klein, R.J., et al. (2005) Complement factor H polymorphism in age-related macular degeneration. Science, 308, 385–389[Abstract/Free Full Text].

    Laird, N.M. and Lange, C. (2006) Family-based designs in the age of large-scale gene-association studies. Nat. Rev. Genet, . 7, 385–394[Web of Science][Medline].

    Lange, C., et al. (2003) Using the noninformative families in family-based association tests: a powerful new testing strategy. Am. J. Hum. Genet, . 73, 801–811[CrossRef][Web of Science][Medline].

    Lange, C., et al. (2004a) A family-based association test for repeatedly measured quantitative traits adjusting for unknown environmental and/or polygenic effects. Stat. Appl. Genet. Mol. Biol, . 3, Article17.

    Lange, C., et al. (2004b) Family-based association tests for survival and times-to-onset analysis. Stat. Med, . 23, 179–189[CrossRef][Web of Science][Medline].

    Roeder, K., et al. (2005) Analysis of single-locus tests to detect gene/disease associations. Genet. Epidemiol, . 28, 207–219[CrossRef][Web of Science][Medline].

    R Development Core Team. R: A Language and Environment for Statistical Computing, (2005) , Vienna, Austria ISBN 3-900051-07-0 R Foundation for Statistical Computing.

    Thomas, D., et al. (2004) Two-Stage sampling designs for gene association studies. Genet. Epidemiol, . 27, 401–414[CrossRef][Web of Science][Medline].

    Van Steen, K., et al. (2005) Genomic screening and replication using the same data set in family-based association testing. Nat. Genet, . 37, 683–691[CrossRef][Web of Science][Medline].

    Verzilli, C.J., et al. (2006) Bayesian graphical models for genomewide association studies. Am. J. Hum. Genet, . 79, 100–112[CrossRef][Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Hum Mol GenetHome page
J. M. Mercader, E. Saus, Z. Aguera, M. Bayes, C. Boni, A. Carreras, E. Cellini, R. de Cid, M. Dierssen, G. Escaramis, et al.
Association of NTRK3 and its interaction with NGF suggest an altered cross-regulation of the neurotrophin signaling pathway in eating disorders
Hum. Mol. Genet., May 1, 2008; 17(9): 1234 - 1244.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/24/3103    most recent
btl507v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Google Scholar
Right arrow Articles by Hoffmann, T.
Right arrow Articles by Lange, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hoffmann, T.
Right arrow Articles by Lange, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?