Bioinformatics Advance Access originally published online on July 27, 2007
Bioinformatics 2007 23(19):2643-2644; doi:10.1093/bioinformatics/btm376
AssociationDB: web-based exploration of genomic association
1Institute for Medical Genetics, Charité – Universitätsmedizin Berlin, Augustenburger Platz 1, D-13353 Berlin, 2Max Planck Institute for Molecular Genetics, Ihnestrasse. 73, D-14195 Berlin and 3Department of Nephrology and Hypertension, Medical Clinic 4, University of Erlangen-Nuremberg, Breslauer Strasse, 201, D-90471 Nürnberg, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Genome-wide association studies use hundreds of thousands of markers making it challenging to present and finally interpret the results. We developed a graphical, web-based solution for an interactive exploration of the results of case-control studies, with a tight integration of related gene information and tissue-specific expression data. Association results are presented as physical position-based vertical bars with known genes included as horizontal bars at their respective physical positions. The interface allows the specification of filtering criteria for the association data and highlights potentially interesting genes with user-specified terms occurring in their reports or with relevant expression patterns. Pop-up windows and hyperlinks provide drill-down capabilities and quick access to relevant data AssociationDB can either be used as a stand-alone solution or as a front-end joining association results obtained by other software with genomic information.
Availability: http://genetik.charite.de/AssociationDB
Contact: dominik.seelow{at}charite.de
Supplementary information: The source code, a web-based demo, a step-by-step manual, and an installation guide are available at http://genetik.charite.de/AssociationDB.
Whole-genome association studies generate a huge amount of data rendering a quick review of results nearly impossible. In a conventional approach one would start to sort the information by the significance of P-values. However, genomic context and information provided by nearby markers will be lost this way. In addition, the overwhelming amount of data makes the visualization of the results very difficult since standard spreadsheet applications are just not able to solve this task. Further, a hit remains anonymous as long as no candidate gene can be assigned to the result.
We generated the web-based interactive open source database AssociationDB that tries to solve those analysis bottlenecks. The database is primarily intended to provide a user-friendly and fast overview of the results of either genome-wide or locus-specific association studies with a case-control setting. The integration of gene information, gene expression data and eventually hyperlinks to WWW resources puts the results straight into a genomic and functional context. Due to the client-server architecture no additional software installation on client computers is required. The intuitive web-based interface lets anyone quickly query and visualize association results and genomic data. It also allows data from different research groups and projects to be kept on the same server. Access on ongoing projects can be granted to the respective data owners and restricted to others while published data could be made completely public.
The backbone of the gene information table was taken from NCBI Entrez Gene, further information comprises NCBI and OMIM reports (Hamosh et al., 2005). Gene and marker positions were taken from the NCBI genetic map, build 36.1. Other data sets being integrated are known microRNAs (Griffiths-Jones et al., 2006), selection data (Voight et al., 2006) and GeneAtlas expression data (Su et al., 2004). Adding own association data is a two-staged process: first, the analysis groups must be defined and their genotypes imported. Afterwards, basic statistics such as
2 for Hardy–Weinberg equilibrium (HWE), genotyping and allelic association can be performed with functions on the database level. To provide reasonable access times the database stores aggregated genotypes and pre-analysed results instead of re-calculating them for every new query. This restricts reports to predefined cases and controls but permits the easy integration of results obtained by further external statistical tests. We provide Perl scripts for an easy export of genotypes and import of results. The general workflow and typical import, analysis and access times are presented on our website.
The main purpose of AssociationDB is a fast and intuitive interpretation of association results in the context of the respective genomic data, and hence a graphical representation was chosen (Fig. 1). The display comprises three different bar charts; allelic and genotypic association, and an aggregation of the allelic association of nearby markers (Fig. 1). Here, each SNP is scored for the significance of the association (weak, modest, high); scores of nearby markers are divided by the distance in SNPs and added. This procedure is repeated for every comparison included. It gives an aggregation of vicinity information as well as of different controls making the scoring relatively robust against false positives due to genotyping errors. On top of the window, genes are represented as vertical bars reflecting their position and size. Genes with words of interest in their gene information or OMIM reports (Hamosh et al., 2005) or fulfilling certain expression criteria are highlighted. The gene description is presented in pop-up windows, clicking on a gene also opens a pop-up providing direct links to gene-specific information in our database, ENSEMBL, NCBI Entrez Gene and GeneCards (Rebhan et al., 1997). In case of our own database, the information comprises relevant OMIM reports, expression data, NCBI GeneRIFs. To add further decision criteria, the location of microsatellite markers and LOD scores obtained in previous linkage analyses can be included as well. The database design allows the storage of multiple maps and hence permits an easy update of the positions as well as the use of older builds if necessary.
|
For a validation of the results, up to three association studies (e.g. cases versus three different control groups or studies in different populations) can be displayed. Cases can be tested against up to two other control populations. In the graphical representation, P-values shared among different groups are displayed in darker colours. P-values smaller than a user-defined significance threshold are indicated as red bars. Deviation from HWE in controls which may point at genotyping problems is indicated as well unless the user decides to completely remove those markers from the output.
AssociationDB differs from existing data analysis solutions such as PLINK (http://pngu.mgh.harvard.edu/~purcell/plink), Stata (Dufouil et al., 2004), Genomizer (Franke et al., 2006), Scout (Epstein et al., 2005) or R modules (Zhao and Tan, 2006) which offer an exhaustive set of analysis methods but lack the integration of genomic data other than marker positions and are often difficult to use for non-statisticians. On the other hand, AssociationDB is not intended to be a mere data repository such as the Genetic Association Database (Becker et al., 2004). AssociationDB's aim is to fill the gap between sophisticated data analysis tools and integrated visualization approaches with an intuitive access to genomic data. The data analysis capabilities of AssociationDB are limited, it can neither generate haplotypes nor perform extended statistical analyses. However, the results of such analyses carried out by other tools can easily be integrated and explored in their genomic context.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
K.H. is supported by Deutsche Forschungsgemeinschaft grant DFG, SFB 577, project A4, and is a recipient of a Rahel Hirsch Fellowship, provided by the Charité Medical Faculty. T.H.L. is supported by grants from the Deutsche Forschungsgemeinschaft (DFG; LiDFG768/4-1/4-2/6-1).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alex Bateman
Received on December 11, 2006; revised on May 31, 2007; accepted on July 14, 2007
| REFERENCES |
|---|
|
|
|---|
Becker KG, et al. The genetic association database. Nat. Genet. (2004) 36:431–432.[CrossRef][Web of Science][Medline]
Dufouil C, et al. Analysis of longitudinal studies with death and drop-out: a case study. Stat. Med. (2004) 23:2215–2226.[CrossRef][Web of Science][Medline]
Epstein MP, et al. Genetic association analysis using data from triads and unrelated subjects. Am. J. Hum. Genet. (2005) 76:592–608.[CrossRef][Web of Science][Medline]
Franke A, et al. GENOMIZER: an integrated analysis system for genome-wide association data. Hum. Mutat. (2006) 27:583–588.[CrossRef][Web of Science][Medline]
Griffiths-Jones S, et al. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res. (2006) 34:D140–D144.
Hamosh A, et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. (2005) 33:D514–D517.
Lamason RL, et al. SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. Science (2005) 310:1782–1786.
Rebhan M, et al. GeneCards: integrating information about genes, proteins and diseases. Trends Genet. (1997) 13:163.[CrossRef][Web of Science][Medline]
Su AI, et al. A gene atlas of the mouse and human protein-encoding transcriptomes, Proc. Natl Acad. Sci. USA (2004) 101:6062–6067.
Voight BF, et al. A map of recent positive selection in the human genome. PLoS Biol. (2006) 4:e72.[CrossRef][Medline]
Zhao JH, Tan Q. Integrated analysis of genetic data with R. Hum. Genomics (2006) 2:258–265.[Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
