Skip Navigation


Bioinformatics Advance Access originally published online on November 17, 2007
Bioinformatics 2008 24(1):146-148; doi:10.1093/bioinformatics/btm551
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/1/146    most recent
btm551v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Schwarz, D. F.
Right arrow Articles by Möller, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Schwarz, D. F.
Right arrow Articles by Möller, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

SNPtoGO: characterizing SNPs by enriched GO terms

Daniel F. Schwarz 1,*, Oliver Hädicke 1,4, Jeanette Erdmann 3, Andreas Ziegler 1, Daniel Bayer 2 and Steffen Möller 2,*

1Institut für Medizinische Biometrie und Statistik, 2Institut für Neuro- und Bioinformatik, 3Medizinische Klinik II, Universität zu Lübeck, Ratzeburger Allee 160, 23538 Lübeck and 4Current Address: Max Planck Institute for Dynamics of Complex Technical Systems, Sandtorstrasse 1, 39106 Magdeburg, Germany

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

For the analysis of complex polygenic diseases, one does not expect all patients to share the same disease-associated alleles. Not even will disease-causing variations be assigned to the identical sets of genes between patients. However, one does expect overlaps in the sets of genes that are involved and even more so in their assigned molecular processes. Furthermore, the assignment of single nucleotide polymorphisms (SNPs) to genes is highly ambiguous for intergenic SNPs. The tool presented here hence adds external information, i.e. GeneOntology (GO) terms (Gene Ontology Consortium), to the analysis of SNP data.

Availability: A web interface and source code are offered at https://webtools.imbs.uni-luebeck.de/snptogo

Contact: schwarz{at}imbs.uni-luebeck.de


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Full genome SNP chip experiments comprise hundreds of thousands of features. The challenge is to reduce the number of explanatory variables without losing relevant biological information. For the analyses of SNP data, an abstraction towards haplotype blocks or regions of little recombination is evident. The integration of external information in that process, e.g. allows for a filtering by the difference of allele frequencies in populations external to the study at hand (Möller et al., 2004). By characterizing dominant features shared by multiple SNPs, the number of features is reduced and the results may become statistically more powerful.

Equivalent difficulties affect the analysis of gene expression data. This field has strongly advanced in embracing external data for the data analysis, a consequence from the direct availability of an avalanche of manually curated and automatically deduced annotations (The UniProt Consortium, 2007). Such approaches comprise the inspection of molecular pathways (Chung et al., 2004; Mlecnik et al., 2005), and, as it is the focus of this work, tools for the analysis of the enrichment (Gentleman, 2004; Wrobel et al., 2005) of GO terms (Gene Ontology Consortium, 2006).

The approach presented here extends the prior towards the analysis of single nucleotide polymorphisms (SNPs). The GO terms of a SNP are the same as those of the gene that has the most proximal chromosomal location. This assignment is ambiguous because of genes overlapping on chromosomes and because of SNPs being located between genes. The analysis of a set of SNPs will determine subsets of SNPs that are each dominated by a respective molecular process—as represented by an entry in GO.


    2 APPROACH
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work presents a web interface to analyse the distribution of GO terms that are associated with a set of SNPs. The assignment is performed according to the Ensembl database (Hubbard et al., 2007) with a user-specified maximal distance between SNP and gene to include intergenic SNPs in the analysis. The maximal distance can be set by user in the submission mask. An option was added to restrain the acceptance of neighbouring genes to intergenic SNPs only. A SNP could be assigned to several genes due to overlapping genes. Thus, multiple GO terms could be associated with the same SNP. GO terms that are found overrepresented are reported both graphically and in a table. To constrain the search, a minimal distance from the root can be specified. Also, with prior knowledge that, e.g. adhesion is of concern in inflammatory processes (Gierer et al., 2005), respective negative lists can be given.

A GO term's dominance is characterized by the ratio of the number of observed appearances in a particular set of selected SNPs versus the number of expected appearances for a random selection. Statistically, the problem is addressed with the Fisher's exact test. For this implementation, the principles of the elim approach described for the topGO tool (Alexa et al., 2006) are directly applied. The elim algorithm (Alexa et al., 2006) uses the tree structure of GO for a top-down hierarchical selection of significant GO terms. A naïve approach that assumes the independence of the assigned GO terms would indicate too many statistically relevant GO terms that share a parent term that is already statistically significant.

The significance level for each Fisher's exact test is set to {alpha}/(number of GO terms selected), the Bonferroni correction for multiple testing. That number is calculated in advance as the directly assigned terms plus their minimal path in GO from the root.

Genes with many historically well characterised SNPs may appear with more probes on a SNP chip than others. This bias is taken into account by the statistics. Nevertheless, the tool offers a gene-based approach that for each GO term counts only the number of different genes assigned to it, not the number of SNPs.

The calculation is performed in multiple stages as shown in Figure 1. The SNPs of interest are submitted in an entry mask and parameters for maximal distances for intergenic SNPs set. Results are presented as an HTML table with hyperlinks for GO Terms via AmiGO (Gene Ontology Consortium, 2006), to SNPs via Ensembl and dbSNP (Wheeler et al., 2007), and to genes also via Ensembl. For each GO term, the calculated P-values and test scores are presented aside absolute numbers for each term indicating the dominance. The terms can be sorted by any of these numbers. The user has additional information thanks to the list of SNPs that was assigned to the GO term and the gene that established this link.


Figure 1
View larger version (29K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. (a) Form to submit a set of SNPs. SNPs are listed with their dbSNP ID. The order of the entries or further annotation is irrelevant to the numerical analysis. (b) Presentation of overrepresented GO terms. Information on absolute, P-values and relative numbers of SNPs that are associated via a gene with a GO term are offered to the user. (c) Heatmap of the results. The intensity represents the P-value. The x-axis represents the SNPs, the y-axis the GO terms. Axis entries are reordered to cluster similar SNPs and GO terms.

 
In a second stage, the degree of overexpression is expressed graphically. Presented in a heatmap, the x axis represents the SNPs, the y axis the GO terms. The intensity represents the P-value. Either axis has its entries reordered to cluster similar SNPs and GO terms.


    3 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The MySQL database of Ensembl version 46 was mirrored on a local Debian Linux server. An assignment of SNPs to GO terms was prepared as a separate table to lower response times. All calculations, graphics and HTML generation are performed by an R script (R Development Core Team, 2006) with support of the libraries RMySQL (James and DebRoy, 2006), CGIwithR (Firth, 2003) and gplots (Warnes et al., 2007).


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This application is the first to apply principles of the analysis of GO terms to SNP data. The key difference between tools previously developed for gene expression data and this approach is in the treatment of ambiguities in the assignment of SNPs to genes.

A major concern for the analysis are the intergenic SNPs. These are compensated for optionally including neighbouring genes within a specified distance of, e.g. 100 kb. A SNP close to enhancer regions may then (amongst others) be assigned to the gene it controls (Blackwood and Kadonaga, 1998). One may argue that it is not unlikely to find interacting genes chromosomally neighboured (Wang et al., 2004), thus contributing to a reduced error by the otherwise included unrelated biological processes.

Intuitively, the statistically most associated SNPs of a polygenic disease are likely to affect different biological processes. If they were on the same, then these could not be compensated. One may speculate that the number of SNPs needed until GO terms can be identified may be a measure for the polygenicity of the disease investigated.

The approach presented here is ignorant of the linkage disequilibrium between SNPs or other information in the raw data like the copy number variation. Also, the comparison of SNP data with gene expression data of the same individuals may yield additional insights for a selection of genes and associated SNPs. Such investigations were not addressed because of the huge amount of data that would be required to be transferred for a complete service. Users are instead suggested to submit sets of linked SNPs separatedly. To support an automation of that process by in-house systems, all source code is made available.

Both the data from SNP chips and gene expression microarrays may be analysed in conjunction with GO. SNP data has the advantage to directly indicate the chromosomal location of putative causes for a genetic disease and its cofactors. Furthermore, a variant is detected independently from the tissue that is analysed. The challenge is to combine both types of data in the analysis.


    5 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The tool may be particularly beneficial for the analysis of defects in polygenic diseases, making use of the redundancy of defects in metabolic or regulatory pathways.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work was funded by the DFG (K02250/3–1) and the EU projects KnowARC (032691) and Cardiogenics (037593). The authors thank Thomas Martinetz, Silke Szymczak and the anonymous reviewers for their support and critical reading of the manuscript.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on May 31, 2007; revised on October 12, 2007; accepted on October 31, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Alexa A, et al. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics (2006) 22:1600–1607.[Abstract/Free Full Text]

    Blackwood E, Kadonaga J. Going the distance: a current view of enhancer action. Science (1998) 281:60–63.[Abstract/Free Full Text]

    Chung HJ, et al. Arrayxpath: mapping and visualizing microarray gene-expression data with integrated biological pathway resources using scalable vector graphics. Nucleic Acids Res (2004) 32:W460–W464.[Abstract/Free Full Text]

    Firth D. CGIwithR: facilities for processing web forms using R. J. Stat. Softw (2003) 8:1–8.

    Gene Ontology Consortium. The gene ontology (GO) project in 2006. Nucleic Acids Res (2006) 34:D322–D326.[Abstract/Free Full Text]

    Gentleman R. Using go for statistical analyses. In: Compstat 2004 — Proceedings in Computational Statistics.—Antoch J, ed. (2004) Heidelberg: Physica Verlag. 171–180. ISBN 3-7908-1554-3.

    Gierer P, et al. Gene expression profile and synovial microcirculation at early stages of collagen-induced arthritis. Arthritis Res. Ther (2005) 7:R868–R876.[CrossRef][Web of Science][Medline]

    Hubbard TJP, et al. Ensembl 2007. Nucleic Acids Res (2007) 35:D610–D617.[Abstract/Free Full Text]

    James D, DebRoy S. RMySQL: R interface to the MySQL database. (2006) R package version 0.5–11.

    Mlecnik B, et al. Pathwayexplorer: web service for visualizing high-throughput expression data on biological pathways. Nucleic Acids Res (2005) 33:W633–W637.[Abstract/Free Full Text]

    Möller S, et al. Selecting SNPs for association studies based on population frequencies: generation of a novel interactive tool and its application to multiple sclerosis. Silico Biol (2004) 4:0035.

    R Development Core Team. R: A Language and Environment for Statistical Computing. In: R Foundation for Statistical Computing. (2006) Austria: Vienna. ISBN 3-900051-07-0.

    The UniProt Consortium. The Universal Protein Resource (UniProt). Nucleic Acids Res (2007) 35:D193–D197.[Abstract/Free Full Text]

    Wang W, et al. Duplication-degeneration as a mechanism of gene fission and the origin of new genes in drosophila species. Nat. Genet (2004) 36:523–527.[CrossRef][Web of Science][Medline]

    Warnes GR, Bolker B, Lumley T. gplots: Various R programming tools for plotting data. (2007) R package version 2.3.2.

    Wheeler DL, et al. Database resources of the national center for biotechnology information. Nucleic Acids Res (2007) 35:D5–D12.[Abstract/Free Full Text]

    Wrobel G, et al. goCluster integrates statistical analysis and functional interpretation of microarray expression data. Bioinformatics (2005) 21:3575–3577.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
L. Taher and I. Ovcharenko
Variable locus length in the human genome leads to ascertainment bias in functional inference for non-coding elements
Bioinformatics, March 1, 2009; 25(5): 578 - 584.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/1/146    most recent
btm551v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Schwarz, D. F.
Right arrow Articles by Möller, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Schwarz, D. F.
Right arrow Articles by Möller, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?