Bioinformatics Advance Access originally published online on May 29, 2008
Bioinformatics 2008 24(14):1650-1651; doi:10.1093/bioinformatics/btn250
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Ontologizer 2.0—a multifunctional tool for GO term enrichment analysis and data exploration
1Institute of Medical Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin and 2Max-Planck-Institute for Molecular Genetics, Ihnestr. 73, 14195 Berlin, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: The Ontologizer is a Java application that can be used to perform statistical analysis for overrepresentation of Gene Ontology (GO) terms in sets of genes or proteins derived from an experiment. The Ontologizer implements the standard approach to statistical analysis based on the one-sided Fisher's exact test, the novel parent–child method, as well as topology-based algorithms. A number of multiple-testing correction procedures are provided. The Ontologizer allows users to visualize data as a graph including all significantly overrepresented GO terms and to explore the data by linking GO terms to all genes/proteins annotated to the term and by linking individual terms to child terms.
Availability: The Ontologizer application is available under the terms of the GNU GPL. It can be started as a WebStart application from the project homepage, where source code is also provided: http://compbio.charite.de/ontologizer
Requirements: Ontologizer requires a Java SE 5.0 compliant Java runtime engine and GraphViz for the optional graph visualization tool.
Contact: sebastian.bauer{at}charite.de; peter.robinson{at}charite.de
| 1 INTRODUCTION |
|---|
|
|
|---|
The Gene Ontology (GO) is a controlled vocabulary that is structured as a directed acyclic graph, and describes genes and their products (hereafter referred to simply as genes) in any organism (The Gene Ontology Consortium, 2000). Genes from a number of organisms have been annotated to GO terms. A widespread application is the identification of annotation-enriched GO terms in a list of genes that share some biological characteristic, the so-called study set (e.g. genes that are overexpressed in a microarray experiment), compared to a larger list of genes, the population set (e.g. all genes on a microarray). These terms are often interpreted as representing the salient biological features of the genes in the study set.
Since the introduction of GO, many tools have been developed that implement more or less the same approach for identifying GO terms whose annotations are enriched in study sets (Khatri and Dr
ghici, 2005). Here, we describe an application that implements not only the standard approach to GO analysis, but also the novel parent-child approach (Grossmann et al., 2007) and novel topology-based methods (Alexa et al., 2006; Falcon and Gentleman, 2007). The application also provides an user interface which allows researchers to analyze their datasets and explore the results in an intuitive fashion.
| 2 STATISTICAL FRAMEWORK |
|---|
|
|
|---|
The standard procedure for determining the annotation enrichment of a GO term consists of calculating the probability of drawing the same or higher number of study set genes annotated to the term if we selected a list of genes of the same size as the study set randomly from the population set. Formally, we can cast this as a statistical test involving the upper tail of a hypergeometric distribution, which is also known as the one-tailed Fisher's exact test. For this purpose, let n and m be the number of genes within the study and population set, respectively, and nt and mt the number of genes annotated to the term t of GO. Term t is called enriched if its P-value given by
|
| (1) |
A disadvantage of this method is that it completely disregards dependencies related to annotations resulting from the so-called true path rule, which states that a gene which is annotated to t is also annotated to all parent (less specific) terms of t. This leads to a problem if we consider multiple terms simultaneously: the chance of t being enriched is much higher if one or more of its parental terms is enriched. To address this inheritance problem, we developed the parent–child method for detecting GO term enrichment which avoids artifacts related to such dependencies (Grossmann et al., 2007). The method involves calculation of the sum
|
| (2) |
Rather different approaches for decorrelating the graph structure were introduced in Alexa et al. (2006). The authors developed two algorithms that iterate over the levels of the GO graph starting at the bottom and ending at the top (the root). At each iteration, a score is calculated for every term residing in the current level. The so-called elim method ignores genes mapping to terms significant in lower levels but otherwise uses the Fisher's exact test for calculating the score as in (1). The more involved weight method assigns real-valued weights to genes as a function of the scores of neighboring terms and uses a modified version of the Fisher's exact test to determine the score. An empirical comparison study involving all outlined approaches was done in Grossmann et al. (2007).
| 3 DESIGN AND IMPLEMENTATION |
|---|
|
|
|---|
The Ontologizer (Robinson et al., 2004) has been completely redesigned to provide a versatile WebStart or desktop application for the GO term enrichment analysis whose user interface utilizes Eclipse's Standard Widget Toolkit (Eclipse Foundation, 2007). It supports all the approaches described in the last section and allows the users to easily explore their results in textual or graphical form.
A project requires the specification of the OBO-file, which defines the GO structure, and the association file, which maps the genes to GO terms. Both types of files are available from the Gene Ontology website. In addition, annotation files as provided by the AffyMetrix NetAffx Analysis Center (Liu et al., 2003) are supported. It is possible to convert identifiers in the association file into other gene names by supplying a simple text file with mappings.
A project comprises a population set and its study sets. To define these sets, a basic text field is provided where genes can either be entered manually, inserted by copy-and-paste, or imported from external files. Genes with annotations are then highlighted.
In addition to the choice from the three statistical frameworks, the user can choose from a number of multiple-testing correction procedures. Various procedures for controlling the family-wise error rate such as the classic Bonferroni correction and the single-step minP procedure of Westfall and Young.
After the analysis is finished, a new window appears with a table showing rows of terms including P-values (or scores), annotation counts and other information (Fig. 1). Enrichment of a term is indicated by color coding according to the sub-ontology to which the term belongs (biological process, molecular function and cellular component), whereby the intensity of the color correlates with the significance of the enrichment. The terms displayed in the table can be restricted to all descendants of any term in GO. This can be used to display terms only in one sub-ontology or, say, to display all terms that are descendants of the term development.
Users can click on any term in the table to display properties and results related to the term such as its parents and children, its description and a list of all genes annotated to the term in the study set. This information is presented as a hypertext in the lower panel with links to parent and child terms. Clicking on a gene's name reveals all the terms directly annotating the gene.
|
The Ontologizer also provides a tightly integrated graphical view of the results. The graphical functions of the Ontologizer make use of the open source graph visualization package GraphViz (Gansner and North, 1999), which must be installed on the user's computer for the graphical functions of the Ontologizer to work. Within the graph view, GO terms are represented as nodes and the parent–child relationships as directed edges. Clicking on a node in the graph will cause the corresponding term in the table to be activated, thereby displaying information about the term. A node's context menu provides further actions, such as copying the names of the genes annotated to the term to the clipboard.
By default, only the graph induced by the enriched terms (i.e. the graph formed by these terms and all of their ancestor terms) are displayed. If the resulting graph contains too many terms for easy visualization, it is possible to restrict the induced graph to a subset of terms, such as all enriched terms in one of the sub-ontologies. It is also possible to add or remove an arbitrary term to the graph inducing term set by using the checkboxes in the table view.
Finally, the results of the analysis can be saved in a variety of tabular and graphical formats. For automation purposes, we additionally provide a command line interface version at the project homepage.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
Funding: This work was supported by the Deutsche Forschungs-gemeinschaft (SFB 760).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Dmitrij Frishman
Received on January 30, 2008; revised on April 11, 2008; accepted on May 27, 2008
| REFERENCES |
|---|
|
|
|---|
Alexa A, et al. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics (2006) 22:1600–1607.
Eclipse Foundation. Eclipse. (2007) Available at http://www.eclipse.org(last accessed date December 15, 2007).
Falcon S, Gentleman R. Using GOstats to test gene lists for GO term association. Bioinformatics (2007) 23:257–258.
Gansner ER, North SC. An open graph visualization system and its applications to software engineering. Software Pract. Exper (1999) 30:1203–1233.
The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nat. Genet (2000) 25:25–29.[CrossRef][Web of Science][Medline]
Grossmann S, et al. Improved detection of overrepresentation of Gene-Ontology annotations with parent-child analysis. Bioinformatics (2007) 23:3024–3031.
Khatri P, Dr
ghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics (2005) 21:3587–3595.
Liu G, et al. NetAffx: affymetrix probesets and annotations. Nucleic Acids Res (2003) 31:82–86.
Robinson PN, et al. Ontologizing gene-expression microarray data: characterizing clusters with Gene Ontology. Bioinformatics (2004) 20:979–981.
This article has been cited by other articles:
![]() |
A. Senf and X.-w. Chen Identification of genes involved in the same pathways using a Hidden Markov Model-based approach Bioinformatics, November 15, 2009; 25(22): 2945 - 2954. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. L. Hufton, S. Mathia, H. Braun, U. Georgi, H. Lehrach, M. Vingron, A. J. Poustka, and G. Panopoulou Deeply conserved chordate noncoding sequences preserve genome synteny but do not drive gene duplicate retention Genome Res., November 1, 2009; 19(11): 2036 - 2051. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. W. Huang, B. T. Sherman, and R. A. Lempicki Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists Nucleic Acids Res., January 1, 2009; 37(1): 1 - 13. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





