Skip Navigation


Bioinformatics Advance Access originally published online on November 30, 2004
Bioinformatics 2005 21(8):1592-1595; doi:10.1093/bioinformatics/bti169
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1592    most recent
bti169v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (15)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kestler, H. A.
Right arrow Articles by Buchholz, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kestler, H. A.
Right arrow Articles by Buchholz, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Generalized Venn diagrams: a new method of visualizing complex genetic set relations

Hans A. Kestler 1,2,*,{dagger}, André Müller 2,{dagger}, Thomas M. Gress 2,{ddagger} and Malte Buchholz 2,{ddagger}

1Neuroinformatics, University of Ulm 89069 Ulm, Germany
2Internal Medicine I, University Hospital Ulm Robert-Koch-Strasse 8, 89081 Ulm, Germany

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 REFERENCES
 

Motivation: Microarray experiments generate vast amounts of data. The unknown or only partially known functional context of differentially expressed genes may be assessed by querying the Gene Ontology database via GOMiner. Resulting tree representations are difficult to interpret and are not suited for visualization of this type of data. Methods are needed to effectively visualize these complex set relationships.

Results: We present a visualization approach for set relationships based on Venn diagrams. The proposed extension enhances the usual notion of Venn diagrams by incorporating set size information. The cardinality of the sets and intersection sets is represented by their corresponding circle (polygon) sizes. To avoid local minima, solutions to this problem are sought by evolutionary optimization. This generalized Venn diagram approach has been implemented as an interactive Java application (VennMaster) specifically designed for use with GOMiner in the context of the Gene Ontology database.

Availability: VennMaster is platform-independent (Java 1.4.2) and has been tested on Windows (XP, 2000), Mac OS X, and Linux. Supplementary information and the software (free for non-commercial use) are available at http://www.informatik.uni-ulm.de/ni/mitarbeiter/HKestler/vennm together with a user documentation.

Contact: hans.kestler{at}medizin.uni-ulm.de


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 REFERENCES
 
Microarray technologies are increasingly being used in biological and medical sciences for high throughput analyses of genetic information on the genome, transcriptome and proteome levels. These types of analysis generate vast amounts of data, often in the form of large lists of genes differentially expressed between different sample sets, leaving the researcher with the task of identifying the functional relevance of the observed expression changes. Comprehensive functional annotation of gene products as provided by the Gene Ontology (GO) database (http://www.geneontology.org) is an invaluable resource for performing this task.

Gene lists can be queried for associated functional categories (GO terms) which are significantly over-represented among the differentially expressed genes using query tools such as GOMiner (Zeeberg et al., 2003). However, due to the association of genes with multiple GO terms and the resulting complex interdependencies of categories sharing differentially expressed genes, the results of such an analysis remain hard to interpret and are not easily visualized. Standard tree representations, as e.g. provided with the GOMiner tool, are in many cases an improper choice for this task, especially for representing intersections. Venn diagrams can provide much more information to the researcher. Full containment of one set in another, partial intersections and disjunctness can be seen at a glance with Venn diagrams (Fig. 1). Simple Venn diagrams are already being used in microarray data analysis software packages such as GeneSpring® and SilicoCyte® to visualize intersections of up to three different lists of genes. In the present paper, we propose to extend the use of Venn diagrams to the faithful visualization of the results of GO queries as performed e.g. with the GOMiner tool. We present a method to represent GO terms which have been identified as significantly over-represented among the differentially expressed genes in the form of polygons with areas directly proportional to the true cardinalities of the sets (i.e. the numbers of differentially expressed genes in the categories) and intersections proportional to the numbers of genes shared by two or more categories.



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 1 A generalized Venn diagram with three sets A, B and C and their intersections. From this representation, the different set sizes are easily observed. Furthermore, if individual elements (genes) are contained in more than one set (functional category), the intersection sizes give a direct view on how many genes are involved in possibly related functions. During optimization, the localization of the circles is altered to satisfy the possibly contradictory constraints of circle size and intersection size.

 

    2 METHODS
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 REFERENCES
 
Unfortunately, this visualization problem is not easily solvable. Sometimes no perfect solution exists, as the constraints of set size (polygon or circle size) and intersection set size are in some cases contrary. As a consequence and to avoid local minima, the visualization approach was implemented using an evolutionary strategy for optimization. The goal is to find the best solution possible and mark unavoidable non-intersections in gray.

Efficient algorithms are known for finding intersecting polygons and computing their area: The intersection of two convex polygons with L and M edges can be computed efficiently within O(L + M) steps (O'Rourke, 2000) and the area of a polygon with L edges can be determined in O(L) steps.

2.1 Area calculation
The polygon areas are computed by applying the Gaussian integration theorem in the plane (Harris and Stocker, 1998)

It states that the value of an area integral of the above form (the left side integrates over a scalar field) on a closed domain B R2 can be expressed by a curve integral along the boundary {partial} B (right side). Let be a polygon. After some conversions the area computes to

2.2 Evolutionary optimization
The problem is partitioned into independently solvable subproblems. It is therefore assumed that all sets have at least one intersecting partner.

Let A1,...,Am {subseteq} U be a sequence of intersecting subsets of the element set U and the corresponding polygons at optimization step t = 1,2,.... For each k-subset I {subseteq} {1,..., m} (|I| = k ≥ 2) we observe the cardinality c of the intersection set {cap}i I Ai and the area of the corresponding polygonal intersection A = {eta} area . The factor {eta} > 0 is a predefined constant describing the correspondence between area and cardinality, so for all polygons |Ai| = {eta} area holds (with these assumptions all costs for |I| = 1 are equal to 0). We define the partial cost of an observed intersection of order k = |I| ≥ 2, cardinality c and area A:

The two parameters {alpha}≥ 0 allow the weighting of the different cases. In the current version they are set to {alpha} = 10 and ß = 20. So the unwanted (first case) and missing overlaps (second cases) are weighted stronger than a small area deviation (last case).

The overall cost is defined as the sum over all partial costs:

In the optimum, all areas multiplied by {eta} should be equivalent to the cardinality of the corresponding intersection sets. The number of considered set combinations sometimes needs to be restricted via an upper bound 2 ≤ K ≤ m, which may be necessary for large m due to memory and time limitations. The problem grows exponentially with the number of sets m: A single cost function evaluation requires O(Lm2m–1) steps for all set combinations; in the restricted case this reduces to steps, with L being the number of polygon edges.

The aforementioned error function is optimized over the positions of the polygon centers in the 2D plane. The shape and orientation of the polygons remain fixed (the number of edges can be chosen in advance). An evolutionary strategy (Bäck, 1996) was used to minimize E (inverse of fitness). In this strategy, only mutation was used to modify the population. Following Bäck, we used the self-adaptation of the mutation variances to achieve a better convergence and to eliminate the need for specifying mutation variances.

A generation contains N individuals (this parameter defaults to 100) each consisting of a parameter vector representing the polygon centers and a mutation vector describing the mutation rate for each parameter. For the first population, all centers are set to random values so that the polygons are contained within a bounding box, and the mutation parameters are uniformly drawn from an pre-specified interval [{tau}lower,{tau}upper]. In the mutation step the mutation parameters itself are mutated:

and restricted to the interval [{tau}lower,{tau}upper] ({tau} defaults to 0.5). Then the locations of the polygons are updated as follows:

where N(0,{sigma}) is a normal distributed variate with mean 0 and variance {sigma}. The result of each partial mutation operation (mutation of a single polygon) is restricted to match the following two conditions:

  1. In the case that a polygon is not in contact with at least one other polygon, its position is reset in the direction of the nearest polygon whose corresponding set has a non-empty intersection with the observed set so that the distance of the centers will be the sum of both radii.
  2. The polygons are restricted to stay in the bounding box [0,1]2.
xEvolutionary selection and offspring generation is performed by assigning each individual a rank r = 1,...,N according to its fitness determined by the value of its cost functional (the best individual with the lowest cost has r = 1). Each individual is then duplicated reciprocal to its rank value. So each individual with rank r will have at most {lceil}qN/r{rciel}(0 < q < 1) offsprings. Starting with the individual with the highest rank r = 1 the new population will be filled up until it has size N. The fittest individual is always included in the new population. All but the fittest individual are mutated. The displayed polygon arrangement always shows the fittest individual and will only change if there is a better solution found.

The optimization process stops when a configurable upper number of steps is exceeded or the cost functional has not changed over a certain number of steps.


    3 RESULTS AND DISCUSSION
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 REFERENCES
 
The visualization approach described was implemented in a small and easy-to-use, platform-independent Java application ( VennMaster). It allows an interactive exploration of the data sets and was tested on Windows XP, Linux and Mac OS X using the Java Runtime Environment 1.4.2 (http://www.java.sun.com). VennMaster supports the interactive exploration of sets and intersection sets. When touching the polygons with the cursor, the region will be highlighted and the involved group names and the cardinality of the intersection set will be shown (Fig. 2). The edge number of the polygons is user-configurable. Furthermore, a gene list of the selected intersection set(s) is shown in an information field. Unresolved intersections (for which no corresponding polygon intersection exists) are listed in the field ‘Inconsistencies’. For each set or intersection set, a text label can be attached. Labels and polygons can be moved by drag-and-drop (the cost function will be updated immediately). So the user can interactively modify the configuration and may restart the evolutionary optimization process on the changed arrangement. Set positions can be locked so that they will not be moved by the optimizer. The optimization process can be controlled via a parameter dialog (see Supplementary information). The Venn diagrams may be saved as JPEG files. To analyze functional categories of differentially expressed genes, we included the ability to import files from GOMiner (Zeeberg et al., 2003) in the program, in addition to a simple tab-delimited file format with an element/group pair in each line. For the GOMiner files, a pre-filtering of the genes/categories is included.



View larger version (13K):
[in this window]
[in a new window]
 
Fig. 2 Result of a visualization (with polygons) by importing files from GOMiner. Here, differentially expressed gene sets between a specialized mesenchymal cell type (stellate cells) and normal skin fibroblasts (minimum total: 100; max p-value: 0.05) are shown. Although a total of nine GO categories were reported to be significantly over-represented among the changed genes, the analysis revealed that these categories strongly overlap and form a single large cluster of cell surface/extracellular matrix related categories.

 
To validate our approach, an experiment using microarrays with 23 000 features was processed by VennMaster, the tool that supports the approach presented here. Differences between two types of cells (stellate cells and fibroblasts) involved in diseases associated with extensive fibrosis such as chronic pancreatitis or liver fibrosis were investigated. The GOMiner tool was used to classify the differential genes into functional categories. Due to the fact that one gene may belong to multiple functional categories, this analysis revealed a complex pattern of 29 GO terms. To identify the major functional categories differentiating the two cell types, the VennMaster program was applied to the GOMiner list of 74 genes. The list of differentially expressed genes was compared to the list of all genes exceeding a minimal expression threshold (normalized expression value >0.5 in at least one of the sample sets) to identify GO terms significantly over-represented among the differentially expressed genes. Since the tree format provided by GOMiner to visualize the results of the analysis is not suited to display the overlap of genes in different GO categories resulting from the association of genes with multiple GO terms, we applied the described Venn diagram approach which facilitates visualization of associations between GO categories based on the evaluation of genes mutually represented in different categories (Fig. 2). The Venn diagram representation revealed that the principal GO terms which were significantly over-represented among this set of genes all fell within a single cluster of interconnected categories relating to extracellular and cell surface genes, such as extracellular matrix genes, secreted proteins and cell adhesion genes and their associated functions. In a second experiment, expression profiles of pancreatic ductal carcinoma and normal pancreatic duct cells were compared (Fig. 3). Both analyses gave an instant overview of the involved GO terms.



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 3 Visualization showing different subprocesses (two for left panel and four for right panel): Genes over-expressed (left) or under-expressed (right) in pancreatic ductal carcinoma as compared to normal pancreatic duct cells (minimum total: 50; max p-value: 0.05): the analysis identifies distinct clusters of biologically relevant GO categories over-represented among the over-expressed and under-expressed genes, respectively.

 
To evaluate the performance of the self-adapting evolution strategy, a series of simulations were made with and without self-adaptation of the mutation variances. For non-adaptive mutation rates, the mutation parameter is critical for the convergence of the optimization process (see Supplementary Figures I and II). The self-adapting procedure gave superior results in all but the lowest mutation variances. For 17 different data sets, the overall maximal fitness gained was evaluated on 100 simulations for each configuration, i.e. a total of 3400. For every data set, the self-adapting algorithm resulted in smaller error values than the non-adaptive algorithm. Each of the p-values of the one-sided Wilcoxon rank sum test was below 6.28e-5 (Bonferroni correction requires a p-value <0.0015 to be considered significant; see Supplementary Figure V).

Clearly, the diagrams we used for visualization purposes are not true Venn diagrams according to Grünbaum (1992) and others (see http://www.combinatorics.org/Surveys/ds5/VennEJC.html for an excellent survey) as they allow empty and not connected intersection sets. Apart from this more theoretical aspect, the proposed generalized Venn diagrams proved nevertheless to be of great value for practical purposes requiring the visualization of complex set relationships.


    Acknowledgments
 
This research was funded in part by a ‘Forschungsdozent’ grant through the Stifterverband für die Deutsche Wissenschaft to H.A.K.


    Footnotes
 
{dagger}The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Back

{ddagger}M. Buchholz and T.M. Gress made equal contributions to this study and should both be considered senior authors. Back

Received on August 18, 2004; revised on October 15, 2004; accepted on November 18, 2004

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 REFERENCES
 

    Bäck, T. Evolutionary Algorithms in Theory and Practice, (1996) , Oxford Oxford University Press.

    Grünbaum, B. (1992) Venn diagrams I. Geombinatorics, 1, 5–12.

    Harris, J.W. and Stocker, H. Handbook of Mathematics and Computational Science, (1998) , New York Springer Verlag.

    O'Rourke, J. Computational Geometry in C, (2000) 2nd edn. , Cambridge Cambridge University Press.

    Zeeberg, B.R., Feng, W., Wang, G., Wang, M.D., Fojo, A.T., Sunshine, M., Narasimhan, S., Kane, D.W., Reinhold, W.C., Lababidi, S., et al. (2003) GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol., 4, R28[CrossRef][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
B. Lee, K. Brown, Y. Hathout, and J. Seo
GOTreePlus: an interactive gene ontology browser
Bioinformatics, April 1, 2008; 24(7): 1026 - 1028.
[Abstract] [Full Text] [PDF]


Home page
Mol. Endocrinol.Home page
H. Gao, S. Falt, A. Sandelin, J.-A. Gustafsson, and K. Dahlman-Wright
Genome-Wide Identification of Estrogen Receptor {alpha}-Binding Sites in Mouse Liver
Mol. Endocrinol., January 1, 2008; 22(1): 10 - 22.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
A. L. Zomer, G. Buist, R. Larsen, J. Kok, and O. P. Kuipers
Time-Resolved Determination of the CcpA Regulon of Lactococcus lactis subsp. cremoris MG1363
J. Bacteriol., February 15, 2007; 189(4): 1366 - 1381.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1592    most recent
bti169v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (15)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kestler, H. A.
Right arrow Articles by Buchholz, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kestler, H. A.
Right arrow Articles by Buchholz, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?