Skip Navigation


Bioinformatics Advance Access originally published online on April 10, 2006
Bioinformatics 2006 22(13):1600-1607; doi:10.1093/bioinformatics/btl140
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/13/1600    most recent
btl140v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (51)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Alexa, A.
Right arrow Articles by Lengauer, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Alexa, A.
Right arrow Articles by Lengauer, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Improved scoring of functional groups from gene expression data by decorrelating GO graph structure

Adrian Alexa *, Jörg Rahnenführer and Thomas Lengauer

Max-Planck-Institute for Informatics Stuhlsatzenhausweg 85, D-66123 Saarbrücken, Germany

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSIONS
 REFERENCES
 

Motivation: The result of a typical microarray experiment is a long list of genes with corresponding expression measurements. This list is only the starting point for a meaningful biological interpretation. Modern methods identify relevant biological processes or functions from gene expression data by scoring the statistical significance of predefined functional gene groups, e.g. based on Gene Ontology (GO). We develop methods that increase the explanatory power of this approach by integrating knowledge about relationships between the GO terms into the calculation of the statistical significance.

Results: We present two novel algorithms that improve GO group scoring using the underlying GO graph topology. The algorithms are evaluated on real and simulated gene expression data. We show that both methods eliminate local dependencies between GO terms and point to relevant areas in the GO graph that remain undetected with state-of-the-art algorithms for scoring functional terms. A simulation study demonstrates that the new methods exhibit a higher level of detecting relevant biological terms than competing methods.

Availability: topgo.bioinf.mpi-inf.mpg.de

Contact: alexa{at}mpi-sb.mpg.de

Supplementary Information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSIONS
 REFERENCES
 
The result of a microarray experiment is a list of genes with corresponding expression profiles. Such a gene list is the starting point for an investigation of the biology manifested by the experimental data. Often, genes are ranked according to differential expression between disease groups or according to correlation of expression values with a phenotype measurement. The result of such an analysis is an ordered list of genes.

In many cases the list of differentially expressed genes is not sufficient for accurate inference of the underlying biology. Additional biological knowledge needs to be included to enhance the interpretation of such a list of genes. With the development of biological knowledge databases (Ashburmer et al., 2000), information for augmenting gene expression data is available. Biologically interesting sets of genes, for example genes that belong to a pathway or genes known to have the same biological function, can now be compiled. A popular choice for gene sets are genes collected under Gene Ontology (GO) terms, see GO Consortium (2004).

Methods for gene set enrichment analyze the positions of genes with a common function in an ordered list of genes. The biological function is expected to be more relevant if the gene set members are among the top-ranked genes in the ordered list obtained in the primary stage of the analysis. The top k genes from the ordered list are selected as interesting genes, and enrichment refers to over-representation of these genes in the group of gene set members. The amount of over-representation is assessed with a statistical score.

Methods that test for enrichment of GO terms have been proposed by Draghici et al. (2003), Zeenerg et al. (2003), Al-Shhrour et al. (2004) and Beissbarth and Speed (2004). A comparative study of commonly used tools for analyzing GO term enrichment was recently presented by Khatri and Draghici (2005). None of the methods compared in this study integrates the knowledge encapsulated in the hierarchical structure of the GO database. Recently Grossmann et al. (2006) discussed scoring enrichment of GO terms in a local sense. The over-representation of a GO term is quantified with respect to its direct less specific neighbors in the GO hierarchy. In the present article we propose algorithms that identify over-represented terms in a more global sense, integrating the whole GO topology in the score.

Several other methods integrating the hierarchical structure of the GO have related but different goals. Balasubramanian et al. (2004) test the association between multiple sources of functional genomics data. Each data source is represented by a single graph nodes representing genes and edges representing functional links. For a group of genes a graph based on GO is obtained by placing edges between all gene pairs that are annotated to GO terms with a small distance in GO graph topology. The proposed tests statistically compare the occurrence of edges observed in different graphs. Joslyn et al. (2004) define distances between nodes in the GO graph for ordering GO terms with respect to a group of genes. A GO term is considered more important if many genes in the group are annotated to GO terms close in graph topology. The resulting rank-ordered list of GO terms is then clustered in order to identify summarizing nodes for the characteristics of the gene group.

The complex structure of the GO introduces strong dependencies among the GO terms. Figure 1 depicts significantly enriched GO terms in a microarray study. The color of the GO terms represents the relative significance of the terms. The interpretation of such results is hampered by the implicit dependence between neighboring GO terms. Even when multiple testing correction procedures are applied the results are biased due to this correlation. The interpretation of the results can be improved if the correlation between the neighboring GO terms is integrated into the score.


Figure 1
View larger version (36K):
[in this window]
[in a new window]
 
Fig. 1 The subgraph induced by the 10 most significant GO terms identified by a current state-of-the-art method for scoring GO terms for enrichment. Boxes indicate the 10 most significant terms. Box color represents significance, ranging from dark red (most significant) to light yellow (least significant). The numbers in the legend represent the magnitude of the raw p-values associated with the respective color. Black arrows indicate is–a relationships and red arrows part–of relationships.

 
In this article we present two novel methods that improve the enrichment analysis of GO terms by integrating GO graph topology on a global scale. The first method, called elim, is intuitive and simple to interpret. It iteratively removes the genes mapped to significant GO terms from more general (higher level) GO terms. In the second method, called weight, genes annotated to a GO term receive weights based on the scores of neighboring GO terms.

When analyzing the GO graph structure, local dependencies between GO terms can be identified and removed. The algorithms are evaluated on two publicly available datasets. However, an evaluation on real datasets always yields results biased towards specifically designed algorithms. To avoid a subjective assessment of the quality of the enrichment methods, we introduce a novel evaluation scheme in which a predefined number of GO terms are artificially enriched and the performance of the methods is quantified with respect to the number of correctly identified enriched terms. The results from both real and simulated data show that the proposed algorithms perform better than current state-of-the-art methods.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSIONS
 REFERENCES
 
The GO provides an ontology that describes the roles of genes and gene products in various organisms. GO has a hierarchical structure that forms a directed acyclic graph (DAG). For such a graph we can use the notions of child and parent, where a child can have multiple parents. Every GO term is represented by a node in this graph. The nodes are annotated with a set of genes. For an inner node of the GO graph, the corresponding set of genes also comprises all genes annotated to all children of this node.

Various test statistics are used for scoring significance of GO terms. Standard implementations of GO testing compute the significance of a node independent of the significance of the neighboring nodes. Independently of the test statistic used, this standard scoring is referred as the classic method in this article. Two novel approaches as alternatives for this basic method are introduced further.

If a GO term contains the same genes as one of its children the classic method gives the same score to both terms. In such a case the child is biologically more interesting since its associated definition is more specific than the definition of its ancestors. Thus, a promising idea is to compute the significance of a node dependent on the significance of its children. The elim method directly implements this idea by removing all genes that are annotated to a significantly enriched node from all its ancestors. Removing genes from a GO term can be regarded as a weighting scheme where weights are restricted to be either 0 or 1. The weight algorithm generalizes this idea to weights in the interval [0,1]. In order to decide if a GO term u better represents the interesting genes than any other term from its neighborhood, the enrichment score of node u is compared with the scores of its children. Children with a better score than u better represent the interesting genes. The corresponding genes annotated to these children should contribute less to the score of any ancestor of node u. Thus, these genes are assigned small weights in all ancestors of node u. The children with a lower score than u should not be reported as significant. To achieve this, the genes annotated to these children receive small weights, and the score is recomputed based on the newly assigned weights.

Before describing the algorithms in more detail, we introduce some definitions and notations.

2.1 Definitions and notations
Let genes[u] denote the set of genes annotated to a GO term u. When the union of two or more sets of genes is computed, duplicates are removed. With Formula we denote all genes that are not annotated to node u.

The edges of the GO graph are directed from parent to child. The level of a node u is defined as the length of the longest path from the root to node u. Note that there can be leaves at each level (except level 0) and that nodes at the same level never have an edge between them. For a set Formula of nodes, upperInducedGraph(Formula) is defined to be the subgraph induced by all nodes reachable from Formula if all edges in the graph are reversed.

To test the enrichment of a node u we use the degree of independence between the two properties:

Formula
Let sigGenes denote the set of interesting genes in a microarray experiment, e.g. the list of differentially expressed genes. Analogously, Formula denotes all other genes from the microarray. For a node u the contingency table (Table 1) summarizes the counts of genes according to their membership to genes[u] and sigGenes.


View this table:
[in this window]
[in a new window]
 
Table 1 Contingency table for the set of genes annotated to node u and the genes found significant in an expression study

 
Testing the association between the characteristics Formula and Formula for the contingency table (Table 1) corresponds to Fisher's exact test, see Lehmann (1986). The p-value returned by this test is the probability of observing at least the same amount of enrichment when significant genes are randomly selected out of all genes. Thus, a very small p-value gives strong evidence for an association between Formula and Formula.

2.2 First approach: eliminating genes
The elim method investigates the nodes in the GO graph bottom-up. It starts processing the nodes from the highest (bottommost) level and then iteratively moves to nodes from a lower level. Since nodes from the same level share no edge, they can be investigated independently. The bottom-up strategy assures that for a currently investigated node all children have been scored.

When a node u is processed, the genes that have been marked as removed in a previous step are removed from the set genes[u]. The score for the resulting group of genes is the p-value returned by Fisher's exact test, and node u is marked as significant if the p-value is smaller than a previously defined threshold. Typically this threshold is set to be 0.01 divided by the number of nodes in the GO graph with at least one annotated gene. This corresponds to a Bonferroni adjustment of the p-values. If node u is found significant then all genes mapped to it are marked as removed in all nodes of upperInducedGraph(u), that is in all ancestors of node u.

The algorithm finishes when all nodes have been processed. The elim method is summarized in Algorithm 1.


View this table:
[in this window]
[in a new window]
 
Algorithm 1 Formula 1

 
2.3 Second approach: weighting genes
The purpose of the bottom-up strategy implemented in the elim algorithm is to identify the most specific nodes with minimum required significance. A node is considered to be significant if its p-value is below a given threshold. More significant nodes on higher levels in the graph thus can be missed because of the gene removal process.

An alternative is implemented in the weight method. Here, significance scores of connected nodes (a parent and its child) are compared in order to detect the locally most significant terms in the GO graph. This is achieved by down-weighting genes in less significant neighbors. We first describe how weights are assigned to genes and how the significance score of a group of genes with corresponding weights is computed.

Let u be a node in the graph and let v be one of its children. The weight associated with this pair of nodes is defined by

Formula
The function sigRatio(·, ·) has the form

Formula
where f (·) is an arbitrary increasing function that can be adapted to the desired degree of weighting between pairs of nodes. Note that the function sigRatio always attains a value in the interval [0, 1] if both scores are non-negative and if the child node v is less significant than its parent u.

The score for a weighted set of genes is computed by applying Fisher’s exact test on a weighted contingency table. X = |sigGenes {cap} genes[u]| is the number of significant genes that are annotated to node u (Table 1). This quantity is replaced by summing up the weights of the genes and subsequent rounding up to the next integer:

Formula 1(1)

In the same way, all other quantities in Table 1 are computed. The cardinality function |Formula 1| for a set S is always replaced by the function {lceil}Formula 1weight[i]{rciel}. The idea behind using a weighted contingency table is to first decorrelate the GO terms and then to test for enrichment in the transformed set of genes.

In the weight method the nodes are processed bottom-up level by level as in the elim method. However, for each node a vector of gene weights is memorized and updated during the process. Initially, all weights for the genes annotated to a node are set to 1. Let u be the currently processed node in the bottom-up process. The main principle is to reinforce differences in significance between u and its neighbors. If u is more significant, genes contained in the children are down-weighted, yielding a decreased significance of the children. If at least one child is more significant than u, the genes common to the child and u are down-weighted in node u and its ancestors, further decreasing the significance of node u.

The function computeTermSig(u, children) is the core of the weight algorithm. It recursively eliminates children of u that are more significant than u. Each time this function is called, the score of u is recomputed using the updated genes weights. Based on the new score, in turn, the weights for all children are recomputed using the sigRatio() function.

More precisely, if in the iterative process node u has a lower p-value than its children and thus can be regarded as a better candidate, for each child the old weights (assigned in some previous step) are multiplied with the weights given by the sigRatio() function. After updating the scores for all children the processing of node u finishes.

The alternative is that at least one child has a better score than node u. All genes annotated to each of these more relevant children are down-weighted in the ancestors of node u, including u itself. Again, gene weights are multiplied with a factor obtained with the sigRatio() function. By calling computeTermSig() for node u and all its less significant children, the score for u then is recomputed based on the new weights. Owing to the modified weights for node u, the scores of current children can now become more significant than the updated score of u. Thus the process is iterated until all remaining children have lower scores than u. Then the algorithm moves to the next node.

The full details of the algorithm weight are shown in Algorithm 2.


View this table:
[in this window]
[in a new window]
 
Algorithm 2 Formula 1

 
2.4 Combining algorithms
In addition to the presented algorithms and the classic method, we consider a combined algorithm called all.M. For this algorithm, the reported score for a node is defined as the mean of the p-values obtained with classic, elim and weight algorithm, computed on a log scale. If p-Value denotes the vector of the three p-values for a GO term, then the new score is defined by

Formula 1


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSIONS
 REFERENCES
 
We evaluated the GO scoring algorithms classic, elim and weight on two real gene expression datasets (Chiaretti et al., 2004; Cario et al., 2005) and on simulated data. Both real datasets were collected in order to distinguish subgroups of Leukemia patients. However, the discrimination criterion is quite different. In the first case, patients are split into B-cell and T-cell type leukemias and in the second case into children with or without minimal residual disease (MRD) after therapy.

The algorithms were implemented in the R programming language (www.r-project.org). The results were obtained using R version 2.2 and the libraries provided by the Bioconductor project (www.bioconductor.org), version 1.7.

3.1 Comparison of algorithms on Leukemia dataset I
All methods were run on an Acute Lymphoblastic Leukemia (ALL) dataset (Chiaretti et al., 2004) that was extensively studied in the literature on microarray analysis. The dataset is available in R as a Bioconductor package.

The dataset consists of 128 microarrays from different patients with ALL. The HGU95aV2 Affymetrix chip was used for the experiments. The annotation was performed using GO's biological process (BP) ontology. From a total of 12 625 genes found on the chip, 9231 genes are annotated to at least one GO term from the BP ontology. This list of genes induces a GO graph with 2677 nodes. The root of this graph contains all 9231 genes. All leaves are GO terms that have at least one gene annotated.

It is known that ALL cells are delivered from either B-cell or T-cell precursors. To determine the differentially expressed genes the patients are split according to the type and stage of the disease. There are 95 patients with B-cell ALL and 33 patients with T-cell ALL in the study. A two-sided t-test was applied to these groups and the raw p-values were adjusted using the false discovery rate method from Benjamini and Yekutieli (2001). This process resulted in 515 differentially expressed genes for a level {alpha} = 0.01 test.

Input for the enrichment algorithms are the list of differentially expressed genes and the GO graph. For the weight method two weighting functions were used. The first is the ratio of the natural logs, namely

Formula 1
The weight algorithm using this function is referred to as weight.log. The second weighting function is the simple ratio of the scores

Formula 1
This method is referred to as weight.ratio. The log ratio weights can be seen as a softer weighting.

The results are presented in Table 2. The p-values reported for all algorithms are adjusted p-values using the false discovery rate procedure from Benjamini and Yekutieli 2001. The results are quite different for the compared algorithms. Some GO terms that are found significant by the classic method are completely discarded by other methods. For example, GO:0030333 is reported as non-significant by method elim, but it is significant for all other methods. The list of GO terms that are discarded by the more sophisticated methods includes very general terms like response to biotic stimulus or defense response. The GO term GO:0009596 that is reported significant by all methods is a leaf in the GO graph. In the reported list of GO terms it contains the smallest number of annotated genes, namely 13 genes. The fact that the GO terms identified as significant terms in this study have considerably large numbers of annotated genes gives additional confidence in the relevance of the results.


View this table:
[in this window]
[in a new window]
 
Table 2 Statistics for significant GO terms for the ALL dataset (Chiaretti et al., 2004).

 
An analysis of rank correlations between the most significant GO terms shows large differences between the different scoring algorithms. The largest correlation of 0.46 can be observed between the elim method and the weight.ratio method, see the Supplementary Material, Section 1.

Another insightful way of looking at the results of the analysis is to investigate how the significant GO terms are distributed over the GO graph. For each algorithm the subgraph induced by the most significant GO terms is plotted. In the plots, the significant nodes are represented as boxes. The plotted graph is the upperInducedGraph() generated by these significant nodes. To better emphasize the differences to the classic method (Fig. 1), nodes that are significant for the classic method but not for the presented method are plotted as circles.

In Figure 2 (left graph) the subgraph induced by the top five GO terms for the elim method is shown. Nodes close to the root of the graph that were reported as significant by the classic method are now completely discarded, e.g. the ancestors of node GO:0006955. These nodes represent very general GO terms like response to biotic stimulus or defense response and, in most cases, these nodes are not interesting for the biological interpretation. The same behavior is observed in the case of the weight.ratio method, see Figure 2 (right graph). Compared with elim the significance of the nodes is reduced, especially in the upper levels of the graph.


Figure 2
View larger version (18K):
[in this window]
[in a new window]
 
Fig. 2 The subgraph induced by the five most significant GO terms found by the elim method (left graph) and by the weight.ratio method (right graph). For a detailed description of the figures see the legend of Figure 1.

 
3.2 Comparison of algorithms on Leukemia dataset II
In a second study we analyzed the cDNA dataset published by Cario et al. (2005) with expression measurements of 51 patients with childhood ALL. The patients in this study appear to be biologically and phenotypically homogeneous and are treated with the same therapy. However, they split into two groups of 30 patients without detectable minimal residual disease (MRD-SR) and 21 patients with high MRD load (MRD-HR).

The data were imported and processed in R. All probes on the array with more than missing values for the 51 samples were removed from the analysis. The expression values of multiple probes with the same LocusLink identifier coding for one respective gene were averaged. This preprocessing resulted in 13 236 genes. Only 6853 of theses genes are annotated to at least one GO term from the BP ontology. The induced GO graph contains 2733 nodes. Compared with the first Leukemia study discussed above the genes are annotated to more specific GO terms and they are more widely spread across the hierarchy. Applying a two-sided t-test to the groups defined by patients with MRD-SR and MRD-HR respectively, a total of 682 genes were detected as differentially expressed. p-values were adjusted with the FDR procedure (Benjamini and Yekutieli, 2001).

The induced GO subgraph plots for all methods and a results table in the form of Table 2 are available in the Supplementary Material. A comparison of the results for the two studies shows important differences but also remarkable similarities. Although both studies are related, they address different biological problems and the data are obtained with different experimental technologies. The GO term immune response is found significant for the first dataset but not for the second and cell adhesion is specific only to the second dataset. Further interesting GO terms for the second study are peripheral nervous system development and activation of MAPK activity. These processes should be investigated for obtaining more knowledge on MRD.

On the other hand, in both studies the GO terms GO:0019884 and GO:0019886 consistently are among the most significant terms for all algorithms, thus underlining the general importance of antigen presentation and antigen processing for ALL, Table 2. Similar concordances between the two Leukemia datasets are described in the results section of the article of Cario et al. (2005).

It is common to both studies that very general GO terms are discarded by the more sophisticated algorithms, in the second study specifically the originally high-scoring terms organogenesis and development.

3.3 Comparison of algorithms on simulated data
In order to objectively compare the quality of different GO scoring methods we need to know the true relevant GO terms. On real datasets like the ones introduced above the true significant GO terms are not known. To address this issue the algorithms were also tested on simulated data. In our simulation setup the significant GO terms are known and the performance of an algorithm is measured w.r.t. the number of correctly identified GO terms. To mimic the real data as best possible the GO biological process ontology was chosen as the underlying graph. The complete list of genes contains all 9231 probes on the HGU95aV2 chip that can be annotated to this graph. The resulting graph contains 2677 nodes.

In the first step, a set of truly enriched nodes is determined. These nodes represent the relevant biological functions and are denoted as enriched nodes. We select the enriched nodes randomly from all nodes whose number of annotated genes lies in a fixed pre-specified interval. Such a constraint is imposed in order to overcome two problems. Nodes with too few annotated genes are too specific and cannot synthesize the underlying biology, and nodes with too many annotated genes are too general and close to the root of the graph.

In the second step, given the enriched nodes, the list of interesting genes is obtained by combining all genes annotated to these nodes to a single set. To better model reality, some noise is introduced in the list of interesting genes. A fraction of the interesting genes is randomly selected and replaced with other genes.

The performance of the methods is assessed with the following measure. For each method M, the nodes are sorted in ascending order of their computed p-values. For a fixed cutoff k the score is the number of enriched nodes found among the top k nodes:

Formula 1
topk(M) denotes the set of the k top-scoring nodes for method M, and enriched denotes the set of enriched nodes. Methods that obtain a higher score better retrieve the true relevant nodes.

In the first simulation study 50 enriched nodes are selected, each having between 10 and 50 annotated genes. The noise introduced in the list of interesting genes is set to 10%, and results of 100 simulated datasets are averaged. Table 3 shows that all new methods perform better than the classic method. The weight methods always outperform the classic method by a considerable amount. For small values of k the elim method is the best, outperforming the classic method by a factor of 3. From the top 25 nodes selected by elim an average of 17 nodes are enriched nodes, compared with an average of only 5.5 enriched nodes for the classic method. Combining the p-values of all methods also helps, providing robust results.


View this table:
[in this window]
[in a new window]
 
Table 3 Average numbers of correctly identified enriched nodes over 100 simulation runs with 50 true enriched nodes, 10% noise level and 10 – 50 genes annotated to the enriched nodes, for different values of k

 
For real datasets the list of interesting genes contains much more noise than 10%. In a second simulation study with more stringent conditions we selected 50 enriched nodes from all nodes with 10–1000 annotated genes, and the noise level was set to 40%. With these parameters recovering the enriched nodes is much more difficult. As in the previous simulation study, the methods weight and elim clearly outperform the method classic, see the Supplementary Material, Table 5.

To obtain more insight into how each method accounts for the topology of the graph, the following scores are defined:

Formula 1

Formula 1

Formula 1

Formula 1

The score Formula 1 considers the first k nodes together with all children and parents of these nodes. For Formula 1 only parents and not children are included. Analogously, we define Formula 1 and Formula 1 that also include nodes with a distance of two levels from the first k nodes.

We compute such scores based on the observation that some methods, in particular the elim method, often yield more specific significant GO terms. If for a method M, the difference Formula 1 is large, then many true enriched nodes are among the parents of the detected nodes.

Figure 3 shows plots of scorek as a function of k. In Figure 3a and 3b Formula 1 and Formula 1 are plotted for the simulation study with 10% noise. In Figure 3c Formula 1 is plotted for the simulation with 40% noise. Adding parents leads to no improvement for the classic method, but to a 15–20% improvement for the elim method. The best performing methods are weight.ratio and the combination of all methods all.M for Formula 1 and elim for the Formula 1.


Figure 3
View larger version (8K):
[in this window]
[in a new window]
 
Fig. 3 Comparison of the quality of different GO scoring algorithms in a simulation study. Each curve represents the average of the numbers of preselected GO terms, over 100 simulation runs, that are among the top k GO terms according to the different algorithms. The proposed methods clearly outperform the method classic. In (a) and (b) Formula 1 and Formula 1 are plotted, respectively, with 10% noise introduced in the list of interesting genes, in (c) Formula 1 is plotted, with 40% noise.

 
The results obtained with the elim method show that including the children of the top scoring nodes gives no significant improvement (40 versus 41 enriched nodes for k = 50, see Supplementary Material, Table 4, for a detailed results table). Including only parents Formula 1 improves the results for the elim method. These observations show that for elim the true enriched nodes are among the top scoring nodes and their parents. Thus elim successfully identifies the areas in the graph with the enriched nodes, but sometimes fails to precisely point to the correct nodes.

3.4 Robustness of algorithms
The analysis of high throughput microarray experiments adopted in this article is sometimes referred to as two-stage analysis. In the first step genes receive scores, e.g. quantifying the correlation of a phenotype with the gene expression values. Based on the distribution of the scores a threshold is chosen and genes with a score above the threshold are called differentially expressed genes. In the second step, the enrichment of a set of genes is tested based on test statistics that depend on the list of differentially expressed genes.

It is believed that in many real-life cases the list of differentially expressed genes contains only a small fraction of truly differentially expressed genes. Thus, the result of the enrichment analysis can be hampered by the choice of the threshold. One method addressing this problems has been introduced in Subramanian et al. (2005).

To evaluate the influence of the threshold, all methods were run with different threshold values for the leukemia dataset discussed in Section 3.1. The results summarized in Figure 4 demonstrate the stability of the methods. Modifying the threshold value, and thus considering more or fewer differentially expressed genes, does not significantly change the order of the most significant GO terms. For the elim method adding more genes to the list of differentially expressed genes does not affect the top five GO terms. The weight method is the best in preserving the ordering; only few swaps between the GO terms are observed.


Figure 4
View larger version (40K):
[in this window]
[in a new window]
 
Fig. 4 Analysis of the influence of the threshold on the ranking of GO terms. Including all genes with FDR-adjusted p-value p ≤ 0.01 yields k = 515 differentially expressed genes. Changing the threshold according to an increase or decrease of numbers of genes in steps of size 10 the ranks of the originally 10 top-ranking GO terms are shown. A colored line describes the rank of a fixed GO term for different thresholds.

 

    4 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSIONS
 REFERENCES
 
We addressed the problem of finding enriched functional groups of genes based on gene expression data. We proposed two novel heuristics for integrating the DAG structure of the GO in testing for group enrichment. The key idea is to compute the significance of a GO term based on its neighborhood. The experimental results show that the introduced methods improve over existing methods.

There is no clear measure to tell which method performs better on a real dataset. The evaluation is performed by the researcher and thus is inherently subjective. To eliminate this subjectivity, we evaluate the methods on simulated data.

With the classical approach in which each node is scored independently only few true significant nodes remain undiscovered. However, the dependencies between top scoring nodes yield a high false-positive rate. The simulation results show that the weight algorithm manages to reduce the false-positive rate, while not missing many true enriched nodes. The elim method further reduces the false-positive rate, but with a higher risk of discarding relevant nodes. For finding the important areas in the graph, the elim method is to be preferred, given its simplicity.

Other test statistics for assessing GO term significance than Fisher’s exact test can be used, e.g. a hypergeometric, binomial or {chi}2-test. In particular, test statistics that do not rely on a fixed list of genes, such as the normalized Kolmogorov–Smirnov test introduced in Subramanian et al. (2005), can be implemented in the method elim. In this article we restricted ourselves to a single test statistic in order to better emphasize the differences between the different algorithms.

In conclusion we showed that integrating the dependency structure between nodes is enhancing the inference. Moreover, combining the results of multiple methods even further improves the result of the analysis.


    Acknowledgments
 
Financial support was provided by BMBF grant No. 01GR0453 [AA, JR] and by the IMPRS [AA]. This work has been performed in the context of the BioSapiens Network of Excellence (EU contract no. LSHG-CT-2003-503265).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on September 28, 2005; revised on March 30, 2006; accepted on April 4, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSIONS
 REFERENCES
 

    Al-Shahrour, F., et al. (2004) FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics, 20, 578–580[Abstract/Free Full Text].

    Ashburner, M., et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet, . 25, 25–29.

    Balasubramanian, R., et al. (2004) A graph-theoretic approach to testing associations between disparate sources of functional genomics data. Bioinformatics, 20, 3353–3362[Abstract/Free Full Text].

    Beissbarth, T. and Speed, T.P. (2004) GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics, 20, 1464–1465[Abstract/Free Full Text].

    Benjamini, Y. and Yekutieli, D. (2001) The control of the false discovery rate in multiple testing under dependency. Ann. Stat, . 29, 1165–1188[CrossRef].

    Cario, G., et al. (2005) Distinct gene expression profiles determine molecular treatment response in childhood acute lymphoblastic leukemia. Blood, 105, 821–826[Abstract/Free Full Text].

    Chiaretti, S., et al. (2004) Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood, 103, 2771–2778[Abstract/Free Full Text].

    Draghici, S., et al. (2003) Global functional profiling of gene expression. Genomics, 81, 98–104[CrossRef][Web of Science][Medline].

    GO Consortium. (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res, . 32, D258–D261[Abstract/Free Full Text].

    Grossmann, S., Bauer, S., Robinson, P.N., Vingron, M. (2006) An improved statistic for detecting over-represented Gene Ontology annotations in gene sets. Proceedings of the Lecture Notes in Computer Science 3909 , pp. 85–98 March 2006.

    Joslyn, C.A., et al. (2004) The gene ontology categorizer. Bioinformatics, 20, Suppl. 1, i169–i177[Abstract].

    Khatri, P. and Draghici, S. (2005) Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics, 21, 3587–3595[Abstract/Free Full Text].

    Lehmann, E.L. Testing Statistical Hypotheses. Springer Texts in Statistics, (1986) 2nd edn , New York Springer-Verlang.

    Subramanian, A., et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA, 102, 15545–15550[Abstract/Free Full Text].

    Zeeberg, B.R., et al. (2003) GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol, . 4, R28[CrossRef][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Am. J. Pathol.Home page
M. Shi, J. Bradner, T. K. Bammler, D. L. Eaton, J. Zhang, Z. Ye, A. M. Wilson, T. J. Montine, C. Pan, and J. Zhang
Identification of Glutathione S-Transferase Pi as a Protein Involved in Parkinson Disease Progression
Am. J. Pathol., July 1, 2009; 175(1): 54 - 65.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
S. Reiland, G. Messerli, K. Baerenfaller, B. Gerrits, A. Endler, J. Grossmann, W. Gruissem, and S. Baginsky
Large-Scale Arabidopsis Phosphoproteome Profiling Reveals Novel Chloroplast Kinase Substrates and Phosphorylation Networks
Plant Physiology, June 1, 2009; 150(2): 889 - 903.
[Abstract] [Full Text] [PDF]


Home page
Eukaryot CellHome page
T. Rossignol, C. Ding, A. Guida, C. d'Enfert, D. G. Higgins, and G. Butler
Correlation between Biofilm Formation and the Hypoxic Response in Candida parapsilosis
Eukaryot. Cell, April 1, 2009; 8(4): 550 - 559.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
N. Rieber, B. Knapp, R. Eils, and L. Kaderali
RNAither, an automated pipeline for the statistical analysis of high-throughput RNAi screens
Bioinformatics, March 1, 2009; 25(5): 678 - 679.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. Ortutay and M. Vihinen
Identification of candidate disease genes by integrating Gene Ontologies and protein-interaction networks: case study of primary immunodeficiencies
Nucleic Acids Res., February 1, 2009; 37(2): 622 - 628.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. W. Huang, B. T. Sherman, and R. A. Lempicki
Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists
Nucleic Acids Res., January 1, 2009; 37(1): 1 - 13.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
E. Loire, F. Praz, D. Higuet, P. Netter, and G. Achaz
Hypermutability of Genes in Homo sapiens Due to the Hosting of Long Mono-SSR
Mol. Biol. Evol., January 1, 2009; 26(1): 111 - 121.
[Abstract] [Full Text] [PDF]


Home page
Plant CellHome page
Z. Zhao, W. Zhang, B. A. Stanley, and S. M. Assmann
Functional Proteomics of Arabidopsis thaliana Guard Cells Uncovers New Stomatal Signaling Pathways
PLANT CELL, December 1, 2008; 20(12): 3210 - 3226.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. V. Rajagopala, J. Goll, N.D. D. Gowda, K. C. Sunil, B. Titz, A. Mukherjee, S. S. Mary, N. Raviswaran, C. S. Poojari, S. Ramachandra, et al.
MPI-LIT: a literature-curated dataset of microbial binary protein--protein interactions
Bioinformatics, November 15, 2008; 24(22): 2622 - 2627.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
E. Prifti, J.-D. Zucker, K. Clement, and C. Henegar
FunNet: an integrative tool for exploring transcriptional interactions
Bioinformatics, November 15, 2008; 24(22): 2636 - 2638.
[Abstract] [Full Text] [PDF]


Home page
Biol. Reprod.Home page
Y. M. Han, R. Romero, J.-S. Kim, A. L. Tarca, S. K. Kim, S. Draghici, J. P. Kusanovic, F. Gotsch, P. Mittal, S. S. Hassan, et al.
Region-Specific Gene Expression Profiling: Novel Evidence for Biological Heterogeneity of the Human Amnion
Biol Reprod, November 1, 2008; 79(5): 954 - 961.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
Y. Lu, R. Rosenfeld, I. Simon, G. J. Nau, and Z. Bar-Joseph
A probabilistic generative model for GO enrichment analysis
Nucleic Acids Res., October 1, 2008; 36(17): e109 - e109.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. Bauer, S. Grossmann, M. Vingron, and P. N. Robinson
Ontologizer 2.0--a multifunctional tool for GO term enrichment analysis and data exploration
Bioinformatics, July 15, 2008; 24(14): 1650 - 1651.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. V. Antonov, T. Schmidt, Y. Wang, and H. W. Mewes
ProfCom: a web tool for profiling the complex functionality of gene groups identified from high-throughput data
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W347 - W351.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
Q. Zheng and X.-J. Wang
GOEAST: a web-based software toolkit for Gene Ontology enrichment analysis
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W358 - W363.
[Abstract] [Full Text] [PDF]


Home page
ScienceHome page
K. Baerenfaller, J. Grossmann, M. A. Grobei, R. Hull, M. Hirsch-Hoffmann, S. Yalovsky, P. Zimmermann, U. Grossniklaus, W. Gruissem, and S. Baginsky
Genome-Scale Proteomics Reveals Arabidopsis thaliana Gene Models and Proteome Dynamics
Science, May 16, 2008; 320(5878): 938 - 941.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. Shriner, T. M. Baye, M. A. Padilla, S. Zhang, L. K. Vaughan, and A. E. Loraine
Commonality of functional annotation: a method for prioritization of candidate genes from genome-wide linkage studies
Nucleic Acids Res., March 27, 2008; 36(4): e26 - e26.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. J. Goeman and U. Mansmann
Multiple testing on the directed acyclic graph of gene ontology
Bioinformatics, February 15, 2008; 24(4): 537 - 544.
[Abstract] [Full Text] [PDF]


Home page
Brief Funct Genomic ProteomicHome page
F. Cordero, M. Botta, and R. A. Calogero
Microarray data analysis and mining approaches
Brief Funct Genomic Proteomic, January 22, 2008; (2008) elm034v1.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. Yang, Y. Li, H. Xiao, Q. Liu, M. Zhang, J. Zhu, W. Ma, C. Yao, J. Wang, D. Wang, et al.
Gaining confidence in biological interpretation of the microarray data: the functional consistence of the significant GO categories
Bioinformatics, January 15, 2008; 24(2): 265 - 271.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Schlicker and M. Albrecht
FunSimMat: a comprehensive functional similarity database
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D434 - D439.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. F. Schwarz, O. Hadicke, J. Erdmann, A. Ziegler, D. Bayer, and S. Moller
SNPtoGO: characterizing SNPs by enriched GO terms
Bioinformatics, January 1, 2008; 24(1): 146 - 148.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. Grossmann, S. Bauer, P. N. Robinson, and M. Vingron
Improved detection of overrepresentation of Gene-Ontology annotations with parent child analysis
Bioinformatics, November 15, 2007; 23(22): 3024 - 3031.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
C. Lottaz, J. Toedling, and R. Spang
Annotation-based distance measures for patient subgroup discovery in clinical microarray studies
Bioinformatics, September 1, 2007; 23(17): 2256 - 2264.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. Liu, J. M. Hughes-Oliver, and J. A. Menius Jr
Domain-enhanced analysis of microarray data using GO annotations
Bioinformatics, May 15, 2007; 23(10): 1225 - 1234.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. Falcon and R. Gentleman
Using GOstats to test gene lists for GO term association
Bioinformatics, January 15, 2007; 23(2): 257 - 258.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. W. Kong, W. T. Pu, and P. J. Park
A multivariate approach for integrating genome-wide expression data and biological knowledge
Bioinformatics, October 1, 2006; 22(19): 2373 - 2380.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/13/1600    most recent
btl140v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (51)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Alexa, A.
Right arrow Articles by Lengauer, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Alexa, A.
Right arrow Articles by Lengauer, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?