Bioinformatics Advance Access originally published online on July 12, 2006
Bioinformatics 2006 22(18):2249-2253; doi:10.1093/bioinformatics/btl378
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
© 2006 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
ADGO: analysis of differentially expressed gene sets using composite GO annotation
1 Korean BioInformation Center, Korea Research Institute of Bioscience and Biotechnology, 52 Eoun-dong Yuseong-gu, Daejeon 305-333, Korea
2 Human Genome Laboratory, Genome Research Center, Korea Research Institute of Bioscience and Biotechnology, 52 Eoun-dong Yuseong-gu, Daejeon 305-333, Korea
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Genes are typically expressed in modular manners in biological processes. Recent studies reflect such features in analyzing gene expression patterns by directly scoring gene sets. Gene annotations have been used to define the gene sets, which have served to reveal specific biological themes from expression data. However, current annotations have limited analytical power, because they are classified by single categories providing only unary information for the gene sets.
Results: Here we propose a method for discovering composite biological themes from expression data. We intersected two annotated gene sets from different categories of Gene Ontology (GO). We then scored the expression changes of all the single and intersected sets. In this way, we were able to uncover, for example, a gene set with the molecular function F and the cellular component C that showed significant expression change, while the changes in individual gene sets were not significant. We provided an exemplary analysis for HIV-1 immune response. In addition, we tested the method on 20 public datasets where we found many filtered composite terms the number of which reached
34% (a strong criterion, 5% significance) of the number of significant unary terms on average. By using composite annotation, we can derive new and improved information about disease and biological processes from expression data.
Availability: We provide a web application (ADGO: http://array.kobic.re.kr/ADGO) for the analysis of differentially expressed gene sets with composite GO annotations. The user can analyze Affymetrix and dual channel array (spotted cDNA and spotted oligo microarray) data for four species: human, mouse, rat and yeast.
Contact: chu{at}kribb.re.kr
Supplementary information: http://array.kobic.re.kr/ADGO
| 1 INTRODUCTION |
|---|
|
|
|---|
Identifying differentially expressed genes between two groups of samples compared (i.e., disease/normal, treatment/control and poor/good prognosis patient groups) is of central importance in analyses of gene expression patterns from microarray experiments. Many investigators have focused on identifying individual genes using standard or modified statistical methods such as the two sample t-test, SAM (Tusher et al., 2001), regression modeling (Thomas et al., 2001), the empirical Bayes method (Efron et al., 2001) and the mixture model (Pan et al., 2001). They then examined the annotation terms that are significantly enriched in selected gene sets to infer the potential mechanisms of underlying biological processes. However, recent studies suggested that gene expressions might be altered in related groups defined by pathways, functions or localizations rather than individually (Mootha et al., 2003; Segal et al., 2004). In such case, genes with distinguished expression changes could be detected, but many other genes showing coordinated but weak changes may be easily missed. Moreover, detecting changes in individual gene expressions is highly affected by measurement noise.
To overcome these shortcomings, Mootha et al. (2003) developed a method, named gene set enrichment analysis (GSEA), to score the expression changes in gene group that share biological relevance. Various gene annotations and clusters were used to define the gene sets. The underlying principle is that weak but coordinated expression changes of specific gene sets can better represent significant flows of biological processes. Recently, Kim and Volsky (2005) suggested a simpler method using Z-statistic. The central limit theorem justifies their method to test the statistical significance. The former is a non-parametric method using the ranks of gene scores and permutation tests, while the latter is a parametric method. See Al-Shahrour et al. (2005) for another parametric approach to gene set analysis. The merits and demerits between the two types of methods resemble those between the rank-sum and two sample t-test for the analysis of single genes.
These approaches revealed specific biological themes directly from expression data and provided many useful insights to disease or specific phenotypes (Mootha et al., 2003; Goeman et al., 2004; Kim and Volsky 2005; Al-Shahrour et al., 2005; Tian et al., 2005; Subramanian et al., 2005). However, in many cases, if not all, analyses with single annotation categories may not be sufficient to uncover the changes of specific expression patterns. For example, not all of the genes categorized by a molecular function may alter their expressions as a whole under an experimental condition, but only those with a particular localization or those involved in a particular pathway might alter their expressions. Unary annotations cannot reveal such specific expression changes.
To solve this problem, we devised a simple but effective method for discovering composite biological themes from expression data. We employed Z-statistic (Kim and Volsky 2005) to score the gene sets, but the idea is equally applicable to GSEA (Mootha et al., 2003) and other gene set analyses (Pavlidis et al., 2002; Lee et al., 2005; Tian et al., 2005; Subramanian et al., 2005; Boorsma et al., 2005). We intersected two gene sets each of which belongs to different annotation categories so that the intersection has composite annotation information. We are particularly interested in the intersection sets with significant expression changes for which the changes in the individual gene sets are not significant. We developed a web-based application, named ADGO, to find such gene sets with composite annotations. We used the three kinds of annotation categories (MF: molecular function, BP: biological process and CC: cellular component) of Gene Ontology (Ashburner et al., 2000) to generate gene sets with composite annotation. We tested ADGO on over 20 public expression datasets, where we found numerous interesting gene sets with composite annotations showing significant expression changes in diseases or treatments that could not be found by previous approaches. Such discoveries will provide detailed and improved insights into biology.
The idea of intersecting GO sets has been used for a similar purpose in MEGO (Tu et al., 2005), but it does not provide statistical significance values which limits its usefulness.
| 2 METHODS |
|---|
|
|
|---|
2.1 Construction of gene set database
We started with gene sets annotated by one of the three GO (Gene Ontology) categories in our database. Then we intersected every pair of gene sets each member of which belonged to different GO categories. In this process, we applied the intersection between two gene sets only if the intersection contained at least 10 (default) genes and any differenced sets between the two sets contained at least 20% (default) of the original gene numbers, say 1020% rule (Fig. 1). The user can modify such settings. These cut-offs allowed us to remove trivial intersection sets. In other words, we did not apply the intersection if gene members of one set is nearly overlapping with those of the other, or the two sets share only a small number of genes (<10). Using a total of 6402 terms in the three GO categories that contained 10 or more genes, we generated 14 422 gene sets with composite annotations between BP and MF. Likewise, 6420 and 8550 composite gene sets were generated in CC versus MF and BP versus CC intersections, respectively.
|
We added the gene sets with composite annotations to the GO set database.
2.2 The ADGO system and implementation
ADGO is available at http://array.kobic.re.kr/ADGO. On the main page (Fig. 2), the user should mark the information of the input data. The user can upload Affymetrix or dual channel array (spotted cDNA and spotted oligo microarray) data to analyze the expression changes of gene sets for four species: human, mouse, rat, and yeast. The input data should be the population set, each member of which represents the expression change of each gene rather than the expression profiles themselves. For example, they can be values for statistical tests (e.g. two sample t-statistic, SAM statistic), mean differences or fold changes between two groups of samples examined. See the Supplementary Figure that is available on the Document link of our web page.
|
We built in a program for treating the replicated genes in Affymetrix data. If the user uploads an Affymetrix dataset and chooses the corresponding Affymetrix-platform, ADGO will automatically average the values of the replicated genes and convert the Affymetrix IDs to gene symbols of dual channel array for further calculations. See Lee et al. for discussions on the treatment of replicated genes (Lee et al., 2005).
We employed Z-statistic to test the expression changes of gene sets among several available algorithms (Pavlidis et al., 2002; Mootha et al., 2003; Kim and Volsky 2005; Lee et al., 2005; Tian et al., 2005; Boorsma et al., 2005). One clear advantage of the parametric method (Kim and Volsky 2005) over the permutation-based methods, either gene or sample permutation (Mootha et al., 2003; Lee et al., 2005; Tian et al., 2005; Subramanian et al., 2005) is the fast computation, which is an important factor for a web server. Moreover, in a series of simulation studies, we found the parametric method performed as well as permutation-based methods (manuscript in preparation). Each (unary or composite) annotation set was scored by Z-statistic and the corresponding p-value, and all the gene sets were sorted according to the Z-statistic values. We provided q-values and Bonferroni p-values to correct multiple hypotheses testing, but we focus on q-values in our analyses. We provided two kinds of Bonferroni corrections. In one correction, we divided each p-value by the number of terms contained in one of the six categories (MF, BP, CC, MF&BP, MF&CC and BP&CC) where the term belonged. In the other correction, we used the total number of terms contained in the six categories. The numbers of terms used will be smaller than the numbers shown in Section 2.1, because we only used the terms generated by the input gene list.
In an analysis report, the user will be given all the single and composite terms with significant expression changes. Induced and repressed terms are colored red and blue, respectively. In this step, however, all the composite terms may not be substantially meaningful. We are particularly interested in the composite terms with significant expression changes for which both individual terms involved are not significant; we call them strongly filtered composite terms. As well as the strongly filtered ones, the user can obtain weakly filtered terms for which the individual terms involved have smaller absolute Z-statistic values than the composite term. Each of the weakly filtered terms can be classified to be a strongly filtered term for another choice of the threshold value.
2.3 PAGE algorithm
Here we briefly describe the PAGE algorithm (Kim and Volsky 2005). For each gene, PAGE evaluates the difference of means (DM) between two experimental groups and uses all the DMs as the population set. Then we regard each annotated gene set as a random collection from the population set. Hence, by the central limit theorem, the Z-score calculated from the DMs of the gene set will have approximately the standard normal distribution. Such Z-score provides the test statistic for each gene set. DMs can be replaced by two sample t-statistic or other difference values, if the numbers of experimental samples are not small.
To assure the reliability of the Z-scores, we adopted the default number 10 for the minimal gene set size (number of genes). In general,
30 samples are known to be sufficient to rely on Z-statistic for most kinds of distributions, but for bell-shape distributions that are typical of genome-wide expression data, 10 would supply a fairly good test statistic.
| 3 TESTS AND DISCUSSION |
|---|
|
|
|---|
We used ADGO to analyze the transcriptional changes induced by gp120, an HIV-1 encoded protein (Cicala et al., 2002). HIV-1 gp120 binding to macrophage induces an immune response in the host cell. The main responses in the host cell to an HIV-1 infection are the production of soluble immune mediators such as cytokines and chemokines. Between the two terms, we found the term chemokine activities received the highest degree of significance in the list of the MF terms scored by ADGO, which agreed with the previously tested results (Cicala et al., 2002). The other term cytokine activities was detected only through the composite term positive regulation of cell proliferation and cytokine activities (BP and MF). The Fisher's exact test previously used declares a term is significant even if only a small number of genes in the term show significant expression changes. Our approach can show which part of the term actually shows the significant change. At 5% significance, we found 13 strongly filtered composite terms (68% compared with the 19 significant single terms). Among them, we observed the genes categorized by response to virus & extracellular region (BP and CC) were significantly induced. However, both the single terms response to virus and extracellular region exhibited very high q-values. The term response to virus was composed of various biological processes including intracellular antiviral responses such as PKR pathway, membrane bound receptors such as TLR3 which detects virus, and extracellular soluble proteins such as interferons and cytokines. The highly expected term responses to virus could be detected only if it was restricted to extracellular region. Figure 3 shows the strongly filtered composite terms for the HIV-1 gp120 response dataset.
|
To estimate the average portions of filtered composite terms, we executed ADGO on 20 gene expression profile datasets that are publicly available. See Table 1 for the summarized results. They were obtained from the GEO (Gene Expression Omnibus) database (http://www.ncbi.nlm.nih.gov/GEO) (Barrett et al., 2005). At 5% significance, we found many strongly filtered composite terms, the number of which reached
34% (
61% for a weak criterion) of the number of significant unary terms on average. All the tested datasets are available on our website (Sample and Supplementary Table).
|
In some cases, the composite terms may refer to the two aspects of the same behavior. However, most of such correlated pairs of gene sets, if not all, are screened by the 1020% rule (Fig. 1) and the strong filtering process. By the 1020% rule, highly correlated gene sets are likely to be filtered out because both single terms should have at least 20% differences to each other. Moreover, we allow users to change the minimum percentage of difference up to 40% to ensure more complete filtering. We emphasize that many composite terms are not mere repetitions of the same behaviors (e.g. see the composite terms in Fig. 3). Even if a strongly filtered composite term represents duplicated information, it is still new information that could not be detected by single terms.
| 4 CONCLUSION AND FUTURE WORK |
|---|
|
|
|---|
Gene set analyses show a useful approach to extract biological information from expression data. However, they cannot show their full analytical power if we merely use the unary annotation systems, because many biological processes are described only by multiple features. As shown in this report, composite annotations can strongly reinforce the analyses to reveal many detailed biological themes from expression data.
ADGO is a useful tool for analyzing the expression changes of gene sets, especially of those with composite annotations. It features its wide applicability with which the user can analyze Affymetrix and dual channel microarray data of a variety of platforms and gene name systems for four eukaryotic species. Presently, ADGO is realized on the three GO categories. In the next version, it will be extended with other annotation systems such as pathways, gene clusters, chromosomes or common cis-regulatory elements, with which richer biological information might be derived.
| Acknowledgments |
|---|
The authors thank Sangcheol Kim for providing R-codes for PAGE algorithm and Drs C. Cicala (NIAID) and R. Lempiki (NIAID) for providing their microarray data. This work was supported by the Bio-Infrastructure Program of the Korea Ministry of Science and Technology. D.N. was partially supported by National Institute for Mathematical Science. Funding to pay the Open Access publication charges for this article was provided by the Bio-Infrastructure Program of the Korea Ministry of Science and Technology.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
Received on April 18, 2006; revised on July 5, 2006; accepted on July 5, 2006
| REFERENCES |
|---|
|
|
|---|
Al-Shahrour, F., et al. (2005) Discovering molecular functions significantly related to phenotypes by combining gene expression data and biological information. Bioinformatics, 21, 29882993
Ashburner, M., et al. (2000) Gene Ontology: tool for the unification of biology. The Gene Ontology consortium. Nat. Genet, . 25, 2529[CrossRef][Web of Science][Medline].
Barrett, T., et al. (2005) NCBI GEO: mining millions of expression profilesdatabase and tools. Nucleic Acids Res, . 33, D562D566
Boorsma, A., et al. (2005) T-profiler: scoring the activity of predefined groups of genes using gene expression data. Nucleic Acids Res, . 33, W592W595
Cicala, C., et al. (2002) HIV envelope induces a cascade of cell signals in non-proliferating target cells that favor virus replication. Proc. Natl Acad. Sci, . 99, 93809385
Efron, B., et al. (2001) Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc, . 96, 11511160[CrossRef][Web of Science].
Goeman, J.J., et al. (2004) A global test for groups of genes: testing association with a clinical outcome. Bioinformatics, 20, 9399
Kim, S.Y. and Volsky, D.J. (2005) PAGE: parametric analysis of gene set enrichment. BMC Bioinformatics, 6, 144155[CrossRef][Medline].
Lee, H.K., et al. (2005) ErmineJ: tool for functional analysis of gene expression data sets. BMC Bioinformatics, 6, 269276[CrossRef][Medline].
Pan, W., Lin, J., Le, C. (2001) A mixture model approach to detecting differentially expressed genes with microarray data. Research report 2001-011. Division of Biostatistics, University of Minnesota, MN.
Pavlidis, P., et al. (2002) Exploring gene expression data with class scores. Pac. Symp. Biocomput, . 7, 474485.
Segal, E., et al. (2004) A module map showing conditional activity of expression modules in cancer. Nat. Genet, . 36, 10901098[Web of Science][Medline].
Subramanian, A., et al. (2005) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA, 102, 1554515550
Thomas, J.G., et al. (2001) An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Res, . 11, 12271236
Tian, L., et al. (2005) Discovering statistically significant pathways in expression profiling studies. Proc. Natl Acad. Sci. USA, 102, 1354413549
Tu, K., et al. (2005) MEGO: gene function module expression based on gene ontology. BioTechniques, 38, 277283[Web of Science][Medline].
Tusher, V.G., et al. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 51165121
This article has been cited by other articles:
![]() |
D. W. Huang, B. T. Sherman, and R. A. Lempicki Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists Nucleic Acids Res., January 1, 2009; 37(1): 1 - 13. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Doherty, M. J. Geske, T. S. Stappenbeck, and J. C. Mills Diverse Adult Stem Cells Share Specific Higher-Order Patterns of Gene Expression Stem Cells, August 1, 2008; 26(8): 2124 - 2130. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Al-Shahrour, J. Carbonell, P. Minguez, S. Goetz, A. Conesa, J. Tarraga, I. Medina, E. Alloza, D. Montaner, and J. Dopazo Babelomics: advanced functional profiling of transcriptomics, proteomics and genomics experiments Nucleic Acids Res., July 1, 2008; 36(suppl_2): W341 - W346. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Nam and S.-Y. Kim Gene-set approach for expression pattern analysis Brief Bioinform, May 1, 2008; 9(3): 189 - 197. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Shriner, T. M. Baye, M. A. Padilla, S. Zhang, L. K. Vaughan, and A. E. Loraine Commonality of functional annotation: a method for prioritization of candidate genes from genome-wide linkage studies Nucleic Acids Res., March 27, 2008; 36(4): e26 - e26. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Liu, J. M. Hughes-Oliver, and J. A. Menius Jr Domain-enhanced analysis of microarray data using GO annotations Bioinformatics, May 15, 2007; 23(10): 1225 - 1234. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






