Skip Navigation


Bioinformatics Advance Access originally published online on April 27, 2007
Bioinformatics 2007 23(13):1697-1699; doi:10.1093/bioinformatics/btm144
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/13/1697    most recent
btm144v3
btm144v2
btm144v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kim, S.-B.
Right arrow Articles by Chu, I.-S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kim, S.-B.
Right arrow Articles by Chu, I.-S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

GAzer: gene set analyzer

Sang-Bae Kim 1, Sungjin Yang 1, Seon-Kyu Kim 1, Sang Cheol Kim 2, Hyun Goo Woo 1, David J. Volsky 3, Seon-Young Kim 4,* and In-Sun Chu 1,*

1Korean BioInformation Center, KRIBB, Daejeon 305-806, 2Department of Applied Statistics, Yonsei University, Seoul, 120-749, Korea, 3Molecular Virology Division, St. Luke's-Roosevelt Hospital Center, 432 West 58th Street, Antenucci Building, Room 709, New York, NY 10019, USA and 4Human Genomics Laboratory, Genome Research Center, KRIBB, Daejeon 305-806, Korea

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESCRIPTION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: Gene Set Analyzer (GAzer) is a web-based integrated gene set analysis tool covering previously reported parametric and non-parametric models. Based on a simulation test for the reported algorithms, we classified and implemented three main statistical methods consisting of the z-statistic, gene permutation and sample permutation for ten gene set categories including Gene Ontology (GO) for human, mouse, rat and yeast. This tool identifies significantly altered gene sets scored by z-statistics and P-values from the z-test or permutation test and provides q-values and Bonferroni P-values to correct multiple hypothesis testing. GAzer allows users to observe changes in expression of each gene in a gene set or to see the significance of the gene sets containing a gene(s) of interest, thus allowing interactive data analysis both at the gene and gene set level. Moreover, GAzer offers extensive annotation for each gene.

Availability: The GAzer gene set analyzer is freely available at http://integromics.kobic.re.kr/GAzer/

Contact: kimsy{at}kribb.re.kr and chu{at}kribb.re.kr

Supplementary information: This can be found on the web page (http://integromics.kobic.re.kr/GAzer/supplement.jsp)


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESCRIPTION
 ACKNOWLEDGEMENTS
 REFERENCES
 
A common approach to microarray data analysis is to identify differentially expressed genes between two sample populations and then to get biologically meaningful interpretations from the selected gene list. In addition to these conventional statistical approaches such as the t-test, SAM (Tusher et al., 2001) and other statistical models (Efron and Tibshirani, 2002; Pan, 2002), new approaches based on biologically categorized gene sets (e.g. Gene Ontology, GO), pathway and chromosomal location, have been recently introduced (Al-Shahrour et al., 2005; Breslin et al., 2005; Mootha et al., 2003; Tu et al., 2005). Under the assumption that weak but coordinated expression changes of specific gene sets can better represent significant flows of biological processes, the gene set analysis approach has shown good potential for interpreting gene expression data since gene set enrichment analysis (GSEA) was introduced (Mootha et al., 2003).

Gene set analysis can be categorized into non-parametric [i.e. GSEA, ErmineJ (Lee et al., 2005) and Catmap (Breslin et al., 2004)] or parametric [i.e. PAGE (Kim and Volsky, 2005) and T-profiler (Boorsma et al., 2005)] methods. Although these approaches have advantages and disadvantages similar to conventional tests such as the rank-sum and two-sample t-test, they are useful for obtaining biological insights from gene expression data.

An extensive collection of predefined gene sets is also important to maximize the analytical power of gene set analysis. In our previous studies, we showed that new and improved biological information can be extracted from composite GO (compGO) (Nam et al., 2006) or cis-regulatory element gene set (Kim and Kim, 2006) categories.

Here, we introduce a web-based and integrated gene set analysis tool designed for gene set analysis using various previously reported statistical methods. This tool includes ten functionally annotated gene sets to effectively extract specific biological information; the gene sets comprise pathways, chromosomal locations, InterPro domains, cis-regulatory elements and information about three GO and three compGO subgroups.


    2 DESCRIPTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESCRIPTION
 ACKNOWLEDGEMENTS
 REFERENCES
 
GAzer consists of a database of predefined gene sets, a utility to evaluate significantly changed gene sets from user input data and a gene annotation tool.

2.1 Gene set database
We constructed a database of ten kinds of predefined gene set categories for human, mouse, rat and yeast. To establish the gene sets, we downloaded annotation information from several public resources such as the NCBI Entrez Gene database (http://www.ncbi.nlm.nih.gov), Affymetrix (http://www.affymetrix.com), TRANSFAC (http://www.gene-regulation.com) and others. We describe the gene set construction processes in detail in our web site (see Supplementary Material online) and all the gene sets can be downloaded from our web site. In GAzer, the gene symbol is used as the main identifier and Affymetrix identifiers are converted into gene symbols to average replicate genes.

2.2 Methods
We applied previously reported methods, each having different features, to the analysis of statistically significant gene sets (Table 1). Depending on the statistics and statistical test used, we organized the methods into six categories: PAGE (z-test), T-profiler (t-test), NT (gene permutation with t-value), ET (sample permutation with t-value) and old and new GSEA (Mootha et al., 2005; Subramanian et al., 2005).


View this table:
[in this window]
[in a new window]

 
Table 1. Comparison of GAzer with previously reported gene set analysis tools

 
In a series of simulations (see Supplementary Material online), we investigated the performance of each method for the following variables: the total number of genes in a data set, gene set size, sample size, fold change between two groups, the percentage of coordinately changing genes in a gene set and gene–gene correlation. Our simulation study showed following points. First, gene permutation or z-test methods generally performed better than sample permutation or t-test methods. Second, when sample size was small, sample permutation methods were impractical. Third, when sample size is less than five, using fold change as an input yields better performance than using the t-statistic. Fourth, among sample permutation-based methods, the method of Tian et al. (2005) performed better than either the old or new GSEA methods. In our study of the effects of increasing gene–gene correlation within gene sets, we found that permutation-based algorithms (PAGE and NT) tended to produce higher z-scores while sample based algorithms (ET, GSEA_N and GSEA_O) produced lower z-scores.

We chose to implement PAGE, NT and ET methods to allow testing of two related hypotheses for coordinated association of a group of genes with a phenotype of interest (Tian et al., 2005).

2.3 GAzer system and implementation
GAzer reads two tab-delimited text files—a gene expression data file and a class information file containing the class identifiers (‘0’ or ‘1’)—to identify two groups. Then, users can select several attributes, including one of four species, array types (Affymetrix or dual-channel array), platform type in case of Affymetrix array, ID type (gene symbol, RefSeq, Entrez gene or ORF) in case of a dual-channel array, the minimum number of genes in a gene set, categories of gene sets and analysis methods (PAGE_fc, PAGE_t, NT or ET). Users can choose as many as ten gene sets. For Affymetrix data, GAzer calculates the average value of replicated genes and converts Affymetrix probe IDs into gene symbols. After computing the statistics of a chosen method, GAzer summarizes the significant gene sets filtered and scored by z-statistics and P-values from the z-test or permutation test (Fig. 1A). This system also provides q-values and Bonferroni P-values to correct multiple hypotheses testing.


Figure 1
View larger version (43K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Screen shots of GAzer. (A) Process results. Statistical values such as z-scores, P-values, q-values and Bonferroni values are calculated, and significant gene sets ordered by the q-value are shown in the table. (B) Gene list in the selected gene set. Users can display the gene list of a gene set that the user has selected along with the heatmap of the expression values.

 
GAzer allows users to view individual genes of a gene set with the heatmap of their expression values and to prioritize each gene set in the web interface gene list to see which genes are significant in the overall behavior of the set (Fig. 1B). GAzer also allows users to cross-compare genes in predefined gene sets. Thus, GAzer has the capacity to analyze gene expression data both at a gene and gene set level interactively. Finally, extensive annotation information for each gene is provided, including all public database identifiers such as NCBI, SWISSPROT, GO and pathway information.

GAzer is implemented in R language (http://www.r-project.org, http://www.bioconductor.org) and uses MySQL (http://www.mysql.com) as a DBMS. This program is wrapped by JAVA (http://www.java.sun.com) to maintain a user-friendly web interface. All R scripts used in the simulation study and GAzer are available on our web site (see Supplementary Material online). Further details, including a User's Guide, are available on the web site.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESCRIPTION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This software development was supported in part by KRIBB Research Initiative Program and 21c Frontier Functional Human Genome Project, Korea. We thank Joshua S. Yang and Kwang-Sik Shin for their excellent technical support.

Conflict of Interest: none declared.


    FOOTNOTES
 
Alfonso Valencia

Received on November 23, 2006; revised on April 1, 2007; accepted on April 6, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESCRIPTION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Al-Shahrour F, et al. Discovering molecular functions significantly related to phenotypes by combining gene expression data and biological information. Bioinformatics, ( (2005) ) 21, : 2988–2993.[Abstract/Free Full Text].

    Boorsma A, et al. T-profiler: scoring the activity of predefined groups of genes using gene expression data. Nucleic Acids Res, ( (2005) ) 33, : W592–595.[Abstract/Free Full Text].

    Breslin T, et al. Comparing functional annotation analyses with Catmap. BMC Bioinformatics, ( (2004) ) 5, : 193.[CrossRef][Medline].

    Efron B, Tibshirani R. Empirical bayes methods and false discovery rates for microarrays. Genet. Epidemiol, ( (2002) ) 23, : 70–86.[CrossRef][ISI][Medline].

    Kim SY, Volsky DJ. PAGE: parametric analysis of gene set enrichment. BMC Bioinformatics, ( (2005) ) 6, : 144.[CrossRef][Medline].

    Kim SY, Kim Y. Genome-wide prediction of transcriptional regulatory elements of human promoters using gene expression and promoter analysis data. BMC Bioinformatics, ( (2006) ) 7, : 330.[CrossRef][Medline].

    Lee HK, et al. ErmineJ: tool for functional analysis of gene expression data sets. BMC Bioinformatics, ( (2005) ) 6, : 269.[CrossRef][Medline].

    Mootha VK, et al. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet, ( (2003) ) 34, : 267–273.[CrossRef][ISI][Medline].

    Nam D, et al. ADGO: analysis of differentially expressed gene sets using composite GO annotation. Bioinformatics, ( (2006) ) 18, : 2249–2253..

    Pan W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics, ( (2002) ) 18, : 546–554.[Abstract/Free Full Text].

    Subramanian A, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA, ( (2005) ) 102, : 15545–50.[Abstract/Free Full Text].

    Tian L, et al. Discovering statistically significant pathways in expression profiling studies. Proc. Natl Acad. Sci. USA, ( (2005) ) 102, : 13544–13549.[Abstract/Free Full Text].

    Tu K, et al. MEGO: gene functional module expression based on gene ontology. Biotechniques, ( (2005) ) 38, : 277–283.[ISI][Medline].

    Tusher VG, et al. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, ( (2001) ) 98, : 5116–5121.[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
F. Al-Shahrour, J. Carbonell, P. Minguez, S. Goetz, A. Conesa, J. Tarraga, I. Medina, E. Alloza, D. Montaner, and J. Dopazo
Babelomics: advanced functional profiling of transcriptomics, proteomics and genomics experiments
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W341 - W346.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
D. Nam and S.-Y. Kim
Gene-set approach for expression pattern analysis
Brief Bioinform, May 1, 2008; 9(3): 189 - 197.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/13/1697    most recent
btm144v3
btm144v2
btm144v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kim, S.-B.
Right arrow Articles by Chu, I.-S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kim, S.-B.
Right arrow Articles by Chu, I.-S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?