Skip Navigation


Bioinformatics Advance Access originally published online on May 18, 2006
Bioinformatics 2006 22(15):1924-1925; doi:10.1093/bioinformatics/btl196
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/15/1924    most recent
btl196v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (15)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Gould, J.
Right arrow Articles by Mesirov, J. P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gould, J.
Right arrow Articles by Mesirov, J. P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Comparative gene marker selection suite

Joshua Gould *, Gad Getz , Stefano Monti , Michael Reich and Jill P. Mesirov

Broad Institute of MIT and Harvard Cambridge, MA 02142, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ANALYTIC MODULE
 VISUALIZATION AND UTILITY...
 REFERENCES
 

Motivation: An important step in analyzing expression profiles from microarray data is to identify genes that can discriminate between distinct classes of samples. Many statistical approaches for assigning significance values to genes have been developed. The Comparative Marker Selection suite consists of three modules that allow users to apply and compare different methods of computing significance for each marker gene, a viewer to assess the results, and a tool to create derivative datasets and marker lists based on user-defined significance criteria.

Availability: The Comparative Marker Selection application suite is freely available as a GenePattern module. The GenePattern analysis environment is freely available at http://www.broad.mit.edu/genepattern

Contact: jgould{at}broad.mit.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ANALYTIC MODULE
 VISUALIZATION AND UTILITY...
 REFERENCES
 
When analyzing genome-wide transcription profiles from microarray data, the first step is often to identify genes that can discriminate between distinct classes of samples (usually defined by a phenotype, such as tumor or normal). This process is commonly referred to as marker (or feature) selection. Many statistical approaches have been developed to estimate the significance of marker genes. The Comparative Marker Selection suite implements many of these approaches, and provides visualization tools for easy comparison of the results that are generated. The suite consists of three modules in the GenePattern (Reich et al., 2006) analysis environment and includes (1) an analytic module that computes the statistical significance of each gene, and includes several methods of correcting for multiple hypothesis testing (MHT), (2) a visualization module to aid in the evaluation of the analytic module results, and (3) a utility module to create derivative datasets and marker lists based on user-defined significance criteria. Complete documentation for GenePattern is provided on the GenePattern website. Additionally, each module in the GenePattern environment contains detailed documentation. We also provide default settings for all module parameters.


    2 ANALYTIC MODULE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ANALYTIC MODULE
 VISUALIZATION AND UTILITY...
 REFERENCES
 
The analytic module takes as input a dataset of expression profiles from samples belonging to two phenotypes. Either two-channel (cDNA) or absolute value (Affymetrix) data can be used as input to the module and any missing values can be imputed (see FAQ on the GenePattern website). If a dataset contains multiple phenotypes, then there is the option to perform all pairwise comparisons or all one-versus-all comparisons. A test statistic (e.g. t-test) is chosen to assess the differential expression between the two classes of samples. Note that technical and biological replicates are handled the same way as independent samples. The significance (nominal P-value) of marker genes is computed using a permutation test, which is a commonly used method for assessing the significance of marker genes. Permutation tests have the advantage of not assuming a parametric underlying distribution of expression values, and importantly they preserve gene–gene correlations which affect some measurements of significance. To construct a distribution of the test statistic, under the null hypothesis of no differential expression, phenotype labels are randomly re-assigned to samples and the test statistic is recomputed for the relabeled dataset. This procedure is repeated for a given number of relabelings to yield the empirical null. It should be noted that the total number of possible exhaustive permutations is a function of the number of samples in each class (for example given 20 samples, 10 of each class, the total number of distinct permutations is Formula. The minimum number of permutations necessary is a function of the number of hypotheses tested and the number of rejected hypotheses we expect. In most cases, we suggest a minimum of 10 samples per class. When the total number of permutations is not large enough to estimate a sufficiently accurate P-value, the module provides the option of computing asymptotic P-values based on the t-test. In addition to reporting the nominal P-value, we also report the estimated 95% confidence intervals for the nominal P-value to assess P-value accuracy. We also include an option to perform all possible relabelings to obtain exact P-values.

Selecting class markers is a particular instance of the general MHT problem. Since several thousand hypotheses are usually tested at once, the nominal P-values have to be corrected to account for the increased number of potential false positives. For example, if we test 20 000 genes for differential expression, a nominal P-value threshold of 0.01 would only ensure that the expected number of false positives is <200 (0.01 * 20 000).

One approach to adjusting for MHT is to control the false discovery rate (FDR), the expected fraction of false positives among all genes reported as significant. In most cases, controlling the FDR is sufficient because the purpose of most microarray investigations is to generate hypotheses for further study. The FDR cut-off level controls the fraction of false leads that the user is willing to tolerate. We include two methods for computing the FDR: the BH procedure developed by Benjamini and Hochberg (1995) and the q-value method of Storey and Tibshirani (2003). The BH procedure gives a more conservative estimate of the FDR than the q-value. The q-value attempts to gain its extra power from estimating {pi}0 (0 ≤ {pi}0 ≤ 1), the fraction of true null hypotheses among all tested hypotheses (the BH procedure always assumes {pi}0 = 1). When this fraction is large, which is the case for many microarray experiments in which we test thousands of genes of which we expect very few to be differentially expressed, little advantage is gained by using the q-value. However, if we expect a large proportion of the tested genes to be differentially expressed, the q-value might allow for the reduction of false negatives when compared with the BH method. We include a plot of the {pi}0 estimate versus the tuning parameter {lambda} to evaluate the accuracy of the final {pi}0 estimate.

When a more conservative approach is required, we suggest controlling the family-wise error rate (FWER), the probability of having at least one false positive. For example, the FWER may be preferred when further investigation of any false positive is costly. We include three methods for controlling the FWER. The Bonferroni method is the most conservative, followed by the empirical FWER and the maxT procedure (Westfall and Young, 1993).

Sometimes data contain extraneous variables that are not accounted for in the design of the experiment and that can distort the results. For example, selecting markers that distinguish tumor from non-tumor samples might lead to incorrect results if some of the samples are male and some are female. We provide the option to control for these confounding phenotypes by providing a restricted permutations (Good, 1994; Lu, 2005) option, in which the class labels are shuffled only within each confounding phenotype. In the example in which gender is the confounding phenotype, we can restrict the permutations so that female non-tumor labels are only permuted with female tumor labels and male non-tumor labels are only shuffled with male tumor labels.


    VISUALIZATION AND UTILITY MODULES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ANALYTIC MODULE
 VISUALIZATION AND UTILITY...
 REFERENCES
 
The visualization portion of the suite provides a framework for assessing the results from the analytic module. The module includes interactive histograms used to determine null distributions for each measure of significance. Additionally a plot of the test statistic rank versus the test statistic is included, which is useful for visualizing the number of features that are upregulated in each class (Figure 1). Pairwise comparison of different significance measures can be plotted to help users assess the relative stringency of the selected hypothesis rejection criteria. Users can visually inspect the profiles of each feature across each sample in a heat map or expression profile format.

Users can view features that pass selected filtering criteria and create derivative datasets and features lists from these filtered features. For example, a user can extract all genes that have a q-value <0.1 and save the corresponding dataset and marker list. We include a utility module to automate this function. All the plots in the viewer are dynamically updated to include only the features that pass the selected filtering criteria.

The annotation of features is provided by two mechanisms. Affymetrix probe set identifiers can be interactively annotated from genomic databases such as GenBank, UniGene, SwissProt, LocusLink and Gene Ontology using the GeneCruiser Web service (Liefeld et al., 2005). Users can also enter their own annotations of features and view these annotations via a color-coding mechanism.


Figure 1
View larger version (35K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Plot of test statistic score and table of significance values for data in Golub et al. (1999).

 

    Acknowledgments
 
The authors wish to thank the following members of the Cancer Program at the Broad Institute: Todd Golub, Ted Liefeld and David Twomey. GenePattern is supported by funding from the National Institutes of Health.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Nikolaus Rajewsky

Received on March 13, 2006; revised on April 21, 2006; accepted on May 15, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ANALYTIC MODULE
 VISUALIZATION AND UTILITY...
 REFERENCES
 

    Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R Stat. Soc. B (Methodological), 57, 289–300.

    Golub, T., et al. (1999) Molecular Classification of Cancer: class Discovery and Class Prediction by Gene Expression. Science, 286, 531–537[Abstract/Free Full Text].

    Good, P. Permutation Tests: A Practical Guide for Testing Hypotheses, (1994) , NY Springer-Verlag.

    Liefeld, T., et al. (2005) GeneCruiser: a web service for the annotation of microarray data. Bioinformatics, 21, 3681–3682[Abstract/Free Full Text].

    Lu, J., et al. (2005) MicroRNA Expression Profiles Classify Human Cancers. Nature, 435, 834–838[CrossRef][Medline].

    Reich, M., et al. (2006) GenePattern 2.0. Nature Genetics, 38, 500–501[CrossRef][Web of Science][Medline].

    Storey, J.D. and Tibshirani, R. (2003) Statistical significance for genomewide studies. PNAS, 100, 9440–9445[Abstract/Free Full Text].

    Westfall, P.H. and Young, S.S. (1993) Resampling-based multiple testing: examples and methods for P-value adjustment. Wiley Series in Probability and Statistics, . , Wiley, NY.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Physiol. GenomicsHome page
V. B. Fedorov, A. V. Goropashnaya, O. Toien, N. C. Stewart, A. Y. Gracey, C. Chang, S. Qin, G. Pertea, J. Quackenbush, L. C. Showe, et al.
Elevated expression of protein biosynthesis genes in liver and muscle of hibernating black bears (Ursus americanus)
Physiol Genomics, April 10, 2009; 37(2): 108 - 118.
[Abstract] [Full Text] [PDF]


Home page
J. Immunol.Home page
W. N. Haining, B. L. Ebert, A. Subrmanian, E. J. Wherry, Q. Eichbaum, J. W. Evans, R. Mak, S. Rivoli, J. Pretz, J. Angelosanto, et al.
Identification of an Evolutionarily Conserved Transcriptional Signature of CD8 Memory Differentiation That Is Shared by T and B Cells
J. Immunol., August 1, 2008; 181(3): 1859 - 1868.
[Abstract] [Full Text] [PDF]


Home page
Cancer Res.Home page
G. Huang, R. Eisenberg, M. Yan, S. Monti, E. Lawrence, P. Fu, J. Walbroehl, E. Lowenberg, T. Golub, J. Merchan, et al.
15-Hydroxyprostaglandin Dehydrogenase is a Target of Hepatocyte Nuclear Factor 3{beta} and a Tumor Suppressor in Lung Cancer
Cancer Res., July 1, 2008; 68(13): 5040 - 5048.
[Abstract] [Full Text] [PDF]


Home page
BloodHome page
J. Shin, S. Monti, D. J. Aires, M. Duvic, T. Golub, D. A. Jones, and T. S. Kupper
Lesional gene expression profiling in cutaneous T-cell lymphoma reveals natural clusters associated with disease outcome
Blood, October 15, 2007; 110(8): 3015 - 3027.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Saeys, I. Inza, and P. Larranaga
A review of feature selection techniques in bioinformatics
Bioinformatics, October 1, 2007; 23(19): 2507 - 2517.
[Abstract] [Full Text] [PDF]


Home page
Cancer Res.Home page
S. Kobayashi, T. Shimamura, S. Monti, U. Steidl, C. J. Hetherington, A. M. Lowell, T. Golub, M. Meyerson, D. G. Tenen, G. I. Shapiro, et al.
Transcriptional Profiling Identifies Cyclin D1 as a Critical Downstream Effector of Mutant Epidermal Growth Factor Receptor Signaling
Cancer Res., December 1, 2006; 66(23): 11389 - 11398.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/15/1924    most recent
btl196v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (15)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Gould, J.
Right arrow Articles by Mesirov, J. P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gould, J.
Right arrow Articles by Mesirov, J. P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?