Skip Navigation

This Article
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow FREE Full Text (Screen PDF)
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (17)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Jörnsten, R.
Right arrow Articles by Yu, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Jörnsten, R.
Right arrow Articles by Yu, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics Vol. 19 no. 9 2003
Pages 1100-1109
© 2003 Oxford University Press

Simultaneous gene clustering and subset selection for sample classification via MDL

Rebecka Jörnsten 1,* and Bin Yu 2

1 Department of Statistics, Rutgers University, 501 Hill Center, Piscataway, NJ 08854, USA
2 Department of Statistics, UC Berkeley, 367 Evans Hall, Berkeley, CA 94720, USA

Received on May 1, 2002 ; revised on October 29, 2002 ; accepted on November 11, 2002

Motivation: The microarray technology allows for the simultaneous monitoring of thousands of genes for each sample. The high-dimensional gene expression data can be used to study similarities of gene expression profiles across different samples to form a gene clustering. The clusters may be indicative of genetic pathways. Parallel to gene clustering is the important application of sample classification based on all or selected gene expressions. The gene clustering and sample classification are often undertaken separately, or in a directional manner (one as an aid for the other). However, such separation of these two tasks may occlude informative structure in the data. Here we present an algorithm for the simultaneous clustering of genes and subset selection of gene clusters for sample classification. We develop a new model selection criterion based on Rissanen’s MDL (minimum description length) principle. For the first time, an MDL code length is given for both explanatory variables (genes) and response variables (sample class labels). The final output of the proposed algorithm is a sparse and interpretable classification rule based on cluster centroids or the closest genes to the centroids

Results: Our algorithm for simultaneous gene clustering and subset selection for classification is applied to three publicly available data sets. For all three data sets, we obtain sparse and interpretable classification models based on centroids of clusters. At the same time, these models give competitive test error rates as the best reported methods. Compared with classification models based on single gene selections, our rules are stable in the sense that the number of clusters has a small variability and the centroids of the clusters are well correlated (or consistent) across different cross validation samples. We also discuss models where the centroids of clusters are replaced with the genes closest to the centroids. These models show comparable test error rates to models based on single gene selection, but are more sparse as well as more stable. Moreover, we comment on how the inclusion of a classification criterion affects the gene clustering, bringing out class informative structure in the data

Availability: The methods presented in this paper have been implemented in the R language. The source code is available from the first author.

Contact: rebecka{at}stat.rutgers.edu

* To whom correspondence should be addressed.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
J. Zhang, Y. Ji, and L. Zhang
Extracting three-way gene interactions from microarray data
Bioinformatics, November 1, 2007; 23(21): 2903 - 2909.
[Abstract] [Full Text] [PDF]


Home page
Stat Methods Med ResHome page
I. Van Mechelen, H.-H. Bock, and P. De Boeck
Two-mode clustering methods: astructuredoverview
Statistical Methods in Medical Research, October 1, 2004; 13(5): 363 - 394.
[Abstract] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.