Skip Navigation



Bioinformatics Advance Access published online on November 30, 2004

Bioinformatics, doi:10.1093/bioinformatics/bti171
Bioinformatics © Oxford University Press 2004; all rights reserved
This Article
Right arrow Advance Access manuscript (PDF) Freely available
Right arrow All Versions of this Article:
21/8/1509    most recent
bti171v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Hua, J.
Right arrow Articles by Dougherty, E. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hua, J.
Right arrow Articles by Dougherty, E. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Received September 16, 2004
Revised November 17, 2004
Accepted November 19, 2004

Article

Optimal number of features as a function of sample size for various classification rules

Jianping Hua 1, Zixiang Xiong 1, James Lowey 2, Edward Suh 2, and Edward R. Dougherty 3*

1 Dept of Electrical Engineering, Texas A&M University, College Station, TX 77843, USA
2 Translational Genomics Research Institute, Phoenix, AZ 85004, USA
3 Dept of Electrical Engineering, Texas A&M University, College Station, TX 77843, USA; Department of Pathology, University of Texas M.D. Anderson Cancer Center, Houston, TX 77030, USA

* To whom correspondence should be addressed.
Edward R. Dougherty, E-mail: e-dougherty{at}ee.tamu.edu


   Abstract

Motivation: Given the joint feature-label distribution, increasing the number of features always results in decreased classification error; however, this is not the case when a classifier is designed via a classification rule from sample data. Typically (but not always), for fixed sample size, the error of a designed classifier decreases and then increases as the number of features grows. The potential downside of using too many features is most critical for small samples, which are commonplace for gene-expression-based classifiers for phenotype discrimination. For fixed sample size and feature-label distribution, the issue is to find an optimal number of features.

Results: Since only in rare cases is there a known distribution of the error as a function of the number of features and sample size, this study employs simulation for various feature-label distributions and classification rules, and across a wide range of sample and feature-set sizes. To achieve the desired end, finding the optimal number of features as a function of sample size, it employs massively parallel computation. Seven classifiers are treated: 3-nearest-neighbor, Gaussian kernel, linear support vector machine, polynomial support vector machine, perceptron, regular histogram and linear discriminant analysis. Three Gaussian-based models are considered: linear, nonlinear and bimodal. In addition, real patient data from a large breast-cancer study is considered. To mitigate the combinatorial search for finding optimal feature sets, and to model the situation in which subsets of genes are co-regulated and correlation is internal to these subsets, we assume that the covariance matrix of the features is blocked, with each block corresponding to a group of correlated features. Altogether there is a large number of error surfaces for the many cases. These are provided in full on a companion web-site, which is meant to serve as resource for those working with small-sample classification.

Availability: For the companion web-site, please visit http://public.tgen.org/tamu/ofs/.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
B. Hanczar and E. R. Dougherty
Classification with reject option in gene expression data
Bioinformatics, September 1, 2008; 24(17): 1889 - 1895.
[Abstract] [Full Text] [PDF]


Home page
Endocr Relat CancerHome page
K. Fujarewicz, M. Jarzab, M. Eszlinger, K. Krohn, R. Paschke, M. Oczko-Wojciechowska, M. Wiench, A. Kukulska, B. Jarzab, and A. Swierniak
A multi-gene approach to differentiate papillary thyroid carcinoma from benign lesions: gene selection using support vector machines with bootstrapping
Endocr. Relat. Cancer, September 1, 2007; 14(3): 809 - 826.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
P. Stafford and M. Brun
Three methods for optimization of cross-laboratory and cross-platform microarray expression data
Nucleic Acids Res., May 11, 2007; 35(10): e72 - e72.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
P. Liu and J. T. G. Hwang
Quick calculation for sample size while controlling false discovery rate with application to microarray analysis
Bioinformatics, March 15, 2007; 23(6): 739 - 746.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Choudhary, M. Brun, J. Hua, J. Lowey, E. Suh, and E. R. Dougherty
Genetic test bed for feature selection
Bioinformatics, April 1, 2006; 22(7): 837 - 842.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.