Bioinformatics Advance Access published online on January 20, 2006
Bioinformatics, doi:10.1093/bioinformatics/btl008
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA
* To whom correspondence should be addressed.
Motivation: Given a large set of potential features, such as the set of all gene-expression values from a microarray, it is necessary to find a small subset with which to classify. The task of finding an optimal feature set of a given size is inherently combinatoric because to assure optimality all feature sets of a given size must be checked. Thus, numerous suboptimal feature-selection algorithms have been proposed. There are strong impediments to evaluating feature-selection algorithms using real data when data are limited, a common situation in genetic classification. The difficulty is compound. First, there are no class-conditional distributions from which to draw data points, only a single small labeled sample. Second, there are no test data with which to estimate the feature-set errors, and one must depend on a training-data-based error estimator. Finally, there is no optimal feature set with which to compare the feature sets found by the algorithms. Results: This paper describes a genetic test bed for the evaluation of feature-selection algorithms. It begins with a large biological feature-label data set that is used as an empirical distribution and, using massively parallel computation, finds the top feature sets of various sizes based on a given sample size and classification rule. The user can draw random samples from the data, apply a proposed algorithm, and evaluate the proficiency of the proposed algorithm via three different measures (code provided). A key feature of the test bed is that, once a data set is input, a single command creates the entire test bed relative to the data set. The particular data set used for the first version of the test bed comes from a microarray-based classification study that analyzes a large number of microarrays, prepared with RNA from breast tumor samples from each of 295 patients. Availability: The software and supplementary material are available at http://public.tgen.org/tgen-cb/support/testbed/.
Received September 30, 2005
Revised January 12, 2006
Accepted January 13, 2006
Article
Genetic test bed for feature selection
Ashish Choudhary 1,
Marcel Brun 2,
Jianping Hua 2,
James Lowey 2,
Ed Suh 2,
and
Edward R. Dougherty 3 *
2 TGen, 400 North Fifth Street, Suite 600, Phoenix, Arizona 85004, USA
3 Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA TGen, 400 North Fifth Street, Suite 600, Phoenix, Arizona 85004, USA
Edward R. Dougherty, E-mail: edward{at}ece.tamu.edu
![]()
Abstract
Associate Editor: Satoru Miyano
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
B. M. King and B. Tidor MIST: Maximum Information Spanning Trees for dimension reduction of biological data sets Bioinformatics, May 1, 2009; 25(9): 1165 - 1172. [Abstract] [Full Text] [PDF] |
||||
![]() |
K.-H. Liu and C.-G. Xu A genetic programming-based approach to the classification of multiclass microarray datasets Bioinformatics, February 1, 2009; 25(3): 331 - 337. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Sima and E. R. Dougherty What should be expected from feature selection in small-sample settings Bioinformatics, October 1, 2006; 22(19): 2430 - 2436. [Abstract] [Full Text] [PDF] |
||||
