Skip Navigation



Bioinformatics Advance Access published online on July 26, 2006

Bioinformatics, doi:10.1093/bioinformatics/btl407
This Article
Right arrow Advance Access manuscript (PDF) Freely available
Right arrow All Versions of this Article:
22/19/2430    most recent
btl407v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sima, C.
Right arrow Articles by Dougherty, E. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sima, C.
Right arrow Articles by Dougherty, E. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author (2006). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
Received February 2, 2006
Revised July 4, 2006
Accepted July 22, 2006

Article

What should be expected from feature selection in small-sample settings

Chao Sima 1 and Edward R. Dougherty 2 *

1 Dept. of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843
2 Dept. of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843; Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ 85004

* To whom correspondence should be addressed.
Edward R. Dougherty, E-mail: edward{at}ece.tamu.edu


   Abstract

Motivation: High-throughput technologies for rapid measurement of vast numbers of biological variables offer the potential for highly discriminatory diagnosis and prognosis; however, high dimensionality together with small samples creates the need for feature selection, while at the same time making feature-selection algorithms less reliable. Feature selection must typically be carried out fromamong thousands of gene-expression features and in the context of a small sample (small number of microarrays). Two basic questions arise: (1) Can one expect feature selection to yield a feature set whose error is close to that of an optimal feature set? (2) If a good feature set is not found, should it be expected that good feature sets do not exist?

Results: The two questions translate quantitatively into questions concerning conditional expectation. (1) Given the error of an optimal feature set, what is the conditionally expected error of the selected feature set? (2) Given the error of the selected feature set, what is the conditionally expected error of the optimal feature set? We address these questions using three classification rules (linear discriminant analysis, linear support vector machine, and k-nearest-neighbor classification) and feature selection via sequential floating forward search and the t-test. We consider three feature-label models and patient data from a study concerning survival prognosis for breast cancer. With regard to the two focus questions, there is similarity across all experiments: (1) One cannot expect to find a feature set whose error is close to optimal, and (2) the inability to find a good feature set should not lead to the conclusion that good feature sets do not exist. In practice, the latter conclusion may be more immediately relevant, since when faced with the common occurrence that a feature set discovered from the data does not give satisfactory results, the experimenter can draw no conclusions regarding the existence or nonexistence of suitable feature sets.

Availability: http://ee.tamu.edu/~edward/feature_regression/.


Associate Editor: Satoru Miyano
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
H. Qin, T. Feng, S. A. Harding, C.-J. Tsai, and S. Zhang
An efficient method to identify differentially expressed genes in microarray experiments
Bioinformatics, July 15, 2008; 24(14): 1583 - 1589.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
M. Hilario and A. Kalousis
Approaches to dimensionality reduction in proteomic biomarker studies
Brief Bioinform, March 1, 2008; 9(2): 102 - 118.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Saeys, I. Inza, and P. Larranaga
A review of feature selection techniques in bioinformatics
Bioinformatics, October 1, 2007; 23(19): 2507 - 2517.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
P. Stafford and M. Brun
Three methods for optimization of cross-laboratory and cross-platform microarray expression data
Nucleic Acids Res., May 11, 2007; 35(10): e72 - e72.
[Abstract] [Full Text] [PDF]



Disclaimer:
Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.