Bioinformatics Advance Access originally published online on November 7, 2007
Bioinformatics 2007 23(24):3343-3349; doi:10.1093/bioinformatics/btm528
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Unsupervised feature selection under perturbations: meeting the challenges of biological data
1School of Computer Science and Engineering, The Hebrew University of Jerusalem 91904, 2School of Physics and Astronomy, Tel Aviv University 69978 and 3Deptartment of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem 91904, Israel
*To whom correspondence should be addressed.
| Abstract |
|---|
Motivation: Feature selection methods aim to reduce the complexity of data and to uncover the most relevant biological variables. In reality, information in biological datasets is often incomplete as a result of untrustworthy samples and missing values. The reliability of selection methods may therefore be questioned.
Method: Information loss is incorporated into a perturbation scheme, testing which features are stable under it. This method is applied to data analysis by unsupervised feature filtering (UFF). The latter has been shown to be a very successful method in analysis of gene-expression data.
Results: We find that the UFF quality degrades smoothly with information loss. It remains successful even under substantial damage. Our method allows for selection of a best imputation method on a dataset treated by UFF. More importantly, scoring features according to their stability under information loss is shown to be correlated with biological importance in cancer studies. This scoring may lead to novel biological insights.
Contact: royke{at}cs.huji.ac.il
Supplementary information and code availability: Supplementary data are available at Bioinformatics online.
Received on July 23, 2007; revised on September 12, 2007; accepted on October 15, 2007