Bioinformatics Advance Access published online on July 12, 2006
Bioinformatics, doi:10.1093/bioinformatics/btl350
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Center for Information Science and Technology, Temple University, Philadelphia, PA 19122, USA
* To whom correspondence should be addressed.
Motivation: Attribute selection is a critical step in development of document classification systems. As a standard practice, words are stemmed and the most informative ones are used as attributes in classification. Due to high complexity of biomedical terminology, general-purpose stemming algorithms are often conservative and could also remove informative stems. This can lead to accuracy reduction, especially when the number of labeled documents is small. To address this issue, we propose an algorithm that omits stemming and, instead, uses the most discriminative substrings as attributes. Results: The approach was tested on five annotated sets of abstracts from iProLINK that report on the experimental evidence about five types of protein post-translational modifications. The experiments showed that Naive Bayes and Support Vector Machine classifiers perform consistently better (with Area Under the ROC Curve (AUC) accuracy in range 0.92-0.97) when using the proposed attribute selection than when using attributes obtained by the Porter stemmer algorithm (AUC in 0.86-0.93 range). The proposed approach is particularly useful when labeled datasets are small. Availability: The supplementary data is available from www.ist.temple.edu/PIRsupplement.
Received March 2, 2006
Revised June 4, 2006
Accepted June 23, 2006
Article
Substring selection for biomedical document classification
Bo Han 1,
Zoran Obradovic 1,
Zhang-Zhi Hu 2,
Cathy H. Wu 2,
and
Slobodan Vucetic 1 *
2 Department of Biochemistry and Molecular Biology, Georgetown University Medical Center, Washington, DC 20057, USA
Slobodan Vucetic, E-mail: vucetic{at}ist.temple.edu
![]()
Abstract
Associate Editor: Thomas Lengauer
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
Y. Saeys, I. Inza, and P. Larranaga A review of feature selection techniques in bioinformatics Bioinformatics, October 1, 2007; 23(19): 2507 - 2517. [Abstract] [Full Text] [PDF] |
||||
