Skip Navigation



Bioinformatics Advance Access published online on July 12, 2006

Bioinformatics, doi:10.1093/bioinformatics/btl350
This Article
Right arrow Advance Access manuscript (PDF) Freely available
Right arrowOA All Versions of this Article:
22/17/2136    most recent
btl350v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Han, B.
Right arrow Articles by Vucetic, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Han, B.
Right arrow Articles by Vucetic, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2006 The Author(s)
Received March 2, 2006
Revised June 4, 2006
Accepted June 23, 2006

Article

Substring selection for biomedical document classification

Bo Han 1, Zoran Obradovic 1, Zhang-Zhi Hu 2, Cathy H. Wu 2, and Slobodan Vucetic 1 *

1 Center for Information Science and Technology, Temple University, Philadelphia, PA 19122, USA
2 Department of Biochemistry and Molecular Biology, Georgetown University Medical Center, Washington, DC 20057, USA

* To whom correspondence should be addressed.
Slobodan Vucetic, E-mail: vucetic{at}ist.temple.edu


   Abstract

Motivation: Attribute selection is a critical step in development of document classification systems. As a standard practice, words are stemmed and the most informative ones are used as attributes in classification. Due to high complexity of biomedical terminology, general-purpose stemming algorithms are often conservative and could also remove informative stems. This can lead to accuracy reduction, especially when the number of labeled documents is small. To address this issue, we propose an algorithm that omits stemming and, instead, uses the most discriminative substrings as attributes.

Results: The approach was tested on five annotated sets of abstracts from iProLINK that report on the experimental evidence about five types of protein post-translational modifications. The experiments showed that Naive Bayes and Support Vector Machine classifiers perform consistently better (with Area Under the ROC Curve (AUC) accuracy in range 0.92-0.97) when using the proposed attribute selection than when using attributes obtained by the Porter stemmer algorithm (AUC in 0.86-0.93 range). The proposed approach is particularly useful when labeled datasets are small.

Availability: The supplementary data is available from www.ist.temple.edu/PIRsupplement.


Associate Editor: Thomas Lengauer
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
Y. Saeys, I. Inza, and P. Larranaga
A review of feature selection techniques in bioinformatics
Bioinformatics, October 1, 2007; 23(19): 2507 - 2517.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.