Bioinformatics Vol. 19 Suppl. 1 2003
Pages i91-i94
© 2003 Oxford University Press
Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation
1 Swiss Institute of Bioinformatics, CMU, 1
Michel-Servet - CH-1211 Genève 4, Switzerland
2 Xerox Research Centre Europe, 6 ch. de
Maupertuis - F-38240 Meylan, France
Received on January 6, 2003
; accepted on February 20, 2003
Motivation: Searching relevant publications for manual database annotation is a tedious task. In this paper, we apply a combination of Natural Language Processing (NLP) and probabilistic classification to re-rank documents returned by PubMed according to their relevance to Swiss-Prot annotation, and to identify significant terms in the documents.
Results: With a Probabilistic Latent Categoriser (PLC) we obtained 69% recall and 59% precision for relevant documents in a representative query. As the PLC technique provides the relative contribution of each term to the final document score, we used the Kullback-Leibler symmetric divergence to determine the most discriminating words for Swiss-Prot medical annotation. This information should allow curators to understand classification results better. It also has great value for fine-tuning the linguistic pre-processing of documents, which in turn can improve the overall classifier performance.
Availability: The medical annotation dataset is available from the authors upon request
Contact: Pavel.Dobrokhotov{at}isb-sib.ch; Cyril.Goutte{at}xrce.xerox.com
* To whom correspondence should be addressed.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
B. Han, Z. Obradovic, Z.-Z. Hu, C. H. Wu, and S. Vucetic Substring selection for biomedical document classification Bioinformatics, September 1, 2006; 22(17): 2136 - 2142. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Hofmann and D. Schomburg Concept-based annotation of enzyme classes Bioinformatics, May 1, 2005; 21(9): 2059 - 2066. [Abstract] [Full Text] [PDF] |
||||
