Skip Navigation



Bioinformatics Advance Access published online on August 27, 2004

Bioinformatics, doi:10.1093/bioinformatics/bth496
Bioinformatics © Oxford University Press 2004; all rights reserved
This Article
Right arrow Advance Access manuscript (PDF) Freely available
Right arrow All Versions of this Article:
21/2/248    most recent
bth496v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Chen, L.
Right arrow Articles by Friedman, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chen, L.
Right arrow Articles by Friedman, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Received April 11, 2004
Revised August 19, 2004
Accepted August 20, 2004

Article

Gene name ambiguity of eukaryotic nomenclatures

Lifeng Chen 1*, Hongfang Liu 2, Carol Friedman 1

1 Department of BioMedical Informatics, Columbia University, New York, NY 10032, USA
2 Departement of Information Systems, University of Maryland, Baltimore County, Baltimore, MD 21250, USA

* To whom correspondence should be addressed. E-mail: lifeng.chen{at}dbmi.columbia.edu.


   Abstract

Motivation: With more and more scientific literature published online, the effective management and reuse of this knowledge have become problematic. Natural Language Processing (NLP) may be a potential solution by extracting, structuring, and organizing biomedical information that occurs in online literature in a timely manner. One essential task is to recognize and identify genomic entities in text. "Recognition" can be accomplished using pattern matching and machine learning. But for "identification" these techniques are not adequate. In order to identify genomic entities, NLP needs a comprehensive resource that specifies and classifies genomic entities as they occur in text and that associates them with normalized terms and also unique identifiers so that the extracted entities are well defined. Online organism databases are an excellent resource to create such a lexical resource. However, gene name ambiguity is a serious problem because it affects the appropriate identification of gene entities. In this paper we explore the extent of the problem and suggest ways to address it.

Results: We obtained gene information from 21 organisms and quantified naming ambiguities within species, across species, with English words and with medical terms. When the case (of letters) was retained, official symbols displayed negligible intra-species ambiguity (0.02%) and modest ambiguities with general English words (0.57%) and medical terms (1.01%). In contrast, the across-species ambiguity was high (14.20%). The inclusion of gene synonyms increased intra-species ambiguity substantially and full names contributed greatly to gene-medical term ambiguity. A comprehensive lexical resource that covers gene information for the 21 organisms was then created and used to identify gene names by using a straightforward string matching program to process 45,000 abstracts associated with the mouse model organism while ignoring case and gene names that were also English words. We found that 85.1% of correctly retrieved mouse genes were ambiguous with other gene names. When gene names that were also English words were included, 233% additional "gene" instances were retrieved, most of which were false positives. We also found that authors prefer to use synonyms (74.7%) to official symbols (17.7%) or full names (7.6%) in their publications.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
H. Xu, J.-W. Fan, G. Hripcsak, E. A. Mendonca, M. Markatou, and C. Friedman
Gene symbol disambiguation using knowledge-based profiles
Bioinformatics, April 15, 2007; 23(8): 1015 - 1022.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. Gaudan, H. Kirsch, and D. Rebholz-Schuhmann
Resolving abbreviations to their senses in Medline
Bioinformatics, September 15, 2005; 21(18): 3658 - 3664.
[Abstract] [Full Text] [PDF]



Disclaimer:
Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.