Skip Navigation


Bioinformatics Advance Access originally published online on February 21, 2007
Bioinformatics 2007 23(8):1015-1022; doi:10.1093/bioinformatics/btm056
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/8/1015    most recent
btm056v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Google Scholar
Right arrow Articles by Xu, H.
Right arrow Articles by Friedman, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xu, H.
Right arrow Articles by Friedman, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Gene symbol disambiguation using knowledge-based profiles

Hua Xu 1, Jung-Wei Fan 1, George Hripcsak 1, Eneida A. Mendonça 1, Marianthi Markatou 2,{dagger} and Carol Friedman 1,*,{dagger}

1Department of Biomedical Informatics, Columbia University, 622 168th St and 2Department of Biostatistics, Columbia University, 722 168th St, New York City, New York, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: The ambiguity of biomedical entities, particularly of gene symbols, is a big challenge for text-mining systems in the biomedical domain. Existing knowledge sources, such as Entrez Gene and the MEDLINE database, contain information concerning the characteristics of a particular gene that could be used to disambiguate gene symbols.

Results: For each gene, we create a profile with different types of information automatically extracted from related MEDLINE abstracts and readily available annotated knowledge sources. We apply the gene profiles to the disambiguation task via an information retrieval method, which ranks the similarity scores between the context where the ambiguous gene is mentioned, and candidate gene profiles. The gene profile with the highest similarity score is then chosen as the correct sense. We evaluated the method on three automatically generated testing sets of mouse, fly and yeast organisms, respectively. The method achieved the highest precision of 93.9% for the mouse, 77.8% for the fly and 89.5% for the yeast.

Availability: The testing data sets and disambiguation programs are available at http://www.dbmi.columbia.edu/~hux7002/gsd2006

Contact: friedman{at}dbmi.columbia.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
1.1 Text mining and gene symbol ambiguity
The rapid growth of biomedical data and published literature has made apparent the need to develop automated methods for information retrieval and information extraction in the biomedical domain (Erhardt et al., 2006; Jensen et al., 2006; Krallinger and Valencia, 2005). The first critical step for text mining in this domain is to correctly identify biomedical entities by associating them with standardized identifiers. With the identified entities, more accurate retrieval and extraction could be achieved, and more complex findings, such as relations among entities, could be more accurately extracted.

Because of the long history of biological research and the lack of inter-species naming conventions, the task of identification of biological entities is very difficult, particularly for gene names, which are of primary importance for understanding biological processes. Even though guidelines are available for the human gene, nomenclature from the Human Genome Organization (HUGO), the study done by Tamames and Valencia (Tamames and Valencia, 2006) shows that the scientific community has not widely adopted the guidelines and there is no clear tendency that this situation is improving. Currently, one gene could be referred to using several different names (synonymy), and a name could be associated with several different genes or non-gene meanings, such as English words or other medical terms (homonymy, which is also called ambiguity).

Studies associated with gene names have shown that the ambiguity problem is complicated because a gene term: (1) may refer to different genes; (2) may refer to a gene or another type of biomedical term or to a general English word; (3) may be used to denote an RNA, a protein or a gene or (4) may be refer to different species. If each ambiguous gene symbol in an article were accompanied by its corresponding long form, the disambiguation task would be much easier. However, Schuemie (Schuemie et al., 2004) analyzed 3902 biomedical full-text articles and found that only 30% of the gene symbols in the abstracts were accompanied by their corresponding full names, and only 18% of the gene symbols in the full text were accompanied by their gene names. Sehgal's (Sehgal et al., 2004) study showed that 1051 human gene symbols also had generic English meanings. Chen (Chen et al., 2005) studied the ambiguity of 21 organisms and found that 85.1% of the retrieved mouse genes were ambiguous with gene names from other species and 233% additional ‘gene’ instances were retrieved when gene names that were also English words were included when processing a set of 45 000 abstracts associated with mouse genes. A recent study on gene/protein nomenclature in five public databases (Fundel and Zimmer, 2006) showed that ambiguity within gene names, between gene names and common English words, and domain-related terms is significant, that the degree of ambiguity varies between different organisms, and that solving the problem is not trivial. Xu (Xu et al., 2006) performed another study on ambiguity between gene-English and gene-UMLS(NLM, 2000) terms in mining MEDLINE. In a set of 82 922 mouse-related MEDLINE abstracts, 99.7% included an ambiguity between a gene symbol and a general English word, and 99.8% included an ambiguity between a gene symbol and a non-gene UMLS term. Even after removing frequent terms, which occur in more than 1% of all the documents, the percentage of articles containing remaining ambiguous terms was 46.2% for gene-English ambiguity and 68.6% for gene-UMLS ambiguity.

1.2 Word sense disambiguation (WSD)
Gene symbol disambiguation is a particular case of WSD, which indeed is a classification task. Research in automated WSD can be traced back to the 1950s (Yngve, 1955), and, in addition, a number of WSD methods have been addressed in the general English domain. Many of the WSD algorithms (Harley and Glennon, 1997; Lesk, 1986; Wilks et al., 1990) use established knowledge sources, which are usually manually curated (e.g. dictionary, ontology). In the biomedical domain, Widdows et al. (2003) compared several methods utilizing the knowledge incorporated in the UMLS metathesaurus and semantic network. The best precision (79%) was achieved when using related UMLS terms based on the relations defined in UMLS. Recently, Humphrey et al. (2006) proposed the Journal Descriptor Indexing (JDI) method to resolve ambiguity in UMLS metathesaurus and reported a precision of 0.7873 on NLM's WSD Test Collection. That method resolves ambiguous terms whose senses correspond to different semantic classes, but not to different senses having the same semantic classes, such as different genes.

More recently, supervised machine learning (ML) technologies have received considerable attention and have shown promising results. Bruce and Wiebe (1994) applied a Bayesian algorithm and chose features based on their ‘informative’ nature. Lee and Ng (2002) evaluated a variety of knowledge sources (including the parts of speech of neighboring words, single words in the surrounding context, local collocations and syntactic relations) and supervised-learning algorithms (including support vector machines (SVM), Naive Bayes, AdaBoost, and decision tree algorithms) for WSD on the SENSEVAL-1 and SENSEVAL-2 data. Mohammad and Pedersen (2004) studied the contribution of lexical features and syntactic features to WSD, and results showed that simple lexical features (words in context and collocation) used in conjunction with part of speech information achieved better results than other feature combinations.

Although supervised ML technologies showed promising results, they are generally less scalable than unsupervised methods because they require annotated training data sets, which are difficult to obtain. There is research on developing automated methods to generate sense-tagged training data (Liu et al., 2002; Pustejovsky et al., 2001). However, even for cases where automated methods are available, it is still time consuming to train and build classifiers for every sense of each ambiguous word. In the biomedical domain, more and more gene-related knowledge sources have been developed, such as Gene Ontology (Ashburner et al., 2000) and Entrez Gene (Maglott et al., 2005), which makes it more feasible to use knowledge-based methods for gene symbol disambiguation.

1.3 Related work
The importance of identification of gene entities is well recognized by researchers in the field of text mining in the biomedical domain. The Gene List Task (Task 1B) (Hirschman et al., 2005) of the BioCreAtivE challenge is targeted to map gene mentions in MEDLINE abstracts to gene identifiers. When a gene mention maps to more than one identifier, disambiguation is required.

Podowski (Podowski et al., 2004) developed a system which is able to automatically assign gene names to their LocusLink IDs (LLID) based on supervised-learning methods. For each LLID, a classifier is built using MEDLINE references in the LocusLink and SwissProt databases. A validation of MEDLINE documents for a set of 66 human genes showed varied F-measures (0.18–1.00), depending on the size of available training documents. This study provides a good model for gene symbol disambiguation using existing knowledge resources, but the supervised-learning method may not perform well if the automatically generated training set does not have similar distribution as the ‘real-world’ documents.

Schijvenaars (Schijvenaars et al., 2005) reported on a thesaurus-based method for human gene symbol disambiguation. Ambiguous human gene symbols were extracted from five public databases. A reference description from either online Mendelian inheritance in man (OMIM) annotation or MEDLINE abstracts was built for each gene sense, and a context description of an ambiguous symbol was extracted from the textual context in which the ambiguous symbol occurs. A set of matching scores between context description and possible reference description were calculated, and the gene corresponding with the reference description that best matches the context was chosen as the symbol's meaning. The system achieved an accuracy rate of 92.7% on an automatically generated testing set when five abstracts were used for the reference description. Schijvenaars's study described an effective method for gene symbol disambiguation, but the evaluation results were limited to certain conditions. The automatically generated testing set contained human genes symbols that appeared as long-form and short-form pairs (e.g. prostate specific antigen (PSA)) in articles, where at least six articles were required to be associated with each gene sense. However, when the gene symbol in the article is ambiguous and the long form is not present, the performance of the method is not known. Limiting the testing set to symbols with at least six articles for each sense may also have introduced bias in the testing set. Additionally, the method has not been tested on other organisms.

In this article, we describe a general method to disambiguate gene symbols, which relies on readily available resources. We build gene profiles from established knowledge sources (e.g. Entrez Gene) and apply them to gene symbol disambiguation via an information retrieval method, focusing on disambiguation among different gene senses. Our research differs from previous work in that: (1) two natural language processing (NLP) systems were utilized to obtain normalized concept identifiers and relations for profile generation; (2) our method allows features from text and manually annotated knowledge sources to be combined; (3) performance of different types of features (including words and ontological concepts) for the profiles was studied; (4) three organisms: mouse, fly, yeast were studied because research showed that the difficulty of disambiguation varies among different organisms (Fundel and Zimmer, 2006).


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Knowledge sources and NLP systems
Entrez Gene is a gene-specific database developed at NCBI (National Center for Biotechnology Information). Two files from the Entrez Gene database were downloaded in January 2006 for this study. The ‘gene2pubmed’ file lists the articles in which a gene occurs and was used to obtain articles associated with a particular gene so that information about the gene could be extracted and used for the profiles. The ‘gene2go’ file records annotated GO codes associated with each gene, which is valuable for characterizing the gene.

The simplest type of information about a gene consists of the words in the related abstracts, whereas MeSH is another type of information, which is usually very accurate because it is manually annotated by curators based on full-text articles. However, MeSH terms are generally limited to several concepts with coarse granularity. Another type of information in the articles consists of ontological concepts, but to obtain these, biomedical terms in the abstracts have to be identified and mapped to concepts. For example, the phrase ‘neuroepithelial cells’ in article (PMID: 9883720) could be represented by a UMLS concept unique identifier (CUI)—C1449624. Another type of information that can be obtained from the articles, which is more complex, consists of explicit relations between the gene and biomedical concepts. In the method we propose, two NLP systems are used to extract UMLS concepts and relations between concepts from the MEDLINE abstracts. The first NLP system is MetaMap developed at the National Library of Medicine (NLM), which maps terms in biomedical text to the UMLS Metathesaurus concepts (Aronson, 2001), but relations are not captured. The second NLP system is called BioMedLEE (Biomedical Language Extracting and Encoding system), which is an NLP system that extracts and encodes a broad variety of genotypic and phenotypic entities and relations from the biomedical literature (Lussier et al., 2006). BioMedLEE identifies a gene term, lists its corresponding identifier if it is unambiguous or candidate identifiers if it is ambiguous, and extracts relations between gene terms and other biomedical entities. Figure 1 shows an example of the simplified XML output generated by BioMedLEE, (some unrelated information, such as output of gene ‘Ab1’, is not shown), which identified the ambiguous gene symbol ‘Arg’ and listed the two candidate GeneIDs for the mouse. It also identified a phenotypic concept ‘neurulation’ and a relation called ‘essential for’ between the gene ‘Arg’ and the concept ‘neurulation’.


Figure 1
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. An example of simplified XML output generated by BioMedLEE. BioMedLEE lists the two candidate GeneIDs of the ambiguous gene symbol ‘Arg’ (11352 and 68703) belonging to the mouse (organism ID: 10090). It also identifies a relation called ‘essential for’ between the gene and a phenotypic concept ‘neurulation’.

 
2.2 Document set and testing data generation
Based on the first column (organism ID) of ‘gene2pubmed’ file, we downloaded all the MEDLINE abstracts known to be related to each of the three organisms: mouse, fly and yeast. It yielded three sets of documents containing 82 922 articles for the mouse, 10 371 for the fly and 9936 for the yeast. From these sets of documents, we automatically generated a sense-tagged subset, as described below, and used it for testing. The remainders of the documents were then used for profile generation.

The sense-tagged testing set was generated based on two facts: (1) BioMedLEE identifies gene terms for a specific organism, and if a gene term is ambiguous, it lists possible candidate gene identifiers (GeneID from Entrez Gene); (2) ‘gene2pubmed’ file lists genes associated with PubMed articles. If BioMedLEE identifies that a gene term in an article is ambiguous and one of its candidate gene senses ‘S1’ is known to be associated with the article according to ‘gene2pubmed’, the gene sense ‘S1’ is assumed to be the correct sense, and it will be tagged as the correct sense for the ambiguous gene term occurring in that article. An example of the sense-tagging algorithm is as follows: the article with PMID 9883720 has an ambiguous gene term ‘Arg’, which occurs in the sentence Thus, Abl and Arg play essential roles in neurulation and can regulate the structure of the actin cytoskeleton. BioMedLEE identifies ‘Arg’ as having two possible gene senses for the mouse organism: (1) GeneID: 68703 ‘arginine glutamic acid dipeptide (RE) repeats’, and (2) GeneID: 11352 ‘Abelson-related gene’. The ‘gene2pubmed’ file indicates that the article is associated with gene ‘GeneID:11352’, and therefore, we assign the gene sense ‘GeneID: 11352 (Abelson-related gene)’ as the correct sense for the gene symbol ‘Arg’ in that article, and use the pair of the gene symbol and the PMID (Arg–9883720) for one of the testing samples. This sense-tagged testing set served as the gold standard for our evaluation and it is available online. We made two assumptions when generating the testing set and the disambiguation algorithm described in this study: (1) one sense per discourse, which means an ambiguous gene symbol always has the same sense in the same article and (2) the organism associated with the article is known.

2.3 Context and profile vectors
We created a context vector using the MEDLINE abstract where the ambiguous gene symbol occurred for each testing sample (a pair consisting of an ambiguous gene symbol and a PMID). For every candidate gene in the testing set, we generated a gene profile vector using the information derived from all MEDLINE articles known to be related to the gene (excluding testing samples) based on ‘gene2pubmed’ files, and information from other manually annotated knowledge sources, such as ‘gene2go’ (see below for information used for the profiles). When using articles known to be related to particular genes, we excluded articles that were associated with more than 25 genes because based on our observation, if an article is associated with so many genes, it usually discusses a high-throughput method which generates data about many genes, such as an expression array, and in that case, does not have much valuable information about a particular gene.

In this study, a variety of types of information were used as features for the context and profile vectors. Mainly, the information could be divided into two groups.

  1. Information directly derived from text of MEDLINE abstracts: (a) Original words that occur in MEDLINE abstracts; (b) UMLS CUIs extracted from MEDLINE abstracts via MetaMap and (c) Relations between a gene symbol and other biomedical entities extracted by BioMedLEE.
  2. Information from manually curated knowledge sources: (a) MeSH terms of MEDLINE abstracts; and (b) GO annotation of genes based on ‘gene2go’ file.

When using bag-of-word as features, general English stop words were removed and all words were stemmed using the Porter stemming algorithm (Porter, 1980). Relations from BioMedLEE were represented as a pair having the format of ‘relation type—the term related to the gene’. For example, the relation feature extracted from the sentence shown in Figure 1 was represented as ‘essential for-bodyfunc:neurulation’ for the profile of GeneID 11 352. GO codes from ‘gene2go’ file were mapped to UMLS CUIs so that it was possible to compare the GO codes in the profile vectors with the CUIs in the context vectors.

2.4 Feature indexing
All the features used in the profile and context vectors were indexed using TF-IDF weighting schema (Salton and Buckley, 1988), which is widely used in the vector space model for information retrieval. Given a document d, the term frequency (TF) of term t is defined as the frequency of t occurring in d. the inverse document frequency (IDF) of term t is defined as the logarithm of the number of all documents in the collection divided by the number of documents containing the term t. Then term t in document d is weighted as TF*IDF.

To obtain features for the profiles, such as bag-of-words, CUIs and MeSH terms, we chose the MEDLINE abstract as the document. TF is then defined as the frequency of a term in a given MEDLINE abstract. The total number of documents used for the IDF calculation was the total number of MEDLINE abstracts in the document set (e.g. 82 922 for mouse organism). To form a gene profile, features from all the MEDLINE abstracts related to the gene were combined and the averaged weight was taken as the final weight of a feature in the profile. For features derived from the ‘gene2go’ file, because every GO code is associated with a gene, all GO codes of a particular gene were collected into one file, which was treated as the document. Therefore, TF is defined as the frequency of a GO code in the document formed by all the GO codes of the given gene. The total number of documents used for the IDF calculation was then the number of all the candidate genes. The same indexing was applied to features of relations from BioMedLEE because relations are also associated with particular genes.

2.5 Similarity measurement
The similarity score between the context vector and profile vector was calculated as the standard measurement of cosine similarity of two vectors. For two vectors a and b, the cosine similarity between them is defined as the inner product of a and b, normalized by the length of two vectors. See the formula below:


Formula

where


Formula

The cosine similarity was computed for each context vector and each candidate gene profile vector. The gene with the highest similarity between its profile vector and the context vector was selected as the correct sense.

The overall process of the profile-based disambiguation algorithm using single type of information—CUIs from MetaMap as features is summarized in Figure 2 with a detailed example. When combining multiple types of information for disambiguation, similarity scores are generated from each individual type of information and are then added together. The aggregated similarity scores are used to determine the correct sense.


Figure 2
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Overview of profile-based method for gene symbol disambiguation when using MetaMap to obtain CUIs as features. The gene symbol ‘Arg’ has two possible senses: GeneID 68703 and GeneID 11352. Candidate gene profile vectors for genes were created as follows: (1) Articles associated with a specific gene based on ‘gene2pubmed’ file were collected; (2) The articles were processed by MetaMap to obtain CUIs, which formed the profile; (3) the CUIs were indexed using the TF*IDF weighting schema. When the ambiguous gene symbol ‘Arg’ occurs in a MEDLINE abstract (PMID 988720), a context vector was formed using CUIs extracted from that abstract, with the same TF*IDF weighting scheme. Then the cosine similarity between the context vector and each candidate gene profile vector was measured, and the GeneID (11352) with the highest similarity score between the profile and context was selected as the correct sense.

 
2.6 Experiments and evaluation
We studied the performance of the profile-based disambiguation methods using various individual knowledge sources as well as using combined knowledge sources. To evaluate the improvement of the profile-based method on gene symbol disambiguation, we added a baseline method by using the majority sense of an ambiguous gene symbol as the correct sense. The majority sense is defined as the gene sense, which is associated with the most PubMed articles based on the ‘gene2pubmed’ file. Six different combinations of features were used in the study and performance was measured for each of them. Table 1 lists the details of each run including the baseline method.


View this table:
[in this window]
[in a new window]

 
Table 1. Details of experimental testing runs

 
Since our goal was to measure performance of our profile-based method when given non-empty information, we focused on the measurement of precision for this method. Precision was defined as the ratio between the number of correctly disambiguated samples and the number of total testing samples for which the profile-based method yielded a decision. We ensured that each candidate profile of a testing sample had at least one related article other than the testing document itself. Occasionally, the profile-based method could not make a decision (when candidate profile vectors had the same similarity scores as the context vector, e.g. zero scores). Therefore we also report recall, which is defined as the number of testing samples that could be disambiguated using the profile-based method over the total number of testing samples.

Considering that the process of generating the testing set might be biased, we used bootstrapping to evaluate performance. In each round, a fixed number of testing samples (100 for mouse and fly organisms, 50 for yeast) were randomly selected from the testing set of each organism and disambiguated using the profile-based method. By comparing the results generated by the profile-based method with the gold standard, we obtained the precision of the profile-based method. We repeated this step 200 times, which yielded 200 records of precision for each testing run. The mean and standard error of precisions over the 200 repeats are reported as the final result of the profile-based method.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
Using the automatic sense-tagging method described in Section 2.2, we generated a pool of 7844 testing samples for mouse, 1320 for fly and 269 for yeast. Table 2 shows the results of the profile-based methods for gene symbol disambiguation when different knowledge sources are used. The first number in the cell is the mean of precisions of 200 randomly selected testing sets using the knowledge source(s) given in column one. The standard error of the precision is displayed in the parenthesis following the mean.


View this table:
[in this window]
[in a new window]

 
Table 2. Results of profile-based methods for gene symbol disambiguation when different knowledge sources are used

 
Compared to the baseline, the profile-based method significantly improves precision no matter which knowledge source is used. For the mouse and fly organisms, the highest median precision of 0.939 (for mouse) and 0.778 (for fly) were reached when the combined knowledge source ‘CUI + ALL’ was used. For the yeast organism, the highest median precision was 0.895 when combined source ‘CUI + Relation’ was used.

To assess whether there are significant differences in terms of median precision rates among the different knowledge sources, we used Friedman's test (Friedman, 1937). This is a non-parametric test and its use is dictated by the fact that the different methods of extracting knowledge were applied on the same data sets.

Our results show that adding knowledge sources generally increases precision. The Friedman test rejected the hypothesis of equal median precision rates among the different knowledge sources in each of the organisms under study (P-value < 0.0001) at both {alpha} = 0.05 and {alpha} = 0.1 levels. However, there were differences in terms of observed precision rates among the three organisms. Figure 3 presents boxplots of the precision rates of each knowledge source by organism and clearly exemplifies the above results. We note here that a boxplot is a visualization device depicting the distribution of precision rates. The line in the middle indicates the median precision performance.


Figure 3
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Boxplots of the precisions for different knowledge sources and different organisms.

 
In particular, we were interested in comparing the median precision of the ‘word’ approach with the median precision of the ‘CUI-based’ methods. The five comparisons of interest are the pairs (‘Word’, ‘CUI’), (‘Word’, ‘CUI + MeSH’), (‘Word’, ‘CUI + Relation’), (‘Word’, ‘CUI + GO’) and (‘Word’, ‘CUI + ALL’). The family-wise levels of significance used, {alpha} = 0.05 and {alpha} = 0.1, were both adjusted for multiple comparisons. To compare ‘Word’ and ‘CUI-based’ methods we used Dunn's test (Dunn, 1964) that takes into account the number of multiple comparisons.

In the mouse organism there are significant differences (P-value < 0.001) between (‘Word’, ‘CUI + MeSH’), (‘Word’, ‘CUI + Relation’) and (‘Word’, ‘CUI + ALL’) at both family-wise levels {alpha} = 0.05 and {alpha} = 0.1 with the ‘CUI-based’ methods exhibiting higher median precision rates. For the fly organism, we observe significant differences between the ‘word’ approach and all ‘CUI-based’methods at {alpha} = 0.05 and {alpha} = 0.1. However, the best performance is exhibited by the ‘CUI + ALL’ method at {alpha} = 0.1. For the yeast organism, at familywise {alpha} = 0.05 and {alpha} = 0.1, there are significant differences between the median precisions of (‘Word’, ‘CUI’), (‘Word’, ‘CUI + MeSH’) and (‘Word’, ‘CUI + Relation’) (P-value < 0.001) with the best performance being exhibited by ‘CUI + Relation’. Those points are also substantiated by the graphical analysis in Figure 3.

Therefore, in all these organisms there is always a ‘CUI’ plus other knowledge source(s) method that outperforms the ‘Word’ approach with ‘CUI + ALL’ exhibiting the best performance in terms of median precision in the mouse and fly organisms and ‘CUI + Relation’ being the best in the yeast organism.

In terms of recall, all profile-based methods for mouse and fly have recalls of 1.00 except for MeSH (0.994/Mouse, 0.997/Fly). For yeast, the range of recalls is from 0.984 (‘MeSH’) to 0.993 (‘CUI + ALL’).


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this study, we focus on resolving intra-species gene ambiguity, which is a part of the gene ambiguity problem. To demonstrate that this is important to address, we determined the frequency and impact of ambiguous genes based on the file ‘gene_info’ in the Entrez Gene database, which records names, symbols and synonyms of each gene. After extracting all the name/symbol/synonym of all the mouse genes, we obtained 129 615 different gene names/symbols. Among them, 2355 gene names/symbols corresponded to more than one mouse GeneID, indicating an intra-species ambiguity rate of 1.82%. For fly and yeast, the rates of ambiguous gene names/symbols were 3.85 and 1.33%, respectively. To find out how many MEDLINE articles were affected by intra-species gene ambiguity, we counted the number of ambiguous gene identifiers generated by BioMedLEE for the three document sets mentioned in Section 2.2. Among 82 922 MEDLINE abstracts related to the mouse, 19 420 (23.4%) abstracts contained at least one gene name/symbol that mapped to more than one mouse gene. For the fly and yeast organisms, the percentages were 46.2 (4788/10 371) and 4.8% (476/9936). Therefore, the intra-species ambiguity problem is substantial for the mouse and fly, and less so for the yeast.

As of January 2006, 78.2% (37 565 of a total of 48 009) of mouse genes in the gene2pubmed file of Entrez Gene were associated with at least one MEDLINE article, signifying that a profile could be generated for at least 78.2% of the mouse genes, and, on average, there are 10.8 MEDLINE articles associated with a specific mouse gene. The coverage for fly and yeast genes were 85.3 (17 708/20 763) and 95.7% (5917/6180) with an average of 5.7 and 6.8 associated MEDLINE articles per gene, respectively.

It is not straightforward to compare our results with other related work. Podowski's work evaluated the performance on 66 human genes they selected, but did not report the overall performance on gene symbols in randomly selected documents. Schijvenaars's method achieved similar level of precision (92.7%) for human genes as ours did for mouse genes (93.9%), but the organisms and selection of testing sets are different, making it difficult to compare results. We did not use Schijvenaars's testing set because it was created based on use of long-form and short-form pairs, which covers just 30% of gene symbols in MEDLINE abstracts (Schuemie et al., 2004) and could be biased. Another difference is the measurement of the performance—we used a bootstrapping method to address sampling bias, which others did not.

4.1 Features of profiles
As shown from the results, most of the performance gains are from information obtained directly from related articles, such as words and CUIs. Other knowledge, such as GO codes and relations, are specific to particular genes, but their coverage is low. By incorporating GO and relations as features along with CUIs, better performance was achieved.

Though ‘CUI + ALL’ method showed the best performance in the mouse and yeast, the ‘CUI’ alone method did not perform as well as we expected. We expected that ‘CUI’ would have good performance because CUIs map all the words/phrases with the same meaning into the same concept identifiers, and CUIs are biomedically relevant whereas some words may not be. By manual analysis of CUIs in the profiles, we noticed that some important biological entities, such as gene names, were missing from the MetaMap output. MetaMap maps terms to the UMLS Metathesaurus, which is not a complete resource for gene names, and therefore, it is not surprising that MetaMap missed many gene names. Because those gene names are likely to be important for disambiguation, we believe performance of the system was affected. In future work, we plan on integrating a gene name tagger to retrieve gene names that are not recognized by MetaMap.

CUIs are more beneficial than words for several reasons. They facilitate integration of information from annotated knowledge sources such as ‘gene2go’ (annotated GO codes could be easily mapped to CUIs in UMLS), and annotated sources such as GO are likely to be expanded in the future. Also, for a gene symbol disambiguation system that targets integration with an NLP system, it is likely that concepts identifiers such as CUIs are available when performing the disambiguation task because many NLP systems in the biomedical domain map free text to concepts identifiers (e.g. CUIs) as MetaMap does. CUIs impart useful and understandable knowledge about concepts associated with a particular gene. For example, a gene named ‘RBP1’ (GeneID: 852 513) has a profile in which a few CUIs have the highest weights: C0071728 (‘porin’), C0085177 (‘RNA-Binding Proteins’), C0066598 (‘mitochondrial messenger RNA’) and C0449249 (‘Growth rate’). It would be possible for an expert to obtain a broader understanding that this gene may be a ‘RNA-Binding Protein’, related to the membrane protein ‘porin’, has some effects on ‘mitochondrial messenger RNA’ and may affect ‘growth rate’. Using CUIs as features for profiles also have more benefits than the word method. More information can be derived from the UMLS semantic network when using CUIs. When there is a fine-grained CUI in the profile, it would be helpful to exploit hierarchical information through the semantic network in order to obtain a broader concept to add to the profile as well. For example, a profile may contain a concept C0599635 (‘Water Channel Proteins’), while the context may contain a broader concept C0071728 (‘porin’). If we calculate the similarity directly, no connection can be found between those two terms. However, if we use the ‘is-a’ relation in the semantic network to obtain C0071728 as the parent of C0599635, we could add it to the profile so that a score can be produced between those two close concepts, which is likely to improve performance. We also can analyze the semantic class of CUIs in the profiles and then determine the most relevant semantic classes that are useful for characterizing a particular gene so that we will be able to use feature selection when forming the profiles. Furthermore, profiles with semantic information could be more useful for other applications, such as question answering, summarization and knowledge discovery tasks.

4.2 Error analysis
We looked into some false-positive cases to analyze the causes of errors. One of the major causes of error occurred when the candidate genes had very close senses and it was difficult to disambiguate them. For example, the ambiguous gene symbol ‘TTF-1’ has two possible senses: ‘thyroid transcription factor 1’ (GeneID: 21869) and ‘transcription termination factor 1’ (GeneID: 22130). Both of these two senses are related to the cellular process ‘transcription’ and share a considerable amount of common information, and therefore are difficult to disambiguate. Neither the profile-based methods nor machine learning methods are likely to be helpful in these cases.

We also noticed that some ambiguous gene symbols refer to the names of gene families, and the corresponding candidate genes are the members of the gene family. Those gene symbols are difficult to disambiguate since the candidate genes are closely related members within the gene family. For example, a mouse gene symbol called ‘TNFR1’ could refer to either one of the subunits of ‘tumor necrosis factor receptor superfamily’: ‘tumor necrosis factor receptor superfamily, member 1a’ (GeneID: 21937) and ‘tumor necrosis factor receptor superfamily, member 1b’ (GeneID: 21938). In the testing sample of ‘TNFR1’, two senses had very similar scores: ‘0.1524’ versus ‘0.1521’. Further investigation showed that the ‘gene_info’ file of Entrez Gene, usually records a gene family name as one of the synonyms for all the members/subunits of the gene family. Therefore, this type of ambiguity is not a true ambiguity, but is an artifact of the knowledge source itself. If we are able to remove this type of ambiguity from the knowledge source and from the testing set, it is likely that the precision of the method described in this article would be higher. Interestingly, this phenomenon occurred more often in the fly organism and may have been a cause of lower precision for the fly.

In some cases, the similarity score between the context and profile vectors was zero, which indicates there were no matches between context and profile. In these cases, it is likely that the context is related to some new aspect of a gene or is associated with a completely new sense of the ambiguous gene symbol. As new biological entities are discovered very quickly, there may be no mention in the previous existing literature for that sense or for that symbol. This is a limitation of our profile-based method, but also of other methods utilizing previous knowledge, including machine-learning methods. A partial solution is to perform updates to the profiles regularly.

4.3 Difference among organisms
Our results show that the performance of the profile-based method varies among different organisms. Based on our observation, there are several issues that could explain this phenomenon. First, the intrinsic characteristics of ambiguity in text documents of different organisms are different, e.g. the ambiguity in the fly data set is harder to resolve as stated in Section 4.2. Second, the quality of profiles may vary among different organisms. For the mouse organism, there are more publications related to mouse genes, which means more MEDLINE abstracts were used to generate the mouse gene profiles, resulting in higher performance. The third issue could be related to different sizes of available testing samples of different organisms. In the case of yeast, the testing set (269) is relatively smaller than the testing sets of mouse (7844) and fly (1320), but its effect on precision rates may be minimal because performance was measured using a bootstrap method with sample sizes of 50, which is justified as large samples by statistical theory. The results of yeast data seem more dependent on the knowledge source than on the sample size because when MeSH was dropped from ‘CUI + ALL’, the modified ‘CUI + ALL’ method had the highest performance (0.895), which is consistent with the results from mouse and fly. The problem with MeSH for the yeast data could be due to different levels of granularity of MeSH in the context and profile vectors or the coverage is not as good as for the other two organisms.


    5 CONCLUSION AND FUTURE WORK
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this article, we have described a method to generate gene profiles utilizing different knowledge sources in the biomedical domain, and then applied the profiles to disambiguate gene symbols using an information retrieval method. Evaluation on the automatically generated testing sets showed precision of 93.9% for the mouse, 77.8% for the fly and 89.5% for the yeast, which indicated significant improvement compared to the baseline.

In further work, we will explore more available knowledge sources in the biomedical domain, such as the SwissProt protein database, more organisms such as human, and use of hierarchical information. In addition to MetaMap, a more comprehensive gene tagger will be integrated into the system to identify gene names. In addition, more complicated methods will be developed to combine different knowledge sources when forming the profiles. We will perform a semantic analysis on the profiles and identify semantic classes that characterize the genes best. By limiting the features to those belonging to selective semantic classes, we anticipate that the performance of disambiguation for profiles will be improved. In addition, we are also exploring use of the semantic profiles of genes for other text mining tasks, such as summarization and question answering as well.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work was supported in part by Grants R01 LM7659, R01 LM8635 from the National Library of Medicine, and Grants NSF-IIS-0430743, NSF-DMS-0504957 from the National Science Foundation. We would like to thank Lyudmila Shagina for providing technical support. Funding to pay the Open Access publication charges was provided by the National Library of Medicine (grant LM8635).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alfonso Valencia

{dagger} The authors wish it to be known that, in their opinion, the last two authors should be regarded as joint First Authors. Back

Received on November 27, 2006; revised on January 22, 2007; accepted on February 11, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. ( (2001) ) Proceedings AMIA Symp. 17–21..

    Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., ( (2000) ) 25, : 25–29.[CrossRef][ISI][Medline].

    Bruce R, Wiebe J. Word sense disambiguation using decomposable models. ( (1994) ) Proceedings of the ACL 1994. 139–146..

    Chen L, et al. Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics, ( (2005) ) 21, : 248–256.[Abstract/Free Full Text].

    Dunn OJ. Multiple comparisons using rank sums. Technometrics, ( (1964) ) 6, : 241–252.[CrossRef].

    Erhardt RA, et al. Status of text-mining techniques applied to biomedical text. Drug Discov. Today, ( (2006) ) 11, : 315–325.[CrossRef][ISI][Medline].

    Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc., ( (1937) ) 32, : 675–701.[CrossRef][ISI].

    Fundel K, Zimmer R. Gene and protein nomenclature in public databases. BMC Bioinformatics, ( (2006) ) 7, : 372.[CrossRef][Medline].

    Harley A, Glennon D. Sense tagging in action: combining different tests with additive weightings. ( (1997) ) Proceedings SIGLEX Workshop "Tagging Text With Lexical Semantics". 74–78..

    Hirschman L, et al. Overview of BioCreAtIvE task 1B: normalized gene lists. BMC Bioinformatics, ( (2005) ) 6, (Suppl. 1): S11..

    Humphrey SM, et al. Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: preliminary experiment. J. Am. Soc. Inform. Sci. Tech., ( (2006) ) 57, : 96–113.[CrossRef].

    Jensen LJ, et al. Literature mining for the biologist: from information retrieval to biological discovery. Nat. Rev. Genet., ( (2006) ) 7, : 119–129.[CrossRef][ISI][Medline].

    Krallinger M, Valencia A. Text-mining and information-retrieval services for molecular biology. Genome Biol., ( (2005) ) 6, : 224.[CrossRef][Medline].

    Lee YK, Ng HT. An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation. ( (2002) ) Proceedings EMNLP 2002. 41–48..

    Lesk M. Automatic sense disambiguation using machine-readable dictionaries: how to tell a pine cone from an ice cream cone. ( (1986) ) 1986 SIGDOC Conference. 24–26..

    Liu H, et al. Automatic resolution of ambiguous terms based on machine learning and conceptual relations in the UMLS. J. Am. Med. Inform. Assoc., ( (2002) ) 9, : 621–636.[Abstract/Free Full Text].

    Lussier Y, et al. PhenoGO: assigning phenotypic context to gene ontology annotations with natural language processing. Pac. Symp. Biocomput., ( (2006) ) 11, : 64–75..

    Maglott D, et al. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res., ( (2005) ) 3, : D54–D58..

    Mohammad S, Pedersen T. Combining lexical and syntactic features for supervised word sense disambiguation. ( (2004) ) Proceedings of the CoNLL 2004. 25–32..

    NLM. UMLS Knowledge Sources., ( (2000) ) 11th edn..

    Podowski RM, et al. AZuRE, a scalable system for automated term disambiguation of gene and protein names. ( (2004) ) Proceedings IEEE Comput. Syst. Bioinform. Conf. (2004): 415–424..

    Porter MF. An algorithm for suffix stripping. Program, ( (1980) ) 14, : 130–137..

    Pustejovsky J, et al. Automatic extraction of acronym-meaning pairs from MEDLINE databases. Medinfo., ( (2001) ) 10, : 371–375.[Medline].

    Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing & Management, ( (1988) ) 24, : 513–523.[CrossRef][ISI].

    Schijvenaars BJA, et al. Thesaurus-based disambiguation of gene symbols. BMC Bioinformatics, ( (2005) ) 6, : 149.[CrossRef][Medline].

    Schuemie MJ, et al. Distribution of information in biomedical abstracts and full-text publications. Bioinformatics, ( (2004) ) 20, : 2597–2604.[Abstract/Free Full Text].

    Sehgal AK, et al. Gene terms and English words: an ambiguous mix. In: SIGIR '04 Workshop on Search and Discovery in BioInformatics., ( (2004) )..

    Tamames J, Valencia A. The success (or not) of HUGO nomenclature. Genome Biol., ( (2006) ) 7, : 402.[CrossRef][Medline].

    Widdows D, et al. Unsupervised monolingual and bilingual word-sense disambiguation of medical documents using UMLS. ( (2003) ) Natural Language Processing in Biomedicine ACL 2003 Workshop. 9–16..

    Wilks Y, et al. Providing Machine Tractable Dictionary Tools., ( (1990) ) Cambridge, MA: MIT Press..

    Xu H, et al. Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues. BMC Bioinformatics, ( (2006) ) 7, : 334.[CrossRef][Medline].

    Yngve VH. Syntax and the problem of multiple meaning. In: Machine Translation of Languages., ( (1955) ) New York: John Wiley & Sons. 208–226..


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/8/1015    most recent
btm056v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Google Scholar
Right arrow Articles by Xu, H.
Right arrow Articles by Friedman, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xu, H.
Right arrow Articles by Friedman, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?