Skip Navigation


Bioinformatics Advance Access originally published online on July 21, 2005
Bioinformatics 2005 21(18):3658-3664; doi:10.1093/bioinformatics/bti586
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/18/3658    most recent
bti586v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (16)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Gaudan, S.
Right arrow Articles by Rebholz-Schuhmann, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gaudan, S.
Right arrow Articles by Rebholz-Schuhmann, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Resolving abbreviations to their senses in Medline

S. Gaudan *, H. Kirsch and D. Rebholz-Schuhmann

European Bioinformatics Institute Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 DICTIONARY OF ABBREVIATIONS
 3 DISAMBIGUATION OF...
 4 ABBREVIATION RESOLUTION
 5 RESULTS
 6 DISCUSSION
 7 CONCLUSION
 REFERENCES
 

Motivation: Biological literature contains many abbreviations with one particular sense in each document. However, most abbreviations do not have a unique sense across the literature. Furthermore, many documents do not contain the long forms of the abbreviations. Resolving an abbreviation in a document consists of retrieving its sense in use. Abbreviation resolution improves accuracy of document retrieval engines and of information extraction systems.

Results: We combine an automatic analysis of Medline abstracts and linguistic methods to build a dictionary of abbreviation/sense pairs. The dictionary is used for the resolution of abbreviations occurring with their long forms. Ambiguous global abbreviations are resolved using support vector machines that have been trained on the context of each instance of the abbreviation/sense pairs, previously extracted for the dictionary set-up. The system disambiguates abbreviations with a precision of 98.9% for a recall of 98.2% (98.5% accuracy). This performance is superior in comparison with previously reported research work.

Availability: The abbreviation resolution module is available at http://www.ebi.ac.uk/Rebholz/software.html

Contact: gaudan{at}ebi.ac.uk


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 DICTIONARY OF ABBREVIATIONS
 3 DISAMBIGUATION OF...
 4 ABBREVIATION RESOLUTION
 5 RESULTS
 6 DISCUSSION
 7 CONCLUSION
 REFERENCES
 
Abbreviations are a common feature in scientific literature. They are often used without naming the long form (Fred and Cheng, 2003), resulting in confusion and even in misinterpretations, as soon as the human reader has the wrong long form for the abbreviation in mind (Sentinel Event Alert, 2001).

We distinguish global abbreviations from local abbreviations. Global abbreviations appear in documents without the long form explicitly stated, whereas local abbreviations come together with their long form in the document. Global abbreviations are often ambiguous, meaning that they have different senses in different documents.

In particular, 80% of the abbreviations defined in the unified medical language system (UMLS) have ambiguous occurrences in Medline (Liu et al., 2002a). With respect to human gene symbols from LocusLink, which are morphologically very similar to abbreviations, 40% of the symbols are used in Medline, but many of the occurrences are not related to genes.

Yu and Friedman (2002) also distinguish dynamic and common abbreviations. Common abbreviations become accepted as synonyms (‘AIDS’ and ‘acquired immunodeficiency syndrome’) and represent important terms in their domain, whereas dynamic abbreviations are defined for convenience in only a particular paper. As a result, global abbreviations are mainly common abbreviations since the reader is expected to know or guess the senses of the global abbreviations.

In the case of local abbreviations the long form can be retrieved from the document using the extraction method described in Swartz and Hearst (2003). This improves the precision of gene and protein identification in biomedical text, which suffers from protein/gene symbols that are identical to ambiguous abbreviations. But this method fails in case of global ambiguous abbreviations (Dingare et al., 2004).

Furthermore, many errors in named entity identification are explained by variations observed in the long forms of abbreviations. For example, AgNor abbreviates two long forms sharing the same sense: ‘argyrophilic nucleolar organizer region’ and ‘silver-stained nucleolar organizer region’; similarly, ER abbreviates ‘estrogen receptor’ and ‘oestrogen-receptor’. This property of long forms is common and has been exploited by Tsuruoka and Tsujii (2003) to develop a probabilistic string similarity method.

Consequently, resolving local and global abbreviations to their long forms is a valuable step for improving the quality of information extraction and information retrieval systems. If the abbreviation is resolved to a normalized long form, i.e. to a common long form and not to a minor morphological variant, then this leads to even better results and was persued in our approach.

The most problematic step in abbreviation resolution is retrieving the sense of a global abbreviation that is ambiguous. Stevenson (2002) gives an overview of the state-of-the-art of solving this problem, also known as ‘Word Sense Disambiguation’.

The Yarowsky observation (Yarowsky, 1995), which states that terms tend to have ‘one sense per discourse’, provides the foundation for retrieving the sense of a polysemic word by using the context of the document.

Various methods have been implemented for the resolution of ambiguous abbreviations, all following a similar schema (Fig. 1): (1) A lexicon is used for collecting the abbreviations and their senses. (2) Then the method computes the context of use for each sense. (3) Finally, a machine-learning algorithm is trained on the context of each sense. (4) The disambiguation of an abbreviation contained in a document consists of computing its context in the document and (5) retrieval of the most probable abbreviation sense, given the context, thanks to the machine-learning algorithm.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 1 Disambiguation process.

 
This disambiguation schema has been exploited by Pakhomov (2002), Yu et al. (2003) and Liu et al. (2002b) who use UMLS to collect the abbreviations and their long forms for the lexicon. However, Pakhomov (2002) observed that not more than one-third of the long forms from UMLS appear in the literature. Furthermore, only frequent abbreviations used in the literature are in UMLS. Since the overlap between the UMLS abbreviations and the ones used in Medline is not sufficient, another dictionary has to be considered. Adar (2004) uses a more relevant approach for the lexicon by using a dictionary extracted from Medline abstracts.

Using the same disambiguation schema, Liu et al. (2002b) rely on the UMLS annotations from MetaMap (Aronson, 2001) of the documents as the context of the senses. A naive Bayes algorithm is trained on the annotations and then used for the disambiguation, achieving, after removing rare senses, a precision of 92.9% but for a recall of 47.4%.

Concerning the extraction of the context, Adar (2004) relies on the medical subject heading terms (MeSH terms) of the abstracts. The cosine similarity metric is applied for classifying the abbreviation. The method classified correctly 73% of the test set when disregarding rare senses (<50 occurrences).

Similarly, (Pakhomov, 2002) compared a local context1 with a global context2 for training a machine-learning algorithm based on maximum entropy. The system achieves an accuracy of 89% on a limited corpus (10 000 rheumatology notes).

Liu et al. (2002b) also experimented with the local context and the bag-of-words technique for training a support vector machine (SVM), reaching an accuracy of 84%.

We present a novel system for the resolution of local and global abbreviations. The resolution of local abbreviations is based on a dictionary of abbreviations, whereas the resolution of global polysemic abbreviations uses a disambiguation process based on the model described in Figure 1.

The first component of the system is a dictionary of abbreviations automatically generated from the literature, inspired by Adar (2004).

Local abbreviations are resolved by looking them up in the dictionary for the most frequent form of the long form found in the text.

Concerning the resolution of polysemic global abbreviations, we describe first the statistical method used for extracting the context of each sense, and then we explore the disambiguation method based on SVMs. Finally, we present the global strategy for the resolution of any abbreviation in arbitrary documents.


    2 DICTIONARY OF ABBREVIATIONS
 TOP
 Abstract
 1 INTRODUCTION
 2 DICTIONARY OF ABBREVIATIONS
 3 DISAMBIGUATION OF...
 4 ABBREVIATION RESOLUTION
 5 RESULTS
 6 DISCUSSION
 7 CONCLUSION
 REFERENCES
 
The literature is rich in various methods for the automatic extraction of abbreviation/long-form pairs from text. Wren et al. (2005) summed up four methods applied to the creation of online databases of abbreviation/long-form pairs when Yoshida et al. (2000) focused on the construction of a protein name abbreviation dictionary.

Our abbreviation extraction is based on the method described in Adar (2004) which is robust, fast and achieves a precision of ~95% for a recall of 75%.

An abbreviation is explained in a document by the mention of its long form. The general pattern is that the long form is followed by the abbreviation in parentheses; the inverse order of the pair is found at a much lower frequency:

The changes in adrenocorticotropin hormone (ACTH), cortisol and dehydroepiandrosterone (DHEA) in maternal and fetal plasma were estimated in two groups of women.

After the detection of an abbreviation in parentheses, the correct long form has to be assigned to the abbreviation. A limited number of rules formalizes how to build abbreviations from a long form.

The long form is identified automatically using the longest common subsequence (LCS) in conjunction with a set of scoring rules (Taghva and Gilbreth, 1999) that favors the first letter of each word of the long form. For each abbreviation candidate (a word surrounded with parentheses), the algorithm matches the long form in front of the parentheses to the abbreviation and thus determines the boundaries of the long form.

After scanning all Medline abstracts available in August 2004, the result of our extraction is 5 250 259 long-form/abbreviation pairs found in 2 857 954 Medline abstracts. In the following text, we refer to this set of abstracts as D.

2.1 Merging morphologically similar long forms
Among all extracted long-form/abbreviation pairs, a number of abbreviations share morphologically similar long forms with the same sense, e.g. ‘oestrogen receptor’ versus ‘estrogen-receptor’. These long forms are identified with each other with a similarity measure (Adar, 2004). An n-gram similarity algorithm is used with a cut-off parameter to merge similar long forms l1 and l2:


Table 1 illustrates long forms presenting a high similarity and therefore, clustered into groups of long forms.


View this table:
[in this window]
[in a new window]
 
Table 1 Similar long forms detected with the n-gram similarity and the contextual similarity

 
The cut-off parameter has been estimated from a hand curated random sample of 250 long forms doublets. We selected a cut-off parameter of 0.8 so that two long forms are merged only if they have the same sense.

2.2 Context based merging
In contrast to the previous similarity consideration, some long forms can be morphologically quite different (e.g. ‘beta site APP-cleaving enzyme’ versus ‘beta site amyloid precursor protein-cleaving enzyme’) but still code for the same meaning. To identify them as synonyms requires domain knowledge, which is provided through the context of the long forms. Using the context, we merged morphologically diverse long forms coding for the same meaning.

Adar (2004) relies on the MeSH term annotations of the abstracts for representing the context of the long forms. However, the granularity of the information contained in MeSH annotations is coarser than the one obtained by extracting relevant words from the text. Furthermore, the MeSH terms approach can not be applied to arbitrary text. As a result, we developed a new method based on word occurrences.

We use here the assumption that two long forms, coding for the same meaning, are illustrated by documents sharing in average more common words3 than documents illustrating different meanings (Table 1).

The similarity between two sets of long forms (g1 and g2), created by grouping morphologically similar long forms together, is computed by considering the number of common words in the sets Dg1 and Dg2 of documents containing the long forms, normalized by the total number of words in the documents of the two sets:

with

if |Dg1| > 1 and |Dg2| > 1, where W(di) is the set of words in the document di.

The cut-off parameter has been estimated from a hand curated random sample of 150 long-form set doublets. We selected a cut-off parameter of 0.22 so that two sets of long forms are merged only if they have the same sense. Consequently, we use sets (clusters) of long form/abbreviation pairs which represent the same meaning according to our morphological and contextual similarity estimates. Each cluster contains a number of similar long forms and the links to the documents containing these long forms. We define the different senses of an abbreviation by the long forms found in the different clusters. These abbreviations and their senses are stored in a dictionary.


    3 DISAMBIGUATION OF ABBREVIATIONS
 TOP
 Abstract
 1 INTRODUCTION
 2 DICTIONARY OF ABBREVIATIONS
 3 DISAMBIGUATION OF...
 4 ABBREVIATION RESOLUTION
 5 RESULTS
 6 DISCUSSION
 7 CONCLUSION
 REFERENCES
 
Whenever we find no long form associated to an ambiguous abbreviation, we use the context to identify the correct meaning of the abbreviation. In the following we describe which suitable context words are generated to disambiguate abbreviations, and how the classifier is trained.

3.1 Context extraction
The contextual terms used for the disambiguation are extracted using the C-value algorithm (Frantzi and Ananiadou, 1999), a method combining linguistical (adjective–noun patterns4) and statistical aspects of terms. The C-value method scores the adjective–noun patterns according to three aspects: the frequency of the adjective–noun patterns (positive correlation), the length of the adjective–noun patterns (positive correlation) and the frequencies of subparts of the adjective–noun patterns (negative correlation):

where w is the adjective–noun pattern candidate, ||w|| is the length (in words) of w, f(w) is the frequency of w in the corpus and Tw is the set of adjective–noun patterns contained in the candidate w.

Only words contained in terms having a high score are kept for representing a document. After prioritization of the words from the context according to the C-value and applying a cut-off to the list of words, we obtain a tuple {Omega} = (w1,...,wn) of size n (55 on average) of relevant words for every document.

3.2 The model
Each abbreviation a belonging to the dictionary has a set of senses, denoted by S(a). Each sense s S(a) is illustrated by a set of documents Ds D. Ds is the set of documents containing the abbreviation/long-form pairs previously extracted for the construction of the dictionary.

For each document d, the context words are extracted and the document is described by a vector v = g(d) with g: D ↦ {0,1}n. The i-th component of v, vi, is defined as

As a result, we have a function {Phi} that associates with each sense s a set of vectors {Phi}(s):

3.3 Disambiguation
The task of disambiguating an abbreviation a in a document d is to find the sense s S(a) that minimizes the distance between the vector v = g(d) (context of d) and the class defined by {Phi}(s).

This problem can also be described as a classification problem of assigning g(d) to one of the classes represented by the vector sets {Phi}(s) where s S(a).

SVMs are suitable classifiers for sparse data in high-dimensional spaces and with many relevant features. It has been shown that SVMs achieve substantial improvements over the similar other methods for text categorization (Joachims, 1997). An SVM can separate two classes (positive/negative) by a hyperplane with a maximum margin between the border vectors. Each class is described by vectors that the SVM ‘learns’. Using the one-against-all approach, we can separate k classes from each other by combining k SVMs. We use in the present case a linear kernel on binary vectors, with an error penalty of 10 in norm L1.

For each sense s of an abbreviation a, we represent the positive class of s by

and the negative by

such that C(s) is the set of vectors describing all the senses of a except s. Note that C and C+ are not necessarily disjoint.

An SVM is created for each sense s and trained with C+(s) and C(s). The result is a function hs : {0,1}n ↦ R where

For each abbreviation a and for each of its sense, we get the classification functions hs (an SVM). The disambiguation of the abbreviation a in a document d consists of selecting the function hs such that hs(g(d)) is maximal.

If the resulting hs(g(d)) is positive, then sense(a,d) = s is predicted to be the sense of a in d. If the resulting hs(g(d)) is non-positive, then no sense is predicted:


    4 ABBREVIATION RESOLUTION
 TOP
 Abstract
 1 INTRODUCTION
 2 DICTIONARY OF ABBREVIATIONS
 3 DISAMBIGUATION OF...
 4 ABBREVIATION RESOLUTION
 5 RESULTS
 6 DISCUSSION
 7 CONCLUSION
 REFERENCES
 
An aspect of the abbreviation resolution task is the recognition of the abbreviations in the text. Some common English words are also used as abbreviations, making the localization task difficult. The conjunction ‘if’ is used to abbreviate ‘immunofluorescence’ and ‘for’ abbreviates ‘ferredoxin oxidoreductase’. If the document contains sections in uppercase, then the identification task is difficult (‘THE’ abbreviates ‘tetrahydrocortisone’). More than 350 abbreviations use the form of a common English word. This problem can be mainly solved by limiting the recognition of abbreviations on adjective–noun patterns, using a part of speech tagger.

When an abbreviation is localized, an efficient search for all the possible long forms of the abbreviation is applied on the document using a deterministic finite automata (Aho and Corasick, 1975). If a long form is found, its most frequent form is kept. If no long form can be retrieved from the document, then a look-up of the abbreviation in the dictionary is performed. If only one sense is found, then the abbreviation is not ambiguous and the most frequent long form of the unique sense is kept. Finally, if several senses are retrieved, then the disambiguation process is applied.


    5 RESULTS
 TOP
 Abstract
 1 INTRODUCTION
 2 DICTIONARY OF ABBREVIATIONS
 3 DISAMBIGUATION OF...
 4 ABBREVIATION RESOLUTION
 5 RESULTS
 6 DISCUSSION
 7 CONCLUSION
 REFERENCES
 
5.1 Dictionary
After mining all Medline abstracts (1965–2004), the dictionary contains 186 641 different abbreviations linked to 623 441 senses, illustrated by 5 250 259 occurrences of an abbreviation with its long form (Table 2). We distinguished three categories: (1) All abbreviation/long/form pairs, (2) abbreviation/long/form pairs with >20 occurrences and (3) pairs occurring at least 40 times. The third category represents 4% of the total number of abbreviations, but covers >72% of the total number of abbreviation/long-form occurrences. We also find in the third category, the most morphological variants for the long forms. As a result, the last category profits the most from normalization of long forms.


View this table:
[in this window]
[in a new window]
 
Table 2 Counts (#) and averages () for abbreviation/sense pairs occurring in at least one abstract (1), in at least 20 abstracts (2) and in at least 40 abstracts (3)

 
We also found that the number of abbreviations strongly increased over the past 10 years, which correlates with the increase of new publications per year. More than half of the abbreviations appeared after 1995 (Fig. 2) and last but not least, abbreviation/long-form pairs appear and disappear, similar to the life cycle of gene names in the literature (Hoffmann and Valencia, 2003) (Fig. 3). Abbreviation/long-form pairs disappear either because the named concept is not of interest any more or because the abbreviation becomes a common abbreviation, so that the long form is not provided any more (Fig. 4).



View larger version (9K):
[in this window]
[in a new window]
 
Fig. 2 Number of long form/abbreviation pairs occurring in Medline since 1975.

 


View larger version (26K):
[in this window]
[in a new window]
 
Fig. 3 Abbreviations in Medline over the past 20 years. Of these, 200 of most frequent abbreviations along the horizontal axis sorted according to their pattern of occurrence in Medline. The color indicates the relative frequency of the abbreviation. Some of the oldest abbreviations (left part) disappear, reminding of a life cycle. The intensive usage of abbreviations is a recent phenomenon that is in huge progression.

 


View larger version (14K):
[in this window]
[in a new window]
 
Fig. 4 Frequency of the abbreviation ‘TUNEL’ with its long form (dark gray) and with or without its long form (light gray) over the past 10 years (1995–2004). The abbreviation ‘TUNEL’ is not ambiguous in Medline and became a common abbreviation. In 2004, 84% of the occurrences in Medline abstracts of ‘TUNEL’ are without the long form.

 
The statistics on the dictionary shows that many abbreviations/sense pairs appear at a low frequency (rare abbreviation/sense pairs), whereas few pairs have high frequencies, reminding of the Zipf's law distribution (Fig. 5).



View larger version (13K):
[in this window]
[in a new window]
 
Fig. 5 The rank of the abbreviation/sense pairs and their frequencies, using logarithmic scales. Zipf's law says that there is a constant k such that frequency(word) · rank(word) = k. The abbreviation/sense pairs on the left-hand side of the vertical line are pairs occurring at least 40 times.

 
The examination of the dictionary shows that some clusters of long forms should be merged with other ones because their meanings are very similar. But the long form of these clusters are morphologically very different from each other and their context did not allow the long forms to merge. However, this phenomenon is only observed on clusters of rare long forms. For a random sample of 350 long forms, 42% of the 169 long forms occurring only once should have been merged with another entry of the dictionary. This proportion drops to 18% for the 108 long forms having a frequency comprised between 2 and 40. Finally, none of the 73 long forms occurring at least 40 times share their meaning with other long forms also occurring at least 40 times.

The dictionary contains many protein and gene symbols that belong to the rare abbreviation sense class (Chen et al., 2005). For example, the gene symbol AFM (‘Afamin’), also means ‘Airflowmeter’, ‘Association Française contre les Myopathies’, ‘acute falciparum malaria’, ‘additive factors method’, ‘aflatoxin M1’, ‘ antiferromagnetic’ and ‘atomic force microscopy’. Furthermore, in 99% of its occurrences, AFM is used in the sense ‘atomic force microscopy’ and not ‘Afamin’ (Table 3). It is obvious that protein and gene name identification requires the resolution of these symbols.


View this table:
[in this window]
[in a new window]
 
Table 3 List of the six first ambiguous abbreviations matching a HUGO symbol and the full name

 
5.2 Disambiguation
The disambiguation is required for abbreviations having several senses and occurring without the long form, in other words, for ambiguous global abbreviations. Global abbreviations are also common since they are expected to be known by the reader. As a result, we can apply the disambiguation process using a high quality dictionary by disregarding the rare abbreviation/sense pairs, without changing the nature of the disambiguation problem.

In the following we only consider senses which appear frequently enough to profit from disambiguation (40 documents and more). As a result, we have 7806 abbreviations with 12 330 senses, representing 72% (3 803 758) of all pair occurrences in Medline. Out of these 7 806 abbreviations, 1851 are polysemic, having on average 3.4 senses with a maximum of 32 senses for ‘PC’ (Table 2).

The SVMs were trained and tested using a k-fold cross-validation schema (k = 5), which measures the quality of predictions on unseen data. For each abbreviation a, a document set is built by grouping the documents illustrating the different senses of a [all the Ds where s S(a)]. Each document set is randomly divided into five subsets equal in size; four are used for the training of the SVM (80%) and one for testing (20%), repeating the operation five times so that each subset has been used for testing. In order to avoid the explicit indication of the sense, the abbreviation long forms are removed from the text before the SVMs learn or classify the test documents. The system achieves a precision 5 of 98.9% for a recall6 of 98.2% (98.5% accuracy7).

This accuracy can be compared with a baseline derived from a different disambiguation scheme that consists of always selecting the most frequent sense of the abbreviation, independent of the context. Such an algorithm achieves 70% accuracy on the same data.

The accuracy of the disambiguation module has been compared with the disambiguation methods described by Liu et al. (2002b), Yu et al. (2003) and Pakhomov (2002), using the abbreviations used for their tests. Our disambiguation method performs better than their methods for >80% of these abbreviations, with an average of 98% accuracy. The remaining 20% are related to abbreviations that have either more or less senses than their test samples.


    6 DISCUSSION
 TOP
 Abstract
 1 INTRODUCTION
 2 DICTIONARY OF ABBREVIATIONS
 3 DISAMBIGUATION OF...
 4 ABBREVIATION RESOLUTION
 5 RESULTS
 6 DISCUSSION
 7 CONCLUSION
 REFERENCES
 
The dictionary of abbreviations, the context extraction and the disambiguation module are the three main components of the abbreviation resolution process.

The dictionary has been generated from Medline so that its content is most suitable for abbreviation resolution in biomedical text. The high quality of the dictionary is crucial to achieve the resolution of abbreviations with a high precision/recall. This quality has been reached by combining statistical and linguistical methods for grouping morphological variants of long forms. Others have also used generated dictionaries, but did not solve the problem of morphological variants for the long forms or have used external resources (UMLS) that are not suitable when applied on the biomedical literature. The high quality of the abbreviation dictionary has also a direct impact on the accuracy of the disambiguation method. Indeed, the entries of the dictionary are properly linked to the senses of the abbreviations occurring at least 40 times because of the one-to-one relationship between the senses and the entries.

A proper representation of the sense's context is a decisive factor for the discrimination of the senses. We use here a method based on the text itself and not based on human annotations, unlike MeSH terms. Furthermore, the C-value method provides a refined granularity for the description of the context, without including irrelevant features. The context of a sense is represented with vectors that have on average 3000 non-empty features. In other words, each sense is represented with a considerable number of words.

The accuracy of the disambiguation method profits from the high performances achieved by SVMs, which have been successfully used in many text classification tasks.

Disambiguation of abbreviations is more accurate than word sense disambiguation on English words because abbreviation's senses are on average more distant. The sense of ‘tree’ as a product of nature and the sense of ‘tree’ as a structure of information are very close. The contexts of both senses can contain ‘root’, ‘branch’, ‘leaves’ or even ‘forest’. In contrast, scientific writers tend to avoid to create abbreviations that already exist in their own domain.

Our classifier disambiguates frequent abbreviations in Medline abstracts very accurately. Nevertheless, some misclassifications occur, generally as a result of one of the following reasons:

  1. The misclassifications occur for a rare sense of the abbreviation (Fig. 6), mainly because of the small margin between scores returned by the positive and negative classification functions (Fig. 7). A customized extraction of the context for minor senses could improve the accuracy of the classifier.
  2. The misclassification occurs on senses which are very similar but not necessarily synonymous, e.g. ‘cytotoxic T lymphocyte’ and ‘cytolytic T lymphocyte’. These misclassifications can be solved by increasing further the granularity of the context, which becomes difficult to achieve without integrating irrelevant features.
  3. The misclassification is owing to the fact that some abbreviations are described as ambiguous by the dictionary, whereas th ey are not. According to the dictionary,the abbreviation UDPGT can either take the sense ‘UDP-glucuronosyltransferase’ or the sense ‘uridine diphosphate glucuronosyltransferase’, which are the same. Some further research has to be done for merging these long forms.



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 6 Distribution of the probability that a misclassification occurs (y axis), given the probability that the the abbreviation take the sense on which the error occurs (x axis).

 


View larger version (50K):
[in this window]
[in a new window]
 
Fig. 7 Scores returned by the SVMs for each sense (class) of ‘HMM’, the positive class (crosses) and the negative class (circles). The distance (margin) between the positive class and the negative class decreases when the senses become rare. The number of abstracts used for the test is given in brackets.

 

    7 CONCLUSION
 TOP
 Abstract
 1 INTRODUCTION
 2 DICTIONARY OF ABBREVIATIONS
 3 DISAMBIGUATION OF...
 4 ABBREVIATION RESOLUTION
 5 RESULTS
 6 DISCUSSION
 7 CONCLUSION
 REFERENCES
 
The biomedical literature contains many abbreviations that can be automatically extracted with their different long forms. We generated a dictionary of abbreviation/sense pairs, where different morphological variants of a sense have been grouped together with linguistical plus statistical methods.

Using the generated dictionary, local and global abbreviations can be resolved to their sense, using the most frequent long form as the sense's representative. Many of the extracted abbreviations (1851) are ambiguous, meaning that they can take different senses in different contexts. We developed a method that disambiguates the polysemic abbreviations in the documents thanks to the context of it.

On Medline abstracts and for abbreviation/sense pairs which are found at least 40 times, our method assigns this sense to the abbreviation with a precision of 98.9% at a recall of 98.2%. The recall is not 100% because, depending on the context, the SVM may not assign a sense at all. We assume that abbreviation/sense pairs found <40 times are not commonly known and therefore, tend to appear with their long form so that disambiguation is not necessary. There are three reasons for the good performance. First, the senses are, on average, well separated. Second, the method uses a considerable number of relevant words (features) to represent the context of each sense. Third, it has been shown that SVM is the most suitable choice for such data (Joachims, 1997).

Abbreviation resolution can help information extraction systems by improving the precision and recall of the recognition of names in documents. Abbreviation resolution can also improve the performances of search engines either by using the resolving abbreviations during the indexing step or by disambiguating the query (query reformulation).

The abbreviation dictionary and the abbreviation resolution module are publicly available.


    Acknowledgments
 
S.G. is supported by an ‘E-STAR’ fellowship funded by the EC's FP6 Marie Curie Host fellowship for Early Stage Research Training under contract number MEST-CT-2004-504640.

Conflict of Interest: none declared.


    Footnotes
 
1Words surrounding the abbreviation. Back

2Words found in the section containing the abbreviation. Back

3Only words included in the pattern (adjective* (proper-noun|noun)+)+ are considered. Back

4The following pattern has been used: (adjective* (proper-noun|noun)*)*. Back

5 Back

6 Back

7 Back

Received on April 1, 2005; revised on June 20, 2005; accepted on July 14, 2005

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 DICTIONARY OF ABBREVIATIONS
 3 DISAMBIGUATION OF...
 4 ABBREVIATION RESOLUTION
 5 RESULTS
 6 DISCUSSION
 7 CONCLUSION
 REFERENCES
 

    Aho, A.V. and Corasick, J.M. (1975) Efficient string matching: an aid to bibliographic search. Commun. ACM., 18, 333–340[CrossRef].

    Aronson, A. (2001) Effective mapping of biomedical text to the UMLS meta-thesaurus: the MetaMap Program. Proc. AMIA Symp., 2001, 17–21.

    Adar, E. (2004) SaRAD: a Simple and Robust Abbreviation Dictionary. Bioinformatics, 20, 527–533[Abstract/Free Full Text].

    Chen, L., et al. (2005) Gene name ambiguity of eukaryotic nomenclature. Bioinformatics, 21, 248–256[Abstract/Free Full Text].

    Dingare, S., Finkel, J., Manning, C., Nissim, M., Alex, B. (2004) Exploring the boundaries: gene and protein identification in biomedical text. Proceedings of the BioCreative Workshop, Granada.

    Frantzi, K. and Ananiadou, S. (1999) The C value domain independent method for multiword term extraction. JNLP, 6, , pp. 145–179.

    Fred, H.L. and Cheng, T.O. (2003) Acronymesis: the exploding misuse of acronyms. Tex. Heart Inst. J., 30, 255–257[Web of Science][Medline].

    Hoffmann, R. and Valencia, A. (2003) Life cycles of successful genes. Trends Genet., 19, 79–81[CrossRef][Web of Science][Medline].

    Joachims, T. (1997) Text categorization with support vector machines: learning with many relevant features. Machine Learning: ECML-98, Tenth European Conference on Machine Learning , pp. 137–142.

    Liu, H., Aronson, A.R., Friedman, C. (2002a) A study of abbreviations in MEDLINE abstracts. Proc. AMIA Symp., 2002, 464–468.

    Liu, H., Johnson, S.B., Friedman, C. (2002b) Automatic resolution of ambiguous terms based on machine learning and conceptual relations in the UMLS. J. Am. Med. Inform. Assoc., 9, 621–636[Abstract/Free Full Text].

    Pakhomov, S. (2002) Semi-Supervised maximum entropy based approach to acronym and abbreviation normalization in medical texts. Proceedings of the 40th Annual Meeting of the ACL , Philadelphia University of Pennsylvania, pp. 160–167.

    Sentinel Event Alert. (2001) Medication errors related to potentially dangerous abbreviations. JCAHO., Issue 23, September 2001.

    Schwartz, A. and Hearst, M. (2003) A simple algorithm for identifying abbreviation definitions in biomedical text. Proceedings of PSB'03Kauai 8, , pp. , pp. 451–462.

    Word Sense Disambiguation, The Case for Combinations of Knowledge Sources Stevenson, M. (2002) CLSI Studies in Computational Linguistics. CLSI publications, Centre for the study of language and information, California.

    Taghva, K. and Gilbreth, J. (1999) Recognizing acronyms and their definitions. Int. J. Document Anal. Recogn., 191–198.

    Tsuruoka, Y. and Tsujii, J. (2003) Probabilistic term variant generator for biomedical terms. Proceedings of the 26th ACM SIGIRToronto, Canada , pp. 167–173.

    Wren, J.D., et al. (2005) Biomedical term mapping databases. Nucleic Acid Res., 33, D289–293[Abstract/Free Full Text].

    Yarowsky, D. (1995) Unsupervised word sense disambiguation rivaling supervised methods. Proceedings of the 33rd Annual Meeting of the ACL , Massachusetts, USA Cambridge, pp. 189–196.

    Yu, H. and Friedman, C. (2002) Mapping abbreviations to full forms in biomedical articles. J. Am. Med. Inform. Assoc., 9, 262–272[Abstract/Free Full Text].

    Yu, Z., Tsuruoka, Y., Tsujii, J. (2003) Automatic resolution of ambiguous abbreviations in biomedical texts using support vector machines and one sense per discourse hypothesis. Proceedings of the SIGIR'03 , pp. 57–62.

    Yoshida, M., Fukuda, K., Takagi, T. (2000) PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary. Bioinformatics, 16, 169–175[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J. Am. Med. Inform. Assoc.Home page
I. Solt, D. Tikk, V. Gal, and Z. T. Kardkovacs
Semantic Classification of Diseases in Discharge Summaries Using a Context-aware Rule-based Classifier
J. Am. Med. Inform. Assoc., July 1, 2009; 16(4): 580 - 584.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
T. Nakazato, H. Bono, H. Matsuda, and T. Takagi
Gendoo: Functional profiling of gene and disease features using MeSH vocabulary
Nucleic Acids Res., July 1, 2009; 37(suppl_2): W166 - W169.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
R. Winnenburg, T. Wachter, C. Plake, A. Doms, and M. Schroeder
Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies?
Brief Bioinform, December 6, 2008; (2008) bbn043v1.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. Rebholz-Schuhmann, H. Kirsch, M. Arregui, S. Gaudan, M. Riethoven, and P. Stoehr
EBIMed--text crunching to gather facts for proteins from Medline
Bioinformatics, January 15, 2007; 23(2): e237 - e244.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
N. Okazaki and S. Ananiadou
Building an abbreviation dictionary using a term recognition approach
Bioinformatics, December 15, 2006; 22(24): 3089 - 3095.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
W. Zhou, V. I. Torvik, and N. R. Smalheiser
ADAM: another database of abbreviations in MEDLINE
Bioinformatics, November 15, 2006; 22(22): 2813 - 2818.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
H. Liu, Z.-Z. Hu, M. Torii, C. Wu, and C. Friedman
Quantitative Assessment of Dictionary-based Protein Named Entity Tagging
J. Am. Med. Inform. Assoc., September 1, 2006; 13(5): 497 - 507.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/18/3658    most recent
bti586v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (16)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Gaudan, S.
Right arrow Articles by Rebholz-Schuhmann, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gaudan, S.
Right arrow Articles by Rebholz-Schuhmann, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?