Skip Navigation


Bioinformatics Advance Access originally published online on April 28, 2005
Bioinformatics 2005 21(14):3191-3192; doi:10.1093/bioinformatics/bti475
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/14/3191    most recent
bti475v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Settles, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Settles, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text

Burr Settles

Department of Computer Sciences and Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison Madison, WI 52706, USA


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 SOFTWARE FEATURES
 3 ALGORITHMS AND IMPLEMENTATION
 4 EVALUATION
 REFERENCES
 

Summary: ABNER (A Biomedical Named Entity Recognizer) is an open source software tool for molecular biology text mining. At its core is a machine learning system using conditional random fields with a variety of orthographic and contextual features. The latest version is 1.5, which has an intuitive graphical interface and includes two modules for tagging entities (e.g. protein and cell line) trained on standard corpora, for which performance is roughly state of the art. It also includes a Java application programming interface allowing users to incorporate ABNER into their own systems and train models on new corpora.

Availability: ABNER is available as an executable Java archive and source code from http://www.cs.wisc.edu/~bsettles/abner/

Contact: bsettles{at}cs.wisc.edu


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 SOFTWARE FEATURES
 3 ALGORITHMS AND IMPLEMENTATION
 4 EVALUATION
 REFERENCES
 
Interest in developing effective tools for natural language processing (NLP) tasks in biomedical literature has been increasing in recent years. The tasks offer scientific challenges—established NLP techniques do not port easily to the biomedical domain—but there is also a practical need to effectively curate, organize and retrieve information automatically from textual sources. Named entity recognition, the NLP task of identifying words and phrases belonging to certain classes (e.g. and ), is an important first step for many larger information management goals. The current state of the art yields F1 scores with exact boundary matching around 70 (Kim et al., 2004; Yeh et al., 2004), but few systems with published results in this range are freely available.

ABNER (A Biomedical Named Entity Recognizer) version 1.0 was released in July 2004 as a free, user-friendly interface to a high-performing system developed for the NLPBA 2004 Shared Task (Settles, 2004). Version 1.5 was released open source in March 2005 with some performance improvements and a customizable application programming interface (API).


    2 SOFTWARE FEATURES
 TOP
 Abstract
 1 INTRODUCTION
 2 SOFTWARE FEATURES
 3 ALGORITHMS AND IMPLEMENTATION
 4 EVALUATION
 REFERENCES
 
ABNER has an intuitive graphical user interface where text can be typed in manually or loaded from a file and automatically tagged for multiple named entities in real time. A screen shot of the interface is shown in Figure 1. Each entity is highlighted with a unique color (, , etc.) for easy visual reference, and tagged documents can be saved in a variety of file formats. The software can also annotate plain text files in batch mode. Users can pre-tokenize input text, or make use of ABNER's built-in tokenization, which is quite robust to wrapped lines and biomedical abbreviations. The bundled ABNER application is platform-independent and has been tested on Linux, Windows XP, Solaris and Mac OSX. The distribution includes two built-in entity tagging modules that are trained and evaluated on the standard NLPBA (Kim et al., 2004) and BioCreative1 (Yeh et al., 2004) corpora. Performance details for both modules are presented in Section 4.



View larger version (85K):
[in this window]
[in a new window]
 
Fig. 1 A screen shot of ABNER's graphical user interface.

 
The Java API allows users to write custom interfaces to ABNER modules or incorporate them into larger biomedical NLP systems. The API also includes routines for training new modules on other corpora. (This may be necessary for tasks that are organism-specific or require tagging conventions not reflected by the built-in modules.) The source code is also available under the terms of the Common Public License.


    3 ALGORITHMS AND IMPLEMENTATION
 TOP
 Abstract
 1 INTRODUCTION
 2 SOFTWARE FEATURES
 3 ALGORITHMS AND IMPLEMENTATION
 4 EVALUATION
 REFERENCES
 
Conditional random fields (CRFs) are undirected statistical graphical models, a special case of which corresponds to conditionally trained finite-state machines well suited for labeling and segmenting sequence data (Lafferty et al., 2001). Named entity recognition can be framed as a sequence labeling problem: words in a sentence are tokens to be assigned labels by states in the CRF framework.

Let o = <o1, o2,...,on> be a sequence of observed words of length n. Let L be a set of labels (, , , etc.) corresponding to states in a finite-state machine. Then l = <l1, l2,...,ln> is a sequence of labels from L assigned to words in the input sequence o. A first-order linear-chain CRF defines the conditional probability of a label sequence given an input sequence to be:

where Zo is a normalization factor over all possible label sequences, fj is one of the k binary functions describing a feature at position i in sequence o and {lambda}j is a weight for that feature. For example, given the text ‘...the ATPase...’ fj might be the feature WORD=ATPase and have value 1 along the transition where li–1 is the label state (‘the’ is a non-entity) and li is the label state . Other features with value 1 along this transition are CAPITALIZED, MIXEDCASE and SUFfix=ase. The learned weight {lambda}j should be positive for a feature correlated with the target label, negative for a feature that is anti-correlated and near zero for a relatively uninformative feature. The weights are set to maximize the conditional log-likelihood of m labeled sequences in a training set D = {<o,l>(1),...,<o,l>(m)}:

where the second sum is a Gaussian prior over feature weights to help to prevent overfitting due to sparsity in D. If training sequences are fully labeled, LL(D) is convex and the model is guaranteed to converge optimally. New sequences can then be labeled with the Viterbi algorithm. For more details, see Lafferty et al. (2001).

ABNER's default feature set comprises orthographic and contextual features, mostly based on regular expressions and neighboring tokens. The feature set is slightly modified from previous work (Settles, 2004) for improved performance, and can be viewed/modified in the source code distribution. Note that ABNER currently does not use syntactic or semantic features. Research indicates that such features can improve performance slightly, but presently they are not dynamically generated by ABNER.

The system is written entirely in Java using graphical window objects from the Swing library. The CRF models are implemented with the MALLET toolkit (http://mallet.cs.umass.edu/), which uses a quasi-Newton method called L-BFGS (Nocedal and Wright, 1999) to find the optimal feature weights efficiently. Tokenization is performed by a deterministic finite-state scanner built with the JLex tool (http://www.cs.princeton.edu/~appel/modern/java/JLex/).


    4 EVALUATION
 TOP
 Abstract
 1 INTRODUCTION
 2 SOFTWARE FEATURES
 3 ALGORITHMS AND IMPLEMENTATION
 4 EVALUATION
 REFERENCES
 
The NLPBA corpus is a modified version of the GENIA corpus (Kim et al., 2003), containing five entities labeled for 18 546 training sentences and 3856 evaluation sentences. The BioCreative corpus contains only one entity subsuming genes and gene products (proteins, RNA, etc.) labeled for 7500 training sentences and 2500 evaluation sentences. ABNER tagged the NLPBA corpus at a rate of 864 words (33 sentences) per second, and the BioCreative corpus at a rate of 1260 words (48 sentences) per second on a 500 MHz Pentium III running Linux with 512 MB memory (speeds will vary among different tagging modules and machines).

Table 1 presents evaluation results in terms of recall [TP/ (TP + FN)], precision [TP/(TP + FP)], and F1 score [(2 x R x P/(R + P)], where TP means true positives, FN means false negatives and FP means false positives. To the author's knowledge, these figures are competitive with the best published results on these corpora at this time. It is important to note that the quality of biomedical NLP systems can vary by organism (Hirschman et al., 2004), thus training ABNER with novel, organism-specific corpora (with a potentially augmented feature set) may be advisable for some applications.


View this table:
[in this window]
[in a new window]
 
Table 1 Evaluation of ABNER's two tagging modules

 


    Acknowledgments
 
Thanks to Mark Craven for his support of this project. Research related to development of this software supported by NLM grant 5T15LM007359 and NIH grant R01 LM07050–01.


    Footnotes
 
1This was previously distributed as part of a command-line tool called YAGI (Yet Another Gene Identifier), which has been deprecated. Back

Received on April 1, 2005; revised on April 21, 2005; accepted on April 26, 2005

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 SOFTWARE FEATURES
 3 ALGORITHMS AND IMPLEMENTATION
 4 EVALUATION
 REFERENCES
 

    Hirschman, L., Colosimo, M., Morgan, A., Colombe, J., Yeh, A. (2004) Task 1B: gene list task. Proceedings of the Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) WorkshopGrenada, Spain.

    Kim, J., et al. (2003) GENIA corpus—a semantically annotated corpus for bio-textmining. Bioinformatics, 19, Suppl. 1, , pp. i180–i182[Abstract].

    Kim, J., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N. (2004) Introduction to the bio-entity recognition task at JNLPBA. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA)Geneva, Switzerland , pp. 70–75.

    Lafferty, J., McCallum, A., Pereira, F. (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. San Francisco, CA. Proceedings of the International Conference on Machine Learning , Williamstown, MA Morgan Kaufmann, pp. 282–289.

    Nocedal, J. and Wright, S.J. Numerical Optimization, (1999) , New York, NY Springer, pp. 224–233.

    Settles, B. (2004) Biomedical named entity recognition using conditional random fields and rich feature sets. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA)Geneva, Switzerland , pp. , pp. 104–107.

    Yeh, A., Hirschman, L., Morgan, A., Colosimo, M. (2004) Task 1A: gene-related name mention finding evaluation. Proceedings of the Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) WorkshopGrenada, Spain.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
R. Klinger, C. Kolarik, J. Fluck, M. Hofmann-Apitius, and C. M. Friedrich
Detection of IUPAC and IUPAC-like chemical names
Bioinformatics, July 1, 2008; 24(13): i268 - i276.
[Abstract] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/14/3191    most recent
bti475v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Settles, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Settles, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?