Skip Navigation


Bioinformatics Advance Access originally published online on February 2, 2005
Bioinformatics 2005 21(9):2138-2139; doi:10.1093/bioinformatics/bti296
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/9/2138    most recent
bti296v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Divoli, A.
Right arrow Articles by Attwood, T. K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Divoli, A.
Right arrow Articles by Attwood, T. K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

BioIE: extracting informative sentences from the biomedical literature

Anna Divoli * and Teresa K. Attwood

Faculty of Life Sciences and School of Computer Science, University of Manchester Manchester M13 9PT, UK

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 IMPLEMENTATION AND ARCHITECTURE
 APPLICATIONS
 FUTURE WORK
 REFERENCES
 

Summary: BioIE is a rule-based system that extracts informative sentences relating to protein families, their structures, functions and diseases from the biomedical literaturE. Based on manual definition of templates and rules, it aims at precise sentence extraction rather than wide recall. After uploading source text or retrieving abstracts from MEDLINE, users can extract sentences based on predefined or user-defined template categories. BioIE also provides a brief insight into the syntactic and semantic context of the source-text by looking at word, N-gram and MeSH-term distributions. Important Applications of BioIE are in, for example, annotation of microarray data and of protein databases.

Availability: http://umber.sbs.man.ac.uk/dbbrowser/bioie/

Contact: divoli{at}bioinf.man.ac.uk


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 IMPLEMENTATION AND ARCHITECTURE
 APPLICATIONS
 FUTURE WORK
 REFERENCES
 
Owing to the large volume of biomedical literature and its continuous fast growth, the need for text-mining tools has become increasingly important. In recent years, several different systems have been developed: some aim to detect interactions among proteins, genes or both (Wong, 2001; Hoffmann and Valencia, 2004); others specifically detect protein and gene names (Hirschman et al., 2002; Yu et al., 2002); other, more specialized, systems extract information relating to, for instance, gene expression profiling (MedMiner; Tanabe et al., 1999), drugs and genes relevant to cancer (EDGAR; Rindflesch et al., 2000), signal-transduction pathways and associated drugs and diseases (GENIES; Friedman et al., 2001) and c-DNA clones (FACTS; Nagashima et al., 2003).

In spite of the range of text-mining tools available, what has been lacking is a tool providing generic, easily customizable information extraction around an entity across a range of subjects of general interest to biologists and specifically to database annotators. Here we describe a new system that allows different types of sentence extraction, using predefined categories of interest relating to proteins, plus custom extraction around different entities and concepts, together with statistical feedback on the source and extracted text.


    IMPLEMENTATION AND ARCHITECTURE
 TOP
 Abstract
 INTRODUCTION
 IMPLEMENTATION AND ARCHITECTURE
 APPLICATIONS
 FUTURE WORK
 REFERENCES
 
BioIE is a rule-based system, implemented in Perl and CGI, and accessible via the Web. Its architecture is shown in Figure 1. Users may upload their own text corpus or download abstracts from MEDLINE using an embedded PubMed search facility; the corpus must contain relevant data in order to be able to extract useful information from it. Once a corpus has been loaded, BioIE can provide statistical information or allow extraction of pertinent sentences. BioIE runs on a server that can handle 385 KB/s; for simple queries (e.g. gene names) over 300 abstracts under test conditions (500 simultaneous users), the transfer rate is ~ 25 KB/s; its main limitation is the volume of text that PubMed permits.



View larger version (40K):
[in this window]
[in a new window]
 
Fig. 1 Overview of BioIE's architecture.

 
Statistical information
Word distribution options return all words found in the text in descending order of their number of occurrence, together with their frequency per 1000 words (if MEDLINE abstracts are used, the word frequency per abstract and the number of abstracts in which each word is found are also given); a filtering option provides the same information but with commonly used English words, which provide no semantic information, filtered out. N-gram distributions (bigrams, trigrams and tetragrams), including or excluding punctuation marks, are also reported in the descending order of the number of times they occur in the text. Finally, for retrieved MEDLINE abstracts, the MeSH terms for which the abstracts are indexed are reported in descending order of the number of times they are found, together with their frequency per 100 MeSH terms. These options provide some insight into the syntactic and semantic content of the source-text and may be valuable for further more detailed studies.

Sentence extraction
Sentence extraction is based on manually defined templates and rules. Templates may be single words, word pairs or small phrases (which may or not be contiguous). Templates and rules were chosen carefully, using domain-specific knowledge, aiming at high precision rather than recall. Currently, BioIE uses five predefined categories of interest relating to proteins: structure, function, diseases and therapeutic compounds and localization and familial relationships.

BioIE was designed to extract sentences because they are grammatically complete entities that are usually more informative than windows of words around target words/phrases: e.g. researchers may want to know about an event, but also about the conditions under which the event takes place—whole sentences are more likely to contain such information, and are small enough to be checked quickly.

Currently, there are three extraction options: the first extracts sentences containing the selected templates; the second extracts sentences containing one or more user-specified terms in addition to the templates; the third allows users to provide their own keywords or phrases, for those interested in information beyond the predefined templates. All returned sentences carry the PMID (when the source is from abstracts) or the line number (when users have uploaded their own text), linking them to the original source.

The extracted sentences are ranked in the order of importance, according to the number and the type of template words and phrases they contain; these and user-specified words, are highlighted in the output, making it easy for users to evaluate the results. The word, filtered word and N-gram distributions of the extracted text may also be calculated; this makes it easy to compare word usage in template-extracted sentences and the initial source-text, which could be useful for more in-depth linguistic analysis.


    APPLICATIONS
 TOP
 Abstract
 INTRODUCTION
 IMPLEMENTATION AND ARCHITECTURE
 APPLICATIONS
 FUTURE WORK
 REFERENCES
 
BioIE provides several extraction options. The predefined categories, based on protein families, their structure, function and disease relationships, together with the custom extraction option, make it useful both as a generic text-mining tool and as an annotation tool, e.g. for microarray data or protein databases (InterPro, Mulder et al., 2003; PRINTS, Attwood et al., 2003), annotators of which currently have to manually trawl the literature to be able to compose an abstract for each family.


    FUTURE WORK
 TOP
 Abstract
 INTRODUCTION
 IMPLEMENTATION AND ARCHITECTURE
 APPLICATIONS
 FUTURE WORK
 REFERENCES
 
BioIE has been designed as a decision support tool rather than to be fully automated. It is thus fast and simple to use. The system is highly interactive and users can customize it to best suit their needs. For instance, for the top ranked results of the user-specified extraction, BioIE achieves 100% precision for most of the protein entities. Currently, the system does not yet deal adequately with synonyms and homonyms, but we plan to provide options to return synonyms for some protein entities in future. We will also further revise the templates and rules, and add new categories of interest, and are exploring the ways to provide summarization options.


    Acknowledgments
 
This work was funded by Inpharmatica Ltd. and BBSRC. We are grateful to Neil Maudling and Alex Mitchell for their suggestions on the system implementation.

Received on September 22, 2004; revised on January 25, 2005; accepted on January 26, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 IMPLEMENTATION AND ARCHITECTURE
 APPLICATIONS
 FUTURE WORK
 REFERENCES
 

    Attwood, T.K., et al. (2003) PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res., 31, 400–402[Abstract/Free Full Text].

    Friedman, C., et al. (2001) GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17, Suppl. 1, S74–S82[Abstract].

    Hirschman, L., et al. (2002) Rutabaga by any other name: extracting biological names. J. Biomed. Inform., 35, 247–259[Medline].

    Hoffmann, R. and Valencia, A. (2004) A gene network for navigating the literature. Nat. Genet., 36, 664[CrossRef][Web of Science][Medline].

    Mulder, N.J., et al. (2003) The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res., 31, 315–318[Abstract/Free Full Text].

    Nagashima, T., et al. (2003) Inferring higher functional information for RIKEN mouse full-length cDNA clones with FACTS. Genome Res., 13, 1520–1533[Abstract/Free Full Text].

    Rindflesch, T., et al. (2000) EDGAR: extraction of drugs, genes and relations from the biomedical literature. Pac. Symp. Biocomput., 517–528.

    Tanabe, L., et al. (1999) MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques, 27, 1216–1217.

    Wong, L. (2001) PIES, a protein interaction extraction system. Pac. Symp. Biocomput., 2001, 520–531.

    Yu, H., et al. (2002) Automatically identifying gene/protein terms in MEDLINE abstracts. J. Biomed. Inform., 35, 322–330[Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
R. Winnenburg, T. Wachter, C. Plake, A. Doms, and M. Schroeder
Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies?
Brief Bioinform, December 6, 2008; (2008) bbn043v1.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
K. Fundel, R. Kuffner, and R. Zimmer
RelEx--Relation extraction using dependency parse trees
Bioinformatics, February 1, 2007; 23(3): 365 - 371.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. Rebholz-Schuhmann, H. Kirsch, M. Arregui, S. Gaudan, M. Riethoven, and P. Stoehr
EBIMed--text crunching to gather facts for proteins from Medline
Bioinformatics, January 15, 2007; 23(2): e237 - e244.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
D. Dinakarpandian, Y. Lee, K. Vishwanath, and R. Lingambhotla
MachineProse: an Ontological Framework for Scientific Assertions
J. Am. Med. Inform. Assoc., March 1, 2006; 13(2): 220 - 232.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. L. Mitchell, A. Divoli, J.-H. Kim, M. Hilario, I. Selimas, and T. K. Attwood
METIS: multiple extraction techniques for informative sentences
Bioinformatics, November 15, 2005; 21(22): 4196 - 4197.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/9/2138    most recent
bti296v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Divoli, A.
Right arrow Articles by Attwood, T. K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Divoli, A.
Right arrow Articles by Attwood, T. K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?