Bioinformatics Advance Access originally published online on February 2, 2005
Bioinformatics 2005 21(9):2138-2139; doi:10.1093/bioinformatics/bti296
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BioIE: extracting informative sentences from the biomedical literature
Faculty of Life Sciences and School of Computer Science, University of Manchester Manchester M13 9PT, UK
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: BioIE is a rule-based system that extracts informative sentences relating to protein families, their structures, functions and diseases from the biomedical literaturE. Based on manual definition of templates and rules, it aims at precise sentence extraction rather than wide recall. After uploading source text or retrieving abstracts from MEDLINE, users can extract sentences based on predefined or user-defined template categories. BioIE also provides a brief insight into the syntactic and semantic context of the source-text by looking at word, N-gram and MeSH-term distributions. Important Applications of BioIE are in, for example, annotation of microarray data and of protein databases.
Availability: http://umber.sbs.man.ac.uk/dbbrowser/bioie/
Contact: divoli{at}bioinf.man.ac.uk
| INTRODUCTION |
|---|
|
|
|---|
Owing to the large volume of biomedical literature and its continuous fast growth, the need for text-mining tools has become increasingly important. In recent years, several different systems have been developed: some aim to detect interactions among proteins, genes or both (Wong, 2001; Hoffmann and Valencia, 2004); others specifically detect protein and gene names (Hirschman et al., 2002; Yu et al., 2002); other, more specialized, systems extract information relating to, for instance, gene expression profiling (MedMiner; Tanabe et al., 1999), drugs and genes relevant to cancer (EDGAR; Rindflesch et al., 2000), signal-transduction pathways and associated drugs and diseases (GENIES; Friedman et al., 2001) and c-DNA clones (FACTS; Nagashima et al., 2003).
In spite of the range of text-mining tools available, what has been lacking is a tool providing generic, easily customizable information extraction around an entity across a range of subjects of general interest to biologists and specifically to database annotators. Here we describe a new system that allows different types of sentence extraction, using predefined categories of interest relating to proteins, plus custom extraction around different entities and concepts, together with statistical feedback on the source and extracted text.
| IMPLEMENTATION AND ARCHITECTURE |
|---|
|
|
|---|
BioIE is a rule-based system, implemented in Perl and CGI, and accessible via the Web. Its architecture is shown in Figure 1. Users may upload their own text corpus or download abstracts from MEDLINE using an embedded PubMed search facility; the corpus must contain relevant data in order to be able to extract useful information from it. Once a corpus has been loaded, BioIE can provide statistical information or allow extraction of pertinent sentences. BioIE runs on a server that can handle 385 KB/s; for simple queries (e.g. gene names) over 300 abstracts under test conditions (500 simultaneous users), the transfer rate is
25 KB/s; its main limitation is the volume of text that PubMed permits.
|
Statistical information
Word distribution options return all words found in the text in descending order of their number of occurrence, together with their frequency per 1000 words (if MEDLINE abstracts are used, the word frequency per abstract and the number of abstracts in which each word is found are also given); a filtering option provides the same information but with commonly used English words, which provide no semantic information, filtered out. N-gram distributions (bigrams, trigrams and tetragrams), including or excluding punctuation marks, are also reported in the descending order of the number of times they occur in the text. Finally, for retrieved MEDLINE abstracts, the MeSH terms for which the abstracts are indexed are reported in descending order of the number of times they are found, together with their frequency per 100 MeSH terms. These options provide some insight into the syntactic and semantic content of the source-text and may be valuable for further more detailed studies.
Sentence extraction
Sentence extraction is based on manually defined templates and rules. Templates may be single words, word pairs or small phrases (which may or not be contiguous). Templates and rules were chosen carefully, using domain-specific knowledge, aiming at high precision rather than recall. Currently, BioIE uses five predefined categories of interest relating to proteins: structure, function, diseases and therapeutic compounds and localization and familial relationships.
BioIE was designed to extract sentences because they are grammatically complete entities that are usually more informative than windows of words around target words/phrases: e.g. researchers may want to know about an event, but also about the conditions under which the event takes placewhole sentences are more likely to contain such information, and are small enough to be checked quickly.
Currently, there are three extraction options: the first extracts sentences containing the selected templates; the second extracts sentences containing one or more user-specified terms in addition to the templates; the third allows users to provide their own keywords or phrases, for those interested in information beyond the predefined templates. All returned sentences carry the PMID (when the source is from abstracts) or the line number (when users have uploaded their own text), linking them to the original source.
The extracted sentences are ranked in the order of importance, according to the number and the type of template words and phrases they contain; these and user-specified words, are highlighted in the output, making it easy for users to evaluate the results. The word, filtered word and N-gram distributions of the extracted text may also be calculated; this makes it easy to compare word usage in template-extracted sentences and the initial source-text, which could be useful for more in-depth linguistic analysis.
| APPLICATIONS |
|---|
|
|
|---|
BioIE provides several extraction options. The predefined categories, based on protein families, their structure, function and disease relationships, together with the custom extraction option, make it useful both as a generic text-mining tool and as an annotation tool, e.g. for microarray data or protein databases (InterPro, Mulder et al., 2003; PRINTS, Attwood et al., 2003), annotators of which currently have to manually trawl the literature to be able to compose an abstract for each family.
| FUTURE WORK |
|---|
|
|
|---|
BioIE has been designed as a decision support tool rather than to be fully automated. It is thus fast and simple to use. The system is highly interactive and users can customize it to best suit their needs. For instance, for the top ranked results of the user-specified extraction, BioIE achieves 100% precision for most of the protein entities. Currently, the system does not yet deal adequately with synonyms and homonyms, but we plan to provide options to return synonyms for some protein entities in future. We will also further revise the templates and rules, and add new categories of interest, and are exploring the ways to provide summarization options.
| Acknowledgments |
|---|
This work was funded by Inpharmatica Ltd. and BBSRC. We are grateful to Neil Maudling and Alex Mitchell for their suggestions on the system implementation.
Received on September 22, 2004; revised on January 25, 2005; accepted on January 26, 2005
| REFERENCES |
|---|
|
|
|---|
Attwood, T.K., et al. (2003) PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res., 31, 400402
Friedman, C., et al. (2001) GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17, Suppl. 1, S74S82[Abstract].
Hirschman, L., et al. (2002) Rutabaga by any other name: extracting biological names. J. Biomed. Inform., 35, 247259[Medline].
Hoffmann, R. and Valencia, A. (2004) A gene network for navigating the literature. Nat. Genet., 36, 664[CrossRef][Web of Science][Medline].
Mulder, N.J., et al. (2003) The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res., 31, 315318
Nagashima, T., et al. (2003) Inferring higher functional information for RIKEN mouse full-length cDNA clones with FACTS. Genome Res., 13, 15201533
Rindflesch, T., et al. (2000) EDGAR: extraction of drugs, genes and relations from the biomedical literature. Pac. Symp. Biocomput., 517528.
Tanabe, L., et al. (1999) MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques, 27, 12161217.
Wong, L. (2001) PIES, a protein interaction extraction system. Pac. Symp. Biocomput., 2001, 520531.
Yu, H., et al. (2002) Automatically identifying gene/protein terms in MEDLINE abstracts. J. Biomed. Inform., 35, 322330[Medline].
This article has been cited by other articles:
![]() |
R. Winnenburg, T. Wachter, C. Plake, A. Doms, and M. Schroeder Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? Brief Bioinform, December 6, 2008; (2008) bbn043v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Fundel, R. Kuffner, and R. Zimmer RelEx--Relation extraction using dependency parse trees Bioinformatics, February 1, 2007; 23(3): 365 - 371. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Rebholz-Schuhmann, H. Kirsch, M. Arregui, S. Gaudan, M. Riethoven, and P. Stoehr EBIMed--text crunching to gather facts for proteins from Medline Bioinformatics, January 15, 2007; 23(2): e237 - e244. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Dinakarpandian, Y. Lee, K. Vishwanath, and R. Lingambhotla MachineProse: an Ontological Framework for Scientific Assertions J. Am. Med. Inform. Assoc., March 1, 2006; 13(2): 220 - 232. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. L. Mitchell, A. Divoli, J.-H. Kim, M. Hilario, I. Selimas, and T. K. Attwood METIS: multiple extraction techniques for informative sentences Bioinformatics, November 15, 2005; 21(22): 4196 - 4197. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



