Skip Navigation


Bioinformatics Advance Access originally published online on October 27, 2004
Bioinformatics 2005 21(5):694-695; doi:10.1093/bioinformatics/bti087
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/5/694    most recent
bti087v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ding, J.
Right arrow Articles by Berleant, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ding, J.
Right arrow Articles by Berleant, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

MedKit: a helper toolkit for automatic mining of MEDLINE/PubMed citations

Jing Ding and Daniel Berleant *

Department of Electrical and Computer Engineering, Iowa State University Ames, IA 50011, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 PROGRAM OVERVIEW
 REFERENCES
 

Summary: MEDLINE/PubMed is one of the most important information sources for bioinformatics text mining. However, there remain limitations in working with MEDLINE/PubMed citations. For example, PubMed imposes an upper limit of 10 000 for downloading PMID list or citations; and MEDLINE files are too large for most off-the-shelf XML parsers. We developed a Java package, MedKit, to work-around the limitations, as well as provide other useful functionalities, e.g. random sampling. Its four modules (querier, sampler, fetcher and parser) can work independently, or be pipelined in various combinations. It can be used as a stand-alone GUI application, or integrated into other text-mining systems. Text mining researchers and others may download and use the toolkit free for non-commercial purposes.

Availability: http://metnetdb.gdcb.iastate.edu/medkit

Contact: berleant{at}iastate.edu


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 PROGRAM OVERVIEW
 REFERENCES
 
MEDLINE (National Library of Medicine, 2004b, http://www.nlm.nih.gov/pubs/factsheets/medline.html) is a standard literature database for bioinformatics text mining. It can be accessed through annual releases in XML format (with weekly updates) or through its web interface, PubMed (National Library of Medicine, 2004c, http://www.ncbi.nih.gov/entrez/query.fcgi). Both access methods have limitations, however. For example, the MEDLINE release files are intended for automatic processing. Although the XML format is easy for navigation and manipulation within a file, the files are too large (average size: ~100 MB) for most off-the-shelf XML parsers. Despite the popularity of XML, some text-mining libraries, e.g. the ‘Bow’ toolkit (McCallum, 1996, http://www-2.cs.cmu.edu/~mccallum/bow/) still take plain text files as input. It is also difficult to generate a subset of citations directly from MEDLINE release files in response to a user query, which causes users to turn to PubMed. PubMed is more focused on human users. Its query system is designed to return a manageable set of relevant documents. Its upper limit for downloading query results (currently 10 000 hits) can be a hindrance to automated text-mining systems. In addition, the MEDLINE XML format and the PubMed XML format are not identical.

To work around these limitations, as well as add other useful functionalities (e.g. to randomly sample a subset of MEDLINE/PubMed abstracts), we developed a Java package, MedKit. It can be used as a stand-alone GUI application, or its modules may be integrated into other automated MEDLINE/PubMed mining systems.


    PROGRAM OVERVIEW
 TOP
 Abstract
 INTRODUCTION
 PROGRAM OVERVIEW
 REFERENCES
 
MedKit integrates four modules (Java classes), a querier, a sampler, a fetcher and a parser. They may be used together through the interface of Figure 1 or incorporated in any combinations into other programs. The querier takes a query (keyword terms plus other conditions, such as publication dates, fields, etc.) as input and returns a list of PMIDs. The number of returned PMIDs has no upper limit. The sampler draws random samples from a list of PMIDs. The fetcher retrieves citations from PubMed given a list of PMIDs. Like the querier, the number of retrieved citations has no upper limit. The parser can parse very large MEDLINE/PubMed XML files, split them into small ones, extract PMIDs, and/or extract abstracts into plain text files (not limited by the size of input files, but by available disk space for output).



View larger version (92K):
[in this window]
[in a new window]
 
Fig. 1 A screenshot of MedKit GUI.

 
Common mining-related tasks in working with MEDLINE/ PubMed citations can be disassembled into atomic steps, and carried out by individual modules in a step-by-step style, saving the output of each step as the input to the next. Alternatively, the modules can be pipelined in various combinations to carry out more sophisticated tasks in a single run. For example, the task of retrieving three random samples of 50 citations each of PubMed abstracts in compressed XML format mentioning ‘red blood cell’ in MeSH terms published in the last 5 years, can be accomplished by a pipeline of ‘querier->sampler->fetcher’ (Fig. 1). Other valid pipeline workflows, their inputs and outputs are shown in Table 1.


View this table:
[in this window]
[in a new window]
 
Table 1 Valid workflows in MedKit

 
The querier and fetcher take advantage of NCBI Entrez Utilities' ESearch and EFetch services, respectively (National Library of Medicine, 2004a, http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html). In other words, MedKit simply passes users' queries to PubMed. Therefore, any legal PubMed Boolean queries can be used, for example, red blood cell[text word] AND review[publication type]. It is, however, not our intention to build another MEDLINE interface to compete with PubMed. On the contrary, MedKit is designed to compliment PubMed. PubMed also provides other facilities (i.e. Limits, History and Clipboard) to enhance query capability and efficiency. The results of direct PubMed queries (PMID lists and/or XML citations) can be saved locally, and then used as input to MedKit for further processing, e.g. parsing and/or sampling.

The parser works around the file size limitation without sacrificing performance (parse medline04n001.xml.gz containing 30 000 citations in 82 s on a Pentium II 500 Hz machine with 384 MB memory running Windows 2000 and Sun's JRE 1.4.2). It is done by combining a regular Java file reader with open source XML libraries, dom4j (MetaStuff Ltd., 2004, http://www.dom4j.org) and Piccolo (Yuval Oren, 2004, http://piccolo.sourceforge.net/). A MEDLINE/PubMed XML file is first opened as a plain text file by the regular file reader, and read into memory in small chunks. A chunk of text, containing exactly one citation unit from start tag to end tag, is then passed to the XML parser. After the citation is processed, the next chunk is passed to the parser, and the previous one is discarded. Thus, the MedKit parser is able to process very large files in a stream-like fashion while retaining the convenience and flexibility of XML within a citation unit. This design is based on the observation that most MEDLINE/PubMed text mining systems focus on the information contained within single citations; cross talk among citations is rare.

The sampler's random sampling algorithm is backed by the Colt distribution, open source libraries for high-performance scientific and technical computing in Java [European Organization for Nuclear Research (CERN), 2004, http://hoschek.home.cern.ch/hoschek/colt/].

The package is freely available at: http://metnetdb.gdcb.iastate.edu/medkit.

Received on May 21, 2004; revised on September 13, 2004; accepted on October 7, 2004

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 PROGRAM OVERVIEW
 REFERENCES
 

    European Organization for Nuclear Research (CERN). (2004) The Colt Distribution: Open Source Libraries for High Performance Scientific and Technical Computing in Java.

    McCallum, A.K. (1996) Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering.

    MetaStuff, Ltd. (2004) dom4j: the Flexible XML Framework for Java.

    National Library of Medicine. (2004a) Entrez Utilities.

    National Library of Medicine. (2004b) MEDLINE.

    National Library of Medicine. (2004c) PUBMED.

    Yuval, O. (2004) Piccolo XML parser.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/5/694    most recent
bti087v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ding, J.
Right arrow Articles by Berleant, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ding, J.
Right arrow Articles by Berleant, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?