Bioinformatics Advance Access originally published online on July 1, 2008
Bioinformatics 2008 24(17):1935-1941; doi:10.1093/bioinformatics/btn318
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PuReD-MCL: a graph-based PubMed document clustering methodology
1Department of Informatics, Aristotle University of Thessalonica, P.O. Box 54124, Thessalonica, Greece, 2Computational Genomics Unit, Institute of Agrobiotechnology, Centre for Research and Technology Hellas (CERTH), P.O. Box 361, GR–57001, Thessalonica, Greece and 3Centre for Bioinformatics, School of Physical Sciences & Engineering, King's College London, Strand, London WC2R 2LS, UK
*To whom correspondence should be addressed.
| Abstract |
|---|
Motivation: Biomedical literature is the principal repository of biomedical knowledge, with PubMed being the most complete database collecting, organizing and analyzing such textual knowledge. There are numerous efforts that attempt to exploit this information by using text mining and machine learning techniques. We developed a novel approach, called PuReD-MCL (Pubmed Related Documents-MCL), which is based on the graph clustering algorithm MCL and relevant resources from PubMed.
Methods: PuReD-MCL avoids using natural language processing (NLP) techniques directly; instead, it takes advantage of existing resources, available from PubMed. PuReD-MCL then clusters documents efficiently using the MCL graph clustering algorithm, which is based on graph flow simulation. This process allows users to analyse the results by highlighting important clues, and finally to visualize the clusters and all relevant information using an interactive graph layout algorithm, for instance BioLayout Express 3D.
Results: The methodology was applied to two different datasets, previously used for the validation of the document clustering tool TextQuest. The first dataset involves the organisms Escherichia coli and yeast, whereas the second is related to Drosophila development. PuReD-MCL successfully reproduces the annotated results obtained from TextQuest, while at the same time provides additional insights into the clusters and the corresponding documents.
Availability: Source code in perl and R are available from http://tartara.csd.auth.gr/~theodos/
Contact: theodos{at}csd.auth.gr
Associate Editor: John Quackenbush
Received on November 12, 2007; revised on May 18, 2008; accepted on June 18, 2008