Bioinformatics Advance Access originally published online on July 26, 2006
Bioinformatics 2006 22(19):2444-2445; doi:10.1093/bioinformatics/btl408
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ALIBABA: PubMed as a graph
1 Knowledge Management in Bioinformatics, Humboldt-Universität zu Berlin Unter den Linden 6, 10099 Berlin, Germany
2 Department of Mathematics and Computer Science, Free University Berlin Arnimallee 2-6, 14195 Berlin, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
The biomedical literature contains a wealth of information on associations between many different types of objects, such as proteinprotein interactions, genedisease associations and subcellular locations of proteins. When searching such information using conventional search engines, e.g. PubMed, users see the data only one-abstract at a time and hidden in natural language text. ALIBABA is an interactive tool for graphical summarization of search results. It parses the set of abstracts that fit a PubMed query and presents extracted information on biomedical objects and their relationships as a graphical network. ALIBABA extracts associations between cells, diseases, drugs, proteins, species and tissues. Several filter options allow for a more focused search. Thus, researchers can grasp complex networks described in various articles at a glance.
Availability: http://alibaba.informatik.hu-berlin.de/
Contact: hakenberg{at}informatik.hu-berlin.de
| 1 INTRODUCTION |
|---|
|
|
|---|
Most information on biological entities and their interactions is available only in textual form. Since searching a text database is less precise than searching a structured database, many efforts are under way to automatically analyze texts to identify and extract relevant facts. The extracted information may be used to answer precise queries, to summarize multiple texts in a single representation and to connect to other sources of knowledge. Although the level of detail and accuracy of the extracted data currently cannot reach that of the original text, automatic information extraction is a valuable tool for navigating text databases and for offering quick overviews, both important tasks in many stages of research (Jensen et al., 2006).
Since the most complete source of citations in biomedicine is PubMed, a number of projects use PubMed abstracts as an input for the analysis. iHOP offers access to the underlying literature by means of a network of concurring genes and proteins (Hoffmann and Valencia, 2005). Users access the information by searching for gene names. In contrast, ALIBABA evaluates arbitrary queries. EBIMed provides a quick overview of co-occurrences of a variety of entities: proteins, species, drugs and gene ontology (GO) terms. It searches all PubMed abstracts that fit an arbitrary user query and presents the resulting associations in tabular form (Kirsch et al., 2005). ALIBABA provides a graphical view and offers more advanced association mining. GoPubMed searches GO terms in PubMed abstracts and links them to the GO hierarchy, which can then be used to navigate the result set (Doms and Schroeder, 2005).
All of the above applications present their results in the form of hyperlinked texts or tables. With ALIBABA, we present a system that graphically visualizes information on associations between biological entities extracted from a PubMed search result. Another distinctive feature is its text mining method for identifying and classifying associations that yields higher precision than a pure collocation analysis.
ALIBABA uses Java Web Start to launch a client from any web browser. This client handles only the visualization of the network. It sends a user's query to the server, which forwards the query to PubMed, retrieves the matching abstracts and processes them. The server then returns the annotated abstracts to the client, which builds a network out of all annotations.
| 2 USING ALIBABA |
|---|
|
|
|---|
ALIBABA's screen consists of three regions, as shown in Figure 1. The upper horizontal input field accepts queries to PubMed. As an option, the results can be limited to a maximum number of citations. The ordering within a result set is the same as retrieved from PubMed. It is also possible to append the results of a query to those of previous queries, which can be used to build a network incrementally.
|
The large window shows the graph as it results from parsing the abstracts returned for the query. Nodes represent biological entities, with different colors for different classes. Edges represent associations between two entities. Whenever ALIBABA was able to assign a source/target-dependency to the relation, this is indicated by directed edges; arrows point from source to target. Undirected edges represent associations for which ALIBABA could not identify such a dependency (Section 3). The gray value of an edge correlates with its assigned confidence score, where darker edges represent more confident relations. The search field at the bottom helps to find specific objects within the graph.
The right-hand window contains more detailed information on all items in the graph. The upper part shows all extracted entities and their interaction partners categorized by biological classes (proteins, cells, tissues, etc.), arranged in a tree. After clicking a node, more information on this entity is shown in the lower part, including synonyms encountered in the abstracts with links to external databases (UniProt, MeSH, NCBI Taxonomy, MedlinePlus, PubMed), as well as sentences from abstracts mentioning this entity.
To view information on a specific relation, users select an associated partner of an entity from the upper tree view. The lower part then contains information on both entities and detailed information on the selected relation including its specific type and subtype (such as modification or activation). It will also show the textual evidence, i.e. all sentences found in the abstracts that discuss the picked relation. Relevant objects in the text are highlighted, and all sentences are back-linked to the respective PubMed abstract.
Filtering result sets
The complete graph view can be altered by setting filter options from the preferences menu. ALIBABA offers to choose which entity classes to display, to hide unconnected entities and to set a minimum confidence value of visible associations. In addition, the user can choose to aggregate cells associated with proteins. Aggregated cells are visualized as bubbles containing their subcellular proteins as shown in Figure 2. Furthermore, associations can be restricted to co-occurrences and/or matching patterns, where patterns would overwrite co-occurrences between the same entities.
|
| 3 INFORMATION EXTRACTION |
|---|
|
|
|---|
ALIBABA uses a dictionary-based approach for recognizing biomedical objects (Kirsch et al., 2005). Dictionaries consist of regular expressions depicting terms and spelling variations. We collected the dictionaries from different sources (aforementioned databases). To find associations between entities, ALIBABA uses two different techniques in parallel: pattern matching and co-occurrence filtering. Pattern matching uses language patterns extracted from annotated, task-specific corpora (Hakenberg et al., 2005). Such language patterns resemble regular expressions using tokens, part-of-speech tags and entity classes. The pattern matching algorithm also provides a confidence score for each relation, depending on the quality of the match between the sentence and a pattern. Furthermore, it identifies the type of the association and, in many cases, the direction. ALIBABA uses such patterns to extract proteinprotein interactions and cellular locations of proteins. The extraction module achieves an F1-measure of 61% (maximum recall of 52% at 75% precision), as evaluated on the SPIES corpus (Hao et al., 2005). To ensure higher recall, we also search for concurring entities, i.e. entities co-occurring in the same sentence. ALIBABA currently finds associations between two proteins (or genes), proteins and cells, diseases, species and tissues, as well as drugs and diseases. On average, ALIBABA parses one abstract per second.
| 4 DISCUSSION |
|---|
|
|
|---|
ALIBABA is an easy-to-use application for browsing biological networks extracted on-the-fly from results of PubMed queries. We chose to forward queries to PubMed due to its elaborated search options and the familiarity of many biologists with PubMed searches. The main features of our systems are (1) the automatic summarization of information from all abstracts matching a query into a network; (2) the graphical and interactive display of the extracted information; (3) the provision of links to external databases; and (4) text mining methods more sophisticated than co-occurrence filtering. All these features build on an information extraction module using dictionaries and pattern matching, identifying a variety of different biological entities and relations between them. To leverage the sometimes erroneous results of the extraction module, users can filter results based on objects and relationships, confidence scores and extraction methods.
| Acknowledgments |
|---|
The Knowledge Management in Bioinformatics Group is a member of the Berlin Center for Genome Based Bioinformatics (BCB). This work is supported by the German Federal Ministry of Education and Research (BMBF) under grant contract 0312705B.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Satoru Miyano
Received on April 28, 2006; revised on May 31, 2006; accepted on July 22, 2006
| REFERENCES |
|---|
|
|
|---|
Doms, A. and Schroeder, M. (2005) GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res, . 33, W783W786
Hakenberg, J., Plake, C., Leser, U., Kirsch, H., Rebholz-Schuhmann, D. (2005) Genic interaction extraction with alignments and finite state automata. Proceedings of the Learning Language in Logic Workshop (LLL'05)Bonn, Germany.
Hao, Y., et al. (2005) Discovering patterns to extract proteinprotein interactions from the literature: Part II. Bioinformatics, 21, 32943300
Hoffmann, R. and Valencia, A. (2005) Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics, 21, ii252ii258[Abstract].
Jensen, L.J., et al. (2006) Literature mining for the biologist: from information retrieval to biological discovery. Nat. Rev. Genet, . 7, 119129[CrossRef][Web of Science][Medline].
Kirsch, H., et al. (2005) Distributed modules for text annotation and IE applied to the biomedical domain. Int. J. Med. Inform, . 75, 496500.
This article has been cited by other articles:
![]() |
I. Solt, D. Tikk, V. Gal, and Z. T Kardkovacs Semantic Classification of Diseases in Discharge Summaries Using a Context-aware Rule-based Classifier JAMIA, July 1, 2009; 16(4): 580 - 584. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Hur, A. D. Schuyler, D. J. States, and E. L. Feldman SciMiner: web-based literature mining tool for target identification and functional enrichment analysis Bioinformatics, March 15, 2009; 25(6): 838 - 840. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Winnenburg, T. Wachter, C. Plake, A. Doms, and M. Schroeder Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? Brief Bioinform, December 6, 2008; (2008) bbn043v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.-j. Kim and D. Rebholz-Schuhmann Categorization of services for seeking information in biomedical literature: a typology for improvement of practice Brief Bioinform, November 1, 2008; 9(6): 452 - 465. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Hofmann-Apitius, J. Fluck, L. Furlong, O. Fornes, C. Kolarik, S. Hanser, M. Boeker, S. Schulz, F. Sanz, R. Klinger, et al. Knowledge environments representing molecular entities for the virtual physiological human Phil Trans R Soc A, September 13, 2008; 366(1878): 3091 - 3110. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Hakenberg, C. Plake, R. Leaman, M. Schroeder, and G. Gonzalez Inter-species normalization of gene mentions with GNAT Bioinformatics, August 15, 2008; 24(16): i126 - i132. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Cheng, C. Knox, N. Young, P. Stothard, S. Damaraju, and D. S. Wishart PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites Nucleic Acids Res., July 1, 2008; 36(suppl_2): W399 - W405. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Otaegui, S. Mostafavi, C. C. A. Bernard, A. L. de Munain, P. Mousavi, J. R. Oksenberg, and S. E. Baranzini Increased Transcriptional Activity of Milk-Related Genes following the Active Phase of Experimental Autoimmune Encephalomyelitis and Multiple Sclerosis J. Immunol., September 15, 2007; 179(6): 4074 - 4082. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Dawelbait, C. Winter, Y. Zhang, C. Pilarsky, R. Grutzmann, J.-C. Heinrich, and M. Schroeder Structural templates predict novel protein interactions and targets from pancreas tumour gene expression data Bioinformatics, July 1, 2007; 23(13): i115 - i124. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Kolarik, M. Hofmann-Apitius, M. Zimmermann, and J. Fluck Identification of new drug classification terms in textual resources Bioinformatics, July 1, 2007; 23(13): i264 - i272. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







