Skip Navigation


Bioinformatics Advance Access originally published online on July 26, 2006
Bioinformatics 2006 22(19):2444-2445; doi:10.1093/bioinformatics/btl408
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/19/2444    most recent
btl408v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Plake, C.
Right arrow Articles by Leser, U.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Plake, C.
Right arrow Articles by Leser, U.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

ALIBABA: PubMed as a graph

Conrad Plake 1, Torsten Schiemann 1, Marcus Pankalla 2, Jörg Hakenberg 1,* and Ulf Leser 1

1 Knowledge Management in Bioinformatics, Humboldt-Universität zu Berlin Unter den Linden 6, 10099 Berlin, Germany
2 Department of Mathematics and Computer Science, Free University Berlin Arnimallee 2-6, 14195 Berlin, Germany

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 USING ALIBABA
 3 INFORMATION EXTRACTION
 4 DISCUSSION
 REFERENCES
 

The biomedical literature contains a wealth of information on associations between many different types of objects, such as protein–protein interactions, gene–disease associations and subcellular locations of proteins. When searching such information using conventional search engines, e.g. PubMed, users see the data only one-abstract at a time and ‘hidden’ in natural language text. ALIBABA is an interactive tool for graphical summarization of search results. It parses the set of abstracts that fit a PubMed query and presents extracted information on biomedical objects and their relationships as a graphical network. ALIBABA extracts associations between cells, diseases, drugs, proteins, species and tissues. Several filter options allow for a more focused search. Thus, researchers can grasp complex networks described in various articles at a glance.

Availability: http://alibaba.informatik.hu-berlin.de/

Contact: hakenberg{at}informatik.hu-berlin.de


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 USING ALIBABA
 3 INFORMATION EXTRACTION
 4 DISCUSSION
 REFERENCES
 
Most information on biological entities and their interactions is available only in textual form. Since searching a text database is less precise than searching a structured database, many efforts are under way to automatically analyze texts to identify and extract relevant facts. The extracted information may be used to answer precise queries, to summarize multiple texts in a single representation and to connect to other sources of knowledge. Although the level of detail and accuracy of the extracted data currently cannot reach that of the original text, automatic information extraction is a valuable tool for navigating text databases and for offering quick overviews, both important tasks in many stages of research (Jensen et al., 2006).

Since the most complete source of citations in biomedicine is PubMed, a number of projects use PubMed abstracts as an input for the analysis. iHOP offers access to the underlying literature by means of a network of concurring genes and proteins (Hoffmann and Valencia, 2005). Users access the information by searching for gene names. In contrast, ALIBABA evaluates arbitrary queries. EBIMed provides a quick overview of co-occurrences of a variety of entities: proteins, species, drugs and gene ontology (GO) terms. It searches all PubMed abstracts that fit an arbitrary user query and presents the resulting associations in tabular form (Kirsch et al., 2005). ALIBABA provides a graphical view and offers more advanced association mining. GoPubMed searches GO terms in PubMed abstracts and links them to the GO hierarchy, which can then be used to navigate the result set (Doms and Schroeder, 2005).

All of the above applications present their results in the form of hyperlinked texts or tables. With ALIBABA, we present a system that graphically visualizes information on associations between biological entities extracted from a PubMed search result. Another distinctive feature is its text mining method for identifying and classifying associations that yields higher precision than a pure collocation analysis.

ALIBABA uses Java Web Start to launch a client from any web browser. This client handles only the visualization of the network. It sends a user's query to the server, which forwards the query to PubMed, retrieves the matching abstracts and processes them. The server then returns the annotated abstracts to the client, which builds a network out of all annotations.


    2 USING ALIBABA
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 USING ALIBABA
 3 INFORMATION EXTRACTION
 4 DISCUSSION
 REFERENCES
 
ALIBABA's screen consists of three regions, as shown in Figure 1. The upper horizontal input field accepts queries to PubMed. As an option, the results can be limited to a maximum number of citations. The ordering within a result set is the same as retrieved from PubMed. It is also possible to append the results of a query to those of previous queries, which can be used to build a network incrementally.


Figure 1
View larger version (39K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 (Parts of the) graph resulting from five PubMed abstracts for the query ‘FADD’. Information on the selected protein ‘caspase-8’ is given in the right panel, for instance, association partners and evidence texts.

 
The large window shows the graph as it results from parsing the abstracts returned for the query. Nodes represent biological entities, with different colors for different classes. Edges represent associations between two entities. Whenever ALIBABA was able to assign a source/target-dependency to the relation, this is indicated by directed edges; arrows point from source to target. Undirected edges represent associations for which ALIBABA could not identify such a dependency (Section 3). The gray value of an edge correlates with its assigned confidence score, where darker edges represent more confident relations. The search field at the bottom helps to find specific objects within the graph.

The right-hand window contains more detailed information on all items in the graph. The upper part shows all extracted entities and their interaction partners categorized by biological classes (proteins, cells, tissues, etc.), arranged in a tree. After clicking a node, more information on this entity is shown in the lower part, including synonyms encountered in the abstracts with links to external databases (UniProt, MeSH, NCBI Taxonomy, MedlinePlus, PubMed), as well as sentences from abstracts mentioning this entity.

To view information on a specific relation, users select an associated partner of an entity from the upper tree view. The lower part then contains information on both entities and detailed information on the selected relation including its specific type and subtype (such as modification or activation). It will also show the textual evidence, i.e. all sentences found in the abstracts that discuss the picked relation. Relevant objects in the text are highlighted, and all sentences are back-linked to the respective PubMed abstract.

Filtering result sets
The complete graph view can be altered by setting filter options from the preferences menu. ALIBABA offers to choose which entity classes to display, to hide unconnected entities and to set a minimum confidence value of visible associations. In addition, the user can choose to aggregate cells associated with proteins. Aggregated cells are visualized as bubbles containing their subcellular proteins as shown in Figure 2. Furthermore, associations can be restricted to co-occurrences and/or matching patterns, where patterns would overwrite co-occurrences between the same entities.


Figure 2
View larger version (52K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 A more complex query for ‘ifn gamma signaling’. Cells are represented as bubbles that contain their respective associtated partners. For more examples ‘see the AliBaba website’, including explanations.

 

    3 INFORMATION EXTRACTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 USING ALIBABA
 3 INFORMATION EXTRACTION
 4 DISCUSSION
 REFERENCES
 
ALIBABA uses a dictionary-based approach for recognizing biomedical objects (Kirsch et al., 2005). Dictionaries consist of regular expressions depicting terms and spelling variations. We collected the dictionaries from different sources (aforementioned databases). To find associations between entities, ALIBABA uses two different techniques in parallel: pattern matching and co-occurrence filtering. Pattern matching uses language patterns extracted from annotated, task-specific corpora (Hakenberg et al., 2005). Such language patterns resemble regular expressions using tokens, part-of-speech tags and entity classes. The pattern matching algorithm also provides a confidence score for each relation, depending on the quality of the match between the sentence and a pattern. Furthermore, it identifies the type of the association and, in many cases, the direction. ALIBABA uses such patterns to extract protein–protein interactions and cellular locations of proteins. The extraction module achieves an F1-measure of 61% (maximum recall of 52% at 75% precision), as evaluated on the SPIES corpus (Hao et al., 2005). To ensure higher recall, we also search for concurring entities, i.e. entities co-occurring in the same sentence. ALIBABA currently finds associations between two proteins (or genes), proteins and cells, diseases, species and tissues, as well as drugs and diseases. On average, ALIBABA parses one abstract per second.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 USING ALIBABA
 3 INFORMATION EXTRACTION
 4 DISCUSSION
 REFERENCES
 
ALIBABA is an easy-to-use application for browsing biological networks extracted on-the-fly from results of PubMed queries. We chose to forward queries to PubMed due to its elaborated search options and the familiarity of many biologists with PubMed searches. The main features of our systems are (1) the automatic summarization of information from all abstracts matching a query into a network; (2) the graphical and interactive display of the extracted information; (3) the provision of links to external databases; and (4) text mining methods more sophisticated than co-occurrence filtering. All these features build on an information extraction module using dictionaries and pattern matching, identifying a variety of different biological entities and relations between them. To leverage the sometimes erroneous results of the extraction module, users can filter results based on objects and relationships, confidence scores and extraction methods.


    Acknowledgments
 
The Knowledge Management in Bioinformatics Group is a member of the Berlin Center for Genome Based Bioinformatics (BCB). This work is supported by the German Federal Ministry of Education and Research (BMBF) under grant contract 0312705B.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Satoru Miyano

Received on April 28, 2006; revised on May 31, 2006; accepted on July 22, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 USING ALIBABA
 3 INFORMATION EXTRACTION
 4 DISCUSSION
 REFERENCES
 

    Doms, A. and Schroeder, M. (2005) GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res, . 33, W783–W786[Abstract/Free Full Text].

    Hakenberg, J., Plake, C., Leser, U., Kirsch, H., Rebholz-Schuhmann, D. (2005) Genic interaction extraction with alignments and finite state automata. Proceedings of the Learning Language in Logic Workshop (LLL'05)Bonn, Germany.

    Hao, Y., et al. (2005) Discovering patterns to extract protein–protein interactions from the literature: Part II. Bioinformatics, 21, 3294–3300[Abstract/Free Full Text].

    Hoffmann, R. and Valencia, A. (2005) Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics, 21, ii252–ii258[Abstract].

    Jensen, L.J., et al. (2006) Literature mining for the biologist: from information retrieval to biological discovery. Nat. Rev. Genet, . 7, 119–129[CrossRef][Web of Science][Medline].

    Kirsch, H., et al. (2005) Distributed modules for text annotation and IE applied to the biomedical domain. Int. J. Med. Inform, . 75, 496–500.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J Am Med Inform AssocHome page
I. Solt, D. Tikk, V. Gal, and Z. T Kardkovacs
Semantic Classification of Diseases in Discharge Summaries Using a Context-aware Rule-based Classifier
JAMIA, July 1, 2009; 16(4): 580 - 584.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. Hur, A. D. Schuyler, D. J. States, and E. L. Feldman
SciMiner: web-based literature mining tool for target identification and functional enrichment analysis
Bioinformatics, March 15, 2009; 25(6): 838 - 840.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
R. Winnenburg, T. Wachter, C. Plake, A. Doms, and M. Schroeder
Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies?
Brief Bioinform, December 6, 2008; (2008) bbn043v1.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
J.-j. Kim and D. Rebholz-Schuhmann
Categorization of services for seeking information in biomedical literature: a typology for improvement of practice
Brief Bioinform, November 1, 2008; 9(6): 452 - 465.
[Abstract] [Full Text] [PDF]


Home page
Phil Trans R Soc AHome page
M. Hofmann-Apitius, J. Fluck, L. Furlong, O. Fornes, C. Kolarik, S. Hanser, M. Boeker, S. Schulz, F. Sanz, R. Klinger, et al.
Knowledge environments representing molecular entities for the virtual physiological human
Phil Trans R Soc A, September 13, 2008; 366(1878): 3091 - 3110.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. Hakenberg, C. Plake, R. Leaman, M. Schroeder, and G. Gonzalez
Inter-species normalization of gene mentions with GNAT
Bioinformatics, August 15, 2008; 24(16): i126 - i132.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. Cheng, C. Knox, N. Young, P. Stothard, S. Damaraju, and D. S. Wishart
PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W399 - W405.
[Abstract] [Full Text] [PDF]


Home page
J. Immunol.Home page
D. Otaegui, S. Mostafavi, C. C. A. Bernard, A. L. de Munain, P. Mousavi, J. R. Oksenberg, and S. E. Baranzini
Increased Transcriptional Activity of Milk-Related Genes following the Active Phase of Experimental Autoimmune Encephalomyelitis and Multiple Sclerosis
J. Immunol., September 15, 2007; 179(6): 4074 - 4082.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
G. Dawelbait, C. Winter, Y. Zhang, C. Pilarsky, R. Grutzmann, J.-C. Heinrich, and M. Schroeder
Structural templates predict novel protein interactions and targets from pancreas tumour gene expression data
Bioinformatics, July 1, 2007; 23(13): i115 - i124.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
C. Kolarik, M. Hofmann-Apitius, M. Zimmermann, and J. Fluck
Identification of new drug classification terms in textual resources
Bioinformatics, July 1, 2007; 23(13): i264 - i272.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/19/2444    most recent
btl408v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Plake, C.
Right arrow Articles by Leser, U.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Plake, C.
Right arrow Articles by Leser, U.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?