Skip Navigation


Bioinformatics Advance Access originally published online on March 7, 2007
Bioinformatics 2007 23(10):1301-1303; doi:10.1093/bioinformatics/btm088
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/10/1301    most recent
btm088v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Cohen-Boulakia, S.
Right arrow Articles by Froidevaux, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Cohen-Boulakia, S.
Right arrow Articles by Froidevaux, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

BioGuideSRS: querying multiple sources with a user-centric perspective

Sarah Cohen-Boulakia 1,2,*, Olivier Biton 2, Susan Davidson 2 and Christine Froidevaux 2

1Laboratoire de Recherche en Informatique, CNRS UMR 8023, Université Paris-Sud XI, 91405 Orsay, France and 2Department of Computer and Information Science, University of Pennsylvania, 3330 Walnut St, PA-19104, Philadelphia, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MAIN FUNCTIONALITIES
 3 BENEFIT OF USING...
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: Biologists are frequently faced with the problem of integrating information from multiple heterogeneous sources with their own experimental data. Given the large number of public sources, it is difficult to choose which sources to integrate without assistance. When doing this manually, biologists differ in their preferences concerning the sources to be queried as well as the strategies, i.e. the querying process they follow for navigating through the sources. In response to these findings, we have developed BioGuide to assist scientists search for relevant data within external sources while taking their preferences and strategies into account. In this article, we present BioGuideSRS, a user-friendly system which automatically retrieves instances of data by using BioGuide on top of the sequence retrieval system (SRS). BioGuideSRS is an Applet that can be run from its web page on any system with Java 5.0.

Availability: http://www.bioguide-project.net

Contact: sarahcb{at}seas.upenn.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MAIN FUNCTIONALITIES
 3 BENEFIT OF USING...
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
To enable scientific discovery, biological data coming from multiple heterogeneous sources must be combined. When doing this manually, scientists exhibit preferences concerning the sources and the cross-references to use; e.g. they trust a source that is highly curated more than one that is not. Scientists also follow different strategies or querying processes when they navigate through the sources, depending on the kind of answer they are interested in.

Over the past ten years, there has been an exponential increase in the number of public biological sources (Galperin, 2007), and manually choosing which sources to use has become an overwhelming task. BioGuide (Cohen-Boulakia et al., 2005) was therefore designed to assist scientists with data searching, taking into account their preferences and query strategies. BioGuide generates a set of paths to be followed between sources, i.e. a ranked list of sequences of sources and links that can be used to answer a given query. In this article, we introduce BioGuideSRS which places BioGuide on top of the popular sequence retrieval system (Etzold et al., 1996), to automatically provide instances of data.


    2 MAIN FUNCTIONALITIES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MAIN FUNCTIONALITIES
 3 BENEFIT OF USING...
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
BioGuideSRS's graphical user interface is shown in part (A) of Figure 1. The BioGuideSRS framework consists of a high-level, semantic view of the scientific domain called the Entity graph as well as a model of the data sources available called the Source-Entity graph. The Entity graph consists of biological entities (e.g. Gene, Disease) and relationships between them (e.g. causes). This graph is then mapped to the Source-Entity graph, which consists of linked data sources (e.g. EntrezGene, online mendalian inheritance in man (OMIM)) which provide information about the entities of interest as well as the implementation of relationships (e.g. EntrezGene provides information about Genes and has a cross-reference (CrossRef) to OMIM implementing causes).


Figure 1
View larger version (55K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. (A) BioGuideSRS main graphical interface; (B) list of paths obtained and (C) data obtained from SRS for the first path.

 
2.1 Browsing
By clicking on an entity or relationship in the Entity graph, users can determine which sources implement the entity or are involved in the relationship as well as the cross-references they share. For example, in Figure 1A, the user has selected the ‘causes’ relationship (left-hand side) and can visualize the network formed by cross-references between sources providing information about genes and diseases (right-hand side).

2.2 Basic querying
Users pose queries over the Entity graph by double clicking on entities and possibly the relationships between them, and by specifying keywords to be searched for each given entity. They are then given a set of ranked paths in the Source-Entity graph representing alternative ways of implementing their query (Fig. 1B). For example, the following question, Q1: What information may I get about narcolepsy and the genes related to this disease?, can be expressed by selecting two entities (GENE and DISEASE, in orange in Fig. 1A) and by specifying ‘narcolepsy’ as a keyword to be searched for the DISEASE entity (the entity name then turns into yellow, as Disease in Fig. 1A).

2.3 Filtering and ranking
Since numerous alternative combinations of sources-entities and links can be returned, BioGuide provides advanced functionalities to filter and rank the paths. Default settings are provided based on the most frequent choices of our users.

First, BioGuide provides strategy criteria, which are alternative approaches for characterizing source-entity paths corresponding to selected entities. The strategy criteria can be selected through the graphical user interface (see top of Fig. 1A), and their combination forms the query strategy. Users can specify whether or not they want to (i) follow an order on the entities (ordered entities); (ii) explore other, unspecified, entities (only given entities) and (iii) visit a source more than once (source once for all). Selecting one or several criteria ensures that only paths which meet the criteria are returned as a result. These criteria have been identified during the study of user requirement we conducted on biological data source browsing (Cohen-Boulakia et al., 2005).

Second, BioGuide considers the user preferences about the sources to be used. Preference values are used by BioGuideSRS to rank as well as filter the answers according to the wishes of the user (Filters and Sort menus). Example of preference-filters include ‘no more than three cross-references must be followed per path’, and ‘only reliable sources should be consulted’. BioGuideSRS also helps scientists quantify the confidence they have in the sources by providing additional interfaces to adapt the preference values to their needs (Preferences menu, Fig. 1A).

2.4 Adapting BioGuideSRS
BioGuideSRS can be customized by each user: modifying preference values, adding new kinds of preferences, adding/removing/modifying links (relationships and cross-references) and nodes (entities and sources) of the Entity and Source-Entity graphs (Model menu). The resulting configuration can then be saved to an XML file (File menu) for future use, and exchanged between users (see BioGuideSRS user manual for more information).


    3 BENEFIT OF USING ALTERNATIVE PATHS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MAIN FUNCTIONALITIES
 3 BENEFIT OF USING...
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We present here results obtained for the query Q1. Assume that the user exploits the strategy provided by default (i.e. considering every ordering between entities, allowing intermediate entities and each source to be visited several times), and specifies the following preference-filters: No more than 3 cross-references must be followed per path; only very reliable sources should be consulted (reliability level higher than 7); and only complete sources should be considered for the gene entity (completeness level higher than 5). As a consequence, five alternative paths are found by BioGuideSRS (Fig. 1B). Each path describes which source can be queried to provide a given entity and which cross-reference can be followed. As an example, (1) indicates that gene and disease information can be found in EntrezGene (EntrezGene_Gene) and OMIM (OMIM_Disease), respectively. A cross-reference linking these two sources can be followed.

By selecting each of these paths, the user obtains the corresponding instances of data (e.g. instances corresponding to path (1) are shown in Fig. 1C). Note that the user did not have to specify the sources to be queried nor indicate the links to be followed; paths were automatically generated. Obtaining instances of data from SRS for each path was also performed automatically by BioGuideSRS. Thus querying is automated from the beginning to the very end.

BioGuideSRS is a multi-strategy approach, in which complementary information is obtained and the scientist is guided in the analysis of the results. Continuing with our example, the entry giving precise knowledge about the general form of narcolepsy is found by path (3), which links genes to diseases by passing through the proteins of SwissProt, but not by path (1). BioGuide thus finds a rich set of information about the disease. On the other hand, path (5) provides a single entry, the HCRT gene, which is well known to be responsible for narcolepsy; the HCRT gene is also found by paths (2) and (3). Knowing that this entry is given by several reliable paths increases the confidence the user has in the results.

Complete examples of use are provided on the BioGuideSRS site.


    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MAIN FUNCTIONALITIES
 3 BENEFIT OF USING...
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
BioGuideSRS is a path-based system (Cohen-Boulakia et al., 2006) in the same spirit as Biozon (Birkland and Yona, 2006) and BioNavigation (Lacroix et al., 2004). It is the first system to provide a multi-strategy approach, allowing various querying capabilities of path systems to be expressed, and implements on-the-fly queries using SRS rather than a warehouse.

There are several advantages of BioGuideSRS: first, queries phrased in terms of biological entities are posed through the BioGuide user-friendly interface. Second, Preferences on the kind of sources to be accessed can be easily specified. Third, Links between the sources are systematically followed according to the strategy of the user, thus alternative and complementary ways of finding data are explored. Important information that may have been missed by a user following a single path can be found using multiple paths. Finally, an intermediate level between queries and data is offered: Each path yields a given set of data, thus the user always knows the origin of the data obtained (the sources and links followed), and paths can be explored one after the other following their ranked order (corresponding to their order of preference).

BioGuideSRS is available for use from its web site which has been accessed by more than 1300 visitors since January 2006, and several visitors have returned more than twice per month. Current BioGuideSRS users include members of the Children's Hospital of Philadelphia. The default configuration of BioGuideSRS—including the design of the graphs, their mapping, the choice of preferences—has been done in close collaboration with its users. Currently, adding a new SRS source to the system is easily done using the configuration file and the user interface; the user then has to map at least one entity with the source. In the future, BioGuideSRS may benefit from text-mining tools to automate the mapping between the graphs (mapping between entities/relationships and sources/cross-references).


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MAIN FUNCTIONALITIES
 3 BENEFIT OF USING...
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
BioGuideSRS would not have come into being without the participation of many scientists, who are acknowledged on the web site. This research is supported by the National Science Foundation under Grants No. 0415810 and 0513778.1

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Jonathan Wren

1Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Back

Received on January 15, 2007; revised on March 1, 2007; accepted on March 2, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MAIN FUNCTIONALITIES
 3 BENEFIT OF USING...
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Birkland A, Yona G. Biozon: a system for unification, management and analysis of heterogeneous biological data. BMC Bioinformatics, ( (2006) ) 7, : 70.[CrossRef][Medline].

    Cohen-Boulakia S, et al. A user-centric framework for accessing biological sources and tools. In: Proc. Data Integration for the Life Sciences (DILS), ( (2005) ) 3615, . Springer-Verlag. 3–18. Lecture Notes in Bioinformatics..

    Cohen-Boulakia S, et al. Path-based systems to guide scientists in the maze of biological data sources. Journal of Bioinformatics and Computational Biology (JBCB), ( (2006) ) 4, : 1069–95.[CrossRef].

    Etzold T, et al. SRS: information retrieval system for molecular biology data banks. Methods Enzymol, ( (1996) ) 266, : 114–128.[ISI][Medline].

    Galperin Y. The molecular biology database collection: 2007 update. Nucleic Acids Res, ( (2007) ) 35, : D3–D4.[Abstract/Free Full Text].

    Lacroix Z, et al. Efficient techniques to explore and rank paths in life science data sources. In: Proc. Data Integration for the Life Sciences (DILS), ( (2004) ) 2994, . Springer-Verlag. 187–202. Lecture Notes in Bioinformatics..


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
F. Lemoine, B. Labedan, and C. Froidevaux
GenoQuery: a new querying module for functional annotation in a genomic warehouse
Bioinformatics, July 1, 2008; 24(13): i322 - i329.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/10/1301    most recent
btm088v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Cohen-Boulakia, S.
Right arrow Articles by Froidevaux, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Cohen-Boulakia, S.
Right arrow Articles by Froidevaux, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?