Skip Navigation


Bioinformatics Advance Access originally published online on February 25, 2007
Bioinformatics 2007 23(7):906-909; doi:10.1093/bioinformatics/btm031
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/7/906    most recent
btm031v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Shah, A. R.
Right arrow Articles by Waters, K. M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shah, A. R.
Right arrow Articles by Waters, K. M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Enabling high-throughput data management for systems biology: The Bioinformatics Resource Manager

Anuj R. Shah *, Mudita Singhal , Kyle R. Klicker , Eric G. Stephan , H. Steven Wiley and Katrina M. Waters

Biomolecular Systems Initiative, Pacific Northwest National Laboratory, Richland, WA, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 BASIC FEATURES AND...
 3 IMPLEMENTATION
 4 EXAMPLE WORKFLOW
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: The Bioinformatics Resource Manager (BRM) is a software environment that provides the user with data management, retrieval and integration capabilities. Designed in collaboration with biologists, BRM simplifies mundane analysis tasks of merging microarray and proteomic data across platforms, facilitates integration of users’ data with functional annotation and interaction data from public sources and provides connectivity to visual analytic tools through reformatting of the data for easy import or dynamic launching capability. BRM is developed using JavaTM and other open-source technologies for free distribution.

Availability: BRM, sample data sets and a user manual can be downloaded from http://www.sysbio.org/dataresources/brm.stm

Contact: anuj.shah{at}pnl.gov, brm{at}pnl.gov


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 BASIC FEATURES AND...
 3 IMPLEMENTATION
 4 EXAMPLE WORKFLOW
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Present-day ‘omics’ technologies produce overwhelming amounts of data in a quest to measure multiple parts of a biological system simultaneously (mRNAs, proteins, metabolites, etc). The systems biologist needs tools not only to analyze his or her new data but also to integrate data from previous experiments and functional annotation from public data sources. In the past, software systems have attempted to integrate heterogeneous data at the semantic as well as the data source levels (Goesmann et al., 2003; Lee et al., 2006; Oikawa et al., 2004). Attempts have also been made to standardize the representation of biological data and workflows (Lu et al., 2006). These and other systems biology software solutions either focus on a very specialized workflow or are restricted to a particular set of data sources. The Bioinformatics Resource Manager (BRM) takes into account the disparate nature of biological data, and provides the experimental scientist with an environment suitable for collecting, integrating and mining high-throughput biological data in the context of multiple experiments and the huge array of publicly available data irrespective of the biological system. BRM incorporates emerging software technologies and concepts to facilitate analysis of high-throughput biological data at a systems level.


    2 BASIC FEATURES AND FUNCTIONALITY
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 BASIC FEATURES AND...
 3 IMPLEMENTATION
 4 EXAMPLE WORKFLOW
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Biologists expend a significant amount of time maintaining lists of thousands of gene or protein identifiers in spreadsheets on their file systems. To analyze data from a single experiment, functional annotations about each gene/protein are obtained from public data sources one at a time in order to mine the data for biological interpretation. Attempts to merge data sets across platforms are complicated by inconsistencies in identifiers from multiple sources and there is no mechanism to track the source of data in each column for a merged spreadsheet. Finally, most analytic tools used for data visualization require specialized formats for data import, creating more files for biologists to save and organize. BRM facilitates these routines, tracks data pedigree information and provides integration capabilities with visualization tools allowing scientists to efficiently analyze enormous amounts of data at their desktop. Current functionality includes:

  • Heterogeneous data integration: High-throughput data are retrieved in a common spreadsheet format regardless of the experimental platform, making it possible to merge and integrate the data within BRM. Data merging is facilitated across experiments and platforms via overlapping equivalent column values or internal cross-identifier mappings (e.g. mappings from protein to gene identifiers).
  • Simplified retrieval/mining of public data: Annotation data such as cross-referencing identifiers, pathway information or interacting proteins can be retrieved for long lists of genes and proteins identifiers in high-throughput fashion. Currently we support National Center for Biotechnology Information (NCBI) (Wheeler et al., 2006), International Protein Index (Kersey et al., 2004), the Universal Protein Resource (Bairoch et al., 2005), and the Biomolecular Interaction Network Database (Bader and Hogue, 2000).
  • Storage of projects and data sets: Organized much like the experiments that generate them, each data set belongs to a project, and a project can have multiple sub-project levels for data management purposes. Data can be imported into BRM as either tab-delimited or comma-separated text files from the user's local machine.
  • Data reformatting: BRM has built-in data conversion tools that facilitate the saving of data sets on the user's local machine in variegated formats. BRM exports simple interaction format files compatible with tools such as Cytoscape (Shannon et al., 2003) and PQuad (Havre et al., 2004) or simple delimited formats for upload into most spreadsheet and pathway tools.
  • Launching tools and web-based sources of public data: BRM can dynamically launch website searching of annotation sources such as Entrez Gene, Entrez Protein, TIGR annotations and Pubmed, as well as CDART and BLAST tools using the identifiers selected. Also, BRM facilitates direct launching of Cytoscape to analyze interaction networks.


    3 IMPLEMENTATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 BASIC FEATURES AND...
 3 IMPLEMENTATION
 4 EXAMPLE WORKFLOW
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
BRM is implemented in JavaTM and is based on 3-tier client-server architecture. Users install a client implemented using Java SwingTM on their desktop computer that provides interfaces to create and manage projects along with datasets, retrieve external data from public sources and launch analysis tools. Figure 1 displays snapshots of the BRM user interface.


Figure 1
View larger version (89K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. The Bioinformatics Resource Manager Client. The Project Browser (top left) displays data sets with metadata information. The Data set Browser (bottom center) tracks provenance of data using colors to distinguish data sources. The data retrieval panels (top right) are used to retrieve data from external data sources.

 
The middle tier consists of the BRM server developed using the Enterprise Java Beans 2.0 technology and residing in a JBOSS application server. Connections to external data sources are pre-configured within the BRM Server via the Java DataBase Connectivity Application Programming Interface (API) and to NCBI via the SeqHound (Michalickova et al., 2002) API. The BRM server is responsible for user authentication, data retrieval from external data sources, and management of user sessions while the client is used only as a front-end for all such requests.

The third tier of the BRM architecture is a PostgreSQL (www.postgresql.org) database system. Tables representing users, projects, data sets and data-sources are accessed via the middle tier. All externally retrieved data is also stored along with metadata information in this persistent storage.

Data transfer between the client-server and the individual server components is kept to a minimum to counter Internet bandwidth limitations and communication overhead common to integrative environments. Communication is achieved via short messages (Java Messaging API) containing metadata. Each server component can then retrieve data based on its metadata for further processing. A distributed caching data model is implemented where clients receive data in chunks of 2000 rows.

Client-server systems require the client to be in-sync with the server for optimal operation. BRM provides an automated version checking mechanism at startup. A pop-up message informs the user of version mismatches and directs them to BRM's web page for updates.


    4 EXAMPLE WORKFLOW
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 BASIC FEATURES AND...
 3 IMPLEMENTATION
 4 EXAMPLE WORKFLOW
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this section, a common user workflow for BRM is described, which utilizes many of its functionalities. Systems biology research often generates data sets from different experimental techniques that need to be merged but do not have common identifiers. The following steps outline the process of merging an Affymetrix microarray data set that has multiple embedded gene identifiers with a Fourier Transform Ion Cyclotron Resonance (FTICR) mass spectrometry proteomics data set that has IPI numbers for proteins. Figure 2 shows the process graphically. The biological significance of such an exercise is to determine if the proteins identified in an experiment were transcriptionally regulated through concordant or discordant mechanisms.


Figure 2
View larger version (72K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Merging of a proteomics experiment file with a microarray experiment file. The figure outlines the steps needed in order to merge files with no common identifiers.

 
Step 1: Data import and extraction

Using the ‘Delimited Data Import’ function, the microarray and proteomics datasets are imported within BRM. Once imported, the regular-expression based ‘Extract’ feature is used to extract the embedded gene symbols from the Affymetrix file as shown in Figure 2-Step 1. The same process can be repeated multiple times for extracting other gene identifiers from the Affymetrix file. These extracted identifiers are now appended as additional columns to the data set.

Step 2: Adding cross-reference identifiers

The cross-reference functionality is now used to obtain gene identifiers for the IPI numbers in the FTICR file. As shown in Figure 2-Step 2, an the internal conversion table is used to retrieve gene symbols and Unigene numbers by selecting the ‘Add Cross-Reference Identifiers’ option from the File menu. Also shown is the ‘Identifier Help’ interface with examples of gene and protein identifiers in sample eukaryotic and prokaryotic organisms.

Step 3: Multi-level merge

The next step is to invoke the ‘Merge Datasets’ interface from either the Dataset Browser or the Project Browser.

Using the ‘BRM’ button on this interface, the user selects the two data sets within BRM to be merged. Then the relationships between the fields in these two data sets must be defined. BRM allows multi-level merges so that multiple identifiers in the data sets can be used to accomplish the maximum overlap. The drop-down menus on the merge interface are automatically populated with the column headers from the two data sets. From the drop-down lists, the user selects the gene identifier in the proteomics file and the equivalent gene identifier field for the microarray file. The ‘Relation’ drop-down list provides the necessary translation options from one identifier to another. Since the same identifiers have been retrieved for the two data sets, the ‘Equivalent’ option is chosen. Alternatively, the same tables used to retrieve the cross reference identifiers can be used for internal translation, e.g. ‘IPI -> gene symbol’ in the relation drop-down list. More identifiers can be used to merge by clicking the ‘Add More Columns’ button and choosing another identifier from the two data sets. When finished selecting identifiers, the ‘Next’ button brings up another interface that has options to select the columns desired in the output data set, to select an intersection versus a union, and to select the project folder to store the new merged data set. The ‘Merge’ button initiates the process, and upon successful completion of the merge, the Dataset Browser will launch with the merged dataset and a popup window with the summary statistics. This mapping process is generic enough to be used for merging any two data sets with common identifiers. As shown in Figure 2-Step 3, 109 rows overlapped between the two sample data sets using two gene identifiers, when only 64 rows would have intersected with either identifier alone.


    5 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 BASIC FEATURES AND...
 3 IMPLEMENTATION
 4 EXAMPLE WORKFLOW
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Today's data-intensive, systems biology research requires the support of a software environment that manages data set storage, communicates with external data sources, integrates heterogeneous data and provides connectivity to analytic tools that transform experimental data into knowledge.

BRM serves as a data management system for experimental scientists and bioinformaticists. Designed in cooperation with biologists, BRM automates many mundane data processing tasks, making integration across platforms and experiments feasible and efficient. One other publicly available software that compares with BRM in function and scope is the Gaggle (Shannon et al., 2006). Gaggle integrates multiple analysis tools, such as the R statistical environment and TIGR Multi-Experiment Viewer, allowing two-way communication of data between applications. However, it does not have a data management component similar to BRM. We are currently working with the Gaggle developers to release a future version of BRM that communicates with Gaggle and adds the much needed analysis component to BRM.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 BASIC FEATURES AND...
 3 IMPLEMENTATION
 4 EXAMPLE WORKFLOW
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The research described in this article was conducted under the LDRD Program at the Pacific Northwest National Laboratory, a multiprogram national laboratory operated by Battelle for the US Department of Energy under Contract DE-AC06-76RL01830.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Golan Yona

Received on September 8, 2006; revised on January 23, 2007; accepted on January 24, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 BASIC FEATURES AND...
 3 IMPLEMENTATION
 4 EXAMPLE WORKFLOW
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bader GD, Hogue CW. BIND: A Data Specification for Storing and Describing Biomolecular Interactions, Molecular Complexes and Pathways. Bioinformatics (2000) 16:465–477.[Abstract/Free Full Text]

    Bairoch A, et al. The Universal Protein Resource (UniProt). Nucleic Acid Res (2005) 33(Database Issue):154–159.[CrossRef]

    Goesmann A, et al. Building a BRIDGE for the integration of heterogeneous data from functional genomics into a platform for systems biology. J. Biotechnol (2003) 106:157–167.[CrossRef][Web of Science][Medline]

    Havre SL, et al. PQuad: visualization of predicted peptides and proteins. IEEE Visualization (2004) 473–480.

    Havre SL, et al. PQuad: Visualization of Predicted Peptides and Proteins. (2004) Proceedings of the Conference on Visualization' 04. Washington, DC: IEEE Computer Society. 473–480.

    Kersey PJ, et al. The International Protein Index: an integrated database for proteomics experiments. Proteomics (2004) 4:1985–1988.[CrossRef][Web of Science][Medline]

    Lee TJ, et al. BioWarehouse: a bioinformatics database warehouse toolkit. BMC Bioinformatics (2006) 7:170–175.[CrossRef][Medline]

    Lu Q, et al. KDE Bioscience: platform for bioinformatics analysis workflows. J. Biomed. Inform (2006) 39:440–450.[CrossRef][Web of Science][Medline]

    Michalickova K, et al. SeqHound: biological sequence and structure database as a platform for bioinformatics research. BMC Bioinformatics (2002) 3(1):32–45.[CrossRef][Medline]

    Oikawa MK, et al. GenFlow: generic flow for integration, management and analysis of molecular biology data. Genet. Mol. Biol (2004) 27:691–695.

    Shannon PT, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome. Res (2003) 13:2498–2504.[Abstract/Free Full Text]

    Shannon PT, et al. The Gaggle: an open-source software system for integrating bioinformatics software and data sources. BMC Bioinformatics (2006) 7:176–188.[CrossRef][Medline]

    Wheeler DL, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acid Rese (2006) 34(Database issue):173–180.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Acta Biochim Biophys SinHome page
Y. Pei, T. Zhang, V. Renault, and X. Zhang
An overview of hepatocellular carcinoma study by omics-based methods
Acta Biochim Biophys Sin, January 1, 2009; 41(1): 1 - 15.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/7/906    most recent
btm031v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Shah, A. R.
Right arrow Articles by Waters, K. M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shah, A. R.
Right arrow Articles by Waters, K. M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?