Bioinformatics Advance Access originally published online on February 2, 2006
Bioinformatics 2006 22(8):1024-1026; doi:10.1093/bioinformatics/btl036
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The LCB Data Warehouse
1 The Linnaeus Centre for Bioinformatics, Uppsala University and The Swedish University for Agricultural Sciences Sweden
2 Department of Pharmaceutical Biosciences, Uppsala University Sweden
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: The Linnaeus Centre for Bioinformatics Data Warehouse (LCB-DWH) is a web-based infrastructure for reliable and secure microarray gene expression data management and analysis that provides an online service for the scientific community. The LCB-DWH is an effort towards a complete system for storage (using the BASE system), analysis and publication of microarray data. Important features of the system include: access to established methods within R/Bioconductor for data analysis, built-in connection to the Gene Ontology database and a scripting facility for automatic recording and re-play of all the steps of the analysis. The service is up and running on a high performance server. At present there are more than 150 registered users.
Availability: An open functional version is available at https://dw.lcb.uu.se/index.phtml?i_login=test. User accounts are created upon request. Additional facilities including plug-ins, user documentation and a password protected data storage system are available from http://www.lcb.uu.se/lcbdw.php
Contact: Jan.Komorowski{at}lcb.uu.se
| 1 INTRODUCTION |
|---|
|
|
|---|
The aim of the LCB-DWH is to help facilitate management and analysis of two-channel microarray data, and to help non-experts to keep-up with developments in the field of data analysis by continuously integrating new tools and new sources of biological information. LCB-DWH differs from most other systems in that it also provides secure and reliable storage according to current international standards.
| 2 DESCRIPTION |
|---|
|
|
|---|
The LCB-DWH is developed from BASE, a widely used open source platform for comprehensive management and analysis of microarray data (Saal et al., 2002). The main difference in system design is that LCB-DWH, as opposed to BASE, enables storage and analysis of data to be performed on separate hardware. The LCB-DWH benefits from features that are inherited directly from the BASE architecture, such as MIAME compliant storage (Brazma et al., 2001), data sharing between groups of researchers, separation of projects, publication through MAGE-ML format (Spellman et al., 2002) and presentation of data analysis in a tree structure.
Development in the LCB-DWH has been focused on integrating a wide collection of data analysis tools, and this work has been facilitated by the BASE plug-in architecture. A wrapper to the programming language R (Ithaka and Gentleman, 1996) has been developed, enabling access to the open source packages within Bioconductor (Gentleman et al., 2004), which contains a wide collection of efficient tools for microarray data analysis and visualization. Within the LCB-DWH system, those tools and several new ones are integrated in one and easy to comprehend framework that allows non-expert users to apply the sophisticated tools to their data in an intuitive manner. Moreover, we have designed and implemented an interactive Gene Ontology (The Gene Ontology Consortium, 2000) tool, which is invoked from within the user interface. It enables users to explore the biological function of a set of genes with a GO browser, or to test for the statistical over- or under-representation of different GO classes in a set of genes with respect to a reference set, e.g. all genes on the array. The problem of multiple testing is handled in two ways. The default option is to calculate the expected number of significant GO terms at some user specified cut-off level. If it is considerably lower than the number of observed significant GO terms at that level, then the results of the whole test is more likely to be correct. Optionally, the user may select a method for multiple test correction. Furthermore, we have implemented a useful feature for creating customized links from all genes in a dataset to any external web resource of biological knowledge.
Another issue that is addressed in the LCB-DWH is reproduction of data analysis. For this purpose, a facility that enables the user to save all steps of data analysis in a script has been developed. The script may be applied either to a specific path in the data analysis tree or to the complete structure. These scripts form protocols that may be re-used for automating repetitive tasks or by reviewers judging the quality of the analysis.
Security and reliability issues are given high priority. Data are stored on a server with a double RAID solution; in addition, incremental backups are taken daily. Communication with the server is done through encrypted connections for password protected accounts.
| 3 WORKFLOW |
|---|
|
|
|---|
LCB-DWH implements the whole dataflow described in the MIAME requirements for microarray experiments. The starting point is data from image quantification software, and data from several arrays are grouped into a single experiment. The experiment may then be shared within a research group, before it is transferred to data analysis module.
Then follows the pre-processing of data, where there are a number of different methods available for background correction, normalization and filtering. Moreover, spots that are printed on multiple positions on an array may be merged into one single value. At any point the quality of data may be checked with several different data visualization methods such as, for instance, array plots, PCA plots and plots for control clones. Such plots can prove helpful, e.g. when selecting an appropriate pre-processing method for some specific dataset. The step after pre-processing usually is detection of candidate genes, i.e. all genes that were targeted by the particular experiment. For example, this can be done using various methods for detecting differentially expressed genes, or by clustering methods. Once a set of candidate genes has been identified, the Gene Ontology tool can be used to analyze biological processes, molecular functions and cellular compartments in which those genes are involved. Sample pictures produced within the LCB-DWH in the data analysis process are shown in Figure 1.
|
Major journals require that the expression data be made available on public repositories such as ArrayExpress (Brazma et al., 2003) at EBI. The LCB-DWH enables export of data in MAGE-ML format, which can be uploaded to such repositories. Manual uploading of data is otherwise a tedious and time consuming process.
| 4 CURRENT DEVELOPMENTS |
|---|
|
|
|---|
Ongoing development of the LCB-DWH system includes adaptation to new microarray technologies such as ChIPchip and array-CGH, which require new methodologies both for data storage and analysis.
| Acknowledgments |
|---|
The authors thank Hanna Göransson for helpful discussions and Jakub Orzechowski Westholm for help with some implementations. The LCB-DWH is supported by grants from the Wallenberg Consortium North and from the Knut and Alice Wallenberg foundation. Funding to pay the Open Access publication charges for this article was provided by the Linnaeus Centre for Bioinformatics.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alfonso Valencia
Received on August 26, 2005; revised on November 14, 2005; accepted on January 31, 2006
| REFERENCES |
|---|
|
|
|---|
Brazma, A., et al. (2003) ArrayExpressa public repository for microarray gene expression data at the EBI. Nucleic Acids Res, . 31, 6871
Brazma, A., et al. (2001) Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet, . 29, 365371[CrossRef][ISI][Medline].
Gentleman, R.C., et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol, . 5, R80[CrossRef][Medline].
Ihaka, R. and Gentleman, R. (1996) R: a language for data analysis and graphics. J. Comput. Graph. Stat, . 5, 299314[CrossRef].
Saal, L.H., et al. (2002) Bioarray Software Environment (BASE): a platform for comprehensive management and analysis of microarray data. Genome Biol, . 3, SOFTWARE0003.
Spellman, P.T., et al. (2002) Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol, . 3, research0046.1research0046.9.
The Gene Ontology Consortium. (2000) Gene Ontology: tool for the unification of biology. Nat. Genet, . 25, 2529[CrossRef][ISI][Medline].
This article has been cited by other articles:
![]() |
R. Andersson, C. E. G. Bruder, A. Piotrowski, U. Menzel, H. Nord, J. Sandgren, T. R. Hvidsten, T. Diaz de Stahl, J. P. Dumanski, and J. Komorowski A segmental maximum a posteriori approach to genome-wide copy number profiling Bioinformatics, March 15, 2008; 24(6): 751 - 758. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Marsell, T. Krajisnik, H. Goransson, C. Ohlsson, O. Ljunggren, T. E. Larsson, and K. B. Jonsson Gene expression analysis of kidneys from transgenic mice expressing fibroblast growth factor-23 Nephrol. Dial. Transplant., March 1, 2008; 23(3): 827 - 833. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Bredhult, L. Sahlin, and M. Olovsson Gene expression analysis of human endometrial endothelial cells exposed to op'-DDT Mol. Hum. Reprod., February 1, 2008; 14(2): 97 - 106. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Draminski, A. Rada-Iglesias, S. Enroth, C. Wadelius, J. Koronacki, and J. Komorowski Monte Carlo feature selection for supervised classification Bioinformatics, January 1, 2008; 24(1): 110 - 117. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. U. Magnusson, A. Dimberg, S. Mellberg, A. Lukinius, and L. Claesson-Welsh FGFR-1 regulates angiogenesis through cytokines interleukin-4 and pleiotrophin Blood, December 15, 2007; 110(13): 4214 - 4222. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Rada-Iglesias, S. Enroth, A. Ameur, C. M. Koch, G. K. Clelland, P. Respuela-Alonso, S. Wilcox, O. M. Dovey, P. D. Ellis, C. F. Langford, et al. Butyrate mediates decrease of histone acetylation centered on transcription start sites and down-regulation of associated genes Genome Res., June 1, 2007; 17(6): 708 - 719. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





