Skip Navigation

Bioinformatics 2005 21(16):3439-3440; doi:10.1093/bioinformatics/bti525
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (36)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Durinck, S.
Right arrow Articles by Huber, W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Durinck, S.
Right arrow Articles by Huber, W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis

Steffen Durinck 1,2,*, Yves Moreau 1, Arek Kasprzyk 2, Sean Davis 3, Bart De Moor 1, Alvis Brazma 2 and Wolfgang Huber 2

1Department of Electronical Engineering ESAT-SCD, K.U.Leuven, Kasteelpark Arenberg 10, 3001 Leuven-Heverlee, Belgium
2EBI, Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD, UK
3Cancer Genetics Branch, National Human Genome Research Institute, National Institute of Health 50 South Drive, Bethesda, MD 20892-8000, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 DESCRIPTION
 USAGE
 EXAMPLES
 DISCUSSION
 REFERENCES
 

Summary: biomaRt is a new Bioconductor package that integrates BioMart data resources with data analysis software in Bioconductor. It can annotate a wide range of gene or gene product identifiers (e.g. Entrez-Gene and Affymetrix probe identifiers) with information such as gene symbol, chromosomal coordinates, Gene Ontology and OMIM annotation. Furthermore biomaRt enables retrieval of genomic sequences and single nucleotide polymorphism information, which can be used in data analysis. Fast and up-to-date data retrieval is possible as the package executes direct SQL queries to the BioMart databases (e.g. Ensembl). The biomaRt package provides a tight integration of large, public or locally installed BioMart databases with data analysis in Bioconductor creating a powerful environment for biological data mining.

Availability: http://www.bioconductor.org. LGPL

Contact: steffen.durinck{at}esat.kuleuven.ac.be


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 DESCRIPTION
 USAGE
 EXAMPLES
 DISCUSSION
 REFERENCES
 
Bioconductor is an open source and open development software project that provides a wide range of statistical and graphical tools based on R (Ihaka and Gentleman, 1996), for the analysis and comprehension of genomic data (Gentleman et al., 2004). These tools are distributed as separate but interoperable packages, each specializing in different subareas of analysis such as the ‘affy’ package to normalize Affymetrix chip data and the ‘graph’ package to handle graph data structures. BioMart (http://www.ebi.ac.uk/biomart) is a simple, federated query system designed specifically for use with large datasets. One of the major databases providing a BioMart database implementation is the Ensembl (Hubbard et al., 2005; Kasprzyk et al., 2004). Central in BioMart database systems is the concept of the star and the reverse-star schemas, of which the former consist of a single main table linked to different dimension tables and the latter is a variant (Kasprzyk et al., 2004). The overall simplicity of these schemas avoids complex joins and enables fast data retrieval. The biomaRt package is an add-on package for R that provides query ability to BioMart databases.


    DESCRIPTION
 TOP
 Abstract
 INTRODUCTION
 DESCRIPTION
 USAGE
 EXAMPLES
 DISCUSSION
 REFERENCES
 
Our package currently covers four BioMart databases: Ensembl (Hubbard et al., 2005), a software system that produces and maintains automatic annotation on selected eukaryotic genomes; VEGA (Ashurst et al., 2005), the manually annotated Vertebrate Genome Annotation; dbSNP (Sherry et al., 2001), the Single Nucleotide Polymorfism database of NCBI and sequence mart, containing the Ensembl genome sequences. The package depends on the R package RMySQL and has been tested on Windows and Linux. After loading the library one can connect to either public BioMart databases or local installations of these. biomaRt offers several functions that enable the user to query these databases. One set of functions can be used to annotate identifiers such as Affymetrix, RefSeq and Entrez-Gene, with information such as gene symbol, chromosomal coordinates, OMIM and Gene Ontology. Alternatively, one can use a gene symbol as the starting point and query for the corresponding Affymetrix identifiers on a given chip. The queries can also have an inter-species nature and one can use an identifier of one type in species a to look up identifiers of the same or another type corresponding to homologs in species b. A second set of functions allow sequence-related data retrieval. Given a species and chromosome coordinates, one can retrieve genome sequences. This way a user can go directly from a set of differentially expressed genes to the upstream promoter sequences. Similarly, single nucleotide polymorphism (SNP) information can be retrieved. The SNP information is derived from dbSNP, which is mapped onto Ensembl.


    USAGE
 TOP
 Abstract
 INTRODUCTION
 DESCRIPTION
 USAGE
 EXAMPLES
 DISCUSSION
 REFERENCES
 
biomaRt provides documentation in the form of manual pages for every function and a vignette, which is an interactive document containing executable code chunks giving a more problem-orientedstyle of help.


    EXAMPLES
 TOP
 Abstract
 INTRODUCTION
 DESCRIPTION
 USAGE
 EXAMPLES
 DISCUSSION
 REFERENCES
 
A typical situation arising in the analysis of microarray data is that one has a list of identifiers corresponding to differentially expressed features on the array. In the example below, we first connect to the BioMart databases and retrieve gene information using an Affymetrix identifier as the input. Then we use this information to retrieve the corresponding sequence.

mart <- martConnect()

gene <- getGene(id="1939_at",

array="hg_u95av2", mart = mart)

seq <- getSequence(martTable = gene,

mart = mart)

Another example is to sort different genes based on their chromosome coordinates, this could be used for investigation if the co-localized genes are also co-expressed.

A more advanced example could be a microarray analysis in Drosophila, where we want to focus on genes that have human homologs known to be involved in a certain disease. biomaRt enables one to first look up the human homologs, then using these homologs to query for OMIM identifiers. Drosophila genes that have human homologs with an OMIM identifier associated with them can then be selected for subsequent analysis.


    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 DESCRIPTION
 USAGE
 EXAMPLES
 DISCUSSION
 REFERENCES
 
The Bioconductor package biomaRt enables direct access from Bioconductor to BioMart databases such as Ensembl, creating a strong alliance for data analysis with biological databases. The current annotation packages available from Bioconductor are complementary to our package. They use precompiled annotation tables derived from the NCBI and stored as hashtables in R (Zhang et al., 2003). Precompiled annotation packages are convenient when working with one or a few array types with relative constant designs; however, this approach has limitations. When multiple chip designs are used in, for example, a meta-analysis study, different metadata packages need to be installed, that will contain redundant information. Very large gene sets make the metadata packages sizeable, while with biomaRt, only the annotation of the genes of interest is retrieved. biomaRt is more scalable as it gathers up-to-date information from BioMart databases. Fast data retrieval is possible as the biomaRt package executes direct SQL queries from R to the BioMart databases. Besides annotation information biomaRt also enables mapping of homologs and retrieval of sequence and SNP data, which can become part of a microarray data analysis. The biomaRt package will be further developed to include more BioMart databases and allow more complex types of queries. This tight integration of large public databases with data analysis in R provides a powerful platform for biological data mining.


    Acknowledgments
 
The authors would like to thank Ewan Birney for the fruitful discussions on BioMart. FWO: PhD/postdoctoral grants, projects G.0115.01, G.0413.03, G.0388.03, G.0229.03, research communities (ICCoS, ANMMM, MLDM); IWT: PhD Grants, GBOU-SQUAD, GBOU-ANA, GBOU-McKnow, STWW-Genprom;Belgian Federal Government: DWTC IUAP V-22; EU: FP5, CAGE, ERNSI; German Ministry for Education and Research through National Genome Research Network (NGFN) grant FKZ 01GR0450.

Conflict of Interest: none declared.

Received on April 12, 2005; revised on May 25, 2005; accepted on May 31, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 DESCRIPTION
 USAGE
 EXAMPLES
 DISCUSSION
 REFERENCES
 

    Ashurst, J.L., et al. (2005) The Vertebrate Genome Annotation (VEGA) database. Nucleic Acids Res., 33, D459–D465[Abstract/Free Full Text].

    Gentleman, R., et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol., 5, R80[CrossRef][Medline].

    Hubbard, T., et al. (2005) Ensembl 2005. Nucleic Acids Res., 33, D447–D453[Abstract/Free Full Text].

    Ihaka, R. and Gentleman, R. (1996) R: a language for data analysis and graphics. J. Comput. Graph. Stat., 5, 299–314[CrossRef].

    Kasprzyk, A., et al. (2004) Ensmart: a generic system for fast and flexible access to biological data. Genome Res., 14, 160–169[Abstract/Free Full Text].

    Sherry, S.T., et al. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res., 29, 308–311[Abstract/Free Full Text].

    Zhang, J., et al. (2003) An extensible application for assembling annotation for genomic data. Bioinformatics, 19, 155–156[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
J. Hu, H. Hu, and X. Li
MOPAT: a graph-based method to predict recurrent cis-regulatory modules from known motifs
Nucleic Acids Res., August 1, 2008; 36(13): 4488 - 4497.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
E. Purdom, K. M. Simpson, M. D. Robinson, J. G. Conboy, A. V. Lapuk, and T.P. Speed
FIRMA: a method for detection of alternative splicing from exon array data
Bioinformatics, August 1, 2008; 24(15): 1707 - 1714.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
F. Lemoine, B. Labedan, and C. Froidevaux
GenoQuery: a new querying module for functional annotation in a genomic warehouse
Bioinformatics, July 1, 2008; 24(13): i322 - i329.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. A. Siepen, K. Belhajjame, J. N. Selley, S. M. Embury, N. W. Paton, C. A. Goble, S. G. Oliver, R. Stevens, L. Zamboulis, N. Martin, et al.
ISPIDER Central: an integrated database web-server for proteomics
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W485 - W490.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
T. Chiang, N. Li, S. Orchard, S. Kerrien, H. Hermjakob, R. Gentleman, and W. Huber
Rintact: enabling computational analysis of molecular interaction data from the IntAct repository
Bioinformatics, April 15, 2008; 24(8): 1100 - 1101.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. Eddy and N. Maizels
Conserved elements with potential to form polymorphic G-quadruplex structures in the first intron of human genes
Nucleic Acids Res., March 27, 2008; 36(4): 1321 - 1333.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
M. Irimia, J. L. Rukov, D. Penny, J. Garcia-Fernandez, J. Vinther, and S. W. Roy
Widespread Evolutionary Conservation of Alternatively Spliced Exons in Caenorhabditis
Mol. Biol. Evol., February 1, 2008; 25(2): 375 - 382.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
P. Jones, R. G. Cote, S. Y. Cho, S. Klie, L. Martens, A. F. Quinn, D. Thorneycroft, and H. Hermjakob
PRIDE: new developments and new datasets
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D878 - D883.
[Abstract] [Full Text] [PDF]


Home page
Mol. Cell. ProteomicsHome page
N. Abed, M. Bickle, B. Mari, M. Schapira, R. Sanjuan-Espana, K. Robbe Sermesant, O. Moncorge, S. Mouradian-Garcia, P. Barbry, B. B. Rudkin, et al.
A Comparative Analysis of Perturbations Caused by a Gene Knock-out, a Dominant Negative Allele, and a Set of Peptide Aptamers
Mol. Cell. Proteomics, December 1, 2007; 6(12): 2110 - 2121.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
Y. Lu, X. He, and S. Zhong
Cross-species microarray analysis with the OSCAR system suggests an INSR->Pax6->NQO1 neuro-protective pathway in aging and Alzheimer's disease
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W105 - W114.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
P. Khatri, C. Voichita, K. Kattan, N. Ansari, A. Khatri, C. Georgescu, A. L. Tarca, and S. Draghici
Onto-Tools: new additions and improvements in 2006
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W206 - W211.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
K. L. Brigand, R. Russell, C. Moreilhon, J.-M. Rouillard, B. Jost, F. Amiot, V. Magnone, C. Bole-Feysot, P. Rostagno, V. Virolle, et al.
An open-access long oligonucleotide microarray resource for analysis of the human and mouse transcriptomes
Nucleic Acids Res., July 19, 2006; 34(12): e87 - e87.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. R. Pocock, P. Li, and T. Oinn
Taverna: a tool for building and running workflows of services.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W729 - W732.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. V. V. Deevi and A. C. R. Martin
An extensible automated protein annotation tool: standardizing input and output using validated XML
Bioinformatics, February 1, 2006; 22(3): 291 - 296.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
E. Birney, D. Andrews, M. Caccamo, Y. Chen, L. Clarke, G. Coates, T. Cox, F. Cunningham, V. Curwen, T. Cutts, et al.
Ensembl 2006
Nucleic Acids Res., January 1, 2006; 34(suppl_1): D556 - D561.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (36)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Durinck, S.
Right arrow Articles by Huber, W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Durinck, S.
Right arrow Articles by Huber, W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?