Skip Navigation


Bioinformatics Advance Access originally published online on June 8, 2009
Bioinformatics 2009 25(16):2092-2094; doi:10.1093/bioinformatics/btp354
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
25/16/2092    most recent
btp354v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Kauffmann, A.
Right arrow Articles by Huber, W.
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kauffmann, A.
Right arrow Articles by Huber, W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2009 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Importing ArrayExpress datasets into R/Bioconductor

Audrey Kauffmann 1,*, Tim F. Rayner 2, Helen Parkinson 1, Misha Kapushesky 1, Margus Lukk 1, Alvis Brazma 1 and Wolfgang Huber 1

1EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD and 2Cambridge Institute for Medical Research, Addenbrooke's Hospital, Cambridge CB2 0XY, UK

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIAME
 3 MAGE-TAB
 4 BIOCONDUCTOR CLASSES
 5 RETRIEVING AND CONVERTING...
 6 APPLICATION
 7 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary:ArrayExpress is one of the largest public repositories of microarray datasets. R/Bioconductor provides a comprehensive suite of microarray analysis and integrative bioinformatics software. However, easy ways for importing datasets from ArrayExpress into R/Bioconductor have been lacking. Here, we present such a tool that is suitable for both interactive and automated use.

Availability: The ArrayExpress package is available from the Bioconductor project at http://www.bioconductor.org. A users guide and examples are provided with the package.

Contact: audrey{at}ebi.ac.uk

Supplementary information:Supplementary data are available Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIAME
 3 MAGE-TAB
 4 BIOCONDUCTOR CLASSES
 5 RETRIEVING AND CONVERTING...
 6 APPLICATION
 7 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
ArrayExpress is a public database for high-throughput functional genomics data (Parkinson et al., 2009). It consists of a repository, which is a MIAME (Brazma et al., 2001) supportive public archive of microarray data, and an added value gene expression Atlas created from the repository data. Currently, nearly 8000 experiments comprising 230 000 arrays are available from ArrayExpress. Retrieving publicly available data for analysis is a repetitive and error prone task for which automation is desirable. As Bioconductor (Gentleman et al., 2004) contains many widely used tools for the data analysis, tools to make a connection with public databases are useful. The GEOquery package (Davis and Meltzer, 2007) was developed to load GEO datasets into Bioconductor, and the RMAGEML package (Durinck et al., 2004) was designed to import the MAGE-ML files that in the past were used by ArrayExpress for data transfer. The ArrayExpress database now supports the MAGE-TAB format (Rayner et al., 2006), a metadata-rich, but much simpler and more resource-efficient format based on tab-delimited files and all data are made available in this format. We have developed the ArrayExpress package for R/Bioconductor to query ArrayExpress and convert MAGE-TAB formatted datasets from the ArrayExpress repository into objects of the Bioconductor class for microarray datasets, eSet.


    2 MIAME
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIAME
 3 MAGE-TAB
 4 BIOCONDUCTOR CLASSES
 5 RETRIEVING AND CONVERTING...
 6 APPLICATION
 7 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
MIAME is a guideline that describes the Minimum Information About a Microarray Experiment needed to ensure interpretation of a microarray dataset. It has five elements: (i) the raw data for each hybridization, (ii) the final processed data for the set of hybridizations in the experiment, (iii) the experiment design including sample data relationships and the essential sample annotation including experimental factors and their values, (iv) sufficient annotation of the array design and (v) essential laboratory and data processing protocols.


    3 MAGE-TAB
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIAME
 3 MAGE-TAB
 4 BIOCONDUCTOR CLASSES
 5 RETRIEVING AND CONVERTING...
 6 APPLICATION
 7 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
MAGE-TAB is a tabular MIAME supportive file format and MAGE-TAB documents consist of five different types of files. (i) A ‘raw’ zip archive contains the raw data files, i.e. the files produced by the microarray image analysis software, such as CEL files for Affymetrix GeneChips or GPR files from GenePix. (ii) A ‘data matrix’ file contains processed values, as provided by the data submitter, converted into a common tab-delimited text format representing a matrix of numbers. (iii) The Sample and Data Relationship Format (SDRF) tab-delimited file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter. (iv) The Array Design Format (ADF) tab-delimited file describes the design of an array, i. e. the sequence located at each feature on the array and annotation of the sequences. (v) The Investigation Description Format (IDF) tab-delimited file contains top-level information about the experiment including title, description, submitter contact details and protocols.


    4 BIOCONDUCTOR CLASSES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIAME
 3 MAGE-TAB
 4 BIOCONDUCTOR CLASSES
 5 RETRIEVING AND CONVERTING...
 6 APPLICATION
 7 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
The Bioconductor class eSet is a different implementation of the MIAME standard. The class has various specializations, or subclasses, that are adapted to specific array technologies, among these are ExpressionSet for generic one-colour datasets, NChannelSet for generic two-colour datasets and AffyBatch for data from Affymetrix GeneChips. Objects of this class contain one or more identical-sized numeric matrices as assayData elements. They also include a table describing the sample–array relationship as phenoData and a table describing the array features as featureData. Details of experimental methods are in the component structure experimentData.


    5 RETRIEVING AND CONVERTING MAGE-TAB DATA
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIAME
 3 MAGE-TAB
 4 BIOCONDUCTOR CLASSES
 5 RETRIEVING AND CONVERTING...
 6 APPLICATION
 7 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
The ArrayExpress package uses the zip archive with either the raw or the processed data to build the assayData component. The SDRF file is used to construct the phenoData table. The ADF file is used to construct the featureData, and the IDF file to fill in the experimentData components.

5.1 Raw data
To import a raw dataset from ArrayExpress, one can use the following R code after loading the package:


Formula

As E-ATMX-18 is a two-colour experiment, the returned R object is of class NChannelSet. If the identifier refers to an Affymetrix experiment, the output is an AffyBatch, if it refers to a one-colour experiment using a platform other than Affymetrix, the output is an ExpressionSet. The ArrayExpress function extracts feature intensity summaries from columns of the raw data files based on the common conventions for the data file sources. If the data source is not recognized, or the file does not have the expected column names, the user is asked to explicitly provide the name of the column(s) to extract, for instance, ‘Cy3 Median’. In some cases, there is a mismatch between the sample or feature annotations and the intensity data files; in such cases, a warning is emitted, the phenoData and/or featureData components are left empty and an incomplete (but syntactically valid) object is returned. Tested on the 5298 accessions with raw datasets that were available from the ArrayExpress repository in March 2009, the ArrayExpress function managed to create a complete object in 58% of the cases (Table 1). The 42% of cases in which the function failed or an incomplete object was produced are due to a variety of reasons, including missing or contradictory data in the repository. We are actively working on manually curating these cases and resolving problems as much as possible; however, due to the repository's role as a public record of scientific activity, problems inherent to information submitted by the contributors may persist.

In addition to calling the one-stop function ArrayExpress, it is possible to download the data for local storage using the function getAE and to import a locally stored MAGE-TAB document with the function magetab2bioc.

5.2 Processed data
The way processed data are handled in the database is less uniform than for raw data, because processing methods vary more than the microarray image analysis software outputs. To import a processed dataset from ArrayExpress, three steps are required: download the dataset, identify which column is of interest, create the R object. Example code looks as follows:

Formula

Here, cn is a character vector of all columns in the processed data, and after visual inspection, we decided to use the second one.


    6 APPLICATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIAME
 3 MAGE-TAB
 4 BIOCONDUCTOR CLASSES
 5 RETRIEVING AND CONVERTING...
 6 APPLICATION
 7 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We used the queryAE function to list all datasets concerned with breast cancer in Homo sapiens. Then, using the ArrayExpress function, we created R objects from all datasets for which raw data were available. We counted, for each dataset, the number of arrays and features. The Supplementary table summarizes the results of this analysis. This could now be followed by an integrative analysis of the data, a complex and open-ended task for which essential tools are provided in the Bioconductor project: the quality of the datasets could be assessed with the help of the arrayQualityMetrics package (Kauffmann et al., 2009), they could be normalized and analysed for differential expression of genes and gene sets (Hahne et al., 2008), and the combination of different datasets is facilitated, for example, by the MergeMaid package (Cope et al., 2004).


    7 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIAME
 3 MAGE-TAB
 4 BIOCONDUCTOR CLASSES
 5 RETRIEVING AND CONVERTING...
 6 APPLICATION
 7 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
The ArrayExpress package is freely available, open source and easy to use. As most of the Bioconductor tools for microarray analysis process eSet objects, the package facilitates large-scale analyses of public data. A strength of the package is the richness, accuracy and standardized format of the metadata that it imports together with the array intensity data. In fact, the diagnostics produced by the package during dataset import from the ArrayExpress repository are currently used by the curators to decrease the number of problematic experiments and improve the quality of the content delivered. For the end user, the ArrayExpress package eliminates, or at least greatly reduces the amount of manual intervention needed and helps towards automated processing of large collections of datasets. Table 1.


View this table:
[in this window]
[in a new window]

 
Table 1. Application of the ArrayExpress package to the ArrayExpress database in March 2009

 

    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIAME
 3 MAGE-TAB
 4 BIOCONDUCTOR CLASSES
 5 RETRIEVING AND CONVERTING...
 6 APPLICATION
 7 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We would like to thank A. Tikhonov, J. Cho, R. Santamaria and the ArrayExpress group.

Funding: EU FP6 (EMERALD, LSHG-CT-2006-037686); National Institutes of Health (5P41HG003619-05).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Jonathan Wren

Received on April 21, 2009; revised on May 28, 2009; accepted on May 31, 2009

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIAME
 3 MAGE-TAB
 4 BIOCONDUCTOR CLASSES
 5 RETRIEVING AND CONVERTING...
 6 APPLICATION
 7 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Brazma A, et al. Minimum information about a microarray experiment (MIAME) – toward standards for microarray data. Nat. Genet. (2001) 29:365–371.[CrossRef][Web of Science][Medline]

    Cope L, et al. MergeMaid: R tools for merging and cross-study validation of gene expression data. Stat. Appl. Genet. Mol. Biol. (2004) 3:29.

    Davis S, Meltzer PS. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics (2007) 23:1846–1847.[Abstract/Free Full Text]

    Durinck S, et al. Importing MAGE-ML format microarray data into Bioconductor. Bioinformatics (2004) 20:3641–3642.[Abstract/Free Full Text]

    Gentleman RC, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. (2004) 5:R80.[CrossRef][Medline]

    Hahne F, et al. Bioconductor Case Studies (2008) New York: Springer.

    Kauffmann A, et al. arrayQualityMetrics – a Bioconductor package for quality assessment of microarray data. Bioinformatics (2009) 25:415–416.[Abstract/Free Full Text]

    Parkinson H, et al. ArrayExpress update – from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res. (2009) 37:D868–D872.[Abstract/Free Full Text]

    Rayner TF, et al. A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB. BMC Bioinformatics (2006) 7:489.[CrossRef][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
25/16/2092    most recent
btp354v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Kauffmann, A.
Right arrow Articles by Huber, W.
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kauffmann, A.
Right arrow Articles by Huber, W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?