Bioinformatics Advance Access originally published online on November 25, 2004
Bioinformatics 2005 21(8):1495-1501; doi:10.1093/bioinformatics/bti157
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The ArrayExpress gene expression database: a software engineering and implementation perspective
EMBL OutstationHinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD, UK
*To whom correspondence should be addressed.
| Abstract |
|---|
Motivation: The lack of microarray data management systems and databases is still one of the major problems faced by many life sciences laboratories. While developing the public repository for microarray data ArrayExpress we had to find novel solutions to many non-trivial software engineering problems. Our experience will be both relevant and useful for most bioinformaticians involved in developing information systems for a wide range of high-throughput technologies.
Results: ArrayExpress has been online since February 2002, growing exponentially to well over 10 000 hybridizations (as of September 2004). It has been demonstrated that our chosen design and implementation works for databases aimed at storage, access and sharing of high-throughput data.
Availability: The ArrayExpress database is available at http://www.ebi.ac.uk/arrayexpress/. The software is open source.
Contact: ugis{at}ebi.ac.uk
| INTRODUCTION |
|---|
Microarrays have become a mainstream tool for molecular biology studies, and increasing amounts of data are generated. There is a need to build a community-wide infrastructure for microarray data sharing (Brazma et al., 2003) the most important elements of which are public repositories, data representation and communication standards. The ArrayExpress project began in 2000, and the database has been online since the beginning of 2002. Currently data from well over 10 000 hybridization experiments covering over 30 different organisms, submitted by almost 100 different laboratories, have been loaded and are available online. ArrayExpress can store data for all microarray technologies, such as two-channel microarrays or Affymetrix arrays, as well as different types of experiments, such as gene expression data, array-based chromatin immunoprecipitation (so-called ChIP-on-chip) data, or array CGH data.
The ArrayExpress database is at the center of a wider microarray informatics system at the EBI (Brazma et al., 2003) which also includes the experiment annotation/submission tool MIAMExpress, data transfer pipelines from other (external) databases and tools, and the online data analysis tool Expression Profiler (Kapushesky et al., 2004). Data transfer pipelines have been established with a number of major microarray databases and analysis tools, including the Stanford Microarray Database (SMD) (Gollub et al., 2003) the TIGR microarray data management system (Saeed et al., 2003) the J-Express data analysis tool (Dysvik and Jonassen, 2001) external installations of MIAMExpress (e.g. in EMBL Heidelberg and University of Cambridge) and RAD (Manduchi et al., 2004).
Each component of this system has been developed relatively independently, following rather different design principles. In this paper we concentrate exclusively on the ArrayExpress component from a computer science and software engineering perspective; for end-user-oriented information see Brazma et al. (2003). We describe the ArrayExpress requirements analysis, design principles, overall architecture, the engineering challenges that we faced and the solutions that we found. We briefly touch upon the microarray data standardization issues, as well as our experience in using these standards. This paper does not describe the implementation details, but only the basic principles. We believe the described project represents advances in bioinformatics software engineering in the sense that by applying modern (albeit proven) software technologies it has delivered, in a short time and with minimal resources, a complex but robust product which is already widely used.
The paper also gives the essential information for installation of local copies of ArrayExpress software. Although ArrayExpress has been developed primarily with the goal of serving as a public repository, it has already been installed in several laboratories. The ArrayExpress software can be adjusted for the goals of particular laboratories and integrated with various laboratory information management system components relatively easily. We also believe that our experience can be useful in the development of databases and informatics systems for many other high throughput technologies.
| PROJECT REQUIREMENTS |
|---|
ArrayExpress goals
ArrayExpress has three major goals(1) to serve as an archive for microarray data associated with scientific publications and other research, (2) to provide easy access to microarray data in a standard format for the research community and (3) to facilitate the sharing of microarray designs and experimental protocols. To serve goals (1) and (3), ArrayExpress must store three main classes of objectsexperiments, array designs and protocols. Experiments are biologically related logical groupings of raw and processed data together with annotation of biological samples, the material treatment and data processing steps. Often an experiment corresponds to a particular publication.
User requirements
To achieve goal (2) we collected different use-cases and hypothetical example queries from a wide range of potential database users. We concluded that there are five main user categories:
- Experimentalists who are interested in learning about the techniques used by others, to reuse and improve their methods, experimental designs and array platforms, and adjusting them to their own biological questions. For them detailed information about the experimental design and procedures are of major importance.
- Biologists focused on studies of particular genes. They are interested in questions, such as under what conditions gene A is expressed and what other genes are expressed similarly. For them the technique used to measure the expression is of secondary interest to gene centric query interface; nevertheless links to the sample annotation are essential, and moreover in the absence of standard gene expression measurement units, links to the experimental procedures are important.
- Biologists and bioinformaticians focused on genome-wide studies. They are interested in queries such as, which genes are expressed in which cell types, how do their orthologs behave in different organisms, and in which pathways they participate. Access to gene expression matrices combining many experiments and links to other relevant information and databases (such as orthologs and pathways) are among their main requirements.
- Algorithm developers, who typically want to download data to their own analysis tools, either for analysis of combined datasets, or for methods development. For them the overview of the database contents and efficient access to large datasets combining several experiments is of the most importance.
- Software developers, who may wish to link their gene expression analysis tools to the central repository. For them a programmatic interface to the database rather than a visual one is needed.
Other considerations
In addition to user requirements we considered the following issues: the nature of microarray data, the existing software technologies relevant to database development and the resources that were available to the project. Three important considerations regarding the microarray technology and data are:
- The signal to noise ratio in microarray experiments is variable and typically much lower than in gene sequencing or even molecular structure determination experiments;
- Gene expression measurements are meaningful only in the context of exact experimental conditions; moreover the expression measurement values are dependent on the experimental protocols and data transformation procedures;
- Microarray technology is developing fast, and flexible data management solutions are required.
To be efficient and scalable the repository must harvest data directly from laboratory information management systems (LIMS), local databases or electronic laboratory notebooks. Also, it cannot hope to provide all possible algorithms and methods for data analysis and visualization; therefore the ability to integrate with data analysis tools is important. This brings us to a wider goal of ArrayExpressto support microarray data sharing infrastructure by facilitating and implementing the necessary community standards and serving as a pilot project that demonstrates the viability of efficient high throughput data sharing.
| DESIGN PRINCIPLES |
|---|
Our main design principles were: (a) the project should be engaged in the development of community standards and support them; (b) the system should enable the capture of all the necessary information needed to interpret the results in a granular, computer interpretable manner; (c) the development should benefit from the relevant and proven software technologies to cope with minimal resources; and (d) the development should be incremental.
Massive amounts of microarray data were already accumulating, and therefore it was important to implement a prototype quickly and in parallel with the standards development. We decided to separate the data archiving from a query optimized data warehouse, prioritizing the first. Taking into account the scope of the project and developing standards, it was infeasible to do a complete initial requirements analysis. First, we needed to generate functional software that could be used for loading data into the database and provide basic access to the data. Afterwards we could refine the auto-generated software, addressing performance problems, adding new query functionality and better ways of presenting information, while the database had to receive submissions and stay online.
We decided to implement the database queries and database visualization software using the object abstraction layer rather than the underlying relational database. This approach enabled faster development of new features, page layouts and queries, and better programmatic access to the database. There is a performance overhead, but this can be addressed incrementally, by prioritizing which functions should be optimized first.
| MICROARRAY DATA STANDARDIZATIONSTATE OF THE ART |
|---|
We participated in the work of the Microarray Gene Expression Data society (MGED, see http://www.mged.org) to develop a set of standards needed for microarray data sharing infrastructure. The Minimum Information About a Microarray Experiment, also known as MIAME (Brazma et al., 2001) was developed to specify the microarray experiment data and metadata that should be reported to enable others to understand and interpret the experiment unambiguously. Note that MIAME is a data content standard, not a format standard.
Software interoperability is possible only if there is a formal standard that specifies the communication protocols. MicroArray Gene Expression (MAGE) standardsMAGE object model (MAGE-OM) and MAGE markup language (MAGE-ML)were developed to ensure software interoperability (Spellman et al., 2002) and to encode all MIAME required information. Only a few of the elements in MAGE-ML are mandatory; therefore MAGE-ML can be used to communicate minimally or fully annotated experiments or their parts. MAGE was finalized in 2002, with a minor revision in 2003.
The third component of MGED standards is the MGED ontology (Stoeckert and Parkinson, 2003). It defines sets of common terms and annotation rules for microarray experiments, enabling unambiguous annotation and efficient queries, data analysis and data exchange without loss of meaning.
| SOFTWARE COMPONENTS |
|---|
Database
Instead of implementing our own object model (Brazma et al., 2002) we decided to rely on the MAGE-OM. This was convenient for fulfilling the repository function and adherence to standards, as we could easily import and represent MAGE-ML documents and thus ensure MIAME compliance. In fact, as the MAGE-OM was under development, we implemented several versions and each time ported the existing data to the new version.
The overall system architecture is shown in Figure 1. The database schema was auto-generated from a modified MAGE-OM by our own tool. The database contains more than 200 tables, derived from around 150 classes in the MAGE-OM. The mapping used is relatively straightforward: classes are mapped to tables one-to-one, each object can be distributed across several tables according to the inheritance hierarchy, 1-to-1 and 1-to-many associations are mapped to foreign keys, while many-to-many associations are mapped to link tables. Some local modifications of the object model were done to improve performance of common queries. For example, an obvious requirement is that experiments should be queryable by species. In the MAGE model, going from the Experiment class to the OntologyEntry class, where species information is contained, can involve a potentially unlimited number of steps, due to recursive constructs in MAGE. Therefore a direct association from Experiment class to BioMaterial class was added to the model, and the data loader creates this link by traversing the above-described path only once.
|
The database software is divided into two categories: data loading and data access.
Data loading
ArrayExpress accepts data in MAGE-ML format. The first tool used for data loading is MAGEvalidator. In addition to a simple check of whether the files comply with the DTD (which could be performed by any generic XML validator), a number of less trivial checks are done. Among them is checking the consistency of object identifiers. When the MAGE-OM, a non-hierarchical structure, is transformed into the MAGE-ML DTD (which as an XML language is hierarchical), some associations between classes are translated into XML parent-child relations, while others are realized by using common identifiers; see Spellman et al. (2002) for details. MAGEvalidator checks these non-hierarchical relations and reports errors; e.g. object defined but not referenced, object referenced but not defined, object defined twice. We have extended MAGEvalidator to check content and MIAME compliance by implementing a set of rules that are aimed at reporting specific cases when essential MIAME information is missing (for more details, see online documentation, http://www.ebi.ac.uk/%7Eele/ext/submitter.html#val).
The second data management tool is MAGEloader, a tool for uploading a set of MAGE-ML files to the repository. Data loading is done in two distinct phases. First, MAGEloader processes the MAGE-ML documents and loads the information into the database. Second, post-processing of the uploaded information is performed: (1) links between objects are computed to facilitate queries (as described above); (2) array information is translated into a spreadsheet format (see below); and (3) expression data matrices are processed to enable efficient retrieval of sub-matrices (e.g. a subset of all quantitation types, for a subset of bioassays). Each row in the ADF corresponds to a single featureits coordinates, the sequence (if provided by submitter), sequence or gene annotation and other information provided by the submitter.
The third tool is MAGEunloader, which can be used to unload data from the database.
We have built a curation framework around these tools, providing a working environment for ArrayExpress curators. This framework includes a submission tracking system which manages submitted MAGE-ML files, tracing them through various processing stages and providing reports for data curators.
Data access
Data access components provide query functionality, experiment annotation browsing and expression data retrieval. For queries our first priority was to enable users to search and retrieve the submissions from ArrayExpress; therefore queries can return the three main object types: experiments, array designs or protocols. Experiments can be queried by a number of parameters, including the type, sample species and details of array designs used.
Annotation browsing functionality permits users to explore experiment structure by clicking through hyperlinks. Some of the information has been summarized in the form of spreadsheets. Array Description Format (ADF) is a tab-delimited format that provides a compact representation of each array design. Each row in the ADF corresponds to a single featureits coordinates, the sequence (if provided by the submitter), sequence of gene annotation, and other information provided by the submitter. A similar format is under development for experiments. This spreadsheet would provide an overview of the experiment structure and sample details, without the need to click through numerous hyperlinks.
For expression data access users can select bioassays (samples), quantitation types (e.g. log ratios), as well as sequence annotation for export. Expression data is exported as a tab-delimited file, which can be directly imported into Expression Profiler or saved for analysis by other means.
| TECHNOLOGY AND IMPLEMENTATION |
|---|
Database
ArrayExpress runs on Oracle RDBMS. However, we use very few Oracle special features; therefore porting to other DBMS platforms is possible, and only DDL scripts would have to be adapted to a different syntax. MAGEloader uses an Oracle sequence for generating unique object identifiers; therefore some methods (localized inside a single class) would need to be changed to generate identifiers in some other way, where underlying RDBMS does not provide sequences. We have been contacted by groups who intend to port ArrayExpress to other RDBMSs and we know of several such efforts.
Data loading
MAGEloader/MAGEvalidator is a single java program with two modes of operation. XML parsing is done using a SAX library, and data is loaded into the database as SAX events come in; therefore no large in-memory structures are created; this represents a scalable solution. We do not use MAGEstk (http://www.sourceforge.net/projects/mged) for this application due to potential memory requirements; although the MAGEstk library is based on SAX, parsing results in holding the complete MAGE structure in memory. Use of DOM would result in the same situation, with its attendant memory issues. JDBC (http://www.java.sun.com/products/jdbc/overview.html) is used for writing data to the database. For the MAGEloader post-processing stage, when loaded data needs further treatment, we use Castor (described below).
One of the tasks of post-loading is processing gene expression matrices so that it is possible to efficiently retrieve sub-matrices corresponding to, e.g. a single experimental condition. In MAGE, data is represented in 3-dimensional matrices where dimensions correspond to quantitation types (e.g. signal, background, ratios), design element (spots or, for normalized data, reporters) and bioassays (hybridizations or, for processed data, different transformations) (Spellman et al., 2002). In the database we do not store values individually as this would result in a table of unmanageable size, and therefore some other way of efficiently slicing these matrices was needed. We chose NetCDF format (Rew and Davis, 1990) which is appropriate for managing multi-dimensional arrays. Post-loading encodes expression matrices in NetCDF and stores them in the database as BLOBs, for efficient access by data access servlets.
MAGEunloader is implemented differently. When MAGEloader loads data into the database, it creates a deletion log, a sequence of SQL commands required to reverse effects of data loading. MAGEunloader processes these logs in reverse order; it is a script that launches sqlplus on the reversed log. This approach was the simplest to implement, taking into account the complexity of ArrayExpress schema. It would be quite difficult to navigate through the maze of objects, find and delete the right ones. We also decided not to use delete cascade since we considered this an unsafe method for a large and complex database like ArrayExpress.
The data curation framework is implemented as a web application that runs on Tomcat java application container (http://jakarta.apache.org/tomcat/).
Data access
Access to ArrayExpress data is provided through a web interface by a set of Java servlets. We are using several open-source Java libraries. The first is Castor relational-to-object mapping software (see http://castor.exolab.org). It enables making OQL queries instead of SQL ones, as well as automatic population of Java objects from the database tables. Castor implements a limited subset of OQL, but this is sufficient for our purposes. Expressing queries to a database in terms of classes and associations rather than in terms of tables and joins is natural and easier for developers, especially in a case where the database schema has been auto-generated from an object model.
We have implemented a generic query servlet that receives a query formatted according to certain conventions as a URL and retrieves MAGE Java objects from the database. This enables, in addition to our own web application, external applications to query and access the database. Chaining multiple queries in a single request enabling retrieval of objects of more than one class is possible.
Additionally we use Velocity (http://jakarta.apache.org/velocity/), which allows us to separate application logic (data retrieval in our case) from user presentation. Velocity templates are essentially HTML files with variable references and some control structures. After the query execution servlet has retrieved Java objects, templates are merged with the retrieved data, producing HTML pages that are returned to the client.
Code auto-generation
We have relied extensively on code auto-generation so that we could deal with the complexity of the MAGE object model with minimal resources. We did not find suitable tools; therefore we developed our own code generators. The following components are auto-generated:
- Database schema;
- Java classes responsible for MAGE-ML validation and loading (that correspond to MAGE-ML elements);
- Java classes used by data retrieval servlets (in practice these are merged with the classes for loading);
- Castor relational-to-object mapping description;
- Velocity templates providing default visualization of ArrayExpress contents.
| INSTALLATION |
|---|
All the software is open-source and available for download. Requirements for installing ArrayExpress locally are the following: Oracle server (initially we used Oracle 8i and now we use Oracle 9i), Java 1.4 or later for data validator/loader, Tomcat server (version 4.1 or later) for data access software.
Not all the ArrayExpress components need to be installed to obtain a functional system. For instance, the MAGEvalidator (java application) can be installed on its own and used for validating MAGE-ML files (without database support). It is used by pipeline submitters to ensure valid data export. Another minimal installation comprises only the ArrayExpress database schema and writing local data loading and access software. The next level requires both database schema and MAGEloader/MAGEvalidator tool suite that would enable loading data, but relying on accessing data by some other means. Finally, installation of the full ArrayExpress software set al so is possible (database schema, data loading application and data access servlets).
| FUNCTIONALITY AND COMPARISON WITH OTHER SYSTEMS |
|---|
The current database functionality satisfies the ArrayExpress goals (1), (2a) and (3) and partly (2c) and (2d), as described in the user requirement section. At the same time all necessary data has been stored in a granular way enabling us to satisfy all other requirements as appropriate software is developed (outlined below in the future developments section).
Several publications comparing gene expression databases (Gardiner-Garden and Littlejohn, 2001; Do et al., 2003) have been written before ArrayExpress was available online. Here we concentrate on the features distinguishing ArrayExpress from other microarray databases.
First of all, ArrayExpress is one of the two currently functioning international public gene expression repositories, the other being GEO (Edgar et al., 2002) at the NCBI (the third repository CIBEX is under development in Japan). Most other database projects have been developed to serve needs of a single laboratory or institution; therefore they often limit the scope of supported array technologies, import formats, and specific types of queries and data analysis methods.
An important feature that distinguishes ArrayExpress from most other projects is the support of community standards. ArrayExpress was the first database and the first software able to import and export MAGE-ML, as well as the first MIAME-compliant database. ArrayExpress can deal with all experimental technologies that can be described by MAGE. Experiment annotation capabilities are far richer than for most other database systems where at best limited sets of fields are provided for sample and experiment characteristics, or only free-text annotation is provided. ArrayExpress data validation and loading tools are able to check that ontologies have been used according to MGED guidelines.
The ArrayExpress project makes a strict separation between: (1) the database itself; (2) data annotation/entry tool MIAMExpress, which can export MAGE-ML not only to ArrayExpress, but to any MAGE-ML supporting tool; and (3) data analysis tool Expression Profiler (Kapushesky et al., 2004). All these components can be used independently. Also, we separate the data repository from the data warehouse so that we can support conflicting database requirements: capturing fine details of microarray experiments versus providing flexible, efficient queries, as well as archiving publication related data versus providing biology centric view on gene expression data.
Although the underlying database management system is relational, we use an object layer that makes development easier; to the best of our knowledge, object technology is systematically used only in GIMS (Paton et al., 2000) where a native object-oriented database is used as a backend.
| DISCUSSION |
|---|
One of the software engineering problems that we faced was balancing between the use of modern development techniques to help us to minimize the development time and resources, and the minimization of the risk that the chosen design might not be robust and scalable because of the use of unproven technologies. We have been able to release increasingly functional software throughout the course of the project while the amount of the data has increased over 100-fold; therefore we believe that our choices have been right. ArrayExpress development has been performed by on average two full-time developers, with help from the database curators (database testing and help documentation). To enable development time of months between releases with these limited resources, code auto-generation has been essential. First we auto-generated functional software that could be used for loading data into the database and provide access to the database, and then we refined the software, addressing performance problems, adding new functionality, better ways of presenting information, while the database received submissions and stayed online.
A view has been expressed that object models (and MAGE-OM in particular) are not suitable for generating database schemas, and that schema development should take into account potential queries. Our experience shows that this is not true; a system that is produced by direct mapping from the object model is easy to implement and maintain, and it can still provide good performance. If and when the need arises to improve the performance, the database schema can be modified, for instance by adding extra objects and indices. Using automated relational-to-object mapping libraries helps to make the development and maintenance easier; if the performance overhead turns out to be prohibitive with growing volumes of data (as in our case happened with generation of ADF), efficiency can be improved as necessary (in our case we rewrote ADF generation using JDBC).
Code auto-generation is usually considered useful for tools that have to support multiple projects, but we have demonstrated that it is useful also for concrete projects, especially as we started ArrayExpress development before the MAGE-OM had stabilized. However, we have successfully used the same tools for other projects with different object models, e.g., the data warehouse project (described below). We obtained the core infrastructure (database schema, data loading and access applications) for free, within only a few days of starting the projects, and could then spend time and resources to refine the basic implementation.
| PERFORMANCE AND SCALABILITY |
|---|
Scalability of the system is ensured by using NetCDF binary format for storing expression values, as opposed to, e.g., creating a new database record for every single expression value. This limits the potential queries that ArrayExpress can support, but we believe this is the only viable option that will support growth of the database in the long term. The number of records that the database engine has to manage is proportional to the number of hybridizations in the database, currently in excess of 10 000 and eventually rising to hundreds of thousands, a figure well within the reach of RDBMS engines, especially Oracle with its industry-proven track record of successfully handling very large databases. Essentially ArrayExpress works in two stages: (1) queries on metadata of data sets are performed, managed by the RDBMS engine; (2) subsets of data as requested by the user are extracted, a process performed by NetCDF manipulation libraries. This two level organization allows us to manage billions of data points. Currently ArrayExpress holds more than 10 billion data points, and as data submission continues, the only limitations are those related to disk space, not database architecture.
| FUTURE DEVELOPMENTS |
|---|
MAGE-OM closely mirrors microarray experiment structure, which enables MIAME compliance, but is not appropriate for all required queries, especially gene centric ones. A separate, query optimized data warehouse is under development. Users will be able to combine various search criteria: on genes, samples, experiments and expression values, and retrieve fine-grained objects, such as genes, expression profiles or samples. The data warehouse will enable detailed queries across many experiments, and will satisfy the requirements (2b)(2d). We are using the BioMart system (Kasprzyk et al., 2004) for implementing such a flexible query support.
Such a data warehouse combining experiments performed on different array platforms is possible, only if the array design elements representing the same genes can be mapped to a common reference. In principle this can be done in two ways: (a) by using standard gene identifiers; or (b) by mapping array sequences to the genome. Since for many species standard gene identifiers have not yet been established or are still changing, the utility of the first route is limited. We have developed a prototype system that uses the Ensembl database (Birney et al., 2004) for mapping array sequences to Ensembl genes. This is possible only for array descriptions which provide the actual reporter sequences used on the array such as oligo-nucleotides. Therefore it is important to understand that only experiments based on such open array platforms can be reliably integrated with the rest of genomics data.
Our aim is to provide tools facilitating automated data loading, to shift the burden of data validation off our curation team. The MAGEvalidator will be extended to validate the MIAME semantics where possible. This will allow us to accept data submissions to the repository efficiently, and to move the human curation efforts from the repository to warehouse, where additional quality control checks can be introduced. Only data passing these will be loaded into the warehouse. This will ensure a clear separation between the functions of the primary archive and the value-added database.
Currently we do not provide programmatic access to ArrayExpress; although URL conventions can be used for querying, and resulting HTML can be parsed to extract necessary details, this is not the best solution since it depends on HTML structure that combines data and layout details. We intend to provide a web service that will accept queries for data and/or annotations and return MAGE-ML. This effectively will complete the requirement (2e).
Currently ArrayExpress does not store raw microarray scans (images). Although image management is essential in laboratory databases, we believe that the utility of images in a public repository would be too limited to justify the necessary investment in the appropriate image management software development. Images would be difficult to use in queries, nor can they easily facilitate meta-analysis of combined datasets. One of the most important uses of microarray images are in quality control; therefore we encourage the submitters to store images in their own databases and provide ArrayExpress with the respective URLs. Effectively this means that these images are available via ArrayExpress, though they are not stored locally. This is the first step towards more distributed data storage (see below). So far we have not had many requests to provide the images; if there is such a demand in the future, then provided that we have required resources, we will implement an image management system in ArrayExpress. If standard methods to assess image (and particular spot) quality are introduced, the use of the images may become less important.
Looking into the more distant future, a question arises whether a single or a few sites can manage worldwide public gene experiment data. An area for exploration is a distributed infrastructure where data is stored on many nodes; then ArrayExpress (along with a few other centralized resources) could serve as a query broker, not storing complete datasets but only some meta-information to support queries. It would then receive and process requests, and point the client to the nodes where the detailed information is available. Such a development would require, in addition to MAGE-ML, a standardized gene expression data query language and the data sharing infrastructure such as GRID. The process of developing such a language has been initiated through OMG (http://www.omg.org/lsr/).
| Acknowledgments |
|---|
ArrayExpress development has been largely funded by the TEMBLOR grant from the European Commission, with contributions from the International Life Sciences Institute of the Environmental Health and Safety Institute and from the CAGE grant from the European Commission. The initial funding was provided by Incyte Genomics. We would like to thank Jaak Vilo, Patrick Kemmeren, the MGED MAGE working group, the staff of Stanford Microarray Database, TIGR and many other external collaborators.
Received on July 2, 2004; revised on October 13, 2004; accepted on November 15, 2004
| REFERENCES |
|---|
Birney, E., Andrews, D., Bevan, P., Caccamo, M., Cameron, G., Chen, Y., Clarke, L., Coates, G., Cox, T., Cuff, J., et al. (2004) Ensembl 2004. Nucleic Acids Res., 32, D468D470
Brazma, A., Robinson, A., Cameron, G., Ashburner, M. (2000) One-stop shop for microarray data. Nature, 403, 699700[CrossRef][Medline].
Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., et al. (2001) Minimum information about a microarray experiment (MIAME)toward standards for microarray data. Nat. Genet., 29, 365371[CrossRef][ISI][Medline].
Brazma, A., Sarkans, U., Robinson, A., Vilo, J., Vingron, M., Hoheisel, J., Fellenberg, K. (2002) Microarray data representation, annotation and storage. Adv. Biochem. Eng. Biotechnol., 77, 113139[Medline].
Brazma, A., Parkinson, H., Sarkans, U., Shojatalab, M., Vilo, J., Abeygunawardena, N., Holloway, E., Kapushesky, M., Kemmeren, P., Lara, G.G., et al. (2003) ArrayExpressa public repository for microarray gene expression data at the EBI. Nucleic Acids Res., 31, 6871
Do, H.H., Kirsten, T., Rahm, E. (2003) Comparative evaluation of microarray-based gene expression databases. Proceedings of the 10th Fachtagung Datenbanksysteme für Business, Technologie und Web (BTW 2003), , Leipzig, Germany.
Dysvik, B. and Jonassen, I. (2001) J-Express: exploring gene expression data using Java. Bioinformatics, 17, 369370
Edgar, R., Domrachev, M., Lash, A.E. (2002) Gene expression omnibusNCBI gene expression and hybridization array repository. Nucleic Acids Res., 30, 207210
Gardiner-Garden, M. and Littlejohn, T.G. (2001) A comparison of microarray databases. Brief. Bioinformatics, 2, 220
Gollub, J., Ball, C.A., Binkley, G., Demeter, J., Finkelstein, D.B., Hebert, J.M., Hernandez-Boussard, T., Jin, H., Kaloper, M., Matese, J.C., et al. (2003) The Stanford Microarray Database: data access and quality assessment tools. Nucleic Acids Res., 31, 9496
Kapushesky, M., Kemmeren, P., Culhare, A.C., Durinck, S., Ihmels, J., Korner, C., Kull, M., Torrente, A., Sarkans, U., Vilo, J., Brazma, A. (2004) Expression Profiler: next generationan online platform for analysis of microarray data. Nucleic Acids Res., 32, W465W470
Kasprzyk, A., Keefe, D., Smedley, D., London, D., Spooner, W., Melsopp, C., Hammond, M., Rocca-Serra, P., Cox, T., Birney, E. (2004) EnsMart: a generic system for fast and flexible access to biological data. Genome Res., 14, 160169
Manduchi, E., Grant, G.R., He, H., Liu, J., Mailman, M.D., Pizarro, A.D., Whetzel, P.L., Stoeckert, C.J., Jr. (2004) RAD and the RAD Study-Annotator: an approach to collection, organization and exchange of all relevant information for high-throughput gene expression studies. Bioinformatics, 20, 452459
Paton, N.W., Khan, S.A., Hayes, A., Moussouni, F., Brass, A., Eilbeck, K., Goble, C.A., Hubbard, S.J., Oliver, S.G. (2000) Conceptual modelling of genomic information. Bioinformatics, 16, 548557
Rew, R.K. and Davis, G.P. (1990) NetCDF: an interface for scientific data access. IEEE Comput. Graphic. Appl., 10, 7682.
Saeed, A.I., Sharov, V., White, J., Li, J., Liang, W., Bhagabati, N., Braisted, J., Klapa, M., Currier, T., Thiagarajan, M., et al. (2003) TM4: a free, open-source system for microarray data management and analysis. Biotechniques, 34, 374378[ISI][Medline].
Spellman, P.T., Miller, M., Stewart, J., Troup, C., Sarkans, U., Chervitz, S., Bernhart, D., Sherlock, G., Ball, C., Lepage, M., et al. (2002) Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol., 3, RESEARCH0046.
Stoeckert, C.J. and Jr and Parkinson, H. (2003) The MGED ontology: a framework for describing functional genomics experiments. Comp. Funct. Genomics, 4, 127132[CrossRef].
This article has been cited by other articles:
![]() |
P. E. Blower, J. S. Verducci, S. Lin, J. Zhou, J.-H. Chung, Z. Dai, C.-G. Liu, W. Reinhold, P. L. Lorenzi, E. P. Kaldjian, et al. MicroRNA expression profiles for the NCI-60 cancer cell panel Mol. Cancer Ther., May 1, 2007; 6(5): 1483 - 1491. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

