Bioinformatics Advance Access originally published online on January 12, 2005
Bioinformatics 2005 21(9):1754-1757; doi:10.1093/bioinformatics/bti246
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AGML Central: web based gel proteomic infrastructure
1Department of Biostatistics, Bioinformatics and Epidemiology 135 Cannon Street, Suite 303, P.O. Box 250835, Charleston, SC 29425, USA
2Division of Nephrology, Department of Medicine, Medical University of South Carolina 135 Cannon Street, Suite 303, P.O. Box 250835, Charleston, SC 29425, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: AGML Central is a web-based open-source public infrastructure for dissemination of two-dimensional Gel Electrophoresis (2-DE) proteomics data in AGML format (Annotated Gel Markup Language). It includes a growing collection of converters from proprietary formats such as those produced by PDQUEST (BioRad), PHORETIX 2-D (Nonlinear Dynamics) and Melanie (GenBio SA). The resulting unifying AGML formatted entry, with or without the raw gel images, is optionally stored in a database for future reference. AGML Central was developed to provide a common platform for data dissemination and development of 2-DE data analysis tools. This resource responds to an increasing use of AGML for 2-DE public source data representation which requires automated tools for conversion from proprietary formats. Conversion and short-term storage is made publicly available, permanent storage requires prior registering. A JAVA applet visualizer was developed to visualize the AGML data with cross-reference links. In order to facilitate automated access a SOAP web service is also included in the AGML Central infrastructure.
Availability: http://bioinformatics.musc.edu/agmlcentral
Contact: stanisrc{at}musc.edu
| INTRODUCTION |
|---|
|
|
|---|
Dissemination of information gathered by proteomic methods is particularly challenging due to the sheer volume of data and the characteristic diversity of the data within an experiment. This has become more obvious in recent years with the rapid expansion of technology for high throughput proteomics, as proteins are often the functional molecules and represent important targets in disease amelioration. The expansion in technology development and explosion in the data collected increasingly requires screening and analysis by computers with minimal human interaction. This fact is most obvious in the use of two-dimensional gel electrophoresis (2-DE) for proteomics screening. We have recently proposed an extended markup language (XML) data structure, known as annotated gel markup language (AGML), to describe and index proteomic analytes generated from 2-DE experiments (Stanislaus et al., 2004). AGML is a common format whereby data acquired through different analysis tools can be converted to a common XML structure that is free of proprietary information that may restrict its usage or dissemination. The specifications of the AGML format are periodically upgraded to accommodate new methods and techniques used in 2-DE.
The publication of the human genome sequence brings scientists a step closer to the next big mapping project, sequencing the human proteome (Abbott, 2001; Humphery-Smith, 2004). This would create an enormous informatics problem in managing diverse databases and differential data dissemination problems. Databases developed on an open-source web platform will have the greatest strength in reaching a greater audience through the internet. Additionally, the financial burden will also be lessened by using tools that interface with open technology data structures. To this end, the requirement for databases to be developed on an open platform and tools developed for that platform could greatly facilitate the understanding of the proteome. In addition to the current effort, other institutions and organizations have also tried to address the informatics problem related proteomics. For example, the Seattle Proteome Center has produced the mzXML schema for marking up the mass spectrometry data (http://sashimi.sourceforge.net/) and, Proteomics Standards Initiative has been working toward a data representation standard among the proteomics community (http://psidev.sourceforge.net/). Additionally, efforts are also underway to produce a standard that would represent proteomic data and minimum required level of detail (http://www.iscb.org/ismb2004/posters/christATebi.ac.uk_1.html). The web-based proteome informatics efforts are not limited to academia. Bio/IT companies, such as Ludesi (http://www.ludesi.se/), are actively pursuing better ways to store, analyze and interpret proteomic data collected from multiple sites and formats.
In addition to proteomic data representation, data storage also has its own set of problems. A number of 2-DE repositories are maintained at different institutionsfor a list see http://au.expasy.org/ch2d/2d-index.html. AGML Central is unique in this regard in that it saves all the proteomic information in one non-proprietary XML format, AGML (Stanislaus et al., 2004), that provides comprehensive description about both the experimental settings and experimental results, including mass spectrometry information for specific spots. AGML Central expands the usefulness of AGML by development of an environment where converters for other, mostly proprietary, formats are implemented and made freely and publicly available. This resource brings forth much needed interoperability in an area fragmented by proprietary software tools. Moreover, the AGML schema itself was accordingly developed to fulfill the federated 2-DE database guidelines of SWISS-2DPAGE (Appel et al., 1996; Stanislaus et al., 2004).
| AGML CENTRAL ORGANIZATION AND IMPLEMENTATION |
|---|
|
|
|---|
Organization
AGML Central site
AGML Central is organized as graphically described in Figure 1. AGML Central database is fed information coming from different converter resources, e.g. PDQUEST. Thus, in order to submit data to the AGML Central database, data need not originally be in AGML format; using the provided converters the input proprietary data is converted to an AGML document and then submitted to AGML Central. The AGML Central infrastructure discussed here also provides tools (discussed below under AGML Central Database and Tools) for analysis of the data.
|
Converting from non-AGML format
In Figure 1 and in the AGML Central web page the converters are designated by the name of the software that generates the data, such as PDQUEST, Phoretix, etc. These links provide the converters to translate the proprietary files into the non-proprietary AGML format. Currently there are three links, namely PDQUEST, Phoretix and Melanie, with translators to convert them into AGML format. This site is continually upgraded by including and adding more converters as they are requested for by the users of AGML Central infrastructure.
AGML Central database and tools
Database
The database made up of AGML files can be accessed in the database view page. Thus the structures of all the submitted files are the same, independently of the original format being Melanie, PDQUEST or Phoretix. At this stage all files are structurally normalized to the AGML format providing the ability to compare the information generated from different data acquisition software. In addition to this view, a keyword search tool is also incorporated into the database. The search tool takes in any value and matches it to the entire AGML xml file contents in the database. For example, one could search for a particular molecular weight or protein among all the files in the database.
Web services
AGML Central infrastructure also provides way to automate access to the database and tools discussed through the use of Web Services Description Language (WSDL;http://www.w3.org/TR/wsdl). The service takes in requests as described by the AGML WSDL document (http://bioinformatics.musc.edu/axis/services/AGML?wsdl) and returns the requested AGML files.
AGML Visualizer
The AGML Visualizer tool provides the ability to view the XML files in the AGML format. AGML Visualizer is a JAVA applet application (Fig. 2). Thus, it is platform independent and can be viewed on any modern browser. It also provides all the relevant details of the experiment juxtaposed to 2-DE gel image. Thus the images are displayed with the background experimental information, thereby putting the data in context. In addition, clicking on a corresponding spot, known or unknown, would provide more information about the spot such as the intensity, Mw and pI. If a known protein is found, using the AGML Visualizer search tool, one can search other 2-DE databases such as SWISS-2DPAGE, SIENA-2DPAGE, etc. This allows the spots to be cross-checked with the other published values. If the 2-DE gel is not calibrated the axis would be marked by the units identified in the XML file and not with the standard Mw and pI values.
|
Image database
Additionally, images of the gel in any proprietary format, for example tif or the PDQUEST proprietary gsc format, can be uploaded for further storing or distribution of the data.
Implementation
AGML Central web services infrastructure runs on a dual Intel® 3.2 GHz XenonTM processor, 8 GB RAM server running the Red-Hat Linux operating system. The software is written in PHP (http://www.php.net/) and BASH and interacts with a POSTGRESQL (http://www.postgresql.org/) database to save the AGML files and images. The AGML Central implementation is graphically described in Figure 1. The web architecture of the site enables the investigators to generate AGML with minimal human interaction and generates an AGML document with all the necessary information needed to analyze the data. The web services to access the database and tools are provided by Apache Axis implementation of SOAP (Simple Object Access Protocol) (http://ws.apache.org/axis/, http://www.w3.org/TR/soap/). The operations and messages relating to the web services are described in WSDL format (http://www.w3.org/TR/wsdl).
How to use the AGML Central infrastructure
In order to use the AGML Central website, the gel data needs to be in one of the formats supported, such as PDQUEST, Phoretix or Melanie. Using the appropriate converter (see converter sites above), data has to be uploaded to the corresponding site. More information on how to submit the data can be found in the corresponding converter's website. Once the data is submitted, server side programs convert the data file into the corresponding AGML file and automatically submit it to the AGML Central infrastructure (Fig. 1). Depending on the size of the file and load on the system this could take a couple of minutes. Once in the AGML Central database, one can use a number of tools for further analysis of the data, or to download the AGML file for local storage and analyses of the data. The popularity of the AGML Central infrastructure has been growing recently with 47 registered users, who have so far processed 178 sets of aligned gels, totaling a cumulative number of close to 1800 individual gel runs.
| DISCUSSION |
|---|
|
|
|---|
AGML Central infrastructure provides a common location and format for the storage of 2-DE data. Analysis software currently being developed interacts with the AGML Central database giving the researchers tools to analyze their data (prototype in development and can be viewed at: http://bioinformatics.musc.edu/Web2DE/). Currently tools are being developed for PCA analysis, cluster analysis and normalization of spot intensities in 2-DE gels (Almeida et al., 2005). A major advantage of storing the data in AGML format is the interoperability associated with having converters from other formats that any tool developed for the analysis of the data in AGML format can be used without any modification by any other researcher using a format for which an AGML converter is available. Currently, most data analysis tools are proprietary and limited to data generated by the corresponding software. Thus, there is no simple way to compare data across different software platforms. AGML Central was developed to provide interoperability by overcoming proprietary boundaries. This infrastructure not only assists integrative analysis of data obtained at multiple locations, but also facilitates development of format independent open-source data analysis applications for 2-DE data acquired by a variety of proprietary systems.
In addition to analysis and storage tools, AGML Central provides a tool, AGML Visualizer, to view the 2-DE gels and retrieve experimental and spot identity information. This tool provides graphic juxtaposition of the gel information, such as the origin of the sample and other relevant information, 2-DE images and individual spot information bringing together in a single interface all the information pertaining to the corresponding 2-DE experiment. Also, the fact that the Visualizer integrates a search tool that accesses other 2-DE databases provides a convenient integration of, e.g. additional information regarding a polypeptide spot found in a 2-DE gel. This valuable linking of 2-DE gel data with other databases fulfills one of the requirements proposed for the federated database (Appel et al., 1996).
With the creation of analysis tools, AGML Central infrastructure offers a way for comparing 2-DE gels acquired from different software programs (e.g. Phoretix versus PDQUEST). The creation of AGML Central is proposed as a first step in a wider effort to develop open-source public license tools to interconvert, analyze, visualize and mine 2-DE data. It is also noteworthy to think AGML Central as a database of 2-DE data files submitted by researchers in the field of 2-DE. In contrast, for example, the SWISS-2DPAGE database contains 2-DE gels generated at ExPASy (Hoogland et al., 1999). AGML Central infrastructure reflects a commitment to an ongoing proteomics initiative by the NHLBI of the NIH (see Acknowledgement section) for development and advancement of proteomics research. Consequently, the initiative discussed here tries to address two major informatics obstacles facing proteomic data analysis, namely acommon data structure to store 2-DE data, AGML (Stanislaus et al., 2004), and a central location for storing of the data, AGML Central, for perusal by other researchers.
| Acknowledgments |
|---|
This work was supported by the NHLBI Proteomics Initiative through contract N01-HV-28181.
Received on August 17, 2004; revised on December 17, 2004; accepted on December 20, 2004
| REFERENCES |
|---|
|
|
|---|
Abbott, A. (2001) And now for the proteome. Nature, 409, 747[Medline].
Almeida, J.S., Stanislaus, R., Krug, E., Arthur, J. (2005) Normalization and analysis of residual variation in 2D Gel Electrophoresis for quantitative differential proteomics. Proteomics, 5, 12421249[CrossRef][ISI][Medline].
Appel, R.D., Bairoch, A., Sanchez, J.C., Vargas, J.R., Golaz, O., Pasquali, C., Hochstrasser, D.F. (1996) Federated two-dimensional electrophoresis database: a simple means of publishing two-dimensional electrophoresis data. Electrophoresis, 17, 540546[CrossRef][ISI][Medline].
Hoogland, C., Sanchez, J.C., Walther, D., Baujard, V., Baujard, O., Tonella, L., Hochstrasser, D.F., Appel, R.D. (1999) Two-dimensional electrophoresis resources available from ExPASy. Electrophoresis, 20, 35683571[Medline].
Humphery-Smith, I. (2004) A human proteome project with a beginning and an end. Proteomics, 4, 25192521[CrossRef][ISI][Medline].
Stanislaus, R., Jiang, L.H., Swartz, M., Arthur, J., Almeida, J.S. (2004) An XML standard for the dissemination of annotated 2D gel electrophoresis data complemented with mass spectrometry results. BMC Bioinformatics, 5, 9[Medline].
This article has been cited by other articles:
![]() |
A. Ng, B. Bursteinas, Q. Gao, E. Mollison, and M. Zvelebil Resources for integrative systems biology: from data through databases to networks and dynamic system models Brief Bioinform, December 1, 2006; 7(4): 318 - 330. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


