Skip Navigation


Bioinformatics Advance Access originally published online on November 22, 2006
Bioinformatics 2007 23(2):262-263; doi:10.1093/bioinformatics/btl573
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/2/262    most recent
btl573v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Falkner, J. A.
Right arrow Articles by Andrews, P. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Falkner, J. A.
Right arrow Articles by Andrews, P. C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

ProteomeCommons.org IO Framework: reading and writing multiple proteomics data formats

J. A. Falkner *, J. W. Falkner and P. C. Andrews

Department of Biochemistry, Program in Bioinformatics Ann Arbor, MI, 48109, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 REFERENCES
 

Motivation: Effective use of proteomics data, specifically mass spectrometry data, relies on the ability to read and write the many mass spectrometer file formats. Even with mass spectrometer vendor-specific libraries and vendor-neutral file formats, such as mzXML and mzData it can be difficult to extract raw data files in a form suitable for batch processing and basic research. Introduced here are the ProteomeCommons.org Input and Output Framework, abbreviated to IO Framework, which is designed to abstractly represent mass spectrometry data. This project is a public, open-source, free-to-use framework that supports most of the mass spectrometry data formats, including current formats, legacy formats and proprietary formats that require a vendor-specific library in order to operate. The IO Framework includes an on-line tool for non-programmers and a set of libraries that developers may use to convert between various proteomics file formats.

Availability: The current source-code and documentation for the ProteomeCommons.org IO Framework is freely available at http://www.proteomecommons.org/current/531/

Contact: jfalkner{at}umich.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 REFERENCES
 
The ProteomeCommons.org IO Framework provides a library of freely usable, open-source Java code supported by ProteomeCommons.org (Falkner and Andrews, 2005) that provides users and developers with tools that aid reading and writing various proteomics file formats, primarily mass spectrometry file formats. The IO Framework builds upon and incorporates several existing projects for reading mass spectrometry data, including ReAdW (http://sashimi.sourceforge.net/software_glossolalia.html), wiff2dta (Boehm et al., 2004), CompassXport (Bruker Daltonics Inc., http://www.bdal.com) and the ProteomeCommons.org JAF (Falkner et al., 2006). In addition to providing an abstract interface to several existing tools, the IO Framework also provides code for reading and writing several more formats including mzData, mzXML, X!Tandem output generated by TheGPM and the .msp format used by the NIST Spectral Library, respective citations are included in section 3.1. Finally, the IO Framework provides several practical features including seamlessly reading and writing spectra files that are saved in the ZIP, GZIP, bzip2 and LZMA compression formats, being able to merge multiple files into a single aggregate spectrum, and providing a framework for filtering peak lists based on intensities or m/z ranges.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 REFERENCES
 
The ProteomeCommons.org IO Frameowork is hosted by ProteomeCommons.org and coded in the Java programming language. The IO Framework's use of the wiff2dta, ReAdW and CompassXport tools require the respective program to be installed in order to read the format and these programs will not work outside of Microsoft Windows. All Java code included with the IO Framework is free to use both commercially and non-commercially, and the project's source-code made pubilicly available along with the documentation. Use of code that may be incorporated with the IO Framework, namely the projects mentioned above, is limited by the licences and use terms associated with respective projects.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 REFERENCES
 
The complete IO Framework project includes documentation that references the various API components and user tools available with the project. The abbreviated list below highlights several of the features and tools.

3.1 Mass spectrometer file format support

  • Thermo Finnigan's .raw format commonly aquired from LCQ, LTQ, LTQ-FT and LTQ Orbitrap instruments.
  • Applied Biosystem's .wiff format commonly acquired from the QStar and QTrap mass spectrometers.
  • Applied Biosystem's .t2d format commonly acquired from the 4700 and 4800 MALDI TOFTOF instruments.
  • Bruker's .baf, .fid, .yep and AutoXecute run for LCMALDI file formats representing the data file formats from the flex, APEX, micrOTOF, micOTOF-Q, esquire, autoFlex and ultraFlex instrument families.
  • mzXML format (Pedrioli et al., 2004) developed by the Institute for Systems Biology.
  • mzData format (Orchard et al., 2005) developed by the HUPO PSI-MS effort (http://psidev.sf.net).
  • Common plain-text peak list formats including the .dta (Sequest), .pkl (Waters, Milford, MA), and .mgf (Matrix Science Ltd, London, England) formats.
  • X!Tandem output files (Craig et al., 2004a) including the files provided in TheGPM Quartz collection (http://www.thegpm.org/quartz/index.html).
  • NIST library of peptide fragmentation spectra .msp files (http://www.nist.gov/srd/nist1a.htm).

3.2 Protein/peptide sequence file format support

  • Support for reading protein sequences FASTA files and GPMDB websites. Support is also provided for automatically shuffling or reversing protein sequences as commonly done in reverse database searches (Elias et al., 2005).
  • Support for converting protein sequences into peptides using common enzymes, such as trypsin and custom rules for protein cleavage.

3.3 Peak list filtering framework

  • Intensity filtering based on relative or absolute intensity.
  • m/z filtering based on a particular m/z range or precursor.

3.4 User tools

3.5 Background and comparison against other tools
The IO Framework aids users and developers that need to work with proteomics-related file formats; however, several other projects exist that also address this need. An up to date list of these tools is maintained with the IO Framework's on-line documentation at http://www.proteomecommons.org/current/531/. The key differences between the IO Framework and the majority of these tools include: the IO Framework provides an intuitive, easy-to-use framework for Java developers; the IO Framework is freely available and the IO Framework seeks to support the broadest range of file formats practicable.

Of the existing tools the ProteomeCommons.org IO framework is most similar to the DBToolKit (Martens et al., 2005) and the Sashimi Glossolalia project (sashimi.sf.net). The DBToolKit is intended to be used in development of tools that manipulate proteomics data. The Sashimi Glossolalia tools are intended to provide users with a method for converting from a number of vendor-specific formats in to the vendor-neutral mzXML format. The IO Framework, in contrast, is specifically focused on rapid development cycles and broad support of proteomics file formats. The IO Framework includes support for many more file formats than the DBToolKit, particularly the non-plain-text formats, and, in contrast to Glossolalia, the IO Framework can convert too many formats including but not limited to mzXML. The IO Framework is also designed to reuse existing code and it includes wrapper code that encapsulates and uses the Sashimi converters. The Sashimi ReAdW tool is of particular note as it is currently the default tool used by the IO Framework to read the .raw file format.

Another application of interest is that of the IO Framework to existing MSMS search engines, such as X!Tandem (Craig et al., 2004a) and subsequently the GPM (Craig et al., 2004b) and Mascot (Perkins et al., 1999). The IO Framework is a tool that can be used for building new MSMS search engines or for converting data into a format that existing search engines can use. Additionally, the IO Framework provides code for manipulating FASTA files and generating reverse/dummy databases. Again, allowing new search engines to use this functionality or aiding in creation of FASTA files for use with existing MSMS search engines. In general, the IO Framework will not replace any existing MSMS search engine, but it does provide code that may be of benefit when working with MSMS search engines.

In summary, the IO Framework is coded in the Java programming language, completely free to use, open-source, and provides broad support for proteomics file formats. The wide support for many of the current proteomics and mass spectrometer file formats, including incorporation of other existing tools, makes the IO Framework a valuable new tool for developers.


    Acknowledgments
 
This project is part of the National Resource for Proteomics and Pathways funded by NCRR grant P41-RR018627. Special acknowledgement is also made to the many developers who have contributed code patches to the IO Framework, including Neil Swainston, Panagiotis Papoulias, Tomas Pluskal, David Hancock and Dominic Battre. The authors also wish to recognize the growing community of proteomics developers. Particularly those who have released freely available tools, including the Institute for Systems Biology, Andreas Boehm et al. and the Bruker Daltonics Inc. software development team.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alfonso Valencia

Received on September 21, 2006; revised on November 10, 2006; accepted on November 11, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 REFERENCES
 

    Boehm, A.M., et al. (2004) ‘Extractor for ESI quadrupole TOF tandem MS data enabled for high throughput batch processing’. BMC Bioinformatics, 5, 162[CrossRef][Medline].

    Bruker Daltronics Inc. ‘Export of Bruker MS-Data to mzData format’, (2006) Billerica, MA.

    Craig, R. and Beavis, R. (2004a) ‘TANDEM: matching proteins with tandem mass spectra’. Bioinformatics, 20, 1466–1467[Abstract/Free Full Text].

    Craig, R., et al. (2004b) ‘Open source system for analyzing, validating, and storing protein identification data’. J. Proteome Res, . 3, 1234–1242[CrossRef][ISI][Medline].

    Elias, J.E., et al. (2005) Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations. Nat. Meth, . 2, 667–675.

    Falkner, J.A., Ulintz, P.J., Andrews, P.C. (2006) A Code and Data Archival and Dissemination Tool for the Proteomics Community. in press American Biotechnology Laboratory (ABL).

    Falkner, J.A. and Andrews, P.C. (2005) ‘Fast tandem mass spectra-based protein identification regardless of the number of spectra or potential modifications examined’. Bioinformatics, 21, 2177–2184[Abstract/Free Full Text].

    Falkner, J.A., et al. (2006) ‘ProteomeCommons.org JAF: reference information and tools for proteomics’. Bioinformatics, 22, 632–633[Abstract/Free Full Text].

    Martens, L., et al. (2005) ‘DBToolkit: processing protein databases for peptide-centric proteomics’. Bioinformatics, 21, 3584–3585[Abstract/Free Full Text].

    Orchard, S., et al. (2005) ‘Second proteomics standards initiative spring workshop’. Expert Rev. Proteomics, 2, 287–289[CrossRef][ISI][Medline].

    Pedrioli, P.G.A., et al. (2004) ‘A common open representation of mass spectrometry data and its application to proteomics research’. Nat. Biotechnol, . 22, 1459–1466[CrossRef][ISI][Medline].

    Perkins, D.N., et al. (1999) ‘Probability-based protein identification by searching sequence databases using mass spectrometry data’. Electrophoresis, 20, 3551–3567[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
J. A. Siepen, K. Belhajjame, J. N. Selley, S. M. Embury, N. W. Paton, C. A. Goble, S. G. Oliver, R. Stevens, L. Zamboulis, N. Martin, et al.
ISPIDER Central: an integrated database web-server for proteomics
Nucleic Acids Res., April 25, 2008; (2008) gkn196v1.
[Abstract] [Full Text] [PDF]


Home page
Mol. Cell. ProteomicsHome page
A. K. Yocum, T. E. Gratsch, N. Leff, J. R. Strahler, C. L. Hunter, A. K. Walker, G. Michailidis, G. S. Omenn, K. S. O'Shea, and P. C. Andrews
Coupled Global and Targeted Proteomics of Human Embryonic Stem Cells during Induced Differentiation
Mol. Cell. Proteomics, April 1, 2008; 7(4): 750 - 767.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/2/262    most recent
btl573v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Falkner, J. A.
Right arrow Articles by Andrews, P. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Falkner, J. A.
Right arrow Articles by Andrews, P. C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?