Bioinformatics Advance Access originally published online on September 25, 2007
Bioinformatics 2007 23(24):3394-3396; doi:10.1093/bioinformatics/btm467
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TandTRAQ: an open-source tool for integrated protein identification and quantitation
1Informatics Shared Resource, OHSU Cancer Institute, 2Oregon Clinical and Translational Research Institute, 3Proteomics Shared Resource and 4Department of Anatomical Pathology, Oregon Health & Science University, Portland, Oregon 97212, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Integrating qualitative protein identification with quantitative protein analysis is non-trivial, given incompatibility in output formats. We present TandTRAQ, a standalone utility that integrates results from i-Tracker, an open-source iTRAQ quantitation program with the search results from X?Tandem, an open-source proteome search engine. The utility runs from the command-line and can be easily integrated into a pipeline for automation.
Availability: The TandTRAQ Perl scripts are freely available for download at http://www.ohsucancer.com/isrdev/tandtraq/
Contact: laderast{at}ohsu.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Until recently, the focus of most proteomics studies was qualitative identification of proteins in simple and complex mixtures (Aebersold and Mann, 2003). However, protein quantitation using multiplexed samples is rapidly becoming important. Multiplexed sample technology, such as Difference Gel Electrophoresis (DIGE), and isotope-coded affinity tags (ICAT) allow for the comparison of two or more samples at once, allowing for not only identification of peptides but also quantitation of those peptide levels. In DIGE, two or more samples are labeled with different fluorescent dyes and then electrophoresed (Unlu et al., 1997). Associated peptides in each sample group fluoresce as a fixed ratio of the two signals, allowing the relative abundances to be determined. ICAT tagging relies on tagging different samples with reporter ions that differ by molecular weight, allowing the peptides to be identified using MS analysis and then quantified using MS/MS analysis (Gygi et al., 1999). A recent variation on ICAT is iTRAQ, where the relative abundances of peptides in up to eight samples can be compared through the labeling of each with eight different reporter ions (Ross et al., 2004; Tao and Aebersold, 2003).
Open-source programs exist for both identification and quantitation. One of the large advantages of using open-source programs is that the source code for the program and the associated algorithms is readily available and modifiable. This aspect of open-source programs is becoming more important in order to understand the limitations and capabilities of the software.
The X?Tandem program is an open-source proteome search engine that is frequently used for peptide identification (Craig and Beavis, 2004). X?Tandem can be utilized as a web-based application at http://thegpm.org or deployed locally using precompiled binaries and FASTA-formatted files. X?Tandem outputs in a combination of BioML and GAML XML formats (Fenyo, 1999).
i-Tracker is an open-source peptide quantitation algorithm that allows the user to extract reporter ion peak ratios from non-centroided peak lists (Shadforth et al., 2005). The output of i-Tracker allows for the relative comparison of the iTRAQ labeled peptides.
However, it is relatively inconvenient to combine the output of both of these programs externally. The purpose of TandTRAQ is to provide a standalone command-line utility that combines results from database searching and peptide quantitation in order to produce a table with protein identifications and associated abundance ratios for each protein based on an analysis of the peptide associated reporter ions (Table 1). To our knowledge, TandTRAQ is the first effort to provide a unified, open-source framework for identification and quantitation.
|
| 2 METHODS AND IMPLEMENTATION |
|---|
|
|
|---|
TandTRAQ is written in Perl and consists of a Perl script and modified versions of i-Tracker and code included in the Global Proteome Machine. It takes two files as input: a Mascot format (MGF) file with uncentroided peaks, and the BioML results from X?Tandem. There are number of utilities that exist to convert other peak list formats into this format, including the Mascot.dll plug-in to ABI Analyst as well as tools available at Proteome Commons (http://www.proteomecommons.org/tools.jsp). The output of TandTRAQ is a tab-separated value (TSV) file that is easily imported into R, Excel or any number of analysis tools for further analysis (Table 1).
Modifications were made to the i-Tracker output format in order to enable easier integration of these results with the BioML/GAML results. In particular, accurate extraction and parsing of the TITLE line in the MGF format was necessary in order to enable joining these results with the BioML/GAML format.
The Global Proteome Machine is a web-enabled version of X?Tandem and other search engines that includes visualization, summarization and annotation tools for BioML/GAML output. The visualization code was modified in order to parse the joining ID from the BioML/GAML format. In addition, the annotation functionality of the GPM source code was modified in order to include database annotations of peptides and links to Uniprot and Refseq in a peptide list.
TandTRAQ requires parameters requested by i-Tracker, including purity correction factors, and ion count threshold. The code can be easily used for batch processing of X?Tandem and i-Tracker results and integrated into an automated pipeline.
| 3 RESULTS |
|---|
|
|
|---|
In order to show the utility of TandTRAQ, cerebrospinal fluid (CSF) samples from a childhood Acute Lymphoblastic Leukemia (ALL) study by Children's Oncology Group and Oregon Health and Science University were used. The morphologically negative CSF samples had previously been classified as minimal residual disease (MRD) negative (–) or positive (+) using real-time PCR techniques on centrifuged cell lysates. Supernatants from 3 MRD –, 3 MRD + and 2 morphologically + CSF samples were concentrated and desalted using Microcon filter device. Albumin was removed using a Human Albumin Kit, Albuminomics TM. The MRD – and MRD + samples, each containing 15 mcg of protein, were pooled separately and processed for protein identification and quantitation using the iTRAQ protocol using a Qstar XL hybrid time of flight mass spectrometer (further details are available on the TandTRAQ website). The resulting WIFF file generated by ABI Analyst 1.1 was then converted to MGF format using the Mascot.dll plug-in for ABI Analyst. The MGF file was run through a web-based instance of X?Tandem (for information about parameters used, please consult the website) in order to produce the BioML result file. The BioML and MGF files were then run through TandTRAQ in order to produce the peptide table (see Table 1 for example output).
| 4 FUTURE DIRECTIONS |
|---|
|
|
|---|
Future directions include integrating the TandTRAQ integration code back into the GPM web interface. Such an integration would provide users with a unified browsing and visualization interface that links quantitated and identified peptide results. As TandTRAQ requires the parameters for the quantitation, it would also be very easy to extend this for reporting purposes to generate the metadata and XML for meeting Minimum Information About a Proteomics Experiment (MIAPE) requirements (http://www.psidev.info/).
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The authors extend their thanks to the reviewers for their comments that improved the clarity and flow of the manuscript. This work was funded in part by NIH grants from NCI (5P30 CA69533-09) and NCRR (1 UL1 RR024140-01).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Anna Tramontano
Received on July 18, 2007; revised on August 30, 2007; accepted on September 8, 2007
| REFERENCES |
|---|
|
|
|---|
Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature (2003) 422:198–207.[CrossRef][Medline]
Craig R, Beavis RC. TANDEM: matching proteins with tandem mass spectra. Bioinformatics (2004) 20:1466–1467.
Fenyo D. The biopolymer markup language. Bioinformatics (1999) 15:339–340.
Gygi SP, et al. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat. Biotechnol (1999) 17:994–999.[CrossRef][Web of Science][Medline]
Ross PL, et al. Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol. Cell Proteomics (2004) 3:1154–1169.
Shadforth IP, et al. i-Tracker: for quantitative proteomics using iTRAQ. BMC Genomics (2005) 6:145.[CrossRef][Medline]
Tao WA, Aebersold R. Advances in quantitative proteomics via stable isotope tagging and mass spectrometry. Curr. Opin. Biotechnol (2003) 14:110–118.[CrossRef][Web of Science][Medline]
Unlu M, et al. Difference gel electrophoresis: a single gel method for detecting changes in protein extracts. Electrophoresis (1997) 18:2071–2077.[CrossRef][Web of Science][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||