Bioinformatics Advance Access originally published online on July 24, 2008
Bioinformatics 2008 24(18):2103-2104; doi:10.1093/bioinformatics/btn385
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
An integrated approach for automating validation of extracted ion chromatographic peaks
1Department of Biology, 2Department of Statistics and 3Department of Chemistry, University of Kentucky, Lexington, KY 40506, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Accurate determination of extracted ion chromatographic peak areas in isotope-labeled quantitative proteomics is difficult to automate. Manual validation of identified peaks is typically required. We have integrated a peak confidence scoring algorithm into existing tools which are compatible with analysis pipelines based on the standards from the Institute for Systems Biology. This algorithm automatically excludes incorrectly identified peaks, improving the accuracy of the final protein expression ratio calculation.
Contact: wnels2{at}uky.edu
Source and Supplementary Information: http://www.chem.uky.edu/research/lynn/Nelson.pdf
| 1 INTRODUCTION |
|---|
|
|
|---|
High-throughput quantitative proteomics, based on stable isotope labeling strategies, have been widely applied to protein expression profiling of complex protein samples (Bantsheff et al., 2007). With the advent of technologies like MudPIT (Wolters et al., 2001), thousands of peptides are now analyzed in a single experiment. Large quantities of data demand highly automated data analysis, however, manual validation of automated quantification calculations is still common with current quantification software.
Peptide/protein quantification software will generally perform the same sequence of algorithms: (1) the theoretical mass to charge ratio of light and heavy partner isotope precursor ions are calculated from the peptide sequence identified by the protein database search engine, (2) the start and end scan limits of the extracted ion chromatographic peaks are identified, (3) the areas under the identified peaks are calculated, (4) the ratio of the heavy and light peak areas is calculated for each peptide and (5) a protein abundance ratio is calculated from the abundance ratios of the peptides contained in that protein.
The most critical step is identifying the peak limits of the ion extracted chromatographic peak (Step 2). Previous work has shown that a correlation coefficient of the heavy versus the light isotope ion intensities from the extracted ion chromatographic peaks has significant value in determining if the identified peak limits are valid (MacCoss et al., 2003). The main idea is that peaks which correlate well have similar shapes, thus indicating consistency in the expression estimates and the expression ratio.
Protein quantification is just a single step in a typical proteomics pipeline. The popular trans-proteomic pipeline (TPP) (http://tools.proteomecenter.org/software.php) provides a good example of a typical pipeline. The proteomics community has cooperatively established standardized file formats (Droit et al., 2006) for the inputs and outputs at each step of the analysis pipeline. These standards greatly reduce the complexity of modifying the pipeline, thereby reducing the time and expense of maintenance and updates. Implementing new software concepts from scratch that cannot be integrated into standardized pipelines creates a significant obstacle to evaluation by other laboratories.
In this study, we have inserted a correlation coefficient calculation, in the form of a confidence score, into the Xpress (Han et al., 2001) protein quantification package of the TPP. The correlation coefficient filter was inserted into XpressPeptideParser and XpressProteinRatioParser; two programs in the Xpress program suite that perform peptide and protein quantification.
| 2 METHODS |
|---|
|
|
|---|
Sample data—raw data used in this evaluation were obtained from cICAT labeled rat brain mitochondrial proteins. See Supplementary Material.
Proteomics analysis pipeline—the acquired tandem mass spectra were searched against SwissProt rodent proteins using the Sequest Cluster (ThermoFisher) and the TPP. Xpress Modifications—code for the correlation coefficient filter was inserted into XpressPeptideParser and XpressProteinRatioParser. In brief, after the step where the start and end scans for the chromatographic peaks are defined, we inserted a step to calculate the correlation coefficient (http://www.ddj.com/cpp/184401277) of the raw ion intensities for the chromatographic peak pair. In the final step of the XpressPeptideParser sequence, a confidence score attribute was added to the Xpress results written to the pep.xml file. The XpressProteinRatioParser program was modified to exclude peptides with a confidence score <0.5.
| 3 RESULTS |
|---|
|
|
|---|
The effectiveness of the correlation coefficient filter in Xpress was evaluated against a dataset from a control study of rat brain mitochondria. The rat brain mitochondria proteins were labeled with cleavable ICAT reagents such that the heavy and light isotope ratios were 1:1. Pipeline analysis of these samples resulted in estimated expression ratios for 749 peptides.
We focused on defining only a conservative bad threshold for the correlation; a score, below which, everything was most certainly not valid (see Table 1 for a discussion of other possibilities). Our suggested thresholds are based on determining those values of correlation score that (1) resulted in statistically determined non-valid estimates of the expression ratio and (2) agreed with a subjective assessment by experienced researchers.
|
Figure 1 shows the logits (used simply to better visualize the region near 1) of the correlation coefficients plotted against the estimated log ratios for each peptide. The plot shows that variation in the estimated log ratios decreases as the correlation coefficient confidence score increases. In fact, there seems to be a dramatic decrease in variation as the logit correlation score increased above 3, corresponding to correlation coefficient confidence scores above 0.95. Alternatively, when the logit of the correlation scores was below 0 (corresponding to correlations below 0.5), there was virtually no clustering of the estimated log ratios indicating that we were simply observing noise. Using a conservative cutoff of 0.5, 15% of peptides with correlations below 0.5 were within 0.25 of the expected log ratio of 0, while 65% of peptides with correlations above 0.5 were within 0.25 of the expected log ratio of 0.
|
In addition to the more objective analysis above, we applied the subjective analysis of six researchers with previous ICAT experience. The six researchers were presented extracted ion chromatographic peak pairs from a set of 47 survey peptides. The survey peptides were of representative peptides at correlation score intervals of 0.01 (some intervals did not have a representative peptide) taken from a set of peptides with ProteinProphet scores above 0.99 (382 peptides). The surveyees accessed the peptide quantification data through the computational proteomics analysis system (Rauch et al., 2006) web interface, where the ion extracted chromatograms could be visualized, and noted whether the data were useful for quantification.
Although the survey showed a great deal of subjectivity when evaluating the quality of peak pairs, pairs with low correlation scores were unanimously rejected. Starting at 0, a correlation score of 0.5 was the first occurrence of a peptide being selected as acceptable by half of the survey group. Based on both the objective and subjective analysis, we implemented a 0.5 correlation score cutoff filter to the ProteinRatioParser program.
The filter removed 60 peptides (8%) from protein abundance calculations. Out of the 133 proteins identified by ProteinProphet with probability scores >0.02, 20 protein abundance ratios (15%) were affected by the correlation filter. Out of 20 protein identifications, 12 were identified by only one unique peptide; the cutoff filter removed half of these 12 one hit wonders from protein abundance calculations.
One fundamental advantage of adding any algorithm into an established pipeline is the ability to compare and combine multiple measures. For example, consider this simple attempt to combine PeptideProphet scores with the correlation coefficient. Table 1 shows SDs of the log expression ratios of identified peptides cross-classified by the PeptideProphet score and the correlation coefficient. We subdivided the correlation scores above 0.5 to show what would happen with a more strict cutoff. As can be seen, the SDs decrease with increasing correlation coefficients dramatically, and this decrease is amplified by the PeptideProphet score. Thus, one can imagine combining these two measures. The implementation of the correlation coefficient into the pipeline allows investigations along these lines to be pursued easily.
| 4 CONCLUSION |
|---|
|
|
|---|
These results not only illustrate the value of a correlation coefficient confidence score but also highlight an effective use of resources. By building upon existing, standardized, open source proteomics pipelines, innovation can be easily implemented and unambiguously shown how it affects the performance of the entire analysis process.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
Thanks to Mark Lovell and Changxing Shao for allowing us to use their rat mitochondrial ICAT data for software evaluation.
Funding: This research was supported by NCRR(NIH) Grant P 20 RR16481 and Kentucky EPSCoR award CRDG-006-06.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: David Rocke
Received on March 21, 2008; revised on July 18, 2008; accepted on July 22, 2008
| REFERENCES |
|---|
|
|
|---|
Bantsheff M, et al. Quantitative mass spectrometry in proteomics: a critical review. Anal. Bioanal. Chem. (2007) 1017–1031.
Droit A, et al. Bioinformatic standards for proteomics-oriented mass spectrometry. Curr. Proteomics (2006) 3:119–128.[CrossRef]
Han DK, et al. Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry. Nat. Biotechnol. (2001) 19:946–951.[CrossRef][Web of Science][Medline]
MacCoss MJ, et al. A correlation algorithm for the automated quantitative analysis of shotgun proteomics data. Anal. Chem. (2003) 75:6912–6921.[Medline]
Rauch A, et al. Computational proteomics analysis system (CPAS): an extensible, open-source analytic system for evaluating and publishing proteomic data and high throughput biological experiments. J. Proteome Res. (2006) 5:112–121.[CrossRef][Web of Science][Medline]
Wolters D, et al. An automated multidimensional protein identification technology for shotgun proteomics. Anal. Chem. (2001) 73:5683–5690.[Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
