Skip Navigation


Bioinformatics Advance Access originally published online on June 17, 2008
Bioinformatics 2008 24(14):1652-1653; doi:10.1093/bioinformatics/btn232
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/14/1652    most recent
btn232v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Segal, M. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Segal, M. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

On E-values for tandem MS scoring schemes

Mark R. Segal

Department of Epidemiology and Biostatistics, Center for Bioinformatics and Molecular Biostatistics, University of California, San Francisco, USA


    ABSTRACT
 TOP
 ABSTRACT
 ACKNOWLEDGEMENTS
 REFERENCES
 

Contact: mark{at}biostat.ucsf.edu

In a recent article in this journal, Khatun, Hamlett and Giddings (2008) (KHG) advance a new scoring scheme for use in conjunction with tandem mass spectrometry (MS/MS)-based peptide identification. As they note, such identifications are fundamental to much proteomics research but, due to MS/MS data complexity and the scale of attendant database searches, their accuracy is limited. The scoring technique they propose, which employs a hidden Markov model (HMM) over a set of states that represent key features of MS/MS data, is convincingly motivated and exhibits good performance. The purpose of this brief note is to critique the method chosen for calibrating the HMM scores, rather than the genesis of the scores themselves.

The ubiquity of expectation (E) values, as provided by BLAST sequence-based searches and based on type I extreme value (Gumbel) distributions, prompted efforts to produce analogous summaries for the seemingly similar MS/MS database searches. In particular, Fenyö and Beavis, (2003) (FB) devise such a summary, and it is this approach that is employed by KHG. The appropriateness of the type I extreme value distribution (evd) for sequence-based search stems from the selection of maximal scoring segments (Karlin and Altschul, 1990) and has strong theoretic and empiric underpinnings, although low complexity sequence constitutes an exception (Sharon et al., 2005). FB consider four scoring functions and assert that the arguments used to justify the extreme value distribution pertain. However, on theoretic grounds, this does not appear to be the case, nor is it immediate that the evd pertains to KHGs HMM-based scores. Further, FB contend that, for the evd, the tail of the log survival function is linear in log score and propose a corresponding estimation scheme for evd parameters. Regardless of the fact that the claimed linearity does not hold for a type I evd [it is approximate for the type II (Frechet) evd], there are existing parameter estimation schemes (e.g. Segal et al., 2000) that enjoy superior efficiency and robustness properties and are readily computable. In their supplementary methods KHG state that log survival relates linearly to score. This approximates an exponential distribution. Again, pursuing E-value determination by linear tail fitting can be highly non-robust.

Before proceeding to demonstrate these points some general comments are in order. First, the problem of tail area estimation (underlying E/P-value computation) is challenging, and difficulties associated with using parametric extrapolation for such purposes are not confined to MS/MS peptide/protein identification. Nonetheless, I believe it purposeful to showcase concerns in this arena because of the additional limitations of proposed approaches. Second, it can be argued that non-robustness in E/P-value estimation is not consequential since these values are used for downstream screening or discrimination, rather than direct interpretation in formal probabilistic terms. However, confidence or significance statements based on E/P-values are still commonly proffered. And, improved discrimination can potentially be obtained by improved estimation, the shortcomings of the FB schema being avoidable.

To illustrate these concerns with the FB approach, I showcase an example using MS/MS data (kindly provided by Robert Chalkley and Aenoch Lynn), with database matching scores obtained using protein prospector (Chalkley et al., 2005). It is important to recognize that the concerns transcend the specific scoring scheme employed but, rather, are fundamental to the estimation approach. Figure 1 shows the (smoothed) empiric density (black curve) for these scores, with superimposed fits of type I evd (red) and gamma (green) densities. As confirmed by Q–Q plots (not shown), these densities both provide good fits to the data, with the gamma proving superior. (In the majority of instances examined the evd did not provide a good summary.) The inset P-values under each density pertain to the maximal score, indicated by the blue arrow, that represents a true peptide match per manual verification. Figure 2 depicts the FB estimation process, based on linear fitting of log survival to log score, for two (10%, 1%) candidate tail quantiles. Two features are notable. First, the inherent and theoretically expected non-linearity is evident. Note that this pertains to scores conforming to the type I evd. The same phenomenon is apparent for random samples generated according to this distribution (not shown). Second, and more important, the combination of curvature and differing tail quantile specifications can have a big impact on FB derived P-values. The two upper tail prescriptions result in P-values that differ by two orders of magnitude. The difference between the P-value obtained using the (recommended) upper 10% quantile (P=5e-06) and that obtained using maximum likelihood or method of moments fitting (P=2.18e-09) exceeds three orders of magnitude. Further, these differences arise despite large attained R2 values for the linear fit, exceeding 93% for the 10% quantile. Now while the P-value disparities may not be critical in and of themselves since, as noted above, they are often used for screening or discrimination purposes rather than for formal probability statements, it remains the case that they unnecessarily result from the non-robust FB estimation scheme. Moreover, even discrimination-based P-value usage can be distorted when applied across different settings due to the interplay of distributional lack-of-fit and non-robust estimation. It is important to note that the showcased disparities arose in the favorable, yet infrequent, situation where the score distribution is well approximated by a type I evd. Finally, P-values serve as inputs into multiple testing correction methods, an inherent component of peptide identification via database search, which can further compound problems due to inaccurate estimates.


Figure 1
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Type I extreme value (red curve) and gamma (green) densities fit to the Protein Prospector database matching scores as shown by the rug (teal) on the top axis and smoothed density (black). The blue arrow pinpoints the maximal score to which the inset P-values pertain.

 

Figure 2
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Illustration of the (Fenyö and Beavis, 2003) approach for estimating P-values via linear fitting of log(survival) to log(score) for differing upper tail prescriptions.

 
In summary, I believe that such parametric attempts at inheriting BLAST style E-values for assessing significance of MS/MS database search scoring schemes should proceed with caution in view of both the difficulties associated with tail estimation and the complexities of MS/MS data. The latter point is exemplified by the framework developed by Shen et al., (2007) wherein score distributions are but one of several components required to assign significance/confidence to putative peptide identifications. Additionally, a recent special issue of the Journal of Proteome Research (2008, Volume 7, Issue 1) was devoted to related concerns. While the focus of several contributions was on competing approaches to multiple testing correction, there was unanimity surrounding the criticality of properly framed null (referent) distributions. Even when so equipped, the illustrations herein demonstrate the need for sound P-value estimation techniques.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 ACKNOWLEDGEMENTS
 REFERENCES
 
Helpful comments were provided by Robert Chalkley, the associate editor and three referees.

Funding: This work was supported by NIH Grant Number 1 UL1 RR024131-01.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alfonso Valencia

Received on March 5, 2008; revised on May 9, 2008; accepted on May 12, 2008

    REFERENCES
 TOP
 ABSTRACT
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Chalkley RJ, et al. Comprehensive analysis of a multidimensional liquid chromatography mass spectrometry dataset acquired on a quadrupole selecting, quadrupole collision cell, time-of-flight mass spectrometer: II. new developments in protein prospector allow for reliable and comprehensive automatic analysis of large datasets. Mol. Cell. Proteomics (2005) 4:1194–1204.[Abstract/Free Full Text]

    Fenyö D, Beavis RC. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem (2003) 75:768–774.[Medline]

    Karlin S, Altschul SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA (1990) 87:2264–2268.[Abstract/Free Full Text]

    Khatun J, et al. Incorporating sequence information into the scoring function: a hidden Markov model for improved peptide identification. Bioinformatics (2008) 24:674–681.[Abstract/Free Full Text]

    Segal MR, et al. Comparing DNA fingerprints of infectious organisms. Statist. Sci (2000) 15:27–45.[CrossRef]

    Sharon I, et al. Correcting BLAST e-Values for low-complexity segments. J. Comp. Bio (2005) 12:978–1001.

    Shen C, et al. A hierarchical statistical model to assess the confidence of peptides and proteins inferred from tandem mass spectrometry. Bioinformatics (2007) 24:202–208.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/14/1652    most recent
btn232v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Segal, M. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Segal, M. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?