Skip Navigation


Bioinformatics Advance Access originally published online on November 17, 2007
Bioinformatics 2008 24(2):202-208; doi:10.1093/bioinformatics/btm555
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/2/202    most recent
btm555v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Shen, C.
Right arrow Articles by Li, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shen, C.
Right arrow Articles by Li, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

A hierarchical statistical model to assess the confidence of peptides and proteins inferred from tandem mass spectrometry

Changyu Shen 1,*, Zhiping Wang 1, Ganesh Shankar 1, Xiang Zhang 2 and Lang Li 1

1Division of Biostatistics, Department of Medicine, Indiana University School of Medicine, Indianapolis, IN 46202 and 2Department of Chemistry, University of Louisville, Louisville, KY 40292, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Statistical evaluation of the confidence of peptide and protein identifications made by tandem mass spectrometry is a critical component for appropriately interpreting the experimental data and conducting downstream analysis. Although many approaches have been developed to assign confidence measure from different perspectives, a unified statistical framework that integrates the uncertainty of peptides and proteins is still missing.

Results: We developed a hierarchical statistical model (HSM) that jointly models the uncertainty of the identified peptides and proteins and can be applied to any scoring system. With data sets of a standard mixture and the yeast proteome, we demonstrate that the HSM offers a reliable or at least conservative false discovery rate (FDR) estimate for peptide and protein identifications. The probability measure of HSM also offers a powerful discriminating score for peptide identification.

Availability: The algorithm is available upon request from the authors.

Contact: chashen{at}iupui.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Recent development in mass spectrometry (MS) techniques has substantially improved the scale of analysis of complex protein mixtures (Colinge et al., 2003; Craig et al., 2004; Matthiesen et al., 2005; Perkins et al., 1999). In particular, the ‘shotgun’ or ‘bottom-up’ approach that integrates liquid chromatography (LC), tandem mass spectrometry (MS/MS) and database search allows high-throughput identification of proteins and has been used to profile the proteome of various biological samples including cell lines, tissues and serum/plasma (McCormack et al., 1997; Peng et al., 2003; Qian et al., 2005; Washburn et al., 2001). In a typical LC-MS/MS experiment, proteins in a sample are first digested into peptides, separated in LC system and then subject to mass spectrometry analysis. During the MS analysis, each peptide is first ionized and the mass-to-charge ratio (m/z) of the ionized peptide is measured by the first MS. Certain peptide ions are then selected for fragmentation to generate their m/z signatures that are collected by a second MS scan (MS/MS). The observed signature is compared with the theoretical signature of each candidate sequence in a database (e.g. sequences whose mass matches that measured in the first MS) using a score that characterizes the similarity of the two spectra, and the one with the best score is assigned as the identification. The identified peptides are then assembled to identify the proteins. The overall experimental design is illustrated in Figure 1.


Figure 1
View larger version (24K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Schematic representation of the LC-MS/MS experiment. Solid arrows refer to experimental steps and dashed arrows refer to interpretation steps. P1, P2 and P3 are the proteins in the sample that generate peptides p1 and p2, p3 and p4, and p5 and p6, respectively. The MS/MS of p1 and p4 are falsely interpreted as the signature of px and py that do not exist in the digested sample, which leads to two false proteins Px and Py. Peptide p3 could also have been generated by Px had Px been in the original sample.

 
As powerful as the shotgun approach is, protein identifications are subject to errors. Particularly, variation and noises can be introduced to the observed spectra through experimental steps such as sample preparation, injection, fragmentation and mass analysis. In addition, there is uncertainty in prediction model used to generate the theoretical spectra. Both factors contribute to false assignment—peptide sequence assigned to a spectrum is not the peptide that generates the spectrum. In Figure 1, peptides px and py are the false assignments that lead to two false proteins Px and Py. The situation is further complicated by the so called degenerate peptides—peptides that can originate from more than one protein species due to multi-family members, alternative spliced forms and other processes, which makes it difficult to assign these peptides to their precursor proteins in a deterministic manner. In Figure 1, peptide p3 could be generated by both P2 and Px and it cannot be inferred with certainty which of the two proteins is/are present even if we know p3 is present.

Obviously, accurate peptide identification is critical for protein identification. Many efforts have been devoted along this direction by constructing powerful scoring systems such as those based on the number of shared peaks (Eng et al., 1994), the stochastic modeling of the fragmentation process (Bafna and Edwards, 2001), the peak intensity (Havilio et al., 2003), Bayesian approach (Zhang et al., 2002), Mowse scores (Perkins et al., 1999), Hypergeometric distributions (Sadygov and Yates, 2003; Tabb et al., 2007), Poisson distributions (Geer et al., 2004; Xue et al., 2006) and regression model (Feng et al., 2007). Peptide and protein identifications, however, are still not perfect, which highlight the value of statistical evaluation of their confidence for appropriately interpreting the data. Typical metrics include the P-value or E-value (Fenyo and Beavis, 2003; Geer et al., 2004; Perkins et al., 1999; Sadygov and Yates, 2003; Tabb et al., 2007), or false discovery rate (FDR) (Benjamini and Hochberg, 1995; Elias and Gygi, 2007; Higgs et al., 2007; Keller et al., 2002; Nesvizhskii et al., 2003; Qian et al., 2005). The overall strategy is a two-step procedure with peptide confidence evaluated first followed by protein confidence. These studies approach statistical significance of peptide/protein identification from a number of angles and yield substantial insight to the nature of the problem and guidance on data interpretation.

In this article, we focus on two issues that have not been adequately addressed regarding the confidence of peptide and protein identifications. First, due to the fact that a protein can generate multiple peptides and a peptide can be generated by multiple proteins, the knowledge of the presence/absence of a peptide/protein might have implication(s) of the likelihood of the presence/absence of other peptide(s)/protein(s). For example, in Figure 1, if we know peptide p5 is present, then we know protein P3 is present, which will increase the likelihood of the presence of peptide p6. Therefore, the uncertainty of peptides and proteins should be integrated in a unified model framework to assess their confidence. The two-step procedure adopted by many methods in its essence does not offer an integrated model and might not be optimal in terms of the reliability of confidence measure. In addition, such an integrated model can provide extra power to discriminate true from false peptides/proteins by accounting for the network among peptides and proteins. Second, how to interpret scores of a peptide that is assigned to multiple spectra. A commonplace in the literature is to take the maximum score. However, such a simplification can be over-optimistic by ignoring the lower scores since it is possible that a false peptide will by chance yield some high scores under certain scenarios if it is assigned multiple times.

We propose a hierarchical statistical model (HSM) to tackle the two issues described above. The HSM offers a unified framework that integrates the uncertainty of peptides and proteins so that the confidence of peptides and proteins is estimated simultaneously. Furthermore, the HSM has the following advantages: (a) straight forward FDR estimate/control; (b) the framework can be applied to any scoring system and (c) confidence measure in the form of probability can potentially provide a better discriminating power than the original score. With data sets of a standard mixture and the yeast proteome, we demonstrate that our model can provide reliable or at least conservative confidence measure for peptides and proteins. In addition, the probability measure from HSM also offers a powerful discriminating score for peptide identifications.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Overview
For the peptide/protein identification problem outlined in Figure 1, we construct a model that reflects the process of generating the score of each peptide assignment to an MS/MS spectrum. The model is composed of four connected layers (marginal/conditional distributions) as shown in Figure 2. In this framework, we start with the identified proteins and peptides and consider three types of unobserved binary variables: presence/absence status for proteins and peptides, and the correctness of each peptide assignment. We derive their probability distribution using the scores of each assignment and the structure of the connections among peptides and proteins (dashed arrows connecting peptides and proteins in Figure 1). The rationale is that the score value suggests some likelihood of a correct assignment, and such likelihood can be translated into the likelihood of the presence/absence of peptides and proteins according to their connections. In doing so, we effectively take into account the correlation among peptides and proteins. In addition, since scores of the same peptide are connected through the unobserved presence/absence status of the peptide, their correlations are also accounted for.


Figure 2
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Four layers of the HSM. Except layer 1, which is a marginal distribution, each layer represents a conditional distribution.

 
2.2 Model
In this section, we describe our model by following the structure in Figure 2. Our model is built on an empirical Bayes framework, where the probability of the unobserved variables and model parameter values is inferred from each data set by the maximum-likelihood (ML) principle.

We first introduce some notations. Let N be the number of proteins with at least one peptide hit and M be the number of peptides assigned to at least one spectrum. Let Yi be a binary indicator such that Yi = 1 if protein i is in the sample and 0 otherwise. Let Zj be the binary indicator such that Zj = 1 if peptide j is present in the digested sample and 0 otherwise. Define Wjk to be a binary indicator such that Wjk = 1 if the kth assignment of peptide j to a spectrum is correct and 0 otherwise, with corresponding matching score Sjk (k = 1,2,...Tj). Let Cj be the set of proteins that could potentially generate peptide j. In other words, proteins in set Cj have a subsequence that matches peptide j. Finally, Y = {Yi}and similar notation applies to other variables.

2.2.1 Model for Y (marginal probability for proteins)
The marginal model of Yi is described by independent Bernoulli distributions:


Formula 1

(1)
Hence, {rho} is the proportion of the identified proteins that are true positive.

2.2.2 Model for Z|Y (conditional probability for peptides)
The probability that peptide j is present in the digested sample is equal to the probability that at least one protein in Cj generates it. Let Pr[Zj = 1|Yi = 1] = {pi}ij({alpha}) for protein i in Cj, where {pi}ij is a function that depends on relevant characteristics of protein i and peptide j and {alpha} is a parameter vector. Therefore, given Y, we have


Formula 2

(2)
The stochastic nature of the cleavage at an amide bond can depend on many physico-chemical properties at the local or global level. Since a cleavage event can be categorized as either specific (occurs at the protease's preferred location) or non-specific, we consider five types of mechanisms that a peptide can be generated by a protein:


Formula

Note that peptides locate at the two ends of a protein can be generated by one cleavage. For typsin digestion, we define a specific cleavage to be at the C-terminal of Arginine (R) or Lysine (K), except when either is followed by Proline (P).

2.2.3 Model for W|Z (conditional probability for assignments)
We consider W|Z since W does not depend on Y given Z. Although most of the time Zj = 1 implies Wjk = 1, it is possible that a peptide ion does not generate the spectrum it is assigned to, even though the peptide is indeed in the digested sample. In other words, the assignment of a peptide that exists to a spectrum does not always mean the assignment is correct. This is more likely to happen for complex samples where many peptides can generate similar spectra. To describe the stochastic nature of this process, we consider the following model:


Formula 3

(3)
Hence, the chance that the assignment of a present peptide to a spectrum is correct is described by the parameter {tau}. The W|Z component is particularly useful for peptides assigned to multiple spectra since it allows flexible interpretation of the assignment without the constrain that assignments of the same peptide are either all correct or incorrect.

2.2.4 Model for S|W (conditional probability density for scores)
We consider a mixture distribution framework based on the rationale that the population of scores can be divided into two sub-populations, those of the correct assignments and those of the incorrect assignments. The idea is that those correct assignments should have on average higher scores than those of the incorrect assignments. The less overlap between the two distributions, the more discriminating power the score is. Since the distribution of a score could also depend on factors other than the correctness of the assignment, such a mixture framework can be constructed at each level of a factor or each combination of levels of multiple factors. For a single factor Q, denote by qjk the value of Q for the kth assignment of peptide j, we have


Formula 4

(4)
where fq,0 and fq,1 are the probability density functions for the scores of incorrect and correct assignments with value q, and βq,0 and βq,1 are the corresponding parameter vectors.

2.2.5 Number of peptide hits
It is well known that the number of peptide hits provides substantial information on the likelihood of the presence of proteins. To account for this, we construct a binary variable Vi for protein i that indicates whether or not the number of peptide hits is beyond a threshold h (see Supplementary Material S4). Specifically, denote by Xi the number of peptides identified for protein i (not necessarily unique peptides) and define Vi = 1 if Xi > h and 0 otherwise. Then we consider the following model:


Formula 5

(5)
where {gamma}1 and {gamma}0 are the probabilities of observing more than h hits for present and absent proteins, respectively.

By integrating Equations (1–5GoGoGoGo), the joint distribution of the variables involved can be written as:


Formula

By treating Y, Z and W as the unobserved variables and {theta} = ({rho},{alpha},{tau},β,{gamma}) as the parameter vector, we use the Expectation-Maximization (EM) algorithm (Dempster et al., 1977) to estimate {theta} using R software. The EM iteratively updates the value of {theta} and the expectation of the log-likelihood (conditional on S and V ) until convergence, where the parameter estimate Formula is obtained (see Supplementary Material S1). Then the confidence of peptides and proteins is


Formula 6

(6)
and


Formula 7

(7)
Correlations among peptides/proteins and among multiple scores of the same peptide are accounted for in (6) and (7) by conditioning only on the observed values.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 A standard mixture
Standard mixtures allow one to examine the performance of confidence measures since the protein species in the sample are known. We first apply our model to the MS/MS data of a simple sample composed of 23 stand-alone peptides and a trypsin digest of 12 proteins (see Supplementary Material S3). The experimental details are described previously (Purvine et al., 2004). SEQUEST (Eng et al., 1994) was used to search a database composed of the 35 peptides/proteins coupled to the protein set from Shewanella oneidensis and typical sample contaminants (4256 entries totally), where non-tryptic peptides are allowed. With three replicates, there are totally 8535 MS/MS spectra assigned to 7128 unique peptides (417 true positive).

We used the score developed in PeptideProphet (Keller et al., 2002) as the score S in our model to discriminate true from false assignments. This score, which is called fval, is a linear combination of a number of parameters output from SEQUEST (e.g. Xcorr and deltaCn). In Figure 3, we show the histograms of fval for each of the three replicates and all three replicates together. It is clear that correct assignments tend to have higher scores than incorrect assignments and the shapes of the distributions are very consistent across replicates.


Figure 3
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Histograms of the distributions of fval for each one of the three replicates and combined data (Purvine et al., 2004). Due to the simplicity of the mixture, assignments that hit the 23 peptides or 12 proteins in the mixture are assumed to be correct. The rest are incorrect.

 
We fit our model to the data from all three replicates. We exclude the V|Y component because the 23 peptides will have at most one hit (they are not further digested) and therefore the rationale of ‘more hits implies more confidence’ does not apply. In addition, the chance that a spectrum assigned to a peptide that exists is actually generated by some other peptide is very small ({tau} {approx} 1) due to the simplicity of the mixture. Hence, we assume Wjk = Zj. Based on the distributions in Figure 3, we fit a two-component mixture model, where the distributions of fval for correct and incorrect assignments are described by gamma and normal distributions, respectively.

We first examine the reliability of the confidence for peptide identification as measured by the probability in (6). One minus such a probability provides an estimate of the local FDR (Efron et al., 2001), which can be used to estimate FDR for a set of claimed positives by simply taking the average of their local FDR (Newton et al., 2004). In Figure 4, we compare the FDR estimated by the PeptideProphet, ProteinProphet (Nesvizhskii et al., 2003) and HSM model with the empirical FDR (see Supplementary Material S2). It is clear that PeptideProphet tends to underestimate the FDR by over 50% most of the time. ProteinProphet provides the best estimate at higher FDR end, though it is still over-optimistic at low FDR end. The HSM has less under-estimate at low FDR end and tends to be conservative at the high FDR end. Moreover, probability measures have more discriminating power than fval itself and the HSM performs the best (Figure 5). In summary, the probability in (6) can be used as both a reliable or at least conservative confidence measure for peptide identifications and a powerful discriminating score.


Figure 4
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Comparison of the FDR estimates for peptide identifications by the HSM, PeptideProphet and ProteinProphet. Positives are selected by applying various thresholds to the peptide probabilities.

 

Figure 5
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Receiver Operating Characteristic (ROC) curves for discriminating true from false peptides by using probabilities from HSM, PeptideProphet, ProteinProphet and the fval values.

 
At the protein level, all 12 proteins have close to 1 probability based on ProteinProphet and the HSM. Two and one false proteins are identified with probability higher than 0.8 for ProteinProphet and the HSM, respectively.

3.2 A yeast data set
Due to the lack of a complex standard sample with a decent number of known proteins, it is rather difficult to evaluate a confidence measure at the protein level. A commonplace in the literature is to search an MS/MS data set of species A against a database composed of sequences in the proteomes of A and another species B, where protein hits of A and B are deemed as true and false, respectively.

We analyzed a yeast MS/MS data set, which was collected on a QSTAR mass spectrometer for three replicates of each of the five gel regions (Elias et al., 2005). SEQUEST was used to conduct a non-constrained search against a sequence database composed of 6473 entries of Saccharomyces cerevisiae and 22 437 entries of Caenorhabditis elegans. All sequences are downloaded from UniprotKB (http://beta.uniprot.org/uniprot/). In total, 9272 MS/MS spectra were assigned to 4148 unique peptides. We excluded 13 singly charged (+1) peptides, which lead to 6869 doubly charged (+2) and 2363 triply charged (+3) peptides in our analysis. Again, we use the fval as the score in the HSM. We fit distinct gamma–gamma mixture models for +2 and +3 peptides because both the correct and incorrect assignments have a skewed distribution and they tend to be different between +2 and +3 peptides. One of the common practices in protein identification is to select proteins with at least two peptide hits to improve identification accuracy. Hence, we set h = 1 for the definition of Vi [see Equation (5)].

In Table 1, we show the number of Yeast/C.elegans proteins selected by the HSM and ProteinProphet by applying various cutoff to the probability measure. It is clear that at any given threshold, ProteinProphet selects more proteins (mostly from yeast) than HSM. In other words, ProteinProphet on average tends to assign higher confidence than HSM.


View this table:
[in this window]
[in a new window]

 
Table 1. Number of Yeast and C.elegans proteins selected by applying various thresholds to the probability measure from HSM and ProteinProphet

 
There are two major factors that lead to higher confidence by ProteinProphet. First, protein confidence is evaluated in ProteinProphet as


Formula 8

(8)
where Pi is the probability of presence for protein i, pj is the probability of presence for peptide j and Formula is a weight factor. Equation (8) essentially says that the probability of protein i being present is equal to the probability that at least one of its peptides is present. A key underlying assumption of (8) is that the presence/absence of peptides is independent, which in general is not true. For example, suppose there are two peptides j and k that could be generated by protein i only (Formula ). Given the knowledge that peptide j is absent, the likelihood that peptide k is absent will, in most scenarios, be higher than its marginal probability of being absent since the knowledge suggests it is likely that protein i is absent and therefore so is peptide j. The consequence is that many terms in the product of Equation (8) are lower than they should be and leads to overestimation of the confidence for proteins with multiple peptide-hits. This phenomenon is particularly observable at low FDR values and has been reported previously (Feng et al., 2007; Xue et al., 2006). In Table 2, we show an example in the yeast data set.


View this table:
[in this window]
[in a new window]

 
Table 2. An example from the yeast data set

 
The second reason is related to peptides assigned to multiple spectra. ProteinProphet selects the assignment with the maximum score and ignores the rest. If the rest are substantially lower than the maximum, it tends to result in higher confidence measure than HSM that accounts for the lower scores. For +2/+3 ions, the difference between the maximum and minimum scores of the same peptide is on average 77%/57% of the distance between the means of the distributions of the correct and incorrect assignments. Hence, inclusion of the maximum score only could have profound impact in the confidence estimate. This effect is shown in Table 3, where the probabilities for two unique peptides of a yeast protein are compared between HSM and ProteinProphet. It is possible that the difference seen in Table 3 is partially due to different parametric distribution (e.g. gamma versus normal) forms and the overall model structure. Nevertheless, we feel that the major driving force lies in the treatment of the lower scores.


View this table:
[in this window]
[in a new window]

 
Table 3. Two unique tryptic peptides of TMA2_YEAST

 
Therefore, ProteinProphet tends to be more optimistic in the calculation of the confidence of certain peptides and proteins through a series of model simplification. The HSM incorporates various types of correlations through a more rigorous model scheme. In Figure 6, empirical FDR is plotted against estimated FDR for protein identifications. It can be seen that ProteinProphet tends to underestimate the FDR at low FDR end and then becomes conservative as the FDR goes higher. The HSM is more close to the perfect line at low FDR end and is always conservative. Overall, HSM and ProteinProhet perform similarly at high FDR end and HSM seems to be more reliable, or at least conservative at low FDR end. Note that the empirical FDR is actually a lower bound of the real empirical FDR since it is possible that some of the yeast proteins are also false positives. Thus, the real empirical FDR versus estimated FDR curves are likely to be above those seen in Figure 6.


Figure 6
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Comparison of the FDR estimates for protein identifications by the HSM and ProteinProphet. Positives are selected by applying various thresholds to the protein probabilities.

 
Table 1 suggests a better discriminating power of the ProteinProphet than the HSM—more true positives corresponding to the same/similar number of false positives. As just explained, ProteinProphet tends to be more optimistic than HSM for proteins with multiple peptide-hits or proteins whose peptide hit(s) are assigned multiple times. For the yeast data set, 57.3% and 9.6% of the identified yeast and C.elegans proteins have multiple peptide-hits. Similarly, 59.3% of the peptides that match to the subsequence of some yeast proteins are assigned to multiple spectra, as compared with 6.8% for peptides that only match to C.elegans proteins. Therefore, the ‘optimistic’ effect of ProteinProphet (compared with HSM) is much more serious for yeast proteins than for C.elegans proteins, which is the major cause of the difference between ProteinProphet and HSM shown in Table 1. However, it is possible that some of the yeast proteins are also false positive. Hence, Table 1 is still an approximation of the true sensitivity and specificity. A more accurate assessment will rely on a complex bench mark sample composed of a decent number of known proteins, which are not yet available.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this article, we propose a hierarchical statistical model to assess the confidence of peptides and proteins inferred by tandem mass spectrometry. The HSM incorporates correlations among multiple scores of the same peptide and correlations among the uncertainty of peptides and proteins. Through data sets of a standard mixture and the yeast proteome, it was demonstrated that HSM offers a reliable or at least conservative confidence measure for peptide and protein identifications, and a powerful probabilistic discriminating score for inferred peptides.

Degenerate peptides in some scenarios will result in degenerate proteins—proteins that do not have unique peptides (e.g. they share all its identified peptides with some other protein(s)).ProeinProphet employs the principle of ‘minimum set of proteins to explain the peptides’ to interpret the data. For example, in Figure 1, ProteinProphet will assign non-zero probability to Px and assign zero probability to P2 since Px itself can explain the two observed peptides px and p3. The minimal set is conceptually simple that makes data interpretation straight forward. The down side is that it might be over confident. In the example above, it increases the certainty of P2 (zero probability means it is absent for sure) and Px (all confidence of p3 is ‘spent’ on Px only and therefore more confidence on the presence of Px).

The HSM offers some flexibility in dealing with degenerate proteins with two options. First, all candidate proteins are treated as distinct elements and will be assigned a confidence measure, which is the method employed in this study. This option ‘spends’ the confidence in peptides on all candidate proteins by assuming that every protein could be present. It is useful when one needs a quantitative measure of the likelihood of various family members or alternative spliced forms of the same protein. In contrast, the second option employs the minimal protein set concept by spending the confidence in peptides on proteins in the minimal set only.

One potential solution to the degenerate peptides/proteins problem is to consider peptides that are not identified since the fact that some peptides are not observed has implication on the likelihood of the presence of their precursor proteins (Tang et al., 2006). It can be used to discern proteins confounded by degenerate peptides. We plan to extend the HSM along this direction and account for mis-cleavage process as well.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work is supported by Fairbanks Institute, Department of Defense grant BC030400, Indiana Alzheimer's disease center (IADC), Indiana University Cancer Center and NCI grant (U24CA126480). We want to thank Dr Eugene Kolker for kindly providing the standard-mixture mass spectrometry data, and Dr Haixu Tang and Mr Yong Li for helpful comments.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Olga Troyanskaya

Received on June 25, 2007; revised on October 4, 2007; accepted on November 1, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bafna V, Edwards N. SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database. Bioinformatics (2001) 17(Suppl. 1):S13–S21.[Abstract]

    Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (1995) 57:289–300.

    Colinge J, et al. OLAV: towards high-throughput tandem mass spectrometry data identification. Proteomics (2003) 3:1454–1463.[CrossRef][Web of Science][Medline]

    Craig R, et al. Open source system for analyzing, validating, and storing protein identification data. J. Proteome Res. (2004) 3:1234–1242.[CrossRef][Web of Science][Medline]

    Dempster AP, et al. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. Ser. B (1977) 39:1–38.

    Efron B, et al. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. (2001) 96:1151–1160.[CrossRef][Web of Science]

    Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods (2007) 4:207–214.[CrossRef][Web of Science][Medline]

    Elias JE, et al. Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations. Nat. Methods (2005) 2:667–675.[CrossRef][Web of Science][Medline]

    Eng JK, et al. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. (1994) 5:976–989.[CrossRef][Web of Science]

    Feng J, et al. Probability model for assessing proteins assembled from peptide sequences inferred from tandem mass spectrometry data. Anal. Chem. (2007) 79:3901–3911.[Medline]

    Fenyo D, Beavis RC. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. (2003) 75:768–774.[Medline]

    Geer LY, et al. Open mass spectrometry search algorithm. J. Proteome Res. (2004) 3:958–964.[CrossRef][Web of Science][Medline]

    Havilio M, et al. Intensity-based statistical scorer for tandem mass spectrometry. Anal. Chem. (2003) 75:435–444.[Medline]

    Higgs RE, et al. Estimating the statistical significance of peptide identifications from shotgun proteomics experiments. J. Proteome Res. (2007) 6:1758–1767.[Web of Science][Medline]

    Keller A, et al. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. (2002) 74:5383–5392.[Medline]

    Matthiesen R, et al. VEMS 3.0: algorithms and computational tools for tandem mass spectrometry based identification of post-translational modifications in proteins. J Proteome Res. (2005) 4:2338–2347.[CrossRef][Web of Science][Medline]

    McCormack AL, et al. Direct analysis and identification of proteins in mixtures by LC/MS/MS and database searching at the low-femtomole level. Anal. Chem. (1997) 69:767–776.[Medline]

    Nesvizhskii AI, et al. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. (2003) 75:4646–4658.[Medline]

    Newton MA, et al. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics (2004) 5:155–176.[Abstract]

    Peng J, et al. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J. Proteome Res. (2003) 2:43–50.[CrossRef][Web of Science][Medline]

    Perkins DN, et al. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis (1999) 20:3551–3567.[CrossRef][Web of Science][Medline]

    Purvine S, et al. Standard mixtures for proteome studies. Omics (2004) 8:79–92.[CrossRef][Web of Science][Medline]

    Qian WJ, et al. Probability-based evaluation of peptide and protein identifications from tandem mass spectrometry and SEQUEST analysis: the human proteome. J. Proteome Res. (2005) 4:53–62.[CrossRef][Web of Science][Medline]

    Sadygov RG, Yates J.R. III. A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal. Chem. (2003) 75:3792–3798.[Medline]

    Tabb DL, et al. MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. J. Proteome Res. (2007) 6:654–661.[CrossRef][Web of Science][Medline]

    Tang H, et al. A computational approach toward label-free protein quantification using predicted peptide detectability. Bioinformatics (2006) 22:e481–e488.[Abstract/Free Full Text]

    Washburn MP, et al. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. (2001) 19:242–247.[CrossRef][Web of Science][Medline]

    Xue X, et al. Protein probabilities in shotgun proteomics: evaluating different estimation methods using a semi-random sampling model. Proteomics (2006) 6:6134–6145.[CrossRef][Web of Science][Medline]

    Zhang N, et al. ProbID: a probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data. Proteomics (2002) 2:1406–1412.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/2/202    most recent
btm555v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Shen, C.
Right arrow Articles by Li, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shen, C.
Right arrow Articles by Li, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?