Bioinformatics Advance Access originally published online on April 7, 2008
Bioinformatics 2008 24(10):1293-1299; doi:10.1093/bioinformatics/btn123
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Peak bagging for peptide mass fingerprinting
Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Mass Spectrometry (MS)-based protein identification via peptide mass fingerprinting (PMF) is a key component in high-throughput proteome research. While PMF was the first commonly used protein identification method, provided higher throughput than the tandem MS-based method, its accuracy is lower than that of the tandem MS method. Thus, it is desirable to develop PMF-based algorithm with higher protein identification accuracy to facilitate proteome research.
Results: We propose a peak bagging method for single MS-based protein identification. It combines results from multiple PMF algorithms, where each PMF algorithm takes a random peak subset as input. Evaluation with a set of real MALDI-TOF MS spectra shows that the new peak bagging method provides consistent improvements over the single PMF algorithm.
Contact: eezyhe{at}ust.hk
| 1 INTRODUCTION |
|---|
|
|
|---|
In mass spectrometry (MS)-based proteome research, identifying the existence of proteins in MS data is one critical step toward quantitative modeling of cellular functions at the system level. There are two methods for protein identification: single MS-based peptide mass fingerprinting (PMF) method and tandem MS (MS/MS)-based peptide/protein identification method.
The PMF method is the first commonly used protein identification method. It is particularly suitable for high-throughput analysis of biological samples. The PMF-based protein identification consists of the following steps:
- Protein purification: Protein separation techniques such as liquid-phase chromatography (LC) or in-gel electrophoretic separation (2D-PAGE) are used to obtain purified protein samples.
- Protein digestion: The purified protein sample is digested by trypsin and the masses of resulting peptides are measured by MS.
- Protein identification: The experimentally observed peptide peaks are compared with theoretical peaks of reference protein sequences. Proteins that best match the experimental peptide peaks are reported along with some numeric scores of the match quality.
Many PMF algorithms have been proposed. For instance, ProFound algorithm (Zhang and Chait, 2000), Probity algorithm (Eriksson and Fenyo, 2004) and MOWSE score-based algorithms (Pappin et al., 1993; Perkins et al., 1999; Song et al., 2007) are derived using rigorous statistical frameworks, while the MSA algorithm (Egelhofer et al., 2002) is based on a recalibration technique.
Compared to the MS/MS method, the PMF method is less accurate since it cannot distinguish different peptides with identical mass. The principal advantage of the PMF method over tandem MS-based methods is that it does not need to infer which proteins are present in samples from peptide identification results (the major difficult issue that MudPIT approaches have to deal with) since we usually use the PMF method on single proteins. Moreover, the PMF method provides high-throughput analysis and is less expensive. If we could improve its protein identification accuracy, the PMF method would become more attractive in proteome research. Based on this motivation, we propose a new machine learning-based PMF method in this article.
Recently, many machine learning-based PMF algorithms (Margnin et al., 2004; Siepen et al., 2007; Yang et al., 2008) have been proposed to improve PMF accuracy. All of them use the supervised learning technique to build classification models from a given set of labeled/training data. In learning the classification model, the training dataset should be large enough to fit model parameters. Unfortunately, obtaining such training data is often expensive and time consuming.
The limitation of supervised learning-based PMF algorithms motivates us to examine the potential of applying other machine learning techniques. We find that ensemble learning method is a good choice. The basic idea of ensemble learning is to combine multiple learned models under the assumption that two (or more) heads are better than one. The decisions of multiple hypotheses are combined to produce more accurate results.
Typical ensemble learning methods such as bagging (Breiman, 1996) and boosting (Freund and Schapire, 1997) have been applied successfully in the area of supervised learning to improve prediction accuracy. In the context of PMF, we apply an unsupervised ensemble learning method to increase the accuracy of PMF without using labeled data. More precisely, we propose a new peak bagging algorithm for protein identification. In this algorithm, we first generate multiple peak subsets through random sampling. Then, we combine multiple rankings generated from these peak subsets to obtain a better consensus ranking. Experimental results show that the peak bagging algorithm provides consistent improvement in terms of identification accuracy.
The rest of the article is organized as follows: Section 2 describes the peak bagging algorithm in detail. Section 3 demonstrates the effectiveness of our method in protein identification with experiments. Section 4 concludes this article with some discussions of future work.
| 2 METHODS |
|---|
|
|
|---|
Our peak bagging algorithm is based on sampling and aggregating, which is similar to the popular bagging method (Breiman, 1996) in classification and regression. It consists of the following steps:
- Candidate protein generation: A set of proteins are retrieved from the protein database to serve as candidates for the consequent bagging process.
- Peak sampling: The input peak set is sampled at random to generate multiple peak subsets.
- Local rank list generation: From each peak subset, a rank list is generated using a PMF algorithm. The PMF algorithm in this context is called base PMF algorithm.
- Rank aggregation: Rankings from different peak subsets are aggregated into a consensus ranking. The top ranked proteins are reported as protein identification results.
Figure 1 gives an overview of this algorithm. In the following, we will explain each step in more details.
|
2.1 Candidate protein generation
Since the size of protein sequence database is very large, ranking all the proteins in a database is much more time consuming than ranking only a small subset of proteins. In order to improve the computational efficiency, we first run a PMF algorithm to filter out unlikely proteins and keep only C top ranked proteins as candidates for the bagging process.
2.2 Peak sampling
The peak sampling procedure generates S peak subsets of equal size using a fixed sampling fraction f. Suppose
is the set of M peaks extracted from the spectrum by some peak detection algorithms (e.g. Yasui et al., 2003). Suppose each peak subset
has
peaks. Then,
.
The motivations behind peak sampling for PMF bagging are three-fold:
- The reproducibility of MS spectra is low (Dekker et al., 2005). Of all the peaks in replicate spectra of the same purified protein, only a small portion of peaks are present in all spectra and most peaks are present in only one spectrum. This indicates that we are actually using subsets of all peaks in PMF even before peak sampling. From this viewpoint, it is reasonable to consider peak sampling as a procedure for imitating replicate spectra.
- From the viewpoint of ensemble learning, creating a set of diverse rankings is of primary importance in our peak bagging algorithm. Peak sampling brings randomness into the ranking generation procedure so that certain diversity between different rank list is created. The sampling fraction f controls the degree of randomness. The smaller the sampling fraction, the more the peak subsets will differ, thereby introducing more randomness into the bagging procedure.
- Each mass spectrum contains electronic and chemical noise along with ion signals. Presence of noisy and irrelevant peaks can significantly degrade the performance of the PMF algorithm. The sampling process randomly leaves some peaks out. As a result, some noisy peaks could be deleted. Since most PMF algorithms do not require a complete list of matching peaks in protein identification, missing some meaningful peaks after sampling is not as serious as one may suspect. In contrast, the opportunity of removing noisy peaks may largely improve the accuracy of protein identification.
We have two choices in peak sampling: sampling with replacement and sampling without replacement. Each peak can appear many times in the random subset when we use sampling with replacement. In contrast, each peak can appear at most once when we use sampling without replacement. In Section 3, we will demonstrate their difference empirically.
2.3 Local rank list generation
The base PMF algorithm will return an ordered-rank list of proteins on each peak subset, which is similar to web page search results (Lee, 1997; Dwork et al., 2001).
A rank list (or simply ranking)
= [x1
x2
···
xk] is an ordering of a subset S
U, where xi
S, U is the set of all proteins in the database, and
is an ordering relation on S.
The rank list
is called a full list when
= U; otherwise, it is a partial list. In general, most PMF algorithms will report a partial list of top k (e.g. k = 10) ranked proteins to the end-user. Such lists are also called top k rank lists.
We use s
(xi) to denote the score assigned to protein xi (xi
U) in the rank list
. The database search score reflects the similarity between the acquired MS spectrum and a theoretical spectrum in the protein sequence database. Without loss of generality, we assume that the higher the score, the better the rank
(xi) of protein xi.
2.4 Rank aggregation
Rank aggregation combines multiple rankings into a consensus ranking. Two methods are commonly used for rank aggregation: rank-based method and score-based method. The score-based method is more informative than the rank-based method. Thus, here we use the score-based method to perform rank aggregation.
To make scores across multiple rank lists comparable, we transform the scores into probability estimates. The probability estimation methods generally fall into two categories: parametric method (Nesvizhskii et al., 2003; Manmatha et al., 2001) and non-parametric method. The non-parametric method is highly intuitive and distribution free, which is very suitable in the context of protein identification. Here we use a non-parametric method in (Montague and Aslam, 2001) to estimate the probability. Given a rank list
= [x1
x2
···
xk], the probability of each protein is computed as the ratio between the current protein score and the total score of all proteins:
|
| (1) |
(xi) denotes the estimated probability of protein xi in the rank list
.
Given a set of n rankings R = {
1,
2, ... ,
n}, we can aggregate them using different operations such as unweighted min, max, or sum operation to obtain a consensus ranking
. Since the CombSUM (i.e. sum of the probabilities of identifications) method has shown better performance (Lee, 1997), we also use this method to obtain the aggregated probability
over multiple rankings:
|
| (2) |
Then, we use the aggregated probability as score to rank each protein in the consensus ranking. Finally, we report a set of top ranked proteins to the end user.
| 3 EXPERIMENTS |
|---|
|
|
|---|
We use the same 52 benchmark MALDI-TOF MS spectra as in (Song et al., 2007), where the reader can find more details about sample preparation and data acquisition. Here we only emphasize that each spectrum is generated by using a single purified protein.
We use a recent release of the Swiss-Prot protein database for protein identification. It contains around 260 000 protein sequences (http://www.pir.uniprot.org/database/download.shtml). Since 40 proteins in (Song et al., 2007) are not included in the Swiss-Prot database, we retrieve these 40 proteins from the NCBI database and add them to the Swiss-Prot database.
Throughout the experiments, we use the same set of parameters in all PMF algorithms: trypsin digestion with a maximum of one missed cleavage, monoisotopic peaks, single charge state, unrestricted protein mass and mass tolerance threshold of 50 ppm. We also consider some typical post-translational modifications (PTMs)1 in the searching algorithms: carboxyamidomethyl cysteine (+57.02146 Da) as the fixed modification and oxidation of methionine as the variable modification (+15.9945 Da).
Knowing which database hit is correct for every spectrum, we can quantitatively evaluate the performance of a PMF algorithm.
Given an input spectrum and a top k rank list
, we say that
gives a correct identification if the ground-truth protein is contained in
. Then we can define the protein identification accuracy at rank k in terms of correct identifications:
|
| (3) |
A somewhat more tolerant criterion for evaluating the overall identification performance would be the average identification accuracy, which is the average of acc(i) over all rank i for 1
i
k:
|
| (4) |
In the following, we first demonstrate the improvement of peak bagging algorithm over single PMF algorithm. Then, we investigate how the performance of the peak bagging algorithm is affected by parameters. These parameters include number of peak subsets S, the sampling fraction f, the number of candidate proteins C and sampling schemas.
3.1 Peak bagging versus single PMF algorithm
Many methods can be used as the base PMF algorithms to verify the effectiveness of our peak bagging algorithm. Among them, Mascot (Perkins et al., 1999) is probably the most widely used method. But unfortunately, the technical details of Mascot are not publicly available, making it difficult to directly embed Mascot into our peak bagging method. In addition, the web-based Mascot version only reports no more than 50 hits in a single run of database search. As we often need several hundreds or even one thousand hits in our bagging procedure, using the web-based version as a stand-alone library function in peak bagging also proves to be infeasible.
Alternatively, we decide to use ProFound (Zhang and Chait, 2000) and probability-based scoring function (PBSF) (Song et al., 2007) as the base PMF algorithm. It should be noted that they may not necessarily provide the best performance. The major reason that we choose these two methods is that their details are publicly available. Since our focus is to demonstrate the merit of the peak bagging strategy, we believe that it should be adequate if we can demonstrate the improvement of performance using two different PMF algorithms.
In the experiment, we submit each of the experimental spectra (with known identity) to both peak bagging algorithm and single PMF algorithm for scoring. In performance evaluation, we take the top 10 ranked proteins as potential true positives.
In the peak bagging algorithm, we fix the number of candidate proteins to 500 (C = 500) and use sampling with replacement to generate 2000 peak subsets (S = 2000) with a sampling fraction of 0.2 (f = 0.2). We carry out 10 runs to measure the performance. In each run, only top 10 ranked proteins are included in the rank list for bagging. Note that the parameter specification here looks arbitrary without much reasoning. In consequent sub-sections, we will discuss how to specify these parameters properly.
We first report the results of using ProFound as the base PMF algorithm (see the Appendix 1 for details on our implementation of ProFound algorithm). Figure 2 displays the protein identification accuracies of single ProFound algorithm and peak bagging algorithm up to rank k for 1
k
10. The peak bagging algorithm is consistently and substantially more accurate than the single ProFound algorithm. This means that the peak bagging strategy is able to boost up the performance of ProFound algorithm significantly. Its success comes from the peak sampling strategy, which creates a set of diverse rankings so that the overall performance is improved. Note that our method does not require any prior knowledge or labeled data.
|
We also conduct the performance comparison when PBSF is used as the base PMF algorithm. The implementation of the PBSF algorithm strictly follows the description in Song et al. (2007). Figure 3 displays the identification accuracies of single PBSF algorithm and peak bagging algorithm. The peak bagging strategy improves the performance of the PBSF algorithm. We also note that the improvement in PBSF-based method is not as significant as in ProFound-based method. This fact reveals the importance of choosing a proper base PMF algorithm when the overall identification accuracy is our main concern. In the rest parts of this section, we will use ProFound as the base PMF algorithm to study the properties of peak bagging algorithm.
|
3.2 Bagging performance versus number of peak subsets
The number of randomly sampled peak subsets S has a significant influence on the bagging performance. To clarify this point, we use the average identification accuracy over all rank k (1
k
10) as the performance indicator. We set C = 500 and use sampling with replacement to generate 100, 500, 2000 and 10 000 peak subsets, respectively, with a sampling fraction f = 0.2. Figure 4 illustrates that the overall performance of peak bagging algorithm increases with the increase of subset number. Figure 4 also shows that the variance of accuracies decreases when we use more sampled peak subsets for bagging. In practice, 1000 or 2000 peak subsets are large enough to achieve a stable accuracy.
|
3.3 Bagging performance versus sampling fraction
The sampling fraction is another important parameter in peak bagging algorithm. We first set C = 500 and then use random sampling with replacement to generate 2000 peak subsets with sampling fraction 0.1, 0.2, 0.3, 0.4, 0.6 and 0.8, respectively.
Figure 5 plots the average identification accuracies of peak bagging algorithm when the sampling fraction increases from 0.1 to 0.8. The trend of first increase and then decrease of accuracy is clearly visible. This trend vividly reflects the conflict between increasing diversity and keeping enough informative peaks.
|
Figure 5 shows that the sampling fraction ranging from 0.1 to 0.4 yields better results. To understand the reason, we first plot the histogram of peak matching fractions of ground-truth proteins in Figure 6. We observe that most ground-truth proteins also have a matching fraction between 0.1 to 0.4. This reveals the intrinsic correlation between optimal sampling fractions and peak matching fractions of ground-truth proteins. Unfortunately, we usually do not know the ground-truth proteins, which makes it difficult to estimate the optimal sampling fraction directly.
|
If the experimental setup allows, the simplest approach to finding the optimal sampling fraction might be that we also run the MS scanning on a set of pure proteins with known identity under the same experimental settings. Then, we check the histogram of peak matching fractions (as shown in Fig. 6) of these pure proteins and determine the optimal sampling fraction based on the histogram.
Before we can find a good estimator for the optimal sampling fraction, we like to recommend a value range between 0.1 and 0.4 as a simple rule of thumb in setting up the sampling fraction.
3.4 Bagging performance versus number of candidate proteins
In order to check the effect of different number of candidate proteins on the bagging results, we first use random sampling with replacement to generate 2000 peak subsets at sampling fraction f = 0.2 and then set up the number of candidate proteins as 100, 300, 500 and 1000, respectively.
Figure 7 records the average identification accuracies of peak bagging algorithm when different number of candidate proteins are used. The bagging performance increases when more candidates are used, while the increase is not so significant. At the same time, the increase of protein candidates will not reduce the variance of identification accuracies. This is because more candidates will bring more noisy proteins into the bagging process so that the algorithm is not stable enough.
|
As a rule of thumb, the accuracy is good enough when we choose 500 candidates.
3.5 Sampling with and without replacement
Figure 8 displays the average identification accuracies over all ranks when we use different sampling methods. It suggests that sampling with replacement is better than sampling without replacement in terms of average identification accuracy. Our explanation is that if replacement is allowed, each peak can be selected many times, then more diversity is created among different rankings.
|
The difference between two sampling methods becomes more apparent when the sampling fraction increases. This is because peak subsets generated by sampling without replacement will become similar to each other with the increase of sampling fraction. In contrast, sampling with replacement is less affected in this regard.
3.6 Summary
As a result of the experiments, we suggest the following parameter setting in the peak bagging algorithm: sampling with replacement, 500 candidates (C = 500) and 2000 peak subsets (S = 2000). Moreover, we suggest to use a value ranging from 0.1 to 0.4 to specify the sampling fraction f.
| 4 CONCLUSIONS AND DISCUSSIONS |
|---|
|
|
|---|
Resampling methods such as bagging (Breiman, 1996) and boosting (Freund and Schapire, 1997) have been applied successfully in the context of supervised learning to improve the classification accuracy. Here we propose a peak bagging strategy to improve the performance of single PMF algorithm. We demonstrate that bagging consistently improves the accuracy of protein identification.
Despite the success of peak bagging algorithm in improving the rankings, the issue of assessing the significance of a possible identification is not addressed in the article. The statistical significance is critical in accepting or rejecting a reported protein. In our future work, we plan to: (1) apply some common significance measure (such as P-value and E-value) to evaluate the consensus ranking directly and (2) define new significance measures through aggregating the significance values from the local rank lists.
Another interesting issue is to apply the peak bagging method to identify protein mixtures. In protein mixtures, MS peaks come from multiple proteins. It is interesting and challenging to extend current peak sampling strategy in a way that the majority of peaks in each peak subset is drawn from a single protein.
| Appendix 1 |
|---|
|
|
|---|
ProFound algorithm
The ProFound algorithm is based on Bayesian statistics and takes into account background information and adjacent peptides in its scoring. Here we use our own implementation of ProFound score based on Equation (2) in (Zhang and Chait, 2000) with the empirical term Fpattern set to 1 (the authors did not explain how to set this parameter). As indicated by Margnin et al. (2004), this term only marginally affects the performance. Then, the ProFound score of a target protein xk reads:
|
| (A1) |
i is the standard deviation of matched mass values at mei.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The comments and suggestions from the anonymous reviewers greatly improved the article. We thank Dr Dong Xu and Dr Zhao Song for providing us the experimental data.
Funding: This work was supported with the Competitive Earmarked Research Grant 621707 from the Hong Kong Research Grant Council.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Jonathan Wren
1There are usually two types of PTMs: fixed modification and variable modification. Fixed modification means that peptides with all applicable residues will be searched with that modification, whereas variable modification means that the modification may or may not be present and the peptide will be analyzed by looking at a possibility of either one or all applicable residues being considered as modified. ![]()
Received on February 4, 2008; revised on March 17, 2008; accepted on April 3, 2008
| REFERENCES |
|---|
|
|
|---|
Breiman L. Bagging predictors. Mach. Learn (1996) 24:123–140.
Dekker LJ, et al. A new method to analyze matrix-assisted laser desorption/ionization time-of-flight peptide profiling mass spectra. Rapid Commun. Mass Spectrom (2005) 19:865–870.[CrossRef][Web of Science][Medline]
Dwork C, et al. Rank aggregation methods for the web. In:. Proceedings of 10th International Conference on the World Wide Web (2001) Hong Kong, China: ACM Press. pp. 613–622.
Egelhofer V, et al. Protein identification by MALDI-TOF-MS peptide mapping: a new strategy. Anal. Chem (2002) 74:1760–1771.[Medline]
Eriksson J, Fenyo D. Probity: a protein identification algorithm with accurate assignment of the statistical significance of the results. J. Proteome Res (2004) 3:32–36.[CrossRef][Web of Science][Medline]
Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Sys. Sci (1997) 55:119–139.[CrossRef]
Lee JH. Analysis of multiple evidence combination. In:. Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1997) Philadelphia, PA, USA: ACM Press. pp. 267–276.
Manmatha R, et al. Modelling score distributions for combining the outputs of search engines. In:. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2001) New Orleans, Louisiana, USA: ACM Press. pp. 267–275.
Montague M, Aslam JA. Relevance score normalization for meta search. In:. Proceedings of the 10th Conference on Information and Knowledge Management (2001) Atlanta, Georgia, USA: ACM Press. pp. 427–433.
Margnin J, et al. OLAV-PMF: a novel scoring scheme for high-throughput peptide mass fingerprinting. J. Proteome Res (2004) 3:55–60.[CrossRef][Web of Science][Medline]
Nesvizhskii AI, et al. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem (2003) 75:4646–4658.[Medline]
Pappin DJ, et al. Rapid identification of proteins by peptide-mass fingerprinting. Curr. Biol (1993) 3:327–332.[CrossRef][Web of Science][Medline]
Perkins DN, et al. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis (1999) 20:3551–3567.[CrossRef][Web of Science][Medline]
Siepen JA, et al. Prediction of missed cleavage sites in tryptic peptides aids protein identification in proteomics. J. Proteome Res (2007) 6:399–408.[CrossRef][Web of Science][Medline]
Song Z, et al. Development and assessment of scoring functions for protein identification using PMF data. Electrophoresis (2007) 28:864–970.[CrossRef][Web of Science][Medline]
Yang D, et al. High-accuracy peptide mass fingerprinting using peak intensity data with machine learning. J. Proteome Res (2008) 7:62–69.[CrossRef][Web of Science][Medline]
Yasui Y, et al. A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. Biostatistics (2003) 4:449–463.[Abstract]
Zhang WZ, Chait BT. ProFound: an expert system for protein identification using mass spectrometric peptide mapping information. Anal. Chem (2000) 72:2482–2489.[Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








