Bioinformatics Advance Access originally published online on September 28, 2004
Bioinformatics 2005 21(5):563-574; doi:10.1093/bioinformatics/bti044
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A graph-theoretic approach for the separation of b and y ions in tandem mass spectra


1 Computational Systems Biology Laboratory, Department of Biochemical and Molecular Biology, University of Georgia Athens, GA 30602, USA
2 Computational Biology Institute, University of Tennessee Oak Ridge, TN 37831, USA
3 Genome Science and Technology Graduate School, University of Tennessee Oak Ridge, TN 37831, USA
4 Organic and Biological Mass Spectrometry Group, Chemical Sciences Division, Oak Ridge National Laboratory Oak Ridge, TN 37831, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Ion-type identification is a fundamental problem in computational proteomics. Methods for accurate identification of ion types provide the basis for many mass spectrometry data interpretation problems, including (a) de novo sequencing, (b) identification of post-translational modifications and mutations and (c) validation of database search results.
Results: Here, we present a novel graph-theoretic approach for solving the problem of separating b ions from y ions in a set of tandem mass spectra. We represent each spectral peak as a node and consider two types of edges: type-1 edge connecting two peaks probably of the same ion types and type-2 edge connecting two peaks probably of different ion types. The problem of ion-separation is formulated and solved as a graph partition problem, which is to partition the graph into three subgraphs, representing b, y and others ions, respectively, through maximizing the total weight of type-1 edges while minimizing the total weight of type-2 edges within each partitioned subgraph. We have developed a dynamic programming algorithm for rigorously solving this graph partition problem and implemented it as a computer program PRIME (PaRtition of Ion types in tandem Mass spEctra). The tests on a large amount of simulated mass spectra and 19 sets of high-quality experimental Fourier transform ion cyclotron resonance tandem mass spectra indicate that an accuracy level of
90% for the separation of b and y ions was achieved.
Availability: The executable code of PRIME is available upon request.
Contact: xyn{at}bmb.uga.edu
| Introduction |
|---|
|
|
|---|
Tandem mass spectrometry (MS/MS) has become a dominant proteomics technique due to its ability to identify proteins in a high-throughput manner (Aebersold and Mann, 2003). In a typical liquid chromatography (LC)/MS/MS experiment, a protein mixture of interest is digested with proteases and the resulting peptides are separated by one- or multi-dimensional LC. When eluted from the LC column, peptides are transformed into gas phase as positively charged ions using electrospray and then introduced into mass spectrometer in batches. After measuring the mass/charge (m/z) ratio of all ions, MS can precisely isolate each peptide by its m/z and fragment the peptide through collisional-induced dissociation (CID). The resultant fragments from this peptide are then measured for their m/z ratios. This process involves two sequential measurements of m/z for a peptide, thus called tandem mass spectrometry (MS/MS) experiment. Modern mass spectrometers can acquire thousands of high-resolution MS/MS spectra per day. Interpretation of such high-throughput mass spectral data in a reliable and efficient manner represents a highly challenging computational problem.
MS/MS spectra are informative about the composition and the order of amino acids in a peptide sequence as it can reveal the molecular weights of the peptides breakdowns. Several bonds along the backbone of a peptide can be broken using the collisions. If the charge is retained on the N-terminal fragment, the ion is classified as a, b or c, and if the charge is retained on the C-terminal, the ion is classified as x, y or z. It has been observed that in a typical MS/MS spectral dataset, the majority of the N- and C-terminal ions are b and y ions, respectively, and each of these ion types contains the derivatives of neutral loss of water or ammonia (Dancik et al., 1999 Tabb et al., 2003). Note that the sum of the two complementary ions masses should be equal to the mass of the parent ion and the mass difference between two consecutive ions of the same type is exactly the mass of an amino acid.
There are two popular approaches for interpreting tandem mass spectra data for protein identification, the database search and de novo sequencing methods. The database search method compares experimental tandem spectra with theoretical tandem mass spectra of each peptide derived from a protein sequence database, such as Swiss-Prot (Boeckmann et al., 2003) and reports the best match or matches (Eng et al., 1994 Mann and Wilm, 1994 Fenyo et al., 1998 Clauser et al., 1999 Perkins et al., 1999 Fenyo, 2000) assuming that the query peptides exist in the protein sequence database. This approach is highly effective and has been used successfully in several proteomics projects on organisms with well-studied genomes (Washburn et al., 2001 Ho et al., 2002 Lasonder et al., 2002 Andersen et al., 2003). However, it is not applicable when a target sequence is not present in the protein database. This can happen for a number of reasons, including novel proteins, protein mutations, post-translational modifications (PTMs) and DNA sequencing errors. Since the database search method usually generates high-false-positive recognition rates, it remains an open problem as how to validate database search results (Nesvizhskii and Aebersold, 2004).
De novo sequencing method attempts to derive a protein sequence directly from tandem mass spectra (Bartels, 1990 Taylor and Johnson, 1997 Taylor and Johnson, 2001 Dancik et al., 1999 Pevzner et al., 2000 Chen et al., 2001 Lu and Chen, 2003 Ma et al., 2003). Theoretically, full-length peptide sequence could be derived from an ideal tandem mass spectrum by computing the differences in mass between adjacent fragment ions of the same ion type in a complete series of fragment ions. The de novo methodology usually employs a graph-theoretic approach in which tandem mass spectrum is typically represented as a graph. Each spectral peak is represented as a vertex and a pair of spectral peaks that differ precisely by the mass of one amino acid is represented as an edge (we call it type-1 edge). The partial sequence of target peptide is predicted by finding one or a set of longest directed paths in the spectrum graph. In this method, no special attempt was made to distinguish ion types and only the information on type-1 edges was used. However as shown in Figure 1 type-1 edges could be created to connect different type of ions by mistake. That is, although the difference in mass between any two ions of the same type is always equal to the total mass of some amino acids, it is not necessarily true vice versa. Since a tandem mass spectrum generally mixes up various ions of different types, such a kind of misconnection decreases the accuracy of de novo sequencing. Furthermore, currently available de novo sequencing programs are computationally intensive and require high-quality MS/MS data. Owing to these difficulties, de novo sequencing approach has not widely been used.
|
In this paper, we solve the problem of ion-type identification separately from the problem of de novo sequencing. We present a novel graph-theoretic approach for the identification of ion types in a set of high-quality MS/MS data. Since the majority of MS/MS spectral peaks are either b or y ions [e.g. as observed in ion trap instruments by Dancik et al., (1999) and Tabb et al., (2003)], our algorithm attempted to separate a set of MS/MS peaks into (1) b ions and their variants (i.e. loss of water or ammonia), (2) y ions and their variants and (3) the other ion types. We used a spectrum graph to represent a given set of spectral peaks in a similar fashion to some of the previous works (Bartels, 1990 Taylor and Johnson, 1997 Taylor and Johnson, 2001 Dancik et al., 1999 Pevzner et al., 2000 Chen et al., 2001 Lu and Chen, 2003 Ma et al., 2003). The main difference in our graph representation is that we considered two types of edges, one representing the connection between a pair of peaks suspected to be of the same ion type (type-1 edge) and the other representing the connection between a pair of peaks suspected to be of different ion types (type-2 edge), based on the observations that the mass difference between any two ions of the same type must be equal to the combination of some amino acids; and if the mass difference is not equal to the mass of any amino acids, it must arise from different type ions. Edge weights were assigned based on the estimated probabilities of whether the edges truly connect ions of the same or different types of ions. We formulated the ion-type identification problem as a graph partition problem, which is to partition the graph into three subgraphs, B, Y and U, respectively, to maximize the total weight of type-1 edges while minimizing the total weight of type-2 edges within each subgroup. This problem has been rigorously and efficiently solved using a dynamic programming algorithm, with a running time of
in the worst case, where i is the distance from the root in a breadth-first tree (BFT) of the spectrum graph, L is the depth of the BFT and |S i | is the number of vertices on the i-th level of the BFT. For a spectrum graph, |S i | is generally small in value. We have systematically tested our algorithm on 114 851 simulated tandem mass spectra data derived from tryptic digested peptides of proteins in the Yeast genome, and have achieved an accuracy level of >90% for the separation of b and y ions. The tests on 19 sets of high-quality experimental Fourier transform ion cyclotron resonance (FT-ICR) tandem mass spectra indicate an average accuracy of 88%. On a typical dataset with 40 spectral peaks, our identification program PRIME (PaRtition of Ion types in tandem Mass spEctra) generally finds the globally optimal partition within 1 s, on a Dell Workstation with Pentium 4 (2.1 GHz). | Methods |
|---|
|
|
|---|
Spectrum graph representation
Let S = {s 1,s 2, ..., s k } be a set of tandem mass spectral data with k peaks s j = {M j ,I j }, where M j and I j denote the neutral mass and intensity of peak s j . Throughout this paper, we assumed that ion masses are already derived from their m/z ratios based on the analysis of the isotope peaks. We define zero mass peak as s 0 = {0,I k+ 1} and parent mass peak as s k+1 = {M,I k+1}, where I k+ 1 = max{I j ,1
j
k} and M is the neutral mass of the parent peptide. Theoretically, each ion has a complementary ion (e.g. the complementary ion of a b ion is a y ion) in the same spectrum. That is for any ion with a mass X, there should be an ion with a mass Y such that X+Y = M. In an experimental spectrum dataset, some ions may have their complementary ions missing because of various reasons. In such cases, we add their complementary ions back. That is for any j, 1
j
k, if there is no peak {M i ,I i } with M i +M j =M, we add a peak {M M j , I j } to the spectrum. Note that the masses of expanded spectrum S are distributed symmetrically around half of the parent mass.
We shall now examine the mass differences between pairs of spectra peaks. First, we note that the mass difference between two ions of the same type (b or y or others) is always equal to the combination of some amino acids, whereas if the mass difference is not equal to the combination of any amino acids, it must be from different type ions. Our algorithm separates ions into different ion types primarily based on this property. However, two ions with a mass difference of the combined mass of some amino acids do not necessarily belong to the same ion type. To estimate the probability of a given mass difference to be from two ions of the same type, we conducted a simulation of tandem mass spectrometry experiment in silico using tryptic digested peptides from proteins derived from the Yeast genome in Swiss-Prot database (version of March 14, 2004). Each peptide was theoretically fragmented into b, y ions and their chemical variants (i.e. loss of water for the first present residue S, T, E, D or loss of ammonia for the first present residue R, K, Q, N). The mass differences between the ions of same type and between ions of different types were calculated and tabulated (we treated b ions and their variants as the same group and so for y ions and their variants). The conditional probability that a given mass difference
is from two ions of the same type was then estimated by the ratio of the counts of
between two ions of the same type to the total occurrences of
. The results are shown in Figure 1. Apparently, mass differences that do not correspond to the total mass of some amino acids must arise from ions of different types, while mass differences arising from one or two amino acids are highly probable to come from ions of the same type (see the information rich zone, 0200 Da in Fig. 1).
We used the following procedure to construct the spectrum graph. Each peak of tandem mass spectrum is represented as a vertex. To capture two distinct relationships between two vertices, two kinds of edges, type-1 and type-2 edges are considered. A pair of peaks is connected by a type-1 edge if their mass difference is the same as the mass of a single amino acid (suggesting that they are most probably of the same ion type), or by a type-2 edge if their mass difference is not equal to the combination of any amino acids (indicating that they definitely belong to different ion types). In the interest of reducing the complexity of the graph, we currently use the following more conservative definition that keeps the number of type-2 edges small. A pair of peaks is connected by a type-2 edge only if their mass difference is
35 Da (which is smaller than the mass of any amino acid) and not equal to 1, 17 or 18 Da (which correspond to masses of H2ONH3, water loss and ammonia loss, respectively). We named this threshold as
2. To make the calculations more efficient, if a vertex had only type-2 edges, we discarded all these type-2 edges. We call this graph G=(V, E) a spectrum graph, with V and E being the vertex and edge set, respectively.
Each type-1 edge has an assigned weight, representing the possibility that the two peaks involved are of the same ion type. The weight of a type-1 edge E(V m ,V n ) is defined as follows:
![]() |
where I m ,I n in (0, 1] are the relative intensities of ions m and n; Pr (
mn ) is the conditional probability that ions m and n are of the same type, given the mass difference
mn = |M n M m |; aa i is an amino acid type which has the smallest |m(aa i )
mn | value and satisfies the condition |m(aa i )
mn |
1 (
1= 0.01 Da for simulated tandem mass spectra or 0.05 Da for experimental FT-ICR data), and m(aa i ) is the mass of aa i ; F i is the relative frequency of amino acid, aa i , in the target genome. The scaling factor
>0 is determined empirically. This definition validates the observation that a good type-1 edge usually has a small mass deviance, connects two peaks with high intensities and has a high probability to arise from the same type ions.
Since the type-2 edges generally connect different type ions based on the above conservative definition, the weight of a type-2 edge E(V m ,V n ) is simply defined as
![]() |
Problem formulation
If we imagine type-1 edges carrying attractive force and type-2 edges repulsive force, the vertices of the same ion type in a spectrum graph should naturally cluster together, whereas vertices of different ion types repel away. Noise can be disconnected or attached by false type-1 edges to b or y ions. The separation of b and y ions can then be achieved by optimally cutting all the type-2 edges and false type-1 edges.
Let
be the set of all possible tri-partitions of vertices set V of a spectrum graph G. Without loss of generality, we assume that G is a connected graph; otherwise G will be an individual connected component. We define the scoring function Score(P,G) for any tri-partition P = {V B ,V Y ,V U }
as
![]() |
where W i (A,B) represents the total weight of type-i edges between vertices of subsets A, B
V, Q i is a positive factor, i = 1, 2. Our goal is to find the optimal partition P opt
such that
![]() |
Algorithm
We now present a dynamic programming algorithm for solving the optimization problem defined above. For a carefully chosen vertex v 0
V (see below), we construct a BFT(v 0) of the spectral graph G, with v 0 being the root (Cormen et al., 2001). Let S i be a set of vertices on the i-th level of BFT(v 0) (hence their distance to the root is i), where i = 0,1,2,...,L and L is the length of the longest path. We have S 0 = {v 0} and
. We define a set of subgraphs G i = (V i , E i ) such that
consists of edges connecting vertices of V i , i = 0,1,2,...,L. Note that in BFT(v 0), there is no edge between V i and S j for any j > i+1. The definition is illustrated in Figure 2.
|
For each partition P i of vertices of S i , let
/P i be a subset of
satisfying P i . We define the conditional optimal partition of vertices V i of G i , P opt (P i )
/P i , as
![]() |
The following theorem relates the conditional optimal partition of the whole graph with the conditional optimal partitions of its subgraphs. It enables us to divide the problem into smaller segments, to tackle one by one, without compromising the global optimality of the final solution.
THEOREM
For any subgraph G i of G and any partition P i
i where
i is a set of all possible partitions of S i , the conditional optimal partition of vertices V i of G i can be decomposed into a conditional optimal partition of vertices V i 1 of G i 1 and a particular partition of S i , i=1,2, ...,L,
![]() |
S i 1 connected by edges E i E i 1, and P i 1
P i presents the virtual union of P i 1 and P i (Fig. 2). PROOF
Because the scoring function Equation (3) is additive in terms of the weights of edges, and subgraphs G i 1 and g i do not have any common edges based on their definitions, we always have
![]() |
/P i =
/(P i 1
P i ),P i 1
i 1, we have
![]() |
![]() |
Implementation
The following pseudocode describes the dynamic programming algorithm for solving the optimal partition problem defined in Equation (6).
- Select v 0 from G based on a specific rule (see below);
- Build a BFT(v 0) of G;
- For each partition P 0
0 of S 0 Do Score(P opt (P 0),G 0)
0;
- For i = 1 To L Step 1 Do
- For each tri-partition P i
i of S i Do
- max_score
0;
- For each tri-partition P i 1
i 1 of S i 1 Do
- Ifmax_score>Score < (P opt (P i 1),G i 1 ) + (P i 1
P i ,g i ) Do
- max_score
(P opt (P i 1), G i 1)+ (P i 1
P i ,g i );
- P opt (P i )
P opt (P i 1)
P i ;
- Select the partition P opt (P L ) with the maximal (P opt (P L ),G L ) value as the final optimal partition of G;
- Determine the ion types of the partitioned vertices as follows.
The dynamic programming algorithm classifies all the spectrum peaks into three classes, B, Y and U. These three groups are really two large groups of ions of the same type (i.e. B or Y), plus a smaller set for the remaining ions (i.e. U). In the algorithm, we did not specifically use properties that are directly associated with b ions or y ions. Therefore, B may actually be y ions and Y may be b ions. To further determine whether B set contains b ions or y ions (the same for Y set), we need additional information. We used two properties of tandem mass spectra to decide which group contains the b ion set or the y ion set. (1) The b ion group should include a vertex with a mass of 0 and a vertex with a mass of peptide parent mass18 Da (i.e. [M-H2O]), while the y ion group should include a vertex with a mass of 18 Da (the complementary ion of [M-H2O]) and a vertex with parent mass (Figure 4B and Figure 5B. (2) Statistical analysis of tandem mass spectra has shown that the average intensity of y ions is typically more than twice that of b ions although the numbers of ions are almost the same (Dancik et al., 1999 Tabb et al., 2003). Based on such information, PRIME reports the ion type of each ion as output.
|
|
Computational complexity
Let C(G) be the computational complexity for calculating the optimal partition of a spectrum graph G defined by Equation (6) and define C(G) as the number of function (P,G) calls. The computational complexity of our dynamic programming algorithm can be derived from the lines 4, 5 and 7 of the pseudocode directly,
![]() |
where i is the distance from the root v 0 of the (v 0), L is the length of the longest path of the (v 0) and |S i | is the number of vertices on the i-th level of (v 0). To make C(G) as small as possible, we exhaustively searched through all vertices to find the root v 0 that gave the smallest argument of O in (10) before starting the partition procedure.
Generalized algorithmconsidering chemical variants
Neutral losses from fragment ions are common in tandem mass spectra. A loss of water or ammonia will reduce fragment ion masses by 18 or 17 Da. In addition, protein PTMs, a common and important phenomenon in cell functioning, also change the patterns of mass spectra by shifting a portion of peaks by specific masses. To detect and deal with such variants, we introduced a new type of pseudo amino acid with the specific mass for each suspected chemical variant, and treated the resulting ions the same way as b or y ions. For example, for water loss, we added a pseudo amino acid namely water with a mass of 18.011 Da to the amino acid library. Since it is difficult to estimate the frequency of each possible pseudo amino acid occurring in tandem mass spectra in advance, we used a simplified Equation (1) to calculate the weight of type-1 edge involving these new residue types,
![]() |
By doing so, virtually no change is needed in our partition algorithm for dealing with chemical variants. In addition, by introducing a pseudo amino acid for each suspected PTM, we prevented the exhaustive combinatorial search for all possible mass modifications, in which even a small set of modification types will lead to a large combinatorial problem (Yates et al., 1995). Finally, the generalized algorithm can deal with the special case where the neutral loss ions of a spectrum are so prominent and the base b or y ions are sufficiently small that only the neutral loss ions survive the spectral pre-processing procedure. Since we treated b ions and their variants as the same group and dealt with them independently (similar approach for y ions and their variants), therefore the absence of base b or y ions do not cause major problems.
| RESULTS |
|---|
|
|
|---|
We have implemented our algorithm as a computer program PRIME using C++ programming language. To systematically assess its performance on ions partition, we first tested PRIME on a big number of simulated tandem mass spectra data derived from tryptic digested peptides of proteins in the Yeast genome, at various b, y ions coverage and noise/signal ratio. Then, we tested PRIME on 19 sets of high-quality experimental FT-ICR data.
Tests on 114 851 simulated tandem mass spectra derived from Yeast genome
We conducted a simulation of tandem mass spectrometry experiment in silico on fully tryptic digested peptides of proteins in the Yeast genome (Swiss-Prot version of March 14, 2004). Figure 3C shows the composition of amino acids and the length distribution of the tryptic digested peptides. After filtering with a mass scope of 8004000 Da, 114 852 peptides remained for the further study. These peptides had a median length of 8 residues, with the shortest peptide having 5 residues and the longest peptide having 44 residues (Fig. 3C). Each peptide was then fragmented into b, y ions and their chemical variants (i.e. loss of water or ammonia) at given b, y ions coverage and noise/signal ratio. The coverage of b and y ions was defined as the ratio of the total number of b and y ions observed in the spectrum to the product of peptide length and two. The relative intensities for Y group ions, B group ions and noise were set to 1.0, 0.5 and 0.20 arbitrarily (Dancik et al., 1999 Tabb et al., 2003). The simulated tandem mass spectra were then fed to PRIME and the ion type of each ion was reported as output. As an example, Figure 4 shows the simulated tandem mass spectrum of the peptide QYIDQDFILK digested from protein 2A5D_YEAST and its partition results. The virtual spectrum was constructed with the ion coverage of 0.5 and noise/signal ratio of 0.5. The calculation was finished with 0.020 s on a Dell workstation with Pentium 4 (2.1 GHz). It can be seen that all ions were partitioned correctly.
|
We now present the statistical analysis of our test results on 114 851 simulated tandem mass spectra at various conditions. We define the sensitivity of our prediction as the number of correctly partitioned B, Y group ions divided by those observed in spectrum. We define the specificity as the number of correctly partitioned B, Y group ions divided by the number of all predicted B, Y group ions. The term sensitivity measures the ability of PRIME for ion-type partition while the term specificity measures the reliability of the partition result.
Dealing with noise
Table 1 summarizes the partition results when all b, y ions were covered (i.e. no missing peak) when different levels of noise were mixed up in the simulated spectra. Since only b, y ions were considered, any mass difference <57 Da (Gly) should arise from different type ions (Fig. 1A). Therefore, any value <57 Da of mass threshold can be used for type-2 edge construction, but the larger the better. Here we set
2=50 Da. It can be seen that PRIME achieved very good partition results at various noise/signal ratios (ranging from 0 to 1.0): >97% sensitivity and specificity;
95% of the simulated spectra were partitioned with 100% accuracy and
97% of the spectra were predicted with
90% accuracy. The accuracy was defined as the percentage of the total number of b, y ions that were correctly identified over that observed in the spectrum. The average number of ions in each spectrum varied from 29 to 58 and the percentage of the number of type-2 edges over the total number of edges varied from 27 to 32%. The calculations were rather fast and finished within 25 ms. The results show that noise affects the partition result very slightly if all b and y ions are covered in the spectrum.
|
It is worth noting that during the construction of spectrum graph, if we found that one vertex was connected by the type-2 edges only, we discarded all the type-2 edges that this vertex was incident to. Since we found that the ion type of this vertex cannot be determined in such a case and noise accounted for most of this kind of useless connections. By discarding them, most of the false positive assignments (i.e. noise was partitioned to B or Y group) were fixed and the computational complexity was decreased dramatically (data not shown).
Dealing with missing ions
Peptide fragmentation in a tandem mass spectrum often does not produce all b and y ions. Missing ions may cause some serious problems in de novo sequencing applications. Because the spectrum graph approaches attempt to find an optimal path connecting the N and C termini, the absence of ions may break such a path. Previous works typically dealt with this difficulty by introducing gap-spanning edges corresponding to the missing di- or tripeptides. However as shown in Figure 1 when a mass difference increases to 300 Da where majority of the di/tripeptides locate, the probabilities that such a mass difference arising from the same or different types of ions become undifferentiable (especially when neutral mass losses were considered). That is, extra false type-1 edges could be introduced. Our partition approach was very different from the traditional ones: disjointing ion series could be correlated to each other through type-2 edges, which still made the correct partition possible. Table 2 lists the partition results when missing ions were considered.
|
In our simulated spectra, the noise/signal ratio was set to 0.5. No neutral mass losses were considered. It can be seen that missing ions dramatically decreased the partition accuracy. When 50% of b and y ions were missing, only 25.67% of the simulated spectra were partitioned with 100% accuracy. The sensitivity and specificity dropped to 76.84 and 82.32%, respectively. However, when 70% of b and y ions were present, 76.23% of the spectra were partitioned completely correctly, and both the sensitivity and the specificity achieved a value of
93%. When 80% of ions were present, we got the corresponding values of 90.68 and
97.31%, respectively. The results indicate that the missing ions do not cause a major problem in our algorithm.
Dealing with loss of water
When b-H2O, b-NH3, y-H2O, y-NH3 were considered in our simulation study, the probability that a given mass difference between two ions corresponds to an amino acid decreased (Fig. 1C). This is due to the fact that neutral mass losses increased the chance that different type of ions produce this mass difference. For example, at the given mass difference of 57 Da (Gly), the probability that it arises from the same type of ions decreased from 0.88 to 0.70 when loss of water or ammonia were taken into account. Note that the conditional probability profile derived from the experimental spectra should be between the two theoretically derived profiles. Although neutral mass losses are common, not all of them are visible in the experimental spectra.
The consideration of water loss has largely increased computational complexity. The needed CPU time was 24 s for each spectrum. The reason is that a large number of new type-1 and type-2 edges were constructed when the loss of water was considered, almost 47 times larger than the original ones. The percentage of type-2 edges was also increased to
35%, which could represent another important reason as to why considering loss of water is helpful in the case of missing peaks.
The power of type-2 edges
One unique feature of our ion-partition approach is the use of type-2 edges. We show here that even very few type-2 edges made the partition successful. Since the number of type-2 edges was solely dependent on the value of
2, we studied the performance of PRIME at various values of
2. The results are presented in Table 4. The simulated tandem mass spectra were constructed in the condition of 70% ion coverage, 30% noise/signal ratio and with the consideration of loss of water. It can be seen that at the extreme case where
2=0 (i.e. without the use of type-2 edges), our partition algorithm failed: both the sensitivity and specificity were
55% and no spectrum was partitioned completely correct. The reason for this is quite simple: if b ion series was coincidently connected to y ion series by a single false type-1 edge, all the b, y ions would be classified into the same group, since no repelling force (i.e. type-2 edges) can be used to partition them into the correct groups. However, when the information of type-2 edges was used, even with a small value of
2, for example, 5 Da, our partition algorithm worked well: 68.8% of simulated spectra were partitioned with 100% accuracy and both the sensitivity and specificity were
90%. The percentage of the number of type-2 edges over the total number of edges was 7.6%, indicating the power of type-2 edges. The reason is that although type-2 edges represent local information, these local information could be propagated along the peptide chains formed by type-1 edges. Furthermore, this special feature of type-2 edges implicitly prevents an ion and its complement from being assigned to the same group. Therefore, the inclusion of a few sparse type-2 edges makes the correct partition possible.
|
The information-rich zone shown in Figure 1A was also narrowed and shifted to the lower-mass end. Thus, a smaller value of
2 has to be chosen for type-2 edges construction. In addition, since the mass differences at 1 Da (H2ONH3), 17 Da (NH3 loss) and 18 Da (H2O loss) give high probabilities of being from the same group ions, and the values of 39 and 40 Da give
41% probability of being from Gly-H2O and Gly-NH3, in the following calculations, we set
2 to be 35 Da. Type-2 edges were constructed between ions with only a mass difference
35 Da and not equal to 1, 17 or 18 Da. The new probability profile was used to calculate the weights of type-1 edges. Using these parameters, we tested PRIME on the new simulated tandem mass spectra in which losses of water or ammonia were considered. The results are summarized in Table 3. To reduce the computational complexity, only b, y and b/y-18 ions were generated in the simulated spectra. It can be seen that even in the case of 50% coverage of b and y ions and 50% noise/signal ratio, the partition results were quite good: 70% of simulated spectra were partitioned with 100% accuracy; both the sensitivity and specificity were
96%. Compared to the results obtained with the same conditions but when only b and y ions were considered (Table 2), the improvement was significant. The reason is that the gaps in the residue chains resulting from missing ions could be patched by the new ion series formed by the neutral mass losses. This feature is clearly shown in Figure 4B.
|
As shown in Table 4 as
2 increases, the partition accuracy increases and then decreases after reaching a maximum at
2=35 Da. As stated earlier, since the mass difference at 39 Da has 0.40 probability of being from the same type ions (Gly-H2O), choosing a larger value than this to be
2 may introduce false type-2 edges, which consequently diminish their impact.
Tests on 19 sets of high-quality experimental FT-ICR data
Collection of FT-ICR tandem mass spectra
We collected 19 sets of high-quality experimental FT-ICR data from various peptide sources: from various peptide sources: synthetic peptides as well as tryptic peptides obtained from digestion of horse myoglobin, bovine serum albumin and lysozyme. All mass spectra were acquired with an IonSpec (Lake Forest, CA) 9.4-Tesla HiRes electrospray ES-FTICR-MS. Ions were generated with a low-flow rate dynamic electrospray source (flow rate of
400500 nl/min), accumulated in an external hexapole, transferred into the high-vacuum region with a quadrupole lens system, and then detected in the cylindrical analyzer cell of the mass spectrometer. Because ion detection was achieved in an ultra-low vacuum regime (
2 x 1010 Torr), broadband mass resolutions of about 200 000 (full width of peak at half maximum) at m/z 1000 were possible. Calibration was accomplished by using standard proteins (such as ubiquitin and myoglobin), and provided mass accuracies within a few millimass units. Low-energy fragmentation via ion collisional dissociation was accomplished with an FT-ICR-MS instrument. This method of collisional fragmentation was conducted by isolating an ion of interest (either a peptide or a protein) within the analyzer cell of the mass spectrometer, and then accelerating the ion into a nitrogen target gas under sustained off-resonance irradiation collision-activated dissociation (SORI-CAD) (Gauthier et al., 1991). Energetic collisions between the accelerated ion and the target gas generated ionic fragment ions and could be measured at high resolution.
Data preprocessing
For each precursor mass, the raw profile data of its corresponding mass spectrum was loaded using the manufacturers software IonSpec99. The molecular mass/intensity list was then tabulated using the View function and saved into a text file (noise was filtered with the default threshold). The resulting file was fed to an in-house software for further isotopic peak reduction. Since, we found that a few isotopic peaks were not identified and consequently their charges were not determined correctly by IonSpec99. This problem was fixed by re-analyzing isotopic peaks carefully, i.e. checking the consecutive peaks with uniform intervals (1.000 Da for +1, 0.500 Da for +2 and 0.333 Da for +3 and so on, with the mass tolerance of 0.005 Da). Newly identified isotopic peaks were then removed but contributed their intensities to their monoisotopic peak. This pre-processing resulted in a number of ions in each spectrum which varied from 22 to 50, covering 2590% of hypothetical b and y ions. PRIME was then used to determine the ion type of each individual peak. The thresholds
1=0.05 Da and
2=15 Da were used to build type-1 edges and type-2 edges, respectively. In most cases, the calculations were done within 1 s on a Dell workstation with Pentium 4 (2.1 GHz). The prediction results are summarized in Table 5.
|
Overall performance
As shown in Table 5 for each of the 19 test spectra, our partition program identified most of the b and y ions and their chemical variants (i.e. loss of water or ammonia) correctly. Among the 19 datasets, 7 were identified with 100% accuracy. The partition accuracy for each individual spectrum ranged from 57 to 100%, with an average accuracy of 88%. In the sequence column of Table 5 the b and y ions that were correctly differentiated were labeled with underlines (b ions) and overlines (y ions), respectively. We observed that the experimental tandem mass spectra with good coverage of b and y ions always had high identification accuracy (e.g. the peptide with 100% accuracy), while those with poor fragmentation resulted in relatively lower partition accuracy. We also observed that most induced sequence covered only a portion of target peptide sequence. However, as reported by Taylor and Johnson, 1997 those accurately determined subsequences with enough length might be appropriate to perform homology-based sequence database search through homology search tools like PSI-BLAST Altschul et al., 1997.
Figure 5 as an example, shows the FT-ICR tandem mass spectrum of peptide DAFLGSFLYEYSR digested from protein BSA (test no. 10) and its partition result. Several interesting results can be observed from this figure. (1) All the b, y ions and their variants (loss of water here, denoted by X) were identified correctly. (2) The full-length peptide sequence except the first two residues can be readily derived from either b ion series or y ion series (no distinction between the amino acids L and I). The first dipeptide was assigned to tryptophan, since neither b1 nor y12 was observed in the spectrum and coincidently the mass of Trp is equal to the sum of Asp and Ala. It was also clearly shown that the adding back of complementary ions (denoted by open symbols) greatly improved the completeness of spectra for de novo sequencing. (3) There exists a second continuous mass ladder (572.27, 719.34, 832.43, 995.50, 1124.55, 1287.63, 1374.67 and 1530.74) that apparently corresponds to b-H2O ion series starting from position 6 of the peptide. This evidence strongly indicates that Ser-6 lost water during fragmentation. This kind of secondary mass ladder information should be highly useful in future applications for detecting the types and sites of protein PTMs, since X could be any other specified neutral mass losses of PTMs, for example, 98 Da for loss of H3PO4 from phosphopeptides. (4) Three false type-1 edges (labeled with thick teal lines) were formed between b and y ions. Our algorithm recognized all of them correctly based on the global optimization.
| DISCUSSION AND CONCLUSION |
|---|
|
|
|---|
Methods for accurate identification of ion types provided the basis for many mass spectrometry data interpretation problems, including (1) de novo sequencing, (2) identification of PTMs and (3) validation of database search results. Compared to previous de novo sequencing methods (Bartels, 1990 Taylor and Johnson, 1997 Taylor and Johnson, 2001 Dancik et al., 1999 Pevzner et al., 2000 Chen et al., 2001 Lu and Chen, 2003 Ma et al., 2003) the uniqueness of our approach is that we treat the problem of ion-type identification separately from the problem of de novo sequencing. By decoupling the two problems, we arrived at a conceptually clearer framework for solving the two problems separately rather than having them tangled together.
In addition, among these published papers on de novo sequencing algorithms and applications (Bartels, 1990 Taylor and Johnson, 1997 Taylor and Johnson, 2001 Dancik et al., 1999 Pevzner et al., 2000 Chen et al., 2001 Lu and Chen, 2003 Ma et al., 2003) only the information similar to what we call type-1 edges was utilized. As shown in Figure 1 false type-1 edges could arise from different types of ions by accident. To recognize this kind of false connections and to partition ions into correct classes, we introduced a new type of connection, the type-2 edges, which connect two peaks of possibly different ion types; and include the probability that a given mass difference between two ions corresponds to an amino acid into the object function. Since we use a very conservative definition in the construction of type-2 edges, the probability of having a false type-2 edge is intrinsically much lower than that of having a false type-1 edge. These facts imply that our algorithm does utilize more informational context of a given MS/MS experimental data and might describe the real spectrum more completely. The systematic tests of our approach on 114 851 simulated tandem mass spectra derived from Yeast genome and on the 19 sets of experimental FT-ICR data showed that the type-2 edges were powerful. The inclusion of a few sparse type-2 edges can make the partition correctly.
Protein PTMs generated tremendous diversity, complexity and heterogeneity of gene products, and are involved in many important cell functioning and regulation processes (Gooley and Packer, 1997. Till date, there are over 340 entries reported in Delta Mass (http://www.abrf.org/index.cfm/dm.home?AvgMass=all), a database of protein PTMs. Therefore, the verification of PTMs raised a major challenging problem after the human genome was completely sequenced (MacCoss et al., 2002 Mann and Jensen, 2003). The capacity for detecting the types and sites of PTMs efficiently should be a key feature of a good de novo sequencing algorithm. We have shown that our algorithm can be easily extended to consider chemical variants, i.e. loss of water or ammonia, or PTMs, by introducing a new type of pseudo amino acid with the specified mass for each suspected chemical variants, without increasing the computational complexity.
Several de novo sequencing algorithms and software packages, such as Lutefisk (Taylor and Johnson, 1997) and PEAKS (Ma et al., 2003) were reported to perform de novo sequencing successfully and the latter also employed a dynamic programming algorithm. However as we had discussed above, the basis of these approaches are quite different from ours: no special attempt was made to distinguish ion types and solely the information of type-1 edges was used to construct the spectrum graph representation. Since our effort in this paper has been to identify the ion type of each individual peak (a first stage of de novo sequencing), no comparison was made to evaluate the performance of our algorithm with others. However, the tests on the simulated and experimental tandem mass spectra showed that PRIME achieved an accuracy of
90% for the separation of b and y ions. Once the ions are partitioned into correct classes, the de novo sequencing should be easily derivable from one single ion series. Work on extending this algorithm to deal with the de novo sequencing problems by using more robust data and exploring its potential applications in detecting PTMs is on-going and will be presented in the future publications.
In summary, we have developed a novel approach for the separation of b and y ions in tandem mass spectra by introducing a new type of connection, the type-2 edges and implemented it using a dynamic programming algorithm as a program PRIME. The tests on 114 851 simulated tandem mass spectra and on 19 sets of experimental FT-ICR data show that PRIME is powerful with a high partition accuracy and a high computational efficiency. We expect that PRIME will prove to be highly useful for de novo sequencing and the identification of protein PTMs at the post-genomic era.
| Acknowledgments |
|---|
We want to thank the anonymous reviewers whose comments have helped to improve the content and the presentation of this paper. This research was supported in part by National Science Foundation (\#NSF/DBI-0354771, \#NSF/ITR-IIS-0407204), and by the US Department of Energys Genomes to Life program (http://doegenomestolife.org/) under project, Carbon Sequestration in Synechococcus sp.: From Molecular Machines to Hierarchical Modeling (www.genomes2life.org).
| Footnotes |
|---|
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
Received on March 19, 2004; revised on July 26, 2004; accepted on September 7, 2004
| REFERENCES |
|---|
|
|
|---|
Aebersold, R. and Mann, M. (2003) Mass spectrometry-based proteomics. Nature, 422, 198207[CrossRef][Medline].
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 33893402
Andersen, J.S., Wilkinson, C.J., Mayor, T., Mortensen, P., Nigg, E.A., Mann, M. (2003) Proteomic characterization of the human centrosome by protein correlation profiling. Nature, 426, 570574[CrossRef][Medline].
Bartels, C. (1990) Fast algorithm for peptide sequencing by mass spectroscopy. Biomed. Environ. Mass Spectrom., 19, 363368[CrossRef].
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., ODonovan, C., Phan, I., Pilbout, S., Schneider, M. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365370
Chen, T., Kao, M.Y., Tepel, M., Rush, J., Church, G.M. (2001) A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol., 8, 325337[CrossRef][ISI][Medline].
Clauser, K.R., Baker, P., Burlingame, A.L. (1999) Role of accurate mass measurement (+/10 ppm) in protein identification strategies employing MS or MS/MS and database searching. Anal. Chem., 71, 28712882[Medline].
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C. Introduction to Algorithms, (2001) , Cambridge, MA The MIT Press.
Dancik, V., Addona, T.A., Clauser, K.R., Vath, J.E., Pevzner, P.A. (1999) De novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol., 6, , pp. 327342[CrossRef][ISI][Medline].
Eng, J.K., McCormack, A.L., Yates, J.R., III. (1994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom., 5, 976989[CrossRef][ISI].
Fenyo, D. (2000) Identifying the proteome: software tools. Curr. Opin. Biotechnol., 11, 391395[CrossRef][ISI][Medline].
Fenyo, D., Qin, J., Chait, B.T. (1998) Protein identification using mass spectrometric information. Electrophoresis, 19, 9981005[CrossRef][ISI][Medline].
Gauthier, J.W., Trautman, T.R., Jacobson, D.B. (1991) Sustained off-resonance irradiation for collision-activated dissociation involving Fourier transform mass spectrometry. Collision-activated dissociation technique that emulates infrared multiphoton dissociation. Anal. Chim. Acta, 246, 211225[CrossRef].
Gooley, A.A. and Packer, N.H. (1997) The importance of co- and post-translational modifications in proteome projects. In Wilkins, W.R., Williams, K.L., Appel, R.D., Hochstrasser, D.F. (Eds.). Proteome Research: New Frontiers in Functional Genomics, , NY Springer-Verlag, pp. 6591.
Ho, Y., Gruhler, A., Heilbut, A., Bader, G.D., Moore, L., Adams, S.L., Millar, A., Taylor, P., Bennett, K., Boutilier, K., et al. (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature, 415, 180183[CrossRef][Medline].
Lasonder, E., Ishihama, Y., Andersen, J.S., Vermunt, A.M., Pain, A., Sauerwein, R.W., Eling, W.M., Hall, N., Waters, A.P., Stunnenberg, H.G., Mann, M. (2002) Analysis of the Plasmodium falciparum proteome by high-accuracy mass spectrometry. Nature, 419, 537542[CrossRef][Medline].
Lu, B. and Chen, T. (2003) A suboptimal algorithm for de novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol., 10,















