Skip Navigation


Bioinformatics Advance Access originally published online on March 3, 2005
Bioinformatics 2005 21(10):2177-2184; doi:10.1093/bioinformatics/bti362
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/10/2177    most recent
bti362v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Falkner, J.
Right arrow Articles by Andrews, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Falkner, J.
Right arrow Articles by Andrews, P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Fast tandem mass spectra-based protein identification regardless of the number of spectra or potential modifications examined

Jayson Falkner * and Philip Andrews

Department of Biological Chemistry and Program in Bioinformatics, University of Michigan 1301 Catherine Street, Ann Arbor, MI 48109, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 ALGORITHM
 3 IMPLEMENTATION
 4 SUMMARY AND FUTURE...
 REFERENCES
 

Motivation: Comparing tandem mass spectra (MSMS) against a known dataset of protein sequences is a common method for identifying unknown proteins; however, the processing of MSMS by current software often limits certain applications, including comprehensive coverage of post-translational modifications, non-specific searches and real-time searches to allow result-dependent instrument control. This problem deserves attention as new mass spectrometers provide the ability for higher throughput and as known protein datasets rapidly grow in size. New software algorithms need to be devised in order to address the performance issues of conventional MSMS protein dataset-based protein identification.

Methods: This paper describes a novel algorithm based on converting a collection of monoisotopic, centroided spectra to a new data structure, named ‘peptide finite state machine’ (PFSM), which may be used to rapidly search a known dataset of protein sequences, regardless of the number of spectra searched or the number of potential modifications examined. The algorithm is verified using a set of commercially available tryptic digest protein standards analyzed using an ABI 4700 MALDI TOFTOF mass spectrometer, and a free, open source PFSM implementation. It is illustrated that a PFSM can accurately search large collections of spectra against large datasets of protein sequences (e.g. NCBI nr) using a regular desktop PC; however, this paper only details the method for identifying peptide and subsequently protein candidates from a dataset of known protein sequences. The concept of using a PFSM as a peptide pre-screening technique for MSMS-based search engines is validated by using PFSM with Mascot and XTandem.

Availability: Complete source code, documentation and examples for the reference PFSM implementation are freely available at the Proteome Commons, http://www.proteomecommons.org and source code may be used both commercially and non-commercially as long as the original authors are credited for their work.

Contact: jfalkner{at}umich.edu


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 ALGORITHM
 3 IMPLEMENTATION
 4 SUMMARY AND FUTURE...
 REFERENCES
 
Tandem mass spectra-based protein identification using unique combinations of derived peptide (usually tryptic) sequences, each representing a fragment of a parent protein, is a common practice and it is sometimes referred to as bottom–up proteomics. The process relies on software packages built around tandem mass spectra (MSMS) database search and de novo peptide generation algorithms to identify both potential candidate peptides for a particular spectrum and, ultimately, a source protein that can account for the observed peptides. Several software packages currently exist (Clauser et al., 1999; Craig and Beavis, 2004; Eng et al., 1994; Mann and Wilm, 1994; Perkins et al., 1999; Tabb et al., 2003), but the software's application is often limiting on account of the time it takes to perform a search on a large number of spectra or against a large dataset of known proteins or owing to the inability to accommodate multiple potential residue modifications. This paper describes a remedy to these issues by means of a novel data structure, a ‘peptide finite state machine’ (PFSM), and an algorithm for generation of the proposed data structure. A free, open source PFSM implementation is available, as described at the beginning of the paper, and it is used in all examples.

The main focus of this paper is to illustrate a method of quickly identifying candidate peptide sequences that match MSMS. The data structure described in this paper provides a method of representing a monoisotopic, centroided spectrum (i.e. ‘peak list’) as a graph, where nodes represent peaks and arcs represent valid amino acid residues (with or without modification) that account for the distance between peaks. Initial construction of the data structure is near identical to that of a spectrum graph (Bartels, 1990); also see the SHERENGA paper (Dancik et al., 1999) for an excellent introduction to MSMS and spectrum graphs; however, the finished data structure requires a conversion of the spectrum graph to a non-deterministic finite automata (NDFA) and subsequently, a deterministic finite automata (DFA) (Sipser, 1996). The process of creating a DFA that represents a collection of spectra is both novel and incredibly practical because it allows one to screen amino acid sequences at a rate that is at worst as slow as looking at one state in the DFA for each amino acid in the sequence within the protein sequence. If one attempts the same process with a data structure similar to a spectrum graph, i.e. equivalent to an NDFA, the time it takes to screen an amino acid sequence can be at worst the amount of time it takes to examine every state in the entire graph and at best the same amount of time a DFA takes to examine the sequence. A simple, helpful analogy to individuals experienced in computer science is that a PFSM provides a way to convert spectra to a single regular expression.

The entire thrust of this paper is succinctly summarized in Section 2.4 and by reference to Figure 5B. A PFSM creates a DFA similar to Figure 5B, and the DFA allows one to quickly and accurately determine if a peptide sequence might account for a spectrum as illustrated in Section 2.4.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 5 (A) The NDFA from Figure 2 that is converted to a DFA (B) An NDFA may have multiple transitions for a single symbol (residue abbreviation) from a single state, note that 2 has two transitions for C. A DFA must have only one transition for each symbol. B is the DFA equivalent of A. By modifying the graph in A it is possible to create a proper DFA with only one transition per symbol. This is the key to the performance benefit of PFSMs. If left as an NDFA, checking a residue sequence might require searching through multiple transitions per residue, in the worst case searching the entire graph; however, if searching with a DFA, there will always be exactly one transition per residue regardless of the number of peak lists multiplexed or potential modifications used to make the graph.

 
The rest of this paper is divided into two main sections. The first section details the precise algorithm for the generation of a PFSM from an arbitrary set of spectra. The second section of the paper demonstrates the feasibility of PFSMs as a tool to perform fast, sensitive protein identification using MSMS from a set of known proteins. All examples of PFSMs are taken from the PFSM project which may be freely obtained with full source code from the Proteome Commons, http://www.proteomecommons.org


    2 ALGORITHM
 TOP
 Abstract
 1 INTRODUCTION
 2 ALGORITHM
 3 IMPLEMENTATION
 4 SUMMARY AND FUTURE...
 REFERENCES
 
The PFSM data structure is an NDFA and an DFA conversion of the NDFA, both of which represent any number of spectra and peptide candidates that may account for each spectrum. Both NDFA and DFA are common computer science concepts associated with the theory of computability or the Church–Turing thesis (Turing, 1936; Church, 1936), and the following section recapitulates the fundamentals.

An NDFA is a computer science concept associated with the expressive power of a language and it represents a finite number of states, each state linked to any of the other states by any number of transitions. Figure 5A illustrates the concept as a graph with a fixed number of nodes and arcs between the nodes, each arc labeled with a single letter. The letters represent valid symbols in the alphabet that makes the language, in this work the complete alphabet is the set of single letter amino acid residue names. All NDFA have one start state, the solid black dot, and at least one finish state, the white–black bullseye. Arrows on the arcs represent the direction that transitions proceed. If there is no arrow for a letter, it is assumed that the transition goes to a state that will never reach a finish state. To use an NDFA, one should start with a sequence of symbols, e.g. a sequence of amino acid residues, and transition from state to state by sequentially using symbols from the sequence. At the end of the sequence, if one is on a finish state, the sequence is valid for the NDFA, if not, the sequence is invalid. In cases where a node has two or more transitions for a particular symbol, all the transitions should be followed simultaneously.

A DFA is as powerful as an NDFA in expressive power and is near identical in structure and use but differs by allowing one and only one transition for every symbol in the alphabet for a given state. This difference makes a DFA more time efficient when implemented by a computer program because the program will never need to follow more than one transition per symbol in a sequence. More formally, this time efficiency is noted as O(n) (Sipser, 1996) where n is the length of the sequence being analyzed. A well-known algorithm exists for converting an NDFA to a DFA (Sipser, 1996), and a succinct example of tracing amino acid sequences through a DFA is provided in Section 2.4.

A PFSM is a convenient name for describing a formal conversion from a set of mass spectra to a DFA, and one of the primary points of this paper is to illustrate how one may convert a spectrum to an NDFA. As explained later, a spectrum cannot be directly converted to a DFA if one accounts for mass accuracy errors of a mass spectrometer and if potential modifications of the standard amino acid residues are allowed. But an NDFA may be directly created from a spectrum. By creating the NDFA a DFA can be made, and with a DFA one can search through sets of protein sequences, such as any FASTA file, very quickly. In addition, as also detailed later, two or more NDFA may be combined into a single NDFA and potential modifications may be accounted for when constructing an NDFA from a spectrum—this allows for a PFSM to search a set of amino acid sequences in the same amount of time regardless of the number of spectra or the number of potential modifications being examined and with equivalent sensitivity as for searching each spectrum individually.

2.1 Conversion of a single tandem mass spectrometry peak list to an NDFA
Any tandem mass spectrometry peak list may be converted to an NDFA by assuming an arbitrary mass error tolerance for the accuracy of the peaks and by assuming an arbitrary tolerance for peaks not represented in the peak list but that should be present in a complete ion series. An MSMS contains the peptide fragment ions observed at a given mass over charge ratio. It has been repeatedly demonstrated and well-reviewed that common peptide fragment patterns are exhibited by peptides fragmented through tandem mass spectrometry (Biemann, 1988; Papayannopoulos, 1995; Roepstorff and Fohlman, 1984). These peptide fragmentation patterns are commonly used to relate unknown peaks from a peak list to a peptide ion fragment in a mass spectrum. A progression of such identified peaks due to one peptide fragmentation pathway but separated by the atomic mass of an amino acid residue (or a number of residues) is referred to as an ion series for the observed peptide fragmentation pathway. A complete ion series is a set of peaks that span across an entire spectrum with a separation of a single residue between each peak. Given a complete ion series, identification of the source peptide is as simple as examining the ion series from start to finish and associating each peak to a sequential residue in the peptide by correlating the mass difference between peaks with the calculated residue masses.

Conversion of a peak list to an NDFA follows the assumption that ion series will be present in any given spectrum, and a complete ion series accurately represents a peptide that could fragment to produce the observed MSMS, i.e. conversion is the same technique utilized to construct a spectrum graph (Bartels, 1990; Dancik et al., 1999). Construction of a PFSM is started by the potential ion series in an observed peak list. For clarity, ion series will be represented in this paper by drawing the ion series as a b-ion series, where peptide sequence may be obtained by reading off-peak distances from left to right, assuming the left represents the N-terminus and the right represents the C-terminus. Figure 1A illustrates a b-ion series, and Figure 1B illustrates how a y-ion series is converted to a b-ion series. In general, any ion series may be converted to a b-ion series by calculating the difference between the ion types while allowing for the mass accuracy of the mass spectrometer [e.g. bIonMZ = parentIonMZ – yIonMZ + 1; see the Lutefisk97 paper's introduction (Taylor and Johnson, 1997) for an excellent example of the general technique]. Conversion of a peak list to an NDFA is accomplished by tracing every valid ion series and converting it to an NDFA with states representing peaks and arcs representing distances between the peaks that correspond to known residue mass over charge ratios. Figure 2A illustrates a peak list of b-ions and y-ions that have been converted to an NDFA, which is illustrated in Figure 2B. Note that the conversion must be to an NDFA and not directly to a DFA because different ion series may result in multiple different but valid peptide sequences based on mass accuracy error of a mass spectrometer or when considering modifications of a standard residue, e.g. carboxyamidomethylation of a cysteinyl residue (an increase of 57 Da) corresponds to the m/z of a cysteinyl residue plus a glycyl residue (57 Da), which may result in an NDFA with two valid transitions for the C symbol (cysteinyl residue) as illustrated in Figure 2B.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 1 (A) A peak list assumed to be a b-ion series where each m/z gap between ions in the series is matched to an amino acid residue. (B) The same peak list assumed to be a y-ion series and drawn as the conversion of a y-ion series to a b-ion series (i.e. parentIonMZ – yIonMZ + 1). (C) The same peak list but assuming that the ion series may be either a y-ion or a b-ion series, i.e. overlaying both possibilities.

 


View larger version (12K):
[in this window]
[in a new window]
 
Fig. 2 (A) A peak list with an assumed b-ion series and annotated by one letter amino acid residues that match m/z differences between sequential ions. Assume that the C above the dotted line is a cysteinyl residue that has been carboxyamidomethylated and corresponds to the same m/z of C + G, thus bridging the ion series directly from T to K. (B) The NDFA conversion of the annotated peak list. Note that there are two routes from T to K, either CG or C. The CG route represents following the unmodified C with a subsequent G. The C-only route represents a cysteinyl residue that has been carboxyamidomethylated to have m/z of the unmodified CG pair. This is why a peak list cannot be converted directly to a DFA—there may be two or more valid m/z differences between ions that use the same residue, either due to a potential modification or mass accuracy errors.

 
MSMS-derived peak lists generated by analysis software do not normally come with the ion types identified, and multiple, incomplete ion series are almost always present in the observed spectrum. Both of these complexities may be accounted for when constructing a PFSM.

Incomplete ion series may be accounted for by allowing peaks separated by an arbitrary but identifiable m/z ratio to be considered as a part of an ion series. When constructing the NDFA from a peak list, missing residues in an ion series can be accounted for by using the technique in Figure 3, i.e. identify an m/z that corresponds to a known combination of amino acid residues and insert all possibilities of the combination as transitions in the NDFA. This practice will introduce several combinations of possible amino acid sequences with possibly only one being correct. It is shown in the following sections, and assumed in general, that the inaccurate sequences can be distinguished from the correct sequence by using a scoring function such as Mascot or XTandem.



View larger version (18K):
[in this window]
[in a new window]
 
Fig. 3 (A) A peak list similar to Figure 2, but missing the second ion in the ion series of Figure 2. The m/z difference is annotated with [C,G] because it is assumed that the m/z of adding the two residues gives the appropriate m/z to bridge the ion series. (B) The graph conversion of the given peak list allowing arcs to have multiple residues, e.g. [C,G] instead of just one residue as in Figure 2. (C) The NDFA conversion of the graph in part B, illustrating how to convert multiple residue to a single residue NDFA transitions—i.e. how to allow for ions in a series that are not represented in the peak list. The solution is to create all combinations of the residues in the NDFA.

 
Unidentifiable fragment ion types may be accounted for by assuming that all peaks are of the ion type of interest. To examine an ion series of particular interest (e.g. b, y, a, c, x or z), one should use the approach illustrated in Figure 1 where the observed peaks are assumed to be of a particular ion type and all the peaks are converted to a b-ion series. While examining multiple ion series, say y- and b-ions, the ion series as illustrated in Figure 1C should be overlaid. This general technique may also be applied for all the different ion types commonly observed through CID (Papayannopoulos, 1995); however, this paper focuses only on the most easily identifiable ion series, b-series and y-ion series. Use of the technique in Figure 1 may introduce inaccurate states in the subsequently generated NDFA, but the practice is required if ions are of an unknown type. It is also shown in the following sections, and assumed in general, that inaccurate peptide sequences introduced from mislabeled ion series can be distinguished from the correct peptide sequence through the use of a scoring function such as Mascot or XTandem.

It is suggested that sequences identified by a PFSM only be considered as candidate sequences that may describe the observed peak list. Conversion of an MSMS to an NDFA is viable, and an ideal MSMS may be converted to an NDFA that will only identify correct sequences. However, MSMS s often come with ambiguities that make the conversion to an NDFA possible, using the techniques previously described, but at the expense of introducing erroneous sequences to the resulting NDFA. Any sequence identified by a PFSM should be further examined either by hand or with a suitable high-throughput scoring function. Two of such scoring functions are examined in the following subsections of this paper.

2.2 Combining multiple peak lists into one NDFA
In many practical situations, multiple MSMS peaks probably represent one protein (e.g. a single 2D gel plug spotted on a well of a MALDI plate) or a large collection of related peptides (e.g. a MuDPIT experiment). In these situations the peak lists generated from these spectra are often searched together in an attempt to identify the source protein(s). A PFSM may represent more than one peak list and any number of related or unrelated peak lists may be multiplexed by combining all the peak lists of interest into a single NDFA for the PFSM.

By definition, an NDFA state may transition to multiple states using a single symbol. This property allows a collection of NDFA to be combined into a single NDFA by simply combining the start states. This concept is illustrated in Figure 4, and it manifests itself in practical use by allowing a PFSM to search for any number of peptides at the same time, without requiring multiple searches of the same amino acid sequence and with the guarantee that the same sequences will be identified as in a set of single spectrum searches. An entire set of spectra may be analyzed simultaneously by a PFSM with just a single pass, through a set of protein sequences. This feature is inherent in the conversion of an NDFA to a DFA, and it is a technique that is commonly used by regular expression matching algorithms.



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 4 (A) Two NDFA that are assumed to have been created from different peak lists. (B) A single NDFA that represents both the original NDFA, e.g. multiplex of the peak lists. Note the multiplex represents the union sequences that each NDFA accepts. Sensitivity of a PFSM will not change due to multiplexing NDFA, and it is not equivalent to combining multiple peak lists before generating an NDFA.

 
2.3 Conversion of an NDFA to a DFA
A PFSM requires both an NDFA, created by converting any number of peak lists, and a DFA that represents the NDFA. A well-established computer science algorithm exists for conversion of an NDFA to a DFA (Sipser, 1996). The algorithm works by taking the set of states that each transition in the NDFA can reach and treating it as a single state in a DFA. The concept is illustrated in Figure 5. The NDFA in Figure 5A has two transitions from state 2, given the symbol C; therefore it cannot be considered a DFA. In order to create a DFA from Figure 5A, one starts by making a DFA state that represents the set of NDFA states that may be transitions from the NDFA start state to state 1 as shown in Figure 5B. For each subsequent DFA state created, a single DFA transition is made from an existing DFA state to another DFA state that represents the collection of NDFA states which may be transitioned to using a symbol from the alphabet, e.g. state 2 is the transition from state 1, given the symbol T and states 3 and 4 are the transitions from state 2 given the symbol C. The process is repeated for every symbol in the alphabet of NDFA, and new DFA states are created only if the collection of NDFA states does not already exist. Once the entire NDFA has been converted, all DFA states that contain a finish state in the aggregate NDFA states are considered valid DFA finish states.

2.4 Peptide and protein identification through use of a PFSM
The algorithm presented in this work is not intended to be a stand-alone system for identifying the best protein match for a particular MSMS spectrum; however, PFSMs are intended for use as a rapid method for identifying candidate peptide sequences that may account for a given MSMS spectrum. A list of such identified peptide sequences may then be sorted and ranked by an arbitrary scoring function in an attempt to correctly identify which sequence or sequences account for the observed MSMS spectrum. In addition, a partial peptide sequence may be correlated with a full protein sequence in order to identify an unknown protein. This paper does not propose a formal scoring scheme for use with PFSMs, but the concept of using a PFSM in conjunction with two well-known scoring schemes is validated in order to demonstrate that PFSMs are suitable as a foundation for building high-performance MSMS search engines.

A PFSM can rapidly identify valid sequences in a known dataset of proteins by using the amino acid sequence(s) from the dataset as input to the PFSM's DFA. If at the end of a sequence the PFSM's DFA is at the finish state, the sequence is a valid peptide that may have represented one of the observed peak lists. The peak list(s) itself may be identified by checking the collection of NDFA states that represents the DFA finish state. In cases where a DFA state was created from a set of NDFA finish states, the amino acid sequence represents a peptide that may account for any of those peak lists.

As a simple example, let as assume that the PFSM's DFA shown in Figure 5B is being used on the tryptic fragments of the sequence ‘TCGKSGRTCKCAR’. The tryptic fragments of the sequence, assuming no missed cleavages, would be ‘TCGK’, ‘SGR’, ‘TCK’ and ‘CAR’. Using the sequence ‘TCGK’ or ‘TCK’ as input to the PFSM's DFA would result in the DFA reaching the finish state, meaning that these amino acid sequences could represent two of the peak lists or the same peak list used to make the PFSM; however, the sequences ‘SGR’ and ‘CAR’ would not end in the finish state of the PFSM's DFA and therefore, are not valid peptide sequences according to the mass accuracy error and amino acids and amino acid modifications used to construct the PFSM.

Given a PFSM, one may quickly examine a known set of peptides (e.g. a theoretical digest of proteins from a known database) and determine if the peptides could or could not account for an observed spectrum. A total of 20 commercially available tryptic digest standards (Michrom Bioresources Inc., Auburn, CA) were analyzed in order to verify whether PFSMs could correctly identify peptide sequences. Aliquots of 50 fMol of each protein was individually spotted on a MALDI plate with an alpha-cyano-4-hydroxycinnamic acid matrix. Data were acquired using an ABI 4700 MALDI TOFTOF by obtaining MSMS of the 10 most intense MS peaks and Mascot Distiller (Matrix Science, Boston, MA) was used to generate the appropriate peak list for each MSMS. A PFSM representing all the MSMS data was generated, assuming a mass error tolerance of 200 ppm and a potential gap of at most 3 sequential amino acids in an ion series and the protein sequences for each of the 20 proteins were analyzed. For each protein, multiple peptide sequences were identified that correlated well with both the observed MSMS.

Table 1 summarizes the results of the PFSM analysis of the known protein sequences. Three different numbers are shown for the PFSM that represents a PFSM constructed to bridge gaps in an observed ion series up to 1, 2 and 3 residues. The numbers for Mascot and XTandem represent the number of peptides they correctly identified. Comparison with the existing search engines should not be taken as a literal measurement of sensitivity. The set of protein sequences searched is constructed to contain only the known sequences. The comparison is intended to illustrate that a PFSM can identify the same candidate peptide sequences (not just one or two pristine MSMS fragmentations per protein), enforcing that a PFSM is an accurate representation of observed MSMS spectra.


View this table:
[in this window]
[in a new window]
 
Table 1 PFSM peptide identifications of known proteins

 

    3 IMPLEMENTATION
 TOP
 Abstract
 1 INTRODUCTION
 2 ALGORITHM
 3 IMPLEMENTATION
 4 SUMMARY AND FUTURE...
 REFERENCES
 
3.1 Performance analysis of the PFSM
In theory, a PFSM will search through any protein sequence in time directly proportional to the size of the sequence; ‘search’ means transition from state to state in a DFA, which assuming one has already constructed the DFA, does not take more time regardless of the number of peak lists multiplexed or the number of potential modifications used to create the initial NDFA. However, there will always be the practical limitations of the amount of time and computer memory required in order to create the PFSM. These limitations are addressed by this section with the common case of searching a collection of peak lists against a collection of protein sequences. In all examples the PFSM project from the Proteome Commons, http://www.proteomecommons.org, is used on a Pentium 4 2.6 GHz with 1 GB of RAM.

The time required to construct a PFSM from a collection of peak lists and the time required to search through a set of protein sequences is illustrated in Figure 6. The example implementation constructs the complete NDFA and then proceeds to create DFA states as they are needed during the search. The approach of the implementation is modeled after the popular, open source regular expression tool Grep (Free Software Foundation available at http://www.gnu.org/software/grep/grep.html). Search time is illustrated in Figure 6A and it is split into two parts representing the time associated with constructing the DFA and the time required to search through protein sequences. Figure 6B illustrates the same data represented in search time per peak list.



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 6 Performance characteristics of a PFSM searching through the tryptic peptides of NCBI nr, August 2004 release (~79 tryptic peptides, ~1 trillion residues). The time required to perform the search are shown in (A) and (B). The solid line is the complete program execution time and the dashed and dotted lines represent the time required to make the PFSM and the time required to search, respectively. (C) Illustrates the minimal memory requirements in order to search.

 
The searches in Figure 6 were performed using peak lists obtained from an ABI 4700 MALDI TOFTOF that was used to collect spectra from commercially available tryptic digest standards (Michrom Bioresources Inc., Auburn, CA). The top 10 MS ions of each digest were analyzed using MSMS, and each protein was analyzed in triplicate without ignoring previously analyzed MS ions. A total of 682 MSMS were collected with ~227 unique spectra identified. Individual peak lists were generated using the Mascot Distiller software package (Matrix Science Inc., Boston, MA). The spectra and peak lists are freely available in the Michrom tryptic digest spectra and peak list projects at http://www.proteomecommons.org. Searches were performed against the tryptic peptide sequences from the August 2004 NCBI nr release. Tryptic sequences were selected by performing a theoretical trypsin digest of each FASTA entry, allowing up to two missed cleavages. Sequences between 8 and 30 residues long were considered, including C-terminal peptides. In total, just over 72 million tryptic peptides, approximately a trillion residues, were searched. In all cases, the searches done in Figure 6 identified the tryptic peptide sequences of the known proteins as candidate sequences. The Mascot search results are freely available and may be obtained in the aforementioned Michrom tryptic digest peak list project.

The time requirements shown in Figure 6 illustrate a few points. The search time for a PFSM stays reasonably consistent as more peak lists are multiplexed. Clearly, the time to multiplex and search is much less than the time required to search each peak list individually; however, search time does slightly increase with the number of peak lists multiplexed, presumably due to the implementation of the algorithm. The limiting factor is clearly the time required to construct the PFSM, which grows with the number of peak lists multiplexed, albeit if the time growth is relatively minuscule.

Figure 6C illustrates the memory requirements derived from multiplexing the peak lists for each search. The PFSM in each search is performed with a reasonable set of potential modifications, which includes acetylation, carboxyamidomethylation, deamidation, oxidation and pyroglutamyl from either glutamic acid or glutamine. The PFSM in each search is also constructed with allowance of three missed ions in a series and a mass accuracy of 200 ppm. The memory requirements are reasonable, allowing up to 1000 peak lists to be loaded in little over 100 MB of computer memory. Also depicted in Figure 6C is the change in memory requirements as a variable number of potential modifications that are considered. Multiple modifications may be used; however, adding more modifications is analogous to multiplexing more peak lists. Potential modifications will result in longer construction time of the PFSM's NDFA and DFA and memory requirements will increase.

It is important to realize that the information in Figure 6 is only illustrating the time and memory requirements for using a PFSM to identify candidate peptide sequences. There is no attempt to rank candidate sequences in order to claim one peptide as a more suitable match for a particular MSMS spectrum, nor is there any attempt to identify the most likely protein from which the peptide sequences originated. In order to identify the most likely peptide sequences and the originating proteins a scoring function is required. Figure 7 illustrates that a PFSM may be combined with existing scoring functions, Mascot (Matrix Science Inc., Boston, MA) and XTandem (Craig and Beavis, 2004), in order to provide more rapid peptide and protein identifications. Both software packages may search against FASTA files; however, Figure 7 illustrates that significant time improvements are obtained when using a PFSM to first identify candidate sequences and subsequently using the search engine to analyze the subset of FASTA entries that contained candidate sequences. In the small case of analyzing 10 spectra obtained from a single MALDI well, e.g. one protein, the candidate sequences identified by the PFSM belong to 142 of the 2 million FASTA entries in NCBI nr. Searching Mascot or XTandem against the 142 entry FASTA file identified the same, correct entry that is identified by searching all of NBCI nr with Mascot or XTandem. The time difference is slightly favorable when using the PFSM along with either of the search engines instead of the search engine on its own. When searching all 682 peak lists the PFSM Machine identifies 12 634 FASTA entires from NCBI nr, and using the PFSM with either search engine saves a significant amount of time as compared with the use of the search engine by itself.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 7 Combination of a PFSM with known MSMS search engines. It is shown that time can be saved by first using a PFSM to screen for candidate sequences and then scoring only those sequences. Time comparison is shown for searching a PFSM + Mascot or XTandem (black) against NCBI nr and for searching Mascot or XTandem (gray) directly against nr. Time measured includes total execution time of both algorithms in the cases of PFSM + Mascot and PFSM + XTandem. Known tryptic standards are used and the search results in both cases identify the same peptides. The Mascot and XTandem searches were performed with the same data but on different computers.

 
Clearly, PFSMs do provide a mechanism for rapidly searching protein sequences while considering large amounts of peak lists and optionally considering potential modifications. The time and memory requirements of a PFSM, specifically the example implementation, are reasonable and allow a moderately fast computer to rapidly analyze large amounts of MSMS data against large datasets. The PFSM by itself identifies only candidate peptide sequences, but it was shown that a PFSM can be combined with a conventional MSMS search engine in order to more rapidly identify and rank candidate peptide sequences.


    4 SUMMARY AND FUTURE DIRECTIONS
 TOP
 Abstract
 1 INTRODUCTION
 2 ALGORITHM
 3 IMPLEMENTATION
 4 SUMMARY AND FUTURE...
 REFERENCES
 
The PFSM data structure is a novel, helpful tool for proteomics MSMS data analysis. The most beneficial aspect of a PFSM is the regular expression-like approach that allows any number of peak lists to be simultaneously searched against a set of protein sequences. Multiplexing in such a fashion does not change the number of peptides that will be identified and it requires much less time compared to searching each peak list individually against the entire dataset. Detailed in this paper is the complete algorithm for generating a PFSM from a set of peak lists and few examples of using a PFSM to identify amino acid sequences that may account for an observed peak list(s).

Clearly, an important future direction for PFSMs is an analysis which includes using a PFSM's DFA for rapid sequence identification while scoring candidate sequences in order to determine the best peptide sequence match(es) for a given spectrum, especially since PFSMs may easily account for multiple potential modifications. Another important future direction for PFSMs is that of result-dependent data acquisition and real-time MSMS data analysis. These topics were not explored in the current study, but upon examination of the performance characteristics of PFSMs (Figs 6 and 7), it should be clear that candidate peptide sequences can be identified in near real time using a modest computer. Given a smaller dataset to examine (NCBI nr is one of the largest datasets available) or more computational power, either a cluster of computers or an expensive server, it is reasonable to attempt using PFSMs to complement real-time MSMS data acquisition. A combination of a high-performance scoring function, a PFSM and an interface with MSMS instrumentation is a promising avenue of future research.


    Acknowledgments
 
Special thanks to Angela Walker for her assistance with the ABI 4700 MALDI TOFTOF and data collection, Edward Barbour for his assistance with Mascot Distiller and to everyone else in the Andrews lab for providing a great place to learn and experiment. Special thanks to Tom Blackwell, Pete Ulintz, Gary Rymar, Catherine Grasso and Eric Simon for their accurate comments, prompt feedback, and time and effort spent on editing this paper. Support from the NCRR for the National Resource for Proteomics and Pathways is gratefully acknowledged.

Received on November 21, 2004; revised on February 25, 2005; accepted on February 26, 2005

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 ALGORITHM
 3 IMPLEMENTATION
 4 SUMMARY AND FUTURE...
 REFERENCES
 

    Bartels, C. (1990) Fast algorithm for peptide sequencing by mass spectroscopy. Biomed. Environ. Mass Spectrom., 19, 363–368[CrossRef].

    Biemann, K. (1988) Contributions of mass spectrometry to peptide and protein structure. Biomed. Environ. Mass Spectrom., 16, 99–111[CrossRef][Medline].

    Church, A. (1936) A note on the entscheidungsproblem. J. Symbolic Logic, 1, 40–41.

    Clauser, K.R., et al. (1999) Role of accurate mass measurement (±10 ppm) in protein identification strategies employing MS or MS/MS and database searching. Anal. Chem., 71, 2871–2882[Medline].

    Craig, R. and Beavis, R.C. (2004) TANDEM: matching proteins with tandem mass spectra. Bioinformatics, 20, 1466–1467[Abstract/Free Full Text].

    Dancik, V., et al. (1999) De novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol., 6, 327–342[CrossRef][Web of Science][Medline].

    Ducret, A., et al. (1998) High throughput protein characterization by automated reverse-phase chromatography/electrospray tandem mass spectrometry. Protein Sci., 7, 706–719[Web of Science][Medline].

    Eng, J., et al. (1994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Mass Spectrom, 5, 976–989[CrossRef].

    Mann, M. and Wilm, M. (1994) Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem., 66, 4390–4399[Medline].

    Papayannopoulos, I.A. (1995) The interpretation of collision-induced dissociation tandem mass spectra of peptides. Mass Spectrom. Rev., 14, 49–73[CrossRef].

    Perkins, D.N., et al. (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis, 20, 3551–3567[CrossRef][Web of Science][Medline].

    Roepstorff, P. and Fohlman, J. (1984) Proposal for a common nomenclature for sequence ions in a mass spectra of peptides. Biomed. Mass Spectrom, 11, 601[CrossRef][Web of Science][Medline].

    Sipser, M. Introduction to the Theory of Computation, (1996) 1st edn ISBN 053494728X Brooks Cole, pp. 225–226.

    Tabb, D.L., et al. (2003) GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. Anal. Chem., 75, 6415–6421[Medline].

    Taylor, J.A. and Johnson, R.S. (1997) Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. Rapid. Commun. Mass Spectrom., 11, 1067–1075[CrossRef][Web of Science][Medline].

    Turing, A.M. (1936) On computable numbers, with an application to the Entscheidungs problem. Proceedings of the London Mathematical Society, series 2, 42, , pp. 230–265 (1936–37).


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
J. Khatun, E. Hamlett, and M. C. Giddings
Incorporating sequence information into the scoring function: a hidden Markov model for improved peptide identification
Bioinformatics, March 1, 2008; 24(5): 674 - 681.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. A. Falkner, J. W. Falkner, and P. C. Andrews
ProteomeCommons.org IO Framework: reading and writing multiple proteomics data formats
Bioinformatics, January 15, 2007; 23(2): 262 - 263.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. A. Falkner, J. W. Falkner, and P. C. Andrews
ProteomeCommons.org JAF: reference information and tools for proteomics
Bioinformatics, March 1, 2006; 22(5): 632 - 633.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/10/2177    most recent
bti362v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Falkner, J.
Right arrow Articles by Andrews, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Falkner, J.
Right arrow Articles by Andrews, P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?