Skip Navigation


Bioinformatics Advance Access originally published online on July 19, 2005
Bioinformatics 2005 21(17):3584-3585; doi:10.1093/bioinformatics/bti588
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
21/17/3584    most recent
bti588v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Google Scholar
Right arrow Articles by Martens, L.
Right arrow Articles by Gevaert, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Martens, L.
Right arrow Articles by Gevaert, K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions{at}oupjournals.org

DBToolkit: processing protein databases for peptide-centric proteomics

Lennart Martens *, Joël Vandekerckhove and Kris Gevaert

Department of Biochemistry, Faculty of Medicine and Health Sciences, Ghent University and Flanders Interuniversity Institute for Biotechnology (VIB09) A. Baertsoenkaai 3, B-9000 Ghent, Belgium

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 APPLICATION FUNCTIONALITIES
 APPLICATION DESIGN
 DISCUSSION
 REFERENCES
 

Summary: DBToolkit is a user-friendly, easily extensible tool that allows the processing of protein sequence databases to peptide-centric sequence databases. This processing is primarily aimed at enhancing the useful information content of these databases for use as optimized search spaces for efficient identification of peptide fragmentation spectra obtained by mass spectrometry. In addition, DBToolkit can be used to reliably solve a range of other typical tasks in processing sequence databases.

Availability: DBToolkit is open source under the GNU GPL license. The source code, full user and developer documentation and cross-platform binaries are freely downloadable from the project website at http://genesis.UGent.be/dbtoolkit/

Contact: lennart.martens{at}UGent.be


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 APPLICATION FUNCTIONALITIES
 APPLICATION DESIGN
 DISCUSSION
 REFERENCES
 
As the tool of choice in present-day high-throughput proteomics, mass spectrometry has evolved substantially over the last years. The classical approach of two-dimensional polyacrylamide gel electrophoresis (2D-PAGE) (O'Farrell, 1975) requires merely a mass measurement of the peptides generated from an enzymatic digest of an isolated protein (Cottrell, 1994). A refinement of this approach uses fragmentation spectra of a few peptides as additional information for the identification of the original protein. In most recent so-called gel-free techniques (e.g. as reviewed by Zhang et al., 2004 and Gevaert et al., 2005) however, mass spectrometers must be able to generate high-quality fragmentation spectra from extremely complex peptide mixtures, obtained following proteolytic digestion of an unfractionated proteome of a cell or tissue. These peptide-centric methods were primarily developed to deal with the inherent shortcomings of classical 2D-PAGE techniques and allow for a greater coverage of the proteome while simultaneously increasing the sensitivity of the analysis (Aebersold and Mann, 2003). The peptide-centric technologies have driven the exchange of the protein for the peptide as the basic unit in proteomics research (Aebersold and Mann, 2003; Kearney and Thibault, 2003).

Sequence databases like SWISS-PROT, IPI and the NCBI non-redundant database remain protein-based however. Since this discrepancy between the respective fundamental units can lead to a loss of highly interesting identifications, we developed the DBToolkit suite of software tools to allow the conversion of protein sequence databases into peptide sequence databases.


    APPLICATION FUNCTIONALITIES
 TOP
 Abstract
 INTRODUCTION
 APPLICATION FUNCTIONALITIES
 APPLICATION DESIGN
 DISCUSSION
 REFERENCES
 
The software can recognize FASTA and EMBL formatted databases out of the box, with UniProt and IPI the most prominent examples of the latter. It is also extremely easy for developers to include automatic recognition of different database formats as detailed below.

DBToolkit can perform various types of processing on sequence databases. Of course, simple in silico enzymatic digests using a variety of predefined enzymes or user-added enzymes are possible as well as database concatenation and FASTA output of differently formatted databases. The enzymatic digest even allows for ‘dual specificity’ enzymes that generate peptides for which the aminoterminus (N-terminus) is the result of a different cleavage pattern than the carboxyterminus (C-terminus). In addition, it is also possible to filter databases (the exact filtering options depend on the database format loaded) and to limit output to sequences in a certain mass range. Additional filters by other developers are also readily included in the software (see below). The three most powerful functions of DBToolkit however, are sequence-based filtering through a simple query language, N-terminal or C-terminal ragging (optionally truncating sequences in the process) and sequence-based redundancy clearing. The ragging process creates a series of subsequences for each ‘mothersequence’ where in each n-th subsequence, the first n – 1 residues have been removed from the N-terminal or C-terminal side, respectively.

These functions are readily applied serially to achieve compound results such as a non-redundant, N-terminally ragged subset of a trypsin digest of the Homo sapiens entries in the UniProt database, all of which have a mass between 600 and 4000 Da.

Several applications for these processed databases are outlined below.


    APPLICATION DESIGN
 TOP
 Abstract
 INTRODUCTION
 APPLICATION FUNCTIONALITIES
 APPLICATION DESIGN
 DISCUSSION
 REFERENCES
 
DBToolkit is completely written in the Java programming language and its only requirement is a Java runtime environment 1.3 or above. The suite consists of both an intuitive graphical user interface presenting the user with interactive controls to all processing steps, and an equivalent set of command-line tools for straightforward automation of the processing steps through simple scripting. This latter functionality has allowed us to tie different processing steps in with the automatic database updating of Mascot (http://www.matrixscience.com) for the most popular sequence databases, creating multiple derived databases overnight.

DBToolkit was designed from the start to be easily extensible. The use of robust frameworking allows the addition of novel database loaders or filters without requiring recompilation.

Full user and developer documentation for the suite is available from the project website, along with the cross-platform binaries and CVS repository coordinates.


    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 APPLICATION FUNCTIONALITIES
 APPLICATION DESIGN
 DISCUSSION
 REFERENCES
 
We have applied DBToolkit in the lab for numerous purposes, most notably the generation of specialized databases for use as searchbases for protein identification in Mascot. One approach used ragged, non-redundant peptide databases to increase the number of identified spectra in an N-terminal COFRADIC experiment with ~40% (Gevaert et al., 2003). Interestingly, most of the peptides identified only in the ragged databases corresponded to the novel N-termini of their progenitor proteins after in vivo processing (e.g. the N-termini of nuclear-encoded proteins that are imported into mitochondria and lost their transit peptide). Since these processing sites typically did not conform to standard tryptic sites, they were absent from searches solely performed in the original sequence databases. Another application has been found in picking up peptides from apoptose substrates, yielding the exact cleavage location in those proteins. For this we created non-redundant, enzymatically digested peptide databases using a bifunctional enzyme that created peptides with an N-terminus derived from caspase activity (i.e. consensus cleavage C-terminal to aspartic acid) and a C-terminus derived from trypsin activity. In this way, a large number of caspase cleavage sites have been confirmed and many tentative new sites have been found that would otherwise have eluded identification (unpublished data). A third application centers on the a priori calculation of the potential success a certain COFRADIC procedure could have by rapidly creating non-redundant, comprehensive lists of all detectable peptides containing a specified amino acid. Note that this functionality can be applied to any peptide-centric proteomics approach that can select for sequences by their aminoacid content (see Zhang et al., 2004 and Gevaert et al., 2005 for an overview of these techniques).

DBToolkit has proven to be a highly versatile yet very simple tool for routine tasks in sequence database processing. Furthermore, as the applicability and popularity of peptide-centric proteomics experiments expands further, DBToolkit can perform the essential task of complementing proven, probabilistic protein identification software like Mascot with peptide-centric search databases, optimized for the specific conditions and requirements of the research.


    Acknowledgments
 
L.M. would like to thank An Staes, Evy Timmerman, Petra Van Damme, Grégoire Thomas and Luc Krols for their useful suggestions and comments on the DBToolkit software during its development phase. K.G. is a Postdoctoral Fellow and L.M. a Research Assistant of the Fund for Scientific Research, Flanders (Belgium) (FWO, Vlaanderen). The project was supported by research grants from the Fund for Scientific Research, Flanders (Belgium) (project number G.0008.03), the Inter University Attraction Poles (IUAP, project number P5/05), the GBOU-research initiative (project number 20204) of the Flanders Institute of Science and Technology (IWT) and the European Union Interaction Proteome (6th Framework Program).

Conflict of Interest: none declared.

Received on June 10, 2005; accepted on July 14, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 APPLICATION FUNCTIONALITIES
 APPLICATION DESIGN
 DISCUSSION
 REFERENCES
 

    Aebersold, R. and Mann, M. (2003) Mass spectrometry-based proteomics. Nature, 422, 198–207[CrossRef][Medline].

    Cottrell, J.S. (1994) Protein identification by peptide mass fingerprinting. Pept. Res., 7, 115–124[Web of Science][Medline].

    Gevaert, K., et al. (2003) Exploring proteomes and analyzing protein processing bymass spectrometric identification of sorted N-terminal peptides. Nat. Biotechnol., 21, 566–569[CrossRef][Medline].

    Gevaert, K., et al. (2005) Diagonal reverse-phase chromatography applications in peptide-centric proteomics; ahead of catalogue-omics? Anal. Biochem., in press.

    Kearney, P. and Thibault, P. (2003) Bioinformatics meets proteomics—bridging the gap between mass spectrometry data analysis and cell biology. J. Bioinform. Comput. Biol., 1, 183–200[CrossRef][Medline].

    O'Farrell, P.H. (1975) High resolution two-dimensional electrophoresis of proteins. J. Biol. Chem., 250, 4007–4021[Abstract/Free Full Text].

    Zhang, H., et al. (2004) Chemical probes and tandem mass spectrometry: a strategy for the quantitative analysis of proteomes and subproteomes. Curr. Opin. Chem. Biol., 8, 66–75[CrossRef][Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol. Cell. ProteomicsHome page
P. Van Damme, S. Maurer-Stroh, K. Plasman, J. Van Durme, N. Colaert, E. Timmerman, P.-J. De Bock, M. Goethals, F. Rousseau, J. Schymkowitz, et al.
Analysis of Protein Processing by N-terminal Proteomics Reveals Novel Species-specific Substrate Determinants of Granzyme B Orthologs
Mol. Cell. Proteomics, February 1, 2009; 8(2): 258 - 272.
[Abstract] [Full Text] [PDF]


Home page
Mol. Cell. ProteomicsHome page
M. Lamkanfi, T.-D. Kanneganti, P. Van Damme, T. Vanden Berghe, I. Vanoverberghe, J. Vandekerckhove, P. Vandenabeele, K. Gevaert, and G. Nunez
Targeted Peptidecentric Proteomics Reveals Caspase-7 as a Substrate of the Caspase-1 Inflammasomes
Mol. Cell. Proteomics, December 1, 2008; 7(12): 2350 - 2363.
[Abstract] [Full Text] [PDF]


Home page
Mol. Cell. ProteomicsHome page
K. Helsens, E. Timmerman, J. Vandekerckhove, K. Gevaert, and L. Martens
Peptizer, a Tool for Assessing False Positive Peptide Identifications and Manually Validating Selected Results
Mol. Cell. Proteomics, December 1, 2008; 7(12): 2364 - 2372.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. A. Falkner, J. W. Falkner, and P. C. Andrews
ProteomeCommons.org IO Framework: reading and writing multiple proteomics data formats
Bioinformatics, January 15, 2007; 23(2): 262 - 263.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. Li, W. Gao, C. X. Ling, X. Wang, R. Sun, and S. He
IndexToolkit: an open source toolbox to index protein databases for high-throughput proteomics
Bioinformatics, October 15, 2006; 22(20): 2572 - 2573.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. A. Falkner, J. W. Falkner, and P. C. Andrews
ProteomeCommons.org JAF: reference information and tools for proteomics
Bioinformatics, March 1, 2006; 22(5): 632 - 633.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
21/17/3584    most recent
bti588v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Google Scholar
Right arrow Articles by Martens, L.
Right arrow Articles by Gevaert, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Martens, L.
Right arrow Articles by Gevaert, K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?