Skip Navigation


Bioinformatics Advance Access originally published online on September 16, 2008
Bioinformatics 2008 24(22):2632-2633; doi:10.1093/bioinformatics/btn488
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/22/2632    most recent
btn488v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sirocco, F.
Right arrow Articles by Tosatto, S. C. E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sirocco, F.
Right arrow Articles by Tosatto, S. C. E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

TESE: generating specific protein structure test set ensembles

Francesco Sirocco and Silvio C. E. Tosatto *

Department of Biology, University of Padova, Viale G. Colombo 3, 35131 Padova, Italy

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURES
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: TESE is a web server for the generation of test sets of protein sequences and structures fulfilling a number of different criteria. At least three different use cases can be envisaged: (i) benchmarking of novel methods; (ii) test sets tailored for special needs and (iii) extending available datasets. The CATH structure classification is used to control structural/sequence redundancy and a variety of structural quality parameters can be used to interactively select protein subsets with specific characteristics, e.g. all X-ray structures of {alpha}-helical repeat proteins with more than 120 residues and resolution <2.0 Å. The output includes FASTA-formatted sequences, PDB files and a clickable HTML index file containing images of the selected proteins. Multiple subsets for cross-validation are also supported.

Availability: The TESE server is available for non-commercial use at URL: http://protein.bio.unipd.it/tese/.

Contact: silvio.tosatto{at}unipd.it


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURES
 ACKNOWLEDGEMENTS
 REFERENCES
 
Creating representative ensembles of sufficiently diverse proteins is a recurring problem in bioinformatics. Any novel method has to be trained and benchmarked on a test set of protein sequences and/or structures ensuring wide coverage of the protein universe and solid statistical evaluation. At least three different use cases can be envisaged: (i) The benchmarking of novel sequence alignment protocols and statistical potentials. (ii) The generation of test sets for specialized protein classes, e.g. transmembrane proteins. (iii) Extending datasets from previous publications with new structures to enhance statistical significance, e.g. for novel repeat proteins. The benchmarking problem has been recently addressed in the area of protein–ligand docking for instance (Jain and Nicholls, 2008). Given the exponential growth in available information, it is increasingly necessary to generate representative test sets large enough to allow solid statistical evaluation of the results. One of the earliest methods for the systematic selection of reduced protein lists from the Protein Data Bank (PDB; Berman et al., 2002) is PDBSELECT (Hobohm and Sander, 1994). It produces a list of protein sequences selected for a maximum percentage of sequence identity and reasonable structural quality. PDB-REPRDB (Noguchi and Akiyama, 2003) and UniqueProt (Mika and Rost, 2003) were developed to automate and facilitate the sequence selection process with more stringent similarity filters. More recently, the PISCES server (Wang and Dunbrack, 2003, 2005) has seen extensive usage for the generation of benchmark sets. PISCES combines both sequence similarity and structure quality filters to produce annotated lists of protein sequences. Structural alignments are used to improve the discrimination of proteins with weak sequence similarities. One limitation of the currently available services is the lack of an underlying structural classification throughout the selection process. This becomes increasingly important in the low sequence similarity range, where it is desirable to eliminate homology, and limits the usefulness of current methods in fold recognition for instance. On the other hand, the structural classification schemes, e.g. CATH (Pearl et al., 2003) and SCOP (Andreeva et al., 2004), are readily used for the selection of similar structures in absence of sequence similarity. However, only the full classifications are distributed and it is the developer's responsibility to extract meaningful subsets in a similar way to the previously mentioned services (e.g. PISCES). This process can become rather cumbersome in practice, e.g. when selecting structures with short tandem repeats or representatives of the Rossman fold. A lack of standardization and the relevance of many technical details in the selection process, frequently also complicates the unbiased assessment of novel methods to avoid ‘cherry-picking’ of the data. For these reasons, we have developed TESE, a novel server for the automatic generation of large benchmark sets both on the sequence and on the structure level.


    2 FEATURES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURES
 ACKNOWLEDGEMENTS
 REFERENCES
 
TESE is a method to derive meaningful ad hoc test sets from proteins of known structure. The CATH structural classification is used to control sequence/structural redundancy at various levels, e.g. <35% pairwise sequence identity corresponds to the ‘S’ level. Queries may be started in three different ways, as in the schematic overview of Figure 1. Keywords or a small sample of PDB files can be used to seed the TESE search for specific proteins, e.g. for {alpha}-helical repeats or oxidoreductases, or to extend previously published datasets. Alternatively, the user may specify search parameters related to the desired CATH similarity level, e.g. topology, the experimental method and quality, e.g. maximum X-ray resolution or protein size, e.g. minimum length, to initiate the search. It is possible to select all structures or a randomly chosen subset of any size. For sets of less than 600 proteins, a clickable list of protein structures and their CATH classification is produced. New proteins may be selected by directly choosing a different protein subset or by adding additional search parameters. When satisfied, the user may save the protein list as a compressed archive containing the relevant FASTA-formatted sequences, PDB files and a HTML index of the selected proteins. The test set may be automatically split to create subsets for cross-validation. Large datasets of more than 600 proteins are treated in a non-interactive way to limit bandwidth usage. Some widely used test sets are available as precompiled archives. An online help is provided to guide the user through the process.


Figure 1
View larger version (51K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Overview of TESE. The server has three main modes of operation: structural filters, list of PDB entries or keywords. These serve to generate a dynamically generated clickable list of structures from which to choose adequate structures. The process can be repeated iteratively, refining the search with additional structural filters, before saving the results as a compressed archive containing a HTML index with pictures, sequence and structure information.

 
TESE uses a MySQL database containing information from the latest CATH release and PDBFINDERII (Hooft et al., 1996) to derive the relevant structural parameters with Perl scripts used for data conversion. The underlying databases are updated weekly and the TAP score (Tosatto and Battistutta, 2007) is calculated locally. Pictures of PDB structures are drawn using PyMol (DeLano Scientific LLC, URL: http://www.pymol.org/). A more extensive server description and examples are available from the web site.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURES
 ACKNOWLEDGEMENTS
 REFERENCES
 
The authors are grateful to Dr. Ingolf Sommer, Dr. Stefano Toppo and members of the BioComputing UP lab for insightful discussions.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Burkhard Rost

Received on June 5, 2008; revised on August 26, 2008; accepted on September 10, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURES
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Andreeva A, et al. SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res. (2004) 32(Database issue):D226–D229.[Abstract/Free Full Text]

    Berman HM, et al. The Protein Data Bank. Acta Crystallogr. D Biol. Crystallogr. (2002) 58:899–907.[CrossRef][Medline]

    Hobohm U, Sander C. Enlarged representative set of protein structures. Protein Sci. (1994) 3:522–524.[Web of Science][Medline]

    Hooft RW, et al. The PDBFINDER database: a summary of PDB, DSSP and HSSP information with added value. (1996) 12:525–529.

    Jain AN, Nicholls A. Recommendations for evaluation of computational methods. J. Comput. Aided Mol. Des. (2008) 22:133–139.[CrossRef][Web of Science][Medline]

    Mika S, Rost B. UniqueProt: creating representative protein sequence sets. Nucleic Acids Res. (2003) 31:3789–3791.[Abstract/Free Full Text]

    Noguchi T, Akiyama Y. PDB-REPRDB: a database of representative protein chains from the Protein Data Bank (PDB) in 2003. Nucleic Acids Res. (2003) 31:492–493.[Abstract/Free Full Text]

    Pearl FM, et al. The CATH database: an extended protein family resource for structural and functional genomics. Nucleic Acids Res. (2003) 31:452–455.[Abstract/Free Full Text]

    Tosatto SC, Battistutta R. TAP score: torsion angle propensity normalization applied to local protein structure evaluation. BMC Bioinformatics (2007) 8:155.[CrossRef][Medline]

    Wang G, Dunbrack R.L. Jr. PISCES: a protein sequence culling server. Bioinformatics (2003) 19:1589–1591.[Abstract/Free Full Text]

    Wang G, Dunbrack R.L. Jr. PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Res. (2005) 33:W94–W98.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
L. Marsella, F. Sirocco, A. Trovato, F. Seno, and S. C.E. Tosatto
REPETITA: detection and discrimination of the periodicity of protein solenoid repeats by discrete Fourier transform
Bioinformatics, June 15, 2009; 25(12): i289 - i295.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/22/2632    most recent
btn488v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sirocco, F.
Right arrow Articles by Tosatto, S. C. E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sirocco, F.
Right arrow Articles by Tosatto, S. C. E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?