Skip Navigation


Bioinformatics Advance Access originally published online on September 25, 2008
Bioinformatics 2008 24(23):2780-2781; doi:10.1093/bioinformatics/btn507
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
24/23/2780    most recent
btn507v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Google Scholar
Right arrow Articles by Holm, L.
Right arrow Articles by Schenkel, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Holm, L.
Right arrow Articles by Schenkel, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Searching protein structure databases with DaliLite v.3

L. Holm 1,2,*, S. Kääriäinen 2, P. Rosenström 2 and A. Schenkel 2

1Department of Biological and Environmental Sciences, and 2Institute of Biotechnology, P.O.Box 56 (Viikinkaari 5), 00014 University of Helsinki, Finland

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 REFERENCES
 

The Red Queen said, ‘It takes all the running you can do, to keep in the same place.’ Lewis Carrol

Motivation: Newly solved protein structures are routinely scanned against structures already in the Protein Data Bank (PDB) using Internet servers. In favourable cases, comparing 3D structures may reveal biologically interesting similarities that are not detectable by comparing sequences. The number of known structures continues to grow exponentially. Sensitive—thorough but slow—search algorithms are challenged to deliver results in a reasonable time, as there are now more structures in the PDB than seconds in a day. The brute-force solution would be to distribute the individual comparisons on a massively parallel computer. A frugal solution, as implemented in the Dali server, is to reduce the total computational cost by pruning search space using prior knowledge about the distribution of structures in fold space. This note reports paradigm revisions that enable maintaining such a knowledge base up-to-date on a PC.

Availability: The Dali server for protein structure database searching at http://ekhidna.biocenter.helsinki.fi/dali_server is running DaliLite v.3. The software can be downloaded for academic use from http://ekhidna.biocenter.helsinki.fi/dali_lite/downloads/v3.

Contact: liisa.holm{at}helsinki.fi


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 REFERENCES
 
Comparative analyses of protein sequences and structures are a cornerstone of bioinformatics. When sequence and structure similarities have an evolutionary origin, it is often possible to infer similarities in the biological functions of the proteins, which would be difficult to predict directly. Structure comparisons have a longer look-back time than sequence comparison and have led to the identification of many ‘super-families’ of distantly related proteins.

Many measures have been proposed to quantify structural similarity. The Dali method uses a weighted sum of similarities of intra-molecular distances, which correlates with expert classifications in the sense that the structures of homologous proteins typically get higher similarity scores than the structures of evolutionarily unrelated proteins (Sierk and Pearson, 2004). This property is useful to a biologist using structure comparison to learn more about her query protein: the biologically informative neighbours are found at the top of the match list with relatively few false leads.

The Dali method has been used to systematically scan new structures against the Protein Data Bank (PDB) for some 15 years (Holm and Sander, 1994). The overall strategy is to screen the structure database with many different methods, starting with fast but unreliable ones and ending with the most sensitive but slow methods. This ensures that no significant similarity is missed. The search space is pruned between methods; if a strong match has been found, then subsequent methods only compare the query structure to the neighbours of the strong match. This strategy requires that all the neighbours of the known structures are precomputed in all versus all fashion within a representative subset of structures. The size of the structure set has grown by two decades since the system was introduced, and all versus all comparison is a quadratic problem in the number of structures. Recently, the paradigm of all versus all comparisons became untenable when the weekly PDB updates began to take more than a week to process.

DaliLite is a standalone package of the Dali algorithm. The first release of DaliLite (Holm and Park, 2000) contained all the functionality of the Dali server at EBI except the site-specific, complicated database update protocol. The main DaliLite program is a wrapper that calls a variety of methods for protein structure comparison. New workflows can thus be easily implemented by ‘rewiring the regulatory logic’ but keeping the basic algorithms unchanged. In DaliLite v.3, we introduce new options for database searching (DaliLite –quick) and database updates (DaliLite –update). The new protocols improve server throughput and vastly simplify the updates, making the complete system portable.

The key change from earlier is that we abandon the all versus all matrix of similarities in favour of a connected graph of similarities. The nodes of the graph represent protein structures and edges represent structural alignments. Whereas before each representative structure was directly linked to all its structurally similar neighbours, we now require only that there is a path of continuous structural similarity through the graph. The structural neighbours of a query structure are collected by walks through the graph. Not only need the graph be less densely connected than the all versus all matrix, thus saving computational effort, but also there is the added benefit that the incremental updates of the structural similarity graph and the choice of structural representatives are completely decoupled.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 REFERENCES
 
2.1 PDB clustering
The PDB is highly redundant. The structures of some proteins and their mutants have been determined in various conditions, though the structures remain the ‘same’ for classification purposes. We use a representative subset at 90% sequence identity level (PDB90), derived from the current set of PDB sequences using CD-HIT (Li and Godzik, 2006). The PDB contains over 100 000 structures (chains), which is reduced to about 20 000 PDB90 representatives. Further clustering of similar folds at lower levels of sequence identity was not cost effective.

2.2 Structural similarity graph
The structural similarity graph and alignment data are stored in a relational database (MySQL). The graph is updated incrementally. If a new structure has strong similarity to structures already in the graph, one edge is sufficient to connect the new structure to the graph in the proper neighbourhood. If there is no strong match, we compare the new structure to all existing structures and add edges for all significant similarities.

Similarity is measured by Dali Z-scores. ‘Significant similarities’ have a Z-score above 2; they usually correspond to similar folds. ‘Strong matches’ have sequence identity above 20% or a Z-score above a cutoff that depends on the size of the query protein. The Z-score cutoff was empirically set to n/10 – 4, where n is the number of residues in the query structure. We additionally require that the complete structure is covered by structural alignments; a segment of the query structure longer than 80 residues without any structural matches always disqualifies a strong match.

2.3 Database searching
The database search option DaliLite –quick compares a query structure to all structures in the PDB, as organized in the structural similarity graph. To initiate a transitive search of structures in the graph, the query structure must be attached to some structural neighbours. Fast feature filters are often successful in finding near neighbours. We currently use sequence comparison by Blast, GTG sequence motifs (Heger et al., 2007) and secondary structure triplets to rank the structures in PDB90. We convert the feature filter scores to Z-scores in order to combine the ranked lists. The top 100 structures are compared using the normal Dali procedures. If a strong match is found, we move to the next step (transitive alignment). Otherwise, the query structure is compared against all 20 000 structures in PDB90.

The entry points connect the query structure to one or more structures in the structural similarity graph. These are direct (first shell) neighbours of the query. Structures in the second shell are compared in batches of 100, selecting those with the strongest connections first. Connection strength is the lesser Z-score along the path from query to the first neighbour to the second neighbour. The transitive alignment (via first neighbour) between the query structure and second neighbour is used as starting point for refinement, skipping the costly alignment optimization from scratch. The expansion is repeated until the connection strength drops below a Z-score cutoff of 2, or a maximum number of matches have been reported (default: MAX_HITS = 500).


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 REFERENCES
 
The utility of a protein structure database search method (i.e. similarity measure and optimization algorithm) must depend on its ability to report back ‘interesting’ matches. As an illustration, we chose query and target structures representing diverse super-families from the four main structural classes in SCOP: cytochromes c and winged helix DNA-binding domains from the all-alpha class, cupredoxins and PUA-like domains from the all-beta class, metallo-dependent hydrolases and alpha/beta hydrolases from the alpha/beta class, and lysozyme-likes and nucleotidyltransferases from the alpha+beta class (Table 1). Match lists were evaluated using the AUC, where the maximum value of one indicates perfect sensitivity and selectivity. Compared to optimizing the alignment from scratch (DaliLite –list), the new transitive search mode (DaliLite –quick) is about 30 times faster, without affecting AUC much (we removed all pre-existing edges from the query structures to the structural similarity graph). Compared to the SSM server's Q-score, the higher AUC values in Table 1 indicate superior discrimination of homologous proteins from unrelated proteins by Dali's Z-score.


View this table:
[in this window]
[in a new window]

 
Table 1. Comparison of DaliLite v.3, the SSM server and SCOP

 
In conclusion, Dali remains a useful tool for structural bioinformatics. The Dali server has been running DaliLite –quick for a number of months now, with a throughput of 50 user queries—a mixture of redundant and unique structures—per day per CPU.

Funding: Academy of Finland (grants #109849 and #1105210).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Thomas Lengauer

Received on June 10, 2008; revised on August 14, 2008; accepted on September 22, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 REFERENCES
 

    Heger A, et al. The global trace graph, a novel paradigm for searching protein sequence databases. Bioinformatics (2007) 23:2361–2367.[Abstract/Free Full Text]

    Holm L, Park J. DaliLite workbench for protein structure comparison. Bioinformatics (2000) 16:566–567.[Abstract/Free Full Text]

    Holm L, Sander C. Searching protein structure databases has come of age. Proteins (1994) 19:165–173.[CrossRef][Web of Science][Medline]

    Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics (2006) 22:1658–1659.[Abstract/Free Full Text]

    Murzin AG, et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. (1995) 247:536–540.[CrossRef][Web of Science][Medline]

    Krissinel E, Henrick K. Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. In: Acta Cryst. (2004) D60:2256–2268.[Web of Science]

    Sierk ML, Pearson WR. Sensitivity and selectivity in protein structure comparison. Protein Sci. (2004) 13:773–785.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
B.-H. Kim, H. Cheng, and N. V. Grishin
HorA web server to infer homology between proteins using sequence and structural similarity
Nucleic Acids Res., July 1, 2009; 37(suppl_2): W532 - W538.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. Shi, B. Chitturi, and N. V. Grishin
ProSMoS server: a pattern-based search using interaction matrix representation of protein structures
Nucleic Acids Res., July 1, 2009; 37(suppl_2): W526 - W531.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. J. Suhrer, M. Wiederstein, M. Gruber, and M. J. Sippl
COPS--a novel workbench for explorations in fold space
Nucleic Acids Res., July 1, 2009; 37(suppl_2): W539 - W544.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
T. Margraf, G. Schenk, and A. E. Torda
The SALAMI protein structure search server
Nucleic Acids Res., July 1, 2009; 37(suppl_2): W480 - W484.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
M. Sagermann, A. Ohtaki, and K. Nikolakakis
Crystal structure of the EutL shell protein of the ethanolamine ammonia lyase microcompartment
PNAS, June 2, 2009; 106(22): 8883 - 8887.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
M.-T. Sung, Y.-T. Lai, C.-Y. Huang, L.-Y. Chou, H.-W. Shih, W.-C. Cheng, C.-H. Wong, and C. Ma
Crystal structure of the membrane-bound bifunctional transglycosylase PBP1b from Escherichia coli
PNAS, June 2, 2009; 106(22): 8824 - 8829.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
D. Han, K. Kim, Y. Kim, Y. Kang, J. Y. Lee, and Y. Kim
Crystal Structure of the N-terminal Domain of Anaphase-promoting Complex Subunit 7
J. Biol. Chem., May 29, 2009; 284(22): 15137 - 15146.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
H. Nishimasu, R. Ishitani, K. Yamashita, C. Iwashita, A. Hirata, H. Hori, and O. Nureki
Atomic structure of a folate/FAD-dependent tRNA T54 methyltransferase
PNAS, May 19, 2009; 106(20): 8180 - 8185.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
D. Pakotiprapha, Y. Liu, G. L. Verdine, and D. Jeruzalmi
A Structural Model for the Damage-sensing Complex in Bacterial Nucleotide Excision Repair
J. Biol. Chem., May 8, 2009; 284(19): 12837 - 12844.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
F. Chignola, M. Gaetani, A. Rebane, T. Org, L. Mollica, C. Zucchelli, A. Spitaleri, V. Mannella, P. Peterson, and G. Musco
The solution structure of the first PHD finger of autoimmune regulator in complex with non-modified histone H3 tail reveals the antagonistic role of H3R2 methylation
Nucleic Acids Res., May 1, 2009; 37(9): 2951 - 2961.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
T. Monecke, A. Dickmanns, and R. Ficner
Structural basis for m7G-cap hypermethylation of small nuclear, small nucleolar and telomerase RNA by the dimethyltransferase TGS1
Nucleic Acids Res., April 22, 2009; (2009) gkp249v1.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
J. K. Capyk, I. D'Angelo, N. C. Strynadka, and L. D. Eltis
Characterization of 3-Ketosteroid 9{alpha}-Hydroxylase, a Rieske Oxygenase in the Cholesterol Degradation Pathway of Mycobacterium tuberculosis
J. Biol. Chem., April 10, 2009; 284(15): 9937 - 9946.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
S. S. Gupta, B. N. Borin, T. L. Cover, and A. M. Krezel
Structural Analysis of the DNA-binding Domain of the Helicobacter pylori Response Regulator ArsR
J. Biol. Chem., March 6, 2009; 284(10): 6536 - 6545.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
24/23/2780    most recent
btn507v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Google Scholar
Right arrow Articles by Holm, L.
Right arrow Articles by Schenkel, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Holm, L.
Right arrow Articles by Schenkel, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?