Bioinformatics Advance Access originally published online on September 25, 2008
Bioinformatics 2008 24(23):2780-2781; doi:10.1093/bioinformatics/btn507
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Searching protein structure databases with DaliLite v.3
1Department of Biological and Environmental Sciences, and 2Institute of Biotechnology, P.O.Box 56 (Viikinkaari 5), 00014 University of Helsinki, Finland
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
The Red Queen said, It takes all the running you can do, to keep in the same place. Lewis Carrol
Motivation: Newly solved protein structures are routinely scanned against structures already in the Protein Data Bank (PDB) using Internet servers. In favourable cases, comparing 3D structures may reveal biologically interesting similarities that are not detectable by comparing sequences. The number of known structures continues to grow exponentially. Sensitive—thorough but slow—search algorithms are challenged to deliver results in a reasonable time, as there are now more structures in the PDB than seconds in a day. The brute-force solution would be to distribute the individual comparisons on a massively parallel computer. A frugal solution, as implemented in the Dali server, is to reduce the total computational cost by pruning search space using prior knowledge about the distribution of structures in fold space. This note reports paradigm revisions that enable maintaining such a knowledge base up-to-date on a PC.
Availability: The Dali server for protein structure database searching at http://ekhidna.biocenter.helsinki.fi/dali_server is running DaliLite v.3. The software can be downloaded for academic use from http://ekhidna.biocenter.helsinki.fi/dali_lite/downloads/v3.
Contact: liisa.holm{at}helsinki.fi
| 1 INTRODUCTION |
|---|
|
|
|---|
Comparative analyses of protein sequences and structures are a cornerstone of bioinformatics. When sequence and structure similarities have an evolutionary origin, it is often possible to infer similarities in the biological functions of the proteins, which would be difficult to predict directly. Structure comparisons have a longer look-back time than sequence comparison and have led to the identification of many super-families of distantly related proteins.
Many measures have been proposed to quantify structural similarity. The Dali method uses a weighted sum of similarities of intra-molecular distances, which correlates with expert classifications in the sense that the structures of homologous proteins typically get higher similarity scores than the structures of evolutionarily unrelated proteins (Sierk and Pearson, 2004). This property is useful to a biologist using structure comparison to learn more about her query protein: the biologically informative neighbours are found at the top of the match list with relatively few false leads.
The Dali method has been used to systematically scan new structures against the Protein Data Bank (PDB) for some 15 years (Holm and Sander, 1994). The overall strategy is to screen the structure database with many different methods, starting with fast but unreliable ones and ending with the most sensitive but slow methods. This ensures that no significant similarity is missed. The search space is pruned between methods; if a strong match has been found, then subsequent methods only compare the query structure to the neighbours of the strong match. This strategy requires that all the neighbours of the known structures are precomputed in all versus all fashion within a representative subset of structures. The size of the structure set has grown by two decades since the system was introduced, and all versus all comparison is a quadratic problem in the number of structures. Recently, the paradigm of all versus all comparisons became untenable when the weekly PDB updates began to take more than a week to process.
DaliLite is a standalone package of the Dali algorithm. The first release of DaliLite (Holm and Park, 2000) contained all the functionality of the Dali server at EBI except the site-specific, complicated database update protocol. The main DaliLite program is a wrapper that calls a variety of methods for protein structure comparison. New workflows can thus be easily implemented by rewiring the regulatory logic but keeping the basic algorithms unchanged. In DaliLite v.3, we introduce new options for database searching (DaliLite –quick) and database updates (DaliLite –update). The new protocols improve server throughput and vastly simplify the updates, making the complete system portable.
The key change from earlier is that we abandon the all versus all matrix of similarities in favour of a connected graph of similarities. The nodes of the graph represent protein structures and edges represent structural alignments. Whereas before each representative structure was directly linked to all its structurally similar neighbours, we now require only that there is a path of continuous structural similarity through the graph. The structural neighbours of a query structure are collected by walks through the graph. Not only need the graph be less densely connected than the all versus all matrix, thus saving computational effort, but also there is the added benefit that the incremental updates of the structural similarity graph and the choice of structural representatives are completely decoupled.
| 2 METHODS |
|---|
|
|
|---|
2.1 PDB clustering
The PDB is highly redundant. The structures of some proteins and their mutants have been determined in various conditions, though the structures remain the same for classification purposes. We use a representative subset at 90% sequence identity level (PDB90), derived from the current set of PDB sequences using CD-HIT (Li and Godzik, 2006). The PDB contains over 100 000 structures (chains), which is reduced to about 20 000 PDB90 representatives. Further clustering of similar folds at lower levels of sequence identity was not cost effective.
2.2 Structural similarity graph
The structural similarity graph and alignment data are stored in a relational database (MySQL). The graph is updated incrementally. If a new structure has strong similarity to structures already in the graph, one edge is sufficient to connect the new structure to the graph in the proper neighbourhood. If there is no strong match, we compare the new structure to all existing structures and add edges for all significant similarities.
Similarity is measured by Dali Z-scores. Significant similarities have a Z-score above 2; they usually correspond to similar folds. Strong matches have sequence identity above 20% or a Z-score above a cutoff that depends on the size of the query protein. The Z-score cutoff was empirically set to n/10 – 4, where n is the number of residues in the query structure. We additionally require that the complete structure is covered by structural alignments; a segment of the query structure longer than 80 residues without any structural matches always disqualifies a strong match.
2.3 Database searching
The database search option DaliLite –quick compares a query structure to all structures in the PDB, as organized in the structural similarity graph. To initiate a transitive search of structures in the graph, the query structure must be attached to some structural neighbours. Fast feature filters are often successful in finding near neighbours. We currently use sequence comparison by Blast, GTG sequence motifs (Heger et al., 2007) and secondary structure triplets to rank the structures in PDB90. We convert the feature filter scores to Z-scores in order to combine the ranked lists. The top 100 structures are compared using the normal Dali procedures. If a strong match is found, we move to the next step (transitive alignment). Otherwise, the query structure is compared against all 20 000 structures in PDB90.
The entry points connect the query structure to one or more structures in the structural similarity graph. These are direct (first shell) neighbours of the query. Structures in the second shell are compared in batches of 100, selecting those with the strongest connections first. Connection strength is the lesser Z-score along the path from query to the first neighbour to the second neighbour. The transitive alignment (via first neighbour) between the query structure and second neighbour is used as starting point for refinement, skipping the costly alignment optimization from scratch. The expansion is repeated until the connection strength drops below a Z-score cutoff of 2, or a maximum number of matches have been reported (default: MAX_HITS = 500).
| 3 RESULTS |
|---|
|
|
|---|
The utility of a protein structure database search method (i.e. similarity measure and optimization algorithm) must depend on its ability to report back interesting matches. As an illustration, we chose query and target structures representing diverse super-families from the four main structural classes in SCOP: cytochromes c and winged helix DNA-binding domains from the all-alpha class, cupredoxins and PUA-like domains from the all-beta class, metallo-dependent hydrolases and alpha/beta hydrolases from the alpha/beta class, and lysozyme-likes and nucleotidyltransferases from the alpha+beta class (Table 1). Match lists were evaluated using the AUC, where the maximum value of one indicates perfect sensitivity and selectivity. Compared to optimizing the alignment from scratch (DaliLite –list), the new transitive search mode (DaliLite –quick) is about 30 times faster, without affecting AUC much (we removed all pre-existing edges from the query structures to the structural similarity graph). Compared to the SSM server's Q-score, the higher AUC values in Table 1 indicate superior discrimination of homologous proteins from unrelated proteins by Dali's Z-score.
|
In conclusion, Dali remains a useful tool for structural bioinformatics. The Dali server has been running DaliLite –quick for a number of months now, with a throughput of 50 user queries—a mixture of redundant and unique structures—per day per CPU.
Funding: Academy of Finland (grants #109849 and #1105210).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Thomas Lengauer
Received on June 10, 2008; revised on August 14, 2008; accepted on September 22, 2008
| REFERENCES |
|---|
|
|
|---|
Heger A, et al. The global trace graph, a novel paradigm for searching protein sequence databases. Bioinformatics (2007) 23:2361–2367.
Holm L, Park J. DaliLite workbench for protein structure comparison. Bioinformatics (2000) 16:566–567.
Holm L, Sander C. Searching protein structure databases has come of age. Proteins (1994) 19:165–173.[CrossRef][Web of Science][Medline]
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics (2006) 22:1658–1659.
Murzin AG, et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. (1995) 247:536–540.[CrossRef][Web of Science][Medline]
Krissinel E, Henrick K. Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. In: Acta Cryst. (2004) D60:2256–2268.[Web of Science]
Sierk ML, Pearson WR. Sensitivity and selectivity in protein structure comparison. Protein Sci. (2004) 13:773–785.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
M. Bellinzoni, S. Buroni, F. Schaeffer, G. Riccardi, E. De Rossi, and P. M. Alzari Structural Plasticity and Distinct Drug-Binding Modes of LfrR, a Mycobacterial Efflux Pump Regulator J. Bacteriol., December 15, 2009; 191(24): 7531 - 7537. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Serrano, M. A. Johnson, A. Chatterjee, B. W. Neuman, J. S. Joseph, M. J. Buchmeier, P. Kuhn, and K. Wuthrich Nuclear Magnetic Resonance Structure of the Nucleic Acid-Binding Domain of Severe Acute Respiratory Syndrome Coronavirus Nonstructural Protein 3 J. Virol., December 15, 2009; 83(24): 12998 - 13008. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. L. Baltz, D. J. Filman, M. Ciustea, J. E. Y. Silverman, C. L. Lautenschlager, D. M. Coen, R. P. Ricciardi, and J. M. Hogle The Crystal Structure of PF-8, the DNA Polymerase Accessory Subunit from Kaposi's Sarcoma-Associated Herpesvirus J. Virol., December 1, 2009; 83(23): 12215 - 12228. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A. MacKenzie, L. E. Tailford, A. M. Hemmings, and N. Juge Crystal Structure of a Mucus-binding Protein Repeat Reveals an Unexpected Functional Immunoglobulin Binding Activity J. Biol. Chem., November 20, 2009; 284(47): 32444 - 32453. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. D. Gelinas, M. Paschini, F. E. Reyes, A. Heroux, R. T. Batey, V. Lundblad, and D. S. Wuttke Telomere capping proteins are structurally related to RPA with an additional telomere-specific domain PNAS, November 17, 2009; 106(46): 19298 - 19303. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Dong, N. P. George, K. L. Duckett, M. A. P. DeBeer, and M. E. Lopper The crystal structure of Neisseria gonorrhoeae PriB reveals mechanistic differences among bacterial DNA replication restart pathways Nucleic Acids Res., November 11, 2009; (2009) gkp1031v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Graebsch, S. Roche, and D. Niessing X-ray structure of Pur-{alpha} reveals a Whirly-like fold and an unusual nucleic-acid binding surface PNAS, November 3, 2009; 106(44): 18521 - 18526. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. A. Martin-Visscher, X. Gong, M. Duszyk, and J. C. Vederas The Three-dimensional Structure of Carnocyclin A Reveals That Many Circular Bacteriocins Share a Common Structural Motif J. Biol. Chem., October 16, 2009; 284(42): 28674 - 28681. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Mishima, J. Quintin, V. Aimanianda, C. Kellenberger, F. Coste, C. Clavaud, C. Hetru, J. A. Hoffmann, J.-P. Latge, D. Ferrandon, et al. The N-terminal Domain of Drosophila Gram-negative Binding Protein 3 (GNBP3) Defines a Novel Family of Fungal Pattern Recognition Receptors J. Biol. Chem., October 16, 2009; 284(42): 28687 - 28697. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. D. Kiser, M. Golczak, D. T. Lodowski, M. R. Chance, and K. Palczewski From the Cover: Crystal structure of native RPE65, the retinoid isomerase of the visual cycle PNAS, October 13, 2009; 106(41): 17325 - 17330. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. R. R. Whittle and T. U. Schwartz Architectural Nucleoporins Nup157/170 and Nup133 Are Structurally Related and Descend from a Second Ancestral Element J. Biol. Chem., October 9, 2009; 284(41): 28442 - 28452. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Golovenko, E. Manakova, G. Tamulaitiene, S. Grazulis, and V. Siksnys Structural mechanisms for the 5'-CCWGG sequence recognition by the N- and C-terminal domains of EcoRII Nucleic Acids Res., October 6, 2009; (2009) gkp699v2. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Loll, M. Gebhardt, E. Wahle, and A. Meinhart Crystal structure of the EndoG/EndoGI complex: mechanism of EndoG inhibition Nucleic Acids Res., September 25, 2009; (2009) gkp770v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Zhao, G. Li, H. Schindelin, and W. J. Lennarz An Armadillo motif in Ufd3 interacts with Cdc48 and is involved in ubiquitin homeostasis and protein degradation PNAS, September 22, 2009; 106(38): 16197 - 16202. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. T. Gregory, H. Demirci, R. Belardinelli, T. Monshupanee, C. Gualerzi, A. E. Dahlberg, and G. Jogl Structural and functional studies of the Thermus thermophilus 16S rRNA methyltransferase RsmG RNA, September 1, 2009; 15(9): 1693 - 1704. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. Metzger, C. Schall, G. Zocher, I. Unsold, E. Stec, S.-M. Li, L. Heide, and T. Stehle The structure of dimethylallyl tryptophan synthase reveals a common architecture of aromatic prenyltransferases in fungi and bacteria PNAS, August 25, 2009; 106(34): 14309 - 14314. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. M. Kohli, S. R. Abrams, K. S. Gajula, R. W. Maul, P. J. Gearhart, and J. T. Stivers A Portable Hot Spot Recognition Loop Transfers Sequence Preferences from APOBEC Family Members to Activation-induced Cytidine Deaminase J. Biol. Chem., August 21, 2009; 284(34): 22898 - 22904. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. A. Bowden, M. Crispin, S. C. Graham, D. J. Harvey, J. M. Grimes, E. Y. Jones, and D. I. Stuart Unusual Molecular Architecture of the Machupo Virus Attachment Glycoprotein J. Virol., August 15, 2009; 83(16): 8259 - 8265. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. S. Ibrahim, N. Kanneganti, G. E. Rieckhof, A. Das, D. V. Laurents, J. B. Palenchar, V. Bellofatto, and D. A. Wah Structure of the C-terminal domain of transcription factor IIB from Trypanosoma brucei PNAS, August 11, 2009; 106(32): 13242 - 13247. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. L. Chan, L. Y. Low, S. Hsu, S. Li, T. Liu, E. Santelli, G. Le Negrate, J. C. Reed, V. L. Woods Jr., and J. Pascual Molecular Mimicry in Innate Immunity: CRYSTAL STRUCTURE OF A BACTERIAL TIR DOMAIN J. Biol. Chem., August 7, 2009; 284(32): 21386 - 21392. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. S. Y. Guu, Z. Liu, Q. Ye, D. A. Mata, K. Li, C. Yin, J. Zhang, and Y. J. Tao Structure of the hepatitis E virus-like particle suggests mechanisms for virus assembly and receptor binding PNAS, August 4, 2009; 106(31): 12992 - 12997. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Xiao, X.-W. Chen, B. A. Davies, A. R. Saltiel, D. J. Katzmann, and Z. Xu Structural Basis of Ist1 Function and Ist1-Did2 Interaction in the Multivesicular Body Pathway and Cytokinesis Mol. Biol. Cell, August 1, 2009; 20(15): 3514 - 3524. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. E. Guy, U. Stahl, and Y. Lindqvist Crystal Structure of a Class XIB Phospholipase A2 (PLA2): RICE (ORYZA SATIVA) ISOFORM-2 PLA2 AND AN OCTANOATE COMPLEX J. Biol. Chem., July 17, 2009; 284(29): 19371 - 19379. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Monecke, A. Dickmanns, and R. Ficner Structural basis for m7G-cap hypermethylation of small nuclear, small nucleolar and telomerase RNA by the dimethyltransferase TGS1 Nucleic Acids Res., July 1, 2009; 37(12): 3865 - 3877. [Abstract] [Full Text] [PDF] |
||||
![]() |
B.-H. Kim, H. Cheng, and N. V. Grishin HorA web server to infer homology between proteins using sequence and structural similarity Nucleic Acids Res., July 1, 2009; 37(suppl_2): W532 - W538. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Shi, B. Chitturi, and N. V. Grishin ProSMoS server: a pattern-based search using interaction matrix representation of protein structures Nucleic Acids Res., July 1, 2009; 37(suppl_2): W526 - W531. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. J. Suhrer, M. Wiederstein, M. Gruber, and M. J. Sippl COPS--a novel workbench for explorations in fold space Nucleic Acids Res., July 1, 2009; 37(suppl_2): W539 - W544. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Margraf, G. Schenk, and A. E. Torda The SALAMI protein structure search server Nucleic Acids Res., July 1, 2009; 37(suppl_2): W480 - W484. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Sagermann, A. Ohtaki, and K. Nikolakakis Crystal structure of the EutL shell protein of the ethanolamine ammonia lyase microcompartment PNAS, June 2, 2009; 106(22): 8883 - 8887. [Abstract] [Full Text] [PDF] |
||||
![]() |
M.-T. Sung, Y.-T. Lai, C.-Y. Huang, L.-Y. Chou, H.-W. Shih, W.-C. Cheng, C.-H. Wong, and C. Ma Crystal structure of the membrane-bound bifunctional transglycosylase PBP1b from Escherichia coli PNAS, June 2, 2009; 106(22): 8824 - 8829. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Dong, F. Xiao, F. Fan, L. Gu, H. Cang, G. B. Martin, and J. Chai Crystal Structure of the Complex between Pseudomonas Effector AvrPtoB and the Tomato Pto Kinase Reveals Both a Shared and a Unique Interface Compared with AvrPto-Pto PLANT CELL, June 1, 2009; 21(6): 1846 - 1859. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Han, K. Kim, Y. Kim, Y. Kang, J. Y. Lee, and Y. Kim Crystal Structure of the N-terminal Domain of Anaphase-promoting Complex Subunit 7 J. Biol. Chem., May 29, 2009; 284(22): 15137 - 15146. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Nishimasu, R. Ishitani, K. Yamashita, C. Iwashita, A. Hirata, H. Hori, and O. Nureki Atomic structure of a folate/FAD-dependent tRNA T54 methyltransferase PNAS, May 19, 2009; 106(20): 8180 - 8185. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Pakotiprapha, Y. Liu, G. L. Verdine, and D. Jeruzalmi A Structural Model for the Damage-sensing Complex in Bacterial Nucleotide Excision Repair J. Biol. Chem., May 8, 2009; 284(19): 12837 - 12844. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Chignola, M. Gaetani, A. Rebane, T. Org, L. Mollica, C. Zucchelli, A. Spitaleri, V. Mannella, P. Peterson, and G. Musco The solution structure of the first PHD finger of autoimmune regulator in complex with non-modified histone H3 tail reveals the antagonistic role of H3R2 methylation Nucleic Acids Res., May 1, 2009; 37(9): 2951 - 2961. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. K. Capyk, I. D'Angelo, N. C. Strynadka, and L. D. Eltis Characterization of 3-Ketosteroid 9{alpha}-Hydroxylase, a Rieske Oxygenase in the Cholesterol Degradation Pathway of Mycobacterium tuberculosis J. Biol. Chem., April 10, 2009; 284(15): 9937 - 9946. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. S. Gupta, B. N. Borin, T. L. Cover, and A. M. Krezel Structural Analysis of the DNA-binding Domain of the Helicobacter pylori Response Regulator ArsR J. Biol. Chem., March 6, 2009; 284(10): 6536 - 6545. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







