Bioinformatics Advance Access originally published online on January 28, 2008
Bioinformatics 2008 24(6):872-873; doi:10.1093/bioinformatics/btn040
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
On distance and similarity in fold space
Center of Applied Molecular Engineering, Division of Bioinformatics, Department of Molecular Biology, University of Salzburg, Hellbrunnerstr. 34, 5020 Salzburg, Austria
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Metric information on similarities and distances in fold space is essential for quantitative work in structural bioinformatics and structural biology. Here we derive a suitable metric for protein structures from the fundamental axioms of similarity. Derivation of the metric also clarifies the relationship between the interrelated concepts of distance and similarity.
Contact: sippl{at}came.sbg.ac.at
Quantitative work in structural bioinformatics requires a suitable metric for the description of similarities and distances in fold space. The primary tools to measure distances or similarities are structure alignment programs. The goal in structure alignment is to characterize the similarity between a query structure a and a target structure b by constructing an alignment. The alignment defines amino acid pairs between the query and the target that are considered to be structurally equivalent. Important parameters of structure alignments are the number of equivalent pairs of residues, Sa,b, also called the alignment length, the root-mean-square (rms) error computed from the superposition of the equivalent residue pairs (e.g. Sippl, 1982), or the percentage of identical amino acid pairs implied by the alignment (e.g. Sippl and Wiederstein, 2008).
William Taylor recently argued that it will be difficult, if not impossible, to find a general metric based on pairwise comparison that will provide a satisfactory classification (Taylor, 2007). To make progress it is advisable to distinguish the notion of a metric in fold space from the possible representations of fold space in terms of classifications. A metric is not a classification but it is quite advantageous to endow classifications with metric information. In particular, quantification of distance and similarity amongst protein structures is essential for the exploration of structural neighborhoods (Suhrer et al., 2007) and for navigation in fold space (Sippl et al., 2008).
To derive a suitable metric, we start with the number of equivalent residues, Sa,b, and call this quantity the similarity of a and b. Similarity thus defined has the following properties:
|
|
For example, we may define the distance between the folds a and b as,
|
|
|
|
|
|
|
|
To continue it is necessary to clearly distinguish the inherent properties of distances and similarities from the actual measurement of such quantities. Distances measured by a yardstick do not necessarily satisfy the triangle inequality, since the actual values depend on the reliability and precision of the yardstick used as well as on the skills of the operator. The same considerations apply when we measure distances or similarities between folds using structure alignment programs.
The maximum number of equivalent residues, i.e. the similarity Sa,b, obtained from structure alignment is a convenient parameter for the description of the extent of structure similarity, provided the programs used ensure that the rms error of superposition stays below reasonable bounds (Feng and Sippl, 1996; Lackner et al., 1999; Sippl et al., 2001). The exact value of the rms error and the respective threshold are of secondary importance. Moreover, the number of equivalent residues is a robust estimator of similarity in the sense that for a given pair of folds suitable structure alignment programs are expected to report comparable values. In this way, the complementary measures of distance and similarity we have just derived provide a convenient, intuitive and general metric of similarity amongst protein structures which is applicable across a wide range of structure alignment programs.
As already noted, the precision and reliability of distances and similarities depends on the structure comparison programs used. Alignment programs may yield erroneous and inconsistent results and they may fail to detect existing relationships. In such cases, a metric is an indispensable guide since violations of the capacity or triangle inequalities reveal mutual inconsistencies in the measurements which can be identified and corrected. The metric enforces consistency.
A particular difficulty in the classification of protein structures originates from the diversity of sequence lengths. The structure of a small protein a may be contained in a larger protein b, a relationship which is inherently asymmetric. In contrast, the symmetry of distance and similarity is demanded by the axioms and hence, by definition, these quantities are unable to deal with asymmetric relationships. Moreover, there are important constraints which follow from the axioms. In particular, the distance of two structures is bounded below by
|
|
|
|
Since variation in size and asymmetry of relationships cannot be captured by distance or similarity alone we need additional concepts. To cope with the variation in protein size we define the relative similarity,
|
|
|
|
|
|
sa,b
1. Again the relationship emphasizes the close kinship of similarity and distance. But normalization has a price since, in general, relative distance and relative similarity do not satisfy the capacity and triangle inequalities. Nevertheless, as long as we are aware of possible violations of the axioms we may use the normalized quantities to our advantage.
Relative similarity and relative distance are still symmetric in a and b. We capture the asymmetric portion of the relationship by defining the relative cover of a with respect to b as
|
|
|
|
cb,a, but ca,b = cb,a whenever La = Lb. When multiplied by 100, relative similarity, relative distance and relative cover are expressed as percentages of sequence lengths. Now, when a is completely contained in b then ca,b = 100%, but if at the same time b is twice the size of a then cb,a is only 50%. Thus, distance, similarity, and cover provide a small set of parameters suitable for a quantitative description of the often complex and intricate pairwise structural relationships encountered in fold space (Sippl et al., 2008). Moreover, these parameters form a proper basis for a clear and undisturbed communication within the structural biology arena. For example, the question frequently arises whether or not a newly determined structure or perhaps a newly designed protein, a, corresponds to a novel fold. Declarations like a is a novel fold are a source of controversy since they represent opinions as opposed to scientific facts. The maximum relative cover ca,b, where b is the most similar structure found in the database of known protein folds, gives a precise quantitative answer, leaving no room for confusion or controversy.
We finally note that the metric relationships derived here are quite general. A metric is a theoretical framework with a certain logical structure. The essential link between a metric and real objects is the discovery of certain quantifiable and measurable relationships among these objects which behave like distances or similarities. In the present context, the link is provided by equating the theoretical structure of similarity on the one hand with the number of equivalent residue pairs among protein structures on the other. The proper yardsticks are effective and accurate structure alignment programs like TopMatch (Sippl and Wiederstein, 2008).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Anna Tramontano
Received on December 24, 2007; revised on January 14, 2008; accepted on January 24, 2008
| REFERENCES |
|---|
|
|
|---|
Feng ZK, Sippl MJ. Optimum superimposition of protein structures: ambiguities and implications. Fold. Des (1996) 1:123–132.[CrossRef][Web of Science][Medline]
Lackner P, et al. Automated large scale evaluation of protein structure predictions. Proteins (1999) Suppl 3:7–14.[Medline]
Sippl MJ. On the problem of comparing protein structures. J. Mol. Biol (1982) 156:359–388.[CrossRef][Web of Science][Medline]
Sippl MJ, et al. Assessment of the CASP4 fold recognition category. Proteins (2001) 45:55–67.[Web of Science][Medline]
Sippl MJ, Wiederstein M. A note on difficult structure alignment problems. Bioinformatics (2008) 24:426–427.
Sippl MJ, et al. A discrete view on fold space. Bioinformatics (2008) 24:870–871.
Suhrer SJ, et al. QSCOP – SCOP quantified by structural relationships. Bioinformatics (2007) 23:513–514.
Taylor WR. Evolutionary transitions in protein fold space. Curr. Opin. Struct. Biol (2007) 17:354–361.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
S. J. Suhrer, M. Wiederstein, M. Gruber, and M. J. Sippl COPS--a novel workbench for explorations in fold space Nucleic Acids Res., July 1, 2009; 37(suppl_2): W539 - W544. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. J. Lawton, L. A. Sayavedra-Soto, D. J. Arp, and A. C. Rosenzweig Crystal Structure of a Two-domain Multicopper Oxidase: IMPLICATIONS FOR THE EVOLUTION OF MULTICOPPER BLUE PROTEINS J. Biol. Chem., April 10, 2009; 284(15): 10174 - 10180. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




