Skip Navigation


Bioinformatics Advance Access originally published online on January 28, 2008
Bioinformatics 2008 24(6):872-873; doi:10.1093/bioinformatics/btn040
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/6/872    most recent
btn040v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sippl, M. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sippl, M. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

On distance and similarity in fold space

Manfred J. Sippl *

Center of Applied Molecular Engineering, Division of Bioinformatics, Department of Molecular Biology, University of Salzburg, Hellbrunnerstr. 34, 5020 Salzburg, Austria

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 REFERENCES
 

Summary: Metric information on similarities and distances in fold space is essential for quantitative work in structural bioinformatics and structural biology. Here we derive a suitable metric for protein structures from the fundamental axioms of similarity. Derivation of the metric also clarifies the relationship between the interrelated concepts of distance and similarity.

Contact: sippl{at}came.sbg.ac.at

Quantitative work in structural bioinformatics requires a suitable metric for the description of similarities and distances in fold space. The primary tools to measure distances or similarities are structure alignment programs. The goal in structure alignment is to characterize the similarity between a query structure a and a target structure b by constructing an alignment. The alignment defines amino acid pairs between the query and the target that are considered to be structurally equivalent. Important parameters of structure alignments are the number of equivalent pairs of residues, Sa,b, also called the alignment length, the root-mean-square (rms) error computed from the superposition of the equivalent residue pairs (e.g. Sippl, 1982), or the percentage of identical amino acid pairs implied by the alignment (e.g. Sippl and Wiederstein, 2008).

William Taylor recently argued that it will be difficult, if not impossible, to find a general metric based on pairwise comparison that will provide a satisfactory classification (Taylor, 2007). To make progress it is advisable to distinguish the notion of a metric in fold space from the possible representations of fold space in terms of classifications. A metric is not a classification but it is quite advantageous to endow classifications with metric information. In particular, quantification of distance and similarity amongst protein structures is essential for the exploration of structural neighborhoods (Suhrer et al., 2007) and for navigation in fold space (Sippl et al., 2008).

To derive a suitable metric, we start with the number of equivalent residues, Sa,b, and call this quantity the similarity of a and b. Similarity thus defined has the following properties:


Formula

where La is the length of the query a. These properties are axioms in the mathematical sense. They express the essential properties of the notion of similarity in an exact form. Hence, the self similarity Sa,a equals sequence length, the similarity Sa,b is positive or zero and is symmetric in a and b. The remaining inequality expresses a simple but nevertheless fundamental property of similarity. Whenever the sum of the similarities Sa,b and Sa,c exceeds the length of a, then b and c necessarily share similarity, where the similarity Sb,c is bounded below by Sa,b + Sa,cLa. We call this relationship the capacity inequality. These basic properties need to be satisfied by any measure that is used to quantify the notion of similarity. From the four axioms of similarity other properties may be deduced.

For example, we may define the distance between the folds a and b as,


Formula

From this definition and the axioms of Sa,b we immediately obtain


Formula

The last relationship is the triangle inequality which is a consequence of the capacity inequality, i.e. insertion yields,


Formula

We have derived the distance as a function of similarity but we may invert our reasoning and derive the properties of similarity from the familiar axioms of distances, where we define similarity as


Formula

This demonstrates that distance Da,b and similarity Sa,b are just two sides of the same coin. Given the similarity, we immediately get the distance and vice versa. Measurement of either quantity yields complete information on distance and similarity at once.

To continue it is necessary to clearly distinguish the inherent properties of distances and similarities from the actual measurement of such quantities. Distances measured by a yardstick do not necessarily satisfy the triangle inequality, since the actual values depend on the reliability and precision of the yardstick used as well as on the skills of the operator. The same considerations apply when we measure distances or similarities between folds using structure alignment programs.

The maximum number of equivalent residues, i.e. the similarity Sa,b, obtained from structure alignment is a convenient parameter for the description of the extent of structure similarity, provided the programs used ensure that the rms error of superposition stays below reasonable bounds (Feng and Sippl, 1996; Lackner et al., 1999; Sippl et al., 2001). The exact value of the rms error and the respective threshold are of secondary importance. Moreover, the number of equivalent residues is a robust estimator of similarity in the sense that for a given pair of folds suitable structure alignment programs are expected to report comparable values. In this way, the complementary measures of distance and similarity we have just derived provide a convenient, intuitive and general metric of similarity amongst protein structures which is applicable across a wide range of structure alignment programs.

As already noted, the precision and reliability of distances and similarities depends on the structure comparison programs used. Alignment programs may yield erroneous and inconsistent results and they may fail to detect existing relationships. In such cases, a metric is an indispensable guide since violations of the capacity or triangle inequalities reveal mutual inconsistencies in the measurements which can be identified and corrected. The metric enforces consistency.

A particular difficulty in the classification of protein structures originates from the diversity of sequence lengths. The structure of a small protein a may be contained in a larger protein b, a relationship which is inherently asymmetric. In contrast, the symmetry of distance and similarity is demanded by the axioms and hence, by definition, these quantities are unable to deal with asymmetric relationships. Moreover, there are important constraints which follow from the axioms. In particular, the distance of two structures is bounded below by


Formula

Hence, two proteins can have a distance of zero only if they have the same size. On the other hand, similarity has an upper bound which is the length of the smaller protein,


Formula

If a small protein is completely contained in a much larger protein, then the distance is necessarily large, although at the same time the similarity may attain its maximum. At first sight the combination of these properties may seem counterintuitive, but they are straightforward consequences of the axioms.

Since variation in size and asymmetry of relationships cannot be captured by distance or similarity alone we need additional concepts. To cope with the variation in protein size we define the relative similarity,


Formula

and the relative distance,


Formula

where we note that


Formula

and 0 ≤ sa,b ≤ 1. Again the relationship emphasizes the close kinship of similarity and distance. But normalization has a price since, in general, relative distance and relative similarity do not satisfy the capacity and triangle inequalities. Nevertheless, as long as we are aware of possible violations of the axioms we may use the normalized quantities to our advantage.

Relative similarity and relative distance are still symmetric in a and b. We capture the asymmetric portion of the relationship by defining the relative cover of a with respect to b as


Formula

and the relative cover of b with respect to a, as


Formula

In general ca,b != cb,a, but ca,b = cb,a whenever La = Lb. When multiplied by 100, relative similarity, relative distance and relative cover are expressed as percentages of sequence lengths. Now, when a is completely contained in b then ca,b = 100%, but if at the same time b is twice the size of a then cb,a is only 50%.

Thus, distance, similarity, and cover provide a small set of parameters suitable for a quantitative description of the often complex and intricate pairwise structural relationships encountered in fold space (Sippl et al., 2008). Moreover, these parameters form a proper basis for a clear and undisturbed communication within the structural biology arena. For example, the question frequently arises whether or not a newly determined structure or perhaps a newly designed protein, a, corresponds to a novel fold. Declarations like ‘a is a novel fold’ are a source of controversy since they represent opinions as opposed to scientific facts. The maximum relative cover ca,b, where b is the most similar structure found in the database of known protein folds, gives a precise quantitative answer, leaving no room for confusion or controversy.

We finally note that the metric relationships derived here are quite general. A metric is a theoretical framework with a certain logical structure. The essential link between a metric and real objects is the discovery of certain quantifiable and measurable relationships among these objects which behave like distances or similarities. In the present context, the link is provided by equating the theoretical structure of similarity on the one hand with the number of equivalent residue pairs among protein structures on the other. The proper yardsticks are effective and accurate structure alignment programs like TopMatch (Sippl and Wiederstein, 2008).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Anna Tramontano

Received on December 24, 2007; revised on January 14, 2008; accepted on January 24, 2008

    REFERENCES
 TOP
 ABSTRACT
 REFERENCES
 

    Feng ZK, Sippl MJ. Optimum superimposition of protein structures: ambiguities and implications. Fold. Des (1996) 1:123–132.[CrossRef][Web of Science][Medline]

    Lackner P, et al. Automated large scale evaluation of protein structure predictions. Proteins (1999) Suppl 3:7–14.[Medline]

    Sippl MJ. On the problem of comparing protein structures. J. Mol. Biol (1982) 156:359–388.[CrossRef][Web of Science][Medline]

    Sippl MJ, et al. Assessment of the CASP4 fold recognition category. Proteins (2001) 45:55–67.[Web of Science][Medline]

    Sippl MJ, Wiederstein M. A note on difficult structure alignment problems. Bioinformatics (2008) 24:426–427.[Abstract/Free Full Text]

    Sippl MJ, et al. A discrete view on fold space. Bioinformatics (2008) 24:870–871.[Abstract/Free Full Text]

    Suhrer SJ, et al. QSCOP – SCOP quantified by structural relationships. Bioinformatics (2007) 23:513–514.[Abstract/Free Full Text]

    Taylor WR. Evolutionary transitions in protein fold space. Curr. Opin. Struct. Biol (2007) 17:354–361.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
S. J. Suhrer, M. Wiederstein, M. Gruber, and M. J. Sippl
COPS--a novel workbench for explorations in fold space
Nucleic Acids Res., July 1, 2009; 37(suppl_2): W539 - W544.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
T. J. Lawton, L. A. Sayavedra-Soto, D. J. Arp, and A. C. Rosenzweig
Crystal Structure of a Two-domain Multicopper Oxidase: IMPLICATIONS FOR THE EVOLUTION OF MULTICOPPER BLUE PROTEINS
J. Biol. Chem., April 10, 2009; 284(15): 10174 - 10180.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/6/872    most recent
btn040v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sippl, M. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sippl, M. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?