Skip Navigation


Bioinformatics Advance Access originally published online on February 4, 2007
Bioinformatics 2007 23(7):832-841; doi:10.1093/bioinformatics/btm022
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/7/832    most recent
btm022v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Viksna, J.
Right arrow Articles by Gilbert, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Viksna, J.
Right arrow Articles by Gilbert, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Assessment of the probabilities for evolutionary structural changes in protein folds

Juris Viksna 1,* and David Gilbert 2

1Institute of Mathematics and Computer Science, University of Latvia, Rainis boulevard 29, Riga LV-1459, Latvia and 2Bioinformatics Research Centre, Glasgow University, A416 Davidson Building, Glasgow G12 8QQ, UK

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: The evolution of protein sequences can be described by a stepwise process, where each step involves changes of a few amino acids. In a similar manner, the evolution of protein folds can be at least partially described by an analogous process, where each step involves comparatively simple changes affecting few secondary structure elements. A number of such evolution steps, justified by biologically confirmed examples, have previously been proposed by other researchers. However, unlike the situation with sequences, as far as we know there have been no attempts to estimate the comparative probabilities for different kinds of such structural changes.

Results: We have tried to assess the comparative probabilities for a number of known structural changes, and to relate the probabilities of such changes with the distance between protein sequences. We have formalized these structural changes using a topological representation of structures (TOPS), and have developed an algorithm for measuring structural distances that involve few evolutionary steps. The probabilities of structural changes then were estimated on the basis of all-against-all comparisons of the sequence and structure of protein domains from the CATH-95 representative set.

The results obtained are reasonably consistent for a number of different data subsets and permit the identification of several ‘most popular’ types of evolutionary changes in protein structure. The results also suggest that alterations in protein structure are more likely to occur when the sequence similarity is >10% (the average similarity being ~6% for the data sets employed in this study), and that the distribution of probabilities of structural changes is fairly uniform within the interval of 15–50% sequence similarity.

Availability: The algorithms have been implemented on the Windows operating system in C++ and using the Borland Visual Component Library. The source code is available on request from the first author. The data sets used for this study (representative sets of protein domains, matrices of sequence similarities and structural distances) are available on http://bioinf.mii.lu.lv/epsrc_project/struct_ev.html.

Contact: juris.viksna{at}mii.lu.lv


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The evolution of protein sequences can be described by a stepwise mutation process, each step usually consisting of a small number of insertions, deletions or substitutions of amino acids in the protein sequence, or in some cases, permutations or rearrangements of sequence fragments. In most of the cases, such sequence evolution steps lead to very small changes in protein structure and generally do not change the protein fold. Still, it is known that there are a number of evolutionarily related pairs of proteins having different folds (Grishin, 2001), implying that in some cases a small mutation in the protein sequence should lead to a change in the protein fold. In a proportionally large number of such cases the changes in the fold are generally small and, similarly as for sequences, can be described by insertions, deletions or rearrangements of a few secondary structure elements (SSEs). Consequently, the evolution of protein folds can be at least partially described by a stepwise process, each step involving a small change in the protein fold.

Several authors have suggested sets of such possible ‘fold mutations’, either motivated by biologically confirmed examples (Grishin, 2001; Kinch and Grishin, 2002), or, in some cases, hypothetical sets (Matsuda et al., 2003; Przytycka et al., 2002) motivated by the exploration of known fold space.

The types of mutations mentioned earlier presumably are the result of accumulated point-mutations (indels and substitutions) in protein sequence. There are also a number of biologically confirmed cases where the likely cause of structural changes are circular permutations of protein fragments, caused by gene duplication and subsequent truncation of protein (Jung and Lee, 2001; Peisajovic et al., 2006; Weiner et al., 2005). However, the current knowledge about these types of structural changes seems to be insufficient to propose a complete set of mutations of this type. There is also an interesting study (Lupas et al., 2001), which for certain protein groups proposes ‘pathways’ of structure evolution involving both of these types of structural changes.

Nevertheless, the proposed sets of fold changes that are based on pointmutations usually are sufficiently rich to permit the transformation of any fold into another and are the ones usually considered in ‘global’ analyses of fold spaces. However, in contrast to sequence mutations, the existence of a valid fold mutation of such a type does not imply that folds are evolutionary related. The number of biochemically acceptable spatial folds seems to be limited, and there are a number of ‘popular’ folds, which as far as we know, contain several groups of evolutionary unrelated proteins, although there are fold groups within which all proteins are considered to be evolutionarily related (Holm and Sanders, 1997). It is interesting to note that there are also some protein folds (e.g. ß-helices) for which there exist a credible biological explanation how these might have evolved (Jenkins and Pickersgill, 2001), however, the intermediate steps of such evolution appear to be absent from the set of known protein structures.

Thus, even if we were able to find the right set of all naturally occurring fold mutations and estimate the probabilities with which each of these mutations tend to occur, we would be unable to estimate the evolutionary distance of two proteins just from looking at their folds. Another difficulty is that we are unlikely to find such a good model of fold mutations for which probability estimates will be reasonably tight—for the ones suggested so far the probability of mutations is likely to be quite dependent on many other factors not covered by the proposed model. Still, interesting questions in their own right are what could be the comparative probabilities for the suggested fold changes, whether we are able to provide some estimates for them, and whether these probabilities are somehow related to the sequence distance between two proteins. Also, such probability estimates could permit the measurement of the evolutionary distance between two folds, although within large error bounds, under the assumption that they are evolutionarily related.

Regarding the relationship between structure and sequence similarities, the generally accepted point of view is that sequence similarity >25% in most cases implies almost identical structure at fold level. However, it is also known that proteins with 50% sequence similarity could have completely different folds, e.g. the artificially modified Janus protein (Dalal et al., 1997). Also in this work we have found several pairs of proteins with similar but different structures and sequence similarity >60–70%.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The idea of our approach is to select one (or several) comparatively large sets of protein structures; for each pair of proteins within a set to compute both a distance between protein sequences and the smallest number of fold mutations that transform one protein into another; and then try to relate sequence distances with the number of fold mutations. If a particular fold mutation tends to occur between proteins with a sequence similarity that is noticeably above average, it could be argued that the mutations of this type are biologically significant. This also permits the computation of comparative probabilities of different mutations that have been found to be of biological significance.

2.1 Topological representation of protein structures
For our study we use a topological representation of protein structures that is largely based on TOPS (Michalapoulos et al., 2004; Westhead et al., 1999) with some features (chiralities, relative orientation of {alpha}-helices as well as ß-strands from different sheets) ignored. Essentially a protein structure is represented as a vertex-ordered graph with two types of vertices (corresponding to {alpha}-helices and ß-strands) and two types of edges between ß-strand vertices (indicating that these strands are connected by hydrogen bonds, and are oriented either in the same or in the opposite directions). As an example, topological representation for CATH domain 1fjgQ00 is shown in Figure 1.


Figure 1
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Traditional ribbon representation of CATH domain 1fjgQ00 (above) and the corresponding topological representation (below).

 
Whilst there is no doubt that such a simplified representation of protein folds loses a significant amount of information, for example the size of SSEs and their spatial positions, it permits the construction of a reasonably simple formalization of evolutionary steps that can be applied to protein structures. Also, studies done using TOPS suggest that this representation distinguishes well between different folds, and that the cases when distinct folds have the same representation are quite rare (Viksna et al., 2003), at least for proteins having a ß-sheet rich structure for which this representation is best suited. The representation also corresponds very well to the level of abstraction used by other authors (Grishin, 2001; Matsuda et al., 2003; Przytycka et al., 2002) for describing the possible evolutionary changes of protein structure (i.e. it is well suited to describe such events as insertions/deletions/permutations of SSEs).

The representations of protein structures used in this study were extracted by a simple program from the existing TOPS database. It should be noted that SSE assignments in TOPS are based on DSSP program (Kabsch and Sander, 1983).

2.2 Evolutionary steps of protein folds
Probably the largest well-defined set of evolutionary mutations of protein folds, motivated by biologically confirmed examples, has been proposed by Grishin (2001). The proposed mutations are the following (each of them can occur in both directions):

  1. {alpha}-helix    {leftrightarrow}    ß-strand,
  2. {alpha}-helix    {leftrightarrow}    sheet of 3 ß-strands,
  3. ß-strand    {leftrightarrow}    sheet of 3 ß-strands,
  4. loop        {leftrightarrow}    {alpha}-helix,
  5. loop        {leftrightarrow}    sheet of 3 ß-strands,
  6. loop        {leftrightarrow}    ß-strand (extension of sheet with a new ß-strand at its end),
  7. loop        {leftrightarrow}    ß-strand (insertion of a new ß-strand within a sheet/barrel),
  8. permutations of SSEs (without giving too explicit details).
Matsuda et al. (2003) have studied the connectivity of fold space under the assumption that most of the mutations are of type 6; in this case the known protein structures can be clustered in ~20 evolutionarily related superfamilies. A related work has been done by Przytycka et al. (2002), although the authors do not consider directly fold mutations, but propose four ‘folding motifs’ from which a large part of all known ß-folds can be constructed.

Our choice of set of mutations is largely based on Grishin's proposal. However, partially in order to reduce the amount of computations and partially for reasons discussed later, we have not used ‘substitution type’ rules that can be easily split into subsequent ‘deletion’ and ‘insertion’ (e.g. the rule {alpha}-helix{leftrightarrow}ß-strand can be replaced by two rules: {alpha}-helix{leftrightarrow}loop and loop{leftrightarrow}ß-strand). In addition, we have visually inspected topological representations of proteins from several folds and have tried to detect typical topology variations within each fold. The choice of our rules/modifications is partially motivated by these observations. The set of fold mutations that we have used for our experiments is the following:

(H) loop    {leftrightarrow}    {alpha}-helix,
(S2) loop    {leftrightarrow}    sheet of 2 ß-strands,
(S3) loop    {leftrightarrow}    sheet of 3 ß-strands,
(AE) loop    {leftrightarrow}    ß-strand (extension of sheet at its end with a new ß-strand, sequentially adjacent to the last strand of sheet),
(NE) loop    {leftrightarrow}    ß-strand (extension of sheet at its end with a new ß-strand, sequentially non-adjacent to the last strand of sheet),
(P) ß-strand    {leftrightarrow}    sheet of 3 ß-strands,
(JS) forming a new sheet by joining the ends of two ß-sheets,
(JB) forming a new barrel by joining the ends of a ß-sheet,
(W) swapping two sequentially adjacent ß-strands.
The mutation types P, S3 and H are the same as types 3, 5 and 4 proposed by Grishin; moreover, our types AE and NE are just two subtypes of Grishin's type 6. The motivation behind type S2 is that there are many folds containing sheets with just two ß-strands, which probably are more likely to form directly than by losing a strand from sheets with three elements (which is the only available option if this mutation is not permitted). Also, this mutation corresponds to the ‘hairpin rule’ given by Przytycka et al. (2002). Mutation types JS, JB and W are largely motivated by our observations, type W being a specific subcase of type 8.

Apart from the fact that a mutation of type S3 gives the same fold change as a sequence of two mutations S2 and AE, the proposed mutation types are ‘practically’ independent (i.e. one type cannot be expressed by a short sequence of other types). There are some situations when mutation of type W does not change a protein's structure (e.g. when applied to adjacent ß-strands of a two element sheet), therefore, we only consider this mutation in cases when such a change does occur.

2.3 Algorithm for computing evolutionary distances
The problem of finding the smallest number of evolution steps that transform one given structure into another may be tackled by looking for the largest ‘common part’ of the two structures and then finding the evolutionary changes that explain the differences between these structures. This might appear to be the best option to solve the problem for a given pair of structures; however, finding the largest ‘common part’ in vertex-ordered graphs is already a harder problem than that of finding a largest common subgraph, which itself is a NP-hard problem. Our previous experience with algorithms for structure comparison (Viksna and Gilbert, 2001) using a similar topological representation suggests that such an approach will be infeasible when computing all-against-all comparisons for datasets containing around 1000 structures or more.

Alternatively, one can try to compute all possible 0 to n-step evolutionary modifications for each of the structures and then split the set of all such modifications into classes of equivalent structures. Each pair of these equivalent structures will correspond to an evolutionary process involving from 1 to 2n steps which transforms one structure into another. This method clearly puts strong limitations on the number of evolutionary steps that can be reconstructed; however, for a small number of steps this approach could be far more efficient than solving the problem for each pair of structures separately. In our experiments, we have used a value of n = 1, which permits the computation of distances of up to 2 evolutionary steps. A realistic upper bound for data sets of 1000 structures would appear to be n = 3, which could be achieved by using dedicated hardware.

However, the hardness of the problem actually derives from the fact that in general, the comparison of ß-sheets is a difficult graph problem. Computing just the number of helix inserts/deletions is a trivial string comparison problem, solvable in linear time. Thus, we have subdivided the problem into two parts: we employ an exhaustive search to look for 0–2-step evolutionary modifications involving only ß-strands, and then for each pair of structures that are related when ignoring the positions of helices, we compute the additional number of helix inserts/deletions that are necessary to transform one of the structures into another (there is no upper bound of helix insertions/deletions that can be computed in this way). Such an approach significantly reduces the search space, since by ignoring the positions of helices the number of possible modifications for each structure is considerably reduced.

Note that when computing all possible evolutionary modifications, one needs to use each of the mutation rules in both directions (with the exception of type W, which is identical with its inverse). Also, we consider only those mutations that do not involve {alpha}-helices (i.e. we exclude type H). Thus, in our case, for each protein we have to apply 15 different mutations, each in all the possible positions in which it can be used.

In more detail, the algorithm for finding evolutionary distances dij and corresponding sets of mutations mij between all pairs of protein folds pi, pj isin S is as follows. We assume that each protein pi is represented as a pair <si, hi>, where si is a topological representation for a protein's ß-strands only and hi encodes the information about the positions of {alpha}-helices with respect to strands in si. R' = {S2, S3, AE, NE, P, JS, JB,W } denotes the set of ‘primary’ mutations; the inverse mutation of xisinR' is denoted by inv(x); and R = {x | xisinR' {vee} inv(x)isinR'} is the set of all 15 available mutations (described earlier); 0 denotes the ‘empty’ mutation (which leaves the protein unchanged).

ALGORITHM. Evolutionary distances.

  1. Initialize mij = Ø for all i, j; dij = + {infty} for i != j and dij = 0 for i = j.
  2. For each protein < si,hi > isin S, each mutation rkisinR and each position l in si, in which mutation rk could be applied, compute the pair < rk (si, l), rk >, where rk(si, l) is a protein obtained by applying mutation rk in position l of protein si. Let M = { < rk (si,l), rk, i, > | {forall} i, k, l} {cup} { < si,0,1>|,{forall} i}.
  3. Sort elements of M according to values of s of its elements < s, r, i > (just treating the used encodings as strings of bytes) and then split M into equivalences classes according to values of s.
  4. For each equivalence class CisinM and for all distinct pairs < s, r1, i >, <s, r2, i> isin C
    1. Set the number of strand mutations ds = 1, if r1 = 0 or r2 = 0, otherwise ds = 2.
    2. Compute the number of helix mutations dh from the values of s, hi and hj.
    3. If dij ≥ ds + dh, then set dij = ds + dh and set mij = {r | risinR' {wedge} (risin{r1, r2} {vee} inv(r)isin{r1, r2})}.

For the data sets we used, which contained ~1000 protein domains with up to 70 SSEs in each domain, there were on average ~250 mutations for each domain, and the encoding for each structure required ~600 bytes of memory. The running time of the program was ~15 s on 3 GHz processor with memory requirements of ~150 MB. This implies that finding of up to 4 strand mutations (n = 2) could be done in few hours of time, but would require 40 GB of RAM memory; case n = 3 (up to 6 mutations) would require a couple of weeks and 10 TB of memory.

2.4 The choice of representative setsof protein structures
Our data sets are based on domains from the CATH (Orengo et al., 1997) classification. One motivation for the use of CATH is that CATH domains are relatively small in comparison with those of SCOP. Hence, fewer mutational changes could be expected between two domains. CATH version 2.4.0 was used, for the reason that the TOPS database contains domains precomputed for this version. However, for convenience we use here the CATH version 3.0.0 naming conventions for protein domains.

Since most of the mutation rules involve ß-strands, only those domains with a rich ß-sheet structure have been selected, that is only proteins from CATH class 2 (proteins with mainly ß-strands, usually also containing some {alpha}-helices) and class 3 (mixed {alpha}–ß proteins). The proteins from classes 2 and 3 were treated as separate sets, because very few ‘short’ evolutionary relations are to be expected between these groups, and also to check how similar will be the probability estimates obtained for these two sets.

A more complicated question was the selection of representative sets for these classes. Ideally all existing folds should be contained in the representative set and each fold should be represented proportionally to the number of proteins from this fold that occur in nature. Other issues are the resolution/quality of the structures and whether one should use engineered proteins. Besides that, in order to obtain statistically more significant results, it is desirable to use as large representative sets as possible.

Several versions of representative sets were evaluated. Initially we have tried to use all domains from the corresponding CATH classes (the nice advantage of this approach was much larger sizes of data sets), however, it turned out that inclusion of proteins with too similar sequences severely biased results. There are two reasons for this: practically all structural differences observed for very similar proteins had to be attributed to DSSP ‘noise’, and these sets included groups of similar proteins that are largely overrepresented due to being a subject of intense studies.

Reasonably stable results were obtained by using the CATH-95 representative set (containing proteins with up to 95% sequence similarity) of protein domains, and this set was used for the results described here (having removed structures obtained with NMR technology as well as those with a resolution greater than 3 Å). Also, just one (random) representative protein was selected from sets of proteins having the same topological depiction. This was probably the most problematic decision, since there seem to be two reasons for topologies containing many proteins—either these topologies are ‘preferred’ by nature, or they just contain groups of proteins that are the subject of intense studies (not all such groups could be eliminated by threshold on sequence similarity). Unfortunately, whilst the inclusion of the first type is probably motivated, according to our observations the inclusion of the second type of topologies may severely bias the statistics.

The representative set obtained for CATH class 2 contains 685 domains and the representative set for CATH class 3 contains 1182 domains. In this article, we refer to them as CATH2 and CATH3 respectively.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Some examples of fold mutations
We have visually inspected a number of proteins pairs, which according to the results of our program were linked by fold mutations. Although a number of such cases where clearly attributable to the ambiguity of splitting a domain into SSEs (especially for protein pairs with high sequence similarity and/or for mutations of type H), in most cases where sequence similarity suggested that the proteins are evolutionary related the discovered fold change also appeared to be biologically motivated.

As it seems that in general, there is no a single authority regarding secondary structure division in strands and helices. As a reference we used ribbon-style representations provided by CATH.

Figures 2 and 3 show examples involving sheet insertions and sheet extensions (these are the types that were detected between protein pairs connected by just one fold mutation).


Figure 2
View larger version (43K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Example showing sheet insertion of type S2. Domains 1tme200 and 2mev200 are viral proteins and have 63% sequence identity according to BLAST and 73% sequence similarity according to Smith–Waterman score. The place of extension is marked by arrows. Also, the example suggests a helix deletion (type H) in another region of protein.

 

Figure 3
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Example showing sheet extension of type AE. Domains 1lilA01 and 1bjmA01 are immunoglobulins, and have 53% sequence identity (BLAST) and 53% sequence similarity (Smith–Waterman score).

 
3.2 Estimations of probabilities for fold mutations
Experiments were performed separately on sets CATH2 and CATH3. For each pair of proteins, we computed both the similarity between protein sequences by using the ssearch implementation of Smith–Waterman algorithm (Smith and Waterman, 1981), and the minimal number of structural changes by using our program. The sequence similarity between two proteins P1 and P2 was represented by a normalized score, computed by


Formula

Normalized scores were rounded up to integer values and for each d from 0 to 100 and for each type of fold mutation X we computed the values e(X, d, 1) (the number of observed mutations of type X between pairs of proteins that differ by one fold mutation and having sequence similarity d) and e(X, d, 2) (the weighted number of observed mutations of type X between pairs of proteins that differ by 2-fold mutations and having sequence similarity d). The weighted number of mutations was computed by increasing the value e(X, d, 2) by 1/m for each mutation of type X observed between a pair of proteins with m different two-step fold mutation paths between them.

Let n(d) denote the total number of protein pairs in a test set with sequence similarity d. Then the values e(X, d, 1)/n(d) and e(X, d, 2)/n(d) give the probability distributions for type X mutation, obtained by taking into account protein pairs with one or two fold mutations between them. The value p(X, d, t) = {sum}i = 0 ... d e(X,i, t)/n(i) is the probability that two random proteins with sequence similarity d will be connected by t-fold mutations, at least one of which involves a mutation of type X.

We start with two observations. First, since the data sets contain few pairs of proteins with large sequence similarities (Fig. 4); the observed mutation probabilities in this region are quite random. Although few structurally different proteins were noticed even with very high sequence similarity (e.g. >90%), these instances were attributable to inaccuracies caused by how the DSSP program splits proteins into SSEs. Reasonably smooth distributions were obtained for sequence similarities of up to 50% (for CATH2 in some cases of up to 70%), thus we restricted our attention to the similarity interval from 0% to 50%. Interestingly this 50% upper bound coincides with 50% bound for ‘Paracelsus challenge’ (Rose and Creamer, 1994), however, this is likely to be a coincidence (we just have too few protein pairs with larger sequence similarities to have meaningful results in this similarity region).


Figure 4
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Number of pairs of proteins with sequence similarity in intervals from 0–10% to 90–100% in sets CATH2 (left) and CATH3 (right).

 
Secondly, the observed number of helix mutations (type H) is much higher than the total number of all other mutation types together (Fig. 5), especially for CATH3; this type is also the most affected by DSSP ‘noise’. As a result, we can just claim that mutations of type H are the most frequent, but cannot give motivated quantitative estimates. Also, this means that when counting probabilities for strand mutations we should consider protein pairs that are additionally separated by few helix mutations as well. Further, by {Delta}s and {Delta}h we correspondingly denote the numbers of strand mutations and helix mutations.


Figure 5
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Distribution of probabilities for mutations of type H and summary probabilities of all other types for CATH2 (left) and CATH3 (right). In this and all following diagrams, x-axis represent sequence similarity, and values on y-axis show the relative probabilities of different types of fold mutations observed for pairs of proteins with this sequence similarity.

 
The overall estimates of probabilities for most of the types of mutations obtained for {Delta}s = 1 and {Delta}s = 2 were roughly consistent. However, the distribution for {Delta}s = 1 is quite noisy (Fig. 6) due to the small number of data points. Nevertheless the results indicate that most frequent types of mutations are AE, NE (sheet extensions) and S2. An interesting observation is that no mutations of types S3 and W were detected for {Delta}s = 1, although these types are equally likely as type S2 for {Delta}s = 2 (a possible explanation might be that their impact on overall structure is so large that they may occur only together with some other changes).


Figure 6
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Distributions of probabilities for mutation types AE, NE and S2 for CATH2 with {Delta}s = 1 and {Delta}h ≤ 3.

 
Figure 7 depicts the distribution of probabilities of mutation types NE, AE, S2, S3 and W, obtained for CATH2 with {Delta}s = 2 and either {Delta}h ≤ 3 or {Delta}h≤+{infty}. Although it is clear that permitting a high number of helix mutations leads to the inclusion of ‘false homologues’ in the results, overall the probability distribution does not change much between {Delta}h ≤ 3 and {Delta}h≤ + {infty}. The probability of fold mutations becomes noticeable when the sequence similarity exceeds 10%. Because the average similarity between random pairs is 6%, this suggests that mutations are more likely to indicate that the proteins involved are homologous, rather than reflecting random relationships between folds.


Figure 7
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7. Distributions of probabilities for mutations of types AE, NE, S2, S3 and W for CATH2 with {Delta}s = 2 and {Delta}h ≤ 3 (above) or {Delta}h≤+{infty} (below).

 
Typical estimates for a sequence similarity threshold above which proteins are likely to be homologous are ~25% (Brenner et al., 1998; Gan et al., 2002). Thus regions above this threshold can be used to estimate the comparative probabilities for different types of mutations. Note, however, that probability values remain reasonably ‘stable’ within the 15–50% sequence similarity range.

The observed probabilities for types P, JS and JB are noticeably smaller than those for types AE, NE, S3, S2 and W, and are shown in Figure 8 (type S2 is included for comparison). These values are also relatively stable within the 15–50% sequence similarity interval.


Figure 8
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 8. Distributions of probabilities for mutations of types S2, P, JS and JB for CATH2 with {Delta}s = 2 and {Delta}h ≤ 3.

 
Figures 9 and 10 show probability distributions obtained from set CATH3. The qualitative comparisons between different types of mutations are generally the same, with AE and NE being the most frequent, followed by S2, S3 and W, although the quantitative differences are probably somewhat smaller. In general, the curves are more ‘noisy’; this could be explained by the presence of a smaller number of pairs of proteins with larger sequence similarity in CATH3 compared with CATH2.


Figure 9
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 9. Distributions of probabilities for mutations of types AE, NE, S2, S3 and W for CATH3 with {Delta}s = 1 and {Delta}h ≤ 3.

 

Figure 10
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 10. Distributions of probabilities for mutations of types S2, P, JS and JB for CATH3 with {Delta}s = 1 and {Delta}h ≤ 3.

 
Table 1 shows the relative probabilities for strand mutations obtained from both data sets CATH2 and CATH 3 by using different values of {Delta}s and {Delta}h. The probabilities have been computed given the assumption that a strand mutation has occurred (i.e. the sum of probabilities for all types is 1) and are based on probability distributions within the interval of 15–50% sequence similarity. The number of data points on which are results are based is also shown.


View this table:
[in this window]
[in a new window]

 
Table 1. Relative probabilities for strand mutations for CATH2 and CATH3 data sets and different values of {Delta}s and {Delta}h

 
In general, just three sets of values (CATH2 with {Delta}s = 2 and {Delta}h ≤ 3 or {Delta}h≤+{infty}, and CATH3 with {Delta}s = 2 and {Delta}h ≤ 3) are based on a sufficient number of data points to be trusted. Nevertheless the probabilities obtained are reasonably consistent. A notable is the difference between the probabilities for sheet extensions, and also for the creation of new sheets, between the CATH2 and CATH3 data sets (CATH3 has smaller probabilities for types AE and NE, and, correspondingly, larger probabilities for types S2 and S3). However, dissimilar results for different classes of protein structures are not particularly surprising, since the model of protein structure considered here is a generalization that ignores many aspects that may affect the real probabilities of fold changes.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Our results provide some empirical justification regarding certain types of evolutionary fold changes, based on a model that incorporates a formal description of structural evolutionary modifications, and also shed more light on the details of such changes. In particular they confirm that sheet extensions are the most frequent fold change of those that involve ß-strands (as suggested e.g. in Matsuda et al., 2002) with the overall probability for sheet extension being ~50–65%. Although the probabilities for types AE and NE are very similar (slightly larger for NE), suggesting that it is equally likely that a sheet will be extended by a new strand either in sequentially adjacent or non-adjacent positions, the number of adjacent positions is noticeably smaller (i.e. just two) compared with all potential positions. This implies that it is more likely that a sheet will be extended by a new strand in a sequentially adjacent than in another position. The next most popular types of fold change appear to be S2, S3 and W, occurring in 5–15% of all cases. Types P, JS and JB appear to be the least frequent, occurring in 0–5% of cases.

Some problems with the available protein structure data sets need to be recognized, which are unlikely to be remedied in the near future. The coverage of structures in the PDB is biased for several reasons. First due to limitations of the currently available techniques structures can only be determined for a subset of all proteins. Secondly due to a lack of resources in the scientific community, only a subset of structures that can in principle be determined are in fact solved, the choice of targets being largely influenced by the interests of the scientists involved.

Our future research aims in this area include the development of tractable algorithms to increase the maximum evolutionary distance that we can compute, with the goal of being able to determine the distance between any pair of proteins in a reasonable time. We also plan to develop a quantitative method to estimate the probabilities of helix insertions/deletions in comparison with mutations involving ß-strands. This will then allow us to determine the evolutionary relationships between all the members of a given set of proteins, and hence to obtain further insights into evolutionary fold space.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The authors would like to thank David Westhead and Gilleain Torrance for useful comments regarding biological aspects of this work as well as an anonymous referee for drawing our attention to studies about the structure changes based on circular permutations. The research was supported by EPRSC grant EP/C00373X/1.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Anna Tramontano

Received on October 10, 2006; revised on January 18, 2007; accepted on January 21, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Brenner S, et al. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. USA, ( (1998) ) 95, : 6073–6078.[Abstract/Free Full Text].

    Dalal S, et al. Protein alchemy: changing ß-sheet into {alpha}-helix. Nat. Struct. Biol, ( (1997) ) 4, : 548–552.[CrossRef][ISI][Medline].

    Holm L, Sanders C. Decision support system for the evolutionary classification of protein structures. Proc. ISMB, ( (1997) ) 97, : 140–146..

    Gan H, et al. Analysis of protein sequence/structure similarity relationships. Biophys. J, ( (2002) ) 83, : 2781–2791.[ISI][Medline].

    Grishin N. Fold change in evolution of protein structures. J. Struct. Biol, ( (2001) ) 134, : 167–185.[CrossRef][ISI][Medline].

    Jenkins J, Pickersgill R. The architecture of parallel ß-helices and related folds. Prog. Biophys. Mol. Bio, ( (2001) ) 77, : 111–175.[CrossRef][ISI][Medline].

    Jung J, Lee B. Circularly permuted proteins in the protein structure database. Prot. Sci, ( (2001) ) 10, : 1881–1886.[Abstract/Free Full Text].

    Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, ( (1983) ) 22, : 2577–2637.[CrossRef][ISI][Medline].

    Kinch L, Grishin N. Evolution of protein structures and functions. Curr. Opin. Struct. Biol, ( (2002) ) 12, : 400–408.[CrossRef][ISI][Medline].

    Lupas N, et al. On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? J. Struct. Biol, ( (2001) ) 134, : 191–203.[CrossRef][ISI][Medline].

    Matsuda K, et al. Finding evolutionary relations beyond superfamilies: fold based superfamilies. Prot. Sci, ( (2003) ) 12, : 2239–2251.[Abstract/Free Full Text].

    Michalapoulos I, et al. TOPS: an enhanced database of protein structural topology. Nucleic Acids Res, ( (2004) ) 32, : D251–D254.[Abstract/Free Full Text].

    Orengo C, et al. CATH—a hierarchic classification of protein domain structures. Structure, ( (1997) ) 5, : 1093–1108.[Medline].

    Peisajovic S, et al. Evolution of new protein topologies through multistep gene rearrangements. Nat. Genet, ( (2006) ) 38, : 168–174.[CrossRef][ISI][Medline].

    Przytycka T, et al. Recursive domains in proteins. Prot. Sci, ( (2002) ) 11, : 409–417.[Abstract/Free Full Text].

    Rose G, Creamer T. Protein folding: predicting predicting. Proteins: Struct. Func. Genet, ( (1994) ) 19, : 1–3.[CrossRef].

    Smith T, Waterman M. Identification of common molecular subsequences. J. Mol. Biol, ( (1981) ) 147, : 195–197.[CrossRef][ISI][Medline].

    Viksna J, Gilbert D. Pattern matching and pattern discovery algorithms for protein topologies. Lect. Notes Comput. Sci, ( (2001) ) 2149, : 98–111..

    Viksna J, et al. Protein structure comparison based on profiles of topological motifs: a feasible way to deal with information from negative examples. In: Proc. German Bioinfor. Conf., ( (2003) ) 159–165..

    Weiner J, et al. Rapid motif-based prediction of circular permutations in multi-domain proteins. Bioinformatics, ( (2005) ) 21, : 932–937.[Abstract/Free Full Text].

    Westhead D, et al. Protein structural topology: automated analysis and diagrammatic representation. Prot. Sci, ( (1999) ) 8, : 8797–904..


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/7/832    most recent
btm022v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Viksna, J.
Right arrow Articles by Gilbert, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Viksna, J.
Right arrow Articles by Gilbert, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?