Skip Navigation


Bioinformatics Advance Access originally published online on January 22, 2007
Bioinformatics 2007 23(6):717-723; doi:10.1093/bioinformatics/btm006
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/6/717    most recent
btm006v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Krissinel, E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Krissinel, E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

On the relationship between sequence and structure similarities in proteomics

Evgeny Krissinel

European Bioinformatics Institute, Genome Campus, Hinxton, Cambridge CB10 1SD, UK


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH, DATA SET...
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: The underlying assumption of many sequence-based comparative studies in proteomics is that different aspects of protein structure and therefore functionality may be linked to particular sequence motifs. This holds true if sequence similarity is sufficiently high, but in general the relationship between protein sequence and structure appears complex and is not well understood.

Results: Statistical analysis of multiple and pairwise structural alignments of protein structures within SCOP folds is performed. The results indicate that multiple conservation of residue identity is not common and that relationship between sequence and structure may be explained by a model based on the assumption that protein structure is tolerant to residue substitutions preserving hydropathic profile of the sequence. This model also explains the origin and specific value of the sequence similarity threshold, noticed in many previous studies, below which structural resemblance is not statistically expected.

Contact: keb{at}ebi.ac.ukkeb


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH, DATA SET...
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Comparative studies play an important role in bioinformatics. It is widely assumed that structural resemblance of proteins implies their functional similarity. This assumption is important for a range of practical problems addressed by bioinformatics—from a better understanding of biochemical processes in cell to drug discovery. It is also widely assumed that structural features are closely related to sequence composition. Although a protein with a given sequence may potentially exist in different conformations, the chances that two close sequences will fold into distinctly different structures are so small that they are often neglected in research practice.

There is, however, a limit to which structure and sequence similarities may be equivalenced. As has been established in several studies (Kinjo and Nishikawa, 2004; Rost,1999), protein pairs with a sequence identity higher than 35–40% are very likely to be structurally similar. Structural similarity in pairs with a sequence identity of 20–35%, often refered to as ‘twilight zone’, is considerably less common; less than 10% of protein pairs with sequence identity below 25% have similar structures. At the same time, the ‘twilight zone’ is characterized by an explosion of false negatives (Rost, 1999), which means that many dissimilar sequences appear to be structural homologues. Although there are examples of homologous protein pairs with <10% sequence identity (Brenner, et al., 1996; Holm and Sander, 1996; Hubbard, et al., 1997; Valencia, et al., 1991), it has been found in many studies (Chotia, 1992; Chotia, and Lesk, 1986; Hubbard and Blundell, 1987; Krissinel and Henrick, 2004) that the likelihood of structural homologues with <20% sequence identity is negligibly small. This is illustrated by Figure 1, which shows the correlation between residue conservation C and structure similarity scores (Krissinel and Henrick,2004):


Formula 1

(1)
Here N1 and N2 stand for the total number of residues in two compared structures, RMSD is r.m.s.d. between Nalign pairs of residues found in geometrically equivalent positions, Ncons of which are occupied by pairs of identical residues. The correlations are represented in the form of reduced density of probability:


Formula 2

(2)
where {rho} (x,C) is the density of probability that a randomly selected pair of structures in the Protein Data Bank (PDB) (Berman et al., 2000) will have structure similarity score x and residue conservation C.


Figure 1
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Correlations between residue conservation C and structure similarity scores: (A) Q-score (B) r.m.s.d. (C) normalized alignment length Nm [cf. Equation (1)], represented as contour maps of reduced density of probability to find the corresponding pairs of similar structures in the PDB [cf. Equation (2)], from Krissinel and Henrick, (2004). The outermost contours correspond to the level of 0.05 from the maximum. The data suggest that, on average, structures are dissimilar at C≤ C0=0.2 , see discussion in the text.

 
No perfect score for structural similarity has been proposed so far. Neither r.m.s.d. nor alignment length alone are truly indicative because one may be always improved at the expense of another. The Q-score represents a balance between r.m.s.d. and Nalign, and was found to be a considerably better score when structural similarity is not self-obvious (Krissinel and Henrick, 2004). Q-scores range from 0, where no similarity exists, to 1 where structures are identical. From an empirical consideration, close structural similarity is suggested by RMSD≤ 2 Å, Nm≥ 0.8 and Q≥ 0.4 .

As may be seen from Figure 1, C0=0.2 represents a threshold, above which all three scores indicate structurally similar proteins and dissimlar ones at C<0.2 . The particular value of C0 has received little, if any, discussion in the above referenced works, as well as in others. However, one may see a few intriguing questions arising here. Consider a family of structurally similar proteins Pn=P1,P2, ...,Pn . If sequence variations within Pn were completely random, then having residue conservation between proteins Pi and Pj, C(Pi,Pj)≥ 0.2 , and C(Pj,Pk)≥ 0.2 would not necessarily mean that C(Pi,Pk)≥ 0.2 . One may think of two types of sequence relationship within structure families, exemplified in Figure 2, that may provide C(Pi,Pj)≥ 0.2 for any Pi, Pj from Pn . Multiple residue conservation (type A) is particularly appealing for bioinformatic applications. It suggests that protein features, such as fold and functionality, may be due to the presence of specific sequence motifs in certain structure positions. This promises a discovery of relationships between structure, function and sequence evolution, and, in fact, many bioinformatic studies exploit this sort of largely intuitive hypotheses. Closed pairwise residue conservation (type B) has different implications. Here, protein structure can not be so unambiguously associated with a particular sequence, therefore, many different sequences may fold into similar structures (which, indeed, is the case). As a consequence, protein structure and function do not appear clearly sequence-related.


Figure 2
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Two possible types of sequence relationships in structure families: (A) multiple residue conservation (B) closed pairwise residue conservation. Dotted lines denote mapping of identical residues in structurally equivalent positions. Any protein pair in (A) and any protein pair in (B) have same number of conserved residues, however, the character of relationships within the families is completely different. See discussion in the text.

 
Obviously, multiple residue conservation (Fig. 2A) should not differ significantly from pairwise conservation (Fig. 2B) in structure families with a high sequence similarity. However, this case poses little practical value because the requirement of high a C in comparative studies limits their predictive power. As C decreases, the sequence–structure relationship is expected to gradually shift to type B (Fig. 2B). When C falls below a value of 0.2, as Figure 1 suggests, no sound structure similarity is statistically expected.

Being relatively clear in outline, the above aspects of relationship between sequence and structure similarities remain largely unstudied in details. The ‘critical’ value of C0=0.2 appears confirmed for a sufficiently large number of different data sets, however, its origin remains unexplained. At the same time, the topic is worth further study because of the importance of sequence-structure relationship in protein science. This article attempts to provide an insight into the sequence and structure relationship by presenting a statistical analysis of available protein structure families. In particular, manifestation of multiple and pairwise sequence relationships in dependence of residue conservation is studied and explained, and a probable origin of the threshold value C0 is suggested.


    2 APPROACH, DATA SET AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH, DATA SET...
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Multiple and pairwise residue conservation types in particular structure family Pn are identified by the corresponding 3D alignment of proteins Pi. Multiple alignment looks for residues that occupy geometrically equivalent positions in all n structures, and pairwise alignment does the same for each of n(n-1)/2 protein pairs (Pi , Pj) separately. A residue is considered conserved if 3D alignment equivalences it with identical residue(s).

Residue conservation may be expressed by probability distribution {rho}(C) to find Ncons=C Nalign conserved residues from total Nalign aligned. Distributions {rho}A(C) and {rho}B(C) for multiple and pairwise conservations, respectively, are calculated from the corresponding 3D alignments of structure families found in the PDB. The families should be chosen in a way that allows for sufficient sequence diversity, on one hand, and to assure a reasonable structure similarity, on the other. A few protein structure classifications are available. CATH (Pearl et al., 2003) and FSSP (Holm and Sander, 1996) represent automatic classifications, while SCOP (Murzin et al., 1995) is a manually–curated resource. Although no significant differences in results obtained for different classifications should be expected, SCOP was chosen in order to avoid any artifacts that might arise from employing a 3D aligner different from one used for the classification. SCOP offers a 5-level structure hierarchy (numbers given for release 1.69): 70859 domains grouped into 2845 families, 1539 superfamilies, 945 folds and 7 classes, in order of decreasing structure resemblance. SCOP folds appear to be the most appropriate set of structure families for the purposes of the present study, because SCOP families and superfamilies contain closely related proteins, which implies high sequence similarity, and structure resemblance between different folds inside a class is normally very low.

3D pairwise and multiple alignments were performed with algorithms used in SSM web service at the European Bioinformatics Institute. Pairwise alignment in SSM is a two-step procedure (Krissinel and Henrick, 2004). First, seed alignments are generated by matching graphs built on secondary structure elements (SSE). More than one seed alignments may be generated if few similar-score alternatives are found. Then seed alignment(s) are refined by matching pairs of C{alpha} atoms in compared structures. A set of different seed alignments and several C{alpha} mappings for each of them are explored in order to maximize the Q-score (cf. Equation 1). Multiple alignment in SSM starts with a set of all-to-all pairwise SSM alignments in structure family. Using these alignments as a first approximation, the algorithm probes alternative mappings for those SSEs that do not have equivalences in all other structures. Each alternative mapping results in decrease of pairwise scores, but may increase the multiple alignment score if all SSE equivalences are found. On the final stage, multiple C{alpha} mapping is performed using an approach equivalent to the central star method in multiple sequence alignment (Gusfield, 1997). A detail description of the multiple SSM algorithm is given in Krissinel and Henrick (2005). The quality and performance of the pairwise SSM algorithm have been acknowledged in an independent study (Kolodny et al., 2005). Limited comparisons of multiple SSM and other multiple alignment algorithms [MASS, Dror et al., (2003) and Combinatiorial Extension, Guda et al., (2004)], performed in (Krissinel and Henrick, 2005), show good overall agreement.

Because SSM uses secondary structure for obtaining seed alignments, PDB entries of proteins without secondary structure and those for which it cannot be calculated (e.g. where only backbone C{alpha} atoms are given), were excluded from the dataset. Next, only one structure from those with sequence identity higher than 50% to each other was left per SCOP fold in order to focus on the situation where structure similarity shows a clear dependence on C. The 50% threshold is suggested by the correlation map between C and Q-score in Figure 1A, which shows that at C≥ 0.5 Q-score does not depend on C and ranges between 0.8 and 1. Other structure scores, RMSD and Nm (Fig. 1A,B), do not exhibit this feature so clearly. Finally, folds with less than three structures remained were removed from the dataset.


    3 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH, DATA SET...
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The resulting data set included 28 980 protein chains in 506 SCOP folds, which is slightly more than a chain per PDB entry (a total of 25 973 PDB entries are annotated in SCOP release 1.69). This indicates a reasonable coverage of PDB content, taking into account that many PDB entries contain non-crystallographically related copies of the same sequence in the asymmetric unit.

The obtained distributions {rho}A(C) and {rho}B(C) are shown in Figure 3. As seen from the figure, the distributions approach each other at residue conservation C≥ 0.25 . A near merge of the curves at C≥ 0.4 is expected considering that, in families with high sequence identity, a large fraction of residues is conserved in both pairwise and multiple alignments. Consider, as an example, a protein family P3=P1,P2,P3 . If residue conservation in pairs (P1, P2) and (P2, P3) is high, then one can expect that residue conservation in (P1, P3) will be also high. Then multiple alignment (P1,P2,P3) reduces to the superposition of all pairwise alignments, thus preserving residue mapping as in Figure 2A. In this case, both types of residue conservation should occur equally frequently, which indeed is confirmed by data in Figure 3. As follows from the absolute values of {rho}A (C) and {rho}B (C) in the high-C end, only a few families with relatively high sequence identity are found in our data set.


Figure 3
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Probability distributions {rho}A(C) and {rho}B(C) to find Ncons=C Nalign conserved residues from total Nalign aligned in multiple and pairwise alignments of protein families, respectively. See details in the text.

 
If residue conservation is low, {rho}A(C) and {rho}B (C) may manifest more differences. For the above example of family P3 , the fewer identical residues found in pairs (P1, P2) and (P2, P3), the lesser the chances of finding them conserved across the whole family (P1,P2,P3) . Indeed, as seen from Figure 3, in the region of low C, {rho}A(C) and {rho}B (C) behave very differently. As suggested by the curves, the most probable residue conservation in multiple alignment is very low (C≤ 0.02). On the contrary, residue conservation in pairwise alignment reaches its maximum at C{approx} 0.14 , while low values of C are considerably less frequent. This implies a higher statistical significance of residue conservation in multiple alignment, which is routinely used in bioinformatic studies. The average residue conservation, calculated as


Formula 3

(3)
equals to FormulaA{approx} 0.12 and FormulaB{approx} 0.19 in multiple and pairwise alignments, respectively.

The pronounced decrease in {rho}B (C) at decreasing C (cf. Fig. 3) indicates that only a fraction ({approx} 16 %) of structurally similar sequence pairs have less than FormulaA· 100% conserved residues, while 50% of multiple alignments fall in this region of low C. Therefore, most multiple alignments at C≤ FormulaA correspond to structure families with a higher pairwise conservation rate. Therefore, pairwise conservation type (type B, cf. Fig. 2) prevails at lower sequence similarity. These results imply that, in general, relationship between sequence and structure is ambiguous as many different sequence motifs may correspond to the same structural feature. Therefore, structural motifs should not be correlated with any particular residues or their sequences. In this connection, it is interesting to see whether structural features may be correlated with particular residue substitutions.

Statistical significance of residue substitutions is given by the odds, calculated as described in Henikoff and Henikoff, (1992):


Formula 4

(4)
where qij is the observed probability of substitution of residue type i with residue type j, and eij is the expected probability of such substitution. Figure 4 shows residue self-odds wii , or conservation odds, calculated from the results of pairwise alignments. As seen from the figure, wii is >1 for all residues, which means that like-residue substitutions are statistically significant in our data set. However, the odds vary greatly with the residue type. Tryptophan appears to be the most conserved residue, which hypothetically might be linked to its size: replacing a big residue with something considerably smaller will more readily provoke structural changes than replacement of similar-size residues. High odds for cysteine could be attributed to the formation of disulphide bonds: breaking such bonds is likely to induce changes in the structure. The third most conserved residue, histidine, plays an important role in catalytic triads and might be better conserved for this reason. Although many different suppositions of this kind may be given here [and what has been done elsewhere, see, e.g. (Naor et al. 1996)], none of them can be firmly verified in the framework of statistical studies. In similar studies by Naor et al. (1996), the most conserved residue is proline, with cysteine second, while neither tryptophan nor histidine appear to be particularly conserved. In fact, very little similarities may be found between data in Figure 4 and those obtained in the referenced publication. This disagreement could be attributed to the difference in the data sets: the data set used in Naor et al., 1996 was considerably smaller (257 structures) and composed of structurally non-redundant representatives. The demonstrated dependence on the content of the data set implies that extensive cross-validation studies are required in order to conclude on the representation power of the PDB. One should notice in this connection that, according to the analysis of annotation records in PDB entries, ~65% of protein structures have been solved by molecular replacement (Dr. Alexei Vaguine, University of York, UK, personal communication), which implies a substantial bias towards already known folds.


Figure 4
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Residue conservation odds wii (cf. Equation 6), calculated from the results of pairwise alignments. See discussion in the text.

 
Figure 5 shows the full log-odds matrix of residue substitutions M , derived from the results of pairwise alignments. Comparison with data in Naor et al. (1996) again shows a considerable difference in absolute numbers, however, the results agree perfectly well on how the amino acid residues may be grouped according to their substitution odds. The order of residues in Figure 5 has been obtained with a nearest-neighbour clustering procedure. This procedure starts by creating clusters made of 2 residues with maximal values of wij , and then iteratively merges clusters with maximal substitution odds wij between their residues. The clustering arrives at two groups of residues, separated by lines in Figure 5. Correspondence with empirical hydropathy index, introduced by Kyte and Doolittle (1982), indicates that, with the exception of tyrosine and tryptophan, the clustering procedure separates residues into hydrophilic and hydrophobic types. Since all alignments were performed between structurally similar proteins, these results indicate that protein structure is quite tolerant to residue substitutions that conserve chemical-physical properties, rather than mere residue identities, in particular sequence positions.


Figure 5
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. The log-odds matrix of residue substitutions M , derived from the results of pairwise alignments. Matrix elements Mij are calculated by truncating log2(wij) , where wij represent statistical odds of substitution residue type i with residue type j in structurally similar proteins [cf. Equation (4)]. Diagonal elements wii represent residue conservation odds shown in Figure 4. Numbers in the downmost line indicate the hydropathy index, introduced by Kyte and Doolittle, (1982). The order of rows and columns has been chosen such as to select two diagonal sub-matrices with highest substitution odds and two off-diagonal matrices with lowest values of wij (as divided by vertical and horizontal lines). See discussion in the text.

 
Similarly to the residue conservation, conservation of hydropathy type is given by the probability distribution {eta} (CH) to find CHNalign residue substitutions within hydrophobic or hydrophilic groups of residues from total Nalign aligned sequence positions. Figure 6 shows distributions {eta}A(CH) and {eta}B(CH) calculated from the results of multiple and pairwise alignments, correspondingly. As seen from the figure, hydropathic properties appear very well conserved in pairwise alignment, i.e. between any two structurally similar proteins. The average hydropathic identity is Formula {approx} 0.69 , which means that, on average, 69% of residue substitutions occur within the same hydropathy type. This figure is in striking contrast with the average residue identity conservation found above (FormulaB{approx} 0.19). A relatively well localized distribution {eta}B(CH) (so that only a small fraction of alignments conserves <50% of hydropathically equivalent residues and no alignments conserve <32% of them) and high value of Formula suggest that hydropathic properties is a major factor that affects protein folding.


Figure 6
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Probability distributions {eta}A(CH) and {eta}B(CH) to find CHNalign hydropathy-conserving residue positions from total Nalign aligned in multiple and pairwise alignments of protein families, respectively. See details in the text.

 
Hydropathy conservation in multiple alignments (or within structure families) is less pronounced than that in pairwise comparisons. The corresponding distribution {eta}A(CH) appears quite irregular. One should note in this connection that {eta}A was averaged over a considerably fewer number of alignments than {eta}B. Indeed, as only 506 SCOP folds were left in the data set, the {eta}A curve was calculated from 506 multiple alignments. The curve spans over the region of 0 to 1, divided into 50 bins. This makes just about 10 counts per bin on average, which is barely enough for proper averaging. The number of pairwise alignments, used for the calculation of {eta}B, is much higher: each of 506 folds contributes n(n-1)/2 alignments, where n is the number of structures in the fold (on average n{approx} 57). Besides, {eta}B distribution spans from 0.4 to 1 thus populating almost half of all bins. These factors help better averaging of the {eta}B distribution. The average multiple hydropathic identity reaches Formula {approx} 0.51, which is considerably higher than the average multiple residue conservation FormulaA{approx} 0.12 . Just as {rho}A(C) and {rho}B(C) (cf. Fig. 3), and for the same reasons, {eta}A(CH) and {eta}B(CH) nearly merge at high values of CH > 0.8 , but deviate substantially in the region of low hydropathy conservation. The fact that CH may reach very low values in multiple alignment, but stays high in pairwise comparisons, indicates that hydropathic properties are conserved in different structure positions between different members of protein families. It, therefore, appears that protein structure is relatively insensitive to how the hydropathy-conserving substitutions are spread within the sequence, as long as there is a sufficiently high ({approx} 70\%) fraction of them. This implies a general prevalence of pairwise hydropathy conservation, similarly to what was found above for the conservation of residue identity.

The obtained results suggest that most of structure-preserving residue mutations should occur within two basic hydropathy types, which agrees with a general consideration that hydrophobic interactions play a major role in protein folding. What can be said about the conservation of residue type C in this sort of mutations? Obviously, statistically expected value of C should reach a minimum in the hypothetical extreme case, when any substitutions within hydropathy types are equally frequent. Residue conservation above statistically expected may indicate structural similarity, while lower values are not indicative. Let us estimate statistical expectancy for residue type conservation. Consider an idealized model of sequence mutations, which assumes that protein structure is conserved at any number of any residue substitutions within two groups of 10 amino acids each, but is not fully tolerant to substitutions between the groups. For simplicity, assume also that all residues have an equal occurence rate and substitution frequency. Then the count matrix of residue substitutions in a sufficiently large number of structure-conserving mutations approaches the following form:


Formula 5

(5)
where the first group includes residues numbered 1... 10 and the second group - residues 11... 20 . f0 and {alpha} f0 stand for the number of each residue substitutions within and between the groups, respectively. In this matrix, substitutions i-> j and j-> i are counted in the same element Fi,j, j ≥ i. The total number of residue substitutions is then given by:


Formula 6

(6)
where N = 20 is the number of aminoacid residues, and the fraction of conserved residues (or relative number of identical residue substitutions) equals:


Formula 7

(7)

In the extreme case of {alpha} = 0, which corresponds to the assumption that any residue substitution between the two groups results in major structural changes, CM(0){approx} 0.18 . This is very close to the empirical threshold C0 derived from Figure 1. Values of C > CM(0) are achievable with increasing fraction of like-residue substitutions in evolutionary-related structures. Such substitutions add to the diagonal elements Fii, which results in increasing C. Higher values of C correspond to higher structure conservation, therefore one could expect an increase in {rho}B(C) at C > CM(0) in Figure 3. However, the higher residue conservation, the fewer number of different sequences may be generated. Therefore, the actual decrease in the pairwise probability distribution (cf. Fig. 3) should be attributed to the composition of the data set, where close sequences are underrepresented due to natural reasons as well as a result of our selection procedure.

In the opposite limit of {alpha} = 1, where inter- and intra- hydropathy group substitutions are equally frequent, CM(1){approx} 0.095 . As seen from Figure 3, CM(1) is close to the peak in the pairwise probability distribution {rho}B(C). Since CM(1) corresponds to the situation where identical residue substitutions are as frequent as all others, values of C < CM(1) may be achieved only by enhancement of unlike-residue substitutions of both the same and different hydropathy types. Although such substitutions should result in a larger number of possible sequence variations, contrary to the situation at C > CM(0) , {rho}B(C) shows a sharp decrease at C < CM(1) (cf. Fig. 3). This indicates an increased rate of structural changes, so that fewer number of thus mutated sequences remain in the same structure family.

Because of many simplifications employed, the described model cannot pretend to accurately reproduce all results of the present study. However, it allows for qualitative conclusions that are in accordance with the results of numerical calculations. The model demonstrates clearly that conservation of residue identity C is a very poor indicator of structure similarity unless the sequences are closely related. The threshold value of C0{approx} 0.2 , below which the chance of finding structural resemblance decreases substantially, is insignificant in terms of sequence identity, because it is statistically expected in the situation when residue substitutions are confined only to hydropathy type. It should be stressed that CM(0){approx} C0 indicates a principal border line, below which structural similarity cannot be inferred from sequence comparison. Therefore, CM(0) should be viewed as the bottom of the ‘twilight zone’, and indeed its value nearly coincides with one found in statistical studies (Kinjo and Nishikawa, 2004; Rost, 1999). Despite principal possibility to correlate sequence and structure similarity in the ‘twilight zone’ the latter is characterized by an explosion of false negatives (Rost, 1999), which poses considerable difficulties for automatic database searches. Remarkably, this agrees with the basic assumptions of the above model of sequence mutations: promiscuous mutations within hydropathy groups preserve structure but result in negative hits in sequence comparison. The importance of hydropathy type conservation at low sequence identity was also acknowledged by Kinjo and Nishikawa, 2004, who studied the sequence–structure relationship by means of eigenvalue analysis. Therefore, it may be concluded that the described model generally corresponds to the situation in the ‘twilight zone’ and that sequence similarity threshold C0 originates from the existence of two basic hydropathy types of residues. It is possible that the number of false negatives in the ‘twilight zone’ can be decreased if the residue hydropathy type is taken into account in sequence comparisons. However, this question is outside the scope of the present study and requires further investigation.


    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH, DATA SET...
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Sequence alignment, based on the matching of residue identity, is a widely used tool in similarity studies. This is due to many reasons, the most important of which are the wealth of methodology developed over many years, a negligibly small number of solved protein structures (~44 000 PDB entries) on comparison with the number of known protein sequences (some 4 300 000 unique sequences in UniProt database (Leinonen et al., 2003)), and computational efficiency. While the last reason becomes less important in view of computer progress and modern efficient tools for structural alignment, the others will remain for many years to come.

Results obtained in our study indicate that conclusions made on the basis of sequence similarity may give far more false negatives than expected. Although high sequence similarity almost always correspond to high structure resemblance, the opposite is far from the truth. Obviously, the same should apply to substructure motifs, such as active sites and functional protein interfaces. A bright example is given by nicotinic receptors (Brejc et al., 2001; Celie et al., 2005). These structures represent pentameric complexes that function as ion channels regulated by nicotinic ligands. Few different structures of the receptors are known, which have been found to be highly similar topologically yet they show only 33% sequence similarity. The most conserved part of the structures appears to be the inner core of monomeric units, while their surface and, particularly, the pentamer-forming interfaces show no regular sequence motifs that could help one to identify them by means of sequence screening.

The ambiguity between sequence and structure similarities arises from the tolerance of protein structure to residue substitutions within classes of amino acids having similar chemical–physical properties. In this study, two classes were selected on the basis of residue hydropathy type. This allowed us to propose a model of sequence mutations that explains the results of structure alignments within protein families selected on the basis of SCOP folds. In this model, structures remain similar at the level of sequence similarity that is insignificant and statistically expected when there is no restrictions on residue substitutions other than by hydropathy type. This level is found to be very close to the sequence similarity threshold, found empirically in many previous studies, below which structure resemblance becomes more of a casual nature. In the model's terms, this corresponds to the inclusion of residue substitutions between unlike hydropathy types, inducing significant structural changes.

As a consequence of a relatively high tolerance of protein structure to amino acid substitutions within the hydropathy types, residue identity tend to not conserve in sufficiently large protein families. This may lead to artifacts in analyses based on multiple sequence alignment, where no structural aspects are taken into account.

Results of this study confirm earlier findings that structures from one protein family tend to conserve hydropathic identity in geometrically equivalent positions (Kinjo and Nishikawa, 2004). It should be stressed that, strictly speaking, this does not mean that any residue substitutions within the same hydropathy type would necessarily conserve major structural features. Obviously, hydropathic properties is not the only factor that affects protein folding. Stability of protein structures has been shown to depend on hydrogen bond pattern (Pace et al., 1996), formation of salt bridges (Waldburger et al., 1995), disulfide bonds (Betz, 1993), aromatic interactions (Serrano et al., 1991) and other factors. Therefore, a mutation that breaks a disulfide bond may potentially induce a considerable structural change.

Protein functionality is known to be highly structure-specific. Although each particular structure is determined by its sequence, the structure–sequence relationship is, as obtained results show, very ambiguous. Therefore, in general, a firm link between protein functionality and sequence may not exist. From this point of view, the significance of sequence in proteomics is different to that in genomics. In the latter, nucleotide sequence encodes all aspects of functionality with no relation to structural features of DNA. In the protein world, it is not so much sequence of residue identities as that of their chemical properties and residue chain fold that really matters. This allows for greater functional diversity and flexibility, which may have important evolutionary implications.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH, DATA SET...
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work has been supported by the research grant No. 721/B19544 from the Biotechnology and Biological Sciences Research Council (BBSRC) UK. The author would like to thank Dr Melford John for reading the manuscript and helpful discussion.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Anna Tramontano

Received on November 2, 2006; revised on January 8, 2007; accepted on January 11, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH, DATA SET...
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Berman HM, et al. The Protein Data Bank, In: Nucleic Acids Res. (2000) 28:235–242.[Abstract/Free Full Text]

    Betz S. Disulfide bonds and the stability of globular proteins. In: Protein Sci (1993) 2:1551–1558.[Web of Science][Medline]

    Brejc K, et al. Crystal structure of an ACh-binding protein reveals the ligand-binding domain of nicotinic receptors. In: Nature (2001) 411:269–276.[CrossRef][Medline]

    Brenner SE, et al. Understanding protein structure: using scop for fold interpretation. In: Methods Enzymol (1996) 266:635–643.[CrossRef][Web of Science][Medline]

    Celie P.HN, et al. Crystal structure of acetylcholine-binding protein from Bulinus truncatus reveals the conserved structural scaffold and sites of variation in nicotinic acetylcholine receptors. J. Biol. Chem (2005) 280:26457–26466.[Abstract/Free Full Text]

    Chotia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. In: EMBO J. (1986) 5:823–826.[Web of Science][Medline]

    Chotia C. One thousand families for the molecular biologist. In: Nature (1992) 357:543–544.[CrossRef][Medline]

    Dror O, et al. Multiple structural alignment by secondary structures: algorithm and applications. In: Prot. Sci. (2003) 12:2492–2507.[CrossRef][Web of Science][Medline]

    Guda C, et al. CE-MC: a multiple protein structure alignment server. In: Nucleic Acids Res (2004) 32:W100–W103.[Abstract/Free Full Text]

    Gusfield D. Algorithms on Strings, Trees and Sequences. (1997) New York: Cambridge University Press. 348–350.

    Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. In: Proc. Natl Acad. Sci (1992) 89:10915–10919.[Abstract/Free Full Text]

    Holm L, Sander C. Mapping the protein universe. In: Science (1996) 273:595–603.[Abstract/Free Full Text]

    Hubbard T.JP, et al. SCOP: a structural classification of proteins database. In: Nucleic Acids Res (1997) 25:236–239.[Abstract/Free Full Text]

    Hubbard T.JP, Blundell TL. Comparison of solvent-inaccessible cores of homologous proteins – definitions useful for protein modelling. In: Protein Engng. (1987) 1:159–171.[Abstract/Free Full Text]

    Kinjo AR, Nishikawa K. Eigenvalue analysis of amino acid substitution matrices reveals a sharp transition of the mode of sequence conservation in proteins. Bioinformatics (2004) 20:2504–2508.[Abstract/Free Full Text]

    Kolodny R, et al. Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J. Mol. Biol (2005) 346:1173–1188.[CrossRef][Web of Science][Medline]

    Krissinel E, Henrick K. Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. In: Acta Cryst. D. (2004) 60:2256–2268.[CrossRef][Medline]

    Krissinel E, Henrick K. Multiple Alignment of Protein Structures in Three Dimensions. In: CompLife 2005, LNBI 3695—Berthold MR, et al, eds. (2005) Springer-Verlag Berlin Heidelberg. 67–78.

    Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. (1982) 157:105–132.[CrossRef][Web of Science][Medline]

    Leinonen R, et al. UniProt Archive. In: Bioinformatics (2004) 20:3236–3237.[Abstract/Free Full Text]

    Murzin AG, et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. (1995) 247:536–540.[CrossRef][Web of Science][Medline]

    Naor D, et al. Amino acid pair interchanges at spatially conserved locations. J. Mol. Biol. (1996) 256:924–938.[CrossRef][Web of Science][Medline]

    Pace C, et al. Forces contributing to the conformational stability of proteins. In: FASEB J. (1996) 10:75–83.[Abstract]

    Pearl FM, et al. The CATH database: an extended protein family resource for structural and functional genomics. In: Nucleic Acids Res. (2003) 31:452–455.[Abstract/Free Full Text]

    Rost B. Twilight zone of protein sequence alignments. In: Protein Engng. (1999) 12:85–94.[Abstract/Free Full Text]

    Serrano L, et al. Aromatic-aromatic interactions and protein stability. Investigation by double-mutant cycles. J. Mol. Biol. (1991) 218:465–475.[CrossRef][Web of Science][Medline]

    Valencia A, et al. GTPase Domains of Ras p21 Oncogene Protein and Elongation Factor Tu: Analysis of Three-Dimensional Structures, Sequence Families, and Functional Sites. (1991) 88. USA: Proc. Natl Acad. Sci. 5443–5447.

    Waldburger C, et al. Are buried salt bridges important for protein stability and conformational specificity? In: Nat. Struct. Biol. (1995) 2:122–128.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
S. Vieira-Silva and E. P. C. Rocha
An Assessment of the Impacts of Molecular Oxygen on the Evolution of Proteomes
Mol. Biol. Evol., September 1, 2008; 25(9): 1931 - 1942.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/6/717    most recent
btm006v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Krissinel, E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Krissinel, E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?