Bioinformatics Advance Access originally published online on January 22, 2007
Bioinformatics 2007 23(6):717-723; doi:10.1093/bioinformatics/btm006
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
On the relationship between sequence and structure similarities in proteomics
European Bioinformatics Institute, Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| ABSTRACT |
|---|
|
|
|---|
Motivation: The underlying assumption of many sequence-based comparative studies in proteomics is that different aspects of protein structure and therefore functionality may be linked to particular sequence motifs. This holds true if sequence similarity is sufficiently high, but in general the relationship between protein sequence and structure appears complex and is not well understood.
Results: Statistical analysis of multiple and pairwise structural alignments of protein structures within SCOP folds is performed. The results indicate that multiple conservation of residue identity is not common and that relationship between sequence and structure may be explained by a model based on the assumption that protein structure is tolerant to residue substitutions preserving hydropathic profile of the sequence. This model also explains the origin and specific value of the sequence similarity threshold, noticed in many previous studies, below which structural resemblance is not statistically expected.
Contact: keb{at}ebi.ac.ukkeb
| 1 INTRODUCTION |
|---|
|
|
|---|
Comparative studies play an important role in bioinformatics. It is widely assumed that structural resemblance of proteins implies their functional similarity. This assumption is important for a range of practical problems addressed by bioinformatics—from a better understanding of biochemical processes in cell to drug discovery. It is also widely assumed that structural features are closely related to sequence composition. Although a protein with a given sequence may potentially exist in different conformations, the chances that two close sequences will fold into distinctly different structures are so small that they are often neglected in research practice.
There is, however, a limit to which structure and sequence similarities may be equivalenced. As has been established in several studies (Kinjo and Nishikawa, 2004; Rost,1999), protein pairs with a sequence identity higher than 35–40% are very likely to be structurally similar. Structural similarity in pairs with a sequence identity of 20–35%, often refered to as twilight zone, is considerably less common; less than 10% of protein pairs with sequence identity below 25% have similar structures. At the same time, the twilight zone is characterized by an explosion of false negatives (Rost, 1999), which means that many dissimilar sequences appear to be structural homologues. Although there are examples of homologous protein pairs with <10% sequence identity (Brenner, et al., 1996; Holm and Sander, 1996; Hubbard, et al., 1997; Valencia, et al., 1991), it has been found in many studies (Chotia, 1992; Chotia, and Lesk, 1986; Hubbard and Blundell, 1987; Krissinel and Henrick, 2004) that the likelihood of structural homologues with <20% sequence identity is negligibly small. This is illustrated by Figure 1, which shows the correlation between residue conservation C and structure similarity scores (Krissinel and Henrick,2004):
|
| (1) |
|
| (2) |
(x,C) is the density of probability that a randomly selected pair of structures in the Protein Data Bank (PDB) (Berman et al., 2000) will have structure similarity score x and residue conservation C.
|
No perfect score for structural similarity has been proposed so far. Neither r.m.s.d. nor alignment length alone are truly indicative because one may be always improved at the expense of another. The Q-score represents a balance between r.m.s.d. and Nalign, and was found to be a considerably better score when structural similarity is not self-obvious (Krissinel and Henrick, 2004). Q-scores range from 0, where no similarity exists, to 1 where structures are identical. From an empirical consideration, close structural similarity is suggested by RMSD
2 Å, Nm
0.8 and Q
0.4 .
As may be seen from Figure 1, C0=0.2 represents a threshold, above which all three scores indicate structurally similar proteins and dissimlar ones at C<0.2 . The particular value of C0 has received little, if any, discussion in the above referenced works, as well as in others. However, one may see a few intriguing questions arising here. Consider a family of structurally similar proteins Pn=P1,P2, ...,Pn . If sequence variations within Pn were completely random, then having residue conservation between proteins Pi and Pj, C(Pi,Pj)
0.2 , and C(Pj,Pk)
0.2 would not necessarily mean that C(Pi,Pk)
0.2 . One may think of two types of sequence relationship within structure families, exemplified in Figure 2, that may provide C(Pi,Pj)
0.2 for any Pi, Pj from Pn . Multiple residue conservation (type A) is particularly appealing for bioinformatic applications. It suggests that protein features, such as fold and functionality, may be due to the presence of specific sequence motifs in certain structure positions. This promises a discovery of relationships between structure, function and sequence evolution, and, in fact, many bioinformatic studies exploit this sort of largely intuitive hypotheses. Closed pairwise residue conservation (type B) has different implications. Here, protein structure can not be so unambiguously associated with a particular sequence, therefore, many different sequences may fold into similar structures (which, indeed, is the case). As a consequence, protein structure and function do not appear clearly sequence-related.
|
Obviously, multiple residue conservation (Fig. 2A) should not differ significantly from pairwise conservation (Fig. 2B) in structure families with a high sequence similarity. However, this case poses little practical value because the requirement of high a C in comparative studies limits their predictive power. As C decreases, the sequence–structure relationship is expected to gradually shift to type B (Fig. 2B). When C falls below a value of 0.2, as Figure 1 suggests, no sound structure similarity is statistically expected.
Being relatively clear in outline, the above aspects of relationship between sequence and structure similarities remain largely unstudied in details. The critical value of C0=0.2 appears confirmed for a sufficiently large number of different data sets, however, its origin remains unexplained. At the same time, the topic is worth further study because of the importance of sequence-structure relationship in protein science. This article attempts to provide an insight into the sequence and structure relationship by presenting a statistical analysis of available protein structure families. In particular, manifestation of multiple and pairwise sequence relationships in dependence of residue conservation is studied and explained, and a probable origin of the threshold value C0 is suggested.
| 2 APPROACH, DATA SET AND METHODS |
|---|
|
|
|---|
Multiple and pairwise residue conservation types in particular structure family Pn are identified by the corresponding 3D alignment of proteins Pi. Multiple alignment looks for residues that occupy geometrically equivalent positions in all n structures, and pairwise alignment does the same for each of n(n-1)/2 protein pairs (Pi , Pj) separately. A residue is considered conserved if 3D alignment equivalences it with identical residue(s).
Residue conservation may be expressed by probability distribution
(C) to find Ncons=C Nalign conserved residues from total Nalign aligned. Distributions
A(C) and
B(C) for multiple and pairwise conservations, respectively, are calculated from the corresponding 3D alignments of structure families found in the PDB. The families should be chosen in a way that allows for sufficient sequence diversity, on one hand, and to assure a reasonable structure similarity, on the other. A few protein structure classifications are available. CATH (Pearl et al., 2003) and FSSP (Holm and Sander, 1996) represent automatic classifications, while SCOP (Murzin et al., 1995) is a manually–curated resource. Although no significant differences in results obtained for different classifications should be expected, SCOP was chosen in order to avoid any artifacts that might arise from employing a 3D aligner different from one used for the classification. SCOP offers a 5-level structure hierarchy (numbers given for release 1.69): 70859 domains grouped into 2845 families, 1539 superfamilies, 945 folds and 7 classes, in order of decreasing structure resemblance. SCOP folds appear to be the most appropriate set of structure families for the purposes of the present study, because SCOP families and superfamilies contain closely related proteins, which implies high sequence similarity, and structure resemblance between different folds inside a class is normally very low.
3D pairwise and multiple alignments were performed with algorithms used in SSM web service at the European Bioinformatics Institute. Pairwise alignment in SSM is a two-step procedure (Krissinel and Henrick, 2004). First, seed alignments are generated by matching graphs built on secondary structure elements (SSE). More than one seed alignments may be generated if few similar-score alternatives are found. Then seed alignment(s) are refined by matching pairs of C
atoms in compared structures. A set of different seed alignments and several C
mappings for each of them are explored in order to maximize the Q-score (cf. Equation 1). Multiple alignment in SSM starts with a set of all-to-all pairwise SSM alignments in structure family. Using these alignments as a first approximation, the algorithm probes alternative mappings for those SSEs that do not have equivalences in all other structures. Each alternative mapping results in decrease of pairwise scores, but may increase the multiple alignment score if all SSE equivalences are found. On the final stage, multiple C
mapping is performed using an approach equivalent to the central star method in multiple sequence alignment (Gusfield, 1997). A detail description of the multiple SSM algorithm is given in Krissinel and Henrick (2005). The quality and performance of the pairwise SSM algorithm have been acknowledged in an independent study (Kolodny et al., 2005). Limited comparisons of multiple SSM and other multiple alignment algorithms [MASS, Dror et al., (2003) and Combinatiorial Extension, Guda et al., (2004)], performed in (Krissinel and Henrick, 2005), show good overall agreement.
Because SSM uses secondary structure for obtaining seed alignments, PDB entries of proteins without secondary structure and those for which it cannot be calculated (e.g. where only backbone C
atoms are given), were excluded from the dataset. Next, only one structure from those with sequence identity higher than 50% to each other was left per SCOP fold in order to focus on the situation where structure similarity shows a clear dependence on C. The 50% threshold is suggested by the correlation map between C and Q-score in Figure 1A, which shows that at C
0.5 Q-score does not depend on C and ranges between 0.8 and 1. Other structure scores, RMSD and Nm (Fig. 1A,B), do not exhibit this feature so clearly. Finally, folds with less than three structures remained were removed from the dataset.
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
The resulting data set included 28 980 protein chains in 506 SCOP folds, which is slightly more than a chain per PDB entry (a total of 25 973 PDB entries are annotated in SCOP release 1.69). This indicates a reasonable coverage of PDB content, taking into account that many PDB entries contain non-crystallographically related copies of the same sequence in the asymmetric unit.
The obtained distributions
A(C) and
B(C) are shown in Figure 3. As seen from the figure, the distributions approach each other at residue conservation C
0.25 . A near merge of the curves at C
0.4 is expected considering that, in families with high sequence identity, a large fraction of residues is conserved in both pairwise and multiple alignments. Consider, as an example, a protein family P3=P1,P2,P3 . If residue conservation in pairs (P1, P2) and (P2, P3) is high, then one can expect that residue conservation in (P1, P3) will be also high. Then multiple alignment (P1,P2,P3) reduces to the superposition of all pairwise alignments, thus preserving residue mapping as in Figure 2A. In this case, both types of residue conservation should occur equally frequently, which indeed is confirmed by data in Figure 3. As follows from the absolute values of
A (C) and
B (C) in the high-C end, only a few families with relatively high sequence identity are found in our data set.
|
If residue conservation is low,
A(C) and
B (C) may manifest more differences. For the above example of family P3 , the fewer identical residues found in pairs (P1, P2) and (P2, P3), the lesser the chances of finding them conserved across the whole family (P1,P2,P3) . Indeed, as seen from Figure 3, in the region of low C,
A(C) and
B (C) behave very differently. As suggested by the curves, the most probable residue conservation in multiple alignment is very low (C
0.02). On the contrary, residue conservation in pairwise alignment reaches its maximum at C
0.14 , while low values of C are considerably less frequent. This implies a higher statistical significance of residue conservation in multiple alignment, which is routinely used in bioinformatic studies. The average residue conservation, calculated as
|
| (3) |
A
0.12 and
B
0.19 in multiple and pairwise alignments, respectively.
The pronounced decrease in
B (C) at decreasing C (cf. Fig. 3) indicates that only a fraction (
16 %) of structurally similar sequence pairs have less than
A· 100% conserved residues, while 50% of multiple alignments fall in this region of low C. Therefore, most multiple alignments at C
A correspond to structure families with a higher pairwise conservation rate. Therefore, pairwise conservation type (type B, cf. Fig. 2) prevails at lower sequence similarity. These results imply that, in general, relationship between sequence and structure is ambiguous as many different sequence motifs may correspond to the same structural feature. Therefore, structural motifs should not be correlated with any particular residues or their sequences. In this connection, it is interesting to see whether structural features may be correlated with particular residue substitutions.
Statistical significance of residue substitutions is given by the odds, calculated as described in Henikoff and Henikoff, (1992):
|
| (4) |
65% of protein structures have been solved by molecular replacement (Dr. Alexei Vaguine, University of York, UK, personal communication), which implies a substantial bias towards already known folds.
|
Figure 5 shows the full log-odds matrix of residue substitutions M , derived from the results of pairwise alignments. Comparison with data in Naor et al. (1996) again shows a considerable difference in absolute numbers, however, the results agree perfectly well on how the amino acid residues may be grouped according to their substitution odds. The order of residues in Figure 5 has been obtained with a nearest-neighbour clustering procedure. This procedure starts by creating clusters made of 2 residues with maximal values of wij , and then iteratively merges clusters with maximal substitution odds wij between their residues. The clustering arrives at two groups of residues, separated by lines in Figure 5. Correspondence with empirical hydropathy index, introduced by Kyte and Doolittle (1982), indicates that, with the exception of tyrosine and tryptophan, the clustering procedure separates residues into hydrophilic and hydrophobic types. Since all alignments were performed between structurally similar proteins, these results indicate that protein structure is quite tolerant to residue substitutions that conserve chemical-physical properties, rather than mere residue identities, in particular sequence positions.
|
Similarly to the residue conservation, conservation of hydropathy type is given by the probability distribution
(CH) to find CHNalign residue substitutions within hydrophobic or hydrophilic groups of residues from total Nalign aligned sequence positions. Figure 6 shows distributions
A(CH) and
B(CH) calculated from the results of multiple and pairwise alignments, correspondingly. As seen from the figure, hydropathic properties appear very well conserved in pairwise alignment, i.e. between any two structurally similar proteins. The average hydropathic identity is
0.69 , which means that, on average, 69% of residue substitutions occur within the same hydropathy type. This figure is in striking contrast with the average residue identity conservation found above (
B
0.19). A relatively well localized distribution
B(CH) (so that only a small fraction of alignments conserves <50% of hydropathically equivalent residues and no alignments conserve <32% of them) and high value of
|
Hydropathy conservation in multiple alignments (or within structure families) is less pronounced than that in pairwise comparisons. The corresponding distribution
A(CH) appears quite irregular. One should note in this connection that
A was averaged over a considerably fewer number of alignments than
B. Indeed, as only 506 SCOP folds were left in the data set, the
A curve was calculated from 506 multiple alignments. The curve spans over the region of 0 to 1, divided into 50 bins. This makes just about 10 counts per bin on average, which is barely enough for proper averaging. The number of pairwise alignments, used for the calculation of
B, is much higher: each of 506 folds contributes n(n-1)/2 alignments, where n is the number of structures in the fold (on average n
57). Besides,
B distribution spans from 0.4 to 1 thus populating almost half of all bins. These factors help better averaging of the
B distribution. The average multiple hydropathic identity reaches
0.51, which is considerably higher than the average multiple residue conservation
A
0.12 . Just as
A(C) and
B(C) (cf. Fig. 3), and for the same reasons,
A(CH) and
B(CH) nearly merge at high values of CH > 0.8 , but deviate substantially in the region of low hydropathy conservation. The fact that CH may reach very low values in multiple alignment, but stays high in pairwise comparisons, indicates that hydropathic properties are conserved in different structure positions between different members of protein families. It, therefore, appears that protein structure is relatively insensitive to how the hydropathy-conserving substitutions are spread within the sequence, as long as there is a sufficiently high (
70\%) fraction of them. This implies a general prevalence of pairwise hydropathy conservation, similarly to what was found above for the conservation of residue identity.
The obtained results suggest that most of structure-preserving residue mutations should occur within two basic hydropathy types, which agrees with a general consideration that hydrophobic interactions play a major role in protein folding. What can be said about the conservation of residue type C in this sort of mutations? Obviously, statistically expected value of C should reach a minimum in the hypothetical extreme case, when any substitutions within hydropathy types are equally frequent. Residue conservation above statistically expected may indicate structural similarity, while lower values are not indicative. Let us estimate statistical expectancy for residue type conservation. Consider an idealized model of sequence mutations, which assumes that protein structure is conserved at any number of any residue substitutions within two groups of 10 amino acids each, but is not fully tolerant to substitutions between the groups. For simplicity, assume also that all residues have an equal occurence rate and substitution frequency. Then the count matrix of residue substitutions in a sufficiently large number of structure-conserving mutations approaches the following form:
|
| (5) |
f0 stand for the number of each residue substitutions within and between the groups, respectively. In this matrix, substitutions i
j and j
i are counted in the same element Fi,j, j
i. The total number of residue substitutions is then given by:
|
| (6) |
|
| (7) |
In the extreme case of
= 0, which corresponds to the assumption that any residue substitution between the two groups results in major structural changes, CM(0)
0.18 . This is very close to the empirical threshold C0 derived from Figure 1. Values of C > CM(0) are achievable with increasing fraction of like-residue substitutions in evolutionary-related structures. Such substitutions add to the diagonal elements Fii, which results in increasing C. Higher values of C correspond to higher structure conservation, therefore one could expect an increase in
B(C) at C > CM(0) in Figure 3. However, the higher residue conservation, the fewer number of different sequences may be generated. Therefore, the actual decrease in the pairwise probability distribution (cf. Fig. 3) should be attributed to the composition of the data set, where close sequences are underrepresented due to natural reasons as well as a result of our selection procedure.
In the opposite limit of
= 1, where inter- and intra- hydropathy group substitutions are equally frequent, CM(1)
0.095 . As seen from Figure 3, CM(1) is close to the peak in the pairwise probability distribution
B(C). Since CM(1) corresponds to the situation where identical residue substitutions are as frequent as all others, values of C < CM(1) may be achieved only by enhancement of unlike-residue substitutions of both the same and different hydropathy types. Although such substitutions should result in a larger number of possible sequence variations, contrary to the situation at C > CM(0) ,
B(C) shows a sharp decrease at C < CM(1) (cf. Fig. 3). This indicates an increased rate of structural changes, so that fewer number of thus mutated sequences remain in the same structure family.
Because of many simplifications employed, the described model cannot pretend to accurately reproduce all results of the present study. However, it allows for qualitative conclusions that are in accordance with the results of numerical calculations. The model demonstrates clearly that conservation of residue identity C is a very poor indicator of structure similarity unless the sequences are closely related. The threshold value of C0
0.2 , below which the chance of finding structural resemblance decreases substantially, is insignificant in terms of sequence identity, because it is statistically expected in the situation when residue substitutions are confined only to hydropathy type. It should be stressed that CM(0)
C0 indicates a principal border line, below which structural similarity cannot be inferred from sequence comparison. Therefore, CM(0) should be viewed as the bottom of the twilight zone, and indeed its value nearly coincides with one found in statistical studies (Kinjo and Nishikawa, 2004; Rost, 1999). Despite principal possibility to correlate sequence and structure similarity in the twilight zone the latter is characterized by an explosion of false negatives (Rost, 1999), which poses considerable difficulties for automatic database searches. Remarkably, this agrees with the basic assumptions of the above model of sequence mutations: promiscuous mutations within hydropathy groups preserve structure but result in negative hits in sequence comparison. The importance of hydropathy type conservation at low sequence identity was also acknowledged by Kinjo and Nishikawa, 2004, who studied the sequence–structure relationship by means of eigenvalue analysis. Therefore, it may be concluded that the described model generally corresponds to the situation in the twilight zone and that sequence similarity threshold C0 originates from the existence of two basic hydropathy types of residues. It is possible that the number of false negatives in the twilight zone can be decreased if the residue hydropathy type is taken into account in sequence comparisons. However, this question is outside the scope of the present study and requires further investigation.
| 4 CONCLUSION |
|---|
|
|
|---|
Sequence alignment, based on the matching of residue identity, is a widely used tool in similarity studies. This is due to many reasons, the most important of which are the wealth of methodology developed over many years, a negligibly small number of solved protein structures (
44 000 PDB entries) on comparison with the number of known protein sequences (some 4 300 000 unique sequences in UniProt database (Leinonen et al., 2003)), and computational efficiency. While the last reason becomes less important in view of computer progress and modern efficient tools for structural alignment, the others will remain for many years to come. Results obtained in our study indicate that conclusions made on the basis of sequence similarity may give far more false negatives than expected. Although high sequence similarity almost always correspond to high structure resemblance, the opposite is far from the truth. Obviously, the same should apply to substructure motifs, such as active sites and functional protein interfaces. A bright example is given by nicotinic receptors (Brejc et al., 2001; Celie et al., 2005). These structures represent pentameric complexes that function as ion channels regulated by nicotinic ligands. Few different structures of the receptors are known, which have been found to be highly similar topologically yet they show only 33% sequence similarity. The most conserved part of the structures appears to be the inner core of monomeric units, while their surface and, particularly, the pentamer-forming interfaces show no regular sequence motifs that could help one to identify them by means of sequence screening.
The ambiguity between sequence and structure similarities arises from the tolerance of protein structure to residue substitutions within classes of amino acids having similar chemical–physical properties. In this study, two classes were selected on the basis of residue hydropathy type. This allowed us to propose a model of sequence mutations that explains the results of structure alignments within protein families selected on the basis of SCOP folds. In this model, structures remain similar at the level of sequence similarity that is insignificant and statistically expected when there is no restrictions on residue substitutions other than by hydropathy type. This level is found to be very close to the sequence similarity threshold, found empirically in many previous studies, below which structure resemblance becomes more of a casual nature. In the model's terms, this corresponds to the inclusion of residue substitutions between unlike hydropathy types, inducing significant structural changes.
As a consequence of a relatively high tolerance of protein structure to amino acid substitutions within the hydropathy types, residue identity tend to not conserve in sufficiently large protein families. This may lead to artifacts in analyses based on multiple sequence alignment, where no structural aspects are taken into account.
Results of this study confirm earlier findings that structures from one protein family tend to conserve hydropathic identity in geometrically equivalent positions (Kinjo and Nishikawa, 2004). It should be stressed that, strictly speaking, this does not mean that any residue substitutions within the same hydropathy type would necessarily conserve major structural features. Obviously, hydropathic properties is not the only factor that affects protein folding. Stability of protein structures has been shown to depend on hydrogen bond pattern (Pace et al., 1996), formation of salt bridges (Waldburger et al., 1995), disulfide bonds (Betz, 1993), aromatic interactions (Serrano et al., 1991) and other factors. Therefore, a mutation that breaks a disulfide bond may potentially induce a considerable structural change.
Protein functionality is known to be highly structure-specific. Although each particular structure is determined by its sequence, the structure–sequence relationship is, as obtained results show, very ambiguous. Therefore, in general, a firm link between protein functionality and sequence may not exist. From this point of view, the significance of sequence in proteomics is different to that in genomics. In the latter, nucleotide sequence encodes all aspects of functionality with no relation to structural features of DNA. In the protein world, it is not so much sequence of residue identities as that of their chemical properties and residue chain fold that really matters. This allows for greater functional diversity and flexibility, which may have important evolutionary implications.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This work has been supported by the research grant No. 721/B19544 from the Biotechnology and Biological Sciences Research Council (BBSRC) UK. The author would like to thank Dr Melford John for reading the manuscript and helpful discussion.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Anna Tramontano
Received on November 2, 2006; revised on January 8, 2007; accepted on January 11, 2007
| REFERENCES |
|---|
|
|
|---|
Berman HM, et al. The Protein Data Bank, In: Nucleic Acids Res. (2000) 28:235–242.
Betz S. Disulfide bonds and the stability of globular proteins. In: Protein Sci (1993) 2:1551–1558.[Web of Science][Medline]
Brejc K, et al. Crystal structure of an ACh-binding protein reveals the ligand-binding domain of nicotinic receptors. In: Nature (2001) 411:269–276.[CrossRef][Medline]
Brenner SE, et al. Understanding protein structure: using scop for fold interpretation. In: Methods Enzymol (1996) 266:635–643.[CrossRef][Web of Science][Medline]
Celie P.HN, et al. Crystal structure of acetylcholine-binding protein from Bulinus truncatus reveals the conserved structural scaffold and sites of variation in nicotinic acetylcholine receptors. J. Biol. Chem (2005) 280:26457–26466.
Chotia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. In: EMBO J. (1986) 5:823–826.[Web of Science][Medline]
Chotia C. One thousand families for the molecular biologist. In: Nature (1992) 357:543–544.[CrossRef][Medline]
Dror O, et al. Multiple structural alignment by secondary structures: algorithm and applications. In: Prot. Sci. (2003) 12:2492–2507.[CrossRef][Web of Science][Medline]
Guda C, et al. CE-MC: a multiple protein structure alignment server. In: Nucleic Acids Res (2004) 32:W100–W103.
Gusfield D. Algorithms on Strings, Trees and Sequences. (1997) New York: Cambridge University Press. 348–350.
Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. In: Proc. Natl Acad. Sci (1992) 89:10915–10919.
Holm L, Sander C. Mapping the protein universe. In: Science (1996) 273:595–603.
Hubbard T.JP, et al. SCOP: a structural classification of proteins database. In: Nucleic Acids Res (1997) 25:236–239.
Hubbard T.JP, Blundell TL. Comparison of solvent-inaccessible cores of homologous proteins – definitions useful for protein modelling. In: Protein Engng. (1987) 1:159–171.
Kinjo AR, Nishikawa K. Eigenvalue analysis of amino acid substitution matrices reveals a sharp transition of the mode of sequence conservation in proteins. Bioinformatics (2004) 20:2504–2508.
Kolodny R, et al. Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J. Mol. Biol (2005) 346:1173–1188.[CrossRef][Web of Science][Medline]
Krissinel E, Henrick K. Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. In: Acta Cryst. D. (2004) 60:2256–2268.[CrossRef][Medline]
Krissinel E, Henrick K. Multiple Alignment of Protein Structures in Three Dimensions. In: CompLife 2005, LNBI 3695—Berthold MR, et al, eds. (2005) Springer-Verlag Berlin Heidelberg. 67–78.
Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. (1982) 157:105–132.[CrossRef][Web of Science][Medline]
Leinonen R, et al. UniProt Archive. In: Bioinformatics (2004) 20:3236–3237.
Murzin AG, et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. (1995) 247:536–540.[CrossRef][Web of Science][Medline]
Naor D, et al. Amino acid pair interchanges at spatially conserved locations. J. Mol. Biol. (1996) 256:924–938.[CrossRef][Web of Science][Medline]
Pace C, et al. Forces contributing to the conformational stability of proteins. In: FASEB J. (1996) 10:75–83.[Abstract]
Pearl FM, et al. The CATH database: an extended protein family resource for structural and functional genomics. In: Nucleic Acids Res. (2003) 31:452–455.
Rost B. Twilight zone of protein sequence alignments. In: Protein Engng. (1999) 12:85–94.
Serrano L, et al. Aromatic-aromatic interactions and protein stability. Investigation by double-mutant cycles. J. Mol. Biol. (1991) 218:465–475.[CrossRef][Web of Science][Medline]
Valencia A, et al. GTPase Domains of Ras p21 Oncogene Protein and Elongation Factor Tu: Analysis of Three-Dimensional Structures, Sequence Families, and Functional Sites. (1991) 88. USA: Proc. Natl Acad. Sci. 5443–5447.
Waldburger C, et al. Are buried salt bridges important for protein stability and conformational specificity? In: Nat. Struct. Biol. (1995) 2:122–128.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
S. Vieira-Silva and E. P. C. Rocha An Assessment of the Impacts of Molecular Oxygen on the Evolution of Proteomes Mol. Biol. Evol., September 1, 2008; 25(9): 1931 - 1942. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








