Structural Bioinformatics
A tale of two tails: why are terminal residues of proteins exposed?
The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University Ramat-Gan, 52900, Israel
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: It is widely known that terminal residues of proteins (i.e. the N- and C-termini) are predominantly located on the surface of proteins and exposed to the solvent. However, there is no good explanation as to the forces driving this phenomenon. The common explanation that terminal residues are charged, and charged residues prefer to be on the surface, cannot explain the magnitude of the phenomenon. Here, we survey a large number of proteins from the PDB in order to explore, quantitatively, this phenomenon, and then we use a lattice model to study the mechanisms involved.
Results: The location of the termini was examined for 425 small monomeric proteins (50200 amino acids) and it was found that the average solvent accessibility of termini residues is 87.1% compared with 49.2% of charged residues and 35.9% of all residues. Using a cutoff of 50% of the maximal possible exposure, 80.3% of the N-terminal and 86.1% of the C-terminal residues are exposed compared to 32% for all residues. In addition, terminal residues are much more distant from the center of mass of their proteins than other residues. Using a 2D lattice, a large population of model proteins was studied on three levels: structural selection of compact structures, thermodynamic selection of conformations with a pronounced energy gap and kinetic selection of fast folding proteins using Monte-Carlo simulations. Progressively, each selection raises the proportion of proteins with termini on the surface, resulting in similar proportions to those observed for real proteins.
Contact: ron{at}biocom1.ls.biu.ac.il
| 1 INTRODUCTION |
|---|
|
|
|---|
Quite a few studies have been devoted to understanding the structural features of the first and last protein residues (i.e. termini). Two lines of investigations were taken; one is the question whether the two termini of proteins tend to be closer to each other than would be expected for random distances distribution. The other question is whether the properties of the N-terminal are different than those of the C-terminal. This is an important question since it has bearing on the controversial issue of sequential folding, i.e. is folding, for example on the ribosome, a sequential process that proceeds from the N-terminal to the C-terminal. In pioneering work, Thornton and Sibanda (1983) evaluated the distances between termini in 52 proteins and concluded that the distances between termini are smaller than expected for random chains. Christopher and Baldwin (1996) examined a much larger set of proteins and reached a different conclusion that the distance between termini is not statistically different than the random expectation. A recent study (Krishna and Englander, 2005) has contributed an interesting observation that proteins which fold in a two-state kinetics have their termini close together, while proteins that fold in a non-two-state kinetics have their termini separated.
The different environment of the termini was first studied in Thornton and Chakauya (1982) where it was observed that for proteins which exist at that time in the PDB, the N-terminal region tend to adopt an extended beta-strand conformation while C-terminal regions are often helical. In Alexandrov (1993) it was argued that N-terminal residues tend to have more intramolecular contacts than the C-terminal, suggesting that the N-terminal folds before the C-terminal. Laio and Micheletti (2006) have re-examined the data, and did not see this tendency. They did find, however, that the C-terminal is significantly more compact and locally organized than the N-terminal, although they argue that the bias is not due to sequential folding.
All these studies are based on the observation that protein termini tend to be on the surface of proteins and not buried in the core. This fact is critical for all these studies since it supplies the background against which calculations are tested. For example, when comparing the expected distance between termini, it is critical to consider the fact that termini are mostly on the surface, since the average distance of random points on a surface of a sphere is very different from the expected distance between random points found anywhere within its volume.
Surprisingly, the tendency of termini to be located on the surface is commonly taken as a postulate without a sufficient explanation. For example, Christopher and Baldwin (1996) paper starts with the following statement: The terminal regions of proteins differ in several ways from more internal segments. The termini are often surface exposed and flexible.
We are not aware of studies aiming to explore this issue and explain how are the terminal residues get to be overwhelmingly located on the surface of proteins. At least for some proteins there is a need to bring the terminal residues to the surface to allow them to participate in post-translational processes (e.g. in N-terminal acetylation or methylation). However, many proteins do not undergo such modifications, and in any case this functional reason does not supply a mechanism to support the tendency of terminal residues to be located on the surface of folded proteins.
A common explanation often given for this tendency is that terminal residues are charged: the first amino group (which is not bonded to a carboxyl group) is positively charged, and likewise the last carboxyl group which is not paired with an amino group is negatively charged. Charged residues would tend to be on the surface of proteins because of their favorable interactions with water which is a polar solvent. However, this argument is valid also for charged amino acids like lysine, arginine, aspartic acid and glutamic acid. While these residues tend indeed to be located on the surface of proteins, we show here that the terminal residues are much more exposed than charged amino acids.
In our study we first use the large collection of protein structures that currently exist in the PDB to measure, by various methods, the extent to which termini are indeed located on the surface and exposed to the solvent. Next, we want to understand what are the mechanisms leading to this behavior. Since running full atom models of the thermodynamics of proteins and especially on their dynamic properties is not practical, we have chosen to use lattice models.
Despite the simplification of such models and the very short sequences that are usually used, they have been shown to capture generic properties of real proteins such as: collapse transitions, mutational properties, development of secondary and tertiary structure and folding kinetics (Dill et al., 1995; Sali et al., 1994; Unger and Moult, 1996).
Using a lattice, we generate a large population of model proteins and study their properties by selecting proteins on three levels: structural selection of compact structures, thermodynamic selection of conformations with strong energy preferences and kinetic selection of fast folding proteins using Monte-Carlo (MC) simulations. We show how, progressively, each selection raises the proportions of proteins with termini on the surface, resulting in very similar proportions to what is measured for real proteins.
| 2 METHODS |
|---|
|
|
|---|
2.1 The PDB dataset
PDB entries were taken from the non-redundant PDB set (http://www.ncbi.nlm.nih.gov/Structure/VAST/nrpdb.html) using the non-redundant threshold of p-value of 1040. From this list we took only monomeric structures of length between 50 and 200 amino acids that were solved by X-ray crystallography and for which no missing residues were reported. A total of 425 structures were considered.
2.2 Exposure analysis of residues of proteins
Two methods were used to determine the extent to which termini are located on the surface of proteins. The first measure is based on the exposure of termini residues to solvent, and the second on the distance of the termini from the center of mass of each protein.
2.2.1 Exposure calculations
The corresponding DSSP files for the PDB entries were downloaded from ftp://ftp.cmbi.ru.nl/pub/molbio/data/dssp/. We used the solvent accessibility value in the DSSP as the exposure measurement as described in Kabsch and Sander (1983). The relative solvent accessibility of each residue was calculated by normalizing its solvent accessibility to the maximum possible value for that amino acid (Shrake and Rupley, 1973).
2.2.2 Distance from center of mass
Whereas solvent exposure is a very common way to measure the extent to which amino acids are on the surface of proteins, there might be a problem in using it for terminal residues. Some of the protection from the solvent is supplied by the main chain and the side chain of the two immediate neighbors of each amino acid. However, terminal residues are truncated and have only one neighboring residue. Thus, to enable independent assessment of the location of terminal residues we suggest measuring the distance of each amino acid to the center of mass of its protein. Residues with the highest distance will be on the surface. Since proteins are of different sizes, and hence expected distances, we normalized this measure for each protein in terms of standard deviation according to:
![]() |
only), D is the absolute distance from the center of mass, Avg(Ds) is the average distances of all residues from the center of mass and SDV is the standard deviation of this average.
2.3 Lattice model of proteins
The polypeptide chains in the simulations are modeled as a linear sequence of residues on a 2D lattice. In order to increase the spectrum of interactions relevant to our study, four different types of residues are used, instead of the common HP model with only two types of interactions (Dill et al., 1995; Chan and Dill, 1996). These are hydrophobic (H), neutral polar (P), positively charged (+) and negatively charged (). Interactions are considered only between residues in neighboring lattice points (diagonal points are not considered neighboring). Interactions between consecutive residues in the sequence are not considered since they are always present and are independent of the conformation.
The energy of sequence S of length N in conformation C is given by
![]() |
![]() |
|
Those values were chosen to reflect the average strength of interactions in empirical mean force potentials (Miyazawa and Jernigan, 1993), where HH interactions are stronger than PP interactions, HP and H(+)/() interactions are neutral, P(+)/() interactions are weakly favorable, (+)(+) or ()() are repulsive and (+)() interactions are the strongest attractor. Repeating the experiments described here with variations of this potential yielded similar results.
2.4 Generation of model sequences
A total of 3342 unique model sequences were generated by random rearrangements of 25 residues, drawn from a distribution of 45% H, 30% P, 12.5% (+) and 12.5% (). This composition is similar to the composition of amino acid groups in the PDB (http://us.expasy.org/sprot/relnotes/relstat.html, PFB release 49.1) which is 44.3% neutral, 30.7% polar and 12% positively charged amino acids and 13% negatively charged amino acids. To be consistent with the fact that protein termini are charged, oppositely charged amino acids (+/) were assigned to both termini.
2.5 Generation of compact structures
Conformations of 25 residues long that fit into a 6 x 6 square are considered compact. For each sequence of 25 residues, all possible 9 646 215 such compact non-symmetric conformations were recursively generated, and the energy values for all these conformations were calculated based on the potential. We assume that the minimal conformation of the chain is one of the 9 646 215 compact structures, and thus the conformation with the minimal energy is considered the native structure of that particular sequence (Fig. 1). If there is more than one conformation with the same minimum, one is arbitrarily chosen. Conformations for which the simulation (to be described later) demonstrates that the minimal energy is not a compact structure (i.e. the simulations found a conformations that cannot be contained within a 6 x 6 square which is better than all the compact conformations) were excluded retroactively from consideration. We encountered only very few (<1%) such cases.
|
2.5.1 Exposure measurement for the model
In the model, a residue is considered exposed if one of its four neighbor lattice points is part of the exterior, or if there is a path from this points to the exterior. Otherwise, it is considered buried. Note that by this definition a residue which is a neighbor to an internal cavity points is not considered exposed. An example of a native structure is illustrated in Figure 1.
2.6 Compact structures with significant energy gap
There is a large variance of the spectrum of energy values of the conformational space for different proteins. As suggested in Sali et al. (1994), a significant energy gap is important in order to ensure kinetic accessibility of the native structure. For each sequence, we measure the difference between the minimal energy (i.e. the native conformation) and the average energy of all conformations in units of standard deviations of the average energy. The larger the difference between these two numbers, the more pronounced is the energy gap. We selected the 800 sequences (out of the 3342), with the largest difference of their native structure for the simulations in the kinetic accessibility stage.
2.7 Simulation technique
Folding dynamics is simulated using the MC method with the Metropolis criterion (Metropolis et al., 1953). A chain starts as a random conformation and folds by the following algorithm: from a conformation S1 with energy E1, a random change (a move) of conformation to S2 is performed and the energy E2 is evaluated. If E1
E2, then the move to conformation S2 is accepted, otherwise acceptance of the move depends on the following non-deterministic criterion:
![]() |
|
2.8 Kinetic accessibility
In order to examine and characterize the kinetic accessibility of a model sequence to its pre-calculated native structure, each of the 800 sequences with the largest energy gaps was simulated and analyzed by the following protocol: A single simulation of a model sequence consists of 106 MC Steps (MCS). The simulation process is terminated once the native conformation is found or after 106 MCS. Some flexibility is allowed in reaching the native conformation. We considered the native conformation as found if the simulation reached a conformation within a distance of <0.5 root mean square distance from the native conformation. (This distance is roughly equal to 2 out of the 25 residues being off by one lattice point from the corresponding position in the native conformation.) The number of MCS taken to find the structure is considered as the first passage time (FPT). For each sequence, 50 independent simulations were run with the same folding parameters (simulation temperature, local moves library size and tail moves probability). If a model sequence was folded successfully more than a defined percent threshold (e.g. 80%, 40 out of 50 runs), it is considered a fast folder; otherwise, it is considered as a slow folder. This threshold parameter, as well as other simulation parameters; like tail moves probability and local moves size (L) were varied in our simulations.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Analysis of PDB structures
We start by calculating the exposure of the termini in a dataset of 425 non-redundant monomeric proteins from the PDB. The averaged normalized solvent accessibility of termini residues is 87.1% compared with 49.2% of charged residues and 35.9% of all residues. We consider a residue with solvent accessibility of >50% of its maximal surface area as exposed. Figure 3 shows the exposure of residues in the N- and C-terminal region, i.e. the first and last 10 residues of each protein. It is clearly seen that the terminal residues are highly exposed (80.3 and 86.1% for N- and C-terminal residues, respectively), there is a much smaller effect on the residues adjacent to the termini. When the analysis is done based on amino acid type (Fig. 4) we see, as expected, that charged residues are more exposed than hydrophobic and polar residues but that terminal residues are much more exposed than charged residues.
|
|
It might be argued that solvent accessibility of terminal residues is large because they are missing one of their neighboring residues that could have provided additional shield from the solvent. Thus, in order to probe directly the location of the terminal residues we measured the distance of the terminal residues and all other residues from the center of mass of their proteins. The distance was normalized, in units of standard deviation, to the average distance of residues to the center of mass for each protein. The results, shown in Figure 5, indicate that indeed terminal residues are found much more on the exterior of proteins as compared to any other type of residues.
|
Thus, we can say that indeed protein termini are predominantly located on the surface. Out of the 425 proteins only 132 have one termini buried (i.e. <50% exposure), and 13 with both termini buried. If we use a cutoff of 25% exposure then there are only 38 proteins with one buried termini and 2 with both termini buried. With a 10% exposure cutoff, only 14 proteins have one terminal buried and none has both. An example of one of the 14 cases, staphopain, is shown in Figure 6.
|
3.2 Analysis of model proteins
For extended conformations of model proteins, most residues are exposed. We collected data from 42 450 extended conformations produced by MC simulations and observed (Fig. 7) that all residues are exposed in >80% of the extended structures. For the three terminal residues on each side, >90% are exposed and the very terminal residues are >95% exposed. To gather statistics about compact conformations, 3342 unique random sequences of 25 residues were created. For each sequence, all possible 9 646 215 two dimensional compact non-symmetric conformations that fit into a 6 x 6 lattice were generated. For these compact structures, the exposure profile of the proteins is quite flat along the structure and all residues have
70% exposure (Fig. 7).
|
Next we turn to analyze the exposure profile of native structures (i.e. minimal energy structure). We used enumeration of compact structures of the 3342 sequences composed of an alphabet of four types: (H) hydrophobic, (P) polar, (+) positively charged and () negatively charged, in proportion similar to what is found in the PDB. For each sequence, using a table of mean force potential [reflecting an average of the strength of interactions between the corresponding amino acids (Miyazawa and Jernigan, 1993)] the energy of every compact conformation was evaluated. The conformation with the lowest energy was considered the native conformation. The percent of exposed residues was calculated for all the native structures.
Figure 8 shows the exposure by residue type and demonstrates that for native structures in our model, terminal residues are more exposed than other types of residues. While these exposures are higher than observed for real proteins (Fig. 4), they do show the same rank between residues type as in real proteins.
|
The percent of exposed residues were calculated for the entire set and for the 800 proteins for which the native structure has the largest gap in energy from the averaged energy value. The tendency of the terminal residues to be exposed is slightly higher (89.2%) for those proteins than for the entire set (87%). If we use the top 200 sequences, the tendency goes slightly further higher to 90.8%.
3.3 Analysis of kinetic folding
Proteins were divided into two groups, fast folders and slow folders. The separation was based on the ability of sequences to fold to their native conformation in a MC simulation of 106 steps. Each sequence was run 50 times and proteins that were able to find the native conformation in more than a threshold percentage of the simulations were considered fast folders, and proteins that found their native structure in less than that threshold percentage of runs were considered slow folders. A threshold of 80% (which was used in most simulations) yielded 355 fast folders and 445 slow folders. A comparison of the percent of exposed residues for fast and slow folding proteins is shown in Figure 9, showing a significant difference. The exposure by residue type for the 355 fast folders is shown in Figure 10.
|
|
The simulations were performed using different parameters of local move set, percent of tail moves, threshold between fast and slow folders and in all cases the conclusion was similar: in all simulations proteins that fold fast have a higher percentage of their termini exposed than slow folding proteins (Fig. 11).
|
Furthermore, we performed longer simulations of 6 x 106 MC moves for two groups of proteins: 78 proteins for which the native conformation has the two termini on the surface, and 78 proteins for which in the native structure at least one of the termini was not exposed. Again we saw that proteins with exposed termini fold faster: The average folding time (FPT) for proteins with exposed termini was 204 000 MCS compared with 404 000 MCS for proteins with at least one buried termini.
| 4 DISCUSSION |
|---|
|
|
|---|
We set out to explain why terminal residues of proteins tend to be located on the surface. We first measured the location of the terminal residues in a dataset of 425 monomeric short proteins. We used two different measurements; first we checked the solvent accessibility of these residues and second we checked the distance of these residues from the center of mass of their proteins. Taken together, the results clearly indicate that indeed terminal residues are overwhelmingly located on the surface on proteins.
Based on this finding, we want to understand the mechanisms that force terminal residue to be on the surface. It is clear that many proteins need to have their terminal exposed in order to make them accessible to post-translational modifications which are common for both termini (Dixon, 1984; Chung et al., 2002). Thus, it can be argued that the location of terminal residues on the surface is a desirable feature that can be selected for by evolution. This feature could have been selected for directly, or, as is common in evolutionary processes, could have been incorporated into other considerations that would have preferred this feature. We suggest that the latter is true, i.e. thermodynamic and kinetic considerations that are known to have an effect on proteins could lead to such a preference.
Using a simple lattice model, we demonstrate that a series of constraints that affect proteins will lead to the preference of terminal residues to be located on the surface. Clearly, for extended conformations of protein, all residues tend to be exposed (Fig. 3). But even for compact conformations, our analysis shows that the exposure profile is quite flat, and all residues tend to be equally exposed (Fig. 3). When only conformations with minimal energy (i.e. native conformations) are considered, terminal residues start to prefer to be located on the surface. When native conformations with a profound energy gap are considered then this tendency increases. If we look at proteins that can fold fast in kinetic simulations, then we see that the tendency of terminal residues to be exposed is increased further (Fig. 10). Proteins that require that terminal residues will be tucked inside the core may be prohibitively complicated to fold. To conclude, we suggest that the tendency of terminal residues of proteins to be located on the surface is a result of thermodynamic and kinetic selection processes. Indeed, model proteins that have been selected using these considerations (Fig. 10) exhibit similar exposure profile to real proteins (Fig. 4). The lattice work presented here is based on small monomeric structures. In the future it might be of interest to examine the model on larger oligomeric structures.
| Acknowledgments |
|---|
The authors thank Inna Myslyuk for assistance with the art work, and Tirza Doniger for useful comments on the manuscript.
| REFERENCES |
|---|
|
|
|---|
Alexandrov, N. (1993) Structural argument for N-terminal initiation of protein folding. Protein Sci, . 2, 19891991[Web of Science][Medline].
Chan, H.S. and Dill, K.A. (1996) A simple model of chaperonin-mediated protein folding. Proteins, 24, 345351[CrossRef][Web of Science][Medline].
Chung, J.J., et al. (2002) Functional diversity of protein C-termini: more than zipcoding? Trends Cell Biol, . 12, 146150[CrossRef][Web of Science][Medline].
Christopher, J.A. and Baldwin, T.O. (1996) Implications of N and C-terminal proximity for protein folding. J. Mol. Biol, . 257, 175187[CrossRef][Web of Science][Medline].
Dill, K.A., et al. (1995) Principles of protein foldinga perspective from simple exact models. Protein Sci, . 44, 561602.
Dixon, H.B.F. (1984) N-terminal modification of proteins. J. Protein Chem, . 3, 99108[CrossRef].
Kabsch, W. and Sander, C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 25772637[CrossRef][Web of Science][Medline].
Krishna, M.M. and Englander, S.W. (2005) The N-terminal to C-terminal motif in protein folding and function. Proc. Natl Acad. Sci. USA, 102, 10531058
Laio, A. and Micheletti, C. (2006) Are structural biases at protein termini a signature of vectorial folding? Proteins, 62, 1723[CrossRef][Web of Science][Medline].
Metropolis, N., et al. (1953) Equations of state calculations by fast computing machines. J. Chem. Phys, . 21, 10871091[CrossRef].
Miyazawa, S. and Jernigan, R.L. (1993) A new substitution matrix for protein sequence searches based on contact frequencies in protein structures. Protein Eng, . 6, 267278
Sali, A., et al. (1994) How does a protein fold? Nature, 369, 248251[CrossRef][Medline].
Shrake, A. and Rupley, J.A. (1973) Environment and exposure to solvent of protein atoms. Lysozyme and insulin. J. Mol. Biol, . 79, 351371[CrossRef][Web of Science][Medline].
Skolnick, J. and Kolinski, A. (1991) Dynamic Monte Carlo simulations of a new lattice model of globular protein folding, structure and dynamics. J. Mol. Biol, . 221, 499531[CrossRef][Web of Science][Medline].
Thornton, J.M. and Chakauya, B.L. (1982) Conformation of terminal regions in proteins. Nature, 298, 296297[CrossRef][Medline].
Thornton, J.M. and Sibanda, B.L. (1983) Amino and carboxy-terminal regions in globular proteins. Mol. Biol, . 167, 443460.
Unger, R. and Moult, J. (1996) Local interactions dominate folding in a simple protein model. J. Mol. Biol, . 259, 988994[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
E. Jacob, A. Horovitz, and R. Unger Different mechanistic requirements for prokaryotic and eukaryotic chaperonins: a lattice study Bioinformatics, July 1, 2007; 23(13): i240 - i248. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||















