Bioinformatics Advance Access originally published online on March 31, 2005
Bioinformatics 2005 21(12):2839-2843; doi:10.1093/bioinformatics/bti421
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Density guided importance sampling: application to a reduced model of protein folding
1Astbury Centre for Structural Molecular Biology, Department of Biochemistry and Microbiology, University of Leeds Leeds LS2 9JT, UK
2Department of Biochemistry, University of Bristol, School of Medical Sciences, University Walk Bristol BS8 1TD, UK
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Monte Carlo methods are the most effective means of exploring the energy landscapes of protein folding. The rugged topography of folding energy landscapes causes sampling inefficiencies however, particularly at low, physiological temperatures.
Results: A hybrid Monte Carlo method, termed density guided importance sampling (DGIS), is presented that overcomes these sampling inefficiencies. The method is shown to be highly accurate and efficient in determining Boltzmann weighted structural metrics of a discrete off-lattice protein model. In comparison to the Metropolis Monte Carlo method, and the hybrid Monte Carlo methods, jump-walking, smart-walking and replica-exchange, the DGIS method is shown to be more efficient, requiring no parameter optimization. The method guides the simulation towards under-sampled regions of the energy spectrum and recognizes when equilibrium has been reached, avoiding arbitrary and excessively long simulation times.
Availability: Fortran code available from authors upon request.
Contact: m.j.parker{at}leeds.ac.uk
| INTRODUCTION |
|---|
|
|
|---|
Considerations of protein folding and protein fold evolution argue for a hierarchic mechanism of structure acquisition (Baldwin and Rose, 1999). There is wide interest in comparing the folding of structural homologues and engineered variants to gain an understanding of how folding and stability are affected as the balance between secondary and tertiary interactions changes; an important design principle. Constraining elements of local structure has a profound influence on our ability to predict native tertiary folds. Indeed, one of the most successful approaches to ab initio protein tertiary structure prediction essentially uses predefined elements of local structure as building blocks during global minimization (Simons et al., 1997). Local structure propensities also play a role in protein mis-folding. For example, sequence elements with a high ß-strand propensity may mediate amyloid fibril formation (Prusiner, 1997; Mihara and Takahashi, 1997; Booth et al., 1997). Reliable methods for quantifying local structure propensities have additional medical relevance therefore.
Modern secondary structure prediction algorithms typically rely on associating homologies between the query sequence and template sequences of known structure. While fast, they are necessarily limited to sequences exhibiting discernable homology with a restricted set of template sequences and structures. Moreover, they do not give a thermodynamic measure of a sequence's conformational tendency. This requires knowledge of the temperature-dependent Boltzmann distribution of the conformation space. Reduced models of proteins, which use statistical and/or physical based potentials, are quite successful at predicting low complexity folds (Kolinski and Skolnick, 2004). Even with this reduced complexity however, sampling methods must be employed to estimate Boltzmann weighted parameters. Simple energy minimization may be insufficient in cases of marginal stability, where the time-averaged conformation comprises a diverse range of structures.
Markov chain Monte Carlo methods are commonly used for estimating the canonical potential energy surface (PES) of high-dimensional systems, in particular protein folding (Onuchic et al., 1997). To calculate thermodynamic parameters, these methods must generate a Markov chain of configurations such that each configuration in the chain has a probability determined by the Boltzmann distribution. Averages over the chain only accurately converge to the expectations with respect to the Boltzmann distribution if the sampling is ergodic.
The Metropolis Monte Carlo (MMC) transition probability [p(Ei
Ei+1)] is temperature-dependent (Metropolis et al., 1953):
![]() | (1) |
E = Ei+1 Ei is the difference in energy. The MMC simulations of protein folding are complicated by two primary factors: the vast size of the configuration space and the complexity of the PES. The complexity of the PES arises from the heterogeneous interplay of myriad interactions operating over a range of interatomic distances. The resultant PES exhibits a rugged topography with configurations that may be structurally similar but greatly different in energy. Below critical temperatures, tunnelling problems are encountered and the simulation becomes quasi-ergodic in behaviour. Trajectories between structurally diverse but statistically important energy minima have low transition probabilities. The trapping of the simulation in isolated minima leads to statistical inaccuracy in the sample, as the system does not reach equilibrium. The problem is insidious as it is hard to discern quasi-ergodic behaviour in a Monte Carlo simulation. A number of extended MMC methods have been developed for increasing sampling efficiency, notably jump-walking (J-Walking) (Frantz et al., 1990), smart-walking (S-Walking) (Zhou and Berne, 1997) and replica-exchange (RepEx) (Swendsen and Wang, 1986). Essentially, these access a larger proportion of the configuration space, uncovering routes to alternate minima and facilitating transitions between isolated regions. In the J-Walking method, alternative configurations are sampled periodically from a high temperature simulation run in parallel. This increases the probability of accessing alternative minima in the low temperature simulation. The choice of the high temperature is critical. It must be high enough that the simulation is ergodic but low enough such that the acceptance probability (or jump acceptance ratio) is reasonably high; i.e. there must be sufficient overlap between the two Boltzmann distributions. The use of multiple parallel simulations conducted over a range of temperatures in the RepEx method addresses this issue, although the CPU cost increases linearly with every additional simulation. S-Walking is a computationally more efficient variation of the J-Walking method. Here the high temperature configurations are subjected to a steepest decent minimization before jumps are attempted, resulting in effective sampling to local minima. However, on a complex energy surface a significant proportion of these quenched configurations will have a relatively high energy; thus the acceptance probability will still be low. Again, the choice of the high temperature is critical. In addition to issues of parameter optimization (high temperature and trial frequency), there is no direct way to establish whether equilibrium has been reached, and therefore whether the expectations calculated accurately reflect Boltzmann weighted values.
The flat histogram algorithm of Wang and Landau (2001a) is a novel Monte Carlo method for estimating the density of states, and has been applied successfully to a number of different systems (Wang and Landau, 2001b; Jain and de Pablo, 2002; Rathore and de Pablo, 2002; Yan et al., 2002). The method is based on the principle that if one performs a random walk in energy space with transition probabilities proportional to the reciprocal of the density of states, then a flat energy occupancy histogram is generated. The transition probability is
![]() | (2) |
g(E) x f. If a possible move is rejected and stays at the same energy level, one modifies the existing density of states with the same modification factor. During the random walk an occupancy histogram (H(E)), which counts the number of visits at each energy level, is also accumulated. When H(E) is flat in the energy range of the random walk, the density of states has converged to the true value with an accuracy proportional to f. H(E) is then reset, f reduced to a finer value and another random walk conducted. This is continued until f is smaller than a predefined value. As the method is independent of temperature, it does not suffer from the tunnelling problems encountered above, and is a far more efficient algorithm. To calculate Boltzmann weighted structural parameters, a two-dimensional energy/structural metric histogram is required. The degeneracy inherent in constructing the histogram leads to inaccuracies in these measurements however. Also, the lower the density of the bin, the larger the error associated with it. In a protein system, the low energy bins, which make the largest contribution to the partition function, have the lowest density. Together these two problems cause significant inaccuracies at low temperatures, especially in the two-dimensional case where there are a greater number of low-density bins. While refining the run parameters increases accuracy, the extra computational cost is significant and disproportionate.
Here we present a hybrid Monte Carlo method called density guided importance sampling (DGIS), which combines the best aspects of the MMC and the flat histogram methods. The method is applied to three peptide systems to calculate Boltzmann weighted structural metrics as a function of temperature. The DGIS method is revealed to be efficient and accurate, outperforming J-Walking, S-Walking and RepEx.
| METHODS |
|---|
|
|
|---|
RAFT: a simplified protein model
A simplified off-lattice model for describing proteins has been developed called RAFT (reduced Ramachandran angle and folding force field for tertiary structure prediction) (Gibbs et al., 2001). Central to the method are the observations by Wodak (Rooman et al., 1991) and later by Levitt (Park and Levitt, 1994) that polypeptide conformations may be adequately described using only four or six pairs of

backbone torsion angles. Such representations reduce substantially the conformation space of a polypeptide from the enormous number that can populate unrestricted Ramachandran space, and can represent backbones of known protein structures with overall RMS deviation from the experimental coordinates typically <2.0 Å. RAFT uses a set of six discrete 
angles per residue optimized from the Protein Data Bank (Berman et al., 2000); two cover the
-helical region of Ramachandran space, three cover the ß-sheet region and one covers the reverse
-helical region (P and G each have special 
angle sets). Each residue is represented as a sphere centred on its backbone atoms and the side chain by between zero (for glycine) and three (for lysine) spheres placed along the C
Cß vector, with the total volume approximating that which the real side chain occupies in water. The structure of a protein in our model can thus be described by a conformer descriptor; a sequence of numbers taking values between 1 and 6, each denoting the 
angle for a particular residue. This takes up far less storage than a series of xyz coordinates. The size of the conformation space in our model is thus given by 6nres, where nres is the number of residues.
The side chains are modelled to approximate rotationally averaged real side chains, working on the premise that intimate side chain docking is not an absolute requirement for achieving a native-like fold. Each sphere has an associated polar (P) or non-polar (NP) potential such that the sum of these for a particular residue corresponds to the free energy of transfer for the residue from water to the protein interior calculated from experimental data (Damodaran and Song, 1986; Rees, 1980; Wolfenden et al., 1981). The energy of a particular conformation is given by the sum of these spheresphere potentials according to an approximate LennardJones potential function, such that NPNP interactions are favourable, NPP are unfavourable and PP are null. A backbone hydrogen bond potential derived from experimental data is also incorporated (Makhatadze and Privalov, 1994) to stabilize
and ß secondary structures.
Employing a search strategy based on simulated annealing (Kirkpatrick et al., 1983), RAFT predicts a high percentage of correct folds for peptides up to 40 residues in length containing both
and ß secondary structures, and is similarly effective as a more complex, all-atom model at predicting the structures of proposed independent folding units (IFUs) (Gibbs et al., 2001; Pedersen and Moult, 1997).
To test the DGIS method we chose three peptide systems; PDB identifiers 1alc21-32 (
-helix), 1pga43-54 and 1fkf27-38 (both ß-hairpins). These are purported to be IFUs (Pedersen and Moult, 1997), and are small enough to enable full enumeration using the geometric description employed in RAFT. For our structural metric we calculate the root mean squared deviation between internal C
and Cß distance matrices (dRMS
ß). The minimum energy conformations found for these three systems using RAFT agree reasonably well with their experimental structures (see below).
Density of states histograms
The algorithm used was based on the sampling method developed by Wang and Landau (2001a). For each system the maximum and minimum energy values were obtained by exhaustive enumeration and 100 energy bins used. Random walks in energy space were performed (up to 10% of 
angles changed in any one step) with acceptance probabilities given by Equation (2). Initially, the density of states values (g(E)) and occupancy values (H(E)) were set to 1 and 0 respectively. With each visit to an energy bin, g(E) was modified as g(E)
g(E) x f, where f is a modification factor (f > 1), and H(E) values were incremented by 1. As in the original implementation of Wang and Landau, we use ln[g(E)]
ln[g(E)] + ln[f], allowing one to store all possible values of g(E) as double precision numbers in memory. After 10 000 moves the flatness of H(E) was assessed. If all H(E) values were within 5% deviation of the average value, H(E) was considered flat. Random walks of 10 000 moves were continued until the flatness criterion was met. H(E) was then reinitialized and a new random walk commenced using a finer modification factor value:
. An initial modification factor (f0) was chosen such that
, as suggested by Wang and Landau (2001a). A final value for f of exp(108) was used. This corresponds to 12 iterations.
Construction of a non-redundant library of conformer descriptors
A non-redundant library of conformer descriptors was constructed during a single iteration of the flat histogram algorithm for energy bins having a Boltzmann probability >0.01 at the maximum model simulation temperature. For each energy bin a maximum of 50 000 conformer descriptors were stored.
Density guided importance sampling
For a given temperature the density of states histogram is converted into a Boltzmann probability histogram (p(E)T):
![]() | (3) |
dRMS
ß
over a range of model temperatures.
DGIS comparison to MMC, J-Walking, S-Walking and RepEx
The J-Walking and S-Walking sampling methods were performed as described in Frantz et al. (1990) and Zhou and Berne (1997), respectively. As mentioned in the introduction, the performance of both methods is sensitive to the choice of high temperature (Thigh) and trial frequency (ftrial). To ensure a fair comparison with the DGIS method, these two parameters were optimized using a grid search procedure for each simulation temperature where tunnelling problems are encountered (see Results and Discussion section). Simulations were started from a random conformation and run until the
dRMS
ß
values matched those calculated by full enumeration, within 0.01 Å. The grid used Thigh values of 2.08.0 at intervals of 1.0 and ftrial values of 100, 200, 500, 1000 and 2000 MC steps.
The RepEx method was performed as described previously (Swendsen and Wang, 1986; Kofke, 2002), with the transition probability used for an exchange trial between replica simulations run at model temperatures T0 and T1 given by:
![]() | (4) |
T = T1 T0) and ftrial. For each system these parameters were optimized, again using a grid search procedure. The grid search used
T values of 0.1, 0.25 and 0.5, and ftrial values of 100, 200, 500, 1000 and 2000 MC steps. | RESULTS AND DISCUSSION |
|---|
|
|
|---|
The
dRMS
ß
T values calculated for each system using the DGIS method are plotted in Figure 1ac. Comparing these with the values obtained by full enumeration demonstrates that our method yields accurate results at all temperatures. For both 1pga43-54 and 1fkf27-38, the lowest energy conformer descriptors found using RAFT agree well with the experimental structures (dRMS
ß = 0.90 and 2.22 Å, respectively). For 1alc21-32, however, the lowest energy conformer descriptor bears little resemblance to the experimentally determined structure, with a dRMS
ß of 3.08 Å. The
dRMS
ß
values calculated for 1alc21-32, 1pga43-54 and 1fkf27-38 at a model temperature of 1.0 are 2.54, 2.29 and 2.23 Å, respectively. Boltzmann weighted ensembles of structures for these are shown in Figure 2 (see legend to Fig. 2 for details). These illustrate the correspondence between the time-averaged and experimental structures and emphasize the importance of sampling over minimization in ab initio predictions of protein conformations.
|
|
As mentioned in the introduction, the necessity of grouping values together in bins in the flat histogram method causes inaccuracies in the direct calculation of Boltzmann weighted parameters, particularly at low temperatures. This is illustrated by the data in Figure 1d, which shows a poor correspondence between
dRMS
ß
T values calculated for 1alc21-32 by full enumeration and directly using a hypothetical, zero error 2D density of states [g(Ei, dRMS
ßj)]. Here energy intervals identical to those in the DGIS method and 21 dRMS
ß intervals of 0.25 Å were used and the
dRMS
ß
T values were calculated from the full enumeration data:
![]() | (5) |
ßj)T values here represent the highest accuracy the flat histogram method can possibly accomplish. Although higher accuracies can be achieved by refining the dRMS
ß and energy bins, the extra CPU time is significant. With respect to computation time, the 2D flat histogram method, run using these 2100 energy/dRMS
ß bins for only four out of our prescribed 12 iterations, took approximately 15 times longer than the DGIS method. For all three systems, MMC simulations reveal critical temperatures where tunnelling problems are encountered (1alc21-32, 0.252.0; 1pga43-54, 0.252.50; 1fkf27-38, 0.252.25). Above these temperatures the MMC method yields accurate results on a sensible timescale. The MMC, J-Walking, S-Walking and RepEx methods were performed over these critical temperatures. At each temperature these were run for the length of time that the DGIS method took to reach equilibrium. The J-Walking and S-Walking methods used optimized values for Thigh and ftrial (see Methods). For the RepEx method also, the replica temperature intervals and ftrial values used were optimized for each system (see Methods) and the highest replica simulation temperatures used ensure fully ergodic sampling. All five methods were started from the same random conformation. The data are compared in Figure 1ac. Equilibration times calculated for 1alc21-32 are compared in Figure 3. Here, during the MMC, J-Walking, S-Walking and RepEx simulations, occupancy histograms (H(E)T) were constructed, as in the DGIS method, and the simulations run until the normalized H(E)T matched the pre-calculated p(E)T within 0.001, or a maximum number of MC steps was reached (2.5 x 107). The data in Figures 1 and 3 demonstrate that, in each case, DGIS outperformed MMC, J-Walking, S-Walking and RepEx methods.
|
The results of our grid search procedure for optimizing Thigh and ftrial for J-Walking and S-Walking (see Methods) demonstrate that the optimum value of Thigh is dependent on the sampling temperature. The optimum value for Thigh cannot be determined a priori, and the extra computation time required reduces the overall efficiency. As mentioned in the introduction, a balance must be achieved; Thigh must be high enough so that sampling is ergodic but low enough to yield a reasonable jump acceptance ratio. For certain systems, this balance cannot be achieved for relatively low sampling temperatures. Analogous considerations are relevant to the choice of temperature intervals used in the RepEx method (Kofke, 2002).
Normally, the MMC, J-Walking, S-Walking and RepEx methods are run somewhat arbitrarily, until the sampled property converges. The data in Figure 4 demonstrate that convergence in the MMC, J-Walking, S-Walking and RepEx simulations does not imply equilibrium has been reached. As there is no direct means of establishing equilibrium, one usually resorts to performing several independent simulations. The system is considered to be in equilibrium if the independent trajectories yield equivalent results (Thirumalai et al., 1989). As has been noted (Zhou and Berne, 1997), while this is a necessary condition, it is not sufficient.
|
With regard to the general applicability of the DGIS method, the same considerations as for the flat histogram method are pertinent. In addition, memory limitations will restrict the size of the non-redundant library that can be constructed for a given model complexity. Using RAFT we stored an arbitrary 2.5 x 106 conformer descriptors in memory for each system. A much greater number could certainly have been stored, and we have not looked at the effects of reducing the size of the library on the efficiency of the algorithm. In addition to the conformer descriptor (a sequence of integers), other metrics could be used as a redundancy filter, ensuring comprehensive coverage of the structure/energy spectrum.
Non-ergodic sampling is encountered in a wide range of systems, from relatively simple Ising models to full atom models of protein folding. The DGIS method directs jumps towards configurations that are representative of otherwise under-sampled regions of the energy spectrum, collated during temperature-independent sampling, ensuring ergodicity. Unlike the J-Walking, S-Walking and RepEx methods, it requires no parameter optimization, beyond the choice of the flat histogram parameters. Use of the Boltzmann probability histogram for establishing equilibrium makes for a highly efficient algorithm, providing an internal quality control that prevents arbitrary and excessively long simulation times.
As the results in this study demonstrate, combining the DGIS method with a reduced but effective protein model serves as a useful tool for exploring and visualizing the local conformational propensities of protein sequence fragments. This will have useful applications in protein structure prediction and for understanding protein mis-folding in disease.
| Acknowledgments |
|---|
We thank the Biotechnology and Biological Sciences Research Council, UK for funding. M.J.P. is a BBSRC David Phillips and University of Leeds Research Fellow.
Received on September 3, 2004; revised on March 21, 2005; accepted on March 30, 2005
| REFERENCES |
|---|
|
|
|---|
Baldwin, R.L. and Rose, G.D. (1999) Is protein folding hierarchic? I. Local structure and peptide folding. Trend. Biochem. Sci., 24, 2633[CrossRef][ISI][Medline].
Berman, H.M., et al. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235242
Booth, D.R., et al. (1997) Instability, unfolding and aggregation of human lysozyme variants underlying amyloid fibrilogenesis. Nature, 385, 787793[CrossRef][Medline].
Damodaran, S. and Song, K.B. (1986) The role of solvent polarity in the free-energy of transfer of amino-acid side-chains from water to organic-solvents. J. Biol. Chem., 261, 72207222
Frantz, D.D., et al. (1990) Reducing quasi-ergodic behaviour in Monte Carlo simulations by J-Walking: applications to atomic clusters. J. Chem. Phys, 93, 27692784[CrossRef].
Gibbs, N., et al. (2001) Ab initio protein structure prediction using physicochemical potentials and a simplified off-lattice model. Proteins, 43, 186202[CrossRef][ISI][Medline].
Jain, T.S. and de Pablo, J.J. (2002) A biased Monte Carlo technique for calculation of the density of states of polymer films. J. Chem. Phys., 116, 72387243[CrossRef].
Kirkpatrick, S., et al. (1983) Optimization by simulated annealing. Science, 220, 671680
Kofke, D.A. (2002) On the acceptance probability of replica-exchange Monte Carlo trials. J. Chem. Phys., 117, 69116914[CrossRef].
Kolinski, A. and Skolnick, J. (2004) Reduced models of proteins and their applications. Polymer, 45, 511524[CrossRef].
Makhatadze, G.I. and Privalov, P.L. (1994) Contribution of hydrogen bonding to the stability of proteins. Biophys. J., 66, 177.
Metropolis, N., et al. (1953) Equation of state calculation by fast computing machines. J. Chem. Phys., 21, 10871092[CrossRef].
Mihara, H. and Takahashi, Y. (1997) Engineering peptides and proteins that undergo alpha to beta transition. Curr. Opin. Struct. Biol., 7, 501508[CrossRef][ISI][Medline].
Onuchic, J.N., et al. (1997) Theory of protein folding: the energy landscape perspective. Annu. Rev. Phys. Chem., 48, 545600[CrossRef][ISI][Medline].
Park, B. and Levitt, M. (1994) The complexity and accuracy of discreet state models of protein structure. J. Mol. Biol., 249, 493507.
Pedersen, J.T. and Moult, J. (1997) Protein folding simulations with genetic algorithms and a detailed molecular description. J. Mol. Biol., 269, 240259[CrossRef][ISI][Medline].
Petersen, E.F., et al. (2004) UCSF chimeraa visualization system for exploratory research and analysis. J. Comput. Chem., 25, 16051612[CrossRef][ISI][Medline].
Prusiner, S.B. (1997) Prion diseases and the BSE crisis. Science, 278, 245251
Rathore, N. and de Pablo, J.J. (2002) Monte Carlo simulation of proteins through a random walk in energy space. J. Chem. Phys., 116, 72257230[CrossRef].
Rees, D.C. (1980) Experimental evaluation of the effective dielectric constant of proteins. J. Mol. Biol., 141, 323326[CrossRef][ISI][Medline].
Rooman, M.J., Kocher, J.-P., Wodak, S.J. (1991) Prediction of protein backbone conformation based on 7 structure assignments. J. Mol. Biol., 221, 961979[CrossRef][ISI][Medline].
Simons, K.T., et al. (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol., 268, 209225[CrossRef][ISI][Medline].
Swendsen, R.H. and Wang, J.-S. (1986) Replica Monte Carlo simulation of spin-glasses. Phys. Rev. Lett., 57, 26072609[CrossRef][ISI][Medline].
Thirumalai, D., et al. (1989) Ergodic behavior in supercooled liquids and in glasses. Phys. Rev. A, 39, 35633574[CrossRef][Medline].
Wang, F.G. and Landau, D.P. (2001a) Efficient, multiple-range random walk algorithm to calculate the density of states. Phys. Rev. Lett., 86, 20502053[CrossRef][ISI][Medline].
Wang, F.G. and Landau, D.P. (2001b) Determining the density of states for classical statistical models: a random walk algorithm to produce a flat histogram. Phys. Rev. E, 64, 056101[CrossRef].
Wolfenden, R., et al. (1981) Affinities of amino acid side chains for solvent water. Biochemistry, 20, 849855[CrossRef][Medline].
Yan, Q.L., et al. (2002) Density-of-states Monte Carlo method for simulation of fluids. J. Chem. Phys., 116, 87458749[CrossRef].
Zhou, R. and Berne, B.J. (1997) Smart walking: a new method for Boltzmann sampling of protein conformations. J. Chem. Phys., 107, 91859196[CrossRef].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









