Skip Navigation


Bioinformatics Advance Access originally published online on May 16, 2006
Bioinformatics 2006 22(15):1838-1845; doi:10.1093/bioinformatics/btl192
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/15/1838    most recent
btl192v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Offman, M. N.
Right arrow Articles by Bates, P. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Offman, M. N.
Right arrow Articles by Bates, P. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Developing a move-set for protein model refinement

Marc N. Offman , Paul W. Fitzjohn and Paul A. Bates *

Biomolecular Modelling Laboratory, Cancer Research UK London Research Institute, Lincoln's Inn Fields Laboratories London, WC2A 3PX, UK

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION AND CONCLUSION
 REFERENCES
 

Motivation: A wide variety of methods for the construction of an atomic model for a given amino acid sequence are known, the more accurate being those that use experimentally determined structures as templates. However, far fewer methods are aimed at refining these models. The approach presented here carefully blends models created by several different means, in an attempt to combine the good quality regions from each into a final, more refined, model.

Results: We describe here a number of refinement operators (collectively, ‘move-set’) that enable a relatively large region of conformational space to be searched. This is used within a genetic algorithm that reshuffles and repacks structural components. The utility of the move-set is demonstrated by introducing a cost function, containing both physical and other components guiding the input structures towards the target structure. We show that our move-set has the potential to improve the conformation of models and that this improvement can be beyond even the best template for some comparative modelling targets.

Availability: The POPULUS software package and the source code are available at http://bmm.cancerresearchuk.org/~offman01/populus.html

Contact: paul.bates{at}cancer.org.uk


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION AND CONCLUSION
 REFERENCES
 
Since the sequencing of the human genome (Lander et al., 2001) and of many other key organisms, the number of publicly available protein sequences is now considerable (approximately 3 million). However, the number of experimentally determined protein structures lags far behind (approximately 35,000) and although there have been improvements in the experimental methods (Berman and Westbrook, 2004), cost and speed limitations dictate that the gap between sequence and structure is likely to grow. The now well-accepted approach is to determine a minimum number of experimental structures (Yan and Moult, 2005; Liu and Rost, 2002) and to use protein modelling methods, particularly comparative modelling (CM) (Marti-Renom et al., 2000), to generate the remainder. At present, these models are not as accurate as experimentally determined structures, but are still useful for qualitative analysis and decision-making in support of a wide range of experimental work (Baker and Sali, 2001). A number of important applications, however, can only use highly accurate models, for example molecular replacement experiments (Giorgetti et al., 2005), function predictions (Skolnick et al., 2000) and virtual drug screening (Lengauer et al., 2004). Consequently methods still need to be improved.

Platforms for the benchmarking of available prediction methods are given by projects such as EVA (Eyrich et al., 2001), LiveBench (Rychlewski and Fischer, 2005) and CASP (Critical Assessment of Techniques for Protein Structure Prediction). These services are very useful for the protein structure prediction community, giving clear observations of the development and the need for further progression within the field. These platforms have also encouraged the exchange of ideas, computer-based algorithms and, in recent years, the sharing of protein models.

An area which has been identified to be in need of further development is that of model refinement (Valencia et al., 2005; Moult, 2005; Kryshtafovych et al., 2005; Tress et al., 2005). Readily available models for a given sequence, such as those generated by automated servers (Bujnicki et al., 2001; Ginalski et al., 2003), often form the basis for the input to these methods. Previous attempts have included molecular dynamics (Lee et al., 2001), Monte Carlo (Bradley et al., 2005) and knowledge-based techniques (Qian et al., 2004).

Here, we use a genetic algorithm (GA) as our conformational space search engine. GAs are well known to be powerful search and optimization techniques (Forrest, 1993), and have been used previously in a number of protein modelling efforts ranging from ab initio folding to model building by homology (Karplus et al., 2005; Damsbo et al., 2004; Contreras-Moreira et al., 2003; John and Sali, 2003; Petersen and Taylor, 2003; Xia and Levitt, 2002; Rabow and Scheraga, 1996; Pedersen and Moult, 1995; May and Johnson, 1994). Other, non-GA-based approaches build and refine models by selection, recombination and assembly of fragments (Kolinski and Bujnicki, 2005; Kosinski et al., 2005; Zhang et al., 2005; Bradley et al., 2005; Jones et al., 2005; Chivian et al., 2005) and were successfully applied in the sixth round of CASP (Moult et al., 2005). In contrast to the previously mentioned refinement protocols, which usually start with models created by a single modelling technique, the method described here uses multiple initial models from a variety of modelling approaches.


    METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION AND CONCLUSION
 REFERENCES
 
Overview
Our GA-based approach consists of the following steps as shown in Figure 1:

  1. Initialization: Model structures are read. Similar structures defined by an RMSD score cutoff and those missing backbone atoms are removed. For fold recognition (FR) and new fold (NF) cases, as defined in CASP (Moult et al., 2005), the input is clustered (Daura et al., 1999) and the two largest clusters are selected for the starting population.
  2. Growth: A series of operators are applied to the current population, creating siblings until the population reaches 250. The four operators are recombination, protein mutation, coil mutation and helix mutation (see below for details) and are applied in a ratio of 2:1:1:1 (empirically derived from numerous runs of the GA). As helix and coil mutations result in only small changes, all new structures are accepted for inclusion into the growing population. The other operators accept siblings according to the following pseudo-Metropolis criterion:

    Formula
    and {sigma} is set to 0.1 for protein mutations and 0.5 for recombinations.

  3. Selection: This step differs slightly between the structural and the energy-based scoring schemes. In the structural scoring scheme, the population is sorted according to the structural score and the best 25 structures were selected as the starting population for the next round. In the energy-based scoring scheme, the population was first sorted by the coarse energy function and the best 50 structures selected. These were then reranked according to the fine energy function and the best 25 structures retained for the next round.

Step 2 and Step 3 are repeated until one of the following convergence criteria is fulfilled:

  • The difference between the average score and the best score is <0.001.
  • The sum of all scores in the current round is the same as the previous.
  • The scores are all within 0.04 of each other.
  • The number of rounds exceeds 20.
  • After termination the top-ranked 20 structures are returned.

Growth operators
The move-set is defined by two global (recombination and protein mutation) and two local (coil and helix mutation) growth operators.

Recombination
There are two recombination types (Fig. 2a), single crossover (X) and double crossover (XX), applied with a default ratio of 1:2. The crossover boundaries are randomly selected outside secondary structure elements. Secondary structure elements are assigned by a fast method, based on values of the {Phi} and {Psi} angles only (Srinivasan and Rose, 1999). For a single crossover only one hot-spot ({omega}-bond between two residues) is set, with a minimum distance of five residues from either the N- or C-terminus. For a double crossover, the two selected hot-spots need to be at least five residues away from each other and both termini. Lower values for all defined boundaries resulted in very little improvement.

Protein mutation
A protein mutation move creates a sibling by subsectioning the initial structure into two parts and repacking them, see Figure 3. A coil region is selected randomly and three pairs of {Phi}/{Psi} angles are mutated randomly within ±90°. The two subsections are repacked using the cyclic coordinate descent (CCD) (Canutescu and Dunbrack, 2003) algorithm, which manipulates torsion angles to satisfy a distance constraint. After calculating the number of steps needed to satisfy the distance constraint d (Fig. 3) ten intermediate models, uniformly distributed during the repacking, are scored. The best of these 10 structures is included into the sibling population according to the pseudo-Metropolis criterion. The CCD algorithm was modified to avoid sparsely populated {Phi}/{Psi} regions, by dividing the Ramanchandran dihedral angle space (Ramachandran and Sasisekhran, 1968) into a 30° x 30° {Phi}/{Psi}-grid with likely and unlikely bins, similar to the approach described by Fitzkee et al. (2005).

Coil mutation
For a coil mutation (Fig. 2b), a random coil region with a minimum size of three residues is selected. Within this region a third of the {Phi}/{Psi} angles are randomly changed within ±90°. These changes cause a break between the coil and the N-terminus of the following secondary structure element, which is closed using the CCD algorithm, thereby generating a new loop conformation. If the selected region includes one of the termini, less movement (±20°) is applied during the torsion angle perturbation and the CCD protocol is not used.

Helix mutation
For a helix mutation (Fig. 2b), a similar method as for the coil mutation is used. A helical region of a minimum length of three residues is randomly selected. A third of the {Phi}/{Psi} angles in the coil region at the N-terminus of the helix are perturbed to shift the helix away from its original position. The resulting break is closed by applying the CCD method on the surrounding coil regions, which causes a change in the helix orientation. Terminal helices are manipulated in the same way as terminal coil regions.

Structural scoring function
The structural scoring function is the mean value of the TM (Zhang and Skolnick, 2004), GDT (Zemla et al., 1999) and maxsub (Siew et al., 2000) scores. The TM-score software (http://bioinformatics.buffalo.edu/tm-score) calculates all three scores, returning normalized values. The number of clashes are also counted and the mean score is reduced by 1.5 and 2.0 for >20 and >50, respectively. Structures containing more than 100 clashes are given a score of 0. In this scoring scheme a meaningful prediction scores between 1.0 and 0.4, with structures scoring below 0.17 being considered random.

Energy-based scoring function
In the GA energy-based method we use both a coarse and fine scoring function.

The coarse scoring function is defined as follows:

Formula

Residue–residue pair potentials, Spp: The internal packing scored according to a statistically derived mixed backbone atom-centroid potential (Robson and Osguthorpe, 1979).

Hydrogen bonds, Shb: The number of hydrogen bonds calculated using the software STRIDE (Frishman and Argos, 1995).

Clash penalty, Scl: A clash between two residues is counted if any two backbone atoms from any pair of non-consecutive residues are closer than 2 Å.

Secondary structure score, Sss: Sum of PSIPRED confidence scores for matches between predicted (Jones, 1999) and assigned secondary structures.

Relative compactness, {Delta}Scomp: The compactness of a protein structure is defined as the average distance of all atoms to the centre of mass of the protein. The relative compactness is the absolute difference between a model compactness and a compactness reference score. This compactness reference score is calculated in the initialization step by taking the average compactness of the top five ranked input models. The input models are initially ranked using the first four terms of the coarse scoring function, but with different weights.

Weights: To achieve an optimal ranking the weights for the components of the function were optimized using simplex optimization (Mead and Nelder, 1965). We used the submissions for the sixth round of CASP, approximately 200 models, plus 400 models created with the move-set, as the training dataset for each target. Targets where the best input structure had a structural score below 0.3 were excluded from this training.

The following weights were assigned, where a low score corresponds to a good structure: w1 = 1.00, w2 = –0.46, w3 = 2.07, w4 = –4.20, w5 = 1.37.

The fine scoring function is defined by the following equation:

Formula

Effective energy function, Seef: SCWRL 3.0 is used to place the side chains (Canutescu et al., 2003) for the energy calculation. All scored models are minimized using the steepest descent and adopted basis Newton–Raphson methods as implemented in CHARMM (Brooks et al., 1983) until the value of the gradient drops below 1.0 kcal Å–2. The models are scored using the effective energy function (EEF) (Lazaridis and Karplus, 2000) in CHARMM.

Solvation energy, Sse: The solvent accessibility is calculated using the software POPS_A (Cavallo et al., 2003). To obtain the solvation energy for each residue, the accessibility is multiplied by the observed amino acid solvation free energy as described by Eisenberg and McLachlan (1986). This energy is also used as a pre-filter, comparing the solvation energy of each structure to the average solvation energy of the previous round. If a model does not fulfil the pseudo-Metropolis criterion (described above), with {sigma} = 0.1, it is highly penalized (Sfine + 2000).

Relative compactness, {Delta}Scomp: As described for the coarse scoring function.

Weights: Assigned as described above for the coarse scoring function: w1 = 1.00, w2 = 0.20, w3 = 0.20.

We developed a program called POPULUS that implements our GA approach. Two sources of input are required: at least two protein structures in PDB format (Berman et al., 2000) and the protein sequence. During processing the protein structures are mainly represented by their backbone torsion angles (Fitzkee et al., 2005).


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION AND CONCLUSION
 REFERENCES
 
Our refinement protocol is split into a well-defined move-set, to search conformational space and a cost function to guide the search. The ability of the move-set to find more native-like structures has been evaluated separately from the cost function in order to highlight the maximum potential of these moves.

Move-set
A generic move-set on the one hand is required to be efficient in sampling the structural conformation space close to the native structure and on the other hand should enable substantial moves for lower quality models. In this analysis, the structural scoring function was used as the fitness criterion, which returns a similarity score for the model compared with the native protein structure. A representative set of output models, built with the guided move-set, was checked using PROCHECK (Laskowski et al., 1993) to make sure the improvement was not made at the expense of the native-like properties of each of the models.

Potential improvement for the CASP6 dataset
Each target of the sixth round of CASP was refined using the structurally driven GA with all the input models, taken from the submissions of the competitors. Figure 4 shows the structural quality of the input (squares) and the refined (circles) models. It can be seen that the structural improvement is relatively low for CM targets (T0268–T0226), whereas lower quality structures are improved significantly. However, the ability of the move-set to improve models is limited, if the best input model has a very low structural score (T0237, T0216).

Recovering from lower quality input—comparing the results for the CASP6 and CAFASP4 submissions
Ideally, a move-set for protein refinement should be able to create comparable results, starting from models with different initial conformations and varying degrees of overall quality. In Figure 5, the results based on the submissions of CASP6 and CAFASP4 (Critical Assessment of Fully Automated Structure Prediction) (Fischer et al., 2003) are compared, showing the potential of the algorithm to achieve this.

Structural analysis of the improvement
Depending on the move-set the improvement can be localized in certain regions of the refined protein structure. As described, specific operators to repack helices and coil regions are implemented in this approach, but strands can only be changed indirectly, using the recombination operators. Hence we analysed the improvement distribution for the different secondary structure elements. Figure 6 shows that the degree of improvement is made in the following descendant order: helix, coil and strand regions. The targets where the improvement mainly appears to be in strand regions (T0274, T0211, T0196, T0227, T0212 and T0239) are either all ß proteins or have a very low {alpha} helical content.

Since protein termini are often more flexible than the rest and also more sensitive to the environment, a better measure of refinement is the improvement in protein core regions, defined here as the region between the N and C terminal secondary structure elements. As seen in Figure 6, a major improvement is indeed achieved by changes in the protein core region, rather than simply by rearranging the terminal regions.

Comparison with the best template for comparative modelling targets
In CASP6 some improvement was reported for the CM section, since a few groups were able to construct models approaching the quality of the best templates (Tress et al., 2005). Figure 7 shows that for 12 out of 25 CM targets the move-set is able to improve the input models beyond the quality of the best template. For all other cases the refined structures are much closer to the best template than the best initial input models.

Energy-based runs
Whilst the development of an efficient move-set is the main focus of this work, it is interesting to assess how well the move-set performs on a realistic modelling scenario. Therefore an energy-based scoring function was introduced and the GA was applied to the CAFASP4 dataset. All models were taken from the meta-server BIOINFO.PL (Ginalski et al., 2003).

Energy-based function: is the native structure a stable global minimum?
When using an energy-based function to score protein structures, it is important that the function is effective within the context for which it is used. An important criterion to fulfil is the stability of the native structure, ideally scored as the global minimum of the function (Misura and Baker, 2005). In 21 out of 23 CM cases, the energy of the native structure is lower than the energy of the lowest scored GA output model. Moreover running the GA with the native structure as the input, 20 of the 23 cases are stable, having a maximum structural score deviation of 0.04 ({approx}0.6 Å RMSD) from the original input structure. The remaining three cases are substantially worse (deviation of 0.10, 0.12, 0.16). In all three cases the native structure was only changed in the terminal regions, which in the full protein complex were stabilized by strong inter-chain interactions.

A realistic modelling scenario: using the CAFASP4 dataset
The same structural analysis as for the move-set was applied to the CAFASP4 dataset, using the energy-based cost function. As can be seen in Figure 8 the algorithm is able to produce reasonable results for the complete range of modelling difficulties. It is known that energy functions can pick up strong signals in structural regions close to the native structure (Kortemme et al., 2003) which can be seen for target T0233, a CM case where improvement beyond the best input model is achieved. However, for FR and NF cases higher quality models are achieved only after pre-selective clustering, reducing the influence of poor models during the selection process. Using this strategy it is possible to improve beyond the best input even for FR cases as shown for target T0249.

Navigating through energy landscapes to find the global minimum
To understand the role of energy-based functions and their quality, we constructed energy landscapes for both the guided and the real-case scenario model refinement. Figure 9 shows the fine energy score plotted against both the GA rounds and the structural score for the surviving models of each round. Two CM cases are shown, T0233 and T0231. Inspecting the structurally guided cases, we can see energy barriers potentially being a problem for the energy driven procedure. For T0233 we were able to improve upon the best input structure (Fig. 10). This can be explained by the fact that in the structural score-driven run, the energy barrier occurs just at the end of the optimization process. However, as can be seen in Figure 9 for T0231, the models would need to climb an energy barrier at the very beginning of the energy landscape. This causes the algorithm to terminate at a local minimum that is slightly inferior to the structural score of the best input.

The GA-achieved improvement for T0233 is shown in Figure 10. The two blue helices highlighted are closer to their native state (green), improving upon the best initial input model (red).

Comparison with the CASP6 manual and automatic submissions
Taking the model with the lowest energy for each target, we compared the overall performance of the GA to the CASP6 submissions. The 201 submitting groups can be divided into three classes: human, meta-server and server. For all groups, the additive structural score for all models, ranked top by the competitors, was calculated. The top ten groups, all of class human, score between 28.94 and 26.38. Including the POPULUS score into this analysis our new method ranks 14th with a score of 25.69. In the server group, there is only one server better than our approach with a score of 25.71.


    DISCUSSION AND CONCLUSION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION AND CONCLUSION
 REFERENCES
 
The Move-set
In this GA-based method several operators are used to reshuffle, repack and recombine fragments of an input population, to obtain a final model. Using these operators, the move-set is potentially able to improve models beyond the best input structure by more than 30%, being limited by an upper and a lower structural score boundary of approximately 0.15 and 0.92, respectively (Fig. 4). This improvement includes recovery from alignment errors (data not shown). The lower boundary can be explained by missing secondary structure elements or the wrong placement of these and by the necessity to unfold models with completely wrong topologies (structural score below 0.2). For the upper boundary, the GA is limited because no fine movements are applied within secondary structure elements to introduce missing kinks and twists.

Looking closer at the distribution of the improvement, we found that helix repacking alone can refine a model by up to 30% (Fig. 6). Moreover this improvement is situated mainly in the core and not in the flexible terminal regions. This underlines the importance of helix repacking in new refinement strategies.

The distributions of the input models appear to be clustered, as seen in Figures 4 and 8, pointing to the use of different templates and techniques, to provide a highly variable starting population of models. From Figure 5, it can be seen that for some cases, using two different input sets (CASP6 and CAFASP4) can lead to dramatically different outputs. For these cases it was found that models contained unique protein fragments that are part of the one dataset, but not the other. Still, in general, an input population covering a wide range of structural diversity seems to be sufficient to refine structures, improving some beyond the best available template (Fig. 7).

The energy-based scoring function
In this approach we have combined different energy functions, such as simple pair potentials, unified-atom energy functions and solvation energies. The combination of these, in a two-level filtering, enables us to obtain good models on the basis of five separate GA runs.

Using this technique for all the CASP6 submissions, we were able to improve upon the best input in 4 out of 25 cases. Since in general, the CAFASP4 submissions are of lower quality, the energy-based functions proved to be less effective. However, as can be seen from Figure 8, the GA output model is usually much closer to the best input model than the average score of the input ensemble, indicating that the GA, at this stage of development, can at least select a good model from an ensemble. Indeed, applying the GA to all targets of CAFASP4, we were able to obtain the overall rank of 14th, compared with the CASP6 submissions. This rank is promising given that our method pushes models substantially away from the templates used to model the initial population.

We showed that our fine energy-based function does in general score the native structure to be the global minimum. However, it is clear that energy-based functions do not always guide the conformational search towards the global minimum. If the way to the global minimum is blocked by a series of high energy barriers the probability of getting stuck in a local minimum is very high. In trying to force the algorithm to jump out of these minima, we run the risk of populating other local minima, which are further away from the global minimum. Nevertheless, the GA is able to jump over some small energy barriers, but at the point where there is no overall gradient available, without climbing high barriers, the approach is very likely to fail. The closer larger energy barriers are to the starting point, the more probable a premature termination will occur (Fig. 9). It is worth noting that alternative energy functions, such as ME score (Summa et al., 2005), Generalized Born (GBSW) (Im et al., 2003) and Wesson–Eisenberg (Wessen and Eisenberg, 1992), were tested in the context of this method. This increased the algorithm runtime substantially and the results were similar or marginally worse than for the function shown here.

Future directions
The future for releasing the potential locked within ensembles of protein models lie within developing better scoring functions. Here we have tried several, including secondary structure matching, degree of compactness, classical pairwise potentials and a unified atom-based energy function. We believe that it is the atom-based potentials that need to be improved the most, particularly for the refinement of models initially constructed from template-based methods. The GA, under the direction of inadequate atomic potentials, can simply drift away from regions close to the target conformation, thereby finding local regions equivalent in energy.

The move-set, too, can be improved upon. For some model ensembles, usually those based on fold recognition (analogy) and ab initio methods, alignments and general protein topologies are so poor relative to the target that more radical moves are called for. This may, e.g. require more single and double crossovers from a wider range of ab initio library fragments, built on the basis of several secondary structure predictions. Ultimately, true atom-based refinement will require a very delicate mutation move-set that will probably rely on co-localized groups of atoms undergoing logical refinement moves.


Figure 1
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 POPULUS flowchart. X = single crossover, XX = double crossover, P = protein mutation, C = coil mutation, H = helix mutation.

 


Figure 2
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Illustration of recombination, coil and helix mutation.

 


Figure 3
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Protein mutation-subsection repacking. A coil region is randomly selected. The initial distance, d, is the distance between the start and the end backbone atoms of the selected coil region. (1) A number of {Phi}/{Psi} torsion angles are randomly changed in this region. (2–4) The CCD method is used to bring back the protein into a state where the distant, d, constraint is satisfied. Ten intermediate structures are scored. (5) The model with the best score enters the population. The cross indicates a poor region in the parent model and the tick shows the improvement in the sibling.

 


Figure 4
View larger version (41K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 Comparison of structural score for CASP input, squares and GA output models, circles. In this move-set evaluation the structural alignment to the native structure was used as the fitness function. Figures 4, 5, 7 and 8 were made using XMGRACE http://plasma-gate.weizmann.ac.il/grace/

 


Figure 5
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5 Comparison for CASP6 (dark, left) and CAFASP4 (light, right), showing the improvement from the best input models (circles) to the GA output models (triangle).

 


Figure 6
View larger version (52K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6 Analysis of structural score improvement. Two bars for each target show the improvement divided into either secondary structure (HELIX, STRAND, COIL) or into terminal and core regions (N TERMINUS, CORE, C TERMINUS).

 


Figure 7
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7 Comparison of best input (+), best available template (x) and GA output (*), using the GDT score.

 


Figure 8
View larger version (48K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 8 Comparison of the CAFASP input (blue) and the output ensemble (orange) for five separate runs using the energy-based functions. For each run and resulting ensemble the red circle is the model with the lowest energy. The green triangles show the model with the lowest energy over all five runs.

 


Figure 9
View larger version (30K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 9 Energy landscapes for selected targets (T0233, T0231) showing the surviving models per round (small black dots). The top two plots are structurally guided, the bottom two plots energy guided. The red circle indicates the best input model, the green circle the GA output model. The dotted line denotes an energy barrier causing the termination of the energy-driven run into a local minima. For both cases the termination of the energy-guided runs can only be fully understood in context of the structurally guided runs. Figure 9 was made using Gnuplot (http://www.gnuplot.org).

 


Figure 10
View larger version (55K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 10 Target T0233. Superimposition of the native structure (green) (1vqu) the best input model (red) and the refined model from the GA (blue). The highlighted helices show the region of most improvement. Figure 10 was made using VMD (http://www.ks.uiuc.edu/research/vmd/).

 

    Acknowledgments
 
We thank all the members of the Biomolecular Modelling Laboratory for many useful discussions and insights. This work was funded by Cancer Research UK, the Barbara Mary Hill Memorial Fund and a B'nai B'rith Leo Baeck Lodge Scholarship awarded to M.N.O.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Anna Tramontano

Received on March 13, 2006; revised on May 4, 2006; accepted on May 14, 2006

    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION AND CONCLUSION
 REFERENCES
 

    Baker, D. and Sali, A. (2001) Protein structure prediction and structural genomics. Science, 294, 93–96[Abstract/Free Full Text].

    Berman, H.M. and Westbrook, J.D. (2004) The impact of structural genomics on the protein data bank. Am. J. Pharmacogenomics, 4, 247–252[CrossRef][Medline].

    Berman, H.M., et al. (2000) The Protein Data Bank. Nucleic Acids Res, . 28, 235–242[Abstract/Free Full Text].

    Bradley, P., et al. (2005) Free modeling with Rosetta in CASP6. Proteins, 61, Suppl. 7, 128–134.

    Bradley, P., et al. (2005) Toward high-resolution de novo structure prediction for small proteins. Science, 309, 1868–1871[Abstract/Free Full Text].

    Brooks, B., et al. (1983) CHARMM: a program for macromolecular energy, minimization, and dynamics calculation. J. Comp. Chem, . 4, 187–217.

    Bujnicki, J.M., et al. (2001) Structure prediction meta server. Bioinformatics, 17, 750–751[Abstract/Free Full Text].

    Canutescu, A.A. and Dunbrack, R.L., Jr. (2003) Cyclic coordinate descent: a robotics algorithm for protein loop closure. Protein Sci, . 12, 963–972[CrossRef][Web of Science][Medline].

    Canutescu, A.A., et al. (2003) A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci, . 12, 2001–2014[CrossRef][Web of Science][Medline].

    Cavallo, L., et al. (2003) POPS: a fast algorithm for solvent accessible surface areas at atomic and residue level. Nucleic Acids Res, . 31, 3364–3366[Abstract/Free Full Text].

    Chivian, D., et al. (2005) Prediction of CASP6 structures using automated robetta protocols. Proteins, 61, Suppl. 7, 157–166.

    Contreras-Moreira, B., et al. (2003) In silico protein recombination: enhancing template and sequence alignment selection for comparative protein modelling. J. Mol. Biol, . 328, 593–608[CrossRef][Web of Science][Medline].

    Damsbo, M., et al. (2004) Application of evolutionary algorithm methods to polypeptide folding: comparison with experimental results for unsolvated Ac-(Ala-Gly-Gly)5-LysH+. Proc. Natl Acad. Sci. USA, 101, 7215–7222[Abstract/Free Full Text].

    Daura, X., et al. (1999) Peptide folding: when simulation meets experiment. Angew. Chemie Intl, . 38, 236–240[CrossRef].

    Eisenberg, D. and McLachlan, A.D. (1986) Solvation energy in protein folding and binding. Nature, 319, 199–203[CrossRef][Medline].

    Eyrich, V.A., et al. (2001) EVA: continuous automatic evaluation of protein structure prediction servers. Bioinformatics, 17, 1242–1243[Abstract/Free Full Text].

    Fischer, D., et al. (2003) CAFASP3: the third critical assessment of fully automated structure prediction methods. Proteins, 53, Suppl. 6, 503–516.

    Fitzkee, N.C., et al. (2005) The protein coil library: a structural database of nonhelix, nonstrand fragments derived from the PDB. Proteins, 58, 852–854[CrossRef][Web of Science][Medline].

    Forrest, S. (1993) Genetic algorithms: principles of natural selection applied to computation. Science, 261, 872–878[Abstract/Free Full Text].

    Frishman, D. and Argos, P. (1995) Knowledge-based protein secondary structure assignment. Proteins, 23, 566–579[CrossRef][Web of Science][Medline].

    Ginalski, K., et al. (2003) 3D-Jury: a simple approach to improve protein structure predictions. Bioinformatics, 19, 1015–1018[Abstract/Free Full Text].

    Giorgetti, A., et al. (2005) Evaluating the usefulness of protein structure models for molecular replacement. Bioinformatics, 21, ii72–ii76[Abstract].

    Im, W., et al. (2003) Generalized born model with a simple smoothing function. J. Comput. Chem, . 24, 1691–1702[CrossRef][Web of Science][Medline].

    John, B. and Sali, A. (2003) Comparative protein structure modeling by iterative alignment, model building and model assessment. Nucleic Acids Res, . 31, 3982–3992[Abstract/Free Full Text].

    Jones, D.T., et al. (2005) Prediction of novel and analogous folds using fragment assembly and fold recognition. Proteins, 61, Suppl. 7, 143–151.

    Jones, D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol, . 292, 195–202[CrossRef][Web of Science][Medline].

    Karplus, K., et al. (2005) SAM-T04: what is new in protein-structure prediction for CASP6. Proteins, 61, Suppl. 7, 135–142.

    Kolinski, A. and Bujnicki, J.M. (2005) Generalized protein structure prediction based on combination of fold-recognition with de novo folding and evaluation of models. Proteins, 61, Suppl. 7, 84–90.

    Kortemme, T., et al. (2003) An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes. J. Mol. Biol, . 326, 1239–1259[CrossRef][Web of Science][Medline].

    Kosinski, J., et al. (2005) FRankenstein becomes a cyborg: the automatic recombination and realignment of fold recognition models in CASP6. Proteins, 61, Suppl. 7, 106–113.

    Kryshtafovych, A., et al. (2005) Progress over the first decade of CASP experiments. Proteins, 61, Suppl. 7, 225–236.

    Lander, E.S., et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921[CrossRef][Medline].

    Laskowski, R.A., et al. (1993) PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Cryst, . 26, 283–291[CrossRef].

    Lazaridis, T. and Karplus, M. (2000) Effective energy functions for protein structure prediction. Curr. Opin. Struct. Biol, . 10, 139–145[CrossRef][Web of Science][Medline].

    Lee, M.R., et al. (2001) Molecular dynamics in the endgame of protein structure prediction. J. Mol. Biol, . 313, 417–430[CrossRef][Web of Science][Medline].

    Lengauer, T., et al. (2004) Novel technologies for virtual screening. Drug Discov. Today, 9, 27–34[CrossRef][Web of Science][Medline].

    Liu, J. and Rost, B. (2002) Target space for structural genomics revisited. Bioinformatics, 18, 922–933[Abstract/Free Full Text].

    Marti-Renom, M.A., et al. (2000) Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct, . 29, 291–325[CrossRef][Web of Science][Medline].

    May, A.C. and Johnson, M.S. (1994) Protein structure comparisons using a combination of a genetic algorithm, dynamic programming and least-squares minimization. Protein Eng, . 7, 475–485[Abstract/Free Full Text].

    Misura, K.M.S. and Baker, D. (2005) Progress and challenges in high-resolution refinement of protein structure models. Proteins, 59, 15–29[CrossRef][Web of Science][Medline].

    Moult, J., et al. (2005) Critical assessment of methods of protein structure prediction (CASP)–round 6. Proteins, 61, Suppl. 7, 3–7.

    Moult, J. (2005) A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr. Opin. Struct. Biol, . 15, 285–289[CrossRef][Web of Science][Medline].

    Mead, R. and Nelder, J.A. (1965) A simplex method for function minimization. Comp. J, . 7, 308–313.

    Pedersen, J.T. and Moult, J. (1995) Ab initio structure prediction for small polypeptides and protein fragments using genetic algorithms. Proteins, 23, 454–460[CrossRef][Web of Science][Medline].

    Petersen, K. and Taylor, W.R. (2003) Modelling zinc-binding proteins with GADGET: genetic algorithm and distance geometry for exploring topology. J. Mol. Biol, . 325, 1039–1059[CrossRef][Web of Science][Medline].

    Qian, B., et al. (2004) Improvement of comparative model accuracy by free-energy optimization along principal components of natural structural variation. Proc. Natl Acad. Sci. USA, 101, 15346–15351[Abstract/Free Full Text].

    Rabow, A.A. and Scheraga, H.A. (1996) Improved genetic algorithm for the protein folding problem by use of a Cartesian combination operator. Protein Sci, . 5, 1800–1815[Web of Science][Medline].

    Ramachandran, G.N. and Sasisekharan, V. (1968) Conformation of polypeptides and proteins. Adv. Protein Chem, . 23, 283–438[Medline].

    Robson, B. and Osguthorpe, D.J. (1979) Refined models for computer simulation of protein folding. Applications to the study of conserved secondary structure and flexible hinge points during the folding of pancreatic trypsin inhibitor. J. Mol. Biol, . 132, 19–51[CrossRef][Web of Science][Medline].

    Rychlewski, L. and Fischer, D. (2005) LiveBench-8: the large-scale, continuous assessment of automated protein structure prediction. Protein Sci, . 14, 240–245[CrossRef][Web of Science][Medline].

    Siew, N., et al. (2000) MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics, 16, 776–785[Abstract/Free Full Text].

    Skolnick, J., et al. (2000) Structural genomics and its importance for gene function analysis. Nat. Biotechnol, . 18, 283–287[CrossRef][Web of Science][Medline].

    Srinivasan, R. and Rose, G.D. (1999) A physical basis for protein secondary structure. Proc. Natl Acad. Sci. USA, 96, 14258–14263[Abstract/Free Full Text].

    Summa, C.M., et al. (2005) An atomic environment potential for use in protein structure prediction. J. Mol. Biol, . 352, 986–1001[CrossRef][Web of Science][Medline].

    Tress, M., et al. (2005) Assessment of predictions submitted for the CASP6 comparative modeling category. Proteins, 61, Suppl. 7, 27–45.

    Valencia, A. (2005) Protein refinement: a new challenge for CASP in its 10th anniversary. Bioinformatics, 21, 277[Free Full Text].

    Wessen, L. and Eisenberg, D. (1992) Atomic solvation parameters applied to molecular dynamcis of proteins in solution. Protein Sci, . 1, 227–235[Web of Science][Medline].

    Xia, Y. and Levitt, M. (2002) Roles of mutation and recombination in the evolution of protein thermodynamics. Proc. Natl Acad. Sci. USA, 99, 10382–10387[Abstract/Free Full Text].

    Yan, Y. and Moult, J. (2005) Protein family clustering for structural genomics. J. Mol. Biol, . 353, 744–759[CrossRef][Web of Science][Medline].

    Zemla, A., et al. (1999) Processing and analysis of CASP3 protein structure predictions. Proteins, Suppl. 3, 22–29.

    Zhang, Y., et al. (2005) TASSER: an automated method for the prediction of protein tertiary structures in CASP6. Proteins, 61, Suppl. 7, 91–98.

    Zhang, Y. and Skolnick, J. (2004) Scoring function for automated assessment of protein structure template quality. Proteins, 57, 702–710[CrossRef][Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/15/1838    most recent
btl192v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Offman, M. N.
Right arrow Articles by Bates, P. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Offman, M. N.
Right arrow Articles by Bates, P. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?