Bioinformatics Advance Access originally published online on July 26, 2005
Bioinformatics 2005 21(19):3719-3725; doi:10.1093/bioinformatics/bti595
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CHORAL: a differential geometry approach to the prediction of the cores of protein structures

Department of Biochemistry, University of Cambridge Tennis Court Road, Cambridge CB2 1GA, UK
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Although the cores of homologous proteins are relatively well conserved, amino acid substitutions lead to significant differences in the structures of divergent superfamilies. Thus, the classification of amino acid sequence patterns and the selection of appropriate fragments of the protein cores of homologues of known structure are important for accurate comparative modelling.
Results: CHORAL utilizes a knowledge-based method comprising an amalgam of differential geometry and pattern recognition algorithms to identify conserved structural patterns in homologous protein families. Propensity tables are used to classify and to select patterns that most likely represent the structure of the core for a target protein. In our benchmark, CHORAL demonstrates a performance equivalent to that of MODELLER.
Availability: The algorithm is available via internet on http://www-cryst.bioc.cam.ac.uk/servers.html
Contact: rinaldo{at}cryst.bioc.cam.ac.uk
| 1 INTRODUCTION |
|---|
|
|
|---|
It is possible to classify the current methods used for protein structure prediction into two main group: ab initio and knowledge-based. Ab initio methods compute the properties or simulate the behaviour of a system from first principles, normally derived from a framework of basic physical laws governing it. By contrast, knowledge-based methods construct a model for a target protein from the information derived from the structures of a homologous protein family, together with rules derived from a careful analysis of individual protein structures.
Knowledge-based methods applied to protein modelling programs can be described as satisfaction of spatial restraints or fragment-based assembly. The former starts with a large number of restraints derived from a wide range of sources, including homologous proteins, molecular mechanics force fields and experimental data. The protein target sequence is then folded into a conformation that satisfies those restraints, as well as others derived from the general rules for a protein structure (Sali and Blundell, 1993; Greer, 1980).
The hand fragment-based assembly approaches, however, seek to identify parts of proteins of known structure that are likely to be similar to those of the target protein. Most approaches using fragment-based assembly follow Greer (1980) in dividing proteins into two regions: structurally conserved regions (SCRs) and structurally variable regions (SVRs). The identified fragments from homologues of known structure are then ranked according to their similarity to the target sequence so that a model can be constructed from the best fragments (Johnson et al., 1994).
Both approaches have been successfully used to implement comparative modelling programs. Probably, the most widely used approach based on satisfaction of spatial restraints is MODELLER (Sali and Blundell, 1993), which uses probability density functions derived from homologous structures and general features, obtained from the statistical analysis of a large number of known protein structures. Programs based on the satisfaction of spatial constraints usually produce complete models but with variable quality in different regions. These problems occur because of inadequacies in handling deletions, and especially insertions, where constraints cannot easily be derived from homologues or other proteins of known structure. Also, these programs normally need more computational power and time in order to process and generate the model.
An early example of a program using the fragment-based approach was COMPOSER (Blundell et al., 1988), which is based on an algorithm for selecting and assembling rigid fragments derived from
- the framework defined by multiple least squares superposition of homologous proteins (Sutcliffe et al., 1987a);
- the variable regions by scanning a database of loop substructures using a distance filter and loop templates (Topham et al., 1993);
- the side chains by using a set of rules derived from the analysis of side chain dihedral angles at topologically equivalent positions in homologous structures (Sutcliffe et al., 1987b).
An evaluation of COMPOSER (Srinivasan and Blundell, 1993) demonstrated that it is most reliable where the known structures cluster around that to be predicted and where the percentage sequence identity to the unknown is >30%. The reliability of the protein model tends to be directly proportional to the sequence similarity between the ensemble of protein structures and the target sequence.
Another widely used comparative modelling program is the web hosted Swiss-Model (Peitsch, 1996; Schwede et al., 2003), which builds the model core in a framework constructed using the averaged backbone atoms positions from the template structures, where the weights are based on their sequence similarity to the target. The insertions and deletions are handled by a system that constructs an ensemble of fragments compatible with the anchor regions using constraint space programming and for some cases by a loop library derived from experimental structures. Swiss-Model has a final refinement stage using molecular dynamics with the GROMOS96 forcefield.
MOE-Homology, another comparative modelling system based on rigid-fragment assembly, creates full-atom, energy-minimized 3D protein models using one or more protein structures as templates. It uses a segment-matching procedure (Levitt, 1992) together with the modelling of insertions and deletions using geometric scoring criteria (Fechteler et al., 1995). The final model is refined using molecular dynamics with a choice of two variants of AMBER or the MMFF94 forcefield.
One problem with fragment assembly approaches is that SCRs are usually defined as regions where all proteins in the same family show the same conformation for the main chain atoms independent of their classification in a secondary structure element or loop region. This implies that the length of the SCR tends to be proportional to the family PID and inversely proportional to the number of its members. A further problem is that superposition of C
atoms often leads to equivalencing regions of quite different conformation. In an attempt to overcome the averaging of different conformers, SCORE (Deane et al., 2001) defines an SCR as a continuous stretch of three or more residues conserved in all aligned structures; these residues are said to be conserved when the distance amongst all C
atoms for each position is <3.8 Å and the difference in backbone torsion angles lies within a threshold of 150°. This definition is clearly a necessary condition for any structural conservation of the main chain but it is not sufficient to insure it.
ORCHESTRAR (R.Wander Montalvão, R.Smith, D.Burke, S.Lovell and T.L.Blundell, unpublished data) is a fragment-based program designed to overcome limitations of earlier approaches. It uses CHORAL to predict the protein core, CODA (Deane and Blundell, 2001) for loop modelling and ANDANTE (R.Smith, S.Lovell, R.W.Montalvão and T.L.Blundell, unpublished data) for side-chain modelling. In particular, CHORAL was designed to improve the performance of a fragment-based program for modelling families or superfamilies showing low sequence identity. It introduces the concept of structurally conserved clusters (SCCs), which use as much of the information available as possible for a protein family and also introduce new strong geometric requirements towards a sufficient condition for structural conservation.
| 2 METHODS |
|---|
|
|
|---|
An SCC is defined as a geometric pattern (shape) that may be unique to any single member of the structural alignment or common to any combination of structures. Our approach allows several SCCs to span a single region of the structural alignment where no single SCR would be defined.
2.1 Cluster definition and determination using structural conservation
In CHORAL, aligned residues of superposed structures are defined as aggregated when all C
C
separations in the ensemble are <3.6 Å. A region in two or more proteins, at least three residues long, is considered conserved when all of its residues are aggregated and the difference for all of its differential geometry features (curvature and torsion) are below a specific threshold.
Curvature (1) and torsion (2) (Lipschultz, 1969) are calculated from three parametric spline functions that fit the C
x, y and z coordinates with the residue index used as parameter as shown in Figure 1.
![]() | (1) |
![]() | (2) |
|
Essentially, curvature is the rate of change of the tangent vector T in relation to the parameter s (the arc length) and can be expressed in terms of r (the parametric curve function generated by C
spline fitting functions) and its parameter t. It corresponds to a measure of deviation from a straight line. In addition, torsion is the rate of change of the binormal vector B(B = T x N, where N is the normal vector) and measures the deviation from a plane curve. Using these values one can cluster the residues for an aligned position creating sets of residues that are classed as geometrically conserved. For clustering we apply a modified basic sequential algorithm scheme (Theodoridis and Koutroumbas, 1999) using the Euclidean distance between the points in the curvaturetorsion space as the dissimilarity measure (Figs 1 and 2). As an example, the residues in the position number 61 (Fig. 2) can be represented by a set of sets as [[4ape, 2apr, 1smra], [1mpp]].
|
The next step is to compare each position with its neighbours in order to determine the SCCs. An SCC is defined as a region, three or more residues long, where all subsets are equal. For example, the positions 59, 60 and 61 (Fig. 1) are represented as [[4ape, 2apr, 1smra], [1mpp]] forming two different SCCs than can be represented as outlined below:
- SCC1
- Start: 59
- Stop: 61
- Length: 3
- Members: 4ape, 2apr, 1smra
- Stop: 61
- Start: 59
- SCC2
- Start: 59
- Stop: 61
- Length: 3
- Members: 1mpp
- Stop: 61
- Start: 59
For the positions running from residue 62 until 66, all structures are in the same set making just one large SCC.
2.2 Scoring SCCs using environmentally constrained substitution tables
CHORAL utilizes environmentally constrained amino acid propensity tables (Deane and Blundell, 2001) to score all residues in the protein family template in the alignment. These tables, divided in six
/
areas, indicate the propensity for a residue to be replaced by another in a specific conformation. Using this information, Equation (3) defines the representative score for an SCC. In this equation, s is the environmentally constrained score between the basis structure B and the target sequence T. The index i refers to a protein member of the k-th SCC and runs from 1 to N, where N is the number of proteins in the SCC. Additionally, the index j refers to the residue index and runs from the residue m (start of the SCC) to the residue n (end of the SCC).
![]() | (3) |
2.3 SCC optimal selection
The clustering process creates an ensemble of overlapping SCCs, each with unique geometry, which need to be trimmed to a single solution. To achieve this, CHORAL applies a combinatorial optimization approach based on simulated annealing. The pseudo-energy function used in this process [Equation (4a) and (4b)] takes into account the SCC weight Sk, a penalty G for C
C
separation between the SCC and its neighbours and a penalty P when using an SCC that does not contain the best template. The index i runs only over selected non-overlapping SCCs (the configuration C) and the Greek letters are scaling factors used to normalize the pseudo-energy elements.
![]() | (4a) |
![]() | (4b) |
The value of the parameter
is 1.0. The values of ß(=5.0) and
(=25.0) are chosen in order to bring the values of G and P to the same scale of S. The G penalty was introduced to minimize the C
C
separation for the selected SCCs, therefore avoiding non-regular geometry. The scheme of simulated annealing, in the selection context, is outlined below.
T = Tmax C = initial configuration of non-overlapping SCCs.
While T > Tmin
While equilibrium state is not obtained.
Compute EC.
Produce a new configuration C'.
Compute
EC'.
If
E = EC' EC < 0 then
* C=C'
Else
* C=C' if RAND < EXP(
E/T)
End if
End While
T =
*T with 0 <
<1
End While
This process ensures, for a well-chosen
, a smooth convergence towards the minimum value in the energy landscape and, consequently, it obtains the optimal trace over all non-overlapping SCCs. Figure 3a shows an example of an alignment with assigned SCCs and a possible solution for one of the overlapping regions.
|
2.4 Core generation
The final step in the CHORAL modelling process involves the building of the backbone coordinates for the target protein. CHORAL creates a framework using a weighted averaged C
network in the same way as SCORE (Deane et al., 2001) in order to produce the core model. For this process, Equation (5) gives the normalized score S for a structure i member of the SCC. S is the total score for structure i; t indicates the structure with the highest score amongst the members of the SCC and j runs over all members:
![]() | (5) |
The procedure used to create the framework vector F consists of taking the C
coordinate vector
for each residue r and each structure i within the SCC and multiplying it by the respective normalized score [Equation (6)]:
![]() | (6) |
Applying Equation (6) for all residues in all SCCs creates the complete framework in which the residues of the target protein are built. CHORAL applies this procedure for the five lowest energy configurations, producing five different cores.
2.5 Test sets
The first test set used for benchmarking CHORAL comprises ten families of proteins manually selected by N.Furnham and T.L.Blundell (unpublished data) from the HOMSTRAD database (Mizuguchi et al., 1998). This set includes a wide variety of different protein folds and of pairwise percentage of identities amongst its members (Table 1). For each family, a high-resolution (X-ray) structure was selected as a target and four other high-resolution structures were chosen as templates, carefully including members that are both close and distant related to the target. The CHORAL models comprise four models based on one template, six using two templates, four using three templates and one using four templates, giving 15 combinations of the template structures for each target representing a wide range of information entropy. In order to compare our algorithm with a standard method R.N.Miguel (unpublished data) constructed all 150 models of our test set using MODELLER version 6a. A model for each family was selected from the top five automatically using the average between the minimal energy and minimal violations of the bond and angle parameters.
|
The second test set is composed of 800 CHORAL and MODELLER models produced by a script that selects random targets and templates from randomly selected HOMSTRAD families. First, the script selects a family from a pool of possible HOMSTRAD families. Second, it selects a target structure and four structures for template. The third step uses COMPARER (Sali and Blundell, 1990) to automatically generate the alignment and the superposition of the template structures, and finally it submits a job to CHORAL and another to MODELLER.
| 3 RESULTS |
|---|
|
|
|---|
The SCC definition was specifically designed to accommodate the multimodal nature of protein ensembles of a homologous family (Fig. 3), which is particularly characteristic of the C
spatial distributions in low similarity superfamilies and for loop regions. The differential geometric classification provides a better classification than that obtained by using C
distances alone. The use of a colour-coded alignment (Fig. 3a) and the colour-coded structures (Fig. 3b and c) as supplied by CHORAL can give very helpful insights into the structural conservation patterns of a protein family. Thus, colour-coded structures shown in Figure 3 indicate that there are different conformational patterns in an otherwise reasonably similar region that would be classified as a common framework in many programs.
3.1 MBSAS parameters
The MBSAS clustering algorithm developed for CHORAL has two parameters to be adjusted. The first parameter is the C
C
distance exclusion and the second is a threshold for the dissimilarity measure.
The C
C
exclusion parameter excludes any atom from being classified into a group if its separation from the most distant member of the group is above that value. It enforces the formation of compact clusters in 3D space by eliminating the possible bridges across clusters.
The curvature/torsion dissimilarity parameter is the threshold for clustering atoms with similar differential geometric parameters. Using low values for this parameter ensures a more precise classification of the structural patterns.
Comparing manually selected clusters for 20 control models with those automatically selected, we were able to determine the values that generate the best quality models and minimize geometric anomalies. The selected values that allowed CHORAL to predict the structure for 90.78% of the residues are 2.0 Å for exclusion and 0.25 for the curvature/torsion dissimilarity measure.
3.2 Simulated annealing and performance
Despite the use of simulated annealing CHORAL has impressive performance regarding running times. Typical running times for a Pentium IV 1.6 GHz computer have been <5 min for most cases tested.
To ensure the best performance possible, without missing the system global minimum, we have adjusted the standard cooling schedule parameter
by using 20 models as a control group. Those models were selected to represent several possible situations for SCC distribution and for optimization problems.
The models where initially produced setting
= 0.99 to guarantee that the global minimum will be always reached. The next step is to decrease the
value until CHORAL reaches a state where one or more models are different from the equivalent control set. The last value that produces reliable models is then used as the standard cooling schedule parameter. In our control set the value we obtained was
0.75.
3.3 Model refinement
Owing to the non-physical nature of the weighted average position, the inclusion in the pseudo-energy term of a penalty for C
C
separation between an SCC and its neighbours is necessary to solve some distortions that would, otherwise, arise from deletions and insertions. The quality is further improved by the ab initio program ARPEGGIO (R.W. Montalvão, S. Lovell and T.L. Blundell, unpublished data), which uses accurate bond and angle X-ray protein structure data (Engh and Huber, 1991) and a molecular dynamic algorithm for geometrically constrained molecules. Since the atom movements needed to improve these areas are minimal there is no significant impact on the average RMSD.
3.4 Comparison with MODELLER
All CHORAL and MODELLER models from the first dataset were superposed on the native structures and the RMSD, calculated for the N, C, C
and Cß (for non-Gly residues) atoms present in CHORAL models (Table 2). Any residue in a MODELLER model without a corresponding residue in the CHORAL model does not enter in the RMSD calculation. The RMSD difference column is the difference between CHORAL RMSD to MODELLER RMSD and a negative value indicates a lower average RMSD for CHORAL. The average value of this feature for all 150 models is 0.1 Å, meaning that MODELLER and CHORAL produce equivalent models.
|
Despite being statistically equivalent to MODELLER in this dataset, CHORAL presents better RMSD values for a few families, such as globin and major histocompatibility antigen. In the case of the MODELLER results for some of the MHC complexes, this appears to result from energy minimization of the protomer in highly accessible regions, some of which were involved in interprotomer interaction in the dimer. If the MHC family is omitted from the comparison, the average value of the RMSD difference is 0.04 Å.
3.5 Analysis of a large dataset
The second dataset was used to produce a more complete statistical evaluation of the RMSD for both systems. The average value of the RMSD difference for all 800 models is 0.02 Å and CHORAL predicted an average of 89% of the residues. These results confirm that CHORAL and MODELLER produce statistically equivalent models for a large dataset.
The best results when comparing CHORAL with MODELLER originate from families with an average PID of <40%. A possible explanation for this behaviour is that the CHORAL algorithm is designed to deal with systems having multimodal distributions showing complex pattern formation in the 3D space, such as present in these cases.
3.6 Backbone building quality
The slight but consistent differences in the RMSD between CHORAL and MODELLER as in the aspartic proteinase (asp) family is caused by the different approaches to backbone building. MODELLER builds the main chain atoms in a framework that is the best model according to the structural restraints and CHORAL builds a framework that is the weighted average position based on the SCC's total score. The weighted average position is often equivalent to the MODELLER method for cases where the structures of the templates used are all close to that being modelled and the superposed C
atoms reside in a spherical or ellipsoidal space.
In general, sets of C
atoms with a maximum separation between any of its members of <1.2 Å belong to a spherical or ellipsoidal ensemble; consequently, models made by CHORAL are similar to models made by MODELLER for those regions. This can be illustrated by using a cut-off of 1.0 Å to calculate the C
RMSD for the aspartic proteinase family, where it shows that the average value drops from 0.053 to just 0.010 Å.
3.7 SCC versus SCR
Regions that can be modelled by a mosaic of SCCs cannot be modelled by SCRs as there is a requirement for the presence of all templates within an SCR (e.g. the region in the Fig. 3a between the aligned positions 45 and 61). As can be observed in Table 2, the percentage of residues predicted is fairly high for most families as this approach can build loops when the necessary information is present in the template structures.
| 4 CONCLUSION |
|---|
|
|
|---|
Despite using a very simple strategy, CHORAL can build models with a quality equivalent to MODELLER but using a much faster algorithm. The typical running time for producing five alternative models for a target is <5 min. Typical running time for MODELLER under the same conditions was 2 h.
| Acknowledgments |
|---|
We thank Dr David Burke for his useful suggestions and discussion, Dr Ricardo Nunez Miguel and Nick Furnham for access to models constructed using MODELLER. We acknowledge support from Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and from Tripos Associates.
Conflict of Interest: none declared.
| Footnotes |
|---|
Present address: Faculty of Life Sciences, University of Manchester, Oxford Road, Manchester M13 9PT, UK
Received on June 2, 2005; revised on July 21, 2005; accepted on July 21, 2005
| REFERENCES |
|---|
|
|
|---|
Blundell, T.L., et al. (1988) Knowledge based protein modelling and design. 18th Sir Hans Krebs Lecture Eur J. Biochem., 173, 513520.
Deane, C.M. and Blundell, T.L. (2001) CODA: A combined algorithm for predicting the structurally variable regions of protein models. Proteins, 40, 135144.
Deane, C.M., et al. (2001) SCORE: predicting the core of protein models. Bioinformatics, 17, 541550
Engh, R. and Huber, R. (1991) Accurate bond and angle parameters for X-ray protein structure refinement. Acta Cryst, A 47, 392400[CrossRef].
Fechteler, T., et al. (1995) Prediction of protein three-dimensional structures in insertion and deletion regions: a procedure for searching data bases of representative protein fragments using geometric scoring criteria. J. Mol. Biol., 253, 114131[CrossRef][ISI][Medline].
Greer, J. (1980) Model for haptoglobin heavy chain based upon structural homology. Proc. Natl. Acad. Sci. USA, 77, 33933397
Johnson, M.S., et al. (1994) Knowledge-based protein modeling. Crit. Rev. Biochem. Mol. Biol., 29, 168[ISI][Medline].
Levitt, M. (1992) Accurate modeling of protein conformation by automatic segment matching. J. Mol. Biol., 226, 507533[CrossRef][ISI][Medline].
Lipschultz, M. (1969) Differential Geometry. Schaum's Outlines Series McGraw-Hill 6164.
Mizuguchi, K., et al. (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci., 7, 24692471[Abstract].
Peitsch, M.C. (1996) ProMod and Swiss-Model: internet-based tools for automated comparative protein modelling. Biochem. Soc. Trans., 24, 274279[ISI][Medline].
Sali, A. and Blundell, T.L. (1990) Definition of general topological equivalence in protein structures. J. Mol. Biol., 212, 403428[CrossRef][ISI][Medline].
Sali, A. and Blundell, T.L. (1993) Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol., 234, 779815[CrossRef][ISI][Medline].
Schwede, T., et al. (2003) SWISS-MODEL: an automated protein homology-modeling server. Nucleic Acids Res., 31, 33813385
Srinivasan, N. and Blundell, T.L. (1993) An evaluation of the performance of an automated procedure for comparative modelling of protein tertiary structure. Protein Eng., 6, 501512
Sutcliffe, M.J., et al. (1987a) Knowledge based modelling of homologous proteins, Part I: three dimensional frameworks derived from the simultaneous superposition of multiple structures. Protein Eng., 1, 377384
Sutcliffe, M.J., et al. (1987b) Knowledge based modelling of homologous proteins, Part II: rules for replacement of side-chains. Protein Eng., 1, 385392
Theodoridis, S. and Koutroumbas, K. (1999) Pattern Recogniton. Academic Press 391392.
Topham, C.M., et al. (1993) Fragment ranking in modelling of protein structure. Conformationally constrained environmental substitution tables. J. Mol. Biol., 229, 194220[CrossRef][ISI][Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







Compute EC.
E = EC' EC < 0 then

