Bioinformatics Advance Access originally published online on October 12, 2007
Bioinformatics 2007 23(23):3131-3138; doi:10.1093/bioinformatics/btm499
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Conformational analysis of alternative protein structures
1Max-Planck-Institut Informatik, Stuhlsatzenhausweg 85, 66123 Saarbrücken and 2Fachbereich Statistik, Universität Dortmund,Vogelpothsweg 87,44221 Dortmund, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Alternative structural models determined experimentally are available for an increasing number of proteins. Structural and functional studies of these proteins need to take these models into consideration as they can present considerable structural differences. The characterization of the structural differences and similarities between these models is a fundamental task in structural biology requiring appropriate methods.
Results: We propose a method for characterizing sets of alternative structural models. Three types of analysis are performed: grouping according to structural similarity, visualization and detection of structural variation and comparison of subsets for identifying and locating distinct conformational states. The alpha carbon atoms are used in order to analyse the backbone conformations. Alternatively, side-chain atoms are used for detailed conformational analysis of specific sites. The method takes into account estimates of atom coordinate uncertainty. The invariant regions are used to generate optimal superpositions of these models. We present the results obtained for three proteins showing different degrees of conformational variability: relative motion of two structurally conserved subdomains, a disordered subdomain and flexibility in the functional site associated with ligand binding. The method has been applied in the analysis of the alternative models available in SCOP. Considerable structural variability can be observed for most proteins.
Availability: The results of the analysis of the SCOP alternative models, the estimates of coordinate uncertainty as well as the source code of the implementation are available in the STRuster web site: http://struster.bioinf.mpi-inf.mpg.de.
Contact: doming{at}mpi-sb.mpg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Current progress in structural biology is the result of the efforts of crystallographers and NMR spectroscopists who continuously submit new models to the Protein Data Bank (PDB) (Berman et al., 2000). As a consequence of these efforts, alternative structural models are made available for an increasing number of proteins. Alternative models are usually obtained from a single NMR experiment, from the presence of non-crystallographic symmetry in X-ray crystallography, or from several independent structural determinations of the same protein. These alternative models represent the protein in different complexes, interacting with different ligands or they result from different physicochemical or experimental conditions. Most importantly, they illustrate conformational changes at the backbone and at the side-chain level, that are associated with protein function. Therefore, it is of great interest to characterize the differences and similarities between these models.
The backbone conformational changes associated with ligand binding and with catalysis have been characterized for many proteins. The studies in transferrins and protein kinases provide some examples (Jeffrey et al., 1998; Taylor et al., 2004). In addition, the molecular mechanisms of protein function have been investigated by detailed comparison of the structure of catalytic, ligand binding or protein binding sites upon binding different ligands, substrates, substrate analogues and inhibitors, or in different mutant forms. Studies in
-amylase, transferrin and PKA provide many examples (Akamine et al., 2003, 2004; Machius et al., 1996; Madhusudan et al., 2002; Nurizzo et al., 2001; Wu et al., 2005). Structural analysis of specific protein sites also play an essential role in the investigation of new enzyme inhibitors and protein–protein interaction inhibitors of medical relevance. Previous work in HCV NS5B and in IL - 2/IL2R
provides some examples (Arkin et al., 2003; Biswal et al., 2006; Thanos et al., 2006).
Several established methods are currently available for pairwise and multiple comparison of the protein backbone structure. Some of these have been previously reviewed (Sierk and Kleywegt, 2004), and new methods have been proposed since then (Birzele et al., 2007; Ilyin et al., 2004; Shatsky et al., 2004; Ye and Godzik, 2005; Zhang and Skolnick, 2005). In general, these methods search for an alignment defined via a set of equivalent residues, which maximizes a measure of structure similarity. Methods can be distinguished by the use of different measures of structure similarity and of different alignment search algorithms. Some methods provide multiple solutions, some are optimized for fast computation, others take flexibility explicitly into account. In principle, these methods can be used to compare alternative models of the same protein. Nevertheless, the comparison of different protein structures is not the same as the comparison of alternative models of the same protein; these are different problems that require different solutions.
Finding the set of structurally equivalent residues and generating an alignment is not the goal in the comparison of alternative models. The alignment is predefined in this case because the residue positions in the models correspond to positions in the same protein sequence. In the comparison of different proteins, structurally dissimilar regions are reported as gaps, and in general they are ignored in the computation of the measure of structure similarity. In the comparison of alternative structures these variable regions are generally aligned, and correspond to different conformations. The identification of the location and extent of the conformational change, as well as the identification of the structurally conserved invariant regions are of major interest in the comparison of alternative models.
A number of approaches have been developed for the characterization of alternative structures. These approaches tend to rely on clustering methods (Kaufman and Rousseeuw, 2005), distance matrices (Phillips, 1970), difference distance matrices (Nishikawa et al., 1972) and variance matrices (Kelley et al., 1997). Clustering approaches have been proposed for identifying the representative structures in NMR ensembles (Kelley et al., 1996) and for grouping the alternative models available for each protein (Domingues et al., 2004). A local backbone-superposition method has been proposed for locating and visualizing variable regions (Lema and Echave, 2005). NMRCORE (Kelley et al., 1997) has been proposed for identifying invariant regions (also called local structural domains) in NMR ensembles by clustering, based on the C
atomic distance variances. ESCET computes differences of C
atomic distance for visualizing the structural similarities and differences between alternative structures and for identifying the invariant regions (Schneider, 2000, 2002, 2004).
We describe STRuster, a method for characterizing alternative models of a given protein at two levels: backbone conformation and side-chain conformation. There are several stages in the analysis: grouping of models, analysis of variation and comparison. The C
atoms are used to analyse the backbone conformation. In the first stage the models are grouped according to backbone structural similarity using clustering methods. This first stage has been previously described (Domingues et al., 2004), but all other functionality is new. In the second stage, the structural variation is analysed. In particular, the location and extent of the backbone similarities and differences are identified. The structural differences can be the result of flexibility resulting from thermal motion, or alternatively from collective motions or from triggered conformational changes (Petsko and Ringe, 2003). Thermal and collective motions are usually associated with a continuous distribution of the different conformations, but triggered conformational changes can result in distinct conformational states. Estimates of coordinate uncertainty are indicators of thermal motion. The variance of atomic distances provides a measure of collective motions and of triggered conformational changes. In the third stage, distinct conformational states are identified by comparison of subsets of alternative models. Of particular interest is the identification of invariant and variable parts of the protein. The relative geometry is preserved in the invariant regions, but not in the variable parts. A similar three-stage analysis (grouping, variation analysis and comparison) is performed to compare side-chain conformations in order to characterize in detail specific sites in the protein.
| 2 METHODS |
|---|
|
|
|---|
First the criteria for selecting the structural data used in the analysis are described. Then we explain the alignment procedure, and review the clustering method. Finally, we describe the different approaches for characterizing backbone and side-chain conformations: analysis of variation, comparison of subsets and identification of invariant and variable regions. The C
atom coordinates are used to analyse backbone conformations, alternatively the coordinates of the side-chain atoms are used for more detailed analysis of the side-chain conformations.
2.1 Structural data
Structural models were obtained from ASTRAL SCOP 1.71 (Chandonia et al., 2004), which correspond to the SCOP 1.71 (Andreeva et al., 2004) domain definitions. Each set contains the structural models (entries) classified in the same SCOP species level. There are 75 930 SCOP models in total. The analysis was restricted to sets with two or more entries, but with fewer than 60 entries. It was also restricted to the first seven SCOP classes: all alpha, all beta, alpha and beta (a/b), alpha plus beta (a + b), multi-domain, membrane and cell surface and small proteins. Four classes were excluded: coiled-coil proteins, low-resolution protein structures, peptides and designed proteins. Only entries for which a diffraction-component precision index value could be computed (see below) and which could be aligned were used in the analysis. In total 36 634 SCOP models were used, divided in 5837 sets.
The diffraction-component precision index (DPI) was used to derive estimates of atom coordinate uncertainties for models derived by X-ray crystallography (Cruickshank, 1999). The result is an estimate of coordinate uncertainty (
i) for atom i. The centroid coordinate uncertainty is the mean uncertainty of the atom coordinates used in the computation of the centroid. DPI values were computed for 85% of all PDB entries available in February 2007, and are available on the STRuster web site. See Supplementary Material for more details.
2.2 Alignment
Each set contains models for one type of protein matching a UniProt entry (Bairoch et al., 2005). Each model was aligned to the protein sequence from UniProt, using the mappings between PDB residue number and UniProt sequence position (Martin, 2005). A set was only used in the analysis if each PDB entry was mapped to the same UniProt. Sets where different PDB entries were mapped to different UniProt sequences were excluded. The PDB residue-UniProt sequence pairwise alignments were then combined using the UniProt sequence as reference, producing a multiple sequence alignment. Alignments can include substitutions and insertions/deletions relative to the UniProt sequence. We refer to the different residue positions in the protein by the alignment positions (starting at index 0). For the three examples provided in the Results section, the mapping between the alignment positions and the PDB residue numbering are provided in the STRuster web site, by typing the appropriate SCOP or PDB codes.
2.3 Clustering
A hierarchical clustering method for grouping the models according to structure similarity was applied to each set. Clustering is implemented as previously described (Domingues et al., 2004), see Supplementary Material for a review of the method. The main difference is that the dissimilarity measure was modified to take into account insertions and deletions in the residue mapping between two entries as obtained from the multiple sequence alignment. The C
atom distances are used when clustering according to backbone structural similarity. When a more detailed analysis sensitive to side-chain conformations is required, then the distances between the side-chain centroids (including the C
atom) is also used. The silhouette width value (Rousseeuw, 1987) is a measure of cluster quality, which is used to identify the best number of groups obtained by hierarchical clustering.
2.4 Variation matrices
The variation matrices are used for visualizing the location and extent of structural variability over a set of alternative models. The structural variability is measured at the level of backbone using C
atom coordinates, or alternatively using the centroid of the side chain (including the C
atom). Four types of matrices are computed. They provide complementary information. Matrix S is the standard-deviation (SD) matrix, and gives the SDs of the coordinate distances. The total SD matrix (T) accounts not only for the distance variation but also for the coordinate uncertainties. Both the S and T matrices provide measures of distance variability in absolute units (Å). The relative SD matrix (R) provides a measure of structural variability relative to the estimates of uncertainty in order to help in the identification of significant conformational variation. Finally, the maximum relative difference matrix (X) contains the largest pairwise structural differences at each position in the set of models. The matrix rows and columns correspond to the alignment positions (i,j), and all the matrices are symmetric.
Given a set of models A = {a1, ... , ak, ... , am}, for any model ak
A, the expression
denotes the C
or centroid coordinate distance between residue position i and j in the alignment. We define:
|
| (1) |
|
| (2) |
The total SD matrix TA(i, j) also takes into account the estimates of coordinate uncertainty. The coordinate uncertainty for residue i in model ak is denoted by
. Neglecting the covariance, one can estimate the distance uncertainty
as:
|
| (3) |
|
| (4) |
|
| (5) |
The relative SD matrix RA(i, j) provides a measure of significant variability as the ratio of
SET and
SUS:
|
| (6) |
The maximum relative difference matrix X describes the structural outliers. Matrix X is based on the maximal differences between the distances and it has been previously proposed (Schneider, 2000).
|
| (7) |
2.5 Comparison matrices
The comparison matrices are used for identifying the structural differences between two subsets A = {a1, ... , ak, ... , am} and B = {b1, ... , bk, ... , bn}. The method provides three types of comparisons, namely between two subsets, between a single entry and a subset, and between two single entries. The backbone conformations are compared using C
atom distances, and side-chain conformations are compared using distances between the centroid of the side-chain and C
atoms.
For each pair of positions (i, j) in the alignment, the extent of agreement between the two distance distributions of the two subsets A and B, relative to the variance of the distributions, is given by the value of the Welch statistic (Welch, 1938). There are two components in estimating the variance, one resulting from the variance of the distances
, and one from the distance uncertainties
. If only the distance distributions are considered, then:
|
| (8) |
|
| (9) |
|
| (10) |
Extending the formalism to the comparison of a single entry a to a subset B, and not considering the coordinate uncertainties, we define:
|
| (11) |
|
| (12) |
|
| (13) |
|
| (14) |
|
| (15) |
If only two entries a and b are compared, the variance is derived from the distance uncertainties as previously proposed (Schneider, 2000):
|
| (16) |
|
| (17) |
2.6 Identification of hinges, variable and invariant regions
The relative orientation of the backbone is preserved in the invariant regions. Invariant regions can be composed of more than one segment of contiguous residues (invariant segments). Invariant segments in invariant regions are structurally conserved relative to each other. The backbone structure is not preserved in variable segments. Hinge segments are associated with short flexible fragments on the protein backbone and with transitions between different invariant and variable segments.
Invariant backbone regions, as well as hinges and variable segments are identified from the variation matrices or from the comparison matrices using a data smoothing approach. First the hinge segments are identified, then the remaining inter-hinge segments are classified either as variable or as invariant segments. Finally, invariant segments that preserve the relative orientation to each other are grouped into invariant regions. See Supplementary Material for detailed explanation.
A clustering approach is used to identify regions with invariant side chains, based on the variation or comparison matrices computed with side-chain centroid distances. The matrix elements are used as distances for hierarchical clustering with group average agglomeration. The resulting tree is cut at a certain cutoff (s_cutoff).
2.7 Superposition
For an invariant backbone region, an optimal superposition between all structural entries and a representative entry is computed. The representative structure is chosen as the structure with lowest sum of the backbone clustering dissimilarity values in the invariant region. The superposition between each entry and the representative is computed based on the invariant regions. The superpositions were performed using Biopython http://biopython.org/, as provided in the PDB module (Hamelryck and Manderick, 2003).
2.8 Implementation and visualization tools
The methods were implemented in Python http://www.python.org/, using the Biopython library. Version 2.1.0 of the R environment for statistical computing (R Development Core Team, 2005) was used for clustering, for data smoothing and for visualization. PyMOL (http://www.pymol.org) was used for molecular rendering and visualization.
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
The method has been applied to three sets of alternative models, corresponding to different types of conformational variability. The method has also been applied to the models in the SCOP classification database and the results are summarized.
3.1 Backbone conformational analysis
Serum transferrin is responsible for the transport of iron along the bloodstream and into the cells. Transferrin binds to Fe3+ through six coordination sites provided by four amino acids and by a synergistic anion (CO_32 –) (MacGillivray et al., 1998). In total, 19 models of human transferrin were collected from SCOP (sunid 53899) and aligned. The atom coordinate uncertainties were computed for each model. The original STRuster implementation was already applied to clustering transferrin models (Domingues et al., 2004). The new STRuster implementation is again applied to analysing backbone conformations in transferrin. The results now include in addition the variation and comparison matrices, as well the identification of invariant regions used for superposition.
3.1.1 Clustering
The transferrin backbone clustering results are given in Figure 1. Two clusters, A and B are noticeable, with cluster B further subdivided into C and D, as described previously (Domingues et al., 2004). The best clustering according to average silhouette width values corresponds to clusters A, C and D, see Figure S2 in Supplementary Material. All models in cluster A correspond to the apo form of transferrin, while the models in cluster B correspond to the iron-binding form. The two clusters C and D into which B is divided correspond to models obtained from two different crystal forms; P41212 (cluster C) and P212121 (cluster D).
|
3.1.2 Variation matrices
The variation matrices allow for visualizing and locating the parts of the protein that are structurally conserved and the parts with considerable conformational variability. The results obtained with the different matrices are similar but differ in detail. Matrix T and matrix S are displayed in Figure 1, and are compared in Supplementary Figure S3. Both give similar results, but the total SD matrix shows increased variance in some areas reflecting the distance uncertainty contribution. In both matrices one can observe three large segments of low variance along the diagonal, separated at approximately positions 90 and 250. There is considerable variance between the first and the second segment, as well as between the second and the third segment, but not between the first and the third segment. There is also some variance around position 140 and at the C terminus.
The relative SD matrix R and the maximum relative difference matrix X are shown in Figure 1. Enlarged plots are available in Supplementary Figure S4. They are similar to the other matrices S and T, but the boundaries of the three invariant segments are more clearly visible. X is sensitive to the largest conformational differences in the set, and shows larger relative variation around positions 140 and 300–334, than R.
We focus our analysis on the results obtained with T. In this matrix, five invariant segments are detected which are separated by hinge segments. The invariant segments span the positions: 1–59, 67–80, 85–125, 142–235 and 246–303. STRuster groups these segments in two invariant regions. One region includes the segments at the N terminus (up to 80), and at the C terminus (starting at 246). The second region includes the two middle segments between 85 and 235.
3.1.3 Comparison matrices
The variation matrices provide a description of the structural variability of the protein but they do not allow for differentiating between continuous structural variability and alternative conformational states. In order to identify distinct conformations, candidate subsets are first selected according to the clustering results. Then, the C
distance distributions between pairs of aligned residues from the two subsets are compared. Segments with significantly different distributions indicate distinct conformational states.
To investigate whether the different clusters correspond to distinct conformations, cluster A was compared to cluster B (matrix UAB) and cluster C was compared to cluster D (matrix UCD). The results are available in Figure 1. The matrix UAB gives a distribution of invariant and variable segments similar to the results obtained with the variation matrices. The two invariant regions are also identified with similar boundaries. These results clearly indicate that the models in cluster A have a distinct conformation relative to the models in cluster B. In particular, the middle region of the protein backbone in models from cluster A tend to have larger distances to both ends of the backbone than in models from cluster B. Matrix UAB is compared to matrix VAB in Figure S5. Unlike U, matrix V does not take into account coordinate uncertainties. An improved signal in U over V is noticeable.
Cluster B was further analysed by comparison of the two subsets C and D into which B is divided. The matrix UCD is shown in Figure 1. Three hinges are identified at the middle and at the C-terminal end of the backbone (positions 133–141, 304–305 and 318–334), with the remaining residues corresponding to a single invariant region. These results indicate that short segments at the middle of the backbone and at the end have different conformations in the subsets C and D, which reflects the differences between the two crystal forms. In particular, the differences in the loop around position 135 result from different inter-molecular contacts in the two crystal forms (MacGillivray et al., 1998).
3.1.4 Superposition
Figure 1 shows the 19 transferrin models with the two invariant regions optimally superimposed. The invariant regions were identified in the variation matrix T. The proteins consist of two subdomains that move relative to each other when binding iron. The first subdomain consists of both the N-terminal and C-terminal parts of the protein backbone. The second subdomain corresponds to the middle part of the protein backbone. The first and second subdomains match the first and the second invariant regions identified in T. The second subdomain also includes the C-terminal helix. The results obtained with T and UAB, and the superpositions reflect the considerable conformational change that occurs when binding iron. In the iron-free form (A), the structure is an open conformation with the two subdomains apart from each other (Jeffrey et al., 1998). In the iron-binding form (B), the two subdomains move closer towards each other (MacGillivray et al., 1998).
3.2 Variable segments
In the previous example, two conserved backbone substructures are observed in alternative relative orientations. But substructures might not be always conserved and can show different internal structures, as in rabbit fructose 1,6-bisphosphate aldolase (SCOP sunid 51580). This glycolytic enzyme includes a C-terminal region that displays substantial conformational differences. It has been suggested that the backbone conformational mobility plays a functional role in the attachment and release of the reaction product (Blom and Sygusch, 1997). In total 12 models were analysed, and the C-terminal regions are identified as a variable segment in the T matrix. See Supplementary Figure S6.
3.3 Analysis of functional site
The STRuster method allows not only for the analysis of backbone conformations, as shown in the previous results, but also for performing detailed analysis of side-chain conformations, as demonstrated in the analysis of the ligand binding site of
-amylase. The
-amylases catalyze the hydrolysis of the
– (1,4) glycosidic bonds in different polysaccharides. The porcine
-amylase consists of two domains according to SCOP, an N-terminal catalytic domain and a C-terminal domain. STRuster was applied to the SCOP models for the porcine catalytic domain (SCOP sunid 51459). The results obtained for the backbone conformational analysis reflect the effect of interactions between different proteins and ligands with the ligand binding site. The analysis reveals two alternative conformations of the loop at position 305 associated with ligand binding. Details are provided in the Supplementary Material.
In order to investigate in more detail the conformational changes associated with ligand binding, a side-chain conformational analysis was performed in the ligand binding site region. Figure 2 shows the clustering results, the T matrix and the superposition of the ligand binding site residues. The clustering result reveals three major clusters (with best average silhouette width): B, E1 and E2. Cluster B corresponds to the ligand-bound form. Cluster E1 corresponds to a complex between the enzyme and an inhibitor antibody. Cluster E2 corresponds to an empty ligand binding site. The only exception is d1ua3a2, with a partially occupied binding site, as described in the Supplementary Material. The T matrix was computed for the binding site residues based on the side-chain centroid distances. Considerable side-chain conformational variability is observed for the side chains of residues 151, 163, 238, 240, 300 and 305–308, they reflect the alternative ligand-bound/unbound states at the functional site. The comparison matrix allows to identify the residues in distinct conformation in B, E1 and E2. In particular, models in clusters E1 and E2 show different side-chain conformations at residues 151 and 163 (see Supplementary Material).
|
3.4 Comparison to other approaches
Many tools are available for protein structure comparison, and a few have been specifically implemented to compare alternative structures of the same protein. How does STRuster compare to these methods? As mentioned in the Introduction section, the comparison of the structure of different proteins is not the same problem as the comparison of alternative conformers of the same protein. In the comparison of alternative structures, the alignment is not determined by an optimization procedure using measures of structural similarity. Instead the alignment is determined by matching the residues in the structural model to the protein sequence. Regions that are dissimilar in structure are not left out as gaps as it is usual in the comparison of different structures, instead they are aligned and the extent of variation in these regions is measured. The structural differences can be rather small in the comparison of alternative structures, within the range of the atom coordinate uncertainties. Therefore, these uncertainties are taken into account in the comparison of alternative structures using STRuster and ESCET, but they are usually not considered by general structure comparison methods. STRuster is most closely related to ESCET (Schneider, 2002), as both methods rely on the comparison of distance matrices for the identification of invariant regions. ESCET identifies invariant regions based on the X variation matrix using a genetic algorithm. Using X to identify the invariant regions with STRuster should give similar results to ESCET in the same set of proteins, the difference is in the approach for identification of invariant regions. The two methods have been compared in five sets of alternative structures that have been previously proposed (Schneider, 2002). Similar invariant regions are found by the two methods, the results are given in the Supplementary Material. Nevertheless, there are notable differences between STRuster and ESCET. In contrast to ESCET, STRuster provides additional variation matrices that provide complementary information. STRuster also provides an approach for comparing subsets using U and V matrices, for clustering the alternative models and for identifying subsets of similar structures based on the clustering results. Another distinctive feature of STRuster is that it is applicable not only to the comparison of backbone structures, but also to the comparison of side-chain conformations.
3.5 Application to SCOP
The method was applied to the sets of models available at each SCOP species level in order to assess the extent of structure variation among the sets of alternative models, and to demonstrate that STRuster can be applied to large number of sets. The analysis was restricted to the PDB entries with computed DPI values and to sets with at least two alternative models. In total 36 634 different models were analysed in 5837 different sets. The largest set includes 56 models. The average set size is 6 models. The sets were aligned and the variation matrices were computed (S, T and R). Hinge segments and invariant regions were identified based on the matrix T. A PDB coordinate file with the models superimposed according to the invariant regions was also computed. The results are available on the STRuster web site (http://struster.bioinf.mpi-inf.mpg.de).
The number of hinges provides an indication of the conformational variability within a set. For most sets (65%), at least one hinge segment is identified when using the matrix T. For 42% of the sets at least two hinges are identified. Only 35% of the sets have no hinge detected with T. The percentage of sets with no hinge is larger if either the S or R matrixes are used (43% and 46%, respectively). With the T matrix only one invariant region is identified in most sets (95%), and only 4% of the sets have two invariant regions. Variable segments are identified in few sets (4.4%).
| 4 CONCLUSIONS |
|---|
|
|
|---|
Two different approaches are currently used in STRuster to identify invariant regions, a smoothing approach is used to identify backbone invariant regions, and a different clustering approach is used to identify invariant regions in specific sites where the residues are not necessarily contiguous in the sequence. We intend to develop a single common approach applicable for the identification of invariant regions.
In the analysis of side-chain conformation the side-chain centroid is used. The centroid representation is robust regarding residue substitutions or missing atom coordinates, but it is not very sensitive to small conformational differences. We intend to implement alternative approaches that are more sensitive to small differences in the side-chain conformations.
To demonstrate the capabilities of STRuster, we have focused on the analysis of models determined by X-ray crystallography. They correspond to the majority of the current entries, in PDB (85%). Most of the remaining entries in PDB have been obtained by NMR spectroscopy. They were excluded from the current study mostly for practical reasons. Nevertheless, NMR models can also be analysed with STRuster: in particular, one can group the models by clustering, analyse the structural variation with the S matrix or compare models with the V matrix. The T, R, X and U matrices cannot currently be computed for NMR models as they rely on estimates of coordinate uncertainty. As the method matures we will consider to implement a measure of coordinate uncertainty for NMR models, so that STRuster can also be applied to these models.
The main motivation to develop STRuster was to provide a tool for the detailed structural analysis of specific proteins. Drug targets and proteins of medical interest are obvious relevant candidates, for which there is a considerable amount of structural information available. In particular, we are currently applying STRuster in the backbone and side-chain structure analysis of hepatitis C viral proteins (HCV). For example, there are currently 38 PDB entries for HCV NS5B, an RNA-dependent RNA polymerase, which is the focus of intensive research as a target for new antiviral drugs. For 32 of these entries, the protein is associated with different ligands. The number of actual models is much higher (63), as some entries, include two models in the asymmetric unit. The amount of structural data is similar for the HCV protease, and is considerably larger for some of the HIV drug targets.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Ingolf Sommer for the helpful comments. This is part of the BioSapiens project, which is funded by the European Commission, contract number LSHG-CT-2003-503265. Financial support was also provided by the BMBF (grant No. 01GR0453).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Burkhard Rost
Received on June 12, 2007; revised on September 13, 2007; accepted on September 28, 2007
| REFERENCES |
|---|
|
|
|---|
Akamine P, et al. Dynamic features of camp-dependent protein kinase revealed by apoenzyme crystal structure. J. Mol. Biol, ( (2003) ) 327, : 159–171.[CrossRef][ISI][Medline].
Akamine P, et al. Balanol analogues probe specificity determinants and the conformational malleability of the cyclic 3,5-adenosine monophosphate-dependent protein kinase catalytic subunit. Biochemistry, ( (2004) ) 43, : 85–96.[CrossRef][Medline].
Andreeva A, et al. SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res, ( (2004) ) 32, : D226–D229.
Arkin MR, et al. Binding of small molecules to an adaptive protein-protein interface. Proc. Natl Acad. Sci. USA, ( (2003) ) 100, : 1603–1608.
Bairoch A, et al. The Universal Protein Resource (UniProt). Nucleic Acids Res, ( (2005) ) 33, : D154–D159.
Berman H, et al. The Protein Data Bank. Nucleic Acids Res, ( (2000) ) 28, : 235–242.
Birzele F, et al. Vorolign–fast structural alignment using voronoi contacts. Bioinformatics, ( (2007) ) 23, : e205–e211.
Biswal BK, et al. Non-nucleoside inhibitors binding to hepatitis c virus ns5b polymerase reveal a novel mechanism of inhibition. J. Mol. Biol, ( (2006) ) 361, : 33–45.[CrossRef][ISI][Medline].
Blom N, Sygusch J. Product binding and role of the c-terminal region in class i d-fructose 1,6-bisphosphate aldolase. Nat. Struct. Biol, ( (1997) ) 4, : 36–39.[CrossRef][ISI][Medline].
Chandonia J-M, et al. The ASTRAL Compendium in 2004. Nucleic Acids Res, ( (2004) ) 32, : D189–D192.
Cruickshank D. Remarks about protein structure precision. Acta Crystallogr. D Biol. Crystallogr, ( (1999) ) 55, : 583–601.[CrossRef][Medline].
Domingues FS, et al. Automated clustering of ensembles of alternative models in protein structure databases. Protein Eng. Des. Sel, ( (2004) ) 17, : 537–143.
Hamelryck T, Manderick B. PDB file parser and structure class implemented in Python. Bioinformatics, ( (2003) ) 19, : 2308–2310.
Ilyin VA, et al. Structural alignment of proteins by a novel topofit method, as a superimposition of common volumes at a topomax point. Protein Sci, ( (2004) ) 13, : 1865–1874.
Jeffrey P, et al. Ligand-induced conformational change in transferrins: crystal structure of the open form of the N-terminal half-molecule of human transferrin. Biochemistry, ( (1998) ) 37, : 13978–13986.[CrossRef][Medline].
Kaufman L, Rousseeuw PJ. Finding Groups in Data. An Introduction to Cluster Analysis, ( (2005) ) Hoboken: Wiley-Interscience..
Kelley L, et al. An automated approach for clustering an ensemble of NMR-derived protein structures into conformationally related subfamilies. Protein Eng, ( (1996) ) 9, : 1063–1065.
Kelley L, et al. An automated approach for defining core atoms and domains in an ensemble of NMR-derived protein structures. Protein Eng, ( (1997) ) 10, : 737–741.
Lema MA, Echave J. Assessing local structural perturbations in proteins. BMC Bioinformatics, ( (2005) ) 6, : 226.[CrossRef][Medline].
MacGillivray R, et al. Two high-resolution crystal structures of the recombinant N-lobe of human transferrin reveal a structural change implicated in iron release. Biochemistry, ( (1998) ) 37, : 7919–7928.[CrossRef][Medline].
Machius M, et al. Carbohydrate and protein-based inhibitors of porcine pancreatic alpha-amylase: structure analysis and comparison of their binding characteriztics. J. Mol. Biol, ( (1996) ) 260, : 409–421.[CrossRef][ISI][Medline].
Madhusudan, et al. Crystal structure of a transition state mimic of the catalytic subunit of camp-dependent protein kinase. Nat. Struct. Biol, ( (2002) ) 9, : 273–277.[CrossRef][ISI][Medline].
Martin ACR. Mapping PDB chains to UniProtKB entries. Bioinformatics, ( (2005) ) 21, : 4297–4301.
Nishikawa K, et al. Tertiary structure of proteins. i. representation and computation of the conformations. J. Phys. Soc. Jpn, ( (1972) ) 32, : 1331–1337.[CrossRef].
Nurizzo D, et al. Crystal structures and iron release properties of mutants (K206A and K296A) that abolish the dilysine interaction in the N-lobe of human transferrin. Biochemistry, ( (2001) ) 40, : 1616–1623.[CrossRef][Medline].
Petsko GA, Ringe D. Protein Structure and Function. ( (2003) ) London: Sinauer Associates, New Science Press Ltd..
Phillips D. The development of crystallographic enzymology. Biochem. Soc. Symp, ( (1970) ) 30, : 11–28.[Medline].
R Development Core Team. R: A language and environment for statistical computing, ( (2005) ) Vienna, Austria: R Foundation for Statistical Computing..
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math, ( (1987) ) 20, : 53–65.[CrossRef].
Schneider TR. Objective comparison of protein structures: error-scaled difference distance matrices. Acta Crystallogr. D Biol. Crystallogr, ( (2000) ) 56, : 714–721.[CrossRef][Medline].
Schneider TR. A genetic algorithm for the identification of conformationally invariant regions in protein molecules. Acta Crystallogr. D Biol. Crystallogr, ( (2002) ) 58, : 195–208.[CrossRef][Medline].
Schneider TR. Domain identification by iterative analysis of error-scaled difference distance matrices. Acta Crystallogr. D Biol. Crystallogr, ( (2004) ) 60, : 2269–75.[CrossRef][Medline].
Shatsky M, et al. A method for simultaneous alignment of multiple protein structures. Proteins, ( (2004) ) 56, : 143–156.[CrossRef][ISI][Medline].
Sierk ML, Kleywegt GJ. Dj vu all over again: finding and analyzing protein structure similarities. Structure, ( (2004) ) 12, : 2103–2111.[Medline].
Taylor SS, et al. Pka: a portrait of protein kinase dynamics. Biochim. Biophys. Acta, ( (2004) ) 1697, : 259–269.[Medline].
Thanos CD, et al. Hot-spot mimicry of a cytokine receptor by a small molecule. Proc. Natl Acad. Sci. USA, ( (2006) ) 103, : 15422–15427.
Welch B. The significance of the difference between two means when the population variances are unequal. Biometrika, ( (1938) ) 29, : 350–362.
Wu J, et al. Crystal structure of the e230q mutant of camp-dependent protein kinase reveals an unexpected apoenzyme conformation and an extended n-terminal a helix. Protein Sci, ( (2005) ) 14, : 2871–2879.
Ye Y, Godzik A. Multiple flexible structure alignment using partial order graphs. Bioinformatics, ( (2005) ) 21, : 2362–2369.
Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res, ( (2005) ) 33, : 2302–2309.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








