Skip Navigation

Bioinformatics 2008 24(16):i98-i104; doi:10.1093/bioinformatics/btn271
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Csaba, G.
Right arrow Articles by Zimmer, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Csaba, G.
Right arrow Articles by Zimmer, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Protein structure alignment considering phenotypic plasticity

Gergely Csaba , Fabian Birzele and Ralf Zimmer *

Practical Informatics and Bioinformatics Group, Department of Informatics, Ludwig-Maximilians-University, Amalienstr. 17, D-80333, Munich, Germany

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 

Motivation: Protein structure comparison exhibits differences and similarities of proteins and protein families and may help to elucidate protein sequence and structure evolution. Despite many methods to score protein structure similarity with and without flexibility and to align proteins accurately based on their structures, a meaningful evolutionary distance measure and alignment method which models the cost of mutations, insertions and deletions occurring in protein sequences on the structure level is still missing.

Results: Here, we introduce a new measure for protein structure similarity and propose a novel method called phenotypic plasticity method (PPM) which explicitly tries to model the evolutionary distance of two proteins on the structure level by measuring the cost of ‘morphing’ one structure into the other one. PPM aligns protein structures taking variations naturally observed in groups of structures (‘phenotypic plasticity’) into account while preserving the overall topological arrangement of the structures. The performance of PPM in detecting similarities between protein structures is evaluated against well-known structure classification methods on two benchmark sets. The larger set consists of more than 3.6 million structure pairs from the SCOP database which are also consistently classified in CATH. In the current parameterization, PPM already performs comparable or better than other methods such as TM-Align and Vorolign on those two sets according to various evaluation criteria showing that the method is able to reliably classify known protein structures, to detect their similarities and to compute accurate alignments despite phenotypic plasticity.

Availability: Executables are available upon request. Datasets and supplementary data (datasets and superpositions) can be accessed on http://www.bio.ifi.lmu.de/PPM

Contact: ralf.zimmer{at}bio.ifi.lmu.de


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 
The comparison of protein structures and the detection of their similarities and differences is one of the major steps in understanding the interplay of protein sequence and protein structure evolution.

To date, the gold standard for protein structure classification are the (manually) curated databases SCOP (Andreeva et al., 2007) and CATH (Greene et al., 2007). Despite large efforts over many years, these ‘gold standards’ are known to disagree to some extent (Hadley and Jones, 1999), which indicates that defining and identifying structural similarity is difficult and still controversial. For benchmarking purposes of new measures and methods it is therefore suggestive to use consensus sets defined from both classifications. SCOP classifies structures into classes, folds, superfamilies and families in a hierarchic manner. But with the growing number of protein structures in the PDB (Berman et al., 2000) and the arrival of structural genomics projects (Grabowski et al., 2007) there has been a growing need to model expert knowledge in automatic methods for protein structure classification and protein structure analysis. Currently (February 2008), out of the 45 000 protein structures in the PDB, more than 10 000 are not classified in the most recent SCOP version 1.73.

A crucial step in structure analysis is the computation of a structural alignment of two or more protein structures and, in contrast to sequence alignment methods, structure alignment methods aim directly at optimizing the structural similarity of the input proteins.

Sequence alignments are commonly computed using amino acid exchange matrices as well as gap penalties. Those parameters explicitly model the process of sequence evolution by mutations as well as insertions and deletions. It is difficult to propose an equivalent model for protein structure evolution which could be used for structure alignment.

Therefore, to date, all structure alignment methods capture protein structure similarity not by the evolutionary cost of mutating (‘morphing’) one sequence and structural conformation into the other but by a similarity of the 3D objects. The methods differ in their score for protein structure similarity as well as in the algorithms used to optimize this score. Different similarity measures may be differently well suited for specific applications.

Many methods measure structural similarity based on a rigid body superposition of the input structures (given the alignment) and for that purpose several scores have been proposed. Among them are CE (Shindyalov and Bourne, 1998) (optimizing the RMSD) and TM-Align (Zhang and Skolnick, 2005) optimizing the TM-score (Zhang and Skolnick, 2004).

Though being a valid approach for many applications, rigid superposition neglects the fact that proteins are dynamic and flexible objects that allow for (significant) structural variation even within families and much more so within superfamilies and folds. Therefore, in recent years, also methods for flexible protein structure alignment have been published that account for large scale structural flexibilities and can align protein structures even in cases that show large conformational changes. A well known method is FATCAT (Ye and Godzik, 2003).

While the two approaches discussed so far explicitly use the coordinates of the protein structure in the scoring and optimization process, other methods work on different representations like contact matrices (Holm and Sander, 1993), graphs (Harrison et al., 2003) or Voronoi cells (Birzele et al., 2007). Those methods also allow for a certain amount of intrinsic flexibility of the protein structures as long as the contact networks are conserved.

The method proposed here is based on the observation that, despite conformational changes and large scale flexibilities, protein structure families exhibit a high flexibility on a smaller scale which we call phenotypic plasticity (Fig. 1). Phenotypic plasticity of a protein structure comprises the changes in the actual 3D structure, which are to be expected for the given protein sequence (the ‘genotype’) or within a given genotype population (i.e. a group of proteins related according to a family, superfamily or fold relationship). Further, phenotypic plasticity is grounded in evolutionary events that happen on the sequence level, namely mutations, insertions and deletions.


Figure 1
View larger version (49K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. On the top left figure, we show the overall multiple rigid superposition of members of the pheromone binding family (SCOP: a.39.2.1) implied by a manually curated alignment (guided by disulfide bridges and secondary structure elements). While the overall topology remains the same for all members of the family the exact position and orientation of the helices shows high plasticity (A). As can be seen from a flexible multiple superposition (B) the helices are very well conserved. Structural variations in the different helices are shown for each of the six helices separately in the lower part of the figure.

 
In contrast to existing approaches, our method, in the following called phenotypic plasticity method (PPM), explicitly tries to model, score and optimize the changes which naturally occur in protein structure evolution. Under the hypothesis that two protein structures are similar, we assume that they share locally similar substructures (not necessarily restricted to secondary structure elements). For those, single amino acid exchanges (if they occurred) have not led to major structure rearrangements. Nevertheless, some point mutations as well as most insertions and deletions require certain topological rearrangements (movement of elements) on the structure level. In PPM we now try to measure the structural cost of mutating (or ‘morphing’) one structure into the other one. During this process, we score the similarity of locally similar substructures as well as the conservation of their topological arrangement. This approach explicitly allows to map corresponding structural elements of the two structures onto each other and to observe their phenotypic plasticity in a population. This feature of PPM will be especially helpful to study the interplay of sequence and structure evolution in the future.

In the following, we will first describe our model of protein structure similarity based on phenotypic plasticity and an algorithm to optimize this similarity score. We will then evaluate and discuss in detail the performance of PPM in recognizing similar structures on two benchmark sets and will compare PPM against other well known structure classification methods. We will conclude with an outlook on future work and extensions of PPM.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 
PPM scores the similarity of the topology of locally similar substructures (core blocks), and we determine possible core blocks and mappings in a first step. Having identified core blocks, we need some measure to quantify their topological similarities/differences (i.e. the implied structural mutations). We address this problem in three steps: (i) we introduce a measure quantifying topological difference of core block pairs in two structures, (ii) we introduce the graph (the PPM graph) of topological changes of pairs of core blocks, and define their topological difference on this graph and (iii) we introduce an algorithm that identifies the consistent subgraph with maximal score in the PPM graph. The PPM method is summarized in Figure 2. Please note that all similarity functions used are only presented in the form of examples in the text for space reasons. Details on those functions are available in the supplementary material on http://www.bio.ifi.lmu.de/PPM.


Figure 2
View larger version (31K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. An overview on the steps of PPM. (a) In a first step, mappings of locally similar, ungapped blocks are identified in the two structures. (b) The number of possible core block mappings is reduced due to the criterion described in Section 2.1.1. (c) The computation of the similarity of pairs of core block mappings as described in Section 2.1.1. (d) The computation of the final alignment. The cost of adding block b5 to the alignment consisting of blocks b1,..., b4. The edges e51,..., e54 are sorted with respect to their weights, i.e. e51 represents the edge with the smallest, e54 the edge with the highest cost. Depending on which edge is chosen each step, adding a block becomes more or less expensive, allowing to adjust the degree of phenotypic plasticity in the score. We used Level 3 throughout our experiments.

 
2.1 Phenotypic plasticity measure
2.1.1 Determining possible core blocks
In Figure 2, we show the result of the core block detection and the subsequent filter step. Given two protein structures p1 and p2, let B(p1) and B(p2) be sets of blocks (subsequences of the respective proteins) which are structurally similar to at least one block in the other protein.

DEFINITION [Block]
A subpart (subsequence) b1 in protein p1 is called a block if it has at least length Tl (set to 6 in our experiments) and is rigidly superposable to some subsequence b2 in protein p2, such that all associated distances in the superposition are below Td Å.

The threshold Td is chosen in a length dependent manner and represents a tradeoff between the length of the block pair and the the maximally allowed pairwise distance of the residues. The longer the fragment, the higher the variability allowed. In our experiments Td is for example set to 1.24, 2.37, 2.91 and 3.3 Å for blocks of length 6, 12, 18 and 24, respectively. Please note that using pairwise distances instead of e.g. RMSD results in highly similar local substructures. Each block mapping represents a local, gapless alignment with locally similar structures. In the following steps, we will search for the maximal set of consistent blocks inducing similar topologies in the input protein structures.

In order to reduce the complexity of subsequent steps and to avoid shift errors in the core block alignment, we filter the core block mappings found using global topology information. Therefore, we calculate the superposition of the two proteins induced by a core block mapping and compute the maximum number of residue pairs under 6 Å. When applying this step to every residue of the query protein we can drop many incorrect core block mappings. An example for this approach is shown in Figure 3, while the effect of the filter step is shown in Figure 2b. The resulting set of possible core blocks is denoted by V={vj=(b1,b2)|biisinB(pi)=B(p1) x B(p2).

The similarity of a mapping of two core blocks sim(b1,b2)=sim(vj) is determined by a empirical function representing a tradeoff between the length of the pairs and the corresponding RMSD. The score for a pair of blocks lies between 0 and length with higher scores indicating better pairs. Examples of the score are: length = 10, RMSD = 1.2 leads to a score of 10.0, length = 10, RMSD = 2.0 to a score of 5.46 and length = 20, RMSD = 2.0 results in a score of 17.05.


Figure 3
View larger version (36K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. The effect of shift errors on global superposition. Domains d1qwva_and d1r5ra_(members of the same SCOP family) are locally superposed based on the mapping of the terminal helices. (a) Aligns the helix correctly and clearly all elements of the structures (the other five helices) are topologically close in space to their corresponding counterparts. (b) and (c) align the terminal helix with a shift error of one and two, respectively. The effect is that based on the local match the other five helices are distant in space.

 
2.1.2 Similarity of topologies implied by pairs of core block mappings
Having identified possible mappings of core blocks, we can compare pairs of such mappings using their topological relationships. In general, we represent the topology of a protein structure by the 3D geometry of the blocks. In more detail, having assigned b1,1 from protein one to b2,1 in protein two and b1,2 to b2,2, we calculate the implied structural mutation assuming that the topology of (b1,1, b1,2) changed into the topology (b2,1, b2,2). For this we compute the superposition of the two mappings (b1,1, b1,2) and (b2,1, b2,2) and calculate their RMSD in this superposition. The maximal value of the two RMSDs obtained (b1,1 on b2,1 and b1,2 on b2,2) is used as a ‘structural cost’, of a mutation i.e. a cost of the topological change implied by the joint mapping of the blocks. The maximal RMSD is used in order to avoid effects of large blocks on the mutation cost. A schematic example of the procedure is shown in Figure 2c.

Please note that, while the similarity of a core block pair is measured by an RMSD-based score, the final cost of the structural mutations observed is measured on the PPM graph (Section 2.2). The PPM graph explicitly tries to score the topological costs of a structural mutation.

The function to convert the lengths of the two mappings and their maximal RMSD into a cost function for structural mutations is empirical and represents again a tradeoff between length and allowed RMSD, i.e. a maximal RMSD of 4.3 Å across 33 residues leads to costs of –2.79, while an RMSD of 5.1 on 65 residues is less expensive (due to it's larger size) and costs –2.09.

2.2 The PPM-graph: similarity of the topologies
To measure the similarity of topologies we define the PPM graph. The set of core blocks mappings (Section 2.1.1) defines the nodes V of the graph. In the first step, the graph is fully connected and the edges are weighted by the pairwise similarity of pairs of core blocks as defined in Section 2.1.2. The schematic idea of the scoring process is shown in Figure 2d.

DEFINITION [PPM-graph, GPPM]
Given two proteins p1 and p2, we define the phenotypic plasticity graph as a labeled graph GPPM(p1, p2)=(V, E) as follows: V=B(p1) x B(p2) and E=V x V. Please note that nodes vj represent aligned block pairs, i.e. mappings of two core blocks in an alignment of proteins one and two.

To reduce the number of edges, we only use the ‘coordinating’ set of edges in our experiment, i.e. we only use edges if the distance of the core block pairs is <30 Å in both proteins.

DEFNITION [Linear Phenotypic Plasticity Measure of an alignment, LPPM(A)]
Having defined the similarity of nodes and edges, let A be an alignment of protein structures p1 and p2, A=(v1, e1,...,vn–1, en–1, vn), i.e. an alignment is simply a subset of core blocks connected by topological edges which form a spanning tree of the core blocks in the alignment. The linear LPMM(A) connects the blocks in a linear fashion from the N- to the C-terminus and is defined as: LPPM(A)={sum}Formulasim(vi)+{sum}Formula weight(ej).

The LPPM(A) score defined above always links a block to the alignment by the edge connecting it to the previous block in the sequence. Therefore, the edge weight does not take the overall topology of the protein into account. To overcome that problem, we define a variant of score as follows (see also Fig. 2d). For each aligned block node vi, the graph GPPM contains n–1 incident edges and each could be used to connect vi to the alignment.

Consider the sorted list of edges weights ei1, ei2,...,ein, with ei1 having the smallest weight. A tree T where the nodes are the n block pairs connected by n–1 edges represents a structural alignment/superposition of p1 and p2, T=(v1,..., vn; e1,..., en–1). The level allows to control the plasticity of the alignment: Level 1 connects a node vi via the edge of smallest weight ei1 to the tree, Level k uses the edge eik to score connecting node vi.

The higher the level the more edges are penalized in the alignment. Therefore, Level 1 allows for the largest plasticity of the protein structures, since adding a node (block pairs) to the alignment can always be done at the lowest possible cost. Using level n–1 enforces the largest rigidity onto the protein structures, since always using the worst edge for connecting the nodes (the pair with the largest inconsistency) increases the cost of the alignment and therefore decreases the score.

DEFNITION [Level k Tree Phenotypic Plasticity Measure of an alignment, TkPPM(A)]
Let the tree T represent an alignment of protein structures p1 and 2, T=(v1,..., vn; e1,..., en–1), where ej=ejk (the k-th best edge connecting the nodes) TkPPM(A)={sum}Formula~(vi)+{sum}Formula weight(ej)

DEFNITION Phenotypic Plasticity Measure of an alignment PPM(A)]
In our experiments the level has been set to 3 which accounts for a reasonable tradeoff between flexibility and rigidity allowed for the input structures. Therefore, we define PPM(A)=T3PPM(A).

2.3 Optimizing and Normalizing the PPM score
Having calculated all possible core blocks and all topological edges, we need to identify the consistent subgraph with maximal score in the PPM graph. This subgraph represents the structural alignment of the two input structures. We solve this problem by using the A* algorithm with an appropriate heuristic function. The A* algorithm searches for the path in the graph maximizing the similarity of the two structures. A* guarantees to find the optimal solution (if the corresponding heuristic function is ‘admissible’ which means that the function always overestimates the best possible similarity resulting from a path through the graph). Since it is possible to find such a heuristic function, the algorithm is able to find the optimal solution to our problem in reasonable time (less than a second on average). Please note that, the algorithm does not incorporate sequence gap costs at any time in the optimization procedure. However, for measuring the overall structural similarity it is useful to penalize non-aligned (na) residues and to normalize the score. In our experiment, we simply penalize every non-aligned residue with a constant value of –0.1 and the similarity of a query and target protein is normalized by the length of the query protein. The normalized PPM score is defined as:

DEFNITION [Normalized Phenotypic Plasticity Measure of two proteins]


Formula


    3 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 
In the following, we will evaluate the performance of PPM according to its capability to recognize structural similarities of a query protein in a template database of protein structures. Therefore, we will first show the results of PPM on the Vorolign testset (Birzele et al., 2007) and will then perform a larger, more difficult and more detailed evaluation on a comprehensive set of domains from the most recent SCOP version 1.73. PPM is evaluated against the two state-of-the-art structure alignment methods Vorolign and TM-Align. While TM-Align represents a method optimizing a rigid-body similarity criterion, Vorolign represents a method which uses a different, more flexible and more sequence-based representation of the protein structure, namely Voronoi neighborhoods, and allows to compute alignments even in cases where large flexibilities exist.

3.1 Family recognition on the Vorolign set
In order to test the ability of PPM to detect the correct protein family in a database of representative protein structures, we first use the family recognition set used by Birzele et al. (2007). This set contains 979 non-trivial test proteins from the ASTRAL compendium that are contained in version 1.67 (February 2005) but not in the previous version 1.65 (December 2003). Simulating the Vorolign test scenario, those proteins were scanned against the ASTRAL25 database (version 1.65).

The performance of all methods is measured as the fraction of proteins that can be assigned to their correct family, superfamily or fold according to the SCOP classification. Only the hit with the best score as computed by the respective method is taken into account for the evaluation.

PPM is compared against the structure alignment methods Vorolign, CE and TM-Align as well as a sequence-based method, profile–profile alignment (von Ohsen et al., 2003). The results are shown in Table 1 (the performances of further, sequence-based methods are additionally discussed in Birzele et al. (2007).


View this table:
[in this window]
[in a new window]

 
Table 1. Evaluation of the family, superfamily and fold recognition rate

 
In this test scenario, all structure-based methods clearly outperform sequence-based methods like profile–profile alignment. Further, PPM outperforms the second best method (Vorolign) by about 2% on the family and the superfamily level and performs similar to Vorolign on the fold level. The test shows that PPM is able to predict the SCOP classification for a protein structure with a very high accuracy and can improve the family prediction rate on this set by another 2% compared to Vorolign.

3.2 Large-scale evaluation
To evaluate the performance of PPM, Vorolign and TM-Align (the three top-scoring methods on the Vorolign set) on a larger and also more recent test set, we used a comprehensive set of 4933 domains from SCOP 1.73. The domains are a representative subset of all domains classified in SCOP 1.73 which are similarly defined in CATH. We require an overlap of the residues in the two-domain definitions for a protein of >80%. All domains share a sequence identity of <50%. Every domain is used as query and compared against all domains in the set which belong to the same family, superfamily, fold (positive examples) and the same class but not the same fold (negative examples). Additionally, we require a pair of domains to be classified consistently in the CATH database, i.e. pairs from the same family/superfamily in SCOP need to be classified into the same ‘homologous superfamily’ in CATH and pairs belonging to the same SCOP fold into the same CATH topology. This approach assures a high quality of the structural similarity relationships defined by the gold standards. Templates from other classes are not taken into account to limit the number of pairs to be evaluated. In total, we compute >3.6 million pairwise structure comparisons (the set can be obtained from http://www.bio.ifi.lmu.de/PPM).

When evaluating the same scenario as above, namely the family, superfamily and fold-recognition rate of the methods according to the first hit (Table 1) the methods turn out to perform very similar. Vorolign slightly outperforms PPM and TM-Align on family and superfamily level while PPM performs better than the other two methods on fold level. The somewhat different results of the two benchmarks in Tables 1 and 2 on family and superfamily as compared to fold level may be due to a smaller sequence identity allowed for pairs in the Vorolign set (<30%) compared to the larger set (50% at maximum).

In contrast to Table 2, Table 3 evaluates the performance of the methods in a different and more difficult scenario. Here, the performance is evaluated in a situation where the family of the target has been removed from the benchmark set to test a methods ability to detect similarities in the absence of clear sequence similarity. This set-up becomes for example important for folds which contain only one family and need to be extended by a new family (‘protein structure threading’).

Therefore, all hits from the query domain's own fold are regarded as correct hits while hits from the same class but from a different fold are regarded to be errors. The table shows the number of query proteins where templates from wrong folds score better than templates from the own family, superfamily or fold, respectively. The error rates obtained allow for a detailed analysis of the performance of a method in the absence of the true family or superfamily since in those cases wrong folds would be predicted. In this more detailed and also more difficult test scenario (compared to pure family recognition), PPM significantly improves on both, Vorolign and TM-Align, on the superfamily and fold level, significantly reducing the number of errors. On the family level, it performs slightly worse than Vorolign. On this level, Vorolign's more sequence based scoring function seems to be an advantage.


View this table:
[in this window]
[in a new window]

 
Table 2. Evaluation of the family, superfamily and fold recognition rate of PPM, TM-Align and Vorolign on a large set of almost 5000 SCOP domains

 


View this table:
[in this window]
[in a new window]

 
Table 3. Evaluation of the errors made by a method in the presence of the correct family (Family column), the presence of the correct superfamily, but not the correct family (Superfamily column) as well as in the absence of the correct family and superfamily but in the presence of the correct fold (Fold column)

 
3.3 Detailed analysis of examples
We further analyzed the errors made by PPM on the family level (cases where family members are not identified correctly) on both sets. The results for all errors made can be visually inspected in 3D at http://www.bio.ifi.lmu.de/PPM. The errors may be partitioned into three different classes. For some cases, the true family and the other fold found by PPM cannot be distinguished on the structure level. One such example is shown in Figure 4. The figure allows for a fast and detailed analysis of the differences of the two hits found. When analyzing the similarities (i.e. the aligned parts) and differences (i.e. the unaligned parts) of the proteins in the two superpositions one can see that both families share the same common core (though being classified in different folds) of three helices and differ in some other, mostly unstructured, parts. It will be difficult for any purely structure-based method to distinguish those two folds (if they turn out to be truly different).

Further, we find examples where PPM overestimates the cost of phenotypic plasticity in cases where an unexpectedly large plasticity can be observed. Those errors may be solved in future versions by a better understanding of the costs of structural mutations and, may be, even family or fold-specific cost functions.

The third class of errors comprises cases, where e.g. length differences of template and query lead to problems in the score normalization and in the ranking process.


Figure 4
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Example of a PPM misclassification where a member from a different fold scores better than the protein's own family member. The query and both template proteins consist of three helices in the same orientation and topology. Additionally, the member of the same family has a longer helix which slightly changes its orientation. The hit from the other fold on the other hand has no further secondary structure elements. Finally, the flexible superposition shows that the parts aligned by PPM fit very well. See also http://www.bio.ifi.lmu.de/PPM for interactive versions of all superpositions of PPM errors.

 

Figure 5
View larger version (35K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. An example for the alignments and superpositions computed by PPM and TM-Align of two proteins (d1ooha_, d1qwva_) from family a.39.2.1 (also shown in Fig. 1). The family has six conserved cystein bridges which should be conserved in a good alignment of the two members of the family. Due to phenotypic plasticity of the corresponding helices, TM-Align produces shifts and breaks in the alignment which results in three (out of six) non-conserved cystein bridges (right part). In contrast, all six bridges are conserved in the PPM alignment (left part) and the corresponding, locally similar helices are identified correctly (as shown in the flexible superposition based on the PPM alignment in the middle of the figure).

 
To exemplify the ability of PPM to compute high quality alignments of core regions and to deal with phenotypic plasticity, we show the example of a.39.2.1 which has already been briefly discussed in Figure 1. This family consists of six helices and is stabilized by six cystein bridges. The family shows a high level of phenotypic plasticity which makes it hard for a rigid body method to align the helices correctly, i.e. without shifts and without adding gaps to the helices. Also, TM-Align aligns a number of single residues which are not related but are close to each other in space in a rigid body superposition. In contrast, such errors are avoided by PPM and the core regions (the six helices) are correctly identified and aligned (w.r.t. to the cystein bridges) by our PPM method as shown in Figure 5.


    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 
We presented a novel idea and a new method to address the protein structure alignment and protein structure classification problem in the presence of phenotypic plasticity. In contrast to existing approaches, PPM explicitely tries to model the evolutionary cost of the mutations, insertions and deletions that occurred on the structure level during the transformation of one structure to another as implied by the sequence changes. Thereby, it accounts for the natural variance observed in protein structure families, a phenomenon we call phenotypic plasticity. This plasticity can lead to problems and artifacts in the alignment process (e.g. shifting helices or random matches in space as shown in Fig. 5) and those are avoided by PPM since it only aligns highly similar local substructures in a first step and then optimizes the alignment of those core blocks using the A* algorithm. The process takes the overall topology of the structures into account and avoids overestimating the structural similarity due to a too large level of flexibility.

The performance of PPM has been evaluated on two comprehensive benchmark sets. On the Vorolign set, PPM outperforms all other methods in recognizing the correct family and superfamily of a query protein. On the more recent benchmark set of almost 5000 SCOP domains and the corresponding 3.6 million structure pairs PPM performs comparable to Vorolign and TM-Align in the family/superfamily and fold-recognition scenario but significantly outperforms the other methods in a different test set-up where the ability to recognize at least a member from the correct fold in the absence of family or superfamily members is evaluated.

Future development of PPM will be into different directions. First, the method is completely modular according to which scoring functions are used to score the similarity of core blocks as well as their topologies. Our final goal of finding an evolutionary measure for structural mutations similar to the idea behind accepted mutations in matrices like PAM (Dayhoff et al., 1978) is not reached yet. So far, the scoring functions are purely structure based and due to our current lack of knowledge on the true costs of mutations and insertions/deletions on the structure level can only be optimized and parameterized empirically. It is an on-going and iterative process to further refine those parameters to capture the real costs of mutations and insertions/deletions on the structure level, may be even in a family or fold-specific manner.

Second, we would like to extend the currently purely structure-based scoring functions used in the method to a combination of sequence and structure scores to account for both criteria and further improve on the quality of the protein alignments. High-quality alignments with respect to sequence and structure form the basis for a detailed analysis of the interplay of protein sequence and structure evolution.

Third, conserved pairwise core blocks identified by PPM will further be used to identify conserved cores of protein structure families with interesting applications in protein structure prediction and sequence–structure alignment (e.g. threading). For this task, PPM will be extended in order to compute multiple structure alignments or to combine sets of pairwise alignments to a multiple alignment.

Finally, the model of structure evolution underlying PPM may be an interesting starting point toward a novel definition of protein structure similarity and protein structure flexibility within and between groups of structures, a feature that will also help to detect similarities across different folds which may be explored in nature for example by alternative splicing (Birzele et al., 2008) or during structure evolution.


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 

    Andreeva A, et al. Data growth and its impact on the scop database: new developments. Nucleic Acids Res (2007) 36(Database issue):D419–D425.[CrossRef][Web of Science][Medline]

    Berman H, et al. The protein data bank. Nucleic Acids Res (2000) 28:235–242.[Abstract/Free Full Text]

    Birzele F, et al. Vorolign–fast structural alignment using Voronoi contacts. Bioinformatics (2007) 23:e205–e211.[Abstract/Free Full Text]

    Birzele F, et al. Alternative splicing and protein structure evolution. Nucleic Acids Res (2008) 36:550–558.[Abstract/Free Full Text]

    Dayhoff MO, et al. Amodel of evolutionary change in proteins. Atlas of Protein Sequence and Structure (1978) 5:345–352.

    Grabowski M, et al. Structural genomics: keeping up with expanding knowledge of the protein universe. Curr. Opin. Struct. Biol (2007) 17:347–353.[CrossRef][Web of Science][Medline]

    Greene LH, et al. The cath domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res (2007) 35(Database issue):D291–D297.[Abstract/Free Full Text]

    Hadley C, Jones D. A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. Structure (1999) 7:1099–1112.[Medline]

    Harrison A, et al. Recognizing the fold of a protein structure. Bioinformatics (2003) 19:1748–1759.[Abstract/Free Full Text]

    Holm L, Sander C. Protein structure comparison by alignment of distance matrices. J. Mol. Biol (1993) 233:123–138.[CrossRef][Web of Science][Medline]

    Shindyalov IN, Bourne PE. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng (1998) 11:739–747.[Abstract/Free Full Text]

    von Ohsen N, et al. Profile-profile alignment: a powerful tool for protein structure prediction. Pac. Symp. Biocomput (2003) 252–263.

    Ye Y, Godzik A. Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics (2003) 19(Suppl. 2):II246–II255.[Medline]

    Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins (2004) 57:702–710.[CrossRef][Web of Science][Medline]

    Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res (2005) 33:2302–2309.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
A. Poleksic
Algorithms for optimal protein structure alignment
Bioinformatics, November 1, 2009; 25(21): 2751 - 2756.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Csaba, G.
Right arrow Articles by Zimmer, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Csaba, G.
Right arrow Articles by Zimmer, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?