Bioinformatics Advance Access originally published online on August 12, 2004
Bioinformatics 2005 21(1):51-62; doi:10.1093/bioinformatics/bth467
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 21 issue 1 © Oxford University Press 2005; all rights reserved.
Protein structure alignment by deterministic annealing
1 Department of Electrical Engineering and Electronics, Osaka Sangyo University Osaka 574-8530, Japan
2 School of Mathematics and Computational Mathematics, Zhongshan University Guangzhou 510275, China
3 Department of Mathematical Sciences, Tsinghua University Beijing 100084, China
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Protein structure alignment is one of the most important computational problems in molecular biology and plays a key role in protein structure prediction, fold family classification, motif finding, phylogenetic tree reconstruction and so on. From the viewpoint of computational complexity, a pairwise structure alignment is also a NP-hard problem, in contrast to the polynomial time algorithm for a pairwise sequence alignment.
Results: We propose a method for solving the structure alignment problem in an accurate manner at the amino acid level, based on a mean field annealing technique. We define the structure alignment as a mixed integer-programming (MIP) problem. By avoiding complicated combinatorial computation and exploiting the special structure of the continuous partial problem, we transform the MIP into a reduced non-linear continuous optimization problem (NCOP) with a much simpler form. To optimize the reduced NCOP, a mean field annealing procedure is adopted with a modified Potts model, whose solution is generally identical to that of the MIP. There is no soft constraint in our mean field model and all constraints are automatically satisfied throughout the annealing process, thereby not only making the optimization more efficient but also eliminating many unnecessary parameters that depend on problems and usually require careful tuning. A number of benchmark examples are tested by the proposed method with comparisons to several existing approaches.
Contact: chen{at}elec.osaka-sandai.ac.jp
| 1 INTRODUCTION |
|---|
|
|
|---|
Aligning protein structures is one of the most important computational problems in molecular biology and plays a key role in protein structure prediction, fold family classification, motif finding, phylogenetic tree reconstruction and so on. In contrast to a pairwise sequence alignment that simply compares two strings of characters, a pairwise structure alignment is a more difficult problem that optimally superimposes two structures and further finds the regions of the closest overlap in the three-dimensional (3D) space. From the viewpoint of the computational complexity, even a pairwise structure alignment is a NP-hard problem due to matching implementation of rigid bodies, compared with the polynomial time algorithm Needleman and Wunsch, 1971 of a pairwise sequence alignment by dynamical programming (DP).
Many algorithms have so far been developed for the structure alignment problem Mount, 2001, Orengo and Taylor, 1996, Szustakowski and Weng, 2000 mainly founded on either distance-based methods Gerstein and Levitt, 1996 or vector-based methods Holm and Sander, 1993, Shindyalov and Bourne, 1998, such as iterative dynamical programming Gerstein and Levitt, 1996, Monte Carlo simulation Bryant and Altschul, 1995, mean field approximation Ohlsson et al., 2000, Blankenbecler et al., 2003 and simulated annealing Holm and Sander, 1993. In particular, based on the Potts model, a mean field approximation method (Lund) was recently proposed by Blankenbecler et al., 2003, which is not only very general but also significantly improves computational efficiency as well as quality of the structure alignment. Lund Blankenbecler et al., 2003 is a distance-based iterative method, which includes two major iterative steps with an annealing process, i.e. it first estimates the fuzzy assignment variables with the representation of the Potts model based on the distances between two proteins to be aligned, and then performs exact rotation and translation of protein coordinates weighted with the corresponding fuzzy assignment. Although an efficient computation requires the tuning of parameters and some heuristic settings are introduced, there are many appealing properties of Lund, e.g. the results have a probabilistic interpretation even without tedious stochastic simulation due to the Potts model, and side-chains representation and almost arbitrary constraints can be easily handled in the algorithm.
In this paper, we propose a method for solving the structure alignment problem based on a mean field annealing technique [Peterson and Anderson (1987) and Peterson and Soderberg (1989)], by mainly referring to the approaches of Ohlsson et al., 2000 and Blankenbecler et al., 2003. We define the structure alignment as a mixed integer-programming (MIP) problem Nemhauser and Wolsey, 1988 with the inter-atomic distances between two structures as an objective function Ohlsson et al., 2000. The integer variables represent the matching among structures whereas the continuous variables contain the translation vector and rotation matrix with each protein structure as a rigid body. By exploiting the special structure of the continuous partial problem and avoiding the complicated combinatorial computation, we transform the MIP into a reduced non-linear continuous optimization problem (NCOP) Bazaraa et al., 1993, Luenberger, 2003 with a much simpler form, based on mean field equations. To optimize the reduced NCOP, a mean field annealing procedure that is equivalent to the NCOP, is adopted with a modified Potts model or double-Potts model (D-Potts) Urahama, 1994, Ishii and Sato, 2002. Since all constraints are embedded in the mean field equations, we do not need to add any penalty terms of the constraints to the error function. In other words, there is no soft constraint Ishii and Sato, 2002 in the mean field model and all constraints are automatically satisfied during the annealing process, thereby not only making the optimization more efficient but also eliminating unnecessary parameters that depend on the problems and usually require careful tuning. Besides exact mathematical implementations of the alignment problem, there are several additional advantages to our approach, e.g. no heuristic manipulation is required and the solution of the algorithm is ensured eventually to converge with that of the MIP. Moreover, the proposed algorithm has the ability to identify the circular permutations, which is different from that of Blankenbecler et al., 2003 where no permutation is allowed because the assignment matrix is determined by fuzzy alignment paths.
To demonstrate the proposed method, we use the same benchmark examples as those of Shindyalov and Bourne, 1998 and Blankenbecler et al., 2003 from the Protein Data Bank for numerical simulation with comparisons to several existing approaches.
| 2 FORMULATION OF STRUCTURE ALIGNMENT |
|---|
|
|
|---|
In this section, we formulate the pairwise structure alignment problem as a MIP in an exact manner, adopting the similar notation to that of Ohlsson et al., 2000 and Blankenbecler et al., 2003.
Let n x and n y be the atom numbers of two proteins X and Y to be structurally aligned, and Xi =(x i,1, x i,2, x i,3), Yj =(y j,1, y j,2, y j,3)
3 (i=1, ..., nx ; j=1, ..., ny ) be the atom coordinates of two protein chains, which correspond to the C
atoms along the backbones in this paper although C ß atoms can be considered in the same way. A square distance metric between the chain atoms is adopted, i.e.
is the square distance between the atom i in X and the atom j in Y. Each protein chain is viewed as a rigid geometry body. The coordinate transformation of a rigid body is generally expressed by a translation vector A
3 and a rotation matrix R
3x3, i.e.
for the atom i of the chain X, where there are three independent variables for the translation vector and the rotation matrix, respectively. For a pairwise structure alignment, we fix the coordinates of the second protein chain Y, which is assumed to be longer than the first chain X. Therefore, after coordinate transformation, a square distance between the atom i in X and the atom j in Y is
![]() |
We define a matching matrix S with binary elements s ij to describe matching of two atoms for i = 1, ..., n x ; j = 1, ..., n y :
![]() |
We also consider gaps in the protein chains for i = 1, ..., n x by
![]() |
In the same way, s 0j is defined for j = 1, ..., n y
![]() |
Clearly, gap positions are not expressed by individual elements in s ij but correspond to common sinks. Therefore, S is an (n x + 1) x (n y + 1) matrix with only binary elements. Note that s 00 is not defined.
Since each atom in a chain must match one atom in the other (including gap), the following conditions are satisfied.
![]() |
![]() |
Figure 1 is a simple example that shows the matches and gaps for S. We adopt gap penalties by an affine expression for consecutive gaps. Then, we have an error function for the pairwise structure alignment problem Ohlsson et al., 2000,
|
![]() |
where the first term E c is total square distances between two protein chains defined as
![]() |
and
are position-dependent penalties for opening a gap while
is a position-independent penalty for gap extension proportional to the gap length. The second and third terms, and the fourth and fifth terms in Equation (7) are gap penalties corresponding to the first and the second chains, respectively.
Therefore, a pairwise structure alignment is mapped as a non-linear MIP Nemhauser and Wolsey, 1988 with an objective function Equation (7) and constraints Equations (5) and (6) for binary variables s ij , [i = 0, 1, ..., n x ; j = 0, 1, ..., n y ; (n x + 1) (n y + 1) 1 discrete variables] and continuous real variables (A, R: six continuous variables). Theoretically, an optimal structure alignment can be obtained by solving the MIP. However, the non-linear MIP is a very complicated combinatorial problem that belongs to the NP-hard class due to discrete variables and the non-linear objective function. In this paper, rather than directly solving the MIP, we transform it into an equivalent continuous problem by mean field equations, which can be treated in a much easier way.
| 3 MEAN FIELD APPROXIMATION WITH D-POTTS MODEL |
|---|
|
|
|---|
In this section, we first transform the MIP into a NCOP to avoid the complicated combinatorial computation. Then, by exploiting an analytical expression of the coordinate transformation, a simplified NCOP as well as its first-order conditions of optimality are derived. Finally, we obtain mean field equations with the D-Potts expression, which have much simpler forms, but ensure that all conditions are satisfied.
3.1 Non-linear continuous optimization problem
We adopt mean field annealing Urahama, 1994, Ishii and Sato, 2002 to solve the MIP problem by approximating the binary variable s ij
0, 1 with the real variable v ij
[0, 1].
Define a free energy function
![]() |
where T
is a temperature for the annealing process, and H(V) corresponds to the entropy function for matching between the chains X and Y. Clearly, if T
0, then F(V, A, R)
E(V, A, R), which implies that the free energy is eventually equal to the error function at the end of the annealing.
Then the MIP is changed to the following NCOP
![]() |
![]() |
where the variables are V, A and R. The necessary conditions of optimality with respect to translation A and rotation R for NCOP are
![]() |
where
F/
A =
E/
A =
E c /
A and
F/
R =
E/
R =
E c /
R due to the relations among functions F, E and E c . Note that there are only three independent variables in R, and
E c /
R are the partial derivatives for these three variables.
3.2 Reduced non-linear continuous optimization problem
Clearly Equation (12) is equivalent to such an optimization problem: minimize E c (V, A,R) with respect to variables A and R for a given V, which can actually be solved analytically Ohlsson et al., 2000, Arun et al., 1987 by differentiable functions A = g(V) and R = h(V). Such analytical expressions are
![]() |
where B = 2
X
Y T
2
X Y T
, and
f(X, Y)
is defined as an averaging operation of f, i.e.
with
and p ij = v ij /v tot . Note that (BB T )1/2 is the square root of a positive-definite matrix Higham, 1986. Numerically, R and A can also be obtained by singular value decomposition (SVD) as shown in Appendix A.
Therefore, the NCOP Equations (9)(11) can be further reduced to a simplified form by formally eliminating A and R from Equation (13), i.e.
![]() |
where the variables are V. The conditions of optimality for the KarushKuhnTucker (KKT) Bazaraa et al., 1993, Luenberger, 2003 are
![]() |
where
j and ß i are Lagrange multipliers corresponding to Equations (10) and (11), respectively,
, and
0 = ß0 = 0, v 00 = 1/e. Theoretically, an optimal solution for the structure alignment or the MIP satisfies the KKT conditions provided that each v ij approaches a binary value or s ij , which can actually be achieved by mean field equations with an annealing process.
3.3 Mean field equations
Next, we derive the mean field equations with the D-Potts Urahama, 1994, Ishii and Sato, 2002 form from the KKT conditions, Equations (16) and (17). Define
![]() |
for i = 0, ..., n x ; j = 0, ..., n y , where u 00 = 0. From Equations (11) and (16), we obtain
![]() |
for i = 0, ..., n x ; j = 0, ..., n y , where
. Note w 0 = 1 due to
0 = 0, and v 00 = 1/e due to Equation (19).
Substituting Equation (19) into Equation (10), then for j = 1, ..., n y
![]() |
Therefore, we obtain the mean field Equations (19) and (20) with continuous variables V and W, which are equivalent to the KKT conditions. All constraints are automatically satisfied provided that Equations (19) and (20) hold. If T
0, then
, and each v ij generally approaches a binary value due to Equation (19), which implies that the solution of Equations (19) and (20) with a sufficient small T is also the solution of the original MIP or the structure alignment problem. Strictly speaking, the solution v ij of Equations (19) and (20) is a real number, and identical to s ij only if v ij is rounded off to either 1 or 0.
If all W j s are set to be constant, Equation (19) is a typical Potts model, which however only ensures the constraints Equation (11). Usually, the constraints Equation (10) are taken as soft constraints, which are added into E of Equation (8) as penalty terms. Such a manipulation not only introduces the heuristic penalty parameters that require careful tuning, but may also impede efficient convergence of the computation due to the introduction of additional local minima. In contrast, in our D-Potts model, all constraints are taken as hard constraints and embedded into the mean field equations, which means that all constraints are automatically satisfied even during the annealing process, thereby not only making the optimization more efficient but also eliminating unnecessary parameters related to penalty terms. In addition, formally the translation and rotation are not explicitly involved in the mean field equations.
Note that although the proposed approach has a similar form as that of Ohlsson et al., 2000 using mean field equations, there are two major differences. The first one is that our approach adopts the D-Potts model instead of the conventional Potts model. The second one is that our approach handles the optimization in a more exact manner without heuristic decomposition, which approximately divides the alignment problem into a mean field subproblem for V and a coordinate transformation subproblem for A and R, as in Ohlsson et al., 2000. As indicated in Equation (13) and Equations (14) and (15), we take the translation vector A and rotation matrix R as the functions of the matching variables V, and solve the alignment problem wholly by V in a more accurate manner.
The original non-linear MIP is reduced to the algebraic Equations (19) and (20) with continuous variables V and W. To solve Equations (19) and (20) for V and W, we adopt an iterative computation strategy with an annealing process, i.e. solve W from Equation (20) and V from Equation (19) iteratively with a slow cooling schedule, as described in detail in the next section.
| 4 DETERMINISTIC ANNEALING ALGORITHM |
|---|
|
|
|---|
The iterative algorithm includes two main steps:
- Obtain W from Equation (20) with a fixed V.
- Obtain V from Equation (19) with a fixed W.
The algebraic Equations (19) and (20) can be solved by either updating or converging strategy. The detailed algorithm is stated straightforwardly as
- Initialization
- Set initial values, i.e. T(0);
(0 <
< 1: a cooling coefficient for the annealing);
;
; convergence criteria (
w ,
v ,
E ) for (W, V, E), respectively, and all initial values of variables v ij . Let the iteration index t = 1.
- Normalize the input data, i.e. move the protein chains to their common center of mass and rescale coordinates such that the largest distance between atoms within the chains is unity Blankenbecler et al., 2003. Fix the coordinate y of the longer chain.
- Set initial values, i.e. T(0);
- Deterministic annealing
- Step 0: Update u ij (t) at iteration t by iterative equations for i = 0, ..., n x , j = 0, ..., n y where u 00(t) = 0, i.e.
where the derivatives can be numerically replaced by finite differences Ohlsson et al., 2000, i.e.
to improve the computation efficiency. Note that T, u, v and w are functions of the iteration t.
- Step 1: Solve w j either by the Newton method or by iterative updating from Equation (20) until converged, e.g. iteratively calculate w j (t) for j = 1, ..., n y with the given V(t1)
until
where w 0(t) = 1.
- Step 2: Solve v ij (t) from Equation (19) with the given W(t), where v 00(t) = 1/e, i.e. for i = 0, ..., n x ; j = 0, ..., n y
until

- Step 0: Update u ij (t) at iteration t by iterative equations for i = 0, ..., n x , j = 0, ..., n y where u 00(t) = 0, i.e.
- Convergence check
- If |E(t) E (old)(t)|
E is satisfied, terminate the computation. Otherwise, reduce the temperature T by T(t) =
T(t 1), let t
t + 1 and then go to Step 1. Note that the new coordinate A + RX i of chain X is also updated to each iteration due to implicit expression of h and g for v in Equation (13).
- If |E(t) E (old)(t)|
- Output
- Compute the new coordinate of X i , i.e.
from Equation (13) for all i.
- Round off v ij to either 1 or 0, and then let s ij = v ij for all i and j.
- Compute the number of the aligned residues m, and root-mean-square distance (rms) for the two protein chains
and Y, i.e


- Compute the new coordinate of X i , i.e.
When converged with a sufficient low T, v ij generally approaches either 0 or 1 due to Equations (18) and (19), which are then taken as s ij after the rounding off operation. The initial v ij (0) is randomly chosen between 1 and 0. Since all constraints are embedded in the mean field equations, there is no soft constraint in our mean field model and all constraints are automatically satisfied during the annealing process, which implies that the proposed approach is efficient and can be used for an exact structure alignment.
In addition to pushing v ij to a binary value, the annealing process has a significant effect to prevent the system from being trapped at local minima Chen and Aihara, 1995, Chen and Aihara, 1999, thereby enabling the optimization to converge to a high-quality solution. Moreover, unlike the assignment matrix in Blankenbecler et al., 2003, the matching matrix of the proposed algorithm can be used to detect permutations, as shown in the numerical simulation.
| 5 SIMULATION |
|---|
|
|
|---|
5.1 Pairwise structure alignment
We use the same benchmark examples as those of Shindyalov and Bourne, 1998, Blankenbecler et al., 2003 from Protein Data Bank (http://www.rcsb.org/pdb/) for numerical simulation by comparing with the several existing methods, i.e. Dali Holm and Sander, 1993 (http://www.ebi.ac.uk/dali), CE Shindyalov and Bourne, 1998 (http://cl.sdsc.edu/ce/ce_align.html) and Lund Blankenbecler et al., 2003. There is no post-processing in our simulation, and C
representation is adopted for each protein chain. To simplify the model, we set
and
=
/2 for all i and j in the numerical simulation. Initial temperature and cooling coefficient are set to be T(0) = 8 and
= 0.8 and the convergence criteria are
w = 108,
v = 106 and
E = 0.1 for all examples. Therefore,
is the only parameter in the simulation. The simulation results are shown in Table 1, where dihydrofolate reductases and globins are considered easy for alignment while the other 10 protein pairs are thought to be very difficult to align Shindyalov and Bourne, 1998. From Table 1, there are no significant differences for the easy alignment cases, i.e. dihydrofolate reductases and globins, for all algorithms. However, for the 10 most difficult protein pairs Shindyalov and Bourne, 1998, our algorithm performs effectively and typically produces alignments with lower rms distances or longer chains. In particular, for a pair of proteins 1CRL and 1EDE, our method finds a better alignment than other algorithms. Numerical simulation for each example typically requires 4050 iterations due to the fixed cooling schedule of the annealing process.
|
Figure 2 is an example of the alignment process for a pair of proteins 1DHF a and 8DFR with translation, rotation and matching of atoms. At the start of the procedure, as shown in Figure 2a and b, the protein 1DHF a rotates and translates by approaching to 8DFR relatively fast with the large increment of the rotation angle due to a higher temperature [e.g. T(0) = 8 and T(2) = 5.12]. However, with the slow annealing, such rotation and translation changes are rapidly reduced, and become insignificant at the low temperature [e.g. T(30) = 9.904 x 103 and T(42) = 6.806 x 104], as indicated in Figure 2c and d. Figure 2d is the converged result. Note that the coordinate of 8DFR is fixed due to its longer amino sequence according to the algorithm. Generally, the movement of alignments for other protein pairs has a similar pattern to this example.
|
Starting from a sufficiently high temperature, the converged solution generally is not sensitive to initial conditions. However, to reduce the computation time, the initial temperature T(0) is usually chosen to be as low as possible, which means that the initial conditions may be an important factor for the quality of the solution. The results from different random initial conditions show that the convergence of the algorithm with T(0) = 8 actually depends on the initial conditions, in particular for the cases of the 10 difficult structures, although empirically it is not sensitive to the cases of reductases and globins. Figure 3 shows an example of 1DHF a and 8DFR alignment for a solution with initial conditions different from Figure 2. Clearly although the number of the aligned residues are m = 182 that is the same as that of Figure 2, the quality of the solution is not good due to rms = 0.91 larger than 0.7.
|
5.2 Trade-off relation and convergence properties
Generally, there is a trade-off relation between the rms distance and the number of the aligned atoms m, which depends on the penalty parameters
and
in the algorithm. In other words, a shorter aligned chain (m) has a closer matching (rms), which is quantitatively controlled by
due to linear summation of E c and the gap penalty terms in the error function E. Table 2 shows such a trade-off relation by changing the penalty parameter
for the alignment of two protein chain pairs 1CRL1EDE and 2SIM1NSB, which also demonstrates the robustness of the algorithm. It is easy to see from Table 2 that generally, the larger the gap penalty
, the more the number of the aligned atoms m. The results in Table 2 can also be viewed as a Pareto set for optimization of multi-objective functions rms and m. Therefore, depending on the requirement for the criterion of rms, the proposed algorithm can produce the longest aligned atoms, by automatically changing
.
|
For the computational convergence, the increment of the error function E for each iteration is generally large with a high temperature, but such a change becomes insignificant with a low temperature for the mean field annealing. In other words, the convergence process of the annealing is inefficient for small temperatures, particularly due to fluctuations of W. In this paper, instead of V, we mainly evaluate the increment of the error function to check overall convergence as stated in Convergence Check of the algorithm. Even if the continuous variables still change during the iteration for low temperatures, the error function keeps almost constant. Such a heuristic criterion possibly sacrifices optimality but reduces computational iterations. Figure 4 is an example for the convergent process of a pair of proteins 1DFH a and 8DFR, where m, rms, E, t and T represent the number of the aligned residues, the root-mean-square distance, the error function, the iteration number and temperature, respectively.
|
5.3. Comparisons with manual alignments
To test the quality of the computational alignments, we use several examples to be compared with the manual alignments. For a pair of proteins 1BZG (34 residues) and 1ZWG (35 residues), as shown in Figure 5 of the alignment, the matching number detected by our algorithm is 24 with rms = 0.9, which is identical with the manually aligned result of HOMSTRAD (Mizuguchi et al., 1998, urihttp://www.cryst.bioc.cam.ac.uk/homstrad/). On the other hand, with the HOMSTRAD result as a reference, the number of the correct matching atoms by CE is zero with rms = 3.07 although the number of the aligned atoms is 25 by CE for this example.
|
For a pair of proteins 1HYO a (416 residues) and 1GTT a (429 residues), the number of the aligned atoms for our algorithm is 232 with rms = 1.83, among which 187 residues are correctly matched. The matching number of the manually aligned proteins is 198. In contrast, the number of the aligned atoms is 301 with rms = 1.94 for CE, where the number of the correctly matched atoms is 176.
5.4. Circular permutation
A circular permutation is assumed to be generated by shifting the C-terminal portion of a protein into the N-terminal portion of the sequence during the course of molecular evolution. Since the pairwise structure alignment in our approach is to minimize directly the distance of two structures in terms of atomic 3D coordinates, circular permutations can be detected. A pair of proteins 1AXJ (122 residues, a FMN-binding protein) and 2PIA (321 residues, phthalate dioxygenase reductase) are taken as an example to test the ability to detect the permutations. Figure 6 is the computational result of the proposed algorithm, where (a) and (b) are the structures of 1AXJ and 2PIA, respectively; (c) shows only the aligned atoms between 1AXJ and 2PIA; and (d) is the alignments illustrated by the primary structures, where the protein 1AXJ is aligned to the oxidoreductase FAD-binding domain (1117) of the protein 2PIA. In Figure 6a and b, the identically colored portions indicate the matched circular permutation positions, and the thin lines with black color indicate unaligned posi-tions. The N-terminal portion of 2PIA and the C-terminal portion of 1AXJ are colored in red (24 residues), whereas the C-terminal portion of 2PIA and the N-terminal portion of 1AXJ are colored in blue. Figure 6c shows only the matched alignments of the two proteins detected by our algorithm after the unmatched residues are cut off, where the thick line with red color indicates the permutation region for the N-terminal portion of 2PIA and the C-terminal portion of 1AXJ. The total number of the aligned residues is 95 with rms = 1.2, in contrast to 79 by the local structure alignment based on double dynamic programming (DPP) Hiroike and Toh, 2001. The schematic diagram of such a circular permutation by DPP is also shown in Figure 3 of Hiroike and Toh, 2001. The coordinates of the two proteins are both in the C
representation of the backbone. As indicated in Figure 6d, our algorithm successfully identifies the circularly permuted version of the 2PIA FAD-binding domain, by aligning the portion (530) at the beginning of 2PIA with the portion (80113) at the end of 1AXJ with 24 matches.
|
One possible mechanism for such a permutation may be due to a tandem gene duplication process Hiroike and Toh, 2001, Uliel et al., 1999. Specifically, the two genes are fused after the duplication. Then both the segments encoding the N-terminal portion of the first copy and the C-terminal portion of the second copy are deleted. As a result, a gene encoding the permuted protein is generated. By using structural and functional assays, it was found that the permuted protein basically has the same structure as that of the original, and essentially retains the biological function. The permutation between 1AXJ and 2PIA FAD-binding domain is a possible example. Based only on such a mechanism, there may be multiple permutations in a protein during the evolution, but all permutations have to be in a circularly permuted order. However, due to the complicated genetic events, e.g. insertions, deletions, mutations and substitutions, both genes and proteins may further diverge from each other depending on their evolution processes.
In addition, we also tested several other protein pairs that have potential permutations or have been detected by existing methods, such as, 2PIA/1AXJ; 1LEA/1XGN B 4; 1XGS A /1LEA; 2ADN A /1BOO; ONR/FBA; HGS/GSA; GBG/AJK; and AVD/SWG. The numerical simulation shows that our approach generally performed well. However, it has been found that the quality of the solution for the cases with permutations depends sensitively on the penalty parameter
in contrast to other cases, e.g. those cases in Table 1. In other words,
is not so robust to different protein pairs and generally required carefully tuning.
Moreover, although theoretically our algorithm can also detect multiple permutations, the more the number of permutations, the more difficult it generally is to detect all of the correct permuted regions due to high divergence not only at the sequence level but also at the structure level for such cases. The algorithm was implemented in Fortran language. Although the algorithm aims to provide accurate alignment, the simulation for each structure alignment on average requires only a few seconds on the Pentium-4 2.8 GHz computer, which is considered relatively fast.
| 6 CONCLUSION |
|---|
|
|
|---|
We developed a method for solving the protein structure alignment problem in a more accurate and mathematical manner at the amino acid level. The proposed model is quite general and treats the structure alignment in a more accurate way with implicit complete exploration of the entire space. The original protein alignment problem is formulated as a MIP and further transformed into a reduced NCOP with only continuous variables. Owing to the adoption of the D-Potts, the equivalent mean field equations to the reduced NCOP automatically satisfy all constraints or optimality conditions. In other words, there is no soft constraint, thereby not only making the optimization more efficient but also eliminating unnecessary parameters of penalty that usually require careful tuning. With an annealing process, the obtained solution is generally identical to that of the original MIP. By the matching matrix, the algorithm has the ability to identify permutations. We have tested our approach to several benchmark alignment problems, which verified the efficiency and effectiveness of the algorithm. In future, we will further modify the framework of the algorithm to apply to multiple structure alignments, and also make the software available. Besides the backbone of a protein chain, it is recognized that information on side-chains is also important for structure alignment, which will be included in our algorithm. In addition, there is an instability problem with respect to penalty parameters and cooling scheduling
for our algorithm, which needs to be improved further.
| A APPENDIX |
|---|
|
|
|---|
A.1 Weighted least square problem of two 3D rigid chains by SVD
We numerically solve the following weighted least square problem for two 3D rigid chains
and
by modifying the results of Ohlsson et al., 2000 and Arun et al., 1987, i.e.
![]() |
where the variables are the translation vector A
3 and rotation matrix R
3x3. Note that v ij
is fixed.
Let
and p ij = v ij /v tot .
f(X, Y)
be defined as the average operation of f, i.e.
. Then
![]() |
Hence, from
E c /
A = 0, we have
![]() |
Substituting Equation (A.4) into Equation (A.3),
![]() |
where H =
X
Y T
+
XY T
.
From Equation (A.6), clearly minimizing E c with respect to R is equivalent to maximizing [RH] with respect to R. Performing SVD with H results in
![]() |
where
= (
1,
2,
3) with
1
2
3, and the 3 x 3 orthonormal matrix V = (v 1, v 2, v 3). Notice that V is defined in the appendix as the right vector. According to the Lemma in Arun et al., 1987, if det(VU T ) = 1, the optimal solution to maximize [RH] is:
![]() |
On the other hand, if det(VU T ) = 1 and
3 = 0, the optimal solution is:
![]() |
where V ' = (v 1, v 2, v 3). However, if det (VU T ) = 1 and
3
0, then the algorithm fails and other techniques, such as RANSAC-like technique may be required Arun et al., 1987. Note that det(VU T ) = 1 is a degenerate case, which generally rarely occurs.
Therefore, first performing SVD for H to get V and U T as indicated in Equation (A.7), then we have the optimal solution according to Equations (A.8) or (A.9) for R and Equation (A.4) for A.
Received on April 13, 2004; revised on June 2, 2004; accepted on August 3, 2004
| REFERENCES |
|---|
|
|
|---|
Arun, K.S., Huang, T.S., Blostein, S.D. (1987) Least-squares fitting of two 3-D point sets. IEEE Trans. Pattern Anal. Machine Intell., PAMI-9, 698701.
Bazaraa, M.S., Sherali, H.D., Shetty, C.M. Nonlinear Programming: Theory and Algorithms, (1993) 2nd edn , New York Wiley.
Blankenbecler, R., Ohlsson, M., Peterson, C., Ringner, M. (2003) Matching protein structures with fuzzy alignments. Proc. Natl Acad. Sci. USA, 100, , pp. 1193611940
Bryant, S.H. and Altschul, S.F. (1995) Statistics of sequencestructure threading. Curr. Opin. Struct. Biol., 5, 236244[CrossRef][ISI][Medline].
Chen, L. and Aihara, K. (1995) Chaotic simulated annealing by a neural network model with transient chaos. Neural Networks, 8, 915930.
Chen, L. and Aihara, K. (1999) Global searching ability of chaotic neural networks. IEEE Trans. Circuits Syst., 46, 974993.
Gerstein, M. and Levitt, M. (1996) Using iterative dynamic programming to obtain accurate pairwise and multiple alignments of protein structures. Proceedings of the Fourth International Conference on Intelligent Systems in Molecular Biology (ISMB), June 1215, , IL St Louis, pp. 5967.
Higham, N.J. (1986) Newton's method for the matrix square root. Math. Comput., 46, 537549.
Hiroike, T. and Toh, H. (2001) A local structural alignment method that accommodates with circular permutation. Chem-Bio Inform. J., 1, 103114.
Holm, L. and Sander, C. (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol., 233, 123138[CrossRef][ISI][Medline].
Ishii, S. and Sato, M. (2002) Doubly constrained network for combinatorial optimization. Neurocomputing, 43, 239257[CrossRef].
Luenberger, D.G. Linear and Nonlinear Programming, (2003) 2nd edn Kluwer Academic Publishers.
Mizuguchi, K., Deane, C.M., Blundell, T.L., Overington, J.P. (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci., 7, , pp. 24692471[Abstract].
Mount, DW. Bioinformatics, (2001) , Cold Spring Harbor, New York Cold Spring Harbor Laboratory Press.
Needleman, S.B. and Wunsch, C.D. (1971) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, , pp. 443453.
Nemhauser, G.L. and Wolsey, L.A. Integer and Combinatorial Optimization, (1988) , New York Wiley.
Ohlsson, M., Peterson, C., Ringner, M., Blankenbecler, R. (2000) A Novel Approach to Structure Alignment. Protein Sci., 7, , pp. 445456.
Orengo, C.A. and Taylor, W.R. (1996) SSAP: sequential structure alignment program for protein structure comparison. Methods Enzymol., 266, 617635[ISI][Medline].
Peterson, C. and Anderson, J.R. (1987) A mean field theory learning algorithm for neural networks. Complex Syst., 1, 9951019.
Peterson, C. and Soderberg, B. (1989) A new method for mapping optimization problems onto neural networks. Int. J. Neural Syst., 1, 322.
Shindyalov, I. and Bourne, P. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng., 11, 739747
Szustakowski, J. and Weng, Z. (2000) Structural alignment using a genetic algorithm. Prot. Struct. Funct. Genet., 38, 438440.
Uliel, S., Fliess, A., Amir, A., Unger, R. (1999) A simple algorithm for detecting circular permutations in proteins. Bioinformatics, 15, 930936
Urahama, K. (1994) Analog method for solving combinatorial optimization problems. IEICE Trans. Inform. Syst., E77-A, 302308.
| ||||||||||||






























