Bioinformatics Advance Access originally published online on April 21, 2006
Bioinformatics 2006 22(14):1710-1716; doi:10.1093/bioinformatics/btl150
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular complexes at a glance: automated generation of two-dimensional complex diagrams
Center for Bioinformatics (ZBH), University of Hamburg Bundesstrasse 43, 20146 Hamburg, Germany
*To whom correspondence should be adderessed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: In this paper a new algorithmic approach is presented, which automatically generates structure diagrams of molecular complexes. A complex diagram contains the ligand, the amino acids of the protein interacting with the ligand and the hydrophilic interactions schematized as dashed lines between the corresponding atoms. The algorithm is based on a combinatorial optimization strategy which solves parts of the layout problem non-heuristically. The depicted molecules are represented as structure diagrams according to the chemical nomenclature. Due to the frequent usage of complex diagrams in the scientific literature as well as in text books dealing with structural biology, biochemistry and medicinal chemistry, the new algorithm is a key element for computer applications in these areas.
Results: The method was implemented in the new software tool PoseView. It was tested on a representative dataset containing 305 proteinligand complexes in total from the Brookhaven Protein Data Bank. PoseView was able to find collision-free layouts for more than three quarters of all complexes. In the following the layout generation algorithm is presented and, additional to the statistical results, representative test cases demonstrating the challenges of the layout generation will be discussed.
Availability: The method is available as a webservice at http://www.zbh.uni-hamburg.de/poseview
Contact: rarey{at}zbh.uni-hamburg.de
| 1 INTRODUCTION |
|---|
|
|
|---|
The two-dimensional (2D) illustration of molecular complexes has a wide range of application in life sciences. The interaction arrangement between different molecules of a complex gives important insights into protein function and the role of ligands in molecular biology. Methods from computer-aided drug design to high-throughput screening (HTS) are usually characterized by large outputs. Manual investigation of these outputs requires a clearly arranged visualization of the molecular complexes. Since gaining information from highly detailed 3D illustrations turns out to be a time-consuming procedure, the two-dimensional diagrams of molecular complexes, reflecting the key interactions, are often more suitable. To cope with high quantities of input data, the automated generation of such diagrams is highly desired.
Today, only a single software tool for the automatic creation of complex diagrams called Ligplot (Wallace et al., 1995) is available. The Ligplot algorithm heuristically creates a 2D diagram under consideration of the 3D coordinates of the proteinligand complex. The flattening is done by unrolling the structures to the plane beginning with the rings of the complex molecules. During the next step all rotatable bonds are sequentially rotated to unroll the structure. Hydrophilic interactions are treated as rotatable bonds; hence it is possible to unroll the complete complex in one work step. An additional structure clean-up before printing out the diagram as a postscript file rounds off the method.
In this paper we present a new algorithm implemented in the software tool PoseView. It automatically creates structure diagrams of molecular complexes following the chemical structure diagram conventions (Helson, 1999) with a focus on hydrogen bonding. Instead of converting the 3D coordinates of the complex, the algorithm is based on a thorough analysis of the complex topology. By this means, it is possible to generate an overlap-free 2D arrangement of the structure diagrams and interaction lines independent from the original 3D complex coordinates.
In the subsequent methods section, the PoseView algorithm will be described in detail. Then, the statistical results of applying PoseView to a dataset of proteinligand complexes derived from the Protein Data Bank (Berman et al., 2000) are presented. In
74% of all test cases, PoseView was able to create overlap-free layouts near to text book quality. The result section shows several examples and highlights the difficulties of complex diagram generation. Furthermore, examples of complexes with provably no overlap-free diagrams will be identified and drawn.
| 2 METHODS |
|---|
|
|
|---|
Throughout this paper, the following nomenclature will be used. The 2D illustration of a complex generated by PoseView is termed a Pose Diagram, while the arrangement of the components is referred to as a Pose Layout. A component of the Pose Diagram can be either a structure diagram of an amino acid and the ligand, respectively, or a dashed line representing a hydrophilic interaction between two atoms. The aim is to calculate a collision-free Pose Layout that provides the information about the interacting pattern readily identifiable for the user. A layout that fulfills these criteria is called a good Pose Layout.
PoseView performs on both, given complex data, for example from the Protein Data Bank (PDB), and docking results generated by FlexX (Rarey et al., 1995). In case of using complex data, the interactions between ligand and receptor and the positions of absent hydrogen atoms, which are not contained in the PDB file, are computed by FlexX. The ligand and amino acid structure diagrams, which are involved in the complex, are generated by the Structure Diagram Generation Library (LibSDG) as presented in Fricker et al. (2004).
While the structure diagrams of the amino acids are regarded as rigid structures, i.e. no additional layout modifications that change the bond and atom arrangement within the molecule are permitted after their generation, the ligand layout can be modified in order to allow an intersection-free arrangement of interaction lines.
The Pose Layout is computed sequentially. Depending on the previously calculated ligand layout and the resultant convex hull of the ligand atoms, the amino acids are initially placed. During a post-processing step collisions between the components are identified and, if possible, removed.
The amino acids that appear in the pose diagram are those which participate in the interaction pattern between ligand and receptor with an energy contribution to the total interaction energy of at least 33% of the possible maximal energy for this interaction type. The energy is estimated by applying the scoring function of Böhm (1994) with minor changes as it is used in FlexX (Rarey et al., 1996) procedure.
2.1 Data collection
For data collection and preparation, the FlexX molecular docking software is employed. While the protein can be loaded directly from PDB, the ligand molecule should be provided in mol2 format to ensure proper hybridization and protonation states. In case the ligand is extracted from the PDB file, the information about ligand atom properties are derived from the 3D coordinates. In FlexX, the receptor and the ligand are represented as connected graphs, while the interactions are available as matches between interacting atoms. All amino acids involved in the Pose Diagram are represented as individual structure diagrams, even if they are adjacent in the protein sequence. The connections to adjacent amino acids are symbolized by a bonded R.
The initial 2D coordinates of molecule components of a pose diagram are generated de novo by LibSDG (Fricker et al., 2004). Due to the chemical conventions some hydrogen atoms are not explicitly drawn or bonds between hydrogen atoms and heavy atoms are omitted. Since hydrophilic interactions start at such implicit atoms several times, the ligand is searched for these atoms and they are additionally drawn by LibSDG as explicit details.
2.2 Ligand layout
As the layout of the ligand structure diagram determines the arrangement of its interaction atoms, it plays a key role for a collision-free Pose Layout. Once a ligand layout is chosen, a walk around the structure defines the order in which interactions are arranged. Since interactions might be formed to the same amino acid, an interaction order in which such interactions are adjacent is a prerequisite for an overlap-free layout. Enumerating all possible ligand layouts guarantees that the interaction order with a minimal number of interruptions in interaction contiguity is found.
2.3 Initial interaction order
During the preprocessing part, the set of initial orders of the ligand interaction atoms is calculated. The order describes the sequence of the interaction atoms' cyclic occurrence around the ligand (Fig. 1). The aim is to modify the ligand layout such that all atoms which are interacting with the same amino acid are adjacent with respect to the interaction order. If no interaction order providing an intersection-free arrangement of interactions is found, a layout without collisions cannot be created. In this case, the ligand layout providing the smallest number of interruptions with respect to adjacency is selected.
|
Both inner and terminal atoms of the molecule can participate in interactions. If more than one outgoing bond of an inner interaction atom leads to a part of the molecule that contains at least one other interaction atom, the resulting order is no longer unique. From the perspective of layout generation it is advantageous to place an interaction line starting in the direction of the largest opening angle formed by the outgoing bonds of the interacting atom. Since a rigid selection of the geometrically preferred direction eventually does not provide a collision-free layout, all different interaction orders are computed and evaluated in the order of increasing geometrically preferred interaction directions. This search process is stopped as soon as an order is found that allows an overlap-free arrangement of Pose Diagram components.
2.4 Layout modification operators
The ligand layout can be changed by different modification operations. Either a bond is rotated or two bonds which are connected to the same atom are exchanged. Each acyclic bond divides the molecule into two substructures. Rotating this bond causes a flip of one molecule part and changes the neighborhood of the rightmost and leftmost interaction atom of both substructures in the interaction order as shown in Figure 2. The identification of the two substructures for each bond is done by a depth first search traversal, in which bonds are explored according to an increasing angular order with respect to the last visited bond. Thus, for each bond, the leftmost and rightmost interaction atom of its submolecules can be stored. If one of the submolecules of a rotatable bond contains just one or no interaction atom, its rotation has no influence on the interaction order although it possibly changes the layout of the structure diagram. Such bonds are removed from the list of layout changing bonds.
|
The second operation is a so-called bond exchangeor switchof two acyclic bonds which are connected to the same atom. The operation exchanges the whole substructures following the switched bonds. The central atom of the bond exchange is named switch core atom. Exchanging two bonds is only reasonable if four or more bonds are connected to the switch core atom. For all other cases, each switch is equivalent to a sequence of bond rotations and can therefore be omitted.
In the following, only the case for four outgoing bonds will be discussed exemplarily. For the usage of bond exchanges two situations have to be distinguished: either all bonds are acyclic or a subset of them is cyclic, for example the switch core atom is a ring atom, where two substituents are connected to the ring. In case of acyclic bonds, three different atom orders are possible (Fig. 3). Since the arrangement of the exchangeable bonds is circular, it is possible to obtain all different orders by permutation of only three bonds. From the resulting six different orders, three can be achieved by combining bond switching and rotations. Hence the distinction between three different cases for switch possibilities suffers.
|
If two or more of the considered bonds are cyclic, it means that the switch core atom is a ring atom, there arebesides the constraints due to the chemical nomenclaturelimitations given by LibSDG. Therefore it is not possible to perform bond exchanges which include ring bonds currently.
2.5 Enumeration of ligand layouts
Subsequent to the preprocessing part all possible ligand layouts are enumerated and evaluated until a ligand layout enabling an overlap-free Pose Diagram is found. Terminating the enumeration at the stage where the first good layout is found is a compromise between finding the optimal layout and the time-consuming enumeration and evaluation of the full solution space. The algorithm guarantees that a good layout is found if it exists, but an eventual better layout concerning the later collision handling might not be provided.
The evaluation of each new layout is done by checking for each amino acid with multiple interactions, if the ligand interaction atoms are adjacent in the corresponding interaction order. Note that the evaluation up to this point can be done without the time-consuming calculation of 2D coordinates. The current layout score is calculated by adding one penalty point for each interruption in the adjacency of ligand interaction atoms which are interacting with the same amino acid. If the number of interruptions of the new layout is less than the number of interruptions of the best layout found so far, 2D coordinates for all ligand atoms compliant with the list of modification operations are calculated using LibSDG.
An example for the effect of both layout modification operations can be found in Figure 4.
|
2.6 Initial amino acid arrangement
Following the ligand layout calculation the amino acids are initially placed in the Pose Diagram. Each amino acid is positioned independently of the others; possibly emerging collisions are ignored at this stage.
First, for each interaction the initial direction is calculated individually, even if more than one is located between ligand and the current selected amino acid. The direction is determined either by the convex hull of the ligand or the bonds leading to the interacting atom in the following way. If the ligand interaction atom is part of the convex polygon the interaction direction extends the direction of the bond leading to the atom, or rather if more than one bond is connected to the atom, it extends the main direction resulting from all bond directions. Otherwise the interacting atom is situated within the polygon and the interaction direction is orientated perpendicular to an associated edge of the polygon. If no suitable edge is found the direction is computed as the resulting direction of all bond directions that are explored by a depth first search within five recursion steps. Emanating from the individual interaction directions a main interaction direction for each amino acid is calculated by averaging. The centroid of all ligand atoms interacting to the same amino acid is used as starting point for the main interaction direction of that amino acid.
Second, for each amino acid the interaction direction to the ligand has to be computed. For amino acids, all orientations of bonds to potentially interacting atoms are pointing out of the molecule. Therefore it suffices in all cases to extend the bond directions and perform the averaging process as for the ligand side.
The amino acid positioning is done by superimposing the pairwise matching interaction directions of the ligand and the amino acid. After rotating the structure diagram, the distance between the starting points of the interaction directions is set to roughly four times of the regular bond length. Although this distance is too large with regard to the covalent bond length, it is not shortened in order to reduce the number of irresolvable collisions between the Pose Diagram components.
2.7 Collision removal
The initial Pose Layout is searched for three types of collisions. The overlap of different structure diagrams, the intersection of two interaction lines and intersections of interaction lines with a structure diagram. An example for each collision type can be found in Figure 5. The collision handling strategy consists of the rotation of the amino acid coordinates and the amino acid interactions around the main interaction direction starting point at the ligand. For each amino acid eight potential placements are considered apart from the initial one as it is shown in Figure 6. Starting from the initial position, it can be rotated four times to the left and to the right in steps of 18 degrees.
|
|
A collision between two amino acids exists if the pairwise distance of at least one atom of both amino acids is smaller than 1.2 times the bond length of a covalent bond in a structure diagram. Collisions including interaction lines are detected by calculating intersections either between two interaction lines leading to different amino acids or between interaction lines and bonds of another structure diagram. All collisions of these three types are numerically combined to a layout score.
Due to pairwise atom comparisons, examining the Pose Diagrams for collisions is a time-critical step during the layout generation because it shows an exponential behavior. To avoid unnecessary atom comparisons between amino acids which do not touch, the minimal bounding box of each amino acid in each position is computed. If the minimal bounding boxes do not overlap the atom position comparison can be omitted.
The computation of the best Pose Layout is realized by a branch and bound algorithm enumerating and evaluating possible combinations of amino acid positions. The aim is to find a collision-free layout with the minimal possible deviation from the initial one. Within the branch and bound scheme, this is achieved by a destination value that affects the order of occurrence of analyzed layouts. The branch and bound algorithm proceeds on a k-ary tree, k being the number of different possible positions of an amino acid. The height of the tree corresponds to the number of amino acid structure diagrams in the Pose Diagram. The tree is traversed in a depth first manner adding an amino acid structure diagram and its interaction lines to the layout in each level of the tree. The rotation angle of the newly placed amino acid is determined by the tree edge traversed. The newly placed amino acid is scored and added to the score of the predecessor layout. If the score exceeds the current best score that is initially set to the score of the initial layout, the branch can be cut off because the score never decreases.
A second scorethe position scoreevaluates the geometric deviation of the structure diagram positions from the initial ones. The score p for an amino acid in a position that is diverged i steps from the initial layout is calculated as p = 2i 1. For the position score, a target value is defined which is initialized as zero and incremented stepwise by one. For each new result, it is requested that the position score has to reach the target value exactly once all amino acids are placed. With this method, resulting layouts are calculated in the order of increasing deviation from the initial layout.
If no collision-free combination exists, the layout with the lowest score and the smallest deviation from the initial layout is chosen as the best result.
2.8 Additional methods
Besides the presented main algorithm some additional algorithmic steps are required to create a visually appealing Pose Layout.
If an amino acid has more than one interaction with the ligand, it is possible that after the initial placement the interaction lines intersect. This intersection is not handled during the post-processing step, but in an earlier point in time. Before starting the post-processing, the amino acids, whose interactions intersect, are rotated correspondingly. The straight line defined by the centroid and the interaction starting point of the amino acid structure diagram forms the axis of rotation. A more esthetic point is to rearrange the interacting hydrogen atoms next to each other if the interaction lines lead to the same structure diagram. In a last step the amino acid structure diagrams are labeled with their three letter code, their number in the protein sequence and the letter that identifies the chain in the protein.
Finally, the output of PoseView is created. PoseView is equipped with browser functionality in order to interactively analyze molecular complexes. Furthermore, a .pdf-file or .fig-file (xfig program) can be created.
| 3 RESULTS |
|---|
|
|
|---|
In the following the performance of PoseView is demonstrated by applying the algorithm on a dataset of PDB complexes (Nissink et al., 2002).
To measure the quality of the results, each Pose Diagram was classified in one of three different quality categories: overlap-free Pose Layouts, layouts whose ligand topology provides an intersection-free arrangement of interactions but the Pose Diagram is not collision-free, and unsolvable layouts.
The layout generation was tested on the CCDC Astex dataset (Nissink et al., 2002). Not all complexes were included in the evaluation, because for some of them no interactions whose type and energy would fit in the selection criteria were estimated. The CCDC Astex dataset contains 305 complexes in total. Of them, 269 were supplied to PoseView, and the algorithm was able to find 199 overlap-free Pose Layouts, which corresponds to a percentage of 74%. Of all layouts, 58 (21.6%) were of improvable quality and for 6 layouts no intersection-free solution exists in 2D space. The remaining 6 complexes could not be drawn for technical reasons (for example complex ring systems in the ligand molecule).
The computing time on a standard PC (Intel Xeon2 EMT64, 3.0 GHz, 2 GB main memory running under SuSE Linux 9.2) averages out to 0.14 s per generated Pose Diagram. The average computing time differs significantly between the three layout categories. While the good layouts could be generated in
0.02 s, the Pose Diagrams which could not be created collision-free show a higher computing time (Table 1) because collision removal is stopped when an overlap-free layout is found. The data preparation and estimation of interactions between ligand and receptor by FlexX is usually done in
2 s.
|
The statistical results are noted in detail in Table 1.
3.1 Selected examples
Figure 7ad shows four examples for collision-free layouts. All layouts are created by PoseView without manual intervention. In Figure 7e and f two examples for Pose Diagrams containing collisions are shown. For one example (e) the layout is classified as improvable while for the last example (f) no intersection-free arrangement of interactions is possible in a 2D representation. Detailed information about the complexes and the challenges during the layout generation can be found in Table 2.
|
|
| 4 DISCUSSION |
|---|
|
|
|---|
After careful inspection of several hundred Pose Diagrams generated by PoseView, some options for further improvement become apparent. The PoseView algorithm starts with the calculation of the best ligand layout, which is done in a combinatorial way. The chosen method always finds a layout that provides an intersection-free arrangement of interactions if it exists. However, it has been found insufficient to decide only by a topological criterion about the best interaction order because two topological adjacent atoms in the interaction order can be geometrically very distant or situated on opposite sides of the ligand. The second possible improvement would be to choose the best of all possible layouts instead of taking the first result that provides an intersection-free arrangement of interactions. Both presented modifications come along with higher computing times since the number of rotatable and exchangeable bonds would increase.
The dependency of the initial interaction orientation from the convex hull of the ligand proved to be a very useful strategy to avoid collisions ab initio. Independent of ligand shape the interaction lines do not intersect any ligand bond or atom if there exists a straight line from the interaction atom to an edge of the convex hull. The amino acids are evenly distributed around the ligand, because the convex hull ensures a radial arrangement of the interaction lines. Therefore the collision handling step can often be avoided because the initial amino acid arrangement is overlap-free.
During the post-processing step alternative arrangements for the amino acids are calculated if the initial layout is not collision-free. All three collision types can be handled with the same strategy. The chosen rotation angle covers nearly a semicircle. This range offers a suitable position for most amino acids. The more challenging part is that in some cases the best relative placement for an amino acid lies between two possible positions given by the chosen rotation angle. A reduction of the angle size and an increase of different possible positions would improve the results of some pose layouts. Since the branch and bound scheme shows an exponential behavior in the worst case, an increase in the number of amino acid positions would be too time-consuming for complex diagrams which contain a relatively large number of structure diagrams and interaction lines.
Until now, the amino acid diagrams are treated as rigid structures. To avoid some collisions it would be helpful to allow layout modifications like it is done for the ligand.
Like hydrophilic interactions, hydrophobic contacts play an important role in complex formation. One of the next steps will be to introduce an appropriate visualization of these interactions.
| 5 CONCLUSION |
|---|
|
|
|---|
We have presented a multi-phase algorithm for the automatic creation of Pose Diagrams which contain the ligand, the interacting amino acids and the hydrogen bonding pattern. The algorithm is implemented in the software tool PoseView, which is able to generate 74% collision-free diagrams of PDB complexes. The largest part of the remaining 26% of the Pose Diagrams is of improvable quality which means that an increase of success rates by including continuous collision handling strategies and ligand layout evaluation methods is conceivable. The Pose Diagram generation is done in the range of a tenth part of seconds, which makes PoseView applicable for large numbers of complexes and allows the possibility for new additional collision handling methods.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Anna Tramontano
Received on February 21, 2006; revised on April 13, 2006; accepted on April 13, 2006
| REFERENCES |
|---|
|
|
|---|
Berman, H.M., et al. (2000) The Protein Data Bank. Nucleic Acids Res, . 28, 235242
Böhm, H.J. (1994) The development of a simple empirical scoring function to estimate the binding constant for a proteinligand complex of known three-dimensional structure. J. Comput. Aided Mol. Des, . 8, 243256[CrossRef][Web of Science][Medline].
Fricker, P.C., et al. (2004) Automated drawing of structural molecular formulas under constraints. J. Chem. Inf. Comput. Sci, . 44, 10651078[CrossRef][Web of Science][Medline].
Helson, H.E. (1999) Structure diagram generation. In Lipkowitz, B. and Boyd, D.B. (Eds.). Reviews in Computational Chemistry, , NY Wiley-VCH, pp. 313398.
Lauble, H., et al. (1994) Crystal structures of aconitase with trans-aconitate and nitrocitrate bound. J. Mol. Biol, . 237, 437451[CrossRef][Web of Science][Medline].
Muller-Dieckmann, H.J. and Schulz, G.E. (1995) Substrate specificity and assembly of the catalytic center derived from two structures of ligated uridylate kinase. J. Mol. Biol, . 246, 522530[CrossRef][Web of Science][Medline].
Nissink, J.W., et al. (2002) A new test set for validating predictions of proteinligand interaction. Proteins, 49, 457471[CrossRef][Web of Science][Medline].
Quiocho, F.A., et al. (1989) Substrate specificity and affinity of a protein modulated by bound water molecules. Nature, 340, 404407[CrossRef][Medline].
Rarey, M., et al. (1995) Time-efficient docking of flexible ligands into active sites of proteins. Proc. Int. Conf. Intell. Syst. Mol. Biol, . 3, 300308[Medline].
Rarey, M., et al. (1996) A fast flexible docking method using an incremental construction algorithm. J. Mol. Biol, . 261, 470489[CrossRef][Web of Science][Medline].
Veerapandian, B., et al. (1992) Direct observation by X-ray analysis of the tetrahedral intermediate of aspartic proteinases. Protein Sci, . 1, 322328[Web of Science][Medline].
Wallace, A.C., et al. (1995) LIGPLOT: a program to generate schematic diagrams of proteinligand interactions. Protein Eng, . 8, 127134
Weber, P.C., et al. (1989) Structural origins of high-affinity biotin binding to streptavidin. Science, 243, 8588
Xia, Z., et al. (1996) Determination of the gene sequence and the three-dimensional structure at 2.4 angstroms resolution of methanol dehydrogenase from Methylophilus W3A1. J. Mol. Biol, . 259, 480501[CrossRef][Web of Science][Medline].
Xue, Y., et al. (1994) Crystal structure of fructose-1,6-bisphosphatase complexed with fructose 2,6-bisphosphate, AMP, and Zn2+ at 2.0-A resolution: aspects of synergism between inhibitors. Proc. Natl Acad. Sci. USA, 91, 1248212486
This article has been cited by other articles:
![]() |
P. Zhou and Z. Shang 2D molecular graphics: a flattened world of chemistry and biology Brief Bioinform, May 1, 2009; 10(3): 247 - 258. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







