Skip Navigation


Bioinformatics Advance Access originally published online on October 24, 2006
Bioinformatics 2007 23(1):99-106; doi:10.1093/bioinformatics/btl538
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/1/99    most recent
btl538v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zhou, R.
Right arrow Articles by Mudur, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhou, R.
Right arrow Articles by Mudur, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

PROTERAN: animated terrain evolution for visual analysis of patterns in protein folding trajectory

Ruhong Zhou 1,*, Laxmi Parida 1, Kush Kapila 2 and Sudhir Mudur 3

1 Computational Biology Center, IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA
2 Matrox Imaging Inc., Dorval Quebec, Canada
3 Department of Computer Science, Concordia University Montreal, Canada

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INCREMENTAL PATTERN DISCOVERY...
 3 ON-LINE PATTERN VISUALIZATION
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 

Summary: The mechanism of protein folding remains largely a mystery in molecular biology, despite the enormous effort from many groups in the past decades. Currently, the protein folding mechanism is often characterized by calculating the free energy landscape versus various reaction coordinates such as the fraction of native contacts, the radius of gyration and so on. In this paper, we present an integrated approach towards understanding the folding process via visual analysis of patterns of these reaction coordinates. The three disparate processes (1) protein folding simulation, (2) pattern elicitation and (3) visualization of patterns, work in tandem. Thus as the protein folds, the changing landscape in the pattern space can be viewed via the visualization tool, PROTERAN, a program we developed for this purpose. We first present an incremental (on-line) trie-based pattern discovery algorithm to elicit the patterns and then describe the terrain metaphor based visualization tool. Using two example small proteins, a ß-hairpin and a designed protein Trp-cage, we next demonstrate that this combined pattern discovery and visualization approach extracts crucial information about protein folding intermediates and mechanism.

Contact: ruhongz{at}us.ibm.com

Supplementary information: http://www.research.ibm.com/people/r/ruhong/SI.htm


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INCREMENTAL PATTERN DISCOVERY...
 3 ON-LINE PATTERN VISUALIZATION
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
Understanding how protein folds continues to be one of the most challenging problems in computational biology. The interest is in both obtaining the final fold—as witnessed by recent advances in protein structure prediction and structural genomics, and understanding the mechanism and kinetics of the folding process. (Snow et al., 2002; Zhou et al., 2004; Zhou and Karplus, 1999) Many native proteins fold into unique globular structures on a very short time scale. The so-called fast folders can fold into the native structure from random coils in microseconds to milliseconds. Recent advances in experimental techniques that probe proteins at different stages during the folding process have shed light on the nature of the folding kinetics and thermodynamics (Blanco et al., 1994; Munoz et al., 1997). However, due to experimental resolution limits, detailed protein folding pathways remain unknown. Computer simulations performed at various levels of complexity can be used to supplement experiments and fill in some of the gaps in our knowledge about folding mechanisms. Meanwhile, effective analysis and visualization of the trajectory data from the protein folding simulations, either by molecular dynamics (MD) or Monte Carlo (MC), remains yet another challenge due to the large number of degrees of freedom and the huge amount of trajectory data. Currently, the protein folding mechanism is often characterized by calculating the free energy landscape versus the so-called reaction coordinates (Garcia and Sanbonmatsu, 2001; Zhou et al., 2001). We and others have used various reaction coordinates, such as fraction of native contacts and radius of gyration (Garcia and Sanbonmatsu, 2001; Zhou et al., 2001). Searching for better reaction coordinates is still of active interest in protein folding mechanism studies. These analyses have provided important information for a better understanding of protein folding. However, the free energy landscapes usually result in too much information reduction due to their limitation in dimensionality, and worse, the time correlation information about various states is usually not directly reflected in these free energy contour maps. Therefore, better or complementary analysis tools are in great demand (Parida and Zhou, 2005; Lei et al., 2005).

1.1 Background
We had previously proposed a combinatorial pattern discovery approach to recognize the folding intermediates (Parida and Zhou, 2005). We describe that process briefly here to provide a context to this paper. It is known that the folding process of many proteins takes the amino acid coil through different states before stabilizing on the final folded state. Thus, from a pattern discovery point of view, the task is to recognize these intermediate states that the folding process goes through. The procedure can be briefly summarized as a four-step process. The first step involves the conversion of folding trajectories from MD simulations to a time series of characteristic features (reaction coordinates in this case), since working in the full structure space of the protein is extremely complex due to the number of degrees of freedom and the very large number of configurations (easily in millions) (Zhou et al., 2001, 2003a). In the second step, we study these data points to extract the characteristic set of features which we call patterned clusters. In the third step, these patterns are filtered to retain the most significant ones. It is non-trivial to model the significant patterns in this domain, so we have combined the second and third steps and used appropriate parameters to filter out possibly insignificant patterns. For instance, if a pattern occurs less than certain (K) times, then the pattern is possibly not salient. The fourth step is of analyzing the patterns: this involves extracting the structural patterns or structural motifs using the time coordinates and studying the time correlation of the different structures. For instance, one could observe that the hydrophobic core is formed before the beta-strand hydrogen bonds, or vice versa, in a ß-hairpin folding; and one can also interconnect various free energy states in different free energy contour maps by monitoring the high dimensional patterns (Parida and Zhou, 2005).

In this paper, we present an integrated approach towards understanding the folding process via visual analysis of folding patterns in the form of reaction coordinates. The three disparate processes (i) protein folding simulation with replica exchange molecular dynamics (REMD) (Sugita and Okamoto, 1999; Garcia and Sanbonmatsu, 2001; Zhou et al., 2001), (ii) pattern elicitation and (iii) visualization of the patterns, work in tandem. Here the Incremental data generation process is the simulation process that produces the reaction coordinates data with time (x-axis in the Fig. 1). At each time interval t{delta} (in the figure, t{delta}=1), the newly generated data is fed into the Incremental maximal pattern generation module, which extracts the patterns and feeds them to the Visualization module which displays an appropriate pattern landscape (animated over time).


Figure 1
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 The incremental augmented trie construction for a sample dataset with K = 2 for Example 1. The dashed edges are (implicitly) labeled with ‘–’.

 
To our knowledge, this is the first time such an integrated approach has been applied to better understanding of a process such as that of the protein folding. We present a trie-based incremental pattern discovery algorithm and the simplicity of the pattern descriptions lead to easy interpretation of and thus better understanding of the underlying processes. The combined pattern discovery and visualization approach is then applied to two small protein systems, a ß-hairpin and a designed 20-residue protein Trp-cage. The approach extracts crucial information about protein folding intermediates and helps to identify structural motifs that are overlooked by the free energy landscape analysis. This success encourages us to postulate that this combined approach will lead to deeper understanding of the folding process.


    2 INCREMENTAL PATTERN DISCOVERY ALGORITHM
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INCREMENTAL PATTERN DISCOVERY...
 3 ON-LINE PATTERN VISUALIZATION
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
Consider a scenario where there are n data points with each data point having m attributes that take on real values. Each attribute is denoted by Jj, 1 ≤ j ≤ m.

Let D be an n x m array where D[i, j] isin R that represents the value of property Jj for the i-th data point. A cluster pattern, p is a collection of some l (1 < l ≤ m) attributes or columns denoted as

Formula
Given a fixed set of values {delta}j, for each j, we define the location list Lp of p (which is a collection of rows) as follows:

Formula
This pattern p is said to satisfy the quorum K if Formula. In other words, the pattern p must have support in at least K data points.

A pattern p with l columns is maximal if no more columns can be added without decreasing the support of the pattern. More precisely, p is maximal if there exists no pattern p', with p' sup p and Formula (Parida and Zhou, 2005). Thus maximality is a combinatorial means of reducing the number of cluster patterns, without any loss of information. The patterns under discussion in this paper are maximal. An off-line (i.e. given all the n data points of the input D) algorithm to discover these maximal cluster patterns is discussed in (Parida and Zhou, 2005). It has also been shown that it is possible to define discrete alphabet such that D[i, j] isin {Sigma}, a finite alphabet set. In the rest of the discussion we will assume that D is defined on this finite alphabet {Sigma}.

Here, we discuss an on-line algorithm to solve the same problem. The assumption is that all data points are not available at the same time and either each data point or chunks of data points are made available with time. More precisely, let each row i represent the data in time i and at time i, the rows 1,2, ... , i are available to the algorithm, denoted as D[1...i,m] (i.e. i rows and m columns). Since maximal pattern p generated at time t1 is displayed, it is important to assert that at time t2 > t1, p remains maximal. The following is easy to verify.

LEMMA 1
A pattern p that is maximal in D[1 ... t1, m] is also a maximal pattern in D[1...t2,m], where t2 > t1.

Since the data arrives one complete row at a time, a previously declared maximal pattern cannot become non-maximal later. Since we have shown that this model is viable, now we present an incremental method to discover the patterns.

2.1 Trie-based algorithm
For simplicity, assume that D is defined on a set of time records (reaction coordinates) {Sigma} = {{sigma}1, {sigma}2, ... , {sigma}l}, with an ordering on the alphabet as {sigma}1 < {sigma}2 < ... < {sigma}l. Let a new record (edge node) ‘–’ {notin} {Sigma} and let {Sigma} {cup} {‘–’} be denoted by {Sigma}*. Further, let {sigma}i < ‘–’ for all 1 ≤ i ≤ l.

We present a method based on a modified trie data structure (Aho et al., 1983) which we call the augmented trie. First, as each row of D is read, it is treated as a string of length m and the standard trie data structure is constructed. Recall the following two properties of a standard trie (Aho et al., 1983):

Property 1: There is one root node whose depth is assumed to be 0. Each edge is labeled and the label of an internal node is assumed to be the label of the (unique) incoming edge.

Property 2: The tree is of height m. An internal node at depth j from the root denotes the j-th column of D.

In this standard trie (whose edges are labeled with characters from {Sigma}), each unique path from the root node to the leaf node A represents a pattern pA in D with exactly m columns and has a pointer to LpA or the set of rows where pA occurs.

Augmenting the trie. Consider a trie or tree with labels from {Sigma}*. Now, each unique path from the root node to the leaf node A represents a pattern pA in D: however, if an edge at depth j is labeled with ‘–’, then column j is ignored. Thus in the unique path from the root to the leaf node A if there are l' edges with label ‘–’, then the pattern pA has ml' columns. Further, it also has a pointer to LpA or the set of rows where pA occurs.

We next define a minimal consensus T, given k trees as follows: Given the input data D and trees Ti, i = 1, ... , k, labeled on {Sigma}*, T is a consensus tree of the k trees,

  1. if for each leaf node Ai in Ti, there is a leaf node A in T with Formula and Formula in D,
  2. for every leaf node A in T, there is leaf node Ai in Ti, for some i such that Formula, and.
  3. no two siblings of T have the same label (i.e. T is a trie). Further, T is a minimal consensus tree if it has no subtree T' that is a consensus of Ti Formula.

Also, the following observation is straightforward to verify.

LEMMA 2
The minimal consensus tree T of k trees is unique.

We augment this standard trie to have labels from {Sigma}* as follows. Note that this is a recursive definition. An internal node, B, that has more than one child, denoted as C1, C2, ... , Ck, is augmented to have wild child, say Cwild, whose edge is labeled by ‘–’. The sub-tree rooted at Cwild with label ‘–’ is the minimal consensus tree of sub-trees Ti that correspond to trees rooted at each sibling Ci. This is well-defined since the minimal consensus tree is unique. The algorithm to compute the minimal consensus tree is straightforward and is outlined in Figure 2.


Figure 2
View larger version (34K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 The minimal consensus tree (recursive) algorithm for two given trees.

 
Pattern discovery algorithm. The algorithm is based on constructing the augmented trie. Recall that at time i, the rows 1,2, ... , i are available to the algorithm, denoted as D[1 ... i, m]. We make the following observation that is straightforward to verify.

LEMMA 3
Let Ti be the augmented trie corresponding to D[1...i,m]. Then Ti is a subtree of Ti2 for i1 < i2.

This shows that as each row is read, the augmented trie grows without any need for backtracking. However, the location lists pointed to by the leaf nodes may be expanded (but not reduced). This is also demonstrated in the simple example below (Fig. 1).

EXAMPLE 1. Consider a scenario with five data points, each with five attributes. For simplicity, let {Sigma} = {a < b < c < s < p < q < r < x < y < z} and the corresponding D with columns J1, J2, J3, J4, J5 is shown below:

Formula
Let T be the trie under construction. When D is empty, T has only the root node. To add a new row i of D to T, this row is treated as a string of length m and added to the trie T in the standard way (Aho et al., 1983). Further, this row is added to the location list pointed to by the leaf node. Let P be a node that has more than one child in T given as Ci, i = 1,2, ... , k. Using the MinConsensus() algorithm, a child with label ‘–’ is either created or updated.

The stepwise construction of the trie is shown in Figure 1. The branches labeled with ‘–’ are shown as dashed lines (the right most child of a node) for convenience. Here quorum is K = 2. The maximal pattern p1 is generated when row 2 is read; maximal pattern p2 when row 4 is read; maximal patterns p3 and p4 when row 5 is read. The bold edges in Figure 1b and c denote the new branches generated at that step.

Since the labels are ordered and the wild edge is always the rightmost child, a pattern p1 on the left is more specific than p2 on the right, the following observation can be verified.

LEMMA 4
Let leaf nodes p1 < p2 in the left-to-right ordering of the augmented trie, (i.e. leaf node p1 is to the left of p2) with Formula. Then p2 must be non-maximal with respect to p1.

This property enables us to display the patterns as they are created (without ever having to backtrack). In the algorithm, as a new location list is generated, it is checked with existing location lists. The location lists are stored in a balanced binary tree to make this checking efficient. If it is a new list, the pattern is output as a maximal pattern.

Time complexity. In the application, D is defined on real values. The implications of converting these real values to discrete characters is discussed in (Parida and Zhou, 2005). We use the same approach here in the on-line algorithm.

Recall in our application, m << n. The augmented trie can be compacted to give a Patricia or a radix tree (Aho et al., 1983; Morrison, 1968) for efficiency in space. Figure 3 shows a compact Patricia tree for the running example.


Figure 3
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 The Patricia trie of the cp-trie of Figure 1c.

 
In the worst case there can be O(2md) distinct patterns where each column has at most d different values. Also, the location lists are stored in a balanced binary tree and the time to check if a list already exists takes O(n log n) time. At step i, let N be the number of patterns including the non-maximal patterns. The algorithm then takes O(N n log n) time for each step i.

Reducing the pattern space. It is quite clear that using maximality and quorum K is not adequate to control the number of patterns to be studied. We are currently experimenting with the following as a way of reducing this space without losing important information. Two patterns p1 and p2 are {varepsilon}-equal if Formula, for some fixed 0 < {varepsilon} ≤ 1. When two patterns p1 and p2 are {varepsilon}-equal, the two are replaced with p = p1 {cap} p2, and Formula. Thus in the reduced pattern space, no two distinct patterns p1 and p2 are ß-equal.


    3 ON-LINE PATTERN VISUALIZATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INCREMENTAL PATTERN DISCOVERY...
 3 ON-LINE PATTERN VISUALIZATION
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
It is clear that trying to manually analyze large amounts of data will result in important information being overlooked. Memorizing a few patterns is quite difficult and with the possibility of thousands of patterns getting a global view of the pattern space will be even more difficult. The user is now left with trying to memorize both the patterned clusters and their relative order in time. Hence the need arises for a good visualization tool that can give both a global view of the pattern space and the interaction of the patterned clusters as they evolve over time. When the pattern clusters are fed in a continuous fashion into our visualization tool, PROTERAN, the time correlation of the patterns and therefore the order of the folding events can be easily visualized and recorded (see supplementary material for additional details on the PROTERAN design).

PROTERAN design. We note that the patterns in this application have two important properties. One important aspect is the number of columns in the pattern. The second important aspect is the extent of prevalence of the pattern in the data in terms of its frequency or the number of times the pattern appears in the data. We design the system keeping these two major aspects in mind. The functional requirements of the visualization program were identified as follows:

  1. Global View: The user should be able to at any point see a global view of the total current data.
  2. Zoom-In: The user should be able to identify important clusters and focus on any individual patterned cluster.
  3. Animation: The patterned cluster should be animated over time.
  4. Query: The column and its corresponding values of the patterned cluster should be obtainable at any time.

The terrain metaphor. Here we use the metaphor of a terrain (Chalmers, 1993; Wise et al., 1995) to represent our pattern space. We identify the number of columns defining a pattern cluster as a way to type the patterns. The first step to visualizing as a terrain is to break down the plane into sub-regions corresponding to different types of patterns. Thus if there are m columns, the terrain is broken down into (m – 1) regions: each region corresponds to a pattern with 1 < l ≤ (m – 1) columns. Next, a region corresponding to l columns can further be broken down into Formula distinct patterns. For example, for m = 7 and l = 2 there can be at most 21 different patterns (see Supplementary Material for the layout of patterns).

The frequency of the pattern is represented by the height of the mountain whose base is the assigned territory in the plane. Thus by a quick visual inspection of the heights of the mountain, one gets a fair idea of the major states at play. In addition, because the mountains grow over time it is easy to see a time-line of the states or the patterns. The ability to ‘fly’ through the terrain allows the user to get different perspectives on the data. The zoom in and out buttons give the user the flexibility to either get a global view of the terrain or focus in on a specific pattern. Finally by pointing and clicking on a mountain the values of the respective patterns and all their properties (values of each reaction coordinate and their occurrence times) can be extracted. A sample screen shot is shown in Figure 4 and PROTERAN is available for use at http://www.kapila.ca/Proteran.htm


Figure 4
View larger version (63K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 Visualization of all patterns in the protein folding trajectory of ß-hairpin at 310K in the PROTERAN layout described above.

 

    4 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INCREMENTAL PATTERN DISCOVERY...
 3 ON-LINE PATTERN VISUALIZATION
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
4.1 Folding of a ß-hairpin
The first example protein is a 16-residue ß-hairpin from the C-terminus of protein G (GEWTYDDATKTFTVTE, 2gb1 [PDB] .pdb). Table 1 lists some representative patterns of size two from the previous study (Parida and Zhou, 2005), which are shown as red patterns in PROTERAN in Figure 4 (See Supplementary Material for the details of simulation and reaction coordinates). The time sequences of each pattern are used to animate the specific patterns with time in PROTERAN. These simple patterns can be directly compared with the previous free energy states in the 3D free energy contour maps. Although more complicated patterns, such as those with up to six or seven reaction coordinates (shown as light blue and cyan patterns in Fig. 4), cannot be directly linked to the free energy contour maps due to the low dimensionality in these free energy landscapes, they have the potential to reveal more interesting results as seen in the following paragraphs.


View this table:
[in this window]
[in a new window]

 
Table 1 Patterns of size two J1, ... , J7 represent reaction coordinates

 
As described above, we can recover the previously found free energy states using the current combinatorial pattern discovery approach. Figure 5a shows a representative structure for the first pattern in Table 1. This structure resembles the partially folded state, P state, in the free energy contour map analysis using reaction coordinates Formula and Formula (Zhou et al., 2001). Similarly, the second pattern of Table 1 mimics very well the structure from the folded state (F state) in the same free energy landscape (Fig. 5b). Thus this pattern resembles the F state of the free energy contour map. In general, there is a high degeneracy in patterns with respect to the (limited) folding states even with pattern reduction (unless a very large {varepsilon} is used in pattern reduction described above). For example, the second, third, and seventh patterns in Table 1 all represent the folded F state in the free energy landscape.


Figure 5
View larger version (54K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5 Representative structures for various patterns (see text for details).

 
More importantly, the new approach reveals important structures overlooked previously, which might help to understand the folding mechanism better. Eaton and co-workers (Munoz et al., 1997) proposed a ‘hydrogen bond zipping’ mechanism for this ß-hairpin in which folding initiates at the turn and propagates toward the tails by making beta-strand hydrogen bonds one by one, so that the hydrophobic core, from which most of the stabilization derives, forms relatively late during the folding. In our previous study, we proposed a different folding mechanism that this ß-hairpin undergoes a hydrophobic core collapse first, then forms native ß-strand hydrogen bonds. Figure 5c shows a representative structure for the first pattern in Table 2, which lists multi-column patterns. The structure shows that all the five native ß-strand H-bonds have been formed, but the hydrophobic core is not completely aligned yet (RMSDs up to 4 Å). The loop region also bends towards the hydrophobic core to somewhat offset the non-perfect hydrophobic core. This implies that the hairpin can also have a path to form ß-strand hydrogen bonds before the core is finalized. The current findings indicate that the final hydrophobic core and ß-strand hydrogen bonds might be formed almost simultaneously. This can also be seen from the low free energy barrier in the previous free energy landscape analysis (Zhou et al., 2001). Interestingly, Thirumalai et al. also found that the lag time between collapse and hydrogen bond formation is very short and the two processes can occur nearly simultaneously (Klimov and Thirumalai, 2000).


View this table:
[in this window]
[in a new window]

 
Table 2 Complex patterns with multi-columns (6 and 7 in this case)

 
4.2 Folding of the Trp-cage
As mentioned in Introduction section, it is also important to study the time correlation between various folding patterns or states. For example, it is extremely useful to know which pattern or state precedes the other and by how much time. Of course, this requires continuous trajectory data, ideally the true folding kinetics data. This subsection will use another protein Trp-cage, a designed 20-residue mini-protein (Neidigh et al., 2002; Simmerling et al., 2002; Snow et al., 2002), as an example to demonstrate PROTERAN on this time correlation feature. The current data were also obtained from the REMD simulations, but the trajectory data were organized according to each replica which can climb up or down in temperature ladder (in the previous ß-hairpin case, the trajectory was obtained at the fixed biological temperature 310K); thus, these trajectories are continuous in the structural space. Even though the time sequences are not truly kinetic, they can still provide useful information on the time correlated or time sequential folding events. The goal here is not trying to compare with experimental kinetic data, but rather to demonstrate the current pattern discovery algorithm and visualization tool. Another point to mention is that the continuum solvent model, Generalized Born (GB) model (Still et al., 1990) is used in the Trp-cage simulation, which might have some artifacts. The GB model has been found to have a tendency to form erroneous salt-bridges between charged residues. (Zhou et al., 2001) The reason why we used the Trp-cage folding data from the GB model is largely because we have access to the REMD data, (Pitera and Swope, 2003) and also because this Trp-cage seems to be a robust folder and others (Simmerling et al., 2002) have also successfully folded this protein from fully stretched initial configuration using the GB model.

Figure 6 shows a few snapshots of all the patterns at t = 0, 1.25, 2.5, and 5 ns for one replica. Some patterns, such as (Rg = 7.914 ± 0.5, Nhelix = 5.5 ± 0.5), show up earlier than other patterns, indicating some structural patterns or intermediate structures develop in early stage of the folding (see Supplementary Material for definitions of all reaction coordinates). For example, this particular pattern, (Rg = 7.914 ± 0.5, Nhelix = 5.5 ± 0.5), is found to be related to the alpha helix formation near residues 2–9 (more details below). The time sequences when one particular pattern appears can be collected for each pattern, and consequently, collective patterns can be obtained at each time sequence window (window size 200–400 frames). For those time sequences with many patterns appearing at the same time, some structural signatures or motifs might be expected. Table 3 lists the top time sequences with most patterns identified in that time sequence window. As found previously, many patterns are redundant, for example, patterns (Rg = 7.914 ± 0.5, {rho} = 0.729 ± 0.15), ({rho} = 0.729 ± 0.15, RMSD = 3.151 ± 0.5, Nhelix = 8.5 ± 0.5,) and (Rg = 7.914 ± 0.5, {rho} = 0.729 ± 0.15, RMSD = 3.151 ± 0.5, Nhelix = 8.5 ± 0.5) all represent the folded state in this particular case.


Figure 6
View larger version (94K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6 Animation of patterns in the Trp-cage folding versus time for one replica which can climb up and down in the temperature ladder. (a) 0 ns; (b) 1.25 ns; (c) 2.50 ns; and (d) 5.0 ns.

 


View this table:
[in this window]
[in a new window]

 
Table 3 Top time sequences with most patterns collected

 
A closer look at the structures in these time sequences might reveal important intermediate folding events. The representative structures for each top time sequence window can be obtained by clustering (Zhou et al., 2001, 2003). These representative structures for the first five time sequences as well as the initial extended structure are shown in Figure 7. The key hydrophobic residues forming the Trp-cage core, Tyr3, Trp6, Leu7, Pro12, Pro17, Pro18 and Pro19 are represented by sticks, whereas the rest of the proteins are represented by the ribbon view. The first major time sequence (or event) shows that an {alpha}-helix between residues 2–9 starts to develop after ~0.50 ns (sequence 1). Interestingly, the 310-helix near residues 11–14 is also partially formed in the early stage, but it comes and goes with time. The {alpha}-helix keeps developing, as shown in time sequences 2 (t = ~0.76 ns) and 3 (t = ~1.2 ns). During this process, the C-terminal poly-proline II helix has not packed against the {alpha}-helix or central tryptophan residue yet. The 310-helix is reformed at ~1.5 ns (sequence 4). Meanwhile, another important folding event occurs—the C{alpha}-terminal poly-proline II helix forms and packs against the {alpha}-helix. At approximately 2.0 ns (sequence 5), the sidechain of Trp6 has optimized its position inside the hydrophobic cage formed by the {alpha}-helix and poly-proline II helix. Thus, the Trp-cage protein has been folded, with a C{alpha}-RMSD of only 2.4 Å from the native structure. These results indicate that the folding process starts with the formation of the {alpha}-helix near residues 2–9; then the poly-proline II helix (residues 15–20) forms and packs against the {alpha}-helix; and finally the sidechain of Trp6 optimizes its position within the cage formed by the {alpha}-helix and the poly-proline II helix. The 310 near residues 11-14, on the other hand, comes and goes during the folding process. Once the protein is folded, it can stay in the folded state for quite some time before it unfolds again due to its climbing to high temperatures in the REMD simulation.


Figure 7
View larger version (33K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7 Representative structures of the Trp-cage protein from the major time sequences in Table 3.

 
It should be pointed out that the folding time ~2.0 ns here is much faster than the experimental value of ~4 µs (Qiu et al., 2002). This seemingly much faster folding speed is due to two important factors: one is that in REMD the energy barrier crossings can be tens or hundreds of times faster than the regular MD (Garcia and Sanbonmatsu, 2001; Zhou et al., 2001); and the other is that the folding kinetics in the continuum solvent model GB can be much faster than the experiment as found by others as well (Summerling et al., 2002; Pitera and Swope, 2003; Chowdhury et al., 2003). Nevertheless, we are more interested in the time sequential orders of the folding events here.

As mentioned above, the time correlation function can also be calculated for patterns:

Formula 1(1)
where A(t) and B(t) represent two different patterns as a function of time and <···> indicates the ensemble average. When A(t) is the same as B(t), the correlation function reduces to the normal autocorrelation function:

Formula 2(2)
Here we use two example correlation functions to illustrate the concept (see Supplementary Material for more details and graphs). One is the autocorrelation function of the helix-pattern (representing the helix formation—defined as 1 if eight or more residues are alpha-helical, and 0 otherwise), and the other is the time correlation function between the helix-pattern and the folded-pattern (Rg = 7.914 ± 0.5, {rho} = 0.729 ± 0.15, RMSD = 3.151 ± 0.5, Nhelix = 8.5 ± 0.5, see above, pattern for folded conformation). The helix-pattern autocorrelation function shows a CAA(t) {propto} exp (–1.384t) (t in ns), indicating a helix-pattern lifetime of ~ 0.72 ns; while the helix-folded-patterns correlation function shows a CAB(t) {propto} exp(–0.268t), indicating a correlation characteristic time of 3.73 ns (the helix formation on average leads the folded conformation by 3.73 ns). This is consistent with the above time sequence analysis and the patterns we observe in PROTERAN, even though the exact time differs slightly due to the insufficient data in the correlation function calculations. Finally, it should be noted that the current method is equally applicable to real-time kinetic MD trajectories, once these long-scale simulations at biological temperature become routinely accessible.


    5 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INCREMENTAL PATTERN DISCOVERY...
 3 ON-LINE PATTERN VISUALIZATION
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
In this paper, we have presented an integrated approach towards understanding the protein folding via visual analysis of folding patterns, which consists of three processes (1) protein folding simulation, (2) pattern elicitation and (3) pattern visualization. The on-line pattern discovery algorithm uses a trie-based approach and the visualization tool uses the metaphor of a terrain to represent a pattern, which can also animate the changing pattern landscape with time. Two small but important protein systems, a ß-hairpin from C-terminus of protein G and a designed fast folding protein Trp-cage, are then used to demonstrate the effectiveness of this combined approach. It is shown that the method not only reproduces the previously found free energy states (most populated states) in free energy contour maps, but also reveals new information overlooked previously about the intermediate structures and folding mechanism. In addition, the combined approach can reveal the time correlated or time-dependent information about various folding events. Furthermore, it is shown that the method can identify possible force field artifacts by discovering previously overlooked structural motifs. The success with the ß-hairpin and Trp-cage proteins is very encouraging and we are currently exploring the application of this method to other larger proteins, such as lysozyme. For these larger and more complicated protein systems, more structural features and intermediates might be expected in the molten globule and unfolded states (Klein-Seetharaman et al., 2002).


    Acknowledgments
 
We would like to thank Jed Pitera and William Swope for sharing with us the protein Trp-cage folding data for analysis.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alfonso Valencia

Received on August 18, 2006; revised on October 4, 2006; accepted on October 14, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 INCREMENTAL PATTERN DISCOVERY...
 3 ON-LINE PATTERN VISUALIZATION
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 

    Aho, A.V., Hopcroft, J.E., Ullman, J.D. Data Structure and Algorithms, (1983) , Massachusetts Addison-Wesley Publishing Company.

    Blanco, F.J., et al. (1994) A short linear peptide that folds in a native stable ß-hairpin in aqueous solution. Nat. Struct. Biol, . 1, 584–590[CrossRef][ISI][Medline].

    Chalmers, M. (1993) Using a landscape metaphor to represent a corpus of documents. In Frank, A. and Campari, I. (Eds.). Proceedings of the European Conference on Spatial Information Theory: A Theoretical Basis for GIS. LNCS, , Berlin Springer-Verlag 716, , pp. 377–390.

    Chowdhury, S., et al. (2003) Ab initio folding simulation of the trp-cage mini-protein approaches NMR resolution. J. Mol. Biol, . 327, 711–717[CrossRef][ISI][Medline].

    Garcia, A.E. and Sanbonmatsu, K.Y. (2001) Exploring the energy landscape of a hairpin in explicit solvent. Proteins, 42, 345–354[CrossRef][ISI][Medline].

    Klein-Seetharaman, J., et al. (2002) Long-range interactions within a nonnative protein. Science, 295, 1719–1722[Abstract/Free Full Text].

    Klimov, D.K. and Thirumalai, D. (2000) Mechanism and kinetics of ß-hairpin formation. Proc. Natl Acad. Sci. USA, 97, 2544–2549[Abstract/Free Full Text].

    Lei, Y., et al. (2005) A wavelet approach for the analysis of folding trajectory of protein trp-cage. J. Bioinf. Comput. Biol, . 3, 1351–1370[CrossRef].

    Morrison, D.R. (1968) Patricia: practical algorithm to retrieve information coded in alphanumeric. J. ACM, 15, 514–534[CrossRef].

    Munoz, V., et al. (1997) Folding dynamics and mechanism of ß-hairpin formation. Nature, 390, 196–199[CrossRef][Medline].

    Neidigh, J.W., et al. (2002) Desiging a 20-residue protein. Nat. Struct. Biol, . 9, 424–429.

    Parida, L. and Zhou, R. (2005) Combinatorial pattern discovery approach for the folding trajectory analysis of a ß-hairpin. PLoS Comput. Biol, . 1, e8.

    Pitera, J. and Swope, W. (2003) Getting to the folded state: parallel-replica simulations of trp-cage. Proc. Natl Acad. Sci. USA, 100, 7587–7592[Abstract/Free Full Text].

    Qiu, L., et al. (2002) Smaller and faster: the 20-residue trp-cage protein folds in 4 ßs. J. Am. Chem. Soc, . 124, 12952–12953[CrossRef][ISI][Medline].

    Simmerling, C., et al. (2002) All-atom structure prediction and folding simulations of a stable protein. J. Am. Chem. Soc, . 124, 11258–11259[CrossRef][ISI][Medline].

    Snow, C.D., et al. (2002a) Absolute comparison of simulated and experimental protein-folding dynamics. Nature, 420, 102[CrossRef][Medline].

    Snow, C.D., et al. (2002b) The trp cage: folding kinetics and unfolded state topology via molecular dynamics. J. Am. Chem. Soc, . 124, 14548–14549[CrossRef][ISI][Medline].

    Still, W.C., et al. (1990) Semianalytical treatment of solvation for molecular mechanics and dynamics. J. Am. Chem. Soc, . 112, 6127–6129[CrossRef].

    Sugita, Y. and Okamoto, Y. (1999) Replica-exchange molecular dynamics method for protein folding. Chem. Phys. Lett, . 314, 141–151[CrossRef].

    Wise, J.A., et al. (1995) Visualizing the non-visual: spatial analysis and interaction with information from text documents. Proc. IEEE Inform. Visual, . 95, 51–58.

    Zhou, R. (2003a) Free energy landscape of protein folding in water: explicit vs. implicit solvent. Proteins, 53, 148–161[CrossRef][ISI][Medline].

    Zhou, R. (2003b) Trp-cage: folding free energy landscape in explicit water. Proc. Natl Acad. Sci. USA, 100, 13280–13285[Abstract/Free Full Text].

    Zhou, Y. and Karplus, M. (1999) Interpreting the folding kinetics of helical proteins. Nature, 401, 400–403[CrossRef][Medline].

    Zhou, R., et al. (2001) The free energy landscape for ß-hairpin folding in explicit water. Proc. Natl Acad. Sci. USA, 98, 14931–14936[Abstract/Free Full Text].

    Zhou, R., et al. (2004) Hydrophobic collapse in multidomain protein folding. Science, 305, 1605–1609[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/1/99    most recent
btl538v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zhou, R.
Right arrow Articles by Mudur, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhou, R.
Right arrow Articles by Mudur, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?