Skip Navigation


Bioinformatics Advance Access originally published online on October 5, 2007
Bioinformatics 2007 23(22):3009-3015; doi:10.1093/bioinformatics/btm481
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/22/3009    most recent
btm481v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Deutsch, C.
Right arrow Articles by Krishnamoorthy, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Deutsch, C.
Right arrow Articles by Krishnamoorthy, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Four-Body Scoring Function for Mutagenesis

Chris Deutsch and Bala Krishnamoorthy *

Department of Mathematics, Washington State University, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION AND PREVIOUS...
 2 METHODS AND MATERIALS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 

Motivation: There is a need for an efficient and accurate computational method to identify the effects of single- and multiple-residue mutations on the stability and reactivity of proteins. Such a method should ideally be consistent and yet applicable in a widespread manner, i.e. it should be applied to various proteins under the same parameter settings, and have good predictive power for all of them.

Results: We develop a Delaunay tessellation-based four-body scoring function to predict the effects of single- and multiple-residue mutations on the stability and reactivity of proteins. We test our scoring function on sets of single-point mutations used by several previous studies. We also assemble a new, diverse set of 237 single- and multiple-residue mutations, from over 24 different publications. The four-body scoring function correctly predicted the changes to the stability of 169 out of 210 mutants (80.5%), and the changes to the reactivity of 17 out of 27 mutants (63%). For the mutants that had the changes in stability/reactivity quantified (using reaction rates, temperatures, etc.), an average Spearman rank correlation coefficient of 0.67 was achieved with the four-body scores. We also develop an efficient method for screening huge numbers of mutants of a protein, called combinatorial mutagenesis. In one study, 64 million mutants of a cold-shock nucleus binding domain protein 1CSQ, with six of its residues being changed to all possible (20) amino acids, were screened within a few hours on a PC, and all five stabilizing mutants reported were correctly identified as stabilizing by combinatorial mutagenesis.

Availability: All lists of mutants scored, and executables of programs developed as part of this study are available from this web page: http://www.wsu.edu/~kbala/Mutate.html

Contact: kbala{at}wsu.edu or bkrishna{at}math.wsu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION AND PREVIOUS WORK
 TOP
 ABSTRACT
 1 INTRODUCTION AND PREVIOUS...
 2 METHODS AND MATERIALS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 
Mutagenesis is the process of replacing one or more amino acids in a wild-type (WT) protein by alternate amino acids to generate a mutant protein. The goal of the process is to create a protein with certain desirable biochemical properties that are lacking in the WT protein. For instance, a protein that is more reactive, or more stable, in a particular reaction than the WT can often be generated by altering the identity of a single key residue (to generate a single-point mutant). Mutagenesis finds applications naturally in protein design, drug discovery and other similar areas (see the Supplementary Material for a listing of many such applications).

The experimental process of creating mutants can often be expensive and time consuming. To start with, it is often not straightforward to identify the key residue(s) that need to be mutated in order to achieve the desired biochemical properties. Once the critical residue positions are identified, it can still be non-trivial to decide what the new amino acids should be. Hence, biochemists often end up having to create and analyze a large number of mutants in order to identify a handful of desirable ones. In a typical example (1CSQ, one of the proteins included in our research), six amino acid positions were identified as desirable mutation sites. The experimentalists wanted to try all possible alternate amino acid combinations for these six residues, changing at least one amino acid in each case. The total number of single- and multiple-point mutants that they could have considered is 64 million (206–1 to be exact)! Of course, they only tried a few hundreds of the mutants, and reported five of them that had the desired properties. As illustrated by this example, it is quite desirable to use computational methods to reduce the number of mutants that need to be generated experimentally.

One of the key steps in most protein structure prediction methods is the screening of multiple candidate conformations to select the best one(s), and a scoring function is used for this purpose. Scoring functions have been used for protein fold recognition for several years—many of them are studied in the following references (Krishnamoorthy and Tropsha, 2003; Miyazawa and Jernigan, 1985; Park and Levitt, 1996; Sippl, 1990). We could study the use of any such scoring function for the purpose of virtual mutagenesis – score the original (WT) protein, and then score the original protein after making the proposed changes to the sequence while keeping the structure unchanged. We could then correlate the changes in score and the effects of the mutations on various properties of the original protein, thus developing a predictor for the effects of mutations. In spite of the apparent simplicity, only a few such studies have been undertaken so far. Carter et al. (2001) obtained high correlations between changes in the four-body scores and the free energy changes (measured as {Delta}{Delta}G values) resulting from mutations to residues in the hydrophobic cores of five different proteins. More recently, Masso et al. (2006) used the same four-body scoring function to study the structure–function correlations of the mutants of HIV-1 protease and T4 lysozyme.

On the other hand, direct computational strategies (i.e. without any connections to fold recognition) have been used to predict the effects of mutations on the stability of proteins. Gilis and Rooman (1997) developed database-derived potentials based on solvent accessibility to predict the effects of single-point mutations on the stability of proteins. Topham et al. (1997) used tabulated structural propensities of amino acids to predict the changes to the stability of several T4-lysozyme structures. In addition to the 3D structures of the WT proteins, this study also used the 3D structures of mutants. Guerois et al. (2002) developed the energy function called FOLD-X for predicting the {Delta}{Delta}G values due to mutations. More recently, Cheng et al. (2006) developed support vector machines-based (SVM) models that used both sequence and structure information to predict stability changes due to single-point mutations. The SVM-based method has the best accuracy reported so far—84%. The common feature of the above methods is that they all employ various sequence and structure interaction terms, and the best way to combine these terms is determined (i.e. various parameters are tuned) using a training set of mutations. The accuracy of these methods mainly stems from the training procedure, and hence is quite dependent on the training set of mutations used.

We believe that the accuracy of the underlying scoring function (or interaction terms) is most critical for predicting the effects of mutations. Our main goal is to develop, and test, an accurate underlying scoring function to predict the changes to the stability and reactivity of proteins due to mutations from scratch, i.e. without having to learn from any mutations. The scoring function we use is developed from the four-body scoring function that is based on the Delaunay tessellation of proteins. The latest (and most accurate) version of the four-body scoring function as used for protein decoy discrimination was proposed by the author previously (Krishnamoorthy and Tropsha, 2003), and has been tested extensively on many test sets of decoys. Though two previous studies (Carter et al., 2001; Masso et al., 2006) used the four-body scoring function to analyze mutagenesis, they both tested the scoring function only on limited sets of proteins. The first study considered mutations that are made only in the hydrophobic core of five proteins. The second study considers most possible mutations for two different proteins. Further, they both use an older version of the scoring function, and the second study uses different settings for scoring the mutants of the two proteins considered. We address these and other shortcomings when defining our four-body scoring function.

We first test our scoring function on 1558 mutants (all single-point mutations except three) considered by the previous studies for stability changes. The overall accuracy was 65.2% (see Section 3 for details). We also assemble a new, comprehensive list of proteins and their mutants (both single and multiple point), along with experimental data that quantifies the change in stability or reactivity of the WT protein. This test set of 237 mutants is collected from 24 different experimental studies. We correctly predict the effects of the mutations on the stability of 169 out of 210 mutants (80.5%), and those on the reactivity of 17 out of 27 (63%) mutants. For the sets of mutants that had the effects on stability or reactivity quantified (reaction rates, free energy changes, etc.), we obtain an average Spearman rank correlation coefficient of 0.67 between the quantifying data and the four-body scores. We also propose an efficient method to evaluate huge numbers of mutants by working with only the Delaunay tetrahedra that the mutated residues participate in (as opposed to scoring the WT protein repeatedly).


    2 METHODS AND MATERIALS
 TOP
 ABSTRACT
 1 INTRODUCTION AND PREVIOUS...
 2 METHODS AND MATERIALS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 
We describe the details of the four-body scoring function used for mutagenesis, and the test set of single- and multiple-residue mutants that we assembled, and then introduce combinatorial mutagenesis.

2.1 Four-body scoring function
The idea of a scoring function for protein fold recognition that is built on the Delaunay tessellation of proteins was first proposed by Tropsha et al. (1996). Singh, Tropsha and co-workers subsequently developed the scoring function (Munson and Singh, 1997; Tropsha et al., 1998), and also explored the possibility of using the same for ab initio protein folding (Gan et al., 2001). The formulation of the scoring function was improved by Krishnamoorthy and Tropsha (2003), and the applicability for decoy discrimination was tested on various decoy sets. Further extensive testing of the scoring function has been conducted recently by Krishnamoorthy et al. (Fowler et al., 2007; Krishnamoorthy and Stratton, 2007). This latest formulation defines the scoring function as the following log-likelihood ratio:


Formula

i, j, k, and l represent the residue identities of the four amino acids (20 possibilities) in a Delaunay tetrahedron from the tessellation of the protein. Each amino acid is represented by a single point located at the centroid of the atoms in its side chain (including the C{alpha} atom). {alpha} represents the type of the tetrahedron based on the backbone chain connectivity of the four participating amino acids. There are five tetrahedron types possible, and {alpha} takes one of the values 0, 1, 2, 3 or 4 corresponding to these types (Krishnamoorthy and Tropsha, 2003). The total score (or simply, the score) of a protein is then defined as the sum of the log-likelihood ratios of all tetrahedra in its Delaunay tessellation. A cutoff value of 10 Å (Angstroms) was used for the length of any edge of the tetrahedra that are scored, thus discarding biochemically irrelevant tetrahedra with huge edge lengths. Simply put, the score of a protein gives a measure of how well-packed its residues are (hence it was also called the Simplicial Neighborhood Analysis of Protein Packing, or SNAPP, score). Further, the correct way to interpret the score is in a relative sense, i.e. we can compare the scores of two otherwise similar conformations to quantify how one of them is packed better than the other.

The backbone chain connectivity of the tetrahedra is not considered by Carter et al. (2001) or in the more recent study by Masso et al. (2006). Further, in both these studies, there is some ambiguity regarding the choice of side chain centers of residues versus backbone C{alpha} atom coordinates that should be used to represent each amino acid. It is not desirable to change the settings and other parameters of the scoring function when scoring different proteins. In fact, Masso et al. use two different settings for the two proteins that they study. Their justification is the robustness of the four-body score under small perturbations of the points representing each amino acid. The authors claim that the total score of a protein does not change by much when the representation of the amino acids is changed from C{alpha} to side chain centers.

The question of robustness of the Delaunay tessellation of proteins (and point sets in general) was addressed by Bandyopadhyay and Snoeyink (2004)—they defined the concept of almost Delaunay simplices, where the positions of the points defining the simplex are allowed to vary in a controlled range (as opposed to being fixed). The four-body scoring function was tested under the almost-Delaunay setting to obtain decoy discrimination results that are roughly comparable to those obtained by the original scoring function. Still, the results obtained using the side chain center representation [by Krishnamoorthy and Tropsha (2003)] are markedly better than those obtained using C{alpha} representation, or using almost Delaunay tetrahedra. In fact, Krishnamoorthy and Tropsha did obtain the C{alpha} results for the decoy sets reported in their paper; but they were uniformly inferior to those using side chain centers, and hence not reported at all. They suggested that side chain centers be used always in order to obtain the most accurate results. This suggestion has been further validated by recent results obtained by Krishnamoorthy et al. (Fowler et al., 2007; Krishnamoorthy and Stratton, 2007).

Another aspect of the side chain center versus C{alpha} option is the change in representation between the WT and the actual mutant protein. We use the structure of the WT for the mutant as well (only the sequence is changed). To check the validity of this assumption, we need to compare the 3D structure of mutants (when available) with that of the WT, as represented by the set of Delaunay tetrahedra formed. Naturally, the tetrahedra set could see several more changes with the side chain center representation as compared to that of C{alpha}'s; especially, when small residues in the WT are replaced by bulky ones in the mutant (e.g. a GLY replaced by TYR). Topham et al. (1997) provide a list of PDB codes for several WT-mutant pairs of T4 lysozyme. We calculate the edit distance between the tetrahedra sets participated by the mutation sites in the WT and in the mutant—how many residue number substitutions have to be made to get from the tetrahedra set of the WT to that of the mutant, given as a percentage of the total number of residues (counting repetitions) in the tetrahedra in the set of the WT. Under the side chain center representation, the average edit distance is 35%, while under C{alpha} representation, it is only 12%. At the same time, the above calculation completely ignores the sequence of the residues. Even though the C{alpha} atoms of the WT are a lot closer to the C{alpha} of the actual mutant, the sequence–structure correlations are far more accurate under the side chain center representation (Krishnamoorthy and Tropsha, 2003). To make sure, we scored our set of mutants using C{alpha}'s, only to obtain an accuracy of < 50%. Hence, we stick with the side chain centers.

On a related note, the claim of Masso et al. that the score of the protein does not change much when C{alpha} atoms are used in place of side chain centers might hold only for the total score of the protein, and not for the case of change in the total score—especially when the change is small. When only a single residue is changed, only a small subset of the full set of Delaunay tetrahedra is affected. The change in the total score in this case might be sensitive to the way the residues are represented, and also to perturbations in the positions of these residues. The residues that are in the inner portions of the protein (buried) participate in many more tetrahedra than those that are on the outside (surface), and hence the robustness result might apply more for the case of the buried residues. All mutations studied by Carter et al. (2001) are performed on hydrophobic core residues. On the other hand, many mutations that we considered involve changes to surface residues. Hence, we suggest the consistent use of side chain centers when scoring mutations using the four-body scoring function. We also use the weights for the scores of different classes of tetrahedra as defined by Krishnamoorthy and Tropsha (2003).

In addition, some key long-range interactions between amino acids are missed out by the use of a 10 Å cutoff on the Delaunay edges, especially for the case of surface residues. Hence, we use an increased cutoff of 12 Å when scoring mutations. Notice that the contacts made by most buried residues remain unchanged, as such contacts are well within the 10 Å range. At the same time, several key interactions of surface residues that are left out by the 10 Å cutoff are now included in the calculations, thus making the scoring function more accurate.

Under the settings described above, we calculate the change in total score between the WT and the mutant protein (mutant score—WT score). A positive change (i.e. the mutant score is more than the WT score) indicates that the mutant is more stable than the WT, while a negative change indicates lower stability. Instead of using the raw change in total score, we use the fraction (given as percentage) of change to the sum of the scores of the tetrahedra that see any change due to the mutations. Thus, we exclude from the calculations those tetrahedra that are present both in the WT and the mutant. We use a cutoff value of 0.1% to determine if this percentage change is significant (i.e. if the percentage change is below 0.1% in absolute value, we assume there is no change).

We also correlate increased (decreased) activity with a negative (positive) change in the total score. The intuition behind this definition is that well-packed proteins are typically not highly active, and hence the high total score is correlated with less activity. We must mention, though, that as of now, we could only assemble a limited number of mutants with activity data to test this assumption (see Section 3).

2.2 Test sets of mutants
The ProTherm database (Kumar et al., 2006) lists a huge number of mutations, and some of the previous studies have created mutant data sets from there (Cheng et al., 2006). At the same time, ProTherm typically does not list multi-point mutants, and reactivity data is not listed in all the cases as well. Hence, we have searched the literature to identify a comprehensive list of single- and multiple-point mutations. Overall, there are 237 mutants taken from 24 different papers. Total 210 of the mutants are analyzed for changes in stability, while the remaining 27 are analyzed for changes in reactivity. After assembling the data set, we found that ProTherm in fact listed 80 of them. The whole data set, along with the performance of the four-body scoring function on the mutants, is presented in Table 1. We describe the various types of mutations assembled briefly in a Supplementary Material.


View this table:
[in this window]
[in a new window]

 
Table 1. Test set of mutations studied

 
Apart from our mutant list, we also analyze the two lists of 1096 and 388 single-point mutants considered by Cheng et al. (2006), the 50 single point mutants that were considered outliers in the study by Guerois et al. (2002), and 24 T4 lysozyme mutants (3 being multiple point) considered by Topham et al. (1997).

2.3 Combinatorial mutagenesis
After the potential mutation sites in the WT are identified, it is often not straightforward to decide the new residues to be put in these sites. Experimentalists might want to try several amino acids for each mutation site, and hence are faced with the task of generating a large number of mutants. For example, if three potential mutation sites have been identified, and we want to try the residues Ala, Val, Ile for the first site, Val and Leu for the second site and Cys, His, Lys and Arg for the third site. The total number of mutants we have to analyze is 3 x 2 x 4 = 24. In one of the studies that we used to create our list of mutants, Martin et al. (2001) identified six amino acid sites of the protein 1CSQ to be mutated to all other amino acids. The result is a staggering 64 000 000 proteins to be analyzed (including the WT). Our scoring function is most valuable for such mutagenesis experiments—we could identify (computationally) a relatively small set of mutants that are potentially the most suitable ones, and experimentally generate them before considering others. Since we consider all possible combinations of mutations at the individual sites, we term the process of scoring all possible mutants using the four-body scoring function as combinatorial mutagenesis.

As seen by the example of 1CSQ, the number of combinations could be quite huge, and for such cases, the usual way of scoring the mutants turns out to be highly inefficient. By default, we would score the WT protein once, and then score the same again for each mutant, with the appropriate changes made in the amino acid sequence. Each call to the four-body scoring function involves the computation of the Delaunay tessellation of the protein, which proves to be the bottleneck as far as the overall running time of the algorithm is concerned. The most efficient algorithms for computing Delaunay tessellations have a worst-case running time of O(n log n), where n is the number of points (see Edelsbrunner, 2001, Chapters 1, 5). The average running times in practice also follow the same bounds. At the same time, we notice that the structure of the WT is not altered in any of the mutants, and hence the residue numbers of the four amino acids forming each tetrahedron remains unaltered, even though the identities of some of the amino acids are changed. Hence we calculate the Delaunay tessellation only once as part of combinatorial mutagenesis, when scoring the WT protein. We just need to change the amino acid identities corresponding to each mutant.

As a result, only those tetrahedra that involve one or more of the mutation sites see changes to the sequence identities of the participating amino acids. For instance, the six mutation sites of 1CSQ participate in (i.e. at least one of the six is in) 58 Delaunay tetrahedra, which is only a fraction of the total of 256 tetrahedra formed by the entire protein (see Fig. 1). From the theoretical point of view, there are some bounds for the number of Delaunay triangles that each point (out of the total n points) participates in when we consider the 2D case (Edelsbrunner, 2001), and some more conservative estimates could be derived in 3D. From among the 4000 odd proteins that we analyzed, the maximum number of tetrahedra that a single amino acid participated in is 48 (43-Arg in 2BOQ), but the typical number of tetrahedra is much smaller (average is 17.65). This number is even smaller if the amino acid in question is on the surface of the protein. So we identify the smaller set of tetrahedra that the mutation sites participate in (by searching the Delaunay tessellation of the WT). For each mutant, we calculate the difference in the sum of the log-likelihood ratios for these tetrahedra alone in the WT and in the mutant, and we use this difference to score and rank all the mutants. This implementation of combinatorial mutagenesis proves to be far more efficient than repeated calls to the default four-body scoring function for each mutant (see Section 3.1).


Figure 1
View larger version (56K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. 1CSQ (backbone trace shown as the thick line) along with the Delaunay tetrahedra that the six mutation sites participate in. The six residues considered for mutation are 2-Leu, 3-Glu, 46-Ala, 64-Thr, 66-Glu, and 67-Ala. The spheres are at the side chain centers of these residues, and represent the six residues in the increasing order when viewed from the bottom part of the figure to the top. The thin lines are the edges of the 58 tetrahedra that contain at least one of these six residues. If we consider all 67 amino acids, we get a total of 256 Delaunay tetrahedra (at 12 Å distance cutoff). This image was created using VMD (Humphrey et al., 1996).

 

    3 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION AND PREVIOUS...
 2 METHODS AND MATERIALS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 
We say that the four-body scoring function predicts the effect of a mutation correctly if an increase (correspondingly, a decrease) in the four-body total score is observed for mutations that are experimentally observed to be stabilizing or decreasing the reactivity (destabilizing or increasing the reactivity) of the WT protein. Overall on our data set, 78% (186) of the mutants were identified correctly (see Table 1), with 169 out of 210 correct predictions for stability (80.5%) and 17 out of 27 for activity (63%). We are currently undertaking a detailed examination of how the scoring function performed for each of the 24 mutant sets, and especially the cases of de Antonio et al. (2000), Takano et al. (1999) and Kong et al. (1993) (articles #3, #11 and #23 in Table 1), for which the scoring function failed on all the mutants considered from each set. For the human lysozyme mutants studied by Takano et al. only the N-terminal residue is mutated, which does not form enough Delaunay tetrahedra (being on one end of the chain, and on the surface of the protein).

As of now, we only have limited (27) number of mutants with activity data available. We need more such mutants to test out assumption that high total scores correspond to lower activity. The accuracy of 63% on this set of mutants with activity data is encouraging still.

The effects of the mutants are quantified for 10 of the mutant sets, 5 of them being different proteins studied by Carter et al. (2001). Even though the authors calculated linear correlation coefficients between four-body scores and free energy changes of these mutants, there is no clear evidence to suggest that the four-body scores follow a linear relationship with the experimental quantities reported. [Masso et al. (2006) also report linear correlation coefficients, but see Section 3.2 for discussion on their work.] The Spearman rank correlation coefficient between the four-body scores and the experimental values seems more appropriate, with no assumptions of linear relationships involved. We present the rank correlation coefficients for the 10 mutant sets in Table 2. The overall average Spearman rank correlation coefficient is 0.67. The rank correlations for the set of mutants studied by Carter et al. are markedly high—the average for these five mutant sets is 0.77. This result is not surprising, as all these mutations are done on sites in the hydrophobic cores of the proteins in question. In general, the more tetrahedra the mutation sites participate in, the more accurate the prediction is.


View this table:
[in this window]
[in a new window]

 
Table 2. Spearman rank correlation coefficients for mutant sets whose change in stability/reactivity has been quantified

 
The performance of our scoring function on mutant data sets compiled by others are as follows (% correct predictions): 66% for the 1096 set and 63% for the 388 set from Cheng et al. (2006), 60% for the 50 outliers from Guerois et al. (2002) and 79% of the 24 mutants from Topham et al. (1997). We also scored our 210 mutants with stability data using the FOLD-X program (Guerois et al., 2002), and 68% of these mutants were identified correctly by this program (compare to our accuracy of 80.5%). While the web interface for the FOLD-X program is handy when scoring a handful of mutations, we found it quite tedious to score all the 210 mutants from our test set (took us several hours). We believe that researchers should provide executable file(s) for scoring functions that could handle large sets of mutations simultaneously.

The key point to note when comparing our scoring function to others is that unlike the previous methods, we have not trained our scoring function on a set of mutations. Thus, an SVM trained on our scoring function could well have the largest accuracy yet reported—we are currently trying to implement this idea.

3.1 Combinatorial mutagenesis: an example
Our implementation of combinatorial mutagenesis (Section 2.3) scored all 64 000 000 mutants of 1CSQ (including the WT) within 6 h (on a typical PC). In comparison, calling the four-body scoring function separately for each mutant did not finish in 24 h. The original authors reported only five stabilizing mutants. Combinatorial mutagenesis predicted all five of them correctly. Furthermore, they were in the top 17.7% (of 64 million mutants). Analysis of the top-scoring mutants shows that high scores are assigned for mutants with Cystines in the selected sites. As illustrated in Figure 1, the six mutation sites participate in several tetrahedra together, i.e. they are linked to each other. The occurrences of stabilizing disulfide bonds between cystines is scored among the highest by the four-body scoring function, and hence the mutants with two or more Cystines are naturally scored high (Tropsha et al., 1996, 1998).

3.2 Comments on the work of Masso et al. (2006)
Masso, Lu and Vaisman analyzed mutants reported in three different papers (and some more mutants that were not reported in these papers) using an earlier version of the four-body scoring function. In the first of these papers, Loeb et al. (1989) studied mutants of HIV-1 Protease. They reported a western blot assay analysis as well as an enzyme activity analysis. Most mutations were reported to have ambiguous ‘WT-like’ behavior. Furthermore, only mutations of the WT western blot assay classification could be considered, as they were the only group for which the production of the protein was explicitly proven. This is an important factor, as the authors were mutating an operon, which leads to the production of multiple HIV proteins. Our results of this study were poorer than the other reported results. Out of 56 mutants, only 21 were correctly scored. This lower accuracy may be accounted for by the unreported global destabilization of the enzyme. In another paper considered, Wrobel et al. (1998), analyzed mutants of HIV-1 reverse transcriptase. The analysis as well as the results presented were similar to those presented by Loeb et al. and a subset of appropriate mutations was selected in a similar manner. From among the 105 selected mutants, the four-body scoring function correctly predicted 51 mutants. Once again, the incorrect scoring of the remaining mutants in this study could very well be due to the global destabilization of the protein, which the authors did not report explicitly.


    4 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION AND PREVIOUS...
 2 METHODS AND MATERIALS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 
The strengths of the four-body scoring function for predicting stability and reactivity effects of mutations are widespread applicability, consistency (one setting works for all cases), computational efficiency (combinatorial mutagenesis) and accuracy. The idea of combinatorial mutagenesis can in principle be used even for a single mutation, or a few of them, but the gain in computational efficiency might not be noticeable.

Our test set is comprehensive, but is not complete—we plan to further explore previous as well as forthcoming literature to add new sets of mutants to the current ones. Even though the same settings are recommended for applying our scoring function to all proteins, we could customize it with different settings specifically for certain classes of proteins, thus increasing its accuracy (of course, the scoring function will not perform as well under such customized settings for other classes of proteins). Another idea for increasing the accuracy of the scoring function is to use different weights for various quadruplets (rather than simply adding them all up). We could divide the set of mutants into training and test sets, determine the weights by learning from the training set and then validate them on the test set. We are currently investigating these and other ideas.


    ACKNOWLEDGEMENT
 TOP
 ABSTRACT
 1 INTRODUCTION AND PREVIOUS...
 2 METHODS AND MATERIALS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 
Both authors are thankful for the support provided by the NSF UBM Grant DEB 0531870 while working on this research project.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on April 5, 2007; revised on September 9, 2007; accepted on September 22, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION AND PREVIOUS...
 2 METHODS AND MATERIALS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 

    Almog O, et al. Structural basis of thermostability analysis of stabilizing mutations in subtilisin bpn. J. Biol. chem. (2002) 277:27553–27558.[Abstract/Free Full Text]

    Bandyopadhyay D, Snoeyink J. Almost-Delaunay simplices: nearest neighbor relations for imprecise points. In: In SODA '04: Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2004) Philadelphia, USA: Society for Industrial and Applied Mathematics. 410–419.

    Bonander N, et al. Crystal structure of the disulfide bond-deficient azurin mutant c3a/c26a. Eur. J. Biochem. (2000) 267:4511–4519.[Web of Science][Medline]

    Braun P, et al. Alanine insertion scanning mutagenesis of lactose permease transmembrane helices. J. Biol. Chem. (1997) 272:29566–29571.[Abstract/Free Full Text]

    Brownlie PD, et al. The three-dimensional structures of mutants of porphobilinogen deaminase: toward an understanding of the structural basis of acute intermittent porphyria. Protein Sci. (1994) 3:1644–1650.[Web of Science][Medline]

    Carter C.W. Jr, et al. Four-body potentials reveal protein-specific correlations to stability changes caused by hydrophobic core mutations. J. Mol. Biol. (2001) 311:625–638.[CrossRef][Web of Science][Medline]

    Chen G.-Q, Gouaux E. Reduction of membrane protein hydrophobicity by site-directed mutagenesis: introduction of multiple polar residues in helix d of bacteriorhodopsin. Protein Eng. (1997) 10:1061–1066.[Abstract/Free Full Text]

    Cheng J, et al. Prediction of protein stability changes for single-site mutations using support vector machines. Proteins Struct. Funct. Bioinformatics (2006) 62:1125–1132.[CrossRef]

    de Antonio C, et al. Assignment of the contribution of the tryptophan residues to the spectroscopic and functional properties of the ribotoxin alpha-sarcin. Proteins Struct. Func. Genet. (2000) 41:350–361.[CrossRef]

    Dvir H, et al. X-ray structure of human acid-β-glucosidase, the defective enzyme in gaucher disease. EMBO Rep. (2003) 4:704–709.[CrossRef][Web of Science][Medline]

    Edelsbrunner H. Geometry and Topology for Mesh Generation (2001) England: Cambridge University Press.

    Erwin C, et al. Effects of engineered salt bridges on the stability of subtilisin bpn. Protein Engineering (1990) 4:87–97.[Abstract/Free Full Text]

    Fowler A, et al. A hierarchy of scoring functions for protein decoy discrimination based on Delaunay tessellation of proteins. In: Bioinformatics (2007) under review.

    Funahashi J, et al. Positive contribution of hydration structure on the surface of human lysozyme to the conformational stability. J. Biol. Chem. (2002) 277:21792–21800.[Abstract/Free Full Text]

    Gan HH, et al. Lattice protein folding with two and four-body statistical potentials. Proteins Struct. Funct. Genet. (2001) 43:161–174.[CrossRef][Web of Science][Medline]

    Ge X, et al. Preliminary study on the structural basis of the antifungal activity of a rice lipid transfer protein. Protein Eng. (2003) 16:387–390.[Free Full Text]

    Gilis D, Rooman M. Predicting protein stability changes upon mutation using database-derived potentials: solvent accessibility determines the importance of local versus non-local interactions along the sequence. J. Mol. Biol. (1997) 272:276–290.[CrossRef][Web of Science][Medline]

    Guerois R, et al. Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J. Mol. Bio. (2002) 320:369–387.[CrossRef][Web of Science][Medline]

    Hahn M, et al. Crystal structure and site-directed mutagenesis of bacillus macerans endo-1, 3-1, 4-beta glucanase. J. Biol. Chem. (1995) 270:3081–3088.[Abstract/Free Full Text]

    Huang H.-B, et al. Site-directed mutagenesis of amino acid residues of protein phosphatase 1 involved in catalysis and inhibitor binding. Am. Soc. Biochem. Mol. Biol. Inc. (1996) 271:2574–2577.

    Humphrey W, et al. VMD – Visual Molecular Dynamics. J. Mol. Graph. (1996) 14:33–38.[CrossRef][Web of Science][Medline]

    Köditz J, et al. Probing the unfolding region of ribonuclease a by site-directed mutagenesis. Eur. J. Biochem. (2004) 271:4147–4156.[Web of Science][Medline]

    Kong K.-H, et al. Site-directed mutagenesis study on the roles of evolutionally conserved aspartic acid residues in human glutathione s-transferase p1-1. Protein Eng. (1993) 6:93–99.[Abstract/Free Full Text]

    Korkegian A, et al. Computational thermostabilization of an enzyme. Science (2005) 308:857–860.[Abstract/Free Full Text]

    Krishnamoorthy B, Stratton K. Ranking CASP predictions using a fourbody scoring function. In: Proteins Struct. Funct. Bioinformatics (2007) under review.

    Krishnamoorthy B, Tropsha A. Development of a four-body statistical pseudo-potential to discriminate native from non-native protein conformations. Bioinformatics (2003) 19:1540–1548.[Abstract/Free Full Text]

    Kumar MS, et al. ProTherm and ProNIT: thermodynamic databases for proteins and protein-nucleic acid interactions. Nucleic Acids Res. (2006) 34:D204–D206. database issue: ProTherm link:http://gibk26.bse.kyutech.ac.jp/jouhou/Protherm/protherm.html.[Abstract/Free Full Text]

    Loeb DD, et al. Complete mutagenesis of hiv-1 protease. Nature (1989) 340:397–400.[CrossRef][Medline]

    Martin A, et al. In vitro selection of highly stabilized protein variants with optimized surface. J. Mol. Biol. (2001) 309:717–726.[CrossRef][Web of Science][Medline]

    Masso M, et al. Computational mutagenesis studies of protein structure-function correlations. Proteins Struct. Funct. Bioinformatics (2006) 64:234–245.[CrossRef]

    Miyazawa S, Jernigan RL. Estimation of effective inter-residue contact energies from protein crystal structures: a quasi-chemical approximation. Macromolecules (1985) 18:534–552.[CrossRef][Web of Science]

    Munson PJ, Singh RK. Statistical significance of hierarchical multi-body potentials based on Delaunay tessellation and their application in sequence-structure alignment. Protein Sci. (1997) 6:1467–1481.[Web of Science][Medline]

    Oppermann U.CT, et al. Active site directed mutagenesis of 3β/17β-hydroxysteroid dehydrogenase establishes differential effects on shortchain dehydrogenase/reductase reactions. Biochemistry (1997) 36:34–40.[CrossRef][Web of Science][Medline]

    Ormö M, et al. Residues important for radical stability in ribonucleotide reductase from escherichia coli. J. Bio. Chem. (1995) 270:6570–6576.[Abstract/Free Full Text]

    Park B, Levitt M. Energy functions that discriminate x-ray and near-native folds from well-constructed decoys. J. Mol. Biol. (1996) 258:367–392.[CrossRef][Web of Science][Medline]

    Pathange LP, et al. Correlation between protein binding strength on immobilized metal affinity chromatography and the histidine-related protein surface structure. Anal. Chem. (2006) 78:4443–4449.[Medline]

    Siadat OR, et al. The effect of engineered disulfide bonds on the stability of drosophila melanogaster acetylcholinesterase. J. Bio. Chem. (2006) 7:12.

    Sippl M. Calculation of conformational ensembles from potentials of mean force. J. Mol. Bio. (1990) 213:859–883.[Web of Science][Medline]

    Sun D, et al. Active-site residues are critical for the folding and stability of methylamine dehydrogenase. Protein Eng. (2001) 14:675–681.[Abstract/Free Full Text]

    Suresh MV, et al. Role of the property of c-reactive protein to activate the classical pathway of complement in protecting mice from pneumococcal infection. J. Immunol. (2006) 176:4369–4374.[Abstract/Free Full Text]

    Takano K, et al. Effect of foreign n-terminal residues on the conformational stability of human lysozyme. Euro. J. Biochem. (1999) 266:675–682.[Web of Science][Medline]

    Topham CM, et al. Prediction of the stability of protein mutants based on structural enviornment-dependent amino acid substitution and propensity tables. Protein Eng. (1997) 10:7–21.[Abstract/Free Full Text]

    Tropsha A, et al. Delaunay tessellation of proteins: Four body nearest neighbor propensities of amino acid residues. J. Comput. Biol. (1996) 3:213–222.[Web of Science][Medline]

    Tropsha A, et al. Compositional preferences in quadruplets of nearest neighbor residues in protein structures: statistical geometry analysis. In: In IEEE Symposia on Intelligence and Systems (1998) 163–168.

    Wrobel JA, et al. A genetic approach for identifying critical residues in the fingers and palm subdomains of hiv-1 reverse transcriptase. Proc. Natl Acad. Sci. USA (1998) 95:638–645.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/22/3009    most recent
btm481v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Deutsch, C.
Right arrow Articles by Krishnamoorthy, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Deutsch, C.
Right arrow Articles by Krishnamoorthy, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?