Skip Navigation


Bioinformatics Advance Access originally published online on February 5, 2008
Bioinformatics 2008 24(7):916-923; doi:10.1093/bioinformatics/btn050
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/7/916    most recent
btn050v3
btn050v2
btn050v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Stout, M.
Right arrow Articles by Krasnogor, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Stout, M.
Right arrow Articles by Krasnogor, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Prediction of recursive convex hull class assignments for protein residues

Michael Stout 1, Jaume Bacardit 1,2, Jonathan D. Hirst 3 and Natalio Krasnogor 1,*

1Automated Scheduling, Optimization and Planning research group, School of Computer Science, 2Multi-disciplinary Centre for Integrative Biology, School of Biosciences and 3School of Chemistry, University of Nottingham, UK

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 LEARNABILITY OF RCH...
 4 DISCUSSION
 5 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: We introduce a new method for designating the location of residues in folded protein structures based on the recursive convex hull (RCH) of a point set of atomic coordinates. The RCH can be calculated with an efficient and parameterless algorithm.

Results: We show that residue RCH class contains information complementary to widely studied measures such as solvent accessibility (SA), residue depth (RD) and to the distance of residues from the centroid of the chain, the residues’ exposure (Exp). RCH is more conserved for related structures across folds and correlates better with changes in thermal stability of mutants than the other measures. Further, we assess the predictability of these measures using three types of machine-learning technique: decision trees (C4.5), Naive Bayes and Learning Classifier Systems (LCS) showing that RCH is more easily predicted than the other measures. As an exemplar application of predicted RCH class (in combination with other measures), we show that RCH is potentially helpful in improving prediction of residue contact numbers (CN).

Contact: nxk{at}cs.nott.ac.uk

Supplementary Information: For Supplementary data please refer to Datasets: www.infobiotic.net/datasets, RCH Prediction Servers: www.infobiotic.net


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 LEARNABILITY OF RCH...
 4 DISCUSSION
 5 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Prediction of the three-dimensional structure of proteins from their constituent amino acid sequences continues to be one of the key goals of structural biology and a wide range of predictive strategies has been investigated. Steady improvements in predictive accuracy have resulted from decomposition of the problem into subproblems, such as prediction of secondary structural elements [approaching a theoretical prediction limit of 80% (Dor and Zhou, 2007; Wood and Hirst, 2005)], of residue coordination number [at over 80% (Bacardit et al., 2006)] and of residue solvent accessibility [at over 77% using consensus predictors (Gianese and Pascarella, 2006)]. Burial of hydrophobic groups within the protein core is a primary driving force for protein structure formation. Characterizations of residue accessibility to solvent are, therefore, important for protein structure prediction (PSP), potentially helping to constrain the search space to be explored using de novo methods (Baldi and Pollastri, 2002). Whilst classifying residue neighbourhood density as high or low will generally assign the high class to residues buried within the structure and the low class to residues exposed on the surface, residues lining cavities in the structure that may be functionally significant (Chen et al., 2007) can have a low coordination number even when located far from the surface. Incorporation of complementary residue solvent accessibility and residue depth information improves fold recognition (Liu et al., 2007). A range of measures of residue location have been studied. Lee and Richards (1971) used a spherical probe method to measure the solvent accessible surface of residues and recently Kawabata and Go (2007) have used adjustable probe parameters to identify putative ligand binding pockets on protein surfaces. Solvent accessibility, however, is difficult to compute and does not distinguish between residues below the surface. Hence, atom/residue depth (RD), the distance of an atom/residue from its nearest solvent accessible neighbour, was introduced (Chakravarty and Varadarajan, 1999) and efficient algorithms are available to compute RD for a given structure (Pintar et al., 2003; Vlahovicek et al., 2005). Whilst SA emphasises burial, RD emphasises exposure and depends on the method used to identify surface atoms/residues. Hence, Half Sphere Exposure (HSE), has been recently proposed (Hamelryck, 2005). HSE, like CN, counts neighbouring residues but distinguishes two regions (half spheres) around each residue based on the C{alpha}Cβ vector, i.e. a 2D measure of residue location. In addition, the distance (exposure) of residues from the chain centroid is a potentially interesting measure being related to the location of catalytic residues in enzyme structures (Ben-shimon and Eisenstein, 2005). Measures of atom/residue location typically depend on specific parameters such as probe size for SA or contact radius for CN.

In this paper, we introduce a new approach to stratifying residues in protein structures by recursively identifying the convex hull layer to which each residue belongs. The convex hull of a set of points is a parameterless, mathematically rigorous and unambiguous approach to identifying the points on the exterior of a point set, analogous to identifying those points that contact the enclosing surface when the point set is tightly wrapped. The convex hull is simple and efficient (O(n * log n)) to compute (Preparata and Hong, 1977). The recursive convex hull (RCH) of a point set is obtained by identification of the minimal point set that generates the convex hull (the vertices) and removal of these points from the point set followed by recursively applying these steps to the remaining points to identify subsequent hulls. Applied to the point set of coordinates of residues in a protein chain, a series of hulls is obtained that groups the residues by their distance from the convex surface of the structure. The recursive convex hulls of a 2D off-lattice protein model are shown in Figure 1 along with a representation of the outer convex hull of a 3D point set derived from the Cβ atomic coordinates of residues in a real protein chain.


Figure 1
View larger version (36K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Left: RCH of a 2D off-lattice protein model. The backbone is represented by coloured circles joined by solid black lines. Residues on the outermost RCH are coloured red, subsequent recursive convex hulls are coloured blue, green, and yellow, with residues on the innermost recursive convex hull coloured purple. Right: A graphical representation of the outer RCH of residues in a 3D model of a natural protein chain (PDB Id. 1P4X).

 
Convex hulls have found a wide range of applications in studies of molecular structure. Here we give a brief, by no means complete, review. Badel-chagnon and colleagues introduced a notion of the ‘molecular surface convex hull’ to define the depth of any molecular surface point (Badel-chagnon et al., 1994) and Lin and colleagues used convex hulls to align 11 randomly generated bio-active tachykinin peptides, finding that 3D convex hulls can be used to align even these flexible structures (Lin et al., 1999; Lin and Lin, 2001). Meier et al. (1995) proposed a convex hull-based segmentation technique (that makes few assumptions about the underlying surface) to find characteristically shaped regions of molecular surfaces for prediction of possible protein docking sites. Liang and Dill (2001) used convex hulls to define the boundaries of surface pockets and depressions in studies of packing densities in proteins. Holmes and Tsai tackled protein side-chain packing and interactions by measuring variation in convex hulls constructed around these groups (Holmes and Tsai, 2005). Coleman and Sharp, (2006) introduced the notion of travel depth (the physical distance a solvent molecule would have to travel from a surface point to a suitably defined reference surface) using convex hulls of surface points. Recently, Lee and colleagues have employed 3D convex hulls around complementarity regions of antibodies to analyse binding sites (Lee et al., 2006) and Wang et al., (2006) have used convex hulls of protein backbones in neural network-based classification of protein structures. However, dissection of protein structures by recursively assigning convex hull numbers to residues, as we propose here, does not appear to have been previously reported.

This article has two parts. In the first part we analyse RCH as a new computable property of proteins. We compare the information content of RCH to that of residue solvent accessibility (SA), residue depth (RD) and exposure (Exp), and show that although not totally unrelated, these properties are indeed complementary. We show that RCH correlates better with structural conservation than the other measures of residue location and that RCH is also better correlated with changes in protein thermal stability in the presence of cavity forming mutations. We turn, in Part 2, to the question of how easy/difficult it is, in practical terms, to learn to predict these measures. The relative predictability of RCH, RD, SA and Exp using four different machine-learning algorithms was assessed using six different, progressively richer, sets of input attributes at three levels of precision. The relative benefits of using these various inputs are described. C4.5 (Quinlan, 1992), Naive Bayes (John and Langley, 1995), GAssist (Bacardit, 2004) and BioHEL (Bacardit et al., 2007) are the machine-learning methods employed in this article. Finally, we demonstrate the usefulness of RCH by using the predicted RCH class of residues as input for prediction of residue coordination number (CN) showing that, in combination with predicted residue SA and Exp class, predicted RCH information increases predictive accuracy for CN.


    2 MATERIALS AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 LEARNABILITY OF RCH...
 4 DISCUSSION
 5 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Datasets and features studied
Next, we describe the datasets and algorithms employed to assess the novelty of RCH and its relation to previously studied measures. All of the measures studied are based on atomic coordinates. Two polypeptides that have similar structures when represented using C{alpha} coordinates may have distinct structures when represented using Cβ coordinate (Eidhammer et al., 2003). Throughout this article Cβ atom coordinates are used (C{alpha} for glycyl residues) as these are sensitive to the orientation of side-chain atoms.

Protein dataset: The dataset used here are those described by Bacardit et al. (2006), originally proposed by Kinjo et al. (2005). Protein chains were selected from PDB-REPRDB [a non-redundant curated subset of the Protein Data Bank (PDB) (Noguchi et al., 2001), covering the space of possible folds] using the following criteria:less than 30% sequence identity, sequence length greater than 50 residues, no membrane proteins, no non-standard residues, no chain breaks, resolution better than 2 Å and a crystallographic R factor better than 20%. Chains that had no entry in the HSSP (Sander and Schneider, 1991) database were discarded. The final dataset contains 1050 protein chains (257 560 residues).

Identification of residue RCH: Convex hulls were identified from the residue Cβ atomic coordinates using the QHull package (Barber et al., 1996). Hulls were iteratively identified, hull residues were assigned a hull number and removed from the point set. This being repeated until all residues had been assigned a hull number. The mean RCH number in this dataset was 2.6 (SD 2.3). Assignment of RCH numbers to the 1050 chains took 52 min. We term this numbering of hulls, from the outermost inward, residue RCH. An alternative numbering scheme, from innermost hull outward, termed RCHr are given in the Supplementary Material (Section 2.1). The mean RCHr number in this dataset was 5.1 (SD 2.7). Assigning RCHr numbers to all chains took 58 min.

Calculation of residue solvent accessibility (SA): Solvent accessible surface values for each residue were extracted from the DSSP (Holm and Sander, 1993) file for each structure. These values were divided by the solvent accessible surface values for each amino acid as defined in Rost and Sander (1994) to obtain the relative solvent accessibility of each residue. The mean SA value in this dataset was 0.27 (SD 0.27).

Calculation of residue exposure (Exp): In this study, we characterize residue exposure as the distance of residues from the centroid of each chain (Ben-shimon and Eisenstein, 2005). The chain centroid was determined from the coordinates of the residues and the euclidean distance of each residue from this point was calculated to obtain the residues exposure value. The mean Exp value in this dataset was 19.1 Å (SD 7.8). Determination of Exp values for the whole dataset took less than 2 minutes.

Calculation of residue depth (RD): Residue depth (RD) values were obtained from the DPX server (Pintar et al., 2003) using default settings. RD values were positively skewed with a mean RD of 0.86 (SD 1.41).

Normalization: In Section 2.2, both unnormalized and normalized values are reported for characterization of the measures studied using box plots (Fig. 2), correlation coefficients (Table 1), structural conservation (Table 2), thermal stability (Table 3) and mutual information between class assignments (Table 4). The value for each residue was divided by the maximum value for that measure in the corresponding chain to obtain the normalized value. Histograms of unnormalized and normalized measures are shown in the Supplementary Materials (Figures 5 and 6). After normalization RCH and RCHr are symmetric.


View this table:
[in this window]
[in a new window]

 
Table 1. Correlation coefficients between measures studied

 

View this table:
[in this window]
[in a new window]

 
Table 2. Conservation of measures

 

View this table:
[in this window]
[in a new window]

 
Table 3. Correlation of structural features with thermal stability

 

View this table:
[in this window]
[in a new window]

 
Table 4. Pairwise mutual information

 
2.2 Comparison between RCH and other measures of residue location
BoxPlots: Figure 2 plots RD versus RCH for each residue in the dataset using the statistically robust Box and Whisker technique. Boxes cover 50% of the data points, whiskers extend to 1.5 times the interquartile range with outliers plotted as blue dots and median values indicated with black dots. Median values for RD are positively correlated with RCH yet RCH makes finer distinctions between degrees of burial and exposure. Further box plots for these measures are available in the Supplementary Materials (Fig. 3).


Figure 2
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Box and whisker plots of RD against RCH for 257 560 residues from 1050 proteins. Black dots indicate median values. Values were normalized and rounded to one decimal place.

 
Correlation coefficients: Pairs of measures that have a low correlation coefficient are likely to be unrelated and potentially provide complementary information for PSP. Table 1 shows the Pearson correlation coefficients between the measures studied. RD has low correlation with the other measures.RCH is most highly anti-correlated with SA (–0.62) and has a higher correlation with SA and Exp than RD. RCH is not highly correlated with RD, suggesting that these are distinct characterizations of residue location. RCH appears to be the measure that correlates closely to many of the other measures. Hence, we would like to determine whether it is relatively more learnable than these other measures.

Conservation of RCH: For related proteins, aligned residues are potentially conserved even in the absence of strong sequence homology. Measures that have relatively high correlation for aligned residue pairs potentially reflect conserved aspects of protein structure. We, therefore, assess to what degree these measures are correlated between aligned residues in pairs of superimposed structures from a range of folds. Following Hamelryck (2005), the conservation of RCH and the other measures was calculated for 15 621 aligned residues (BLAST E-value > = 1.0) in 218 pairs of structures from the SABmark version 1.63 Twilight Zone database (Van Walle et al., 2005). This dataset comprises pairs of superimposed structures covering 236 folds. These pairs are structurally similar, yet are without probable common evolutionary origin, effectively, a hard dataset to predict. Table 2 reports the correlation coefficients for both unnormalized and normalized measures. RCH and RCHr have higher conservation correlation coefficients than RD, Exp and SA indicating that, for such aligned residues, RCH is more highly correlated with structurally conserved locations than RD, Exp and (after normalization) SA. As we used Cβ coordinates, values for RD and SA are around 0.1 lower than those previously reported (Hamelryck, 2005).

Relationship of RCH to changes in thermal stability of mutant proteins: Changes in thermal stability of proteins after mutations of core hydrophobic residues (that potentially lead to cavity formation) has been correlated with changes in SA and residue depth [for references see Hamelryck (2005)]. For such residues, measures that correlate relatively high with changes in the proteins thermal stability reflect structurally important features. We, therefore, assess for these residues the degree to which these measures are correlated with changes in protein thermal stability. The correlation of these measures of residue location (both normalized and unnormalized) with changes in the thermal stability ({Delta}{Delta}G in kcal/mol) of 91 Ile/Leu/Val to Ala point mutations was measured. Sixteen protein structures from the Protherm database (Bava et al., 2004; Gromiha et al., 1999; Kumar et al., 2006) were employed, again following the approach of Hamelryck (2005). The correlation coefficients for RD, SA, Exp, RCH, RCHr and {Delta}ASA (related to the change in accessible surface upon folding) are shown in Table 3. RD values were similar to those previously reported. RCH is more highly correlated with changes in thermal stability upon mutation than the other measures. Exp and {Delta}ASA showed higher correlation when the data was normalized. RD showed the lowest correlation of the measures studied. This data indicates that (unnormalized) RCH is correlated more strongly with residues in the hydrophobic core (that are related to structural stability) than are the other measures.

Mutual information: The degree to which the classes assigned to residues using these measures are mutually informative was assessed using mutual information (MI) (Cover and Thomas, 2006). For discrete data, MI is defined as:


Formula 1

(1)
where p(x) and p(y) are the probabilities of x and y occurring in the dataset, and p(x, y) is the probability of the combination of x and y occurring together in the dataset. MI is used here to measure the quantity of information that one measure (e.g. SA) tells us about another (e.g. RCH).

Table 4 shows the MI between pairs of measures for all 257 560 residues studied. When the MI between the class assignments for a pair of measures is high they represent closely related problems (the MI between a measure and itself is maximal, and is 1.00 if the classes assigned to the measure are well balanced). SA shares 0,26 MI with RCH whilst Exp shares 0.38 MI with RCHr and all other pairwise MI values are less than 0.10. This indicates that the RCH class of residues provides information distinct to SA, RD and Exp class information. MI for Q3 and Q5 class assignments is given in the Supplementary Materials (Table 3) along with a detailed pairwise examination of the Q5 class assignments for SA versus RCH, and RCHr versus Exp, where increased levels of MI were observed (Supplementary Materials, Tables 4 and 6) along with RD versus RCH (and in Table 5). Frequent differences in class assignments are observed for measures with greater than 0.20 MI.


View this table:
[in this window]
[in a new window]

 
Table 5. Datasets

 
To further highlight the distinction between RD and RCH, visualisations of two space filling Cβ atom models of protein structures are shown in Figure 3. The values for each measure were normalized and the colour assigned, in both measures, to indicate values from ‘exposed’ (blue) to ‘buried’ (red). These models provide visual confirmation that residue RCH assignments are distinct to those for RD. Further examples are available in the Supplementary Materials (Figs. 1 and 2).


Figure 3
View larger version (63K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Space filling Cβ atom models of proteins coloured by RCH and RD. ‘Core’ residues are coloured red/yellow and ‘surface’ residues blue/green (rendered using RasMol).

 

    3 LEARNABILITY OF RCH AND OTHER MEASURES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 LEARNABILITY OF RCH...
 4 DISCUSSION
 5 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Having demonstrated that residue RCH is a new and distinct characterization of residue location, we turn to the predictability of these measures and assess, in practical terms, which of these characterizations of residue location is easier to learn. Hence, potentially more useful for PSP.

3.1 Prediction experiments
Inputs to predictions: For each measure (RCH, RCHr, RD, SA and Exp) predictions were made using six types of input information and three levels of precision: two, three and five class partitions (Q2, Q3 and Q5). Table 5 summarizes the six different types of input information used for predictions of the measures studied. Combinations of both local (neighbourhood of the target in the chain) and global (protein-wise) information were used. A window of four residues either side of the target residue has been shown to lead to high CN predictive accuracy using LCS (Bacardit et al., 2006) and was used in this study also to facilitate comparison of results. For each representation (RCH, SA etc.) these inputs were labeled 1–6 in the rest of this article, e.g. RCH-3 denotes RCH predicted using input dataset 3. For each measure a total of 18 datasets was evaluated (six sets of input attributes each at three levels of class assignment). A detailed description of these inputs appears in Stout et al. (2007). In order to determine the degree to which RCH and RCHr vary in their learnability and capture properties of protein structures, in what follows we use their unnormalized versions.

Predicted secondary structure information of the target residue was obtained using the PSI-PRED predictor (Jones, 1999). This consists the secondary structure type (helix, strand or coil) and a confidence level of the prediction (ranging from 0 to 9).

For each measure, the average value of that measure was determined for each chain and 10 pairs of training and test folds were prepared. For each instance, the inputs were the chain length (one integer value) and amino acid composition of each chain (20 real values), and the target class was the measured average value for the particular measure (partitioned into 10 classes using a uniform frequency cut-point strategy). Cut-points were determined separately for each training fold and used to assign classes to the values in the corresponding training and test folds. The predicted average value of the measure under consideration (termed PredAveRCH, PredAveRD, etc.) was predicted using the GAssist LCS (details below) prior to preparation of the datasets for the full measure predictions. Nine hundred fifty instances (chains) were used for training and 100 instances for testing. Ten iterations were performed for each prediction using different random number seeds and the 10 rule sets generated were combined as an ensemble using a majority vote to predict the measure.

Class assignments: In order to predict measures using classification techniques, the calculated values for each measure were partitioned into two, three and five classes (bins) here termed Q2, Q3 and Q5, respectively. For imbalanced measures, such as SA, a class boundary that leads to more balanced classes is traditionally chosen, e.g. for SA a cut point of 25% is widely used. We apply class balancing for all measures and levels of discretization (Q2, Q3 and Q5) in this study, adopting a uniform frequency classification procedure. For our data, balanced classes for SA were obtained using, e.g. a cut point of 18%. Class boundaries were determined individually for each training/test set pair using the corresponding training fold. Details of the cut points used are given in the Supplementary Table 1.

Definition of the training and tests sets: Datasets were divided randomly into 10 training and test set pairs (950 chains for training and 100 for testing), using bootstrap (Kohavi, 1995). We have placed a copy of the datasets at www.infobiotic.net

Performance measures: Different protein chains have different lengths and it is prediction accuracy on chains that is typically reported (Kinjo et al., 2005; Jones, 1999). Therefore, prediction accuracies for each chain were averaged to obtain the protein-wise accuracy reported here.

Machine-learning methods: We use four different machine-learning methods. The first two are popular machine-learning systems taken from the WEKA package (Witten and Frank, 2005): C4.5 (Quinlan, 1992), a decision tree rule induction system, Naive Bayes (John and Langley, 1995), a Bayesian-learning algorithm. Learning systems belonging to the Learning Classifier Systems (LCS) (Holland and Reitman, 1978) class of ML techniques were also studied. These systems are rule-based machine-learning systems that employ evolutionary computation (Holland, 1975) as the search mechanism. Two LCS methods have been employed: GAssist (Bacardit, 2004) and BioHEL (Bacardit et al., 2007) that implement different rule induction paradigms. A detailed description of both systems is included in the Supplementary Material (Section 3.1).

Analysis of results: For each experiment, the mean prediction accuracy (as defined previously in this section) over the test sets is reported. Student t-tests were applied to the 10 results from each experiment to determine the best method for each dataset at a confidence level of 95%. Standard deviations and any significant differences are indicated in each table. The conservative Bonferroni correction (Miller, 1981) for multiple pair-wise comparisons was applied. In addition, the contributions of global input information were assessed as follows: for each learning system and precision (Q2, Q3 and Q5), the maximum of (Dataset4, Dataset6)-Dataset2 was computed. As a base for the performance gap, the dataset with predSS was used, because in certain situations the Dataset1 performed poorly, distorting the comparisons. Finally, the contribution of predicted secondary structure was also assessed as follows: for each learning system and number of states the value of the maximum (Dataset2-Dataset1, Dataset4-Dataset3, Dataset6-Dataset5) was determined.

3.2 Prediction results
For each measure studied, Table 6 summarizes the best Q2 predictive accuracy (in descending order) for each measure (using the best possible input dataset in each case). Detailed results for the predictions are given in the Supplementary Materials (Tables 7 –10). Predictive accuracy was higher on the two RCH based representations than on the SA, RD or Exp representation. The predictive accuracies for RCHr being statistically significantly higher than those for the other measures (P-value = 0.5).


View this table:
[in this window]
[in a new window]

 
Table 6. Summary of the highest predictive accuracies for each measure studied, in descending order of accuracy

 
For all representations, higher predictive accuracies were seen when fewer classes were predicted (lower precision—Q2). Q5 predictive accuracy for RCH was between 30% and 40%, Q3 was approximately 20% higher, between 55% and 60% whilst for Q2 prediction accuracies exceeded 77%. The LCS's performed best on the RCHr representation when using input dataset RCHr-4. This dataset combines local information (a window of residues around the target and its predicted secondary structure) with global chain information (chain length and chain residue composition). The more compact RCHr-6 was frequently the most learnable dataset for C4.5 and Naive Bayes. This dataset comprises local information (window and predicted secondary structure) and global information (predicted average RCHr of the chain).

3.3 Predicted RCH improves CN prediction
Finally, we assess the utility of predicted RCH as an input to prediction of other aspects of protein structure, specifically coordination number (CN). For each of the measures studied, the Q5 predictions (using input dataset 4) made by BioHEL (which was, in general, the best performing method) are fed back into prediction of CN (Bacardit et al., 2006). The CN of a residue is a count of the number of other residues from the chain that are located within a certain threshold distance. Specifically, we have used the Kinjo et al. (2005) definition of CN. We predict whether the CN of a residue is above or below the midpoint of the CN domain, using as input information the AA type of a window of ±4 residues around the target (equivalent to the first set of input attributes used to predict the other features), CN1.

The contribution of SA, Exp, RCH and RCHr to CN prediction (individually and in combination with one another) was evaluated by extending the CN1 dataset with 16 combinations of input attributes that correspond to all combinations of these measures. Using predicted RD as input gave the lowest improvement (0.2%) over the CN1 (local window) input alone and was, therefore, not included in predictions made with combinations of inputs. Table 7 shows the results of these experiments. As a baseline, the performance of the original CN1 is included. The table has been sorted by accuracy which helps to identify the combinations of predicted measures that give the biggest performance boost.


View this table:
[in this window]
[in a new window]

 
Table 7. Coordination number prediction (by BioHEL) using amino–acid sequence and various combinations of the predicted measures

 
The results of these experiments were analyzed using paired t-test with 95% confidence level and the Bonferroni correction. Two types of results were identified: the datasets in which BioHEL performed significantly better than when learning from CN1 (marked with a a) and the (statistically indistinguishable) group of datasets that resulted in the highest predictive accuracies are indicated (b).

There are two groups of measures: those that only provide a small performance boost over CN1 (RD, Exp and RCHr), and others that provide a larger boost (SA and RCH). Furthermore, combining Exp and RCHr (together and with other measures) only marginally improves the performance, indicating that these measures are mostly redundant (consistent with the observed increase in MI between these two measures, Table 4). On the other hand, combining SA and RCH resulted in much better accuracy, showing that these measures complement each other. The best combinations of measures (Exp+SA+RCH, SA+RCH+RCHr and Exp+SA+ RCH+RCHr) lead to a performance increase of 2.6% over the baseline CN1 dataset.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 LEARNABILITY OF RCH...
 4 DISCUSSION
 5 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
4.1 Prediction results
The results show some general trends across all measures studied and all learning methods. Predictive accuracy is increased when richer input information is employed. Inclusion of local information in the form of predicted secondary structure typically leads to an increase in Q2 predictive accuracy of 2–3% on most datasets for the learning systems used, whilst using global protein information (chain length and composition) can boost Q2 predictive accuracy by more than 10% (in the case of RCH and RCHr). The type 3 and 4 datasets, containing 21 real-valued global protein attributes, in particular presents a considerably larger search space, and the mixture of real-valued and nominal attributes makes the learning problem more difficult than for purely nominal knowledge representations. The type 5 and 6 datasets use less attributes than the type 3 and 4 datasets enabling the LCS's to generate rule sets that are more easily interpreted. The RCHr representation was the most predictable measure, this resulting from use of input dataset 4 (chain composition and length information) and the BioHEL LCS. C4.5 performed best when using datasets 4 and 6, benefiting from predicted local and global information. Naive Bayes generally performed best when using the more compact dataset 6.

There were interesting differences between these representations of residue location; RCH was easier to learn compared to other representations because this representation correctly assigned classes to residues for anisotropic (elongated) structures. For such structures, the Exp representation may assign some surface residues to the buried class and some buried residues to the exposed class. Similarly, RCH was more predictable than SA; assignment of residues was deep within structures, but on the surface of cavities (high solvent accessibility) to the exposed class may have lowered the predictability of the SA representation.

RD was more predictable than SA and Exp but was not as easily predicted as RCH and RCHr. As SA emphasises buried residues, RD emphasises exposed residues. The highly imbalanced nature of the RD measure (Supplementary Materials, Figs. 5 and 6) leads to imbalanced class assignments even when using a uniform frequency cut-point strategy (Supplementary Materials, Table 2). This imbalance is likely to have made this measure relatively more difficult to learn. Furthermore, when fed back as inputs for prediction of CN, RD in particular provided little additional information over the CN1 inputs resulting in only marginal improvements in CN prediction. Prediction accuracy for all measures is likely to be boosted by including additional (e.g. non-local) input information. For example, 79.3% 10-fold cross validated accuracy has been reported for two-state SA prediction at a 25% cut point using an integrated system of neural networks (Dor and Zhou, 2007) with position specific scoring matrices and a range of other input data.

4.2 ‘White box’ prediction: interpretable analysis and performance considerations
Understanding the basis on which a prediction is made may be more valuable than making relatively accurate predictions in a blind manner. Decision trees can be interpreted, however, on these problems C4.5, produced pruned trees that ranged in mean size from 1068 (SD = 331) to 138 977 (SD = 2735) nodes limiting their explanatory value. Naive Bayes, on the other hand, generates compact solutions, however, these are probabilistic models that have no immediate physico-chemical interpretation. In contrast, it is much easier to relate the compact rule sets evolved by the LCS algorithms to the underlying physical and chemical properties of proteins. The following is an example of a rule set evolved by the GAssist LCS that produced 71.2% two-state predictive accuracy on the RCH-6 dataset.

  1. If PredSSConf < 7.5 and PredAveAtt > 8.64 and Res–4 {notin} {K,x} and Res–2 {notin} x and Res {notin} {D,E,K,N,Q} and Res+1 {notin} {x} and Res+2 {notin} {K} and Res+4 {notin} {x} -> class is 1
  2. If PredSS {notin} {C} and PredAveAtt > 4.8 and Res–4 {notin} {E, Q} and Res–3 {notin} {D, T} and Res–2 {notin} E and Res–1 {notin} {D, P,V} and Res {notin} {D,E,H,K,N,P,Q,R,T} and Res+1 {notin} {x} and Res+2 {notin} {H} and Res+3 {notin} {K} and Res+4 {notin} {E,K,Q,R,x} -> class is 1
  3. If PredSS {notin} {C} and PredAveAtt > 4.8 and Res–4 {notin} {E, R} and Res–3 {notin} {D, E} and Res–1 {notin} {P, W} and Res {notin} {A,D,E,K,N,P,Q,R} and Res+1 {notin} {W, x} and Res+2 {notin} {V, W} and Res+3 {notin} {K, R} and Res+4 {notin} {K, x} -> class is 1
  4. If PredSS isin {E} and PredAveAtt > 4.5 and Res–4 {notin} {C, K, x} and Res–3 {notin} R and Res–2 {notin} {K, R} and Res–1 {notin} {D, Q, S, x} and Res {notin} {E, F, K, M, R, T} and Res+1 {notin} {F, L, x} and Res+2 {notin} {K} and Res+3 {notin} {C} and Res+4 {notin} {x} -> class is 1 .... Default class is 0
The rule set structure is a decision list, rules are tested in order starting with the first and rule evaluation continues until a match is found or the default (last) rule is reached. This rule set comprised 15 rules, the first four rules are shown (the full rule set is shown in the Supplementary Materials, Section 4.3). In the first rule, the confidence level of the secondary structure prediction is tested then the predicted average value for the property under consideration (predicted average RCH > 8.64). Subsequent predicates test the amino acid types of residues surrounding the target in the sequence. AA ± M means the amino acid type of the residue at position ± M in respect to the target residue. Amino acids are represented by their one letter code, plus the symbol x representing positions after the start/end of the chain, for cases where the window of neighbouring residues overlaps the beginning or the end of the chain. For positions relative to the target residue (e.g. Res–2) these predicates restrict the amino acid types for the residue at that position (membership or non-membership of a set). In subsequent rules, the predicted Secondary Structure class (PredSS) of the target residue; either Helix (H), Sheet (E) or Coil (C) is tested. In many cases, the target residue type is largely restricted to hydrophobic residues that are often found buried in the protein core and, therefore, have higher RCH numbers. The LCS has correctly identified this, predicting the above average RCH class (1) for these residues.

The accuracy, simplicity and interpretability of the rule sets generated by the LCS's must, however, be balanced against the computational expense needed to generate them. Run times for the learning phase of the BioHEL algorithm ranged from 3 min on the smaller input datasets to almost 7 h on the larger ones. In contrast, the worst case for C4.5 was 32 min and for Naive Bayes was less than 1 min. Moreover, for each problem set, the LCS algorithms, BioHEL and GAssist, were run multiple times to produce 10 classifiers for input to the ensemble procedure. Once trained, however, run times for the resulting classifiers on the entire test set was around two minutes for C4.5 and Naive Bayes but less then 1 min for the LCS's, an indication, for instance, of how these LCS evolved classifiers would perform as part of a prediction web server.


    5 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 LEARNABILITY OF RCH...
 4 DISCUSSION
 5 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this article, a new measure of residue location in folded protein chains, the RCH, was introduced. RCH is a parameterless, simple to compute and mathematically rigorous method that situates residues in layers within protein structures. We show that RCH is distinct to other widely studied measures of residue location and that RCH distinguishes a range of degrees of residue burial/exposure, correlates better with residue conservation and changes in protein stability under mutation than measures such as solvent accessibility, residue depth or residue distance from chain centroid. Further, we assess the predictability of these measures using three types of ML technique: decision trees (C4.5), Naive Bayes and LCS, employing a range of predictive inputs. We show that an LCS that employs iterative rule learning, BioHEL, predicts RCH at 77.3, 60.6 and 39.0% accuracy for Q2, Q3 and Q5, respectively. We present examples of the competent yet simple and interpretable LCS classification rules, showing how they relate to the underlying physical and chemical properties of the residues. As an exemplar application of predicted RCH class (in combination with other measures), we show that prediction of contact number can be improved up to 2.6%.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 LEARNABILITY OF RCH...
 4 DISCUSSION
 5 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We acknowledge the support of the UK Engineering and Physical Sciences Research Council (EPSRC) under grant GR/ T07534/01. We are grateful for the use of the University of Nottingham's High-Performance Computer. We are grateful to the anonymous reviewers whose insightful comments have helped to improve this manuscript.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Burkhard Rost

Received on November 6, 2007; revised on January 28, 2008; accepted on January 30, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 LEARNABILITY OF RCH...
 4 DISCUSSION
 5 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bacardit J, et al. Coordination number predication using learning classifier systems: Performance and interpretability. (2004) Proceedings of the 8th Anuual Conference on Genetic and Evolutionary Computation (GECCO 06). ACM Press, New York, pp. 247–254.

    Bacardit J, et al. Automated alphabet reduction method with evolutionary algorithms for protein structure prediction. In: GECCO '07: Proceedings of the 9th annual conference on Genetic and evolutionary computation (2007) Acm Press, London, Vol 1. pp. 346–353.

    Bacardit J. Pittsburgh Genetics-Based Machine Learning in the Data mining era: Representations, generalization, and run-time. (2004) PhD Thesis. Ramon Llull University, Barcelona. Catalonia Spain.

    Badel-chagnon A, et al. "Iso-depth contour map" of a molecular surface. J. Mol. Graph (1994) 12:162–168. 193.[CrossRef][Web of Science][Medline]

    Baldi P, Pollastri G. A machine-learning strategy for protein analysis. IEEE Intel. Sys (2002) 17:28–35.

    Barber CB, et al. The quickhull algorithm for convex hulls. ACM Trans. Math. Software (1996) 22:469–483.[CrossRef]

    Bava KA, et al. Protherm, version 4.0: thermodynamic database for proteins and mutants. Nucl. Acids Res (2004) 32(Database issue):D120–D121.[Abstract/Free Full Text]

    Ben-shimon A, Eisenstein M. Looking at enzymes from the inside out: the proximity of catalytic residues to the molecular centroid can be used for detection of active sites and enzyme-ligand interfaces. J. Mol. Biol (2005) 351:309–326.[CrossRef][Web of Science][Medline]

    Chakravarty S, Varadarajan R. Residue depth: a novel parameter for the analysis of protein structure and stability. Structure (1999) 7:724–732.

    Chen BY, et al. Cavity scaling: automated refinement of cavity-aware motifs in protein function prediction. J. Bioinform. Comput. Biol (2007) 5:353–382.[CrossRef][Medline]

    Coleman RG, Sharp KA. Travel depth, a new shape descriptor for macromolecules: application to ligand binding. J. Mol. Biol (2006) 362:441–458.[CrossRef][Web of Science][Medline]

    Cover T, Thomas JA. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). (2006) John Wiley & Sons Inc.

    Dor O, Zhou Y. Achieving 80% ten-fold cross-validated accuracy for secondary structure prediction by large-scale training. Proteins (2007) 66:838–45.[CrossRef][Web of Science][Medline]

    Eidhammer I, et al. Protein Bioinformatics: An Algorithmic Approach to Sequence and Structure Analysis. (2003) Hoboken, NJ: Wiley-Interscience.

    Dor O, Zhou Y. Achieving 80% ten-fold cross-validated accuracy for secondary structure prediction by large-scale training. Proteins (2007) 66:838–845.[CrossRef][Web of Science][Medline]

    Eidhammer I, et al. Protein Bioinformatics (2003) New York: An Algorithmic Approach to Sequence and Structure Analysis. J. Wiley & Sons Ltd. Chichester.

    Gianese G, Pascarella S. A consensus procedure improving solvent accessibility prediction. J. Comput. Chem (2006) 27:621–626.[CrossRef][Web of Science][Medline]

    Gromiha MM, et al. Protherm: thermodynamic database for proteins and mutants. Nucl. Acids Res (1999) 27:286–288.[Abstract/Free Full Text]

    Hamelryck T. An amino acid has two sides: a new 2d measure provides a different view of solvent exposure. Proteins (2005) 59:38–48.[CrossRef][Web of Science][Medline]

    Holland JH. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence (1975) University of Michigan Press.

    Holland JH, Reitman JS. Cognitive systems based on adaptive algorithms. In: Pattern directed Inference Systems—Waterman DA, Hayes-Roth F, eds. (1978) Cambridge, MA: MIT Press. 313–329.

    Holmes JB, Tsai J. Characterizing conserved structural contacts by pair-wise relative contacts and relative packing groups. J. Mol. Biol (2005) 354:706–721.[CrossRef][Web of Science][Medline]

    Holm L, Sander C. Protein structure comparison by alignment of distance matrices. J. Mol. Biol (1993) 233:123–138.[CrossRef][Web of Science][Medline]

    John GH, Langley P. Estimating continuous distributions in Bayesia classifiers. In: In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (1995) San Mateo: Morgan Kaufmann Publishers. 338–345.

    Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol (1999) 292:195–202.[CrossRef][Web of Science][Medline]

    Kawabata T, Go N. Detection of pockets on protein surfaces using small and large probe spheres to find putative ligand binding sites. Proteins (2007) 68:516–529.[CrossRef][Web of Science][Medline]

    Kinjo AR, et al. Predicting absolute contact numbers of native protein structure from amino acid sequence. Proteins (2005) 58:158–165.[CrossRef][Web of Science][Medline]

    Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. Mellish CS, ed. (1995) Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence. San Mateo: Morgan Kaufmann. 1137–1145.

    Kumar MD, et al. Protherm and pronit: thermodynamic databases for proteins and protein-nucleic acid interactions. Nucl. Acids Res (2006) 34(Database issue):D204–D206.[Abstract/Free Full Text]

    Lee B, Richards FM. The interpretation of protein structures: estimation of static accessibility. J. Mol. Biol (1971) 55:379–400.[CrossRef][Web of Science][Medline]

    Lee M, et al. Shapes of antibody binding sites: qualitative and quantitative analyses based on a geomorphic classification scheme. J Org Chem (2006) 71:5082–5092.[CrossRef][Web of Science][Medline]

    Liang J, Dill KA. Are proteins well-packed? Biophys. J (2001) 81:751–766.[Web of Science][Medline]

    Lin TH, Lin JJ. Three-dimensional quantitative structure-activity relationship for several bioactive peptides searched by a convex hull-comparative molecular field analysis approach. Comput. Chem (2001) 25:489–498.[CrossRef][Web of Science][Medline]

    Lin TH, et al. A comparative molecular field analysis study on several bioactive peptides using the alignment rules derived from identification of commonly exposed groups. Biochim Biophys Acta (1999) 1429:476–485.[CrossRef][Medline]

    Liu S, et al. Fold recognition by concurrent use of solvent accessibility and residue depth. Proteins (2007) 68:636–645.[CrossRef][Web of Science][Medline]

    Meier R, et al. Segmentation of molecular surfaces based on their convex hull. In: ICIP 95: Proceedings of the 1995 International Conference on Image Processing (1995) 3. Washington DC: IEEE Computer Society. 552–555.[CrossRef]

    Miller RG. Simultaneous Statistical Inference (Springer Series in Statistics). (1981) New York: Springer-Verlag Inc.

    Noguchi T, et al. Pdb-reprdb: a database of representative protein chains from the protein data bank (pdb). Nucl. Acids Res (2001) 29:219–220.[Abstract/Free Full Text]

    Pintar A, et al. Dpx: for the analysis of the protein core. Bioinformatics (2003) 19:313–314.[Abstract/Free Full Text]

    Preparata FP, Hong SJ. Convex hulls of finite sets of points in two and three dimensions. Commun. ACM (1977) 20:87–93.[CrossRef]

    Quinlan JR. C4.5: Programs for Machine Learning (1992) San Mateo, CA: Morgan Kaufmann Publishers.

    Rost B, Sander C. Conservation and prediction of solvent accessibility in protein families. Proteins (1994) 20:216–226.[CrossRef][Web of Science][Medline]

    Sander C, Schneider R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins (1991) 9:56–68.[CrossRef][Web of Science][Medline]

    Stout M, et al. Prediction of topological contacts in proteins using learning classifier systems. Soft Computing (2008) Special Issue on Evolutionary and Metaheuristic-based Data Mining (EMBDM), (in press).

    Van Walle I, et al. Sabmark–a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics (2005) 21:1267–1268.[Abstract/Free Full Text]

    Vlahovicek K, et al. Cx, dpx and pride: Www servers for the analysis and comparison of protein 3d structures. Nucl. Acids Res (2005) 33(Web Server issue):W252–W254.[Abstract/Free Full Text]

    Wang Y, et al. Automatic classification of protein structures based on convex hull representation by integrated neural network. (2006) Theory and Applications of Models of Computation, Third International Conference, {TAMC} 2006, May 15–20, 2006: Beijing, China. Lecture Notes in Computer Science. Springer, Vol. 3959. pp. 505–514.

    Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. In: Second Edition: (The Morgan Kaufmann Series in Data Management Systems) (2005) Amsterdam, Boston, MA: Morgan Kaufmann.

    Wood MJ, Hirst JD. Protein secondary structure prediction with dihedral angles. Proteins (2005) 59:476–481.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/7/916    most recent
btn050v3
btn050v2
btn050v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Stout, M.
Right arrow Articles by Krasnogor, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Stout, M.
Right arrow Articles by Krasnogor, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?