Skip Navigation


Bioinformatics Advance Access originally published online on November 7, 2007
Bioinformatics 2007 23(24):3320-3327; doi:10.1093/bioinformatics/btm527
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
23/24/3320    most recent
btm527v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Shamim, M. T. A.
Right arrow Articles by Nagarajaram, H.A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shamim, M. T. A.
Right arrow Articles by Nagarajaram, H.A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Support Vector Machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs

Mohammad Tabrez Anwar Shamim , Mohammad Anwaruddin and H.A. Nagarajaram *

Laboratory of Computational Biology, Centre for DNA Fingerprinting and Diagnostics, Hyderabad 500 076, India

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Fold recognition is a key step in the protein structure discovery process, especially when traditional sequence comparison methods fail to yield convincing structural homologies. Although many methods have been developed for protein fold recognition, their accuracies remain low. This can be attributed to insufficient exploitation of fold discriminatory features.

Results: We have developed a new method for protein fold recognition using structural information of amino acid residues and amino acid residue pairs. Since protein fold recognition can be treated as a protein fold classification problem, we have developed a Support Vector Machine (SVM) based classifier approach that uses secondary structural state and solvent accessibility state frequencies of amino acids and amino acid pairs as feature vectors. Among the individual properties examined secondary structural state frequencies of amino acids gave an overall accuracy of 65.2% for fold discrimination, which is better than the accuracy by any method reported so far in the literature. Combination of secondary structural state frequencies with solvent accessibility state frequencies of amino acids and amino acid pairs further improved the fold discrimination accuracy to more than 70%, which is ~8% higher than the best available method. In this study we have also tested, for the first time, an all-together multi-class method known as Crammer and Singer method for protein fold classification. Our studies reveal that the three multi-class classification methods, namely one versus all, one versus one and Crammer and Singer method, yield similar predictions.

Availability: Dataset and stand-alone program are available upon request.

Contact: han{at}cdfd.org.in

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
The gap between the number of proteins with and without 3-dimensional (3D) structural information has been increasing alarmingly owing to the successful completion of many genome-sequencing projects. Since 3D structure is essential for understanding protein function, and as not all proteins are amenable to experimental structure determination, computational prediction of 3D structures has, therefore, become a necessary alternative to experimental determination of 3D structures. Among the computational approaches, fold recognition/threading methods have taken central stage. In instances where detection of homology becomes difficult even when using the best sequence comparison methods such as PSI-BLAST (Altschul et al., 1997), structure-based fold recognition approaches are often employed. Many methods have been developed, which are used for assigning folds to protein sequences. These can be broadly classified in three categories: (a) sequence–structure homology recognition methods such as FUGUE (Shi et al., 2001) and 3DPSSM (Kelley et al., 2000), (b) threading methods such as THREADER (Jones et al., 1992) and (c) taxonomic methods such as PFP-Pred (Shen and Chou, 2006).

Sequence–structure homology recognition methods align target sequence onto known structural templates and calculate their sequence–structure compatibilities using either profile-based scoring functions (Kelley et al., 2000) or environment-specific substitution tables (Shi et al., 2001). The scores obtained for different structural templates are then ranked and the template, which gives rise to the best score, is assumed to be the fold of the target sequence. Unfortunately, these methods, although widely used, have not been able to achieve accuracies >30% at the fold level (Cheng and Baldi, 2006), which could be attributed to the fact that these methods use substitutions to detect folds that are evolutionally related. Threading methods, which use pseudo-energy based functions (Jones et al., 1992) to calculate sequence–structure compatibilities also yield poor accuracies perhaps due to the difficulty of formulating reliable and general scoring functions.

On the other hand, taxonomic methods for protein fold recognition such as the one developed by Ding and Dubchak (2001) and PFP-Pred (Shen and Chou, 2006) that give prediction accuracies of ~60%, assume that the number of protein folds in the universe is limited and therefore, the protein fold recognition can be viewed as a fold classification problem, where a query protein can be classified into one of the known folds. In this classification scheme one needs to identify fold-specific features, which can discriminate between different folds. Available taxonomic methods for protein fold recognition use amino acid composition, pseudo amino acid composition, and selected structural and physico-chemical propensities of amino acids as fold discriminatory features. Ding and Dubchak (2001) used amino acid composition and features extracted from structural and physico-chemical propensities of amino acids to train the discriminatory classifier. The Ensemble classifier approach for protein fold recognition developed by Shen and Chou (2006) used different orders of pseudo amino acid composition and structural and physico-chemical propensities of amino acids as features. In general, the taxonomic approach appears very promising for protein fold recognition and hence this approach can further be explored in order to obtain higher prediction accuracies by investigating new fold discriminatory features. In this study, we investigate the discriminatory potential of the secondary structural and solvent accessibility state information of amino acid residues and amino acid residue pairs. As shown, our approach gives a fold recognition accuracy which is ~8% higher than the best published fold recognition method.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Datasets for training and testing
The investigations were performed on two datasets: (a) Ding and Dubchak dataset (D–B dataset), which is same as that used in earlier studies (Ding and Dubchak, 2001; Shen and Chou, 2006) and (b) extended D–B dataset, which is formed by further populating the D–B dataset with additional protein examples.

2.1.1 D–B dataset
The D–B dataset contains 311 and 383 proteins for training and testing, respectively (http://crd.lbl.gov/~cding/protein/) (Supplementary Table 1). This dataset has been formed such that, in the training set, no two proteins have more than 35% sequence identity to each other and each fold have seven or more proteins; and in the test set, proteins have <40% sequence identity to each other and have not more than 35% identity to the proteins of the training set (Ding and Dubchak, 2001). According to SCOP classification (Murzin et al., 1995), the proteins used for training and testing belong to 27 different folds representing all major structural classes: all {alpha}, all β, {alpha}/β, {alpha} + β and small proteins.

2.1.2 Extended D–B dataset
This dataset (Supplementary Table 1) was formed by merging training and testing datasets of the D–B dataset and further populating each fold with additional protein examples chosen from ASTRAL SCOP 1.71 (Chandonia et al., 2004; http://astral.berkeley.edu), where sequences have <40% identity to each other. This dataset comprises of 2554 proteins belonging to 27 folds.

2.2 Fold classifier method
For fold classification we have used Support Vector Machine (SVM), a supervised machine-learning method first developed by Vapnik (1995) that is extensively used for classification and regression problems. Literature abounds with technical details of SVM (Larranaga et al., 2006; Vapnik, 1995; Yang, 2004).

SVM has been designed primarily for binary classification. Many methods have been developed to extend SVM to a multi-class classification (Crammer and Singer, 2000; Krebel, 1999). Currently, there are two kinds of methods: (a) the Binary classification-based method (Ding and Dubchak, 2001; Hsu and Lin, 2002; Krebel, 1999), which constructs and combines several binary classifiers and (b) the All-together method (Crammer and Singer, 2000; Vapnik, 1998), which directly considers all data in one big optimization formulation. In general, a multi-class problem is computationally more expensive than a binary problem. Since protein fold recognition is typically a multi-class problem, we used multi-class methods, namely, All-together method (referred to as Crammer and Singer method) and the two Binary classification-based methods: one versus all and one versus one. One versus all and one versus one methods have been used earlier for protein fold recognition (Ding and Dubchak, 2001).

All SVM computations were carried out using LIBSVM (Chang and Lin, 2001). We used the one versus one implementation of LIBSVM 2.83 main code, one versus all implementation of LIBSVM error-correcting code and Crammer and Singer method implementation of BSVM 2.0. Although LIBSVM provides a choice of in-built kernels, such as Linear, Polynomial, Radial basis function (RBF) and Gaussian, we used RBF kernel for this study as it gave the best results (data not shown). The SVMs were trained using different values of the cost parameter C = [211, 210, ..., 2–3] and kernel parameter {gamma} = [2–3, 2–2, ..., 2–11] and only those which gave rise to the best results were retained.

2.3 Fold discriminatory features
The sequence- and structure-based features extracted for this study are listed in Table 1.


View this table:
[in this window]
[in a new window]

 
Table 1. Different features along with their dimensions, used for training SVM classifiers

 
2.3.1 Sequence-based features
Amino acid composition: amino acid composition compresses the protein information into a fixed length vector in 20-dimensional space. This feature has been used with significant success, for predicting sub cellular localization of proteins (Garg et al., 2005; Guo et al., 2006), classification of nuclear receptors (Karchin et al., 2002) and protein fold recognition (Ding and Dubchak, 2001). The composition of an amino acid i in a protein is calculated using the formula:


Formula

where fi = frequency of amino acid i; Ni = number of amino acid i found in that protein; L = total number of amino acid residues found in that protein and i = 1 to 20.

Amino acid pair composition: amino acid pair composition or an nth order amino acid pair encapsulates the interaction between the ith and (i + n)th (n > 0) amino acid residues and gives the local order information as well as the composition of amino acids in a protein. Amino acid pair composition is a 400 (20 x 20) dimensional representation of protein information, which has been shown to work well for many problems, such as subcellular localization of proteins (Garg et al., 2005; Guo et al., 2006); classification of G-protein-coupled receptors (Karchin et al., 2002), etc. The nth order of amino acid pair composition in a protein is calculated using the formula:


Formula

where f (Di,i+n)j is the frequency of an nth order amino acid pair j; N(Di,i+n)j is the number of nth order amino acid pair j; n is the order of amino acid pair and j = 1 to 400.

2.3.2 Structure-based features
Secondary structural state (H, E, C) frequencies of amino acids: these are the frequencies of amino acids found in helices (H), β-strands (E) and coils (C) in a given protein and are collectively represented as a 60 (20 x 3) dimensional vector. The frequencies are calculated using the formula:


Formula

where k = (H, E, C); Formula is the frequency of amino acid i occurring in the secondary structural state k and Formula is the number of amino acid i found in the secondary structural state k. In this study, we have used predicted secondary structural information as the basis for all the calculations. The predictions were made using PSIPRED (McGuffin et al., 2000) and only those with confidence level ≥1 were considered for calculations.

Secondary structural state frequencies of amino acid pairs: these collectively represent a 1200 (400 x 3) dimensional vector. An amino acid pair was considered as found in helix or β-strand, only if both the residues were found in helix or strand, respectively, otherwise the pair was considered as found in coil. Secondary structural state frequency of an n-order amino acid pair is calculated using the formula:


Formula

where k = (H, E, C); f (Formula )j is the frequencies of an nth order amino acid pair j in secondary structural state k and N(Formula )j is the number of an nth order amino acid pair j found in secondary structural state k.

Solvent accessibility state (B, E) frequencies of amino acids: solvent accessibility state frequency of amino acids is a 40-dimensional representation of protein structural information and is calculated as follows:


Formula

where k = (B, E); Formula is the frequency of amino acid i in solvent accessibility state k and Nk i is the number of amino acid i in solvent accessibility state k. We used predicted solvent accessibility states for calculating these frequencies. ACCpro (Cheng et al., 2005) was used for predicting the solvent accessibility states of amino acid residues [cut off value for relative solvent accessibilities were ≤10% and >10% for buried (B) and exposed (E), respectively].

Solvent accessibility state frequencies of amino acid pairs: these comprise a 1200-dimensional representation of protein structural information. An amino acid pair was considered as buried (B) or exposed (E) only if both the residues were found buried or exposed, respectively. All other pairs were considered as partially buried (I). The solvent accessibility state frequency of an nth order amino acid pair is calculated using the formula:


Formula

where k = (B, E, I); f (Formula )j and N(Formula )j are the frequency and number of the nth order amino acid pair j found in solvent accessibility state k and Ln is the total number of nth order amino acid pairs.

2.4 Performance measures
The performance of fold classification by SVM was evaluated by computing overall accuracy (Q), sensitivity (Sn) and specificity (Sp). Overall accuracy is the most commonly used parameter for assessing the global performance of a multi-class problem (Ding and Dubchak, 2001; Pierleoni et al., 2006), and is defined as the number of instances correctly predicted over the total number of instances in the test set:


Formula

where N is the total number of proteins (instances) in the test set, and zi are the true positives.

Sensitivity and specificity were calculated using formulae:


Formula

where TP, FN and FP are the number of true positives, false negatives and false positives, respectively.

The n-fold cross-validation is generally used to check the generalization and stability of a method (Bhasin and Raghava, 2004; Goutte, 1997; Wang et al., 2006). In this study, we performed 2-fold cross-validation using the D–B dataset and 5-fold cross-validation using the extended D–B dataset.

We also checked the classification performance of input features using a naïve Bayes classifier. The classifier was downloaded from http://www.borgelt.net/bayes.html, and trained with default parameters using the same input features as used in SVM. The performance was evaluated using 2-fold and 5-fold cross-validation for D–B and extended D–B dataset, respectively.


    3 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We conducted preliminary studies to test the usefulness of the different orders (n) ranging from 1 to 12, of amino acid pairs. Our studies revealed that only the amino acid pairs with the first (n = 1) and second (n = 2) orders give good fold prediction accuracies. The prediction accuracy declined from the third order pair onwards (please see the Supplementary Fig. 1). The decrease in prediction accuracy is due to increase uncertainty in backbone conformation as the spacing between the amino acids, i.e. ‘n’ in the pair increases. Therefore, for further studies we considered only the first and second order of amino acid pairs.

3.1 Twofold cross-validation studies using D–B dataset
We analyzed individual fold discriminatory potentials of the sequence and the structure-based features as given in Table 1. The prediction accuracies yielded by the various features for the three multi-class methods and their corresponding values of C and {gamma} are given in Supplementary Table 2. Among the nine individual features used in this study, secondary structural state frequencies of amino acids (Feature 4) gave the best overall Qcv (2-fold cross-validation accuracy) value of 57% (Table 2). As mentioned earlier, we also examined the fold discriminatory potential of different combinations of the features. Of these, Feature10—the combination of secondary structural state and solvent accessibility state frequencies of amino acids gave the best 2-fold accuracy of 60% (Table 2).


View this table:
[in this window]
[in a new window]

 
Table 2. The overall 2-fold cross-validation accuracies obtained for three multi-class methods—(a) one versus all, (b) one versus one and (c) Crammer and Singer

 
The sensitivity and specificity of the best classifier set Feature10 as obtained by the three multi-class methods are shown in Figure 1. As is evident from the figure, the sensitivity and specificity values do not remain the same for all the folds. In general, the folds, which are mostly {alpha}-helical, such as globin-like and cytochrome c, show high sensitivity and specificity. The average prediction accuracy (i.e. sensitivity) obtained for ‘all {alpha} class’ folds is ~78% as compared to ~56% obtained for ‘all β class’ folds. This difference in prediction accuracies between the two classes of folds can be attributed to the accuracies associated with the prediction of secondary structures and solvent accessibilities of the amino acids and amino acid pairs in these folds. In general, {alpha}-helices are predicted with better accuracies than the β-strands (Rost and Sander, 1993 and Table 3) and, therefore, any prediction approach, such as the one presented here, which uses predicted secondary structural information, is already biased towards better prediction of folds in {alpha}-class than β-class.


Figure 1
View larger version (43K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Fold-wise sensitivity and specificity obtained for the best classifier feature—Feature10 for three multi-class methods: (A) one versus all (OVA), (B) one versus one (OVO) and (C) Crammer and Singer (C&S). As mentioned in Table 1, Feature10 is the combination of secondary structural and solvent accessibility state frequencies of amino acids. Overall prediction accuracy using Feature10 is 60.5% for one versus all (OVA), 59.5% for one versus one (OVO) and 59.5% for Crammer and Singer (C&S) method.

 

View this table:
[in this window]
[in a new window]

 
Table 3. Discrepancy (error) in secondary structural state (helix, strand) prediction and solvent accessibility state (buried) prediction for different folds

 
In this study, as mentioned earlier, the methods used for secondary structure and solvent accessibilities are, respectively, PSIPRED and ACCpro and their prediction accuracies are ~78% (McGuffin et al., 2000) and ~77% (Cheng et al., 2005), respectively. As the structures for the protein domains in the D–B dataset are known, we identified the secondary structural states using SSTRUC (Smith, 1989) and calculated the solvent accessibilities using PSA (Sali, 1991) and these were compared with the predictions (Table 3). It is interesting to note that most of the low performing folds show marked errors in their predicted secondary structural states. For example, the OB-fold shows ~22% error in strand prediction (Es); trypsin-like serine proteases fold, ~25% error in strand prediction; ribonuclease H-like motif fold, ~15% error in helix prediction and ~23% error in strand prediction.

Similarly, the low performing folds show significant errors in the prediction of solvent accessibility states of the residues. Failure to predict the correct number of residues buried can arise in the case of domains, which form parts of multi-domain proteins. In such cases, the solvent accessibility prediction program does not give proper prediction, as contact residues between domains, which are actually buried, are predicted as exposed. We calculated the percentage of such domains, which are part of multi-domain proteins in each fold (Table 3). It turns out that most of the folds characterized by the domains from multi-domain proteins give rise to low accuracies.

In addition to the influence of incorrectly predicted features, the SVM training can also become error prone due to the sparseness of the dataset used. It is known that performance of the SVM depends on the size of the dataset used for training because it learns from the examples. The greater the number of examples (for both positives and negatives) available for learning, the better would be the model. A look at the D–B dataset (Supplementary Table 1) reveals that many folds are sparsely represented. For example, folds such as the immunoglobulin-like β-sandwich and TIM-barrel show good sensitivity but poor specificity. These are the most populated folds in the D–B dataset. Generally, training, in such cases, becomes skewed towards populous folds labeled as positive rather than lesser populated folds labeled as negative; hence as a result many proteins that do not belong to the populous folds get classified as positives.

3.2 Fivefold cross-validation studies using extended D–B dataset
In order to remove any bias due to inadequate data, the D–B dataset was populated by adding representatives taken from ASTRAL SCOP 1.71 (Chandonia et al., 2004). The new dataset referred to as the extended D–B dataset, is almost four times larger in size than the D–B dataset. This dataset was used to perform 5-fold cross-validation by randomly dividing the dataset into five equal size sets (I, II, III, IV and V) and in each round of cross-validation, training was carried out using four sets and testing using the remaining set. The prediction accuracies achieved by the various features for the three multi-class methods and their corresponding values of C and {gamma} are given in Supplementary Table 3.

Among the individual features tested, the secondary structural state frequencies of amino acids (Feature 4) gave the best 5-fold accuracy of 65% (Table 4), which is higher than the best accuracy (62%) reported in the literature, by the Ensemble classifier approach PFP-Pred (Shen and Chou, 2006). Among the feature combinations, Feature15—combination of secondary structural state and solvent accessibility state frequencies of amino acids and first-order amino acid pairs—gave the highest accuracy of 70.5% (Table 4). This corresponds to the highest accuracy reported in the literature. The feature combination, Feature10, which showed the highest accuracy in 2-fold cross-validation studies, achieved a 5-fold accuracy of around 69%.


View this table:
[in this window]
[in a new window]

 
Table 4. The overall 5-fold cross-validation accuracy (Qcv) along with SD (enclosed within parentheses) obtained for all the features using three multi-class methods—one versus all, one versus one and Crammer and Singer

 
The sensitivities and specificities obtained for the different folds are shown in Figure 2. As can be seen from the figure the prediction accuracies for many folds (i.e. EF Hand-like, immunoglobulin-like β-sandwich, TIM-barrel, trypsin-like serine proteases, etc.), have improved significantly as compared to the results obtained from 2-fold cross-validation studies. This shows that dataset size influences the quality of the training of SVM and hence the accuracy of prediction.


Figure 2
View larger version (42K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Fold-wise sensitivity and specificity for the classifier feature—Feature10 using three multi-class methods: (A) one versus all (OVA), (B) one versus one (OVO) and (C) Crammer and Singer (C&S). Overall prediction accuracy using Feature10 is 68.7% for one versus all (OVA), 68.7% for one versus one (OVO) and 68.8% for Crammer and Singer (C&S) method.

 
Another interesting result is the increase in the specificity values of the populous folds, which showed poor specificity in 2-fold cross-validation studies. This indicates that poor specificity in 2-fold cross-validation studies is due to sparseness of the D–B dataset. Furthermore, poor specificity of the populous folds can also be attributed to their wide spread in the ‘Fold-space’ as revealed by phylogenetic studies (data not shown).

We computed the classification accuracy at the superfamily level (Supplementary Table 4). Superfamilies having at least 20 proteins in the extended D–B dataset were selected for the study. There are a total of 33 such superfamilies in the extended D–B dataset. The 5-fold accuracy of 74.2% was obtained. The sensitivities and specificities obtained for different superfamilies are also given in the Supplementary Table 4.

Finally, we estimated the generalization performance of SVM using the leave-one-out error estimate that is commonly used for this purpose. In the literature in addition to leave-one-out error estimate, Xi-Alpha error estimate has also been used. However, it has been shown that Xi-Alpha estimator overestimates the true error rate. In fact, Xi-Alpha estimator was developed as an alternative to the leave-one-out estimator as the latter is computationally very expensive (Joachims, 2000). The leave-one-out error estimate was calculated for the Feature10 and Feature15 using the extended D–B dataset and the results are shown in the Supplementary Figure 2. The leave-one-out error estimate is very similar to average error (100–Qcv) obtained for the 5-fold cross-validation. This result shows that the SVM is generally working well. Furthermore, the number of support vectors in the model (Supplementary Table 5) further strengthens the fact that the SVM is not over-trained for any specific dataset.

As mentioned earlier, we also calculated the prediction accuracies using a naFormula ve Bayes classifier for the same input features. We found that the SVM performance is much superior to the naFormula ve Bayes classifier (Supplementary Table 6).

3.3 Comparison of multi-class methods
It has been argued that one versus one method performs better than the one versus all multi-class method (Allwein et al., 2000; Furnkranz, 2002; Hsu and Lin, 2002). However, present study reveals that all the three multi-class methods yield similar overall accuracies, sensitivities and specificities (refer Tables 2 and 4; and Figs 1 and 2) indicating that the performance of SVM for the present set of features is independent of the type of multi-class method used; but dependent on the types of discriminatory features as well as the size of the dataset used for training. It is, however, worth noting that one versus all method is slower than the one versus one and Crammer and Singer method, especially for large dimensional features such as the ones used in the present study and hence any one of the latter methods is more useful in terms of execution time. To the best of our knowledge, this is the first time the Crammer and Singer multi-class method has been tested for protein fold classification problem.

3.4 Comparison with the other fold recognition methods
We compared the performance of our approach with that of other taxonomic fold recognition methods reported in literature; details are shown in Figure 3. For the sake of completion, we have also shown the prediction accuracies of the template-based fold recognition methods as reported in the literature. As evident from Figure 3, the prediction accuracy of our approach is ~8% higher than the best available method PFP-Pred. The strikingly better performance of our approach can be attributed to the more sensitive and specific fold discriminatory features as well as better trained fold-specific SVM.


Figure 3
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. The best prediction accuracy (Q) for protein fold recognition reported by different fold recognition methods. The Q-value for the template-based methods (#) corresponds to the% of top 1 hits match the correct folds. #Cheng and Baldi (2006), *Shen and Chou (2006) and $Ding and Dubchak (2001).

 

    4 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this study, we have investigated fold discriminatory potential of a number of sequence- and structure-based features using SVM. Our studies have revealed that the secondary structural and solvent accessibility state frequencies of amino acids and amino acid pairs collectively give rise to the best fold discrimination. The newly developed SVM-based approach presented in this study is stable and outperforms the other available methods and therefore can be used for fold-wise classification of unknown proteins discovered in various genomes.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
H.A.N. gratefully acknowledges the core funding from CDFD. M.T.A.S. and M.A. are thankful to the Council of Scientific and Industrial Research (CSIR) for their research fellowships. Computational facilities of the SUN Centre of Excellence, CDFD is gratefully acknowledged. The authors thank Prof. Sir Tom Blundell for critically going through the manuscript and colleagues Sridhar and Pankaj for their wise help throughout this course of study. Finally, the authors gratefully acknowledge the two anonymous referees for their critical comments.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Burkhard Rost

Received on July 5, 2007; revised on September 25, 2007; accepted on October 15, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Allwein EL, et al. Reducing multi-class to binary: a unifying approach for margin classifiers. J. Mach. Learn. Res (2000) 1:113–141.[Medline]

    Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:3389–3402.[Abstract/Free Full Text]

    Bhasin M, Raghava GP. Classification of nuclear receptors based on amino acid composition and dipeptide composition. J. Biol. Chem (2004) 279:23262–23266.[Abstract/Free Full Text]

    Chandonia JM, et al. The ASTRAL compendium in 2004. Nucleic Acids Res (2004) 32:D189–D192.[Abstract/Free Full Text]

    Chang CC, Lin CJ. LIBSVM: a library for support vector machines. (2001) Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

    Cheng J, Baldi P. A machine learning information retrieval approach to protein fold recognition. Bioinformatics (2006) 22:1456–1463.[Abstract/Free Full Text]

    Cheng J, et al. SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res (2005) 33:w72–w76.[CrossRef][Web of Science][Medline]

    Crammer K, Singer Y. On the learnability and design of output codes for multiclass problems. (2000) Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (COLT-2000). 35–46.

    Ding CH, Dubchak I. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics (2001) 17:349–358.[Abstract/Free Full Text]

    Furnkranz J. Round robin classification. J. Mach. Learn. Res (2002) 2:721–747.[CrossRef][Web of Science]

    Garg A, et al. Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. J. Biol. Chem (2005) 280:14427–14432.[Abstract/Free Full Text]

    Goutte C. Note on free lunches and cross-validation. Neural Comput (1997) 9:1211–1215.[CrossRef][Web of Science]

    Guo J, et al. GNBSL: a new integrative system to predict the subcellular location for Gram-negative bacteria proteins. Proteomics (2006) 6:5099–5105.[CrossRef][Web of Science][Medline]

    Hsu C, Lin C. A comparison of methods for multi-class support vector machines. IEEE Trans. Neural Netw (2002) 13:415–425.[CrossRef][Web of Science][Medline]

    Joachims T. Estimating the generalization performance of an SVM efficiently. (2000) Proceedings of the Seventeenth International Conference on Machine Learning (ICML-2000). 431–438.

    Jones DT, et al. A new approach to protein fold recognition. Nature (1992) 358:86–89.[CrossRef][Medline]

    Karchin R, et al. Classifying G-protein coupled receptors with support vector machines. Bioinformatics (2002) 18:147–159.[Abstract/Free Full Text]

    Kelley LA, et al. Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol (2000) 299:499–520.[Web of Science][Medline]

    Krebel U. Pairwise classification and support vector machines. Advances in Kernel Methods- Support Vector Learning (1999) Cambridge, MA: MIT Press. 255–268.

    Larranaga P, et al. Machine learning in bioinformatics. Brief. Bioinformatics (2006) 7:86–112.[Abstract/Free Full Text]

    McGuffin LJ, et al. The PSIPRED protein structure prediction server. Bioinformatics (2000) 16:404–405.[Abstract/Free Full Text]

    Murzin AG, et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol (1995) 247:536–540.[CrossRef][Web of Science][Medline]

    Pierleoni A, et al. BaCelLo: a balanced subcellular localization predictor. Bioinformatics (2006) 22:e408–e416.[Abstract/Free Full Text]

    Rost B, Sander C. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol (1993) 232:584–599.[CrossRef][Web of Science][Medline]

    Sali A. Ph.D. Thesis (1991) University of London.

    Shen HB, Chou KC. Ensemble classifier for protein fold pattern recognition. Bioinformatics (2006) 22:1717–1722.[Abstract/Free Full Text]

    Shi J, et al. FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol (2001) 310:243–257.[CrossRef][Web of Science][Medline]

    Smith D. SSTRUC: A Program to Calculate Secondary Structural Summary (1989) Department of Crystallography, Birkbeck College, University of London.

    Vapnik V. The Nature of Statistical Learning theory (1995) New York: Springer.

    Vapnik V. Statistical Learning Theory (1998) New York, NY: Wiley.

    Wang Y, et al. Better prediction of the location of {alpha}-turns in proteins with support vector machine. Proteins Struct. Funct. Bioinformatics (2006) 65:49–54.[CrossRef]

    Yang ZR. Biological applications of support vector machines. Brief. Bioinformatics (2004) 5:328–338.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
Q. Dong, S. Zhou, and J. Guan
A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation
Bioinformatics, October 15, 2009; 25(20): 2655 - 2662.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
23/24/3320    most recent
btm527v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Shamim, M. T. A.
Right arrow Articles by Nagarajaram, H.A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shamim, M. T. A.
Right arrow Articles by Nagarajaram, H.A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?