Bioinformatics Advance Access originally published online on June 28, 2007
Bioinformatics 2007 23(23):3125-3130; doi:10.1093/bioinformatics/btm324
Analysis and identification of β-turn types using multinomial logistic regression and artificial neural network
1Department of Biophysics, Faculty of Basic Sciences and 2Department of Biostatistics, Faculty of Medical Sciences, Tarbiat Modares University, Tehran, Iran
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: So far various statistical and machine learning techniques applied for prediction of β-turns. The majority of these techniques have been only focused on the prediction of β-turn location in proteins. We developed a hybrid approach for analysis and prediction of different types of β-turn.
Results: A two-stage hybrid model developed to predict the β-turn Types I, II, IV and VIII. Multinomial logistic regression was initially used for the first time to select significant parameters in prediction of β-turn types using a self-consistency test procedure. The extracted parameters were consisted of 80 amino acid positional occurrences and 20 amino acid percentages in β-turn sequence. The most significant parameters were then selected using multinomial logistic regression model. Among these, the occurrences of glutamine, histidine, glutamic acid and arginine, respectively, in positions i, i + 1, i + 2 and i + 3 of β-turn sequence had an overall relationship with five β-turn types. A neural network model was then constructed and fed by the parameters selected by multinomial logistic regression to build a hybrid predictor. The networks have been trained and tested on a non-homologous dataset of 565 protein chains by 9-fold cross-validation. It has been observed that the hybrid model gives a Matthews correlation coefficient (MCC) of 0.235, 0.473, 0.103 and 0.124, respectively, for β-turn Types I, II, IV and VIII. Our model also distinguished the different types of β-turn in the embedded binary logit comparisons which have not carried out so far.
Availability: Available on request from the authors.
Contact: parviz{at}modares.ac.ir
| 1 INTRODUCTION |
|---|
|
|
|---|
The protein architecture is characterized by the repetitive motif elements such as
-helices and β-sheets, and non-repetitive motif elements, such as tight turns, bulges and random coil structures (Richardson, 1981). β-Turns are the most numerous category of tight turns and represent
25% of all residues in proteins (Kabsch and Sander, 1983). They are made of four consecutive residues (denoted by i to i + 3) with a distance between C
(s) of residues i to i + 3 that has to be smaller than 7 Å (Chou, 2000). They can be classified into nine different types according to the
,
angles of the two central residues. β-Turns play many biological roles in proteins and peptides. They are responsible for the compact globular shape of proteins because of the ability to reserve the protein chain direction. Also, β-turn formation is an important stage in protein folding (Takano et al., 2000). Furthermore, the occurrence of β-turns on solvent-exposed surfaces makes them suitable candidates for molecular recognition processes and interactions between peptide substrates and receptors (Rose et al., 1985). Therefore, it is useful to develop an accurate method for identifying the type of β-turns within a protein sequence. It not only would be a small step toward the overall prediction of 3D structure of a protein from its amino acid sequence but also would be helpful in fold-recognition studies and identification of structural motifs such as a β-hairpin.
There have been some attempts to predict and analyze β-turns in proteins. They can be divided into two categories: those based on statistical methods and those based on machine-learning methods. The majority of statistical methods empirically employed the knowledge of amino acid preferences at individual positions in β-turns (Chou and Fasman, 1974; Fuchs and Alix, 2005; Hutchinson and Thornton, 1994; Lewis et al., 1973; Wilmot and Thornton, 1988; Zhang and Chou, 1997). Machine learning-based methods applied for prediction of β-turns include artificial neural network (ANN) approach (Kaur and Raghava, 2003, 2004; McGregor et al., 1989; Shepherd et al., 1999) as well as support vector machine (SVM) approach (Cai et al., 2003; Pham et al., 2005; Zhang et al., 2005). An evaluation of some six of the β-turn prediction methods based on a common dataset has been published (Kaur and Raghava, 2002). They showed that neural network approach by Shepherd et al. (1999) gave the best prediction performance among the other evaluated methods. Consequently, neural network was considered to be one of the best-performing classification methods so far.
The rational underlying this study was to use multinomial logistic regression to build the most effective set of parameters which then were fed into a well-established neural network. The multinomial logistic regression method, that has not been applied for β-turn analysis so far, belongs to the generic class of regression imputation methods with sufficient capability for separating distinct sets when the dependent variable is polytomous and the independent variables are continuous and/or discrete. The distinction is performed through establishing the discriminant rules. The rules are estimated during the training procedure and can be used to allocate new cases into the previously defined classes (Hosmer and Lemeshow, 2000).
The multinomial logistic regression, as the first stage of this hybrid modeling procedure, can increase the accuracy and reliability of neural networks, as the last stage, in β-turn types prediction.
The hybrid approach proposed in this article is aimed at analyzing and prediction of different types of β-turn. We have focused mainly on prediction of β-turn Types I, II, IV and VIII. The remaining Types I', II', VIa1, VIa2 and VIb are not enough for a reliable prediction. Thus, these turn types have been combined into one set, called NS (non-specific) turn type.
| 2 SYSTEMS AND METHODS |
|---|
|
|
|---|
2.1 The dataset
The dataset which was used in the course of this study was comprised of 565 non-homologous protein chains. These protein chains were selected using the PAPIA system (Noguchi et al., 2001). In this dataset, no two protein chains have more than 25% sequence identity. Also, they have not chain breaks. The structure of these proteins is determined by X-ray crystallography at 2.0 Å resolution or better. The PROMOTIF program was used to assign the β-turns in protein chains (Hutchinson and Thornton, 1996). Sequence parameters including 80 amino acid positional occurrences and 20 amino acid percentages (of existence) in β-turn sequence were generated using IF and COUNTIF functions of excel software, respectively.
2.2 Model development
In the first stage of this hybrid method, multinomial logistic regression serves as a non-linear model on the dataset to select significant parameters through the Self-consistency Test. This test is an examination for the self-consistency of a prediction method. When the self-consistency test was performed, each tetra peptide in the dataset concerned is in turn identified using the rule parameters derived from the same dataset, the so-called training dataset. Then the ANNs, which act non-linearly in the last stage, were fed by the outputs of multinomial logistic regression to predict β-turn types. The jackknife technique (individual testing of each protein in the dataset) was not applied to train and testing networks, because it is time consuming and not feasible. The ANN method has been trained and tested using 9-fold cross-validation techniques, whereby the whole set is divided into nine sets, each containing equal number of proteins. The method has been trained on eight sets and the performance was measured on the remaining ninth set. This procedure was then repeated nine times to trust that all member of the dataset was selected in the testing procedure. By doing so, we hopefully expected that we got a global conclusion on whole of dataset.
2.3 Crosstabs
The purpose of a crosstabulation is to show the individual relationship (or lack thereof) between independent variables and dependent variable (Hosmer and Lemeshow, 2000). Since the number of parameters in this research was very high (100), a Crosstabs Test has been organized to decrease the size of the parameter population by omitting the non-effective parameters. Therefore, the established model would be simple and its performance improved.
2.4 Multinomial logistic regression model
The used multinomial logistic regression model is a generalization of the logistic regression model. It is commonly used for data in which the dependent variable is polytomous, and independent variables are numerical or categorical predictors. As the binary dependent variable can always be interpreted as the occurrence or non-occurrence of characteristic, the logistic regression model is an expression of the form
|
| (1) |
|
| (2) |
|
| (3) |
|
| (4) |
2.5 Artificial neural network model
As a powerful non-linear predictor in hybrids with the multinomial logistic regression, the ANN was used. In this way, the selected variables from multinomial logistic regression model were used as input nodes for the ANN. This is supposed to reduce the number of input nodes, simplify the network structure and shorten the model building time. We used feed-forward back propagation networks with a single hidden layer. Using such algorithm, the parameters related to the training cases were fed into the networks. The final outputs estimated by the networks were compared with the real type of cases, producing a mean of the sum-of square error (MSE). MSE was propagated back into the networks to adjust the randomly chosen weights. The training cases were then tested with new weights and the process repeated. Through such process, the MSE was minimized.
We used three layer networks. Each unit in the input layer was fed by one independent variable which has been selected by multinomial logistic regression model. The output layer contained five units which represented 10000, 01000, 00100, 00010 and 00001 for Types I, II, IV, VIII and NS of β-turn, respectively. We used the MSE as an index of network efficiency in optimizing the number of hidden units in networks (Hayatshahi et al., 2005). To do so, the number of hidden units was changed in every network in order to develop networks generating the minimal MSE. Finally, after such optimizing procedure, the number of hidden layer units reached 30.
The final neural network architecture was consisted of 41 units in input layer, 30 units in hidden layer and 5 units in output layer. The activation function of hidden layer units was logsig. Also, the Quasi-Newton training function was used for the first time. This training function is superior to simple batch gradient-descent and lead to significantly better solutions requiring fewer training steps. In addition, this method does not suffer from the specification problem of the learning rate parameter which is crucial for the performance of the gradient-descent method (Likas and Stafylopatis, 2000).
Training has been performed for 5000 epochs for nine networks. The value of the learning rate parameter has been set to 0.2. The software used to build the neural networks was in-house written in the MATLAB programming language.
2.6 Performance measures
Five different parameters have been used to measure the performance of prediction methods. These five parameters can be derived from the four scalar indices: TPi (true positives: number of correctly classified β-turn type i), TNi (true negatives: number of correctly classified non-β-turns), FPi (false positives: number of non-β-turns incorrectly classified as β-turn type i) and FNi (false negatives: number of β-turn type i incorrectly classified as non-β-turns or some other turn type), where i = I, II, IV, VIII and NS. Using the following formulas which have been previously reported in the published material, we calculated the prediction accuracy, sensitivity, specificity, probability of correct prediction and Matthews correlation coefficient for the output of the multinomial logistic regression and ANN models.
- Prediction Accuracy (Acci) = [(TPi + TNi)/t] x 100, where t = TPi + TNi + FPi + FNi is the total number of cases including β-turn types and non-β-turns.
- Sensitivity (
) = [TPi/(TPi + FNi)] x 100 is the percentage of observed β-turn types that are predicted correctly.
- Specificity (
) = [TNi/(TNi + FPi)] x 100 is the percentage of observed non-β-turns that are predicted correctly.
- Probability of correct prediction (
): the probability of correct prediction is the percentage of predicted β-turn types that are predicted correctly.
|
|
- Matthews correlation coefficient (MCCi): we used MCC as a more robust measure to evaluate the reliability of the established method (Matthews, 1975). The MCC for each β-turn type is defined by
|
|
Statistical analysis was performed using SPSS 13 for Windows (SPSS Inc., Chicago, USA).
| 3 RESULTS |
|---|
|
|
|---|
3.1 Distribution of β-turn types
From the dataset of 565 protein chains, the number of β-turns which have been assigned and categorized (to different types) was 11 838. Among all the β-turn types, Type IV was the most frequently existing turn type (34.9%) followed by Type I turns (34.8%), which were two to three times more common than Type II (12.2%). The percentage of Type VIII in the dataset was 9.3. The mirror image Types I' and II' were rare, comprising only 4.2 and 2.6%, respectively. Other β-turn types (i.e. Types VIa1, VIa2 and VIb) were very few (1.9%).
3.2 Crosstabs result
After running crosstabs test on the dataset, it would be found that eighteen parameters have not any relationship with dependent variable (five β-turn types). The significance values of Pearson
2 for these parameters were more than 0.05 (the significance level). Therefore, these parameters omitted from the dataset. Then, the analysis and prediction procedures have been done using 82 remaining parameters. The parameters which have omitted, were Glu (%), His (%), Met (%), Phe (%), Trp (%), Arg (i), Cys (i), Trp (i), Arg (i + 1), Cys (i + 1), Gln (i + 1), Met (i + 1), Trp (i + 1), Cys (i + 2), Cys (i + 3), His (i + 3), Trp (i + 3) and Tyr (i + 3).
3.3 Multinomial logistic regression analysis
We ran a multinomial logistic regression model on the dataset using self-consistency test. The final model
2 was 8946.224 (P-value = 0.001). Since the probability of the final model
2 was less than the level of significance (0.05), the existence of a relationship between the independent variables (82 parameters) and the dependent variable (five β-turn types) was supported.
Using Likelihood Ratio Tests table in the output of multinomial logistic regression model, it would be found that only 4 parameters among 82 parameters had an overall relationship with the dependent variable. These parameters were the occurrence of Glutamine in position i of β-turn sequence (P-value = 0.019), the occurrence of Histidine in position i + 1 of β-turn sequence (P-value = 0.007), the occurrence of Glutamic acid in position i + 2 of β-turn sequence (P-value = 0.050) and the occurrence of Arginine in position i + 3 of β-turn sequence (P-value = 0.019).
Also, multinomial logistic regression model compared multiple groups of dependent variable (i.e. I, II, IV, VIII and NS Types) and provided specific information for each of the embedded binary logit comparisons. Tables 1–3![]()
show the parameter estimates (β), standard errors, Wald statistic and corresponding odds ratios for the selected parameters among 82 parameters, respectively, for Types I, II, VIII of β-turn in contrast to Type IV of β-turn as reference group of the self-consistency multinomial logistic regression procedure. Only the parameters that were significant (i.e. their P-value < 0.05), have included in tables.
|
|
|
As seen, the selective parameters in different obtained comparisons (or logits) are not identical. For instance, the occurrence of Asparagine in position i of β-turn sequence was significant only in Type I to Type IV logit. On the other hand, there are some parameters that are significant in all comparisons such as the occurrence of Isoleucine in position i of β-turn sequence.
As one can see in Table 1, the only 22 parameters among 82 parameters have selected in this comparison. Among these parameters, the occurrence of Proline in positions i, i + 1 and i + 2 of β-turn sequence, respectively, with the high logit coefficients (or parameter estimates) were considered the strongest indicators.
With regard to Table 2, it would be found that only 17 parameters (among 82 parameters) in this logit were significant. Among these, the occurrence of Proline in position i of β-turn sequence and the occurrence of Glycine in position i + 2 of β-turn sequence, respectively, were considered the strongest indicators with the highest parameter estimates. However, these logit coefficients are meaningful for these two types (II in contrast to IV) and it does not make any sense to make a general conclusion for the whole of the dataset.
It would be found from Table 3 that the number of significant parameters in this comparison is 13. In this logit, the percentages of Valine, Proline, Isoleucine, Lysine and Leucine, respectively, were the strongest indicators.
The result of self-consistency test was evaluated by the performance measures. The results shown in Table 4 are obtained according to the output of the model.
|
3.4 Established neural network architecture
We fed our neural networks with 41 parameters selected in self-consistency multinomial logistic regression procedure to build two-stage hybrid model. The number of units in hidden layer was optimized in networks regarding the least MSE rate (as was mentioned in Materials and Methods section). It has been observed that neural networks with 41 input units, a single hidden layer with 30 units perform best for all considered β-turn types. A 9-fold cross-validation procedure has been used for prediction of β-turn types. The performance of the model was evaluated by averaging the mentioned measures over nine sets.
3.5 Prediction with neural network
The neural network prediction results presented in Table 4. Types I and II of β-turn have been predicted with an averaged accuracy of 63.9 and 89.1%, respectively, and overall their performance was better than other types of β-turn. The corresponding MCC values were 0.235 and 0.473, respectively. The MCC of Type II β-turn was almost twice the MCC of Type I β-turn, whereas the sensitivity of them is nearly similar. The values of sensitivity and probability of correct prediction for Type IV β-turn were higher than the two other types (i.e. VIII and NS), however, other prediction performance values for Type IV are lower than Types VIII and NS. The corresponding MCC values for Types IV, VIII and NS of β-turn were 0.103, 0.124 and 0.241, respectively. From these results, it would be concluded that the prediction performance of Type NS β-turn is better than the Types IV and VIII of β-turn. Among all β-turn Types, Type IV and VIII show the least performance. Type VIII β-turn has the sensitivity of 9.1% and probability of correct prediction equal to 29% that are lowest among the other β-turn types. Also, Type IV β-turn has lowest MCC among others.
| 4 DISCUSSION |
|---|
|
|
|---|
We have developed a hybrid approach for analysis and prediction of β-turn types in proteins. Both from structural and functional point of view, β-turns play important biological tasks as reflected from the following facts: (1) a polypeptide chain cannot fold into a compact globular structure without the element of β-turns, (2) β-turns usually occur on the exposed surface of proteins and therefore likely involve in molecular recognition processes between proteins and in interactions between peptide substrates and receptors and (3) also β-turns play an important role in protein folding and stability. Thus, β-turn is an important component of protein structure whose prediction can provide huge information to the researchers working in the fields of protein secondary structure prediction, protein modeling and protein engineering.
We used multinomial logistic regression model for the first time to select significant parameters that are applied for the prediction of β-turn types. One of the most important advantages of this model is its ability to clarify the weights of each selected significant parameter, which highlights its performance in determining the sequence-structure relationship. Our study proves that the parameter selection ability of multinomial logistic regression in hy-brid with other powerful predictors such as neural networks leads to the development of more accurate models.
We also showed that it was possible to build an accurate method for the prediction of β-turn types with simple concepts such as occurrences of amino acids in different positions of β-turn sequence, with the help of multiple alignments. We reached to the MCC values of 0.235, 0.473, 0.103, 0.124 and 0.241 for Types I, II, IV, VIII and NS of β-turn, respectively.
One of the most important advantages of the hybrid approach is that it could predict different types of β-turn, using only 41 significant parameters. In general, the results showed that by use of this hybrid approach, one can provide better information for a precise prediction applying less parameters, only including the percentages and positional occurrences of amino acids. It should be emphasized that the selected parameters has just meaning to make discrimination for two specific types (Type II and Type IV). Therefore, based on these local data, it is impossible to make any general conclusion to highlight the occurrence of any amino acid at a specific position.
We used the PDB profile to prepare our database in order to analyze and predict the beta-turn type using a novel hybrid model. The justification for such data preparation was based on the previously published reports (Kaur and Raghava, 2002, 2003, 2004; Shepherd et al., 1999). There are some reasonable concerns suggesting that PDB profile just shows a static picture of the crystal structure from the extremely diverse conformational dynamics that a protein exhibits. However, this idea needs further investigations to be evaluated. We are going to address this point by checking the stability of different beta-turn conformations using alternative X-ray or NMR structures sharing the same sequence.
This study clarified the efficiency of using the statistical model of multinomial logistic regression as a preprocessor in determining effective parameters. Moreover, the optimal structure of neural network can be simplified by a preprocessor in the first stage of hybrid approach, thereby reducing the needed time for neural network training procedure in the second stage and the probability of over fitting occurrence decreased and a high precision and reliability obtained in this way.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The authors are indebted to Mr. Hamed Sadat Hayatshahi for his useful comments.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Anna Tramontano
Received on February 25, 2007; revised on May 20, 2007; accepted on June 12, 2007
| REFERENCES |
|---|
|
|
|---|
Cai YD, et al. Prediction of β-turns with learning machines. Peptides (2003) 24:665–669.[CrossRef][Web of Science][Medline]
Chou KC. Prediction of tight turns and their types in proteins. Anal. Biochem (2000) 286:1–16.[CrossRef][Web of Science][Medline]
Chou PY, Fasman GD. Conformational parameters for amino acids in helical, β-sheet and random coil regions calculated from proteins. Biochemistry (1974) 13:211–222.[CrossRef][Medline]
Fuchs PFJ, Alix AJP. High accuracy prediction of β-turns and their types using propensities and multiple alignments. Proteins (2005) 59:828–839.[CrossRef][Web of Science][Medline]
Hayatshahi SHS, et al. Non-linear quantitative structure-activity relationship for adenine derivatives as competitive inhibitors of adenosine deaminase. Biochem. Biophys. Res. Comun (2005) 338:1137–1142.[CrossRef][Web of Science][Medline]
Hosmer DW, Lemeshow S. Applied logistic regression (2000) New York: John Wiley & Sons Inc.
Hutchinson EG, Thornton JM. A revised set of potentials for β-turn formation in proteins. Protein. Sci (1994) 3:2207–2216.[Web of Science][Medline]
Hutchinson EG, Thornton JM. PROMOTIF: a program to identify and analyze structural motifs in proteins. Protein Sci (1996) 5:212–220.[Web of Science][Medline]
Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers (1983) 22:2577–2637.[CrossRef][Web of Science][Medline]
Kaur H, Raghava GPS. An evaluation of β-turn prediction methods. Bioinformatics (2002) 18:1508–1514.
Kaur H, Raghava GPS. Prediction of β-turns in proteins from multiple alignment using neural network. Protein Sci (2003) 12:627–634.[CrossRef][Web of Science][Medline]
Kaur H, Raghava GPS. A neural network method for prediction of β-turn types in proteins using evolutionary information. Bioinformatics (2004) 20:2751–2758.
Lewis PN, et al. Chain reversals in proteins. Biochem. Biophys. Acta (1973) 303:211–229.[Medline]
Likas A, Stafylopatis A. Training the random neural network using quasi-newton methods. Eur. J. Oper. Res (2000) 126:331–339.[CrossRef][Web of Science]
Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochem. Biophys. Acta (1975) 405:442–451.[Medline]
McGregor MJ, et al. Prediction of β-turns in proteins using neural network. Protein Eng (1989) 2:521–526.
Noguchi T, et al. PDB_REPRDB: a database of representative protein chains from the Protein Data Bank (PDB). Nucleic Acids Res (2001) 29:219–220.
Pham TH, et al. Support vector machines for prediction and analysis of beta and gamma-turns in proteins. J. Bioinform. Comput. Biol (2005) 3:343–358.[CrossRef][Medline]
Richardson JS. The anatomy and taxonomy of protein structure. Adv. Protein Chem (1981) 34:167–339.[Medline]
Rose GD, et al. Turns in peptides and proteins. Adv. Protein Chem (1985) 37:100–109.
Shepherd AJ, et al. Prediction of the location and type of β-turns in proteins using neural networks. Protein Sci (1999) 8:1045–1055.[Web of Science][Medline]
Takano K, et al. Role of amino acid residues at turns in the conformational stability and folding of human lysozyme. Biochemistry (2000) 39:8655–8665.[CrossRef][Medline]
Wilmot CM, Thornton JM. Analysis and prediction of the different types of β-turns in proteins. J. Mol. Biol (1988) 203:221–232.[CrossRef][Web of Science][Medline]
Zhang CT, Chou KC. Prediction of β-turns in proteins by 1–4 & 2–3 correlation model. Biopolymers (1997) 41:673–702.[CrossRef][Web of Science]
Zhang Q, et al. Improved method for predicting β-turn using support vector machine. Bioinformatics (2005) 21:2370–2374.
This article has been cited by other articles:
![]() |
D. K. Crawford, D. I. Perkins, J. R. Trudell, E. J. Bertaccini, D. L. Davies, and R. L. Alkana Roles for Loop 2 Residues of {alpha}1 Glycine Receptors in Agonist Activation J. Biol. Chem., October 10, 2008; 283(41): 27698 - 27706. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

