Bioinformatics Advance Access originally published online on February 15, 2006
Bioinformatics 2006 22(10):1207-1210; doi:10.1093/bioinformatics/btl055
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
An ensemble of K-local hyperplanes for predicting proteinprotein interactions
DEIS, IEIIT, CNR, Università di Bologna Viale Risorgimento 2, 40136 Bologna, Italy
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Prediction of proteinprotein interaction is a difficult and important problem in biology. In this paper, we propose a new method based on an ensemble of K-local hyperplane distance nearest neighbor (HKNN) classifiers, where each HKNN is trained using a different physicochemical property of the amino acids. Moreover, we propose a new encoding technique that combines the amino acid indices together with the 2-Grams amino acid composition. A fusion of HKNN classifiers combined with the Sum rule enables us to obtain an improvement over other state-of-the-art methods. The approach is demonstrated by building a learning system based on experimentally validated proteinprotein interactions in human gastric bacterium Helicobacter pylori and in Human dataset.
Contact: lnanni{at}deis.unibo.it
| 1 INTRODUCTION |
|---|
|
|
|---|
Proteinprotein interactions are of interest in biology because they regulate a variety of cellular processes. Finding interactions between proteins involved in common cellular functions is a way to get a broader view of how they work cooperatively in a cell. In order to find out proteinprotein interactions based on genomic sequence analysis several approaches have been developed that search sequence information for patterns that might be expected to occur based on a priori biological knowledge. (Valencia and Pazos, 2002) describe the most important computational techniques available for the prediction of proteinprotein interactions and examine their range of applicability. An approach is to use experimental data and machine learning methods to guide the discovery of inherent patterns in the sequence data, Bock and Gough (2003), Martin et al. (2005) and Nanni (2005a) use this approach. Bock and Gough (2003) create protein structural and physiochemical descriptors based on the primary sequence information, and then train a support vector machine (SVM) learning system to identify proteinprotein binding interactions from these descriptors. Martin et al. (2005) introduces a novel descriptor called signature product, which is a product of subsequences and an expansion of the signature descriptor from chemical informatics. Nanni (2005a,b) creates a feature vector based on 2-Grams and then trains a classifier, moreover in Nanni (2005a) the author studies the fusion of classifiers based on different methodologies. Using Helicobacter Pylori dataset and Human dataset (Martin et al., 2005) we benchmark our algorithm and provide comparisons with other state-of-the-art works (Bock and Gough, 2003; Martin et al., 2005; Nanni, 2005a,b).
Since the structure and hence the function of proteins is dictated by the different interacting physicochemical properties of the amino acids making up the protein, we present a new algorithm which uses multiple physicochemical properties of amino acids for building an ensemble of classifiers. An ensemble classifier consists of a set of base classifiers; its prediction rule is based on a summary measure of individual classifications by the base classifiers. In bagging, one samples the training set, generating random independent bootstrap replicates (Valentini, 2005), constructs the classifier on each of these and aggregates them by a simple majority vote in the final decision rule. In Peng (2005) classifiers are constructed in random subspaces of the data feature space. In our ensemble a different K-local hyperplane distance nearest neighbor (HKNN) (Vincent and Bengio, 2001) is trained for each physicochemical property of the amino acids, finally the classifiers are combined with the Sum rule (Kuncheva, 2005). Best physicochemical properties to combine are chosen by running sequential forward floating selection (SFFS) (Pudil et al., 1994). Our ensemble outperforms the other state-of-the-art methods in H.pylori dataset and obtains performances similar to those obtained by (Martin et al., 2005) in the Human dataset. Moreover, we propose a new encoding technique that combines the amino acid indices together with the 2-Gram amino acid composition.
| 2 METHODS |
|---|
|
|
|---|
The set of physicochemical properties are obtained by the Amino Acid index (Kawashima and Kanehisa, 2000) database available at http://www.genome.jp/dbget/aaindex.html. An amino acid index is a set of 20 numerical values representing any of the different physicochemical properties of amino acids. This database currently contains 494 such indices and 83 substitution matrices.
One of the main computational challenges in using a machine learning approach for the prediction of proteinprotein interactions is a suitable encoding of the protein information in some vector space. In our case, we have the problem of representing variable length amino acid sequences as vectors containing the information necessary to distinguish between binding and non-binding proteinprotein pairs. Our data points are in fact pairs (proteinprotein pairs) of amino acid sequences. In this paper we extract the features from each protein as detailed in Section 2. Our encoding technique combines the amino acid indices together with the 2-Gram amino acid composition. We called this feature extraction method Physicochemical 2-Grams.
Then, the representation of an interaction pair is given by the sum of the feature vectors of the proteins that belong to the protein pair. To reduce the dimensionality of this representation we projected it onto a lower D-dimensional space by principal component analysis (in our tests, as suggested in (Nanni, 2005b), we set D = 70 (the retained variance is 65%) for H.pylori dataset and D = 30 (the retained variance is 50%) for Human Dataset.
As far as the classification is concerned, we adopt a multiclassifier system: a different HKNN is trained for each physicochemical property of the amino acids, finally the classifiers are combined with the Sum rule. Best physicochemical properties to combine are chosen by running SFFS (where the features are the physicochemical properties).
In this paper we propose four different ensembles of classifiers obtained by selecting four different sets of features by SFFS. In fact, the difference among these ensembles is the fitness function adopted for the SFFS feature selection (and optimized using a ten-fold cross-validation). The parameters used in the fitness functions are: accuracy [(TP + TN)/(TP + FP + TN + FN)]; precision [TP/(TP + FP)]; and sensitivity [TP/(TP + FN)]; where TP, FP, TN and FN are true and false positive and negative predictions, respectively. The ensembles have been named as follows:
- I_ACC, the fitness function is the average accuracy
- I_PR, the fitness function is the average precision
- I_SEN, the fitness function is the average sensitivity
- SUM the fitness function is the sum among accuracy, precision and sensitivity.
2.1 Physicochemical 2-Grams
The n-Grams or k-tuples (Nanni, 2005b) are a pair of values (vi, ci), where vi is the feature and ci are the counts of this feature in a protein sequence. An example is shown in Figure 1. These features are all the possible combinations of n letters from the set T of 20 amino acids. The total number of possible features is 20n. This representation has several advantages: it is sequence length invariant and it allows classification based on the localized regions of similarity. In this work we use 2-grams, in this way vi is a pair of amino acids. Since there are 20 amino acids, we have 400 2-grams for each protein.
|
In order to model the physiochemical properties of the amino acids we introduce a new encoding technique that combines the value of a given property for an amino acid together with its 2-Grams representation. Given a physiochemical property p, we define index(p, d) the function returning the value of the property p of the amino acid d. Then, we define Fp(i, j) for a given property p as
![]() |
The feature vector of a protein for a given physiochemical property p is made by the concatenation between all Fp(i, j) for i = 1, ... , 400 and j = 1, 2. So we obtain a 800-dimensional super-vector where the first 400 features are the values of Fp(i, 1) and the last 400 features are the values of Fp(i, 2).
2.2 K-local hyperplane distance nearest neighbor algorithm (HKNN)
HKNN is a modified k-nearest neighbour (KNN) algorithm, it computes distances of each test point x to H local hyperplanes, where H is the number of different classes. The l-th hyperplane is composed of KNN of x in the training set, belonging to the l-th class. A test point x is associated with the class whose hyperplane is closest to x. HKNN has two parameters, K and a penalty term L used to find the hyperplane. The parameters of HKNN have not been optimized in this work and are the same used in Nanni (2005b) (in H.pylori dataset K = 3 and L = 0.25; in Human dataset K = 9 and L = 0.5).
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
We have used exactly the same datasets used in Bock and Gough (2003), Martin et al. (2005) Nanni (2005a) and Nanni (2005b). From H.pylori (www.cs.sandia.gov/~smartin/software.html) dataset a set of 1458 positives (interacting pairs) and 1458 negatives (non-interacting pairs) for a total of 2916 protein pairs has been selected at random. From Human dataset (Shared by S. Martin) a set of 941 positives (interacting pairs) and 941 negatives (non-interacting pairs) for a total of 1882 protein pairs has been selected at random. We use as performance indicators accuracy, precision and sensitivity averaged over the 10 experiments conducted using 10-fold cross-validation. In a cross-validation run (Duda et al., 2000) the 10 folds of approximately the same size are randomly created. Please note, the 10 folds used in the training are different from the 10 folds used in the testing.
In Figures 2 and 3, we show a different plot for each performance parameter (accuracy, precision and sensitivity) reporting the results obtained by the methods I_ACC, I_PR, I_SEN, SUM as a function of the number k of the physicochemical properties selected by SFFS. The five best properties selected by the SFFS algorithm for the ensemble SUM are shown in Table 1. Please note, SFFS select the most independent [from the point of view of the classification (Kuncheva, 2005)] features extraction methods.
|
|
|
In Tables 2 and 3 the performance of our ensembles at a fixed value of selected physicochemical properties (k = 5) has been compared with other state-of-the-art approaches.
|
|
Obviously, the best results for each performance indicator are obtained by the ensemble that optimizes the indicator itself: I_ACC, I_PR or I_SEN, respectively. (i.e. in Table 2 the value of 0.866 of the accuracy, obtained by I_ACC, is obtained selecting the physicochemical properties that maximize the value of the accuracy, while the value 0.885 of the sensitivity, obtained by I_SEN, is obtained selecting the physicochemical properties that maximize the sensitivity). It is more interesting to note that the ensemble SUM outperforms the other state-of-the-art methods in H.pylori dataset and obtains performances similar to those obtained by (Martin et al., 2005) in the Human dataset. Moreover, it is remarkable that the best physicochemical properties selected for H.pylori dataset are different from the best physicochemical properties selected for Human dataset (Table 1).
| 4 CONCLUSIONS |
|---|
|
|
|---|
We investigated fusion of classifiers for predicting proteinprotein interactions, and tested it on two real-world datasets. To enforce the diversity (Kuncheva, 2005) we combine classifiers based on different physicochemical properties extracted from the proteins. It is believed that classifiers based on different features offer complementary information about the patterns to be classified. Our method encodes and compares protein pairs by concatenating the features extracted from each protein, and hence use a global representation of a protein pair. Differently, in Martin et al. (2005) the authors look for amino acid subsequence pairs which occur together when two proteins interact; we expect that the fusion between our SUM method and the system proposed in Martin et al. (2005) permits to obtain a very reliable system.
| Acknowledgments |
|---|
The authors would like to thank S. Martin at Sandia National Laboratories for sharing the datasets. The authors would like to thank O. Okun (University of Oulu, Finland) for sharing the HKNN code.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Anna Tramontano
Received on December 5, 2005; revised on January 30, 2006; accepted on February 10, 2006
| REFERENCES |
|---|
|
|
|---|
Bock, J. and Gough, D. (2003) Whole-proteome interaction mining. Bioinformatics, 19, 125135
Duda, R.O., Hart, P.E., Stork, D.G. Pattern Classification, (2000) 2nd edn , New York Wiley.
Kawashima, S. and Kanehisa, M. (2000) AAindex: amino acid index database. Nucleic Acids Res, . 28, 374
Kuncheva, L.I. (2005) Diversity in multiple classifier systems. Inform. Fusion, 6, 34[CrossRef].
Martin, S., et al. (2005) Predicting proteinprotein interactions using signature products. Bioinformatics, 21, 21826
Nanni, L. (2005a) Fusion of classifiers for predicting proteinprotein interactions. Neurocomputing, 68, 289296[CrossRef][ISI].
Nanni, L. (2005b) Hyperplanes for predicting proteinprotein interactions. Neurocomputing, 69, 257263[CrossRef].
Peng, Y. (2005) A novel ensemble machine learning for robust microarray data classification. Comput. Biol. Med, . doi:10.1016/j.compbiomed.2005.04.001.
Pudil, P., et al. (1994) Floating search methods in feature selection. Pattern Recogn. Lett, . 15, 11191125[CrossRef].
Valencia, A. and Pazos, F. (2002) Computational methods for the prediction of protein interactions. Curr. Opin. Struct. Biol, . 12, 368373[CrossRef][ISI][Medline].
Valentini, G. (2005) An experimental bias-variance analysis of SVM ensembles based on resampling techniques. IEEE Trans. Syst. Man. Cybern. B Cybern, . 35, 12521271[Medline].
Vincent, P. and Bengio, Y. (2001) K-local hyperplane and convex distance nearest neighbour algorithms. Adv. Neural Inf. Process. Syst, . 14, 985992.
This article has been cited by other articles:
![]() |
C.-W. Tung and S.-Y. Ho POPI: predicting immunogenicity of MHC class I binding peptides by mining informative physicochemical properties Bioinformatics, April 15, 2007; 23(8): 942 - 949. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Ferraro, A. Via, G. Ausiello, and M. Helmer-Citterich A novel structure-based encoding for machine-learning applied to the inference of SH3 domain specificity Bioinformatics, October 1, 2006; 22(19): 2333 - 2339. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




