Skip Navigation


Bioinformatics Advance Access originally published online on February 15, 2006
Bioinformatics 2006 22(10):1207-1210; doi:10.1093/bioinformatics/btl055
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/10/1207    most recent
btl055v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Nanni, L.
Right arrow Articles by Lumini, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Nanni, L.
Right arrow Articles by Lumini, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

An ensemble of K-local hyperplanes for predicting protein–protein interactions

Loris Nanni and Alessandra Lumini

DEIS, IEIIT, CNR, Università di Bologna Viale Risorgimento 2, 40136 Bologna, Italy

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 

Prediction of protein–protein interaction is a difficult and important problem in biology. In this paper, we propose a new method based on an ensemble of K-local hyperplane distance nearest neighbor (HKNN) classifiers, where each HKNN is trained using a different physicochemical property of the amino acids. Moreover, we propose a new encoding technique that combines the amino acid indices together with the 2-Grams amino acid composition. A fusion of HKNN classifiers combined with the ‘Sum rule’ enables us to obtain an improvement over other state-of-the-art methods. The approach is demonstrated by building a learning system based on experimentally validated protein–protein interactions in human gastric bacterium Helicobacter pylori and in Human dataset.

Contact: lnanni{at}deis.unibo.it


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
Protein–protein interactions are of interest in biology because they regulate a variety of cellular processes. Finding interactions between proteins involved in common cellular functions is a way to get a broader view of how they work cooperatively in a cell. In order to find out protein–protein interactions based on genomic sequence analysis several approaches have been developed that search sequence information for patterns that might be expected to occur based on a priori biological knowledge. (Valencia and Pazos, 2002) describe the most important computational techniques available for the prediction of protein–protein interactions and examine their range of applicability. An approach is to use experimental data and machine learning methods to guide the discovery of inherent patterns in the sequence data, Bock and Gough (2003), Martin et al. (2005) and Nanni (2005a) use this approach. Bock and Gough (2003) create protein structural and physiochemical descriptors based on the primary sequence information, and then train a support vector machine (SVM) learning system to identify protein–protein binding interactions from these descriptors. Martin et al. (2005) introduces a novel descriptor called signature product, which is a product of subsequences and an expansion of the signature descriptor from chemical informatics. Nanni (2005a,b) creates a feature vector based on 2-Grams and then trains a classifier, moreover in Nanni (2005a) the author studies the fusion of classifiers based on different methodologies. Using Helicobacter Pylori dataset and Human dataset (Martin et al., 2005) we benchmark our algorithm and provide comparisons with other state-of-the-art works (Bock and Gough, 2003; Martin et al., 2005; Nanni, 2005a,b).

Since the structure and hence the function of proteins is dictated by the different interacting physicochemical properties of the amino acids making up the protein, we present a new algorithm which uses multiple physicochemical properties of amino acids for building an ensemble of classifiers. An ensemble classifier consists of a set of base classifiers; its prediction rule is based on a summary measure of individual classifications by the base classifiers. In bagging, one samples the training set, generating random independent bootstrap replicates (Valentini, 2005), constructs the classifier on each of these and aggregates them by a simple majority vote in the final decision rule. In Peng (2005) classifiers are constructed in random subspaces of the data feature space. In our ensemble a different K-local hyperplane distance nearest neighbor (HKNN) (Vincent and Bengio, 2001) is trained for each physicochemical property of the amino acids, finally the classifiers are combined with the ‘Sum rule’ (Kuncheva, 2005). Best physicochemical properties to combine are chosen by running sequential forward floating selection (SFFS) (Pudil et al., 1994). Our ensemble outperforms the other state-of-the-art methods in ‘H.pylori’ dataset and obtains performances similar to those obtained by (Martin et al., 2005) in the ‘Human’ dataset. Moreover, we propose a new encoding technique that combines the amino acid indices together with the 2-Gram amino acid composition.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
The set of physicochemical properties are obtained by the Amino Acid index (Kawashima and Kanehisa, 2000) database available at http://www.genome.jp/dbget/aaindex.html. An amino acid index is a set of 20 numerical values representing any of the different physicochemical properties of amino acids. This database currently contains 494 such indices and 83 substitution matrices.

One of the main computational challenges in using a machine learning approach for the prediction of protein–protein interactions is a suitable encoding of the protein information in some vector space. In our case, we have the problem of representing variable length amino acid sequences as vectors containing the information necessary to distinguish between binding and non-binding protein–protein pairs. Our data points are in fact pairs (protein–protein pairs) of amino acid sequences. In this paper we extract the features from each protein as detailed in Section 2. Our encoding technique combines the amino acid indices together with the 2-Gram amino acid composition. We called this feature extraction method Physicochemical 2-Grams.

Then, the representation of an interaction pair is given by the sum of the feature vectors of the proteins that belong to the protein pair. To reduce the dimensionality of this representation we projected it onto a lower D-dimensional space by principal component analysis (in our tests, as suggested in (Nanni, 2005b), we set D = 70 (the retained variance is 65%) for H.pylori dataset and D = 30 (the retained variance is 50%) for Human Dataset.

As far as the classification is concerned, we adopt a multiclassifier system: a different HKNN is trained for each physicochemical property of the amino acids, finally the classifiers are combined with the ‘Sum rule’. Best physicochemical properties to combine are chosen by running SFFS (where the features are the physicochemical properties).

In this paper we propose four different ensembles of classifiers obtained by selecting four different sets of features by SFFS. In fact, the difference among these ensembles is the fitness function adopted for the SFFS feature selection (and optimized using a ten-fold cross-validation). The parameters used in the fitness functions are: accuracy [(TP + TN)/(TP + FP + TN + FN)]; precision [TP/(TP + FP)]; and sensitivity [TP/(TP + FN)]; where TP, FP, TN and FN are true and false positive and negative predictions, respectively. The ensembles have been named as follows:

  • I_ACC, the fitness function is the average accuracy
  • I_PR, the fitness function is the average precision
  • I_SEN, the fitness function is the average sensitivity
  • SUM the fitness function is the sum among accuracy, precision and sensitivity.

2.1 Physicochemical 2-Grams
The n-Grams or k-tuples (Nanni, 2005b) are a pair of values (vi, ci), where vi is the feature and ci are the counts of this feature in a protein sequence. An example is shown in Figure 1. These features are all the possible combinations of n letters from the set T of 20 amino acids. The total number of possible features is 20n. This representation has several advantages: it is sequence length invariant and it allows classification based on the localized regions of similarity. In this work we use 2-grams, in this way vi is a pair of amino acids. Since there are 20 amino acids, we have 400 2-grams for each protein.


Figure 1
View larger version (11K):
[in this window]
[in a new window]
 
Figure 1 An example of 2-Grams.

 
In order to model the physiochemical properties of the amino acids we introduce a new encoding technique that combines the value of a given property for an amino acid together with its 2-Grams representation. Given a physiochemical property p, we define index(p, d) the function returning the value of the property p of the amino acid d. Then, we define Fp(i, j) for a given property p as

Formula
where L is the length of the protein sequence; amn(vi, j) returns the j-th amino acid of the pair vi;

The feature vector of a protein for a given physiochemical property p is made by the concatenation between all Fp(i, j) for i = 1, ... , 400 and j = 1, 2. So we obtain a 800-dimensional ‘super-vector’ where the first 400 features are the values of Fp(i, 1) and the last 400 features are the values of Fp(i, 2).

2.2 K-local hyperplane distance nearest neighbor algorithm (HKNN)
HKNN is a modified k-nearest neighbour (KNN) algorithm, it computes distances of each test point x to H local hyperplanes, where H is the number of different classes. The l-th hyperplane is composed of KNN of x in the training set, belonging to the l-th class. A test point x is associated with the class whose hyperplane is closest to x. HKNN has two parameters, K and a penalty term L used to find the hyperplane. The parameters of HKNN have not been optimized in this work and are the same used in Nanni (2005b) (in H.pylori dataset K = 3 and L = 0.25; in Human dataset K = 9 and L = 0.5).


    3 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
We have used exactly the same datasets used in Bock and Gough (2003), Martin et al. (2005) Nanni (2005a) and Nanni (2005b). From ‘H.pylori (www.cs.sandia.gov/~smartin/software.html) dataset a set of 1458 positives (interacting pairs) and 1458 negatives (non-interacting pairs) for a total of 2916 protein pairs has been selected at random. From ‘Human’ dataset (Shared by S. Martin) a set of 941 positives (interacting pairs) and 941 negatives (non-interacting pairs) for a total of 1882 protein pairs has been selected at random. We use as performance indicators accuracy, precision and sensitivity averaged over the 10 experiments conducted using 10-fold cross-validation. In a cross-validation run (Duda et al., 2000) the 10 folds of approximately the same size are randomly created. Please note, the 10 folds used in the training are different from the 10 folds used in the testing.

In Figures 2 and 3, we show a different plot for each performance parameter (accuracy, precision and sensitivity) reporting the results obtained by the methods I_ACC, I_PR, I_SEN, SUM as a function of the number k of the physicochemical properties selected by SFFS. The five best properties selected by the SFFS algorithm for the ensemble SUM are shown in Table 1. Please note, SFFS select the most independent [from the point of view of the classification (Kuncheva, 2005)] features extraction methods.


Figure 2
View larger version (13K):
[in this window]
[in a new window]
 
Figure 2 Plot of the results obtained combining an increasing number of systems for the H.pylori dataset.

 

Figure 3
View larger version (11K):
[in this window]
[in a new window]
 
Figure 3 Plot of the results obtained combining an increasing number of systems for the Human dataset.

 

View this table:
[in this window]
[in a new window]
 
Table 1 The five best physicochemical properties automatically selected by the SFFS algorithm for the ensemble SUM

 
In Tables 2 and 3 the performance of our ensembles at a fixed value of selected physicochemical properties (k = 5) has been compared with other state-of-the-art approaches.


View this table:
[in this window]
[in a new window]
 
Table 2 Comparison of state-of-the-art approaches in the ‘H.pylori’ dataset (Nanni, 2005a) is the best ‘stand-alone’ classifier proposed in Nanni (2005a)

 

View this table:
[in this window]
[in a new window]
 
Table 3 Comparison of state-of-the-art approaches in the ‘Human’ dataset [the method (Bock and Gough, 2003) does not report results for this dataset]

 
Obviously, the best results for each performance indicator are obtained by the ensemble that optimizes the indicator itself: I_ACC, I_PR or I_SEN, respectively. (i.e. in Table 2 the value of 0.866 of the accuracy, obtained by I_ACC, is obtained selecting the physicochemical properties that maximize the value of the accuracy, while the value 0.885 of the sensitivity, obtained by I_SEN, is obtained selecting the physicochemical properties that maximize the sensitivity). It is more interesting to note that the ensemble SUM outperforms the other state-of-the-art methods in ‘H.pylori’ dataset and obtains performances similar to those obtained by (Martin et al., 2005) in the ‘Human’ dataset. Moreover, it is remarkable that the best physicochemical properties selected for ‘H.pylori’ dataset are different from the best physicochemical properties selected for ‘Human’ dataset (Table 1).


    4 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
We investigated fusion of classifiers for predicting protein–protein interactions, and tested it on two real-world datasets. To enforce the diversity (Kuncheva, 2005) we combine classifiers based on different physicochemical properties extracted from the proteins. It is believed that classifiers based on different features offer complementary information about the patterns to be classified. Our method encodes and compares protein pairs by concatenating the features extracted from each protein, and hence use a global representation of a protein pair. Differently, in Martin et al. (2005) the authors look for amino acid subsequence pairs which occur together when two proteins interact; we expect that the fusion between our SUM method and the system proposed in Martin et al. (2005) permits to obtain a very reliable system.


    Acknowledgments
 
The authors would like to thank S. Martin at Sandia National Laboratories for sharing the datasets. The authors would like to thank O. Okun (University of Oulu, Finland) for sharing the HKNN code.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Anna Tramontano

Received on December 5, 2005; revised on January 30, 2006; accepted on February 10, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 

    Bock, J. and Gough, D. (2003) Whole-proteome interaction mining. Bioinformatics, 19, 125–135[Abstract/Free Full Text].

    Duda, R.O., Hart, P.E., Stork, D.G. Pattern Classification, (2000) 2nd edn , New York Wiley.

    Kawashima, S. and Kanehisa, M. (2000) AAindex: amino acid index database. Nucleic Acids Res, . 28, 374[Abstract/Free Full Text].

    Kuncheva, L.I. (2005) Diversity in multiple classifier systems. Inform. Fusion, 6, 3–4[CrossRef].

    Martin, S., et al. (2005) Predicting protein–protein interactions using signature products. Bioinformatics, 21, 218–26[Abstract/Free Full Text].

    Nanni, L. (2005a) Fusion of classifiers for predicting protein–protein interactions. Neurocomputing, 68, 289–296[CrossRef][Web of Science].

    Nanni, L. (2005b) Hyperplanes for predicting protein–protein interactions. Neurocomputing, 69, 257–263[CrossRef].

    Peng, Y. (2005) A novel ensemble machine learning for robust microarray data classification. Comput. Biol. Med, . doi:10.1016/j.compbiomed.2005.04.001.

    Pudil, P., et al. (1994) Floating search methods in feature selection. Pattern Recogn. Lett, . 15, 1119–1125[CrossRef].

    Valencia, A. and Pazos, F. (2002) Computational methods for the prediction of protein interactions. Curr. Opin. Struct. Biol, . 12, 368–373[CrossRef][Web of Science][Medline].

    Valentini, G. (2005) An experimental bias-variance analysis of SVM ensembles based on resampling techniques. IEEE Trans. Syst. Man. Cybern. B Cybern, . 35, 1252–1271[Medline].

    Vincent, P. and Bengio, Y. (2001) K-local hyperplane and convex distance nearest neighbour algorithms. Adv. Neural Inf. Process. Syst, . 14, 985–992.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
C.-W. Tung and S.-Y. Ho
POPI: predicting immunogenicity of MHC class I binding peptides by mining informative physicochemical properties
Bioinformatics, April 15, 2007; 23(8): 942 - 949.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
E. Ferraro, A. Via, G. Ausiello, and M. Helmer-Citterich
A novel structure-based encoding for machine-learning applied to the inference of SH3 domain specificity
Bioinformatics, October 1, 2006; 22(19): 2333 - 2339.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/10/1207    most recent
btl055v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Nanni, L.
Right arrow Articles by Lumini, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Nanni, L.
Right arrow Articles by Lumini, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?