Bioinformatics Advance Access first published online on June 28, 2007
This version published online on June 30, 2007
Bioinformatics, doi:10.1093/bioinformatics/btm342
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A statistical approach using network structure in the prediction of protein characteristics

Department of Statistics, University of Oxford, Oxford, OX1 3TG, UK.
*To whom correspondence should be addressed. Pao-Yang Chen, E-mail: pchen{at}stats.ox.ac.uk
| Abstract |
|---|
Motivation: The Majority Vote approach has demonstrated that protein-protein interactions can be used to predict the structure or function of a protein. In this paper we propose a novel method for the prediction of such protein characteristics based on frequencies of pairwise interactions. In addition, we study a second new approach using the pattern frequencies of triplets of proteins, thus for the first time taking network structure explicitly into account. Both these methods are extended to jointly consider multiple organisms and multiple characteristics.
Results: Compared to the standard non network-based method, namely the Majority Vote method, in large networks our predictions tend to be more accurate. For structure prediction, the frequencybased method reaches up to 71% accuracy, and the triplet-based method reaches up to 72% accuracy, whereas for function prediction, both the triplet-based method and the frequency-based method reach up to 90% accuracy. Function prediction on proteins without homologs showed slightly less but comparable accuracies. Including partially annotated proteins substantially increases the number of proteins for which our methods predict their characteristics with reasonable accuracy. We find that the enhanced triplet-based method does not currently yield significantly better results than the enhanced frequency-based method, suggesting that triplets of interactions do not contain substantially more information about protein characteristics than interaction pairs. Our methods offer two main improvements over current approaches - firstly, multiple protein characteristics are considered simultaneously, and secondly, data is integrated from multiple species. In addition, the triplet-based method includes network structure more explicitly than the Majority Vote and the frequency-based method.
Availability: The program is available upon request.
Associate Editor: Dr. Jonathan Wren
Funded in part by MMCOMNET Grant No. FP6-2003-BEST-Path-012999.
Received on February 15, 2007; revised on June 8, 2007; accepted on June 22, 2007