Bioinformatics Advance Access published online on June 23, 2009
Bioinformatics, doi:10.1093/bioinformatics/btp386
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SOLpro: accurate sequence-based prediction of protein solubility
Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California, Irvine, CA, USA.
*To whom correspondence should be addressed. Pierre Baldi, E-mail: pfbaldi{at}ics.uci.edu
| Abstract |
|---|
Motivation: Protein insolubility is a major obstacle for many experimental studies. A sequence-based prediction method able to accurately predict the propensity of a protein to be soluble on overexpression could be used, for instance, to prioritize targets in large-scale proteomics projects and to identify mutations likely to increase the solubility of insoluble proteins.
Results: Here we first curate a large, non-redundant, and balanced training set of more than 17,000 proteins. Next, we extract and study twenty three groups of features computed directly or predicted (e.g. secondary structure) from the primary sequence. The data and the features are used to train a two-stage SVM architecture. The resulting predictor, SOLpro, is compared directly to existing methods and shows significant improvement according to standard evaluation metrics, with an overall accuracy of over 74% estimated using multiple runs of ten-fold cross-validation.
Availability: SOLpro is integrated in the SCRATCH suite of predictors and is available for download as a stand-alone application and as a web server at: http://scratch.proteomics.ics.uci.edu.
Contact: pfbaldi{at}ics.uci.edu
Associate Editor: Prof. Burkhard Rost
Received on April 13, 2009; revised on June 9, 2009; accepted on June 17, 2009