Bioinformatics Advance Access originally published online on June 24, 2004
Bioinformatics 2004 20(17):3080-3098; doi:10.1093/bioinformatics/bth369
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 20 issue 17 © Oxford University Press 2004; all rights reserved.
Developing optimal non-linear scoring function for protein design
Department of Bioengineering, SEO, MC-063, University of Illinois at Chicago, 851 S. Morgan Street, Room 218, Chicago, IL 60607-7052, USA
Received on February 29, 2004; revised on June 9, 2004; accepted on June 10, 2004
Advance Access Publication June 24, 2004
Motivation. Protein design aims to identify sequences compatible with a given protein fold but incompatible to any alternative folds. To select the correct sequences and to guide the search process, a design scoring function is critically important. Such a scoring function should be able to characterize the global fitness landscape of many proteins simultaneously.
Results: To find optimal design scoring functions, we introduce two geometric views and propose a formulation using a mixture of non-linear Gaussian kernel functions. We aim to solve a simplified protein sequence design problem. Our goal is to distinguish each native sequence for a major portion of representative protein structures from a large number of alternative decoy sequences, each a fragment from proteins of different folds. Our scoring function discriminates perfectly a set of 440 native proteins from 14 million sequence decoys. We show that no linear scoring function can succeed in this task. In a blind test of unrelated proteins, our scoring function misclassfies only 13 native proteins out of 194. This compares favorably with about threefour times more misclassifications when optimal linear functions reported in the literature are used. We also discuss how to develop protein folding scoring function.
Availability: Available on request from the authors.
Contact: jliangATuicDOTedu