Skip Navigation


Bioinformatics Advance Access originally published online on June 26, 2009
Bioinformatics 2009 25(18):2404-2410; doi:10.1093/bioinformatics/btp397
This Article
Right arrow Full Text
Right arrow Full Text (Print PDF)
Right arrow All Versions of this Article:
25/18/2404    most recent
btp397v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Huttenhower, C.
Right arrow Articles by Troyanskaya, O. G.
PubMed
Right arrow PubMed Citation
Right arrow Articles by Huttenhower, C.
Right arrow Articles by Troyanskaya, O. G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

The impact of incomplete knowledge on evaluation: an experimental benchmark for protein function prediction

Curtis Huttenhower 1,2,{dagger}, Matthew A. Hibbs 3,{dagger}, Chad L. Myers 4,{dagger}, Amy A. Caudy 2, David C. Hess 2 and Olga G. Troyanskaya 1,2,*

1Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ 08540-5233, 2Lewis-Sigler Institute for Integrative Genomics, Carl Icahn Laboratory, Princeton University, Princeton, NJ 08544, 3Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609 and 4Department of Computer Science, University of Minnesota, 200 Union Street SE, Minneapolis, MN 55455, USA

*To whom correspondence should be addressed.


   Abstract

Motivation: Rapidly expanding repositories of highly informative genomic data have generated increasing interest in methods for protein function prediction and inference of biological networks. The successful application of supervised machine learning to these tasks requires a gold standard for protein function: a trusted set of correct examples, which can be used to assess performance through cross-validation or other statistical approaches. Since gene annotation is incomplete for even the best studied model organisms, the biological reliability of such evaluations may be called into question.

Results: We address this concern by constructing and analyzing an experimentally based gold standard through comprehensive validation of protein function predictions for mitochondrion biogenesis in Saccharomyces cerevisiae. Specifically, we determine that (i) current machine learning approaches are able to generalize and predict novel biology from an incomplete gold standard and (ii) incomplete functional annotations adversely affect the evaluation of machine learning performance. While computational approaches performed better than predicted in the face of incomplete data, relative comparison of competing approaches—even those employing the same training data—is problematic with a sparse gold standard. Incomplete knowledge causes individual methods' performances to be differentially underestimated, resulting in misleading performance evaluations. We provide a benchmark gold standard for yeast mitochondria to complement current databases and an analysis of our experimental results in the hopes of mitigating these effects in future comparative evaluations.

Availability: The mitochondrial benchmark gold standard, as well as experimental results and additional data, is available at http://function.princeton.edu/mitochondria

Contact: ogt{at}cs.princeton.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

{dagger}The authors wish it to be known that, in their opinion, the first three authors should be regarded as joint First Authors.

Associate Editor: Jonathan Wren


Received on April 10, 2009; revised on June 5, 2009; accepted on June 23, 2009

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?




Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.