Skip Navigation



Bioinformatics Advance Access published online on September 11, 2006

Bioinformatics, doi:10.1093/bioinformatics/btl475
This Article
Right arrow Advance Access manuscript (PDF) Freely available
Right arrow Supplementary data
Right arrow All Versions of this Article:
22/22/2753    most recent
btl475v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Lewis, D. P.
Right arrow Articles by Noble, W. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lewis, D. P.
Right arrow Articles by Noble, W. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author (2006). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
Received April 22, 2006
Revised September 1, 2006
Accepted September 3, 2006

Article

Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure

Darrin P. Lewis 1, Tony Jebara 1, and William Stafford Noble 2 *

1 Department of Computer Science, Columbia University, New York, NY, 10027
2 Department of Genome Sciences, Department of Computer Science and Engineering, University of Washington, Seattle, WA, 98195

* To whom correspondence should be addressed.
William Stafford Noble, E-mail: noble{at}gs.washington.edu


   Abstract

Motivation: Drawing inferences from large, heterogeneous sets of biological data requires a theoretical framework that is capable of representing, for example, DNA and protein sequences, protein structures, microarray expression data, various types of interaction networks, etc. Recently, a class of algorithms known as kernel methods has emerged as a powerful framework for combining diverse types of data. The support vector machine (SVM) algorithm is the most popular kernel method, due to its theoretical underpinnings and strong empirical performance on a wide variety of classification tasks. Furthermore, several recently described extensions allow the SVM to assign relative weights to various data sets, depending upon their utilities in performing a given classification task.

Results: In this work, we empirically investigate the performance of the SVM on the task of inferring gene functional annotations from a combination of protein sequence and structure data. Our results suggest that the SVM is quite robust to noise in the input data sets. Consequently, in the presence of only two types of data, an SVM trained from an unweighted combination of data sets performs as well or better than a more sophisticated algorithm that assigns weights to individual data types. Indeed, for this simple case, we can demonstrate empirically that no solution is significantly better than the naive, unweighted average of the two data sets. On the other hand, when multiple noisy data sets are included in the experiment, then the naive approach fares worse than the weighted approach. Our results suggest that for many applications, a naive unweighted sum of kernels may be sufficient.

Availability:


Associate Editor: Alfonso Valencia
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
T. Damoulas and M. A. Girolami
Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection
Bioinformatics, May 15, 2008; 24(10): 1264 - 1270.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
G. Valentini and N. Cesa-Bianchi
HCGene: a software tool to support the hierarchical classification of genes
Bioinformatics, March 1, 2008; 24(5): 729 - 731.
[Abstract] [Full Text] [PDF]



Disclaimer:
Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.