Skip Navigation


Bioinformatics Advance Access originally published online on November 12, 2008
Bioinformatics 2009 25(1):30-35; doi:10.1093/bioinformatics/btn583
This Article
Right arrow Full Text Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
25/1/30    most recent
btn583v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Wu, J.
Right arrow Articles by Sun, X.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wu, J.
Right arrow Articles by Sun, X.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature

Jiansheng Wu , Hongde Liu , Xueye Duan , Yan Ding , Hongtao Wu , Yunfei Bai and Xiao Sun *

State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, P. R. China

*To whom correspondence should be addressed.


   Abstract

Motivation: In this work, we aim to develop a computational approach for predicting DNA-binding sites in proteins from amino acid sequences. To avoid overfitting with this method, all available DNA-binding proteins from the Protein Data Bank (PDB) are used to construct the models. The random forest (RF) algorithm is used because it is fast and has robust performance for different parameter values. A novel hybrid feature is presented which incorporates evolutionary information of the amino acid sequence, secondary structure (SS) information and orthogonal binary vector (OBV) information which reflects the characteristics of 20 kinds of amino acids for two physical–chemical properties (dipoles and volumes of the side chains). The numbers of binding and non-binding residues in proteins are highly unbalanced, so a novel scheme is proposed to deal with the problem of imbalanced datasets by downsizing the majority class.

Results: The results show that the RF model achieves 91.41% overall accuracy with Matthew's correlation coefficient of 0.70 and an area under the receiver operating characteristic curve (AUC) of 0.913. To our knowledge, the RF method using the hybrid feature is currently the computationally optimal approach for predicting DNA-binding sites in proteins from amino acid sequences without using three-dimensional (3D) structural information. We have demonstrated that the prediction results are useful for understanding protein–DNA interactions.

Availability: DBindR web server implementation is freely available at http://www.cbi.seu.edu.cn/DBindR/DBindR.htm.

Contact: xsun{at}seu.edu.cn

Supplementary information: Supplementary data are available at Bioinformatics online.

Associate Editor: Ivo Hofacker


Received on February 29, 2008; revised on October 10, 2008; accepted on November 7, 2008

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?




Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.