Bioinformatics Advance Access published online on March 29, 2005
Bioinformatics, doi:10.1093/bioinformatics/bti404
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Department of Computer Science, Exeter University, UK
Motivation: Although the outbreak of the severe acute respiratory syndrome (SARS) is currently over, it is expected that it will return to attack human beings. A critical challenge to scientists with various disciplines worldwide is to study the specificity of cleavage activity of SARS related coronavirus (SARS-CoV) and use the knowledge obtained from the study for effective inhibitor design to fight the disease. The most commonly used inductive programming methods for knowledge discovery from data assume that the elements of input patterns are orthogonal to each other. Suppose a sub-sequence is denoted as PB2B-PB1B-PB1'B-PB2'B, the conventional inductive programming method may result in a rule like "if PB1B=Q, then the sub-sequence is cleaved, otherwise non-cleaved". If the site PB1B is not orthogonal to the others (for instance, PB2B, PB1'B, and PB2'B), the prediction power of this kind of the rules may be limited. It is therefore motivated in this study to develop a novel method for constructing non-orthogonal decision trees for mining protease data. Result: Eighteen sequences of coronavirus polyprotein are downloaded from NCBI (http://www.ncbi.nlm.nih.gov). Among these sequences, 252 cleavage sites have been experimentally determined. These sequences are scanned using a sliding window with size k to generate about 50,000 k-mer sub-sequences (for short, k-mers). The value of k varies from four to 12 with the gap of two. The bio-basis function proposed in (Thomson et al., 2003) is used to transformation the k-mers to a high-dimensional numerical space on which an inductive programming method is applied for the purpose of deriving a decision tree for decision-making. The process of this transform is referred to as a bio-mapping. The constructed decision trees select about ten out of 50,000 k-mers. This small set of selected k-mers is regarded as a set of decisive templates. By doing so, non-orthogonal decision trees are constructed using the selected templates and the prediction accuracy is significantly improved. Availability: The program for bio-mapping can be obtained by request to the author.
Received November 14, 2004
Revised February 7, 2005
Accepted March 22, 2005
Article
Mining SARS-CoV protease cleavage data using non-orthogonal decision trees, a novel method for decisive template selection
![]()
Abstract ![]()
CiteULike
Connotea
Del.icio.us What's this?