Bioinformatics Advance Access originally published online on November 29, 2005
Bioinformatics 2006 22(4):407-412; doi:10.1093/bioinformatics/bti806
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Application of compression-based distance measures to protein sequence classification: a methodological study
1Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and University of Szeged Aradi vértanúk tere 1., H-6720 Szeged, Hungary
2Bioinformatics Group, International Centre for Genetic Engineering and Biotechnology Padriciano 99, I-34012 Trieste, Italy
3Bioinformatics Group, Biological Research Centre, Hungarian Academy of Sciences Temesvári krt. 62, H-6701 Szeged, Hungary
*To whom correspondence should be addressed.
Motivation: Distance measures built on the notion of text compression have been used for the comparison and classification of entire genomes and mitochondrial genomes. The present study was undertaken in order to explore their utility in the classification of protein sequences.
Results: We constructed compression-based distance measures (CBMs) using the Lempel-Zlv and the PPMZ compression algorithms and compared their performance with that of the SmithWaterman algorithm and BLAST, using nearest neighbour or support vector machine classification schemes. The datasets included a subset of the SCOP protein structure database to test distant protein similarities, a 3-phosphoglycerate-kinase sequences selected from archaean, bacterial and eukaryotic species as well as low and high-complexity sequence segments of the human proteome, CBMs values show a dependence on the length and the complexity of the sequences compared. In classification tasks CBMs performed especially well on distantly related proteins where the performance of a combined measure, constructed from a CBM and a BLAST score, approached or even slightly exceeded that of the SmithWaterman algorithm and two hidden Markov model-based algorithms.
Contact: kocsor{at}inf.u-szeged.hu
Supplementary information: http://www.inf.u-szeged.hu/~kocsor/CBMO5
Received on August 30, 2005; revised on November 27, 2005; accepted on November 27, 2005
This article has been cited by other articles:
![]() |
P. Sonego, A. Kocsor, and S. Pongor ROC analysis: applications to the classification of biological sequences and 3D structures Brief Bioinform, May 1, 2008; 9(3): 198 - 209. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. V. Tetko, I. V. Rodchenkov, M. C. Walter, T. Rattei, and H.-W. Mewes Beyond the 'best' match: machine learning annotation of protein sequences by integration of different sources of information Bioinformatics, March 1, 2008; 24(5): 621 - 628. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Sonego, M. Pacurar, S. Dhir, A. Kertesz-Farkas, A. Kocsor, Z. Gaspari, J. A.M. Leunissen, and S. Pongor A Protein Classification Benchmark collection for machine learning Nucleic Acids Res., January 12, 2007; 35(suppl_1): D232 - D236. [Abstract] [Full Text] [PDF] |
||||


