Bioinformatics Advance Access originally published online on September 6, 2005
Bioinformatics 2005 21(22):4125-4132; doi:10.1093/bioinformatics/bti658
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences
1Department of Statistics, National Cheng-Kung University Tainan, Taiwan 70101
2Institute of Statistical Science, Academia Sinica Taipei, Taiwan 11529
3Institute of Bioinformatics, National Yang-Ming University Taipei, Taiwan 11221
*To whom correspondence should be addressed.
Motivation: Several measures of DNA sequence dissimilarity have been developed. The purpose of this paper is 3-fold. Firstly, we compare the performance of several word-based or alignment-based methods. Secondly, we give a general guideline for choosing the window size and determining the optimal word sizes for several word-based measures at different window sizes. Thirdly, we use a large-scale simulation method to simulate data from the distribution of SKLD (symmetric KullbackLeibler discrepancy). These simulated data can be used to estimate the degree of dissimilarity ß between any pair of DNA sequences.
Results: Our study shows (1) for whole sequence similiarity/dissimilarity identification the window size taken should be as large as possible, but probably not >3000, as restricted by CPU time in practice, (2) for each measure the optimal word size increases with window size, (3) when the optimal word size is used, SKLD performance is superior in both simulation and real data analysis, (4) the estimate
of ß based on SKLD can be used to filter out quickly a large number of dissimilar sequences and speed alignment-based database search for similar sequences and (5)
is also applicable in local similarity comparison situations. For example, it can help in selecting oligo probes with high specificity and, therefore, has potential in probe design for microarrays.
Availability: The algorithm SKLD, estimate
and simulation software are implemented in MATLAB code, and are available at http://www.stat.ncku.edu.tw/tjwu
Contact: tjwu{at}stat.ncku.edu.tw
Supplementary information: Tables A1A3, and Remarks 111 at http://www.stat.ncku.edu.tw/tjwu
Received on May 25, 2005; revised on August 24, 2005; accepted on August 31, 2005
This article has been cited by other articles:
![]() |
A. E. Pozhitkov, D. Tautz, and P. A. Noble Oligonucleotide microarrays: widely applied poorly understood Brief Funct Genomic Proteomic, July 20, 2007; (2007) elm014v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. S. Vernikos and J. Parkhill Interpolated variable order motifs for identification of horizontally acquired DNA: revisiting the Salmonella pathogenicity islands Bioinformatics, September 15, 2006; 22(18): 2196 - 2203. [Abstract] [Full Text] [PDF] |
||||

