Bioinformatics Advance Access published online on September 6, 2005
Bioinformatics, doi:10.1093/bioinformatics/bti658
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Department of Statistics, National Cheng-Kung University, Tainan, Taiwan 70101
* To whom correspondence should be addressed.
Motivation: Several measures of DNA sequence dissimilarity have been developed. The purpose of this paper is threefold. Firstly, we compare the performance of several word-based or alignment-based methods. Secondly, we give a general guideline for choosing the window size and determine the optimal word sizes for several word-based measures at different window sizes. Thirdly, we use a large-scale simulation to simulate data from the distribution of SK-LD (symmetric Kullback-Leibler discrepancy). These simulated data can be used to estimate the degree of dissimilarity Results: Our study shows (i) for whole sequence similiarity/dissimilarity identification the window size should be taken as large as possible, but probably not larger than 3000, as restricted by CPU time in practice, (ii) for each measure the optimal word size increases with window size, (iii) when the optimal word size is used, SK-LD performs superiorly in both simulation and real data analysis, (iv) the estimate Availability: The algorithm SK-LD, estimate Supplementary information: Table A1 to Table A3, and Remark 1 to Remark 11 at http://www.stat.ncku.edu.tw/tjwu.
Received May 25, 2005
Revised August 24, 2005
Accepted August 31, 2005
Article
Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences
2 Institute of Statistical Science, Academia Sinica, Taipei, Taiwan 11529; Institute of Bioinformatics, National Yang-Ming University, Taipei, Taiwan 11221
3 Institute of Statistical Science, Academia Sinica, Taipei, Taiwan 11529
Tiee-Jian Wu, E-mail: tjwu{at}stat.ncku.edu.tw
![]()
Abstract
between any pair of DNA sequences.
of
based on SK-LD can be used to filter out quickly a large number of dissimilar sequences and speed alignment-based database search for similar sequences and (v)
is also applicable in local similarity comparison situations. For example, it can help in selecting oligo probes with high specificity and therefore has potential in probe design for microarrays.
and simulation software are implemented in MATLAB code, and are available at http://www.stat.ncku.edu.tw/tjwu.
Part of the research was done while the first author was visiting the Institute of Statistical Science, Academia Sinica, Taipei, Taiwan.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
G. E. Sims, S.-R. Jun, G. A. Wu, and S.-H. Kim Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions PNAS, February 24, 2009; 106(8): 2677 - 2682. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Deloger, M. El Karoui, and M.-A. Petit A Genomic Distance Based on MUM Indicates Discontinuity between Most Bacterial Species and Genera J. Bacteriol., January 1, 2009; 191(1): 91 - 99. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. E. Pozhitkov, D. Tautz, and P. A. Noble Oligonucleotide microarrays: widely applied poorly understood Brief Funct Genomic Proteomic, July 20, 2007; (2007) elm014v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. S. Vernikos and J. Parkhill Interpolated variable order motifs for identification of horizontally acquired DNA: revisiting the Salmonella pathogenicity islands Bioinformatics, September 15, 2006; 22(18): 2196 - 2203. [Abstract] [Full Text] [PDF] |
||||



