Skip Navigation



Bioinformatics Advance Access published online on September 6, 2005

Bioinformatics, doi:10.1093/bioinformatics/bti658
This Article
Right arrow Advance Access manuscript (PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
21/22/4125    most recent
bti658v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wu, T.-J.
Right arrow Articles by Li, L.-A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wu, T.-J.
Right arrow Articles by Li, L.-A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author (2005). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
Received May 25, 2005
Revised August 24, 2005
Accepted August 31, 2005

Article

Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences

Tiee-Jian Wu 1*, Ying-Hsueh Huang 2, and Lung-An Li 3

1 Department of Statistics, National Cheng-Kung University, Tainan, Taiwan 70101
2 Institute of Statistical Science, Academia Sinica, Taipei, Taiwan 11529; Institute of Bioinformatics, National Yang-Ming University, Taipei, Taiwan 11221
3 Institute of Statistical Science, Academia Sinica, Taipei, Taiwan 11529

* To whom correspondence should be addressed.
Tiee-Jian Wu, E-mail: tjwu{at}stat.ncku.edu.tw


   Abstract

Motivation: Several measures of DNA sequence dissimilarity have been developed. The purpose of this paper is threefold. Firstly, we compare the performance of several word-based or alignment-based methods. Secondly, we give a general guideline for choosing the window size and determine the optimal word sizes for several word-based measures at different window sizes. Thirdly, we use a large-scale simulation to simulate data from the distribution of SK-LD (symmetric Kullback-Leibler discrepancy). These simulated data can be used to estimate the degree of dissimilarity {beta} between any pair of DNA sequences.

Results: Our study shows (i) for whole sequence similiarity/dissimilarity identification the window size should be taken as large as possible, but probably not larger than 3000, as restricted by CPU time in practice, (ii) for each measure the optimal word size increases with window size, (iii) when the optimal word size is used, SK-LD performs superiorly in both simulation and real data analysis, (iv) the estimate {beta} of {beta} based on SK-LD can be used to filter out quickly a large number of dissimilar sequences and speed alignment-based database search for similar sequences and (v) {beta} is also applicable in local similarity comparison situations. For example, it can help in selecting oligo probes with high specificity and therefore has potential in probe design for microarrays.

Availability: The algorithm SK-LD, estimate {beta} and simulation software are implemented in MATLAB code, and are available at http://www.stat.ncku.edu.tw/tjwu.

Supplementary information: Table A1 to Table A3, and Remark 1 to Remark 11 at http://www.stat.ncku.edu.tw/tjwu.


Part of the research was done while the first author was visiting the Institute of Statistical Science, Academia Sinica, Taipei, Taiwan.
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Proc. Natl. Acad. Sci. USAHome page
G. E. Sims, S.-R. Jun, G. A. Wu, and S.-H. Kim
Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions
PNAS, February 24, 2009; 106(8): 2677 - 2682.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
M. Deloger, M. El Karoui, and M.-A. Petit
A Genomic Distance Based on MUM Indicates Discontinuity between Most Bacterial Species and Genera
J. Bacteriol., January 1, 2009; 191(1): 91 - 99.
[Abstract] [Full Text] [PDF]


Home page
Brief Funct Genomic ProteomicHome page
A. E. Pozhitkov, D. Tautz, and P. A. Noble
Oligonucleotide microarrays: widely applied poorly understood
Brief Funct Genomic Proteomic, July 20, 2007; (2007) elm014v1.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
G. S. Vernikos and J. Parkhill
Interpolated variable order motifs for identification of horizontally acquired DNA: revisiting the Salmonella pathogenicity islands
Bioinformatics, September 15, 2006; 22(18): 2196 - 2203.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.