Skip Navigation


Bioinformatics Advance Access originally published online on September 6, 2005
Bioinformatics 2005 21(22):4125-4132; doi:10.1093/bioinformatics/bti658
This Article
Right arrow Full Text Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
21/22/4125    most recent
bti658v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wu, T.-J.
Right arrow Articles by Li, L.-A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wu, T.-J.
Right arrow Articles by Li, L.-A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oxfordjournals.org

Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences

Tiee-Jian Wu 1,*, Ying-Hsueh Huang 2,3 and Lung-An Li 2

1Department of Statistics, National Cheng-Kung University Tainan, Taiwan 70101
2Institute of Statistical Science, Academia Sinica Taipei, Taiwan 11529
3Institute of Bioinformatics, National Yang-Ming University Taipei, Taiwan 11221

*To whom correspondence should be addressed.

Motivation: Several measures of DNA sequence dissimilarity have been developed. The purpose of this paper is 3-fold. Firstly, we compare the performance of several word-based or alignment-based methods. Secondly, we give a general guideline for choosing the window size and determining the optimal word sizes for several word-based measures at different window sizes. Thirdly, we use a large-scale simulation method to simulate data from the distribution of SK–LD (symmetric Kullback–Leibler discrepancy). These simulated data can be used to estimate the degree of dissimilarity ß between any pair of DNA sequences.

Results: Our study shows (1) for whole sequence similiarity/dissimilarity identification the window size taken should be as large as possible, but probably not >3000, as restricted by CPU time in practice, (2) for each measure the optimal word size increases with window size, (3) when the optimal word size is used, SK–LD performance is superior in both simulation and real data analysis, (4) the estimate of ß based on SK–LD can be used to filter out quickly a large number of dissimilar sequences and speed alignment-based database search for similar sequences and (5) is also applicable in local similarity comparison situations. For example, it can help in selecting oligo probes with high specificity and, therefore, has potential in probe design for microarrays.

Availability: The algorithm SK–LD, estimate and simulation software are implemented in MATLAB code, and are available at http://www.stat.ncku.edu.tw/tjwu

Contact: tjwu{at}stat.ncku.edu.tw

Supplementary information: Tables A1–A3, and Remarks 1–11 at http://www.stat.ncku.edu.tw/tjwu


Received on May 25, 2005; revised on August 24, 2005; accepted on August 31, 2005

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Proc. Natl. Acad. Sci. USAHome page
G. E. Sims, S.-R. Jun, G. A. Wu, and S.-H. Kim
Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions
PNAS, February 24, 2009; 106(8): 2677 - 2682.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
M. Deloger, M. El Karoui, and M.-A. Petit
A Genomic Distance Based on MUM Indicates Discontinuity between Most Bacterial Species and Genera
J. Bacteriol., January 1, 2009; 191(1): 91 - 99.
[Abstract] [Full Text] [PDF]


Home page
Brief Funct Genomic ProteomicHome page
A. E. Pozhitkov, D. Tautz, and P. A. Noble
Oligonucleotide microarrays: widely applied poorly understood
Brief Funct Genomic Proteomic, July 20, 2007; (2007) elm014v1.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
G. S. Vernikos and J. Parkhill
Interpolated variable order motifs for identification of horizontally acquired DNA: revisiting the Salmonella pathogenicity islands
Bioinformatics, September 15, 2006; 22(18): 2196 - 2203.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.