Skip Navigation

Bioinformatics 2005 21(Suppl 3):iii3-iii11; doi:10.1093/bioinformatics/bti1201
This Article
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Glazko, G.
Right arrow Articles by Mushegian, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Glazko, G.
Right arrow Articles by Mushegian, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oxfordjournals.org

The choice of optimal distance measure in genome-wide datasets

Galina Glazko 1,*, Alexander Gordon 2 and Arcady Mushegian 1,3

1Stowers Institute for Medical Research 1000 E 50th Street, Kansas City MO 64110, USA
2Department of Biostatistics and Computational Biology, University of Rochester Medical Center Rochester, NY 14642, USA
3Department of Microbiology, Molecular Genetics, and Immunology, University of Kansas Medical Center Kansas City, KS 66160, USA

*To whom correspondence should be addressed.

Motivation: Many types of genomic data are naturally represented as binary vectors. Numerous tasks in computational biology can be cast as analysis of relationships between these vectors, and the first step is, frequently, to compute their pairwise distance matrix. Many distance measures have been proposed in the literature, but there is no theory justifying the choice of distance measure.

Results: We examine the approaches to measuring distances between binary vectors and study the characteristic properties of various distance measures and their performance in several tasks of genome analysis. Most distance measures between binary vectors turn out to belong to a single parametric family, namely generalized average-based distance with different exponents. We show that descriptive statistics of distance distribution, such as skewness and kurtosis, can guide the appropriate choice of the exponent. On the contrary, the more familiar distance properties, such as metric and additivity, appear to have much less effect on the performance of distances.

Availability: R code GADIST and Supplementary material are available at http://research.stowers-institute.org/bioinfo/

Contact: gvg{at}stowers-institute.org


Received on June 1, 2005; accepted on August 16, 2005

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?




Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.