The choice of optimal distance measure in genome-wide datasets
1Stowers Institute for Medical Research 1000 E 50th Street, Kansas City MO 64110, USA
2Department of Biostatistics and Computational Biology, University of Rochester Medical Center Rochester, NY 14642, USA
3Department of Microbiology, Molecular Genetics, and Immunology, University of Kansas Medical Center Kansas City, KS 66160, USA
*To whom correspondence should be addressed.
Motivation: Many types of genomic data are naturally represented as binary vectors. Numerous tasks in computational biology can be cast as analysis of relationships between these vectors, and the first step is, frequently, to compute their pairwise distance matrix. Many distance measures have been proposed in the literature, but there is no theory justifying the choice of distance measure.
Results: We examine the approaches to measuring distances between binary vectors and study the characteristic properties of various distance measures and their performance in several tasks of genome analysis. Most distance measures between binary vectors turn out to belong to a single parametric family, namely generalized average-based distance with different exponents. We show that descriptive statistics of distance distribution, such as skewness and kurtosis, can guide the appropriate choice of the exponent. On the contrary, the more familiar distance properties, such as metric and additivity, appear to have much less effect on the performance of distances.
Availability: R code GADIST and Supplementary material are available at http://research.stowers-institute.org/bioinfo/
Contact: gvg{at}stowers-institute.org
Received on June 1, 2005; accepted on August 16, 2005