Vol. 20 no. 1 2004, pages 67-74
Bioinformatics © Oxford University Press 2004; all rights reserved.
Distribution of words with a predefined range of mismatches to a DNA probe in bacterial genomes

1 Stowers Institute for Medical Research, Kansas City, MO 64110, USA and 2 Department of Microbiology, Molecular Genetics and Immunology, University of Kansas Medical Center, Kansas City, KS 66160, USA
Received on March 25, 2003
; revised on June 11, 2003
; accepted on July 22, 2003
Motivation: Hybridization of oligonucleotides with longer nucleotide sequences is an essential step in nucleic acid biosynthesis in vitro and in vivo, in oligonucleotide-based diagnostics, and in therapeutic applications of oligonucleotides. A major factor determining sensitivity and selectivity of hybridization is the number of base pair mismatches that occur in an ungapped alignment of the oligonucleotide (probe) and a longer sequence (target).
Results: The k-distance match count between the probe and the target is defined as the number of ungapped alignments between the two sequences that have exactly k mismatches, and the k-neighbor match count is defined as the sum of the j-distance match counts for j between 0 and k. We derive a novel formula for the probability of a k-distance match. This formula is based on the assumption that the target is strand-symmetric Bernoulli text (i.e. nucleotides are independently, identically distributed in the target and satisfy Chargaff's second parity rule). Our model predicts that the GC-content in both the probe and the target significantly affects the match count expectation. The ratio of k-neighbor match counts in two distinct genomes for a given probe is a measure of its specificity. We calculated such ratios for pairs of bacterial genomes with different combinations of length, GC-content and phylogenetic distance. Examination of the extreme values of these ratios indicates that probes with a high discriminative power exist for each tested pair.
Supplementary information: Stowers Institute Technical Report No. 0002, C++ source code, Mathematica notebooks and other information is available at http://www.stowers-institute.org/labs/bioinformatics/omm/index.htm
Contact: omm{at}stowers-institute.org
* To whom correspondence should be addressed at
Present address: Northern State University, 1200 S. Jay Street, NSU Box 713, Aberdeen, SD 57401, USA