DISTAN A program which detects significant distances between short oligonucleotides
1National Institutes of Health, National Cancer Institute, Laboratory of Mathematical Biology FCRF Bldg. 469, Rm. 151
2Advanced Scientific Computing Laboratory, Frederick Cancer Research Facility Bldg. 430 Rm 210, Frederick, MD 21701, USA
*To whom reprint requests should be sent.
We present an algorithm to detect distances between oligonucleotides in large collections of nucleic acids sequences. The ratios of actual frequencies of occurrence of short oligonucleotides at a given distance to the corresponding expected frequencies were analyzed in four categories of DNA sequences (eukaryotic exons, bacterial genes, introns and non-Alu repeated DNAs). Three base periodic occurrences (independent of the reading frame) of all combinations of mononucleotides and repeats of all dinucleotides was characteristic for protein coding regions. This was also the case with the majority of trinucleotides (including translational stop signals) in these regions. Mirror-symmetric trinucleotides (except GCG and CGC) displayed a strong tendency to be two base periodically repeated in introns. Some two and three base periodic motifs were also observed in repeated DNAs. The possible biological implications of outstanding three base periodicities in bacterial genes and eukaryotic exons are discussed.
Received on March 2, 1987; accepted on May 5, 1987