Middle-range clustering of nucleotides in genomes
Institute of Biophysics, Academy of Sciences of the Czech Republic CZ-61265 Brno, Czech Republic
1To whom correspondence should be addressed
We propose a novel, transparent and very simple algorithm to analyze middle-range correlations in genomic nucleo tide sequences. Analysis by this algorithm of the EMBL Nucleotide Sequence Database demonstrates that all four nucleotides cluster in the genomic nucleotide sequences of eukaryotes on the scale of several hundred base pairs. In prokaryotes, the clustering is weak but still evident. The non-dominant three bases are deficient in the clusters, while A is the most deficient nucleo tide in the clusters of C, and vice versa, and G is the most deficient nucleotide in the clusters of T, and vice versa. The algorithm also detects CG islands, extending over 1kb, in vertebrate sequences. In plants, the CG islands are shown to be much smaller, if they exist at all. A clustering tendenc,v is also exhibited by the TA doublet. Other doublets do not cluster. We observe no strong correlation between nucleotides separated in genomes by > 1 kb.
Received on September 20, 1994; revised on December 17, 1994; accepted on January 13, 1995