Bioinformatics Advance Access originally published online on April 15, 2004
Bioinformatics 2004 20(15):2421-2428; doi:10.1093/bioinformatics/bth266
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics 20(15) © Oxford University Press 2004; all rights reserved.
How independent are the appearances of n-mers in different genomes?
1 Department of Computer Science and 2 Department of Chemistry University of Houston, 4800 Calhoun Road, Houston, TX 77204-3010, USA, 3 Vitruvius Biosciences, The Woodlands, TX, USA and 4 Department of Physics, University of Guadalajara, Guadalajara, Mexico
Received on October 16, 2003; revised on March 9, 2004; accepted on April 1, 2004
Advance Access Publication April 15, 2004
Motivation: Analysis of statistical properties of DNA sequences is important for evolutional biology as well as for DNA probe and PCR technologies. These technologies, in turn, can be used for organism identification, which implies applications in the diagnosis of infectious diseases, environmental studies, etc.
Results: We present results of the correlation analysis of distributions of the presence/absence of short nucleotide subsequences of different length (n-mers, n = 5 20) in more than 1500 microbial and virus genomes, together with five genomes of multicellular organisms (including human). We calculate whether a given n-mer is present or absent (frequency of presence) in a given genome, which is not the usually calculated number of appearances of n-mers in one or more genomes (frequency of appearance). For organisms that are not close relatives of each other, the presence/absence of different 720mers in their genomes are not correlated. For close biological relatives, some correlation of the presence of n-mers in this range appears, but is not as strong as expected. Suppressed correlations among the n-mers present in different genomes leads to the possibility of using random sets of n-mers (with appropriately chosen n) to discriminate genomes of different organisms and possibly individual genomes of the same species including human with a low probability of error.
Supplementary information: Supplementary data is available at http://www.bioinfo.uh.edu/publications/independence_genomes/.
Contact: yfofanov{at}uh.edu.
* To whom correspondence should be addressed.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
C. Reed, V. Fofanov, C. Putonti, S. Chumakov, T. Slezak, and Y. Fofanov Effect of the mutation rate and background size on the quality of pathogen identification Bioinformatics, October 15, 2007; 23(20): 2665 - 2671. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Salerno, P. Havlak, and J. Miller Scale-invariant structure of strongly conserved sequence in genomic intersections and alignments PNAS, August 29, 2006; 103(35): 13121 - 13125. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Tran, P. Havlak, and J. Miller MicroRNA enrichment among short 'ultraconserved' sequences in insects. Nucleic Acids Res., January 1, 2006; 34(9): e65 - e65. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Gangal, P. Sharma, R. Gangal, and P. Sharma Human pol II promoter prediction: time series descriptors and machine learning Nucleic Acids Res., March 24, 2005; 33(5): 1739 - 1739. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Gangal and P. Sharma Human pol II promoter prediction: time series descriptors and machine learning Nucleic Acids Res., March 1, 2005; 33(4): 1332 - 1336. [Abstract] [Full Text] [PDF] |
||||


