Bioinformatics Vol. 18 no. 4 2002
Pages 513-528
© 2002 Oxford University Press
Distribution patterns of over-represented k-mers in non-coding yeast DNA
Department of Information and Computer Science, Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, CA 92697-3425, USA
Received on May 24, 2001
; revised on November 30, 2001
; accepted on December 5, 2001
Motivation: Over-represented k-mers in genomic DNA regions are often of particular biological interest. For example, over-represented k-mers in co-regulated families of genes are associated with the DNA binding sites of transcription factors. To measure over-representation, we introduce a statistical background model based on single-mismatches, and apply it to the pooled 500 bp ORF Upstream Regions (USRs) of yeast. More importantly, we investigate the context and spatial distribution of over-represented k-mers in yeast USRs.
Results: Single and double-stranded spatial distributions of most over-represented k-mers are highly non-random, and predominantly cluster into a small number of classes that are robust with respect to over-representation measures. Specifically, we show that the three most common distribution patterns can be related to DNA structure, function, and evolution and correspond to: (a) homologous ORF clusters associated with sharply localized distributions; (b) regulatory elements associated with a symmetric broad hill-shaped distribution in the 50200 bp USR; and (c) runs of As, Ts, and ATs associated with a broad hill-shaped distribution also in the 50200 bp USR, with extreme structural properties. Analysis of over-representation, homology, localization, and DNA structure are essential components of a general data-mining approach to finding biologically important k-mers in raw genomic DNA and understanding the lexicon of regulatory regions.
Contact: hampson{at}ics.uci.edu; kibler{at}ics.uci.edu; pfbaldi{at}ics.uci.edu
* To whom correspondence should be addressed. Also at Department of Biological Chemistry, College of Medicine.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
X. Dai, J. He, and X. Zhao A new systematic computational approach to predicting target genes of transcription factors Nucleic Acids Res., July 26, 2007; 35(13): 4433 - 4440. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Zhou and W. H. Wong CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling PNAS, August 17, 2004; 101(33): 12114 - 12119. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Li, Y.-H. Chen, T.-J. Liu, J. Jia, S. Hampson, Y.-X. Shan, D. Kibler, and P. H. Wang Using DNA Microarray to Identify Sp1 as a Transcriptional Regulatory Element of Insulin-Like Growth Factor 1 in Cardiac Muscle Cells Circ. Res., December 12, 2003; 93(12): 1202 - 1209. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Richard and G. Nuel SPA: simple web tool to assess statistical significance of DNA patterns Nucleic Acids Res., July 1, 2003; 31(13): 3679 - 3681. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Rombauts, K. Florquin, M. Lescot, K. Marchal, P. Rouze, and Y. Van de Peer Computational Approaches to Identify Promoters and cis-Regulatory Elements in Plant Genomes Plant Physiology, July 1, 2003; 132(3): 1162 - 1176. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. M. Conlon, X. S. Liu, J. D. Lieb, and J. S. Liu Integrating regulatory motif discovery and genome-wide expression analysis PNAS, March 18, 2003; 100(6): 3339 - 3344. [Abstract] [Full Text] [PDF] |
||||



