Bioinformatics Advance Access published online on January 2, 2008
Bioinformatics, doi:10.1093/bioinformatics/btm610
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Natural Similarity Measures between Position Frequency Matrices with an Application to Clustering
1Computational Biology, Max Planck Institute f. Molecular Genetics, Ihnestr. 73, 14195 Berlin,Germany.
2Mathematics and Computer Science, Free University of Berlin, Takustr. 9, 14195 Berlin, Germany.
3COMET group, Genome Informatics, Universität Bielefeld, 33594 Bielefeld, Germany.
4Bioinformatics for High-Throughput Technologies, Computer Science 11, Dortmund University,44221 Dortmund, Germany.
*To whom correspondence should be addressed. Utz J. Pape, E-mail: utz.pape{at}molgen.mpg.de
| Abstract |
|---|
Motivation: Transcription factors (TFs) play a key role in gene regulation by binding to target sequences. In silico prediction of potential binding of a TF to a binding site is a well-studied problem in computational biology. The binding sites for one TF are represented by a position frequency matrix (PFM). The discovery of new PFMs requires the comparison to known PFMs to avoid redundancies. In general, two PFMs are similar if they occur at overlapping positions under a null model. Still, most existing methods compute similarity according to probabilistic distances of the PFMs. Here we propose a natural similarity measure based on the asymptotic covariance between the number of PFM hits incorporating both strands. Furthermore, we introduce a second measure based on the same idea to cluster a set of the Jaspar PFMs.
Results: We show that the asymptotic covariance can be efficiently computed by a two dimensional convolution of the score distributions. The asymptotic covariance approach shows strong correlation with simulated data. It outperforms three alternative methods. The Jaspar clustering yields distinct groups of TFs of the same class. Furthermore, a representative PFM is given for each class. In contrast to most other clustering methods, PFMs with low similarity automatically remain singletons.
Availability: A website to compute the similarity and to perform clustering, the source code and Supplementary Material are available at http://mosta.molgen.mpg.de.
Contact: utz.pape{at}molgen.mpg.de
Associate Editor: Prof. Alfonso Valencia
Received on August 30, 2007; revised on December 5, 2007; accepted on December 6, 2007
This article has been cited by other articles:
![]() |
M. Defrance and J. van Helden info-gibbs: a motif discovery algorithm that directly optimizes information content during sampling Bioinformatics, October 15, 2009; 25(20): 2715 - 2722. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. J. Pape, H. Klein, and M. Vingron Statistical detection of cooperative transcription factors with similarity adjustment Bioinformatics, August 15, 2009; 25(16): 2103 - 2109. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Zhang, M. Xu, S. Li, and Z. Su Genome-wide de novo prediction of cis-regulatory binding sites in prokaryotes Nucleic Acids Res., June 1, 2009; 37(10): e72 - e72. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Nishida, M. C. Frith, and K. Nakai Pseudocounts for transcription factor binding sites Nucleic Acids Res., February 1, 2009; 37(3): 939 - 944. [Abstract] [Full Text] [PDF] |
||||

