Bioinformatics Advance Access originally published online on February 18, 2007
Bioinformatics 2007 23(8):917-925; doi:10.1093/bioinformatics/btm048
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HomologMiner: looking for homologous genomic groups in whole genomes
Department of Computer Science & Engineering, Penn State University, PA, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
Motivation: Complex genomes contain numerous repeated sequences, and genomic duplication is believed to be a main evolutionary mechanism to obtain new functions. Several tools are available for de novo repeat sequence identification, and many approaches exist for clustering homologous protein sequences. We present an efficient new approach to identify and cluster homologous DNA sequences with high accuracy at the level of whole genomes, excluding low-complexity repeats, tandem repeats and annotated interspersed repeats. We also determine the boundaries of each group member so that it closely represents a biological unit, e.g. a complete gene, or a partial gene coding a protein domain.
Results: We developed a program called HomologMiner to identify homologous groups applicable to genome sequences that have been properly marked for low-complexity repeats and annotated interspersed repeats. We applied it to the whole genomes of human (hg17), macaque (rheMac2) and mouse (mm8). Groups obtained include gene families (e.g. olfactory receptor gene family, zinc finger families), unannotated interspersed repeats and additional homologous groups that resulted from recent segmental duplications. Our program incorporates several new methods: a new abstract definition of consistent duplicate units, a new criterion to remove moderately frequent tandem repeats, and new algorithmic techniques. We also provide preliminary analysis of the output on the three genomes mentioned above, and show several applications including identifying boundaries of tandem gene clusters and novel interspersed repeat families.
Availability: All programs and datasets are downloadable from www.bx.psu.edu/miller_lab
Contact: mhou{at}cse.psu.edu
Associate Editor: Limsoon Wong
Received on November 21, 2006; revised on January 22, 2007; accepted on February 6, 2007