Bioinformatics Advance Access published online on February 18, 2007
Bioinformatics, doi:10.1093/bioinformatics/btm048
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HomologMiner: looking for homologous genomic groups in whole genomes



Dept. Computer Science & Engineering, Penn State University
*to whom correspondence should be addressed. Ms. Minmei Hou, E-mail: mhou{at}cse.psu.edu, mhou{at}bx.psu.edu
| Abstract |
|---|
Motivation: Complex genomes contain numerous repeated sequences, and genomic duplication is believed to be a main evolutionary mechanism to obtain new functions. Several tools are available for de novo repeat sequence identification, and many approaches exist for clustering homologous protein sequences. We present an efficient new approach to identify and cluster homologous DNA sequences with high accuracy at the level of whole genomes, excluding lowcomplexity repeats, tandem repeats and annotated interspersed repeats. We also determine the boundaries of each group member so that it closely represents a biological unit, e.g. a complete gene, or a partial gene coding a protein domain.
Results: We developed a program called HomologMiner to identify homologous groups applicable to genome sequences that have been properly marked for low-complexity repeats and annotated interspersed repeats. We applied it to the whole genomes of human (hg17), macaque (rheMac2), and mouse (mm8). Groups btained include gene families (e.g. olfactory receptor gene family, zinc finger families), un-annotated interspersed repeats, and additional homologous groups that resulted from recent segmental duplications. Our program incorporates several new methods: a new abstract definition of consistent duplicate units, a new criterion to remove moderately frequent tandem repeats, and new algorithmic techniques. We also provide preliminary analysis of the output on the three genomes mentioned above, and show several applications including identifying boundaries of tandem gene clusters and novel interspersed repeat families.
Availability: All programs and datasets are downloadable from www.bx.psu.edu/miller_lab.
supported by NIH grant HG02238
Associate Editor: Dr. Limsoon Wong
Received on November 21, 2006; revised on January 22, 2007; accepted on February 6, 2007