Bioinformatics Advance Access originally published online on May 31, 2007
Bioinformatics 2007 23(15):2013-2014; doi:10.1093/bioinformatics/btm282
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Tree Gibbs Sampler: identifying conserved motifs without aligning orthologous sequences
1Division of Biostatistics and 2Center for Computational Biology and Bioinformatics, School of Medicine, Indiana University, 410 West 10th Street, Indianapolis, IN 46202, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Tree Gibbs Sampler is a software for identifying motifs by simultaneously using the motif overrepresentation property and the motif evolutionary conservation property. It identifies motifs without depending on pre-aligned orthologous sequences, which makes it useful for the extraction of regulatory elements in multiple genomes of both closely related and distant species.
Availability: The Tree Gibbs Sampler software is freely downloadable at https://compbio.iupui.edu/xiaomanli/LiSoftware/retrieve.php?ID=tgs
Contact: shawnli{at}iupui.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
A transcription factor can bind to short DNA segments in the regulatory regions of many different genes to control their expression. The common pattern of these short DNA segments bound by a transcription factor is called a motif. Recently, many computational methods have been developed to identify motifs by finding overrepresented and conserved DNA segments (putative motif instances) in the regulatory regions of a set of candidate genes in multiple related species (Liu, 2004; Moses, 2004; Prakash, 2004, 2005; Sinha, 2004; Wang, 2003). Most of these methods align orthologous sequences first and then identify motifs from the aligned orthologous sequences, often without taking the species divergent time into account. However, motif instances are not always aligned with their counterpart motif instances in the multiple alignments of orthologous sequences (Li, 2005). Moreover, without taking the divergent time into account, one often cannot distinguish the conserved segments due to the short divergent time from the conserved segments due to the functionality. Here we developed a useful software, Tree Gibbs Sampler (TGS), which identifies motifs from unaligned orthologous sequences by taking the divergent time into account properly.
| 2 SOFTWARE |
|---|
|
|
|---|
We briefly describe the method implemented in the TGS software here, since the detailed algorithm has been published elsewhere (Li, 2005). Given a set of putative co-regulated genes from one species, we first collect their orthologous genes in other species and assume that the evolution of the non-coding sequences of all groups of orthologous genes shares the same phylogenetic tree (for instance, the species tree). On each branch of the phylogenetic tree, TGS then uses different evolution models to describe the evolution of the non-functional segments and the evolution of the functional segments (motif instances), respectively. By starting from randomly generated motif weight matrices (same motif weight matrix for each species at the beginning), for each group of orthologous sequences, TGS identifies ancestral motif instances first and then identifies motif instances in the current species. Then, TGS updates the motif weight matrices and other parameters such as motif length and motif occurrence frequencies. With these updated parameters, TGS updates the motif instances in every group of orthologous sequences again. TGS repeats this cycle of updating parameters and motif instances several thousand times and then outputs the best motifs and motif instances. Because of this novel top–down strategy of identifying motif instances (ancestral motif instances first and then motif instances in the current species), TGS not only can identify motifs without aligning orthologous sequences but also can identify divergent motifs in distant species that cannot be found by other methods (Li, 2005).
The method implemented in the TGS software uses a linear time and linear space algorithm to identify motifs in input sequences. Given k species, n genes in each species, average sequence length L and the number of iterations for each motif m, the TGS runs O(kmnL) time for each motif. The total space requirement for the algorithm is O(knL). The time cost to run TGS based on different parameters is shown in Figure 1.
|
Here, we use yeast CBF1 target gene upstream sequences as an example to show how to use the software under command line mode. To run TGS, one has to specify a sequence file, an evolution parameter file, the maximal number of genes, the species number, the range of the motif lengths (6
20 bp), the number of the motifs and the number of iterations for each motif. All these parameters should be put in a file that serves as input for TGS. We only explain the first three parameters here since others are fairly straightforward. The sequence file (CBF1.FASTA) contains the orthologous sequences for every gene group. The order of the sequences for each gene group is the same as what is specified in the parameter file. For instance, there are 12 gene groups in CBF1.FASTA and there are four sequences for each gene group. The order of the sequences for each gene group is always in the order of Saccharomyces cerevisiae sequence, S.mikatae sequence, S.kudriazevii sequence and S.bayanus sequence, which has been specified according to the phylogenetic tree. The evolution parameter file (evolutionOfNCYeastSampler.txt) describes the evolution models of the functional segments and non-functional segments on each branch of the phylogenetic tree. These evolution models for non-functional segments are constructed by using Phylip software (http://evolution.genetics.washington.edu/phylip.html) based on the alignments of upstream sequences of orthologous genes in the four yeast species under consideration. Then the corresponding evolution models for functional segments are constructed by decreasing the corresponding branch length by half. In this way, any functional segment that evolves at least two times slower than the non-functional segments are expected to be identified. To make it simple, we also provide sample evolution parameter files for commonly used species groups. The third parameter, the maximal number of genes, should be at least the same as or larger than the number of the genes in the sequence file (12 in CBF1.FASTA). Although there is no upper bound for this parameter, it will be limited by the RAM memory of the user's computer. With these defined parameters, TGS identifies motifs in a top–down fashion by identifying conserved segments within orthologous sequences and overrepresented similar segments across different genes in the same species. During this process, TGS updates the motifs and motif instances. It also automatically adjusts the motif length based on the frequency of neighboring nucleotides around the identified motif instances. Finally, TGS outputs the best motifs, motif instances and motif significance in every species. To find five motifs, 2000 iterations for each motif on the upstream sequences of CBF1 target gene, takes
165 min on an Intel Pentium 3.6 GHz processor. We provide both the command line mode of the TGS programs that can be run on the DOS, Linux and OS environments and the GUI mode of the Windows version TGS program. The usage of the GUI mode(Windows version) of TGS software is similar to what has been described above. Detailed information is available from the readme file in the download package.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This work was supported by the Showalter Trust award (X.L.) and the Indiana Genomics Initiative (INGEN) (X.L. and H.H.), which is funded in part by the Lilly Endowment.
Conflict of Interest. none declared.
| FOOTNOTES |
|---|
Associate Editor: Alfonso Valencia
Received on February 3, 2007; revised on April 17, 2007; accepted on May 18, 2007
| REFERENCES |
|---|
|
|
|---|
Li X, Wong WH. Sampling motifs on phylogenetic trees. Proc. Natl Acad. Sci. USA (2005) 102:9481–9486.
Liu Y, et al. Eukaryotic regulatory element conservation analysis and identification using comparative genomics. Genome Res. (2004) 14:451–458.
Moses A, et al. Phylogenetic motif detection by expectation-maximization on evolutionary mixtures. Pac. Symp. Biocomput. (2004) 9:325–335.
Prakash A, Tompa M. Discovery of regulatory elements in vertebrates through comparative genomics. Nat. Biotechnol. (2005) 23:1249–1256.[CrossRef][Web of Science][Medline]
Prakash A, et al. Motif discovery in heterogeneous sequence data. Pac. Symp. Biocomput. (2004) 9:348–359.
Sinha S, et al. PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics (2004) 5:170.[CrossRef][Medline]
Wang T, Stormo GD. Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics (2003) 19:2369–2380.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
