Skip Navigation


Bioinformatics Advance Access originally published online on May 31, 2007
Bioinformatics 2007 23(15):2013-2014; doi:10.1093/bioinformatics/btm282
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/15/2013    most recent
btm282v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Cai, X.
Right arrow Articles by Li, X. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Cai, X.
Right arrow Articles by Li, X. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Tree Gibbs Sampler: identifying conserved motifs without aligning orthologous sequences

Xiaohui Cai 1,2, Haiyan Hu 2 and Xiaoman Shawn Li 1,2,*

1Division of Biostatistics and 2Center for Computational Biology and Bioinformatics, School of Medicine, Indiana University, 410 West 10th Street, Indianapolis, IN 46202, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SOFTWARE
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: Tree Gibbs Sampler is a software for identifying motifs by simultaneously using the motif overrepresentation property and the motif evolutionary conservation property. It identifies motifs without depending on pre-aligned orthologous sequences, which makes it useful for the extraction of regulatory elements in multiple genomes of both closely related and distant species.

Availability: The Tree Gibbs Sampler software is freely downloadable at https://compbio.iupui.edu/xiaomanli/LiSoftware/retrieve.php?ID=tgs

Contact: shawnli{at}iupui.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SOFTWARE
 ACKNOWLEDGEMENTS
 REFERENCES
 
A transcription factor can bind to short DNA segments in the regulatory regions of many different genes to control their expression. The common pattern of these short DNA segments bound by a transcription factor is called a motif. Recently, many computational methods have been developed to identify motifs by finding overrepresented and conserved DNA segments (putative motif instances) in the regulatory regions of a set of candidate genes in multiple related species (Liu, 2004; Moses, 2004; Prakash, 2004, 2005; Sinha, 2004; Wang, 2003). Most of these methods align orthologous sequences first and then identify motifs from the aligned orthologous sequences, often without taking the species divergent time into account. However, motif instances are not always aligned with their counterpart motif instances in the multiple alignments of orthologous sequences (Li, 2005). Moreover, without taking the divergent time into account, one often cannot distinguish the conserved segments due to the short divergent time from the conserved segments due to the functionality. Here we developed a useful software, Tree Gibbs Sampler (TGS), which identifies motifs from unaligned orthologous sequences by taking the divergent time into account properly.


    2 SOFTWARE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SOFTWARE
 ACKNOWLEDGEMENTS
 REFERENCES
 
We briefly describe the method implemented in the TGS software here, since the detailed algorithm has been published elsewhere (Li, 2005). Given a set of putative co-regulated genes from one species, we first collect their orthologous genes in other species and assume that the evolution of the non-coding sequences of all groups of orthologous genes shares the same phylogenetic tree (for instance, the species tree). On each branch of the phylogenetic tree, TGS then uses different evolution models to describe the evolution of the non-functional segments and the evolution of the functional segments (motif instances), respectively. By starting from randomly generated motif weight matrices (same motif weight matrix for each species at the beginning), for each group of orthologous sequences, TGS identifies ancestral motif instances first and then identifies motif instances in the current species. Then, TGS updates the motif weight matrices and other parameters such as motif length and motif occurrence frequencies. With these updated parameters, TGS updates the motif instances in every group of orthologous sequences again. TGS repeats this cycle of updating parameters and motif instances several thousand times and then outputs the best motifs and motif instances. Because of this novel top–down strategy of identifying motif instances (ancestral motif instances first and then motif instances in the current species), TGS not only can identify motifs without aligning orthologous sequences but also can identify divergent motifs in distant species that cannot be found by other methods (Li, 2005).

The method implemented in the TGS software uses a linear time and linear space algorithm to identify motifs in input sequences. Given k species, n genes in each species, average sequence length L and the number of iterations for each motif m, the TGS runs O(kmnL) time for each motif. The total space requirement for the algorithm is O(knL). The time cost to run TGS based on different parameters is shown in Figure 1.


Figure 1
View larger version (6K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Time cost of TGS software on Intel Pentium 3.6GHz CPU. (a) The time costs for different gene groups (total sequence length) to find five motifs and iterate 200 times for each motif. (b) Time costs for different iteration numbers to find five motifs on CBF1 gene dataset, which total sequence length is ~23 kb. (c) Time costs for finding different number of motifs when iterating 200 times on CBF1 dataset. Detailed data may be found from the Readme file in the download package.

 
Here, we use yeast CBF1 target gene upstream sequences as an example to show how to use the software under command line mode. To run TGS, one has to specify a sequence file, an evolution parameter file, the maximal number of genes, the species number, the range of the motif lengths (6~20 bp), the number of the motifs and the number of iterations for each motif. All these parameters should be put in a file that serves as input for TGS. We only explain the first three parameters here since others are fairly straightforward. The sequence file (CBF1.FASTA) contains the orthologous sequences for every gene group. The order of the sequences for each gene group is the same as what is specified in the parameter file. For instance, there are 12 gene groups in CBF1.FASTA and there are four sequences for each gene group. The order of the sequences for each gene group is always in the order of Saccharomyces cerevisiae sequence, S.mikatae sequence, S.kudriazevii sequence and S.bayanus sequence, which has been specified according to the phylogenetic tree. The evolution parameter file (evolutionOfNCYeastSampler.txt) describes the evolution models of the functional segments and non-functional segments on each branch of the phylogenetic tree. These evolution models for non-functional segments are constructed by using Phylip software (http://evolution.genetics.washington.edu/phylip.html) based on the alignments of upstream sequences of orthologous genes in the four yeast species under consideration. Then the corresponding evolution models for functional segments are constructed by decreasing the corresponding branch length by half. In this way, any functional segment that evolves at least two times slower than the non-functional segments are expected to be identified. To make it simple, we also provide sample evolution parameter files for commonly used species groups. The third parameter, the maximal number of genes, should be at least the same as or larger than the number of the genes in the sequence file (12 in CBF1.FASTA). Although there is no upper bound for this parameter, it will be limited by the RAM memory of the user's computer. With these defined parameters, TGS identifies motifs in a top–down fashion by identifying conserved segments within orthologous sequences and overrepresented similar segments across different genes in the same species. During this process, TGS updates the motifs and motif instances. It also automatically adjusts the motif length based on the frequency of neighboring nucleotides around the identified motif instances. Finally, TGS outputs the best motifs, motif instances and motif significance in every species. To find five motifs, 2000 iterations for each motif on the upstream sequences of CBF1 target gene, takes ~165 min on an Intel Pentium 3.6 GHz processor.

We provide both the command line mode of the TGS programs that can be run on the DOS, Linux and OS environments and the GUI mode of the Windows version TGS program. The usage of the GUI mode(Windows version) of TGS software is similar to what has been described above. Detailed information is available from the readme file in the download package.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SOFTWARE
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work was supported by the Showalter Trust award (X.L.) and the Indiana Genomics Initiative (INGEN) (X.L. and H.H.), which is funded in part by the Lilly Endowment.

Conflict of Interest. none declared.


    FOOTNOTES
 
Associate Editor: Alfonso Valencia

Received on February 3, 2007; revised on April 17, 2007; accepted on May 18, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SOFTWARE
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Li X, Wong WH. Sampling motifs on phylogenetic trees. Proc. Natl Acad. Sci. USA (2005) 102:9481–9486.[Abstract/Free Full Text]

    Liu Y, et al. Eukaryotic regulatory element conservation analysis and identification using comparative genomics. Genome Res. (2004) 14:451–458.[Abstract/Free Full Text]

    Moses A, et al. Phylogenetic motif detection by expectation-maximization on evolutionary mixtures. Pac. Symp. Biocomput. (2004) 9:325–335.

    Prakash A, Tompa M. Discovery of regulatory elements in vertebrates through comparative genomics. Nat. Biotechnol. (2005) 23:1249–1256.[CrossRef][Web of Science][Medline]

    Prakash A, et al. Motif discovery in heterogeneous sequence data. Pac. Symp. Biocomput. (2004) 9:348–359.

    Sinha S, et al. PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics (2004) 5:170.[CrossRef][Medline]

    Wang T, Stormo GD. Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics (2003) 19:2369–2380.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/15/2013    most recent
btm282v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Cai, X.
Right arrow Articles by Li, X. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Cai, X.
Right arrow Articles by Li, X. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?