Skip Navigation


Bioinformatics Advance Access originally published online on August 14, 2008
Bioinformatics 2008 24(19):2256-2257; doi:10.1093/bioinformatics/btn408
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/19/2256    most recent
btn408v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ng, P.
Right arrow Articles by Keich, U.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ng, P.
Right arrow Articles by Keich, U.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

GIMSAN: a Gibbs motif finder with significance analysis

Patrick Ng and Uri Keich *

Department of Computer Science, Cornell University, Ithaca, NY 14853, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: We present GIMSAN (GIbbsMarkov with Significance ANalysis): a novel tool for de novo motif finding. GIMSAN combines GibbsMarkov, our variant of the Gibbs Sampler, described here for the first time, with our recently introduced significance analysis.

Availability: GIMSAN is currently available as a web application and a stand-alone application on Unix and PBS (Portable Batch System) cluster through links from http://www.cs.cornell.edu/~keich.

Contact: keich{at}cs.cornell.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 ACKNOWLEDGEMENTS
 REFERENCES
 
A reliable significance evaluation should be considered as an essential component of any motif finder. Indeed, it is often the only information available to the users before they decide on whether to invest significant resources in further exploration or verification of the reported motifs. We recently demonstrated inherent flaws in the significance analysis based on the E-value of the information content (Ng et al., 2006) as well as on the empirical normal approximation (Ng and Keich, 2008). In contrast, we recently introduced a biologically realistic and reliable method to estimate the reported motif's statistical significance based on a novel 3-Gamma approximation scheme (Keich and Ng, 2007). Exploiting the robustness of this technique, we showed how we can further improve its reliability by factoring in local GC content (Ng and Keich, 2008).

Here, we present a novel de novo motif finding tool called GIMSAN (GIbbsMarkov with Significance ANalysis). GIMSAN combines GibbsMarkov, our variant of the Gibbs Sampler (Lawrence et al., 1993), described here for the first time, with the aforementioned significance analysis. This tool is currently publicly available as a web application and a stand-alone application on Linux and PBS (Portable Batch System) cluster. In addition to reporting the best detected motif1 and its statistical significance, GIMSAN tests whether the putative sites exhibit any pairwise positional dependencies (this last feature will be described in details elsewhere).

GIMSAN allows the user to specify a range of motif widths. We previously showed in similar such cases that selecting the optimal width based on our significance analysis can improve the results of de novo motif finding (Ng and Keich, 2008). Note, however, that choosing the best (i.e. lowest) P-value among several candidates amounts to multiple testing and a necessary correction should be employed by the user.

The performance of GIMSAN on real biological data is demonstrated in (Ng and Keich, 2008).


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 3-Gamma-based P-values
GIMSAN reports two figures that indicate the significance of the reported motif as outlined next. Based on the user selected reference set, GIMSAN generates null sets of sequences that preserve the dimensions and local GC content of the input set. It then runs GibbsMarkov with the user selected parameters on these null sets thereby creating a small sample of the finder's null distribution. Assuming this sample comes from a 3-Gamma distribution, GIMSAN reports a maximum likelihood point estimator of the P-value of the reported motif. Since the latter can significantly over-estimate the significance of the motif, GIMSAN augments it with a, roughly, 95% confidence interval of the P-value of the motif. For more details see Keich and Ng (2007) and Ng and Keich (2008).

2.2 Hybrid Gibbs sampler
By GibbsMarkov we refer here to our variant of a Gibbs Sampler finder (Lawrence et al., 1993). Currently it handles an OOPS (one occurrence per sequence) or a ZOOPS (zero or one occurrence per sequence) model (Bailey and Elkan, 1995). Its scoring function and sampling steps follow the techniques presented in Liu et al. (1995) and Jensen et al. (2004). There are a couple of distinctions between these works and our implementation which merits the following presentation. First, neither of the above papers specifically addresses the ZOOPS model described here. Second, these papers use a complete Bayesian framework which includes a prior on the matrices. Instead, we use a hybrid model which incorporates a prior on the percentage of sequences that include sites, but we use a maximum likelihood approach for the matrix. While the latter is fairly similar to using the Stirling approximation to the full Bayesian model (Jensen et al., 2004), it is not exactly the same. The ZOOPS model is specifically used in Narlikar et al. (2007) but, again, there are some differences between the functions optimized there and ours.2

Our generative probabilistic ZOOPS model of the input set is defined as follows. The given input to the model is: the number of sequences N, their lengths li, the (Markov) background model B and the motif modeled by a 4 x w PSFM (position-specific frequency matrix) {Theta}. We denote by p the probability that sequence Si contains a site. We determine p by randomly drawing from a prior β(a, b)–distribution. In practice, we choose a=b={alpha}N where {alpha} is a parameter that reflects the strength of your prior. Note that this choice indicates our prior belief that on average half the sequences should contain a site3 and it can readily be changed. We next draw N independent samples {Zi} Formula from a Bernoulli(p) distribution. Each Zi is the indicator function of the event‘sequence i contains a site’. Sequences i for which Zi=0 are not containing sites. Therefore, they are generated according to our background model B with probability PB(Si).4 Alternatively, if Zi=1 we first choose Yi, the site location for sequence i, uniformly from {1,2,...,liw+1}.5 We then generate the two background pieces SFormula and SFormula, independently and according to the background model B. The site itself, SFormula is generated according to the product of multinomials parametrized by the PSFM {Theta}: rodFormula{Theta}Formula, where {alpha}(j):=SFormula+j–1Formula. Below we sloppily refer to the product of these last three probabilities/likelihoods as PB, {Theta}(Si| Yi).

The score we are trying to optimize is the model's joint likelihood:


Formula

where β(a, b)=isintFormulaxa – 1(1 – x)b – 1 dx is the beta function. Specifically, we view Z and Y as missing parameters and we try to find


Formula

Note that since the sites are generated according to a product of multinomials, maximizing over {Theta} given Z and Y, is the standard multinomial MLE (maximum likelihood estimation) deduced from the sites' letter counts.

It is convenient to divide by the constant Formula so that our target function simplifies to:


Formula

As in the original work of Lawrence et al. (1993), our sampler iteratively resamples the site location, one sequence at a time. When resampling the location, {Theta} is estimated from the sites chosen in the rest of the sequences. We resample Yi with probabilities proportional to:


Formula

for jisin{1,2,..., liw+1}, where {sum}'Zk={sum}k!=iZk. We allow for Zi=0 which, following the convention mentioned in Narlikar et al. (2007), we denote by Yi=0. Therefore, with the same proportionality constant as above: P(Yi=0) {propto} β({sum}'Zk+a,N{sum}'Zk+a). As usual, we apply a plateau period condition (Lawrence et al., 1993) to stop a run and we use multiple runs, each with its own random starting locations. The configuration of Z and Y that maximizes {Psi}(Z, Y,{Theta}|S) is reported together with its score.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank Robert Bukowski for deploying GIMSAN as a web application. Part of this work was carried out by using the resources of the Computational Biology Service Unit from Cornell University which is partially funded by Microsoft Corporation. Finally, we thank anonymous reviewers for very constructive comments.

Funding: National Science Foundation (0644136).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Limsoon Wong

1A sequence logo is generated using the popular WebLogo (Crooks et al., 2004). Back

2Our target function is different than theirs even in the case of uninformative prior they consider. Back

3The expected number of sites is NE(p)=Na/(a+b) which is N/2 if a=b. Back

4We adopt the common abuse of notations of failing to distinguish between the random variables and their actual values or instances. Back

5We ignore the issue of reverse complement sites in this discussion. Back

Received on June 16, 2008; revised on July 25, 2008; accepted on July 29, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bailey T, Elkan C. The value of prior knowledge in discovering motifs with meme. In: Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology. (1995) Menlo Park, California: AAAI Press. 21–29.

    Crooks G. Weblogo: a sequence logo generator. (2004).

    Harbison CT, et al. Transcriptional regulatory code of a eukaryotic genome. Nature (2004) 431:99–104.[CrossRef][Web of Science][Medline]

    Jensen S, et al. Computational discovery of gene regulatory binding motifs: a Bayesian perspective. Stat. Sci (2004) 19:188–204.[CrossRef][Web of Science]

    Keich U, Ng P. A conservative parametric approach to motif significance analysis. In: The 18th International Conference on Genome Informatics (GIW 2007). (2007) Singapore.

    Lawrence C, et al. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science (1993) 262:208–214.[Abstract/Free Full Text]

    Liu J, et al. Bayesian models for multiple local sequence alignment and gibbs sampling strategies. J. Am. Stat. Assoc (1995) 90:1156–1169.[CrossRef][Web of Science]

    Macisaac K, et al. An improved map of conserved regulatory sites for saccharomyces cerevisiae. BMC Bioinformatics (2006) 7.

    Narlikar L, et al. Nucleosome occupancy information improvese novo motif discovery. RECOMB (2007) 107–121.

    Ng P, et al. Apples to apples: improving the performance of motif finders and their significance analysis in the Twilight Zone. Bioinformatics (2006) 22:e393–e401.[Abstract/Free Full Text]

    Ng P, Keich U. Factoring local sequence composition in motif signifance analysis. In: The 19th International Conference on G, enome Informatics (GIW 2008). (2008) inpress.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/19/2256    most recent
btn408v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ng, P.
Right arrow Articles by Keich, U.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ng, P.
Right arrow Articles by Keich, U.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?