Skip Navigation


Bioinformatics Advance Access originally published online on June 5, 2007
Bioinformatics 2007 23(19):2619-2621; doi:10.1093/bioinformatics/btm288
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/19/2619    most recent
btm288v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bae, S.-H.
Right arrow Articles by Kim, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bae, S.-H.
Right arrow Articles by Kim, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

dPattern: transcription factor binding site (TFBS) discovery in human genome using a discriminative pattern analysis

Seung-Hee Bae 1, Haixu Tang 2, Jing Wu 4, Jun Xie 4 and Sun Kim 2,3,*

1Department of Computer Science, 2School of Informatics, 3Center for Genomics and Bioinformatics, Indiana University – Bloomington, IN 47404 and 4Department of Statistics, Purdue University, West Lafayette, IN, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIXTURE MODEL GENERATION
 3 MODEL REFINEMENT PROCEDURE
 4 COMPARISON WITH REFINEMENT
 5 MODEL REFINEMENT VERSUS...
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Transcription factor binding sites (TFBSs) are typically short in length, thus search with a profile model from known TFBSs produces many false positives. When combined with additional information, gene expression data in this article, sensitivity and specificity of TFBS search can be improved significantly.

Results: By modifying our previous REFINEMENT approach, we developed dPattern that searches for occurrences of TFBSs in the promotor regions of up/down regulated or random genes.

Availability: http://platcom.org/projects/dpattern

Contact: sun.kim{at}acm.org or sunkim2{at}indiana.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIXTURE MODEL GENERATION
 3 MODEL REFINEMENT PROCEDURE
 4 COMPARISON WITH REFINEMENT
 5 MODEL REFINEMENT VERSUS...
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The problem we are considering is how to increase the specificity of transcription factor binding site (TFBS) prediction by incorporating gene expression information. From gene expression data with an experimental design to study the effect of a certain condition, e.g. interferon stimulated genes, we can obtain up-/down-regulated genes. Promotors in the up-regulated genes, say by interferon, constitute a positive sequence set where more TFBS occurrences are expected. Promotors in the down-regulated or random genes constitute a negative sequence set where less TFBS occurrences are expected. Given a set of known TFBSs that we want to search for, e.g. interferon stimulated regulatory elements (ISREs), we have developed a computational method, called REFINEMENT (Tsukahara et al., 2006), that refines the original motif model iteratively to discriminate occurrences in the positive set from those in the negative set. This discriminative method was successful in reducing false positives in the promotor regions of up-/down-regulated genes. Here we developed a new program, called dPattern and a web server. The main differences between REFINEMENT and dPattern are: (1) dPattern uses Patser (Hertz and Stormo, 1999) instead of a profile hidden Markov model, assuming no gaps in the TFBS, which improved the computational efficiency significantly so that a general web server can be provided, (2) dPattern uses a mixture model approach. It first computes a background model using human promoter sequences identified from alignments between human and mouse/rat/chimpanzee from the UCSC web site at http://genome.ucsc.edu and combines the background model and a TFBS model (user input) to produce an initial mixture model that is general enough to predict de novo TFBSs and (3) dPattern also use a simple rank-based discriminative model refinement procedure.

Input to dPattern are: (1) a set of known TFBSs, (2) a set of Ensembl gene identifiers for up-regulated genes and (3) a set of Ensembl gene identifiers for down-regulated or random genes. Then dPattern searches for occurrences of the TFBS in the promotor regions of the genes.

TFBS search is performed in two steps: mixture model generation and iterative model refinement.


    2 MIXTURE MODEL GENERATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIXTURE MODEL GENERATION
 3 MODEL REFINEMENT PROCEDURE
 4 COMPARISON WITH REFINEMENT
 5 MODEL REFINEMENT VERSUS...
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The goal is to make the motif model general enough to detect TFBSs that are not included in the set of known TFBSs. Given a set of m known TFBSs of length n, {x1,...,xm}, we can build a profile Piy for 1≤i≤n and y isin {A, T, G, C}. Using human promoter sequences identified from the alignments at UCSC, a kth order Markov model Ky|x1,...,xk is computed. Then a mixture model Miyis computed by combining Piy and Ky|x1,...,xk. The values for a column i of Miy are computed as follows.

  1. (1≤ i≤k)
    A background stationary-profile Siy of length k is computed using the weighted character frequencies of a prefix of length k of each known TFBS x j. The weight of each prefix is determined by its stationary probability from Ky|x1,...,xk using estim_m, a tool from the seq++ package (Miele et al., 2005). Then


    Formula

    where, {alpha} is a preset value between 0 and 1 (0.75 by default) and all three matrices, Miy, Piy, and Biy, are of 4 xk.

  2. (k≤i≤n)
    The probability of y, By, is computed by averaging the Markov model probabilities Formula for m sequences, x1,...,xm. Then


    Formula


The Markov model Ky|x1,...,xk is computed using a set of alignments between human and mouse/rat/chimpanzee from the UCSC web site, and a background model B is computed as decribed earlier. The background model B is from conserved promotor regions between human and a reference genome, say mouse genome, thus we expect that B performs better for TFBS prediction than a simple background model with pseudo counts since TFBSs are typically searched for in the promotor regions. The order k of the Markov model is a parameter to be specified by the user. We recommend using the smallest value of k that discriminates between the alignment of promotors of up-regulated genes (a positive set) and the alignment of promotors of down-regulated or random genes (a negative set); users can test and see the model parameter differences using gnuplot (http://gnuplot.org) at our web server.

After constructing a mixture model, dPattern searches for statistically significant occurrences of the new model in the promotor regions of two target sequence sets using Patser program (Hertz and Stormo, 1999) with a prior of each nucleotide being equally likely. The system also makes a WebLogo (Crooks et al., 2004) with respect to each occurrences set.


    3 MODEL REFINEMENT PROCEDURE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIXTURE MODEL GENERATION
 3 MODEL REFINEMENT PROCEDURE
 4 COMPARISON WITH REFINEMENT
 5 MODEL REFINEMENT VERSUS...
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The goal is to refine the model Miy to a new model M'iy so that the number of the model occurrences in the negative sequence set is reduced while the number of occurrences in the positive sequence set is increased. We modified the refinement procedure of REFINEMENT (Tsukahara et al., 2006) as follows.

  1. Ranking candidates: dPattern ranks occurrences of Miy using the score from Patser. We use a negative ranking strategy to leverage the discriminative information (up versus. down/random genes), i.e. the profile model of negative sequences is used to rank binding site occurrences in the positive sequence set. More formally, let O+ be occurrences in the positive set and O be occurrences in the negative set. We build a new profile using O and rank O+ based on the scores by the profile, thus we use a positive model to collect candidates and use a negative model to rank the candidates.
  2. Selecting candidates: given a list of TFBS occurrences in the positive sequence set, the user can select candidates either manually using the checkbox on the web page or simply specify top N sequences.
  3. Constructing new model: a new refined model M'iy is constructed with the selected candidate sequences. Then TFBS search is performed again with the new model and the search result is displayed on the web with a summary and logos for the model and its occurrences in the positive and negative sequence sets. The entire steps can be repeated as many times as the user wants.
Figure 1 shows an experiment with ISREs in human genome after a single refinement step. dPattern increased the number of predicted TFBSs in the genes up-regulated by interferon while reducing the number of predicted TFBSs in the down-regulated genes. As a result, the occurrence ratio (up to down) increased significantly to 4.33 from 3.14.


Figure 1
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. An experiment with ISREs in human genome. Comparison of the original model and the refined model and their occurrences in the positive and negative sequence sets using logo. The numbers of occurrences of the original model were 88 and 28 for the positive and negative sequence sets, respectively with an occurrence ratio = 3.14 (88/28). These numbers for the refined model were 91 and 21 for the positive and negative sequence sets, respectively with an occurrence ratio = 4.33 (91/21).

 

    4 COMPARISON WITH REFINEMENT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIXTURE MODEL GENERATION
 3 MODEL REFINEMENT PROCEDURE
 4 COMPARISON WITH REFINEMENT
 5 MODEL REFINEMENT VERSUS...
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
For the comparison, we used REFINEMENT web server and the dPattern web server with a Markov model of order k = 6. We tried one iteration step for both REFINEMENT and dPattern web servers, and then we measured running times of both web servers and counted the number of predicted ISRE elements in both the up-regulated promoter region sequence set and down-regulated promoter region sequence set.

For the running time, REFINEMENT web server took 116 s to get the result with one refinement iteration step, and the dPattern web server took only 33 s to get the predicted result. In terms of predicted binding sites, REFINEMENT web server predicted 87 ISRE sites in the promoter regions of genes up-regulated by interferon and 68 sites in the promotor regions of genes down-regulated by interferon. On the other hand, dPattern predicted 91 sites in the up-regulated promoter region sequences and only 21 sites in the down-regulated promoter region sequence set.


    5 MODEL REFINEMENT VERSUS DE NOVO PREDICTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIXTURE MODEL GENERATION
 3 MODEL REFINEMENT PROCEDURE
 4 COMPARISON WITH REFINEMENT
 5 MODEL REFINEMENT VERSUS...
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
dPattern is to refine an original model that was constructed using known transcription binding sites. Prediction of de novo binding sites is a challenging problem which is different from the refinement problem in this article. Recently, there has been interesting development on the de novo binding site problem using discriminative approaches (Kawada and Sakakibara, 2005; Sinha, 2006). Use of discriminative approaches can make the binding site prediction algorithms less sensitive to model parameter settings as shown in Sinha, (2006).

We ran DIPS, Discriminative PWM Search, (Sinha, 2006) on our ISRE sequence data to measure how good the de novo binding site prediction would be. DIPS was not able to find the ISRE bind sites correctly after running more than 2 days on a Linux machine (3.0 GHz Intel processor and 4 GB main memory). The main reason is that our sequence data (each of 7 kb in length) is significantly longer than the data used in Sinha (2006).

There is a growing evidence that some binding sites can be very far away, say >100 kb, from the transcription start site. For example, a genome wide analysis of binding sites of estrogen receptor, the master transcriptional regulator of breast cancer phenotype and the archetype of a molecular therapeutic target, showed that binding sites were detected as far as 206 kb from TSS (Carroll et al., 2006).

Thus, a discriminative method that combines de novo binding site prediction and model refinement techniques may be an interesting approach to deal with binding sites that are far away from TSS.


    6 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIXTURE MODEL GENERATION
 3 MODEL REFINEMENT PROCEDURE
 4 COMPARISON WITH REFINEMENT
 5 MODEL REFINEMENT VERSUS...
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
dPattern is a web server where users can search for occurrences of a specific TFBS in the human genome with gene expression data. The accuracy of TFBS search, which often produces many false positives and negatives, can be improved significantly by incorporating gene expression data. Since each gene expression experiment is designed with a specific goal, dPattern will be a valuable tool for TFBS search with gene expression data.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIXTURE MODEL GENERATION
 3 MODEL REFINEMENT PROCEDURE
 4 COMPARISON WITH REFINEMENT
 5 MODEL REFINEMENT VERSUS...
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work is supported by a US National Science Foundation Career DBI-0237901 grant (to Kim), US National Institutes of Health National Cancer Institute Grant CA113001 (to Kim) and a Collaboration in Life Sciences and Informatics Research (CLSIR) grant from the state of Indiana (to Xie, Wu, Tang and Kim).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Burkhand Rost

Received on January 9, 2007; revised on May 20, 2007; accepted on May 21, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MIXTURE MODEL GENERATION
 3 MODEL REFINEMENT PROCEDURE
 4 COMPARISON WITH REFINEMENT
 5 MODEL REFINEMENT VERSUS...
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Carroll JS, et al. Genome-wide analysis of estrogen receptor binding sites. Nat. Genet. (2006) 38:1289–1297.[CrossRef][Web of Science][Medline]

    Crooks GE, et al. WebLogo: a sequence logo generator. Genome Res. (2004) 14:1188–1190.[Abstract/Free Full Text]

    Hertz GZ, Stormo GD. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics (1999) 15:563–577.[Abstract/Free Full Text]

    Kawada Y, Sakakibara Y. Discriminative detection of cis-acting regulatory variation from location data. In: Proceedings of the 4th Asia-Pacific Bioinformatics Conference (2005) 89–98.

    Miele V, et al. seq++: analyzing biological sequences with a range of Markov-related models. Bioinformatics (2005) 21:2783–2784.[Abstract/Free Full Text]

    Sinha S. On counting position weight matrix matches in a sequence, with application to discriminative motif finding. Bioinformatics (2006) 22:454–463.[CrossRef]

    Tsukahara T, et al. REFINEMENT: a search framework for the identification of iterferon-responsive elements in DNA sequences–a case study with ISRE and GAS. Comput. Biol. Chem. (2006) 30:134–147.[Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/19/2619    most recent
btm288v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bae, S.-H.
Right arrow Articles by Kim, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bae, S.-H.
Right arrow Articles by Kim, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?