Skip Navigation


Bioinformatics Advance Access originally published online on August 27, 2004
Bioinformatics 2005 21(1):31-38; doi:10.1093/bioinformatics/bth471
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/1/31    most recent
bth471v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sumazin, P.
Right arrow Articles by Zhang, M. Q.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sumazin, P.
Right arrow Articles by Zhang, M. Q.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics vol. 21 issue 1 © Oxford University Press 2005; all rights reserved.

DWE: Discriminating Word Enumerator

Pavel Sumazin 1,2,*, Gengxin Chen 1, Naoya Hata 1, Andrew D. Smith 1, Theresa Zhang 3 and Michael Q. Zhang 1,*

1 Cold Spring Harbor Laboratory 1 Bungtown Road, Cold Spring Harbor, NY 11724, USA
2 Computer Science Department, Portland State University P.O. Box 751, Portland, OR 97207, USA
3 Bioinformatics, Merck Research Laboratories Rahway, NJ 07065, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEMS AND METHODS
 ALGORITHM
 EXPERIMENTS
 DISCUSSION AND CONCLUSION
 REFERENCES
 

Motivation: Tissue-specific transcription factor binding sites give insight into tissue-specific transcription regulation.

Results: We describe a word-counting-based tool for de novo tissue-specific transcription factor binding site discovery using expression information in addition to sequence information. We incorporate tissue-specific gene expression through gene classification to positive expression and repressed expression. We present a direct statistical approach to find overrepresented transcription factor binding sites in a foreground promoter sequence set against a background promoter sequence set. Our approach naturally extends to synergistic transcription factor binding site search.

We find putative transcription factor binding sites that are overrepresented in the proximal promoters of liver-specific genes relative to proximal promoters of liver-independent genes. Our results indicate that binding sites for hepatocyte nuclear factors (especially HNF-1 and HNF-4) and CCAAT/enhancer-binding protein (C/EBPß) are the most overrepresented in proximal promoters of liver-specific genes. Our results suggest that HNF-4 has strong synergistic relationships with HNF-1, HNF-4 and HNF-3ß and with C/EBPß.

Availability: Programs are available for use over the Web at http://rulai.cshl.edu/tools/dwe

Contact: ps{at}cs.pdx.edu; mzhang{at}cshl.edu

Supplementary information: Data and omitted results are available at http://rulai.cshl.edu/tools/dwe/supp


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEMS AND METHODS
 ALGORITHM
 EXPERIMENTS
 DISCUSSION AND CONCLUSION
 REFERENCES
 
One of the main goals of modern genetics is to decipher the mechanisms of gene expression and regulation. Recent years have seen the generation of a significant volume of data that will help to probe expression mechanisms. Microarray techniques and chromatin immunoprecipitation (ChIP) techniques allow for genome-scale investigation of gene expression and DNA-binding protein localization. These techniques can be used to classify expression by cell environment and transcription factor binding.

Completed or nearly completed genome sequences are publicly available for a growing number of vertebrate species including human, mouse, rat and chicken. Increasingly accurate methods for detecting transcription start sites (TSSs), such as Davuluri et al. (2001) and Scherf et al. (2000), enable localization of promoter regions. Coupled together, sequence information and TSS location can be used to identify proximal promoter sequences. Proximal promoter sequences have already been well identified for a large number of genes in human, mouse and rat.

We are interested in methods that combine gene expression and sequence information for de novo discovery of transcription factor binding sites (TFBSs) in proximal promoters of co-expressed tissue-specific genes. The annotation of proximal promoters for such genes will advance the understanding of tissue-specific transcription regulation.

We describe a discriminant word counting algorithm, Discriminant Word Enumerator (DWE), which can be used to discover motifs in promoters of co-regulated genes. We use DWE to find overrepresented gapped degenerate words (motifs) in proximal promoters of liver-specific genes taken from Liver-Specific Promoter Database (LSPD) (Zhang and Zhang, 2000, http://cgsigma.cshl.org/LSPD) against vertebrate promoters from the Eukaryotic Promoter Database (EPD), release 78 (Perier et al., 1998). We use TSS data from DBTSS (Suzuki et al., 2002) and sequence data from GenBank to collect promoter sequences.

Related literature
Classical sequence-based motif discovery algorithms include CONSENSUS (Hertz et al., 1990), MEME (Bailey and Elkan, 1995) and the Gibbs sampler (Lawrence et al., 1993; Liu et al., 1995). Other motif discovery algorithms that use word-counting methods are reported previously (Van Helden et al., 1998, 2000; Sinha and Tompa, 2002). Recent motif search algorithms that use sequence and microarray data from expression or ChIP analysis include REDUCE (Bussemaker et al., 2001), MDscan (Liu et al., 2002), DMOTIFS (Sinha, 2003) and YMF (Sinha and Tompa, 2000, 2002; Blanchette and Sinha, 2001). REDUCE relates motif occurrence counts to gene expression ratio; MDscan iteratively constructs matrix representations of TFBSs that are overrepresented in the foreground set against a Markov background model that can be estimated from a background sequence set; DMOTIFS searches for overrepresented motifs in a foreground set against a background set while maintaining a maximum count per sequence; YMF searches for overrepresented motifs in a foreground set against a third-order Markov model estimated from a background sequence set. Beer and Tavazoie (2004) describe a method for predicting expression from TFBSs abundance; this method could be extended to include motifs found by DWE. We extend recent work which uses a P-value statistic to search for overrepresented ungapped motifs of length 7 in Saccharomyces cerevisiae promoters.


    SYSTEMS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEMS AND METHODS
 ALGORITHM
 EXPERIMENTS
 DISCUSSION AND CONCLUSION
 REFERENCES
 
We searched for overrepresented motifs in a set of non-orthologous proximal promoters of genes that are known to have high expression in liver. We also searched for motifs in the consensus sequences of these proximal promoters. We measured the overrepresentation of motifs in these sets against the set of all vertebrate proximal promoters in EPD78, and the set of EPD78 vertebrate proximal promoters whose corresponding genes are not known to be strongly expressed in liver. We report the most overrepresented motifs in these comparisons, and infer the transcription factors most likely to bind to the corresponding TFBSs.

Statistical evaluation
We use three methods to evaluate the significance of motif overrepresentation.

P-value. The fixed marginal contingency table P-value follows the multiple hypergeometric distribution given in Equation (1) for a review see Agresti (1992). The P-value for the table is the sum of the probabilities of all tables that are at least as extreme. In this application we set a P-value for the overrepresentation of a motif in the foreground set against the background set, so that N f and N b are the potential occurrences in the foreground and background sets (trials), and n f , n b are the number of observed occurrences in the respective sets (successes).


Z-test. The Z-test (Student, 1908) is represented by the following Equation (2).


Log frequency ratio. The log frequency ratio (LFR) is as follows:


From TFBS to transcription factor
We searched through TRANSFAC (Knuppel et al., 1994) for position frequency matrices (PFMs) that match the motifs found by DWE and PFMs found by MDscan. Transcription factors that are known to bind to the TRANSFAC PFMs are likely to bind to the matching DWE motifs and MDscan PFMs. To facilitate the search, we converted consensus-based motifs into PFMs using the maximum entropy principle of Jaynes (1957a,b); each IUPAC symbol was converted into a maximum-entropy column with total count equal to the number of foreground occurrences n f . For example, M = {A, C} was converted into [n f /2,n f /2,0,0] T and D = {A, G, T} was converted into [n f /3, 0, n f /3, n f /3] T . We used a {chi}2 test to compare discovered-motif PFMs to TRANSFAC PFMs following the methodology proposed by Schones et al., (2004); PFMs are iid observations from a product multinomial distribution and were compared column by column, with the smaller PFM compared at each possible position to a submatrix of the larger PFM and the best match reported. PFMs were said to match when the normalized probability that they are occurrences from the same product-multinomial distribution was better than 0.05.

Dataset and consensus set
We selected LSPD genes that have at least one known ortholog, a known TSS, and sequence information covering the [–299, 100] region relative to the TSS. With the objective of collecting promoters with known sequence information covering the [–499, 100] region relative to the TSS, we selected a longest promoter from each set of orthologs, breaking ties arbitrarily. The resulting Liver-Specific Promoter Subset (LSPS) includes 35 promoters with mean length 549. In contrast, the vertebrate promoter subset of EPD78 includes 2380 promoters with average length 579, and the promoter subset of liver expressed genes in EPD78 includes 103 promoters with average length 558. LSPS includes four promoters that are subsequences or orthologs of Krivan and Wasserman (2001) promoters, including RATAADC01, HUMVITDBP, MMILGF and HUMGLUT201. Promoters of selected LSPD genes, LSPS, mapping from LSPS to EPD78 and mapping from promoters of liver expressed genes in EPD78 to LSPS are provided in the Supplementary information.

We generated a consensus sequence for each ortholog set, and used those consensus sequences to check for the conservation of motifs found in LSPS. To generate a consensus sequence, we first aligned orthologs using CLUSTALW (Thompson et al., 1994) with default parameters. We selected a consensus element for each aligned position according to the following procedure. Collect the set of nucleotides that appear at least twice at this position across the aligned sequences; if any of the sequences contains a gap at this position or if the nucleotide set is empty output a ‘-’, otherwise output an IUPAC symbol that corresponds to the collected nucleotide set. To measure conservation, we report the number of occurrences of each discovered motif and motif pair in the consensus set.

We searched for overrepresented motifs in the consensus set against vertebrate promoters in EPD78 (Table 3). To accommodate for motif discovery programs, which do not accept degenerate nucleotide input, we modified the consensus generation procedure to output the majority nucleotide in a column (and a ‘-’ in case of a tie) instead of a degenerate IUPAC symbol. The modified consensus sequence set has four sequences that are different from the original. Both consensus sequence sets are provided in the Supplementary information.


View this table:
[in this window]
[in a new window]
 
Table 3 Motifs that are overrepresented (by occurrence count) in the consensus set against EPD vertebrate promoters

 

    ALGORITHM
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEMS AND METHODS
 ALGORITHM
 EXPERIMENTS
 DISCUSSION AND CONCLUSION
 REFERENCES
 
Given a motif structure, including motif length, gaps and maximum number of degenerate positions, we enumerate all matching motifs using a method similar to that of Waterman et al., (1984). Each non-degenerate motif is mapped to an integer by stripping away gaps and converting the resulting word of length {ell} over alphabet of size 4 into an integer ranging from 0 to 4{ell}+1 – 1. Each motif position and integer representation are recorded, and the operation is repeated for the reverse complement if so specified. Position information is compiled for each permitted degenerate word. The representation of each word and each degenerate word in the foreground is compared with its representation in the background, and the words with foreground overrepresentation above threshold are reported. DWE disregards substrings with characters other than the case insensitive A, C, G, T in the background and foreground sequence sets.

Thresholds are set for P-values, LFRs and Z-values as described in the Systems and methods section. Comparison conditions such as self-overlap, counting method and motif independence are user specified. When self-overlap is disallowed, the number of potential occurrences (trials) in each sequence set will be set to the maximum number of non-overlapping occurrences. The counting method can be set to word counting or sequence counting. The former refers to counting occurrences independently of their distribution across sequences, and the latter refers to counting sequences that contain at least one motif occurrence. When motif independence is not required, DWE reports all overrepresented motifs above the specified threshold. Such reporting may include similar words that have related sets of occurrences. For example, occurrence sets for degenerate words CTNTGD and CTVTGD will have a large intersection. When motif independence is required, we use the {chi}2-test suggested by Schones et al., (2004) to suppress the reporting of lower-quality-dependent words.

Finding synergistic motifs
Given a list of IUPAC motifs and an integer k, DWE will search for motif k-tuples that occur in the same sequences and are overrepresented in the foreground. In the case that overlap is not allowed, the counting procedure is more intricate. When sequence counting is used, the number of trials (potential number of occurrences for a tuple in a promoter set) is the number of sequences in that set, and the number of successes (occurrences of that tuple) is the number of sequences containing at least one set of non-overlapping occurrences of each x X k . When word counting is used, the number of trials for a motif k-tuple X k is given in Equation (4), where S = s is the set of sequences and |s| is the length of s. We calculate the number of successes for each tuple using a recursion on k. For k = 2, the number of successes for X 2 = {x 1,x 2} over S is , where O(X 2) is the number of overlapping occurrences of x 1 and x 2, and x (s) is the number of occurrences of x in s. For k > 2, the number of overlapping occurrences O(X k ) = {sum} s S O(X k ,s) is given in Equation (5), where L(X k ,s) is the number of distinct motif k-tuple occurrences that share at least one position in s. The total running time is in the order of |S| + k log k O(X k ).




    EXPERIMENTS
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEMS AND METHODS
 ALGORITHM
 EXPERIMENTS
 DISCUSSION AND CONCLUSION
 REFERENCES
 
We used DWE and MDscan to find the most overrepresented motifs in LSPS against EPD. We did not use REDUCE because it is less suitable for discriminating against a background set. Our results on synthetic data suggest that YMF does not perform as well as DWE or MDscan when searching for overrepresented motifs in a foreground set against a background set.

Performance on synthetic sequence data
The sensitivity of motif finding algorithms depends on the total size of the sequence set, motif width and motif degeneracy. We tested the algorithms on synthetic data with dimensions similar to those of LSPS. Foreground and background sets were made of 35 sequences of length 550. We implanted motifs of increasing number and degeneracy in the foreground sets and measured the ability of each algorithm to detect these motifs against background sets. Background sets and non-motif elements in the foreground sets were generated from a background vector with 60% CG. Motifs were generated from position weight matrices (PWMs) that correspond to uniformly selected IUPAC words with specified number of degenerate positions.

We constructed foreground sets with 10–40 uniformly-at-random implanted occurrences of motifs with width six, and 0, 1 and 2 degenerate positions. For each motif type and motif number, new foreground and background sets were constructed and the experiment was repeated 100 times. We selected the top five motifs found by DWE when counting motif occurrences (denoted by DWE-W), DWE when counting the number of sequences containing the motif (denoted by DWE-S), MDscan and YMF. We did not remove dependences between the motifs found by the algorithms, potentially allowing for similar motifs in the top-5 set. We report the proportion of trials where the implanted motif matched a top-5 motif. When matching motifs, we matched a degenerate element using all of the nucleotides it represents. Our results suggest that DWE outperforms MDscan on non-degenerate motifs, MDscan outperforms DWE on degenerate motifs, and YMF performs worse than DWE and MDscan (Fig. 1).



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 1 Detection-quality comparison of DWE, MDscan and YMF when attempting to discover an implanted motif with width six against a vector-generated background sequence set. We plot the frequency (from 0 to 1) of the correct detection in the top five found motifs for each method as a function of the number of implanted motifs (from 10 to 40). Foreground and background sets contained 35 sequences of length 550; motifs are implanted uniformly at random across the set; each data point corresponds to 100 runs of the corresponding algorithm; and DWE-W counts the number of motif occurrences in each set and DWE-S counts the number of sequences containing the motif. We report results for implanted motifs with no degenerate positions (top), one degenerate position (middle) and two degenerate positions (bottom).

 
We tested the ability of the algorithms to discover implanted motifs that are strongly underrepresented in the background set. We augmented the randomly constructed background sets in our initial experiments with 35 additional sequences of length 550 that do not include any occurrences of the implanted motif. The detection quality of the algorithms when using the augmented background sets is reported in Figure 2. The performance of DWE improved dramatically, while the performance of MDscan and the performance YMF did not improve substantially.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 2 Detection-quality comparison of DWE, MDscan and YMF when attempting to discover an implanted motif with width six against an augmented background sequence set that is created by adding 35 additional sequences that do not contain the motif to the background set used in the experiments reported in Figure 1. We plot the frequency (from 0 to 1) of the correct detection in the top five found motifs for each method as a function of the number of implanted motifs (from 10 to 40). Each data point corresponds to 100 runs of the corresponding algorithm; DWE-W counts the number of motif occurrences in each set and DWE-S counts the number of sequences containing the motif. We report results for motifs with no degenerate positions (top), one degenerate position (middle) and two degenerate positions (bottom).

 
Liver-Specific Promoter Database
We used DWE to discover motifs that are overrepresented in LSPS against the vertebrate promoter subset of EPD78 (Table 1), and against that set excluding promoters of liver-expressed genes (Table 2). We searched for (3+gap+3)mers and (4+gap+4)mers, with rigid gaps ranging from 0 to 7 bp and at the most two degenerate positions. We also searched for the motifs that are overrepresented in the consensus set against the vertebrate promoter set from EPD78 (Table 3). We repeated these searches using MDscan and report the top 3 motifs of lengths 6, 8 and 10 in each experiment; (Tables 46).


View this table:
[in this window]
[in a new window]
 
Table 1 Motifs that are strongly overrepresented (by occurrence count) in promoters of liver-expressed genes (LSPS) against promoters of liver-expression independent genes (EPD)

 

View this table:
[in this window]
[in a new window]
 
Table 2 Motifs that are strongly overrepresented (by occurrence count) in LSPS against EPD vertebrate promoters of genes that are not known to be expressed in liver

 

View this table:
[in this window]
[in a new window]
 
Table 4 Top three motifs of lengths 6,8 and 10 found by MDscan to be over-represented in LSPS against EPD78 vertebrate promoters

 

View this table:
[in this window]
[in a new window]
 
Table 6 Top three motifs of lengths 6,8 and 10 found by MDscan to be over-represented in the consensus set against vertebrate promoters in EPD78

 
Initially, MDscan reported poly(A) and alternating C–T motifs. These motif are found to be strongly overrepresented by DWE when motif autocorrelation is not considered. However, the number of occurrences of these motifs decreases substantially when selfoverlap is not permitted, and they are not reported in the top 50. In order to use MDscan more effectively, we masked all substrings that correspond to cycles of periods 1 and 2 and length 8 or greater. The results by MDscan still differ substantially from the results of DWE, but both identify binding sites that are similar to known binding sites for hepatocyte nuclear factors HNF-4 and HNF-1.

Because the consensus set allows for a very small number of trials for each word structure, and because of the high-false-negative rate when using a consensus, we did not find motifs with P-values <0.001 when searching in the consensus against EPD vertebrate promoters. Instead, we report motifs by Z-test score (Table 3).

For each motif x with n f occurrences in the foreground set and n c occurrences in the consensus set, we found all degenerate words having the same structure and the same count in the foreground set, and counted the number of occurrences of these words in the consensus set. Our results suggest that the majority of these words are strongly conserved in the consensus set. These results are reported in the Supplementary information.


    DISCUSSION AND CONCLUSION
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEMS AND METHODS
 ALGORITHM
 EXPERIMENTS
 DISCUSSION AND CONCLUSION
 REFERENCES
 
DWE is a fast word-counting-based tool for discovering overrepresented motifs in one set of promoters relative to another. Our results on synthetic data suggest that DWE outperforms existing methods on a large class of motifs, and is best suited for finding overrepresented motifs against carefully selected background sets. However, the accuracy of DWE decreases with increasing motif degeneracy. In addition to single motifs, DWE can find overrepresented motif tuples. A feature of DWE's P-value motif comparison method is that it allows comparisons of motifs with different structures, and motifs that are found using different foreground or background sets.

We used DWE to search for overrepresented motifs in proximal promoters of liver-specific genes, and found that HNF binding sites and binding sites for CCAAT/enhancer-binding protein (C/EBPß) are the most overrepresented. This conclusion is largely supported by experiments with MDscan, and agrees with the results of Baumhueter et al., (1988), Costa et al., (1989), Xanthopoulos et al., (1991), Thomas et al., (2001) and Krivan and Wasserman, 2001. Our results on synthetic data suggest that DWE has a high degree of accuracy when searching for motifs with structures and frequencies characteristic to the majority of motifs reported.

When searching for co-occurring motif pairs, we found that HNF-4 binding sites have strong synergistic relationships with other HNF-4 binding sites and with binding sites of HNF-1, HNF-3ß and C/EBPß. These relationships are supported by high conservation ratios (number of occurrences in LSPS versus number of occurrences in the promoter consensus set), and agree with the results of Miura and Tanaka (1993), Antes and Levy-Wilson (2001) and Hatzis and Talianidis (2002).

Our results suggest that the majority of top motifs found by DWE are conserved, but few motifs such as CWGT•••CABA and ATAGTYTV of Tables 7 and 8 have low conservation ratios and may be false positives. The majority of motif pairs in Tables 9 and 10 have weak conservation ratios, but the motif pairs GWTA••••TTDA MWG•TTA, GWTA••••TTDA AAMRGT, GWTA••••TTDA TTGBAA and GDTA••••TTRA TTGBAA have relatively high-conservation ratios, which may indicate a more biologically significant relationship (Tables 11 and 12). We note that motifs found by DWE have relatively higher conservation ratios than motifs found by MDscan.


View this table:
[in this window]
[in a new window]
 
Table 7 Motifs that are strongly overrepresented (by sequence count) in promoters of liver-expressed genes (LSPS) against promoters of liver-expression independent genes (EPD)

 

View this table:
[in this window]
[in a new window]
 
Table 8 Motifs that are strongly overrepresented (by sequence count) in promoters of liver-expressed genes (LSPS) against promoters of genes that are not known to be expressed in liver

 

View this table:
[in this window]
[in a new window]
 
Table 9 Top pairs (by sequence count) of the motifs from Table 7

 

View this table:
[in this window]
[in a new window]
 
Table 10 Top pairs (by sequence count) of the motifs from Table 8

 

View this table:
[in this window]
[in a new window]
 
Table 11 Top pairs (by occurrence count) of the motifs from Table 1

 

View this table:
[in this window]
[in a new window]
 
Table 12 Top pairs (by occurrence count) of the motifs from Table 2

 
We also examined motifs that had a large number of occurrences in LSPS but were not overrepresented against EPD vertebrate promoters. We found that many of these motifs have high conservation ratios. These motifs are reported in the Supplementary information.

Our consensus construction method can be used to filter out false-positive detections, but in its current state it is error-prone. Consensus construction through ortholog alignment requires promoter alignment tools and consensus construction tools that are not yet perfected. Our method is very conservative when aligning ortholog promoters from distant species, and has little impact on false-positive filtration when aligning ortholog promoters from close species. Moreover, by using CLUSTALW we impose a colinearity constraint and do not consider inversions or TFBS birth and death events.

We used DWE to discover liver-specific cis-regulatory elements. Of course, DWE can be used to discover motifs in promoters of any co-regulated genes. To improve its performance in detecting more degenerate motifs, DWE should be modified to use PWM scores instead of occurrence counts.


View this table:
[in this window]
[in a new window]
 
Table 5 Top three motifs of lengths 6,8 and 10 found by MDscan to be over-represented in LSPS against EPD78 vertebrate promoters that are not known to be strongly expressed in liver

 


    Acknowledgments
 
We thank Zhenyu Xuan, Debopriya Das and Saurabh Sinha for useful discussions. This work is supported by NIH grant GM060513 and NSF grants DBI-0306152 and EIA-0324292.

Received on September 11, 2003; revised on July 26, 2004; accepted on August 7, 2004

    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEMS AND METHODS
 ALGORITHM
 EXPERIMENTS
 DISCUSSION AND CONCLUSION
 REFERENCES
 

    Agresti, A. (1992) A survey of exact inference for contingency tables. Stat. Sci., 7, 131–177[CrossRef].

    Antes, T.J. and Levy-Wilson, B. (2001) HNF-3 beta, C/EBP beta, and HNF-4 act in synergy to enhance transcription of the human apolipoprotein B gene in intestinal cells. DNA Cell Biol., 20, 67–74[CrossRef][Web of Science][Medline].

    Bailey, T.L. and Elkan, C. (1995) Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning, 21, 51–80[Web of Science].

    Baumhueter, S., Courtois, G., Crabtree, G.R. (1988) A variant nuclear protein in dedifferentiated hepatoma cells binds to the same functional sequences in the beta fibrinogen gene promoter as HNF-1. EMBO J., 7, 2485–2493[Web of Science][Medline].

    Beer, M.A. and Tavazoie, S. (2004) Predicting gene expression from sequence. Cell, 117, 185–198[CrossRef][Web of Science][Medline].

    Blanchette, M. and Sinha, S. (2001) Separating real motifs from their artifacts. Proceedings of the Annual International Symposium on Intelligent Systems for Molecular Biology, , Denmark Copenhagen, pp. 30–38.

    Bussemaker, H.J., Li, H., Siggia, E.D. (2001) Regulatory element detection using correlation with expression. Nat. Genet., 27, 167–171[CrossRef][Web of Science][Medline].

    Costa, R.H., Grayson, D.R., Darnell, J.E., Jr. (1989) Multiple hepatocyte-enriched nuclear factors function in the regulation of transthyretin and alpha 1-antitrypsin genes. J. Comput. Biol., 9, 1415–1425.

    Davuluri, R., Grosse, I., Zhang, M.Q. (2001) Computational identification of promoters and first exons in the human genome. Nat. Genet., 29, 412–417[CrossRef][Web of Science][Medline].

    Hatzis, P. and Talianidis, I. (2002) Dynamics of enhancer-promoter communication during differentiation-induced gene activation. Mol. Cell, 10, 1467–1477[CrossRef][Web of Science][Medline].

    Hertz, G., Hartzell, G., III, Stormo, G. (1990) Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput. Appl. Biosci., 6, 81–92[Abstract/Free Full Text].

    Jaynes, E.T. (1957a) Information theory and statistical mechanics. Phys. Rev., 106, 620–630[CrossRef][Web of Science].

    Jaynes, E.T. (1957b) Information theory and statistical mechanics. II. Phys. Rev., 108, 171–190[CrossRef][Web of Science].

    Knuppel, R., Dietze, P., Lehnberg, W., Frech, K., Wingender, E. (1994) TRANSFAC retrieval program: a network model database of eukaryotic transcription regulating sequences and proteins. J. Comput. Biol., 1, 191–198[Medline].

    Krivan, W. and Wasserman, W.W. (2001) A predictive model for regulatory sequences directing liver-specific transcription. Genome Res., 11, 1559–1566[Abstract/Free Full Text].

    Lawrence, C., Altschul, S., Boguski, M., Liu, J., Neuwald, J., Wootton, J. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Sci., 262, 208–214[Abstract/Free Full Text].

    Liu, J.S., Lawrence, C.E., Neuwald, A. (1995) Bayesian models for multiple local sequence alignment and its Gibbs sampling strategies. J. Am. Stat. Assoc., 90, 1156–70[CrossRef][Web of Science].

    Liu, X.S., Brutlag, D.L., Liu, J.S. (2002) An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol., 20, 835–839[Web of Science][Medline].

    Miura, N. and Tanaka, K. (1993) Analysis of the rat hepatocyte nuclear factor (HNF) 1 gene promoter: synergistic activation by HNF4 and HNF1 proteins. Nucleic Acids Res., 21, 3731–3736[Abstract/Free Full Text].

    Perier, R.C., Junier, T., Bucher, T. (1998) The eukaryotic promoter database EPD. Nucleic Acids Res., 26, 353–357[Abstract/Free Full Text].

    Scherf, M., Klingenhoff, A., Werner, T. (2000) Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. J. Mol. Biol., 297, 599–606[CrossRef][Web of Science][Medline].

    Schones, D., Sumazin, P., Zhang, M.Q. (2004) Similarity of position frequency matrices for transcription factor binding sites. Bioinformatics, doi: 10.1093/bioinformatics/bth480.

    Sinha, S. (2003) Discriminative motifs. J. Computat. Biol., 10, 599–615.

    Sinha, S. and Tompa, M. (2000) A statistical method for finding transcription factor binding sites. Proceedings of the Annual International Symposium on Intelligent Systems for Molecular Biology, , Denmark Copenhagen Vol. 8, , pp. 344–344.

    Sinha, S. and Tompa, M. (2002) Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res., 30, 5549–5560[Abstract/Free Full Text].

    (1908) The probable error of a mean. Biometrika, 6, 1–25 Student[Free Full Text].

    Suzuki, Y., Yamashita, R., Nakai, K., Sugano, S. (2002) DBTSS: DataBase of human Transcriptional Start Sites and full-length CDNAS. Nucleic Acids Res., 30, 328–331[Abstract/Free Full Text].

    Thomas, H., Jaschkowitz, K., Bulman, M., Frayling, T.M., Mitchell, S.M., Roosen, S., Lingott-Frieg, A., Tack, C.J., Ellard, S., Ryffel, G.U., Hattersley, A.T. (2001) A distant upstream promoter of the HNF-4alpha gene connects the transcription factors involved in maturity-onset diabetes of the young. Hum. Mol. Genet., 10, 2089–2097[Abstract/Free Full Text].

    Thompson, J.D., Higgins, D.G., Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680[Abstract/Free Full Text].

    Van Helden, J., Andre, B., Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol., 281, 827–842[CrossRef][Web of Science][Medline].

    Van Helden, J., Andre, B., Collado-Vides, J. (2000) Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res., 28, 1808–1818[Abstract/Free Full Text].

    Waterman, M.S., Arratia, R., Galas, D.J. (1984) Pattern recognition in several sequences: consensus and alignment. Bulletin of Mathematical Biol., 46, 515–527.

    Xanthopoulos, K.G., Prezioso, V.R., Chen, W.S., Sladek, F.M., Cortese, R., Darnell, J.E.J. (1991) The different tissue transcription patterns of genes for HNF-1, C/EBP, HNF-3, and HNF-4, protein factors that govern liver-specific transcription. Proc Natl Acad Sci., USA, 88, 3807–3811[Abstract/Free Full Text].

    Zhang, T. and Zhang, M.Q. (2000) Liver specific promoter database.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
A. Yu. Mitrophanov and M. Borodovsky
Statistical significance in biological sequence analysis
Brief Bioinform, March 1, 2006; 7(1): 2 - 24.



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/1/31    most recent
bth471v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sumazin, P.
Right arrow Articles by Zhang, M. Q.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sumazin, P.
Right arrow Articles by Zhang, M. Q.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?