Skip Navigation


Bioinformatics Advance Access originally published online on February 2, 2006
Bioinformatics 2006 22(9):1047-1054; doi:10.1093/bioinformatics/btl037
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
22/9/1047    most recent
btl037v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Hon, L. S
Right arrow Articles by Jain, A. N
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hon, L. S
Right arrow Articles by Jain, A. N
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

A deterministic motif finding algorithm with application to the human genome

Lawrence S Hon 1 and Ajay N Jain 1,*

1 UCSF Cancer Research Institute and Comprehensive Cancer Center, University of California San Francisco, CA, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

Motivation: We present a novel algorithm, MaMF, for identifying transcription factor (TF) binding site motifs. The method is deterministic and depends on an indexing technique to optimize the search process. On common yeast datasets, MaMF performs competitively with other methods. We also present results on a challenging group of eight sets of human genes known to be responsive to a diverse group of TFs. In every case, MaMF finds the annotated motif among the top scoring putative motifs. We compared MaMF against other motif finders on a larger human group of 21 gene sets and found that MaMF performs better than other algorithms. We analyzed the remaining high scoring motifs and show that many correspond to other TFs that are known to co-occur with the annotated TF motifs. The significant and frequent presence of co-occurring transcription factor binding sites explains in part the difficulty of human motif finding. MaMF is a very fast algorithm, suitable for application to large numbers of interesting gene sets.

Availability: The software is available for academic research use free of charge by email request.

Contact: ajain{at}jainlab.org

Supplemental information: Data comprising the benchmarks used in the paper may be downloaded from http://www.jainlab.org/downloads.html.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Transcriptional regulation is governed by protein transcription factors (TFs) that bind to specific sequences in the promoter region of genes. These DNA sequences are short (5–15 bp) and can be highly variable. While many subtly different copies of a DNA motif may appear by chance in the genome, transcription factors selectively bind to a specific subset of these sequences. Understanding the determinants of site-specific binding to genomic DNA is critical to elucidate the logic and mechanisms of transcriptional regulation.

Traditional experimental methods to derive binding sites for a given transcription factor are labor intensive; consequently many computational techniques have arisen to identify binding sites within promoter regions (Workman and Stormo, 2000; Liu et al., 2001; Hertz and Stormo, 1999; Guhathakurta and Stormo, 2001; Bussemaker et al., 2001; Bailey and Elkan, 1994; Bussemaker et al., 2000; Roth et al., 1998; Sinha and Tompa, 2003; Van Helden, 2003; Buhler and Tompa, 2002; Pavesi et al., 2001). The general form of the problem is, given a set of promoters corresponding to genes either postulated or known to be responsive to a particular TF, to find a sequence motif that represents the true biological target of the TF.

While many motif finding algorithms have been shown to work successfully in yeast and other lower organisms, most perform significantly worse in higher organisms (Tompa et al., 2005). To boost the sensitivity of motif finding, other algorithms integrate additional information into the motif finding process, such as comparative genomics (Mccue et al., 2001; Wang and Stormo, 2003; Liu et al., 2004), expression microarray data (Conlon et al., 2003; Keles et al., 2002; Zhu et al., 2002), chromatin immunoprecipitation data (Liu et al., 2002) and domain knowledge, such as knowledge about the binding site structure (Li et al., 2002) or knowledge about cis-regulatory modules (Zhou and Wong, 2004; Gupta and Liu, 2005). Despite increasing amounts of human genomic and expression data, most of these algorithms still focus on lower organisms, emphasizing the difficulty of the problem in higher organisms. In this paper, we focus on the problem of motif finding without using auxiliary information.

Motif finding in the human case is particularly difficult because the human genome is more complicated than lower organisms. While human binding sites remain small and variable, human intergenic regions are vastly larger, of the order of 10 000 bp long, with binding sites embedded in regions beyond the promoter, including introns and even 3' regions. By comparison, intergenic regions in yeast tend to be <1000 bp long. The larger intergenic regions in human allow for enhancers to exist more than a megabase away from the promoter, such as in the case of DACH (Nobrega et al., 2003). With the number of genes not substantially more in human compared with other species, aspects of human functional complexity are probably because of more sophisticated regulatory regions, in part by having more complicated transcription factor complex interactions and a greater amount of cis-regulatory DNA sequence (Levine and Tjian, 2003).

A recent survey by Tompa et al. (2005) comparing currently available motif finding tools has demonstrated the difficulty of addressing the human case and developing methodologies for motif prediction assessment in light of the complexities of mammalian transcriptional machinery. In this survey, the authors presented a methodology in which an algorithm predicts a single motif and is measured by its ability to identify the positions of the annotated binding sites. They found that all 13 algorithms surveyed were shown to perform significantly worse on human gene sets than on yeast gene sets. From this assessment, two of their key observations helped motivate this paper: (1) The results suggest that efforts in modeling binding sites in yeast have been more successful than in metazoans, which further suggest that there is opportunity in developing a motif finder that is targeted at higher organisms and (2) The assessment methodology should consider the top N predicted motifs to increase sensitivity and account for binding sites for multiple transcription factors that may work in concert with the annotated one. To that end, we present a motif finder targeted at higher organisms, and then assess this algorithm using methodologies that address motif assessment challenges in the Tompa et al's paper that are inherent in motif finding in higher organisms. Our approach is to couple a scoring function of empirical design to a fast, deterministic search strategy in order to yield a ranked list of probable solutions.

We present MaMF (Mammalian Motif Finder), which addresses the motif finding problem so as to yield practically useful results in the metazoan case of motif discovery. The search process is accelerated through the use of an index of the input sequences, which allows MaMF to generate large numbers of aligned motifs quickly, maximizing its search depth without significantly increasing its running time. The fast search procedure is coupled to a very simple scoring function that combines a preference for conservation among input sequences with a preference for under-represented sequences relative to the genome.

We assessed MaMF by answering two questions. The first is a comparative one about algorithm performance relative with other widely used methods. After showing comparable performance in lower organisms, we showed that MaMF performs better than other motif finders in humans, using a systematic algorithm assessment method that uses a motif similarity metric to assess the top N motifs predicted by a motif finder [30 in the data presented, in accordance with the suggestion of Tompa et al. (2005)].

The second question seeks to determine the biological significance of the high scoring but unannotated motifs. While these motifs are nominal false positives, the transcriptional complexity of mammalian systems suggests that they may in fact be motifs that bind other TFs that work in concert with the annotated TF. To test this, we introduce an assessment technique that employs expression microarray data to determine the degree to which a putative motif is found to be enriched among coexpressed genes relative to non-coexpressed genes (the enrichment ratio), which can then be used to develop biological support for high scoring but putative false positive motifs. Using this technique on several human datasets, we showed that microarray expression data can be used to support and rank hypothesized motifs, and that many putative false positive motifs predicted by MaMF are probably bonafide TF motifs that co-occur with the annotated motifs. The result is interesting in that it underscores the ubiquity of multiple-TF regulation of gene expression in human biology, and it also offers a way to make use of easily obtained high-throughput biological data to help triage the results of motif-finding exercises.


    2 MATERIALS AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
2.1 MaMF algorithm summary
Given a set of N promoters and an input motif width w, MaMF seeks to maximize the value of a scoring function that prefers motifs that are conserved across the different promoters and are under-represented in the target genome. MaMF yields a ranked list of motifs, where each motif contains exactly N sequences of length w (zero or more sites for each input promoter). We will first describe the search algorithm, then the scoring function (the Supplementary Material contains additional details as well as an extensive diagram).

MaMF's search algorithm is deterministic, and it depends on a simple yet effective indexing strategy to optimize performance. Indexing techniques to speed searches have been used widely and are nicely described in the original report of the BLAST algorithm (Altschul et al., 1990). In the case of MaMF, we can create an index of all n-mers (defined as a short sequence of length n, typically 4–6 bp long) found per input sequence, which makes identifying locations within a sequence that have a given n-mer a constant time operation. Given indices of two sequences and an n-mer, we can identify all alignments between the two sequences that share that particular n-mer in constant time. Therefore, we can efficiently generate a lookup table for all sequence pair alignments of width w that share an n-mer and meet an identity cut-off t. Larger n-mer sizes increase speed at the expense of search depth.

Enumerating all sequence pairs from the lookup table and scoring them using the scoring function described below, we keep the top 1000 high scoring sequence pairs to be used as seeds in the motif generation step. The motif generation step employs a greedy search strategy that builds motifs from the high scoring sequence alignment seeds, iteratively adding sequences to the growing motif that maximize the motif score. At each iteration, we obtain potential sequences that align with any of the sequences already part of the motif via the lookup table, caching those sequences that do not maximize the motif score for reconsideration in subsequent iterations. The motif is complete when it reaches a size threshold equal to N, the number of input sequences. The 1000 motifs generated are resorted according to their motif score. To make the results easily viewable, the 1000 motifs are filtered to remove highly similar motifs at the 75% similarity level (see Motif Similarity below), and the top 30 remaining motifs are presented.

2.2 Scoring function
We define our scoring function to be the following:

Formula 1(1)
where w is the motif width, R is the number of sequences in motif S, S1, ..., SR are the individual sequences, m(Sji, Ski) is a matching function that returns 1 if sequences Sj and Sk match at position i and 0 otherwise, and p0(Si) returns the probability of seeing sequence Si using a background model. The first term measures motif conservation and is equivalent to the ungapped sum-of-pairs function (Altschul, 1989; Gupta et al., 1995). The second term measures the uniqueness of the motif relative to the relevant genome, obtained by calculating the average background probability of seeing the various sequences in the motif. The product of the sequence similarity term and the motif frequency term form the scoring function. Motifs that have sequences that are similar to each other and unique relative to the background maximize the score.

2.3 Background model
For the yeast and Escherischia coli gene sets, the scoring function used a third order Markov chain background model [similar to Liu et al. (2001)] derived from the relevant organism to assess sequence uniqueness. In human we used a frequency-based background distribution that counts actual sequence occurrences for a width w. To account for infrequent sequences that may skew probabilities, for each sequence we defined an exhaustive set of all related sequences that differ from the original sequence by 1 bp. Because the number of occurrences of these related sequences should be much greater than that of the lone sequence, we created a composite background frequency score for a sequence equal to the total number of occurrences of related sequences divided by the total number of sequences of width w in the dataset. In addition, in the comparison of different background models, we also used an equal weighting model where each nucleotide is equally probable to occur (equivalent to having no model at all), a GC-weighted model where Gs and Cs are less probable to occur (38%), and four Markov chain background models of the first, third, fifth and seventh orders. The human backgrounds were based on the promoters of all Refseq genes.

2.4 Data
We created two independent human datasets, one called the TRANSFAC dataset and the other the Tompa dataset. The TRANSFAC dataset is a group of eight well-annotated gene sets obtained from TRANSFAC (Wingender et al., 2001) that was used to illustrate parameter sensitivity of MaMF and also for preliminary algorithm comparison. We exhaustively searched TRANSFAC using strict criteria for human gene sets with sufficient size and verified binding sites, resulting in eight gene sets, containing promoters of length 1200 obtained from DBTSS (Ota et al., 2004) and repeat masked (Smit et al., 1996–2004, http://www.repeatmasker.org). The Tompa dataset is a larger benchmark of 21 gene sets (that do not overlap with the TRANSFAC dataset), obtained from the Tompa et al.'s survey (2005) and used for the purposes of rigorously comparing algorithm performance. Additional details of the data curation are found in the Supplementary Material.

2.5 Motif similarity
We define a motif similarity function that compares two motifs A and B of lengths n and m, respectively, finding the best possible alignment between the two. A motif is an ungapped alignment of a set of sequences, which can be represented as a position weight matrix (PWM). For the purposes of finding the best alignment, we require the second motif B to contain sequences for which surrounding sequence is known, such that we can create an extended motif B' containing this surrounding sequence, of width m' = 2m. We calculate the raw similarity score by taking the motif A and aligning it against the extended motif B', finding the best alignment that maximizes the score, defined to be the following:

Formula 2(2)
where motif A and extended motif B' are the PWMs of the two motifs, i and j are the positions within the respective weight matrices, k is the nucleotide, n and m' are the widths of the weight matrices (with the requirement that m' ≥ n), x is the offset used to align motif A to motif B, and the notation Aik denotes the number of occurrences of nucleotide k at position i. The similarity score is the raw score scaled between 0 and 1 by dividing by the maximum possible raw score, obtained by calculating the score of the consensus sequence of motif A multiplied by the number of sequences in motif B.

2.6 Algorithm comparison
We used defaults for AlignACE, Bioprospector, Weeder and Consensus unless otherwise specified. To make the comparison as similar as possible, we specified a motif width of 11 (if there was the option of choosing a motif width), used the most complex background model available and used the same repeat masked human sequence as the input. Weeder was parameterized to use the launcher method in the large setting, which tries finding motifs of width 6–12 containing 1–4 mismatches, removes duplicates and returns the best scoring motifs. We set it to use the provided human background, to search both strands of the input sequences, and to return 30 motifs. Bioprospector was set to use a width of 11, a third order Markov model based on the same Refseq promoters MaMF used and to report the top 30 motifs; it finds zero or more sites per sequence and the number of sites in a motif is determined internally by the algorithm. Consensus was parameterized to operate in a way similar to MaMF: a width of 11, a background model containing 40% GC, n sites per motif for n genes (depending on the gene set), zero or more sites per sequence and the top 30 motifs to be returned. AlignACE used default parameters, and it automatically chooses varying motif lengths and sizes to maximize its motif score. The scoring function in our copy of Bioprospector uses an updated version of the scoring function found in Liu et al. (2002) (X. Liu, personal communication).

2.7 Enrichment Ratio
We developed a method to employ expression microarray data in order to provide biological validation for computationally discovered motifs (see Results). The method defines an enrichment ratio, which, informally, is the degree to which a motif discovered from a target gene set preferentially occurs in the promoters of genes that are co-expressed with the target gene set relative to the presence of the motif in genes that are not co-expressed with the target gene set. For the calculation of the enrichment ratio, we measured coexpression based on a collection of human expression microarray data reported in Stuart et al. (2003). This combined dataset contains >1200 samples, with strong overlap with genes in DBTSS. The coexpression of a gene against the annotated gene set is calculated if two conditions are met: (1) pairwise Pearson's correlations can be computed between that gene and at least six genes in the annotated gene set and (2) for each of these six gene pairs, expression values are available for both genes in at least 10 samples. Coexpressed genes have a high average Pearson's correlation when compared with the annotated gene set, while non-coexpressed genes have a near zero average Pearson's correlation. For the MaMF output of a given gene set, we computed enrichment ratios for all motifs using the top 20 coexpressed genes and 100 non-coexpressed genes (those with correlations closest to zero).

Formally, the enrichment ratio is defined as follows. Call the highest-scoring alignment of a motif against a gene promoter the alignment score. We compute the enrichment ratio to be the mean alignment score of coexpressed genes divided by the mean alignment score of non-coexpressed genes. Additional details are presented in the Supplemental Material. Motifs with an enrichment ratio greater than one are considered interesting. Because we are considering large promoters, the computed enrichment ratios, even biologically meaningful ones, will not have values significantly >1.0. This is because any particular motif is likely to yield a high-scoring match among a large set of non-coexpressed genes by chance, which increases the denominator of the ratio.

We computed the null distribution of enrichment factors by generating 10 000 random PWMs of width 11 and computing their enrichment ratios against 10 000 random orderings of the expression data above (to yield different numerator and denominator gene sets). The probability of observing an enrichment ratio greater or equal to 1.035 was 0.05, 1.05 was 0.01, and 1.07 was 0.001. Note, however, that the composition of a motif also affects the distribution of ER given permuted data, so this distribution serves as a guideline for the scale of the observation of ER as opposed to a formal P-value. Rather than focusing on absolute enrichment ratios, we have focused on the probability of observing skew in populations of enrichment ratios. Since the populations are large, population skew is a reliable measure and yields P-values by exact binomial computation.

2.8 Known TRANSFAC TF motifs
To find known TF motifs that might co-occur with our annotated motifs, we generated a large list of TF motifs curated from TRANSFAC. These motifs were required to have at least two annotated binding sites obtained from either human, mouse or rat. This yielded 146 well-described TF motifs that belonged to higher organisms.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
We tested MaMF on data from lower organisms as a necessary condition of performance prior to proceeding to the more complicated human case. For both sets of data, we compared performance directly to Weeder (Pavesi et al., 2001), Bioprospector (Liu et al., 2001), Consensus (Hertz and Stormo, 1999) and AlignAce (Roth et al., 1998), all of which are well-established motif finding algorithms. Weeder was shown to perform the most competitively in multiple organisms in the Tompa et al.'s survey (Tompa et al., 2005).

3.1 Lower organisms: MaMF performance
We verified that the MaMF algorithm worked by testing it on common test cases in lower organisms, including RAP1, MCM1 and URS1 in yeast, and CRP in E.coli. The yeast examples contained about 10 genes each and 1000 bp upstream of each gene was considered. CRP contained 33 genes, with 200 bp upstream. MaMF output was defined to be correct if the highest scoring motif matched the reported consensus sequence. In all four cases, the top motif was correct. The motifs found, the consensus, and details of each motif are given in the Supplementary Material. Performance of the other approaches was comparable, with Bioprospector and Weeder finding the correct motif for all the four datasets, and Consensus and AlignAce both finding correct motifs in three out of the four datasets (both failing on RAP1), reflecting the fact that these commonly tested cases now form a low bar for evaluation of motif finding algorithms.

3.2 Human data: MaMF performance
Given these positive results, we approached the TRANSFAC dataset. Binding site widths for different transcription factors vary, but we standardized on width 11. The n-mer size was set at 4 and an identity cutoff at 8 (8/11 = 73% identity). Recall from above that the top scoring motifs matched the correct consensus in the yeast and E.coli sets, but of the eight human gene sets, only E2F yielded the correct consensus as the top ranked motif. Tompa et al. (2005) noted this issue in their paper, suggesting the use of the top N motifs in algorithm assessment.

Table 1 shows that MaMF finds the correct motif in all eight TFs, in most cases within the top 10. In several cases, the top-ranked correct motif only approximately matched the annotated consensus, but in general a slightly lower ranking motif could be found that more closely matched the annotated consensus. We examined the sensitivity of parameter settings (see Supplementary Materials for details), and found that parameters specific to MaMF yielded little performance variation across different settings. Parameters common to multiple algorithms (e.g. motif width) yielded somewhat more variation, but in line with their expected behavior.


View this table:
[in this window]
[in a new window]
 
Table 1 Selected motifs found by MaMF that are highly similar to the annotated motif for the eight TRANSFAC gene sets

 
3.2.1 Algorithm comparison: TRANSFAC dataset
We used Weeder, Bioprospector, Consensus and AlignACE to search for motifs and compare their predicted motifs with those from MaMF, using the same assessment methodology as above. Using the eight TRANSFAC gene sets, we chose parameters for the different algorithms that were similar to those used for MaMF but optimized their performance if an identical parameter could not be used (see Methods). Bioprospector performed best when an explicit background model was not provided (background estimated from the input sequences). Results are presented for Bioprospector both with and without an explicit background model. In these experiments, at the 0.75 motif similarity threshold, MaMF found six of the eight motifs; Bioprospector with and without background found three of the eight and six of the eight motifs, respectively; Weeder found five of eight motifs; Consensus found one of eight motifs; and AlignACE found no motifs. MaMF's performance on this set of eight TFs was within its plateau across the parameter sensitivity analysis. For this reason, and since the size of the set was small, we made a second test with the identical parameters on a larger set of human TFs (the Tompa dataset).

3.2.2 Algorithm comparison: Tompa dataset
The Tompa dataset comprised 21 human gene sets. Using the same metric for success and identical parameters for the algorithms as before, we ran the algorithms on this new dataset. At the 75% motif similarity threshold, MaMF found 12/21 motifs, Weeder 8/21 motifs, Bioprospector 7/21 (without background) and 4/21 (with background), Consensus 1/21 and AlignACE 0/21. Additional details of the motif rankings for the different algorithms are presented in the Supplementary Materials. Under various motif similarity thresholds, MaMF did consistently best, shown in Table 2. The rank order of the performance of algorithms was very similar to the results on the TRANSFAC dataset, with Weeder doing better than Bioprospector, but worse than MaMF. The single exception involved the use of Bioprospector without an explicit genomic background model. Note that we did not evaluate multiple parameter settings for any algorithm on the Tompa dataset.


View this table:
[in this window]
[in a new window]
 
Table 2 Comparison of motif finding performance between algorithms using the Tompa dataset

 
The success rates were dramatically higher than levels of correctness reported in the Tompa paper, emphasizing the importance of looking at the top N motifs where N is significantly >1. The results we report in Table 2 can be related to the approach presented in Tompa et al. (2005). At each level of motif similarity, we are defining a different threshold of correctness of a predicted motif. Tompa's approach is to consider correctness of predicted sites and defines them as correct when a predicted set of nucleotides overlap a true TF binding site by at least one-quarter in length. In our case, there are exactly 21 true positives to consider, and each algorithm yielded 30 motifs, most of which constituted nominal false positives. Tompa's results showed strong correlation among positive predictive value, sensitivity and the ‘average site performance’. In our analysis, sensitivity [TP/(TP + FN)] is monotonically related to the motif similarity threshold, ranging from a sensitivity of 0.71 down to 0.43 for MaMF. Issues involving quantification of specificity in real datasets are complicated by the probable presence of true functional motifs distinct from the annotated ones, as also noted by the Tompa group, which we discuss below. Nominal sensitivities [TP/(TP+FP)] for all methods in our analysis are low, by construction of the assessment, since we considered 30 top-scoring motifs for each promoter set, most of which are false positives by definition if not in reality.

We believe that MaMF's performance advantage is real for three reasons. First, in the case of the motifs for the lower organisms we observed excellent performance for all of the algorithms, reflecting that the algorithms were run properly. Second, the comparison approaches had similar or longer running times (all were similar, with Weeder being ~100x slower), so there was no bias toward MaMF with respect to computational burden. Third, the parameters used in the Tompa dataset evaluation were fixed without knowledge of that dataset. So, with respect to the first question posed in the introduction, we believe that MaMF offers performance advantages over other widely used methods on non-synthetic human data.

3.3 Biological significance of high scoring incorrect motifs
Given that only a fraction of the top motifs returned by MaMF were true positives, we considered the second question posed in the introduction: what the other motifs might be (referred to below as ‘incorrect’ motifs). There are several possibilities. First, the remaining motifs could really be incorrect and without biological significance, a reflection of a difficult and noise-ridden problem, where the signal from human binding motifs may be too weak to strongly distinguish true positives from false positives. Second, genome-wide effects such as repetitive element distribution or GC composition may encourage composition-based motifs to appear. Indeed we have previously shown a relationship between coregulated genes and the quantity of repetitive elements (Hon and Jain, 2003), but in the above experiments, we removed repetitive elements. As for GC composition, it has been previously shown that gene expression variation is related to isochore content (Bernardi, 1995; Pesole et al., 1999) which may also yield composition-based motifs. The third possibility is that many of the remaining motifs are biologically active, either as targets of cooperatively acting transcription factors, targets of basal transcriptional machinery, or structural elements involved in DNA packaging and access. This third possibility is the most interesting, and our data supported this interpretation in many cases.

Given a target gene set known to be responsive to some specific TF, MaMF yielded a list of motifs, which included the TF in question, but also included a large number of incorrect motifs. We hypothesized that if an incorrect motif were biologically active, the motif would more likely be found in promoters of genes coexpressed with the target gene set versus the promoters of genes not coexpressed with the target gene set. We defined the enrichment ratio of a motif to reflect the degree to which strong motif matches preferentially occurred within coexpressed genes versus non-coexpressed genes. Motifs with an enrichment ratio >1 are considered interesting, since the expectation value is exactly 1.0 in the case where no enrichment exists. The Methods section contains additional details about computation and interpretation of the enrichment ratios, particularly as regards magnitude.

Figure 1 shows a scatterplot of enrichment ratios versus similarity to the correct motif (using the motif similarity metric) for 1000 MaMF generated motifs for E2F. We considered the unfiltered MaMF output of 1000 motifs to take into account the frequency of motifs similar to an annotated motif. There was a separation into two clusters, one representing correct E2F motifs (dark diamonds), the other incorrect motifs (hollow diamonds). The E2F cluster has enrichment ratios nearly all >1.0, reflecting the expected enrichment of E2F motifs within genes that are observed to co-vary their expression with known E2F responsive genes. Analyzing all eight TRANSFAC gene sets, we found that correct motifs from every gene set had an average enrichment ratio greater than one; they were also greater than the median enrichment ratio of all motifs (see Table 3). More surprising was the large number of incorrect motifs that nevertheless had enrichment ratios >1.0. This skew towards ratios >1.0 was statistically significant in six of the eight cases (P << 0.0001 for those six cases), and suggests that that many of those motifs may have biological functions. We also made the enrichment ratio analysis on the 21 Tompa gene sets (with TFs of unknown identity). The results largely paralleled the analysis for the eight motifs presented in Table 3 (see Supplementary Materials for details), but owing to the limited amount of annotation in the Tompa dataset, we limited analysis of co-occurring motifs to the TRANSFAC datasets.


Figure 1
View larger version (32K):
[in this window]
[in a new window]
 
Fig. 1 Scatterplot of enrichment ratio to motif similarity for the E2F gene set.

 

View this table:
[in this window]
[in a new window]
 
Table 3 Enrichment ratio (ER) skew of MaMF motifs and elevated enrichment ratio of correct motifs on the eight TRANSFAC gene sets

 
3.4 Co-occurring transcription factor motifs
In light of the enrichment ratio analysis, we hypothesized that the incorrect motifs with high enrichment ratios could belong to transcription factors that co-occur with the annotated one in a biologically meaningful manner. To test this, we looked for sets of motifs predicted by MaMF that matched a motif in the TRANSFAC database (using motif similarity at the 0.75 threshold). For a given set of motifs that matched a TRANSFAC motif, we computed the average enrichment ratio of these motifs. Ordering these sets of motifs by number of matches, we kept the top 10 TFs that had an average enrichment ratio greater than the mean enrichment ratio of all motifs from the MaMF output, and had at least four matching motifs. For each case, this resulted in a list of well-studied motifs that were suggested to co-occur with the target TF based on the combination of MaMF output and expression microarray data.

We validated the predicted co-occurring TFs by looking for literature evidence that both the annotated TF and the predicted co-occurring TF were observed to bind to the same gene promoter. Of the predicted TFs across all eight gene sets, 70% indeed had been shown in the literature to co-occur with the annotated TF (details of literature support for TF co-occurrence are found in the Supplementary Materials). In addition, the top 10 predicted co-occurring TF motifs for a given gene set account for up to 40% of the ‘incorrect’ high scoring motifs generated by MaMF, explaining a large portion of the results. These results suggest that multiple TF regulation of human genes is very common and that databases of TF interactions are probably only scratching the surface.

Given that MaMF predicted a number of TFs to co-occur with the annotated ones, we present two detailed examples where sufficient literature data exist to show that these predictions held. The first uses the AHR/HIF dataset. Because MYC is predicted by MaMF to co-occur in AHR/HIF target genes, we checked to see if the 11 AHR/HIF target genes taken from TRANSFAC were also responsive to MYC. We used the MYC Target Gene Database (Zeller et al., 2003) and found that 7/11 were indeed annotated as MYC targets (Table 4). Since ~10% of all human genes are expected to be MYC targets (Fernandez et al., 2003), the likelihood that this is by chance is very low (P < 2.3x10–5 by exact binomial). So it appears that MYC and AHR/HIF indeed do commonly co-occur, which supports the idea that the nominal MYC motifs found by MaMF in the AHR/HIF genes are biologically relevant. Here we have taken a single predicted co-occurring TF (MYC) derived from the targets of AHR/HIF and established that many of the targets of AHR/HIF are also targets of MYC.


View this table:
[in this window]
[in a new window]
 
Table 4 AHR/HIF target genes those are also responsive to MYC

 
The second example uses the annotated targets of E2F and asks whether the 10 different predicted co-occurring TFs from MaMF bind to MYC's promoter, which is an annotated target of E2F. Results are shown in Table 5. In 8/10 cases, we were able to find literature evidence that there was direct binding of the predicted TF to the MYC promoter. While it is difficult to make a strong statistical statement about this result, it seems improbable that such a high proportion of TFs would specifically target MYC.


View this table:
[in this window]
[in a new window]
 
Table 5 Transcription factors predicted to bind onto the MYC promoter

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Despite the evidence that many of the unannotated high scoring motifs belong to TFs that co-occur with the annotated TF, the relationships and mechanisms between the co-occurring motifs and the annotated TFs remain unclear. It is possible that these co-occurring motifs have no functional relationship with the annotated TF and simply belong to other transcriptional programs that use similar genes. However, given the importance of interactions between transcription factors to modulate transcription particularly in humans, we find it more likely that many of these co-occurring TFs interact with the annotated TF in some manner. Some of the difficulty of human motif finding can be attributed to these co-occurring motifs that comprise a large portion of MaMF output. In lower organisms where the transcriptional programs are simpler, the likelihood of finding co-occurring motifs that score higher than the annotated motif is smaller. The successful human motif finder, therefore, needs to be able to search deeper and faster to enumerate both the annotated TF motif as well as co-occurring motifs.

MaMF's performance is largely dependent on two features: the use of a scoring function that incorporates a genomic background model and the use of indexing to accelerate sequence comparisons. With respect to the background model, higher order background models have been shown to improve motif detection (Thijs et al., 2001), preventing spurious common sequences such as short simple repeats (that remain despite repeat masking) from being the dominant signal. In the parameter analysis we validated this by showing that as the complexity of the background model increased, MaMF was increasingly able to find the correct motifs (see Supplementary Materials for details). Furthermore, in the algorithm comparisons, we observed that algorithm performance corresponded closely with the choice of the background model used. The worst performing algorithm, AlignACE, uses no background model, whereas MaMF and Weeder both use frequency-based backgrounds. Interestingly, Bioprospector does best when the background is estimated from the input sequence, suggesting that a local background may be more effective in general than a background estimated from the entire genome. We did not examine if such a strategy would also improve MaMF's performance.

The second major feature that MaMF employs is the indexing of sequences to speed searches. The benefit of indexing is most evident when comparing MaMF with Weeder. While both algorithms use frequency-based backgrounds, and Weeder is competitive in performance with MaMF, MaMF runs about 100 times faster than Weeder. A large portion of this speed differential results from the exponential time penalty that Weeder incurs when searching for longer motifs, with running times of >1 h for motif widths of 10 and above. MaMF is able to search for longer motifs with no such performance penalty, which makes this algorithm amenable to high-throughput experiments as well as longer and more complicated binding motifs.

MaMF builds upon previous algorithms by its use of indexing to optimize performance. The generation of the lookup table is similar to the enumerative approaches used by van Helden (2003) and Sinha and Tompa (2003). Whereas they enumerate all possible motifs, we enumerate all possible n-mers to generate sequence pairs that share an exact n-mer. Since the n-mer size is significantly shorter than the motif width, our approach minimizes the polynomial time penalty that enumeration incurs. The motif generation step uses a greedy approach similar to Consensus (Hertz and Stormo, 1999), except that our approach uses the indexed lookup table both to generate sequence pair seeds and to build motifs. In Consensus, each motif is generated from a seed containing a single sequence and additional sequences to be added to the motif are exhaustively searched from the input sequences. The indexed lookup table in MaMF minimizes the redundancy of generating motifs from similar or identical seeds.

We chose an assessment strategy that differed from that used in Tompa et al. (2005) in order to better measure performance in the human case. The strategy differed in two ways. Instead of looking at the top motif, we looked at the top 30 motifs predicted by a motif finder. It is important to increase sensitivity in this way because, as we have shown, many of the highest scoring but unannotated motifs are also likely to be biologically relevant. If multiple TFs are involved in a given gene set, the annotated TF motif might not be expected to score the highest.

The second way our strategy differed from Tompa et al was that whereas Tompa measured a motif finder's ability to identify the annotated binding sites (i.e. at the nucleotide and binding site levels), we measured the degree of similarity of the predicted motif to the annotated motif using a quantitative similarity metric (i.e. at the motif level). While measuring nucleotide and site level accuracy is an important metric in identifying biologically active binding sites, we feel that it is more important to analyze motif level accuracy of the motifs predicted by a motif finder. Given confidence of a particular motif, individual binding sites can be predicted from this motif. Furthermore, measuring motif similarity is a natural extension of the traditional metric of comparing a predicted motif to the annotated consensus sequence, only that we are comparing the predicted motif with the annotated consensus motif, a richer representational form of the consensus sequence. As a result, our assessment methodology may be more straightforward to understand.

The enrichment ratio concept offers some level of biological validation of motifs. We believe that there may be ways to use the enrichment ratio in motif finding. For instance, one could potentially use the enrichment ratio in a post-processing step to score motifs generated by MaMF and thus yield a biologically motivated ranking, which might ameliorate the need for a statistical model of motif score likelihood. This is attractive because with the proliferation of publicly available expression microarray data (Ball et al., 2005) it is easy to calculate enrichment ratios for motifs from any gene set. More ambitiously, a motif finding algorithm could be designed around the enrichment ratio, e.g. by finding motifs that maximize the enrichment ratio. An advantage of this approach is that the genomic background distribution is not required since such information is embedded within the enrichment ratio.

In conclusion, we have presented a novel motif finding method that uses indexing to optimize the search process. Using a quantitative motif similarity metric, the method is shown to work better than other methods in predicting human motifs. Using expression microarray data to build support for putative motifs, we also show that the difficulty of human gene sets is partially because of the frequent co-occurrence of other transcription factor binding sites.


    Acknowledgments
 
The authors would like to acknowledge NIH/NCI for partial funding of the work (GM070481 and CA64602) and Mark Segal for insightful discussions and collaboration.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Nikolaus Rajewsky

Received on October 14, 2005; revised on December 1, 2005; accepted on February 1, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

    Altschul, S.F. (1989) Gap costs for multiple sequence alignment. J. Theor. Biol, . 138, 297–309[CrossRef][Web of Science][Medline].

    Altschul, S.F., et al. (1990) Basic local alignment search tool. J. Mol. Biol, . 215, 403–410[CrossRef][Web of Science][Medline].

    Bailey, T.L. and Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol, . 2, 28–36[Medline].

    Ball, C.A., et al. (2005) The Stanford Microarray Database accommodates additional microarray platforms and data formats. Nucleic Acids Res, . 33, D580–D582[Abstract/Free Full Text].

    Bernardi, G. (1995) The human genome: organization and evolutionary history. Annu. Rev. Genet, . 29, 445–476[CrossRef][Web of Science][Medline].

    Buhler, J. and Tompa, M. (2002) Finding motifs using random projections. J. Comput. Biol, . 9, 225–242[CrossRef][Web of Science][Medline].

    Bussemaker, H.J., et al. (2000) Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. Proc. Natl Acad. Sci. USA, 97, 10096–10100[Abstract/Free Full Text].

    Bussemaker, H.J., et al. (2001) Regulatory element detection using correlation with expression. Nat. Genet, . 27, 167–171[CrossRef][Web of Science][Medline].

    Conlon, E.M., et al. (2003) Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl Acad. Sci. USA, 100, 3339–3344[Abstract/Free Full Text].

    Fernandez, P.C., et al. (2003) Genomic targets of the human c-Myc protein. Genes Dev, . 17, 1115–1129[Abstract/Free Full Text].

    Guhathakurta, D. and Stormo, G.D. (2001) Identifying target sites for cooperatively binding factors. Bioinformatics, 17, 608–621[Abstract/Free Full Text].

    Gupta, M. and Liu, J.S. (2005) De novo cis-regulatory module elicitation for eukaryotic genomes. Proc. Natl Acad. Sci. USA, 102, 7079–7084[Abstract/Free Full Text].

    Gupta, S.K., et al. (1995) Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. J. Comput. Biol, . 2, 459–472[Medline].

    Hertz, G.Z. and Stormo, G.D. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15, 563–577[Abstract/Free Full Text].

    Hon, L.S. and Jain, A.N. (2003) Compositional structure of repetitive elements is quantitatively related to co-expression of gene pairs. J Mol Biol, 332, 305–310[CrossRef][Web of Science][Medline].

    Keles, S., et al. (2002) Identification of regulatory elements using a feature selection method. Bioinformatics, 18, 1167–1175[Abstract/Free Full Text].

    Levine, M. and Tjian, R. (2003) Transcription regulation and animal diversity. Nature, 424, 147–151[CrossRef][Medline].

    Li, H., et al. (2002) Identification of the binding sites of regulatory proteins in bacterial genomes. Proc. Natl Acad. Sci. USA, 99, 11772–11777[Abstract/Free Full Text].

    Liu, X., et al. (2001) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput, . 127–138.

    Liu, X.S., et al. (2002) An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol, . 20, 835–839[Web of Science][Medline].

    Liu, Y., et al. (2004) Eukaryotic regulatory element conservation analysis and identification using comparative genomics. Genome Res, . 14, 451–458[Abstract/Free Full Text].

    Mccue, L., et al. (2001) Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic Acids Res, . 29, 774–782[Abstract/Free Full Text].

    Nobrega, M.A., et al. (2003) Scanning human gene deserts for long-range enhancers. Science, 302, 413[Free Full Text].

    Ota, T., et al. (2004) Complete sequencing and characterization of 21,243 full-length human cDNAs. Nat. Genet, . 36, 40–45[CrossRef][Web of Science][Medline].

    Pavesi, G., et al. (2001) An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics, 17, Suppl. 1, S207–S214[Abstract].

    Pesole, G., et al. (1999) Isochore specificity of AUG initiator context of human genes. FEBS Lett, . 464, 60–62[CrossRef][Web of Science][Medline].

    Roth, F.P., et al. (1998) Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol, . 16, 939–945[CrossRef][Web of Science][Medline].

    Sinha, S. and Tompa, M. (2003) YMF: A program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res, . 31, 3586–3588[Abstract/Free Full Text].

    Smit, A.F.A., Hubley, R., Green, P. (1996–2004) RepeatMasker Open-3.0.

    Stuart, J.M., et al. (2003) A gene-coexpression network for global discovery of conserved genetic modules. Science, 302, 249–255[Abstract/Free Full Text].

    Thijs, G., et al. (2001) A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics, 17, 1113–1122[Abstract/Free Full Text].

    Tompa, M., et al. (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol, . 23, 137–144[CrossRef][Web of Science][Medline].

    van Helden, J. (2003) Regulatory sequence analysis tools. Nucleic Acids Res, . 31, 3593–3596[Abstract/Free Full Text].

    Wang, T. and Stormo, G.D. (2003) Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics, 19, 2369–2380[Abstract/Free Full Text].

    Wingender, E., et al. (2001) The TRANSFAC system on gene expression regulation. Nucleic Acids Res, . 29, 281–283[Abstract/Free Full Text].

    Workman, C.T. and Stormso, G.D. (2000) ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac. Symp. Biocomput, . 467–478.

    Zeller, K.I., et al. (2003) An integrated database of genes responsive to the Myc oncogenic transcription factor: identification of direct genomic targets. Genome Biol, . 4, R69[CrossRef][Medline].

    Zhou, Q. and Wong, W.H. (2004) CisModule: de novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc. Natl Acad. Sci. USA, 101, 12114–12119[Abstract/Free Full Text].

    Zhu, Z., et al. (2002) Computational identification of transcription factor binding sites via a transcription-factor-centric clustering (TFCC) algorithm. J. Mol. Biol, . 318, 71–81[CrossRef][Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
C. Yanover, M. Singh, and E. Zaslavsky
M are better than one: an ensemble-based motif finder and its application to regulatory element prediction
Bioinformatics, April 1, 2009; 25(7): 868 - 874.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
22/9/1047    most recent
btl037v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Hon, L. S
Right arrow Articles by Jain, A. N
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hon, L. S
Right arrow Articles by Jain, A. N
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?