Bioinformatics Advance Access originally published online on August 12, 2008
Bioinformatics 2008 24(20):2288-2295; doi:10.1093/bioinformatics/btn420
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MotifVoter: a novel ensemble method for fine-grained integration of generic motif finders
1School of Computing, National University of Singapore, Singapore 119260, 2Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613, 3Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong and 4Genome Institute of Singapore, 60 Biopolis Street, #02-01 Genome, Singapore 138672
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Locating transcription factor binding sites (motifs) is a key step in understanding gene regulation. Based on Tompa's benchmark study, the performance of current de novo motif finders is far from satisfactory (with sensitivity
0.222 and precision
0.307). The same study also shows that no motif finder performs consistently well over all datasets. Hence, it is not clear which finder one should use for a given dataset. To address this issue, a class of algorithms called ensemble methods have been proposed. Though the existing ensemble methods overall perform better than stand-alone motif finders, the improvement gained is not substantial. Our study reveals that these methods do not fully exploit the information obtained from the results of individual finders, resulting in minor improvement in sensitivity and poor precision.
Results: In this article, we identify several key observations on how to utilize the results from individual finders and design a novel ensemble method, MotifVoter, to predict the motifs and binding sites. Evaluations on 186 datasets show that MotifVoter can locate more than 95% of the binding sites found by its component motif finders. In terms of sensitivity and precision, MotifVoter outperforms stand-alone motif finders and ensemble methods significantly on Tompa's benchmark, Escherichia coli, and ChIP-Chip datasets. MotifVoter is available online via a web server with several biologist-friendly features.
Availability: http://www.comp.nus.edu.sg/~bioinfo/MotifVoter
Contact: ksung{at}comp.nus.edu.sg
supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Understanding the regulatory mechanism of genes is one of the major challenges faced by biologists today. Central to this problem is the identification of recurring patterns (motifs) in regulatory sequences, which represent binding sites of a transcription factor. In general, one would like to identify transcription factors whose binding sites are statistically over-represented in the promoter of a set of co-regulated genes. Many motif finders have been proposed to undertake this problem using different approaches, such as profile-based methods (Nimwegen, 2007) and consensus-based methods (Eskin and Pevzner, 2002; Pavesi et al., 2001; Wijaya et al., 2007). However the existing tools are still not effective for discovering motifs (Das and Dai, 2007; Tompa et al., 2005). For example, as shown in Tompa et al.'s evaluation, even the best performing algorithm has sensitivity <13% and precision <35% (sensitivity is percentage of true nucleotides that are predicted and precision is the percentage of predicted nucleotides that are true).
To deal with this issue, some literatures (Harbison et al. 2004; Hu and Kihara, 2005; MacIsaac and Fraenkel, 2006) hinted that assembling the results of multiple motif finders can provide better results for motif finding. For example, Harbison et al. (2004) observed different motif finders have different strengths. They successfully identified more binding sites by combining results of six motif finders compared to using only single finder. In fact, the benchmark datasets from Tompa et al. (2005) also support this. By simply taking the union of all binding sites predicted by 10 selected motif finders, the sensitivity can be increased by more than double over each selected motif finder. However, the union of all predicted sites could contain a lot of noise. It is not trivial to distinguish the real binding sites from the noise.
Six ensemble methods have been developed. SCOPE (Chakravarty et al., 2007) and BEST (Dongsheng et al., 2005; Jensen and Liu, 2006) rerank all motifs predicted by individual finders based on a certain scoring function and report the top motif. WebMotifs (Gordon et al., 2005; Romer et al., 2007), MultiFinder (Huber and Bulyk, 2006) and RGSMiner (Huang et al., 2004) assume motifs given by the consensus of several motif finders are likely to be the real motifs. They cluster the motifs and report one motif from the best cluster. All above methods select one representative motif among the predicted motifs from the individual finders. If none of the motifs from the finders can capture the binding sites accurately, the performance of these methods will suffer. EMD (Hu et al., 2006) goes a step further and considers the binding sites predicted by the motifs. Motifs of the same rank from different component motif finders are grouped together. For each group, based on the binding sites predicted by all the motifs in the group, new motifs are derived. Though these ensemble methods indeed help to improve the performance of motif finding, the improvement is not significant. For example, in Tompa's benchmark and Escherichia coli datasets, the average sensitivity is only improved by 62% but the average precision is reduced by 15%. We believe that some important aspects have been overlooked by existing ensemble methods:
- Existing ensemble methods make use of the observation that true motifs are shared by multiple motif finders, i.e. the coincidence of the predictions from multiple finders. However, the (negative) information from the false positive (spurious) motifs is not exploited in existing approaches. These spurious motifs, which are usually only reported by a few or even one motif finder, provide additional information to guide the identification of good motifs.
- Although existing ensemble methods assume that true motifs will be shared by multiple motif finders, they do not explicitly use the fact that the motifs (binding sites) predicted by more motif finders have a higher chance to be the correct ones.
These observations are also supported by Hu and Kihara (2005) and MacIsaac and Fraenkel (2006). In this article, we show how to formalize these observations and derive two criteria, namely discriminative and consensus criteria, to combine results from multiple motif finders.
Extensive evaluation shows that our proposed ensemble method, MotifVoter, can extract almost all information from multiple motif finders. On 186 datasets for Tompa's benchmark (Tompa et al., 2005) and E.coli (Salgado et al., 2004), MotifVoter retains more than 95% of true binding sites predicted by at least one individual motif finder. This implies that MotifVoter improves the sensitivity by 120% and precision by 77% when compared with the best individual motif finder and it also shows an improvement of 90% in sensitivity and 135% in precision when compared to the existing ensemble methods. Applying MotifVoter on 140 yeast and mammalian ChIP-Chip datasets, we demonstrate that MotifVoter is scalable. Furthermore, in these ChIP-Chip datasets, we notice that the motifs reported by MotifVoter are more similar to true motifs when compared with those reported by existing stand-alone motif finders and ensemble methods.
| 2 APPROACH |
|---|
|
|
|---|
Figure 1 shows the overall framework of MotifVoter. Given a set of sequences, MotifVoter executes m different motif finders, each reporting n motifs. Note that each motif also defines a set of predicted binding sites. MotifVoter comprises two stages. (1) Motif filtering: in the first stage, MotifVoter processes the candidate motifs predicted by the m motif finders and attempts to remove the spurious motifs (Fig. 1a). The main idea is to find a cluster of motifs with high conformity based on a motif similarity measure (defined below). (2) Sites extraction: based on the candidate motifs retained in Stage 1, MotifVoter then identifies a set of sites with high confidence that they are real (Fig. 1b). At the end, for visualization purposes, we generate the PWM and the Weblogo of the motifs (Fig. 1c).
|
It may be noted that, unlike other ensemble methods that only work on the predicted motifs, MotifVoter and EMD work on both the predicted motifs and binding sites. MotifVoter differs from EMD in the discriminative and consensus criteria used for grouping (clustering) the motifs. First, the discriminative criterion requires the selected cluster to of motif share as many binding sites as possible, while requiring that motifs outside the cluster share none or few binding sites. Second, the consensus criterion requires the motifs in the selected cluster to be contributed by as many motif finders as possible. We propose a scoring function that captures the two criteria and derive an efficient algorithm to select a cluster that optimizes this scoring function. In addition to these two criteria, MotifVoter selects only those binding sites of high confidence to construct the new motif, rather than using all binding sites predicted by motifs in the same group (cluster).
Compared to existing ensemble schemes, MotifVoter has the following advantages:
- Existing ensemble methods infer the final motif from sites of either one motif or a set of motifs of the same ranking. MotifVoter, on the other hand, infers the final motif using sites of high confidence gathered from all predicted motifs of multiple motif finders. Hence, MotifVoter has potential to extract more true sites than existing ensemble methods.
- MotifVoter can potentially yield higher precision than existing ensemble methods by the following reasoning. The discriminative criterion helps to select a cluster of motif that are not only similar (i.e. share as many binding sites as possible), but also have the property that motifs outside the cluster are distant from one another (i.e. share none or few binding sites). At the same time, the consensus criterion ensures that the selected cluster of motifs is predicted by as many motif finders as possible. These two criteria help to remove almost all spurious motifs. MotifVoter further improves the precision by removing spurious sites in the selected cluster.
| 3 METHODS |
|---|
|
|
|---|
MotifVoter uses 10 basic motif finders. The first group consists of three motif finders based on (l,d) model, namely MITRA (Eskin and Pevzner, 2002), Weeder (Pavesi et al., 2001), and SPACE (Wijaya et al., 2007). The second group consists of seven motif finders based on PWM model, namely AlignACE (Hughes et al., 2000), ANN-Spec (Workman and Stormo, 2000), BioProspector (Liu et al., 2001), Improbizer (Ao et al., 2004), MDScan (Liu et al., 2002), MEME (Bailey and Elkan, 1995) and MotifSampler (Thijs et al., 2001). The details of these component motif finders and parameter descriptions can be found in Supplementary Material 10.
Section 3.1 formally describes the discriminative and consensus criteria used in the motif filtering step. Then, in Section 3.2 , we present the heuristic algorithm for finding a cluster of motifs that satisfy the two criteria. The details of the sites extraction step are given in Section 3.3. Finally, Section 3.4 describes the performance evaluation measures we used for benchmarking MotifVoter against other existing methods.
3.1 The discriminative and consensus criteria
Given two motifs x and y, we first describe how to measure their similarity based on the binding sites defined by these motifs. Let I(x) be the set of binding sites defined by motif x. Let I(x)
I(y) be the set of regions covered by at least one site in x and one site in y. Let I(x)
I(y) be the set of regions covered by any site of x or y. We denote the total number of nucleotides of all the regions in I(x)
I(y) and I(x)
I(y) , as |I(x)
I(y)| and |I(x)
I(y)|, respectively. The similarity of x and y, denoted sim(x, y) , is expressed as |I(x)
I(y)|/|I(x)
I(y)|. Note that 0
sim(x, y)
1 and sim(x, x)=1.
Now, we define the scoring function to capture the discriminative criterion. Given m motif finders and each motif finder reports its top-n candidate motifs, there will be a set P of mn candidate motifs. Among all the candidate motifs in P, some of them will approximate the real motif while the other will not. We would like to identify the subset X of P such that the candidate motifs in X are likely to approximate the real motif. The principle idea is that if the candidate motifs in X can model the real motif, motifs in X should be highly similar while motifs outside X should be distant from one another (discriminative criteria).
Let X be some subset of candidate motifs of P. The mean similarity among the candidate motifs in X, denoted as sim(X) , is defined as:
|
|
|
|
The function w(X) measures the degree of similarity among the candidate motifs in X. If many of the candidate motifs in X approximate the real motif, we should expect to have a high w(X). On the other hand, we expect the complement of X, that is P–X, should have a low w(P–X). Thus, w(P–X) constitutes the discriminative criterion in the clustering procedure. In other words, if X is the set of candidate motifs which approximates the real motif, we expect to have a high A(X) score, where:
|
|
Note that there may be multiple sets of X with the same A(X) score. Among those X's, we would select one that contains maximum number of motif finders. The latter constitutes the consensus criterion.
In summary, this stage aims to find X
P which maximizes A(X) while X contains the candidate motifs predicted by maximum number of motif finders. The naive method to identify X is to enumerate all possible X and check if they satisfy the above two criteria. However, this approach is computationally infeasible. In the next section, we describe our proposed heuristics to identify X to overcome this difficulty.
3.2 Heuristics for motif filtering
In this subsection, we describe the heuristic algorithm for identifying X, the set of similar candidate motifs in the first stage.
Let P be all motifs found by m motif finders, where each motif finder returns n motifs. Steps 1–3 compute the pairwise similarity scores for all pairs of motifs. Based on these similarity scores, we apply the following heuristics approach to find X (Steps 4–9).
Instead of enumerating all possible subsets in P which takes exponential time, we only consider subsets Xz,j={z, p1,...,pj} for every z
P and for every 1
j
|P|–1 where sim(z, pi)>sim(z,pi+1) and pi
P. The rationale behind examining only these subsets is as follows. For each z
P, let a and b be two other motifs such that sim(a, z)>sim(b, z). That is, a is more similar to z than b. Then, it is likely that A({z, a,b})>A({z, b}). By including pj in the subset, we also include all pi with i<j.
For each Xz,j, we compute the value of A(Xz,j) and the number of motif finders contributing to Xz,j, denoted as Q(Xz,j). Finally, we set X to Xz,j with maximum A(Xz,j) value. If there are more than one such subset, pick the one with the largest Q(Xz,j). In case of a tie again, we pick one randomly.
The time complexity of heuristics can be shown to be O(m2n2) as follows. There are |P|=mn motifs. For each motif z, we consider mn different Xz,j subsets. For each subset, we need to compute the value of A(Xz,j) for all j. Note that we can obtain the value of A(Xz,j) from the value of A(Xz,j–1) in constant time. So, for each motif z, to compute all values of A(Xz,j) , it takes O(mn) time. Therefore, the overall time complexity is O(m2n2). For the actual running time of the heuristics, please refer to Supplementary Material 5.
3.3 Sites extraction
From X, we obtain the list of sites using two requirements. First, we accept all regions which are covered by sites defined by at least two motifs x and y in X, where x and y are predicted by two different motif finders. The reason behind is that it is unlikely that several motif finders predict the same spurious binding sites.
Second, we accept all the sites of the motif x in X with the highest confidence in approximating real motif. The confidence score is defined as follows. Let B(x) be the total number of nucleotides covered by the sites defined by motif x. Let O(x) be the total number of nucleotides covered by the sites defined by motif x and all other motifs in X. The confident score of motif x is defined as O(x)/B(x).
For the selected sites that are covered by more than one motif finder, we further apply a post-processing procedure to refine each site by removing the nucleotides that are only covered by a single finder to increase the precision of our prediction as these nucleotides are likely to be noise.
Given all the sites predicted by MotifVoter, we generate a PWM motif to model the motif. To achieve this, we first align these sites using MUSCLE (Edgar, 2004), then a PWM is generated from this alignment to model the motif. Figure 1c provides an illustration of this process.
3.4 Performance evaluation measures
For performance evaluation, we use the same four measures proposed in (Tompa et al., 2005) namely, sensitivity (nSN), positive predictive value1 (nPPV), performance coefficient (nPC) and correlation coefficient (nCC). Index n is used to denote that the assessment is done at the nucleotide level instead of site level. The definitions of these performance measures can be found in Supplementary Material 4.
For ChIP-Chip datasets which do not have sites position information, we use sum of squared distance (SSD) to evaluate the performance of motif finders. It is one of the most widely used methods for comparing PWMs (Mahony et al., 2007). Let L be length of two aligned motifs X and Y, SSD is defined as:
|
|
where pxi(b) and pyi(b) are the probabilities of base b occurring at position i in motif X and Y, respectively. Note that 0
SSD(X, Y)
2 and SSD(X ,X)=2. If the motifs are of different length, we use the shorter motif to align different regions of the longer motif and obtain the maximum SSD value.
| 4 EXPERIMENTAL RESULTS |
|---|
|
|
|---|
There is no universal benchmark dataset for evaluating a motif finder. We thus performed an extensive evaluation on MotifVoter over 326 datasets from Tompa's benchmark, E.coli and ChIP-Chip datasets. In our experiments, we consider the top-30 (i.e. n=30) motifs reported by each individual motif finder. For the effect of different values of n, please refer to Supplementary Material 6.
4.1 Comparison using Tompa's benchmark dataset
This section compares MotifVoter with existing methods using Tompa's benchmark datasets. The datasets are constructed based on real transcription factor binding sites drawn from four different organisms (human, fruitfly, mouse and yeast). The background sequences are based on three categories: (1) real upstream sequences (real), (2) randomly chosen upstream sequences with binding sites inserted (generic) and (3) sequences randomly generated by a Markov chain with binding sites inserted (markov). It consists of 56 datasets. The number of sequences per dataset ranges from 1 to 35 and the sequence lengths are up to 3000 bp (Tompa et al., 2005). Such diverse characteristics of the benchmark make it a good candidate for evaluating the robustness of a motif finder.
The graph in Figure 2a shows the average performance of all stand-alone and ensemble motif finders based on four performance evaluation measures [see also Table 1 for the actual figures of sensitivity (nSN) and precision (nPPV)] using Tompa's benchmark dataset. Among the stand-alone motif finders, Weeder and SPACE perform somewhat similar and outperform the other 15 stand-alone motif finders by all four measures. However, integration of motif finders enables MotifVoter to further enhance the performance. The improvement gained over SPACE is 215% in sensitivity and 45% in precision. We believe that MotifVoter is able to piece together different subregions of sites identified by the stand-alone motif finders, thus yielding significant improvement in sensitivity. Furthermore, since MotifVoter imposes strict discriminative and consensus criteria on the clustering procedure, we get significant improvement in precision also.
|
|
For ensemble methods, MotifVoter outperforms the sensitivity of BEST and precision of SCOPE by 54.1% and 226.2%, respectively. We cannot directly evaluate RGSMiner and MultiFinder because of their limited applicability. For WebMotifs, it only takes probe names as input, so we only evaluate WebMotifs on ChIP-Chip dataset (see Section 4.3). In principle, these three ensemble methods select one motif out of all motifs reported by the component motif finders. Although we are not able to run these finders on the dataset, to predict the upper bound of the performance of these finders, we can select the most sensitive and precise motif for each dataset. We found that even if we do so, the sensitivity is at least 48.6% lower than MotifVoter and the precision is at least 41.2% lower.
For EMD, a direct evaluation is done in the next subsection on E.coli dataset. However, in this benchmark if we cluster the motifs of the same rank given by 10 motif finders and pick the most sensitive cluster, the highest possible sensitivity is 21.3% lower than MotifVoter. The main reason why MotifVoter yields better performance is because apart from the ability to capture true sites from various motif finders, it also includes true sites that may come from different ranks.
4.2 Comparison using E.coli dataset
Despite its comprehensive assortments of datasets, Tompa's benchmark does not fully represent real biological settings since it still contains synthetic sequences (markov) and mixture of real and synthetic (generic) sequences. Here we evaluate MotifVoter using real E.coli datasets. This species is not included in Tompa's benchmark. The datasets are taken from RegulonDB (Salgado et al., 2004) (http://regulondb.ccg.unam.mx/), and the sequences are generated from the intergenic regions of E.coli genome. In total, there are 62 datasets. The average number of sequences per dataset is 12 and the average sequence length is 300 bp.
The graph in Figure 2b shows the average performance of all stand-alone and ensemble motif finders based on four performance evaluation measures [see also Table 1 for the actual figures of sensitivity (nSN) and precision (nPPV)] using the E.coli dataset. MotifVoter consistently outperforms the stand-alone motif finders by about 168% in terms of sensitivity. For precision, although the improvement is not so much as for sensitivity, MotifVoter still yields a significant improvement over stand-alone motif finders exceeding the highest precision by 80.9%.
Three ensemble methods (SCOPE, BEST, EMD) evaluated on this dataset outperform the stand-alone algorithms in terms of sensitivity. However, even for the most sensitive ensemble method (EMD) among the three, the improvement is not significant. On the other hand, MotifVoter outperforms the EMD ensemble method by 130.2% in sensitivity. As for precision, MotifVoter also consistently performs better with a 45.9% improvement over the best ensemble method (SCOPE).
4.3 Application on ChIP-Chip datasets
In this section, we evaluate MotifVoter on yeast and mammalian ChIP-Chip datasets. These datasets provide an ideal platform for assessing the scalability and applicability of MotifVoter to the entire genome.
The yeast ChIP-Chip datasets and TF profiles are obtained from Harbison et al. (2004) (http://fraenkel.mit.edu/Harbison/). For mammalian datasets, we evaluate nine transcription factors: CREB (Zhang et al., 2005), E2F (Ren et al., 2002), HNF4/HNF6 (Odom et al., 2004), MYOD/MYOG (Cao et al., 2006), NFKB (Schrieber et al., 2006), NOTCH (Palomero et al., 2006), SOX (Boyer et al., 2005). Further details about the yeast and mammalian datasets sources can be found in Supplementary Material 1.
Out of 65 cases, MotifVoter can identify 56 motifs. As for the missing ones, we found that the individual finders do not perform well. Figure 3 shows the evaluation using SSD values of the stand-alone and ensemble motif finders based on the 56 successful cases. Note that SSD values can give an estimation of how close the prediction of one motif finder is to the real PWM in each dataset. Table 2 shows the average SSD performance of each motif finder.
|
|
On yeast ChIP-Chip datasets, among all stand-alone and ensemble motif finders, Improbizer has the highest average SSD value (1.639). MotifVoter outperforms Improbizer with an average SSD value of 1.919. Weblogos of the actual motifs predicted by MotifVoter on these datasets can be found in Supplementary Material 3.
Figure 4 shows the nine PWMs found by MotifVoter in agreement with canonical motifs in mammalian datasets. Evaluation on MotifVoter shows that it has an average SSD value of 1.824, while the best performing existing motif finder achieves an average SSD value of 1.530. SSD of MotifVoter on mammalian datasets is lower than that on yeast since the datasets of mammalian genome are much more complex than yeast.
|
| 5 DISCUSSION |
|---|
|
|
|---|
This section further investigates the effectiveness of MotifVoter in terms of the following aspects: (1) the optimality of MotifVoter; (2) the effects of the discriminative and consensus criteria used in MotifVoter and (3) the robustness of MotifVoter and the running time of MotifVoter.
5.1 Optimality of MotifVoter
In general, the performance of any ensemble method is bounded by its component motif finders. The total number of true binding sites that are covered by any of its component motif finders can be regarded as a rough upper bound on the performance of an ensemble method. We say an ensemble motif finder is
-optimal if it can find
fraction of the true sites covered by at least one of its component motif finders. Evaluation on Tompa's benchmark shows that the highest possible sensitivity that can be achieved all by the component motif finders is 0.440. The sensitivity of MotifVoter is 0.419 which is 95.2% of this upper bound. In E.coli dataset, the highest possible sensitivity is 0.643 and MotifVoter achieves 95.7%. This empirical study shows that MotifVoter is a near optimal ensemble method.
5.2 Analysis of the effects of MotifVoter's criteria
The discriminative criterion of MotifVoter makes use of the information provided by the false positive (spurious) motifs/sites to guide the clustering of true motifs/sites. This information is not fully utilized by existing ensemble methods. The consensus criterion in MotifVoter requires the cluster to be globally contributed by as many motif finders as possible. This can enhance the confidence that the cluster contains good motifs/sites. These two key ideas distinguish our ensemble method from the existing ones. This subsection experimentally shows that both criteria in MotifVoter are equally important in their own right.
Figure 5a shows the evaluation results on Tompa's benchmark datasets. Sensitivity of MotifVoter without both criteria (Case IV) is 80% lower than BEST and precision is 15% lower than Weeder. Similarly in Figure 5b for E.coli, when neither discriminative nor consensus criteria are implemented, the precision of MotifVoter is lower than that of SCOPE. In contrast, by including both criteria MotifVoter can improve the sensitivity by 163.2% and precision by 45.9%.
|
5.3 Robustness and time complexity MotifVoter
We also evaluate the performance of MotifVoter on various background sequences and various species in Tompa's benchmark. MotifVoter achieved the highest nSN and nPPV on all three backgrounds. The major sensitivity improvement is on real datasets (275%), followed by generic datasets (128%). As for evaluation on three species, MotifVoter made major sensitivity improvement on human dataset (314%) followed by fruitfly (263%), while the least improvement was made on yeast datasets (84%).
MotifVoter relies on its component motif finders. Thus, a natural question is whether the performance of MotifVoter will degrade if we include some motif finders with poor performance. The performance of MotifVoter does degrade as more random motif finders (representing motif finders with poor performance) are included. However, even if we include five random motif finders (that is half of the real motif finders we used), the performance of MotifVoter is still greater than that of the best individual motif finder by 183% in sensitivity and by 10.4% in precision. These results imply that MotifVoter is robust even if some of the component motif finders perform badly. We remark that it is possible that MotifVoter may predict wrong motifs if almost every individual finder reports similar wrong motifs. However, MotifVoter can reduce the chance of finding such wrong motifs significantly in comparison to individual finders because the chance of having many individual finders to report the same (similar) spurious motif is low since every finder has its own scheme of predicting the motifs.
Running all 10 component motif finders for MotifVoter is not always practical. We show that the sensitivity and precision of MotifVoter are still significantly better than the best motif finder when we run only the five fastest motif finders. The details of the above robustness evaluations are shown in Supplementary Material 5. In Supplementary Material 8, we evaluate and discuss the characteristics of binding sites missed by MotifVoter. We also show that MotifVoter can identify motifs in motif modules that contain multiple motifs work in groups, provided that these motifs have signals strong enough to be detected by the individual motif finders used in MotifVoter. Details can be found in Supplementary Material 9.
| 6 CONCLUSIONS |
|---|
|
|
|---|
We have presented MotifVoter, a simple yet effective ensemble method that utilizes both positive and negative information in the motifs returned by the basic motif finders and achieves near-optimal sensitivity. In extensive evaluations on many ChIP-Chip and benchmark datasets, we observed that MotifVoter outperforms all stand-alone and existing ensemble motif finders significantly. It is also robust with respect to the inclusion of random motif finders and number of component motif finders.
A key advantage of MotifVoter is its flexibility in incorporating new algorithms. That is, if a novel superior motif finding algorithm is made available, it can be readily incorporated into the MotifVoter platform. In the website version, we provide an option for users to provide results from the component motif finders that are not included in our main list. Furthermore, MotifVoter, as a web application, has several biologist friendly features such as a parameter-free interface, ability for users to choose their preferred component motif finders and option for speedier return of results using the fastest motif finders.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We are grateful to the authors of the component motif finders used by MotifVoter for making their tools available for public use. We thank C. Dongsheng for providing command line version of BEST. Finally, the authors would like to thank the reviewers for all their useful and constructive comments.
Funding: National University of Singapore (grant R-252-000-326-112); Research Output Prize (Faculty of Engineering) of the University of HongKong to S.M.Y.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alex Bateman
1Throughout the article, we use the term precision instead of positive predictive value. ![]()
Received on May 9, 2008; revised on August 3, 2008; accepted on August 7, 2008
| REFERENCES |
|---|
|
|
|---|
Ao W, et al. Environmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR. Science (2004) 305:1743–1746.
Bailey T, Elkan C. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Mach. Learn. (1995) 21:51–80.
Boyer L, et al. Core transcriptional regulatory circuitry in human embryonic stem cells. Cell (2005) 122:947–956.[CrossRef][Web of Science][Medline]
Cao Y, et al. Global and gene-specific analyses show distinct roles for myod and myog at a common set of promoters. EMBO J. (2006) 25:502–511.[CrossRef][Web of Science][Medline]
Chakravarty A, et al. A parameter-free algorithm for improved de novo identification of transcription factor binding sites. BMC Bioinformatics (2007) 8:29.[CrossRef][Medline]
Das M, Dai HK. A survey of DNA motif finding algorithms. BMC Bioinformatics (2007) 8(Suppl.7).
Dongsheng C, et al. BEST: binding-site estimation suite tools. Bioinformatics (2005) 21:2909–2911.
Edgar R. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. (2004) 32:1792–1797.
Eskin E, Pevzner P. Finding composite regulatory patterns in DNA sequences. Bioinformatics (2002) 18(suppl. 1):S354–S363.[Abstract]
Gordon D, et al. TAMO: a flexible, object oriented framework for analyzing transcriptional regulation using DNA-sequences motifs. Bioinformatics (2005) 21:3164–3165.
Harbison C, et al. Transcription regulatory code of a eukaryotic genome. Nature (2004) 431:99–104.[CrossRef][Web of Science][Medline]
Hu J, et al. EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences. BMC Bioinformatics (2006) 7:342.[CrossRef][Medline]
Hu J, Kihara D. Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res. (2005) 33:4899–4913.
Huang H.-D, et al. Identifying transcriptional regulatory sites in the human genome using an integrated system. Nucleic Acids Res. (2004) 32:1948–1956.
Huber BR, Bulyk M. Meta-analysis discovery of tissue specific DNA sequence motifs from mammalian gene expression data. BMC Bioinformatics (2006) 7:229–254.[CrossRef][Medline]
Hughes J, et al. Computational identification of cis-regulatory elements associated with functionally coherent groups of genes in saccharomyces cerevisiae. J. Mol. Biol. (2000) 296:1205–1214.[CrossRef][Web of Science][Medline]
Jensen S, Liu J. BioOptimizer: a Bayesian scoring function approach to motif discovery. Bioinformatics (2006) 20:1557–1564.[CrossRef]
Liu X, et al. BioProspector: discovering DNA motifs in upstream regulatory regions of co-expressed genes. In: Proceedings of the Seventh Pacific Symposium of Biocomputing (PSB) (2001) Singapore: World Scientific Press. 127–138.
Liu X, et al. An algorithm for finding protein-DNA binding sites with application to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol. (2002) 20:835–839.[Web of Science][Medline]
MacIsaac K, Fraenkel E. Practical strategies for discovering regulatory DNA sequence motifs. PloS Comput. Biol. (2006) 2:201–210.[Web of Science]
Mahony S, et al. DNA familial binding profiles made easy: comparison of various motif alignment and clustering strategies. PLoS Comput. Biol. (2007) 3:e61+1.
Nimwegen E. Finding regulatory elements and regulatory motifs a general probabilistic framework. BMC Bioinformatics (2007) 8(Suppl. 6):S4.
Odom D, et al. Control of pancreas and liver gene expression by HNF transcription factors. Science (2004) 303:1378–1381.
Palomero T, et al. NOTCH1 directly regulates c-MYC and activates a feed-forward-loop transcriptional network promoting leukemic cell growth. Proc. Natl Acad. Sci. USA (2006) 103:18261–18266.
Pavesi G, et al. An algorithm for finding signals of unknown length in DNA sequencess. Bioinformatics (2001) 17:S207–S214.[Abstract]
Ren B, et al. CE2F integrates cell cycle progression with DNA repair, replication and G2/M checkpoints. Genes Dev. (2002) 16:245–256.
Romer K, et al. WebMOTIFS: automated discovery, filtering, and scoring of DNA sequence motifs using multple programs and bayesian approaches. Nucleic Acids Res (2007) 35:W217–W220.
Salgado H, et al. Regulon DB (version 4.0): transcriptional regulation, operon organization and growth conditions in escherichia coli k-12. Nucleic Acids Res (2004) 32:D303.
Schrieber J, et al. Coordinated binding of NFKB family members in the response of human cells to lipopolysaccharide. Proc. Natl Acad. Sci. USA (2006) 103:5899–5904.
Thijs G, et al. A higher-order background model improves the detection of promoter regulatory elements by gibbs sampling. Bioinformatics (2001) 17:1113–1122.
Tompa M, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. (2005) 23:137–144.[CrossRef][Web of Science][Medline]
Wijaya E, et al. Detection of generic spaced motifs using submotif pattern mining. Bioinformatics (2007) 23:1476–1485.
Workman C, Stormo G. ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. In: Proceedings of Pacific Symposium of Biocomputing (PSB) (2000) Singapore: World Scientific Press. 467–478.
Zhang X, et al. Genome-wide analysis of camp-response element binding protein occupancy, phosphorylation, and target gene activation in human tissues. Proc. Natl Acad. Sci. USA (2005) 102:4459–4464.
This article has been cited by other articles:
![]() |
S. A. F. T. van Hijum, M. H. Medema, and O. P. Kuipers Mechanisms and Evolution of Control Logic in Prokaryotic Transcriptional Regulation Microbiol. Mol. Biol. Rev., September 1, 2009; 73(3): 481 - 509. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Yanover, M. Singh, and E. Zaslavsky M are better than one: an ensemble-based motif finder and its application to regulatory element prediction Bioinformatics, April 1, 2009; 25(7): 868 - 874. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








