Bioinformatics Advance Access originally published online on May 5, 2007
Bioinformatics 2007 23(12):1476-1485; doi:10.1093/bioinformatics/btm118
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Detection of generic spaced motifs using submotif pattern mining
1Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613, 2School of Computing, National University of Singapore, Singapore 119260, 3Genome Institute of Singapore, 60 Biopolis Street, #02-01 Genome, Singapore 138672 and 4Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Identification of motifs is one of the critical stages in studying the regulatory interactions of genes. Motifs can have complicated patterns. In particular, spaced motifs, an important class of motifs, consist of several short segments separated by spacers of different lengths. Locating spaced motifs is not trivial. Existing motif-finding algorithms are either designed for monad motifs (short contiguous patterns with some mismatches) or have assumptions on the spacer lengths or can only handle at most two segments. An effective motif finder for generic spaced motifs is highly desirable.
Results: This article proposes a novel approach for identifying spaced motifs with any number of spacers of different lengths. We introduce the notion of submotifs to capture the segments in the spaced motif and formulate the motif-finding problem as a frequent submotif mining problem. We provide an algorithm called SPACE to solve the problem. Based on experiments on real biological datasets, synthetic datasets and the motif assessment benchmarks by Tompa et al., we show that our algorithm performs better than existing tools for spaced motifs with improvements in both sensitivity and specificity and for monads, SPACE performs as good as other tools.
Availability: The source code is available upon request from the authors.
Contact: ksung{at}comp.nus.edu.sg
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
One of the major challenges facing biologists today is understanding the regulatory mechanism of genes. This challenge includes detection of transcription factor binding sites involved in regulation and discovery of regulatory networks. The problem of de novo identification of transcription factor binding site motifs has been widely studied and a number of motif-finding algorithms have been proposed under categories such as profile-based methods e.g. Gibbs sampler (Lawrence et al., 1993), MotifSampler (Thijs et al., 2002), SeSiMCMC (Favorov et al., 2005), GAME (Wei and Jensen, 2006), Improbizer (Ao et al., 2004), consensus-based methods e.g. Weeder (Pavesi et al., 2001), MITRA (Eskin and Pevzner 2002), Gemoda (Jensen et al., 2006) and hybrid methods that use a combination of the above two methods, e.g. (Hertz and Stormo, 1999).
However, motif-finding continues to be a difficult problem. Recently Tompa (Tompa et al., 2005) conducted an assessment of 13 popular motif discovery algorithms over 56 datasets drawn from Homo sapiens, Mus musculus, Drasophila melanogaster and Saccharomyces cerevisiae genomes, and found that all the algorithms performed unimpressively overall (barring yeast datasets).
One reason is that motifs can have complicated patterns. As pointed out by Eisen in a recent survey (Eisen, 2005), regulatory motifs could be highly complex in the biological context. Many motifs are known to be composite patterns which are groups of monad patterns (short contiguous patterns with some mismatches) that occur relatively near each other (Harbison et al., 2004). For example, the binding site for ArcA-P, a transcription factor for regulating gene related to the respiratory metabolism in Escherichia coli (Liu and DeWulf, 2004), can be regarded as two conserved segments, separated by a spacer of length approximately 6 (McGuire et al., 1999). Another example is Mcm1 (Kato et al., 2004) or often called as the early cell cycle box (ECB) (Tavazoie et al., 1999) which has three segments and two spacers. Note that a spacer does not necessarily mean that the characters in the spacer are completely random and arbitrary, but these characters are not very conserved in different instances.
In fact, in some regulatory mechanisms, a single transcription factor may bind to two or more sites that are relatively close to each other—as is frequently the case, for instance, of RNA polymerase (Record et al., 1996). Identifying these sites is similar to finding a spaced motif. Spaced motifs may also be associated with co-regulated genes that share two or more transcription factors and the binding sites are often recognized by different macromolecular complexes that make contact with one another (Owen and Zelent., 2000; Werner, 1999). Our focus in this article is to find such complex motifs that could contain spacers.
Most of the existing algorithms are mainly designed for monad motifs. Applying these algorithms to locate spaced motifs may not be effective. By treating a spaced motif as a single monad pattern, the motif instances may not be very similar, i.e. the signal may not be strong to be detected, due to the many random (non-conserved) characters in the spacers. Or if we try to locate the individual segments of a spaced motif using these algorithms, some of the segments may be too short and may not be easily detected.
On the other hand, there are algorithms designed for spaced motifs. The methods used by existing algorithms can be classified into the following approaches. The first and the most common approach is to assume that all the spacers in the same motif are all of the same fixed length [e.g. SesiMCMC (Favorov et al., 2005), OligoDyad (van Helden et al., 2000)]. However, in real cases, this is not the case. Another approach to handle spacers is to enumerate all possible spacer lengths between two composite segments (e.g. YMF developed in (Sinha and Tompa, 2000) and BioProspector (Liu et al., 2001)). Although this approach can find motifs with spacers of varying length, it is inherently inefficient and is difficult to extend to more than two segments. And it may not be practical for long motifs. The third approach to locate spaced motifs is to find the monad segments first (e.g. MITRA in (Eskin and Pevzner, 2002)), then based on the locations of monad segments, locate a set of possible dyads (spaced motifs with two segments). The algorithm of MITRA relies on a specially designed data structure (mismatch tree data structure) to quickly identify possible monad segments. There are other methods [e.g. (Carvalho et al., 2005; Marsan and Sagot, 2000)] that make use of data structures such as suffix tree to store the regularly spaced motif before finally identifying the motif pairs to speed up the process. Almost all existing approaches only handle spaced motifs with two segments.
In this article we propose a new approach for finding spaced motifs, and develop a novel motif-finding algorithm that offers flexibility in handling spacers with different lengths, the number of segments and variations in segment lengths. We formulate the motif finding problem as a frequent itemset mining and present an algorithm called SPACE for finding these motifs. Experimental results show that the approach is promising.
Our approach is similar to TEIRESIAS (Rigoutsos and Floratos, 1998) in building longer motif from shorter blocks. Yet, TEIRESIAS is computationally expensive. It uses a convolution strategy to stitch the shorter blocks exhaustively to find maximal patterns. It also does not handle mismatches. On the other hand, our novel approach provide further advantages. We allow flexibilities in terms of allowing mismatches and provide an efficient method to find the pattern.
| 2 SPACED MOTIFS AND THE SUBMOTIFS |
|---|
|
|
|---|
In this section, we provide the formal definition of a generic spaced motif and discuss the notion of submotif which is the core concept of our approach. We generalize the string representation of motifs as follows.
DEFINITION 1
For some pre-defined coverage ratio, a spaced motif (or simply a motif) is a length-L string formed by characters of {A, C, G, T, n} with at least
characters in {A, C, G, T}. Each maximal substring of consecutive n represents a spacer and each maximal substring of other characters represents a segment.
Figure 1 shows an example of a spaced motif M which is of length 20 and has three segments separated by two spacers. Note that the segments, as well as the spacers, can be of different lengths. The number of segments is also not fixed. Let
be the substring of Z starting at position i and ending at position j. Any length-
substring
within any segment of M is called its submotif. Below, using submotifs, we define an instance of a spaced motif (see Fig 1 for an example).
DEFINITION 2
Consider a length-L spaced motif M and any length-L string I formed by characters of {A, C, G, T}. I is called an instance of M if, for every submotifand
have at most d mutations, for some pre-defined constant d.
|
Now, we define the spaced motif-finding problem. Let
|
By formulating the motif-finding problem in this way, we have the following advantages:
- The lengths of the segments in the motif need not be known even if we pre-fix the length of the submotif. This follows because union of an overlapping set of submotifs can represent an arbitrary length segment. This property implies that motifs with segments of arbitrary lengths could be found. Note that this does not depend on whether the motif has spacers or not.
- The spaced motif uses multiple segments to model the functional parts, which are more conserved, and the spacers to model the non-functional parts. However, monad motif (or dyads) only has one segment (or two segments) for modeling both conserved and non-conserved regions. Hence, spaced motif can fit the conserved regions better. In other words, it yields higher specificity. We confirm this in our experiments on several datasets including the Motif Assessment Benchmark and some real biological datasets.
- It provides a natural extension for finding motifs with multiple spacers, in which neither the spacer length nor the number of spacers (and segments) is known.
However, there could be too many submotifs (many of them are spurious) and the challenge is in how effectively the submotif-compositing can be done to return good motifs. To tackle this situation, we formulate this task as a constrained frequent submotif mining problem and propose a new algorithm for solving it.
| 3 OUR SOLUTION |
|---|
|
|
|---|
Let
We do not assume any knowledge about the number and the locations of the spacers in the motif. To identify a possible candidate for the motif, we look at each length-L substring u in S, based on the definition of a spaced motif, we define an occurrence of u as follows. Let
be the Hamming distance of two equal-length strings x and y.
DEFINITION 3
Let u be a length-L substring in S. Consider another substring w of the same length in S. For some pre-defined constants d and, for every i, the substring
is called a submotif occurrence of
if
. The number of characters spanned by all submotif occurrences is called the coverage of w on u. The substring w is called an occurrence of u if the coverage of w on u is at least
.
The length-L substring u is called a motif candidate if there exist at least q occurrences of u in S. Figure 3 shows an example of a motif candidate.
|
Step 1 tries to find all motif candidates. A straightforward implementation is given in Section 3.1. Note that a spaced motif is highly correlated with a motif candidate. For a motif candidate that is a real spaced motif, the locations of submotif occurrences in each occurrence of the motif candidate define the locations of the segments for the candidate. A different occurrence may define a different set of segments for the same candidate. By finding the set of common segments defined by the occurrences, we can generate a spaced motif. The refinement process is done in Step 2 based on frequent itemset mining, which is detailed in Section 3.2. Step 3 and our scoring function is discussed in Section 3.3. The naive implementation shown in Section 3.1 is a bit slow, Section 3.4 shows how to speed up the process.
3.1 Generation of motif candidates
To find all motif candidates and their occurrences, a straightforward implementation is as follows. Fix a constant L for the motif length, for each sequence Si, for each substring u of length L in Si, check the coverage of all other substrings in S of length L on u. If there are q occurrences of u, report u and all its occurrences. This naive procedure runs in
time where n is the length of a sequence. The actual running time is about 2 min for a dataset of 5K bp with 10 sequences on a 3.6GHz Xeon Linux workstation with 4 processors and 8GB RAM.
At the end of this step we have a set of motif candidates, each associated with a set of occurrences. Recall that some of these occurrences may be noise or some of the submotif occurrences in them may be spurious. Our next step is to eliminate these noise to identify the spaced motifs.
3.2 Refining motif candidate into spaced motif
Given a motif candidate u and its occurrences
, this section discusses the way to refine u into a spaced motif. Our idea is to transform the problem into frequent itemset mining (Han and Kamber, 2000).
Before describing the transformation, recall that, by Definition 3, u and wi share a set of submotif occurrences.
DEFINITION 4
Suppose that w is an occurrence of u. Then,is called the itemset of w with respect to u.
Figure 4 demonstrates the itemset concept. For an itemset J, we can construct a spaced motif Mu,J of length L such that
if
for some
; otherwise,
. An itemset is called a frequent pattern if it has at least q occurrences. The following lemma states the relationship between frequent itemset and spaced motif. Note that there is an assumption behind this transformation. While we allow different gaps to have different lengths in the same motif, for a gap in the motif, the length of this gap is the same in all instances.
LEMMA 1
Let J be a frequent pattern of u with at least q support. If Mu,J has coverage at least, Mu,J is a spaced motif.
|
PROOF
Since J is a frequent pattern with at least q support,has at least q instances. Also, Mu,J has coverage at least
, so Mu,J is a spaced motif.
Hence, given a motif candidate u and its occurrences, we can refine u as a spaced motif as follows.
- Generate the itemsets for all occurrences of u.
- Find the frequent itemsets F which appear at least q times.
- Report the spaced motif corresponding to F with sufficient coverage.
Algorithm 1 shows the complete scheme of the algorithm.
|
3.3 Significance testing and scoring
We adapt the motif-scoring technique introduced in Weeder (Pavesi et al., 2001) to compute the significance of spaced motifs. Intuitively, a motif is significant if (1) the total number of its occurrences in all input sequences is a lot more than expected with respect to the background and (2) the pattern is either very conserved or occurs in quite a number of the input sequences. So, Weeder's scoring mechanism computes two values to capture these two properties.
Let M be the motif, E(M, e) be the expected frequency of M with at most e mutations based on a set of background sequences (we will show how to compute E later in this section). Then,
, where len (s_i) denotes the length of i-th sequence si, represents the expected frequency of M with at most e mutations in all input sequences. To capture property (1), we count the total number of observed occurrences of M (with at most e mutations),
, in all input sequences and compute the occurrence score, ß (M) as follows.
|
| (1) |
To capture property (2), for a sequence
with an occurrence of M, we consider the most conserved pattern of M and let
be the number of mutations of this best pattern. The value of
represents the expected frequency of the occurrences of this motif in
. This value is smaller if the motif is more conserved. Then, we compute the sequence-specific score,
(M) as follows. If the pattern is very conserved and/or occurs in many sequences,
(M) is large.
|
| (2) |
Finally the score of each motif, Motif Score(M), is
.
The value of E (M, e) is computed by summing the expected frequency
of
in the background sequences for all
with at most e mutations from M. When
contains no spacer and is of length shorter than or equal to 8, the expected frequency value
is pre-computed from background sequences obtained from Regulatory Sequence Analysis Tool (RSAT) database site1 (van Helden, 2003). These background sequences of the organisms are taken from 1000 bp upstream regions of all their annotated genes.
When
contains spacers and is of length shorter than or equal to 8,
equals the sum of
among all possible
with the spacers n's replacing by
.
When
is of length longer than 8 and with or without spacers, we are unable to precompute the frequency values since it is long. Instead, the expected frequency of
is modelled using seventh order Markov chain. Suppose
with k greater than 8.
can be computed as follows:
|
|
The conditional probability
of having nucleotide pi preceded by nucleotides
, is computed by using the expected frequency of 8-mers:
|
|
3.4 Efficient generation of motif candidates
This section shows how to speed up Step 1, the motif candidate generation step. The observations that lead to the speed up are as follow. Recall that
is the length of a submotif.
LEMMA 2
Let the coverage ofon
be
. Then, the coverage
of
on
can be computed as follows.
where
if the prefix
of
is a submotif occurrence, that is,
![]()
, otherwise
. Similarly,
if the suffix
of
is a submotif occurrence, that is,
![]()
, otherwise
.
PROOF
Note that when considering all length-substrings of
and
, the only substrings that they are different are
which is in
, but not in
, and
which is in
, but not in
. If
, it means that
is a submotif of
with respect to
. This submotif will not be in
. If
, then
is a submotif of
with respect to
which is not in
. So, the result follows.
Based on Lemma 1, once we have calculated the coverage of
on
, to calculate the coverage of
on
, it only takes O(1) time. To calculate the coverage of all substrings on one sequence against all potential motif candidates in another sequence, the time complexity can then be reduced to
.
Since we are only interested in the substrings that can have a coverage at least
, we can further prune the computation according to the following lemma.
LEMMA 3
Let the coverage ofon
be
. Let y be the length of the longest suffix of
that is not covered by a submotif occurrence. The coverage
of
on
is upper bounded by
for any p > 0.
PROOF
We try to upper bound the value ofas follows. Comparing
with
, There are p new characters. Assuming that all these characters are covered by submotifs, the coverage can be increased at most by p. For the suffix of
that is not covered by any submotif occurrence. If
, then when considering
, these y characters may all be covered by a submotif, so the coverage can be increased by at most y. On the other hand, if
, then at most the last
characters, which can form a submotif with one new character, can be covered by a submotif occurrence when considering
, so the coverage can be increased by at most
. So, the result follows.
By Lemma 2, after computing the coverage of
on
, based on the upper bound calculation, we can skip the computation of coverage for some substrings and jump to the substrings
and
with the smallest p such that
. From our experiments, we found that the running time for generating the motif candidates and their occurrences have been reduced from 2 min to <1 s on the same dataset of 5K bp with 10 sequences on a 3.6GHz Xeon Linux workstation with 4 processors and 8GB RAM. So, it is feasible for large datasets.
| 4 EXPERIMENTAL RESULTS |
|---|
|
|
|---|
We perform experiments on four classes of datasets namely, (1) nine biological datasets that are known to contain spacers, (2) four synthetic test cases consisting of different variations of spaced motifs, (3) The datasets from four different species, proposed by (Tompa et al., 2005) for the assessment of motif discovery algorithms and (4) 10 real biological datasets consisting monad motifs. The assessment results are reported below.
For performance evaluation, we use the same four measures proposed in (Tompa et al., 2005) namely, sensitivity (nSn), positive predictive value (nPPV), performance coefficient (nPC), and correlation coefficient (nCC). Index n is used to denote that the assessment is done at the nucleotide level instead of site level (please refer to Appendix B for the definitions of these measures).2 All experiments have been performed on a 3.6GHz Xeon Linux workstation with 4 processors and 8GB RAM.
For each dataset, we run SPACE using 12 parameter settings: motif length L = 8, 15, 20; submotif length
= 5; maximum number of mismatches allowed in each submotif instance d = 1; the minimum support
or 0.5t, where t is the number of input sequences; the coverage ratio r is set to be 0.5 or 0.8. Similar to what Weeder does, we collect the top 10 motifs from each run. For each motif, we count the number of related redundant motifs in the collection. Based on the observation that a real motif should have more related redundant motifs, we rank the motifs in decreasing order of the total number of related redundant motifs in the collection. Please refer to Appendix A for a more detailed description of this step.
4.1 Results on datasets with spaced motifs
We first evaluate the performance of SPACE for spaced motifs, that is, motifs with at least one spacer.
4.1.1 Real biological datasets
In the literature, we found nine transcription factor binding sites whose motifs have gaps. They are: GAL4P (Johnston and Carlson, 1992), ARCA-P (Liu and DeWulf, 2004), MCM1 (Kato et al., 2004) or ECB (Tavazoie et al., 1999) and six transcriptional regulators of C6 Zinc cluster family (Schjerling and Holmberg, 1996).
Comparison is done with MITRA3 (Eskin and Pevzner, 2002) and BioProspector,4 both of which can handle motifs with spacers. We let MITRA search for motifs up to 12 bp (the maximum possible) and we require it to find the motif on the given strands only. For BioProspector we allow the algorithm to search for motifs with block size ranging from 4 to 10 and gap size ranging from 0 to 12. In the comparison, instead of picking the first motif among the top 20 that can give a better nSn and nPPV than the motif of rank 1. Table 1 summarizes the comparison results. From the table, we see that the selected motifs of SPACE are usually of higher rank than the other two and the averaged performance is better across all measures.
|
4.1.2 Synthetic datasets
We consider four synthetic test cases for spaced motifs using randomly created sequences with the base pairs uniformly distributed. For each case, we create three datasets, each containing 10 sequences of length 300bp. We run SPACE and report the averaged performance. For each dataset the motifs are implanted in five of them at random positions. And the motifs are as follow:
- A 7 bp length motif with no spacer and 1 mismatch.
- A 15 bp length motif containing two segments of length 5 and 7 with a spacer of length 3, with 1 mismatch for each segment.
- A 21bp length motif containing two segments of length 5 with spacers of length 3, with 1 mismatch for each segment.
- A 15bp length motif containing two segments of length 4 and 5 bp with a spacer of length 6, with 1 mismatch for each segment.
Figure 5 shows the averaged performance of SPACE, MITRA and BioProspector on the synthetic datasets with Table 3 giving the detailed statistics on one particular dataset of each test case. It also shows that SPACE performs better than the other tools over all four measures. The result is consistent with the one for real biological datasets. More information about the comparison results can be obtained in Appendix L of the Supplementary Material.
|
|
|
4.2 Results on datasets with monad motifs
We are also interested in the performance of SPACE for motifs without spacers. We have performed two sets of experiments, one on Tompa's benchmark datasets and the other on 10 real biological datasets.
4.2.1 Tompa's benchmark data
Tompa's benchmark dataset has been constructed based on real transcription factor binding sites drawn from four different organisms (Tompa et al., 2005). It consists of 56 datasets in total. The number of sequences ranges from 1–35 and the sequence lengths are up to 3000 bp. In this assessment, following Tompa's approach, the motif ranked number 1 by the algorithm is used for comparison. The detailed experimental results can be found in Appendix C of Supplementary Material.
The performance of SPACE averaged over all datasets is shown in Figure 6. SPACE performs better than other tools based on the comparison measures.5 As an example, we show the binding sites (see Fig. 7) identified (in green) by our algorithm and Weeder on the dataset
together with the actual binding sites (in blue). Weeder is reported to perform the best among other tools in this dataset (Tompa et al., 2005). Similar to Weeder, SPACE is able to identify almost all actual binding sites.
|
|
We also analysed the performance of SPACE across the four organisms. Figure 8 shows the average performance of SPACE for each organism, compared with the best algorithm among the other tools for the respective organism. The figure shows that the performance of SPACE is similar to that of the best performing algorithm for each organism. On the other hand, the averaged performance of SPACE for all four organisms is better than other tools (Fig. 6), indicating that SPACE is more robust and organism independent.
|
4.2.2 Real biological datasets
We also performed experiments on 10 real biological datasets whose binding sites are known to be monads from literature. Comparison is done with Weeder (Pavesi et al., 2001) and MEME (Bailey and Elkan, 1995), both of which are well-known monad motif-finding algorithms. We set MEME to use two-component mixture mode and find motifs of length ranging from 8 to 20 bp. And for Weeder we use the large mode. summarizes the comparison results. From the table, for sensitivity, MEME shows a better performance than Weeder while SPACE is better than MEME. The reason for SPACE to have a higher sensitivity is due to the submotif modelling. Since the measure nSn focuses on predicting the binding sites, if there is a region in the motif that is not strongly conserved over all binding sites, the use of submotifs may still be able to identify most of these binding sites based on the regions that are strongly conserved, thus predicting more true binding sites. However, it does not mean that missing these binding sites, the software will predict a wrong motif pattern.
Unlike the case for spaced motifs, SPACE is not a clear winner for all the measures except sensitivity. But, the experiments indicate that for monads, SPACE is as good as other tools. This shows that SPACE can be used as a standard tool for finding monads as well as spaced motifs.
The real biological datasets of the spaced and monad motifs are constructed from –1000 to –1 upstream region of all the genes co-regulated by respective transcription factor, truncating the region if it overlaps with an upstream open reading frame (ORF). They are obtained from RSAT (van Helden et al., 2000) and ABS (Blanco et al., 2006) database respectively. More details on the results can be found in Appendices E and F.
| 5 CONCLUSIONS |
|---|
|
|
|---|
In this article, we have proposed a new approach for finding spaced motifs based on the notion called submotifs. We developed a novel motif-finding algorithm SPACE that detects the target motif by first finding submotifs and then strategically compositing them using an efficient frequent submotif pattern-mining approach. In finding motif with generic spacers, this framework provides the following novelties: the spacers could appear in more than two parts of the motif and their lengths need not be fixed. In experiments on real biological datasets, synthetic datasets and Tompa's motif assessment benchmarks, we observed that our algorithm performs better than existing tools for spaced motifs with improvements in both sensitivity and specificity and for monads, SPACE performs as good as other tools.
However, based on the submotif notion we define, we implicitly assume that the mismatches are uniformly distributed in the motif instances. If that is not the case, SPACE may fail to capture these instances, and thus may miss the motif or the regions of the motif that contain these mismatches. On the other hand, in real biological datasets, it seems that mismatches are usually not clustered for most of the motif instances. Hence, SPACE can perform well in most cases.
For future work, we are enhancing our algorithm to handle motifs for which the same gap may have different lengths across different instances by exploring the idea of allowing tolerances in the gap lengths across different instances during the mining process. Other directions include applying our approach for motif finding on drosophila s data where the sites maybe overlapping and with fluctuating positions (Makeev, 2003), discovery of motif modules (co-operating binding factors) (GuhaThakurta and Stormo, 2001).
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The authors would like to thank the reviewers for all their useful and constructive comments.
S.M.Y. was supported in part by the funding from the Research Output Prize (Faculty of Engineering) of the University of Hong Kong.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
2 There is no consensus on what measures are the most appropriate to evaluate all different motif finders. The selected measures focus on the accuracy of predicting the locations of actual binding sites. Note that the consensus of the motifs reported may be the same for different algorithms, but the predicted binding sites may be different, thus yielding a difference in the performance measures. ![]()
3 http://fluff.cs.columbia.edu:8080/domain/mitra.html ![]()
4 http://ai.stanford.edu/~xsliu/BioProspector/ ![]()
5 In our comparison, we did not include the new motif finder, MotifSeeker (Peng et al., 2006). The experiments in their paper are based on a different subset of Tompa's datasets, so a direct comparison is not appropriate. ![]()
Received on December 5, 2006; revised on March 16, 2007; accepted on March 16, 2007
| REFERENCES |
|---|
|
|
|---|
Ao W, et al. Environmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR. Science, ( (2001) ) 305, : 1743–1746.[CrossRef].
Bailey T, Elkan C. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning, ( (1995) ) 21, : 51–80.[ISI].
Balzi E, Goffeau A. Yeast multidrug resistance: the PDR network. J. Bioenerg. Biomembr, ( (1995) ) 27, : 71–6.[CrossRef][ISI][Medline].
Becker B, et al. A nonameric core sequence is required upstream of the LYS genes of Saccharomyces cerevisiae for Lys14p-mediated activation and apparent repression by lysine. Mol. Microbiol, ( (1998) ) 29, : 151–63.[CrossRef][ISI][Medline].
Blanchette M, Tompa M. Discovery of regulatory elements by a computational method for phylogenetic Footprinting. Genome Res, ( (2002) ) 12, : 739–748.
Blanco E, et al. ABS: a database of annotated regulatory binding sites from orthologous promoters. Nucleic Acid Res, ( (2006) ) 34, : D63–D67.
Carvalho AM, et al. A highly scalable algorithm for the extraction of cis-regulatory regions. ( (2003) ) Proceedings of the Third Asia-Pacific Bioinformatics Conference (APBC). 273–282..
Dermitzakis ET, Clark AG. Evolution of transcription factor binding sites in Mammalian gene regulatory regions: conservation and turnover. Mol. Biol. Evol, ( (2002) ) 19, : 1114–1121.
Eisen MB. All motifs are not created equal: structural properties of transcription factor - DNA interaction and the inference of sequence specificity. Genome Biol, ( (2005) ) 6, : P7.[CrossRef].
Eskin E, Pevzner P. Finding composite regulatory patterns in DNA sequences. Bioinformatics, ( (2002) ) 18, : S354–S363.[Abstract].
Favorov AV, et al. A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length. Bioinformatics, ( (2005) ) 21, : 2240–2245.
GuhaThakurta D, Stormo GD. Identifying target sites for cooperatively binding factors. Bioinformatics, ( (2001) ) 17, : 608–621.
Han J, Kamber M. Data Mining: concepts and techniques. In: Morgan Kaufmann., ( (2000) ) San Diego, CA. 230–245..
Han TH, et al. Mapping of epidermal growth factor-, serum-, and phorbol ester-responsive sequence elements in the c-jun promoter. Mol. Cell. Biol, ( (1992) ) 12, : 4472–4477.
Harbison CT, et al. Transcription regulatory code of a eukaryotic genome. Nature, ( (2004) ) 431, : 99–104.[CrossRef][Medline].
Hermeking H, et al. Identification of CDK4 as a target of c-MYC. Proc. Natl Acad. Sci. USA, ( (2005) ) 97, : 2229–2234.[CrossRef].
Hertz GZ, Stormo GD. Identifying DNA and protein patterns with statistically significant alighments of multiple sequences. Bioinformatics, ( (1999) ) 15, : 563–577.
Jensen KL, et al. A generic motif discovery algorithm for sequential data. Bioinformatics, ( (2006) ) 22, : 21–28.
Johnston M, Carlson M. Regulation of carbon and phosphate utilisation, In Molecular and Cellular Biology of the Yeast Saccharomyces: Gene Expression. ( (1992) ) Woodbury NY, USA: CSHL Press. 193–281..
Kato M, et al. Identifying combinatorial regulation of transcription factors and binding motifs. Genome Biol, ( (2004) ) 5, : R56.[CrossRef][Medline].
Lawrence C, et al. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, ( (1993) ) 1993, : 133–154..
Lenhard B, et al. Identification of conserved regulatory elements by comparative genome analysis. J. Biol, ( (2003) ) 2, : 13.[CrossRef][Medline].
Liu X, et al. BioProspector: discovering DNA motifs in upstream regulatory regions of co-expressed genes. ( (2001) ) Proceedings of the Seventh Pacific Symposium of Biocomputing (PSB). 127–138..
Liu X, Wulf P. Probing ArcA-P modulon of Escherichia coli by whole genome transcriptional analysis and sequence recognition profiling. J. Biol. Chem, ( (2004) ) 279, : 12588–12597.
Makeev VJ, et al. Distance preferences in the arrangement of binding motifs and hierarchical levels in organization of transcription regulatory information. Nucleic Acids Res, ( (2003) ) 31, : 6016–6026.
Marsan L, Sagot M-F. Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J. Comp. Biol, ( (2000) ) 7, : 345–360.[CrossRef].
McGuire AM. A weight matrix for binding recognition by the redox-response regulator ArcA-P of Escherichia coli. Molecular Microbiology, ( (1999) ) 32, : 219–221.[CrossRef][ISI][Medline].
Owen G, Zelent A. Origins and evolutionary diversification of nuclear receptor superfamily. Cell Mol. Life. Sci, ( (2000) ) 57, : 809–827.[CrossRef][ISI][Medline].
Pavesi G, et al. An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics, ( (2001) ) 17, : S207–S214.[Abstract].
Peng C-H, et al. Identification of degenerate motifs using position restricted selection and hybrid ranking combination. Nucleic Acids Res, ( (2006) ) 34, : 6379–6391.
Record MT, et al. Escherichia coli. RNA polymerase
70 promoters, and the kinetics of the stepstranscription initiation. Escherichia Coli and Salmonella, ( (1996) ) 1, : 792–820..
Rigoutsos I, Floratos A. Combinatorial pattern discovery in biological sequences. Bioinformatics, ( (1998) ) 14, : 55–67.
Schjerling P, Holmberg S. Comparative amino acid sequence analysis of the C6 zinc cluster family of transcriptional regulators. Nucleic Acid Research, ( (1996) ) 24, : 4599–4607.
Sinha S, Tompa M. A statistical method for finding transcription factor binding sites. ( (2000) ) Proceedings of the 8th International Conference on Intelligent Systems for Molecular ISMB-00. 344–354..
Svetlov VV, Cooper TG. Compilation and characteristics of dedicated transcription factors in Saccharomyces cerevisiae. Yeast, ( (1995) ) 11, : 1439–1484.[CrossRef][ISI][Medline].
Tavazoie S, et al. Systematic determination of genetic network architecture. Nat. Genet, ( (1999) ) 22, : 281–285.[CrossRef][ISI][Medline].
Thijs G, et al. A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes. J. Comput. Biol, ( (2002) ) 9, : 447–464.[CrossRef][ISI][Medline].
Tompa M, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol, ( (2005) ) 23, : 137–144.[CrossRef][ISI][Medline].
van Helden J. Regulatory sequence analysis tools. Nucleic Acids Res, ( (2003) ) 31, : 3539–3596..
van Helden J, et al. Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res, ( (2000) ) 28, : 1808–1818.
Wasserman WW, Ficket JW. Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol, ( (1998) ) 278, : 167–181.[CrossRef][ISI][Medline].
Wei Z, Jensen ST. GAME: detecting cis-regulatory elements using a genetic algorithm. Bioinformatics, ( (2006) ) 22, : 1577–1584.
Werner T. Models for prediction and recognition of eukaryotic promoters. Mamm. Genome, ( (1999) ) 10, : 168–175.[CrossRef][ISI][Medline].
Yagi H, et al. Regulation of the mouse histone H2A.X gene promoter by the transcription factor E2F and CCAAT binding protein. J. Biol. Chem, ( (1995) ) 270, : 18759–18765.









