Skip Navigation


Bioinformatics Advance Access originally published online on January 9, 2008
Bioinformatics 2008 24(4):484-491; doi:10.1093/bioinformatics/btm629
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/4/484    most recent
btm629v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Gunewardena, S.
Right arrow Articles by Zhang, Z.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gunewardena, S.
Right arrow Articles by Zhang, Z.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

A hybrid model for robust detection of transcription factor binding sites

Sumedha Gunewardena * and Zhaolei Zhang *

Banting and Best Department of Medical Research, Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, 160 College St, Toronto ON, Canada M5S 3E1, Canada

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: The short and degenerate nature of transcription factor (TF) binding sites contributes towards a low signal to noise ratio making it very difficult to separate them from their background. In order to tackle this problem one needs to look at ways of capturing the underlying biophysical properties that best discriminates TF binding sites from their background DNA. One such discriminatory property lies in the observed compositional differences in the nucleotide levels of TF binding sites and background DNA which are a result of processes such as purifying selection and selective preferences of TF binding sites for particular nucleotides or a combination of nucleotides over others.

Results: In this article, we present a hybrid model, referred to as a MonoDi-nucleotide model for robustly detecting TF binding sites. It incorporates both mono- and dinucleotide statistics to optimally partition the base positions of an aligned set of TF binding sites (motif) into a non-redundant sequence of mono and/or dinucleotide segments that maximizes the odds ratio of the binding sites relative to their background DNA. We tested the MonoDi-nucleotide model on the benchmark dataset compiled by Tompa et al. (2005) for assessing computational tools that predict TF binding sites. The performance of the MonoDi-nucleotide model on this data set compares well to, and in many cases exceeds, the performance of existing tools. This is in part attributed to the significant role played by dinucleotides in discriminating TF binding sites from background DNA.

Availability: A Matlab implementation of the MonoDi-nucleotide model can be found at http://www.utoronto.ca/zhanglab/MonoDi/.

Contact: sumedha{at}cantab.net, Zhaolei.Zhang{at}utoronto.ca

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Transcriptional regulation is an important means of controlling gene expression in humans and other higher eukaryotic organisms. Essential to transcriptional regulation is the binding of various modulator transcription factors (TF) onto cis-regulatory elements (binding sites) in the vicinity of the regulated gene. Characterizing these cis-regulatory elements remain an important but difficult task, made difficult by the short (on average ≤ 10 bp) and highly degenerate nature of these sequences. Early methods for representing TF binding sites were based on consensus sequences (Schug et al., 1997), and later, position specific weight matrices (PSWM) (Quandt et al., 1995). PSWMs continue to be the most widely used method of representing TF binding sites and have been used in many programs such as Signal Scan (Prestridge, 1991) and MatInspector (Quandt et al., 1995). While PSWMs are an improvement from consensus sequences in representing TF binding sites and in most cases considered adequate for characterizing them (Benos et al., 2002), they are at an intrinsic disadvantage from the base independence assumption, i.e. the assumption that individual bases of a binding site exist independently of each other, underlying their construction. In a PSWM model, the binding affinity of a site is assumed to be proportional to the sum of the independent contributions of the individual interactions at each nucleotide position. However, now, there is an increasing number of evidence suggesting the existence of correlations between nucleotide positions (Berger et al., 2006; Man et al., 2001; O'Flanagan et al., 2005; Udalova et al., 2002; Wolfe et al., 1999). Man and Stormo (Man et al., 2001), for example, showed that the interaction of Salmonella bacteriophage repressor Mnt with its operator DNA at positions 16 and 17 were not independent. Barash et al. (Barash et al., 2003) described dependencies within the binding sites of the Arabidopsis ABA responsive element. Udalova et al. (Udalova et al., 2002) demonstrated the existence of such interdependent effects in binding sites of the NF-{kappa} B protein. All this evidence suggests an inherent weakness in the general base independence assumption of nucleotides in TF binding sites. This issue has been addressed by various authors in programs dedicated for finding TF binding sites with techniques ranging from the use of oligonucleotide frequency matrices (Ponomarenko et al., 1999), improved weight matrices with prior information on correlated nucleotide positions (Stormo et al., 1986), principal coordinates analysis (Udalova et al., 2002), phylogenetic footprinting (Qian et al., 2005) to ‘template’ based approaches (Gunewardena et al., 2006).

Although the exact reasons for the association between nucleotide positions of transcription factor binding sites are not well understood, it can be postulated to the different biophysical constraints influencing these sites. An example is the restrictions caused in the mobility of DNA by steric hindrances (clashes) that arise as a result of unequal base sizes of adjacent bases twisted in opposite directions (Dickerson et al., 1981). It is likely that such clashes, at critical points along the promoter, interfere in the DNAs ability to interact with the binding protein, and hence modulate the efficiency in which the protein binds to the DNA. Evidence of this has been shown by Nussinov (1984), in terms of the local structure of the DNA in the analysis of E.Coli polymerase binding sites. The binding of protein to DNA causes a large number of distortions to the regular twist of the double helix resulting in a departure of the DNA from the classic linear B-DNA to a conformation with a distorted axis which curves towards the protein's recognition helix. The ability of the DNA to tolerate such structural distortions from the classical B-form will, at least partly, determine how well the DNA binds to the protein. Such structural constraints are bound to impose implicit nucleotide dependencies on the DNA. As Calladine and Dickerson (Dickerson, 1983) pointed out, the structural conformation of the DNA is very much dependent on base steps rather than isolated base pairs. Further more, every step in which a steric clash occur, it is projected to its neighboring base pair steps.

Other reasons for associations between nucleotide positions to arise may include processes such as DNA methylation, purifying selection, DNA replication and repair mechanisms and biases in DNA modification (Gentles et al., 2001). Further more, the authors have carried out a study that shows significant differences in the dinucleotide content between TF binding sites and background DNA (see Supplementary Note 1) which provides a compelling reason to look beyond the base independence assumption when discriminating TF binding sites from background DNA. We can express the degree of discrimination of a motif (a collection of fixed length aligned putative binding sites) from its background as an odds ratio between the motif and background models. The odds ratio quantifies the extent to which the motif model derived from its putative binding sites differ from the a priori probabilities of the individual bases of those sites. It can be seen as a statistical measure of the quality of the motif.

In this article we describe a hybrid model referred to as the MonoDi-nucleotide model that discriminates TF binding sites based on their odds ratios. The MonoDi-nucleotide model maximizes the odds ratio of a set of binding sites by selecting an optimal, complete and non-redundant (disjoint) partitioning of the base positions of the sites, into base independent and non-independent segments. This partitioning describes the statistically, independent and non-independent positions of a TF binding site that maximizes its odds ratio over the background. The MonoDi-nucleotide model works in the principle that if dinucleotide dependencies are present in a set of binding sites the model identifies and accommodates it. In the absence of such dependencies, the model only works on mononucleotides. It is a flexible model whose topology adapts to the best discriminatory topology of the binding sites based on mono and/or dinucleotide combinations of nucleotide positions of the sites. As a ‘de novo’ prediction tool of regulatory elements, the raw data does not provide any prior knowledge of the presence or absence of any dinucleotide dependencies in the sites or their locations within the sites. The MonoDi-nucleotide model also takes advantage of the differences that exist between the non-independent dinucleotide distributions of TF binding sites and background DNA to home in on TF binding sites (see Supplementary Note 4).

We tested the MonoDi-nucleotide model on the benchmark dataset compiled by Tompa et al. (Tompa et al., 2005) for assessing computational tools that predict TF binding sites. The performance of the MonoDi-nucleotide model on this data set compares well or exceeds the performance of existing tools. This is in part attributed to the significant role played by dinucleotides in discriminating TF binding sites from background DNA (see Supplementary Note 4).


    2 METHODOLOGY
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Motif model
The motif model is a hybrid model that incorporates both mono- and dinucleotide statistics to optimally partition the base positions of a set of TF binding sites into a non-redundant sequence of mono and/or dinucleotide segments that maximize the odds ratio of the binding sites relative to the background nucleotide distribution. The derivation of the motif model is described below.

2.1.1 Mononucleotide model
In the mononucleotide model, bases of a DNA sequence are assumed to exist independently of each other. This enables the probability of the sequence to be expressed as a product of the probabilities of its individual bases.

DEFINITION 2.1. Let S = < x1, x2, ... , xL >, be a nucleotide sequence of length L.

In a mononucleotide model, the probability of S, P(S) = p(x1) p(x2) ... p(xL).

If S is a TF binding site, we can compute its odds ratio, O(S), relative to the background nucleotide distribution by dividing its probability, P(S|binding site model), here forth referred to as P(S), by the product of the background probability of its individual bases, P(S|background model), here forth referred to as Q(S) The background model is simply the a priori probabilities of the individual bases. It is computed from the mono- and di-nucleotide counts of the input sequences after removing the sub-sequences of the motif.


Formula

DEFINITION 2.2. Let Formula be a motif of length L of the N aligned sites, {S1, ... ,SN}, where Formula refers to the ith nucleotide of the jth sequence of motif Formula .

DEFINITION 2.3. Let Formula , be any arbitrary sub-alignment, Formula , from position L to position l' of the N sequences, {S1, ... , SN}, of motif Formula , where 1 ≤ l ≤ l' ≤ L.

If Formula is a sub-alignment of base independent sub-sequences of Formula , we can compute the normalized odds ratio, Formula , of Formula as follows:


Formula

where Formula is the count of the number of nucleotides equal to {delta} along position i of the N sequences of Formula . The fraction Formula is the probability of seeing nucleotide {delta} in position i, i.e. p(xi = {delta}). We can convert the normalized odds ratio to a normalized log odds ratio, Im, giving:


Formula

This is the ‘relative entropy’ of Formula . In information theory, the inner summation is referred to as the ‘Kullbacl-Leibler distance’.

The probability p(xi = {delta}) can be computed directly from the available binding sites, as the fractionFormula . In many cases though, the available data does not provide a full picture of the actual nucleotide distributions of the sites. The general means of circumventing this problem is to add pseudo counts to account for unseen instances of the data. Hertz and Stormo (Hertz et al., 1999) proposed adding pseudo counts proportional to the background distribution of the nucleotide bases, which is the method we follow in this article, therefore:


Formula

.

2.1.2 Dinucleotide model
In the dinucleotide model, we assume that nucleotide correlations do occur, however due to computational complexities involved with modelling higher order nucleotide correlations in TF binding sites we only consider correlations between neighboring base pairs. Under this model, the probability of a sequence S can be expressed as:


Formula

If Formula is a sub-alignment of Formula whose nucleotide positions are correlated between neighboring base pairs, we can compute its normalized odds ratio, Formula , using the dinucleotide model as follows:


Formula

where Formula is the count of the number of dinucleotides that are equal to {delta}1 in position i and {delta}2 in position i+1 of the N sequences of Formula . The fraction Formula is the probability of seeing nucleotide {delta}1 in position i and {delta}2 in position i+1, i.e. p(xi = {delta}1,xi+1 = {delta}2). Like the mononucleotide model, we can compute the normalized log odds ratio, Id for the dinucleotide model giving:


Formula

If an independent background model is assumed, the joint background probability q(xi, xi+1) can be replaced by product q(xi) q(xi+1). Similar to the mononucleotide model, pseudo counts are added when computing the probability p(xi = {delta}1,xi+1 = {delta}2) giving:


Formula

where Formula .

In order to account for the difference in the degrees of freedom between the more complex dinucleotide model and the equivalent less complex mononucleotide model we scaled down the dinucleotide model (at the dinucleotide level) by subtracting a factor {lambda} from the normalized log odds ratio (see Supplementary Note 2). We set {lambda} = log2(1.23) (derived from Gentles et al., 2001) which ensures at least a 1.23-fold increase in odds of a dinucleotide under the dinucleotide model relative to an equivalent mononucleotide model.

2.1.3 MonoDi-Nucleotide model
Transcription factor binding sites can have a combination of both base independent and non-independent segments of nucleotide positions. The odds ratio of such sequences can be computed as a product of the odd ratios of the individual segments computed using the appropriate model.

DEFINITION 2.4. Let Formula be a set of, M, non-overlapping base independent sub-alignments of Formula where 1 ≤ lmi ≤l';mi ≤ L.

DEFINITION 2.5. Let Formula be a set of, D, non-overlapping base dependent (between neighboring bases) sub-alignments of Formula . Formula where1 ≤ ldi < l';di ≤ L.

We can compute the odds ratio of the union of the two sets. It is simply the product of the odd ratios of the individual sets i.e.


Formula

The normalized log odds ratio of the combined sets, is given by


Formula

The MonoDi-nucleotide model enables us to combine the odd ratios of base independent and non-independent nucleotide sequences.

2.2 Optimizing the odds ratio of a set of transcription factor binding sites
We can optimize the normalized log odds ratio of a motif by optimally segmenting its nucleotide positions into base independent and/or non-independent segments. This can be done with the aid of a MonoDi-nucleotide network described below. An optimal segmentation of the positions of the sites is a complete (entails all positions), disjoint (no overlaps) partitioning of the nucleotide positions into segments (which are sub-alignments of the motif), with each segment being either base independent under the mononucleotide model or non-independent under the dinucleotide model, such that the combined odds ratio of the segments is optimal taken over all possible partitions. Now, if Formula is a motif representing an aligned set of TF binding sites, we can obtain its optimal segmentation and resulting normalized log odds ratio as follows.

DEFINITION 2.6. Let Formula be a partitioning of motif Formula into M complete and disjoint segments (sub-alignments), with the positions of each segment being either strictly independent or non-independent. Then Formula where {frown} is the concatenation operator.

We use a MonoDi-nucleotide network to obtain an optimal partitioning, Formula , of the nucleotide positions of Formula . A MonoDi-nucleotide network is a directed acyclic network with three parallel layers, and a layer for the Start and End nodes referred to as the 0th layer (see Figure 1). The network flows from the Start node to the End node. Layer 1 of the network represents the mononucleotide (base independent) model. Layers 2 and 3 represent the dinucleotide (base dependent) model. Transitions between Layers 1 and 2 provides for the MonoDi-nucleotide (hybrid base independent and non-independent) model. In order to better elucidate the explanations that follow, imagine a motif, Formula , as an un-gapped fixed length alignment of a set of binding sites. Also note that, Formula refers to the nucleotides along the ith column of Formula .


Figure 1
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. A MonoDi-nucleotide network. Each node in Layer 1 holds the normalized log odds ratio values of the individual bases, {A, C, G, T}, corresponding to its motif position. Each node in Layer 2 holds the normalized log odds ratio values of the base pairs, {AA, AC,..., TT}, corresponding to its two adjacent motif positions.

 
DEFINITION 2.7. Let Formula refer to the ith node of Layer k of the MonoDi-nucleotide network.

DEFINITION 2.8. Let Formula represent a directed edge from node i of Layer k1 to node j of Layer k2.

Each position of the binding sites (column of the alignment) maps to a single node in Layer 1 of the MonoDi-nucleotide network such that the first node maps to the first column, the second node, to the second column, and so on.


Formula

Hence, the number of nodes in Layer 1 of the network is equal to the length of the binding sites (L nodes for L nucleotide positions). These nodes represent the normalized log odds ratios of the nucleotides of the columns they map to. Nodes of Layer 1 are ordered monotonically in the order of the nucleotide positions they map to. Within Layer 1, nodes are connected by directed edges, going from left to right, between neighboring nodes


Formula

Nodes of Layer 2 map to dinucleotides with a node for each positional pair (adjacent columns of the alignment), giving L–1 nodes for L nucleotide positions.


Formula

Positions are taken in successive overlapping pairs (i.e. pairs of the form [1, 2], [2, 3], ... , [L–1, L]). Nodes of Layer 2 represent the normalized log odds ratios of the dinucleotides of the adjacent pairs of columns they map to. Like Layer 1, Layer 2 is also ordered monotonically in the order of the first nucleotide position in the pair of nucleotide positions they map to. Nodes of Layer 2 are connected to each other only through a unique node in Layer 3. This node in Layer 3 represents the negative normalized log odds ratios of the intersecting column of the two, Layer 2 nodes that it connects.


Formula

Nodes of Layer 1 connect to nodes of Layer 2 whose first position is the immediate successor to its position. Likewise, nodes of Layer 2 connect to nodes of Layer 1 whose position is the immediate successor to its second position.


Formula

These connections enable the network to alternate between the mononucleotide and dinucleotide models.

Finally, The Start node of the network connects to the first node of Layer 1 and the first node of Layer 2. The last node of Layer 1 and Layer 2 connects to the End node of the network.



Formula

We can use the MonoDi-nucleotide network to obtain an optimal partitioning of the nucleotide positions. In order to do this, we use a dynamic programming algorithm to systematically fill the MonoDi-nucleotide network. It proceeds as follows:

DEFINITION 2.9. Let Formula , i isin {1..., End} and k isin{0, 1, 2, 3}, be the maximum cumulative normalized log odds ratio, along the optimum path, at node Formula , starting from the Start node Formula

  • We initialize the network by setting Formula .
  • For any other node Formula , for valid i isin {1, ... , End} and kisin{0, 1, 2, 3}, of the network, we have


    Formula

    where


    Formula


The function Formula returns the parent nodes of Formula (i.e. the nodes that connect to Formula ).

  • We keep track of the direction of the optimal path using the function Formula defined as


    Formula

    (In case there is more then one path, we select the path along the mononucleotide model giving it priority over the dinucleotide model.)

  • The optimal partition, Formula , of the nucleotide positions of Formula is obtained by tracing back the path from the End node to the Start node.
  • The optimal normalized log odds ratio, Formula , of motif Formula partitioned by Formula is given by Formula .

The network can be completed by traversing the nodes in the order


Formula

For the benefit of the readers, an illustrative example is provided in Supplementary Material Worked Example.

2.3 Motif significance score
The motif significance (MS) score is used to rank motifs. The MS score, E, of a motif Formula with an optimal normalized log odds ratio of Formula is given by


Formula

where N is the number of sites in motif Formula and T, the total number of possible sites given the sequences. The MS score incorporates the posterior probability of sampling motif Formula . This is similar to the posterior probability of a model defined in Neuwald et al. (1995). However we do not add weighted pseudosites to the actual number of sites but instead weight this probability by a fraction w isin (0,1). The default setting for w is 0.8. The MS score described here does not include a term for variable motif lengths as only fixed length motifs are considered.

2.4 Sampling the sequence space
Computational techniques such as Gibbs sampling that rely on randomly sampling a given sequence space for finding statistically significant motifs depend heavily on the accuracy and efficiency of the particular sampling strategy used. It is generally done uniformly over the given sequence space. This is however an inefficient approach that exacerbates as the number and, more so, the length of the sequences being sampled increase.

We have observed that the placement of TF binding sites in the upstream regions of co-regulated genes for some transcription factors tend to locate within relatively similar distances from the transcription start site of those genes regardless of the absolute distance from it (see Supplementary Note 3). It is an observation relative to the length of the upstream region considered. This is illustrated in Figure 2. Given this observation, sampling the sequence space can be preformed more efficiently if it is concentrated more on the regions where TF binding sites tend to locate over regions where they are less likely to be found instead of uniformly scanning the entire space. However, sampling should not be prejudicial to any particular spread of TF binding sites.


Figure 2
View larger version (5K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. The figure illustrates a non-uniform distance distribution of the placement of TF binding sites (from the transcription start site) of a transcription factor along the upstream regions of a set of co-regulated genes.

 
If we assume that the probability of a TF binding site, s, being at a distance, l, from the transcription start site is proportional to its placement on the sequence which is given by the distance to s from either end of the sequence i.e l and l–l, we can express the probability of s being located at a distance l from the transcription start site as a function of the distances l and L–l as P(l|s) = k (l)p (Ll)q, where L is the length of the sequence and k, a normalizing constant given by k= {int} (l)p (Ll)q.

This distribution can be converted to a standard beta distribution by normalizing the distances l and Ll to lie in the range (0,1) by dividing by L+1 and replacing the parameters p and q by p–1 and q–1, respectively (where p > 0, q > 0). The parameters p and q, or more precisely, the location parameter {alpha} = p/(p+q) and the scale parameter β = p+q, determines the shape of the distribution. Given that parameters {alpha} and β describe the distribution of the placement of binding sites from the transcription start site we can use this distribution to sample the sequence space much more efficiently than sampling it uniformly. For a given set of observed distances, we can compute the ML estimates of parameters {alpha} and β very efficiently (see Hahn et al., 1994; Scarborough, 1955).

The sequence space is sampled using a dynamic beta distribution, Formula , which at any given time represents the distribution of distances of the sites in the motif with maximum MS score found to that point. We start by computing the ML estimates of {alpha} and β representing the distribution that fits the observed distances of the sites in the initial motif (one can also start with the uniform distribution by setting Formula and Formula ). This distribution is used to sample the sequence space until a new motif is found with greater MS score. The distances of the sites in this motif is used to re-compute Formula and Formula , and the resulting distribution is used to sample the sequence space. This process continues until the final motif is obtained. If we start with a uniform sampling distribution, the dynamic update step of the sampling distribution can be delayed until a good initial motif configuration is found (e.g. a motif configuration with a positive MS score).

2.5 Motif Sampler
Motifs are sampled using a simulated annealing based Gibbs sampling algorithm. The algorithm is fully described in Supplementary Algorithm 1. The algorithm updates the motif configuration by visiting sites randomly according to the proposal distribution described in Section 2.4. Motifs are scored using the motif model described in Section 2.1.


    3 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
In 2005, Tompa et al. introduced a collection of datasets for assessing the performance of computational tools that predict transcription factor binding sites. These datasets comprised of eukaryotic binding sites belonging to 52 transcription factors representing four different species, 6 belonging to fly, 26 belonging to human, 12 belonging to mouse and 8 belonging to yeast. The binding sites of each transcription factor were presented in three different background models, ‘real’, ‘generic’ and ‘Markov’. For each of the different type of background model, the known positions and orientations of the binding sites were kept unchanged. Four additional datasets of type ‘Markov’ containing no planted binding sites were included as negative controls.

Using these datasets Tompa et al. evaluated the performance of 13 different computational tools for de novo prediction of regulatory elements. We evaluated the MonoDi nucleotide model on these datasets according to the performance measures described in Tompa et al. and compared it to the 13 tools evaluated in that paper. The comparative results of this evaluation are described below. More details of the datasets and the different statistics used to assess tool performance quality can be found in Tompa et al. (2005).

Figure 3 shows the comparative results for the seven statistics (see Supplementary Materials for details), nucleotide-level sensitivity (nSn), nucleotide-level positive predictive value (nPPV), nucleotide-level performance coefficient (nPC), nucleotide-level correlation coefficient (nCC), site-level sensitivity (sSn), site-level positive predictive value (sPPV) and site-level average site performance (sASP)—summarized over all 56 datasets (regardless of species or background model). The MonoDi nucleotide model outperforms the other tools in all indicators except for the xPPV in which it falls behind Weeder. However, as explained in Tompa et al. (2005), the xPPV values tend to be exaggerated for those programs that make no predictions on datasets. The MonoDi-nucleotide model had a motif prediction on every dataset it attempted a prediction while Weeder on the other hand had 17 datasets which it predicted no motif (see Supplementary Table 1) giving it a biased advantage on the xPPV scale.


Figure 3
View larger version (39K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Combined measures of correctness over all 56 datasets.

 
Figure 4 gives the breakdown of the performance, highlighted by the correlation coefficient, nCC, of each tool on the datasets of the different species (regardless of the background model). From Figure 4, we can see that the MonoDi-nucleotide model does better than the other tools in all species except for yeast. The three tools, Weeder, MotifSampler and MEME do better than the MonoDi-nucleotide model on the yeast datasets. We believe that the reason that Mono-di model works better for fly, mouse and human and less so for yeast is because the yeast genome is less biased in di nucleotide frequencies (Gentles et al., 2001).


Figure 4
View larger version (31K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Correlation coefficient (nCC) by species.

 
Figure 5 breaks down the datasets according to the different background models, ‘real’, ‘generic’ and ‘Markov’ (regardless of the species). The MonoDi-nucleotide model performs better than the other tools in predicting motifs in the ‘real’ and ‘Markov’ background models. It does better than the other tools for the ‘generic’ background model as well, except for Weeder which performance slightly better.


Figure 5
View larger version (30K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Correlation coefficient (nCC) by background model.

 
Further comparative evaluations of the MonoDi-nucleotide model is presented in the Supplementary Material. Supplementary Figure 1 shows the comparative performance of the MonoDi-nucleotide model on the datasets with real background models as measured by the seven statistics of Figure 3. Supplementary Figure 2 replicates the same seven statistics for the analyzed tools over just the datasets with generic and Markov background models. Supplementary Figure 3 provides an idea of the performance of the individual tools when considered in absence of the yeast datasets. In all these evaluations, we see that the MonoDi-nucleotide model outperforms the other tools.

Significant motifs of the MonoDi-nucleotide model were selected based on their MS score. We considered motifs with an MS score cutoff of 4.5 or greater for the above evaluation. The value 4.5 was selected because it yielded comparatively similar amounts of predicted motifs as the other tools analyzed. Increasing this value increases the quality of the predicted motifs and hence the performance of the MonoDi-nucleotide model however it reduces the number of motifs predicted. Reducing the MS cutoff score on the other hand results in more datasets with predicted motifs but with a less average quality. Supplementary Figures 4–9 shows the performance indicators of Figures 3–5 and Supplementary Figures 1–3 with the predictions from the MonoDi-nucleotide model taken with a conservative MS cutoff of 5 and a more liberal MS cutoff of 3.5.


    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have introduced the a new model for detecting TF binding sites based on an optimal combination of mono- and/or di-nucleotide log likelihood scores. The model outperforms many of the contemporary motif detection tools as revealed in Figure 3 and others. The MonoDi-nucleotide model uses a simple construction that extends the well established mononucleotide weight matrix model to incorporate dinucleotide information from both the motif and background. The MonoDi-nucleotide model described here does ‘de novo’ prediction of binding sites, but can be easily adapted with training data to scan for binding sites of known transcription factors.

Identifying TF binding sites is a challenging task, due mainly to our lack of understanding of the true nature of the signal(s) that discriminate these sites from background DNA. It is evident that models such as position specific weight matrices are too restrictive in many instances in representing the degenerate nature of TF binding sites. In order to better discriminate TF binding sites, we need to look at more subtle signals that distinguish these sites from background DNA. The relative difference in dinucleotide content that manifest as strong nucleotide associations in TF binding sites and background DNA is one such signal which the MonoDi-nucleotide model exploits when looking for TF binding sites. Discriminative signals such as nucleotide associations are not universal and hence are not evident in the binding sites of all transcription factors. Neither are they homogeneous in the nature of their occurrence across TF binding sites of different transcription factors. This is why it is necessary to develop flexible models such as the MonoDi-nucleotide model whose topology can adapt to the best discriminatory topology of the binding sites based on the optimum partitioning of the base positions of the sites into independent mono nucleotide and/or correlated dinucleotide segments.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work was supported in part by new faculty start-up funds from the University of Toronto to Z.Z. and grants from Genome Canada through the Ontario Genomics Institute.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on March 23, 2007; revised on November 30, 2007; accepted on December 17, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Benos PV, et al. Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res (2002) 30:4442–4451.[Abstract/Free Full Text]

    Barash Y, et al. Modeling dependencies in protein-DNA binding sites. In: In Proc. of RECOMB-03. (2003) 28–37.

    Berger MF, et al. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol (2006) 24:1429–1435.[CrossRef][Web of Science][Medline]

    Dickerson RE, Drew HR. Structure of a B-DNA dodecamer. II. Influence of base sequence on helix structure. J. Mol. Biol (1981) 149:761–786.[CrossRef][Web of Science][Medline]

    Dickerson RE. Base sequence and helix structure variation in B and A DNA. J. Mol. Biol (1983) 166:419–441.[Web of Science][Medline]

    O'Flanagan RA, et al. Non-additivity in protein-DNA binding. Bioinformatics (2005) 21:2254–2263.[Abstract/Free Full Text]

    Gunewardena SSA, et al. Enhancing the prediction of transcription factor binding sites by incorporating structural properties and nucleotide co-variations. J. Comput. Biol (2006) 13:929–945.[CrossRef][Web of Science][Medline]

    Gentles AJ, Karlin S. Genome-scale compositional comparisons in eukaryotes. Genome Res (2001) 11:540–546.[Abstract/Free Full Text]

    Hahn GJ, Shapiro SS. Statistical Models in Engineering. (1994) New York: John Wiley & Sons. 95.

    Hertz GZ, Stormo GD. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics (1999) 15:563–577.[Abstract/Free Full Text]

    Man TK, Stormo GD. Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. Nucleic Acids Res (2001) 29:2471–2478.[Abstract/Free Full Text]

    Nussinov R. Promoter helical structure variation at the Escherichia coli polymerase interaction sites. J. Biol. Chem (1984) 259:6798–6805.[Abstract/Free Full Text]

    Neuwald AF, et al. Gibbs motif sampling: Detection of bacterial outer membrane protein repeats. Protein Sci (1995) 4:1618–1632.[Web of Science][Medline]

    Prestridge DS. Signal Scan: a computer program that scans DNA sequences for eukaryotic transcriptional elements. Comput. Applic. Biosci (1991) 7:203–206.[Abstract/Free Full Text]

    Ponomarenko MP, et al. Oligonucleotide frequency matrices addressed to recognizing functional DNA sites. Bioinformatics (1999) 15:631–643.[Abstract/Free Full Text]

    Qian J, et al. Identification of regulatory targets of tissue-specific transcription factors: application to retina-specific gene regulation. Nucleic Acids Res (2005) 33:3479–3491.[Abstract/Free Full Text]

    Quandt K, et al. MatInd and Matinspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res (1995) 23:4878–4884.[Abstract/Free Full Text]

    Scarborough JB. Numerical Mathematical Analysis. (1955) 3rd. Baltimore: JJohns Hopkins Press. 203–206.

    Schug J, Overton GC. TESS: Transcription Element Search Software on the WWW. Technical Report CBIL-TR-1997-1001-v0.0. Computational Biology and Informatics Laboratory, (1997) School of Medicine, University of Pennsylvania.

    Stormo GD, et al. Quantitative analysis of the relationship between nucleotide sequence and functional activity. Nucleic. Acids Res (1986) 14:6661–6679.[Abstract/Free Full Text]

    Tompa M, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol (2005) 23:137–144.[CrossRef][Web of Science][Medline]

    Udalova IA, et al. Quantitative prediction of NF-kB DNA-protein interactions. Proc. Natl Acad. Sci. USA (2002) 99:8167–8172.[Abstract/Free Full Text]

    Wolfe SA, et al. Analysis of zinc fingers optimized via phage display: evaluating the utility of a recognition code. J. Mol. Biol (1999) 285:1917–1934.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/4/484    most recent
btm629v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Gunewardena, S.
Right arrow Articles by Zhang, Z.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gunewardena, S.
Right arrow Articles by Zhang, Z.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?