Bioinformatics Advance Access originally published online on April 7, 2005
Bioinformatics 2005 21(11):2636-2643; doi:10.1093/bioinformatics/bti402
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A boosting approach for motif modeling using ChIP-chip data
1Department of Statistics, Harvard University Cambridge, MA 02138, USA
2Department of Biostatistics, Harvard School of Public Health Boston, MA 02115, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Building an accurate binding model for a transcription factor (TF) is essential to differentiate its true binding targets from those spurious ones. This is an important step toward understanding gene regulation.
Results: This paper describes a boosting approach to modeling TFDNA binding. Different from the widely used weight matrix model, which predicts TFDNA binding based on a linear combination of position-specific contributions, our approach builds a TF binding classifier by combining a set of weight matrix based classifiers, thus yielding a non-linear binding decision rule. The proposed approach was applied to the ChIP-chip data of Saccharomyces cerevisiae. When compared with the weight matrix method, our new approach showed significant improvements on the specificity in a majority of cases.
Contact: wwong{at}hsph.harvard.edu
Supplementary information: The software and the Supplementary data are available at http://biogibbs.stanford.edu/~hong2004/MotifBooster/.
| 1 INTRODUCTION |
|---|
|
|
|---|
With the continuing explosive growth of sequenced genomes and genome-wide mRNA expression data, scientists are increasingly interested in modeling regulatory motifs and predicting binding targets of transcription factors (TFs). In this paper, we propose a discriminant approach that builds models to distinguish positive sequences (i.e. binding targets of a TF) from negative sequences (i.e. non-targets of a TF). Several approaches for this discriminant task have been proposed previously. DMotifs applies an enumerative search of the motif space and reports the best motif as a feature of the sequences that best differentiates positive from negative sequences (Sinha, 2002). Vilo et al. (2000) used a binomial formula for significance test to evaluate the occurrences of a motif in positive sequences against those in negative sequences. Similar to the approach of Vilo et al. (2000) the random selection null hypothesis approach in Barash et al. (2001) tests the significance of a motif against negative sequences based on a hypergeometric distribution. Takusagawa and Gifford (2004) extended the works of Vilo et al. (2000) and Barash et al. (2001) to consider the effects of the lengths of sequences. The above approaches report motifs as consensus words, which are arguably less sensitive and precise than the corresponding weight matrix representations (Stormo et al., 1982).
Since the pioneering work of Stormo et al. (1982) the weight matrix model has become one of the most widely used models for representing motifs. A popular approach to estimating the parameters of a weight matrix de novo is to find a statistically enriched motif in positive sequences with respect to a background model (Stormo and Hartzell, 1989; Lawrence and Reilly, 1990; Lawrence et al., 1993; Liu et al., 1995; Barash et al., 2001). The background model, which usually is defined as an n-th order Markov model (n = 0, 1, 2 or 3), tries to capture all information in the non-binding sites that are much more heterogeneous than the binding sites. Such a background model is so general that the weight matrix model tends to have very low specificity. To better identify the non-binding sites that are very similar to the binding sites, Workman and Stormo (2000) proposed a discriminant method called ANN-Spec, which uses a Perceptron model and Gibbs sampling to train the weight matrix. They showed that the weight matrix models output by ANN-Spec have higher specificity than those built by non-discriminant approaches, such as MEME (Bailey and Elkan, 1994).
A motif reported as a weight matrix assumes that different positions of the motif are independent. Under this assumption, a weight matrix is essentially a linear classifier when used with a cutoff value to predict binding sites in sequences. Recent biological studies have demonstrated that individual positions of binding sites are not always independent (Bulyk et al., 2001, 2002; Man and Stormo, 2001), and suggested that some TFs recognize their targets in a non-linear fashion. Barash et al. (2003) adopted Bayesian networks to model dependencies in binding motifs as trees and mixtures of trees. The Bayesian tree model is similar to the one used in an early work by Agarwal and Bafna (1998) to model the dependency between bases. It is recently reported (Zhou and Liu, 2004) that a simpler pair-correlation model can largely account for all observed correlations among motif positions and using such a model in conjunction with the Gibbs sampling method suffers no overfitting problem. However, such a model still cannot accommodate some non-linear factors in discriminating positive and negative sequences.
It is widely accepted that a TF participates in controlling the mRNA levels of its target genes through its binding sites in the corresponding promoter regions. Hence, the REDUCE method (Bussemaker et al., 2001) and Motif Regressor (Conlon et al., 2003) were proposed to discover motifs by associating motif abundances with real-valued changes in genome-wide expression data. The REDUCE method enumerates all K-mers (DNA segments of length K) and checks whether the combinatorial effects of a set of K-mers can be used to explain changes of gene-expression data in a regression manner. Motif Regressor first uses MDSCAN (Liu et al., 2002) to generate a large set of matrix-based motif candidates that are enriched in the promoter regions of genes with the highest fold changes in gene expression data. Then it uses regression analyses to select motif candidates that are most relevant to the change of gene expressions. Nevertheless, neither approach exploits the potential of using negative sequences to change the parameters of a motif so as to increase the specificity of the model.
We propose a novel discriminant approach to enhance TFDNA binding models using the boosting technique. First, we use the ChIP-chip data to select positive and negative sequences. In ChIP-chip experiments, DNA is crosslinked in vivo to proteins at sites of DNAprotein interaction and sheared to 500 bp2 kb fragments. The DNAprotein complexes are precipitated by antibodies specific to the TF of interest. The precipitated protein-bound DNA fragments are PCR amplified, fluorescently labeled and hybridized to microarrays containing every promoter (sometimes also every ORF) in the genome. DNA fragments that are consistently enriched by ChIP-chip over repeated experiments are identified as positive sequences containing the proteinDNA interacting loci at
1 kb resolution. When compared with the gene-expression data, the ChIP-chip data provide much more accurate information about the genome-wide location of in vivo TFDNA interactions, which enables us to assign definitive class labels to some promoter sequences with high confidence. Consequently, we can model the TFDNA binding problem as a classification problem. We modify the confidence-rated boosting (CRB) algorithm (Schapire and Singer, 1999) to train a TFDNA binding classifier as an ensemble model, which is a weighted combination of a set of base classifiers. The modified CRB algorithm automatically decides the number of base classifiers to be used so as to avoid overfitting. A key aspect of the boosting technique is that it forces some of the base classifiers to focus on the boundary between positive and negative samples, thus effectively reducing classification errors. We demonstrate the power of this approach by its performance on the ChIP-chip data of Saccharomyces cerevisiae (Lee et al., 2002).
| 2 METHODS |
|---|
|
|
|---|
2.1 The ensemble model
We define a TFDNA binding model as a weighted combination of a set of base classifiers {qm()}:
![]() | (1) |
m is the weight of qm(). The model weights can be normalized so that they sum up to 1. The class label of a DNA sequence Si is decided by sign(Q(Si)), with +1 denoting that Si is a positive sequence. The base classifier has its root in the weight matrix method (Stormo et al., 1982). Let fm() be the weight matrix model on which qm() is based. And let the set {sij} represent all K-mers in a DNA sequence Si. The score of a K-mer sij, given fm() is:
![]() | (2) |
is the parameter (in the logarithm scale) of the model fm() for the nucleotide b at position k; (2) Ik,b(sij) = 1 if the k-th base of sij is b and Ik,b (sij) = 0, otherwise; (3) t is a threshold decided by some criteria (e.g. P-value). The higher the score, the more likely a site will be bound by the TF. The weight matrix model decides sij as a target of the TF if fm(sij) > 0 and a non-target site, otherwise. We will show later that the threshold can be embedded into the parameter matrix
.
In many situations (e.g. ChIP-chip experiments), we only have information about whether a DNA sequence is bound by a TF, but do not know which sites in the sequence the TF binds to. Hence, given a weight matrix, we need to derive a scoring function to assess the likelihood of a DNA sequence as a target of a TF. This score should be affected by: (1) the number of matching sites in the sequence; and (2) the degree of the match for each matching site. The following function takes into account of the above factors and scores a sequence as:
![]() | (3) |
The base classifier qm() transforms the score of a sequence with a hyperbolic tangent function to a soft class prediction:
![]() | (4) |
2.2 Learn the ensemble model via boosting
We adopt the CRB algorithm (Schapire and Singer, 1999) to perform the following tasks in building an ensemble model Q(): (1) deciding the number of linear classifiers qm() in Q() and (2) learning the parameters of each qm() and its weight
m. Loosely speaking, in the first round, the CRB algorithm assigns equal weights to all samples and trains the first base classifier. In each of the rounds that follow, the boosting procedure gives higher weights to previously misclassified samples and learns a new base classifier with its weight using the reweighted samples. The final classifier is a linear assembly of weighted base classifiers from each round.
We made some modifications to the CRB algorithm to serve our purpose better. The modified CRB algorithm is outlined as Figure 1. Our first change tries to accommodate the unbalanced training set (the number of negative samples is much larger than that of positive ones) by assigning larger initial weights to the positive samples. Second, to prevent overfitting, we reserve some training sequences for internal test during training. The details of our implementations are explained in the next section.
|
| 3 IMPLEMENTATION |
|---|
|
|
|---|
3.1 Initialize the weights of sequences
In our study, the number of negative sequences (usually in thousands) is often much larger than the positive ones (usually <100). Without proper adjustments, negative sequences would overwhelm a classifier and reduce its capability of recognizing positive sequences. As a remedy, we constrain the total weight of the positive sequences to be equal to that of the negative sequences (step b in Fig. 1). The sequences within each class have equal weights. This in effect imposes a higher penalty for misclassifying a positive sequence than misclassifying a negative one. Note that this heuristics is not equivalent to increasing the number of positive observations.
3.2 Learn base classifiers
The CRB algorithm (Schapire and Singer, 1999) is a Newton-like algorithm that constructs an ensemble model to minimize the upper bound on misclassification error
![]() | (5) |
is the initial weight of Si and yi is the class label of Si. Friedman et al. (2000) have detailed a discussions on the rationale of choosing the above criterion. In the m-th round, the CRB algorithm trains qm() and its weight
m to minimize the weighted error:
![]() | (6) |
is the weight of Si in the m-th round. In our case, the parameters to be estimated in each round include
m, r and
. Basically, at step c.1 in Figure 1, we increase r from 1 to R (currently R = 5) by the step size 1. For each value of r, the parameters
m and
are initialized and refined to minimize the weighted error. Finally, the m-th round reports the values of r,
m and
, which correspond to the minimum weighted error.
3.2.1 Initialization
Since the motif must be an enriched pattern in the positive sequences, we take advantage of Motif Regressor (Conlon et al., 2003) to generate a good seed weight matrix for initializing
. The seed weight matrix, reported by Motif Regressor, has the best correlation between the logarithm of ChIP-chip P-value and motif-matching score of all training sequences. Let
be the seed weight matrix. Given a value of r, we initialize
m and
as
m(0) = 1 and
, respectively, where
k,b is randomly generated in the range [0.2, 0.2] and t is the threshold as in Equation (2). The value of t is determined as the following. We first use the matrix
to score all sites in the training sequences and obtain the minimum and maximum site scores as tmin and tmax. Then, we increase t from tmin to tmax by the step size 0.1 and select the value that corresponds to the minimum weighted error under the current values of r and
m.
3.2.2 Refinement
The parameters
and
m are iteratively refined by a gradient-like method. In the n-th iteration (n
1), use
to find the best r sites in each sequence as its representative sites, and update
and
m(n) based on the corresponding gradients of the weighted error, i.e.:
![]() | (7) |
1 = 0.05 and
2 = 0.1 based on our experience. The iteration stops if (1) the weighted error increases, (2) the improvement of error is <0.0001 or (3) the maximum number of iterations (currently 100) is reached. Note that a site sij is now scored as
, which is slightly different from Equation (2). The threshold t in Equation (2) is absorbed by
and is updated implicitly.
3.3 Prevent overfitting
A main challenge with the small number of positive samples is that one can easily overtrain the classifiers. Our strategy to alleviate this effect is to reserve a subset of the negative training sequences (5% in our current setting) and one positive training sequence for internal validation during training. The sequences are randomly selected. The weight of each reserved sequence is set as the initial weight of a training sequence with the same class label. Overfitting is checked using the reserved data at step c.3 in Figure 1. The boosting procedure will stop, if adding one more base classifier increases the error [as defined in Equation (5)] for the reserved sequence set.
Sometimes, the ensemble model may have only one base classifier, say q1(). We build a base classifier q
() with its parameters as r
and
, where r
and t
are decided by the initialization method (without
k,b) described in Section 3.2. The weight of q
() is set as 1. We compare q
() with q1() and choose the one with a smaller weighted error as defined in Equation (5). The rationale for this step is that the current way for training base classifiers may not find the best one. This limitation can be amended by a weighted combination of multiple base classifiers. If the final model has only one base classifier, q
() could be a better alternative.
| 4 RESULTS |
|---|
|
|
|---|
4.1 Data
We used the ChIP-chip data reported in Lee et al. (2002). Positive sequences are selected using ChIP-chip P-value 0.001 as the cutoff. At this cutoff selection, the false positive rate is 610% and the false negative rate is
33% (Lee et al., 2002). Although the data are still noisy, they are the best genome-wide data of in vivo TFDNA binding localization so far. To avoid having too few positive samples, we also required that each selected TF should have at least 25 positive sequences. Forty TFs (Lee et al., 2002) satisfy these criteria. Negative sequences were selected as those with ChIP-chip ratio
1 and ChIP-chip P-value
0.05. Each selected TF has
3000 negative sequences. For each gene, we take its upstream sequence, up to 800 bp, not overlapping with the previous gene.
4.2 Boosting improves the specificity of motif models
To evaluate our method, we used the following cross-validation procedure. In each run, we leave one positive sequence and 5% of randomly selected negative sequences as the test data and train a classifier on the remaining data. This procedure is repeated 10 times for each positive sequence. The cross-validation error of each run is calculated as the number of false positives if the number of the false negatives is zero. The results are then averaged for all runs and compared. The detailed data, which include the sequence data, the ensemble models of the TFs, the logos of the ensemble models and all the test results, are available as the Supplementary data at http://biogibbs.stanford.edu/~hong2004/MotifBooster/.
We used Motif Regressor (Conlon et al., 2003) to find the seed weight matrix. For each TF, Motif Regressor called MDSCAN (Liu et al., 2002) to find candidate motifs of width 617 bases. At each width, MDSCAN reported the best 20 weight matrices enriched in the positive training sequences. Each weight matrix was used to score the training sequences. Motif Regressor then performed simple linear regression between the logarithm of ChIP-chip P-values and sequence scores. We chose the motif corresponding to the best regression P-value as our seed motif. We observed that Motif Regressor did not find significant enough motifs for nine TFs (DIG1, GAL4, GAT3, GCR2, IME4, IXR1, NND1, PHO4 and ROX1). It is possible that under the asynchronized growth condition, these TFs were not activated, or the modified tagged TFs have changed their binding characteristics. Table 1 summarizes the results for the remaining 31 TFs. Compared with the weight matrix reported by Motif Regressor, the ensemble models performed markedly better in 27 cases and evenly in 4 cases (FKH1, FKH2, RLM1 and YAP6). A closer examination on the four even cases reveals that each ensemble model only has one base classifier that is a direct conversion from the initial weight matrix.
|
The boosting approach also reported final models with single base classifier in 5 of 27 cases that performed better. These five TFs are CIN5, MBP1, NRG1, SKN7 and STE12. Since the base classifier is equivalent to a weight matrix model, these results indicate that using negative information can help discover better weight matrices in many cases. This is consistent with the findings of Workman and Stormo (2000). However, the first base classifier does not always perform better than the initial weight matrix. Table 2 summarizes the contributions of the base classifiers for the cases where the boosting method selected more than one base classifier. The base classifiers in the final models are arranged in the descending order of their weights. The performances of 13 first base classifiers, i.e. the ones with the largest weights, are worse than those of the weight matrices reported by Motif Regressor. This may suggest that when the binding sites of a TF are heterogeneous and maybe grouped into clusters, our boosting method finds base classifiers corresponding to different cluster profiles, whereas Motif Regressor reports an average profile. Thus, a single base classifier may be too specific to a particular cluster and does not discriminate well globally.
|
| 5 DISCUSSION |
|---|
|
|
|---|
For some cases, the ensemble model can reveal dependencies among motif positions. For example, Figure 2a displays the weight matrix found by Motif Regressor for RAP1, from which we can see that C and T dominate in position 5, and A and G dominate in position 8. But there is no further information on how these two positions might correlate with each other. In contrast, our boosting approach selected three base classifiers (Fig. 2bd) to compose the final model. Two base classifiers favored C and A in positions 5 and 8, respectively, whereas the third one preferred T and G in those positions, respectively. This observation implies that positions 5 and 8 may cooperate in a certain way such that the change in one position correlates with the change in the other. As another example, we observe that positions 1, 10 and 13 of REB1 motif (Fig. 3) can be decomposed in a similar way. In its first base classifier, position 13 strongly prefers G; positions 1 and 10 are ambivalent about G and C, respectively. In the second base classifier, however, position 13 strongly disfavors G, and positions 1 and 10 strongly favor G and C, respectively. This suggests that the three positions may cooperate to facilitate the proteinDNA binding.
|
|
The boosting approach terminates with an ensemble of 23 base classifiers for most cases. This is atypical for applications using the boosting technique that usually can boost for hundreds to thousands of base classifiers. The small number of base classifiers could be due to three reasons. The first reason might be the unbalanced training data (
100 positive versus
3000 negative sequences). We examined the sensitivity and specificity of each base classifier alone using the training samples (Fig. 4a). The sensitivity of base classifiers spreads out in the range of 4090%, while their specificity concentrates in the range of 7595%. This suggests that it is easier to train base classifiers to recognize negative samples in our case although the negative samples are more heterogeneous than the positive ones. We modify the boosting algorithm by adding more initial weights to the positive samples such that the initial total weights of two classes are equal. We note that although this method helps to bring out a less biased classifier, it is not equivalent to increasing the number of positive observations. As shown in Figure 4b, base classifiers with higher sensitivity tend to have lower generalization errors. A similar trend can be observed for the specificity of base classifiers in Figure 4c. Figure 5a shows that it is more likely to train base classifiers with relatively low training sensitivity and specificity when the size of positive sequences is small. Moreover, base classifiers trained with less positive samples are more likely to have higher generalization errors (Fig. 5b). Based on the above analyses, we reason that (1) base classifiers hardly overfit the training data in most cases and (2) the small size of positive samples does not provide enough information to boost for more base classifiers.
|
|
Second, the binding mechanisms of some TFs may indeed be almost linearly dependent of nucleotide types of the motif positions. For example, ABF1 has a much larger positive sample size (176) when compared with other TFs. Both the weight matrix and the ensemble model of ABF1 have low and comparable generalization errors (Table 1). The ensemble model has two base classifiers. The training sensitivity/specificity of the base classifiers are 93.18/94.66% and 90.34/95.58%. These results suggest that the binding mechanism of ABF1 may have little non-linearity because its samples can be well classified by linear decision rules including the weight matrix and the base classifiers. The base classifier becomes a strong learner (i.e. it can explain most of the training data) in such a case. On the other hand, the mild performances of many other base classifiers suggest that the binding mechanisms of some other TFs could have relatively high non-linearity.
Finally, our approach initializes a base classifier using a seed matrix. The successive refining step may only explore a limited subspace around the seed matrix. The training of base classifiers can be improved by a sampling-based de novo motif finding algorithm that is capable of exploring a wider range of the solution space (e.g. by sampling at multiple temperature levels). Or we can replace the base learner with a simpler one, e.g. a simple decision tree that uses rules like whether a position should be C or not, etc. With the above modifications, the ensemble model could have more base classifier and capture more comprehensive features that lead to better classification performance. Nonetheless, the resultant base classifiers could be very diverse. Some base classifiers could represent highly degenerated motifs. One potential drawback of this alternative is the loss of biological interpretability of the ensemble model. Although it is still not perfectly understood why the number of base classifiers is small, our approach provides a good balance between the interpretability and the performances of the boosted models. Another choice for improving the boosted models is to train each base classifier only by a randomly selected subset of the full training set as suggested by Friedman (2002). It was reported that such kind of randomness has advantages in the situations of small samples and powerful weak learners.
| 6 CONCLUSION |
|---|
|
|
|---|
We introduce a boosting-based method for modeling TFDNA binding. By repeatedly fitting weight matrix based classifiers to weighted samples that focus on erroneous classifications, the boosting approach can build a more accurate TFDNA binding model as a weighted combination of the base classifiers. The proposed approach was applied to the ChIP-chip data of S.cerevisiae and showed significant improvements on specificity in many cases. Like many recent studies that use mRNA microarray data to help refine regulatory binding motifs and infer combinatorial rules of transcription regulation (W. Wang et al., submitted for publication; Beer and Tavazoie, 2004), we found that ChIP-chip data can be used to further refine motif models and reveal novel features of TFDNA interactions. Currently, we use Motif Regressor to generate the seed motif for boosting. However, our algorithm is not limited to working with Motif Regressor and can be used to boost weight matrices reported by any motif finding algorithm.
| Acknowledgments |
|---|
The work of W.H.W. is supported by NIH-HG02341. The work of J.S.L. is supported by NIH-P20-CA96470 and NSF DMS-0244638. The work of P.H. is supported by NIH-GM67250. We thank the anonymous reviewers for constructive suggestions that helped us to unify the way to initialize and train base classifiers and inspired us to think hard on the overfitting issue of the ensemble models.
Received on July 30, 2004; revised on January 10, 2005; accepted on March 21, 2005
| REFERENCES |
|---|
|
|
|---|
Agarwal, P.K. and Bafna, V. (1998) Detecting non-adjoining correlations with signals in DNA. Proceedings of the Second Annual International Conference on Research in Computational Molecular BiologyMarch 2225, 1998 , New York, USA ACM Press, pp. 28.
Bailey, T.L. and Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol., 2, 2836[Medline].
Barash, Y., et al. (2001) A simple hyper-geometric approach for discovering putative transcription factor binding sites. Algorithms in Bioinformatics: Proceedings of the 1st International Workshop , pp. 278293 LNCS 2149.
Barash, Y., et al. (2003) Modeling dependencies in proteinDNA binding sites. Prooceedings of the 7th Annual International Conference on Computational Molecular Biology (RECOMB 2003)Berlin, Germany , NY ACM Press, pp. 2837.
Beer, M.A. and Tavazoie, S. (2004) Predicting gene expression from sequence. Cell, 117, 185198[CrossRef][Web of Science][Medline].
Bulyk, M.L., et al. (2001) Exploring the DNA-binding specificities of zinc fingers with DNA microarrays. Proc. Natl Acad. Sci. USA, 98, 71587163
Bulyk, M.L., et al. (2002) Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res., 30, 12551261
Bussemaker, H.J., et al. (2001) Regulatory element detection using correlation with expression. Nat. Genet., 27, 167171[CrossRef][Web of Science][Medline].
Conlon, E.M., et al. (2003) Integrating regulatory motif discovery and genomewide expression analysis. Proc. Natl Acad. Sci. USA, 100, 33393344
Friedman, J.H. (2002) Stochastic gradient boosting. Comput. Stat. Data Anal., 38, 367378[CrossRef].
Friedman, J.H., et al. (2000) Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors). Ann. Statist., 28, 337407.
Lawrence, C.E., et al. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208214
Lawrence, C.E. and Reilly, A.A. (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, 7, 4151[CrossRef][Web of Science][Medline].
Lee, T.I., et al. (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 298, 799804
Liu, J.S., et al. (1995) Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Am. Stat. Assoc., 90, 11501170.
Liu, X.S., et al. (2002) An algorithm for finding proteinDNA binding sites with applications to chromatin immunoprecipitation microarray experiments. Nat. Biotechnol., 20, 835839[Web of Science][Medline].
Man, T.K. and Stormo, G.D. (2001) Non-independence of Mnt repressoroperator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. Nucleic Acids Res., 29, 24712478
Schapire, R. and Singer, Y. (1999) Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37, 297336[CrossRef].
Schneider, T.D. and Stephens, R.M. (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res., 18, 60976100
Segal, E., et al. (2002) From promoter sequence to expression: A probabilistic framework. Proceedings of the 6th International Conference on Research in Computational Molecular Biology (RECOMB'02) , Washington, DC ACM Press, pp. 263272.
Sinha, S. (2002) Discriminative motifs. Proceedings of the 6th International Conference on Research in Computational Molecular Biology (RECOMB'02) , Washington, DC ACM Press, pp. 291298.
Stormo, G.D. and Hartzell, G.W., III. (1989) Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl Acad. Sci. USA, 86, 11831187
Stormo, G.D., et al. (1982) Use of the Perceptron algorithm to distinguish translational initiation sites in E.coli. Nucleic Acids Res., 10, 29973011
Takusagawa, K. and Gifford, D. (2004) Negative information for motif discovery. Pac. Symp. Biocomput., 360371.
Vilo, J., et al. (2000) Mining for putative regulatory elements in the yeast genome using gene expression data. Proc. Int. Conf. Intell. Syst. Mol. Biol., 8, 384394[Medline].
Workman, C.T. and Stormo, G.D. (2000) ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac. Symp. Biocomput., 467478.
Zhou, Q. and Liu, J. (2004) Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics, 20, 909916
This article has been cited by other articles:
![]() |
V. X. Jin, J. Apostolos, N. S. V. R. Nagisetty, and P. J. Farnham W-ChIPMotifs: a web application tool for de novo motif discovery from ChIP-based high-throughput data Bioinformatics, December 1, 2009; 25(23): 3191 - 3193. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Zhou and J. S. Liu Extracting sequence features to predict protein-DNA interactions: a comparative study Nucleic Acids Res., July 1, 2008; 36(12): 4137 - 4148. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Chen, L. Guo, Z. Fan, and T. Jiang W-AlignACE: an improved Gibbs sampling algorithm based on more accurate position weight matrices learned from sequence and gene expression/ChIP-chip data Bioinformatics, May 1, 2008; 24(9): 1121 - 1128. [Abstract] [Full Text] [PDF] |
||||
![]() |
C.-Y. Chen, H.-K. Tsai, C.-M. Hsu, M.-J. May Chen, H.-G. Hung, G. T.-W. Huang, and W.-H. Li Discovering gapped binding sites of yeast transcription factors PNAS, February 19, 2008; 105(7): 2527 - 2532. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Jiang, M. Q. Zhang, and X. Zhang OSCAR: One-class SVM for accurate recognition of cis-elements Bioinformatics, November 1, 2007; 23(21): 2823 - 2828. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. X. Jin, H. O'Geen, S. Iyengar, R. Green, and P. J. Farnham Identification of an OCT4 and SRY regulatory module using integrated computational and experimental genomics approaches Genome Res., June 1, 2007; 17(6): 807 - 817. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Elnitski, V. X. Jin, P. J. Farnham, and S. J.M. Jones Locating mammalian transcription factor binding sites: A survey of computational and experimental techniques Genome Res., December 1, 2006; 16(12): 1455 - 1464. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. X. Jin, A. Rabinovich, S. L. Squazzo, R. Green, and P. J. Farnham A computational genomics approach to identify cis-regulatory modules from chromatin immunoprecipitation microarray data--A case study using E2F1 Genome Res., December 1, 2006; 16(12): 1585 - 1595. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. GuhaThakurta Computational identification of transcriptional regulatory elements in DNA sequence Nucleic Acids Res., July 19, 2006; 34(12): 3585 - 3598. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. D. MacIsaac, D. B. Gordon, L. Nekludova, D. T. Odom, J. Schreiber, D. K. Gifford, R. A. Young, and E. Fraenkel A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin immunoprecipitation data Bioinformatics, February 15, 2006; 22(4): 423 - 429. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||















