Bioinformatics Advance Access originally published online on August 19, 2004
Bioinformatics 2005 21(3):307-313; doi:10.1093/bioinformatics/bth480
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 21 issue 3 © Oxford University Press 2005; all rights reserved.
Similarity of position frequency matrices for transcription factor binding sites
1 Cold Spring Harbor Laboratory 1 Bungtown Road, Cold Spring Harbor, NY 11724, USA
2 Department of Physics and Astronomy, State University of New York Stony Brook, NY 11794, USA
3 Computer Science Department, Portland State University PO Box 751, Portland, OR 97207, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Transcription-factor binding sites (TFBS) in promoter sequences of higher eukaryotes are commonly modeled using position frequency matrices (PFM). The ability to compare PFMs representing binding sites is especially important for de novo sequence motif discovery, where it is desirable to compare putative matrices to one another and to known matrices.
Results: We describe a PFM similarity quantification method based on product multinomial distributions, demonstrate its ability to identify PFM similarity and show that it has a better false positive to false negative ratio compared to existing methods.
We grouped TFBS frequency matrices from two libraries into matrix families and identified the matrices that are common and unique to these libraries. We identified similarities and differences between the skeletal-muscle-specific and non-muscle-specific frequency matrices for the binding sites of Mef-2, Myf, Sp-1, SRF and TEF of Wasserman and Fickett. We further identified known frequency matrices and matrix families that were strongly similar to the matrices given by Wasserman and Fickett. We provide methodology and tools to compare and query libraries of frequency matrices for TFBSs.
Availability: Software is available to use over the Web at http://rulai.cshl.edu/MatCompare
Contact: dschones{at}cshl.edu
Supplementary information: Database and clustering statistics, matrix families and representatives are available at http://rulai.cshl.edu/MatCompare/Supplementary
| INTRODUCTION |
|---|
|
|
|---|
Transcription-factor binding site (TFBS) discovery in promoter sequences is important for predicting transcription regulation. These binding sites are often represented as matrices, which are known in the literature under a variety of names: position weight matrices, position frequency matrices, alignment matrices, profiles, etc. (Knuppel et al., 1994; Sandelin et al., 2004; Lenhard and Wasserman, 2002). We refer to a matrix consisting of nucleotide counts per position as a position frequency matrix (PFM). Schneider et al. (1982, 1986) and Staden (1984) were some of the first studies to use PFMs to characterize DNA-binding site specificity. Berg and von Hippel (1987, 1988), Hertz et al. (1990); Hertz and Stormo (1999) and Stormo and Hartzell (1989) refined the method to allow quantitative discrimination of sites with calculated site scores approximating the binding energy of the profiled transcription factor.
Comparison tools for TFBS PFMs are important for testing newly discovered matrices against existing matrices, reducing redundancy in databases and increasing the quality of the matrices. Previous approaches for quantifying PFM similarity include the average log-likelihood ratio method proposed by Wang and Stormo (2003), the Pearson correlation coefficient method described by Pietrokovski (1996) and Hughes et al. (2000) and a method recently introduced by Sandelin and Wasserman (2004).
We describe a column-by-column method for PFM similarity quantification based on the likelihood that aligned columns are independent and identically distributed observations from the same multinomial distribution. We compared the performance of this method to the average log-likelihood ratio method and the Pearson correlation coefficient method on simulated data. Our method outperforms the other methods in each of our tests. However, we did not compare with the method of Sandelin and Wasserman (2004), because it is fundamentally different from ours as they allow for gapped PFM alignment.
We used this PFM similarity quantification to classify TFBSs by PFM similarity. We grouped PFMs in TRANSFAC (Knuppel et al., 1994) and JASPAR (Sandelin et al., 2004; Lenhard and Wasserman, 2002) into PFM-families and generated representatives for each family. We found that PFM-families are likely to include TFBS PFMs for related transcription factors. PFM-families and their representatives are useful for reducing the error when searching a PFM library. By comparing the similarity of a novel PFM to a PFM-family representative, as opposed to all other PFMs, we lowered the false positive rate while increasing the computational efficiency. Once a PFM-family is chosen, similarity between its family members and the novel PFM is computed with greater accuracy.
We compared the matrices present in the Transcription Factor Database (TRANSFAC) database to those in JASPAR, and vice versa. With a similarity threshold of 0.05, 16 of the PFMs from JASPAR were found to have no counterpart in TRANSFAC, including binding site matrices for EN-1, Elk-1, FREAC-3, GATA-2, Gfi, Gklf, HMG-1, MYB.ph3, Pax-2, SAP-1, SQUA, Tal1beta-E47S, c-FOS, c-MYB_1, p50 and Spz1. With a similarity threshold of 0.01, six of the PFMs from JASPAR had no counterpart in TRANSFAC, including binding site matrices for Elk-1, FREAC-3, GATA-2, HMG-1, SAP-1 and Tal1beta-E47S.
We compared the skeletal muscle binding site PFMs given by Wasserman and Fickett (1998) with the independently curated matrices, and to PFM-families and individual PFMs in TRANSFAC. We found that the muscle-specific and the non-muscle-specific binding site matrices for Mef-2 are strongly similar in eight core positions and different in the remaining positions; the PFMs for Myf are similar in seven core positions; the the PFMs for TEF are similar; and the PFMs for SRF are weakly similar.
In the remainder of the paper, we introduce methods for calculating PFM similarity distances and compare these to PFM similarity measures described previously. We use the methods that are introduced in this paper to build PFM families in TRANSFAC and JASPAR. We demonstrate the effectiveness of these techniques by producing conclusive comparisons of the PFMs given by Wasserman and Fickett (1998) and by identifying similar PFMs and PFM families in TRANSFAC and JASPAR.
| SYSTEMS AND METHODS |
|---|
|
|
|---|
In this section, we present and compare the performance of four methods for comparing PFMs: the Pearson correlation coefficient, the average log-likelihood ratio, the Pearson
2-test and the FisherIrwin exact test. We conclude the section with a description of the clustering methodology used to build PFM-families and with a description of the PFM libraries themselves. It is important to note that, as expected, all methods perform less effectively when PFMs are built from alignments of few sequences. The Pearson
2-test and FisherIrwin exact test allow for power quantification that can be used to determine the confidence level in the similarity of the PFMs.
Distance measures
We adopt the methodology of Liu et al. (1995), where PFMs follow a product multinomial distribution. Each column is a set of independent and identically distributed observations, and matrix comparisons reduce to column-by-column comparisons. The overall similarity score for a matrix pair is derived from the individual column scores.
Methods for comparing frequency matrices have been described by Pietrokovski (1996), Hughes et al. (2000), Wang and Stormo (2003) and Sandelin and Wasserman (2004). Pietrokovski tested four different methods for comparing multiple alignments of protein sequences and determined that the Pearson correlation coefficient is the most effective statistic of the four. Hughes et al. employed the Pearson correlation coefficient to compare PFMs. Wang and Stormo introduced the average log-likelihood ratio statistic, based on the information content of the binding sites.
We use a statistical test for determining the likelihood that two columns are generated from the same multinomial distribution. This likelihood can be computed using the FisherIrwin exact test or approximated using the Pearson
2-test.
Pearson correlation coefficient
A general similarity measure between two columns X and Y can be written as in Eisen et al. (1998):
![]() |
where,
. For TFBS matrices, we have an alphabet of size four (N = 4). When Z off is set to the mean of
, this similarity measure is the Pearson correlation coefficient (PCC) given in Equation (2). To compare matrices consisting of multiple columns, the scores of the individual column comparisons are summed.
![]() |
Average log-likelihood ratio
The average log-likelihood ratio statistic (ALLR), introduced by Wang and Stormo (2003), is a weighted sum of two log-likelihood ratios. The ALLR of two column vectors X and Y is given in Equation (3), where n b is the number of occurrences, f b = n b /N is the frequency and p b is the prior for base b. Again, to compare matrices consisting of multiple columns, the scores of the individual column comparisons are summed.
![]() |
Pearson
2
The probability that two unnormalized frequency vectors of length 4 are selected from the same multinomial distribution can be described by the P-value of the 2 x 4 contingency table as seen in Table 1.
|
The
2-statistic of Equation (4) can be used to test the hypothesis that the columns are samples from the same multinomial distribution, where
is the observed number of base b at position j, and
is the expected number of base b at position j, calculated as
(Fleiss et al., 2003). The P-value is calculated from this
2-value with 3 degrees of freedom, and the P-value for multiple columns is the product of the P-values of the individual columns. In our discussion, we use the geometric mean of the column P-values, which allows for comparing different size matrices and setting column-based P-value thresholds.
![]() |
FisherIrwin exact test
The
2-test is an approximation of FisherIrwin exact test. The approximation does not hold when the marginal frequencies are small, specifically when at least one of the marginals is <5, a condition that occurs often in PFMs of TFBSs (Fleiss et al., 2003). The fixed marginal contingency table P-value follows the multiple hypergeometric distribution given in Equation (5) (Agresti, 1992). The two-sided P-value for the table is the sum of the probabilities of all tables that are at least as extreme. As in the
2-test, the P-value for multiple columns is the product of the P-values of the individual columns.
![]() |
Distance measure comparisons
We generated PFM libraries from product multinomial distributions of a given information content range, and tested the effectiveness of the four methods in separating PFM pairs generated from the same distribution and PFM pairs generated from different distributions (Fig. 1). Each library contains 20 PFMs generated from each of 10 distributions with six independent vectors with total information content ranging from 1.9 to 10.4 bits. Each PFM was generated by sampling from a Dirichlet distribution with a sample size of 30. We generated 220 libraries for each sample in order to achieve suitable power. We chose distributions with six vectors and PFMs with 30 sequences to match with the average characteristics of the extended-core libraries of TRANSFAC and JASPAR. We controlled the false positive rate and compared the power (selectivity) of the four methods. When the false positive rate is set to 0.001 and information content is 3.5 or lower, the hypothesis that the power of the Pearson
2-test and the FisherIrwin exact test is no greater than the power of the other two tests can be rejected with error probability
= 0.01 and 99% power. Our experiments suggest that the
2-method is as good as the exact test method in detecting PFM similarity for the majority of PFMs in TRANSFAC, as can be seen in Figure 1.
|
PFM-family construction
Two of the most widely used databases of TFBS matrices are TRANSFAC and JASPAR. JASPAR has a much smaller dataset than TRANSFAC, and is manually curated with the goal of eliminating redundancy (Sandelin et al., 2004; Lenhard and Wasserman, 2002).
We describe the clustering of TRANSFAC; the clustering of JASPAR follows in similar lines. TRANSFAC version 7.2 includes 636 matrices, a small subset of which lack sufficient information for PFM construction. We selected all matrices for which we could estimate the correct frequency of each base at each position. This set consisted of 609 matrices.
A matrix core is identified in each PFM by TRANSFAC as the five most-conserved contiguous columns (highest confidence) (Knuppel et al., 1994). Extended cores were constructed to include columns that are adjacent to the cores and whose information content is greater than the information content of the highest entropy column in the core. We used matrix cores and extended cores to measure distances between PFMs.
We compared all PFM core pairs and all extended-core pairs using a sliding window of five columns. The comparisons were ranked according to P-value and a similarity threshold was set so that two PFMs with a P-value below the threshold are deemed incompatible and a P-value above the threshold are considered similar. We chose the threshold by estimating the associated rate of false positives and false negatives, where the expected number of false positives is the sum of the P-values of the incompatible pairs and the expected number of false negatives is the sum of the q-values (q = 1 P) of the similar pairs.
We set the rate of false positive comparisons to 0.05 and used this to set a P-value threshold. The comparisons with similarity above threshold were then used as input for the partitioning around medoids (PAM) clustering algorithm of Kaufman and Rousseeuw (1990) in the S-PLUS software package.
Some of the clusters produced by PAM include pairs with a similarity P-value lower than the threshold. These clusters were modified to eliminate pairs with probabilities below the similarity threshold, a process generally resulting in the breaking of a cluster into two or more smaller clusters. For the TRANSFAC cores, this process increased the number of clusters from 135 to 156. The resulting clusters can be described as cliques in the subgraph induced by edges with P-value greater than the threshold. PFM-families are given in the Supplementary information.
Matrices in the JASPAR database are not annotated with a core section as the TRANSFAC matrices. In order to search in an unbiased manner for JASPAR matrices in TRANSFAC, we defined cores and extended cores in JASPAR matrices in a manner consistent with TRANSFAC. Statistics about these matrix sets are given in the Supplementary information.
| IMPLEMENTATION |
|---|
|
|
|---|
In this section, we describe the construction of PFM-families using core and extended core sections from matrices in both the TRANSFAC and JASPAR databases. A representative matrix is constructed for each PFM-family. We conclude with a study of the PFMs given by Wasserman and Fickett (1998), describing the similarities and differences between the collected muscle-specific PFMs and independently selected PFMs and comparing these with TRANSFAC and JASPAR PFMs and PFM-families.
PFM-family construction
The clustering procedure described above was used to group PFMs into families of matrix similarity. We organized PFMs into PFM-families for the TRANSFAC core, TRANSFAC extended core, JASPAR core and JASPAR extended core PFM sets. We outline the implementation for each of the sets. Statistics and a complete list of the PFM-families in each PFM set are available in the Supplementary information.
We generate a representative matrix for each PFM-family by first aligning the matrices using a comparison window of five bases, and then summing all the elements across the aligned columns. The summing operation is consistent with the product multinomial model, where each column is a set of observations and the representative column is the combination of the categorical datasets (Liu et al., 1995).
TRANSFAC cores
The largest PFM-family with high-internal similarity is given in Table 2. This PFM-family contains 12 matrices that have an average similarity of 0.94. The ATF, CREB, bZIP910 and bZIP911 factors present in this PFM-family are all members of the bZIP family of proteins. Other CREB matrices exist in TRANSFAC, but are sufficiently distinct and do not appear in this cluster. Relaxed constraints leads to the inclusion of additional bZIP PFMs.
|
Another interesting result from the clustering of the TRANSFAC core matrices is the presence of multiple E2F-binding site PFM-families, as shown in Table 3. PFM-families 124, 141 and 156 have identical consensus sequences, but there are differences in the relative strength of the signal at various positions. This can be seen from the sequence logos in Figure 2; the logos program is described by Schneider and Stephens (1990).
|
|
TRANSFAC extended cores
We grouped the extended-core PFMs into 145 PFM-families. An example of PFM-families that are formed when using the extended cores and not formed when using the cores is given in Table 4. The factors MyoD, E47, E12, E2A and myogenin belong to the bHLH (basic region + helixloophelix) factor class. MyoD, E47, E12, E2A and myogenin are known to interact, and the Lmo2 complex transcription factor is known to bind to E2A and E47 (Mitsui et al., 1993).
|
JASPAR cores and extended cores
We identified 61 similar JASPAR PFM core pairs and 80 similar JASPAR PFM extended-core pairs out of the 6431 possible pairs and produced 23 and 36 PFM-families, respectively. The average similarity within clusters and the size of clusters is considerably smaller for the JASPAR database than for the TRANSFAC database. The clustering statistics are given in the Supplementary information.
Representative matrices
Through increasing the number of observations for each high information column and decreasing the number of PFMs in the initial search, representative matrices are used to increase the accuracy and the efficiency of similarity searches. An example of this is given in the following section.
Similarity designations are often difficult to make for TRANSFAC PFMs because PFMs are often constructed from the alignment of few sequences. PFM-family representatives allow for increased accuracy since they represent richer alignments.
Representative matrices can also be used to validate PFM-families. The representatives can be used to search the original database for related matrices. When doing this, matrices of the PFM-family corresponding to the representative matrix should be found with high similarity, followed by members of related families. The result of a query using the representative of the bZIP family introduced in Table 2 is shown in Table 5. Results for the other representative matrices created from the TRANSFAC extended cores are in the Supplementary information.
|
Novel PFM comparison
Wasserman and Fickett (1998) curated a set of PFMs for skeletal muscle-specific TFBSs and compared them to PFMs from independently selected promoter segments. They wanted to know if the two resulting PFMs in each pair differ substantially and they offered observations about the differences. We describe the difference in quantifiable terms and compare the PFMs to general PFMs from TRANSFAC.
We used the Wasserman and Fickett (1998) classification of PFMs into muscle-specific and independent. We compared the analogous PFMs from the muscle-specific and independent set. A summary of the results follows.
- Mef-2 PFMs: The muscle-specific and independent PFMs match well from position 4 to 11 (with similarity 0.21) and match weakly from position 1 to 4 (0.05).
- Myf PFMs: Match well in 7 positions starting at position 4 of the muscle-specific PFM and 5 of the independent PFM.
- SRF PFMs: Match weakly in 10 positions starting at position 3 of the muscle-specific PFM and 5 of the independent PFM (0.06).
- Sp-1 PFMs: Match well in 10 positions starting at position 1 of the muscle-specific PFM and 2 of the independent PFM (0.49). However, the power of the comparison is <90%.
- TEF PFMs: Match well in 8 positions starting at position 2 of the muscle-specific PFM and 1 of the independent PFM (0.43). However, the power of the comparison is <85%.
We compared the skeletal muscle-specific PFMs to the representative PFMs of TRANSFAC extended-core PFM-families. A summary of the results follows.
- Mef-2 PFM: Matches best with the representative of the sixth PFM-family (M06), with a similarity of 0.45 for window of size 7 starting at position 4. M06 includes binding site matrices for a MEF-2, MADS-B and MEF-2. The PFM also matches M00006 (MEF2) in 10 positions, starting at its second position. However, M00006 is constructed from the alignment of five sequences and the similarity has considerably lower power than the similarity of the PFM with the representative of M06.
- Myf PFM: Myf PFMs in TRANSFAC did not meet our requirements and were removed from all analysis. The Myf PFM did not match any PFM-families.
- SRF PFM: Matches the representative of PFM-family 75 (M075) with window of size 7 and similarity 0.14. The PFM-family includes a binding site matrix for BR-C_Z2 and two binding site matrices for SRF. The PFM matches, with window size 7, M00186 (SRF) with similarity 0.323518, M00404 (MADS-B) with similarity 0.249068 and M00810 (SRF) with similarity 0.248197. The match with M00404 is most likely false. M00404 is constructed from the alignment of seven sequences and its match with the SRF PFM begins at position 5 instead of position 3. Interestingly, the muscle-independent SRF PFM strongly matches M00152 (SRF) with a window size 10 and similarity 0.75. Thus, the muscle-specific and independent SRF PFMs match different SRF binding site matrices in TRANSFAC.
- Sp-1 PFM: Matches the representative of PFM-family 28 (M028) with a window of size 6 and similarity 0.31. The PFM-family includes a binding site matrix for Muscle initiator sequences-19 and Muscle initiator sequences-20. The PFM matches, with window size 7, M00221 with similarity 0.30 and M00749 with similarity 0.24. Both these matches are likely to be false. M00221 is constructed from the alignment of seven sequences and M00749 from six sequences. Their alignments with the Sp-1 PFM start at different positions.
- TEF PFM: Does not match any PFM-family.
| DISCUSSION AND CONCLUSION |
|---|
|
|
|---|
We present a technique to identify similarity between PFM profiles for TFBSs. This similarity method is deeply rooted in the theory of PFMs and is experimentally shown to outperform existing methods. It allows for a statistical quantification of errors and is used to facilitate PFM queries in TFBS PFM libraries.
We used our technique to classify PFMs in TRANSFAC and JASPAR into PFM-families, which were then used to increase the accuracy of PFM queries. An examination of these families reveals a strong correlation between PFM similarity and the function of the corresponding transcription factors, but there are examples of similar PFMs that profile binding sites of transcription factors that are not likely to be related functionally.
The analysis of the TRANSFAC and JASPAR databases reveals that the JASPAR database is less redundant, but almost all of the JASPAR matrices are represented in TRANSFAC. By grouping the TRANSFAC PFMs into PFM-families we build a higher quality PFM set that is also less redundant.
We also show that the cores in the TRANSFAC database do not always capture the whole signal. For example, the JASPAR PFM MA0003 is strongly related to the TRANSFAC M00075 and both are binding sites of an E2F transcription factor. However, the similarity between the PFMs cannot be detected when using the M00075 core alone. Another example is the similarities between the bHLH factors listed in Table 4. These PFMs only group together as similar when the extended cores are considered.
Sandelin and Wasserman (2004) use NeedlemanWunch (Needleman and Wunsch, 1970) to align matrices before comparing them. This method is attractive for comparing binding site PFMs that are composed of strongly conserved position clusters that are separated by non-conserved positions, such as binding sites for dimers like the leucine zippers. We chose to concentrate on the simpler configuration of adjacent, conserved positions as advocated by TRANSFAC. However, the extension to allow for gapped PFM alignments is possible and would be useful.
We compared the skeletal muscle-specific PFMs curated by Wasserman and Fickett (1998) with their corresponding independently curated PFMs. We show that the muscle-specific SRF binding site matrix is different from the independent SRF binding site matrix; these matrices match different binding site matrices in TRANSFAC. All other muscle-specific binding site matrices in the Wasserman and Fickett study are similar.
Finally, our major contribution is a methodology for comparing PFMs and for searching for PFMs in a PFM library. Our techniques can be used for classifying PFM-families and for investigating novel PFM binding site matrices.
| Acknowledgments |
|---|
This work was supported by NSF grants EIA-0324292 and DBI-0306152.
Received on May 18, 2004; revised on July 28, 2004; accepted on August 13, 2004
| REFERENCES |
|---|
|
|
|---|
Agresti, A. (1992) A survey of exact inference for contingency tables. Stat. Sci., 7, 131177.
Berg, O.G. and von Hippel, P. (1987) Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Bio., 193, 723750[CrossRef][Web of Science][Medline].
Berg, O.G. and von Hippel, P. (1988) Selection of DNA binding sites by regulatory proteins II: the binding specificity of cyclic AMP receptor protein to recognition sites. J. Mol. Biol., 200, 709723[CrossRef][Web of Science][Medline].
Eisen, M., Spellman, P., Brown, P., Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 1486314868
Fleiss, J.L., Levin, B., Paik, M.C. Statistical Methods for Rates and Proportions, (2003) , NY John Wiley & Sons.
Hertz, G., Hartzell, G., III, Stormo, G. (1990) Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput. Appl. Biosci., 6, , pp. 8192
Hertz, G. and Stormo, G. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15, 563577
Hughes, J.D., Estep, P.W., Tavozoie, S., Church, G.M. (2000) Computational identification of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae . J. Mol. Biol., 296, 12051214[CrossRef][Web of Science][Medline].
Kaufman, L. and Rousseeuw, P.J. Finding Groups in DataAn Introduction to Cluster Analysis, (1990) , NY John Wiley & Sons.
Knuppel, R., Dietze, P., Lehnberg, W., Frech, K., Wingender, E. (1994) TRANSFAC retrieval program: a network model database of eukaryotic transcription regulating sequences and proteins. J. Comput. Biol., 1, , pp. 191198[Medline].
Lenhard, B. and Wasserman, W.W. (2002) TFBS: computational framework for transcription factor binding site analysis. Bioinformatics, 18, 11351136
Liu, J.S., Lawrence, C.E., Neuwald, A. (1995) Bayesian models for multiple local sequence alignment and its Gibbs sampling strategies. J. Am. Stat. Assoc., 90, 11561170[CrossRef][Web of Science].
Mitsui, K.K., Shirakata, M., Paterson, B.M. (1993) Phosphorylation inhibits the DNA-binding activity of MyoD homodimers but not MyoD-E12 heterodimers. J. Biol. Chem., 268, 2441524420
Needleman, S. and Wunsch, C. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443453[CrossRef][Web of Science][Medline].
Pietrokovski, S. (1996) Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res., 24, 38363845
Sandelin, A., Alkema, W., Engström, P., Wasserman, W.W., Lenhard, B. (2004) JASPAR: an open access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res., 32, D91D94
Sandelin, A. and Wasserman, W.W. (2004) Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J. Mol. Biol., 338, 207215[CrossRef][Web of Science][Medline].
Schneider, T.D. and Stephens, R.M. (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res., 18, 60976100
Schneider, T.D., Stormo, G.D., Gold, L., Ehrenfeucht, A. (1982) Use of the Perceptron algorithm to distinguish translational initiation sites in E.coli . Nucleic Acids Res., 10, 29973011
Schneider, T.D., Stormo, G.D., Gold, L., Ehrenfeucht, A. (1986) Information content of binding sites on nucleotide sequences. J. Mol. Biol., 188, 41531[CrossRef][Web of Science][Medline].
Staden, R. (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res., 12, 505519
Stormo, G.D. and Hartzell, G., III. (1989) Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl Acad. Sci. USA, 86, 11831187
Wang, T. and Stormo, G.D. (2003) Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics, 19, 23692380
Wasserman, W.W. and Fickett, J.W. (1998) Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol., 278, 167181[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
Y. Zhang, W. Wu, Y. Cheng, D. C. King, R. S. Harris, J. Taylor, F. Chiaromonte, and R. C. Hardison Primary sequence and epigenetic determinants of in vivo occupancy of genomic DNA by GATA1 Nucleic Acids Res., September 18, 2009; (2009) gkp747v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Fan, P. B. Bitterman, and O. Larsson Regulatory element identification in subsets of transcripts: Comparison and integration of current computational methods RNA, August 1, 2009; 15(8): 1469 - 1482. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Zhang, M. Xu, S. Li, and Z. Su Genome-wide de novo prediction of cis-regulatory binding sites in prokaryotes Nucleic Acids Res., June 1, 2009; 37(10): e72 - e72. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Tokovenko, R. Golda, O. Protas, M. Obolenskaya, and A. El'skaya COTRASIF: conservation-aided transcription-factor-binding site finder Nucleic Acids Res., April 1, 2009; 37(7): e49 - e49. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. T. Fulp, G. Cho, E. D. Marsh, I. M. Nasrallah, P. A. Labosky, and J. A. Golden Identification of Arx transcriptional targets in the developing basal forebrain Hum. Mol. Genet., December 1, 2008; 17(23): 3740 - 3760. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. J. Pape, S. Rahmann, and M. Vingron Natural similarity measures between position frequency matrices with an application to clustering Bioinformatics, February 1, 2008; 24(3): 350 - 357. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. C. Bryne, E. Valen, M.-H. E. Tang, T. Marstrand, O. Winther, I. da Piedade, A. Krogh, B. Lenhard, and A. Sandelin JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update Nucleic Acids Res., January 11, 2008; 36(suppl_1): D102 - D106. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Kheradpour, A. Stark, S. Roy, and M. Kellis Reliable prediction of regulator targets using 12 Drosophila genomes Genome Res., December 1, 2007; 17(12): 1919 - 1931. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Das, T. A. Clark, A. Schweitzer, M. Yamamoto, H. Marr, J. Arribere, S. Minovitsky, A. Poliakov, I. Dubchak, J. E. Blume, et al. A correlation with exon expression approach to identify cis-regulatory elements for tissue-specific alternative splicing Nucleic Acids Res., July 10, 2007; (2007) gkm485v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Martinez, A. D. Smith, B. Li, M. Q. Zhang, and K. S. Harrod Computational prediction of novel components of lung transcriptional networks Bioinformatics, January 1, 2007; 23(1): 21 - 29. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Haberer, M. T. Mader, P. Kosarev, M. Spannagl, L. Yang, and K. F.X. Mayer Large-Scale cis-Element Detection by Analysis of Correlated Expression and Sequence Conservation between Arabidopsis and Brassica oleracea Plant Physiology, December 1, 2006; 142(4): 1589 - 1602. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Fang, S. Fan, X. Zhang, and M. Q. Zhang Predicting methylation status of CpG islands in the human brain Bioinformatics, September 15, 2006; 22(18): 2204 - 2209. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. D. Smith, P. Sumazin, Z. Xuan, and M. Q. Zhang DNA motifs in human and mouse proximal promoters predict tissue-specific expression PNAS, April 18, 2006; 103(16): 6275 - 6280. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Roepcke, S. Grossmann, S. Rahmann, and M. Vingron T-Reg Comparator: an analysis tool for the comparison of position weight matrices Nucleic Acids Res., July 1, 2005; 33(suppl_2): W438 - W441. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Cartharius, K. Frech, K. Grote, B. Klocke, M. Haltmeier, A. Klingenhoff, M. Frisch, M. Bayerlein, and T. Werner MatInspector and beyond: promoter analysis based on transcription factor binding sites Bioinformatics, July 1, 2005; 21(13): 2933 - 2942. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||













