Bioinformatics Advance Access originally published online on July 10, 2007
Bioinformatics 2007 23(18):2449-2454; doi:10.1093/bioinformatics/btm348
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
An approach to predict transcription factor DNA binding site specificity based upon gene and transcription factor functional categorization


1CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 2Graduate School of the Chinese Academy of Sciences, 19 Yuquan Road, Beijing 100039, 3Bioinformatics Center, Key Lab of Molecular Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China, 4Department of Mathematics, University of Manchester, Institute of Science and Technology, P.O. Box 88, Manchester M60 1QD, UK, 5Shanghai Center for Bioinformation Technology, 100 Qinzhou Road, 200235 Shanghai, 6College of Life Science & Biotechnology, Shanghai Jiao Tong University, China and 7Molecular Physiology Laboratory, Centre for Cardiovascular Science Queen's Medical Research Institute, 47 Little France Crescent, Edinburgh, EH16 4TJ, UK
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: To understand transcription regulatory mechanisms, it is indispensable to investigate transcription factor (TF) DNA binding preferences. We noted that the generally acknowledged information of functional annotations of TFs as well as that of their target genes should provide useful hints in determining TF DNA binding preferences.
Results: In this contribution, we developed an integrative method based on the Nearest Neighbor Algorithm, to predict DNA binding preferences through integrating both the functional/structural information of TFs and the interaction between TFs and their targets. The accuracy of cross-validation tests on the dataset consisting of 3430 positive samples and 7000 negative samples reaches 87.0 % for 10-fold cross-validation and 87.9 % for jackknife cross-validation test, which is a much better result than that in our previous work. The prediction result indicates that the improved method we developed could be a powerful approach to infer the TF DNA preference in silico.
Contact: cyd{at}picb.ac.cn
Supplementary information: Supplementary data are available at Bioinformatics online
| 1 INTRODUCTION |
|---|
|
|
|---|
Transcription factor (TF), designated as the major regulator of transcription, prefers to bind specific regions on DNA sequences to modulate the nearby genes expression. Given a newly identified TF, how do we know it has partiality for which kinds of DNA sequences? Typically in cases like this, we resort to a vast number of experiments such as DNA footprint technology (Fox, 1997) to try to find direct evidence of the interaction between the TF and its target genes or build statistical models (Stormo, 2000) to describe the TF's DNA binding preferences. However, both methods are not an ideal approach when tackling the challenging problem due to its remarkable consumption of time and money. Nevertheless, with the advent of massive integrative biological databases and various data processing tools, like the InterPro (Apweiler et al., 2001), a database of protein families, domains and functional sites and the gene regulation database TRANSFAC (Matys et al., 2006; Wingender et al., 1996), we are able to provide an alternative view of methods to help us to investigate the problem. We have reached an age where we possess not only the experimental techniques but also the existing knowledge that is expected to be put to good use to give us at least some indications and clues concerning the TF of interest. Thus, a novel strategy based on integrating both: the existing transcription regulatory knowledge and the generally acknowledged information of functional annotations of TFs as well as that of their target genes will soon be within reach.
Actually, in our previous work (Qian et al., 2006), we tried to answer the question whether the querying pair of TF and potential transcription factor binding sites (TFBS) interact or not, so as to infer the TF's DNA binding preferences. It was based on InterPro (Apweiler et al., 2001) annotations of TFs, which covers a large number of functional domains/sites found in known proteins, and TRANSFAC (Matys et al., 2006; Wingender et al., 1996) which provides well-understood interacting pairs of TF and TFBS. The success rate of cross-validation on the collected dataset reached 76.6 % (Qian et al., 2006). More accurate predictions on DNA binding preferences are required. If we know the targets regulated by the querying TF, better predictions of DNA binding preferences can be expected. Right now, targets of TFs can be extracted from the large experimental data generated by various high-throughput biotechniques, such as the ChIP-chip experiment. Therefore, in this contribution, by integrating both the InterPro (Apweiler et al., 2001) annotations of TF and the TF-Targets relationships, an improved strategy based on the Nearest Neighbor Algorithm (NNA) is introduced to predict DNA binding preferences of novel TFs.
To examine the performance of our predictor, it was tested on a dataset consisting of 3430 true TF—transcription factor target (TFT) genes—TFBS triplets and 7000 artificial TF-TFT-TFBS triplets and achieved a success rate of 87.0 %. This result indicates that the method we developed could be promising in TF DNA binding preference research.
| 2 MATERIALS AND METHODS |
|---|
|
|
|---|
2.1 Positive dataset
For TFs as well as their targets and binding sites, the original dataset came from TRANSFAC v7.0 (Matys et al., 2006; Wingender et al., 1996). Then the original dataset was filtered in the following steps: (1) 327 TFs and 113 TFTs without SwissProt accessions were removed, and associated 407 TFBSs were filtered. (2) 743 TFBSs with length shorter than 5 bp or longer than 25 bp were removed, since most of the TFBSs length is within this range. (3) Finally, a positive dataset with 3430 TF-TFT-TFBS triplets which covered 143TF, 1416TFT and 571TFBS was built (c.f. Supplementary Material 1, Table 1).
|
2.2 Negative dataset
A negative dataset was randomly generated by shuffling the TFBS column in the collected positive dataset according to the following steps: (1) each TF-TFT-TFBS triplet is assigned a random number. (2) TFBSs are shuffled according to the random numbers while the TFs and TFTs are not changed. The details are illustrated in Figure 1. (3) The reduplicated record(s) that already existed in the positive dataset are removed. (4) Steps 1, 2 and 3 are repeated twice. (5) Finally, we achieve a negative dataset with about 7000 records, about two times as large as the positive dataset.
|
To minimize the sampling bias, we sampled 20 different negative datasets according to the above algorithm.
Each record is assigned a random number. Then the TFBS column is sorted according to the random number assigned, while the TF and TFT column remains unchanged.
2.3 Numeric representation system for TFs and TFTs
The prerequisite for building a predictor is to develop feasible numeric representation systems for proteins as well as for nucleotide sequences. In our contribution, InterPro (Apweiler et al., 2001) annotations were used to represent protein samples (TF, TFT) and 100D binary vectors were used to represent nucleotide sequences of DNA binding sites (TFBS). In the following section, we will discuss these two numeric representation systems in detail.
First, we extract the InterPro (Apweiler et al., 2001) annotations of each TF/TFT through using the Protein2Ipr mapping provided by InterPro (Apweiler et al., 2001) which altogether contains 8151 entries.
Then, each TF is encoded with an 8151D (dimensional) vector. Each component of the 8151D vector, take the ith for example, is set to be 0 or 1, indicating whether this protein hit the ith InterPro (Apweiler et al., 2001) entry. This can be formulated as,
|
| (1) |
|
|
|
| (2) |
|
|
2.4 Numeric representation system for nucleotide sequences
Without information loss, the TFBS was encoded with a 0-1 system (Bhasin et al., 2005; Jia et al., 2006) according to the following steps:
(1) At first, TFBSs with a length <25 bp are extended to exactly 25 bp through adding N suffixes, e.g., 21 bp nucleotide sequence TTCGATCGATCGATCGATCGT will be extended to TTCGATCGATCGATCGATCGNNNNN, meanwhile TFBSs with length of exact 25 bp remain unchanged.
(2) Then, these TFBSs can be represented in a 25 x 5 = 125D (dimensional) vector, e.g. TTCGATCGATCGATCGATCGNNNNN will be represented in a 125D binary vector as,
|
|
|
| (3) |
|
| (4) |
2.5 The hybridization space
Obviously, DNA binding preferences can be inferred by predicting the interactions among TF, TFT and TFBS as mentioned above. To facilitate predicting interactions among TF, TFT and TFBS, a numeric representation was developed to cover TF-TFT-TFBS triplets. This can be done as follows: Suppose Tx, Gz and Dy are the xth TF, yth TFT, zth TFBS, respectively. The x – y – z TF-TFT-TFBS triplet TGD (x, y, z) can be expressed as
|
| (5) |
represents the orthogonal sum, k is the weight to facilitate removing the bias caused by the contribution difference between two encoding systems, T is the transpose operator and Tx, t(u, i), Gz, g(v,j), Dy and d(v,j), have the same meanings as listed in formulae 1–3. By using this hybridization approach, all the TF-TFT-TFBS triplets can be represented in the simple form of a 16427D (8151D + 8151D + 125D) vector. Similar approaches have also been used in several previous works, such as predicting protein–protein interaction (Chou and Cai, 2006) and achieving a very good performance, which indicates that the hybridization approach provides an efficient encoding system for numeric representation of interacting pairs/triplets.
2.6 The nearest neighbor algorithm
The NNA is one of the most widely used classifiers which finds the closest (according to some distance metric) training point to the unknown point and predicts the category of that training point. Put more formally, given a collection of points S and a query point q, what is the point x closest to q in S? The nearest-neighbor classifier predicts that the category of q is the same as the category of x. It is particularly useful when the underlining distribution is unknown and has been widely used in previous works, such as predicting protein–protein interaction (Chou and Cai, 2006), protein quandary structure classification (Yu et al., 2006) and TF classification (Qian et al., 2006), and achieved good performance.
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
To some extend, TFs with similar biological functions may share similar DNA binding sites. For example, data from TRANSFACV v7.0 show that TFs T00333 (SwissProt Accession: P06536 [GenBank] ) and T00511 (SwissProt Accession: P22199 [GenBank] ) belong to the family Steroid hormone receptors. And at the same time they share quite similar InterPro annotations (Table 1, the common domains, IPR000536, IPR001628 and IPR008946). Besides, according to TRANSFAC v7.0, both bind to the same site R12371. Based on this assumption, the DNA binding information of function related TFs should work to predict TFs DNA binding sites (TFBS). Based on these assumption, we proposed a knowledge-based method to infer TF DNA binding sites.
The prediction will be better if more biological information can be taken into consideration. In this contribution, not only functional annotations of TFs but also that of TF targets are integrated into our prediction system. In the next paragraph, we will illustrate our method (c.f. see Sections 2.1 and 2.2).
To test the performance of our predictor, Jackknife cross-validation test (Bhasin et al., 2005; Cai and Chou, 2006; Chou and Cai, 2004, 2006; Jia et al., 2006; Qian et al., 2006a; Qian et al., 2006b) and 10-fold cross-validation test were adopted. In our implementations, jackknife cross-validation tests were operated as follows: for each TF-TFT-TFBS triplets in the dataset, we applied our predictor to predict its property. Predictor successes occur when the predicted property agrees with the truth. Finally, the success rate can be calculated as
|
| (6) |
|
In our dataset consisting of 3430 true TF-TFT-TFBS triplets and 7000 artificial TF-TFBS-TFT triplets, when k is set to 0.5 (formula 5), the success rate on positive and negative datasets is 84.7 and 89.3 %, respectively, and on overall datasets reaches 87.9 % (Table 2).
Meanwhile 10-fold cross-validation tests were operated as shown in the following steps: (1) we randomly split the dataset including both positive data and negative data into 10 portions. (2) Then, for each portion, we tried to predict the category of each sample. The predictor succeeds if it can correctly predict the category of one sample. (3) Finally, the success rate of the 10-fold cross-validation test on positive and negative datasets is 83.0 and 89.1 %, respectively, and on overall datasets reaches 87.0 % (Table 2).
Support Vector Machine (SVM) algorithm with polynomial kernel (Joachims, 1999) as a classification method was also adopted to perform the prediction.
The total result shows that DNA binding preferences are closely correlated to both their function and their target's function. Through comparison with our previous work which uses only the TF and TFBS information (Qian et al., 2006), we can see that the prediction performance with TF-TFT-TFBS triplets increases as expected when we take TFTs into consideration.
Here, we also show the improved proportion of the correct prediction compared to our previous work (Qian et al., 2006). Because the negative dataset is randomly generated, only the positive dataset was compared. The comparison is illustrated in the following Table 3.
|
In Table 3, we can see that there are 211 duplexes which are correctly predicted in TF-TFT-DTF triplets, but failed in TF-DFT duplexes. However, only 41 duplexes were correctly predicted in TF-DFT duplexes, but failed in TF-TFT-DTF triplets. This means our current method does effectively improve the prediction performance. The detailed duplex list concerned in Table 3 can be found in Supplementary Material 2.
However, there are still several difficulties related to this information integrative approach. First, it is difficult to determine which TFBS candidates should be chosen. When the length of TFBS is L, the count of TFBS candidates would reach 4L, which is a huge number and needs lots of computation time. Nowadays more and more statistical methods have been developed to identify sequence motifs (DHaeseleer, 2006). In our future works, these statistical methods will be integrated, when motifs generated by them, instead of enumerated 4L sequence fragments, are considered as the inputs of this approach. The second bottleneck is the limited number of known TF-TFBS duplexes from TRANSFAC (Matys et al., 2006; Wingender et al., 1996). More prior information would improve the predictions.
| 4 CONCLUSIONS |
|---|
|
|
|---|
Specific DNA binding is a fundamental issue in understanding transcription regulatory mechanisms. Therefore, predicting DNA binding preferences is a critical problem in this research area. In this contribution, we applied a familywise approach to predict DNA binding preferences by using functional related TFs through integrating functional domain compositions. The performance of our predictor reached 87.0 %. This contribution provides a novel way to investigate TFs lacking DNA binding preferences and provides a useful tool to develop novel TF engineering which can have fascinating implications in the discovery of new drugs (Jamieson et al., 2003).
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This work is supported by National Basic Research Program of China No. 2006CB910700, No. 2003CB715900, No. 2004CB518606 and National High-Tech Research and Development Program of China (863 Program) No. 2006AA02Z320.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alfonso Valencia
These authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. ![]()
Received on December 24, 2007; revised on June 4, 2007; accepted on June 27, 2007
| REFERENCES |
|---|
|
|
|---|
Apweiler R, et al. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res (2001) 29:37–40.
Attwood TK. The PRINTS database: a resource for identification of protein families. Brief. bioinformatics (2002) 3:252–263.
Bhasin M, et al. Prediction of methylated CpGs in DNA sequence using a support vector machine. FEBS Lett (2005) 579:4302–4308.[CrossRef][Web of Science][Medline]
Cai YD, Chou KC. Predicting membrane protein type by functional domain composition and pseudo-amino acid composition. J. Theor. Boil (2006) 238:395–400.
Chou K-C, Cai Y-D. Using GO-PseAA predictor to predict enzyme sub-class. Biochem. Biophys. Res. Commun (2004) 325:506–509.[CrossRef][Web of Science][Medline]
Chou KC, Cai YD. Predicting protein-protein interactions from sequences in a hybridization space. J. proteome Res (2006) 5:316–322.[CrossRef][Web of Science][Medline]
D'Haeseleer P. How does DNA sequence motif discovery work? Nat. Biotechnol (2006) 24:959–961.[CrossRef][Web of Science][Medline]
Finn RD, et al. Pfam: clans, web tools and services. (2006) D247–D251.
Fox KR. DNase I footprinting. Methods mol. Biol. (Clifton, N.J.) (1997) 90:1–22.
Jamieson AC, et al. Drug discovery with engineered zinc-finger proteins. Nat. rev (2003) 2:361–368.
Jia P, et al. Demonstration of two novel methods for predicting functional siRNA efficiency. BMC bioinformatics (2006) 7:271.[CrossRef][Medline]
Joachims T. Making Large-Scale SVM Learning Practical. (1999) Cambridge, MA, USA: MIT Press.
Matys V, et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res (2006) 34:D108–D110.
Qian Z, et al. Automatic transcription factor classifier based on functional domain composition. Biochem. Biophys. Res. Commun (2006a) 347:141–144.[CrossRef][Web of Science][Medline]
Qian Z, et al. A novel computational method to predict transcription factor DNA binding preferences. Biochem. Biophys. Res. Commun (2006b) 348:1034–1037.[CrossRef][Web of Science][Medline]
Stormo GD. DNA binding sites: representation and discovery. Bioinformatics (Oxford, England) (2000) 16:16–23.[CrossRef]
Wingender E, et al. TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res (1996) 24:238–241.
Yu X, et al. Classification of protein quaternary structure by functional domain composition. BMC Bioinformatics (2006) 7:187.[CrossRef][Medline]
This article has been cited by other articles:
![]() |
T. M. Alleyne, L. Pena-Castillo, G. Badis, S. Talukder, M. F. Berger, A. R. Gehrke, A. A. Philippakis, M. L. Bulyk, Q. D. Morris, and T. R. Hughes Predicting the binding preference of transcription factors to individual DNA k-mers Bioinformatics, April 15, 2009; 25(8): 1012 - 1018. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








