Skip Navigation


Bioinformatics Advance Access originally published online on June 29, 2006
Bioinformatics 2006 22(18):2210-2216; doi:10.1093/bioinformatics/btl329
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/18/2210    most recent
btl329v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, L.
Right arrow Articles by Nephew, K. P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, L.
Right arrow Articles by Nephew, K. P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

A mixture model-based discriminate analysis for identifying ordered transcription factor binding site pairs in gene promoters directly regulated by estrogen receptor-{alpha}

Lang Li 1,*, Alfred S. L. Cheng 2, Victor X. Jin 2, Henry H. Paik 3, Meiyun Fan 3, Xiaoman Li 1, Wei Zhang 1, Jason Robarge 1, Curtis Balch 3, Ramana V. Davuluri 2, Sun Kim 3, Tim H.-M. Huang 2 and Kenneth P. Nephew 3

1 Division of Biostatistics, Department of Medicine, Indiana University School of Medicine Indianapolis, IN 47405, USA
2 Division of Human Cancer Genetics, Department of Molecular Virology, Immunology, and Medical Genetics, Comprehensive Cancer Center, Ohio State University Columbus, OH 43210, USA
3 Medical Sciences, Indiana University School of Medicine Bloomington, IN 47405, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 CONCLUSIONS
 REFERENCES
 

Motivation: To detect and select patterns of transcription factor binding sites (TFBSs) which distinguish genes directly regulated by estrogen receptor-{alpha} (ER{alpha}), we developed an innovative mixture model-based discriminate analysis for identifying ordered TFBS pairs.

Results: Biologically, our proposed new algorithm clearly suggests that TFBSs are not randomly distributed within ER{alpha} target promoters (P-value < 0.001). The up-regulated targets significantly (P-value < 0.01) possess TFBS pairs, (DBP, MYC), (DBP, MYC/MAX heterodimer), (DBP, USF2) and (DBP, MYOGENIN); and down-regulated ER{alpha} target genes significantly (P-value < 0.01) possess TFBS pairs, such as (DBP, c-ETS1-68), (DBP, USF2) and (DBP, MYOGENIN). Statistically, our proposed mixture model-based discriminate analysis can simultaneously perform TFBS pattern recognition, TFBS pattern selection, and target class prediction; such integrative power cannot be achieved by current methods.

Availability: The software is available on request from the authors.

Contact: lali{at}iupui.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 CONCLUSIONS
 REFERENCES
 
1.1 Biological background
Identifying transcription factors (TFs) that regulate gene expression is an important step in understanding disease-related biological process. Transcriptional regulation by steroid receptors and ligand-inducible transcription factors play crucial biological roles in normal physiological and pathological processes (McDonnell and Norris, 2002). Steroid receptor activity has recently been shown to depend on combinatorial interactions with multiple nuclear proteins (Geserick et al., 2005). One of the best studied steroid receptors is that of the female sex hormone estrogen, designated estrogen receptor-{alpha} (ER{alpha}). Bioinformatics studies, including ours (Jin et al., 2004), have revealed DNA-binding sites for many transcription factors to be adjacent to estrogen response elements (EREs), DNA sequence motifs recognized by ER{alpha}. ER{alpha} may form complexes with nearby transcription factors, and such interactions may strengthen or attenuate transcriptional activity, thus defining target gene specificity (Geserick et al., 2005). To our knowledge, no study has been conducted to examine the patterns of transcription factor binding sites (TFBSs) in promoters that are directly activated or repressed by ER{alpha}.

In the present study, we focused on 10 known TFs, which may play roles in regulating ER{alpha}-target genes. AP1, GATA3 and SMAD3 are known to be involved in estrogen signaling in reproductive tissues and breast carcinogenesis (Lacroix and Leclercq, 2004; Lincoln, 2005; Rushton et al., 2003), while DBP, MYOGENIN and USF2 are associated with multiple ER{alpha} regulated processes (Jin, 2004). The c-MYC proteins (MYC and its binding partner MAX) are important regulators of many cellular processes, including proliferation and apoptosis (Pelengaris et al., 2002). In addition, MYC and MAX are over-expressed in human breast carcinomas (Bland et al., 1995), rapidly up-regulated in response to estrogen, and are known mediators of the stimulatory effects of this hormone (Rodrik et al., 2005). The ETS family is one of the largest families of transcriptional regulators that activate or repress the expression of genes involved in various biological processes, including human cancer (Seth and Watson, 2005). c-ETS1-68, a splice variant of c-ETS1, is overexpressed in breast cancer (Myers et al., 2005; Lincoln and Bove, 2005; Seth and Watson, 2005) and has recently been shown to play a role in ER{alpha}-mediated tumor invasion and angiogenesis (Lincoln et al., 2003).

Key insight into the gene networks underlying the role of ER{alpha} in breast cancer has come from microarray studies profiling up- and down-regulation of ER{alpha} target gene expression in breast cancer cells (Lin et al., 2004). These and other studies further suggest that breast cancer growth is regulated by the coordinated action of ER{alpha} and multiple signaling pathways (Osborne et al., 2005). In this regard, the combination of ER{alpha} with transcription factors, such as those identified above, may provide an important mechanism to coordinate hormone-elicited signals for proper regulation of target genes. The overall focus of the present study was up- and down-regulated ER{alpha} target genes. Particularly, we examined TFBS patterns in target promoters that are either activated or repressed by ER{alpha}. We also investigated whether pairs of TFBSs were ordered and whether distinct TFBS patterns could distinguish up- and down-regulated ER targets.

1.2 Computational algorithm background
Many computational algorithms (i.e. de novo TFBS discovery methods) have been developed for large-scale TFBS discovery. Importantly, these approaches can identify candidate TFBSs for experimental validation. Three major challenges face de novo discovery methods and need to be addressed: (1) the identification of TFBSs from a set of known co-regulated gene promoters, without knowing their locations or position specific weight matrices (PSWMs) (Bailey, 1994; Bussemaker et al., 2001; Gupta and Liu, 2003; Liu et al., 1995, 2001; Roth et al., 1998); (2) inference of TFBSs and their modules along a promoter sequence, based on a set of known TF PSWMs (Bailey and Nobel, 2003; Berman et al., 2002; Crowley et al., 1997; Firth et al., 2001, 2002; Frech et al., 1997; Kondrakhin et al., 1995; Prestridge, 1995; Sinha et al., 2003) and (3) discovery of patterns within the detected TFBSs that can not only characterize a regulatory mechanism, but can also distinguish between mechanisms. For example, between ER{alpha} up- and down-regulated promoters, different TFBS patterns have yet to be identified and characterized.

In this paper, we focused on challenge three: finding patterns within the detected TFBSs that not only characterize a regulatory mechanism but can also define different mechanisms. Many methods have previously been proposed for this objective, including logistic regression analysis (Krivan and Wasserman, 2001), logic regression (Keles et al., 2004) and classification trees (Jin et al., 2004). However, these methods merely select TFBSs that distinguish different regulation mechanisms, and cannot identify specific TFBS patterns within each regulation mechanism. To address both parts of challenge three, we propose a mixture model-based discriminate analysis. This approach uses a random background TFBS distribution component and a position specific weight matrix for any TFBS pair (adjacent or non-adjacent). Here, a mixture model was established for both ER{alpha} up- and down-regulated targets, forming a discriminate model for prediction of up/down classes. If the corresponding parameters in two mixture models had small differences and did not improve prediction, they were then ‘annealed’ to yield our optimal parsimonious model.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 CONCLUSIONS
 REFERENCES
 
2.1 A mixture model for ordered TFBS pairs
Before investigating the order of TFBSs, the actual presence of specific TFBSs was predicted from a supervised TFBS identification method (Jin et al., 2004), as described in detail in Section 3.

The distribution of TFBSs on promoters is assumed to be either random or ordered. If all K TFBSs are randomly distributed, their relative frequencies are described by a parameter vector V0 = (V0,1, ... , V0, k, ... V0, k)T, where V0, k represents the probability of the presence of TFBS k at any position on any promoter. Therefore, under this random model V0, the probability to observe a TFBS pair (k1, k2), is Formula.

Conversely, if TFBSs are not random, all ordered TFBS pairs (adjacent or non-adjacent) are modeled by a TFBS position weight matrix (TFPWM). It is a K x 2 matrix, Formula, where V1,1 is the frequency vector for all K TFBSs at position 1, and V1,2 is the frequency vector at position 2. If a TFBS pair (k1, k2) follows TFBS pair model V1, it has a probability of Formula, where 1 ≤ k1, k2 ≤ K. This TFBS pair model strategy can identify non-adjacent TFBS pairs, which can be either true non-adjacent, or true adjacent but disrupted by falsely predicted TFBSs.

The distribution of TFBS pairs is modeled by a mixture of random (V0) and ordered (V1) models. Given that sequence i has li TFBSs, it possesses mi = li x (li – 1)/2 TFBS pairs. Denote Xij = (Xij1, Xij2) as the j-th TFBS pair from sequence i. If Xij is generated through the ordered (TFPWM) model V1, a random variable, designated as Zij = 1; however, if Xij is generated through V0, the random variable Zij = 0. With {lambda} denoted as the proportion of TFBS pairs that follow V1, the log-likelihood function is displayed by

Formula 1(1)

We can then implement an E-M algorithm by treating Zij as missing data. The E-step of the algorithm is specified as

Formula 2(2)

Formula 3(3)
where I(Xiju) is a K x 1 indicator vector that specifies the TFBS Xiju, Xij = (Xij1, Xij2), and ({lambda}(t), Formula 3 are current parameter estimates.

In the M-step, the parameters in the matrices V1 and V0 are estimated by

Formula 4(4)

Formula 4

Our ordered TFBS pair model [Equation (1)], therefore, is similar to a two-compartment mixture (TCM) model (Bailey, 1994). In our model, we assume that any ordered TFBS pair can be present multiple times within any promoter sequence. The unique feature of our model is that our TFBS pair uses any pair of TFBSs, even if they are not adjacent, while the Bailey TCM model merely describes a DNA sequence segment, with the nucleotides completely adjacent to one another. Computationally, the parameter estimations for V1 and V0 are different from other approaches proposed previously (Bailey, 1994).

2.2 Test and Select Significant Ordered TFBS Pairs
One underlying hypothesis was to test whether or not the predicted TFBSs contain ordered TFBS pairs. A likelihood ratio test is implemented by

Formula 5(5)

Formula 5

Formula 5

The P-value for this type of likelihood ratio test was previously discussed by Bailey and Gribskov (1998), who proposed an approximated test. Although this approximation test worked well in a TCM model, it was unsuccessful in our ordered TFBS pair model, in which TFBS pairs were not constrained to be adjacent to one another. The difficulty of the asymptotic likelihood ratio test for mixture models is well documented by McLachlan and Peel (2000), who described that theoretical difficulties lie in the parameter identification problem, when data follow the model V0. Before these theoretical issues can be solved, the most reliable likelihood ratio test is through a bootstrap analysis. To establish an empirical distribution of the test statistic, multiple simulated datasets are generated under the background model V0, in which TFBSs are randomly distributed along the sequences. Using that simulation, 10 000 replication datasets are sufficient to generate the empirical distribution under the null model V0.

To predict an ordered TFBS pair as significant, its score and P-value must necessarily be calculated. We proposed a score based on log-odds between its TFBS-pair model and background model [Equation (6)], with a score significance calculated from the tail percentile from an empirical distribution, based on numerous TFBS pairs generated using the background model V0. The significance of this score is shown below [Equation (7)]. A total of 1000 TFBS pairs were generated under the null model V0, and denote x as a TFBS pair.

Formula 6(6)

Formula 7(7)

2.3 Discriminate analysis
To perform discriminate analyses, we denote {phi}1 as a (3 x K + 1) x 1 parameter vector that contains all parameters in ({lambda}, V1, V0) from the TFBS pair model for the ER{alpha} up-regulated promoter sequences. Denoting {phi}2 as the set of parameters from ER{alpha} down-regulated promoter sequences, Equation (8) describes a discriminate function able to predict the ER{alpha} regulation class based on a set of TFBS pairs X* from a new sequence.

Formula 8(8)
where m denotes the number of TFBS pairs, and X*j denotes the j-th pair. If {delta}(X*|{phi}1, {phi}2) = 1, the discriminate function classifies the new sequence S* to be an up-regulated target; if {delta}(X*|{phi}1, {phi}2) = –1, the discriminate function classifies X* as a down-regulated target. The prediction accuracy is thus evaluated by the agreement of the predicted value by {delta}(•) and the true class of X*.

In Equation (8), the corresponding parameters between two models {phi}1 and {phi}2 may be very close to each other, while others may be very different. However, the discriminate function [Equation (8)] does not provide much information about which TFBSs or TFBS pairs are more important for predicting up/down classes. Therefore, we developed an ‘annealing’ procedure, which anneals two corresponding parameters to be the same, provided that they are numerically close and do not improve the prediction of the discriminate function. Consequently, the TFBSs and TFBS pairs that show different frequencies in either the background TFBS model, V0, or the TFBS pair model, V1, would be responsible for the best regulation class prediction. This prediction is controlled by a shrinkage parameter, {Delta}, as follows:


Formula 9

(9)

To select an optimal {Delta} which gives the best prediction accuracy, a 3-fold cross-validation is implemented. Parameters ({phi}1, {phi}2) are estimated from the training sample, annealed by {Delta} in the testing sample, and prediction accuracies then calculated for both training and testing elements. In practice, 20 equally spaced numbers between 0 and 2.0 are tested for {Delta} in the cross-validation. The largest {Delta} providing the best prediction in the validation sample is then used to anneal parameters from the two mixture models. The major advantage of this method is its flexibility in selecting highly correlated predictors. While these correlated predictors are not all selected in regression model-based variable selection procedures, all of these adequately predict outcome (or class). This point is further addressed in our simulation studies (Section 3).


    3 IMPLEMENTATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 CONCLUSIONS
 REFERENCES
 
3.1 Data analysis
3.1.1 ER{alpha} target sequences
ER{alpha} targets were previously identified by ‘ChIP-on-chip’ a microarray technique in which DNA sequences bound to specific proteins are immunoprecipitated and hybridized to a microarray (Cheng et al., 2006). Following estrogen treatment of MCF7 cells, up- or down-regulated ER{alpha} targets were identified using antibodies against acetylated or methylated histone H3 lysine 9 (H3K9) (Roh et al., 2005). Loci identified as ER{alpha} targets were classified as activated or repressed, based on their ratios of acetylation/methylation (Ac/Me) at H3-K9. Using those criteria, an Ac/Me >1.5, defined as up-regulation, was observed for 42 target genes, while 35 loci were classified as down-regulated, using Ac/Me < 0.67. Every target probe sequence was extended by 2000 bp upstream and 1000 bp downstream.

3.1.2 TFBS module identifications
In the supervised TFBS identification, we used MATCH and PSSMs in the TRANSFAC database (http://www.gene-regulation.com/cgi-bin/pub/databases/transfac/search.cgi) to scan orthologous pairs of human and mouse promoters for the following TFs: AP1, c-ETS1-68, DBP, ER{alpha}, GATA3, MYC, MYC/MAX, MYOGENIN, SMAD3 and USF2, using the ‘minFN_good71.prf’ profile (profile of cut-off values with minimum number of false-negative predictions) of MATCH. If the resulting sequence similarity of the scanned TFBS was ≥60%, using ClustalW sequence alignment (Thompson et al., 1994), a predicted binding site was considered as conserved. Among the 42 up-regulated targets, 39 (Supplementary Table S1) possessed predicted ER{alpha}-binding sites, while the other three were filtered out of the data analysis. Among the 35 down-regulated targets, 27 (Supplementary Table S1) possessed predicted ER{alpha}-binding sites, and were then subjected to the data analysis described below. All the other 11 targets (3 up- and 8 down-regulated) were dropped from the analysis. These predicted TFBSs, and their relative positions, in the 39 up- and 27 down-regulated promoter sequences are plotted in Figure 1 and Supplementary information (file 2).


Figure 1
View larger version (39K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 TFBS distributions in up- (a) and down- (b) regulated promoter targets. c-ETS1-68 is denoted by blue dots; DBP is labeled with gold dots; ER is labeled with red dots; MYC and MYC/MAX are labeled with black dots; MYOGENIN is labeled with green dots; USF2 is labeled with brown dots and GAGA is labeled with gray dots. The other TFBSs for AP1 and SMAD3 are labeled with open circles. The x-axis represents the relative positions (not physical positions) among TFBSs, and they are not equally spaced.

 
3.1.3 TFBS pattern identification for up-regulated ER{alpha} targets
Figure 2a is the mixture model for TFBS pairs among up-regulated promoter targets. V0 denotes the background model, and V1 denotes the ordered pair model for a TFBS pair Xij = (Xij1, Xij2). The color intensity represents the relative frequencies of the TFBSs; the darker the color, the higher the frequencies. The relative frequencies of c-ETS1-68 and ER{alpha} for a TFBS pair Xij are very comparable between V0 and V1. This suggests that the TFBS distribution follows the random model V0. In contrast, the relative frequencies of the other eight TFs in V1 are different to various degrees between position 1 and 2 of a TFBS pair, and these all differ from their frequencies in the random model V0.


Figure 2
View larger version (57K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Heatmap (a) is the mixture model for TFBS pairs among up-regulated promoter targets. V0 denotes the background model and V1 denotes the ordered pair model for a TFBS pair (Xij1, Xij2). The higher the probability, the darker the color; the smaller the probability, the lighter the color. Heatmap (b) is the mixture model for TFBS pairs among down-regulated promoter targets. Heatmaps (c) and (d) display the mixture models for up- and down-regulated targets respectively, after annealing by the discriminate analysis. It appears in heatmaps (c) and (d) that both target sets have comparable heatmaps, except that up-regulated targets have more MYC and MYC/MAX, while down-regulated targets possess more c-ETS1-68 sites.

 
Following determination of the TFBS frequencies, a log-likelihood ratio test (5) was conducted, which significantly (P-value = 0.0003) rejected the null hypothesis, demonstrating that the order of the 10 TFBSs was not purely random. Based on our score and P-value calculation schemes [Equations (6) and (7)], the following TFBS combinations had P-values < 0.01: (DBP, MYC), (DBP, MYC/MAX), (DBP, USF2) and (DBP, MYOGENIN). Based on Figure 2a, it is obvious that DBP has a higher frequency in position 1 than position 2 of a TFBS pair, and the TFBSs MYC, MYC/MAX, USF2 and MYOGENIN all have lower frequencies in position 1 than position 2. From Figure 1a, in 33 out of 39 up-regulated sequences, a common pattern found was TFBS combinations starting with DBP and ending with a combination of USF2, MYC/MAX, MYOGENIN and (or) MYC. ER{alpha} and other TFBSs were found to occur before, between or after these specifically ordered TFBS pairs.

3.1.4 TFBS pattern identification for down-regulated ER{alpha} targets
The V1 and V0 models for down-regulated ER{alpha} targets are shown in Figure 2b. The relative frequencies of TFBSs for ER{alpha}, MYC, MYC/MAX and SMAD3 in V0 and V1, are very comparable for a TFBS pair. This suggests that the distribution for these TFBSs follows the random model V0. Conversely, the other six TFBSs' relative frequencies of a TFBS pair are different to various degrees, and these all differ from their frequencies in the random model V0.

Similar to the result for up-regulated targets, the log-likelihood ratio test [Equation (5)] strongly (P-value = 0.0007) rejected the null hypothesis, demonstrating that the arrangements of the 10 TFBSs were not purely random. Based on our score and P-value calculation schemes [Equations (6) and (7)], the following TFBS combinations had a P-value < 0.01: (DBP, c-ETS1-68), (DBP, USF2) and (DBP, MYOGENIN). In Figure 2b, it is obvious that DBP has a higher frequency in position 1 than position 2 of a TFBS pair; and all (USF2, CETS168, MYOGENIN) have lower frequencies in position 1 than position 2. From Figure 1b, in 26/27 down-regulated sequences, common patterns found were combinations of TFBSs starting with DBP and ending with a combination of USF2, MYOGENIN and c-ETS1-68. Similar to up-regulated targets, the ER{alpha}-binding sites (and the other TFBSs) could occur before, between or after these specifically ordered TFBS pairs.

3.1.5 Mixture model-based discriminate analysis
Comparing TFBS patterns within the ER{alpha} up- and down-regulated targets shown in Figure 2a and 2b, the frequencies of (MYC, MYC/MAX and c-ETS1-68) were highly different in both V0 and V1, while the other TFBS frequencies merely exhibited much smaller differences. We therefore expected those three TFBSs to possess much higher predictive power than the others.

Using a 3-fold cross-validation, the mixture model-based discriminate analysis achieved 87% predictive power in the validation dataset. When the shrinkage parameter {Delta} was set to 1.0, three TFBSs (MYC, MYC/MAX and c-ETS1-68), when paired with DBP, demonstrated different distribution probabilities between the two target sets. Specifically, MYC and MYC/MAX were up-regulation specific (neither was predicted in the down-regulated targets), while c-ETS1-68 was specific for down-regulation (not predicted in the up-regulated targets). In Figure 2c and 2d, based on the optimal {Delta} selected from cross-validation, our final discriminate analysis model was constructed from annealed two mixture models in the full dataset. It is noticeable that only those three TFBSs (MYC, MYC/MAX, and c-ETS1-68) showed differential frequencies between two target classes in both V0 and V1; all the other TFBSs had the same distributions.

3.1.6 Alternative discriminate analysis
Using TFBS presence/absence as a predictor, logistic regression only selected MYC and MYC/MAX to predict the up/down target classes with 87% prediction accuracy. This analysis was conducted in an R function glm(), with a stepwise forward variable selection used with a 3-fold cross-validation. Using the same binary TFBS predictors, we constructed a classification tree model. Here, the ‘Gini’ algorithm was selected as the splitting method for growing the tree, with 3-fold cross-validation then used to obtain the minimal tree achieving optimal prediction (Breiman et al., 1984). Classification tree analysis was performed in an R function tree() selecting a tree that contained MYC, MYC/MAX and CETS168, with a prediction accuracy of 87%. Logic regression (Keles et al., 2004; Ruczinski et al., 2003) was then implemented to select binary predictors and predict up/down-regulation classes. This analysis was conducted in the R package LogicReg. A 3-fold cross-validation, used to select the optimal predictor subset that achieved the best prediction, selected a logic model that contained MYC, MYC/MAX and c-ETS1-68, achieving a prediction accuracy of 87%.

3.2 Statistical simulation and validation studies
3.2.1 Promoter sequence simulation
In our analysis, TFBS predictions were made by comparing binding sites between human and mouse sequences. Among the promoter sequences that we investigated in this paper, ~70% were conserved, a conservation rate very close to the rate reported for comparative analysis of the mouse genome (Consortium, 2002). In our simulated studies, a 70% overlap of human and mouse sequences was simulated. Consensus sequences for 10 TFBSs were generated, using TRANSFAC position weight matrices, and inserted into both human- and mouse-simulated sequences. Additional simulation detail can be found in our Supplementary information (S3.doc). In the following simulation studies, the TFBSs MYC, c-ETS1-68, DBP, MYOGENIN, USF2, MYC/MAX, GAGA3, AP1, SMAD3 and ER{alpha} are denoted by TF1–TF10, respectively.

3.2.2 Ordered TFBS pair detection performance in mixture model
To investigate whether our mixture model could select ordered TFBS pairs, we performed a simulation study. The sensitivity of the model to the initial TFBS selection is a very interesting issue. Based on our analysis of ER{alpha} up-regulated promoter targets (Fig. 1), TF3 binding sites usually occur before TF5, TF4, or TF1. In our simulated sequences, these three ordered pairs were inserted into the sequence using a pre-specified order, while the other six TFBSs were randomly inserted.

The matrix similarity score (mSS) proposed by MATCH (2003) was then employed to predict TFBSs. The sequence similarity threshold of the predicted TFBSs between simulated human and mouse promoters was defined as 60%. The cut-offs for the predicted TFBS were selected based on the number of false positives in a random sequence with n bps, and the false positive thresholds 0.25, 0.5, 1, 2 and 3 were evaluated. In addition, the proportion of simulated sequences that contain those three ordered TFBS pairs was varied among (0, 30, 50, 80 and 100%). All the other sequences, however, possessed only randomly distributed TFBSs. In our mixture model-based TFBS pair selection, the P-value threshold for the TFBS pair was set at 5%. The P-value calculation is illustrated in our Supplementary information (S3.doc).

Figure 3 displays the average probability (power) of selecting three ordered TFBS pairs. First, the shorter the sequences, the higher the probability the model can detect sequentially ordered TFBS binding sites. Second, a set of sequences possessing a smaller proportion of ordered TFBS pairs has a lower probability for detecting ordered TFBS pairs, as compared with a sequence set containing a higher proportion (of ordered TFBS pairs). Third, in our simulated sequence length range (500–4000 bp), if 0.5 to 1 false positive values are allowed in predicting TFBS binding sites, our probability to detect three ordered TFBS pairs appears to be the highest. Although a more stringent false positive control, such as 0.25, does not greatly reduce the probability, a less stringent false positive control, such as 2, will lower the likelihood. Finally, in our ER{alpha} promoter target TFBS analysis, false positives in TFBS prediction are confined within the range of 0.5–1.0. Consequently, our method should retain the highest power to detect ordered TFBS pairs.


Figure 3
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Probabilities in selecting ordered motif pairs in simulation studies. The x-axis represents the number of false positive motif predictions with respect to the sequence length; the y-axis represents the probability of selecting pre-specified ordered motif pairs. Different lines represent various proportions of sequences that contain the ordered motif pairs: line ‘a’ denotes 100% of sequences containing the motif pairs, line ‘b’, 80%; line ‘c’, 50%; line ‘d’, 30% and line ‘e’, 0%.

 
3.2.2 Ordered TFBS pair mixture model-based discriminate analysis performance
One simulation study was conducted to compare our mixture model-based discriminate analysis with logistic, logic and classification tree analyses. In this study, two sets of sequences were simulated. Among the first sequence set, a subset contained the ordered TFBS pairs (TF3, TF1) and (TF3, TF4). The number of sequences containing these pairs followed a binomial distribution with parameter 0.80. TF2 was excluded and the other six TFBSs did not display ordered appearances. The second sequence set was nearly the same as the first, except that its sequences possessed the order (TF3, TF2), rather than (TF3, TF1). Here, TF1 was excluded and the number of sequences possessing these pairs independently followed a binomial distribution with parameter 0.80. We further expected that either (TF3, TF1) or (TF3, TF2) could accurately distinguish the two sequence classes. Here, every sequence was ~3000 bp in length, with 30 sequences simulated for each set, and a total of 1000 sets analyzed. The prediction accuracy was based on 3-fold cross-validation.

In our simulation study, our analyses from the four TFBS discovery methods started from the ideal situation that every TFBS prediction was 100% accurate (i.e. 100% sensitivity and specificity, with no false positive prediction). We then proceeded to analyze simulated data with falsely predicted TFBSs.

From Figure 4, first, with every TFBS accurately predicted, and no false positives, all four methods have comparable prediction accuracies. However, if false positive TFBS prediction exists, the mixture model-based discriminate analysis has the highest prediction accuracy, remaining stable as the number of false positives increase. Second, both TF1 and TF2 are selected for predicting two classes in the mixture model, with a probability >75%; this probability was not sensitive to the controlled number of false positives. Third, logistic regression usually selects either TF1 or TF2, but not both. Finally, logic regression and classification tree methods selected both TF1 and TF2 with probability >75%, when the false positive number was controlled at low level (0.5). However, the ability of these methods to select TF1 and TF2 was reduced to 24%, with the number of false positives set at two.


Figure 4
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 Mixture model discriminate analysis simulation 1. In this simulation study, sequence set one's ordered motif pairs possess TF1, and sequence set two's ordered motif pairs have TF2. Both sets possess ~80% sequences with ordered motif pairs. The x-axis represents the number of false positive motif prediction with respect to 3000 bp; in the upper left plot, the y-axis denotes the prediction accuracy for all four methods; in the other three plots, the y-axis represents the TF selection probabilities for either TF1 or TF2. Each line represents a different TFBS discovery method: line ‘a’, mixture model; line ‘b’, logistic regression; line ‘c’, logic regression and line ‘d’, classification tree.

 
Logistic regression, logic regression and classification tree all utilize TFBS presence or absence to predict class. A large false positive threshold can lead to a 100% TFBS presence; hence it can lose its power in predicting the class. However, the mixture model uses a TFBS's random or ordered frequencies to predict its class, and it is not sensitive to the false positive threshold. The same reason is applicable to TF selection. When false positive threshold goes up, as the presence of true TFBS loses its power to predict the class, all these three methods fail. Only the mixture model-based approach is insensitive to false positive threshold, and consistently selects the true TFBS.


    CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 CONCLUSIONS
 REFERENCES
 
In this study, we demonstrate, for the first time, a method for predicting the presence and order of TFBS in the proximity of estrogen receptor elements, within ER{alpha}-responsive loci. Subsequently, two significant aspects emerge from this analysis. The first is the biological significance of our findings, in which our proposed new TFBS pattern identification algorithm clearly suggests that TFBSs are not randomly distributed within ER{alpha} target promoters (P-value < 0.001), but form distinctive patterns within target promoters both up- and down-regulated by ER{alpha}. The up-regulated targets contain ordered TFBS pairs such as (DBP, MYC), (DBP, MYC/MAX), (DBP, USF2) and (DBP, MYOGENIN). The down-regulated targets contain TFBS pairs such as (DBP, c-ETS1-68), (DBP, USF2) and (DBP, MYOGENIN). Specifically, the TFBS combinations (DBP, USF2) and (DBP, MYOGENIN) appear in both up- and down-regulated ER{alpha} targets. These two transcription factors may act in concert with ER{alpha} to mediate target gene regulation, and their presence is not specific for either up- or down-regulation. MYOGENIN also appears in the work by Jin et al. (2004), which specifies ER{alpha} direct targets. For promoters up-regulated by ER{alpha}, the TFBS pairs (DBP, MYC) and (DBP, MYC/MAX) were found to be specific; further work by our group also demonstrated a previously unknown co-regulatory role of c-MYC within a subset of ER{alpha}-responsive genes (Cheng et al., 2006). Analogously, ETS factors are known to act both as positive or negative regulators of gene expression (Seth, 2005), and we found the (DBP, c-ETS1-68) TFBS pair to be down-regulation-specific. In support of our findings, c-ETS1-68 was previously shown to function as either a transcriptional repressor or activator, depending on the promoter context (Goldberg et al., 1994). In hormone-responsive breast cancer cells, the activity of ER{alpha} signaling pathway is inversely correlated with the activities of growth factor signaling pathways (Schiff, 2005). Our finding that the c-ETS1-68 binding site is a negative regulatory element in ER{alpha}-target genes suggests that ETS proteins might serve as points of molecular ‘crosstalk’ between estrogen and growth factor signaling pathways in hormone-dependent breast cancer cells. Overall, our mixture model-based discriminate analysis demonstrates prediction accuracy as high as 87% for the above-described patterns.

Second, our proposed mixture model-based discriminate analysis can perform both TFBS pattern recognition and target class prediction. There is currently no available method that can perform this feature, as logistic regression, logic regression and classification tree algorithms are limited to class prediction analysis. Based on extensive statistical simulation studies, the performance of our method is not sensitive to false positives in the initial TFBS prediction, while the other methods do not perform adequately when false positive TFBS predictions are high. Furthermore, our mixture model-based discriminate analysis can select, with relative high probability, all TFBSs that are differentially distributed between two classes, even when the false positive TFBS prediction is present at a high level. In contrast, logistic regression can detect only a small subset of TFBSs. While classification tree and logic regression methods can yield high probability to choose a full set of TFBSs that are differentially distributed between two classes, this probability quickly diminishes with increasing false positive TFBS prediction.

In summary, our mixture model-based discriminate method holds great potential for TFBS pattern identification and classification in the study of estrogen-responsive promoter elements. Elucidation of regulatory proteins within such elements, and their association, will allow greater insight into biological processes regulated by estrogen, including sexual development, menopause and hormone-dependent cancers. This approach can be readily extended to any TF-directed targets, and seek this TF and its many coTFs' TFBS patterns.


    Acknowledgments
 
This research was supported by National Cancer Institute grants CA085289 (K.P.N.) and CA113001 (T.H-M.H.).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on April 13, 2006; revised on May 24, 2006; accepted on June 9, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 CONCLUSIONS
 REFERENCES
 

    Bailey, L.T. and Gribskov, M. (1998) Combining evidence using P-value: application to sequence homology search. Bioinformatics, 14, 48–54[Abstract/Free Full Text].

    Bailey, L.T. and Nobel, W.S. (2003) Searching for statistically significant regulatory modules. Bioinformatics, 19, ii16–ii25[Abstract].

    Bailey, T.L. and Elkans, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymer. Proceedings of the Second International Conference on Intelligent Systems for Molecular BiologyMenlo Park, CA , pp. 28–36.

    Berman, B.P., et al. (2002) Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc. Natl Acad. Sci. USA, 99, 757–762[Abstract/Free Full Text].

    Bland, K.I., et al. (1995) Oncogene protein co-expression. Value of Ha-ras, c-myc, c-fos, and p53 as prognostic discriminants for breast carcinoma. Ann. Surg, . 221, 708–718.

    Bussemaker, H.J., et al. (2001) Regulatory element detection using correlation with expression. Nat. Genet, . 27, 167–174[CrossRef][Web of Science][Medline].

    Breiman, L., et al. (1984) Classification Regression Trees. , New York, NY Chapman & Hall.

    Cheng, A.S.L., et al. (2006) Combinatorial analysis of transcription factor partners reveals recruitment of c-MYC to estrogen receptor a-responsive promoters. Mole. Cell, 21, 393–404.

    Consortium, M.G.S. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 521–562.

    Crowley, E.M., et al. (1997) A statistical model for locating regulatory regions in genomic DNA. J. Mol. Biol, . 268, 8–14[CrossRef][Web of Science][Medline].

    Firth, M.C., et al. (2001) Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics, 17, 878–889[Abstract/Free Full Text].

    Firth, M.C., et al. (2002) Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucleic Acids Res, . 30, 3214–3224[Abstract/Free Full Text].

    Frech, K., et al. (1997) A novel method to develop highly specific models for regulatory units detects a new LTR in GenBank which contains a functional promoter. J. Mol. Biol, . 270, 674–687[CrossRef][Web of Science][Medline].

    Geserick, C., et al. (2005) The role of DNA response elements as allosteric modulators of steroid receptor function. Mol. Cell Endocrinol, . 236, 1–7[CrossRef][Web of Science][Medline].

    Goldberg, Y., et al. (1994) Repression of AP-1-stimulated transcription by c-Ets-1. J. Biol. Chem, . 269, 16566–16573[Abstract/Free Full Text].

    Gupta, M. and Liu, J.S. (2003) Discovery of conserved sequence patterns using a stochastic dictionary model. J. Am. Stat. Assoc, . 98, 55–66[CrossRef][Web of Science].

    Jin, V.X., et al. (2004) Identifying estrogen receptor alpha target genes using integrated computational genomics and chromatin immunoprecipitation microarray. Nucleic Acids Res, . 32, 6627–6635[Abstract/Free Full Text].

    Keles, S., et al. (2004) Regulatory motif finding by logic regression. Bioinformatics, 20, 2799–2811[Abstract/Free Full Text].

    Kondrakhin, Y.V., et al. (1995) Eukrayotic promoter recognition by binding sites for transcription factors. Comp. Appl. Biosci, . 11, 477–488.

    Krivan, W. and Wasserman, W.W. (2001) A predictive model for regulatory sequences directing liver-specific transcription. Genome Res, . 11, 1559–1566[Abstract/Free Full Text].

    Lacroix, M. and Leclercq, G. (2004) About GATA3, HNF3A, and XBP1, three genes co-expressed with the oestrogen receptor-alpha gene (ESR1) in breast cancer. Mol. Cell Endocrinol, . 219, 1–7[CrossRef][Web of Science][Medline].

    Lin, C.Y., et al. (2004) Discovery of estrogen receptor {alpha} target genes and response elements in breast tumor cells. Genome Biol, . 5, R66[CrossRef][Medline].

    Lincoln, D.W., 2nd and Bove, K. (2005) The transcription factor Ets-1 in breast cancer. Front Biosci, . 10, 506–511[Web of Science][Medline].

    Lincoln, D.W., 2nd, et al. (2003) Estrogen-induced Ets-1 promotes capillary formation in an in vitro tumor angiogenesis model. Breast Cancer Res. Treat, . 78, 167–178[CrossRef][Web of Science][Medline].

    Liu, J.S., et al. J. Am. Stat. Assoc, . (1995) 90, 1156–1170[CrossRef][Web of Science].

    Liu, X., et al. (2001) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput, . 127–138.

    McDonnell, D.P. and Norris, J.D. (2002) Connections and regulation of the human estrogen receptor. Science, 296, 1642–1644[Abstract/Free Full Text].

    McLachlan, G. and Peel, D. Finite Mixture Model, (2000) , New York, NY John Wiley & Sons.

    Myers, E., et al. (2005) Associations and interactions between Ets-1 and Ets-2 and coregulatory proteins, SRC-1, AIB1, and NCoR in breast cancer. Clin. Cancer Res, . 11, 2111–2122[Abstract/Free Full Text].

    Osborne, C.K., et al. (2005) Crosstalk between estrogen receptor and growth factor receptor pathways as a cause for endocrine therapy resistance in breast cancer. Clin. Cancer Res, . 11, 865s–870s[Abstract/Free Full Text].

    Pelengaris, S., et al. (2002) c-MYC: more than just a matter of life and death. Nat. Rev. Cancer, 2, 764–776[CrossRef][Web of Science][Medline].

    Prestridge, D. (1995) Predicting Pol II promoter sequences using transcription factor binding sites. J. Mol. Biol, . 249, 923–932[CrossRef][Web of Science][Medline].

    Ripley, B.D. Pattern recognition and neural network, (1996) , Cambridge, UK Cambridge University Press.

    Rodrik, V., et al. (2005) Survival signals generated by estrogen and phospholipase D in MCF-7 breast cancer cells are dependent on Myc. Mol. Cell Biol, . 25, 7917–7925[Abstract/Free Full Text].

    Roh, T.Y., et al. (2005) Active chromatin domains are defined by acetylation islands revealed by genome-wide mapping. Genes Dev, . 19, 542–552[Abstract/Free Full Text].

    Roth, F.R., et al. (1998) Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol, . 16, 939–945[CrossRef][Web of Science][Medline].

    Ruczinski, I., et al. (2003) Logic Regression. J. Comput. Graphical Stat, . 12, 475–511[CrossRef].

    Rushton, J.J., et al. (2003) Distinct changes in gene expression induced by A-Myb, B-Myb and c-Myb proteins. Oncogene, 22, 308–313[CrossRef][Web of Science][Medline].

    Schiff, R. and Osborne, C.K. (2005) Endocrinology and hormone therapy in breast cancer: new insight into estrogen receptor-alpha function and its implication for endocrine therapy resistance in breast cancer. Breast Cancer Res, . 7, 205–211[CrossRef][Web of Science][Medline].

    Seth, A. and Watson, D.K. (2005) ETS transcritpion factors and their emerging roles in human cancer. Eur. J. Cancer, 41, 2462–2478[Web of Science][Medline].

    Sinha, S., et al. (2003) A probabilistic method to detect regulatory modules. Bioinformatics, 19, 1–10[Free Full Text].

    Thompson, J.D., et al. (1994) Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalities and weight matrix choice. Nucleic Acids Res, . 22, 4673–4680[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
K. D. Yokoyama, U. Ohler, and G. A. Wray
Measuring spatial preferences at fine-scale resolution identifies known and novel cis-regulatory element candidates and functional motif-pair relationships
Nucleic Acids Res., June 2, 2009; (2009) gkp423v2.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/18/2210    most recent
btl329v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, L.
Right arrow Articles by Nephew, K. P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, L.
Right arrow Articles by Nephew, K. P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?