Skip Navigation


Bioinformatics Advance Access originally published online on September 15, 2005
Bioinformatics 2005 21(22):4169-4175; doi:10.1093/bioinformatics/bti680
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/22/4169    most recent
bti680v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zhong, W.
Right arrow Articles by Zhu, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhong, W.
Right arrow Articles by Zhu, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oxfordjournals.org

RSIR: regularized sliced inverse regression for motif discovery

Wenxuan Zhong 1,{dagger}, Peng Zeng 2,{dagger}, Ping Ma 3,{dagger}, Jun S. Liu 1,* and Yu Zhu 4,*

1Department of Statistics, Harvard University Cambridge, MA 02138, USA
2Department of Mathematics and Statistics, Auburn University Auburn, AL 36849, USA
3Department of Statistics, University of Illinois at Urbana-Champaign Champaign, IL 61820, USA
4Department of Statistics, Purdue University West Lafayette, IN 47907, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 IMPLEMENTATION
 RESULTS
 CONCLUSION
 REFERENCES
 

Motivation: Identification of transcription factor binding motifs (TFBMs) is a crucial first step towards the understanding of regulatory circuitries controlling the expression of genes. In this paper, we propose a novel procedure called regularized sliced inverse regression (RSIR) for identifying TFBMs. RSIR follows a recent trend to combine information contained in both gene expression measurements and genes' promoter sequences. Compared with existing methods, RSIR is efficient in computation, very stable for data with high dimensionality and high collinearity, and improves motif detection sensitivities and specificities by avoiding inappropriate model specification.

Results: We compare RSIR with SIR and stepwise regression based on simulated data and find that RSIR has a lower false positive rate. We also demonstrate an excellent performance of RSIR by applying it to the yeast amino acid starvation data and cell cycle data.

Availability: Matlab programs are available upon request from the authors.

Contact: jliu{at}stat.harvard.edu; yuzhu{at}stat.purdue.edu


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 IMPLEMENTATION
 RESULTS
 CONCLUSION
 REFERENCES
 
Gene transcription is regulated by proteins called transcription factors (TFs) binding to their recognition sites located mostly in upstreams of the genes (promoter regions), but also not infrequently in downstreams or intronic regions. The common pattern of the recognition sites of a specific TF is referred to as the transcription factor binding motif (TFBM). Discovering binding sites and motifs of specific TFs of an organism is an important first step towards the understanding of gene regulation circuitry. This problem has attracted much attention from both experimentalists and quantitative researchers. Experimental techniques such as electrophoretic mobility shift assays and DNase footprinting have been used to locate TFBMs on a gene-by-gene and site-by-site basis, but these methods are laborious and time-consuming, and thus unsuitable for genome-wide studies. Recent years have seen a rapid adoption of the ChIP-on-chip technology (Ren et al., 2000; Lieb et al., 2001; Lee et al., 2002), where chromatin immunoprecipitation (ChIP) is carried out in conjunction with mRNA microarray analysis (chip) to identify genome-wide interaction sites of a DNA-binding protein. However, this method only yields a resolution of hundreds to thousands of base pairs, whereas the actual binding sites are 10–15 bp long. Computational methods developed over the past 10–15 years have proven extremely helpful for pinning down the exact binding site locations (see Stormo, 2000 and Jensen et al., 2004 for a review).

The latest attempt to improve upon computational TFBM finding methods is to incorporate information from gene expression data. An innovative method is REDUCE, proposed in Bussemaker et al. (2001), which utilizes the correlation between gene expression values and the occurrences of certain ‘words’ in the promoter regions of genes. The method has been extended in Keles et al. (2002, 2004). Conlon et al. (2003) proposed another method, motif regressor, which selects ‘functional’ TFBMs from a large pool of motif candidates by regressing the genes' mRNA expression levels against their promoter regions' matching scores to each of the candidate motifs. Along a similar line of thought, Beer and Tavazoie (2004) built Bayesian models to predict a gene's cluster membership based on the motif features in its promoter region.

A fundamental assumption to facilitate motif discovery using gene expression information is that a gene's mRNA copy number is associated with the gene's upstream matching score (or more intuitively, number of TFBM copies) corresponding to a functional TFBM. The simplest association is the linear relationship as considered by both Bussemaker et al. (2001) and Conlon et al. (2003). Hence, to find the TFBMs, they look for motif candidates that have the strongest linear association with the gene expression values. To identify all the functional TFBMs, Conlon et al. (2003) used the classic stepwise regression technique in their motif regressor. Though the idea of motif regressor is promising, the linear model combined with stepwise regression may not be completely satisfactory due to the following drawbacks. First, the relationship between gene expression values and motif scores is likely to be more complex than the simple linear relationship; second, the number of motif candidates in consideration is usually in hundreds, however most stepwise regression algorithms can only explore a small portion of all the possible models and third, motif scores are often highly correlated, which makes regression fitting unstable and may lead to falsely inflated regression coefficients that contribute to high false positives and negatives. In order to avoid the first drawback, more sophisticated methods using non-linear basis functions, such as polynomial basis and spline basis, were recently proposed for motif discovery (Sinisi and van der Laan, 2004; Das et al., 2004). But these methods also face the other two drawbacks.

In this paper, we propose a more flexible model that consists of an unspecified (non-linear) link function and linear combinations of motif scores for the relationship between gene expression values and motif matching scores. Based on the model, we develop a novel procedure, called regularized sliced inverse regression (RSIR), to directly identify the linear combinations and further the functional TFBMs while avoiding the estimation of the link function. RSIR is an extension of sliced inverse regression (SIR) proposed by Li (1991) for data of high dimensionality, high collinearity and relatively small sample size.

After demonstrating the performance of RSIR in a non-linear model setting via simulation, we applied it to two real datasets: the amino-acid starvation data (Gasch et al., 2000) and the cell cycle data (Spellman et al., 1998). For the former data, we found 16 motif patterns that are functionally active for amino acid starvation, 11 of which are known in the literature. We further explored the results and found two biologically interpretable modules of motif patterns. For the cell cycle data, we successfully identified all the known cell cycle regulating TFBMs. The correlations between the gene expression values and the motif scores of the identified TFBMs reveal periodical patterns, which are consistent with the underlying biological mechanism.


    METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 IMPLEMENTATION
 RESULTS
 CONCLUSION
 REFERENCES
 
Model assumptions
Let g1, g2, ..., gN denote the genes and let y = (y1, ..., yN) be their corresponding mRNA expression levels measured by a microarray experiment under some specific condition. The initial set of TFBM candidates is determined as follows. First, a subset (10–100) of the genes with the highest absolute expression values is obtained. Second, a motif finding algorithm, such as MDscan (Liu et al., 2002) in our case, is used to search the promoter regions of the selected genes to give rise to the initial motif set. We denote these TFBM candidates by m1, m2, ..., mp and their consensus matrices by {theta}1, {theta}2, ..., {theta}p, respectively. For gene gi (1 ≤ i ≤ N), the matching score of the motif candidate mj in gi's promoter region is calculated as

(1)
where ni is the size (length) of promoter region of gi, wj is the width of the motif candidate mj, {theta}0 is the third-order Markov model parameter estimated from intergenic sequences and si,k is the sequence segment of width wj starting at the k-th position in the promoter region of gi. The matching score xij describes the abundance and intensity of mj in the entire promoter region of gi. Let xi = (xi1, xi2, ..., xip)'. Then form the gene expression and motif matching score data, or in short, the expression-score data.

We use a toy example to motivate our general model for identifying TFBMs. Assume that the initial set contains p = 9 motif candidates, and that the gene expression value is related to the motif scores as

(2)
where {varepsilon} is a random error independent of x, and the vectors of coefficients are ß1 = (1, 0, 0, 0, 0, 0, 0, 0, 0)', ß2 = (0, 0, 0.3, 0.4, 0, 0, 0, 0, 0)' andß3 = (0, 0, 0, 0, 0, 0, 0, 0.8, 0.9)'.

Model (2) states that the gene expression value depends on the motif matching scores through three of their linear combinations. Since only x1, x3, x4, x8 and x9 have nonzero coefficients in these linear combinations, motifs m1, m3, m4, m8 and m9 are the true functional TFBMs whereas the rest are not. Model (2) further suggests that, to identify the true TFBMs using the expression-score data, we need to estimate the linear combinations first and then identify the motif scores with nonzero contributions. Note that the link function f is left unspecified in (2).

In general, we assume that the expression value of a gene depends on the motif scores through k (unknown) linear combinations,

(3)
Since f is not specified, (3) can accommodate a wide variety of models including the linear model. The expression-score data are usually high-dimensional and noisy, so a direct fitting of f using non-parametric methods is impractical. It is thus desirable to estimate the linear combinations without fitting f. This task can be accomplished by SIR (Li, 1991), which was originally developed for dimension reduction and data visualization. After having obtained , we can identify xi's with nonzero contributions to the linear combinations and their corresponding functional TFBMs. Because k is much smaller than p, if further desirable, non-parametric methods can be used to fit f only using , and the fitted model can be used to predict the gene expression value y. In this paper, we only focus on the identification of functional TFBMs and their linear combinations with biological interpretation. The subspace spanned by ß1,...,ßk is defined to be the sufficient dimension reduction subspace, which is denoted by . Model (3) and the subspace can also be defined from the perspective of conditional independence, and other methods exist for estimating ß1,...,ßk; see Cook (1998) for details.

Regularized sliced inverse regression
When x satisfies a linearity condition, Li (1991) showed that ß1,...,ßk in (3) can be estimated by solving the following optimization problem sequentially,

(4)
where ß is a p-dimensional vector, M = cov[E(x|y)] and {Sigma} is the variance covariance matrix of x. Once ß1 is obtained, the constraint in (4) is updated to ß'{Sigma}ß1 = 0 and ß'{Sigma}ß = 1, solving the updated (4) gives ß2. This procedure continues till ß1,...,beta;k are obtained. In application, {Sigma} is estimated by the sample covariance matrix , where is the sample mean. A slicing procedure is proposed in Li (1991) for estimating M. First, divide the range of into a number of disjoint intervals, say H intervals, which are denoted by S1, S2,...,SH; second, for 1 ≤ h ≤ H, calculate , where nh is the number of yi in Sh; third,

(5)
is used as an estimate of M. One then proceeds to solve (4) with {Sigma} and M replaced by their estimates. And the obtained directions are denoted by .

SIR is not as successful for our TFBM identification task as in some other applications due to the high dimensionality and high multicollinearity of the expression-score data, which makes nearly degenerate in a number of directions. Rewrite the sample version of (4) in an equivalent expression,

(6)
It is easy to see that the target function in (6) depends on both the norm and the orientation of ß, and the maximization is over ß on the ellipsoid . Along the directions in which is degenerate, ß's on E have extremely large norms, so that they may be falsely selected as the estimates of ß1,...,ßk, which makes very unstable. In order to mitigate the variability caused by the near-degeneracy of , we add a positive definite matrix sI to , where I is the p x p identity matrix and s is a prescribed non-negative constant. Thus the ellipsoid E is changed to , and (6) becomes

(7)
which can be solved sequentially as (4) to generate k directions denoted by . We refer to (7) as RSIR and as the RSIR directions. Note that when s = 0, (7) is the sample version of (4) and for j = 1,...,k.

In fact, can be obtained by solving the following system of equations and constraints,

(8)
where i,j = 1,...,k.

Similar to other regularized methods, RSIR reduces estimation variability at the cost of inducing estimation bias. The success of RSIR depends on its proper choice of s especially for data of high dimensionality, high multicollinearity and relative small sample size such as the expression-score data. The same argument and method were originally used by Hastie et al. (1995) and Yu et al. (1999) when proposing penalized linear discriminant analysis.


    IMPLEMENTATION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 IMPLEMENTATION
 RESULTS
 CONCLUSION
 REFERENCES
 
The flow chart of our procedure is given in Figure 1. Note that we choose MDscan as the tool to search for the motif candidates. Since MDscan is publicly available, we do not discuss it in detail here; see Liu et al. (2002) for more information. Readers can choose their favorite motif searching algorithms when using our procedure. Some computational issues related to RSIR are discussed in this section. The entire Matlab code can be requested from the authors.



View larger version (34K):
[in this window]
[in a new window]
 
Fig. 1 Flow chart for the procedure of motif discovery.

 
Computing scheme for RSIR directions
Suppose k is known and the regularization parameter s is given. First, we calculate using (5) with fixed H. Second, we solve (8) to obtain . The linear combinations are referred to as the RSIR variates in the rest of the paper. For visualization, we further generate k plots of y against , 1 ≤ {ell} ≤ k, respectively, which reveal the relationships between the gene expression values and the k RSIR variates. Similar to SIR (Li, 1991), RSIR is insensitive to the choice of H.

Motif selection based on RSIR variates
For motif candidate mj, its coefficients in the RSIR variates are , . According to model (3), intuitively, if all the , are close to zero, then mj is not a functional TFBM. Next we propose a scoring procedure to determine functional TFBMs.

Let , and let {Sigma}*j(s) be the covariance matrix of . We use the squared Mahalanobis distance between and the origin as a significance score for mj, which is

The exact sampling distribution of is difficult to derive. When s = 0, Li (1991) stated in his Remark 5.1 that is asymptotically normal with mean and covariance . For s > 0, the same result still holds, i.e. asymptotically follows a normal distribution with mean (s) and covariance . Therefore, follows a chi-squared distribution where k is the degree of freedom. When ß*j(s) = 0, {Gamma}j follows asymptotically. Note that the advantage of regularization is to trade bias for significant reduction in variation, and usually the introduced bias is very small when s is properly chosen, especially in our setting. This was also observed in penalized discriminant analysis by Yu et al. (1998). Although ß*j(s) is different from ß*j(0), their difference is expected to be small. Hence, we can use {Gamma}j as a scoring function for checking whether mj is a functional motif or not.

There exists another difficulty in using {Gamma}j directly, because we do not have an analytic expression for {Sigma}*j(s). We use the following bootstrap procedure to derive an estimate of {Sigma}*j(s). For m = 1,...,B, we draw with replacement N pairs from the original sample completely at random. Each such formed new sample is called a bootstrap sample. We apply RSIR to each bootstrap sample to obtain the estimates of ß1(s),...,ßk(s), and estimate {Sigma}*j(s) by the sample covariance matrix of , denoted by . Finally, we approximate {Gamma}j by . If , where {alpha} is a significance level determined by the user, mj is declared to be a functional TFBM. When the number of motif candidates is large, multiple comparison procedures can be incorporated.

Choice of regularization parameter
Because regularization controls the tradeoff between the bias and the variability of the estimate, it is important to determine the amount of regularization s. For a given s, the total mean squared error (MSE) of the RSIR directions is

(9)
where tr is the trace of a matrix. L(s) can be interpreted as the average distance between the RSIR directions and the true directions ß1,...,ßk, and it consists of two parts. The first part is the sum of the variances of , and the second part is the sum of the squared biases. When s increases from 0, L(s) first decreases. After L(s) reaches its minimum at s*, it begins to increase as s further increases. We choose s* to be the amount of regularization in RSIR. Using s*, RSIR can benefit from the significant reduction in variance while the increase of bias is limited to a minimum.

Note that both V(s) and B(s) in L(s) are not directly computable. We use generated by the bootstrap procedure discussed in the previous subsection to approximate V(s) and . The true directions ßi are approximated by the bootstrap mean of with s0 being a suitably small positive value. Although the proposed procedure for determining s is approximate, our experience from simulation and real data suggests that it works well.

Determining the number of directions
Recall that k is the number of true directions and it is equal to the rank of M = cov[E(x|y)]. So, in RSIR, the choice of k shall not depend on the choice of s. A graphical method for determining k is to plot y against the derived RSIR variates for 1 ≤ i ≤ p, and pick up the variates that generate plots with visible patterns. A more formal yet conservative approach as proposed by Li (1991) is to sequentially test a series of hypotheses: H0 : k = d versus H1 : k > d, for d = 0,1,2,...,p – 1. Let be the eigenvalues calculated from (8). Define . When s = 0, Li (1991) proved that {Lambda}d(0) follows a {chi}2 distribution with (p k)(Hk – 1) degrees of freedom asymptotically. Using the test statistic and its asymptotic distribution, we start with d = 0 and sequentially test the subsequent hypotheses until H0 is accepted. As indicated by Li (1991), the sequential test is conservative and sometimes underestimates k. In this paper, we combine both the graphical method and the sequential test to choose k.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 IMPLEMENTATION
 RESULTS
 CONCLUSION
 REFERENCES
 
Simulation
Suppose that there are 20 motif candidates and 500 genes. Let U be a fixed 20 x 20 orthogonal matrix, and let {Lambda}0 be a diagonal matrix with diagonal entries (0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 6, 7, 8, 25, 50, 100). We generate the expression-score data using the following scheme: (1) Generate 500 iid xi from N(0,U{Lambda}0U'), and use them as the ‘motif matching scores’; (2) Generate 500 iid random errors {varepsilon}i from N(0,1); (3) The 500 gene expression values are computed as , where ß1 = (1, 1, 1, 1, 0,...,0)' and ß2 = (0,...,0, 1, 1, 1, 1)'. Thus, the functional TFBMs in this case are m1, m2, m3, m4, m17, m18, m19 and m20, and the covariance matrix of xi is nearly degenerate in some directions.

We generate 1000 expression-score datasets following the above procedure. For each data, we apply RSIR, SIR and the stepwise regression procedure used by Conlon et al. (2003) to identify functional TFBMs, and record false negative and false positive rates at various levels, which are used to generate the ROC curves in Figure 2. We observed that RSIR outperformed the SIR and stepwise regression at all sensitivity levels for this example.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 2 ROC curves for RSIR, SIR and stepwise regression. The false negative rate is defined as the number of false rejections over the number of motifs, and the false positive rate is defined as the number of false selection over the number of motifs.

 
Amino acid starvation
DNA microarrays were used to obtain the expression values of 5970 yeast genes before and after 0.5 h of amino acid starvation; see Gasch et al. (2000) for details. Conlon et al. (2003) extracted the upstream sequence up to 800 bp of each gene, and used MDscan to find 414 motif candidates of width between 5 and 15 bp from the upstream sequences of the 100 most induced and 100 most repressed genes. The motif-matching scores of the selected 414 motif candidates were assigned according to (1). So expression-score data were generated for 5970 genes and 414 motif candidates. The 414 motif matching scores are highly correlated due to their generating mechanism. For example, a large number of pairwise correlations between motif candidates are >0.9.

Using RSIR (H = 80) and the sequential test, we identified two significant RSIR directions (k = 2) at {alpha} = 5%. We further used the bootstrap procedure to approximate L(s) at various s, and found that L(s) is minimized at s* = 1.0. Figure 3 displays the plots of the logarithm-transformed gene expression values against the two RSIR variates separately. Figure 3A shows that the gene expression values decrease when the first RSIR variate increases, while Figure 3B demonstrates that the dispersion of gene expression values increases as the second RSIR variate increases.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 3 (A) Logarithm-transformed gene expression values versus the first RSIR variate with s = 1. (B) Logarithm-transformed gene expression values versus the second RSIR variate with s = 1.

 
Following the motif selection procedure described in the implementation, we have identified 28 active TFBMs from the 414 motif candidates at {alpha} = 1%. So the original expression-score data can be reduced to include only 28 TFBMs. We further calculate the sample correlation matrix of the motif scores of the 28 TFBMs. The correlation between any pair of them is <0.4, indicating that the reduced data do not have high multicollinearity as the original data do.

According to their position weight matrices, the 28 TFBMs can be classified into 16 motif patterns (Fig. 4). Eight motif patterns (STRE, GCN4, M3A, M3B, MET4, PHO4, RAP1 and URS1) are regulators known to respond to amino acid starvation and were also identified in Conlon et al. (2003). Another three patterns (REB1, LEU3 and RFX1) we have found also respond to amino acid starvation, but they were not identified in Conlon et al. (2003). It is known that REB1, which stimulates or inhibits transcription, is essential for cell growth (Morrow et al., 1993); LEU3, a zinc-finger transcription factor, regulates genes involved in amino acid biosynthesis; RFX1 plays an important role in the regulation of protein synthesis activities. Our findings are consistent with results from the more recent study by Harbison et al. (2004). The biological functions of the remaining five patterns are currently unknown and invite further biological study.



View larger version (42K):
[in this window]
[in a new window]
 
Fig. 4 The first column is motif names. The second column is the label of the groups. The third column is motif logos of the 28 selected TFBMs.

 
To further investigate the identified TFBMs, we apply RSIR directly to the reduced expression-score data that only include the 28 TFBMs. As we discussed above, the multicollinearity is not serious in the reduced data. Therefore we use RSIR with s = 0, i.e. SIR, to avoid bias. Two RSIR directions were identified at {alpha} = 5% with H = 80. We further explore the possible linear combinations of the two RSIR directions in order to find the directions that have more interesting biological interpretations. Two directions were identified using an oblique rotation and a promax criterion. The two new directions represent two disjoint modules of TFBMs with the first involving 16 TFBMs marked (1) in Figure 4, and the second involving the remaining 12 TFBMs marked (2) in Figure 4. The plot of the logarithm of gene expression values along the first module shows the same negative linear trend as in Figure 3A, and that along the second module demonstrates the same heteroscedastic pattern as in Figure 3B. Hence, the two modules well preserve the information regarding the relationship between gene expression values and motif matching scores.

The TFBMs in the first module, which are marked (1) in Figure 4, have a significant effect on cell growth and protein synthesis. From the downstream genes of the motifs in the first module, we use GeneMerge (Castillo-Davis and Hartl, 2003) to identify significantly enriched gene function categories defined in Gene Ontology Consortium (http://www.geneontology.org/). The identified genes are closely associated with ribosome functioning, translation factors, selenoamino acid metabolism and aminoacyl-tRNA biosynthesis. In the case of amino acid starvation, processes of transcription, translations and protein synthesis should be slowed down. Thus the genes associated with these processes are expected to be downregulated. The negative trend between gene expression values and motif scores of the first module is consistent with the biological mechanism just described.

The TFBMs in the second module respond to environmental stress and mediate transcription activity. Using GeneMerge again, we identified the significantly enriched functions among the genes downstream of the TFBMs in the second module. We found that many metabolism pathways are enriched, such as arginine and proline metabolism, propanoate metabolism, pyruvate metabolism, and alanine and aspartate metabolism. It is expected that the effects of the TFBMs on these metabolism pathways are quite different during amino acid starvation. Thus, the genes could be up- or downregulated. This phenomenon is clearly demonstrated by the heteroscedastic pattern.

In summary, our analysis suggests that two primary sources are responsible for the transcription response to amino acid starvation. The first module slows down ‘inessential activities’ and the second module changes metabolism patterns, e.g. increases the uptake of certain amino acids.

Cell cycle regulation
Transcriptional regulation is one of the crucial regulation mechanisms for the cell cycle clock. We applied RSIR to the yeast cell-cycle data of 18 time points over two complete cell cycles, starting from release from alpha-factor arrest in the M/G1 phase (Spellman et al., 1998). We repeated the same procedure as in the previous example to generate the expression-score datasets at each time point. TFBMs were selected using RSIR with H = 80 and s = 0.1 at {alpha} = 5% with false discovery rate correction (Storey, 2002). A total of 143 TFBMs were obtained by combining the selected TFBMs at each time point.

Among the TFBMs we have identified, SWI5, SCB, MCB, SFF and MCM1 are well-known cell cycle regulator TFBMs. Their active phases and P-values are summarized in Table 1. We find that SWI5 is active during the M/G1 phase, and SCB and MCB are active during the G1 phase. Our findings are consistent with the results in the literature. The complex of Mcm1, Fkh2 and Ndd1 proteins plays a key role in activating G2/M genes (Harbison et al., 2004). We also find that SFF and MCM1 are active during the G2/M phase, which agrees with recent experimental findings (e.g. Simon et al., 2001). In addition to the five TFBMs above, we further discover three other TFBMs, STE12, PHO4 and STRE. Because the Ste12 protein is an important transcriptional activator that has a pheromone inductive effect, STE12 is active during many phases of the cell cycle. PHO4 is active during the S phase, which is consistent with Makhnevych et al. (2003). STRE is active at the beginning of the cell cycle after release from alpha-factor arrest, and it responds to the stress resulting from cell cycle release.


View this table:
[in this window]
[in a new window]
 
Table 1 Some well-known cell cycle regulator TFBMs identified using RSIR along with their active phases and P-values

 
The expression values of the cell-cycle-associated genes vary periodically over cell cycles. The correlations between gene expression values and the motif scores are expected to demonstrate the same pattern. Using K-means clustering, we clustered the 143 TFBMs into 10 clusters similarly as in Conlon et al. (2003). Eight of the ten clusters contain TFBMs that have strong effects on the cell cycle. Figure 5 shows the heatmaps for the correlations between the gene expression values and the matching scores of the TFBMs in each cluster. It is clear that the peaks of the correlations for the first eight clusters shift slowly to the right, which suggests that the TFBMs in different clusters are active one after another during the cell cycle. The correlations for the last two clusters do not fluctuate in the cell cycles, which indicates that their motifs are not necessarily cell cycle related. We notice that most of these motifs were identified in the first cell cycle, and their presence may reflect the limitation of the cell cycle experiment. The yeast cell cycle program was blocked by the alpha-factor at G1 phase in this experiment. After release, this tight synchrony decayed gradually due to the diversity of individual cell growth rates. In general, motifs identified in the second cycle are not always the same motifs identified during the first cycle.



View larger version (47K):
[in this window]
[in a new window]
 
Fig. 5 Motif clusters during cell cycle. The 143 significant TFBMs selected by RSIR are clustered into 10 clusters by the correlation coefficients between the gene expression values and the upstream sequence motif matching scores.

 
For 4 of the 10 clusters, we plot the correlations between gene expression values and matching scores along time (Fig. 6). All the plots demonstrate cyclic patterns in two periods, which clearly reflect the effects of the motifs, identified by RSIR, on the gene expression values over time. However, the pattern in the second cycle is not as sharp as that in the first cycle, indicating the correlation between gene expression values and their upstream matching scores of TFBMs weakens as time increases. This phenomenon is also caused by the limitation of the experiment as stated in the preceeding paragraph.



View larger version (38K):
[in this window]
[in a new window]
 
Fig. 6 Correlation coefficients between the gene expression values and the upstream sequence motif matching scores during cell cycle. From top to bottom: the first cluster contains STRE and SWI5 motifs; the second contains the MCB/SCB motifs; the third contains PHO4; the fourth contains the SFF motif.

 

    CONCLUSION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 IMPLEMENTATION
 RESULTS
 CONCLUSION
 REFERENCES
 
The identification of TFBMs is an important research topic in computational biology. In this paper, we have proposed a novel procedure, namely RSIR, to identify TFBMs using microarray gene expression information. Unlike REDUCE, which uses the counts of motif appearances, we follow an idea of motif regressor by using motif matching scores. We do not assume the rigid linear relationship between gene expression values and motif matching scores; instead, a semi-parametric model with an unspecified (and non-linear) link function is used. The RSIR algorithm proposed in this article can identify TFBMs, while avoiding an explicit estimation of the link function. RSIR improves upon the SIR algorithm proposed earlier (Li, 1991) by introducing a regularization term, which allows the user to make a conscious tradeoff between bias and variance. This strategy is especially well suited and important for analyzing data with high dimensionality and high collinearity, which are typical in biological applications. An interesting and potentially useful by-product of RSIR for expression-motif data analysis is also to estimate modules consisting of functionally coherent motifs. The computation cost of RSIR is minimum, and our simulation study and real data applications show that RSIR outperforms other existing procedures. Lastly, RSIR is not limited to motif discovery, and can be applied to many other variable selection and feature identification problems with high dimensional data.


    Acknowledgments
 
The authors are indebted to Erin Conlon, Lihua Zou and Xiaole S. Liu for their help with MDscan and motif regressor. We thank Jun-Yi Leu and Cristian Castillo-Davis for clarifying the biological results. We also thank the three anonymous referees for their constructive suggestions and comments that have helped improve an early version of the paper.

Conflict of Interest: none declared.


    FOOTNOTES
 
{dagger}The authors wish it to be known that, in their opinion, the first three authors should be regarded as joint First Authors. Back

Received on May 17, 2005; revised on September 3, 2005; accepted on September 13, 2005

    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 IMPLEMENTATION
 RESULTS
 CONCLUSION
 REFERENCES
 

    Beer, M.A. and Tavazoie, S. (2004) Predicting gene expression from sequence. Cell, 117, 185–198[CrossRef][ISI][Medline].

    Bussemaker, H.J., et al. (2001) Regulatory element detection using correlation with expression. Nat. Genet., 27, 167–171[CrossRef][ISI][Medline].

    Castillo-Davis, C. and Hartl, D. (2003) Genemerge: post-genomic analysis, data-mining and hypothesis. Bioinformatics, 19, 891–892[Abstract/Free Full Text].

    Conlon, E., et al. (2003) Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl Acad. Sci. USA, 100, 3339–3344[Abstract/Free Full Text].

    Cook, R.D. Regression Graphics: Ideas for Studying Regressions Through Graphics, (1998) , New York John Wiley & Sons.

    Das, D., et al. (2004) Interacting models of cooperative gene regulation. Proc. Natl Acad. Sci. USA, 101, 16234–16239[Abstract/Free Full Text].

    Gasch, A., et al. (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell, 11, 4241–4257[Abstract/Free Full Text].

    Harbison, C., et al. (2004) Transcriptional regulatory code of a eukaryotic genome. Nature, 431, 99–104[CrossRef][Medline].

    Hastie, T., et al. (1995) Penalized discriminant analysis. Ann. Stat., 23, 73–102.

    Jensen, S., et al. (2004) Computational discovery of gene regulatory binding motifs: a bayesian perspective. Stat. Sci., 19, 188–204[CrossRef].

    Keles, S., et al. (2002) Identification of regulatory elements using a feature selection method. Bioinformatics, 18, 1167–1175[Abstract/Free Full Text].

    Keles, S., et al. (2004) Regulatory motif finding using logic regression. Bioinformatics, 20, 2799–2811[Abstract/Free Full Text].

    Lee, T., et al. (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 298, 799–804[Abstract/Free Full Text].

    Li, K.C. (1991) Sliced inverse regression for dimension reduction. J. Am. Stat. Assoc., 86, 316–327[CrossRef].

    Lieb, J.D., et al. (2001) Promoter-specific binding of rap1 revealed by genome-wise maps of protein-DNA association. Nat. Genet., 28, 327–334[CrossRef][ISI][Medline].

    Liu, X., et al. (2002) An algorithm for finding protein-DNA interaction sites with applications to chromatin immunoprecipitation microarray experiments. Nat. Biotechnol., 27, 835–839.

    Makhnevych, T., et al. (2003) Cell cycle regulated transport controlled by alterations in the nuclear pore complex. Cell, 115, 813–823[CrossRef][ISI][Medline].

    Morrow, B.E., et al. (1993) A bipartite DNA-binding domain in yeast reb1p. Mol. Cell Biol., 13, 1173–1182[Abstract/Free Full Text].

    Ren, B., et al. (2000) Genome-wide location and function of DNA binding proteins. Science, 290, 2306–2309[Abstract/Free Full Text].

    Simon, I., et al. (2001) Serial regulation of transcriptional regulators in the yeast cell cycle. Cell, 106, 697–708[CrossRef][ISI][Medline].

    Sinisi, S.E. and van der Laan, M.J. (2004) Deletion/substitution/addition algorithm in learning with applications in genomics. Stat. Appl. Genet. Mol. Biol., 3, 2799–2811.

    Spellman, P.T., et al. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 3273–3297[Abstract/Free Full Text].

    Storey, J.D. (2002) A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B, 64, 479–498[CrossRef].

    Stormo, G.D. (2000) DNA binding sites: representation and discovery. Bioinformatics, 16, 16–23[Abstract/Free Full Text].

    Yu, B., et al. (1999) Penalized discriminant analysis of in situ hyperspectral data for conifer species recognition. IEEE Trans. Geosci. Remote Sensing, 37, 2569–2577[CrossRef].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
D. Datta and H. Zhao
Statistical methods to infer cooperative binding among transcription factors in Saccharomyces cerevisiae
Bioinformatics, February 15, 2008; 24(4): 545 - 552.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/22/4169    most recent
bti680v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zhong, W.
Right arrow Articles by Zhu, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhong, W.
Right arrow Articles by Zhu, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?