Skip Navigation


Bioinformatics Advance Access originally published online on September 10, 2007
Bioinformatics 2007 23(22):3039-3047; doi:10.1093/bioinformatics/btm457
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
23/22/3039    most recent
btm457v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Yuan, S.
Right arrow Articles by Li, K.-C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Yuan, S.
Right arrow Articles by Li, K.-C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions.org

Context-dependent clustering for dynamic cellular state modeling of microarray gene expression

Shinsheng Yuan 1 and Ker-Chau Li 1,2,*

1Institute of Statistical Science, Acadmia Sinica, 128, Section 2, Academia Road, Nankang, Taipei 115, Taiwan, ROC and 2Department of Statistics, UCLA, 8125 Math Sciences Bldg. Box 951554 Los Angeles, CA 90095-1554, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DATA
 3 METHOD
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: High-throughput expression profiling allows researchers to study gene activities globally. Genes with similar expression profiles are likely to encode proteins that may participate in a common structural complex, metabolic pathway or biological process. Many clustering, classification and dimension reduction approaches, powerful in elucidating the expression data, are based on this rationale. However, the converse of this common perception can be misleading. In fact, many biologically related genes turn out uncorrelated in expression.

Results: In this article, we present a novel method for investigating gene co-expression patterns. We assume the correlation between functionally related genes can be strengthened or weakened according to changes in some relevant, yet unknown, cellular states. We develop a context-dependent clustering (CDC) method to model the cellular state variable. We apply it to the transcription regulatory study for Saccharomyces cerevisiae, using the Stanford cell-cycle gene expression data. We investigate the co-expression patterns between transcription factors (TFs) and their target genes (TGs) predicted by the genome-wide location analysis of Harbison et al. Since TF regulates the expression of its TGs, correlation between TFs and TGs expression profiles can be expected. But as many authors have observed, the expression of transcription factors do not correlate well with the expression of their target genes. Instead of attributing the main reason to the lack of correlation between the transcript abundance and TF activity, we search for cellular conditions that would facilitate the TF-TG correlation. The results for sulfur amino acid pathway regulation by MET4, respiratory genes regulation by HAP4, and mitotic cell cycle regulation by ACE2/SWI5 are discussed in detail. Our method suggests a new way to understand the complex biological system from microarray data.

Availability: The program is written in ANSI C. The source code could be downloaded from http://kiefer.stat.sinica.edu.tw/CDC/index.php

Contact: kcli{at}stat.ucla.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DATA
 3 METHOD
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
High-throughput expression profiling allows researchers to study gene activities globally. Many clustering, classification and dimension reduction approaches have proved powerful in elucidating the expression data (Alon et al., 1999; Alter et al., 2000; Brown et al., 2000; Chow et al., 2001; Eisen et al., 1998; Golub et al., 1999; Marcotte et al., 1999; Tamayo et al., 1999). These methods stem from the assumption that genes with similar expression profiles are likely to encode proteins that may participate in a common structural complex, metabolic pathway or biological process (Eisen et al., 1998; Marcotte et al., 1999).

While co-expressed genes are likely to be functionally associated, the profiles of most functionally associated genes turn out uncorrelated (Zhou et al., 2002, 2005). Existence of examples such as the thyroid hormone receptor that can act either as a repressor or as an activator, depending on the absence or presence of thyroid hormone (Weaver 2002), suggests that the more delicate patterns of association may not be easy to detect without considering the underlying cellular states. Most proteins have multiple functions. The overall/marginal correlation between the expression profiles of two genes participating in a common process under certain conditions are likely to be weakened if they disengage and embark on activities of their own under other conditions. A direct characterization of the involved cellular states is a real challenge, however. To bypass the hurdle, liquid association (LA) was introduced in Li (2002). The method searches for a third gene that might mediate the change in correlation between two genes. Li et al. (2004) further generalize LA from two genes to multiple genes via a statistical dimension reduction approach. In the derivation of the LA statistic, change of cellular states is assumed to be correlated with the expression profiles of a third gene.

In this article, to help the study of complex patterns of gene coregulation for functionally associated genes, we propose the use of latent state variables for clustering the conditions of gene expression profiling. Each cluster of conditions may represent a cellular activity state so that within the state, the pattern of gene regulation becomes simpler. More specifically, denote a set of p functionally related gene expression profiles by a p by n matrix Y, each row being the expression of one gene under a total of n conditions and each column being the expression of p genes under one condition. We assume there are K states and each condition belongs to one of the K states. The state variable is considered latent in the sense that the cluster membership for each condition is unknown and has to be estimated from the data. It is important for us to differentiate two different types of gene activity patterns, static cellular state versus dynamic cellular state (Fig. 1).


Figure 1
View larger version (68K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. A schematic illustration for static states versus dynamic states. Three static cellular states are shown. Within each state, each gene (represented by one row) is homogeneous in expression. For instance, the first gene is consistently high (red) for all five conditions at state, A, and is consistently low (green) for all five conditions at state C. Dynamic states (D,E,F) show inhomogeneous patterns of within expression. However, each state co-expression between genes can still be detected with the aid of an additional variable X.

 
1.1 Homogeneous/static cellular activity state
Many clustering methods can be applied directly to cluster the columns of Y; e.g., K-means, principal component analysis, hierarchical clustering, self organization map, mixture model clustering methods and hybrid methods such as tight clustering. (Alon et al., 1999; Alter et al., 2000; Brown et al., 2000; Chow et al., 2001; Eisen et al., 1998; Golub et al., 1999; Liu et al., 2007; Marcotte et al., 1999; Medvedovic and Sivaganesan 2002; Tamayo et al., 1999; Tseng and Wong 2005; Yeung et al., 2001) These general-purpose context-free methods aim at finding clusters of conditions with homogenous expression. The level of expression of a gene is expected to remain unchanged as the conditions vary within one cluster. Consequently, while these methods are adequate for characterizing the homogeneous cellular states under which gene activities are static or in equilibrium, they cannot be used to model other types of cellular states.

1.2 Dynamic cellular activity state
Our main interest in this artcile is to characterize the cellular states that allow for dynamic gene activities. As conceptualized in Figure 1, a dynamic state differs from a homogeneous/static state in that each individual gene in the gene set Y can be expressed at a wide range of levels under the conditions within the state. Dynamic cellular states are harder to characterize because the coordination in expression between genes in Y is less transparent and the pattern of coherence in the gene activities often depends on the cellular context that is relevant to the function of the gene set Y. To facilitate the modeling of the proper cellular context, an additional variable X, which is biologically related to Y, will be utilized. For example, X could be the expression profile of a transcription factor and Y be a set of its target genes. We can also use external variables such as drug sensitivity profiles as X for studying a group of genes in the molecular target pathway (Li and Yuan 2004; Ross et al., 2000; Scherf et al., 2000). We do not require X to have an overall correlation with Y for all n conditions. Rather, we assume that the association between X and Y may be more transparent for the smaller set of conditions within a relevant cellular state. Our aim is to develop a context-dependent clustering (CDC) method for identifying the dynamic states that are connected to the regulation of a given set of genes.

1.3 Global gene expression analysis for biological interpretation of the detected cellular activity states
There are several ways to look for biological interpretation of the detected states. For instance, one may examine the experimental setup (e.g. the temporal or spatial variables, tissue category, growth phase, temperature, the levels of nutrients from carbon sources, amino acids and so on) to find out if there are any features shared by the conditions within each cluster. However, the available external variables are often limited by the expert knowledge in compiling a comprehensive list of relevant variables to measure. Consequently, the available external variables are often insufficient for characterizing the molecular details of various cellular states that are relevant to study of each gene set of interest.

We propose that a rich source of information is in fact contained in the gene expression data themselves. We hypothesize that a biologically meaningful state is likely to be accompanied by the activation or repression of some other genes in the genome whose biological functions are connected to the gene set under study. We use genome-wide one-way analysis of variance to find out which genes Z in the entire genome show major differential expression between the detected clusters of conditions. By studying the cellular functions of these differentially expressed genes, we may infer the biological meaning of the detected cellular activity states.

We shall apply CDC to the transcription regulatory study for Saccharomyces cerevisiae. In particular, we investigate the co-expression patterns between transcription factors (TFs) and their target genes (TGs). Two sets of data are required, micoarray gene expression data and transcription factor binding data.


    2 DATA
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DATA
 3 METHOD
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Microarray dataset
The microarray gene expression profiling comes from the Stanford yeast cellcycle data (Spellman et al., 1998). There are a total of 73 experimental conditions and 5818 genes in this dataset. As well documented in (Eisen et al., 1998), this expression dataset has been used not only in finding cell cycle-regulated genes but also in detecting many functionally associated genes that are not cell cycle regulated. (Shi et al., 2007; Yarragudi et al., 2007).

2.2 Transcription factor binding dataset
The binding dataset from the original ChIP dataset contains many false positives (Harbison et al., 2004; Lee et al., 2002). In their follow-up work, a list of TF-TG pairs is obtained, based on the conservation of the binding elements across several yeast species. (Gordon et al., 2005) In our study, we use the ChIP data with binding P-value cutoff at 0.001 and require the binding elements be conserved in at least two other yeast species. This gives us 101 transcription factors and a total of 4890 TF-TG pairs. We further exclude those TFs with less than five TGs. Therefore, there are 70 TFs remained for our CDC analysis. The original binding dataset could be downloaded from http://jura.wi.mit.edu/fraenkel/regcode/release_v24/bound_by_factor/.


    3 METHOD
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DATA
 3 METHOD
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Context-dependent clustering method
Denote the columns of gene set Y by y1, y2, ... , yn. Assume that there are K clusters. The likelihood of a general mixture model takes the form of


Formula 1

(1)
Here, pk is the probability for a column to belong to the k-th cluster. Given a column from the k-th cluster, the conditional probability density of yi is represented by a parametric function fk(yi|{theta}k). A well-known case of fk(yi|{theta}k) is the multivariate normal density with mean vector µk and covariance matrix proportional to the identity {sigma}2I; {theta}k = (µk, {sigma}2). This special case gives the Bayesian model for K-means clustering. Our context-dependent clustering method differs from the standard model-based approach in the requirement of an additional variable, x. The likelihood function is represented by


Formula 2

(2)

We consider a Gaussian mixture model with linear regression between x and Y. More specifically, the conditional density function fk(yi, xi|{theta}k) for the k-th cluster is


Formula

where the {Delta}yi = yi{alpha}k – β'kxi and {theta}k = ({alpha}k, βk, {sigma}k).

We considered two models for the covariance structure: (1) the unequal volume spherical model, {Sigma}k = {sigma}kI where I is the identity matrix and the scalar {sigma}k depends on k; and (2) the diagonal covariance model {Sigma}k = Dk where Dk is a diagonal covariance matrix.

3.2 Parameter estimation via EM
We apply the EM algorithm (Dempster et al., 1977) for parameter estimation. In the E-step, we impute the missing cluster indicators by the posterior probability qk, i = E(Ii = k|yi, xi, {theta}). Here, Ii is the cluster indicator for i-th row. The formula for qk,i is given by


Formula

In the M-step, the maximization of the likelihood is equivalent to finding the weighted least square estimators for {alpha}k, βk, {Sigma}k. The updated formula for these parameters Formula can be described by


Formula

where, {Delta}yi,j is the j-th element in the vector {Delta}yi and Formula is the j-th diagonal element in Formula .

3.3 The number of clusters
In order to determine the number of clusters for CDC, we use Bayesian Information Criterion (BIC). BIC selects the model by minimizing – 2L = k log n. Here, L, k and n are respectively, the overall likelihood for the model, the number of parameters in the model and number of observations. Since there are only a moderate number of conditions (73) in the cell-cycle dataset, we further restrict that no more than five clusters are allowed.

3.4 P-value for CDC models
A permutation test is conducted to give a P-value for the statistical significance of CDC models:

  1. Randomly permute the vector x.
  2. Apply the CDC algorithm to gene set Y and the randomly generated x and record the value of final likelihood.
  3. Repeat the above steps N times and count the number C of permutations that yield likelihood values larger than the likelihood value from the real data.
  4. The estimated P-value is the ratio C/N.
  5. Reject the null hypothesis when C/N is less than {alpha}.

3.5 P-value for conditional correlations
A permutation test is conducted to provide a P-value for the conditional correlations between TF and TGs within each cluster. For a given cluster, the expression of TF are permuted 100 000 times to generate 100 000 correlations. The P-value is estimated by n/100 000. Here, n is the number of permuted conditional correlations whose values are larger than positive observed correlation or smaller than the negative one. In the Results section (Tables 1 and 4), the correlations footnoted by * means that the P-value is less than 0.05. Similarly, those footnoted by ** means that P-value is less than 0.01.


View this table:
[in this window]
[in a new window]

 
Table 1. Correlation between MET4 and its target genes

 

    4 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DATA
 3 METHOD
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
4.1 Gene expression study on transcription factors and binding targets in yeast
We apply CDC to the transcription regulatory study for S.cerevisiae. In particular, we investigate the co-expression patterns between transcription factors and their target genes. As illustrated by Figure 2, we first obtain a list of high confidence TF/TGs from Harbison et al. (2004). We examine their expression profiles using Stanford cell-cycle data. Since a transcription factor regulates the expression of its target genes, it seems reasonable to expect correlation between each TF and its TGs. But as it turns out, for the majority of cases, the correlation is rather weak. Only 4 (out of 70) transcription factors have correlation of 0.4 (or higher) with at least 20% of their target genes. To investigate possible hidden co-expression patterns, the CDC algorithm is applied to each transcription factor (as X) and its target gene set (as Y) to cluster the conditions (columns). After that, a P-value is computed to assess the significance of model fitting. If the model is found significant, then two follow-up steps will be conducted. Step A is to examine more closely the co-expression pattern between TF and its TGs separately for each estimated cellular state. At Step B, we look for genes that may be associated with the change of the hidden activity state by a genome-wide one-way Analysis of variance (ANOVA). Genes with significant differential expression patterns that are compatible with the estimated activity states can be obtained from the whole genome.


Figure 2
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. A diagram for applying content-dependent clustering (CDC) to transcription regulatory study. First, a transcription factor and its target genes are found from Harbison et al. The expression profiles of this TF and its TGs will be used as X and Y to estimate the hidden activity states. The P-value is estimated through permutation test and only those transcription factors with small P-value (≤0.01) are kept for follow-up analysis. Step A and Step B. Step A is to investigate the co-expression patterns within each state, and Step B is to conduct a genome-wide search for the differentially expressed genes associated with the estimated activity states.

 
4.2 CDC analysis with unequal volume model
We applied CDC analysis to each of the 70 TFs and their binding targets. Using the unequal volume spherical model, we found that there were 65 TFs for which the CDC analysis found two or more clusters of conditions. To see which of these CDC analysis results may be false positives, we applied the permutation test as described in the Method section (P-value for CDC) and found out that 50 out of the 65 TFs are significant (P-values less than 0.01; false discovery rate < 2%). These 50 transcription factors are subject to the Step A and Step B follow-up analyses of Figure 2. The results of a few cases are discussed next. For the full list, please visit our website http://kiefer.stat.sinica.edu.tw/CDC/index.html.

4.3 Sulfur amino acid pathway regulation by MET4
The transcription factor MET4 is a leucine-zipper transcriptional activator that is responsible for the regulation of the sulfur amino acid pathway. According to the ChIP binding dataset, MET4 has seven target genes, of which five genes (MET2, MET6, STR3, CYS3 and SAM1) are found in the sulfur amino acid pathway of SGD. The other two genes do not have known functions. The CDC analysis (unequal volume covariance) with the expression profile of MET4 as x and the seven target gene expression profiles as gene set Y shows three clusters of conditions. Cluster CC1 has 30 conditions, cluster CC2 has 22 conditions and cluster CC3 has 21 conditions.

To shed light on the biological interpretation of the clusters of conditions found by CDC analysis, we conduct a genome-wide search for genes that may show differential expression under the three clusters of conditions we detected. This is done through one-way Analysis of Variance (ANOVA), using the three clusters as the group variable. Briefly, for each gene in the cell-cycle dataset, we compute the mean expression within each of the three clusters of conditions identified by CDC. Then the F-value of ANOVA test is computed. A list of 20 genes with highest F-value (smallest P-value) is listed in Table 2. The overly represented keyword ‘methionine’ is observed from the annotations of these 20 genes. In fact, among the 14 genes in sulfur amino acid biosynthesis, nine are contained in our list (Fig. 3) (hypergeometric P-value = 1.45E–20). In addition to these nine genes, five more genes, MMP1 (high affinity S-methylmethionine permease), MHT1 (S-Methylmethionine Homocysteine methylTransferase), MUP1 (high affinity methionine permease), MET28 (transcriptional activator in the Cbf1p-Met4p-Met28p complex) and MET1 (S-adenosyl-L-methionine uroporphyrinogen III transmethylase), are also involved in closely related biological processes. See Supplementary Material Section 4 for more discussion.


Figure 3
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. The simplified sulfur amino acid biosynthesis pathways are presented here. The genes in boldface are included in ChIP data.

 

View this table:
[in this window]
[in a new window]

 
Table 2. Genes with most significant differential expression under the three clusters of conditions found by CDC for MET4

 
We examined the co-expression patterns between MET4 and all seven target genes. We found all correlations very weak, ranging from 0.23 to –0.17; see the last column (CC) of Table 1. However, within each cluster, the correlation patterns become more visible; see columns CC1, CC2 and CC3 of Table 1. For instance, in the first cluster, the correlations for MET2, STR3, MET6 and CYS3 are significant. The negative correlation for SAM1, though insignificant (P-value 0.17) is consistent with the fact that SAM1 plays a different role from those of MET2, STR3 and CYS3 — degradation of methioine; (see Fig. 3).

4.4 Mitotic cell-cycle regulation by ACE2 and SWI5
Both ACE2 and SWI5 regulate the expressions of several target genes involved in mating type switching, exit from mitosis and cell wall function (Doolin et al., 2001). The number of target genes from ChIP dataset are 13 and 37 for ACE2 and SWI5, respectively. We applied our CDC method to find the hidden cellular states. The number of clusters found by the CDC method is 4 for ACE2 and 3 for SWI5.

We conducted genome-wide one-way ANOVA to find out which genes in the genome show differential expression under the detected clusters. We inspect the genes with large F-values and find out that there are significant overlaps in the output for ACE2 and the output for SWI5. For instance, all 10 genes highlighted in the paper by Doolin et al. (2001) as novel targets, appear within the list of top 41 genes with highest F-values in the ANOVA output for ACE2 and within the list of top 54 genes for SWI5 (see Table 3).


View this table:
[in this window]
[in a new window]

 
Table 3. The F-values of ANOVA are listed for the 10 highlighted genes from Doolin et al.

 
Since ACE2 and SWI5 are cell-cycle regulated genes, it is of interests to find out if the clusters of conditions detected by our CDC method are related to cell-cycle progression or not. We take the gene expression profiles of the top 50 genes in the ANOVA output. We conduct a principal component analysis for these genes. We then display this component in a time plot, which shows the major trend in expression for these genes; see Figure 4a for the ACE2 output, and Figure 4b for SWI5 output. Both figures show similar periodic trends as would be expected in the expression pattern of cell cycle regulated genes. By color coding, we can easily find the conditions associated with each state. Although there are four states found for ACE2 and only three states found for SWI5, there is a substantial overlapping between them. (Percentage of agreement is 86%) By comparing Figure 4a and b, it is clear that the last two clusters for ACE2 correspond to one cluster detected for SWI5.


Figure 4
View larger version (31K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. In both graphs, the black lines are the profile of the first PCA component from top 50 genes in ANOVA output for ACE2 and SWI5, respectively. Different clusters are assigned by different shaped dots. The height of the dots is the average of PCA1 over the conditions within the same clusters.

 
4.5 Respiratory gene regulation by HAP Family
The heme-activated glucose-repressed protein complex contains several subunits encoded by Hap2p, Hap3p, Hap4p and Hap5p. These transcription factors activate and regulate many respiratory genes. The CDC method finds three clusters for HAP4. In Table 4, the conditional correlations of a subset of genes out of 29 predicted target genes from ChIP dataset is shown. These genes are members of the mitochondrial electron transport chain (COX5A, COX6, COX8, COX9, COX12, COX13, QCR7, COR1, SDH1, RIP1, ND1) and ATP coupled proton transport (ATP1, ATP7, ATP15, ATP17, ATP20). A strong coherent pattern of positive correlation is observed in the column CC1, for the aforementioned respiratory genes. To yield biological explanation for the detected states, we conduct genome-wide one-way ANOVA. Table 5 gives the top 20 genes with highest F-values. We conduct a GO study for these genes. The enriched GO terms, GO:0042773 [ATP synthesis coupled electron transport] (2.21E–9), GO:0005746 [mitochondrial respiratory chain] (3.66E–7) and so on, are related to the biological functions of the target genes of HAP4. (See Supplementary Table S5 for a full list.) We also conduct the principal component analysis for the top 20 genes obtained by ANOVA. (See Fig. S4 in Supplementary Material.) As expected, we do not see any cyclic pattern. It is interesting to observe that the second cluster consists of the beginning time points of CDC28 and CDC15 experiments. These two experiments have one thing in common: the activation of the cell-cycle arrested yeast cells require change of temperatures.


View this table:
[in this window]
[in a new window]

 
Table 4. The conditional correlations between HAP4 and ATP and COX family genes

 

View this table:
[in this window]
[in a new window]

 
Table 5. Top 20 most significant genes by one way ANOVA for HAP4

 
4.6 CDC analysis with diagonal covariance model
The increased number of model parameters in diagonal covariance model, though providing more modeling flexibility, may also increase model uncertainty. We applied the diagonal model and compare the results with those from unequal volume model. The comparison is done by finding the percentage of conditions which share the same cluster ids (percentage of agreement). Percentage of agreement is defined based on the contingency table between two cluster ids. Given a contingency table, we first search for a best re-labeling of clusters so that the sum of diagonal elements (the total number of matches) will be maximized. This sum is further divided by the total number of conditions to yield the percentage of agreement in clustering the conditions between the CDC method with the diagonal covariance model and the CDC method with the unequal volume model. We computed the percentage of agreement for each of the 65 TFs and the distribution of the results was given in Figure 5. Expected for two cases, other values are all above 85%. The degree of agreement is very high.


Figure 5
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. The histogram of the percentage of agreement between unequal volume model and diagonal volume model.

 
4.7 CDC analysis with unrestricted covariance model
Another option is to allow for the completely unstructured covariance matrices. We did not take this option because the increased number of parameters would be overwhelming. Instead, in the Supplementary Material Section 1, through a simulation, we demonstrate that unequal volume modeling can still be effective in dealing with general covariance structure.


    5 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DATA
 3 METHOD
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Gene expression patterns across a wide range of conditions are complex. This is in part because they can be highly dependent on some hidden cellular states that may change the dynamics of gene regulation mechanisms. When the total set of profiling conditions contains a heterogeneous mix of several cellular states, it would be difficult to recognize the gene regulation pattern without grouping the conditions according to the relevant states first. We made the assumption that if the underlying cellular state could be well characterized and we could group the conditions properly according to their proper cellular state, then the gene expression pattern under conditions with the same cellular state should become simpler to describe and a linear regression would be appropriate. The challenge is of course that the state variable is unknown. In this article, we presented CDC as a new clustering method for helping the detection of the state variables that are biologically relevant to the mediation of a group of functionally related genes.

We apply the CDC method to study the transcription factors and its target gene regulation in yeast. Many authors have reported the lack of overall correlation between the expression profiles of transcription factors and their target genes. Indeed, the activity of a transcription factor in modulating its target gene expression may not correlate with the expression of the gene coding for the transcription factor for reasons such as post-translation modification by phosphorylation or nucleus/cytoplasm shuffling, protein degradation by ubiquitinization and so on. Several authors have conceptualize a measure of transcription factor activity that purportedly should reflect the net gain or loss of protein activity of the TF. Such TF activity profiles, if accurately measured, are expected to have a better correlation with the TFs target gene expression profiles (Liao et al., 2003; Yu and Li, 2005). TF activity profiles are hard to measure directly, however. They can only be estimated through mathematical modeling. Furthermore, the relationship between the estimated TF activity profiles and the TF gene expression profiles remains difficult to characterize.

The CDC analysis helps find subsets of profiling conditions under which the relationship between the expression profile of a TF and those of its target genes becomes more revealing. As illustrated in the case of MET4, three clusters of conditions are detected. Correlation, positive or negative, between MET4 expression and its target gene expression is significantly strengthened when conditions are confined within a cluster. In interpreting the correlation, it is still important to bear in mind the difference between the gene expression of a TF and the TFs protein activity. For instance, the significant positive correlations in Table 1 for the first cluster only reflect the coherence in the regulation of gene expression of MET2, STR3, MET6 and CYS3 by the protein activity of MET4 and the regulation of MET4 gene expression. Because according to the known literature, MET4 (as a protein) serves as transcription activators, we would expect a positive correlation between the MET4 protein activity profile and the expression profiles of MET2, STR3, MET6 and CYS3. Consequently, under the conditions within this cluster, we may infer that the MET4 protein activity is positively correlated with the gene expression of MET4. The second cluster shows significant negative correlations between the expression profile of MET4 and the expression profiles of STR3, MET6 and CYS3. Because there is no literature showing that the molecular function of MET4 protein can change from an activator to a repressor, we would still expect the protein activity profile of MET4 to be positively correlated with the gene expression profiles of STR3, MET6 and CYS3. Now following the same logic used before, we should infer that the MET4 protein activity may be negatively correlated with the gene expression of MET4 under the conditions of the second cluster. This negative association may reflect the feedback gene regulation in preventing the continued activation of the sulfur metabolic pathway. Consistent with this argument, we found that as shown in Figure 6, the expression levels of the genes in this pathway under this cluster of conditions are already much higher than those under the first cluster of conditions.


Figure 6
View larger version (24K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. The heatmap of nine genes in Sulfur amino acid pathway. These nine genes are found by genome-wide ANOVA based on the detected clusters from CDC analysis on the transcription factor MET4.

 
In addition to the cell-cycle gene expression data, our method can be applied to other large-scale microarray data. In particular, one referee brought up the yeast mutation data of Hughes et al. (2000). This is a very large collection of expression profiles corresponding to 300 diverse mutations and chemical treatments in S.cerevisiae. It would be interesting to see if our method can find specific clusters of biologically meaningful conditions (corresponding to chemical treatments, e.g., or some molecular markers) under which a small group of genes are strongly co-regulated. Nowadays, as the cost of microarray gets cheaper, biologists can afford to conduct larger-scale studies with more and more conditions. Many commonly used correlation methods might become less effective because of the more complex composition of cellular states. We believe that CDC can help in uncovering hidden cellular states and understand better about the complicated biological system.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DATA
 3 METHOD
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work is supported in part by NSF Grants DMS-0201005 and DMS-0406091, and in part by MIB, Institute of Statistical Science, Academia Sinica.

The authors are grateful to two anonymous referees for their suggestions that help improve the presentation.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Joaquin Dopazo

Received on April 24, 2007; revised on August 25, 2007; accepted on August 27, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DATA
 3 METHOD
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Alon U, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl Acad. Sci. USA (1999) 96:6745–6750.[Abstract/Free Full Text]

    Alter O, et al. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA (2000) 97:10101–10106.[Abstract/Free Full Text]

    Brown MP, et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci. USA (2000) 97:262–267.[Abstract/Free Full Text]

    Chow ML, et al. Identifying marker genes in transcription profiling data using a mixture of feature relevance experts. Physiol. Genomics (2001) 5:99–111.[Abstract/Free Full Text]

    Dempster A, et al. Maximum-likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. B (1977) 39:1–38.

    Doolin M.-T, et al. Overlapping and distinct roles of duplicated yeast transcription factors ace2p and swi5p. Mol. Microbiol. (2001) 40:422–432.[CrossRef][Web of Science][Medline]

    Eisen MB, et al. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA (1998) 95:14863–14868.[Abstract/Free Full Text]

    Golub TR, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science (1999) 15:531–537.

    Gordon DB, et al. Tamo: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs. Bioinformatics (2005) 21:3164–3165.[Abstract/Free Full Text]

    Harbison CT, et al. Transcriptional regulatory code of a eukaryotic genome. Nature (2004) 431:99–104.[CrossRef][Medline]

    Hughes T, et al. Functional discovery via a compendium of expression profiles. Cell (2000) 102:109–126.[CrossRef][Web of Science][Medline]

    Lee T, et al. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science (2002) 298:799–804.[Abstract/Free Full Text]

    Li K.-C. Genome-wide coexpression dynamics: theory and application. Proc. Natl Acad. Sci. USA (2002) 99:16875–16880.[Abstract/Free Full Text]

    Li K.-C, Yuan S. A functional genomic study on nci's anticancer drug screen. Pharmacogenomics J. (2004) 4:127–135.[CrossRef][Web of Science][Medline]

    Li K.-C, et al. A system for enhancing genome-wide coexpression dynamics study. Proc. Natl Acad. Sci. USA (2004) 101:15561–15566.[Abstract/Free Full Text]

    Liao JC, et al. Network component analysis: reconstruction of regulatory signals in biological systems. Proc. Natl Acad. Sci. USA (2003) 100:15522–15527.[Abstract/Free Full Text]

    Liu X, et al. Including probe-level uncertainty in model-based gene expression clustering. BMC Bioinformatics (2007) 8:98.[CrossRef][Medline]

    Marcotte EM, et al. A combined algorithm for genome-wide prediction of protein function. Nature (1999) 402:83–86.[CrossRef][Medline]

    Medvedovic M, Sivaganesan S. Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics (2002) 18:1194–1206.[Abstract/Free Full Text]

    Ross DT, et al. Systematic variation in gene expression patterns in human cancer cell lines. Nat. Genet. (2000) 24:227–235.[CrossRef][Web of Science][Medline]

    Scherf U, et al. A gene expression database for the molecular pharmacology of cancer. Nat. Genet. (2000) 24:236–244.[CrossRef][Web of Science][Medline]

    Shi Y, et al. Inferring pairwise regulatory relationships from multiple time series datasets. Bioinformatics (2007) 23:755–763.[Abstract/Free Full Text]

    Spellman PT, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell (1998) 9:3273–3297.[Abstract/Free Full Text]

    Tamayo P, et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA (1999) 96:2907–2912.[Abstract/Free Full Text]

    Tseng GC, Wong WH. Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics (2005) 61:10–16.[CrossRef][Web of Science][Medline]

    Weaver RF. Molecular Biology (2002) 2nd ed. Boston: McGraw-Hill.

    Yarragudi A, et al. Genome-wide analysis of transcriptional dependence and probable target sites for abf1 and rap1 in Saccharomyces cerevisiae. Nucleic Acids Res. (2007) 35:193–202.[Abstract/Free Full Text]

    Yeung KY, et al. Model-based clustering and data transformations for gene expression data. Bioinformatics (2001) 17:977–987.[Abstract/Free Full Text]

    Yu T, Li K.-C. Inference of transcriptional regulatory network by two-stage constrained space factor analysis. Bioinformatics (2005) 21:4033–4038.[Abstract/Free Full Text]

    Zhou X, et al. Transitive functional annotation by shortest-path analysis of gene expression data. Proc. Natl Acad. Sci. USA (2002) 99:12783–12788.[Abstract/Free Full Text]

    Zhou XJ, et al. Functional annotation and network reconstruction through cross-platform integration of microarray data. Nat. Biotechnol. (2005) 23:238–243.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
23/22/3039    most recent
btm457v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Yuan, S.
Right arrow Articles by Li, K.-C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Yuan, S.
Right arrow Articles by Li, K.-C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?