Bioinformatics Advance Access originally published online on December 22, 2006
Bioinformatics 2007 23(4):473-479; doi:10.1093/bioinformatics/btl640
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The discovery of transcriptional modules by a two-stage matrix decomposition approach
Bioinformatics Unit, Branch of Research Resources, National Institute on Aging NIH, Baltimore, MD 21224, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: We address the problem of identifying gene transcriptional modules from gene expression data by proposing a new approach. Genes mostly interact with each other to form transcriptional modules for context-specific cellular activities or functions. Unraveling such transcriptional modules is important for understanding biological network, deciphering regulatory mechanisms and identifying biomarkers.
Method: The proposed algorithm is based on two-stage matrix decomposition. We first model microarray data as non-linear mixtures and adopt the non-linear independent component analysis to reduce the non-linear distortion and separate the data into independent latent components. We then apply the probabilistic sparse matrix decomposition approach to model the hidden expression profiles of genes across the independent latent components as linear weighted combinations of a small number of transcriptional regulator profiles. Finally, we propose a general scheme for identifying gene modules from the outcomes of the matrix decomposition.
Results: The proposed algorithm partitions genes into non-mutually exclusive transcriptional modules, independent from expression profile similarity measurement. The modules contain genes with not only similar but different expression patterns, and show the highest enrichment of biological functions in comparison with those by other methods. The usefulness of the algorithm was validated by a yeast microarray data analysis.
Availability: The software is available upon request to the authors.
Contact: zhanmi{at}mail.nih.gov
| 1 INTRODUCTION |
|---|
|
|
|---|
Genes mostly interact with each other to form transcriptional modules for context-specific cellular activities or functions (Brunet et al., 2004; Hughes et al., 2000; Segal et al., 2004). Unraveling such transcriptional modules is important for understanding biological network, deciphering regulatory mechanisms and identifying biomarkers.
Various methods have been proposed for identifying gene transcriptional modules from microarray data. Most of them are clustering-based, such as hierarchical clustering (Eisen et al., 1998), K-means (Tavazoie et al., 1999) and self-organizing map (SOM) (Tamayo et al., 1999). The clustering-based methods mostly measure correlative and linear relationship among genes, and partition genes into mutually exclusive clusters. The methods identify transcriptional modules by assuming that genes with similar expression profiles share similar functions or the same pathway. However, genes involved in the same biological process or pathway can have different expression patterns (Zhou et al., 2005). It is also important to capture contextual modularity that might result from highly non-linear interactions among genes. Moreover, one gene may be part of several biological processes and thus should belong to different modules. Apparently, the clustering-based methods are not suitable to address these problems.
Recently, matrix decomposition methods have been introduced for uncovering transcriptional modules from microarray data. These methods treat microarray data as a mixture of unknown signals that may correspond to specific biological sources. Since matrix decomposition methods do not cluster genes based on pair-wise similarity measurement, functional related genes showing different expression patterns can be clustered together (Dueck et al., 2005; Lee and Batzoglou, 2003; Zhou et al., 2005). The methods can also partition genes into non-mutually exclusive modules to reflect the fact that genes may have multiple functions or are active in multiple biological processes. A variety of matrix decomposition methods have been proposed for microarray data analysis. Some of them take non-probabilistic approaches, which include singular value decomposition (Alter et al., 2000; Holter et al., 2001), independent components analysis (Chiappetta et al., 2004; Frigyesi et al., 2006; Lee and Batzoglou, 2003; Liebermeister, 2002), non-negative matrix factorizations (Brunet et al., 2004; Carmona-Saez et al., 2006; Gao and Church, 2005; Kim and Tidor, 2003; Lee and Seung, 1999; Wang et al., 2006) and network component analysis (Liao et al., 2003). Others take probabilistic approaches that account for noise, such as probabilistic sparse matrix factorization (Dueck et al., 2005). Most of these methods however use linear models that describe gene expression data as linear combinations of latent biological sources. In fact, non-linear interactions among genes are often observed in gene regulatory networks, such as negative feedback events, two consecutive biological events of threshold and saturation, etc. (Atkinson et al., 2003; Yuh et al., 1998). Thus, non-linear mixtures would provide a more realistic model for gene relationships and transcriptional modules. Recently, progresses have been made for general non-linear mixed models in matrix decomposition (Jutten and Karhunen, 2004). A promising method among these advances is the non-linear independent component analysis (NICA) by variational Bayesian learning (Jutten and Karhunen, 2004; Lappalainen and Honkela, 2000). This method uses a multilayer perceptron network to model the non-linear mapping from sources to observed data, and infers statistically independent sources and parameters of the mapping that most likely generate the observed data (Lappalainen and Honkela, 2000). Other non-linear mappings include kernel-based functions (Muller et al., 2001), radial basis functions (RBF) (Tan et al., 2001), etc. These statistical methods however have not yet been used in microarray data analysis for transcriptional module discovery.
In this study, we introduce a novel approach, which is based on two-stage matrix decomposition, to model gene expression data and uncover the modular structure. Considering that interactions among many genes, particularly between genes and transcription factors, are possibly non-linear, we first model microarray data as non-linear mixtures and adopt a NICA decomposition to reduce the non-linear distortion in the data and separate the data into independent latent components, which may correspond to specific biological sources unknown or hidden. We then apply a probabilistic sparse matrix factorization (PSMF) approach to model the hidden expression profile of genes across the independent latent components as linear weighted combinations of profiles of a small number of predominant transcriptional regulators (e.g. transcription factors). The NICA transformation allows capture the non-linear structure of transcriptional profiles in our method. The PSMF procedure is appropriate for modeling gene expression data, in which while many genes are involved in gene regulation, only a small number of regulators (e.g. transcription factors) have predominant impact to the expression of any specific gene. From the model, we propose a general approach for identifying transcriptional modules. The newly proposed algorithm takes into account that the non-linear structure existed in the data. Besides clustering together genes with similar expression profiles and similar function categories, our algorithm can also cluster genes, which show different expression profiles but related functions. One gene can be assigned into multiple modules. Compared with other methods [e.g. hierarchical clustering, K-means clustering, self-organizing maps, PSMF or independent components analysis (ICA)], the newly proposed algorithm appears to have improved performance in identifying biologically meaningful transcriptional modules. The usefulness of our algorithm in transcriptional module discovery was validated by a yeast microarray data analysis.
| 2 METHODS |
|---|
|
|
|---|
Framework of the algorithm
Figure 1 shows the general schema of the proposed algorithm. The algorithm is based on two-stage matrix decomposition. First, the NICA stage de-non-linearizes microarray data into independent latent components. Then, the PSMF stage models the gene hidden expression profile across the independent latent components as linear weighted combinations of a small number of regulator profiles.
|
The NICA stage The NICA method we adopt is based on a variational Bayesian learning approach (Jutten and Karhunen, 2004; Lappalainen and Honkela, 2000). This approach assumes that the microarray data can be described by the noisy non-linear mixing model
|
| (1) |
|
| (2) |
Assuming that the source signals S at the input layer of the MLP network have simple Gaussian distributions, we obtain a non-linear principal component analysis solution for S based on variational Bayesian learning for blind estimation and separation in non-linear mixture data model in Equation (1). This solution can usually model quite well non-linear mixtures (observed data), but it does not yet provide estimates of the independent source signals. To find independent components from S, we apply a standard linear ICA to S using the FastICA algorithm (Hyvarinen and Oja, 2000). The goal of the linear ICA is to decompose
so that columns (components) of
are statistically as independent as possible.
The PSMF stage. The PSMF approach is a probabilistic extension of sparse matrix factorization, which can account for uncertainties due to different levels of noise in the data (Dueck et al., 2005). Given a matrix
in which element
represents the hidden expression of gene i under the independent latent component j, PSMF is to find
and
such that
. Each row of
has at most K non-zero entries. Row vectors of Z contain unobserved L factor profiles. More specifically, let ki be the number of non-zero entries (ki
K) of the row vector
in Y and
be the vector that contains column indices of non-zero entries of yi, we model each gene hidden expression profile across the independent latent component
, as a linear combination of ki of the factor profiles
, plus noise:
|
| (3) |
|
| (4) |
Identification of gene modules Given the non-linear independent latent component matrix
and its composition matrices Y and Z, we propose a general approach for identifying gene modules. Gene modules are identified by assigning genes with the indices of non-zero element of the column vector
of Y (
) into the cluster associated with the factor l, l = 1,
, L. Based on the assumption that Z contains unobserved L factor profiles, we compute correlations between row vectors
of
and row vectors zl of Z and then select the genes that hidden expression profiles are significantly correlated with zl as potential transcription factors. The significant level of the observed correlation is dependent on the Pearson correlation value, Ril (i = 1,
, N; l = 1,
, L) and the sample size M'. The p-value of the correlation is obtained by treating
as coming from a t-distribution with M' 2 degrees of freedom (SAS, 2004).
The proposed algorithm is summarized as follows.
Input
- Microarray data matrix X = [Xij], where the element Xij represents the expression level of gene i associated with the jth sample, i = 1,
, N, j = 1,
, M.
- Number of modules (determined by L, the number of possible regulators).
- Maximum number of effective regulators, K. K should be less or equal to L.
Output
- Gene set in each of L modules.
- Potential transcription regulators.
Algorithm
- De-non-linearize X using the NICA approach.
- Find the non-linear principle component matrix S from Equation (1) and Equation (2) by variational Bayesian learning.
- Decompose
by the linear FastICA algorithm to obtain linear independent components in
.
- Find the non-linear principle component matrix S from Equation (1) and Equation (2) by variational Bayesian learning.
- Decompose
using the PSMF method with constraint
0.- If prior partial connection information is known, initialize the elements yil = 1 in Y; else set the value of yil in Y randomly.
- Obtain Y and Z by factorized variational inference.
- If prior partial connection information is known, initialize the elements yil = 1 in Y; else set the value of yil in Y randomly.
- Cluster gene modules and predict key transcription regulators based on
, Y and Z.- For l = 1,
, L, find the indices of non-zero element of column vector
in Y where
. Genes with these indices are the members in module l.
- For i = 1,
, N and l = 1,
, L, compute the correlation coefficient Ril between
and zl where
and zl
1 x M'; sort Ril for a given l and obtain the indices of top Ril's that satisfy cutting threshold and p-value.
- For l = 1,
Evaluation of the algorithm
We compare our algorithm with other algorithms using the ClusterJudge method based on the Gene Ontology (GO) (Gibbons and Roth, 2002). In the GO, each gene is described by a set of GO attributes of molecular function, biological process, or cellular component that the gene is associated to (www.geneontology.org). Based on the list of genes and associated GO attributes, a contingency table is constructed for each cluster-attribute pair, from which the entropy is computed for each cluster-attribute pair (HAiC), for the clustering result independent of attributes (HC), and for each of the NA attributes in the table independent of clusters (HAi). The total mutual information between the cluster result C and all attributes Ai is computed as:
|
| (5) |
We also examine statistical over-representation of GO terms in each transcriptional module identified by our algorithm. The Fisher's exact test is conducted to calculate the hypergeometric probability of observing a GO term as enriched in a transcriptional module. In specific, the probability p that a GO term is significantly enriched in a module is calculated as:
|
| (6) |
Software and experimental validation
We implemented a Matlab-based computational tool, ModulePro, for the algorithm that we developed. We applied ModulePro to a publicly available yeast cDNA microarray dataset for method validation. The yeast dataset contains 6285 genes and 156 samples, including different experimental conditions such as temperature shocks, amino acid starvation, nitrogen source deletion, progression into stationary phase, etc. (Gasch et al., 2000). The normalization of the data was conducted by Zero-Transform (Gollub et al., 2006), and the missing data were filled using the KNNimpute approach (Troyanskaya et al., 2001).
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
Feature of the algorithm
The essence of our algorithm is to improve the recovery of transcriptional modules in the form of functional links among genes, made possible by incorporating NICA transformation into PSMF decomposition on microarray data. The NICA method we proposed is based on a variational Bayesian learning (Jutten and Karhunen, 2004; Lappalainen and Honkela, 2000). The method uses a multilayer perceptron (MLP) network as a non-linear mapping to model non-linear mixtures of data. Theoretically, given enough nodes in the hidden layer, a MLP network can model any non-linear mapping from sources to observed data with certain accuracy (Haykin, 1999). In addition, a MLP network is a flexible non-linear mapping because its model complexity scales linearly with the dimension of the latent source space (Lappalainen and Honkela, 2000). To our best knowledge, this NICA method has never been used in microarray data analysis.
PSMF decomposition models gene expression as a linear weighted combination of a small number of prototypes that represent the influence of either different biological or experimental factors (Dueck et al., 2005). The PSMF model is based on the assumption that the expression of each gene is influenced by only a small set of the hidden regulatory variables to various degrees. Prior to PSMF decomposition, we apply NICA transformation. This allows non-linear mixtures and thus provides a more realistic model for gene expression data, since many interactions between genes (e.g. between transcription factors and target genes) are not necessarily linear. The proposed algorithm projects the gene expression under different experimental conditions into corresponding meta-expression in the latent independent components that may correspond to specific biological sources. The algorithm can thus overcome the singularity problem, which often causes PSMF to fail. In comparison with other methods, our algorithm can dramatically improve the quality of gene clustering (described later).
When applying our algorithm, the choice of the parameters L and K at the PSMF stage is important for the structure of decomposition. As described in the Methods section, K is the maximum number of effective regulators. L is the predefined number of possible regulators that determines the number of modules in our algorithm. L is in general much smaller than the number of variables (genes) N since gene expression is thought to be regulated by a small set of genes (compared with the total number of genes). These genes act in combination as key regulators to maintain the steady-state abundance of specific mRNAs. For the choice K = 1, each row in the data matrix is associated with only a single factor, and the sparse matrix factorization is a clustering of the data rows. For the choice K = L, the factorization is simply a low-rank approximation. Since we assume that the expression of each gene is determined by only a small subset of the possible regulators we heuristically set K = 3 in our study. It is worthy to note that although our algorithm is unsupervised, prior biological information such as binding transcriptional factor data could be incorporated into the initialization of Y (also see Methods section for details).
Performance of the algorithm
To evaluate the performance, we compared our algorithm with average linkage hierarchical clustering (HC), K-means clustering, SOM, ICA (Lee and Batzoglou, 2003) and PSMF (Dueck et al., 2005) methods. We assert that the best clustering method is that with the strongest tendency to bring genes of similar function together when applied to diverse expression datasets, as we claim in our algorithm. For each method, we examined the relationship between identified clusters and the GO attributes of the genes in the clusters, using the z-score. The z-score is based on mutual information between cluster membership and GO attributes, derived from comparing real clustering with random clustering of genes. A higher z-score indicates that a clustering result is further from random so that more biologically meaningful.
Figure 2 shows the z-scores from clustering by our algorithm and the five other clustering methods on a yeast microarray dataset. The yeast data contains 156 samples from 20 different conditions. The clustering by each method was made under different cluster numbers, ranging from 5 to 60, and the calculated z-scores were plotted as a function of cluster numbers (l) (Fig. 2). In our algorithm, we set the number of independent latent components equal to the number of experimental conditions for simplicity. For getting more accurate non-linear mapping, we set the number of hidden neurons in the MLP network as twice as the number of independent latent components. We also set K = 3 in both our algorithm and PSMF. We selected clusters for ICA as: clusters containing 10% of all genes with either high or low expression levels within the independent components (Lee and Batzoglou, 2003). As illustrated, our algorithm consistently performed well with high z-scores at different choices of l, showing the best enrichment of functional attributes in the gene clusters, in comparison with HC, SOM, PSMF and ICA. The performance of K-means clustering, which was based on the Pearson correlation as the distance metric, was comparable with our algorithm. However, our algorithm partition genes into non-mutually exclusive modules, reflecting the fact that genes may have multiple functions and are active in multiple biological processes, while K-means clustering of genes is mutually exclusive. Strikingly, HC and ICA performed significantly poorer than our algorithm, SOM, K-means and PSMF. We also observed that when applied to gene expression data directly, PSMF often failed on the matrix decomposition. This is because of matrix singularity caused by strong correlation among experimental conditions in gene expression data. Our algorithm, by incorporating NICA into PSMF decomposition and projecting the gene expression of different experimental conditions into corresponding meta-expression in independent latent components, overcomes the singularity problem.
|
For further evaluation of our algorithm, we examined the enriched GO categories in transcriptional modules uncovered from the same yeast microarray dataset. We subsequently identified 30 transcriptional modules by setting L = 30 and K = 3 in our algorithm. Among them, eight modules were significantly enriched by GO terms (corrected P-values
0.001). Each gene module appeared to be dominated by only a few functional categories (Table 1). The proposed approach is effective in clustering together genes with similar expression profiles and similar function categories. Module C16, for example, is strongly associated with the ribosomal subunit assembly process (P-value < 1.39E-16). The heatmap indicated similar expression patterns of genes under different experimental conditions (Fig. 3). More significantly, our algorithm can cluster genes which show different expression profiles but similar functions. Module C11, for example, is significantly enriched by proteolysis (P-value < 2.45E-33). The heatmap revealed at least three distinctive expression patterns in this module (denoted as A, B and C, Fig. 4). This clearly places our algorithm above similarity-based methods (e.g. HC and SOM) that are under the assumption that genes with similar expression profiles should have similar functions. We also compared statistical significance of common enriched functional categories in transcriptional modules uncovered by our algorithm, HC, SOM, K-means, ICA and PSMF (Table 2). Among those common functional categories detected significantly by all methods, there are five out of eight functional categories that our method produced significantly lower corrected p-values than other methods did.
|
|
|
|
| 4 SUMMARY |
|---|
|
|
|---|
In this study, we present a new method for transcriptional module discovery from microarray data. The new approach is based on two-stage decomposition of microarray data, firstly by NICA and then by PSMF. Our method offers several advantages: (1) it takes into account the non-linear structure existed in the data; (2) the approach can identify genes of the similar functions yet without similar expression profiles; (3) the method can assign one gene into different modules; and (4) the method avoids the singularity problem in matrix decomposition. In comparison with other methods, our method leads to a significant improvement in uncovering biologically relevant transcriptional modules.
| Acknowledgments |
|---|
This study is supported by the Intramural Research Program, National Institute on Aging, NIH.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Golan Yona
Received on September 18, 2006; revised on November 14, 2006; accepted on December 14, 2006
| REFERENCES |
|---|
|
|
|---|
Alter, O., et al. (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA, 97, 1010110106
Atkinson, M.R., et al. (2003) Development of genetic circuitry exhibiting toggle switch or oscillatory behavior in Escherichia coli. Cell, 113, 597607[CrossRef][Web of Science][Medline].
Brunet, J.P., et al. (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc. Natl Acad. Sci. USA, 101, 41644169
Carmona-Saez, P., et al. (2006) Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC Bioinformatics, 7, 78[CrossRef][Medline].
Chiappetta, P., et al. (2004) Blind source separation and the analysis of microarray data. J. Comput. Biol, . 11, 10901109[CrossRef][Web of Science][Medline].
Dueck, D., et al. (2005) Multi-way clustering of microarray data using probabilistic sparse matrix factorization. Bioinformatics, 21, Suppl 1, i144i151[Abstract].
Eisen, M.B., et al. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 1486314868
Frigyesi, A., et al. (2006) Independent component analysis reveals new and biologically significant structures in microarray data. BMC Bioinformatics, 7, 290[CrossRef][Medline].
Gao, Y. and Church, G. (2005) Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics, 21, 39703975
Gasch, A.P., et al. (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell, 11, 42414257
Gibbons, F.D. and Roth, F.P. (2002) Judging the quality of gene expression-based clustering methods using gene annotation. Genome Res, . 12, 15741581
Gollub, J., Ball, C.A., Sherlock, G. (2006) The Stanford Microarray Database: a user's guide. Methods Mol. Biol, . 338, 191208[Medline].
Haykin, S. Neural Networks: A Comprehensive Foundation, (1999) , Prentice Hall McMaster University.
Holter, N.S., et al. (2001) Dynamic modeling of gene expression data. Proc. Natl Acad. Sci. USA, 98, 16931698
Hughes, J.D., et al. (2000) Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol, . 296, 12051214[CrossRef][Web of Science][Medline].
Hyvarinen, A. and Oja, E. (2000) Independent component analysis: algorithms and applications. Neural Networks, 13, 411430[CrossRef][Web of Science][Medline].
Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K. (1999) An introduction to variational methods for graphical models. Learn. Graph. Models, In Jordan, M.I. (Ed.). , Cambridge MIT Press.
Jutten, C. and Karhunen, J. (2004) Advances in blind source separation (BSS) and independent component analysis (ICA) for nonlinear mixtures. Int. J. Neural. Syst, . 14, 267292[CrossRef][Medline].
Kim, P.M. and Tidor, B. (2003) Subsystem identification through dimensionality reduction of large-scale gene expression data. Genome Res, . 13, 17061718
Lappalainen, H. and Honkela, A. (2000) Bayesian nonlinear independent component analysis by multi-layer perceptrons. In Girolami, M. (Ed.). Advances in Independent Component Analysis, Springer-Verlag, pp. 93121.
Lee, D.D. and Seung, H.S. (1999) Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788791[CrossRef][Medline].
Lee, S.I. and Batzoglou, S. (2003) Application of independent component analysis to microarrays. Genome Biol, . 4, R76[CrossRef][Medline].
Liao, J.C., et al. (2003) Network component analysis: reconstruction of regulatory signals in biological systems. Proc. Natl Acad. Sci. USA, 100, 1552215527
Liebermeister, W. (2002) Linear modes of gene expression determined by independent component analysis. Bioinformatics, 18, 5160
Muller, K., et al. (2001) An introduction to kernel-based learning algorithms. IEEE Trans. Neural Networks, 12, 181201.
SAS. Base SAS 9.1 Procedures Guide, (2004) SAS Publishing.
Segal, E., et al. (2004) A module map showing conditional activity of expression modules in cancer. Nat. Genet, . 36, 10901098[Web of Science][Medline].
Tamayo, P., et al. (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA, 96, 29072912
Tan, Y., et al. (2001) Nonlinear blind source separation usign a radial basis function network. IEEE Trans Neural Networks, 12, 124134[CrossRef].
Tavazoie, S., et al. (1999) Systematic determination of genetic network architecture. Nat. Genet, . 22, 281285[CrossRef][Web of Science][Medline].
Troyanskaya, O., et al. (2001) Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520525
Wang, G, et al. (2006) LS-NMF: a modified non-negative matrix factorization algorithm utilizing uncertainty estimates. BMC Bioinformatics, 7, 175[CrossRef][Medline].
Yuh, C.H., et al. (1998) Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science, 279, 18961902
Zhou, X.J., et al. (2005) Functional annotation and network reconstruction through cross-platform integration of microarray data. Nat. Biotechnol, . 23, 238243[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
H. Li and M. Zhan Unraveling transcriptional regulatory programs by integrative analysis of microarray and transcription factor binding data Bioinformatics, September 1, 2008; 24(17): 1874 - 1880. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Chang, Z. Ding, Y. S. Hung, and P. C. W. Fung Fast network component analysis (FastNCA) for gene regulatory network reconstruction from microarray data Bioinformatics, June 1, 2008; 24(11): 1349 - 1358. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





