Bioinformatics Advance Access originally published online on April 9, 2008
Bioinformatics 2008 24(11):1349-1358; doi:10.1093/bioinformatics/btn131
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Fast network component analysis (FastNCA) for gene regulatory network reconstruction from microarray data
1Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong, 2Department of Electrical and Computer Engineering, University of California, Davis, CA 95616, USA and 3Department of Medicine, The University of Hong Kong, Pokfulam Road, Hong Kong
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Recently developed network component analysis (NCA) approach is promising for gene regulatory network reconstruction from microarray data. The existing NCA algorithm is an iterative method which has two potential limitations: computational instability and multiple local solutions. The subsequently developed NCA-r algorithm with Tikhonov regularization can help solve the first issue but cannot completely handle the second one. Here we develop a novel Fast Network Component Analysis (FastNCA) algorithm which has an analytical solution that is much faster and does not have the above limitations.
Results: Firstly FastNCA is compared to NCA and NCA-r using synthetic data. The reconstruction of FastNCA is more accurate than that of NCA-r and comparable to that of properly converged NCA. FastNCA is not sensitive to the correlation among the input signals, while its performance does degrade a little but not as dramatically as that of NCA. Like NCA, FastNCA is not very sensitive to small inaccuracies in a priori information on the network topology. FastNCA is about several tens times faster than NCA and several hundreds times faster than NCA-r. Then, the method is applied to real yeast cell-cycle microarray data. The activities of the estimated cell-cycle regulators by FastNCA and NCA-r are compared to the semi-quantitative results obtained independently by Lee et al. (2002). It is shown here that there is a greater agreement between the results of FastNCA and Lee's, which is represented by the ratio 23/33, than that between the results of NCA-r and Lee's, which is 14/33.
Availability: Software and supplementary materials are available from http://www.eee.hku.hk/~cqchang/FastNCA.htm
Contact: cqchang{at}eee.hku.hk
| 1 INTRODUCTION |
|---|
|
|
|---|
High-throughput microarray technology is an important advancement in the post-genome period. Much effort has been devoted to obtain information, particularly functions of living organisms using data from DNA microarrays. The starting point of the new field of systems biology, which has emerged in the past decade, is the awareness of the fact that cells must respond and adapt to environmental perturbation by adjusting their genomic expression program in short-term and long-term measures via positive or negative regulation of the DNA binding transcription regulators/factors (TF). The TF activity (TFA), which can be defined as the concentration of its subpopulation capable of DNA binding, is therefore an important parameter, but usually very difficult to be measured directly in an in vivo system (see e.g. Jorgensen and Tyers, 2000; Toone et al., 1997; Simon et al., 2001). A key aspect of systems biology is to develop mathematical networks to reconstruct the hidden dynamics of the transcription regulatory activities based on DNA microarray data. In the terminology of engineering, the biological system is taken as an information processing system and relevant mathematical techniques in the general field of information processing have been applied, developed with specific adaptation treatment, to analyze biological system in recent years. Some integrative or novel mathematic methods have also emerged along this line to obtain informative signals from a low-dimensional regulatory network system which generates a set of high-dimensional data.
Essentially such investigation amounts to solving the inverse problem under constraints. The principal component analysis (PCA) and independent component analysis (ICA) have been developed to solve the problem under the mathematical assumptions that the signal sources are mutually orthogonal and statistically independent, respectively (Alter et al., 2000; Liebermeister, 2002). However, neither PCA nor ICA is suitable in biological systems in general because the orthogonality property and the statistical independent property are not usually valid (Chang et al., 2006, 2007). Other types of constraints have to be imposed in systems biology.
The following linear model of gene regulatory network, with and without noise, is considered (Liao et al., 2003):
|
| (1) |
|
| (2) |
is the inevitable measurement noise and the unknown N x M matrix A specifies the connectivity strength (transcription regulation strength) between the regulatory domain (TFA) and output domain (expression dynamics), while N, M and K are the number of genes, number of TFs and number of time points, respectively. X is the ideal gene-expression matrix where there is no noise in the measurement. This model can be used to represent any linear regulatory networks in systems biology other than transcriptional regulatory network. In general, gene regulation processes are dynamic and non-linear. The static linear model given in Equation (1) is derived under the following assumptions. Gene expression can be considered as the product of the regulating TFAs using a combinatorial power-law model, which can be viewed as a log-linear approximation of non-linear kinetic systems in multiple dimensions (Savageau, 1976; Torrest and Voit, 2002; Voit and Almeida, 2004). It is assumed that the time scale of TFA change is much longer than that of mRNA. Therefore, mRNA levels at most time are in a quasi-steady state, and thus in this quasi-steady state the dynamic model becomes approximately static. Furthermore, DNA microarrays are commonly expressed in log-ratio, and for this log-ratio the product of power-law model for the raw mRNA level becomes a linear model. This explains why we have a static linear model as in Equation (1). A detailed derivation of this model can be found in Liao et al. (2003) and Galbraith et al. (2006).
Our problem is to recover A and S from X according to Equation (2). The method developed in Liao et al. (2003) to solve such inverse problem is called network component analysis (NCA), which is based on the following three constraints referred to as NCA criteria: (i) the connectivity matrix A must be full-column rank; (ii) when an element in the regulatory domain is removed along with all the output elements connected to it, the connectivity matrix of the resulting network is still of full-column rank; (iii) S must have full-row rank.
The NCA criteria were shown in Liao et al. (2003) to render unique solution of the inverse problem if there is no noise. When NCA is employed to analyze a problem, first the identifiability property of A, i.e. criterion (i) and criterion (ii), must be checked. Then the bi-linear objective function is minimized under the constraint that the network structure (the non-zero pattern of A) is conserved.
|
| (3) |
Experimental validation of this NCA method has been carried out using a network of seven hemoglobin solutions which contain three hemoglobin components. Absorption spectra measurement was carried out the result of which was compared with that deduced theoretically using NCA, PCA and ICA methods. The analysis of Liao et al. (2003) demonstrates the superiority of NCA over the other two methods. In the same article, the NCA approach was also applied to study the dynamics of the TFA for 11 TF involved in cell-cycle regulation of Saccharomyces cerevisiae. The network topology, or connectivity between TF and genes was taken from genome wide location analysis (Lee et al., 2002) Microarray datasets pertaining to S.cerevisiae were taken from cultures synchronized by elutriation,
-factor arrest, and arrest of a cdc15 temperature sensitive mutant. The estimates of TFA can therefore be obtained using Equation (3).
It was later pointed out by Tran et al. (2005) that the bilinear optimization scheme in the NCA algorithm has two potential limitations: (1) ill-conditioned matrices generated during the iteration may make numerical algorithm unstable and suggest physically unreasonable solution; (2) there may exist multiple local minima. The Tikhonov regularization method was then applied to the optimization problem given in Equation (3) in order to overcome the first limitation, and it turned out that it was also helpful in solving the multiple solution issue. This Tikhonov regularization method is referred to as NCA-r.
The NCA approach to gene-regulatory network reconstruction is promising as evidenced by its successful applications (Kao et al., 2004, 2005). However, the limitations of NCA are only partially overcome by NCA-r, and the regularization makes NCA-r computationally much more complicated (and time consuming) than NCA whose computational complexity is already very high due to the inevitable occurrence of iterations. Therefore, there is a need to develop a fast, stable and globally convergent algorithm to reconstruct gene-regulatory networks that satisfy the three NCA criteria.
| 2 FAST NETWORK COMPONENT ANALYSIS |
|---|
|
|
|---|
2.1 Analytical solution in the ideal case
In Equation (1), since we can always re-order the rows of X and A, we assume without loss of generality that the first column of A can be partitioned as
|
| (4) |
= 0, we partition Equation (1) accordingly as
|
| (5) |
|
| (6) |
|
| (7) |
|
| (8) |
|
| (9) |
|
| (10) |
2.2 Numerical algorithm in the noisy case
The analytical solution given in Section 2.1 cannot be applied directly to the measured data Y because of noise. Noise can be coped with by singular value decomposition (SVD). Let the SVD of Y in Equation (2) be
|
| (11) |
1
2
·
K
0, and uk and vk are the left and right singular vectors. If the noise term
is small and X is not ill-conditioned, the first M singular values are contributed mainly by X, and the remaining insignificant singular values are contributed mainly by
. Therefore, we can remove the insignificant singular values to increase signal-to-noise (SNR) ratio (Scharf, 1991). Depending on the noise level, the number of singular values to be removed can be flexible, and this provides a trade-off between bias and variance (Scharf, 1991; Scharf and Tufts, 1987).
Assume that we keep the first L (
M) singular values, where L may be different from M. Let
L and
n be two diagonal matrices with diagonal entries be the first L leading singular values and the remaining K – L smaller singular values of Y, respectively, then we can write the SVD of Y in Equation (11) as
|
| (12) |
|
| (13) |
We further ortho-normalize YL and work on the left singular vectors UL rather than YL itself. Let W = UL, we have
|
| (14) |
replaced by |
| (15) |
|
| (16) |
|
| (17) |
|
| (18) |
|
| (19) |
1 and
0 are the first M – 1 leading singular values and the remaining L – M + 1 smaller singular values of Wr, respectively.
Since
is of full row rank M – 1 according to the partition Equation (15) and NCA condition (iii), an estimate of
can be
|
| (20) |
|
| (21) |
is small. Then a rank-1 EYM approximation of |
| (22) |
Now we summarize the algorithm, which will be referred to as Fast Network Component Analysis (FastNCA), since it is free of iterations and thus fast in computation.
FastNCA algorithm:
- Perform a rank-L EYM approximation of Y by SVD as
, and let W = UL.
- Estimation of A:
For i = 1, ... , M
Re-order rows of W so that the ith column of A has the structure.
Partitionconformally with this structure.
Do SVD for Wr and get the last L – M + 1 right singular vectors, denoted as a matrix V0.
Compute ãi = the first left singular vector of.
end
- Estimate the TFA matrix by S = A
YL.
For the choice of L, it has been shown by our numerical simulation experiments that in the case of random Gaussian noise we can choose L = M with good performance and superior computational efficiency. If L = M, due to the column dimension of Wr (which is M) V0 in Equation (20) becomes a 1-dimensional vector v, which is the Mth (the last) right singular vector of Wr, so that
|
| (23) |
|
| (24) |
|
| (25) |
In the synthetic and experimental study presented in Sections 3 and 4, we shall adopt L = M. Compared to the NCA algorithm described by Equation (3), FastNCA is not only much faster but also not subjected to the problems of instability and multiple local solutions. FastNCA copes with measurement noise by subspace decomposition through SVD, different from NCA.
| 3 SYNTHETIC AND EXPERIMENTAL DATA |
|---|
|
|
|---|
3.1 Synthetic data
First, synthetic data are used to demonstrate the performance of the FastNCA and its comparison to the original iterative NCA algorithm. Many issues can be controlled in the network analysis model Equation (1). We consider the following aspects: (1) the size of the network; (2) the complexity of the network; (3) the correlation between input signals, i.e. TFAs; (4) the length of the data; (5) the level of noise and (6) inaccuracies of the network topology.
Three networks with different sizes and complexity are denoted as Networks A (321 x 25, 321 genes, 25 TFs), B (400 x 25) and C (100 x 15). Figure 1 shows the in-degree distribution of these networks, which represents the complexity of a network and is defined for each gene as the number of TFs regulating this specific gene (Boscolo et al., 2005). Note that Networks A and B are similar to the example networks in Boscolo et al. (2005), and Network C is similar to an example network in Tran et al. (2005). The non-zero entries in the connectivity matrix A are randomly generated with standard Gaussian distribution.
|
Input signals are zero-mean white Gaussian with an identical pairwise correlation coefficient denoted as
. Input data with
= 0, 0.5 and 0.9 are generated to represent uncorrelated, moderately correlated and strongly uncorrelated signals, respectively. Noises are assumed to be white Gaussian, while the noise level is defined as the ratio between the power of the noise (
) and the true gene expressions (A S). A variety of noise levels, 0.05, 0.1, 0.2, 0.3 and 0.75 are used to test the robustness of the algorithms to noise.
3.2 Experimental data
As in Liao et al. (2003), here the cell-cycle regulation in S.cerevisiae is used as an example to test the applicability and assess the performance of the NCA algorithms. The data are downloaded from the website of Liao et al., http://www.ee.ucla.edu/~riccardo/NCA/nca.html, where the time-course microarray data of S.cerevisiae are from Spellman et al. (1998) and the network topology are from Lee et al. (2002). There are three experiments, based on three different kinds of synchronization. The first set of cell-cycle microarray data are taken from cultures synchronized by elutriation. It contains one cell cycle, taken every 30 min from 0 min to 390 min, 14 time points. The second set are from synchronization by
-factor arrest, two cell cycles, 18 time points, taken every 7 min from 0 min to 119 min. The third set are from synchronization by arrest of a cdc15 temperature sensitive mutant, three cell cycles, 24 time points, taken every 20 min from 10 min to 70 min, and then every 20 min to 250 min, and then at 270 and 290 min. In order to reconstruct the TF-gene network, these three sets of microarray data are then cascaded as a whole (totally 56 time points) for network component analysis. There are 11 TFs which are related to yeast cell-cycle regulation and are of interest in this study. The names of these TFs are Ace2, Fkh1, Fkh2, Mbp1, Mcm1, Ndd1, Skn7, Stb1, Swi4, Swi5 and Swi6.
We focus on these 11 TFs and their influenced genes. The original network contains 570 genes and 44 TFs. To see the relevance of the NCA criteria to the present problem, we will re-state the NCA criteria in the context of cell-cycle regulation as: (1) the TFs are structurally independent so that the structure of A supports a full-column rank matrix, (2) if a TF is removed along with all the genes regulated by the removed TF, the remaining TFs are still structurally independent and (3) the TF activities are independent. Unfortunately, the original structure of A of the original 570 x 44 network does not naturally satisfy NCA criterion (ii). However, we may trim A by removing some selected TFs along with all the genes regulated by these TFs, and get a trimmed network that satisfies the NCA criteria (i) and (ii). We are able to find a trimmed network containing 441 genes and 33 TFs, in which the above mentioned 11 cell-cycle related TFs are included. The dimension of S for this trimmed network is 33 x 56, which generically satisfies the last NCA criterion, and thus all three NCA criteria are satisfied.
The above mentioned Spellman's wild type yeast cell-cycle microarray data (Spellman et al., 1998), combined with a dataset from a mutant that lacks two forkhead TFs (Zhu et al., 2000), Fkh1 and Fkh2 synchronized by
-factor arrest, have also been processed in line with the procedure described in Yang et al. (2005). Missing values of the gene-expression data are imputed by knn (Troyanskaya et al., 2001). After knn imputation, common genes without missing data points between the two datasets were selected for network component analysis. We have adopted the same sub-network analysis as represented in Figure 2 of Yang et al. (2005) to select the four sub-networks, each containing the same set of TFs as in the corresponding sub-networks in Yang et al. (2005). For each sub-network, the network topology required by network component analysis is determined from the ChIP-chip data by Lee et al. (2002) with the P-value threshold set to 0.001. Our four sub-networks are found to contain 1247, 943, 1114 and 921 genes, respectively. Judging from the statement of Yang et al. (2005) in the caption of Figure 2, we notice that their networks contain 1110, 847, 1015 and 793 genes, respectively. Even though the corresponding number of genes of the two sets of sub-networks are not identical, it is still fruitful to compare the reconstructed TFAs obtained from these two sets of networks. The slight difference in the number of genes of the two sets of networks may be caused by possibly different versions of the ChIP-chip dataset. It can be shown that these network topologies also satisfy the NCA criteria and thus allow the application of FastNCA algorithm. The data of our four sub-networks can be downloaded from our website.
|
| 4 RESULTS |
|---|
|
|
|---|
4.1 Results on synthetic data
In this section, we assess and compare the performance of three algorithms, where the MATLAB implementation of NCA and NCA-r are obtained from the website of Liao et al., http://www.ee.ucla.edu/~riccardo/NCA/nca.html. Our FastNCA is also implemented in MATLAB. In order to assess the performance, the estimated connectivity matrix is properly scaled so that the sum of the absolute value of any column of the estimated matrix is equal to that of the corresponding column of the true connectivity matrix. The performance index is defined as the relative fitting error (in dB) of the TFAs.
We first test the performance of the algorithms under settings with a combination of a variety of noise level (0.05, 0.1, 0.2, 0.3, 0.6), signal length (K = 50, 200, 1000), and correlation between input signals (
= 0.0, 0.5, 0.9). We take 100 Monte Carlo experiments for NCA and FastNCA, but only 30 experiments for NCA-r because it is too time consuming. The mean and median of the fitting error (in dB) can then be calculated. It is found that for FastNCA and NCA-r the mean is almost equal to the median, but for NCA the mean is very much greater than the median because that the solution of NCA is not stable and may be far away from the true solution if the initial point of the algorithm is not properly selected. Therefore the median of the fitting error is used as the summary statistics to be presented in this article. Results are shown in Figure 2 for uncorrelated and highly correlated input signals (TFAs). The results for
= 0.5 is very similar to that for
= 0 and so is not shown here.
It is shown that for all three network examples in general (except some cases for NCA), the performance of all three algorithms decreases when the noise level increases or the signal length decreases. Though these algorithms do not require the input signals to be uncorrelated, in practice high correlation between input signals will degrade the performance of the algorithms. In Figure 2(b) we can see that if the input signals are strongly correlated, for each example network there are cases where the median of the fitting error NCA is unacceptably high (>10 dB). This means that the original NCA algorithm can become extremely unstable when the input signals are strongly correlated. In all cases the performance of FastNCA is at least comparable to NCA (note that if the mean of fitting error is considered, the performance of NCA will be greatly degraded) but much better than NCA-r especially when the noise level is low. In the case of noise level equal to 0.05, the performance of FastNCA is about 5 dB better than NCA-r, despite the complexity of the network and the level of input signal correlation. Only when the noise level is as high as 0.6, the performance of NCA-r becomes comparable to FastNCA. Due to the convergence problem, the performance of NCA becomes inferior to FastNCA when the input signals are highly correlated.
Now we compare computation complexity of the three algorithms using the above Monte Carlo simulations. The simulations are performed in MATLAB 7.3 installed on a Windows XP system with a 2.4G Hz CPU. The average computation time (in seconds) is shown in Table 1. It shows that in general FastNCA is several tens times faster than NCA and several hundreds times faster than NCA-r.
|
In practice, the connectivity topology of the TF-gene network is conventionally built from biological experiments, and such experimentally derived gene-TF interactions are often prone to errors. To test the robustness of the algorithms to the inaccuracy of the a prior knowledge on the network topology, in the previous set of simulations we introduce random errors in the hypothesized connectivity topology when applying the reconstruction algorithms. As in Boscolo et al. (2005), a number of TF-gene pairs in the network are randomly selected and changed to be connected if there is no connection or to be disconnected if there is already a connection. The number is equal to a percentage of the number of non-zero entries in the connectivity matrix. To simulate the topology error, such percentage is chosen to be 2, 5 and 10% in our simulation. In this simulation we take 100 Monte Carlo runs, and only FastNCA and NCA are tested since NCA-r is too time consuming. Here we just look at a representive example for Network A since it is not possible to present all the results. In the case where K = 200,
= 0.5 and noise level is 0.1, a topology error of 0% (no error), 2%, 5% and 10% will cause FastNCA a median fitting error of –20.2 dB, –19.8 dB, –19.0 dB and –17.2 dB respectively, while for NCA the fitting error are –20.2 dB, –19.9 dB, –19.0 dB, –17.1 dB, respectively. We see that small topology error does not greatly degrade the performance since a 10 percent error only degrades the performance by 3 dB. According to the median of the performance indices, there is no significant difference between FastNCA and NCA.
4.2 Results on experimental data
All the three algorithms are used to reconstruct the cell-cycle TF-gene-regulation network from the time course S.cerevisiae (yeast) microarray data. Since we are only interested in those TFs related to cell-cycle regulation, we show only the results for the 11 cell-cycle related TFAs. The results for FastNCA and NCA-r [according to Equation (27) of Tran et al., 2005] are shown in Figure 3. Bootstrap is used in Liao et al. (2003) to assess the variability of the estimates. However, it is not always justified to use bootstrap to assess the variability of an estimate (Efron and Tibshirani, 1998). Therefore, here we do not adopt this bootstrap procedure, and instead in another experiment we perform a sub-network analysis to assess the variability as in Yang et al. (2005).
|
Ren et al. (2000) carried out a genome-wide location analysis experiment for each of the 106 yeast strains that expressed epitope-tagged TFs/regulators. Then chromatin immunoprecipitation (ChIP) experiment was performed on each of these 106 tagged strain samples. Promoter (of the associated TF) regions which were enriched through the ChIP experiment could be identified by hybridization to microarrays containing a genome-wide set of yeast promoter regions (Lee et al., 2002). Such genome-wide location data were analyzed and filtered by a special error model (Hughes et al., 2000). Identification of a set of promoter regions that are bound by certain TFs led Lee et al. (2002) to predict sequence motifs being bound by the TFs, forming network motifs. Based on six types of regulating network motifs thus found, an algorithm was developed to build up a model for the yeast cell-cycle transcriptional regulatory network.
The yeast S.cerevisiae has a popular model system for cell-cycle analysis. Centrifugal elutriation can be employed to obtain cells with uniform size. Since cell size is correlated to cell-cycle phase, the elutriated cells are synchronized with respect to position in the cycle. However, even though this method introduces relatively small degree of stress to the cells, the synchrony is generally poor, showing only one cell-cycle as demonstrated in our FastNCA and NCA-r results (the samples marked I). Another method to prepare samples for cell-cycle analysis is the block-and-release method, where cells are synchronized initially by blocking them at a certain phase (e.g. the phase using the mating pheromone
-factor or using a temperature-sensitive cdc 15 mutation), and then releasing them for growth. Samples are taken after the release at different times. These two arrests methods pertain to the samples (II), (III), in the analysis of this article, respectively. Both these block-and-release type of methods give cells synchronized for two cycles, but since biochemical agents have to be used, certain artifacts might be introduced in the samples. A detailed analysis of all the related factors is needed to have a good estimate of the TFs activities involved in this or other network component analysis, but for a general check on the consequences of the Fast NCA and NCA-r studies, the following analysis based on a comparison of these consequences with the high-throughput study of Lee et al. (2002) is a useful step at this stage.
Lee et al. (2002) gave the phase(s) of the cell-cycle during which each TF is activated. The four phases are roughly divided equally to span the circle which represents one cell-cycle. The combined experimental and computational method at that time only provided the periods of activities without knowledge about waveform variations of the activities. Depending on the method of arrest preparation, not only the total cell-cycle time varies in general, the length of each phase also varies. Notably the S phase is particularly short in real time. According to Figure 3 of this article, we have measured the four phases of the first cycle pertaining to each of the three preparations (I, II, III) and have determined the relative lengths of phases which span the circle in Figure 4a–c, respectively. We have then transformed the TFA data of Lee et al. (2002) according to our scales and have drawn the arcs to represent the extends of the TFA spans of 33 specimens in Figure 4a–c. Entries to Table 2 (Panel a–c) correspond, respectively to the three methods of sample preparation for cell synchronization and arrest stated in the text above. Row three of Table 2 (Panel a–c) [i.e. (iii)] give the phase(s) during which each TF is active according to Figure 4a–c. The entry (
G1,
S) in Table 2 represents the information that the activity period covers about the second half of the G1 to the first half of the S phase. The symbol (narrow) simply emphasizes that the activity period covers only a small proportion of the related phase. To simplify the discussion, we only enter the peak position of the activity-time trace for each TF with respect to difference phases based on the result calculated using the FastNCA and NCA-r methods (i.e. Fig. 3). The term noise indicates that we cannot identify clear cyclic variation of activity within the phase domain based on established evidence. The symbol with a question mark represents an ambiguous situation of agreement. We observe from the entries in Table 2 (Panel a–c) that there are 23/33 clear agreements between the results, as represented by peak positions, calculated using FastNCA and the phase period deduced by Lee et al. (2002). The total clear agreements between rows (ii) and (iii) is only 14/33. We must remark that the above analysis only provides a rough comparison.
|
|
For the sub-network analysis, two NCA algorithms, gNCA (Tran et al., 2005) and FastNCA, are applied to estimate the TFAs for each of the four sub-networks. The result by gNCA (figure not shown here but can be found on our website) for most TFAs is very similar to the result presented in Yang et al. (2005). Figure 5 represents the such TFA plots applying the FastNCA method. When we compare this Figure 5 and Figure 3 of Yang et al. (2005), we notice that the charts of Ace2, Mbp1, Ndd1, Stb1, Swi4 and Swi5 from these two figures are basically in line. The other charts pertaining to Fkh1, Fkh2, Mcm1, Skn7 and Swi6 do not agree very well. Such a result is explainable: (1) because there is a slight difference in gene numbers of the four sub-networks used in these two studies and (2) the FastNCA and gNCA methods are different and it is expected some results are different also. Such a rough comparison however supports the statement that the two algorithms are workable in the type of analysis demonstrated and we might turn our attention in this article on the merits of the methodology introduced here. We can also see that for most plots in Figure 5, reconstructions from different sub-networks show good agreement, which suggests that the estimate by FastNCA is quite reliable.
|
| 5 DISCUSSIONS AND CONCLUSIONS |
|---|
|
|
|---|
Expression of genes are regulated by TFs which themselves are also products of the expression of some genes. Microarray technologies enable the simultaneous measurement of all mRNA transcripts in a cell and thus make it possible to reconstruct the gene regulatory network from microarray data (Gardner and Faith, 2005). The network can be approximated by a static linear multi-input multi-output (MIMO) model. Efforts using such model for gene network analysis include singular value analysis (Alter et al., 2000), independent component analysis (Lee and Batzoglou, 2003; Liebermeister, 2002), and many others (Li et al., 2007; Sabatti and James, 2006; Yu and Li, 2005). However, in general SVD and ICA do not correctly reconstruct the gene-regulatory network since the underlying assumptions cannot be satisfied (Liao et al., 2003; Chang et al., 2006, 2007). It is then noted that some a prior information can be utilized to help solve the inverse problem. The fact that in a gene-regulatory network each gene is regulated by only a limited subset of the TFs is exploited by Liao et al. (2003) to develop the network component analysis (NCA) algorithm, with the finding that such a sparse network topology may satisfy the NCA criteria required by the algorithm. The prior information on the network topology can be obtained by genome wide location analysis through ChIP experiments (Lee et al., 2002) and if such ChIP results are not available, it is still possible to infer the topology from the microarray data (Brynildsen et al., 2007).
However, the NCA formulation in Liao et al. (2003) is a bilinear least squares problem and thus is non-convex and has local solutions, and it is not uncommon to have convergence problem. Though in Tran et al. (2005) Tikhonov regularization is introduced to make the problem well conditioned, it degrades the performance, further slows down the convergence, and does not completely overcome the limitation of multiple local solutions. In contrast, as demonstrated by theoretical analysis and computer simulations, our novel FastNCA algorithm does not suffer from any convergence problem, has superior performance, and is much faster. The speed is important for large networks and for those networks where a large number of subnetworks need to be analyzed.
Application to experimental data also shows the advantage and importance of our FastNCA algorithm. Based on genome-wide location analysis, it has been found that cell-cycle transcriptional activators in one phase regulate transcriptional activators of the next phase. Not only these activators in all the four phases form a connected cyclic regulatory network, the expression levels of about 800 genes also vary in a periodic manner during the yeast cell-cycle from microarray analysis (Cho et al., 1998; Simon et al., 2001; Spellman et al., 1998). It is important to deduce, based on experimental data and mathematical analysis, the cyclic structure of the activities of the key activators over the cell-cycle (Jorgensen and Tyers, 2000). The eukaryote S.cerevisiae is a wonderful sample for such investigation because a rich amount of data has been accumulated. Deducing the peak activity of each key activator with respect to the phase position over the cycle accurately is therefore a desirable endeavor before cell-cycle regulatory network models can be cross-examined, hopefully leading to understanding the physiology of phase–phase transition over the cycle. We have demonstrated in this article, using the novel FastNCA method, that the peak activities of the key activators are quite in line with that deduced from an independent study, which provided only broad activity periods over the cycle. Further improvement of the accuracy of FastNCA, together with close examination of the properties of phase arrest over the cycles in samples prepared by different techniques (like I, II, III mentioned in this and other articles) might clarify the meaning of the peculiar activator activity variation of some of the key activators (like Skn 7). At the present stage, we can only describe some of the results from FastNCA and NCA-r as noises, as listed in Table 2 (Panel a-c) here. Just for cell-cycle analysis alone, it is of fundamental importance to know in which part of the cycle, and further, in which portion of a certain phase, that the activity of each activator peaks.
We must notice that to apply FastNCA for TFA reconstruction from microarray data, as for the original NCA algorithm in Liao et al. (2003), the three NCA criteria must be strictly satisfied. It implies that the S matrix must be of full-row rank and when we remove any one TF and its influenced genes the resulting sub-network must have a connectivity matrix of full-column rank. The FastNCA algorithm cannot account for simultaneous constraints in A and S with additional assumptions on the structure of S, as was accounted for in gNCA (Tran et al., 2005). The requirement on the rank of S has also been conditionally relaxed in Galbraith et al. (2006), which cannot be accounted for by FastNCA in its current form. How to account for the aforementioned relaxed conditions in FastNCA would be an area for further studies.
| APPENDIX |
|---|
|
|
|---|
Given a matrix H, the projection matrix PH, which projects a vector to the subspace spanned by the columns of H, is given by
|
| (A1) |
is the pseudo inverse of H, and if H is a tall matrix having full-column rank, then
|
| (A2) |
Denote by
be the projection matrix that projects a vector to the space orthogonal to the subspace spanned by the columns of H. Then, we have
|
| (A3) |
|
| (A4) |
|
| (A5) |
Furthermore, if D has full-row rank,
|
| (A6) |
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The authors would like to thank the anonymous reviewers for their helpful comments.
Funding: This work is supported in part by the University of Hong Kong CRCG under Small Project Funding.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
Received on October 19, 2007; revised on February 25, 2008; accepted on April 8, 2008
| REFERENCES |
|---|
|
|
|---|
Alter O, et al. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA (2000) 97:10101–10106.
Boscolo R, et al. A generalized framework for network component analysis. IEEE-ACM Trans. Comput. Biol. Bioinform (2005) 2:289–301.[CrossRef]
Brynildsen M, et al. Biological network mapping and source signal deduction. Bioinformatics (2007) 23:1783–1791.
Chang CQ, et al. Network component analysis for blind source separation. In: Proceedings of the 2006 International Conference on Communications, Circuits and Systems. (2006) 1. Guilin, China: IEEE (Institute of Electrical and Electronics Engineers). 323–326.
Chang CQ, et al. Fast network component analysis for gene regulation networks. In: Proceedings, 2007 IEEE International Workshop on Machine Learning for Signal Processing. (2007) Greece, Thessaloniki: IEEE (Institute of Electrical and Electronics Engineers).
Cho RJ, et al. A genome-wide transcriptional analysis of the mitotic cell-cycle. Mol. Cell (1998) 2:65–73.[CrossRef][Web of Science][Medline]
Efron B, Tibshirani R. An introduction to the bootstrap. (1998) Boca Raton, FL: Chapman and Hall/CRC.
Galbraith SJ, et al. Transcriptome network component analysis with limited microarray data. Bioinformatics (2006) 22:1886–1894.
Gardner TS, Faith JJ. Reverse-engineering transcription control networks. Phys. Life Rev (2005) 2:65–88.[CrossRef]
Golub GH, van Loan CF. Matrix Computation. (1996) 3rd. Baltimore, London: The Johns Hopkins University Press.
Hughes TR, et al. Functional discovery via a compendium of expression profiles. Cell (2000) 102:109–126.[CrossRef][Web of Science][Medline]
Jorgensen P, Tyers M. The fork'ed path to mitosis. Genome Biol (2000) 1. REVIEWS1022.
Kao KC, et al. Transcriptome-based determination of multiple transcription regulator activities in escherichia coli by using network component analysis. Proc. Natl Acad. Sci. USA (2004) 101:641–646.
Kao KC, et al. A global regulatory role of gluconeogenic genes in escherichia coli revealed by transcriptome network analysis. J. Biol. Chem (2005) 280:36079–36087.
Lee SI, Batzoglou S. Application of independent component analysis to microarrays. Genome Biol (2003) 4:R76.[CrossRef][Medline]
Lee TI, et al. Transcriptional regulatory networks in saccharomyces cerevisiae. Science (2002) 298:799–804.
Li H, et al. The discovery of transcriptional modules by a two-stage matrix decomposition approach. Bioinformatics (2007) 23:473–479.
Liao JC, et al. Network component analysis: reconstruction of regulatory signals in biological systems. Proc. Natl Acad. Sci. USA (2003) 100:15522–15527.
Liebermeister W. Linear modes of gene-expression determined by independent component analysis. Bioinformatics (2002) 18:51–60.
Ren B, et al. Genome-wide location and function of DNA binding proteins. Science (2000) 290:2306–2309.
Sabatti C, James GM. Bayesian sparse hidden components analysis for transcription regulation networks. Bioinformatics (2006) 22:739–746.
Savageau MA. Biochemical Systems Analysis: a Study of Function and Design in Molecular Biology. (1976) Reading, MA: Addison-Wesley.
Scharf LL. Statistical Signal Processing: Detection, Estimation, and Time Series Analysis. (1991) New York: Addison-Wesley.
Scharf LL, Tufts DW. Rank reduction for modeling stationary signals. IEEE Trans. Acoust. Speech Processing (1987) ASSP-35:350–355.[CrossRef]
Simon I, et al. Serial regulation of transcriptional regulators in the yeast cell-cycle. Cell (2001) 106:697–708.[CrossRef][Web of Science][Medline]
Spellman PT, et al. Comprehensive identification of cell-cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell (1998) 9:3273–3297.
Toone WM, et al. Getting started: regulating the initiation of DNA replication in yeast. Annu. Rev. Microbiol (1997) 51:125–149.[CrossRef][Web of Science][Medline]
Torrest NV, Voit EO. Pathway Analysis and Optimization in Metabolic Engineering. (2002) New York: Cambridge University Press.
Tran LM, et al. gnca: a framework for determining transcription factor activity based on transcriptome: identifiability and numerical implementation. Metab. Eng (2005) 7:128–141.[CrossRef][Web of Science][Medline]
Troyanskaya O, et al. Missing value estimation methods for DNA microarrays. Bioinformatics (2001) 17:520–525.
Voit EO, Almeida J. Decoupling dynamic systems for pathway identification from metabolic profiles. Bioinformatics (2004) 20:1670–1681.
Yang YL, et al. Inferring yeast cell-cycle regulators and interactions using transcription factor activities. BMC Genomics (2005) 6. Article 90.
Yu TW, Li KC. Inference of transcriptional regulatory network by two-stage constrained space factor analysis. Bioinformatics (2005) 21:4033–4038.
Zhu G, et al. Two yeast forkhead genes regulate the cell-cycle and pseudohyphal growth. Nature (2000) 406:90–94.[CrossRef][Medline]
This article has been cited by other articles:
![]() |
W.-P. Lee and W.-S. Tzou Computational methods for discovering gene networks from expression data Brief Bioinform, July 1, 2009; 10(4): 408 - 423. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





