Bioinformatics Advance Access originally published online on December 7, 2004
Bioinformatics 2005 21(8):1538-1541; doi:10.1093/bioinformatics/bti197
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Characterizing the dynamic connectivity between genes by variable parameter regression and Kalman filtering based on temporal gene expression data
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences Beijing 100080, People's Republic of China
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: One popular method for analyzing functional connectivity between genes is to cluster genes with similar expression profiles. The most popular metrics measuring the similarity (or dissimilarity) among genes include Pearson's correlation, linear regression coefficient and Euclidean distance. As these metrics only give some constant values, they can only depict a stationary connectivity between genes. However, the functional connectivity between genes usually changes with time. Here, we introduce a novel insight for characterizing the relationship between genes and find out a proper mathematical model, variable parameter regression and Kalman filtering to model it.
Results: We applied our algorithm to some simulated data and two pairs of real gene expression data. The changes of connectivity in simulated data are closely identical with the truth and the results of two pairs of gene expression data show that our method has successfully demonstrated the dynamic connectivity between genes.
Contact: jiangtz{at}nlpr.ia.ac.cn
| INTRODUCTION |
|---|
|
|
|---|
With the ability to simultaneously measure the activity of thousands of genes under different conditions (Iyer et al., 1999; Eisen et al., 1998; Cho et al., 1998; Spellman et al., 1998; Bozdech et al., 2003), DNA microarray technology has attracted tremendous interest in both the scientific community and industry during the past several years. This has led to a dramatic increase in microarray data and reliable and efficient tools are needed urgently to mine useful information from these data. One of the applications of microarray technology is to characterize the functional connectivity between genes. A basic assumption of this application is that genes with similar expression profiles have similar functions in cells. The most popular metrics used to evaluate the similarity (or dissimilarity) between gene expression profiles may be Pearson's correlation (Eisen et al., 1998). Linear regression coefficient and Euclidean distance are two metrics very similar to Pearson's correlation.
One of the main limitations of these metrics is that their values are constant and stationary. However, for many gene time-series expression profiles, the connectivity between genes is variable and dynamic. Hence, constant and stationary metrics cannot always characterize the variable and dynamic connectivity between genes. So far, there has been no study on this dynamic relationship. We believe that variable parameter regression is an appropriate tool for characterizing this time-dependent correlation relationship. It happened that Buchel and Friston (1998) used variable parameter regression and Kalman filtering to characterize the dynamic relationship between two fMRI signals. We believe that they can also be used to model the dynamic relationship between genes, although our problem is very different from that studied by Buchel and Friston (1998). This idea was tested on some simulated data and real gene expression data. All the results demonstrate that this method can detect successfully the changes of connectivity between genes (or other signals).
| METHODS |
|---|
|
|
|---|
Materials
In this paper, we apply our algorithm to a simulated dataset and some real data. As shown in Figure 1, we generated two simulated signals x (a) and y (b). Both signals have 286 points along the time line. Signal x has six half-sine curves and the content between any two half-sine curves is Gaussian noise. Signal y is similar to signal x. The main difference between x and y is that the half-sine curves in signal y added uniformly distributed noise. We also selected four similar gene expression profiles from the dataset of Cho et al. (1998) and grouped them into two pairs randomly. One pair is YNL309w and YML060w, as shown in Figure 2. Figure 3 shows another pair, YDL164c and YLR383w. Cho et al. collected cells at 17 time points at 10 min intervals, covering nearly two full cell cycles. The time course was divided into five phases: early G1, late G1, S, G2 and M based on the size of the buds. In order to weaken the effect of system error, we first normalized the raw dataset of Cho et al. such that the mean is 0 and the variance is 1.
|
|
|
Variable parameter regression
Variable parameter regression can be described as follows:
![]() | (1) |
![]() | (2) |
standard deviation. As described in Buchel and Friston (1998) the dynamic evolution of ß over time is assumed to follow the following equations:
![]() | (3) |
![]() | (4) |
2Q is the stationary covariance matrix of the innovation pt. From Equation (4) we can see that if Q = 0, then parameter ßt does not change along time and the variable parameter regression reduces to the stationary coefficient linear regression problem. We can see that Equation (3) is in fact a random walk model for ßt. The innovations ut and pt are uncorrelated.
Parameter estimation using Kalman filtering
Given two gene expression profiles x1, ..., xT and y1, ..., yT, we are interested in the corresponding regression coefficients ß1, ..., ßT. In this paper, we use Kalman filtering to estimate these regression coefficients. Kalman filtering is a recursive solution to the optimal linear filtering problem. We define
to be the prior estimate of regression coefficient at time t given knowledge of the process prior to time t, and
to be the posteriori estimate of regression coefficient at time t given the expression value of y at time t. Let
be the prior estimate error variance and Pt be the posteriori estimate error variance. We define Kt to be the gain that minimizes the posteriori error variance. Then the first step of Kalman filtering is to obtain the prediction that updates
and its error variance for the passage of time t 1 to t:
![]() | (5) |
![]() | (6) |
![]() | (7) |
![]() | (8) |
![]() | (9) |
be the smoothed estimate of ßt, and then the third step can be depicted as follows:
![]() | (10) |
is set to
. Then Equation (10) is also a recursive process and
can be solved by this process. We take
as the final estimate of ßt. From the process of Kalman filtering, we can see that the regression coefficients are determined not only by y/x but also by its historical information. Then ß changes, dependent on the time-dependent correlation relationship between genes. Therefore, ß can characterize the dynamic connectivity between genes with time. | RESULTS |
|---|
|
|
|---|
We first applied our algorithm to the simulated data. From the simulated data shown in Figure 1, we can see that signals x and y are strongly correlated (connectivity or strong connectivity) at corresponding sine regions and weakly correlated (non-connectivity or weak connectivity) at the random noise regions. The result on the simulated data is shown in Figure 4. From Figure 4, we can see that the regression coefficients change dynamically along the time axis. This regression coefficient curve is perfectly consistent with the curves of signal x and signal y. The peaks of this curve correspond to the sine regions and the valleys of this curve correspond to the Gaussian noise regions. This means that the sine regions are more correlated than the noise regions. From this result, we can see that our algorithm depicts dynamic connectivity between simulated signals very well.
|
Subsequently, we applied our algorithm to two pairs of genes selected from the dataset of Cho et al. Figure 5 shows the result of genes YNL309w and YML060w. From Figure 5, we can see that the regression coefficients (connectivity) between YNL309w and YML060w have two peaks, which means that the two genes profiles are more time-dependent near these peaks and then have strong connectivity near these peaks. According to Cho et al.'s information, we mapped the time points of these two peaks back to the cell cycles and then deduced that these peaks were near G and S phases. This means that YNL309w and YML060w are more correlated near G and S phases and interact with each other during G and S phases with a high probability. YNL309w takes part in the process of G1/S transition in the mitotic cell cycle. YML060w takes part in the processes of DNA repair and base-excision repair, which are also strongly related to G1 and S phases. The result of genes YDL164c and YLR383w is shown in Figure 6. Two peaks located near G and S phases means YDL164c and YLR383w have strong connectivity during G and S phases. YDL164c takes part in the processes of DNA ligation, DNA recombination, base-excision repair, lagging strand elongation and nucleotide-excision repair. YLR383w takes part in the processes of DNA repair and cell proliferation. These processes take place mainly during G and S phases. These results mean that our algorithm successfully shows the dynamic connectivity between these pairs of genes.
|
|
| CONCLUSIONS AND DISCUSSIONS |
|---|
|
|
|---|
Main contribution of the current work is that we introduce a novel insight for characterizing the relationship between genes and suggest a proper mathematical tool to model it. The results have demonstrated that this technique successfully assesses the dynamic connectivity between two signals on both simulated data and real data. Connectivity can be regarded as the influence that one signal (or gene) exerts over another. Dynamic connectivity depicts the changes of connectivity strength. Moreover, some detailed information about these genes supports our results well.
Apart from these advantages of our method, there are still some limitations. First, an implicit assumption of gene expression data analysis is that genes with similar expression profiles have similar functions in cells. However, this assumption is not always right (Zhou et al., 2002). Almost all gene expression data analysis methods have this limitation. Second, we assumed that the connectivity between two genes is linear, which may not be true. Third, the time points of gene expression profiles are too small; therefore, more time points are needed in order to get better results. Then, if we are interested in some particular genes, we can use real-time quantitative PCR (RTQPCR) to analyse these genes for more time points. The result of RTPCR data for more time points will demonstrate the dynamic connectivity better.
| Acknowledgments |
|---|
We thank Meng Liang and Chaozhe Zhu for some valuable discussions. We are grateful to Dr Elizabeth E. Budy for some language corrections. This work was partially supported by the Natural Science Foundation of China, Grant nos 30425004 and 60121302, and the National Key Basic Research and Development Program (973) Grant no. 2004CB318107.
Received on July 28, 2004; revised on October 12, 2004; accepted on November 29, 2004
| REFERENCES |
|---|
|
|
|---|
Bozdech, Z., Llinas, M., Pulliam, B.L., Wong, E.D., Zhu, J., De Risi, J.L. (2003) The transcriptome of the intraerythrocytic developmental cycle of plasmodium falciparum. PLoS Biol., 1, E5[Medline].
Buchel, C. and Friston, F.K. (1998) Dynamic changes of effective connectivity characterized by variable parameter regression and Kalman filtering. Hum. Brain. Mapp., 6, 403408[CrossRef][Web of Science][Medline].
Cho, R.J., Campbell, M.J., Winzeler, E.A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T.G., Gabrielian, A.E., Landsman, D., Lockhart, D.J., Davis, R.W. (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell, 2, 6573[CrossRef][Web of Science][Medline].
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 1486314868
Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G., Moore, T., Lee, J.C.F., Trent, J.M., Staudt, L.M., Hudson, J.J., Boguski, M.S., et al. (1999) The transcriptional program in the response of human fibroblasts to serum. Science, 283, 8387
Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., Futcher, B. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 32733297
Zhou, X., Kao, M.C., Wong, W.H. (2002) Transitive functional annotation by shortest-path analysis of gene expression data. Proc. Natl Acad. Sci. USA, 99, 1278312788
This article has been cited by other articles:
![]() |
J. Z. Kelemen, A. Kertesz-Farkas, A. Kocsor, and L. G. Puskas Kalman filtering for disease-state estimation from microarray data Bioinformatics, December 15, 2006; 22(24): 3047 - 3053. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

t), t = 0 : 0.04 : 1. Every segment of noise has 26 time points and sampled from a Gaussian distribution with mean 0 and standard variance 0.4. (b) Signal y is constructed by the way similar to that of signal x. The main difference is the six segments of half-sine waves of y are corrupted by five segments of additive uniform distributed noise in the interval [0, 0.4]. Then, there are 286 time points all together in signal x and signal y.














