Bioinformatics Advance Access originally published online on January 24, 2006
Bioinformatics 2006 22(8):959-966; doi:10.1093/bioinformatics/btl017
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Polynomial model approach for resynchronization analysis of cell-cycle gene expression data
1 Department of Electrical and Computer Engineering, University of Maryland College Park, USA
2 Department of Electrical and Computer Engineering, University of British Columbia Canada
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Identification of genes expressed in a cell-cycle-specific periodical manner is of great interest to understand cyclic systems which play a critical role in many biological processes. However, identification of cell-cycle regulated genes by raw microarray gene expression data directly is complicated by the factor of synchronization loss, thus remains a challenging problem. Decomposing the expression measurements and extracting synchronized expression will allow to better represent the single-cell behavior and improve the accuracy in identifying periodically expressed genes.
Results: In this paper, we propose a resynchronization-based algorithm for identifying cell-cycle-related genes. We introduce a synchronization loss model by modeling the gene expression measurements as a superposition of different cell populations growing at different rates. The underlying expression profile is then reconstructed through resynchronization and is further fitted to the measurements in order to identify periodically expressed genes. Results from both simulations and real mircorarray data show that the proposed scheme is promising for identifying cyclic genes and revealing underlying gene expression profiles.
Availability: Contact the authors.
Contact: qiupeng{at}umd.edu
Supplementary information: Supplementary data are available at: http://dsplab.eng.umd.edu/~genomics/syn/
| 1 INTRODUCTION |
|---|
|
|
|---|
A central goal of molecular biology is to use genetic data in order to understand the fundamental cyclic systems, such as regulatory network in yeast cell-cycle (Lee et al., 2002). Recent advances in high-throughput gene expression data acquisition technologies, such as microarrays, provide a rich opportunity for achieving this goal (Moor, 2001). The first critical task in understanding such cyclic systems is to identify the related genes which are periodically expressed during the cell-cycle. In the current technologies, most expression data are measured based on a population of cells which are synchronized to exhibit similar behaviors (Spellman et al., 1998). However, even with the most advanced synchronization method, maintaining a tight synchronization population even over a couple of cycles is a challenging research issue, since continuous synchronization loss is gradually observed owing to the diversity of individual cell growth rates (Shedden and Cooper, 2002). Therefore, in addition to the noise effect on the measurements, a significant difficulty in identifying cell-cycle regulated genes by analyzing microarray gene expression data arises from synchronization loss. Direct periodicity testing on the expression measurements could be misleading or fail due to the fact that the expression values measured are contributed by a mixed cell populations growing at different rates.
Several approaches for identifying cell-cycle regulated genes, when taking into consideration the issue of synchronization loss, have been proposed in the literature. They can be divided into two major categories, differentiated by the absence or presence of other complementary information besides gene expression data. Most studies in the literature belong to the former category, which relies solely on expression data. Fourier analysis algorithm was employed for synchronization test in Shedden and Cooper (2002), Johansson et al. (2003) and Whitfield et al. (2002). The authors presented an exact statistical test to identify periodically expressed genes by distinguishing periodicity from random processes in Wichert et al. (2004). In Lu et al. (2004), a periodic-normal mixture (PNM) model was proposed to fit transcription profiles of periodically expressed (PE) genes and a principled statistical estimation approach was developed for estimating the periodicity of gene expressions. In the second category, an algorithm combining budding index and gene expression data was proposed recently to deconvolve expression profiles in Bar-Joseph et al. (2004). Regardless of these developments, efforts are still needed to accurately identify cyclic genes and recover a more accurate gene profile compared with the current expression measurements.
The goal of this paper is to develop an efficient scheme for identifying periodically expressed genes and reconstructing the underlying gene expression profiles by estimating the effects of synchronization loss. The main contributions of this paper are 2-fold.
- We propose a synchronization loss model by representing the gene expression measurements as a superposition of different cell populations growing at different rates, because the model can mimic the synchronization loss observed in microarray experiments and is easy to implement. Also, we develop a model-based estimation algorithm to reconstruct the underlying single-cell gene expression profiles. In previous studies, the single-cell expression profile is often assumed to be sinusoids. However, the proposed algorithm does not require that assumption. It is able to handle a much larger variety of single-cell expression profiles.
- Using the fitting residue error as criteria, we explore a supervised learning scheme for identifying the cell-cycle regulated genes. The performance of the proposed scheme are examined via both simulations and real microarray gene expression data of Saccharomyces cerevisiae.
The organization of the rest paper is as follows. We start by introducing a synchronization loss model and our formulation. After that, a cyclic gene identification scheme is proposed. In Sections 4 and 5, the proposed scheme is examined and compared with two previous studies. From the results, we conclude that the proposed scheme is promising in improving quality of gene expression time series data.
| 2 SYSTEM MODEL AND FORMULATION |
|---|
|
|
|---|
2.1 A mixture model for synchronization loss
Even with the best synchronization method currently available, cells begin to lose their synchronization in a short time. Therefore, we propose to model the observed gene expression data as a superposition from a mixed population of cells growing at slightly different rates as
![]() | (1) |
m represents the relative growth rates of cells with respect to standard cell-cycle; ßm represents the percentage of cells with a growth rate
m, and it is assumed to be constant in one series of measurements. Although
m can take continuous values in experiment, due to the limited size of microarray data,
m is approximated by N + 1 components. Because of the different growth rates in experiment cell population, even if a gene is cell-cycle regulated, the measured expression may not exhibit clear periodicity. Therefore, it is difficult to accurately detect periodically expressed genes and distinguish them from non-periodically expressed genes based on the noisy microarray data.
Note that in Equation (1), for gene i, from the underlying expression profile xi(t) to the observation yi(t), the distortion is dictated by ßm and
m, which describes the synchronization loss status of the whole cell populations. We propose to utilize some common properties of all genes to extract underlying expression profiles from the observations.
2.2 An inverse model for synchronization loss
Since the underlying expression profile xi(t) is unknown, we propose to re-write Equation (1) into the following form,
![]() | (2) |
It is worth mentioning that the parameters ams and cms depend solely on ßms and
ms. They are common constants for all genes. Thus, we propose to utilize this common property of all genes to extract underlying expression profiles.
Equations (1) and (2) are not mathematically equivalent in general. However, if xi(t) is polynomial, Equations (1) and (2) can be equivalently represented. In this paper, we are particularly interested in the case of polynomials, since polynomial is a common tool for data fitting (Stoer and Bulirsch, 1991). As shown in the literature, polynomials are often successfully used to fit the time-series gene expression data (Bar-Joseph et al., 2004).
Suppose xi(t) is a polynomial of order K such that
![]() | (3) |
![]() | (4) |
![]() | (5) |
![]() | (6) |
![]() | (7) |
![]() | (8) |
In the above argument, the underlying expression xi(t) does not assume periodicity. However, in this study, the most interested expression signal is cell-cycle regulated, i.e. periodic,
![]() | (9) |
m is not large. With cm carefully chosen, although periodic, the above argument holds for most of the cell-cycle data. In the following section, we will demonstrate that under such a periodic condition Equation (2) is a fine approximation of the inverse of Equation (1).
2.3 Formulation for estimating ams
For cell-cycle regulated genes, because of periodicity, xi(t) = xi(t + T), from Equation (2), the observations and parameters ams and cms are related as follows:
![]() | (10) |
|
| (11) |
|
| (12) |
![]() | (13) |
m is not large. In later simulations, we will show that, it is accurate enough to choose cm to cover the range from 0.6 to 1.4. The fixed-summation constraint in Equation (13) is chosen to avoid the trivial 0-vector solution, i.e. a = 0.
2.4 Fitting residue criterion
After estimating ams, the model in (2) is used to reconstruct the underlying periodical component xi(t) for every gene. In order to detect cell-cycle regulated genes, a criterion is needed to justify the question whether the extracted signal is the underlying periodical expression profile of a cell-cycle regulated gene, or it is the periodical component from a non-cell-cycle regulated gene. We propose a criteria based on the model in (1), using the extracted periodical signal to fit the observations. The fitting residue will serve as the criterion in detecting cell-cycle regulated genes. For a particular gene, if the fitting residue is sufficiently small, compared with a threshold, then the extracted signal could lead to the measurements owing to synchronization loss, which means the gene is highly likely to be cell-cycle regulated. On the other hand, if the fitting residue is large, then the extracted periodical signal is not closely related to experimental observation, which means the gene is more likely to be non-cell-cycle regulated. In the proposed identification scheme, the threshold of fitting residue is dynamically determined during iterations. Details are described in Sections 3 and 5.
| 3 THE IDENTIFICATION SCHEME |
|---|
|
|
|---|
Based on the synchronization loss model and estimation approach described in Section 2, we further proceed to identify the cyclic genes. The scheme described in this section is a supervised learning scheme, since it requires an initial training set which consists of cell-cycle regulated genes previously identified by traditional biology experiments. Specifically, we propose an iterative framework to purify the training set and detect cyclic genes simultaneously. The main steps in the proposed iterative framework is described as follows:
- Define initial training set as cell-cycle regulated genes previously identified by traditional methods.
- Apply the proposed model on training set to estimate the parameters ams, and extract the underlying periodical signal for every gene in the training set.
- Based on the extracted signal xi(t), fit it to the observation model in (1). According to the fitting residue criterion, remove some non-periodically expressed genes from the training set. Then, re-estimate the parameters ams using the training set and use the estimated ams to extract periodical signal for every gene in the testing set.
- According to the fitting residue criterion, include some periodically expression genes into the training set. Then go back to Step 2.
Note that, under this framework, in order to purify the training set and detect the periodically expressed genes correctly, the criteria for removing and including genes in Steps 2 and 4 should be carefully designed and fine tuned for each dataset. The ultimate goal of Steps 2 and 4 is to find a set of cyclic genes as prior knowledge, such that the cyclic genes identified by the proposed scheme do not violate the prior knowledge, or maximally support the prior knowledge. It is a difficult optimization problem with numerous possible solutions. The proposed scheme, although heuristic in updating the training set, yields satisfactory results as will be demonstrated in what follows.
| 4 SIMULATIONS |
|---|
|
|
|---|
In this section, we simulate time-series expression data with synchronization loss for both periodically expressed genes and non-periodically expressed genes. The proposed method is used to resynchronize the simulated data and identify periodically expressed genes. To evaluate the performance of the proposed method, we compare it with the methods studied in Spellman et al. (1998) and Lu et al. (2004). We also perform sensitivity analysis to examine the robustness of the proposed method.
4.1 Simulations based on sinusoids
In this subsection, we simulate time-series expression data for 100 periodically expressed genes and 600 non-periodically expressed genes. The underlying single-cell periodical expression profile for cyclic gene i is generated by a linear combination of four sinusoids with random phases,
![]() | (14) |
ij is randomly chosen, different for each gene.
ij represents the random phase, which is uniformly distributed on [0,2
). For the 600 non-cyclic genes, their underlying expressions are obtained through random permutations of expressions of cyclic genes.
For each gene, we simulate the synchronization loss by
![]() | (15) |
In the simulation, 50 cyclic genes are assumed known, in order to form the initial training set. The testing set contains the left 650 genes. For a particular choice of cm, by applying the proposed model, am parameters are estimated, the underlying periodical signals for all genes are extracted, and the fitting residue criterion is examined.
The parameter M is set to be M = 7. As mentioned in Section 2.2, we need to choose M to be larger than or equal to K. Since the exact value of K does not affect the proposed method as long as K
M, therefore, with M = 7, the proposed method can handle all polynomials with K
7. And we know, that the seventh order polynomials can generate a large variety of curves, with up to six peaks and valleys. We believe the current parameters-setting can sufficiently model gene expression profiles.
As mentioned earlier, cm should be chosen properly, in order to extract underlying expression profiles accurately. In Table 1, different choices of cm are examined. To ensure a fair comparison, with M set to be 7, the values of cm are chosen to be uniformly spaced in tested range. In the fitting residue criterion,
m is set to be [0.7, 0.8, ... , 1.3]. From Table 1, we can see that, different choice of cm leads to different fitting residues for both cyclic genes and non-cyclic genes. As the range of cm increases, the fitting residues for cyclic genes tend to decrease first, and then increase. This observation can be intuitively explained by the trade-off between errors owing to the model-complexity and the data size. In one hand, from the implication of FIR and IIR filters, owing to the larger range of cm considered, the truncation error will be smaller. However, if the range of cm is too large, because of the limited size of time series data, the number of available time points n in Equation (11) will be small. Less training data will cause the fitting residues to increase. Therefore, based on Table 1, we choose the range of cm to be [0.6, 1.4], since with this choice the average fitting residue for cyclic genes is small and the difference between cyclic genes and non-cyclic genes is large, resulting in a good detection performance.
|
After determining the choice of cms, the proposed model is applied to estimate parameters ams based on the training set, and extract the underlying periodical signals for genes in the training set. Figure 1 gives a typical example of genes in the training set. Although there is clear difference between the underlying periodical expression and the simulated observation, based on the proposed method, the extracted expression is quite similar to the underlying periodical expression.
|
Based on the am and cm parameters, the proposed model is applied to extract periodical signal components for all genes, and the fitting residue criterion is examined. In Figure 2a, the histogram of all genes fitting residues is shown, where the shaded part corresponds to the 100 cyclic genes. We can see that the cyclic genes have smaller fitting residues, while non-cyclic genes yield larger fitting residues statistically. Therefore, this clear separation between these two groups of genes leads to the accurate identifications of cyclic genes.
|
In order to examine the identification performance of the proposed method, we compare it with two previous works, Spellman et al. (1998) and Lu et al. (2004), by applying them to the same simulated time series data. In Spellman et al. (1998), Fourier analysis is applied to calculate the energy of the periodical components for each gene. The energy serves as a metric to identify cyclic genes. From Figure 2b, we can see that, this method can identify cyclic genes with small outage. However, its performance is worse than that of the proposed method. In Lu et al. (2004), a PNM model is proposed, where a probabilistic (Gaussian) distribution and Fourier analysis are combined to model the synchronization loss. Before identifying cyclic genes, the parameters of the Gaussian distribution have to be estimated. In our implementation, we skip the parameter estimation step by feeding the actual parameter values into the PNM model. Therefore, Figure 2c shows the performance upper method in Lu et al. (2004), which is close to that of the proposed method. However, it is worth mentioning that the PNM-based method is admittedly sensitive to the parameter estimates of the Gaussian distribution.
In Table 2, we present the results in Figure 2 in a more quantitative fashion. We employ the NeymanPearson framework in detection theory (Poor, 1994). During comparison, we fix the probability of correctly detecting cyclic genes and examine the probability of false positive of different methods. That is, under the condition that certain amount of cyclic genes are correctly detected, how many non-cyclic genes will be falsely detected as cyclic. From Table 2, we can see that, when fixing the probability of detection, the proposed method has much less false positives, compared with the two previous studies.
|
In this subsection, the time series are simulated with the underlying signal xi(t) being sinusoids. Together with the fact that Fourier analysis is employed, both previous works have nice performance in identifying cyclic genes. However, if the underlying signal is based on polynomials, the result could be different.
4.2 Simulation based on polynomials
In this subsection, we simulate time-series expression data based on polynomial models. Again, 100 cyclic genes and 600 non-cyclic genes are simulated. The underlying periodical expression profile for cyclic gene i is generated by polynomials of order K = 6,
![]() | (16) |
For each gene, we simulate the synchronization loss by Equation (15). All parameters are set to be the same as previous subsection. A total of 50 cyclic genes are assumed known, forming the training set. For a particular choice of cm, by applying the proposed model, am parameters are estimated based on the training set, the underlying periodical signals for all genes are extracted, and the fitting residue criterion is examined. Again, M is set to be 7, and different choices of cm are examined. From Table 3, similar result is observed. We choose the range of cm to be [0.6, 1.4], because the average fitting residue for cyclic genes is small, and the difference between cyclic genes and non-cyclic genes is large.
|
Figure 3 is a typical example of genes in the training set. We can see that the simulated observations is quite different from the underlying periodical expression profile. Owing to synchronization loss, the observed time-series does not exhibit a clear periodicity, especially in the second cycle. From poorly synchronized observations, the proposed method can successfully recover the underlying periodical expression profile.
|
Based on the estimates of ams and cms, the proposed model is applied to extract periodical signal components for all genes, and the fitting residue criterion is examined. In Figure 4 a the histogram of residues shows that the cyclic genes and non-cyclic genes are well separated, meaning that the proposed method can successfully identify the cyclic genes.
|
The methods in Spellman et al. (1998) and Lu et al. (2004) are also examined in this subsection, with results shown in Figure 4b and 4c. From these figures, we note that both previous methods failed to separate cyclic and non-cyclic genes in the case that underlying expression profiles being polynomials. Similar to the previous subsection, the result is shown in a more quantitative way, in Table 4. As it is easy to see, the proposed method outperforms previous studies in the simulation based on polynomials. It is encouraging to see that the proposed method works well for a much larger variety of the underlying single-cell expressions.
|
4.3 Sensitivity analysis
In our discussions so far, the standard cell-cycle duration T is assumed to be known as a prior knowledge. However, the cell-cycle duration may vary because of various environmental and experimental factors. In this subsection, we examine the performance of the proposed method when inexact prior knowledge of the cell-cycle duration T is considered.
The sensitivity analysis is conducted based on the simulated data by sinusoids. In the simulated data, the true cell-cycle length is T = 60. However, when applying the proposed method, we do not know the cell-cycle length exactly as prior knowledge. In Figure 5, we can see that, when the prior knowledge is inexact, the separation of fitting residues between cyclic and non-cyclic genes is not affected much. In Table 5, we quantitatively examine the sensitivity of the proposed method in terms of probability of detection and false positive. In Table 5, each row corresponds to a certain requirement of probability of detection; each column corresponds to a case where certain value of T is taken as prior knowledge; and each element is the probability of false positive. From this table, as long as we do not require probability of detection to be extremely high (i.e. 100%), only when the prior knowledge is significantly different from the truth (i.e. the prior T
40 or T
70), will the performance degrade severely. This simulation result demonstrates the robustness of the proposed method with respect to the cell-cycle duration.
|
|
| 5 REAL DATASETS |
|---|
|
|
|---|
In this study, three real datasets are investigated, alpha, cdc15 in (Spellman et al., 1998) and cdc28 in (Cho et al., 1998). From Spellman et al. (1998), 93 cell-cycle regulated genes previously identified by traditional methods are selected as initial training set. Since there is no guarantee that all those 93 genes will behave periodically in a particular experiment, we employ the iterative framework to purify the training set and identify cyclic genes simultaneously. During each iteration, we adopt simple removing and including criterion in Steps 2 and 4. In Step 2, the size of training set is reduced to half in order to purify the training set. In Step 4, 200 genes with smallest fitting residues are included into the training set. In this way, we hope to purify the training set. As mentioned before, the ultimate goal of Steps 2 and 4 is to find a set of cyclic genes as prior knowledge, such that the identified cyclic genes identified by the proposed method do not violate the prior knowledge, or maximally support the prior knowledge. It is a quite difficult optimization problem, with numerous possible solutions. However, with the simple criterion we adopted, this goal can be achieved within several iterations (510). As results, the histograms of fitting residues for the alfa, cdc15 and cdc28 datasets are shown in Figure 6, where the identified cyclic genes in the training set have small fitting residues.
|
To make a fair comparison with Spellman et al. (1998) and Lu et al. (2004), 800 genes with smallest fitting residues are identified as cyclic genes. In Figure 7, a Venn diagram showing the overlap of genes identified by different studies. The proposed method and Spellman et al. (1998) is 403; the intersection between proposed method and Lu et al. (2004) is 433; the intersection between Spellman et al. (1998) and Lu et al. (2004) is 541; the intersection among all three studies is 355. It is encouraging to see the large overlaps illustrated in Figure 7, an indication of consistency of the proposed method to the previous researches. In Supplementary Material, we show some examples of genes identified by both the proposed method, (Spellman et al., 1998), and traditional experimental methods. Both the observed expression and extracted expression are shown. We can see that, for the cyclic genes that already exhibit periodical expression, the extracted expression is closed to experiment observed expression. And for the cyclic genes that do not exhibit periodical expression, the proposed method can recover the periodicity.
|
Although the genes identified by the proposed method have large overlap with those of the previous studies, it is interesting to examine the non-overlapping genes identified by the proposed method, but not identified in the previous studies, neither Spellman et al. (1998) nor Lu et al. (2004). In the Supplementary Material, some examples are shown. Since both previous studies relied on Fourier analysis, genes without clear periodicity may not be identified. However, the proposed method may be able to identify them, because synchronization loss is estimated and recovered. We need to further investigate the genes identified by the proposed method only, and to validate the identified genes through biology experiments or previous biology knowledge. One possible validation method is to validate the biological relevance of such identified cell-cycle genes by semantic analysis based on the Gene Ontology (GO) terms. To achieve this purpose, an online tool is applied, the SGD GO Term Finder (http://db.yeastgenome.org/cgi-bin/GO/goTermFinder). We analyzed the set of non-overlapping genes which are identified by one method, but not by the other two methods. The top GO terms associated with each methods results can be found in the Supplementary Material. For the proposed method, in the top 25 GO terms, there are several cell-cycle related terms, such as M phase, cell-cycle, mitotic cell cycle and M phase of mitotic cell cycle, including 84 genes. It suggests that some genes identified by the proposed method but not by the other two methods are cell-cycle related. For the sets of non-overlapping genes identified by the two reference methods, it is noted that none of the above four cell-cycle related GO terms appears in the top 25 GO terms. Details of top GO terms associated with results of each method can be found in the Supplementary Materials. Theseencouraging observations demonstrate that the proposed method is promising for identifying cyclic genes.
| 6 CONCLUSION |
|---|
|
|
|---|
Synchronization loss is a major concern in identifying cyclic genes to understand the fundamental cyclic systems. We developed a model-based framework for identifying cell-cycle regulated genes through resynchronization and reconstructing the underlying gene expression profiles, which representing a single-cell behavior more accurately. We consider a simple synchronization loss model where the gene expression measurements are regarded as superposition of mixed cell populations with different growth rates. The proposed scheme is shown feasible, promising and robust via simulations. Results from real mircoarray data analysis reveal that the reconstructed profiles represent a more accurate expression profiles and improve our ability to identify cyclic genes. We will further investigate the proposed method by combining complementary information such as budding index.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: David Rocke
Received on October 31, 2006; revised on January 19, 2006; accepted on January 21, 2006
| REFERENCES |
|---|
|
|
|---|
Bar-Joseph, Z., et al. (2004) Deconvolving cell cycle expression data with complementary information. Bioinformatics, 20, Suppl. 1, i23i30[Abstract].
Cho, R.J., et al. (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell, 2, 6573[CrossRef][ISI][Medline].
Johansson, D., et al. (2003) A multivariate approach applied to microarray data for identification of genes with cell cycle-coupled transcription. Bioinformatics, 19, 467473
Lee, T.I., et al. (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 298, 799804
Lu, X., et al. (2004) Statistical resynchronization and Bayesian detection of peroidically expressed genes. Nucleic Acids Res, . 32, 447455
Moore, S. (2001) Making chips to probe genes: biotechnology". IEEE Spectrum, . 38, 5460.
Shedden, K. and Cooper, S. (2002a) Analysis of cell-cycle gene expression in Saccharmoyces cerevisiae using microarray and multiple synchronization methods. Nucleic Acids Res, . 30, 29202929
Shedden, K. and Cooper, S. (2002b) Analysis of cell-cycle-specific gene expression in human cells as determined by microarrays and double-thymidine block synchronization. Proc. Natl Acad. Sci. USA, 99, 43794384
Spellman, P.T., et al. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 32733297
Stoer, J. and Bulirsch, R. Introduction to Numerical Analysis, (1991) Springer.
Poor, H.V. An Introduction to Signal Detection and Estimation, (1994) Springer.
Whitfield, M.L., et al. (2002) Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol. Biol. Cell, 13, 19772000
Wichert, S., et al. (2004) Identifying periodically expressed transcripts in microarray time series data. Bioinformatics, 20, 520
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

















