Bioinformatics Advance Access originally published online on January 24, 2006
Bioinformatics 2006 22(7):843-848; doi:10.1093/bioinformatics/btl016
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Novel technique for preprocessing high dimensional time-course data from DNA microarray: mathematical model-based clustering
Graduate School of Systems Life Sciences, Kyushu University 6-10-1 Hakozaki, Higashi-ku, Fukuoka 812-8581, Japan
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation. Classifying genes into clusters depending on their expression profiles is one of the most important analysis techniques for microarray data. Because temporal gene expression profiles are indicative of the dynamic functional properties of genes, the application of clustering analysis to time-course data allows the more precise division of genes into functional classes. Conventional clustering methods treat the sampling data at each time point as data obtained under different experimental conditions without considering the continuity of time-course data between time periods t and t +1. Here, we propose a method designated mathematical model-based clustering (MMBC).
Results: The proposed method, designated MMBC, was applied to artificial data and time-course data obtained using Saccharomyces cerevisiae. Our method is able to divide data into clusters more accurately and coherently than conventional clustering methods. Furthermore, MMBC is more tolerant to noise than conventional clustering methods.
Availability: Software is available upon request.
Contact: taizo{at}brs.kyushu-u.ac.jp
| 1 INTRODUCTION |
|---|
|
|
|---|
Genes with similar expression profiles are likely to be functionally related (Lockhart et al., 2000), and thus classifying genes into clusters depending on their expression profiles is one of the most important and widely distributed analysis techniques for DNA microarray data. First-generation clustering methods, such as hierarchical clustering (Eisen et al., 1998), k-means clustering (Somogyi et al., 1999; Hartigan, 1975) and self-organizing maps (Kohonen et al.,), are now widely applied to DNA microarray analysis (Tamayo et al., 1999; Tavazoie et al., 1999). Furthermore, second-generation clustering methods, such as fuzzy k-means (Gash and Eisen, 2002) and Fuzzy ART (Tomida et al., 2002), have since been proposed in order to consider the fuzziness of cluster members.
Time-course data of mRNA has recently become available using microarray. Because temporal gene expression profiles of microarray data are indicative of the dynamic functional properties of genes, application of clustering analysis to time-course data facilitates more accurate division of genes into functional classes when compared with non-time-course data. However, the application of clustering methods to the analysis of time-course data remains under investigation; Spellman et al. (1998) have applied hierarchical clustering to DNA microarray data related to the cell cycle of Saccharomyces cerevisiae, while Tomida et al. (2002) have applied Fuzzy ART to temporal microarray data related to the sporulation of S.cerevisiae. They treated the sampling data at each time point as data obtained under different experimental conditions without considering the continuity of time-course data between time periods t and t + 1. Can the consideration of the continuity of time-course data between the time period t and t + 1 divide data into more precise functional classes? A simple method for considering the time-continuity is exploring the parameter values of the best-fitted polynomial function, which realizes time-course data. However, biological information, such as onset and cessation period of genes, cannot be extracted using the best-fitted polynomial function. Thus, building mathematical kinetic models (simultaneous differential equations models) of gene expression followed by exploring the kinetic parameter values realizing time-course data is one of the most effective techniques for considering the time-continuity of data. In this manner, numerous researchers have proposed effective methods for inferring gene interactions using mathematical models (Uematsu et al., 2003; Tamada et al., 2003; Hakamada et al., 2004; Maki et al., 2004).
Maki et al. (2004) recently proposed an integrated genetic inferring system comprising mathematical modeling of genetic interactions and estimation of parameter values that realizes temporal profiles of gene expression. The obtained parameter values not only represent information regarding gene interactions, but also include information on onset and cessation of each gene. By applying cluster analysis to these parameters, it is expected that more important biological information can be extracted from the results of clustering. We designated such methods as mathematical model-based clustering (MMBC).
In order to validate MMBC, we applied the technique to artificial time-course data and temporal gene expression profiles related to the sporulation of S.cerevisiae. The clustering results obtained by MMBC were then compared with those obtained by conventional clustering (k-means).
| 2 MATERIALS AND METHODS |
|---|
|
|
|---|
2.1 Mathematical model
Based on the linear transcription model by Chen et al. (1999), Sasik et al. (2002) proposed a mathematical kinetic model of gene expression related to the sporulation of Dictyostelium. The mathematical equations of this model can be written as follows:
![]() | (1) |
i is the i-th mRNA decay rate,
is onset time and
is cessation time. This model contains four parameters, Si,
i, t1 and t2, but most of these remain unknown for each gene. In this study, we adapted this model and developed a procedure for estimating the values of the four kinetic parameters to realize time-course data. We used the following objective function (Sasik et al., 2002) and explored the kinetic parameters that can minimize the value of this function:
![]() | (2) |
|
2.2 Clustering and evaluation
The k-means clustering technique is simple and widely used for microarray data analysis (Quackenbush, 2001). In MMBC, k-means clustering are applied to the estimated parameters. The k-means clustering technique was also used for the same time-course data. In both methods, Euclidean distance was used as the dissimilarity and we evaluated results obtained by the two methods (MMBC and k-means) using silhouette values (Rousseeuw, 1987) and the Rand statistic (Rand, 1971).
2.3 Silhouette value
The silhouette value S(i) was defined as the dissimilarity between clusters. For example, take any object i in the dataset and this object belongs to cluster A. When cluster A contains objects apart from i, we calculate
![]() | (3) |
![]() | (4) |
A,b(i) is defined as follows:
![]() | (5) |
![]() | (6) |
2.4 Rand statistic
Consider a dataset, a priori divided and labeled into n groups. Suppose this dataset is clustered into m clusters by a suitable clustering method. Take two arbitrary data points (x, y) from m clusters, check the affiliation of clusters and original groups, and then divide them into four categories:
- SS: if both data belong to the same cluster and to the same group.
- SD: if both data belong to the same cluster but to different groups.
- DS: if data belong to different clusters but to the same group.
- DD: if data belong to different clusters and to different groups.
- SD: if both data belong to the same cluster but to different groups.
Let a, b, c and d be the number of SS, SD, DS and DD data points, respectively. The Rand statistic is defined as the degree of similarity:
![]() | (7) |
2.5 Time-course data to be analyzed
2.5.1 Artificial data
We prepared artificial data using the following equation consisting of the Hill equation as the synthetic term and first-order decay as the degradation term:
![]() | (8) |
is degradation rate. Artificial data were prepared by the following procedures: (1) values of S and
were fixed in Equation (8), and five types of pair matching for (t1j, t2j), where j=1, ... , 5, were set; (2) each pair matching (t1j, t2j) was assumed to be the centroid of cluster j (j=1, ... , 5), and 20 types of Ai(t) in each cluster were prepared with changing values for t1 and t2 in accordance with normal distribution and (3) in cluster j, j was labeled to every Ai(t) and these labels were regarded as correct. As the artificial data Ai(t) (i = 1, ... , 100) were created by Equation (8) and have characteristic patterns, we prepared an additional 100 random patterns as unregulated ones. Thus, we prepared 200 artificial time-course data; 100 data having characteristic patterns with labels and 100 data having random patterns without labels. Table 1 shows the parameter values used in this study. Figure 2 shows the time-course of artificial data.
|
|
2.5.2 Experimental data
Time-course data from DNA microarray analysis of S.cerevisiae were used and these data were measured at seven sampling points (0, 0.5, 2, 5, 7, 9 and 11.5 h). Thus, the whole time-course data have seven dimensions. Preprocessed temporal microarray data for genes related to the sporulation of S.cerevisiae were based on the criteria of Chu et al. (1998); 1051 genes were extracted from 5626 genes. Depending on their original expression patterns, Chu et al. (1998) categorized 428 genes into 7 labels (Early-I, Early-II, Early-Mid, Middle, Mid-Late, Late and Metabolic); the remaining 623 genes were not labeled.
Data were standardized as follows:
![]() | (9) |
As shown in Equation (1), the estimated gene expression ratio at time t (Ai(t)) can be represented by four kinetic parameters; Si (mRNA transcription rate),
(mRNA decay rate), t1 (onset time) and t2 (cessation time). Because the magnitude of gene expression ratio (expression level) varies in genes after introducing the standardization function [Equation (9)], the values of Si and
do not reflect the actual dynamic behavior of the genes. In this study, we focused on onset time (t1) and cessation time (t2) for each gene, as these values reflect actual dynamic behavior regardless of standardization. The k-means clustering was applied to these two parameters (t1, t2).
2.6 Noise tolerance
Generally, because microarray experiments include several types of noise and/or perturbation, it is difficult to obtain the same quantitative microarray data in each experiment. Because such noise may seriously affect the results, it is important to investigate noise tolerance to clustering analysis. For this purpose, we added noise as follows:
![]() | (10) |
is the noise depending normally on mean Ii(t). Variance
2 is given as follows:
![]() | (11) |
-value, noise was randomly added to each data point. We prepared 10 types of dataset to include noise at every
-value and noise tolerance was evaluated by the Rand statistic. | 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
3.1 Artificial data
As described above, we prepared 100 Ai(t) with labels and 100 random patterns without labels. In the case of MMBC, k-means clustering (k = 37) were applied to the 200 artificial data after estimating t1 and t2 in Equation (1). The clustering results for the artificial data are shown in Figure 3 (closed circle). The abscissa and the ordinate represent the number of clusters and the Rand statistic, respectively. The open circles in Figure 3 represent the results of k-means. The correct number of clusters is five, and MMBC gave a higher Rand statistic (0.782) than k-means (0.676) at this point. As shown in Figure 3, even when changing the number of clusters, we obtained the same results; the difference in Rand statistic between MMBC and k-means was maximal when the number of clusters was five. Since correct numbers of clusters are known in advance, to investigate the change of Rand statistic, we calculated them around correct number of cluster (e.g. from 3 to 7 clusters). Although we calculated Rand statistic till 20 clusters, it did not change significantly (data not shown). This demonstrates that MMBC gives more accurate clustering results than k-means.
|
Figure 4 shows the distribution of silhouette value when the number of clusters is five. The abscissa and the ordinate represent the silhouette value and relative ratio of distribution, respectively. Closed and open circles show the results for MMBC and k-means, respectively. In MMBC, 91.5% of 200 artificial data had silhouette values >0.9, whereas in k-means, only 49.5% of 200 artificial data had silhouette values >0.9. This suggests that MMBC gives larger distances between the clusters than k-means. In addition, most of the data within a cluster were gathered near the center in MMBC, while in k-means, as the second peak of distribution is seen between 0.1 and 0.3 (silhouette value), data belonging to this region are far from the center of the clusters.
|
We also investigated the locations of the 100 labeled data points in the distribution of silhouette values. In k-means, 40% of labeled data belonged to the value range of <0.3, whereas in MMBC, 93% of labeled data belonged to the range of >0.9 and only one data was involved in the range of <0.3. This shows that MMBC is able to divide data points more coherently than k-means.
Noise tolerance was then investigated in MMBC. Table 2 shows the influence of noise level [
-values in Equation (11)] on the average Rand statistic with the number of clusters ranging from three to seven. As shown in Table 2, in every case, MMBC gave higher values for the Rand statistic than k-means; Wilcoxon signed rank test (P = 0.005, one side) showed that these differences were significant, except where the number of clusters is seven and the noise level is 0.5. This demonstrates that MMBC is tolerant to noise and accurate clustering can be achieved with MMBC, even with large noise levels.
|
Based on the analysis of artificial time-course data, MMBC gave preferable clustering results and divided the clusters more clearly than k-means.
3.2 Experimental data
While optimizing these parameters, we examined t1 and t2 starting from t = 0 to the end of sampling (time step, 0.01) and explored Si and
i in the range 010. This range was selected as each mRNA value was normalized between 0 and 1.0, and thus there is no way but to set the maximum and minimum values of these parameters arbitrarily. In this study, we set 0.01 for Smin and
min, and 10 for Smax and
max. According to Equation (1), the maximum value of A(t) is Si/
i, and thus the searching range of Si/
i is between 0.001 and 1000. Furthermore, in order to validate the searching range, we confirmed the profile of objective function, error, as represented by Equation (2), against the parameters to be estimated (Si,
i) for each mRNA. Every profile was approximated by the unimodal quadratic function rather than by the multimodal function within the searching range and the optimal value of Si and
i could be estimated individually (data not shown). Thus, only one optimal value of Si and
i for each mRNA exists according to 0.001 < (Si/
i) < 1000. However, because these ranges depend on experimental data, appropriate ranges for the parameters should be selected.
In order to validate MMBC with experimental data, we prepared time-course data for genes related to the sporulation of S.cerevisiae (Chu et al., 1998). When we applied standardization to these time-course data, the values for t1 and t2 in Equation (1) were estimated for each time-course data point followed by k-means clustering. In MMBC, we examined which parameter combinations gave the most coherent clustering results; (1) clustering using only t1, (2) clustering using only t2 and (3) clustering using both t1 and t2. With the number of clusters fixed at seven, we examined the silhouette values for each case. Table 3 shows the ratio of genes with high silhouette values for MMBC (t1), MMBC (t2) and MMBC (t1, t2), which depict the values for only t1, only t2 and both t1 and t2, respectively. In MMBC (t1), 87.3% of genes had silhouette values >0.9, which is the highest when compared with all other variations. This suggests that t1 is the best parameter for clustering.
|
Figure 5 shows the Rand statistic values for the experimentally observed time-course data. The abscissa and the ordinate represent the number of clusters and the Rand statistic, respectively. Closed and open circles indicate the Rand statistic in the case of MMBC and k-means, respectively. Because the time-course data were divided into seven clusters by Chu et al. (1998), seven was regarded as the correct number of clusters. At seven clusters, the Rand statistic for MMBC (0.768) was higher than that for k-means (0.702). This demonstrates that MMBC can is able to divide data into clusters more accurately than k-means.
|
In k-means, the Rand statistic increased significantly from seven to eight clusters, which is slightly higher than that of MMBC. As shown in Equation (7), the Rand statistic is represented by the positive integers of a, b, c and d, and we examined these values by k-means with seven and eight clusters; the value of a [number of cases where two arbitrary Ai(t) [Equation (1)] belong to the same cluster with the same label] did not change in either case, whereas the value of d [number of cases where two arbitrary Ai(t) belong to different clusters with different labels] increased with eight clusters, thus giving a marked increase in Rand statistic values. In contrast, in MMBC, the values for a and d did not change for seven or eight clusters.
The Rand statistic tends to increase depending on the number of clusters. Because d increases depending on the number of clusters, the Rand statistic values increase with the number of clusters. In k-means, the Rand statistic increased depending on the number of clusters, whereas, in MMBC, it did not increase regardless of the number of clusters. When the number of clusters is eight, in MMBC, the Rand statistic did not change because the new cluster had only one gene (data not shown). This suggests that 1050 of the 1051 genes are clustered into seven clusters, and thus it is meaningless to increase the number of clusters further. It is also notable that this change number of clusters (= 7) is the same as the number classified by Chu et al. (1998) based on expression patterns. And furthermore, we calculated Rand statistic till 20 clusters and it did not change significantly (data not shown).
As described in Section 2, Chu et al. (1998) extracted 428 genes from 1051 genes related to the sporulation of S.cerevisiae, and categorized them into seven labels (Early-I, Early-II, Early-Mid, Middle, Mid-Late, Late and Metabolic). We investigated the silhouette values of 428 genes, as shown in Table 4. Of these, 94.6 and 6.1% had silhouette value >0.9 in MMBC and k-means, respectively. This suggests that MMBC gives larger distances among the clusters than k-means, and that MMBC is able to divide more coherent clusters than k-means. Furthermore, in MMBC, most genes within a cluster were gathered near the center of the cluster, whereas, in k-means, few genes were located near the center of the cluster. Figure 6 shows the distribution of the silhouette values in MMBC (closed circles) and k-means (open circles). As shown in Figure 6, in MMBC, the main peak of the distribution was found at 1.0 (silhouette value), whereas, in k-means, the main peak was found at 0.6, which indicates that most of the genes were located far from the center of the cluster in k-means.
|
|
In k-means, cluster analysis was performed by using all sampling points (20 points for artificial time-course data and 7 points for experimental time-course data from S.cerevisiae). In MMBC, cluster analysis was performed by building a suitable kinetic model [Equation (1)] followed by extracting and estimating the values of most influential parameters; two parameters [t1, t2 in Equation (1)] and one parameter [t1 in Equation (1)] were extracted in the case of artificial time-course data and in the case of experimental time-course data, respectively. As shown in Figures 36, in all cases, MMBC showed superior properties when using limited numbers of kinetic parameters in comparison with k-means. Furthermore, Table 2 demonstrates that MMBC is tolerant to noise and gives accurate clustering results, even with substantial noise.
In this study, we analyzed the time-course of mRNA related to the sporulation of S.cerevisiae. Time-courses were measured at seven sampling points. Thus, each time-course data has seven dimensions, although we applied mathematical models with two dimensions or one dimension (t1 and/or t2). Generally speaking, reducing dimensions is considered to result in loss of information. However, high dimensional data may interfere with cluster structure. According to Bellman, one of the risks is that high dimensional data tend to be sparse. Furthermore, Beyer et al. (1999) pointed out that as dimensions increase, the differences in distance between a given point and its nearest neighbor becomes negligible. Therefore, it becomes difficult to identify clustering structure based on distance. In addition, some features are redundant, some are irrelevant and some interfere with clustering. There often exist irrelevant and/or noisy features that mislead the clustering process (Dy and Brodley, 2004). Because these reports suggest, in many cases, that whole time-course data do not give the best clustering results, it is important to utilize only indispensable dimensions. As shown in Figures 5 and 6, Rand statistic and silhouette values in MMBC are higher than in k-means. This demonstrates that MMBC use only indispensable dimensions to obtain coherent and accurate clustering results. Furthermore, these indispensable dimensions can be interpreted biologically. Another advantage of MMBC is the consideration of time-continuity of time-variant sampled data in addition to the reducing' dimensions. There are two wide-distributed criteria for evaluating clustering ability; silhouette value and Rand statistic. Using the same sampled data (artificial data and gene expression profile related to the sporulation), we compared MMBC with k-means clustering from the views of these two criteria. As shown in Figures 36, MMBC showed the superiority of both the criteria to k-means clustering in every case.
Applying MMBC to the various kinds of practical data, we have to consider the following two problems: how to process noise involving practical raw data and how to set up the most appropriate mathematical model. The former problem should be solved by applying smoothing effect to the practical raw data; introduction of the interpolation such as spline interpolation to the raw data as a preprocessing method. In this study we did not apply such an interpolation technique, however, using such an interpolation, we can highly expect the reduction of noise from the practical raw data followed by more precise and clear-cut clustering. The latter problem will be the most crucial in this study. Applying Equation (1), we obtained better clustering results; however, it is uncertain whether Equation (1) represents the most suitable mathematical kinetic model. After adding perturbation or stress to the system, most of transient time-courses show monotonous and saturable patterns, but also show oscillatory patterns in some cases. Thus, with monotonous patterns, the kinetic model shown in Equation (1) is the most appropriate, but other kinetic models are required when considering feedback or looping interactions between genes or when time-courses show an oscillatory pattern. We should develop generalized kinetic model according to the temporal message profiles of the target genes.
| Acknowledgments |
|---|
Our research in this study was supported by Grant-in-Aid for Scientific Research on Priority Areas (C) "Genome Informatics Science" (No. 13208008 and 12208008) from the Ministry of Education, Science, Sports and Culture of Japan and The Project for Development of a Technological Infrastructure for Industrial Bioprocesses on R&D of New Industrial Science and Technology Frontiers by Ministry of Economy, Trade & Industry (METI), and entrusted by New Energy and Industrial Technology Development Organization (NEDO).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on November 25, 2005; revised on January 4, 2006; accepted on January 21, 2006
| REFERENCES |
|---|
|
|
|---|
Bellman, R. Adaptive Control Processes: A Guided Tour, (1961) , NJ Princeton University Press.
Beyer, K., Goldstein, J., Ramakrishanan, R., Shaft, U. (1999) When is nearest neighbor meaningful? Proceedings of the International Conference on Database Theories, , London, UK Springer-Verlag, pp. 217235.
Chen, T., et al. (1999) Modeling gene expression with differential equations. Pac. Symp. Biocomput, . 4, 2940.
Chu, S., et al. (1998) The transcription program of sporulation in budding yeast [Erratum (1998) Science, 282, 1421.]. Science, 282, 699705
Dy, J.G. and Brodley, C.G. (2004) Feature selection for unsupervised learning, feature selection for unsupervised learning. J. Mach. Learn. Res, . 5, 845889.
Eisen, M.B., et al. (1998) Clustering analysis and display of genome-wide expression pattern. Proc. Natl Acad. Sci. USA, 95, 1486314868
Gasch, A.P. and Eisen, M.B. (2002) Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biol, . 3, RESEARCH0059.
Hakamada, K., Hanai, T., Honda, H., Kobayashi, T. (2004) A preprocessing method for inferring genetic interaction from gene expression data using Boolean algorithm. J. Biosci. Bioeng, . 98, 457463[Medline].
Hartigan, J.A. Clustering Algorithm, (1975) , New York Wiley.
Kohonen, T. Self-Organizing Maps, (1997) 2nd Edn , New York Springer.
Lockhart, D.J. and Winzeler, E.A. (2000) Genomics, gene expression and DNA arrays. Nature, 405, 827836[CrossRef][Medline].
Maki, Y., et al. (2004) An integrated comprehensive workbench for inferring genetic networks: voyagene. J. Bioinfo. Comput. Biol, . 2, 533550[CrossRef].
Quackenbush, J. (2001) Computational analysis of microarray data. Nat. Rev. Genet, . 2, 418427[CrossRef][Web of Science][Medline].
Rand, W.M. (1971) Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc, . 66, 846850[CrossRef][Web of Science].
Rousseeuw, P.J. (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Compt. Appl. Math, . 20, 5365.
Sasik, R., et al. (2002) Extracting transcriptional events from temporal gene expression patterns during Dictyostelium development. Bioinformatics, 18, 6166
Somogyi, R. (1999) Making sense of gene-expression data. Pharmainformatics, 1724.
Spellman, P.T., et al. (1998) Comprehesive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 32733297
Tamada, Y., et al. (2003) Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection. Bioinformatics, 19, Suppl. 2, II227II236.
Tamayo, P., et al. (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA, 96, 29072912
Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.M. (1999) Systematic determination of genetic network architecture. Nat. Genet, . 22, 281285[CrossRef][Web of Science][Medline].
Tomida, S., et al. (2002) Analysis of expression profile using fuzzy adaptive resonance theory. Bioinformatics, 18, 10731083
Uematsu, N., et al. (2003) Analysis of genetic networks using time-course data of gene expression profiling. Cytometry Res, . 13, 4553.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
















