Skip Navigation


Bioinformatics Advance Access originally published online on November 19, 2007
Bioinformatics 2008 24(2):184-191; doi:10.1093/bioinformatics/btm568
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/2/184    most recent
btm568v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kim, J.
Right arrow Articles by Kim, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kim, J.
Right arrow Articles by Kim, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Clustering of change patterns using Fourier coefficients

Jaehee Kim 1,* and Haseong Kim 2

1Department of Statistics, Duksung Women's University and 2Bioinformatics and Biostatistics Laboratory, Seoul National University, Seoul, S. Korea

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE MODEL
 3 TRIGONOMETIC FOURIER SERIES...
 4 CLUSTERING CURVES OF...
 5 RESULTS
 6 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: To understand the behavior of genes, it is important to explore how the patterns of gene expression change over a time period because biologically related gene groups can share the same change patterns. Many clustering algorithms have been proposed to group observation data. However, because of the complexity of the underlying functions there have not been many studies on grouping data based on change patterns. In this study, the problem of finding similar change patterns is induced to clustering with the derivative Fourier coefficients. The sample Fourier coefficients not only provide information about the underlying functions, but also reduce the dimension. In addition, as their limiting distribution is a multivariate normal, a model-based clustering method incorporating statistical properties would be appropriate.

Results: This work is aimed at discovering gene groups with similar change patterns that share similar biological properties. We developed a statistical model using derivative Fourier coefficients to identify similar change patterns of gene expression. We used a model-based method to cluster the Fourier series estimation of derivatives. The model-based method is advantageous over other methods in our proposed model because the sample Fourier coefficients asymptotically follow the multivariate normal distribution. Change patterns are automatically estimated with the Fourier representation in our model. Our model was tested in simulations and on real gene data sets. The simulation results showed that the model-based clustering method with the sample Fourier coefficients has a lower clustering error rate than K-means clustering. Even when the number of repeated time points was small, the same results were obtained. We also applied our model to cluster change patterns of yeast cell cycle microarray expression data with alpha-factor synchronization. It showed that, as the method clusters with the probability-neighboring data, the model-based clustering with our proposed model yielded biologically interpretable results. We expect that our proposed Fourier analysis with suitably chosen smoothing parameters could serve as a useful tool in classifying genes and interpreting possible biological change patterns.

Availability: The R program is available upon the request.

Contact: jaehee{at}duksung.ac.kr

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE MODEL
 3 TRIGONOMETIC FOURIER SERIES...
 4 CLUSTERING CURVES OF...
 5 RESULTS
 6 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
One of the most common approaches in genome science is the analysis of gene expression patterns or change patterns. If the expressions of genes are measured at various time points during the course of an experimental study, each gene can be characterized by its pattern of change. Grouping genes that share similar expression profiles into clusters is usually the first step in understanding the huge amount of DNA microarray data associated with complicated biological networks. However, most research on gene clustering has been performed with the observed expression data, while ignoring the change patterns. The motivation of this research is to derive an efficient and robust statistical method for an area where little research has been done yet the needs from a biological standpoint are numerous.

The researcher is frequently interested in studying gene expression changes along time and evaluating trend differences between the various experimental groups. For example, rather than comparing and grouping genes with the same pattern from corn oil or fish oil diet on colon cancer, and only considering observation values, change values and change patterns need to be investigated. The researcher is then interested in detecting biologically meaningful gene expression trends and in spotting differences between the various experimental groups.

Due to the differences in the initial levels of background noise in the experiment, difference values or derivatives need to be used as a measure of change. Also a basic premise is that the genes sharing similar change profiles may be functionally related or co-regulated. As such, microarray derivative data provide further insight into gene–gene interactions, gene functions and pathways. Derivative functions also provide statistical convenience in that: (1) functions with a constant amount of difference have the same derivatives (2) difference values give information about their changes as well as about their original functions. Nevertheless, few of the previous methods took derivative functions into account.

We propose to use Fourier coefficients in clustering expression patterns and change patterns. Fourier coefficients have several advantages over other methods. Some of these advantages are: (1) the dimension of a data set can be reduced to several Fourier coefficients (2) the estimated Fourier coefficients give information about the underlying function and enable automatic estimation of the change or pattern function (3) the Fourier coefficient estimation does not depend strongly on the covariance structure (4) as the sample Fourier coefficients asymptotically follow the multivariate normal distributions. A Gaussian mixture model that incorporates underlying probability distributions can be effective.

There has been considerable research about discovering patterns using clustering and testing. Serban and Wasserman (2005) proposed clustering after transformation and smoothing as a technique for non-parametrically estimating and clustering a large number of curves. To discover change patterns in gene expressions, Ernst et al. (2005) clustered short time series gene expression data by selecting a set of potential expression profiles. Li and Wong (2002) proposed an effective discretization and gene selection method using the concept of emerging patterns. Park et al. (2003) proposed a method to test the statistical significance of time-dependent gene expression data and to identify genes with significant change based on an ANOVA model. Lai et al. (2004) proposed a method for selecting genes that have differential gene–gene co-expression patterns with the idea of correlation difference.

Model-based clustering is a clustering approach considering probability distribution. Yeung et al. (2001) showed the performance of model-based clustering on several simulated and real gene expression data sets. Murtage and Raftery (1984) successfully applied model-based hierarchical clustering in character recognition problems using a multivariate normal model. Fraley and Raftery (2002) suggested model-based hierarchical agglomerative clustering based on computing an approximate maximum for the classification likelihood.

Smoothing away noise-induced wiggles with Fourier series has been studied by some researchers. Zhang et al. (2003) used the first harmonic of discrete Fourier transform to translate the multi-dimensional time series microarray expression data into a two-dimensional scatter plot. Murthy and Hua (2004) proposed improved Fourier method considering irregular or monotonic component of cell-cycle expression. Kim et al. (2006) suggested a two-step procedure for clustering periodic patterns of gene expression profiles. They used the least squares non-linear curve fitting based on a Fourier series approximation with frequency and amplitude of order one. Though they considered the periodicity and mixture model-based likelihood for the estimated parameters, change patterns of the gene expression were not taken into account.

There has been much research on clustering microarray data, mostly on grouping common expression patterns. However, there are many cases in gene study in which grouping change patterns is of interest. In this research, we propose a new method for clustering change patterns with derivative Fourier coefficients. The proposed method consists of four main steps. The first and second steps consist of representing a gene profile with sample Fourier coefficients, and then the calculation of derivatives from the Fourier coefficients. The third step is to cluster the derivative Fourier coefficients using model-based clustering. In the final step, genes with the same change pattern are clustered and the underlying change pattern is automatically estimated using the Fourier representation.

We demonstrated the usefulness of the Fourier analysis and model-based clustering by applying the method to simulated data. We also extended the application of our model to real gene expression data resulting in interpretable genes.


    2 THE MODEL
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE MODEL
 3 TRIGONOMETIC FOURIER SERIES...
 4 CLUSTERING CURVES OF...
 5 RESULTS
 6 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Consider the data Yiu, uth observation on the ith curve, of the form


Formula 1

(1)
where E({varepsilon}iu) = 0 and Var({varepsilon}iu) = {sigma}2. In the microarray experiment Yiu is the log gene expression of gene i at time tiu.

We assume that the curve fi belongs to a class of smooth functions F as defined below:


Formula 2

(2)
where {bj} is an orthonormal basis system and


Formula 3

(3)
We can estimate fi using Fourier coefficients by


Formula 4

(4)
which is the projection onto the first J basis functions where J, 1 ≤ J ≤ m, is a smoothing parameter to be chosen based on the data.

The sample Fourier estimate can be estimated as


Formula 5

(5)
with tr = r/m and t isin [0,1].

With regard to changes, the difference data


Formula 6

(6)
can be approximated by Formula , the derivative of fi at tiu and tiu ti,u – 1, assuming that the first order derivative exists. Therefore, the following model can be considered:


Formula 7

(7)
where


Formula

This setup can be extended to the cases where the design or time points are not the same for all curves. We want to classify the same patterns with differences or derivatives that give information about the underlying change pattern.


    3 TRIGONOMETIC FOURIER SERIES ESTIMATION OF DERIVATIVES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE MODEL
 3 TRIGONOMETIC FOURIER SERIES...
 4 CLUSTERING CURVES OF...
 5 RESULTS
 6 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
The function represented with a Fourier series with the cosine bases is given as


Formula 8

(8)

We can estimate fi with J terms of Fourier coefficients as


Formula 9

(9)
where Fourier coefficients are estimated as


Formula 10

(10)
We also estimate the derivative of fi as


Formula 11

(11)
Note that the Fourier coefficients of the derivatives are calculated by weighting the coefficients from the original functions. The coefficients of the derivative have more weight j on the latter terms of the Fourier coefficients. This means that the higher frequency terms have more information about the derivative pattern of ups and downs.

The model in (7) can be expressed as


Formula 12

(12)
where Formula . Therefore the Fourier coefficients of change can be estimated by Formula . Since {psi}ij is a Fourier coefficient of the derivative function, we call {psi}ij the derivative Fourier coefficient and Formula the estimated derivative Fourier coefficient from the sample.

With the independent {varepsilon}ij’s, var({eta}ij) = 2{sigma}2, and


Formula

To estimate Fourier coefficients, the covariance structure is not considered in our approach since the covariance matrix of a finite set of estimated Fourier coefficients is asymptotically proportional to the identity matrix.

The parameter J controls the amount of smoothing and should be determined based on the data. Even though the optimal choice for J varies from function to function, we choose to use a single smoothing parameter that operates reasonably well for all of the curves. There has been some research on optimal choices for J. For example, to find global smoothing parameter, Serban and Wasserman (2005) calculated J as the minimizer of the total regret. Eubank and Hart (1992) also suggested choosing the smoothing parameter J minimizing the risk or mean-squared error.

With a large number of gene curves and various functional shapes, a universal rule for an optimal choice for J does not exist. Therefore, instead, we capitalize on the convergence property of Fourier transforms. Since the Fourier estimator converges to the true function, usually the first few Fourier coefficients contribute to the estimation of the whole function. In practice, we can select a smaller J for linear or smooth functions and a larger J for wigglier functions.


    4 CLUSTERING CURVES OF THE SAME CHANGE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE MODEL
 3 TRIGONOMETIC FOURIER SERIES...
 4 CLUSTERING CURVES OF...
 5 RESULTS
 6 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
The similarity of cluster derivatives Formula and Formula can be measured with Euclidean distance or other distance measures. It may be of interest to check the equivalence of the similarity of the estimated Fourier coefficients with the similarity of the estimated functions. As Beran and Dumbgen (1998) demonstrated, a reasonable coordinate system via a Fourier transform of data has as much correct asymptotic coverage probability as the untransformed data. As such, the sample Fourier coefficients can be used instead of the underlying functions.

After clustering with the estimated Fourier coefficients Formula ’s for the original function and with Formula ’s for the derivative function, we can estimate the function of each gene with these estimated Fourier coefficients using (4). The change pattern can also be estimated with derivative Fourier coefficients using (11). This automatic estimation is another capability of Fourier representation. These estimated periodic functions show the functional shape and periodicity.

4.1 Mixture model of derivative Fourier coefficients
Clustering using a mixture model assumes that each group of the data is generated by an underlying probability distribution. Suppose that data X1, ... , Xn are multivariate observations.

In a Gaussian mixture model, each group k is modeled by the multivariate normal distribution with parameters Formula (mean vector) and Formula (covariance matrix):


Formula 13

(13)

Geometric features (shape, volume, orientation) of each group k are determined by the covariance matrix Formula . Banfield and Raftery (1993) proposed a general framework for exploiting the representation of the covariance matrix in terms of its eigenvalue decomposition. Each elliptical model is implemented in Mclust (Fraley and Raftery, 1999).

We consider model-based clustering with the estimated Fourier coefficients of change Formula . The sample Fourier coefficient Formula in (5) is a form of weighted average of random variables with variance O(m–1). Freedman and Lane (1980) showed that the empirical distribution of Fourier coefficients is normal. By Central Limit Theorem for independent and identically distributed samples, the sample Fourier coefficient Formula is asymptotically normally distributed as Formula . As Formula and fixed Formula , Formula has an asymptotically J-dimensional multivariate normal distribution. Therefore Formula has an asymptotically multivariate normal distribution as a linear function of Formula . With this asymptotic property, we can use the Gaussian mixture model for clustering.

Model-based hierarchical agglomerative clustering is an approach to compute an approximate maximum of the classification likelihood,


Formula

where the gi’s are labels indicating a unique classification for each observation and Formula is the probability function of the estimated Fourier coefficients of the ith gene. In the above Gaussian mixture likelihood, each component is weighted by the probability that a sample Fourier coefficient belongs to that component. Our clustering strategy is model-based agglomerative hierarchical clustering and selection of the model and the number of clusters using approximate Bayes factors with the BIC approximation.

4.2 Cluster validity
A major challenge in cluster analysis is the estimation of the optimal number of clusters. To identify the partition of clusters for which a measure of quality is optimal, as a cluster validity technique silhouette method was proposed by Rousseeuw (1987).

The silhouette width for the ith sample in the jth cluster is defined as:


Formula

where a(i) is the average distance between the ith sample and all other samples included in the jth cluster, b(i) is the minimum average distance between the ith sample and all of the samples clustered in kth cluster for k != j. A point is regarded as well clustered if s(i) is large. The overall average silhouette value can be used as an effective validity index for any partition. Kaufman and Rousseeuw (1990) proposed choosing the optimal number of clusters as the value maximizing the average s(i) over the data set. We can consider the overall average silhouette in selecting the number of Fourier coefficients and the optimal number of clusters. A silhouette is generally known to work best with roughly spherical clusters. If the clustering algorithm does not result in this shape of cluster, the overall average silhouette width tends to become very low.

Azuaje (2002) studied the assessment of expression cluster validity with 18 measures and remarked that there is no universal validity paradigm to predict consistent results across different clustering techniques. Evaluation of biologically relevant results may support the cluster validity.


    5 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE MODEL
 3 TRIGONOMETIC FOURIER SERIES...
 4 CLUSTERING CURVES OF...
 5 RESULTS
 6 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
5.1 Simulated data set
Since real expression data sets are generally noisy and their clusters may not be fully reflective of the class information, we first evaluate the performance of our method with simulated data, where the classes are known.

We simulate data according to the regression model


Formula

with i = 1,2, ..., 9, u = 1,2, ..., m, tiu = u/m. The regression functions for f are:


Formula

The simulated data consist of 2000 curves originating from nine different functions, 1200 f1’s and 100 curves of each f2, ..., f9, to reflect typical gene expression data. There are only five different change patterns f1, f2, f4, f6, f8. We assume that the noise is normally distributed: for low noise {sigma}~Unif(0.4,0.7) and for high noise {sigma}~Unif(1.0,1.2). m = 5, 10, 20, 30, 50 repeated design points are considered.

Pollard (1982) showed that under weak conditions, as the sample size tends to infinity, the set of cluster centers as the minimizer of the distance of samples in each cluster converges almost surely to the population cluster centers and converges in distribution to the multivariate normal distribution. Since the means with K-means clustering can satisfy this property, we consider K-means clustering in the comparison study.

Let T be a clustering map defined as


Formula

Regarding the estimation error, the clustering estimation error rate {eta}(K) is defined as


Formula

where C = { f1, ..., fN} denote the true curves and Formula denote the estimated curves. Let T and Formula represent the corresponding cluster maps and K denotes the number of clusters. {eta}(K) then is the fraction of all pairs that are incorrectly put in separate clusters depending on K clusters, as described in Serban and Wasserman (2005).

Table 1 shows the clustering estimation error rates for the model-based method and K-means methods with Fourier coefficients and also with difference data for the number of Fourier terms and the number of repeated design points. The clustering estimation error is smaller in the model-based method with the Fourier coefficients than in the K-means with the Fourier coefficients. Also the clustering estimation error is smaller in the model-based with Fourier coefficients than in the model-based with the difference data. With Fourier coefficients, the clustering estimation error becomes smaller as m becomes larger. The number of clusters is determined to be 5 in accordance with Bayesian Information Criterion (BIC). Once J exceeds 5, the clustering estimation error rate does not change appreciably. Therefore, we suggest using J around 5 for dimension reduction and to perform the biological interpretation. Optimal J values are highlighted in Table 1. Supplementary Table S1 shows the similar result as Table 1 with high noise data. Figure 1 shows the functions grouped in each cluster with J = 2 with low noise simulated data. It shows the true functions of the same derivatives.


Figure 1
View larger version (34K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Functions in 5 clusters based on derivative Fourier coefficients using J = 2 via model-based clustering with m = 10 and the low noise level simulated data.

 

View this table:
[in this window]
[in a new window]

 
Table 1. Comparison of clustering estimation error rate (%) of model-based versus K-means cluster method with derivative Fourier coefficients (FC) and difference data (dif) in 5 clusters of the simulated data with low noise level in 100 repetitions and m repeated design points

 
5.2 Yeast cell cycle microarray expression data example
5.2.1 Yeast cell cycle data
We also applied our method to yeast cell cycle data. Cell cycle is important in understanding cell replication, malignancy and reproductive disease that are associated with genomic instability and abnormal cell division. Biologists have been studying the cell cycle with budding yeast Saccharomyces cerevisiae that is a free living, eukaryotic and single cell but highly complex organism.

Spellman et al. (1998) created a comprehensive catalog of yeast genes whose transcript levels vary periodically within the cell cycle. They used DNA microarrays and samples from yeast cultures synchronized by three independent methods: {alpha}-factor arrest, elutriation and the arrest of a cdc15 temperature-sensitive mutant.

We applied our method of clustering to yeast cell cycle data downloaded from http://genome-www.stanford.edu/cellcycle/. We used yeast alpha data collected at 18 time points for 120 min during two full cell cycles. After removing genes with the missing values, there were 4489 genes remaining.

5.2.2 Choice of Fourier coefficients and clusters
To determine the J value and the number of clusters, we considered several J values and Bayesian Information Criterion (BIC) with the assumption that each cluster covariance has the same elliptical volume and shape. Since we found that the optimal J value varied for each function, we surmised that a true optimal J value may not exist. As such, we experimented with the model-based clustering using various numbers of clusters and J values.

Table 2 shows the median and average silhouette values with Euclidean distance between samples by model-based and K-means clusterings for various J values in 5 clusters. Although J = 1 yields higher overall silhouette widths using both K-means and model-based clustering, we think a larger number than 1 is appropriate to extract enough information about the underlying change patterns. Judging from the highest overall silhouette value, the model-based with 4 Fourier coefficients and 5 clusters was considered most appropriate. With K-means, silhouette value with J = 1 is the largest. Silhouette value of K-means with J = 3 is larger than that of the model-based clustering with J = 4. Therefore, it should be noted that silhouette values of Euclidean distance between two clustering models may not be the only criterion for model comparison. Rather as in the following gene ontology analysis biological interpretation should be done to validate clustering. However, the model-based method including density connects the probability-neighboring data, while K-means method measures intra-cluster homogeneity as cluster compactness.


View this table:
[in this window]
[in a new window]

 
Table 2. Median and average silhouette values for 5 clusters with derivative Fourier coefficients of yeast data using Euclidean distance

 
Using the model-based and J = 4, each partition of 5 clusters has the following number of genes 3032, 401, 164, 400 and 492. Figure 2 shows means, 5% and 95% of Fourier estimated gene scores in 5 clusters with sample derivative Fourier coefficients. The graph in the bottom right-hand corner of Figure 2 shows the estimated change patterns of the 5 clusters altogether. Supplementary Figure S1 shows the means of 4 derivative Fourier coefficients as a cluster profile and gives the variation between clusters. Supplementary Figure S2 shows chisquare plot of each cluster for multivariate normality with a dimension of 4. If the four derivative Fourier coefficients follow a multivariate normal distribution, they would scatter around the line with a slope of 1. Even though they satisfy asymptotic multivariate normality, this assumption can also be checked with chisquare plots. Except for cluster 4, they appear to have a slightly heavier tail than a normal distribution.


Figure 2
View larger version (29K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Fourier estimated gene score mean, 5% and 95% in 5 clusters based on derivative Fourier coefficients with J = 4 of yeast data and Fourier estimated change patterns.

 
Supplementary Figure S3 and S4 show plots of Formula and Formula hinting at the shape of clusters using both the model-based and K-means method. The clusters are elliptical with the different size and shape combined with the probability distributions in Figure S3. The clusters are grouped closer within cluster in Figure S4. Gaussian mixture model clustering allows clusters to have different orientation or sizes while preserving some common features, such as an ellipsoidal shape. Cluster 5 in particular has a wide elliptical shape incorporated with the probability distribution.

Owing to noise and the high dimensionality of data, careful consideration of statistical and biological validity is needed when analyzing the real microarray data.

5.2.3 Gene ontology analysis
In order to evaluate the result of the clustering analysis, we obtained Gene Ontology (GO) information for the clustered genes’ biological processes, molecular functions and cellular components. The GO database provides a useful tool to annotate and analyze the functions of a large number of genes. We searched statistically overrepresented GO annotations using GOstat for evaluating statistical significance of overrepresented functional and molecular mechanisms (Beissbarth and Speed, 2004). GOstat allows us to identify which annotations are typical for the group of genes. GOstat simply derives the statistical significance between expected and observed functional categories based on the Fisher's exact test.

In order to compare our method with other clustering methods, we also applied K-means clustering (MacQueen, 1967) to yeast cell cycle data. Table 3 shows some results of the overrepresented biological processes from the proposed method and the K-means clustering method for various values of k from 5–15.


View this table:
[in this window]
[in a new window]

 
Table 3. Result of comparison between overrepresented biological processes by using proposed method and K-means method

 
In Table 3, the first column shows the cluster number of the proposed method. The second column summarizes the list of the selected overrepresented biological processes that had their children GO terms in the same cluster. For example, we first selected total 81 GO terms in cluster 1 by using GOstat and then selected 6 GO terms that had as many children nodes as possible in cluster 1. In the same way, K-means clustering results were obtained. We compared the list of the overrepresented GO terms from the proposed method (second column) with that from the K-means clustering method. The black dots in Table 3 represented the GO terms that were selected by both methods. In summary, there are some GO terms that can only be detected by the proposed method such as GO:0000209, GO:0000079, GO:0009086 and GO:0005978. In particular, all GO terms in cluster 5 of our proposed method are closely related to biosynthesis. The three GO terms in cluster 5, GO:0009086, GO:0006537 and GO:0005978, are rarely overrepresented by the K-means clustering method. Our proposed method not only found the GO terms that were not identified by the K-means method but also grouped them in the same cluster.

Furthermore, the genes in cluster 5 are closely related to the glucose metabolic pathway. For example, GLC3 (GO:0005978) encodes 1,4-glucan-6(1,4-glucano)-transferase, involved in glycogen accumulation. Glycogen in turn serves as a major storage carbohydrate (glucose) (Rowen et al., 1992). Free glucose is oxidized to pyruvate. The other genes from GO:0006537, GO:0006526 and GO:0009086 are related to the synthesis of amino acids in the citric acid cycle,15ATPs and 3CO2 are produced from one pyruvate molecule. IDP1 (GO:0006526) catalyzes the oxidation of isocitrate to alpha-ketoglutarate (Haselbeck and McAlister-Henn, 1993). GLT1 (GO:0006537) synthesizes glutamate from glutamine and alpha-ketoglutarate (Valenzuela et al., 1998). ARG1, ARG3 and ARG4 in GO:0006526 are involved in the synthesis of alginine from the glutamate (Crabeel et al., 1988; Jauniaux et al., 1978). Oxaloacetate an intermediary in the citric acid cycle, is the entry point for the metabolism of the underlying carbon structure of the amino acids aspartate and asparagine. MET2 (GO:0009086) is involved in the synthesis of methionine from the aspartate (Masselot and De Robichon-Szulmajster, 1975). It catalyzes the conversion of homoserine to O-acetyl homoserine using one molecule of acetyl coenzyme A (acetyl-CoA) (Thomas and Surdin-Kerjan, 1997). These findings illustrate that our proposed methodology can identify genes that are biologically interpretable.


    6 CONCLUDING REMARKS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE MODEL
 3 TRIGONOMETIC FOURIER SERIES...
 4 CLUSTERING CURVES OF...
 5 RESULTS
 6 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
The method proposed in this study provides an efficient tool for clustering curves of the same change pattern by Fourier estimation.

Because Fourier coefficients can give information on both the original underlying functions and their derivatives, we used the sample Fourier coefficients of derivatives to summarize the change patterns. We demonstrated the effectiveness of our approach using model-based clustering of change patterns. Although we assumed that the residuals within each curve over time were independent and had constant variance, due to the large number of repetitions, we found that it is not necessary to assume independence between curves.

There are several areas that deserve further research. Determining the number of Fourier coefficients and selecting the number of clusters are topics that many researchers are actively pursuing. Also, there needs a study to handle the instability of Fourier estimation affected by outliers when only a small number of repeated time points is available. Another topic of future research is to develop validity measures incorporating the probability framework for clusters. When further information about correlations is available, a time series analysis approach would also be an area worthy of consideration.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE MODEL
 3 TRIGONOMETIC FOURIER SERIES...
 4 CLUSTERING CURVES OF...
 5 RESULTS
 6 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We are grateful to Dr Carroll, Dr Hart and Dr Vannucci for their motivation and advice. We also thank the referees for their constructive comments. This work took place during J.K.’s visit to the Bioinformatics Training Program at the Department of Statistics at Texas A&M University and her research was supported by the Korea Research Foundation (R04-2004-000-10138-0). The work of H.K. was supported by the National Research Laboratory Program of Korea Science and Engineering Foundation (M10500000126).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on March 21, 2007; revised on November 7, 2007; accepted on November 8, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE MODEL
 3 TRIGONOMETIC FOURIER SERIES...
 4 CLUSTERING CURVES OF...
 5 RESULTS
 6 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Ajuaje F. A cluster validity framework for genome expression data. Biometrics (2002) 18:319–320.

    Banfield JD, Raftery AE. Model-based Gaussian and non-Gaussian clustering. Biometrics (1993) 49:803–821.[CrossRef][Web of Science]

    Beissbarth T, Speed TP. GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics (2004) 20:1464–1465.[Abstract/Free Full Text]

    Beran R, Dumbgen L. Modulation of estimators and confidence Sets. Ann. Stat. (1998) 26:1826–1856.[CrossRef]

    Crabeel M, et al. Arginine repression of the Saccharomyces cerevisiae ARG1 gene Comparison of the ARG1 and ARG3 control regions. Curr. Genet. (1988) 3:113–124.

    Ernst J, et al. Clustering short time series gene expression data. Bioinformatics (2005) 21:159–168.[CrossRef][Web of Science]

    Eubank R, Hart JD. Testing goodness-of-fit via order selection criteria. Ann. Stat. (1992) 20:1412–1425.[CrossRef]

    Fraley C, Raftery AE. MCLUST: software for Model-based cluster analysis. J. Classif. (1999) 16:297–306.[CrossRef]

    Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and Density Estimation. J. Am. Stat. Assoc. (2002) 97:611–631.[CrossRef][Web of Science]

    Freedman D, Lane D. The Empirical distribution of Fourier coefficients. Ann. Stat. (1980) 8:1244–1251.[CrossRef]

    Haselbeck RJ, McAlister-Henn L. Function and expression of yeast mitochondrial NAD- and NADP-specific isocitrate dehydrogenases. J. Biol. Chem. (1993) 268:12116–12122.[Abstract/Free Full Text]

    Jauniaux JC, et al. Arginine metabolism in Saccharomyces cerevisiae: subcellular localization of the enzymes. J. Bacteriol. (1978) 133:1096–1107.[Abstract/Free Full Text]

    Kaufman L, Rousseeuw PJ. Finding Groups in Data: An Introduction to Cluster Analysis (1990) New York: Wiley.

    Kim B, et al. Clustering periodic patterns of gene expression based on Fourier approximations. Curr. Genomics (2006) 7:197–203.[CrossRef]

    Lai Y, et al. A statistical method for identifying differential gene-gene co-expression patterns. Bioinformatics (2004) 20:3146–3155.[Abstract/Free Full Text]

    Li J, Wong L. Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns. Bioinformatics (2002) 18:725–734.[Abstract/Free Full Text]

    Masselot M, De Robichon-Szulmajster H. Methionine biosynthesis in Saccharomyces cerevisiae. I. Genetical analysis of auxotrophic mutants. Mol. Gen. Genet. (1975) 139:121–132.[CrossRef][Web of Science][Medline]

    MacQueen JB. Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability (1967) 1. Berkeley: University of California Press. 281–297.

    Murtage C, Raftery AE. Fitting straight lines to point patterns. Pattern Recognit. (1984) 17:479–483.[CrossRef][Web of Science]

    Murthy K.RK, Hua LJ. Improved Fourier transform method for unsupervised cell-cycle regulated gene prediction. Proc. IEEE Comput. Syst. Bioinform. Conf. (2004) 194–203.

    Park T, et al. Statistical tests for identifying differentially expressed gene in time-course microarray experiments. Bioinformatics (2003) 19:694–703.[Abstract/Free Full Text]

    Pollard D. A central limit theorem for K-means clustering. Ann. Stat. (1982) 10:919–926.

    Rousseeuw PJ. Silhouettes: graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. (1987) 20:53–65.[CrossRef]

    Rowen DW, et al. GLC3 and GHA1 of Saccharomyces cerevisiae are allelic and encode the glycogen branching enzyme. Mol. Cell Biol. (1992) 12:22–29.[Abstract/Free Full Text]

    Serban N, Wasserman L. CATS: clustering after transformation and smoothing. J. Am. Stat. Assoc. (2005) 471:990–999.

    Spellman PT, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccaromyces cerevisiae by microarray hybridization. Mol. Biol. Cell (1998) 9:3273–3297.[Abstract/Free Full Text]

    Thomas D, Surdin-Kerjan Y. Metabolism of sulfur amino acids in Saccharomyces cerevisiae. Microbiol. Mol. Biol. Rev. (1997) 61:503–532.[Abstract/Free Full Text]

    Valenzuela L, et al. Regulation of expression of GLT1, the gene encoding glutamate synthase in Saccharomyces cerevisiae. J. Bacteriol. (1998) 180:3533–3540.[Abstract/Free Full Text]

    Yeung KY, et al. Model based clustering and data transformations for gene expression data. Bioinformatics (2001) 17:977–998.[Abstract/Free Full Text]

    Zhang L, et al. Fourier harmonic approach for visualizing temporal patterns of gene expression data. Proc. IEEE Comput. Syst. Bioinform. Conf. (2003) 2:137–147.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
T. Zeng and J. Li
Maximization of negative correlations in time-course gene expression data for enhancing understanding of molecular pathways
Nucleic Acids Res., January 1, 2010; 38(1): e1 - e1.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/2/184    most recent
btm568v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kim, J.
Right arrow Articles by Kim, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kim, J.
Right arrow Articles by Kim, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?