Bioinformatics Advance Access originally published online on August 20, 2008
Bioinformatics 2008 24(21):2467-2473; doi:10.1093/bioinformatics/btn375
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
An unsupervised conditional random fields approach for clustering gene expression time series
Department of Computer Science, University of Warwick, Coventry, UK
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: There is a growing interest in extracting statistical patterns from gene expression time-series data, in which a key challenge is the development of stable and accurate probabilistic models. Currently popular models, however, would be computationally prohibitive unless some independence assumptions are made to describe large-scale data. We propose an unsupervised conditional random fields (CRF) model to overcome this problem by progressively infusing information into the labelling process through a small variable voting pool.
Results: An unsupervised CRF model is proposed for efficient analysis of gene expression time series and is successfully applied to gene class discovery and class prediction. The proposed model treats each time series as a random field and assigns an optimal cluster label to each time series, so as to partition the time series into clusters without a priori knowledge about the number of clusters and the initial centroids. Another advantage of the proposed method is the relaxation of independence assumptions.
Contact: ctli{at}dcs.warwick.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 BACKGROUND |
|---|
|
|
|---|
A key challenge in genomic signal research is the development of efficient and reliable probabilistic models for analysing gene expression data. As gene expression is a temporal process, it is necessary to record a time course of gene expression in order to determine the set of genes that are expressed under certain conditions, the gene expression levels and the interactions between these genes. While static data are assumed to be time independent, time-course data have strong correlations between successive points. This allows us to fully utilize the information from the experiments, as it reveals the pathway that leads from one state to the next, instead of a stable state, under given conditions.
Nevertheless, as time-course data are obtained from multiple microarrays produced at different time points with replicates, the large amount of data causes difficulties for analysis. This requires appropriate models to identify patterns in such large-scale datasets. To this end, gene expression clustering provides an efficient way to discover groups in large-scale datasets. The underlying assumption in clustering gene expression data is that co-expression indicates co-regulation or relation in functional pathways, so an efficient clustering method should identify genes that share similar functions (Boutros and Okey, 2005). Quite often, the ground truth about the number of clusters in the dataset and the cluster memberships of the genes are unknown, making classifier training infeasible. As such, unsupervised clustering methods requiring no training and no a priori knowledge about the number of clusters are desirable.
Many gene clustering methods for gene class discovery have been proposed in the literature (Culotta et al., 2005; Heard et al., 2006; Husmeier, 2003; Schliep et al., 2005; Wu et al., 2005). One research focus is the development of stable and accurate probabilistic models. The advantage of probabilistic methods over some alternatives is that they allow measurements of uncertainty (Culotta et al., 2005; Heard et al., 2006; Husmeier, 2003; Luan and Li, 2003; Ng et al., 2006). Generative probabilistic models, such as dynamic Bayesian networks (DBNs), are trained to maximize the joint probability of a set of observed data and their corresponding labels (Dojer et al., 2006; Husmeier, 2003). Very often, model generality comes at the cost of computational inefficiency. For instance, mixture models work well for static experiments. However, when it comes to time-series clustering, there are k x n2parameters to be estimated for the k covariance matrices, n being the number of time points (Medvedovic et al., 2004). In addition, mixture models are not effective for time-course data, as they treat time points as discrete and ignore their time order (Ma et al., 2006). Hidden Markov models (HMMs), a special case of DBNs, are another popular method for probabilistic sequence data modelling and have been applied in the fields such as speech recognition and language sequence modelling. As a relatively straightforward machine learning method, they have also been successfully applied to gene expression time-course clustering (Ji et al., 2003; Schliep et al., 2005). A first-order HMM model can be expressed as a joint probability distribution of the observed data x and the corresponding label/state y, such that
|
| (1) |
Conditional probabilistic models such as conditional random fields (CRFs) (Culotta et al., 2005; Lafferty et al., 2001), on the other hand, define a conditional probability over labels given observations. They label a pattern/time series by selecting the label sequence that maximizes the conditional probability, without relying on independence assumptions. They are efficient and typically outperform generative probabilistic models such as HMMs in a number of tasks such as language sequence modelling and gene sequence prediction (Culotta et al., 2005; Lafferty et al., 2001) because they allow arbitrary dependencies on observations. Let X be a random variable over the observations and Y be a random variable over the corresponding labels. A CRF model can be expressed as a joint distribution of Y given X
|
| (2) |
| 2 METHODS |
|---|
|
|
|---|
This section explains how the gene expression time series are transformed into a multi-dimensional feature space, with the temporal correlation and time intervals taken into account, and how the proposed CRF model is formulated. In order not to confuse the reader with details, we first present the whole picture of the proposed algorithm as follows.
2.1 Data transformation
Given a set of
-time-point data, to take the temporal relationship and interval between time points into account, a simple linear transformation is carried out to transform the dataset into a (
–1)-dimensional feature space. Let n denote the number of time series, i.e. the number of genes. Each original time series Ti of length
, i=1,2,...,n is now mapped to a point xi in the (
–1)-dimensional feature space with the t-th component, xi(t), defined as
|
| (3) |
2.2 Model formulation
As mentioned in Section 1, involving the whole dataset incurs high computational complexity and low convergence rate. To alleviate this problem, we exploit the local characteristic (also known as Markovianity) of Markov random fields to condition the learning process of our model. Let X={x1, x2,...,xn} denote a set of n observed time series and Y={y1, y2, ..., yn} the set of the corresponding labels. The objective of the learning is to assign an optimal class label yi to time series i conditioned on observed data xi, observed data xj and class labels yj, for all j in a small neighbourhood Ni, which we call the voting pool (See Section 2.2.1 for details). Therefore, instead of taking the general form of Equation (2), the proposed model can be specifically expressed as
|
| (4) |
) and P(yi|yNi,
) are the conditional probability distribution and the prior, respectively, and
is a set of parameters, as described later. For the sake of conciseness, we will use xNi and yNi to represent the observed data and class labels of all the members in Ni, respectively. We will also drop
from the writing of the model.
The CRF model of Equation (4) can also be expressed in the Gibbs form (Geman and Geman, 1984) in terms of cost functions U
(xi,xNi|yi,yNi) and U
(yi|yNi), which are associated with the conditional probability and the prior of Equation(4), respectively, as
|
| (5) |
- Not using cluster centroids in the labelling process, but the relative similarity between each time series and the member time series in a randomly formed voting pool. By random we mean, for each time series, the members in its voting pool are different in different iterations.
- Allowing each individual time series to be a singleton cluster and to interact with the time series in the voting pool to find its own identity progressively.
Before the first iteration of the labelling process starts, each time series is assigned a label randomly picked from the integer range [1, n], where n is the number of samples. Therefore, the algorithm starts with a set of n singleton clusters without the user specifying the number of clusters.
2.2.1 Voting pool
In each iteration, when a time series i is visited, its voting pool Ni is formed with k time series. The voting pool includes s most similar (MS) time series in Euclidean space, one most different (MD) time series and k–s–1 time series selected at random. The MS and the MD time series are selected from the voting pool once the learning process starts. That is, in any iteration, if a voting time series, j, selected at random is more similar to time series i than the s-th MS member is to i, the s-th MS member is replaced by j. By the same token, if a voting time series, j, selected at random is more different from time series i than the current MD member is from i, the current MD member is replaced by j. Subsequently, each time series is assigned an optimal label decided by a cost function defined later. The MD member plays the role of informing the algorithm of what label (i.e. the one associated with the MD member) to avoid, while the MS members are for maintaining local information and informing the algorithm what labels to choose. The k–s–1 members selected at random are for introducing new information, which includes global and local information depending on what new members (intra-class or inter-class) are selected, in the learning process, and for replacing the MD and MS members, if more appropriate ones are selected. Therefore, we can see that, without involving the shear number of time series of the whole dataset, as the algorithm iterates, the local information becomes more accurate and more information are infused into the voting process. It is this progressive manner of information infusion that reduces the computational complexity.
To allow the algorithm to work without the user specifying the number of clusters, when each time series is visited, only the class labels currently assigned to the members of its voting pool are taken as candidates. The advantages of randomizing the initial label configuration with a label space of n labels and using the proposed voting pool are now clear. First, globally the number of labels allowed is equal to the number of time series, n. Second, the computational complexity is kept low because locally the optimal label of each individual time series is sampled from a small label space with its cardinality less than k because there are k members in the voting pool and some may have the same labels. The cardinality of this small label space will decrease in due course as the homogeneous clusters progressively form. Although starting from a random initial label configuration, through the interaction with the members of the voting pool, an unique optimal label for each cluster can be quickly identified.
The voting pool size k, as one of the parameter in
of Equation (4), is determined by a few factors such as the size of dataset and desired computing efficiency. The impact of k is analysed in Section 3.1 and tabulated in Table 1.
|
The reason clusters merge and eventually converge to a certain number of final clusters is 2-fold. First, the s MS time series of each time series in question, i, encountered since the initial iteration is always kept in i's voting pool and the algorithm tends to assign the label possessed by the majority of the s similar members according to the cost function defined in the next sub-section. Second, at each iteration, k–s–1 random time series are selected to be new members of i's voting pool and any of the s MS members in the voting pool may be replaced if more similar time series are among the new members. These new members allow diversity to be introduced into the voting pool. It is the majority of the s MS members and the diversity brought in by the new members that facilitate cluster merge and eventually converge to a certain number of final clusters.
2.2.2 Cost function
Like the objective functions in other optimization problems, the two cost functions U
(·) and U
(·), of Equation (5) encodes the observed data and labels of the members of the voting pool, and aims to encourage good labelling with low cost and penalize poor labelling with high cost. Defining a feasible cost function is important, but by no means trivial. First, since the two cost functions in Equation (5) are dependent on the same set of arguments, by combining the two, we obtain a new model as
|
| (6) |
Second, as the proposed algorithm does not require the user to specify the number of clusters and requires no information about cluster centroids, therefore instead of using the distances between time series and cluster centroids, the observed data is encoded in the form of pair-wise potential, Wi,j, between paired time series i and j, defined as
|
| (7) |
|
| (8) |
in Equation (4). Many values of m have been tried and we found that any values greater than or equal to 3 virtually make no difference in estimating D. Therefore, we let m equal 3 in our experiments conducted in Section 3.
The cost function, Ui(·), is defined as the sum of the pairwise potentials, Wi,j, between time series i and the voting members in its voting pool, Ni
|
| (9) |
The assignment of a label
to a time series i is determined based on the cost function Ui(·) as follows:
|
| (10) |
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
3.1 Evaluation on toy cases
Since the voting pool plays a key part in determining the performance of the proposed algorithm, to demonstrate how the constituent members of the voting pool affect the performance, we first tested the algorithm on various toy datasets of three-dimensional patterns consisting of five clusters, each having 30 patterns. The algorithm is tested with the size k of the voting pool set to 4, 6, 10 and 18, while the number of MS samples s is always fixed at 1. That is to say, for each value of k, there are one MS, one MD and (k–2) random samples in the voting pool. In order to get objective statistics during each run of the algorithm, a new sample set is used. For each sample set, the five clusters Ci, i=1, 2, 3, 4, 5, are randomly created with the five centroids fixed at µ1=(40, 45, 100), µ2=(85, 60, 100), µ3=(65, 100, 80), µ4=(55, 120, 120), µ5=(120, 50, 120), respectively, and the same variance of 64. Supplementary Figure 1 shows one of the sample set used in our experiments. The clustering performance after repeating the algorithm for 100 times is listed in Table 1. Three cases have been investigated:
- MS and MD included: This is the case in which both the MS and the MD samples are included in the voting pool.
- MD excluded: This is the case in which the MS is included, while the MD is not. An additional random sample is included in place of the MD.
- MS excluded: This is the case in which the MS is not included, while the MD is. An additional random sample is included in place of the MS.
|
The second case (MD excluded) is intended to evaluate the performance of the proposed algorithm when the MD sample is not involved in the voting process. When the size of the voting pool is not big enough (k=4 and 6 in Table 1) to yield reasonable clustering in terms of error rate, excluding the MD sample from the voting pool tends to have better performance in terms of average iteration and error rate. That is, because the MD sample is only updated when a more distant sample is found, while the additional sample in place of the MD (when the MD is excluded) is picked at random, therefore providing more chances for interaction. However, when the size of the voting pool is big enough (k= 10 and 18 in Table 1) to yield reasonable clustering in terms of clustering error rate, including the MD sample in the voting pool becomes beneficial in terms of computational cost (average iteration). The third case (MS excluded) is to demonstrate the importance of the MS sample. Note that the MD sample can only tell the algorithm which particular label should not be assigned to the sample in question, while the MS sample points out which label to be assigned. Even with the assistance of the MD, without the MS, the algorithm tends to guess which label is the correct one. As indicated in Table 1, when k equals 4 and 6 the algorithm never converges. When k is increased to either 10 or 18, successful clustering become possible, but the computational cost is unacceptably higher than the cases when the MS is included.
We have applied k-means clustering to the same five-cluster dataset of our toy case with k varying from 3 to 6. As expected, the k-means clustering algorithm always under-cluster the dataset when k is mistakenly set to 3 or 4, and over-cluster the dataset when k is mistakenly set to 6. When correct value of k (i.e. 5) is provided, the k-means clustering algorithm is still sensitive to the initial centroids. We have run k-means clustering algorithm with k = 5 for 100 times and found that the average clustering error rate is 12.9%, which suggests that the proposed CRF algorithm outperforms the k-means clustering algorithm when comparing to the average clustering error rate in the last two entries of Table 1.
3.2 Experiments on time-series datasets
From the evaluation on the above datasets, we can see that the MS sample plays the most important role in the clustering process. To exploit the power of similar samples in the following experiments, we set the number of MS, s, to
, where
·
is the floor function. We use both biological datasets and simulated datasets for experimental evaluation. Biological data can only provide anecdotal evidence in clustering validation, since the knowledge of gene regulation is far from complete. Annotation information is too inconsistent to support a large-scale evaluation. Simulated datasets are therefore necessary because they can provide more controllable conditions to test an algorithm and a standard for benchmarking (Ernst et al., 2005; Ma et al., 2006). However, to obtain meaningful results, the simulated data need to share statistical characteristics with biological data.
A popular similarity measure of two partitions, the adjusted Rand index (Hubert and Arabie, 1985), is used to assess the clustering accuracy of our algorithm. The adjusted Rand index gives the degree of agreement (in the range of 0–1) between two partitions of a dataset. A high adjusted Rand index indicates a high level of agreement.
3.2.1 Simulated datasets
Following Medvedovic et al. (2004) and Schliep et al. (2005), we synthesized five classes of 60, 60, 60, 80, 60 gene expression profiles, respectively, across 20 time points according to Equation (11), giving 320 profiles in total. Four classes are generated from sine waves, each with different frequency and phase, and the other is generated with a linear function. Gaussian noise
i is added to all of the data as demonstrated in Figure 2 in the Supplementary Material.
|
| (11) |
i and βi are frequency and phase parameters for the sine wave functions and ai and bi are parameters for the linear function, respectively. As discussed in Medvedovic et al. (2004) and Schliep et al. (2005), the mathematical model used to generate the data has been shown to have similar biological and statistical characteristics.
|
The simulated dataset allows us to carry out a more objective evaluation of our proposed algorithm, since the data characteristics and ground truth are known. Figure 1a shows that the adjusted Rand index first climbs to a peak value of 0.9903 when the pool size, k, is 12, then it decreases and is stabilized at around 0.8 when the pool size is greater than 20. The performance of our algorithm with the voting pool size as small as 5 is already good, except that it takes the longest time to converge. A closer look at the clustering results also tells us that, with a too big voting pool, the algorithm arrives at sub-optimum results. In other words, an over-sized voting pool involves too much redundant information. As a result, distinct clusters get merged. On the other hand, the time complexity as shown in Figure 1b tends to decline steadily. As noted earlier, with bigger voting pools, the iteration number drops dramatically, while the time complexity increases for each iteration. Indeed, the impact on the overall time complexity due to the increased time complexity for each iteration is far less than that due to the reduced iteration count. Based on this observation, we arrive at the conclusion that the clustering accuracy and efficiency of the proposed method depend on the choice of the pool size.
Further experiments were carried out for determining appropriate voting pool size. Consider six simulation set-ups: three datasets of 320 gene profiles consisting of 4, 5 and 6 clusters and four datasets of 400 gene profiles consisting of 4, 5 and 6 clusters. The data in Figure 2 are averages of 10 experimental results. In both plots, we can see that the curves of the datasets with a smaller number of samples generally fall earlier than the curves of the datasets with more samples. Moreover, the performance on the dataset with five clusters decreases earlier than that on the dataset with six clusters. Since, for a dataset with only four clusters, the performance of our system is generally good, this may explain why the curves for the dataset with four clusters do not fall before the curves for the datasets with five clusters. However, there is no obvious pattern indicating that the actual number of clusters is an important factor in deciding the optimal voting pool size. The range of the voting pool size that gives good performance is 6–12 for datasets with 320 samples and 8–16 for dataset with 400 samples, respectively. We propose to select the voting pool size in proportion to the size of the dataset. Our experience suggests that a pool size in the range of 2–4% of the dataset size is reasonable.
3.2.2 Yeast galactose dataset
The Yeast galactose dataset (Ideker et al., 2001) consists of gene expression measurements in galactose utilization in Saccharomyces cerevisiae. A subset of 205 measurements of this dataset whose expression patterns reflect four functional categories as illustrated in Supplementary Figure 3 in the Gene Ontology (GO) listings (Ashburner et al., 2000) is used. The experiments are conducted with four repeated measurements across 20 time points. Note that the data in Supplementary Figure 3 is the mean of the four replicates. Thus, we compute the adjusted Rand index of clustering outcomes with the four functional categories as the ground truth for system validation.
|
We compare the adjusted Rand index obtained by our algorithm with that of the algorithm CORE proposed in a recent work (Tjaden, 2006), in which the Yeast galactose dataset is also used in experimental evaluation. Our model has a peak Rand index value of 0.9478 at the pool size of 8 as shown in Figure 3a, otherwise the performance is good with the pool size ranging from 5 to 11. All of them are greater than the peak value of the adjusted Rand index, roughly 0.7, reported in Tjaden (2006). Most importantly, our unsupervised CRFs model captures the correct number of clusters in the dataset, while CORE is essentially a supervised k-means clustering method. In terms of clustering efficiency, the clustering processes of our algorithm finishes within 30 iterations and the consumed time is<10 s in most cases, as shown in Figure 3b.
The performance of our method and Ng et al.'s (2006) mixture model-based method in terms of ARI are 0.948 and 0.978, respectively. Although Ng et al.'s method performs marginally better than ours, comparing to our method, the imitations of their method are: first, it assumes that the observed vectors come from a finite number of clusters/components and thus requires the user to specify the number of clusters/components. This makes the method's performance somehow dependent on the specified number of clusters. Second, as Ng et al. have mentioned that, in their model, the gene profile vectors are not all independently distributed, they circumvent this problem by conditioning the clustering on the unobservable random gene effects and tissue effects, as they believe that given these terms, the gene profiles are all conditionally independent (Ng et al., 2006). Albeit helpful, to some extent, the performance of their method is sensitive to the specification of the random effects, which is by no means trivial. However, they did not suggest ways of specifying them.
3.2.3 Yeast cell-cycle dataset
In the yeast cell-cycle (Y5) (Cho et al., 1998) dataset there are more than 6000 yeast genes measured during two cell cycles at 17 time points. In Schliep et al. (2005), a subset of 384 genes is identified based on their peak times of five phases of the cell cycle: early G1(G1E), late G1(G1L), S, G2 and M phase. The expression levels of each gene were standardized to enhance the performance of model-based methods. In our experiment, we use the same dataset in order to compare the results. Figure 4a in the supplementary Material is the 384 time-series data put together.Figure 4a in the supplement (b)-(f) depict the five individual groups; we can see the subtle distinctions of different peak times among groups.
|
For the Y5 dataset, we evaluated the system performance with different voting pool sizes and depict the results in Figure 4. The best result is obtained at the pool size of 21 with a Rand index of 0.4808. As one can see from Figure 4a, when the pool size of our algorithm is in the range from 13 to 20, the adjusted Rand indexes are all higher than the best result, 0.467, of all of the methods without labelled data, including different HMM models (Schliep et al., 2005), k-means and splines (Bar-Joseph et al., 2002) evaluated in Schliep et al. (2005). Note that the proposed algorithm starts with a random label configuration, without the user providing prior knowledge. An advantage of the type of unsupervised method is that it is independent of annotation data, which are notoriously incomplete (Slonim, 2002). Meanwhile, not all unsupervised methods are able to separate genes from Late G1 and S or M and Early G1. MCLUST (Fraley and Raftery, 2002) is a widely used mixture model-based clustering method. It is unsupervised, not only in determining the number of clusters, but also in selecting the type of model that best fits the data. An R implementation of MCLUST (Fraley and Raftery, 1999) is used in our experiment. Using the EEE (Equal volume, shape and orientation) model, MCLUST yields either four or eight as the optimal number of clusters.
To our knowledge, the clustering results on the Y5 dataset are generally not good for current methods in the literature, in contrast to those of Yeast galactose dataset by the same methods. This is because of the ambiguities among the Y5 groups. As Figure 4a indicates, large pool sizes do not necessarily facilitate better performance. On the contrary, the Rand index starts to decline after a certain point. This is because of noise and outliers that jeopardize the accuracy of the system. Moreover, bigger pool sizes give rise to higher computational cost for each iteration in our CRF model. Despite this, the decline in time complexity with increasing pool size, as shown in Figure 4b, is not surprising, because bigger pool sizes facilitate more global information, thus taking fewer iterations to converge.
3.3 Evaluation of class prediction
Another desirable function of a clustering algorithm, other than class discovery, is its ability to assign genes to known classes, which is also known as class prediction. Class prediction starts with the training of a class predictor, which can then be tested for its accuracy in predicting the classes of the test data. After dividing each of the above three datasets into training and testing sets as shown in Table 2, we trained our proposed clustering algorithm and then used it to predict the classes of the test data. The average prediction accuracies of 27 experiments in terms of Rand index and prediction error rates are cross-tabulated in Table 2.
|
The high-prediction error rate for dataset Y5 is due to the high rate of misclassification of the training set, which is confirmed by a low Rand index. In contrast, the class predictions for the Yeast galactose dataset and the simulated dataset are more accurate when the predictors are well trained as indicated in Figure 5.
|
| 4 CONCLUSIONS |
|---|
|
|
|---|
We presented an efficient unsupervised CRFs model for both gene class discovery and class prediction, which does not require the user to provide a priori knowledge, such as the cluster number and the centroids. The small voting pool (compared to the size of the datasets) progressively updated in each iteration effectively facilitates both local and global interactions and overcomes the high computational complexity problem of most CRF schemes. CRFs offer a desirable set of properties in gene expression time-series analysis. They have great flexibility in capturing arbitrary, overlapping and agglomerative attributes of the data. Most importantly, the conditional nature of CRFs directly results in their efficiency in terms of the relaxation of independence assumptions needed by HMMs.
The biological relevance of our method was demonstrated through statistical analysis of experimental results on the Yeast galactose dataset and the Yeast cell-cycle dataset. By comparing this with the results of recent work, we demonstrated the effectiveness of our CRF system.
Despite its popularity in many applications, the Rand index makes no use of the biological annotation data. This implies the development of new measure for validating gene clustering results. Although, in most cases, the accuracy and completeness of gene annotation is by no means sufficient, these annotation can serve as useful prior knowledge for validating clustering results. It is our intention to take this as a new line of investigations in the near future.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Joaquin Dopazo
Received on March 12, 2008; revised on July 8, 2008; accepted on July 15, 2008
| REFERENCES |
|---|
|
|
|---|
Ashburner M, et al. Gene ontology: tool for the unification of biology. Nat. Genet. (2000) 25:25–29.[CrossRef][Web of Science][Medline]
Bar-Joseph Z, et al. A new approach to analyzing gene expression time series data. In: Proceedings of the Annual International Conference on Computational Molecular Biology. (2002) RECOMB. 39–48.
Boutros PC, Okey AB. Unsupervised pattern recognition: an introduction to the whys and wherefores of clustering microarray data. Brief. Bioinform. (2005) 6:331–343.
Cho RJ, et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell (1998) 2:65–73.[CrossRef][Web of Science][Medline]
Culotta A, et al. Gene prediction with conditional random fields. (2005).
Dojer N, et al. Applying dynamic Bayesian networks to perturbed gene expression data. BMC Bioinformatics (2006) 7.
Ernst J, et al. Clustering short time series gene expression data. Bioinformatics (2005) 21(Suppl. 1).
Fraley C, Raftery A. Mclust: software for model-based cluster and discriminant analysis. (1999).
Fraley C, Raftery A. Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. (2002) 97:611–631.[CrossRef][Web of Science]
Geman S, Geman D. Stochastic relaxation, Gibbs distribution, and Bayesian restoration of images (1984) 6:721–741.
Heard NA, et al. A quantitative study of gene regulation involved in the immune response of anopheline mosquitoes: an application of bayesian hierarchical clustering of curves. J. Am. Stat. Assoc. (2006) 101:18–29.[CrossRef][Web of Science]
Hubert L, Arabie P. Comparing partitions. J. Classif. (1985) 2:193–218.[CrossRef]
Husmeier D. Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks. Bioinformatics (2003) 19:2271–2282.
Ideker T, et al. Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science (2001) 292:929–934.
Ji X, et al. Mining gene expression data using a novel approach based on hidden markov models. FEBS Lett. (2003) 542:125–131.[CrossRef][Web of Science][Medline]
Lafferty JD, et al. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning. (2001) San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. 282–289.
Luan Y, Li H. Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics (2003) 19:474–482.
Ma P, et al. A data-driven clustering method for time course gene expression data. Nucleic Acids Res. (2006) 34:1261–1269.
Medvedovic M, et al. Bayesian mixture model based clustering of replicated microarray data. Bioinformatics (2004) 20:1222–1232.
Ng SK, et al. A mixture model with random-effects components for clustering correlated gene-expression profiles. Bioinformatics (2006) 22:1745–1752.
Schliep A, et al. Analyzing gene expression time-courses. IEEE/ACM Trans. Comput. Biol. Bioinform. (2005) 2:179–193.[CrossRef]
Slonim DK. From patterns to pathways: gene expression data analysis comes of age. Nat. Genet. (2002) 32(Suppl):502–508.[CrossRef][Web of Science][Medline]
Tjaden B. An approach for clustering gene expression data with error information. BMC Bioinformatics (2006) 7:17.[CrossRef][Medline]
Wu FX, et al. Dynamic model-based clustering for time-course gene expression data. J. Bioinform. Comput. Biol. (2005) 3:821–836.[CrossRef][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




