Bioinformatics Advance Access originally published online on June 27, 2007
Bioinformatics 2007 23(17):2247-2255; doi:10.1093/bioinformatics/btm320
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data
Department of Biostatistics, University of Pittsburgh, Pittsburgh, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Cluster analysis is one of the most important data mining tools for investigating high-throughput biological data. The existence of many scattered objects that should not be clustered has been found to hinder performance of most traditional clustering algorithms in such a high-dimensional complex situation. Very often, additional prior knowledge from databases or previous experiments is also available in the analysis. Excluding scattered objects and incorporating existing prior information are desirable to enhance the clustering performance.
Results: In this article, a class of loss functions is proposed for cluster analysis and applied in high-throughput genomic and proteomic data. Two major extensions from K-means are involved: penalization and weighting. The additive penalty term is used to allow a set of scattered objects without being clustered. Weights are introduced to account for prior information of preferred or prohibited cluster patterns to be identified. Their relationship with the classification likelihood of Gaussian mixture models is explored. Incorporation of good prior information is also shown to improve the global optimization issue in clustering. Applications of the proposed method on simulated data as well as high-throughput data sets from tandem mass spectrometry (MS/MS) and microarray experiments are presented. Our results demonstrate its superior performance over most existing methods and its computational simplicity and extensibility in the application of large complex biological data sets.
Availability: http://www.pitt.edu/~ctseng/research/software.html
Contact: ctseng{at}pitt.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Cluster analysis has been widely applied in various unsupervised data mining problems including microarray analysis, sequence analysis, image segmentation and marketing research. The general problem considers clustering data X = {x1, ... , xn} into k clusters, where each object xi is d-dimensional and k is estimated a priori. The large variety of available methods in the literature seems like a jungle, e.g. Bishop (1995), Gordon (1999), Grabmeier and Rudolph (2002), Jain and Dubes (1988), Jobson (1992), Kaufman and Rousseeuw (1990), McLachlan and Basford (1987), Ripley (1996) and Spaeth (1984), to mention only a few. Although tremendous efforts have been made to benchmark clustering algorithms, it is generally agreed that clustering evaluation measures and performance of each method heavily depend on the original application problem (Messatfa and Zait, 1997; Milligan and Cooper, 1985). In this article, we focus our discussion and comparison on high-throughput biological data, especially in the analysis of microarray expression profiles and mass spectrometry proteomic data. In the literature, classical clustering methods, including hierarchical clustering (Eisen et al., 1998), K-means and its variants, self-organizing maps (Conrads et al., 2003; Tamayo et al., 1999) and mixture model approaches (McLachlan et al., 2002; Yeung et al., 2001), as well as modern techniques, such as gene shaving (Hastie et al., 2000), a graph-theoretical method CLICK (Sharan et al., 2003) and tight clustering (Tseng and Wong, 2005), have been widely applied. A recent comparative study for gene clustering in expression profiles (Thalamuthu et al., 2006) suggests that clustering methods allowing scattered objects not being clustered, with explicit or implicit model assumptions, and with resampling evaluations seem to perform better.
Many clustering methods are based on global optimization of a criterion that measures compatibility of the clustering result to the data. K-means and mixture Gaussian model-based clustering are examples of this category. In the following paragraphs, we will elucidate the connections between the two methods and introduce the motivation for our proposed method. In statistical literature, clustering is often obtained through likelihood-based inference including the mixture maximum-likelihood (ML) approach and the classification maximum-likelihood (CML) approach (Celeux and Govaert 1992; Ganesalingam 1989). In the ML approach, the Rd-valued vectors x1, ... , xn are sampled from a mixture of densities,
, where
j is the probability that the data is generated from cluster j and each f (x|
j) is the density for cluster j from the same parametric family with parameter
j. The log-likelihood to be maximized is
|
|
In the CML approach, the partition C = {Ci, ... , Ck}, where Cj's are disjoint subsets of X = {xi, ... , xn}, is considered as an unknown parameter and is directly pursued in the optimization. This criterion samples data of n1, ... , nk observations in each cluster, where nj's are fixed and unknown and
. Following the convention of Celeux and Govaert (1992), it takes the form
|
|
|
| (1) |
j) is the Gaussian distribution. In addition to likelihood-based inference, many clustering methods have utilized heuristic global optimization criteria. K-means (Hartigan and Wong, 1979) is an effective clustering algorithm in this category and is applied in many applications due to its simplicity. In the K-means criterion, objects are assigned to clusters so that the within cluster sum of squared distance is minimized. It will be shown later (Example 1 in Section 2.1) that K-means is actually a simplified form of the CML sampling scheme under the Gaussian assumption. In this article, we propose a class of loss functions extended from K-means, namely penalized and weighted K-means (PW-K-means). A penalty term is added to allow clustering with scattered objects not being clustered and a weight term is introduced to incorporate prior information.
In the analysis of high-throughput biological data, the importance of allowing a set of scattered objects in cluster analysis has been demonstrated (Dasgupta and Raftery, 1998; Fraley and Raftery, 2002; Thalamuthu et al., 2006; Tseng and Wong, 2005). The resulting clustering assignment is represented as C = {C1, ... , Ck, S}, where clusters Cj are disjoint subsets of X, S is the set of scattered objects and
. In Section 2.1, it will be shown that the penalty term in PW-K-means automatically assigns outlying objects to the S set and the formulation is equivalent to the CML model with a noise set uniformly distributed over the space. For prior information incorporation, most existing methods applied to high-throughput biological data ignored such information in the process of clustering except for a few recent papers: Hanisch et al. (2002), Cheng et al. (2004), Pan (2006) and Huang and Pan (2006). Compared to these approaches, PW-K-means provides a more general and flexible formulation to incorporate prior information in cluster formation. It should be noted that the prior information incorporation we discuss here is different from the well-established field, semi-supervised machine learning (Basu et al., 2004; Pan et al., 2006; Segal et al., 2003). In semi-supervised machine learning, a subset of objects come with known class labels (supervised) while the remaining objects have unknown class labels (unsupervised). The algorithm determines whether the unlabeled objects should be assigned to existing classes or a new cluster(s) should be formed. In prior information incorporation for clustering, the problem is more from Bayesian perspective. Groups of objects (e.g. genes of the same functional annotation) are suggested a priori that they likely but not always cluster together in the data. This is especially true in microarray analysis because gene functional annotations often contain many errors or genes of the same function may not reflect coregulation in the data. The prior information should be used to enhance, instead of force, cluster formation.
The article is structured as the following. In Section 2.1, formulation of PW-K-means is described. Insights and properties of the method are discussed. Three examples of K-means and K-medoids, penalized K-means (P-K-means) without the weighting term, and an explicit formulation of PW-Kmeans for gene clustering in microarray data are then presented. Section 2.2 discusses computational issues of the method including implementation and parameter selection. In Section 3, two sets of simulated data, an application to tandem mass spectrometry data, and an application to gene clustering of microarray data are explored to demonstrate our proposed method. Section 4 contains conclusion and discussion.
| 2 METHODS |
|---|
|
|
|---|
2.1 Formulation of general PW-K-means
A general class of loss function extended from K-means is proposed for clustering purposes:
|
| (2) |
is a tuning parameter representing the degree of penalty of each noise point. The weighting factor w(·;·) is used to incorporate prior knowledge of preferred or prohibited patterns of cluster selections. Minimizing Equation (2) with given weight function w(·;·), distance measure d (·;·), k and
produces a clustering solution. We denote by
). Proposition 1 shows several useful properties of this formulation. In particular,
is inversely related to the number of scattered objects (i.e. |S|). This is a desirable property to control tightness of the resulting clusters in practice. Proof of Proposition 1 is left to Supplementary Material.
Proposition 1. (a) Similar to K-means, if k1 > k2, then W(C*(k1,
); k1,
)
W(C*(k2,
); k2,
). (b) If
1 >
2, then |S*(k,
1)|
|S*(k,
2)|. (c) If
1 >
2, then W(C*(k,
1); k,
1) > W(C*(k,
2); k,
2).
In the remaining of this subsection, three specific examples are illustrated:
Example 1: K-means and K-medoids. The K-means algorithm has been widely used in various applications of clustering (Hartigan and Wong, 1979). The loss function to be minimized is:
|
| (3) |
and
It is known that the K-means algorithm is equivalent to maximizing the classification likelihood under a mixture of identical spherical Gaussian distributions (Proposition 1 in Celeux and Govaert, 1992); namely when
(j = 1, ... , k), maximizing (1) is equivalent to minimizing (3).
Example 2: P-K-means. In this example we discuss a version of penalized K-means:
|
| (4) |
0 is a tuning parameter,
0 is invariant under data scaling and different k. In contrast to K-means above, P-K-means provides flexibility of not assigning all points into clusters and allows a set of scattered points, S. Following Tseng and Wong (2005), scattered points are defined as noises that do not tightly share common patterns with any of the clusters in the data. For clustering problems in complex data such as gene clustering in expression profiles, ignoring scattered genes has been found to dilute identified patterns, make more false positives and even distort cluster formation and interpretation.
We can similarly find relationships between penalized K-means and classification likelihood. If the scattered points in S are uniformly distributed in the hyperspace V (i.e. generated from a homogeneous Poisson process), then the C1-CML criterion becomes
|
| (5) |
Example 3: PW-K-means. In this example we further extend (4) to consider prior information incorporation by adding weights in the cost function:
|
| (6) |
|
| (7) |
and
are tuning parameters and
and
are plotted. The parameters
and
are similar to hyper-parameters in the specification of Bayesian priors, which often reflect the confidence of the prior information. In our experience,
and
. The weight function can be modified to suit different structures of prior information and to serve for different purposes. For example, the trimmed mean can be used in h(·;·) to accommodate the problem of possible outliers and errors in the prior information P.
2.2 Computation and implementation
2.2.1 Implementation
For implementation we modified the K-means routine in a published C clustering library (De Hoon et al., 2004). The optimization is performed through a classification EM algorithm with multiple numbers (denoted by M) of independent random initial values to search for the global optimum of the loss function. Depending on the complexity and structure of the data, a suitably large M is needed to obtain the global optimum solution with high probability. In the simulatioins, we will explore this problem in simulated data and will further show that good prior information in PW-K-means helps to alleviate computational difficulty.
2.2.2 Parameter estimation
In the formulation presented above, the number of clusters k and
(or
0) are both considered known a priori. In cluster analysis, estimation of k received tremendous attention in the literature and still remains a difficult problem in the analyses of most real data sets. Milligan and Cooper (1985) conducted a comprehensive evaluation for more than 30 methods. None showed superior performance in all situations. Some methods outperform the others under certain data distribution assumptions but may perform poorly in other settings.
In this article, we follow the prediction-based resampling method proposed by Tibshirani and Walther (2005) (Breckenridge, 1989; Dudoit and Fridlyand, 2002) for selecting k and
. Given k and
the whole data set is first split into two equal parts: the first is the training set Xtr and the second is the testing set Xte. The main idea involves three steps: (a) cluster the training data Xtr, (b) cluster the testing data Xte and (c) measure how well the training set clustering result predicts co-memberships in the testing data. The correct parameter selection should generate consistent clustering results in training and testing data and produce a good prediction in step (c).
We denote by C(Xtr;k,
) the clustering operation on the training data. Following the convention in Tibshirani and Walther (2005), we denote by D[C(Xtr;k,
), Xte] the (n/2) x (n/2) co-membership matrix in the testing data Xte judged by the clustering result from training data, C(Xtr;k,
). The nearest centroid criterion is used for such judgment, i.e. each point in the testing data is assigned to the nearest cluster centroid of C(Xtr;k,
). For any pair of points i and i' in the testing data, the i–i'-th element of the co-membership matrix
will take the value 1 if both i and i' fall into the same cluster under C(Xtr;k,
) judgment and zero otherwise. We denote by
the resulting cluster indexes from clustering the test data in step (b) such that
and n1, ... , nk+1 are the number of observations in each cluster. The prediction strength of the training and testing data split is defined as
|
| (8) |
as possible with reasonably high prediction strength (| 3 RESULTS |
|---|
|
|
|---|
3.1 Simulation
We perform two simulations to evaluate P-K-means and PW-K-means. The first data contain k = 3 clusters normally distributed in 2D space. A number of n = 50 points is simulated for each cluster from N((–10, 10)T,
2I), N((0, –10)T ,
2I) and N((10, 10)T,
2I) (
= 2). Another m = 50 noise points is uniformly generated in [–20, 20] x [–20, 20] with the restriction that they are at least three SDs from the centers. Under this restriction, no confusion of clustered points and noise points should exist. For the second data set, we apply a hierarchical log-normal model proposed by Thalamuthu et al. (2006) to simulate gene cluster structure in microarray data. Supplementary Figure 3 shows a heatmap presentation of the data with 392 clustered genes of 9 clusters and 392 noise genes in 50 samples/dimensions.
The prediction strength method in the Section 2.2 is used to estimate k and
0 in P-K-means. Figure 1A shows the averaged prediction strength of 10 independent repeated resamplings for various k and
0 in the first data set (M = 100 000). Clearly k = 3 and
0 = 0.2–0.3 give the highest prediction strength. In Supplementary Figure 4 clustering results of P-K-means with k = 3 and
0 = 0.1, 0.2 and 0.8 (Supplementary Fig. 4B, C and D) as well as the original true clustering structure (Supplementary Fig. 4A) are presented. P-K-means with
0 = 0.2 gives clustering identical to the underlying truth. When a small
0 = 0.1 (Fig. B) is selected, a few true cluster points are identified as noises. Conversely when a too large
0 = 0.8 is selected, many true noises are grouped into the clusters. The result demonstrates the inverse relationship of
0 and the size of the noise set in Proposition 1. Figure 1B shows the result of the estimation of K and
0 for the second simulation (M = 100 000). The parameters are correctly estimated (K = 9 and
0 = 0.4–0.5) and the underlying true clustering structure is recovered under this parameter selection.
|
In the above results, large M is applied to guarantee correct global optimization. Next, we investigate effects of various M in the optimization of clustering. We apply the adjusted Rand index (ARI) (Hubert and Arabie, 1985) to evaluate the clustering results. ARI calculates the similarity of a clustering result to the underlying true clustering structure. It takes a maximum value 1 if the clustering result is identical to the underlying truth and takes expected value 0 if the clustering result is obtained from random partitioning. Figure 2 shows the means and standard errors of ARI over 30 independent tries under different numbers of M for the two simulation data sets (
0 = 0.2 for the first data set and
0 = 0.4 for the second). We found that around M = 105–106 independent random initial values are needed to guarantee the global optimum for P-K-means with high probability (triangles) in both sets of simulated data. Suppose good prior information P containing three clustered points randomly selected from each cluster respectively is available and is applied to PW-K-means. The M needed to acquire global optimum becomes
103-fold less than P-K-means in the first simulated data set. Similar result is found in the second simulated data set where good prior information reduces the M needed by
104-fold. This analysis suggests that good prior information for PW-K-means helps to alleviate the local minimum issue in implementation. We also test a prior information randomly selected from the data (bad prior) for PW-K-means. The result is similar to P-K-means as if no prior information is given in the first simulation while the result of bad prior seems to slightly improve P-K-means in the second simulation, possibly because the weights have smoothed the surface of the cost function in the higher dimensional and complex situation.
|
3.2 Clustering MS/MS spectra
Mass spectrometry is an important experimental technique in proteomic research which involves large-scale study of proteins in a tissue. In this example, we discuss mining of data sets from tandem mass spectrometry (MS/MS). Each peptide is a sequence of amino acids (AA) composed of 20 possible choices (AAi, 1
i
20: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y) at each location. Protonated peptides are randomly dissociated in the gas phase under low energy collision-induced dissociation (CID). For example, a peptide sequence AAIANAPR has a certain probability to dissociate at the I-A position, breaking into two sub-sequences, AAI and ANAPR. Similarly it has a different probability to break at the N-A position, resulting in AAIAN and APR. In the experiment, the output fragmentation intensities are measured and are assumed to be proportional to dissociation probabilities. The intensities are all normalized between 0 and 1. We denote by xijk the fragmentation intensity at position AAj-AAk of peptide i (i = 1, ... , 28330, 1
j, k
20, 0
xijk
1). For example, in peptide AAIANAPR we observe intensities at six dissociation positions {(A-A), (A-I), (I-A), (A-N), (N-A) (A-P)} to be {0.15, 0.11, 0.69, 0.49, 0.76 and 0.51} [(P-R) is not observed in the experiment] and the remaining 394 out of 20 x 20 = 400 possible fragmentation positions are not observable (missing). In a given set of peptides X, we observe an empirical distribution of fragmentation intensities {xijk: It has been reported recently that different peptide sequence contents and chemical backgrounds, which we will call motifs hereafter, may contribute to different fragmentation patterns (Huang et al., 2005). For example, in the upper left plot of Figure 3 a quantile map of 674 peptides having one positive charge, ending in R and containing P in the middle is shown. The map clearly shows that the dissociation probabilities at most positions of D-X and E-X (X represents any AA) are significantly higher than other positions. In the lower left plot another motif of 2182 peptides having two positive charges, ending in R and containing P in the middle shows a different fragmentation pattern. X-P positions are easier to break in this motif.
|
Learning the patterns of dissociation strength under different motifs is essential to improve protein identification models and algorithms in the analysis of MS/MS data. A natural question then arises: is it possible to cluster the full 28 830 peptides into clusters of similar fragmentation patterns and in turn learn the underlying motifs contributing to these fragmentation patterns? Since each peptide is only 8–20 AA long, only 1.8% (7/400) to 4.8% (19/400) of fragmentation intensities are observed for each peptide. Calculating distances between any two peptides is usually impossible and most clustering methods become inapplicable to such a data set. On the other hand, the distance from a peptide to a set of peptides is still computable; in fact, we can calculate the distance of a peptide to the center of a peptide set where the center is the simple dimension-wise average ignoring missing values. We let {xi = xijk:1
j,k
20} and define the distance between a peptide xi and a set of peptides Y = {yi} to be |
|
|
|
Note that the fraction in the above distance definition is to adjust for different numbers of missingness. With the above distance definition, the cost function in K-means Equantion (3) and P-K-means Equation (4) can be used for clustering peptides. For demonstration purposes in this article, we only pooled peptides of the two sets shown in the left panel of Figure 3 and applied K-means and P-K-means clustering respectively to the combined 2856 peptides. In the K-means result at the middle panel, we find cluster 1 is identified to have a similar fragmentation pattern as the charge 1 set but with diluted information in D-X and E-X and contaminated information in many positions including L-X and V-X by the many more peptides included (1184 versus 674). P-Kmeans, on the other hand, almost perfectly recovers the two patterns on the right panel. The result shows the importance of excluding scattered peptides in clustering and superior performance of P-K-means. For details of how P-K-means was applied to cluster the complete data set of 28 830 peptides and how a data mining scheme of integrated P-K-means and regression tree was further developed for learning sequence motifs, see Huang et al. (2007).
3.3 Clustering genes in expression profile
An expression profile of the yeast cell cycle from Spellman et al. (1998) is analyzed to evaluate performance of clustering results by different methods. Seventy-seven samples of yeast cells synchronized under various time points of different cell cycle stages were applied to cDNA microarray to analyze patterns of expression changes during the cell cycle. Expression data of 6179 genes are examined of which 1663 genes are used for gene clustering after pre-filtering procedures (deleting genes with missing value more than 20% or SD at log-2 scale less than 0.4). Missing values are imputed and each gene vector is standardized to mean 0 and SD 1 so that expression patterns of differential expression levels become comparable. The data matrix is then input to K-means and P-K-means for gene clustering. Other popular clustering methods are also evaluated.
In Figure 4, Bayesian information criterion (BIC) for Gaussian mixture model (mclust) is used to estimate K (left panel) and prediction strength introduced in Section 2.2 is used to determine K and
(right panel). It is easily seen that the parameter selection in this real data is not as clear as in the simulated examples. BIC can hardly determine a suitable selection of K. The best prediction strength appears at K = 2 and
= 0.55, which does not seem a reasonable selection in this example. If we try to select as many tight clusters as possible with reasonably high prediction strength, we might settle with K = 5 and
=0.5. The ambiguous parameter selection is commonly seen in almost any real-world microarray data set.
|
Genes with similar expression patterns often imply co-regulated biological functions. A common application of gene clustering is to predict functional annotation for novel (unannotated) genes. For a functional annotation F, we suppose a total genome size of G (G = 1663 in this example) genes are analyzed among which D(F) genes are known to be categorized in F from a given biological database. We use the hypergeometric distribution (Tavazoie et al., 1999) to assess the probability of observing at least d (F) genes belonging to F in a cluster C. We denote by n(C) the total number of genes in cluster C. The resulting P-value is given by
|
|
D(F)·n(C)/G when the cluster C is not enriched in functional category F. Intuitively, an unusually large d (F) which corresponds to a small P-value implies that the majority of the genes in this cluster belong to the functional category F. In general, it is difficult to evaluate the performance of a gene clustering in a real-world data set due to lack of underlying truth or gold standard. Biological databases such as Gene Ontology and KEGG can serve for this purpose, however, information from this famous yeast cell cycle data may have been incorporated to the databases and the usage of the databases may cause tautological bias. Here, we cite 104 cell cycle regulated genes (see Supplementary Table 1) from Spellman et al. (1998) that were verified by traditional experimental methods and treat them as the gold standard. Mapping them to the 1663-gene data set and discarding duplicate genes, 87 genes belonging to six disjoint functional categories were obtained: F1(M/G1), F2(late G1 SCB regulated), F3(late G1 MCB regulated), F4(S-phase), F5(S/G2-phase) and F6(G2/M-phase). The remaining 1576 unannotated genes are denoted by F7. Since estimating the number of clusters k in gene clustering is usually difficult, we performed clustering with k = 5, ... , 20 for each method. The resulting clusters are denoted by Cki, where k = 5, ... , 20 and i = 1, ... , k for a given clustering method. For convenience, we denote by dkij the number of genes in cluster i and functional category j when k is number of clusters and Pkij = P(G, D(Fj), n(Cki), dkij) are the corresponding P-values. A contingency table of d5ij (i = 1, ... , 6; j = 1, ... , 7) in P-K-means clustering (k = 5) and functional category distributions is demonstrated in Table 1 with their corresponding P-values (P5ij).
|
In practice, P-values should be adjusted for multiple comparisons in any clustering result inference. Here, we take a different evaluation criterion used in Thalamuthu et al. (2006) (Wu et al., 2002) to compare performance of different clustering methods. To avoid sensitivity of estimation of k in the following evaluation, we pool all clustering results Cki (k = 5 – 20, i = 1 – k) to draw a prediction-accuracy plot. Given a P-value threshold
, we assign all genes in the cluster Cki to functional annotation Fj when the corresponding P-value Pkij is less than
. Thus among n(Cki) annotation predictions made, dkij genes are verified as correct by the biological database. We define predictions made =
= 0.01, then the predictions made = (42 + 98 + 98) = 238 and accuracy = (10 + 4 + 23)/238 = 15.55%. Figure 5 shows the prediction-accuracy plot with predictions made on the x-axis and accuracy on the y-axis. For each clustering method, varying the threshold
produces a different number of predictions made and corresponding accuracy, thus generating a curve.
= 1E – 4, 1E – 5, ... , 1E – 20 was used in the figure and, in general, a more stringent threshold (smaller
) corresponds to fewer number of predictions made and better accuracy. The horizontal dotted line shows the average accuracy under random annotation assignment: 87/1663 = 5.23%. Methods producing curves on the above have better performance. P-values were not calculated and no annotation prediction was made for genes in the noise set. Note, in this approach a gene may be predicted multiple times because multiple clustering results (different k) were empirically assessed together. In Figure 5, we have selected seven methods that are popularly applied in the literature of microarray analysis: hierarchical clustering, SOM, K-means, CLICK, GIMM, model-based (mclust) and tight clustering. Description of the methods and detailed implementation are provided in Supplementary Material. The result shows that P-K-means outperforms K-means due to its ability to leave scattered genes unclustered. Intuitively, P-K-means is almost equivalent to the simplified form of identical spherical covariance structure in mclust except that P-K-means follows CML sampling scheme and mclust is an ML approach. It is found that mclust also has better performance than K-means and is similar to P-K-means (
= 0.4 and 0.35). Tight clustering has similar performance to P-K-means with
= 0.35. It should, however, be noted that tight clustering relies on resampling assessment and the computation is much heavier than P-K-means. P-K-means with
= 0.3 generates the tightest clusters with highest accuracy. The number of predictions made is, however, much fewer than the other methods. All other methods seem to provide worse prediction accuracy.
|
Finally, we examine the usefulness of PW-K-means. We notice from Supplementary Figure 1 that the eight genes in F4 share a tightly co-regulated pattern. In fact, it is widely known that the eight genes are core histone genes required for chromatin assembly and chromosome function, and their modifications play central roles in a variety of chromosomal processes during S phase. Unfortunately in Table 1 we find that all eight genes are left in the noise set S in the P-K-means clustering result. To demonstrate the ability of PW-K-means to incorporate prior information, we randomly select three F4 genes as the prior information P. As shown in Table 2 for the PW-Kmeans result (
= 0.4 and
= 2), we were able to recover and identify cluster C3 containing all the eight histone genes. In Table 3, we compare P-K-means and PW-K-means with various
and
and show the frequencies of the eight histone genes (F4) being assigned to scattered gene set in 100 trials with independent starting seeds. We find that P-K-means usually assign the eight histone genes to the scattered set. In PW-K-means, smaller
and smaller
give stronger prior information to increase the probabilities of identifying these important genes in the clusters.
|
|
| 4 DISCUSSION |
|---|
|
|
|---|
In this article, we propose a general class of penalized and weighted K-means for clustering. The method allows a noise set without being clustered and provides the flexibility for prior information incorporation through the weights. We have shown the success and flexibility of this method in high-throughput biological data through simulations, MS/MS and microarray applications. In the literature, many different sampling types of the Gaussian mixture model-based approach have been developed and applied. Different forms of the Bayesian hierarchical models are also proposed for clustering and efficient Markov chain Monte Carlo (MCMC) or Gibbs sampling techniques are used to simulate the posterior distributions (Jain and Neal, 2004; Medvedovic and Sivaganesan, 2002). However, probably none except for Pan (2006) and Huang and Pan (2006) have demonstrated the use of prior information in clustering. Pan (2006) applied a stratified model-based approach to incorporate gene annotation as prior information. The annotation information, however, was not fully utilized to improve the formation of clusters. The weights in our PW-K-means provides a more general and flexible alternative. Huang and Pan (2006) incorporated prior information in a modified shrinkage distance for clustering. The distance of any two genes with the same known gene annotation is shrunken to a fixed smaller proportion. On the other hand, our proposed weight function shrinks all gene distances in the spatial neighborhood of a clustered set of same annotated genes and is expected to be more general and effective.
K-means is known to be a fast algorithm among many clustering methods. As an extension from K-means, PW-K-means only adds limited extra computation in calculating the weight function and the penalty term. Like all optimization-based clustering methods, global optimization of PW-K-means is NP-hard. Multiple random initial values and other stochastic methods have been proposed for such purposes (Biernacki et al., 2003). In general as the number of clusters and noises increase and the data become more complex, computation becomes heavier to approach the global optimization with high probability. In this article, we showed that good prior information reduces the computation efforts required for global optimization.
In the example of MS/MS (Section 3.2), we showed that PW-K-means was capable of clustering a large data set with a high percentage of missing values (>95%), to which most other clustering methods could not be applied. In the microarray example in Section 3.3, we demonstrated that the clustering result of P-K-means had one of the best prediction accuracies. In our experience, P-K-means usually has comparable performance to the resampling-based tight clustering method (Tseng and Wong, 2005) while P-K-means has the advantage of fast computation and the flexibility of choosing different
to easily control the tightness of clusters. On the other hand, PW-K-means can suffer from local optimization results and tight clustering usually provides a more stable solution through the resampling evaluation. As a result, these two methods are complementary to serve different data mining purposes in different types of data sets.
In the cell cycle data, we have shown that both BIC and prediction strength cannot give very clear parameter estimation as in the simulation examples. In tight clustering (Tseng and Wong, 2005), the algorithm attempts to sequentially retrieve tight patterns from the data, which makes the estimation of K a secondary task. This provides certain degree of robustness while the input parameter of the method still needs approximate knowledge of K for the method to perform well. An alternative direction may be to take clustering results of multiple K and
to show a sequential multi-resolution outlook and different degree of tightness of the cluster patterns (Supplementary Fig. 6). In general the parameter estimation (especially estimation of K) is still an open, although old, question in the field and the solution seems to be very data and goal dependent. More research efforts are still needed in the future.
Several issues for PW-K-means can be further pursued in future research. The current formulation does not consider the covariance structure of clusters such as variable dispersion (SD) and non-spherical cluster structures. The weight function in PW-K-means needs further development and improvement for different types of data. For example, we have encountered some applications with prior information of cluster locations (centroids) or information that some points are unlikely to be noise points. We expect further development of PW-K-means will help better data mining and analyses of many high-throughput biological data sets.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
G.C.T. is supported by grant 1 KL2 RR024154-01. The author is indebted to E. Feingold for valuable discussion and comments. The author wishes to thank the editor and the reviewers for many insightful suggestions and comments.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Chris Stoeckert
Received on February 27, 2007; revised on May 11, 2007; accepted on June 8, 2007
| REFERENCES |
|---|
|
|
|---|
Basu S, et al. A probabilistic framework for semi-supervised clustering. (2004) Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 59–68.
Biernacki C, et al. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput. Stat. Data Anal (2003) 41:561–575.[CrossRef]
Bishop C. Neural Networks for Pattern Recognition (1995) Oxford, UK: Oxford University Press.
Breckenridge JN. Replicating cluster analysis: method, consistency, and validity. Multivariate Behav. Res (1989) 24:147–161.[CrossRef][Web of Science]
Celeux G, Govaert G. A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal (1992) 14:315–332.[CrossRef]
Cheng J, et al. A knowledge-based clustering algorithm driven by Gene Ontol-ogy. J. Biopharm. Stat (2004) 14:687–700.[CrossRef][Medline]
Conrads TP, et al. Cancer diagnosis using proteomic patterns. Expert Rev. Mol. Diagn (2003) 3:411–420.[CrossRef][Web of Science][Medline]
Dasgupta A, Raftery AE. Detecting features in spatial point processes with clutter via model-based clustering. J. Am. Stat. Assoc (1998) 93:294–302.[CrossRef][Web of Science]
De Hoon MJL, et al. Open source clustering software. Bioinformatics (2004) 20:1453–1454.
Dudoit S, Fridlyand J. A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol (2002) 3:0036.1–0036.21.
Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc (2002) 97:611–631.[CrossRef][Web of Science]
Ganesalingam S. Classification and mixture approach to clustering via maximum likelihood. Appl. Stat (1989) 38:455–566.[CrossRef]
Gordon AD. Classification (1999) 2nd. Chapman & Hall/CRC, Boca Raton, USA.
Grabmeier J, Rudolph A. Techniques of cluster algorithms in data mining. Data Mining Knowl. Discov (2002) 6:303–360.[CrossRef]
Hanisch D, et al. Co-clustering of biological networks and gene expression data. Bioinformatics (2002) 18:145–154.
Hartigan JA, Wong MA. A K-means clustering algorithm. Appl. Stat (1979) 28:100–108.[CrossRef]
Hastie T, et al. Gene shaving as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol (2000) 1:0003.1–0003.21.
Huang D, Pan W. Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinformatics (2006) 22:1259–1268.
Huang Y, et al. Statistical characterization of charge state and residue dependence of low energy CID peptide dissociation patterns. Anal. Chem (2005) 77:5800–5813.[Medline]
Huang Y, et al. A data mining scheme for identifying peptide structural motifs responsible for different MS/MS fragmentation intensity patterns. In: Journal of Proteomic Research (2007) (in revision).
Hubert J, Arabie P. Comparing partitions. J. Classific (1985) 2:193–218.[CrossRef]
Jain AK, Dubes RC. Algorithms for Clustering Data (1988) New York: Wiley.
Jain S, Neal RM. A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. J. Comput. Graph. Stat (2004) 13:158–182.[CrossRef]
Jobson JD. Applied Multivariate Data Analysis (1992) New York: Springer Bd.I and II.
Kaufman L, Rousseeuw PJ. Finding Groups in Data (1990) New York: Wiley.
McLachlan GJ, Basford KE. Mixture Models (1987) New York: Marcel Dekker.
McLachlan GJ, et al. A mixture model-based approach to the clustering of microarray expression data. Bioinformatics (2002) 18:413–422.
Medvedovic M, Sivaganesan S. Bayesian infinite mixture model-based clustering of gene expression profiles. Bioinformatics (2002) 18:1194–1206.
Messatfa H, Zait M. A comparative study of clustering methods. Future Generation Comput. Syst (1997) 13:149–159.[CrossRef]
Milligan GW, Cooper MC. An examination of procedures for determining the number of clusters in a data set. Psychometrika (1985) 50:159–179.[CrossRef][Web of Science]
Pan W. Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics (2006) 22:795–801.
Pan W, et al. Semi-supervised learning via penalized mixture model with application to microarray sample classification. Bioinformatics (2006) 22:2388–2395.
Ripley BD. Pattern Recognition and Neural Network (1996) Oxford, UK: Cambridge University Press.
Segal E, et al. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics (2003) 19:i264–i271.[Abstract]
Sharan R, et al. CLICK and EXPANDER: a system for clustering and visualizing gene expression data. Bioinformatics (2003) 19:1787–1799.
Spaeth H. Cluster Analysis Algorithms (1984) Chicester: Ellis Horwood Limited.
Spellman PT, et al. Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell (1998) 9:3273–3297.
Tamayo P, et al. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. PNAS (1999) 96:2907–2912.
Tavazoie S, et al. Systematic determination of genetic network architecture. Nat. Genet (1999) 22:281–285.[CrossRef][Web of Science][Medline]
Thalamuthu A, et al. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics (2006) 22:2405–2412.
Tibshirani R, Walther G. Cluster validation by prediction strength. J. Comput. Graph. Stat (2005) 14:511–528.[CrossRef]
Tseng GC, Wong WH. Tight clustering : a resampling-based approach for identifying stable and tight patterns in data. Biometrics (2005) 61:10–16.[CrossRef][Web of Science][Medline]
Wu LF, et al. Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat. Genet (2002) 31:255–265.[CrossRef][Web of Science][Medline]
Yeung KY, et al. Model-based clustering and data transformations for gene expression data. Bioinformatics (2001) 17:977–987.
This article has been cited by other articles:
![]() |
L. Ni, C. Bruce, C. Hart, J. Leigh-Bell, D. Gelperin, L. Umansky, M. B. Gerstein, and M. Snyder Dynamic and complex transcription factor binding during an inducible response in yeast Genes & Dev., June 1, 2009; 23(11): 1351 - 1363. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Pan Network-based multiple locus linkage analysis of expression traits Bioinformatics, June 1, 2009; 25(11): 1390 - 1396. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. A. Shabalin, H. Tjelmeland, C. Fan, C. M. Perou, and A. B. Nobel Merging two gene-expression studies via cross-platform normalization Bioinformatics, May 1, 2008; 24(9): 1154 - 1160. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








