Bioinformatics Advance Access originally published online on December 6, 2005
Bioinformatics 2006 22(3):369-370; doi:10.1093/bioinformatics/bti817
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data
DSI, Dipartimento di Scienze dell'Informazione, Università degli Studi di Milano Via Comelico 39, Italy
| ABSTRACT |
|---|
|
|
|---|
Summary: We present a new R package for the assessment of the reliability of clusters discovered in high-dimensional DNA microarray data. The package implements methods based on random projections that approximately preserve distances between examples in the projected subspaces.
Availability: http://homes.dsi.unimi.it/~valenti/SW/clusterv/download/clusterv_1.0.tar.gz
Contact: valentini{at}dsi.unimi.it
Supplementary information: http://homes.dsi.unimi.it/~valenti/SW/clusterv
Three essential aspects of clustering problems for DNA microarray data analyses are as follows (1) the estimate of the number of clusters (Azuaje, 2002); (2) the assessment of the reliability of each individual cluster (McShane et al., 2002) and (3) the assessment of the confidence of cluster assignments for individual samples.
We present a tool that addresses all the above problems.
Some recent approaches to estimate the reliability of the discovered clusters are based on the concept of stability with respect to data perturbation (e.g. through sub-sampling or noise injection) (Ben-Hur et al., 2002; McShane et al., 2002). According to this general framework, reliable clusters are those maintained across multiple perturbations of the data; that is, a clustering is reliable if it is stable with respect to the perturbations. For instance, the model explorer algorithm (Ben-Hur et al., 2002) estimates the number of clusters by perturbing the data with sub-sampling techniques, and a clustering is considered reliable if it is stable with respect to sub-sampling. We instead introduce data perturbations through randomized projections µ :
d
d', from the original d-dimensional data to lower d'-dimensional subspaces; in this case, a clustering is considered reliable if it is stable with respect to randomized dimensionality reduction, that is, if the clusterings obtained from running a clustering algorithm with multiple instances of randomly projected data are similar. In this context, a key issue consists in maintaining the metric properties of the original data in the projected data instances, because clustering algorithms use distance measures to search for structure in the data. To this end, we introduce random projections that approximately preserve the Euclidean distances between examples, according to the JohnsonLindenstrauss theory (Johnson and Lindenstrauss, 1984) (see Supplementary information for technical details).
To evaluate the similarity between multiple clusterings performed with multiple instances of randomly projected data, we used a n x n symmetric similarity matrix M, whose elements Mij store the memberships of examples pairs i, j to the same cluster (Dudoit and Fridlyand, 2003):
![]() | (1) |
{1, 2, ..., n}, Cs
{ 1, 2, ..., n} is a cluster returned by a clustering algorithm, k the number of clusters and
{0, 1}n is the characteristic vector of Cs, i.e.
, [i] = 1 if i
Cs, otherwise
, [i] = 0. In other words Mij denotes if elements i and j belong to the same cluster. Using multiple random projections of the data we generate multiple instances of projected data that are used by a clustering algorithm to provide multiple sets of clusters (clusterings). We then build multiple similarity matrices (one for each clustering) and averaging between them, we obtain a similarity matrix
that stores the memberships of example pairs i, j to the same cluster across multiple clusterings.
Using the previously computed similarity matrix, the stability index, s, for an individual cluster, C, is
![]() | (2) |
C x C occur together in the same cluster in the projected subspaces. The stability index has values between 0 and 1: low values indicate no reliable clusters, high values denote stable clusters. Note that stability indices for singleton clusters are computed apart, by counting the occurrences of clusters with only one member across multiple clusterings. An overall measure of the stability of the clustering may be obtained averaging between the stability indices:
![]() | (3) |
S(k)
1, where k is the number of clusters. Note that in Equation (3) k must be larger than 1: indeed if k = 1 then always S = 1. Finally, the assignment-confidence (AC) index estimates the confidence of the assignment of an example i to a cluster C, by measuring the frequency by which i appears with the other elements of the cluster:
![]() | (4) |
In particular, clusterv provides a set of functionalities to assess the reliability of clusters in high-dimensional data (Table 1). The stability indices of the clusters, the overall stability index and the AC index for each example can be computed by calling a single high-level function for a large set of clustering algorithms (e.g. hierarchical, k-means, fuzzy k-means or prediction around medoids). For instance the function Random.fuzzy.kmeans.validity applies the fuzzy k-means clustering algorithm to the data, computes the similarity matrix using multiple random subspace projections and then computes the stability indices for each cluster, the overall stability index of the clustering and the set of AC indices for each example. Moreover, using the functions clustering-algorithm-independent, the same reliability analysis can be performed for any distance-based clustering algorithm, as long as its output (the clustering) can be coded as an R vector or a list. For instance, the function Cluster.validity accepts as input the list of the clusters whose validity indices need to be computed and a list of the sets of clusters obtained from the randomly projected subspaces, and returns the stability indices described by Equations (24).
|
The stability indices appear to be well suited for unsupervised gene expression data analysis. Indeed, stability measures based on random projections exploit the high dimensionality and the redundancy of information owing to the correlation between gene expression levels that characterize DNA microarray data. Moreover, this method can also be naturally applied to other biomedical and physical data, especially if characterized by high dimensionality.
There are some limitations about the usage of the clusterv package: JohnsonLindenstrauss theory is proved only with Euclidean distances, even if our experimental results show that random projections and the stability indices may be successfully applied to the analysis of cluster reliability using Manhattan and correlation-based distances. However, we need more theoretical insights and empirical results to extend the usage of the proposed stability measures in the context of more general metric spaces (e.g. with Chebychev and general Ln-norm-based distances).
Moreover, the reliability measures implemented in clusterv, on one hand, present some advantages with respect to classical indices based on intercluster and intracluster distances, such as no bias toward particular shapes or notion of compactness of the clusters; on the other hand, their values depend on the choice of the clustering algorithm: the stability indices of the clusters may vary if different clustering algorithms are used to compute them.
Examples on how to apply the stability indices to high-dimensional synthetic and DNA microarray data are available online, as well as a tutorial and a reference manual (see Supplementary information).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Joaquin Dopazo
Received on September 8, 2005; revised on November 16, 2005; accepted on December 2, 2005
| REFERENCES |
|---|
|
|
|---|
Azuaje, F. (2002) A cluster validity framework for genome expression data. Bioinformatics, 18, 319320
Ben-Hur, A., et al. (2002) A stability based method for discovering structure in clustered data. Pac. Symp. Biocomput, . 7, , pp. 617.
Dudoit, S. and Fridlyand, J. (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19, 10901099
Johnson, W. and Lindenstrauss, J. (1984) Extensions of Lipshitz mapping into Hilbert space. Contemporary Mathematics, , Providence, RI American Mathematical Society 26, , pp. 189206.
McShane, L.M., et al. (2002) Method for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics, 18, 14621469
This article has been cited by other articles:
![]() |
B. Andreopoulos, A. An, X. Wang, and M. Schroeder A roadmap of clustering algorithms: finding a match for a biomedical application Brief Bioinform, May 1, 2009; 10(3): 297 - 314. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Brehelin, O. Gascuel, and O. Martin Using repeated measurements to validate hierarchical gene clusters Bioinformatics, March 1, 2008; 24(5): 682 - 688. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Yu, H.-S. Wong, and H. Wang Graph-based consensus clustering for class discovery from gene expression data Bioinformatics, November 1, 2007; 23(21): 2888 - 2896. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Valentini Mosclust: a software library for discovering significant structures in bio-molecular data Bioinformatics, February 1, 2007; 23(3): 387 - 389. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





