Skip Navigation


Bioinformatics Advance Access originally published online on December 6, 2005
Bioinformatics 2006 22(3):369-370; doi:10.1093/bioinformatics/bti817
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/3/369    most recent
bti817v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Valentini, G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Valentini, G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data

Giorgio Valentini

DSI, Dipartimento di Scienze dell'Informazione, Università degli Studi di Milano Via Comelico 39, Italy


    ABSTRACT
 TOP
 ABSTRACT
 REFERENCES
 

Summary: We present a new R package for the assessment of the reliability of clusters discovered in high-dimensional DNA microarray data. The package implements methods based on random projections that approximately preserve distances between examples in the projected subspaces.

Availability: http://homes.dsi.unimi.it/~valenti/SW/clusterv/download/clusterv_1.0.tar.gz

Contact: valentini{at}dsi.unimi.it

Supplementary information: http://homes.dsi.unimi.it/~valenti/SW/clusterv

Three essential aspects of clustering problems for DNA microarray data analyses are as follows (1) the estimate of the number of clusters (Azuaje, 2002); (2) the assessment of the reliability of each individual cluster (McShane et al., 2002) and (3) the assessment of the confidence of cluster assignments for individual samples.

We present a tool that addresses all the above problems.

Some recent approaches to estimate the reliability of the discovered clusters are based on the concept of stability with respect to data perturbation (e.g. through sub-sampling or noise injection) (Ben-Hur et al., 2002; McShane et al., 2002). According to this general framework, reliable clusters are those maintained across multiple perturbations of the data; that is, a clustering is reliable if it is stable with respect to the perturbations. For instance, the model explorer algorithm (Ben-Hur et al., 2002) estimates the number of clusters by perturbing the data with sub-sampling techniques, and a clustering is considered reliable if it is stable with respect to sub-sampling. We instead introduce data perturbations through randomized projections µ : Rd -> Rd', from the original d-dimensional data to lower d'-dimensional subspaces; in this case, a clustering is considered reliable if it is stable with respect to randomized dimensionality reduction, that is, if the clusterings obtained from running a clustering algorithm with multiple instances of randomly projected data are similar. In this context, a key issue consists in maintaining the metric properties of the original data in the projected data instances, because clustering algorithms use distance measures to search for structure in the data. To this end, we introduce random projections that approximately preserve the Euclidean distances between examples, according to the Johnson–Lindenstrauss theory (Johnson and Lindenstrauss, 1984) (see Supplementary information for technical details).

To evaluate the similarity between multiple clusterings performed with multiple instances of randomly projected data, we used a n x n symmetric similarity matrix M, whose elements Mij store the memberships of examples pairs i, j to the same cluster (Dudoit and Fridlyand, 2003):

Formula 1(1)
where i, j isin {1, 2, ..., n}, Cs sub { 1, 2, ..., n} is a cluster returned by a clustering algorithm, k the number of clusters and Formula 1 isin {0, 1}n is the characteristic vector of Cs, i.e. Formula 1, [i] = 1 if i isin Cs, otherwise Formula 1, [i] = 0. In other words Mij denotes if elements i and j belong to the same cluster. Using multiple random projections of the data we generate multiple instances of projected data that are used by a clustering algorithm to provide multiple sets of clusters (clusterings). We then build multiple similarity matrices (one for each clustering) and averaging between them, we obtain a similarity matrix Formula 1 that stores the memberships of example pairs i, j to the same cluster across multiple clusterings.

Using the previously computed similarity matrix, the stability index, s, for an individual cluster, C, is

Formula 2(2)
The index s(C) estimates the stability of a cluster C by measuring how much the projections of the pairs (i, j) isin C x C occur together in the same cluster in the projected subspaces. The stability index has values between 0 and 1: low values indicate no reliable clusters, high values denote stable clusters. Note that stability indices for singleton clusters are computed apart, by counting the occurrences of clusters with only one member across multiple clusterings. An overall measure of the stability of the clustering may be obtained averaging between the stability indices:

Formula 3(3)
In this case also we have that 0 ≤ S(k) ≤ 1, where k is the number of clusters. Note that in Equation (3) k must be larger than 1: indeed if k = 1 then always S = 1. Finally, the assignment-confidence (AC) index estimates the confidence of the assignment of an example i to a cluster C, by measuring the frequency by which i appears with the other elements of the cluster:

Formula 4(4)
The R package clusterv (that stands for cluster-validity) implements a three steps methodology to estimate the reliability of clusters: (1) generation of multiple random projections that approximately preserve the distances between examples; (2) application of suitable clustering algorithms to the original and projected data and (3) computation of a set of measures of reliability and stability.

In particular, clusterv provides a set of functionalities to assess the reliability of clusters in high-dimensional data (Table 1). The stability indices of the clusters, the overall stability index and the AC index for each example can be computed by calling a single high-level function for a large set of clustering algorithms (e.g. hierarchical, k-means, fuzzy k-means or prediction around medoids). For instance the function Random.fuzzy.kmeans.validity applies the fuzzy k-means clustering algorithm to the data, computes the similarity matrix using multiple random subspace projections and then computes the stability indices for each cluster, the overall stability index of the clustering and the set of AC indices for each example. Moreover, using the functions clustering-algorithm-independent, the same reliability analysis can be performed for any distance-based clustering algorithm, as long as its output (the clustering) can be coded as an R vector or a list. For instance, the function Cluster.validity accepts as input the list of the clusters whose validity indices need to be computed and a list of the sets of clusters obtained from the randomly projected subspaces, and returns the stability indices described by Equations (24).


View this table:
[in this window]
[in a new window]
 
Table 1 Clusterv: main functionalities

 
The stability indices appear to be well suited for unsupervised gene expression data analysis. Indeed, stability measures based on random projections exploit the high dimensionality and the redundancy of information owing to the correlation between gene expression levels that characterize DNA microarray data. Moreover, this method can also be naturally applied to other biomedical and physical data, especially if characterized by high dimensionality.

There are some limitations about the usage of the clusterv package: Johnson–Lindenstrauss theory is proved only with Euclidean distances, even if our experimental results show that random projections and the stability indices may be successfully applied to the analysis of cluster reliability using Manhattan and correlation-based distances. However, we need more theoretical insights and empirical results to extend the usage of the proposed stability measures in the context of more general metric spaces (e.g. with Chebychev and general Ln-norm-based distances).

Moreover, the reliability measures implemented in clusterv, on one hand, present some advantages with respect to classical indices based on intercluster and intracluster distances, such as no bias toward particular shapes or notion of compactness of the clusters; on the other hand, their values depend on the choice of the clustering algorithm: the stability indices of the clusters may vary if different clustering algorithms are used to compute them.

Examples on how to apply the stability indices to high-dimensional synthetic and DNA microarray data are available online, as well as a tutorial and a reference manual (see Supplementary information).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Joaquin Dopazo

Received on September 8, 2005; revised on November 16, 2005; accepted on December 2, 2005

    REFERENCES
 TOP
 ABSTRACT
 REFERENCES
 

    Azuaje, F. (2002) A cluster validity framework for genome expression data. Bioinformatics, 18, 319–320[Abstract/Free Full Text].

    Ben-Hur, A., et al. (2002) A stability based method for discovering structure in clustered data. Pac. Symp. Biocomput, . 7, , pp. 6–17.

    Dudoit, S. and Fridlyand, J. (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19, 1090–1099[Abstract/Free Full Text].

    Johnson, W. and Lindenstrauss, J. (1984) Extensions of Lipshitz mapping into Hilbert space. Contemporary Mathematics, , Providence, RI American Mathematical Society 26, , pp. 189–206.

    McShane, L.M., et al. (2002) Method for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics, 18, 1462–1469[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
B. Andreopoulos, A. An, X. Wang, and M. Schroeder
A roadmap of clustering algorithms: finding a match for a biomedical application
Brief Bioinform, May 1, 2009; 10(3): 297 - 314.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
L. Brehelin, O. Gascuel, and O. Martin
Using repeated measurements to validate hierarchical gene clusters
Bioinformatics, March 1, 2008; 24(5): 682 - 688.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Z. Yu, H.-S. Wong, and H. Wang
Graph-based consensus clustering for class discovery from gene expression data
Bioinformatics, November 1, 2007; 23(21): 2888 - 2896.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
G. Valentini
Mosclust: a software library for discovering significant structures in bio-molecular data
Bioinformatics, February 1, 2007; 23(3): 387 - 389.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/3/369    most recent
bti817v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Valentini, G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Valentini, G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?