Bioinformatics Advance Access originally published online on November 24, 2006
Bioinformatics 2007 23(3):387-389; doi:10.1093/bioinformatics/btl600
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Mosclust: a software library for discovering significant structures in bio-molecular data
DSI, Dipartimento di Scienze dell'Informazione, Università degli Studi di Milano Via Comelico 39, Italy
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: The R package mosclust (model order selection for clustering problems) implements algorithms based on the concept of stability for discovering significant structures in bio-molecular data. The software library provides stability indices obtained through different data perturbations methods (resampling, random projections, noise injection), as well as statistical tests to assess the significance of multi-level structures singled out from the data.
Availability: http://homes.dsi.unimi.it/~valenti/SW/mosclust/download/mosclust_1.0.tar.gz
Contact: valentini{at}dsi.unimi.it
Supplementary information: http://homes.dsi.unimi.it/~valenti/SW/mosclust
The identification and validation of structures underlying complex bio-molecular data is a fundamental step in post-genomic data analysis. Different clustering validation techniques (see Handl et al., 2005 for a recent review), and software tools implementing classical validity indices (such as the Dunn's index and the Silhouette index) have been recently proposed (Bolshakova et al., 2005).
The mosclust software library implements methods based on the concept of stability to estimate the natural number of clusters, or to select a clustering solution from a set generated by a range of different algorithms. Moreover the library may be used to detect multiple structures simultaneously present in complex bio-molecular data, such as sub-clusters inside higher level clusters or multi-level hierarchical structures discovered in gene expression data. In this conceptual framework multiple clusterings are obtained by introducing perturbations into the original data, and a clustering is considered reliable if it is stable, that is if it is approximately maintained across multiple perturbations. The main logical steps of a high-level algorithm for a stability-based evaluation of clustering solutions may be summarized as follows:
- Perform multiple random perturbations of the data.
- Supply the perturbed data to a given clustering algorithm.
- Apply a given clustering similarity measure to multiple pairs of k-clusterings obtained according to steps (1) and (2).
- Use appropriate stability indices, based on the similarity measures applied at step (3), to score the reliability of a given k-clustering.
- Repeat steps (1) to (4) for different numbers k of clusters and elect the most stable clustering(s) as the most reliable.
The proposed library provides similarity measures, stability indices and different random perturbations procedures to assess the reliability of clustering solutions. Moreover statistical tests to assess the significance of the solutions, and to discover multiple structures simultaneously present in the data are also implemented.
More precisely, let Sk (0
Sk
1) be a random variable given by a similarity measure (e.g. Fowlkes and Mallows coefficient) between two k-clusterings obtained from pairs of randomly perturbed data. Using its cumulative distribution function Fk(s), we can obtain the following integral to estimate stability of a given k-clustering:
|
| (1) |
k, for a sufficiently large number n of evaluations of Sk, we have that the following random variables have respectively the normal and
2 distribution:
|
| (2) |
2-based test to find significant clustering solutions (Bertoni and Valentini, 2006). Relaxing the assumption that some probability distributions are normal, we implemented also another statistical test based on the classical Bernstein inequality. In all cases, using these statistical tests, we are able to detect statistically significant structures, comprising also multiple structures (e.g. hierarchical structures) simultaneously present in bio-molecular data. The R package mosclust implements stability methods for unsupervised structure discovery in bio-molecular data through a set of functionalities that may be summarized as follows:
- Data perturbation through random projections, bootstrap resampling techniques or noise injection.
- Functions to compute similarity measures between pairs of perturbed clusterings.
- Functions to compute stability indices for different number of clusters.
- Integrated functions to compute stability indices with widely used clustering algorithms (k-means, hierarchical clustering, Prediction Around Medoids, fuzzy-c-means).
- Functions to perform
2-based and Bernstein inequality-based tests of hypothesis to select k-clustering solutions significant at a given significance level (as well as to estimate the associated P-value).
- Graphical functions to plot histograms and empirical cumulative distribution functions of the similarity measures for different number of clusters, and to plot the p-values for different tests of hypothesis.
As an applicative example of the mosclust R package, we present a stability analysis of the DNA microarray clusters obtained from the well known data set composed by 72 leukemia samples (Golub et al., 1999). With the classical k-means algorithm, using the
2-based statistical test, two and three clusters are predicted at
= 0.01 significance level. The same results are obtained with both subsampling techniques and random projections to lower dimensional subspaces, while using noise injection techniques only two clusters are detected. Figure 1 shows the empirical cumulative distribution functions (ecdfs) of the similarity measures obtained with subsampling and random projections techniques using different numbers k of clusters. Note that the ecdfs for two and three clusters are well separated and below the others, showing that the corresponding clusterings are more stable and reliable. Even if the two-clustering is more stable than the three-clustering [because g(2) < g(3), Equation (1)], their ecdf graphs are quite close, showing that that there is not a well-defined winner: indeed the statistical test does not find any significant difference about the reliability of the two clusterings (at 0.01 significance level), revealing a very likely two-level structure in the data
|
Even if stability-based techniques do not suffer from biases towards particular algorithms, as sometimes is the case with traditional validation techniques based on inter or intra-clusters distances, they may nevertheless exhibit other kinds of bias under certain circumstances. Indeed, as shown by Handl et al. (2005), a given clustering may converge to a suboptimal solution owing to the shape of the data manifold and not to the real structure of the data. Even if this is an open problem for stability-based methods, the usage of different clustering algorithms and a comparison with the results obtained with different techniques (Bolshakova et al., 2005) may in part overcome these problems.
The reference manual and the Supplementary information available at the associated web-site (see Supplementary information) describe in detail the usage, the functionalities and the theoretical background underlying the proposed software library.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: David Rocke
Received on September 10, 2006; revised on November 15, 2006; accepted on November 21, 2006
| REFERENCES |
|---|
|
|
|---|
Ben-Hur, A., et al. (2002) A stability based method for discovering structure in clustered data. Pacific Symp. Biocomput, . 617.
Bertoni, A. and Valentini, G. (2006) Model order selection for clustered bio-molecular data. Probabilistic Modeling and Machine Learning in Structural and Systems BiologyTuusula, Finland, pp. 8590.
Bolshakova, N., et al. (2005) An integrated tool for microarray data clustering and cluster validity assessment. Bioinformatics, 21, 451455
Golub, T.R., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531537
Handl, J., et al. (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics, 21, 32013215
Jain, A.K. and Dubes, R.C. Algorithms for Clustering Data, (1988) , NJ Prentice Hall.
McShane, L.M., et al. (2002) Method for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics, 18, 14621469.
Monti, S., et al. (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn, . 52, 91118[CrossRef].
Valentini, G. (2006) Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data. Bioinformatics, 22, 369370
This article has been cited by other articles:
![]() |
Z. Yu, H.-S. Wong, and H. Wang Graph-based consensus clustering for class discovery from gene expression data Bioinformatics, November 1, 2007; 23(21): 2888 - 2896. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

