Bioinformatics Advance Access originally published online on April 10, 2006
Bioinformatics 2006 22(12):1538-1539; doi:10.1093/bioinformatics/btl129
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles
1 Human Genome Center, Institute of Medical Science, University of Tokyo 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan
2 Research Organization of Information and Systems, The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo 106-8569, Japan
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: One of the significant challenges in gene expression analysis is to find unknown subtypes of several diseases at the molecular levels. This task can be addressed by grouping gene expression patterns of the collected samples on the basis of a large number of genes. Application of commonly used clustering methods to such a dataset however are likely to fail owing to over-learning, because the number of samples to be grouped is much smaller than the data dimension which is equal to the number of genes involved in the dataset. To overcome such difficulty, we developed a novel model-based clustering method, referred to as the mixed factors analysis. The ArrayCluster is a freely available software to perform the mixed factors analysis. It provides us some analytic tools for clustering DNA microarray experiments, data visualization and an automatic detector for module transcriptional of genes that are relevant to the calibrated molecular subtypes and so on.
Availability: The ArrayCluster can be used free of charge for non-commercial and academic use and downloaded from http://www.ism.ac.jp/~higuchi/arraycluster.htm
Contact: yoshidar{at}ims.u-tokyo.ac.jp
| 1 OVERVIEW |
|---|
|
|
|---|
Typical microarray data have a fairly small sample size, less than 100, whereas the number of genes involved is more than several thousands. A purpose of the cluster analysis in microarray studies is to find the existing molecular subtypes in a set of the collected samples. One major difficulty in this problem is that the number of samples to be clustered is much smaller than the dimension of data which is equal to the number of genes of interest. Owing to the high-dimensionality of data, applicability of most ready-made clustering technologies, e.g. k-means, Gaussian mixture clustering, hierarchical clustering and so on, would be limited by over-learning.
The mixed factors analysis was originally proposed by Yoshida et al. (2004) to overcome the curse of dimensionality faced at gene expression analysis. A key idea of this approach is to establish a parametric model, referred to as the mixed factors model, which is an extension of the classical factor analyzer. This model performs a parsimonious parameterization of the Gaussian mixture distribution. Correspondingly, we can avoid over-learning of the Gaussian mixture model even if the dimension of data is quite large relative to the sample size, e.g. the number of genes is more than several thousands and the sample size is less than 100.
The mixture of factor analyzers (MFA, McLachlan et al., 2000, 2002), which is an extension of the mixture of probabilistic principal components analyzers (MPPCA, Tipping et al., 1999), is closely related to our model. These models characterize more flexible geometric feature of clusters than that of the mixed factors model. However, in practice of microarray studies, they may still suffer from the over-fitting. While the number of free parameters of MFA or MPPCA grows quickly as the number of clusters tends to be large, the mixed factors model can save increase in the number of free parameters. Some researchers might be motivated by grouping the tissue samples into a large number of clusters across several thousands genes. In such a situation, our method can be used without modification.
| 2 MIXED FACTORS MODEL |
|---|
|
|
|---|
Suppose that we have d-dimensional data of sample size N denoted by xj
Rd, j = 1, ... , N. In our situation, N and d (N < < d) are given by the number of microarrays and genes, respectively. Starting point of the mixed factors analysis is to establish a linear equation that relates the data vector to the lower-dimensional vector of factor variables, fj
Rq as with q < d in the following way:
![]() | (1) |
j is independently distributed according to
j
N(0,
I) where the I denotes the identity matrix. The d x q matrix A contains the factor loadings.
Our primal intention is parsimoniously to describe the group structure of data based on the factor variables. To this end, we devise the mixed factors that follow a G-components Gaussian mixture as
![]() | (2) |
g, g = 1, ... , G, and each
( ; µg,
g) denotes the Gaussian density with the mean µg and the diagonal covariance matrix
g for g = 1, ... , G. We define the mixed factors model by combining (1) and (2).
Given these assumptions, the unconditional distribution of data vector is represented by the Gaussian mixture in the form of
![]() |
In a similar fashion with the conventional factor analysis, the mixed factors model has a parameter redundancy due to the rotational ambiguity of the factor vector. To avoid it, we impose the orthogonality on the q columns of the factor loading matrix, i.e. AT A = I. This imposition leads to a canonical representation of the mixed factors model as
![]() | (3) |
Rq are distributed according to
![]() | (4) |
| 3 ANALYTIC TOOLS |
|---|
|
|
|---|
The ArrayCluster provides users an usable environment to perform the following tasks:
- Parameter estimation of the mixed factors model: The ArrayCluster computes the maximum likelihood estimators by using the EM algorithm.
- Determination of the number of clusters and the factor dimension (the number of group-related modules): These are selected based on the Bayesian information criterion (BIC).
- Clustering based on the Bayes rule
- Dimension reduction of data: This task is addressed by the same way of the classical factor analysis, the mixed factors analysis explicitly reflects the existing group structure of original data, while the classical factor analysis ignores it during the dimension reduction.
- Identification of the group-related genes: In the ArrayCluster, the relevant genes in each module are selected to be top L (user can specify) of the highest positive (negative) correlation with each element of the factor vector.
- Identification of the modules: By separating positive and negative correlated genes with the factor vector in a module, totally we identify 2q modules.
- Missing data imputation
- Data preprocessing: The methods include normalization and gene filtering.
|
| 4 SOFTWARE DESCRIPTION |
|---|
|
|
|---|
The current version (ArrayCluster 1.0) runs on Windows only and requires pre-installation of Windows 98 or later versions. The executable files to perform the mixed factors analysis and other analytic tools are created by FORTRAN language. As the GUI, the ArrayCluster 1.0 uses a web-browsing software, Lunascape (developed by Lunascape Co., Ltd, http://www.lunascape.jp/) that would enhances efficient knowledge discovery process. Funding to pay the Open Access publication charges for this article was provided by Human Genome Center, Institute of Medical Sciences, University of Tokyo.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alvis Brazma
Received on August 19, 2005; revised on March 28, 2006; accepted on March 30, 2006
| REFERENCES |
|---|
|
|
|---|
McLachlan, G.J. and Peel, D. Finite Mixture Models, (2000) , New York Wiley.
McLachlan, G.J., et al. (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, 18, 413422
Tipping, M.E. and Bishop, C.M. (1999) Mixtures of probabilistic principal component analyzers. Neural Comp, . 11, 443482[CrossRef][Medline].
Yoshida, R., et al. (2004) A mixed factors model for dimension reduction and extraction of gene expression data. Proc. IEEE Compu. Sys. Bioinform. Conf, . 161172.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





