Skip Navigation


Bioinformatics Advance Access originally published online on April 10, 2006
Bioinformatics 2006 22(12):1538-1539; doi:10.1093/bioinformatics/btl129
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/12/1538    most recent
btl129v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Yoshida, R.
Right arrow Articles by Miyano, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Yoshida, R.
Right arrow Articles by Miyano, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles

Ryo Yoshida 1,*, Tomoyuki Higuchi 2, Seiya Imoto 1 and Satoru Miyano 1

1 Human Genome Center, Institute of Medical Science, University of Tokyo 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan
2 Research Organization of Information and Systems, The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo 106-8569, Japan

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 OVERVIEW
 2 MIXED FACTORS MODEL
 3 ANALYTIC TOOLS
 4 SOFTWARE DESCRIPTION
 REFERENCES
 

Summary: One of the significant challenges in gene expression analysis is to find unknown subtypes of several diseases at the molecular levels. This task can be addressed by grouping gene expression patterns of the collected samples on the basis of a large number of genes. Application of commonly used clustering methods to such a dataset however are likely to fail owing to over-learning, because the number of samples to be grouped is much smaller than the data dimension which is equal to the number of genes involved in the dataset. To overcome such difficulty, we developed a novel model-based clustering method, referred to as the mixed factors analysis. The ArrayCluster is a freely available software to perform the mixed factors analysis. It provides us some analytic tools for clustering DNA microarray experiments, data visualization and an automatic detector for module transcriptional of genes that are relevant to the calibrated molecular subtypes and so on.

Availability: The ArrayCluster can be used free of charge for non-commercial and academic use and downloaded from http://www.ism.ac.jp/~higuchi/arraycluster.htm

Contact: yoshidar{at}ims.u-tokyo.ac.jp


    1 OVERVIEW
 TOP
 ABSTRACT
 1 OVERVIEW
 2 MIXED FACTORS MODEL
 3 ANALYTIC TOOLS
 4 SOFTWARE DESCRIPTION
 REFERENCES
 
Typical microarray data have a fairly small sample size, less than 100, whereas the number of genes involved is more than several thousands. A purpose of the cluster analysis in microarray studies is to find the existing molecular subtypes in a set of the collected samples. One major difficulty in this problem is that the number of samples to be clustered is much smaller than the dimension of data which is equal to the number of genes of interest. Owing to the high-dimensionality of data, applicability of most ready-made clustering technologies, e.g. k-means, Gaussian mixture clustering, hierarchical clustering and so on, would be limited by over-learning.

The mixed factors analysis was originally proposed by Yoshida et al. (2004) to overcome the curse of dimensionality faced at gene expression analysis. A key idea of this approach is to establish a parametric model, referred to as the mixed factors model, which is an extension of the classical factor analyzer. This model performs a parsimonious parameterization of the Gaussian mixture distribution. Correspondingly, we can avoid over-learning of the Gaussian mixture model even if the dimension of data is quite large relative to the sample size, e.g. the number of genes is more than several thousands and the sample size is less than 100.

The mixture of factor analyzers (MFA, McLachlan et al., 2000, 2002), which is an extension of the mixture of probabilistic principal components analyzers (MPPCA, Tipping et al., 1999), is closely related to our model. These models characterize more flexible geometric feature of clusters than that of the mixed factors model. However, in practice of microarray studies, they may still suffer from the over-fitting. While the number of free parameters of MFA or MPPCA grows quickly as the number of clusters tends to be large, the mixed factors model can save increase in the number of free parameters. Some researchers might be motivated by grouping the tissue samples into a large number of clusters across several thousands genes. In such a situation, our method can be used without modification.


    2 MIXED FACTORS MODEL
 TOP
 ABSTRACT
 1 OVERVIEW
 2 MIXED FACTORS MODEL
 3 ANALYTIC TOOLS
 4 SOFTWARE DESCRIPTION
 REFERENCES
 
Suppose that we have d-dimensional data of sample size N denoted by xj isin Rd, j = 1, ... , N. In our situation, N and d (N < < d) are given by the number of microarrays and genes, respectively. Starting point of the mixed factors analysis is to establish a linear equation that relates the data vector to the lower-dimensional vector of factor variables, fj isin Rq as with q < d in the following way:

Formula 1(1)
The Gaussian noise {varepsilon}j is independently distributed according to {varepsilon}j ~ N(0, {lambda}I) where the I denotes the identity matrix. The d x q matrix A contains the factor loadings.

Our primal intention is parsimoniously to describe the group structure of data based on the factor variables. To this end, we devise the mixed factors that follow a G-components Gaussian mixture as

Formula 2(2)
Here, the mixing proportions are given by {alpha}g, g = 1, ... , G, and each {varphi}( ; µg, {sigma}g) denotes the Gaussian density with the mean µg and the diagonal covariance matrix {Sigma}g for g = 1, ... , G. We define the mixed factors model by combining (1) and (2).

Given these assumptions, the unconditional distribution of data vector is represented by the Gaussian mixture in the form of

Formula 2
For the unrestricted Gaussian mixture, the number of free parameters grows quickly as the dimension of data becomes larger, mainly, due to over-parameterization of the covariance matrix. Besides the mixed factors model, we possibly avoid the over-fitting of the Gaussian mixture by choosing an appropriate factor dimension regardless to the high dimensionality of data. Once the model has been fitted to a given dataset, clustering can be addressed by the Bayes rule. Task of data compression turns to computing the posterior expectation of the mixed factors.

In a similar fashion with the conventional factor analysis, the mixed factors model has a parameter redundancy due to the rotational ambiguity of the factor vector. To avoid it, we impose the orthogonality on the q columns of the factor loading matrix, i.e. AT A = I. This imposition leads to a canonical representation of the mixed factors model as

Formula 3(3)
From this equation, one achieves the fact that the q canonical variates in ATxj isin Rq are distributed according to

Formula 4(4)
From (4), the loading matrix should be chosen so that the resulting canonical variates are likely to be the Gaussian mixture (4). The canonical variates can be considered as the q modules of genes which are relevant to the existing molecular subtypes. This process yields a feature selection that constructs good discriminators for existing groups as linear combination od d genes. By considering the estimated groups, q transcriptional modules are automatically constructed in the following way; If the (i, j)th element of AT is positioned far from zero, the j-th gene captures a large effect on the i-th module. In contrast, the influence of genes with the corresponding loading coefficient lying in a region close zero is removed.


    3 ANALYTIC TOOLS
 TOP
 ABSTRACT
 1 OVERVIEW
 2 MIXED FACTORS MODEL
 3 ANALYTIC TOOLS
 4 SOFTWARE DESCRIPTION
 REFERENCES
 
The ArrayCluster provides users an usable environment to perform the following tasks:

  • Parameter estimation of the mixed factors model: The ArrayCluster computes the maximum likelihood estimators by using the EM algorithm.
  • Determination of the number of clusters and the factor dimension (the number of group-related modules): These are selected based on the Bayesian information criterion (BIC).
  • Clustering based on the Bayes rule
  • Dimension reduction of data: This task is addressed by the same way of the classical factor analysis, the mixed factors analysis explicitly reflects the existing group structure of original data, while the classical factor analysis ignores it during the dimension reduction.
  • Identification of the group-related genes: In the ArrayCluster, the relevant genes in each module are selected to be top L (user can specify) of the highest positive (negative) correlation with each element of the factor vector.
  • Identification of the modules: By separating positive and negative correlated genes with the factor vector in a module, totally we identify 2q modules.
  • Missing data imputation
  • Data preprocessing: The methods include normalization and gene filtering.
Some outputs are graphically displayed on the GUI (Fig. 1). The ArrayCluster visualizes the computed factor scores using the box plot matrix. They would enhance the graphical understanding of the group structure. A casual link from the calibrated clusters to biological knowledge can be elucidated through the inspection of the group-related modules. The ArrayCluster displays the expression patterns of these modules. Investigating the genes listed at these modules and their visualization give us a scope to question where the calibrated clusters come from.


Figure 1
View larger version (57K):
[in this window]
[in a new window]
 
Fig. 1 A snapshot of ArrayCluster.

 

    4 SOFTWARE DESCRIPTION
 TOP
 ABSTRACT
 1 OVERVIEW
 2 MIXED FACTORS MODEL
 3 ANALYTIC TOOLS
 4 SOFTWARE DESCRIPTION
 REFERENCES
 
The current version (ArrayCluster 1.0) runs on Windows only and requires pre-installation of Windows 98 or later versions. The executable files to perform the mixed factors analysis and other analytic tools are created by FORTRAN language. As the GUI, the ArrayCluster 1.0 uses a web-browsing software, Lunascape (developed by Lunascape Co., Ltd, http://www.lunascape.jp/) that would enhances efficient knowledge discovery process. Funding to pay the Open Access publication charges for this article was provided by Human Genome Center, Institute of Medical Sciences, University of Tokyo.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alvis Brazma

Received on August 19, 2005; revised on March 28, 2006; accepted on March 30, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 OVERVIEW
 2 MIXED FACTORS MODEL
 3 ANALYTIC TOOLS
 4 SOFTWARE DESCRIPTION
 REFERENCES
 

    McLachlan, G.J. and Peel, D. Finite Mixture Models, (2000) , New York Wiley.

    McLachlan, G.J., et al. (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, 18, 413–422[Abstract/Free Full Text].

    Tipping, M.E. and Bishop, C.M. (1999) Mixtures of probabilistic principal component analyzers. Neural Comp, . 11, 443–482[CrossRef][Medline].

    Yoshida, R., et al. (2004) A mixed factors model for dimension reduction and extraction of gene expression data. Proc. IEEE Compu. Sys. Bioinform. Conf, . 161–172.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/12/1538    most recent
btl129v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Yoshida, R.
Right arrow Articles by Miyano, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Yoshida, R.
Right arrow Articles by Miyano, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?