Bioinformatics Advance Access originally published online on September 14, 2007
Bioinformatics 2007 23(21):2888-2896; doi:10.1093/bioinformatics/btm463
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Graph-based consensus clustering for class discovery from gene expression data
Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Consensus clustering, also known as cluster ensemble, is one of the important techniques for microarray data analysis, and is particularly useful for class discovery from microarray data. Compared with traditional clustering algorithms, consensus clustering approaches have the ability to integrate multiple partitions from different cluster solutions to improve the robustness, stability, scalability and parallelization of the clustering algorithms. By consensus clustering, one can discover the underlying classes of the samples in gene expression data.
Results: In addition to exploring a graph-based consensus clustering (GCC) algorithm to estimate the underlying classes of the samples in microarray data, we also design a new validation index to determine the number of classes in microarray data. To our knowledge, this is the first time in which GCC is applied to class discovery for microarray data. Given a pre specified maximum number of classes (denoted as Kmax in this article), our algorithm can discover the true number of classes for the samples in microarray data according to a new cluster validation index called the Modified Rand Index. Experiments on gene expression data indicate that our new algorithm can (i) outperform most of the existing algorithms, (ii) identify the number of classes correctly in real cancer datasets, and (iii) discover the classes of samples with biological meaning.
Availability: Matlab source code for the GCC algorithm is available upon request from Zhiwen Yu.
Contact: yuzhiwen{at}cs.cityu.edu.hk and cshswong{at}cityu.edu.hk
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Recently, the problem of discovering the underlying classes from microarray data receives more and more attention due to its important applications in cancer diagnosis (Golub et al., 1999), gene expression analysis (Alizadeh et al., 2000) and related areas (Su et al., 2002). The adoption of microarray techniques allow the acquisition, through a series of experiments, the expression profile of a genome in a number of different experiment conditions (Baldi and Hatfield, 2002). Information acquired through microarray is usually characterized through two dimensions: the gene dimension and the sample dimension. We can either observe and characterize the expression level distribution of the complete set of genes included in the array under a particular experimental condition, or track the variation of the expression level of a single gene or a selected subset of genes across the different samples. In this article, we mainly focus on the categorization of the samples through our newly proposed graph-based consensus clustering (GCC) approach.
Most of the existing approaches in cancer classification can be categorized into two types: class discovery and class prediction. In general, class discovery consists of two steps: (1) a clustering algorithm is first adopted to partition the samples into K parts, when given a new set of microarray data with unknown number of classes, (2) a cluster validity index is then applied to determine the optimal K value, which corresponds to the final number of classes. Our new approach belongs to this category. For the class discovery problem, the researchers mainly focus on discovering the underlying classes from the samples. In Golub et al. (1999), two types of human acute leukemia were discovered with the help of self organizing feature maps and neighborhood analysis. In Alizadeh et al. (2000), two subtypes of diffused large B-cell lymphoma that are distinct at the molecular level are identified with centroid average hierarchical clustering. In Yeung et al. 2001, the Gaussian mixture model is applied to discover the underlying distribution of the data in microarray. Wigle et al. 2002, identified lung cancer cases are distinguished from the normal cases through statistical analysis and clustering approaches. In Dudoit and Fridlyand (2002), a new prediction-based resampling approach is designed to estimate the number of clusters in microarray data. In Dudoit and Fridlyand (2003), two new resampling approaches based on bagging are proposed to improve the accuracy of the clustering procedure. The silhouette index is adopted as a cluster validity index to determine the optimal number of clusters. In Smolkin and Ghosh (2003), a cluster stability score for gene expression data is proposed to assess the stability of individual clusters based on the random subspace techniques, which can be used to identify the number of clusters by combining with any clustering algorithm. In Handl et al. (2005), the authors performed a survey of cluster validity indices when applied to gene expression data analysis. In particular, they pointed out the benefits and the problems of computational cluster validation techniques. In Datta and Datta (2006), the authors designed two performance measures called biological homogeneity index (BHI) and biological stability index (BSI). BHI measures the biological homogeneity of the clusters, while BSI measures the stability of the clusters. In their work, the main focus is on the functionality of the genes, not the categories of the samples. In addition, since some of the genes in their dataset are labeled using biological tools, their approach belongs to the semi-supervized learning category. In Bertoni and Valentini (2006), randomized maps based on the Johnson–Lindenstrauss theory are designed to project the high-dimensional gene expression data on to a lower dimensional subspace. Then, a stability measure based on the projected data is proposed to discover the underlying structure from microarray data. In Bertoni and Valentini (2007a), a general framework for the assessment of clustering solutions is designed based on the random subspace technique. The true number of clusters is estimated by a
2-based statistical test, which makes use of the distribution of the similarity measure between pairs of clusterings.
Recently, researchers are paying more attention to class discovery based on the consensus clustering approaches. Consensus clustering approaches consist of two major steps: generating a cluster ensemble based on a clustering algorithm, and finding a consensus partition based on this ensemble. The existing consensus clustering approaches that are applied to gene expression data can be categorized into five types: (i) using different clustering algorithms as the basic clustering algorithms to obtain different solutions (Strehl and Ghosh, 2002). (ii) using random initializations of a single clustering algorithm (Grotkjaer et al., 2006), e.g. adopting different initial centers for K-means or EM. (iii) sub-sampling, re-sampling or adding noise to the original data (McShane, 2002, Monti et al., 2003, Valentini, 2007). (iv) using selected subsets of features (Bertoni and Valentini, 2005, Bertoni and Valentini, 2007a, Bertoni and Valentini, 2007b, Smolkin and Ghosh, 2003, Topchy et al., 2005, Valentini, 2006). (v) using different K values to generate different clustering solutions, where K is the number of clusters.
The current consensus clustering approaches have limitations. Specifically, two aspects of the current consensus clustering approaches can be improved, namely, the diversity of the ensemble, and the accuracy of the partitioning results obtained from the consensus matrix. In this article, we propose a new consensus clustering approach, known as GCC, to discover biologically meaningful classes automatically from gene expression data. Our new approach belongs to category (iv), in which the cluster ensemble is generated using different gene subsets. Compared with the previous approaches, in particular the approaches proposed by Bertoni and Valentini (2007a) and Smolkin and Ghosh (2003), the main features of the GCC approach include (1) the adoption of the normalized cut algorithm (Shi and Malik, 2000) to partition the consensus matrix, (2) the adoption of the random subspace technique, combined with the correlation clustering algorithm or K-means, to enhance the diversity of the cluster ensemble and (3) the design of a new cluster validity index called the Modified Rand Index, which measures the degree of agreement between two consensus matrices based on a penalty term.
The work most related to ours is described in Monti et al. (2003). They proposed a consensus clustering approach to identify the underlying types of cancers in a number of datasets. Unlike previous consensus clustering approaches, their approach can estimate the number of classes. However, although the results obtained using six cancer datasets are acceptable, there is a need to further improve the estimation performance to allow more accurate diagnosis. Compared with this approach, our proposed algorithm has two differences: (1) GCC only considers subsets of genes in the process of generating the clustering solution. (2) GCC adopts the new Modified Rand Index, which allows more accurate characterization of the class structure. Specifically, GCC first generates a set of clustering solutions in the gene subspace. Then, it creates a consensus matrix that integrates the partitions coming from the different clustering solutions. Finally, a new validation index, called the Modified Rand Index, is designed to estimate the number of classes automatically from gene expression data. Our experiments show that GCC successfully identifies the number of classes in real cancer datasets.
| 2 METHODS |
|---|
|
|
|---|
We formulate the GCC problem for cancer classification as follows: Given a set of samples S = {S1,S2,... ,Sns} where ns is the number of samples, the GCC constructor first randomly selects a subset of genes and obtains a clustering solution Iu by partitioning the samples into K disjoint classes (
{1,...,K}) based on the sampled subset of genes G [G = {gi1,gi2,...,gings} (where gih (1
h
ngs)] represents the hth sampled gene from the original set of genes, and ngs is the total number of sampled genes]. Then, the above process is repeated B times to create a cluster ensemble I which consists of B clustering solutions Iu (Iu
I = {I1,I2,...,IB}). Finally, a consensus function
is applied to obtain the final clustering solution Ifinal based on the cluster ensemble I and the pre specified parameter K. Figure 1 provides an overview of the framework for GCC algorithm. Specifically, GCC first selects a subset of genes from the gene space. Then, the clustering algorithm is applied to partition the samples into K disjoint classes. GCC repeats the first two steps B times to obtain B clustering solutions. In the third step, it constructs a consensus matrix, partitions the consensus matrix by the normalized cut algorithm (Shi and Malik 2000) and obtains the final results. Finally, GCC estimates a suitable K value by a new cluster validation index.
|
2.1 Subspace generation
In the first step, GCC selects a subset of genes G by random sampling. Specifically, a constant ngs (n_min
ngs
n_max), which represents the number of genes in the subspace, is randomly generated by the following equation:
|
| (1) |
(
[ < xrefref – type="bibr"rid="B0" > 0</xref>,<xrefref – type="bibr"rid="B1" > 1</xref > ]) is a uniform random variable. n_min and n_max, which are pre specified parameters by the user, controls the number of dimensions of the subspace. The default settings for n_min and n_max are 0.75ng and 0.85ng respectively, where n_max
ng. We have followed the settings of n_min = 0.75ng and n_max = 0.85ng in Smolkin and Ghosh (2003).
Then, it selects the gene one by one until ngs genes are obtained. The index of each randomly selected gene is determined as follows:
|
| (2) |
' is a uniform random variable. Finally, the randomly selected ngs genes are used to construct a subspace.
2.2 Subspace clustering
GCC performs clustering in the selected subspace by two approaches: correlation clustering and K-means. The characteristics of the two approaches are summarized as follows:
Correlation clustering combines correlation analysis and graph partition. We first calculate the correlation matrix (CM) whose entries rij (i,j
{1,...,ns}, ns is the number of samples) are determined as follows:
|
| (3) |
Then, the normalized cut algorithm (Shi and Malik, 2000) is applied to partition the samples into K classes based on the CM. We can construct a graph (G = (S,CM)) whose vertices correspond to the samples, and whose edges denote the correlation between the samples. The normalized cut approach is applied to partition the graph recursively until K classes are obtained. We assume that the normalized cut first partitions the vertex set S of the graph G into two subsets P and Q. The cost function Ncut(P,Q), which represents a disassociation measure between P and Q, is defined as:
|
| (4) |
|
| (5) |
|
| (6) |
where rij denotes the weight of the edge between the vertices si and sj, which is the value of the entry in the CM. An alternative representation for the above cost measure is as follows:
|
| (7) |
i =
jrij is the sum of the weights from the vertex si to all other vertices. In this way, the normalized cut problem can be formulated as an optimization problem, in which Ncut(v) is minimized as follows:
|
| (8) |
|
| (9) |
|
| (10) |
|
| (11) |
i (i
{1,...,ns}) on its diagonal, CM is an ns x ns symmetric matrix with the elements rij, I denotes the identity matrix and
i is the ith component of
.
Although finding the normalized cut is an NP-complete problem, an approximate discrete solution can be found efficiently by extending the domain of the variables from discrete to continuous. Based on this constraint relaxation, the above optimization problem can be solved using the following generalized eigenvalue system:
|
| (12) |
denotes the eigenvalue. If
can take real values, the second smallest eigenvector of the generalized eigenvalue system is the solution to the normalized cut problem (Shi and Malik, 2000).
K-means is a popular clustering algorithm which partitions the samples into K classes by maximizing an objective function
(Z,C).
|
| (13) |
|
| (14) |
(si,ck) denotes the cosine distance between the sample si and the center ck of the kth cluster, which is defined as:
|
| (15) |
Through subspace clustering, GCC obtains the predicted labels of the samples. The adjacency matrix M is constructed by the predicted labels (Dudoit and Fridlyand 2003), whose elements mij are defined as:
|
| (16) |
2.3 Cluster ensemble
Given the number of classes K, GCC repeats the above two steps (subspace generation and subspace clustering) B times and obtains B clustering solutions (
) with B adjacency matrices (
). Then, GCC constructs an ns x ns consensus matrix (MK) by merging the adjacency matrix
(u
{1,...,B}) as follows:
|
| (17) |
Afterwards, GCC constructs a graph (GK = (S,MK)) whose vertices correspond to the samples in S, and whose edges represent the probability that two samples appear in the same class, which is denoted as mK,ij in the consensus matrix MK. Then, the normalized cut algorithm is used to partition the samples into K classes based on MK by minimizing the following cost function:
|
| (18) |
2.4 Cluster discovery
We further define an aggregated consensus matrix R as follows:
|
| (19) |
GCC further converts the aggregated consensus matrix to a binary matrix Rb:
|
| (20) |
We propose a new cluster validity index
, known as the Modified Rand Index, as follows:
|
| (21) |
The optimal number of classes K* is selected as follows:
|
| (22) |
In general, cluster validity indices can be categorized into three types: internal indices, external indices and information-theory-based criteria (Sergios and Konstantinos, 2006). The Modified Rand Index can be considered a modified form of the external index. Unlike conventional external indices, the Modified Rand Index does not measure the discrepancy between the ground-truth partition and the partition obtained by the clustering algorithm, but the average of the discrepancies between the partition obtained by the consensus clustering approach and the partitions obtained by the clustering algorithm in a cluster ensemble. It also belongs to the category of indices for optimizing the predictive power/stability of the resulting solution (Handl et al., 05). With the help of the random subspace technique and the consensus clustering approach, the Modified Rand Index can be used to estimate the number of clusters.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Experiment setting
In the experiment, we compare GCC with the consensus clustering (CC) approach described in Monti et al. (2003). Based on the adoption of different feature spaces, sampling approaches, clustering algorithms and consensus functions, we consider the following four combinations: GCCcorr (subspace + non-resampling + correlation clustering + normalized cut), GCCK – means (subspace + non-resampling + K-means +normalized cut), CCHC [complete space + resampling + hierarchical clustering with average linkage (HC) + HC] and CCSOM [complete space + resampling + self organizing map (SOM) + SOM]. The experiment settings for the CC approaches are the same as those in Monti et al. (2003). The experiment settings for the GCC approaches are shown in Table 1.
|
To evaluate the performance of these different approaches, we adopt the Adjusted Rand Index (ARI) (Milligan and Cooper, 1986) to measure the degree of agreement between different partitions with different numbers of clusters. If the partitions are identical, the value of the ARI is 1. If the partitions are drawn independently from one another, the index takes on an expected value of 0. Let (1) LT with KT classes be the true partition of the samples S, and LP with KP classes be the predicted partition of the same set of samples S. (2)
|
| (23) |
|
| (24) |
Table 2 provides a summary of the datasets used in our experiments. We generate three synthetic data sets in the space [ < xrefref – type="bibr"rid="B0" > 0</xref>,<xrefref – type="bibr"rid="B1" > 1</xref > ]d (d is the number of features) using a set of Gaussian distributions with randomly selected centers, and with the covariance matrices of all distributions set to 0.25I (I denotes the identity matrix). The points in Synthetic1, Synthetic2 and Synthetic3 originate from 3, 4, 7 Gaussian clusters, respectively. Synthetic1 is a 1000-gene by 75-sample dataset, in which each class contains 25 samples. Synthetic2 is a 1000-gene by 100-sample dataset, which is used to simulate microarray data with noisy genes. Specifically, 200 noisy genes are included among the 1000 genes. Similar to Synthetic1, each class consists of 25 samples. Synthetic3 is 1000-gene by 100-sample dataset, which is used to simulate microarray data with unequal-size clusters and noisy genes. In addition to including the noisy genes, we vary the number of samples in each class to include 8, 12, 16, 20, 24, 28, 32 samples in the seven classes, respectively. For the real datasets, the data preprocessing procedure is the same as that described in Monti et al. (2003). In Table 2, the abbreviation SRBCT stands for small round blue cell tumors, while CNS tumors refers to embryonal tumors of the central nervous system. In the Leukemia dataset, bone marrow samples are obtained from acute leukemia patients at the time of diagnosis, while diagnostic bone marrow samples in the St.Jude dataset are from pediatric patients with acute leukemia. The ranges of K for Synthetic3 and the St.Jude leukemia dataset are both set to {2,...,15}, while the ranges of K for the other datasets are set to {2,...,9}.
|
In the experiments, we first explore the relationship between
(the Modified Rand Index) and ARI. Then, the optimal K value is selected based on
. Finally, we compare the performances of the different approaches.
3.2 Relationship between ARI and 
We observe an interesting relationship between ARI and
, by which we can estimate the optimal K value. Figure 2 illustrates the change of ARI with respect to K in the different datasets, while Figure 3 shows the change of the Modified Rand Index
corresponding to the different K values in the various datasets. An interesting observation is that the trend of the curve for ARI and that of the curve for
are very similar. We perform correlation analysis on the curve of ARI and the corresponding curve of
in the same dataset. Table 3 lists the results of the correlation analysis between ARI and
in all datasets corresponding to the different approaches. All the correlation values in Table 3 are greater than 0.65. This implies that the degree of dependence between ARI and
is high.
|
|
|
To estimate the true K value of a dataset, we further study the relationship between
, ARI and the K values. If the predicted number of classes is the same as the true number of classes, the value of ARI attains its maximum as shown in Figure 2, since the value of ARI is directly related to the true number of clusters in the dataset. It can also be seen that the peak of the ARI curve in Figure 2 corresponds to that of
in the same dataset as shown in Figure 3. In other words, the maximum value of
also corresponds to the optimal K value. Given a new set of microarray data with unknown clusters, ARI cannot be calculated, while
can be computed when Kmax is given. As a result,
can be considered as an alternative measure to discover the underlying clusters.
3.3 Experiment results
In general, the GCC approaches GCCcorr and GCCK – means outperform the consensus clustering approaches CCHC and CCSOM when applied to the gene expression data as shown in Table 4.
|
GCCcorr outperforms CCHC and CCSOM and correctly discovers the true number of clusters in the three synthetic datasets and the following five real datasets: Breast (BRCA1-mutation-positive samples, BRCA2-mutation-positive samples, Sporadic samples), CNS tumors (medulloblastomas, primitive neuroectodermal tumors, atypical teratoid/rhabdoid tumors, malignant gliomas and normal cerebellum), leukemia (acute myeloid leukemia samples, T-lineage acute lymphoblastic leukemia samples and B-lineage ALL samples), Lung cancer (adenocarcinomas, squamous cell carcinomas, carcinoids and normal lung) and SRBCT [neuroblastoma (NB), nonHodgkin lymphoma (in this case Burkitt's; lymphoma (BL)), rhabdomyosarcoma (RMS) and Ewing's; family of tumors (EWS)]. In the St.Jude dataset, the estimated number of GCCcorr is 5, while the true K value is 6. In fact, the values of
corresponding to K = 5 and K = 6 are nearly the same. When K = 5, GCCcorr identifies four important leukemia sub-types in the St.Jude dataset: T-lineage ALL, E2A-PBX1, TELAML1 and MLL rearrangements, while the two sub-types BCR-ABL and hyperdiploid > 50 chromosomes are merged into a single class by GCCcorr. The performance of GCCK – means is also better than those of CCHC and CCSOM, and is comparable with that of GCCcorr, since GCCK – means successfully discovers the underlying classes in three synthetic datasets and the following five real datasets: Breast cancer, CNS tumors, leukemia, Lung cancer and the St. Jude leukemia dataset (T-lineage ALL, E2A-PBX1, BCR-ABL, TELAML1, MLL rearrangements and hyperdiploid > 50 chromosomes, where ALL denotes acute lymphoblastic leukemia). GCCK – means correctly discovers three classes in the SRBCT dataset: NB, nonHodgkin lymphoma (in this case BL) and RMS, while subdividing the class called EWS into two subtypes.
CCHC estimates the correct K value for the following datasets: Synthetic1, Synthetic2, Breast, CNS tumor and SRBCT, while CCSOM discovers the true underlying classes in Synthetic1, Synthetic2, Synthetic3, Breast and CNS tumor.
In general, to address the problems associated with the high-dimensional and small sample nature of gene expression data, together with their high noise levels and large biological variabilities, the GCC approaches adopt the random subspace technique, together with the correlation clustering algorithm or K-means, to generate a more diverse set of clustering solutions when compared with existing CC approaches, such that a more accurate solution can be obtained. In addition, the new consensus function in GCC performs better than those in existing CC methods due to the adoption of the normalized cut algorithm, which results in a more accurate partition of the consensus matrix.
Table 5 lists the corresponding values of ARI with respect to the estimated K value in Table 4. The GCC approaches clearly outperform the CC approaches, especially in the Leukemia dataset and the Lung cancer dataset. To provide a further comparison of the results in Table 5, we design a statistical table as shown in Table 6. If the ARI value obtained by the first approach is significantly better/worse [the level of significance is set at a difference of 0.05 in the ARI value, as adopted in Kuncheva and Vetrov (2006)] than that obtained by the second approach in one of the datasets, the win/lose count of the first approach is incremented once. If the difference between the ARI values based on the two approaches is smaller than the level of significance, the tie count is incremented instead. As shown in Table 6, GCC approaches GCCcorr and GCCK – means clearly outperform the other approaches and achieve good results in most of the datasets.
|
|
In the following experiments, we adopt GCCcorr to illustrate the properties of our proposed consensus clustering approaches, since (1) GCCcorr achieves good performance in most of the datasets, and (2) there is a higher correlation betweeen ARI and
in the case of GCCcorr for most of the datasets, as shown in Table 3. To further explore the robustness and the stability of our approach against different numbers of noisy components, we generate three more synthetic datasets (Synthetic4, Synthetic5 and Synthetic6) by varying the number of noisy genes in Synthetic2. The numbers of noisy genes in Synthetic4, Synthetic5 and Synthetic6 are 250, 300 and 350, respectively. The performance of GCC when applied to the three new datasets is illustrated in Figure 4. It can be seen that GCC is robust and stable against the changes in the number of noisy components, and can successfully estimate the four clusters in the datasets.
|
We further investigate the effect of the maximum K value (Kmax) on the Modified Rand Index, and the relationship between Kmax and the sparseness of the discretized aggregated consensus matrix Rb. The GCC algorithm GCCcorr is applied to the Synthetic2 dataset and the Leukemia dataset using different Kmax values. It is observed that when Kmax increases, GCCcorr still correctly estimates the number of clusters in the Synthetic2 dataset and the Leukemia dataset, as indicated in Tables 7 and 8. In addition, we observe that the ARI associated with GCCcorr is not affected by Kmax, while the value of the peak
decreases slightly when Kmax increases. In general, when Kmax is large, the number of entries in Rb whose values exceed the threshold in Equation (20) are small, which leads to sparseness of the binary matrix. We further observe that this sparseness causes a slight decrease of the peak
value for both datasets. However, this does not affect the capability of
to identify the correct number of clusters, since the set of
values corresponding to the different K values change in such a way that the position of the maximum point on the
versus K curve is maintained, as indicated in Figures 5 and 6. Although the peaks of the curves in Figures 5 and 6 become less prominent as Kmax increases, our proposed approach still correctly estimates the optimal K value from each dataset.
|
|
|
|
| 4 CONCLUSION |
|---|
|
|
|---|
In this article, we investigate the problem of class discovery in gene expression data. The major contribution of this article is in the design of a new framework, known as GCC, to discover the underlying classes of the samples in gene expression data. Our new approach can successfully estimate the true number of classes for the datasets in our experiments. In addition, based on our experiment results, we also observe that our new approach outperforms the consensus clustering approaches proposed in Monti et al. (2003) when applied to the characterization of gene expression data.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: David Rocke
Received on April 11, 2007; revised on August 15, 2007; accepted on September 8, 2007
| REFERENCES |
|---|
|
|
|---|
Alizadeh AA, et al. Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature (2000) 403:503–511.[CrossRef][Medline]
Baldi P, Hatfield GW. DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling (2002) Cambridge: Cambridge University Press.
Bertoni A, Valentini G. Ensembles based on random projections to improve the accuracy of clustering algorithms. Neural Nets, (WIRN 2005), LNCS (2005) 3931:31–37.
Bertoni A, Valentini G. Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses. Artif. Intell. in Med. (2006) 37:85–109.[CrossRef]
Bertoni A, Valentini G. Model order selection for biomolecular data clustering. BMC Bioinformatics (2007a) 8. (Suppl. 2), S7.
Bertoni A, Valentini G. Randomized Embedding Clustering Ensembles for gene expression data analysis. In SETIT 2007 – Proceedings of IEEE International Conference on Sciences of Electronic (2007b) Technologies of Information and Telecommunications, Hammamet, Tunisia.
Bhattacharjee A, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinomas sub-classes. Proc. Natl Acad. Sci. (2001) 98:13790–13795.
Datta S, Datta S. Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinformatics (2006) 7:397.[CrossRef][Medline]
Dudoit S, Fridlyand J. A prediction-based resampling method to estimate the number of clusters in a dataset. Genome Bio. (2002) 3:0036.1–0036.21.
Dudoit S, Fridlyand J. Bagging to improve the accuracy of a clustering procedure. Bioinformatics (2003) 19:1090–1099.
Golub TR, et al. Molecular classification of cancer: class discovery and class prediction by gene expression. Science (1999) 286:531–537.
Grotkjaer T, et al. Robust multi-scale clustering of large DNA microarray datasets with the consensus algorithm. Bioinformatics (2006) 22:58–67.
Handl J, et al. Computational cluster validation in post-genomic data analysis Bioinformatics. Bioinformatics (2005) 21:3201–3212.
Hedenfalk I, et al. Gene-expression profiles in hereditary breast cancer. New Engl. J. of Med. (2001) 344:539–548.[CrossRef]
Khan J, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. (2001) 7:673–679.[CrossRef][Web of Science][Medline]
Kuncheva LI, Vetrov DP. Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Trans. Pattern Anal. Mach. Intell. (2006) 28:1798–1808.[CrossRef][Medline]
Mc Shane LM. Method for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics (2002) 18:1462–1469.
Milligan G, Cooper M. A study of the comparability of external criteria for hierarchical cluster analysis. Multivar. Behav. Res. (1986) 21:441–458.[CrossRef]
Monti S, et al. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. (2003) 52:91–118.[CrossRef]
Pomeroy S, et al. Gene expression-based classification and outcome prediction of central nervous system embryonal tumors. Nature (2002) 415:436–442.[CrossRef][Medline]
Sergios T, Konstantinos K. Pattern Recognition (2006) 3rd. UK: Academic press, Elsevier. 733–765.
Shi J, Malik J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. (2000) 22:888–905.[CrossRef]
Strehl A, Ghosh J. Cluster ensembles — a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. (2002) 3:583–617.[CrossRef]
Smolkin M, Ghosh D. Cluster stability scores for microarray data in cancer studies. BMC Bioinformatics (2003) 4:36.[CrossRef][Medline]
Su AI, et al. Large-scale analysis of the human and mouse transcriptomes. Proc. Natl Acad. Sci. (2002) 99:4465–4470.
Topchy A, et al. Clustering ensembles: models of consensus and weak partitions. IEEE Trans. Pattern Anal. Mach. Intell. (2005) 27:1866–1881.[CrossRef][Medline]
Valentini G. Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data. Bioinformatics (2006) 22:369–370.
Valentini G. Mosclust: a software library for discovering significant structures in bio-molecular data. Bioinformatics (2007) 23:387–389.
Wigle D, et al. Molecular profiling of non-small cell lung cancer and correlation with disease-free survival. Cancer Res. (2002) 62:3005–3008.
Yeoh E-J, et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell (2002) 1:133–143.[CrossRef][Web of Science][Medline]
Yeung KY, et al. Model-based clustering and data transformations for gene expression data. Bioinformatics (2001) 17:977–987.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






