Skip Navigation


Bioinformatics Advance Access originally published online on September 14, 2007
Bioinformatics 2007 23(21):2888-2896; doi:10.1093/bioinformatics/btm463
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary data
Right arrow All Versions of this Article:
23/21/2888    most recent
btm463v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Yu, Z.
Right arrow Articles by Wang, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Yu, Z.
Right arrow Articles by Wang, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Graph-based consensus clustering for class discovery from gene expression data

Zhiwen Yu *, Hau-San Wong and Hongqiang Wang

Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 

Motivation: Consensus clustering, also known as cluster ensemble, is one of the important techniques for microarray data analysis, and is particularly useful for class discovery from microarray data. Compared with traditional clustering algorithms, consensus clustering approaches have the ability to integrate multiple partitions from different cluster solutions to improve the robustness, stability, scalability and parallelization of the clustering algorithms. By consensus clustering, one can discover the underlying classes of the samples in gene expression data.

Results: In addition to exploring a graph-based consensus clustering (GCC) algorithm to estimate the underlying classes of the samples in microarray data, we also design a new validation index to determine the number of classes in microarray data. To our knowledge, this is the first time in which GCC is applied to class discovery for microarray data. Given a pre specified maximum number of classes (denoted as Kmax in this article), our algorithm can discover the true number of classes for the samples in microarray data according to a new cluster validation index called the Modified Rand Index. Experiments on gene expression data indicate that our new algorithm can (i) outperform most of the existing algorithms, (ii) identify the number of classes correctly in real cancer datasets, and (iii) discover the classes of samples with biological meaning.

Availability: Matlab source code for the GCC algorithm is available upon request from Zhiwen Yu.

Contact: yuzhiwen{at}cs.cityu.edu.hk and cshswong{at}cityu.edu.hk

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
Recently, the problem of discovering the underlying classes from microarray data receives more and more attention due to its important applications in cancer diagnosis (Golub et al., 1999), gene expression analysis (Alizadeh et al., 2000) and related areas (Su et al., 2002). The adoption of microarray techniques allow the acquisition, through a series of experiments, the expression profile of a genome in a number of different experiment conditions (Baldi and Hatfield, 2002). Information acquired through microarray is usually characterized through two dimensions: the gene dimension and the sample dimension. We can either observe and characterize the expression level distribution of the complete set of genes included in the array under a particular experimental condition, or track the variation of the expression level of a single gene or a selected subset of genes across the different samples. In this article, we mainly focus on the categorization of the samples through our newly proposed graph-based consensus clustering (GCC) approach.

Most of the existing approaches in cancer classification can be categorized into two types: class discovery and class prediction. In general, class discovery consists of two steps: (1) a clustering algorithm is first adopted to partition the samples into K parts, when given a new set of microarray data with unknown number of classes, (2) a cluster validity index is then applied to determine the optimal K value, which corresponds to the final number of classes. Our new approach belongs to this category. For the class discovery problem, the researchers mainly focus on discovering the underlying classes from the samples. In Golub et al. (1999), two types of human acute leukemia were discovered with the help of self organizing feature maps and neighborhood analysis. In Alizadeh et al. (2000), two subtypes of diffused large B-cell lymphoma that are distinct at the molecular level are identified with centroid average hierarchical clustering. In Yeung et al. 2001, the Gaussian mixture model is applied to discover the underlying distribution of the data in microarray. Wigle et al. 2002, identified lung cancer cases are distinguished from the normal cases through statistical analysis and clustering approaches. In Dudoit and Fridlyand (2002), a new prediction-based resampling approach is designed to estimate the number of clusters in microarray data. In Dudoit and Fridlyand (2003), two new resampling approaches based on bagging are proposed to improve the accuracy of the clustering procedure. The silhouette index is adopted as a cluster validity index to determine the optimal number of clusters. In Smolkin and Ghosh (2003), a cluster stability score for gene expression data is proposed to assess the stability of individual clusters based on the random subspace techniques, which can be used to identify the number of clusters by combining with any clustering algorithm. In Handl et al. (2005), the authors performed a survey of cluster validity indices when applied to gene expression data analysis. In particular, they pointed out the benefits and the problems of computational cluster validation techniques. In Datta and Datta (2006), the authors designed two performance measures called biological homogeneity index (BHI) and biological stability index (BSI). BHI measures the biological homogeneity of the clusters, while BSI measures the stability of the clusters. In their work, the main focus is on the functionality of the genes, not the categories of the samples. In addition, since some of the genes in their dataset are labeled using biological tools, their approach belongs to the semi-supervized learning category. In Bertoni and Valentini (2006), randomized maps based on the Johnson–Lindenstrauss theory are designed to project the high-dimensional gene expression data on to a lower dimensional subspace. Then, a stability measure based on the projected data is proposed to discover the underlying structure from microarray data. In Bertoni and Valentini (2007a), a general framework for the assessment of clustering solutions is designed based on the random subspace technique. The true number of clusters is estimated by a {chi}2-based statistical test, which makes use of the distribution of the similarity measure between pairs of clusterings.

Recently, researchers are paying more attention to class discovery based on the consensus clustering approaches. Consensus clustering approaches consist of two major steps: generating a cluster ensemble based on a clustering algorithm, and finding a consensus partition based on this ensemble. The existing consensus clustering approaches that are applied to gene expression data can be categorized into five types: (i) using different clustering algorithms as the basic clustering algorithms to obtain different solutions (Strehl and Ghosh, 2002). (ii) using random initializations of a single clustering algorithm (Grotkjaer et al., 2006), e.g. adopting different initial centers for K-means or EM. (iii) sub-sampling, re-sampling or adding noise to the original data (McShane, 2002, Monti et al., 2003, Valentini, 2007). (iv) using selected subsets of features (Bertoni and Valentini, 2005, Bertoni and Valentini, 2007a, Bertoni and Valentini, 2007b, Smolkin and Ghosh, 2003, Topchy et al., 2005, Valentini, 2006). (v) using different K values to generate different clustering solutions, where K is the number of clusters.

The current consensus clustering approaches have limitations. Specifically, two aspects of the current consensus clustering approaches can be improved, namely, the diversity of the ensemble, and the accuracy of the partitioning results obtained from the consensus matrix. In this article, we propose a new consensus clustering approach, known as GCC, to discover biologically meaningful classes automatically from gene expression data. Our new approach belongs to category (iv), in which the cluster ensemble is generated using different gene subsets. Compared with the previous approaches, in particular the approaches proposed by Bertoni and Valentini (2007a) and Smolkin and Ghosh (2003), the main features of the GCC approach include (1) the adoption of the normalized cut algorithm (Shi and Malik, 2000) to partition the consensus matrix, (2) the adoption of the random subspace technique, combined with the correlation clustering algorithm or K-means, to enhance the diversity of the cluster ensemble and (3) the design of a new cluster validity index called the Modified Rand Index, which measures the degree of agreement between two consensus matrices based on a penalty term.

The work most related to ours is described in Monti et al. (2003). They proposed a consensus clustering approach to identify the underlying types of cancers in a number of datasets. Unlike previous consensus clustering approaches, their approach can estimate the number of classes. However, although the results obtained using six cancer datasets are acceptable, there is a need to further improve the estimation performance to allow more accurate diagnosis. Compared with this approach, our proposed algorithm has two differences: (1) GCC only considers subsets of genes in the process of generating the clustering solution. (2) GCC adopts the new Modified Rand Index, which allows more accurate characterization of the class structure. Specifically, GCC first generates a set of clustering solutions in the gene subspace. Then, it creates a consensus matrix that integrates the partitions coming from the different clustering solutions. Finally, a new validation index, called the Modified Rand Index, is designed to estimate the number of classes automatically from gene expression data. Our experiments show that GCC successfully identifies the number of classes in real cancer datasets.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
We formulate the GCC problem for cancer classification as follows: Given a set of samples S = {S1,S2,... ,Sns} where ns is the number of samples, the GCC constructor first randomly selects a subset of genes and obtains a clustering solution Iu by partitioning the samples into K disjoint classes (Formula , Formula , kisin{1,...,K}) based on the sampled subset of genes G [G = {gi1,gi2,...,gings} (where gih (1 ≤ h ≤ ngs)] represents the hth sampled gene from the original set of genes, and ngs is the total number of sampled genes]. Then, the above process is repeated B times to create a cluster ensemble I which consists of B clustering solutions Iu (IuisinI = {I1,I2,...,IB}). Finally, a consensus function {varphi} is applied to obtain the final clustering solution Ifinal based on the cluster ensemble I and the pre specified parameter K.

Figure 1 provides an overview of the framework for GCC algorithm. Specifically, GCC first selects a subset of genes from the gene space. Then, the clustering algorithm is applied to partition the samples into K disjoint classes. GCC repeats the first two steps B times to obtain B clustering solutions. In the third step, it constructs a consensus matrix, partitions the consensus matrix by the normalized cut algorithm (Shi and Malik 2000) and obtains the final results. Finally, GCC estimates a suitable K value by a new cluster validation index.


Figure 1
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. The framework for GCC.

 
2.1 Subspace generation
In the first step, GCC selects a subset of genes G by random sampling. Specifically, a constant ngs (n_min ≤ ngs ≤ n_max), which represents the number of genes in the subspace, is randomly generated by the following equation:


Formula 1

(1)
where {nu}({nu}isin[ < xrefreftype="bibr"rid="B0" > 0</xref>,<xrefreftype="bibr"rid="B1" > 1</xref > ]) is a uniform random variable. n_min and n_max, which are pre specified parameters by the user, controls the number of dimensions of the subspace. The default settings for n_min and n_max are 0.75ng and 0.85ng respectively, where n_max ≤ ng. We have followed the settings of n_min = 0.75ng and n_max = 0.85ng in Smolkin and Ghosh (2003).

Then, it selects the gene one by one until ngs genes are obtained. The index of each randomly selected gene is determined as follows:


Formula 2

(2)
where h denotes the hth gene in microarray data, ng is the total number of the genes and {nu}' is a uniform random variable. Finally, the randomly selected ngs genes are used to construct a subspace.

2.2 Subspace clustering
GCC performs clustering in the selected subspace by two approaches: correlation clustering and K-means. The characteristics of the two approaches are summarized as follows:

Correlation clustering combines correlation analysis and graph partition. We first calculate the correlation matrix (CM) whose entries rij (i,jisin{1,...,ns}, ns is the number of samples) are determined as follows:


Formula 3

(3)
where si and sj denotes the ith and jth samples, respectively.

Then, the normalized cut algorithm (Shi and Malik, 2000) is applied to partition the samples into K classes based on the CM. We can construct a graph (G = (S,CM)) whose vertices correspond to the samples, and whose edges denote the correlation between the samples. The normalized cut approach is applied to partition the graph recursively until K classes are obtained. We assume that the normalized cut first partitions the vertex set S of the graph G into two subsets P and Q. The cost function Ncut(P,Q), which represents a disassociation measure between P and Q, is defined as:


Formula 4

(4)



Formula 5

(5)



Formula 6

(6)

where rij denotes the weight of the edge between the vertices si and sj, which is the value of the entry in the CM. An alternative representation for the above cost measure is as follows:


Formula 7

(7)
where v = [v1,...,vns]T is an ns-dimensional indicator vector (ns = |S|, |S| is the cardinality of the sample set), and {phi}i = {sum}jrij is the sum of the weights from the vertex si to all other vertices. In this way, the normalized cut problem can be formulated as an optimization problem, in which Ncut(v) is minimized as follows:


Formula 8

(8)



Formula 9

(9)



Formula 10

(10)

with the constraints:


Formula 11

(11)
where D is an ns x ns diagonal matrix with {phi}i (iisin{1,...,ns}) on its diagonal, CM is an ns x ns symmetric matrix with the elements rij, I denotes the identity matrix and {gamma}i is the ith component of {gamma}.

Although finding the normalized cut is an NP-complete problem, an approximate discrete solution can be found efficiently by extending the domain of the variables from discrete to continuous. Based on this constraint relaxation, the above optimization problem can be solved using the following generalized eigenvalue system:


Formula 12

(12)
where {lambda} denotes the eigenvalue. If {gamma} can take real values, the second smallest eigenvector of the generalized eigenvalue system is the solution to the normalized cut problem (Shi and Malik, 2000).

K-means is a popular clustering algorithm which partitions the samples into K classes by maximizing an objective function {psi}(Z,C).


Formula 13

(13)
subject to


Formula 14

(14)
where Z is an ns x K partition matrix, and zi, k is an indicator variable: If zi, k = 1, the sample si belongs to the kth cluster. C is a set of cluster centers (C = {c1,...,cK}), and {chi}(si,ck) denotes the cosine distance between the sample si and the center ck of the kth cluster, which is defined as:


Formula 15

(15)
where ngs is the number of selected genes in the subspace. We adopt the cosine distance as the distance metric since the cosine distance can eliminate the effect of different magnitudes among the genes.

Through subspace clustering, GCC obtains the predicted labels of the samples. The adjacency matrix M is constructed by the predicted labels (Dudoit and Fridlyand 2003), whose elements mij are defined as:


Formula 16

(16)
where yi and yj denote the predicted labels of the samples si and sj, respectively.

2.3 Cluster ensemble
Given the number of classes K, GCC repeats the above two steps (subspace generation and subspace clustering) B times and obtains B clustering solutions (Formula ) with B adjacency matrices (Formula ). Then, GCC constructs an ns x ns consensus matrix (MK) by merging the adjacency matrix Formula (uisin{1,...,B}) as follows:


Formula 17

(17)
where B is the number of adjacency matrices, and the element mij in the consensus matrix MK denotes the frequencies that the ith sample si and the jth sample sj appear in the same class.

Afterwards, GCC constructs a graph (GK = (S,MK)) whose vertices correspond to the samples in S, and whose edges represent the probability that two samples appear in the same class, which is denoted as mK,ij in the consensus matrix MK. Then, the normalized cut algorithm is used to partition the samples into K classes based on MK by minimizing the following cost function:


Formula 18

(18)
where the definitions of the parameters are the same as the above subsection, except that the consensus matrix MK replaces the CM.

2.4 Cluster discovery
We further define an aggregated consensus matrix R as follows:


Formula 19

(19)
where Formula denotes the adjacency matrix obtained by the uth clustering solution corresponding to a particular K value. Each entry in the aggregated consensus matrix denotes the probability of two samples appearing in the same class.

GCC further converts the aggregated consensus matrix to a binary matrix Rb:


Formula 20

(20)
where rij and Formula denotes the entry in the ith-row and the jth column of R and Rb, respectively. By the same way, GCC converts the consensus matrix MK to a binary matrix Formula with entries Formula .

We propose a new cluster validity index Formula , known as the Modified Rand Index, as follows:


Formula 21

(21)
where Formula and Formula are the entries of Formula and Rb, respectively, and ns is the number of samples. This index balances the degree of agreement between the two matrices Formula and Rb against the term Formula , which penalizes a large set of clusters.

The optimal number of classes K* is selected as follows:


Formula 22

(22)

In general, cluster validity indices can be categorized into three types: internal indices, external indices and information-theory-based criteria (Sergios and Konstantinos, 2006). The Modified Rand Index can be considered a modified form of the external index. Unlike conventional external indices, the Modified Rand Index does not measure the discrepancy between the ground-truth partition and the partition obtained by the clustering algorithm, but the average of the discrepancies between the partition obtained by the consensus clustering approach and the partitions obtained by the clustering algorithm in a cluster ensemble. It also belongs to the category of indices for optimizing the predictive power/stability of the resulting solution (Handl et al., 05). With the help of the random subspace technique and the consensus clustering approach, the Modified Rand Index can be used to estimate the number of clusters.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
3.1 Experiment setting
In the experiment, we compare GCC with the consensus clustering (CC) approach described in Monti et al. (2003). Based on the adoption of different feature spaces, sampling approaches, clustering algorithms and consensus functions, we consider the following four combinations: GCCcorr (subspace + non-resampling + correlation clustering + normalized cut), GCCKmeans (subspace + non-resampling + K-means +normalized cut), CCHC [complete space + resampling + hierarchical clustering with average linkage (HC) + HC] and CCSOM [complete space + resampling + self organizing map (SOM) + SOM]. The experiment settings for the CC approaches are the same as those in Monti et al. (2003). The experiment settings for the GCC approaches are shown in Table 1.


View this table:
[in this window]
[in a new window]

 
Table 1. The experiment setting for the GCC approaches (ngs denotes the number of genes of the subspace and B denotes the number of clustering solutions)

 
To evaluate the performance of these different approaches, we adopt the Adjusted Rand Index (ARI) (Milligan and Cooper, 1986) to measure the degree of agreement between different partitions with different numbers of clusters. If the partitions are identical, the value of the ARI is 1. If the partitions are drawn independently from one another, the index takes on an expected value of 0. Let (1) LT with KT classes be the true partition of the samples S, and LP with KP classes be the predicted partition of the same set of samples S. (2) Formula be the number of samples in the k-th class in the partition LT, Formula be the number of samples in the l-th class in the partition LP and Formula be the number of samples in both class k in LT and class l in LP. (3) ns be the number of samples. ARI is computed as follows:


Formula 23

(23)



Formula 24

(24)

Table 2 provides a summary of the datasets used in our experiments. We generate three synthetic data sets in the space [ < xrefref type="bibr"rid="B0" > 0</xref>,<xrefref type="bibr"rid="B1" > 1</xref > ]d (d is the number of features) using a set of Gaussian distributions with randomly selected centers, and with the covariance matrices of all distributions set to 0.25I (I denotes the identity matrix). The points in Synthetic1, Synthetic2 and Synthetic3 originate from 3, 4, 7 Gaussian clusters, respectively. Synthetic1 is a 1000-gene by 75-sample dataset, in which each class contains 25 samples. Synthetic2 is a 1000-gene by 100-sample dataset, which is used to simulate microarray data with noisy genes. Specifically, 200 noisy genes are included among the 1000 genes. Similar to Synthetic1, each class consists of 25 samples. Synthetic3 is 1000-gene by 100-sample dataset, which is used to simulate microarray data with unequal-size clusters and noisy genes. In addition to including the noisy genes, we vary the number of samples in each class to include 8, 12, 16, 20, 24, 28, 32 samples in the seven classes, respectively. For the real datasets, the data preprocessing procedure is the same as that described in Monti et al. (2003). In Table 2, the abbreviation SRBCT stands for small round blue cell tumors, while CNS tumors refers to embryonal tumors of the central nervous system. In the Leukemia dataset, bone marrow samples are obtained from acute leukemia patients at the time of diagnosis, while diagnostic bone marrow samples in the St.Jude dataset are from pediatric patients with acute leukemia. The ranges of K for Synthetic3 and the St.Jude leukemia dataset are both set to {2,...,15}, while the ranges of K for the other datasets are set to {2,...,9}.


View this table:
[in this window]
[in a new window]

 
Table 2. Summary of the datasets

 
In the experiments, we first explore the relationship between {zeta} (the Modified Rand Index) and ARI. Then, the optimal K value is selected based on {zeta}. Finally, we compare the performances of the different approaches.

3.2 Relationship between ARI and {zeta}
We observe an interesting relationship between ARI and {zeta}, by which we can estimate the optimal K value. Figure 2 illustrates the change of ARI with respect to K in the different datasets, while Figure 3 shows the change of the Modified Rand Index {zeta} corresponding to the different K values in the various datasets. An interesting observation is that the trend of the curve for ARI and that of the curve for {zeta} are very similar. We perform correlation analysis on the curve of ARI and the corresponding curve of {zeta} in the same dataset. Table 3 lists the results of the correlation analysis between ARI and {zeta} in all datasets corresponding to the different approaches. All the correlation values in Table 3 are greater than 0.65. This implies that the degree of dependence between ARI and {zeta} is high.


View this table:
[in this window]
[in a new window]

 
Table 3. The correlation analysis between ARI and {zeta} in all datasets

 

Figure 2
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. The change of ARI with respect to different K values.

 

Figure 3
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. The change of {zeta} (named zeta, which is the Modified Rand Index) with respect to different K values.

 
To estimate the true K value of a dataset, we further study the relationship between {zeta}, ARI and the K values. If the predicted number of classes is the same as the true number of classes, the value of ARI attains its maximum as shown in Figure 2, since the value of ARI is directly related to the true number of clusters in the dataset. It can also be seen that the peak of the ARI curve in Figure 2 corresponds to that of {zeta} in the same dataset as shown in Figure 3. In other words, the maximum value of {zeta} also corresponds to the optimal K value. Given a new set of microarray data with unknown clusters, ARI cannot be calculated, while {zeta} can be computed when Kmax is given. As a result, {zeta} can be considered as an alternative measure to discover the underlying clusters.

3.3 Experiment results
In general, the GCC approaches GCCcorr and GCCKmeans outperform the consensus clustering approaches CCHC and CCSOM when applied to the gene expression data as shown in Table 4.


View this table:
[in this window]
[in a new window]

 
Table 4. Estimated optimal K value by different approaches

 
GCCcorr outperforms CCHC and CCSOM and correctly discovers the true number of clusters in the three synthetic datasets and the following five real datasets: Breast (BRCA1-mutation-positive samples, BRCA2-mutation-positive samples, Sporadic samples), CNS tumors (medulloblastomas, primitive neuroectodermal tumors, atypical teratoid/rhabdoid tumors, malignant gliomas and normal cerebellum), leukemia (acute myeloid leukemia samples, T-lineage acute lymphoblastic leukemia samples and B-lineage ALL samples), Lung cancer (adenocarcinomas, squamous cell carcinomas, carcinoids and normal lung) and SRBCT [neuroblastoma (NB), nonHodgkin lymphoma (in this case Burkitt's; lymphoma (BL)), rhabdomyosarcoma (RMS) and Ewing's; family of tumors (EWS)]. In the St.Jude dataset, the estimated number of GCCcorr is 5, while the true K value is 6. In fact, the values of {zeta} corresponding to K = 5 and K = 6 are nearly the same. When K = 5, GCCcorr identifies four important leukemia sub-types in the St.Jude dataset: T-lineage ALL, E2A-PBX1, TELAML1 and MLL rearrangements, while the two sub-types BCR-ABL and ‘hyperdiploid > 50’ chromosomes are merged into a single class by GCCcorr.

The performance of GCCKmeans is also better than those of CCHC and CCSOM, and is comparable with that of GCCcorr, since GCCKmeans successfully discovers the underlying classes in three synthetic datasets and the following five real datasets: Breast cancer, CNS tumors, leukemia, Lung cancer and the St. Jude leukemia dataset (T-lineage ALL, E2A-PBX1, BCR-ABL, TELAML1, MLL rearrangements and ‘hyperdiploid > 50’ chromosomes, where ALL denotes acute lymphoblastic leukemia). GCCK means correctly discovers three classes in the SRBCT dataset: NB, nonHodgkin lymphoma (in this case BL) and RMS, while subdividing the class called EWS into two subtypes.

CCHC estimates the correct K value for the following datasets: Synthetic1, Synthetic2, Breast, CNS tumor and SRBCT, while CCSOM discovers the true underlying classes in Synthetic1, Synthetic2, Synthetic3, Breast and CNS tumor.

In general, to address the problems associated with the high-dimensional and small sample nature of gene expression data, together with their high noise levels and large biological variabilities, the GCC approaches adopt the random subspace technique, together with the correlation clustering algorithm or K-means, to generate a more diverse set of clustering solutions when compared with existing CC approaches, such that a more accurate solution can be obtained. In addition, the new consensus function in GCC performs better than those in existing CC methods due to the adoption of the normalized cut algorithm, which results in a more accurate partition of the consensus matrix.

Table 5 lists the corresponding values of ARI with respect to the estimated K value in Table 4. The GCC approaches clearly outperform the CC approaches, especially in the Leukemia dataset and the Lung cancer dataset. To provide a further comparison of the results in Table 5, we design a statistical table as shown in Table 6. If the ARI value obtained by the first approach is significantly better/worse [the level of significance is set at a difference of 0.05 in the ARI value, as adopted in Kuncheva and Vetrov (2006)] than that obtained by the second approach in one of the datasets, the win/lose count of the first approach is incremented once. If the difference between the ARI values based on the two approaches is smaller than the level of significance, the tie count is incremented instead. As shown in Table 6, GCC approaches GCCcorr and GCCKmeans clearly outperform the other approaches and achieve good results in most of the datasets.


View this table:
[in this window]
[in a new window]

 
Table 5. The corresponding values of ARI w.r.t the estimated K-values

 

View this table:
[in this window]
[in a new window]

 
Table 6. Statistical results by comparing different approaches

 
In the following experiments, we adopt GCCcorr to illustrate the properties of our proposed consensus clustering approaches, since (1) GCCcorr achieves good performance in most of the datasets, and (2) there is a higher correlation betweeen ARI and {zeta} in the case of GCCcorr for most of the datasets, as shown in Table 3. To further explore the robustness and the stability of our approach against different numbers of noisy components, we generate three more synthetic datasets (Synthetic4, Synthetic5 and Synthetic6) by varying the number of noisy genes in Synthetic2. The numbers of noisy genes in Synthetic4, Synthetic5 and Synthetic6 are 250, 300 and 350, respectively. The performance of GCC when applied to the three new datasets is illustrated in Figure 4. It can be seen that GCC is robust and stable against the changes in the number of noisy components, and can successfully estimate the four clusters in the datasets.


Figure 4
View larger version (4K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Effect of different numbers of noisy genes.

 
We further investigate the effect of the maximum K value (Kmax) on the Modified Rand Index, and the relationship between Kmax and the sparseness of the discretized aggregated consensus matrix Rb. The GCC algorithm GCCcorr is applied to the Synthetic2 dataset and the Leukemia dataset using different Kmax values. It is observed that when Kmax increases, GCCcorr still correctly estimates the number of clusters in the Synthetic2 dataset and the Leukemia dataset, as indicated in Tables 7 and 8. In addition, we observe that the ARI associated with GCCcorr is not affected by Kmax, while the value of the peak {zeta} decreases slightly when Kmax increases. In general, when Kmax is large, the number of entries in Rb whose values exceed the threshold in Equation (20) are small, which leads to sparseness of the binary matrix. We further observe that this sparseness causes a slight decrease of the peak {zeta} value for both datasets. However, this does not affect the capability of {zeta} to identify the correct number of clusters, since the set of {zeta} values corresponding to the different K values change in such a way that the position of the maximum point on the {zeta} versus K curve is maintained, as indicated in Figures 5 and 6. Although the peaks of the curves in Figures 5 and 6 become less prominent as Kmax increases, our proposed approach still correctly estimates the optimal K value from each dataset.


View this table:
[in this window]
[in a new window]

 
Table 7. The effect of Kmax on the Synthetic2 dataset

 

View this table:
[in this window]
[in a new window]

 
Table 8. The effect of Kmax on the Leukemia dataset

 

Figure 5
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. The relationship between {zeta} and Kmax (Synthetic2 dataset).

 

Figure 6
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. The relationship between {zeta} and Kmax (Leukemia dataset).

 

    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
In this article, we investigate the problem of class discovery in gene expression data. The major contribution of this article is in the design of a new framework, known as GCC, to discover the underlying classes of the samples in gene expression data. Our new approach can successfully estimate the true number of classes for the datasets in our experiments. In addition, based on our experiment results, we also observe that our new approach outperforms the consensus clustering approaches proposed in Monti et al. (2003) when applied to the characterization of gene expression data.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: David Rocke

Received on April 11, 2007; revised on August 15, 2007; accepted on September 8, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 

    Alizadeh AA, et al. Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature (2000) 403:503–511.[CrossRef][Medline]

    Baldi P, Hatfield GW. DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling (2002) Cambridge: Cambridge University Press.

    Bertoni A, Valentini G. Ensembles based on random projections to improve the accuracy of clustering algorithms. Neural Nets, (WIRN 2005), LNCS (2005) 3931:31–37.

    Bertoni A, Valentini G. Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses. Artif. Intell. in Med. (2006) 37:85–109.[CrossRef]

    Bertoni A, Valentini G. Model order selection for biomolecular data clustering. BMC Bioinformatics (2007a) 8. (Suppl. 2), S7.

    Bertoni A, Valentini G. Randomized Embedding Clustering Ensembles for gene expression data analysis. In SETIT 2007 – Proceedings of IEEE International Conference on Sciences of Electronic (2007b) Technologies of Information and Telecommunications, Hammamet, Tunisia.

    Bhattacharjee A, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinomas sub-classes. Proc. Natl Acad. Sci. (2001) 98:13790–13795.[Abstract/Free Full Text]

    Datta S, Datta S. Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinformatics (2006) 7:397.[CrossRef][Medline]

    Dudoit S, Fridlyand J. A prediction-based resampling method to estimate the number of clusters in a dataset. Genome Bio. (2002) 3:0036.1–0036.21.

    Dudoit S, Fridlyand J. Bagging to improve the accuracy of a clustering procedure. Bioinformatics (2003) 19:1090–1099.[Abstract/Free Full Text]

    Golub TR, et al. Molecular classification of cancer: class discovery and class prediction by gene expression. Science (1999) 286:531–537.[Abstract/Free Full Text]

    Grotkjaer T, et al. Robust multi-scale clustering of large DNA microarray datasets with the consensus algorithm. Bioinformatics (2006) 22:58–67.[Abstract/Free Full Text]

    Handl J, et al. Computational cluster validation in post-genomic data analysis Bioinformatics. Bioinformatics (2005) 21:3201–3212.[Abstract/Free Full Text]

    Hedenfalk I, et al. Gene-expression profiles in hereditary breast cancer. New Engl. J. of Med. (2001) 344:539–548.[CrossRef]

    Khan J, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. (2001) 7:673–679.[CrossRef][Web of Science][Medline]

    Kuncheva LI, Vetrov DP. Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Trans. Pattern Anal. Mach. Intell. (2006) 28:1798–1808.[CrossRef][Medline]

    Mc Shane LM. Method for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics (2002) 18:1462–1469.[Abstract/Free Full Text]

    Milligan G, Cooper M. A study of the comparability of external criteria for hierarchical cluster analysis. Multivar. Behav. Res. (1986) 21:441–458.[CrossRef]

    Monti S, et al. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. (2003) 52:91–118.[CrossRef]

    Pomeroy S, et al. Gene expression-based classification and outcome prediction of central nervous system embryonal tumors. Nature (2002) 415:436–442.[CrossRef][Medline]

    Sergios T, Konstantinos K. Pattern Recognition (2006) 3rd. UK: Academic press, Elsevier. 733–765.

    Shi J, Malik J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. (2000) 22:888–905.[CrossRef]

    Strehl A, Ghosh J. Cluster ensembles — a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. (2002) 3:583–617.[CrossRef]

    Smolkin M, Ghosh D. Cluster stability scores for microarray data in cancer studies. BMC Bioinformatics (2003) 4:36.[CrossRef][Medline]

    Su AI, et al. Large-scale analysis of the human and mouse transcriptomes. Proc. Natl Acad. Sci. (2002) 99:4465–4470.[Abstract/Free Full Text]

    Topchy A, et al. Clustering ensembles: models of consensus and weak partitions. IEEE Trans. Pattern Anal. Mach. Intell. (2005) 27:1866–1881.[CrossRef][Medline]

    Valentini G. Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data. Bioinformatics (2006) 22:369–370.[Abstract/Free Full Text]

    Valentini G. Mosclust: a software library for discovering significant structures in bio-molecular data. Bioinformatics (2007) 23:387–389.[Abstract/Free Full Text]

    Wigle D, et al. Molecular profiling of non-small cell lung cancer and correlation with disease-free survival. Cancer Res. (2002) 62:3005–3008.[Abstract/Free Full Text]

    Yeoh E-J, et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell (2002) 1:133–143.[CrossRef][Web of Science][Medline]

    Yeung KY, et al. Model-based clustering and data transformations for gene expression data. Bioinformatics (2001) 17:977–987.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary data
Right arrow All Versions of this Article:
23/21/2888    most recent
btm463v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Yu, Z.
Right arrow Articles by Wang, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Yu, Z.
Right arrow Articles by Wang, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?