Skip Navigation


Bioinformatics Advance Access originally published online on December 6, 2006
Bioinformatics 2007 23(4):450-457; doi:10.1093/bioinformatics/btl624
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/4/450    most recent
btl624v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Liu, J.
Right arrow Articles by Kahveci, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liu, J.
Right arrow Articles by Kahveci, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Markers improve clustering of CGH data

Jun Liu *, Sanjay Ranka and Tamer Kahveci

Computer and Information Science and Engineering, University of Florida Gainesville, FL 32611, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DETECTION OF MARKERS
 3 PROTOTYPE-BASED CLUSTERING
 4 PAIRWISE SIMILARITY BASED...
 5 EXPERIMENTAL RESULTS
 6 CONCLUSIONS
 REFERENCES
 

Motivation: We consider the problem of clustering a population of Comparative Genomic Hybridization (CGH) data samples using similarity based clustering methods. A key requirement for clustering is to avoid using the noisy aberrations in the CGH samples.

Results: We develop a dynamic programming algorithm to identify a small set of important genomic intervals called markers. The advantage of using these markers is that the potentially noisy genomic intervals are excluded during the clustering process. We also develop two clustering strategies using these markers. The first one, prototype-based approach, maximizes the support for the markers. The second one, similarity-based approach, develops a new similarity measure called RSim and refines clusters with the aim of maximizing the RSim measure between the samples in the same cluster. Our results demonstrate that the markers we found represent the aberration patterns of cancer types well and they improve the quality of clustering significantly.

Availability: All software developed in this paper and all the datasets used are available from the authors upon request.

Contact: juliu{at}cise.ufl.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DETECTION OF MARKERS
 3 PROTOTYPE-BASED CLUSTERING
 4 PAIRWISE SIMILARITY BASED...
 5 EXPERIMENTAL RESULTS
 6 CONCLUSIONS
 REFERENCES
 
Numerical and structural chromosomal imbalances are one of the most prominent and pathogenetically relevant features of neoplastic cells Mitelman et al. (1972) One method for measuring genomic aberrations is Comparative Genomic Hybridization (CGH) Kallioniemi et al. (1992). CGH is a molecular-cytogenetic analysis method for detecting regions with genomic imbalances (gains or losses of DNA segments). Applying microarray technology to CGH measures thousands of copy number information distributed throughout the genome simultaneously Pinkel and Albertson (2005). Raw data from array CGH experiments is expressed as the ratio of normalized fluorescence of tumor and reference DNA. Normalized CGH ratio data surpassing predefined thresholds is considered indicative for genomic gains or losses, respectively. In contrast to the array CGH, the chromosomal CGH results (on which this paper is based) are annotated in a reverse in situ karyotype format (Mitelman, 1995) describing imbalanced genomic regions with reference to their chromosomal location. CGH data of an individual tumor can be considered as an ordered list of status values, where each value corresponds to a genomic interval (e.g. a single chromosomal band). The status can be expressed as a real number (positive, negative or zero for gain, loss or no aberration, respectively). We use this strategy and represent gain, loss and no change with +1, –1 and 0, respectively. Chromosomal and array CGH recently account for the majority of published analyses in cancer cytogenetics Gray et al. (1994); Bentz et al. (1996); Desper et al. (1999); Hoglund et al. (2005); Mattfeldt et al. (2001); Vandesompele et al. (2005); Joos et al. (2002).

A lot of different strategies for structural analysis of CGH data have been developed. Some of these analysis have been aimed at the spatial coherence of genomic segments with different copy number levels. Many attempts have been made on this area, such as a Gaussian model-based segmentation method Picard et al. (2005b), unsupervised hidden Markov model approach Fridlyand et al. (2004), hierarchical clustering tree Wang et al. (2005) and segmentation-clustering approach Picard et al. (2005a) etc. Willenbrock and Fridlyand (2005) pointed out that the smoothed (segmented) CGH data aids the downstream analyses, such as classification.

Unsupervised clustering methods are often employed to discover previously unknown sub-categories of cancer and to identify genetic biomarkers associated with the differentiation. Towards this end, many attempts have been done on the studies of gene microarray Jiang et al. (2004). However, to the best of our knowledge, so far there has been very limited study on the clustering of smoothed CGH data, especially on the heterogeneous sets of data that include a large (more than several hundreds) population of samples. In Mattfeldt et al. (2001), used a self-organizing map as a tool for cluster analysis of CGH data. However, their studies only focus on several tens of cases from a single cancer type.

Existing clustering algorithms Jain et al. (1999) cannot be applied to CGH data directly. This is because CGH data are structurally different from ordinary high-dimensional data since consecutive dimensions (i.e. genomic intervals) may be correlated. A point-like genomic aberration may expand to the neighboring intervals and result in a contiguous run of gain or loss status in CGH data.

In our earlier work Liu et al. (2006), we developed a similarity measure, Sim, for a pair of CGH samples. Sim measures the similarity as the number of overlapping segments (contiguous runs of aberrations of the same type). Although Sim leads to better clustering than existing measures over a large population of CGH samples, it has a drawback. Sim is affected from noisy aberrations in CGH data since it cannot distinguish random aberrations from the aberrations that actually lead to the underlying cancer.

Given a set of samples belonging to the same cancer type, we observe that a small number of point aberrations are common to a subset of samples. We say that these point aberrations are supported by these samples. (See Fig. 1 in Appendix for an example). Such aberrations with high-support (we will use the term markers) represent the recurrent genomic imbalances and can be used to characterize the imbalance profile of the sample set. Some recent analysis of CGH data discovered important genes or chromosomal regions that are similar to our notion of markers. For example, Rouveirol et al. (2006) proposed algorithms for computing recurrent minimal genomic alterations from discretized CGH data. Fu and Fu-Liu (2005) proposed a probability model of selection to evaluate the importance of genes in microarray data and choose a small set of biologically significant genes for classifier design. Beattie and Robinson (2006) used a digital paradigm to discretize the gene microarray and transfered the data into binary state patterns. A clustering based on Hamming distance was applied to discover biomarkers. However, these works have not discussed the usefulness of markers for the clustering of CGH data.


Figure 1
View larger version (5K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Two CGH samples X and Y with the values of genomic intervals listed in the order of positions. The segments are underlined.

 
The aim of this paper is to demonstrate that better clusters can be found for CGH datasets with the help of markers. The main thesis is that markers represent the key imbalances between all the patients and the non-marker areas are patient specific and hence negatively impact the clustering process.

Towards this end, we develop a dynamic programming technique to detect important markers in a group of samples. The resulting markers can be seen as the prototype of these samples. Clustering is an unsupervised learning process by its definition. The quality of a clustering can be measured using several criteria. Two important such criteria are coverage and the NMI measure. These are internal and external measures, respectively. We develop two methods, each working well for different measure. Depending on the purpose of clustering either of the two methods can be appropriate. In particular:

  1. The first one is a prototype-based approach. It starts with a random initial clustering and iteratively improves the clusters in two steps. The first step identifies a set of markers for each cluster that have sufficient support. The second step assigns each sample to the cluster whose markers are supported by that sample the most. This algorithm is guaranteed to converge.
  2. The second one is a similarity-based algorithm. It starts by identifying a set of markers that capture the aberration patterns globally. We develop a new similarity measure, RSim, by focusing on the aberrations at these marker positions. It then computes the average similarity between that sample and all the samples in each cluster using RSim. It assigns each sample to the cluster with the highest average similarity.

Our results demonstrate the utility of finding markers. Experimental results show that the prototype-based method has better internal measure than the competing methods. The similarity-based method has better external measure.

The rest of the paper is organized as follows. Section 2 proposes the technique of finding markers in a group of samples. Section 3 describes the prototype-based algorithm and proves its convergence. Section 4 describes the refined Sim measure and the similarity-based algorithm. Section 5 presents the experimental results. Section 6 concludes our work in this paper.


    2 DETECTION OF MARKERS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DETECTION OF MARKERS
 3 PROTOTYPE-BASED CLUSTERING
 4 PAIRWISE SIMILARITY BASED...
 5 EXPERIMENTAL RESULTS
 6 CONCLUSIONS
 REFERENCES
 
Most cancers result from genomic instability and display various genomic aberrations. A recurrent alteration is a set of aberrations common to sufficiently many CGH samples. The recurrent alterations can be used to characterize the aberration pattern of samples. Due to the correlation between adjacent genomic intervals, recurrent regions can be represented using a small number of genomic intervals within these regions. We call these genomic intervals markers. Each marker is represented by two numbers <p, q>, where p and q denote the position and the type of aberration, respectively.

Given a set S of N CGH samples {s1, s2, ···, sN}. Let Formula denote the status value for sample j at genomic interval d, {forall}d, 1 ≤ d ≤ D, where D is the number of intervals. Let sj[u, v] be the segment of sj that starts at the uth interval and ends at the vth interval. We use the term segment to represent a contiguous block of aberrations of the same type. Formally, a list of status values Formula , for 1 ≤ u ≤ v ≤ D define a segment if genomic intervals u through v are in the same chromosome, the values from Formula to Formula are all gains or all losses, and Formula and Formula are different than Formula . For example, in Figure 1, sample X contains four segments. The first and third segments are gain type while the second and fourth segment are loss type.

Let {mt = < pt, qt >|1 ≤ t ≤ R}, be a set of markers that are ordered along the genomic intervals, i.e. p1 < p2 < ··· < pR. We define the support of sj to mt as {sigma}(sj, mt). Here, {sigma}(sj, mt) = 1 only if both of the following conditions are satisfied:

  1. Support: There exists a segment sj[u, v] overlapping with mt, i.e. u ≤ pt ≤ v and the type of sj[u, v] is the same as that of mt, i.e. Formula .
  2. Uniqueness: There exists no other marker Formula , t' < t, in the same chromosome band such that u ≤ pt' ≤ v, Formula and {sigma}(sj, mt') = 1.
Otherwise, {sigma}(sj, mt) = 0. We say sj supports mt or mt covers sj if {sigma}(sj, mt) = 1. We define the support value of marker mt as the sum of its support from all the samples. Formally, Formula .

The intuition behind these two conditions is as follows. (1) Support: The support value of a marker counts the number of samples that share the same aberration status as the type of this marker. Large support value for a marker implies that that marker is important in characterizing the aberration pattern of samples. The first condition ensures that a sample supports a marker only if it has the same aberration as that marker in the position specified by that marker. (2) Uniqueness: The aberrations in the same segment may correspond to a single aberration that spread to neighboring genomic intervals Liu et al. (2006). The second condition forces a segment overlapping with multiple markers of the same aberration type to support only one of those markers.

Marker detection problem can be formally defined as follows. Given a set of CGH samples S = {s1, s2, ···, sN} and a positive integer R, the goal is to find the set of R markers M = {m1, m2, ···, mR}, p1 < p2 < ... < pR, such that the sum of support of markers in M, i.e. Formula , is maximized. Next, we develop a dynamic programming algorithm to solve this problem optimally.

Let Formula , for 1 ≤pt ≤ d ≤ D, {forall}t isin[1, r] denote the largest possible support that r markers can get from the genomic intervals in the range [1.d]. Here, [1.d] denote the integers 1, 2, ... , d. O(1, 1) = support of the single marker at the first genomic interval. The value of O(d, r) in general, where 1 ≤ r ≤ R, r ≤ d ≤ D, can be computed as the maximum of two possible cases.

  • Case 1: There exists a marker mr at the dth genomic interval. In this case, O(d, r) can be computed as the maximum sum of support of mr and the optimal value of locating r – 1 markers. Formally, it can be written as O(d, r) = maxb–1≤d' ≤ d – 1 {O(d', r–1) + Support(mr)}, where b denotes the least genomic interval that is in the same chromosome as the dth genomic interval. Note that O(d', r–1) may correspond to different set of r – 1 markers for different values of d'. The type of marker mr can be either gain or loss. We choose the marker type as the one that leads to a larger Support(mr) value.
  • Case 2: There is no marker at the dth genomic interval. O(d, r) = O (d – 1, r) in this case.

Thus, O(d, r) can be computed using the following recursive equation, O(d, r) =



Formula

The marker set M that leads to O(D, R) corresponds to R markers with the largest sum of supports. We call the above approach a fixed number of markers approach.

An important feature of the dynamic programming approach is that optimal solutions to subproblems are retained so as to avoid recomputing their values Horowitz et al. (1998). We construct a D x R matrix with cell (d, r) storing the optimal value of O(d, r). An iterative program is implemented to fill this matrix. For each cell (d, r), we need to revisit cell (g, r–1), b – 1 ≤ g ≤ d –1, which takes constant time that is proportional to the average length of chromosome. Besides, we need O(N) time to compute Support(mr). So the time complexity of filling one cell is O(N). The overall time complexity of filling the whole matrix would be O(DNR).


    3 PROTOTYPE-BASED CLUSTERING
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DETECTION OF MARKERS
 3 PROTOTYPE-BASED CLUSTERING
 4 PAIRWISE SIMILARITY BASED...
 5 EXPERIMENTAL RESULTS
 6 CONCLUSIONS
 REFERENCES
 
Given a set of CGH samples, our marker detection technique finds a set of markers that characterize the aberration pattern of the samples in that set and can be considered as the prototype or model representing the samples. In this section, we develop a prototype-based clustering algorithm to partition the dataset into subsets such that each subset has a prototype common to the samples in that subset. It is similar in spirit to the k-means algorithm. The user specifies the number of clusters, k, and the number of markers per cluster, R. It starts with a random partitioning of the samples. It then iteratively maximizes an objective function, which we call cohesion function, in two steps. We discuss the cohesion function later. The two steps are described as follows.

  • Refinement step: For each cluster, the dynamic programming technique developed in Section 2 is used to identify the optimal set of R markers. These markers serve as the prototype of this cluster.
  • Reassignment step: Each sample is assigned to the cluster whose prototype is supported the most by that sample.

Essentially, both steps optimize a cohesion function alternatively until the value of this function does not change, i.e. converges to a stable value. We formally prove this below.

Given a set of CGH samples S = {s1, s2, ... , sN}. Let K denote the number of clusters. Let ci denote the ith cluster. A clustering of samples can be represented by a K-partition, C = c1 {cup} c2 {cup} cK, that divides samples into K disjoint clusters. Let R be the number of markers in each cluster. Let Mi = {mi,t| 1 ≤ t ≤ R} denote the set of markers (i.e. prototype) for cluster i, where 1 ≤ i ≤ K, and mi,t denote the tth marker in the ith cluster. Each marker mi,t is a tuple <pi,t, qi,t>, where pi,t and qi,t denote the position and type of this marker respectively. Let M denote the set of prototypes M = {M1, M2, ... , MK}. We define a cohesion function of clustering as below


Formula

where {sigma}(sj, mi,t) is defined as same as in Section 2. Essentially, the cohesion function computes the intra-cluster similarity between samples and cluster prototypes.

In the refinement step, we optimize the cohesion function by refining the prototype (markers) of clusters given that samples are partitioned into K clusters. For cluster i, let Formula denote the largest support that t markers can get from the genomic intervals in the range [1.d]. Thus the optimal cohesion function can be written as Formula where M* denotes the set of refined prototypes and M* = ArgmaxM(cohesion(C, M)). The dynamic programming technique described in Section 2 is used to compute Oi(D, R) for 1 ≤ i ≤ K and, in this way, the cohesion function is optimized.

In the reassignment step, we reassign the sample sj to the ith cluster whose prototype Mi is supported the most by sj, i.e. Mi covers the largest number of segments in sj. This is because, otherwise, the cohesion function could always be improved by moving sj into the ith cluster. Formally, C* = ArgmaxC(cohesion(C, M)) where C* denotes the new partition of samples after reassignment step.

Proof of convergence: It can be seen that both steps, refinement and reassignment step, are connected through the cohesion function they alternatively optimize. The dynamic programming in refinement step optimize the cohesion with respect to M. At iteration (h) we have


Formula

On the other hand, the reassignment step optimize the cohesion based on C, i.e.


Formula

Put together, our algorithm generates a sequence (C(h), M(h)), h ≥ 0, that increases the cohesion function as


Formula

Since the maximum value of cohesion function is finite given a set of samples, our algorithm converges to a stable value at the end. It is worth noting that our algorithm, similar to k-means, does not guarantee to converges to a global maxima.

Note that it is possible to generalize this method to variable number of markers for different clusters with two modifications to the algorithm. Instead of fixing the number of markers, we fix the segment coverage of markers for each cluster in the refinment step. The segment coverage is defined as the ratio of segments that are covered by the markers to the total number of segments in the cluster. The coverage ratio is a normalized value between 0 and 1. It increases as the number of markers increases. The reassignment step needs to be modified slightly as well as follows. A sample is moved to a new cluster only if this sample supports the prototype of new cluster more than that of old cluster and the coverage ratio of the clusters do not drop to less than the given coverage ratio cutoff. We tested the method with variable number of markers. It did not produce better results than the prototype-based method with fixed number of markers. Therefore, we do not report any experimental results.


    4 PAIRWISE SIMILARITY BASED CLUSTERING
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DETECTION OF MARKERS
 3 PROTOTYPE-BASED CLUSTERING
 4 PAIRWISE SIMILARITY BASED...
 5 EXPERIMENTAL RESULTS
 6 CONCLUSIONS
 REFERENCES
 
Pairwise similarity based algorithms partition the samples into clusters so that the similarity between samples from the same cluster are larger than the similarity between samples of different cluster. This generally requires calculating the similarity between two samples. In our previous work Liu et al. (2006), we proposed a segment-based similarity measure, called Sim, for CGH data. In this section, we develop a new similarity measure, called RSim, which avoids noisy aberrations with the help of markers.

Let D denote the number of genomic intervals of each sample. Let Formula and Formula be two CGH samples. Here, Formula and Formula denote the value or status of the dth genomic interval of si and sj, respectively.

We, first, summarize the Sim measure we developed for computing the similarity of two CGH data. We call two segments from two samples overlapping if they have at least one common genomic interval of the same type. Sim constructs maximal segments by combining as many contiguous aberrations of the same type as possible. Thus, each sample translates into a sequence of segments. For example, in Figure 1, si and sj are two samples that have four and two segments, respectively. After this transformation, Sim computes the similarity between two CGH samples as the number of overlapping segment pairs. This is because each overlap may indicate a common point-like aberration in both samples which potentially led to the overlapping segments. In Figure 1, the first segment of si is overlapping with the first segment of sj. Similarly the third segment of si is overlapping with the second segment of sj. Sim computes the similarity between two samples based on the genomic aberrations local to these samples. Thus, Sim cannot distinguish the true aberrations from noisy ones. As a result, Sim is a local measure that is easily biased by the noise. Next, we develop a new approach that addresses this limitation.

We propose to employ markers to eliminate the contribution of noise to the pairwise similarity. We develop a refined Sim measure, called RSim, as follows. Let M = {m1, m2, ···, mR}, p1 < p2 < ··· < pR, denote the set of markers that are globally identified in all samples. The markers imply the important genomic intervals that are associated with the aberration patterns of samples. Given two CGH samples si and sj, RSim computes the similarity between them as the number of overlapping segments pairs, such that these segments satisfy both of the following two conditions:

  1. At least one of the markers in M is contained in both segments.
  2. Both of the overlapping segments have the same aberration type as the marker they both contain.

Formally, let Formula and Formula be a pair of segment from samples si and sj, respectively. RSim counts this pair of segments as one only if

  1. There exists a marker mt = <pt, qt>, mt isin M, such that u ≤ pt ≤ v and u' ≤ pt ≤ v', and
  2. The aberration type of both segments is the same as that of mt, i.e. Formula .
Unlike Sim, RSim does not consider the overlapping segments that do not intersect with any marker. This is because RSim considers such segments as noise. For example, assume that there are two markers in Figure 1, one at the 3rd and the other at the 11th genomic interval. Then RSim measure for si and sj is computed as one, whereas Sim measure is two.

An important observation on RSim is as follows. As the number of markers approaches to the number of genomic intervals in the CGH data, RSim becomes equivalent to the Sim measure. This is because all segments in the CGH data will overlap with a marker and contribute to the similarity. Thus Sim is a special case of RSim when noisy aberrations are not eliminated.

Our previous work Liu et al. (2006) showed that Sim works best when combined with topdown algorithm, compared to other popular clustering algorithms, such as bottom-up and k-means. In this paper, we propose to use the topdown clustering method with RSim as the pairwise similarity measure for pairs of CGH samples.

Note that it is possible to extend the Raw measure in our previous work Liu et al. (2006) by taking the markers into account. The extended Raw measure works as follows. For each pair of samples, we compute the similarity between them as the number of genomic intervals that meet the following two conditions. First, both samples have the same aberration type (gain or loss) at this interval. Second, one marker of the same type as both samples appears at this interval. Our experiments (results ommited due to space limit) show that although the markers improve the original Raw measure, RSim is always superior to this measure.


    5 EXPERIMENTAL RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DETECTION OF MARKERS
 3 PROTOTYPE-BASED CLUSTERING
 4 PAIRWISE SIMILARITY BASED...
 5 EXPERIMENTAL RESULTS
 6 CONCLUSIONS
 REFERENCES
 
Dataset. With >12 000 cases Baudis (2006), the largest resource for published CGH data can be found in the Progenetix database Baudis and Cleary (2001) (http://www.progenetix.net). We use three different datasets, dissimilarDS, interDS and similarDS, taken from Progenetix databases. Each dataset contains >800 CGH samples (i.e. cytogenetic imbalance profiles of tumor samples) from four different histopathological cancer types. Each sample has been coded according to the ICD-O-3 system Fritz et al. (2000) and consists of 862 ordered genomic intervals extracted from 24 chromosomes. In principle, each dataset can be mapped to an integer matrix of size N x 862, where N denotes the number of samples. The difference of these three datasets are the divergence of aberration patterns in distinct cancer types. In dissimilarDS, the samples of different cancer types contain diverse aberration patterns that are easily distinguished from each other. The samples in similarDS contain similar aberration patterns. The interDS dataset is at an intermediate degree. The choices of these datasets are based on a visual inspection of the matrices (as in Fig. 1 in appendix) for each of the cancer types.

System specifications. Our experimental simulations were run on a system with dual 2.59 GHz AMD Opteron Processors, 8 GB of RAM, and an Linux operating system.

5.1 Quality of clustering
Measuring the quality of clustering is a challenging task as it is an unsupervised learning task. There are a number of internal and external cluster validation techniques that are described in the literature. In the following, we describe two measures that can be used for evaluating the clustering.

  1. Coverage Measure (CM): An internal measure evaluates the quality of clusters if the class labels of samples are not known a priori. One possible internal measure is the cohesion function defined in Section 3. This function measures the total support of markers over each cluster. We use the term Coverage Measure (CM) to denote this measure. Markers with high-support potentially convey some meaningful biological information and potentially can serve as the first step for further analysis, such as the identification of new oncogenes and tumor suppressor genes. A group of markers can be considered as biologically relevant if they cover most of the segments in all the patients.
  2. Normalized Mutual Information (NMI): One way to measure the quality of clustering is to see if each cluster predominantly consists of samples from a single cancer type. This clearly makes the assumption that this information is available (as was the case for datasets used in this paper) and those samples from the same cancer type will generally be similar to each other. The external measure, known as the Normalized Mutual Information, described below can be used for this purpose. Let N, U and K denote the total number of samples, the number of different cancer types and the number of clusters, respectively. Let a1, a2, ··· ,am denote the number of samples that belong to each cancer type. Similarly, let b1, b2, ··· ,bk be the number of samples that belong to each cluster. Let ci,j, {forall}i, j, 1 ≤ i ≤ U and 1 ≤ j ≤ K, denote the number of samples in jth cluster that belong to the ith cancer type. The NMI function is computed as:


    Formula

    NMI function takes a value in [0.1] interval. Higher values indicate better clustering qualities. Unlike other external measures, NMI was computed based on mutual information I(X; Y) between a random variable X, governing the cluster labels and a random variable Y, governing the cancer types. It has been argued that the mutual information is a superior measure than purity or entropy Strehl and Ghosh (2002). Moreover, NMI is quite impartial to the number of clusters Zhong and Ghosh (2005).

5.2 Quality of markers
Measuring the quality (or the biological relevance) of the markers identified for the clusters is also important. Here, we develop a measure to address this. We combine all the markers found in each cluster. For each marker in combined marker set, we first compute the ratio of the samples that support it among the samples in each cancer type. Thus, if there are T cancer types in the dataset, T values are computed for each marker. We define the maximum of these ratios as the biological relevance of this marker. This is because a larger ratio of support from one cancer type indicates that the marker better captures the aberration pattern of that cancer type. Therefore, this marker is biologically relevant in that cancer type. We use the term Global Maximum Support (GMS) to represent this measure. Formally, let Formula denote the set of markers. Let Ci, 1 ≤ i ≤ T denote the set of samples that belong to ith cancer type, where T is the number of different cancer types in the dataset. GMS measure for marker Formula is computed as:


Formula

Note that GMS differs greatly from CM for the CM measure is computed over clusters identified by the underlying clustering strategy, whereas GMS is computed over the cancer types.

5.3 Evaluation
We tested the prototype-based approach and two similarity-based approaches, RSim and Sim Liu et al. (2006) over three datasets, dissimilarDS, interDS and similarDS. For each clustering method and dataset, we created 6, 8 and 10 clusters. For each number of clusters, we tried different number of markers, i.e. 4, 6, 8, 10 and 12 markers, per cluster in the prototype-based approach. This is because biologists have pointed out that a total of 4–7 genetic alterations can be estimated for the malignant transformation of a cell Renan (1993). Thus, we estimated that the number of aberrations common to the samples of one cluster could be around 10. For the consistency, we also employed different numbers of markers for each clustering of 6, 8 and 10 clusters in RSim. Here, the number of markers is determined as the product of number of clusters and the number of markers per cluster used in the prototype-based approach. For example, 24, 36, 48, 60 and 72 markers were found to create six clusters using RSim. We compared three methods according to the qualities of clusters. We evaluated the cluster qualities using both NMI and CM measures. To evaluate the cluster qualities using CM, for both RSim and Sim, we identified the markers for each resulting cluster. The number of markers per cluster is the same as that used in the prototype-based approach. We compute the error bars for part of the results. We also used GMS to evaluate the biological relevance of markers found in prototype-based approach and RSim approach.

The CM results are shown in Table 1. The CM values monotonically increase as the number of clusters or number of markers increase. Thus, we compare three clustering methods for the same number of clusters and markers. We observe that prototype-based approach has 8–34 % better coverage than RSim and 15–41% better coverage than Sim. This is because the cohesion function optimized in prototype-based approach is the same as the Coverage Measure. RSim and Sim do not optimize CM directly. The results also show that RSim is superior to Sim most of the time. This is because the use of markers in RSim filters out the noise that are irrelevant to the aberration patterns. As a result, the markers in each cluster are supported by more samples in RSim as compared to Sim.


View this table:
[in this window]
[in a new window]

 
Table 1 The coverage measure for the three clustering methods applied over each of three datasets

 
The NMI results are shown in Table 2. Since Sim produces clusters without finding markers, we list its results on a separate column. The results show that all three methods perform better on dissimilarDS than interDS and similarDS. This is because the aberration patterns of distinct cancer types in dissimilarDS are divergent. Thus, it is harder to cluster interDS and similarDS datasets. From the results, we observe that RSim and Sim always beat prototype-based approach in terms of the NMI values. This observation, together with the conclusion from results in Table 1, indicates that NMI measure has no apparent relationship with CM. This is because NMI computes the quality based on the class labels of samples. On the other hand, CM evaluates the compactness of samples in each cluster based on chromosomal aberration patterns and completely ignores the class labels. Therefore, we conclude that the pairwise similarity-based clustering approaches are more suitable to external measures, such as NMI, while the prototype-based approach works better for the CM. RSim usually has the best NMI results using 10 markers. When compared to Sim, RSim usually has better NMI values. This indicates that the use of markers in refining the pairwise similarity also leads to a better clustering in terms of NMI. Given that RSim has better CM (Table 1) and NMI values than Sim (Table 2), we conclude that RSim is a better pairwise similarity measure than Sim.


View this table:
[in this window]
[in a new window]

 
Table 2 The NMI values of the three clustering methods are applied over each of three datasets

 
We compute the error bars of experimental results as follows. We randomly sample 50% of each dataset and cluster it using three methods, prototype-based, RSim and Sim, described in the paper. To reduce the amount of computations, we choose one configuration from different combinations of parameters in the experiments. For each dataset, we create eight clusters. We identify 10 markers per cluster in the prototype-based approach and 80 markers in RSim approach. We then compute both NMI and CM values for the resulting clusters using 10 markers per cluster. We repeat this process 100 times and compute the error bar for the values of NMI and CM. The error bar indicates the interval where 5–95% of the results lie. Table 3 shows the results with error bars. Note that the results of CM are roughly half the results shown in Table 1 because the calculation of CM depends on the dataset and we sample 50% of each dataset in the experiments. The results show that RSim is superior to Sim most of the time in terms of both NMI and CM. Moreover, among the three clustering methods, RSim and prototype-based approach works the best for NMI and CM measures, respectively. These observations are compatible to those we obtained from Tables 1 and 2. Therefore, the error bars confirm our earlier conclusions.


View this table:
[in this window]
[in a new window]

 
Table 3 The percentage error for three clustering methods, prototype-based (denoted as Proto), RSim and Sim, over each of three datasets, dissimilarDS, interDS and similarDS, to create eight clusters and identify 10 markers per cluster

 
Next experiment compares the GMS values for RSIM and prototype-based clustering approaches. For each dataset, we created eight clusters and identified 10 markers per cluster. This is because these results are among the best results of each dataset in Table 2. We then sort the markers in descending GMS value order. We plot the sorted results of both RSim and prototype-based approach (Fig. 2). The plots show that the maximum global support of markers found by RSim is always comparable to or better than those found by the prototype-based approach.


Figure 2
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Plots of Global Maximum support of markers found in (a) dissimilarDS, (b) interDS and (c) similarDS, respectively, using both prototype-based approach and RSim. The results of prototype-based approach (denoted as Proto) and RSim are plotted in solid line and dashed line, respectively.

 

    6 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DETECTION OF MARKERS
 3 PROTOTYPE-BASED CLUSTERING
 4 PAIRWISE SIMILARITY BASED...
 5 EXPERIMENTAL RESULTS
 6 CONCLUSIONS
 REFERENCES
 
We considered the problem of clustering CGH data of a population of cancer patient samples. There are three main contributions of our work:
  1. We developed a dynamic programming algorithm to identify the optimal set of important genomic intervals called markers. The advantage of using these markers is that the potentially noisy genomic intervals are excluded in the computation of pairwise similarity between samples.
  2. We developed two clustering strategies using these markers. The first one, prototype-based approach, maximizes the support for the markers. The second one, similarity-based approach, develops a new similarity measure called RSim. It iteratively refines clusters with the aim of maximizing the RSim measure between the samples in the same cluster. We demonstrated the utility of such a measure in improving the quality of clustering using the classified disease entities from the Progenetix database. Our results show that the markers we found represent the aberration patterns of cancer types well.
  3. We developed several measures for comparing markers and different clustering methods. Our experimental results show that optimizing for the coverage measure may not lead to better values of NMI and vice versa.


    Acknowledgments
 
The authors would like to thank Dr Michael Baudis for confirming the selection of datasets. The work of authors is supported in part by the National Science Foundation under Grant ITR 0325459. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on September 24, 2006; revised on December 4, 2006; accepted on December 4, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DETECTION OF MARKERS
 3 PROTOTYPE-BASED CLUSTERING
 4 PAIRWISE SIMILARITY BASED...
 5 EXPERIMENTAL RESULTS
 6 CONCLUSIONS
 REFERENCES
 

    Baudis, M. and Cleary, M.L. (2001) Progenetix.net: an online repository for molecular cytogenetic aberration data. Bioinformatics, 17, 1228–1229[Abstract/Free Full Text].

    Baudis, M. (2006) An online database and bioinformatics toolbox to support data mining in cancer cytogenetics. Biotechniques, 40, 269–270, 272[Web of Science][Medline].

    Beattie, B.J. and Robinson, P.N. (2006) Binary state pattern clustering: A digital paradigm for class and biomarker discovery in gene microarray studies of cancer. J. Comput. Biol, . 13, 1114–1130[CrossRef][Web of Science][Medline].

    Bentz, M., et al. (1996) High incidence of chromosomal imbalances and gene amplifications in the classical follicular variant of follicle center lymphoma. Blood, 88, 1437–1444[Abstract/Free Full Text].

    Kallionieni, A., et al. (1992) Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science, 258, 818–821[Abstract/Free Full Text].

    Fridlyand, J., et al. (2004) Hidden markov models approach to the analysis of array CGH data. J. Multivar. Anal, . 90, 132–153[CrossRef].

    Desper, R., et al. (1999) Inferring tree models for oncogenesis from comparative genome hybridization data. J. Comput. Biol, . 6, 37–52[Web of Science][Medline].

    Fritz, A., Percy, C., Jack, A., Sobin, L. In Parkin, M. (Ed.). International Classification of Diseases for Oncology (ICD-O), (2000) , Geneva 3rd edn. World Health Organization.

    Fu, L.M. and Fu-Liu, C.S. (2005) Evaluation of gene importance in microarray data based upon probability of selection. BMC Bioinformatics, 6, 67[CrossRef][Medline].

    Gray, J.W., et al. (1994) Molecular cytogenetics of human breast cancer. Cold Spring Harb. Symp. Quant. Biol, . 59, 645–652[Abstract/Free Full Text].

    Hoglund, M., et al. (2005) Statistical behavior of complex cancer karyotypes. Genes Chromosomes Cancer, 42, 327–341[CrossRef][Web of Science][Medline].

    Horowitz, E., Sahni, S., Rajasekaran, S. Computer Algorithms/C++, (1998) , New York, NY, USA Computer Science Press.

    Jain, A.K., et al. (1999) Data clustering: a review. ACM Comput. Surv, . 31, 264–323[CrossRef].

    Jiang, D., et al. (2004) Cluster analysis for gene expression data: a survey. IEEE Trans. Knowl. Data Eng, . 16, 1370–1386[CrossRef].

    Joos, S., et al. (2002) Classical hodgkin lymphoma is characterized by recurrent copy number gains of the short arm of chromosome 2. Blood, 99, 1381–1387[Abstract/Free Full Text].

    Liu, J., et al. (2006) Distance-based clustering of CGH data. Bioinformatics, 22, 1971–1978[Abstract/Free Full Text].

    Mattfeldt, T., et al. (2001) Cluster analysis of comparative genomic hybridization CGH data using self-organizing maps: application to prostate carcinomas. Anal. Cell. Pathol, . 23, 29–37[Web of Science][Medline].

    Mitelman, F., et al. (1972) Tumor etiology and chromosome pattern. Science, 176, 1340–1341[Abstract/Free Full Text].

    In Mitelman, F. (Ed.). International System for Cytogenetic Nomenclature, (1995) , Karger, Basel.

    Picard, F., Robin, S., Lebarbier, E., Daudin, J.-J. (2005a) A segmentation-clustering problem for the analysis of array CGH data. In Applied Stochastic Models and Data Analysis, .

    Picard, F., et al. (2005b) A statistical approach for array CGH data analysis. BMC Bioinformatics, 6, 27[CrossRef][Medline].

    Pinkel, D. and Albertson, D.G. (2005) Array comparative genomic hybridization and its applications in cancer. Nat. Genet, . 37, Suppl., S11–S17.

    Renan, M.J. (1993) How many mutations are required for tumorigenesis? implications from human cancer data. Mol. Carcinog, . 7, 139–146[Web of Science][Medline].

    Rouveirol, C., et al. (2006) Computation of recurrent minimal genomic alterations from array-cgh data. Bioinformatics, 22, 849–856[Abstract/Free Full Text].

    Strehl, A. and Ghosh, J. (2002) Cluster ensembles—a knowledge reuse framework for combining partitionings. Proceedings of AAAI 2002, AAAIEdmonton, Canada, pp. 93–98.

    Vandesompele, J., et al. (2005) Unequivocal delineation of clinicogenetic subgroups and development of a new model for improved outcome prediction in neuroblastoma. J. Clin. Oncol, . 23, 2280–2299[Abstract/Free Full Text].

    Wang, P., et al. (2005) A method for calling gains and losses in array CGH data. Biostatistics, 6, 45–58[Abstract].

    Willenbrock, H. and Fridlyand, J. (2005) A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics, 21, 4084–4091[Abstract/Free Full Text].

    Zhong, S. and Ghosh, J. (2005) Generative model-based document clustering: a comparative study. Knowl. Inf. Syst, . 8, 374–384[CrossRef].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
J. Liu, N. Bandyopadhyay, S. Ranka, M. Baudis, and T. Kahveci
Inferring progression models for CGH data
Bioinformatics, September 1, 2009; 25(17): 2208 - 2215.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
L. Y. Wu, H. A. Chipman, S. B. Bull, L. Briollais, and K. Wang
A Bayesian segmentation approach to ascertain copy number variations at the population level
Bioinformatics, July 1, 2009; 25(13): 1669 - 1679.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. P. Shah, K-J. Cheung Jr, N. A. Johnson, G. Alain, R. D. Gascoyne, D. E. Horsman, R. T. Ng, and K. P. Murphy
Model-based clustering of array CGH data
Bioinformatics, June 15, 2009; 25(12): i30 - i38.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/4/450    most recent
btl624v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Liu, J.
Right arrow Articles by Kahveci, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liu, J.
Right arrow Articles by Kahveci, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?