Bioinformatics Advance Access published online on May 16, 2006
Bioinformatics, doi:10.1093/bioinformatics/btl185
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Computer and Information Science and Engineering, University of Florida, Gainesville, FL, 32611
Motivation: We consider the problem of clustering a population of Comparative Genomic Hybridization (CGH) data samples. The goal is to develop a systematic way of placing patients with similar CGH imbalance profiles into the same cluster. Our expectation is that patients with the same cancer types will generally belong to the same cluster as their underlying CGH profiles will be similar. Results: We focus on distance based clustering strategies. We do this in two steps. 1) Distances of all pairs of CGH samples are computed. 2) CGH samples are clustered based on this distance. We develop three pairwise distance/similarity measures, namely raw, cosine, and sim. Raw measure disregards correlation between contiguous genomic intervals. It compares the aberrations in each genomic interval separately. The remaining measures assume that consecutive genomic intervals may be correlated. Cosine maps pairs of CGH samples into vectors in a high dimensional space and measures the angle between them. Sim measures the number of independent common aberrations. We test our distance/similarity measures on three well known clustering algorithms, bottom-up, top-down, and -means with and without centroid shrinking. Our results show that sim consistently performs better than the remaining measures. This indicates that the correlation of neighboring genomic intervals should be considered in the structural analysis of CGH data sets. The combination of sim with top-down clustering emerged as the best approach. Availability: All software developed in this paper and all the datasets are available from the authors upon request.
Received February 6, 2006
Revised April 20, 2006
Accepted May 10, 2006
Article
Distance-based clustering of CGH Data
Jun Liu 1 *,
Jaaved Mohammed 1,
James Carter 1,
Sanjay Ranka 1,
Tamer Kahveci 1,
and
Michael Baudis 2
2 Institut fuer Humangenetik, Rheinisch-Westfaelische Technische Hochschule, Aachen, Germany
![]()
Abstract
Associate Editor: Martin Bishop
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
M. Gerstung, M. Baudis, H. Moch, and N. Beerenwinkel Quantifying cancer progression with conjunctive Bayesian networks Bioinformatics, November 1, 2009; 25(21): 2809 - 2815. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. N. Van Wieringen, M. A. Van De Wiel, and B. Ylstra Weighted clustering of called array CGH data Biostat., July 1, 2008; 9(3): 484 - 500. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Liu, S. Ranka, and T. Kahveci Classification and feature selection algorithms for multi-class CGH data Bioinformatics, July 1, 2008; 24(13): i86 - i95. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. P. Shah, W. L. Lam, R. T. Ng, and K. P. Murphy Modeling recurrent DNA copy number alterations in array CGH data Bioinformatics, July 1, 2007; 23(13): i450 - i458. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Liu, S. Ranka, and T. Kahveci Markers improve clustering of CGH data Bioinformatics, February 15, 2007; 23(4): 450 - 457. [Abstract] [Full Text] [PDF] |
||||

