Skip Navigation


Bioinformatics Advance Access originally published online on July 27, 2007
Bioinformatics 2007 23(18):2463-2469; doi:10.1093/bioinformatics/btm359
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary data
Right arrow All Versions of this Article:
23/18/2463    most recent
btm359v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Huang, J.
Right arrow Articles by Pawitan, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Huang, J.
Right arrow Articles by Pawitan, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Robust smooth segmentation approach for array CGH data analysis

Jian Huang 1, Arief Gusnanto 2, Kathleen O'Sullivan 1, Johan Staaf 3, Åke Borg 3 and Yudi Pawitan 4,*

1Statistical Laboratory, Department of Statistics, University College Cork, Ireland, 2MRC Biostatistics Unit, Institute of Public Health, Cambridge CB2 2SR, UK, 3Department of Oncology, University Hospital, SE-221 85 Lund and 4Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 SIMULATION STUDY
 4 REAL DATA EXAMPLES
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Array comparative genomic hybridization (aCGH) provides a genome-wide technique to screen for copy number alteration. The existing segmentation approaches for analyzing aCGH data are based on modeling data as a series of discrete segments with unknown boundaries and unknown heights. Although the biological process of copy number alteration is discrete, in reality a variety of biological and experimental factors can cause the signal to deviate from a stepwise function. To take this into account, we propose a smooth segmentation (smoothseg) approach.

Methods: To achieve a robust segmentation, we use a doubly heavy-tailed random-effect model. The first heavy-tailed structure on the errors deals with outliers in the observations, and the second deals with possible jumps in the underlying pattern associated with different segments. We develop a fast and reliable computational procedure based on the iterative weighted least-squares algorithm with band-limited matrix inversion.

Results: Using simulated and real data sets, we demonstrate how smoothseg can aid in identification of regions with genomic alteration and in classification of samples. For the real data sets, smoothseg leads to smaller false discovery rate and classification error rate than the circular binary segmentation (CBS) algorithm. In a realistic simulation setting, smoothseg is better than wavelet smoothing and CBS in identification of regions with genomic alterations and better than CBS in classification of samples. For comparative analyses, we demonstrate that segmenting the t-statistics performs better than segmenting the data.

Availability: The R package smoothseg to perform smooth segmentation is available from http://www.meb.ki.se/~yudpaw

Contact: yudi.pawitan{at}ki.se


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 SIMULATION STUDY
 4 REAL DATA EXAMPLES
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The DNA-sequence copy number is the number of copies of DNA at a region of a genome. In humans, the normal copy number is two for all the autosomes. DNA copy number alterations are key genetic events in the development and progression of human cancers (Lengauer et al., 1998). Recently, array comparative genomic hybridization (aCGH) (Snijders et al., 2001) was developed to measure DNA copy-number variations across a whole genome. In aCGH, genomic DNA isolated from tumor and normal cells are differentially labeled by two fluorescent dyes and co-hybridized to a microarray whose spots contain DNA sequences that map to chromosomal regions within the human genome. Subsequently, the ratio of the intensities of the two fluorochrome is computed at each spot. An obvious way to visualize copy-number variation is to plot the spot fluorescence ratios as a function of their location within the genome.

Statistical methods for analyzing copy number data are aimed at identifying regions of genomic alteration that correlate with clinical phenotypes such as survival. Here, one faces hypothesis testing problems with thousands or tens of thousands of regions to be considered simultaneously. Some work has been done to address this problem (e.g. Pawitan et al., 2005; Pollack et al. 2002). However, the immediate problem of applying these methods to aCGH data is that the genomic spatial structure has not been taken into consideration.

Segmentation is one approach to make use of the spatial information. Various segmentation methods have been proposed for analyzing aCGH data. Olshen et al. (2004) proposed a circular binary segmentation (CBS) algorithm that identifies the change points through successive comparison of segments of the chromosome, with evaluation of local significance via permutation. This is followed by a pruning algorithm to control the number of change points. Picard et al. (2005) proposed a likelihood-based method to identify the change points for the sequence of log2-ratios. Hupe et al. (2004) likewise presented a Gaussian-based likelihood approach (GLAD). Change points are identified using a piecewise-constant regression model that employs adaptive weights smoothing (AWS), in which the errors are assumed to be independent and normally distributed. Fridlyand et al. (2004) proposed a discrete-state hidden Markov Model (HMM) approach. This model assumes that, given the genetic states at all previous regions, the genetic state at a given region depends only on the true state at the immediately previous region. Wang et al. (2005) used an agglomerative clustering technique (CLAC) to identify the change points.

All of those segmentation methods are based on modeling data as a series of discrete segments, with unknown boundaries and unknown heights. However, it is obvious from observed data (e.g. Fig. 1) that most segments are not discrete. Although the underlying biological process is discrete, in reality a variety of biological and experimental factors cause the signal under study to deviate from a stepwise function (Engler et al. 2006; Picard et al. 2005). One possible factor is the cell-to-cell variability in genomic instability (Kronenwett et al., 2004), so the observed pattern is only an average of a large number of cells with potentially different copy-number patterns. This averaging process tends to smooth out the discrete boundaries.


Figure 1
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Results of smooth segmentation based on the Cauchy model (using smoothseg). Data shown corresponds to chromosome 1 – 4 of a breast cancer sample.

 
To allow for the possibility of smooth transition between segments, we propose a smooth segmentation (smoothseg) method to analyze aCGH data. Another promising algorithm is a denoising by wavelets (Hsu et al., 2005). To get a robust segmentation, our model is based on a doubly heavy-tailed random-effect model, where the first heavy-tailed structure deals with outliers and the second with jumps in the underlying pattern associated with different segments. Smoothness is achieved because the jumps are regularized by correlated random-effect assumption. Using simulated and real data sets, we demonstrate how smoothseg is utilized to aid in identifying regions of genomic alteration that correlate with clinical phenotypes and in classification of samples.

The performance of smoothseg in identification and classification is evaluated using the false discovery rates (FDR) and classification error rates. For the real data, the classification error rates are calculated using the leave-one-out cross-validation procedure. For comparison, we also computed the FDR and classification error rates based on ‘DNAcopy’, which is the R package implementing the circular binary segmentation (CBS) algorithm. A recent comparison study (Lai et al., 2005) of CBS and 10 other approaches concluded that CBS is one of the two methods that perform consistently well. Willenbrock and Fridlyand (2005) in another comparison study also concluded that DNAcopy has the best operating characteristics in terms of sensitivity and FDR.

To summarize our novel contributions: (i) we developed a method to perform robust segmentation of aCGH data, and implemented this in a software package. (ii) We showed that smoothseg compares favorably with one of the best existing methods. For the real data sets, smoothseg leads to smaller false discovery and classification error rates than the CBS algorithm. In a realistic simulation setting, smoothseg is better than CBS. Last but not least, smoothseg performs much faster than DNAcopy (version 1.8.1). (iii) For comparative analyses, we demonstrated that segmenting the t-statistics performs better than segmenting the data.


    2 METHODOLOGY
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 SIMULATION STUDY
 4 REAL DATA EXAMPLES
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Consider Formula regions located on a genome at which the relative copy number is measured by the log2-ratios of fluorescence intensities between tumor and reference (normal) samples. Denote by Formula the observed log2-ratios of the Formula th region at location Formula for Formula , and define the vector Formula . The genomic locations Formula are fixed and known, with Formula .

Our statistical model is based on correlated random-effects for the unobserved pattern. As shown in Ruppert et al. (2003) and more recently in Lee et al. (2006), the random-effect model allows flexible specification for the unobserved pattern. Particularly, a heavy-tailed random-effect model would provide a good fit to data containing jump discontinuities as would be expected from a segmented pattern, but at the same time allowing for a smooth transition. Additionally, to achieve robustness against outliers we also specify a heavy-tailed model for the error term. So, overall we have a doubly heavy-tailed model. Specifically, we model


Formula

where the errors {varepsilon} {equiv} ({varepsilon}1,...,{varepsilon}n) are iid t-distribution with location 0, and unknown dispersion {sigma} and k degrees of freedom. Hence, the problem is how to estimate f(xi) based on observations y.

In matrix form, we can write a general model


Formula

where X is the model matrix determined by the observed xis and the choice of basis functions, e.g. B-spline basis functions (Lee et al., 2006, Chapter 9). This approach requires careful placement of knots to allow sudden changes. To allow for the greatest freedom, we use the observed xis as the knots and use only zero-order B-splines. This is equivalent to specifying


Formula 1

(1)
where fi {equiv} f (xi) is an unknown random-effect parameter. The error {varepsilon} is assumed independent of f {equiv} (f1,...,fn).

The smoothness of f can be imposed on the scaled differences {Delta}fi/{Delta}xi {equiv} (fifi – 1) / (xixi – 1). It is most common to specify that the scaled second-order differences Formula {equiv} {Delta}2fi / ({Delta}xi)2 are iid with some distribution. Because f is mostly smooth, the size of ai {equiv} {Delta}2fi = ({Delta}xi)2Formula is very small relative to the local noise. This means there will be little difference whether we specify the model on Formula or ai. For convenience we shall use the latter. Traditionally, the normal distribution is assumed, but smoothing based on the normal distribution converts jumps into gradual changes and tends to round plateaus, so the normal distribution is not a good choice for aCGH data. It is well known that the normal distribution leads to a quadratic loss function. Eilers and Menezes (2003) have also demonstrated that a smoothing algorithm based on the quadratic loss function does not work well for aCGH data.

To adapt to segmented data, the model must allow jumps, which means we must consider a heavy-tailed distribution for ai. We propose a smooth segmentation (smoothseg) approach based on the assumptions that the second-order differences ai are iid Cauchy, a heavy-tailed model with a simple parametrization. For a quick illustration, Figure 1 shows the smoothseg segmentation of CGH data from chromosomes 1 to 4 from a breast cancer sample. The biggest difference between this and the standard smoothing is that in the latter the choice of smoothing parameter is paramount, and in its application to aCGH data it is easy to oversmooth the data.

The estimation of the random effects is carried out by the maximum likelihood method. The log-likelihood based on the observation vector y and the random effects f is


Formula 2

(2)
where the first term comes from the t-density with k degrees of freedom:


Formula

and Formula , and the random effects contribution comes from the Cauchy model with location 0 and scale factor Formula :


Formula

2.1 Estimation of f given {sigma} and {sigma}f
Maximizing the likelihood of the non-Gaussian random-effect model is described in Pawitan (2001, pp. 464–466), and we will derive an iterative weighted least-squares (IWLS) algorithm. The first derivative of the Formula with respect to f can be written as


Formula

with diagonal weight matrix W = diag[wi] and


Formula

Then, taking the first derivative of l(ai), we have


Formula

or in vector form


Formula

where


Formula

denoting by {Delta}2 the (n – 2) x n matrix that represents the second-order difference operator of n-dimension vectors, and


Formula

Combining the above results, the first derivative of Formula with respect to f is


Formula

To solve S(f) = 0, we can apply a simple iterative scheme that works like IWLS: using a starting value f 0, compute the matrices W and D, and solve the updating equation


Formula 3

(3)
where Formula . Given an updated f, recompute W and D, and continue to iterate; at convergence we have the estimate Formula .

Note that finding the solution of (3) is potentially highly demanding due to its huge dimension. To overcome this problem, we exploit the band-limited property of (W + {lambda}({Delta}2)'D– 1{Delta}2). We use well-tested fortran subroutines available in Linpack (e.g. Dongarra et al., 1979, Chapter 2) and get a very fast inversion procedure.

2.2 Estimation of {sigma} and {sigma}f
The dispersion parameter {sigma} can be estimated directly from the error term Formula . For simplicity, we estimate {sigma} robustly by the median absolute deviation


Formula

Given {sigma}, there is a single parameter {lambda} that controls the smoothness of the model, hence the commonly used smoothing parameter estimation methods such as the AIC can be used to estimate it. First, the degrees of freedom parameter is defined as (Pawitan, 2001, p. 448)


Formula

where WandD are computed using Formula . Using the same method as in Pawitan (1996), the degree of freedom can be approximated as


Formula

where Formula and Formula are the average diagonals of W and D, and Formula is the jth eigenvalue of the second derivative matrix {Delta}2. Then, we estimate {lambda} by minimizing the AIC criterion


Formula

2.3 Analysis of comparative studies
In many CGH studies, we will be interested not only in the copy-number alterations in single samples, but rather in the comparison of groups of samples, i.e. whether there is a consistent change across the samples. All our examples will deal with comparison of two independent groups; extension to several groups is straightforward. In principle, we can extend the previous model by allowing an extra term to capture group effects:


Formula 4

(4)
where gi (xk) represents the mean pattern of group i at location xk, and fij(xk) is the j 's individual deviation from the mean pattern in group i and {varepsilon}ijk is the error term. We then proceed by specifying heavy-tailed correlated random-effect models for gis and fijs. Joint estimation of these parameters can be done using the iterative back-fitting algorithm (e.g. Pawitan 2001, page 445). On the first step, set fij {equiv} 0, so the estimation of gi(xk) involves segmentation of the mean profile. Then define the corrected data


Formula

and perform individual segmentation to estimate fijs using the corrected data. Then define another corrected data


Formula

and use this to update gi (xk), and so on until convergence.

The estimation of individual fijs can use the method we have developed, but the estimation of the mean profile gi (xk) from a collection of aCGH arrays requires a non-trivial extension. Furthermore, for a medium to large number of samples, the joint computation becomes unwieldy and likely slow to converge. Instead, here we make a compromise by separating the segmentation and testing steps. It is not immediately obvious whether we should segment the data prior to testing, or vice versa. In our simulations we will compare two strategies:

  1. Segmenting original data for each array followed by calculating t-statistics.
  2. Calculating t-statistics from the original (unsegmented) data, followed by segmenting the t-statistics.
Note that, if we use simple averaging across arrays in the iterative backfitting algorithm, then the required computation to get gi (xk) is closer to the second strategy, so we can expect it to be better.

An example of the t-statistics pattern from the breast cancer data set is given in Figure 2, showing that segmenting the data and segmenting the t-statistics can produce different results.


Figure 2
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Comparison of t-statistics using the unsegmented data (scattered points), segmented data (dashed line) and segmented t-statistic (solid line). This plot is based on the comparison of two breast cancer groups.

 
In both strategies, the standard t-statistic with pooled variance is computed for each probe. To deal with the multiple comparison problem, we asses the significance of the test statistics using the FDR (Benjamini and Hochberg, 1995; Storey, 2002). The FDR computation using the original log2-ratio, segmented log2-ratio or segmented t, is calculated by adding the segmentation step to the EOC function in the R package OCplus (Pawitan et al., 2005).

In this computation, P-values are obtained using the permutation test. To account for the dependence between the probes, a single permutation of the group labels is applied to the whole array, so the data from an array is always kept together and the dependence structure is preserved. Let Formula be the ordered P-values from the permutation test. For the top k most significant probes, the FDR is computed as


Formula

where {pi}0 is the proportion of probes with non-differential copy number between the two groups, and it is estimated by


Formula

for some cutoff value c. The estimate Formula is typically stable for a wide range of c between 0.6 and 0.9, and in our computations we use c = 0.7.

2.4 Software
We have built a public-domain software smoothseg to perform the procedures described in this article. It can be accessed in http://www.meb.ki.se/~yudpaw. The function smoothseg performs the smooth segmentation for a collection of aCGH data, and the function FDRcgh() performs the estimation of FDR for comparative studies, including paired-sample, two-sample and multi-sample comparisons.


    3 SIMULATION STUDY
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 SIMULATION STUDY
 4 REAL DATA EXAMPLES
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We conducted a simulation study to evaluate the performance of smoothseg in identifying regions of genomic alterations and classifying samples. To get a realistic simulation, Willenbrock and Fridlyand (2005) proposed a scheme that generates genomic profiles of comparable complexity to real data. The first step is choosing a template from the empirical profile constructed from the CBS segmentation of a primary breast tumor data set of 145 patients. Each template consists of 500 regions placed on one chromosome, containing segments of normal and altered copy numbers, and there is one probe associated with each region. It is then assumed that genetic alteration regions in a tissue can be measured with 70 % chance, so we apply 30 % chance for the segments with copy-number changes to be re-assigned a normal copy number of 2. Finally, to take into account the sample contamination, each sample was assigned a proportion Pt of tumor cells and (1 – Pt) normal cells. Consequently, the expected log2-ratio profile for each sample was calculated as Formula , where c was the assigned copy number in the sample.

The log2-ratio profile data was generated by adding more realistic t-distribution noise of mean 0 and dispersion SD to the expected log2-ratios. In Willenbrock and Fridlyand's; simulation setting, Pt and SD are drawn from a uniform distribution (0.3, 0.7) and (0.1, 0.2), respectively.

3.1 Identification of regions with genomic alterations
In this simulation study, we generated 150 samples from one genomic template. The ‘true regions’ with copy-number gains or losses were recorded, allowing the computation of sensitivity and specificity. The performance of smoothseg, wavelets smoothing and DNAcopy were evaluated by the operating characteristic (OC) curve. We calculated the sensitivity as the number of regions with copy-number gains or losses whose absolute value are above the threshold level divided by the number of regions with copy-number gains or losses. We calculated the specificity as the number of regions with copy-number 2 whose absolute values are below the threshold level divided by number of regions with copy-number 2. In order to calculate OC curve, we varied the threshold value from 4.5 to 0.0.

Figure 3a shows that the resulting OC curves for smoothseg, wavelets smoothing and DNAcopy are comparable. The problem is perhaps too easy as the signal-to-noise ratio is very high, so all procedures have areas under curve close to one, although smoothseg performs marginally better. We then modified Willenbrock and Fridlyand's; setting to make the problem harder by increasing the error rate. The modification consisted of dividing the length of each segment with copy-number alteration by 3 and using fixed noise level SD = 0.3. In this setting, the data are more noisy and have smaller copy-number alteration segments. Now, smoothseg clearly performs better than wavelet smoothing and DNAcopy segmentation.


Figure 3
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. OC curves illustrating results from identification of regions with copy number gains and losses. For varying thresholds, it shows the sensitivity versus 1-specificity. Results are based on 150 samples. (a) Based on Willenbrock and Fridlyand's; simulation setting. (b) Based on some modification of Willenbrock and Fridlyand's; simulation setting (see text).

 
3.2 Testing for differential copy number
To simulate realistic data sets with samples from two tumor classes, two genomic templates were used and 20 samples were randomly drawn from each of the two templates. The proportion of regions with differential copy number between the two templates is determined implicitly by the choice of templates. On average, Willenbrock and Fridlyand's; setting generates 42 % regions with differential copy number.

The simulation was repeated 500 times. For each simulated data set, as described in the Methodology Section, segmenting t and segmenting data was carried out. From the 500 simulations, we calculated the sensitivity as the number of regions with copy-number difference between two templates whose P-values are below the threshold level (i.e. significant regions) divided by the number of regions with copy-number difference between two templates. We calculated the specificity as the number of regions with the same copy number between two templates whose P-values are above the threshold level divided by number of regions with same copy number between two templates. In order to calculate the OC curve, we varied the threshold value from 0.0 to 0.6.

Figure 4a shows the resulting OC curves for the different procedures. For comparison, the OC curves based on the DNAcopy segmentation are also shown in Figure 4a. Compared to unsegmented analysis, both smoothseg and DNAcopy using both segmentation strategies have improved identification of regions with differential copy number, which is evident by higher sensitivity for any given specificity. For smoothseg, segmenting the t-statistic is better than segmenting the data. In this simulation setting, DNAcopy performs better than smoothseg. This is probably not surprising, given that the simulation model is based on empirical distribution constructed from DNAcopy segmentation. Again, in this original setting, all procedures perform with relatively little errors compared to what we might expect with real data.


Figure 4
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. OC curves illustrating results from identification of regions with differential copy number between two groups. For varying thresholds, it shows the sensitivity versus 1-specificity. Results are based on 500 simulations. (a) Based on Willenbrock and Fridlyand's; simulation setting. (b) Based on some modification of Willenbrock and Fridlyand's; simulation setting (see text).

 
For the modified setting, the resulting OC curves in Figure 4b show that the level of error rate is now more realistic, where the naive unsegmented method performs poorly. We now observe that smoothseg performs better than DNAcopy. Figure 4b also shows that segmenting the t-statistics is better than segmenting the data.


    4 REAL DATA EXAMPLES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 SIMULATION STUDY
 4 REAL DATA EXAMPLES
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this section, we demonstrate the performance of smoothseg for real data analysis. The performance in identifying the differential copy number is evaluated using the FDR, while the performance in supervised classification is evaluated using the classification error rates, calculated using the leave-one-out-cross validation (LOOCV).

The first data set (unpublished data) consists of 14 BRCA1 and 8 BRCA2 mutation-classified breast cancer patients, analyzed on tiling-32K BAC arrays. For each patient, fluorescence ratios are collected from 31 890 BAC clones across the genome. The data were collected at Department of Oncology, Lund University, Sweden. The second data set consists of 75 oral squamous-cell carcinoma (OSCC) samples from Snijders et al. (2001), comprising 14 p53-mutant and 61 wild-type samples. The data set was downloaded from http://www.cbs.dtu.dk/~hanni/aCGH.

4.1 Identification of differential copy number
Regions with differential copy number between two cancer subtypes are identified by testing if a region had a significantly different log2-ratio in samples from one subtype compared to the log2-ratio for the same region in samples from the other subtype. For the breast-cancer data, the profile of the t-statistics across the chromosomes is given in Figure 5.


Figure 5
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. The profile of the t-statistic for comparison of genomic alterations between BRCA1 versus BRCA2 breast tumors; the scattered points are raw values. The shaded regions correspond to 5 % FDR.

 
Existence of genomic regions with differential gain or loss between BRCA1 and BRCA2 tumors have previously been reported by conventional CGH (van Beers et al., 2005) and aCGH (Jonsson et al., 2005). Applying FDR cutoff of 5 % yielded 5851 significant BAC clones, and the segmented t-statistic profile generated several distinct genomic regions differentiating between the tumor types (Fig. 5). Identified regions included 4p15.32–p14, 4q31.3, 4q32.1–q34, 5q, 17q23.3–q24.2 and 20q12–q13.12, previously reported as highly typical for the BRCA1 or BRCA2 genotype (Jonsson et al., 2005). These differential alterations suggest different development pathways for BRCA1 and BRCA2, and might explain any differences in the clinical progression.

The FDR corresponding to varying proportion of declared significant regions from 0 to 10 % are shown in Figure 6, both for the breast cancer and OSCC data. For comparison, FDR curves based on the DNAcopy segmentation are also calculated and shown in Figure 6. We can see that segmentation-based procedures yield smaller FDR than the segmentation-free procedure. Overall, consistent with the simulated data, segmenting t-statistic performs better than segmenting the data. Finally, for the breast cancer data, smoothseg performs better than DNAcopy, but for the OSCC data they are comparable.


Figure 6
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. The FDR estimate in the identification of regions with differential copy number using the original data, segmented data and segmented t. (a) For the breast cancer data. (b) For the oral squamous-cell carcinoma data.

 
4.2 Classification
We carried out classification using original log2-ratios and segmented data as input variables to the classifier. For simplicity, we considered a diagonal linear discriminant classifier (DLDA), which was previously demonstrated to have good performance in microarray studies (Dudoit et al., 2002). The classification was carried out using the R package sma (Dudoit et al., 2002). The classification error rates were calculated using the leave-one-out cross-validation (LOOCV) for a varying number of predictors. The predictors were ranked by the ratio of between-group to within-group sum of squares of the training data within each cross-validation run. Molinaro et al. (2005), in the context of classification of genomics data, performed extensive simulations that showed that the LOOCV has the smallest bias and mean-square error for linear discriminant analysis.

The classification error rates based on the breast cancer and OSCC data sets are shown in Figure 7. We can see that classification using segmented data has smaller classification error rates than those found using the original unsegmented data. Furthermore, for both data sets, smoothseg performs better than DNAcopy.


Figure 7
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7. Misclassification error rate from the leave-one-out cross-validation procedure for DLDA classifier as a function of the number of predictors. (a) For the breast cancer data. (b) For the oral squamous-cell carcinoma data.

 
Consistent with Jonsson et al. (2005), the BRCA1 and BRCA2 breast tumors can be distinguished extremely well, achieving zero error rate in the classification even with very few predictor genes. The biological differences between these two groups are described more meaningfully in the previous subsection. However, this result is based on a small sample and needs to be validated in a larger sample. The separation between the OSCC groups are not as successful. In fact the best error rate (12 %) is only slightly better than the rate of a naive classification of all cases as wild-type (14/75 = 18 %). This could be partly due to the low resolution of the aCGH platform, containing only 1979 regions.


    5 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 SIMULATION STUDY
 4 REAL DATA EXAMPLES
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have described a smooth segmentation (smoothseg) approach to the analysis of aCGH data, and illustrated its performance using simulated and real data sets. Smoothseg is derived from the Cauchy distribution assumption, as it is expected to handle jumps in the copy-number pattern, while at the same time allowing some smooth transition. The latter can occur naturally because of cellular variability in genomic instability (e.g. Kronenwett et al., 2004) or because of other factors such as variation in GC content. However, while our method is motivated by this possibility, in principle it does not rely on it for its performance. In the simulation studies presented above, the templates have discrete segments.

For processing a large number of today's; high-resolution CGH arrays, speed is an important issue, especially when segmentation is used as pre-data-processing for downstream analyses such as testing and classification. Compared to the current implementation of DNAcopy (version 1.8.1), smoothseg segmentation is about 500 times faster (as measured by the R function system.time).

Using simulated and real data, we demonstrated how smoothseg is utilized to aid in the downstream analysis. Use of segmented data can improve the performance of testing and classification, as demonstrated by Willenbrock and Fridlyand (2005). From their comparative studies, they concluded that DNAcopy has the best operating characteristics in terms of its sensitivity and FDR. Our simulation studies indicate that the relative performance of DNAcopy and smoothseg depends on the data structure: for data with smaller genomic aberrations smoothseg gives better results than DNAcopy, while for larger aberrations the performance is about the same. The comparative study of Lai et al. (2005) also showed that discrete segmentation performs poorly for data with small alterations and high noise level. For the real data sets used in this article smoothseg performs better than DNAcopy.

When using segmentation for group comparisons, a natural question arises: should we segment the data or the t-statistics? Willenbrock and Fridlyand (2005) considered this problem, and in their simulation setting the two segmentation strategies performed similarly (see Fig. 4a in this article, or Fig. 4 in Willenbrock and Fridlyand, 2005). However, the results from a more realistic simulation setting and the real data analyses (Fig. 4b and Fig. 6 in this article) indicate that segmenting t-statistics performs better than segmenting the log-ratios. We described briefly a model-based estimation of group effects, but its implementation would require an extension of the existing computations.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 SIMULATION STUDY
 4 REAL DATA EXAMPLES
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This research is partially supported by the Swedish Cancer Society.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Thomas Lengauer

Received on January 23, 2007; revised on June 12, 2007; accepted on July 9, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 SIMULATION STUDY
 4 REAL DATA EXAMPLES
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (1995) 57:289–300.

    Dongarra JJ, et al. LINPACK Users' Guide. (1979) Philadelphia: SIAM.

    Dudoit S, et al. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association (2002) 97:77–87.[CrossRef][Web of Science]

    Eilers PHC, de Menezes RX. Quantile smoothing of array CGH data. Bioinformatics (2003) 21:1146–1153.[CrossRef]

    Engler DA, et al. A pseudolikelihood approach for simultaneous analysis of array comparative genomic hybridiztions. Bioinformatics (2006) 7:339–421.[CrossRef]

    Fridlyand J, et al. Hidden markov models approach to the analysis of array CGH data. J. Multivar. Anal (2004) 90:132–153.[CrossRef]

    Hsu L, et al. Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics (2005) 6:211–226.[Abstract]

    Hupe P, et al. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics (2004) 20:3413–3422.[Abstract/Free Full Text]

    Jonsson G, et al. Distinct genomic profiles in hereditary breast tumors identified by array-based comparative genomic hybridization. Cancer Res (2005) 65:7612–7621.[Abstract/Free Full Text]

    Kronenwett U, et al. Improved grading of breast adenocarcinomas based on genomic instability. Cancer Res (2004) 64:904–909.[Abstract/Free Full Text]

    Lengauer C, et al. Genetic instabilities in human cancers. Nature (1998) 396:643–649.[CrossRef][Medline]

    Lai WR, et al. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics (2005) 21:3763–3770.[Abstract/Free Full Text]

    Lee YJ, et al. Generalized Linear Models with Random Effects. (2006) London: Chapman & Hall/CRC.

    Molinaro AM, et al. Prediction error estimation: a comparison of resampling methods. Bioinformatics (2005) 21:3301–3307.[Abstract/Free Full Text]

    Olshen AB, et al. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics (2004) 5:557–572.[Abstract]

    Pawitan Y. Automatic estimation of coherence of bivariate time series. Biometrika (1996) 83:419–432.[Abstract/Free Full Text]

    Pawitan Y. In All Likelihood: Statistical Modelling and Inference Using Likelihood. (2001) Oxford: Oxford University Press.

    Pawitan Y, et al. FDR, sensitivity and sample size for microarray studies. Bioinformatics (2005) 21:3017–3024.[Abstract/Free Full Text]

    Picard F, et al. A statistical approach for array CGH data analysis. BMC Bioinformatics (2005) 21:6–27.

    Pollack JR, et al. Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc. Natl Acad. Sci. USA (2002) 99:12963–12968.[Abstract/Free Full Text]

    Ruppert D, et al. Semiparametric Regression. (2003) Cambidge: Cambridge University Press.

    Snijders AM, et al. Assembly of microarrays for genome-wide measurement of DNA copy number. Nat. Genet (2001) 29:263–264.[CrossRef][Web of Science][Medline]

    Storey JD. A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B (2002) 64:479–498.[CrossRef]

    van Beers EH, et al. Comparative genomic hybridization profiles in human BRCA1 and BRCA2 breast tumors highlight differential sets of genomic aberrations. Cancer Res (2005) 65:822–827.[Abstract/Free Full Text]

    Wang P, et al. A method for calling gains and losses in array CGH data. Biostatistics (2005) 61:45–58.

    Willenbrock H, Fridlyand J. A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics (2005) 21:4084–4091.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
E. Budinska, E. Gelnarova, and M. G. Schimek
MSMAD: a computationally efficient method for the analysis of noisy array CGH data
Bioinformatics, March 15, 2009; 25(6): 703 - 713.
[Abstract] [Full Text] [PDF]


Home page
Cancer Res.Home page
D. J. Nancarrow, H. Y. Handoko, B. M. Smithers, D. C. Gotley, P. A. Drew, D. I. Watson, A. D. Clouston, N. K. Hayward, D. C. Whiteman, and for the Australian Cancer Study and the Study of D
Genome-Wide Copy Number Analysis in Esophageal Adenocarcinoma Using High-Density Single-Nucleotide Polymorphism Arrays
Cancer Res., June 1, 2008; 68(11): 4163 - 4172.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. Klijn, H. Holstege, J. de Ridder, X. Liu, M. Reinders, J. Jonkers, and L. Wessels
Identification of cancer genes using a statistical framework for multiexperiment analysis of nondiscretized array CGH data
Nucleic Acids Res., February 2, 2008; 36(2): e13 - e13.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary data
Right arrow All Versions of this Article:
23/18/2463    most recent
btm359v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Huang, J.
Right arrow Articles by Pawitan, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Huang, J.
Right arrow Articles by Pawitan, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?