Bioinformatics Advance Access originally published online on July 29, 2008
Bioinformatics 2008 24(19):2143-2148; doi:10.1093/bioinformatics/btn404
A fast Bayesian change point analysis for the segmentation of microarray data
Department of Statistics, Yale University, 24 Hillhouse Avenue, New Haven, CT 06511, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: The ability to detect regions of genetic alteration is of great importance in cancer research. These alterations can take the form of large chromosomal gains and losses as well as smaller amplifications and deletions. The detection of such regions allows researchers to identify genes involved in cancer progression, and to fully understand differences between cancer and non-cancer tissue. The Bayesian method proposed by Barry and Hartigan is well suited for the analysis of such change point problems. In our previous article we introduced the R package bcp (Bayesian change point), an MCMC implementation of Barry and Hartigan's method. In a simulation study and real data examples, bcp is shown to both accurately detect change points and estimate segment means. Earlier versions of bcp (prior to 2.0) are O(n2) in speed and O(n) in memory (where n is the number of observations), and run in
45 min for a sequence of length 10 000. With the high resolution of newer microarrays, the number of computations in the O(n2) algorithm is prohibitively time-intensive.
Results: We present a new implementation of the Bayesian change point method that is O(n) in both speed and memory; bcp 2.1 runs in
45 s on a single processor with a sequence of length 10 000—a tremendous speed gain. Further speed improvements are possible using parallel computing, supported in bcp via NetWorkSpaces. In simulated and real microarray data from the literature, bcp is shown to quickly and accurately detect aberrations of varying width and magnitude.
Availability: The R package bcp is available on CRAN (R Development Core Team, 2008). The O(n) version is available in version 2.0 or higher, with support for NetWorkSpaces in versions 2.1 and higher.
Contact: chandra.erdman{at}yale.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Numerous publications have stated the importance of accurately detecting regions of genetic alteration. In particular, many methods for the analysis of array comparative genomic hybridization (CGH) data (Kallioniemi et al., 1992) have been developed. In Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data, Lai et al. (2005) compared 11 such methods, dividing them into the categories of estimation algorithms and smoothing algorithms. In this article, we present a fast Bayesian alternative—an MCMC implementation of the method proposed in A Bayesian analysis for change point problems (Barry and Hartigan, 1993).
Circular binary segmentation (CBS; Olshen and Venkatraman, 2004) is a modification of binary segmentation (Sen and Srivastava, 1975). It is an estimation algorithm which uses a likelihood ratio statistic to test the null hypothesis of no change points in a sequence. If the null hypothesis is rejected, the sequence is split and the test is recursively applied to the resulting sub-segments until no additional changes are detected. Jong et al. (2003) introduced a genetic local search algorithm (GA) that searches for the most probable partition of a given number of change points. After smoothing the data, the procedure uses maximum likelihood with a penalty term that increases with the number of change points to select a partition. CGH segmentation (CGHseg Picard et al., 2005)) also uses a penalized likelihood function to estimate the number of change points in a sequence. (Hupe et al., 2004) implement an adaptive weights smoothing procedure (in the R package GLAD) based on the local-likelihood model of Polzehl and Spokiony (2000). The quantile smoothing algorithm (Quantreg; Eilers and Menezes, 2005) is based on the minimization of the sum of absolute errors. The smoothing and estimation algorithm clustering along chromosomes (CLAC; Wang et al., 2005)) uses hierarchical clustering to detect change points, and false discovery rate (FDR) for model selection. The estimation algorithm of Fridlyand et al. (2004) is based on a hidden Markov model (HMM). Hsu et al. (2005) propose a non-parametric smoothing algorithm that uses wavelets to denoise data. chromosomal aberration region miner (ChARM; Myers et al., 2004) uses three edge detection filters for smoothing, followed by an expectation-maximization (EM) algorithm for estimation. Lingjarde (2005) presents CGH-Explorer, a program which implements several threshold methods for segmentation, and introduces the smoothing and estimation algorithm ACE (analysis of copy errors). Locally weighted regression and smoothing scatterplots (Lowess), introduced in Cleveland (1979), is also frequently applied to microarray data.
Several issues arise in the comparison of these algorithms in Lai et al. (2005). Some of the estimation algorithms return only the location(s) of the detected aberration(s), and may not provide estimated log-ratios. Some of the smoothing algorithms return estimates of log-ratios without indicating the location(s) of aberration(s). Some algorithms require additional normal versus normal control samples. Some of the algorithms are slow, and some have not been implemented in publicly available software.
The Bayesian approach estimates both the posterior means and the posterior probability of a change for each postition in a sequence. It is generally applicable and can be used to segment gene expression data, array CGH data, and any other data with contiguous segments of equal mean. The R implementation of the Bayesian approach is available on CRAN as the bcp package for Macintosh, Windows and Linux. Version 2.0 (or higher) is O(n) and is often faster than the new O(n) implementation of CBS (Venkatraman and Olshen, 2007a), particularly for long sequences containing at least a few detectable change points. Version 2.1 (or higher) supports parallel MCMC computations via NetWorkSpaces (R package nws) for additional speed improvements.
In Section 2, we describe the Bayesian change point model and the MCMC implementation; we also provide a detailed description of the R package bcp. In Section 3, we apply the Bayesian approach to the simulated chromosomes used by Lai et al. (2005) to compare the various change point methods, and to the brain tumor sample from Bredel et al. (2005) (also presented in Lai et al. (2005). Section 4 concludes with a discussion of speed considerations, improvements provided by bcp version 2.0, and directions for future work.
| 2 METHODS |
|---|
|
|
|---|
2.1 Implementing Barry and Hartigan's Bayesian procedure
Barry and Hartigan (1993) propose a Bayesian model for the change point problem—when there is an unknown partition,
, of a set into contiguous blocks such that the means are equal within blocks. The model assumes that observations are independent N(µi,
2), and that the probability of a change point at a position i is p, independently at each i. However, the assumption of independent observations could be weakened because all that is required is that, given the partition and the parameters, observations in different blocks are mutually independent (Barry and Hartigan, 1993, p. 310). The prior distribution of µij (the mean of the block beginning at position i+1 and ending at position j) is chosen as N(µ0, 
/(j–i). Note that the variance of this prior changes with the length of the block; specifically, larger deviations from µ0 are expected in shorter blocks. That is, we do not expect to detect a small change in mean if it persists for a short time. However, this choice of prior does allow weak signals provided that there are sufficient data to estimate them (Barry and Hartigan, 1993, p. 311). Barry and Hartigan also select independent priors for µ0, p,
2 and w=
2/(
2+
) (the ratio of signal to error variance). Specifically, |
|
|
|
|
|
|
|
|
|
Although an exact implementation of Barry and Hartigan's Bayes procedure is possible, the calculations are O(n3); we implement an MCMC approximation that is O(n). The reader is encouraged to refer to Barry and Hartigan (1993) for the full theoretical framework, and our presentation will conform to their notation.
The algorithm begins with the partition
= (U1, U2,...,Un), where n is the number of observations and Ui = 1 indicates a change point at position i+1; we initialize Ui to 0 for all i < n, with Un
1. In each step of the Markov chain, at each position i, a value of Ui is drawn from the conditional distribution of Ui given the data and the current partition. Following Barry and Hartigan, we let b denote the number of blocks obtained if Ui = 0, conditional on Uj, for i
j. The transition probability, p, for the conditional probability of a change at the position i+1, is obtained from the simplified ratio presented in Barry and Hartigan:
|
|
The new MCMC implementation benefits from careful updating of the various sums of squares (W0, B0, W1 and B1) in consecutive steps of the algorithm. For example, instead of storing means and variances, the algorithm stores sums and sums of squares within each block of the current partition. When considering the addition (or deletion) of a change point, all quantities within the affected block (or adjacent blocks) are updated directly from these sums and sums of squares.
2.2 R package bcp
The bcp package now contains the main bcp() function, five methods (summary(), print(), plot(), fitted() and residuals()), a new function, interval.prob(), for estimating the probability of a change point in an interval, and two datasets. The function bcp() performs the analysis, taking six arguments:
- x: a numerical vector of data.
- p0 and w0: optional values for Barry and Hartigan's hyperparameters; these default to the value 0.2, which has been found to work well (Barry and Hartigan,1993, Yao, 1984).
- burnin: optional number of burn-in iterations that are excluded from the estimation of the posterior means and probabilities of changes. The chain settles very quickly in practice, and the default is 50.
- mcmc: optional number of iterations used in the estimation of the posterior means; following Barry and Hartigan, the default is 500, although longer chains are recommended in practice.
- return.mcmc: if TRUE, returns the partition and the associated conditional posterior means for each iteration; the default is set to FALSE and returns a summary of the chain.
- nwssleigh: NULL by default. If a sleigh is provided, the mcmc steps are run in parallel (see package nws for more information).
After completing the analysis, bcp() returns an object of class bcp containing the following components:
- data: a copy of the data.
- mcmc.means: if return.mcmc=TRUE, contains the posterior means conditional on the current partition at the end of every iteration, otherwise mcmc.means is NA.
- mcmc.rhos: if return.mcmc=TRUE, contains the partition after each iteration, otherwise mcmc.rhos is NA.
- blocks: a vector of the number of blocks after each iteration.
- posterior.mean: a vector containing the posterior means.
- posterior.var: a vector containing the naive posterior variance for each postition, over the mcmc iterations.
- posterior.prob: a vector containing the posterior probability of a change point at each position.
- p0, w0, burnin, mcmc and return.mcmc: contain the specified values.
- object: the result of a call to bcp().
- start: the starting index of the interval.
- end: the ending index of the interval.
It returns the estimated probability of a change point in the interval [start, end]. Package bcp also contains five methods and two datasets:
Methods:
- The plot method provides two plots summarizing the analysis. The first figure displays the data along with the posterior mean of each position. The second figure shows the proportion of iterations resulting in a change point at each position—the posterior probability of a change.
- The summary and print methods, modeled after the summary method for mcmc objects (Plummer et al., 2007), print a table of the posterior probability of a change in mean, along with the posterior mean and SD for each position. After removing the burnin iterations, the mcmc.means component of a bcp object may be converted into an mcmc object to view a full mcmc summary, and to perform convergence tests with the coda package (Plummer et al., 2007). An example of this conversion is given in the bcp package documentation (Erdman and Emerson, 2008).
- The fitted method returns the vector of posterior means.
- The residuals method returns the data minus the posterior means.
Data:
- Coriell: two array CGH studies of Coriell cell lines, also appears in the DNAcopy package (Venkatraman and Olshen, 2007a) that performs CBS, taken originally from Snijders et al. (2001).
- RealInt: US interest rate data, considered by Bai and Perron (2003) and Garcia and Perron (1996), also appears in the strucchange package that performs dynamic programming algorithm of Bai and Perron (2003) (Zeileis et al., 2002; Zeileis et al., 2003; Zeileis et al., 2007).
| 3 RESULTS |
|---|
|
|
|---|
3.1 Simulations
We apply bcp() to the simulated chromosomes of Lai et al. (2005). Figure 1 shows the results of bcp() applied to a particular artificial chromosome with five aberrations of lengths 2, 5, 10, 20 and 40, each with a magnitude of 1, and N(0, .252) noise. Figure 1 of (Lai et al., 2005, p. 3765) summarizes the results of the 11 aforementioned change point algorithms applied to this simulated chromosome. Of the 11 algorithms, only 3 clearly identify all five aberrations, and two of these three also falsely detect an amplification near position 400. The Bayesian change point algorithm clearly identifies all 5 aberrations and assigns high posterior probability to each, without identifying any false positives.
|
To examine the tradeoff between detecting true positives and incorrectly identifying false positives, Lai et al. (2005) consider the receiver operating curves (ROC) for the 11 algorithms for aberration widths of 5, 10, 20 and 40, and signal-to-noise ratios (SNRs) of 1, 2, 3 and 4. SNR, as defined by Lai et al. (2005, p. 3765), is the mean magnitude of the aberration divided by the standard deviation of the superimposed Gaussian noise. For each combination of aberration width and SNR, they generate 100 artificial chromosomes of length 100, with the aberrations of the various widths added in the center. True positive rates (TPR) are calculated as the number of observations inside the region of amplification whose fitted value (posterior mean) is above a threshold, divided by the total number of observations inside the aberration. False positive rates (FPR) are calculated as the number of observations outside the region of amplification whose fitted value is above a threshold, divided by the total number of observations outside the aberration. The threshold is varied from the minimum observed value to the maximum observed value.
Figure 2 displays the ROC for bcp() applied to the simulated chromosomes from Lai et al. (2005) for SNRs ratios of 1 and 2. When the SNR is 3 or 4—regardless of aberration width—the points of the ROC from bcp() are highly concentrated in the extreme upper left corner (indicating a large percentage of true positives, and a small percentage of false positives). As such, we restrict our attention to the more challenging cases with low SNRs. When the SNR drops to a value of 2, there is still a fairly high concentration of points in the upper left corner for aberration widths of 40, 20 and 10, and slightly less for the aberration of width 5. With this low SNR, the TPR for several of the other algorithms takes a fall (see Fig. 2 of Lai et al. (2005, p. 3766). It is only when the signal is matched by the noise (SNR=1) that the Bayesian change point procedure loses its ability to detect the amplifications. However, it is still no worse than the alternative procedures in this difficult case.
|
3.2 Glioblastoma Multiforme example
We apply bcp() to a Glioblastoma Multiforme (GBM) sample with a low SNR. GBM is a particularly lethal type of brain tumor, with patients surviving a median of just 1 year (Lai et al., 2005). Lai et al., (2005) use normalized GBM data from Bredel et al. (2005), and apply the 11 change point algorithms to two GBM samples. The first has a few short amplifications of large magnitude, and the second has one long loss of low magnitude. Because the Bayesian change point method easily detects large amplifications of any length in the simulated chromosomes, we instead present the result of bcp() applied to the more challenging GBM sample with a low-magnitude loss (chromosome 13 of GBM31) in Figure 3.
|
The posterior means in Figure 3 are clearly divided into two regions—the low-amplitude loss from position 0 to about 550 (note that these numbers do not indicate genomic location), and a normal region from position 550 to the end of the chromosome. CGHseq, GLAD, CBS and GA also detect the loss, dividing the chromosome into these two regions, and detecting up to a dozen outliers (seeFig. 3 of Lai et al. (2005), p. 3767). The outliers can either indicate a real focal aberration, some type of polymorphism, or an experimental artifact (Lai et al., 2005, p. 3768). ACE and CLAC identify the low-amplitude loss as a series of losses. The estimates provided by the wavelet, quantreg and lowess smoothing algorithms are generally lower in the first region and higher in the second, but include many local shifts obscuring the view of the global loss. HMM and ChARM do not separate the chromosome into two regions.
| 4 DISCUSSION |
|---|
|
|
|---|
As shown in the simulated and real examples presented here and in Erdman and Emerson (2007), the Bayesian change point analysis is sensitive enough to catch both larger aberrations that are short lived, and longer aberrations that are of very low magnitude, without significantly increasing the FDR. This is important in the detection of chromosomal aberrations when some may be represented by a single probe or, as in the GBM example (and many tumor samples), the signal is diluted by sample heterogeneity.
In addition to its accuracy, the new implementation of the Bayesian methodology also has a speed advantage. In the simualtion study of Erdman and Emerson (2007), we compare the performance of bcp() to that of the original implementation of CBS in DNAcopy version 1.1.0 (Venkatraman and Olshen, 2007a). Although the Bayesian approach is shown to consistently outperform CBS in accuracy, the O(n2) version of bcp() was significantly slower than CBS. Venkatraman and Olshen (2007b) present a faster, modified version of CBS that involves a hybrid P-value and an early stopping rule. The O(n) implementation of bcp() is now faster than both the original and modified CBS, analyzing a sequence of length 10 000 in
45 s, compared to 10 min for the original CBS algorithm and several minutes for the modified CBS algorithm (depending on the number of segments in the sequence). Further speed improvements may be realized using bcp's support for parallel MCMC computations.
The package version 2.0 or higher also includes improved methods for summarization and visualization of the results. The plot() method produces figures identical to 1 and 3; the summary() method produces a detailed statistical summary reflecting the degree of uncertainty in the change points; the interval.plot() function estimates the probability of at least one change point in a specified region of interest. The Bayesian methodology also provides a natural framework for combining samples. The ability to analyze multiple samples (optionally, in parallel in a multi-core or distributed computing environment) and combine the results will be provided in a future version of the bcp package.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on June 16, 2008; revised on July 27, 2008; accepted on July 28, 2008
| REFERENCES |
|---|
|
|
|---|
Bai J, Perron P. Computation and analysis of multiple structural change models. J. Appl. Econom (2003) 18:1–22.[CrossRef]
Barry D, Hartigan JA. A Bayesian analysis for change point problems. J.Am. Stat. Assoc (1993) 88:309–319.[CrossRef]
Bredel M, et al. High-resolution genome-wide mapping of genetic alterations in human glial brain tumors. Cancer Res (2005) 65:4088–4096.
Cleveland WS. Robust locally weighted regression smoothing scatterplots. J. Am. Stat. Assoc (1979) 74:829–836.[CrossRef][Web of Science]
Eilers PHC, Menezes RX. Quantile smoothing of array CGH data. Bioinformatics (2004) 21:1146–1153.[CrossRef][Web of Science][Medline]
Erdman C, Emerson JW. bcp: an R package for performing a Bayesian analysis of change point problems. J. Stat. Software (2007) 23:1–13.
Erdman C, Emerson JW. An R Package for Performing a Bayesian Analysis of Change Point Problems. In: R package version 1–2 (2008) Available athttp://cran.r-project.org/(last accessed date June 16, 2008).
Fridlyand J, et al. Hidden Markov models approach to the analysis of array CGH data. J. Multivar. Anal (2004) 90:132–153.[CrossRef]
Garcia R, Perron P. An analysis of the real interest rate under regime shifts. Rev. Econom. Stat (1996) 78:111–125.[CrossRef]
Hsu L, et al. Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics (2005) 6:211–226.[Abstract]
Hupe P, et al. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics (2004) 20:3413–3422.
Jong K, et al. Chromosomal breakpoint detection in human cancer. Lect. Notes Comput. Sci (2003) 2611:54–65.[CrossRef]
Kallioniemi A, et al. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science (1992) 258:818–821.
Lai WR, et al. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics (2005) 21:3763–3770.
Lingjaerde OC, et al. CGH-Explorer: a program for analysis of array-CGH data. Bioinformatics (2005) 21:821–822.
Myers C, et al. Accurate detection of aneuploidies in array CGH and gene expression microarray data. Bioinformatics (2004) 20:3533–3543.
Picard F, et al. A statistical approach for array CGH data analysis. BMC Bioinformatics (2005) 6:1471–2105.
Olshen AB, Venketraman ES. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics (2004) 5:557–572.[Abstract]
Plummer M. coda: Output Analysis and Diagnostics for MCMC. In: R package version 0.12–1 (2007) Available athttp://cran.r-project.org/(last accessed date June 16, 2008).
Polzehl J, Spokiony V. Adaptive weights smoothing with applications to image restoration. J. R. Stat. Soc. B (2000) 62:335–354.[CrossRef]
R Development Core Team. R: a language and environment for statistical computing. (2008) Available athttp://www.R-project.org.
REvolution Computing with support and contributions from Pfizer. nws: R~functions for NetWorkSpaces and Sleigh. In: R package version 1.6.3 (2008) Available athttp://nws-r.sourceforge.net.(last accessed date June 16, 2008).
Sen A, Srivastava M. On tests for detecting change in mean. Ann. Stat (1975) 3:98–108.[CrossRef]
Snijders A, et al. Assembly of microarrays for genome-wide measurement of DNA copy number. Nat. Genet (2001) 29:263–246.[CrossRef][Web of Science][Medline]
Venkatraman ES, Olshen AB. DNAcopy: A Package for Analyzing DNA Copy Data}. In: R package version 1.6.0 (2007a) Available athttp://www.bioconductor.org/packages/2.2/bioc/html/DNAcopy.html(last accessed date June 16, 2008).
Venkatraman ES, Olshen AB. A faster circular binary segmentation for the analysis of array CGH data. Bioinformatics (2007b) 23:657–683.
Wang P, et al. A method for calling gains and losses in array CGH data. Biostatistics (2005) 6:45–58.[Abstract]
Yao YC. Estimation of a noisy discrete-time step function: Bayes and empirical Bayes approaches. Ann. Stat (1984) 12:1434–1447.[CrossRef]
Zeileis A, et al. strucchange: an R package for testing for structural change in linear regression models. J. Stat. Software (2002) 7:1–38.
Zeileis A, et al. Testing and dating of structural changes in practice. Comput. Stat. Data Anal (2003) 44:109–123.[CrossRef]
Zeileis A. strucchange: Testing, Monitoring and Dating Structural Changes}. In: R package version 1.3-2 (2007) Available athttp://cran.r-project.org/(last accessed date June 16, 2008).
This article has been cited by other articles:
![]() |
O. M. Rueda and R. Diaz-Uriarte RJaCGH: Bayesian analysis of aCGH arrays for detecting copy number changes and recurrent regions Bioinformatics, August 1, 2009; 25(15): 1959 - 1960. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




