Bioinformatics Advance Access originally published online on February 19, 2007
Bioinformatics 2007 23(8):1006-1014; doi:10.1093/bioinformatics/btm059
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Continuous-index hidden Markov modelling of array CGH copy number data
1Centre for Mathematical Sciences, Lund University, Box 118, 221 00 Lund and 2Department of Oncology, University Hospital, 221 85 Lund, Sweden
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: In recent years, a range of techniques for analysis and segmentation of array comparative genomic hybridization (aCGH) data have been proposed. For array designs in which clones are of unequal lengths, are unevenly spaced or overlap, the discrete-index view typically adopted by such methods may be questionable or improved.
Results: We describe a continuous-index hidden Markov model for aCGH data as well as a Monte Carlo EM algorithm to estimate its parameters. It is shown that for a dataset from the BT-474 cell line analysed on 32K BAC tiling microarrays, this model yields considerably better model fit in terms of lag-1 residual autocorrelations compared to a discrete-index HMM, and it is also shown how to use the model for e.g. estimation of change points on the base-pair scale and for estimation of conditional state probabilities across the genome. In addition, the model is applied to the Glioblastoma Multiforme data used in the comparative study by Lai et al. (Lai,W.R. et al. (2005) Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics, 21, 3763–3370.) giving result similar to theirs but with certain features highlighted in the continuous-index setting.
Contact: susann.stjernqvist{at}matstat.lu.se
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Array comparative genomic hybridization (aCGH) is a technique to perform genome-wide scans of DNA copy numbers (Snijders et al., 2001). In this technique, a sample and reference DNA are differentially labelled with two different fluorescent dyes (typically red, Cy5, and green, Cy3). This mixture is then hybridized on a microarray prepared with so-called clones. A clone is a short segment of DNA derived from e.g. bacterial artificial chromosomes (BACs) or synthesized short oligonucleotides, starting and ending at predefined positions in the genome. A laser scanner is used to detect the remaining fluorescent signal on the slide after post-hybridization washes, and the signal intensities obtained per clone provides information on the relative copy numbers of the sample DNA versus the reference DNA. Normal DNA, with the exception of sex chromosomes, has two copies, yielding the ratio 2/2, while the ratio is 1/2 or 0/2 if the sample DNA misses one or both copies, respectively. If instead copies are gained, the ratios are 3/2, 4/2 and so on.
Genome-wide copy number screens are often performed on cells that express a disease putatively caused by DNA copy number alterations; commonly tumour cells but also different monogenic diseases (Vissers et al., 2005). The ultimate purpose of the aCGH experiment is then to find copy number alterations that are related to the disease in question. At a more basic level, however, the aim of the analysis of aCGH data is to find, or reconstruct, the copy number of the sample DNA throughout the genome. This problem can be split into two: to find the breakpoints in the genome, i.e. where the number of copies changes, and to find the DNA copy number in each of these regions. The intensity ratios obtained from an aCGH experiment are transformed into log2-scale, putting the normal 2/2 ratio at the level zero. The log2-ratios are only observed with error however, as the sampling and measurement procedures give rise to both systematic and random errors. The systematic error originates from several different sources. One is that when extracting DNA from cancer tumours for instance, it is difficult to completely separate these cells from surrounding tissue. Thus, part of the sample DNA may indeed be normal cell DNA. Another possible source of error is that of intra-sample heterogeneity; all sample cells interrogated may not have identical alterations, for instance for reasons of being in different stages of disease development. Both of these errors decrease the amplitude of the sample alterations, shrinking the log2-ratios towards zero from the theoretical levels. A further possible source of systematic error is cross-hybridization of repeated sequences of DNA in the genome, meaning that the same sequence of base-pairs may occur at more than one location in the genome. We will still refer to the ratios including systematic errors as log2-ratios, even though they do not exactly correspond to the numbers log2(k/2) for integer k. The random errors arise from different kinds of measurement errors in the experimental procedure. Summing up, the data can be described as log2-ratios observed with additive noise.
As described above, a typical basic aCGH data analysis tries to reconstruct the log2-copy number ratios from the observed data by segmenting the genome into regions of constant copy number. This problem has previously been studied using several different methods. Many authors assume a normal distribution model for which the mean, and possibly variance, change at an unknown number of change points. For a fixed number of change points, dynamic programming solves the problem of optimally (in likelihood sense) locating them; their number is then typically chosen by subtracting from the likelihood a penalty that increases with the number of change points, and choosing as the final model the one optimizing the overall score. Among these articles are Jong et al. (2003) who used BIC/AIC penalty and Daruwala et al. (2004) who used a Poisson probability for the number of change points. Picard et al. (2005) provides a nice summary of these methods and also propose estimating the penalty adaptively from data. Other methods based on a normal segmented likelihood include Olshen et al. (2004) who proposed a modified binary segmentation algorithm which they called circular binary segmentation, Hupé et al. (2004) who employed local regression smoothing to reconstruct copy numbers and then inserted change points at abrupt large changes, and Myers et al. (2004) who applied an edge detection filter to find preliminary change points and and then fined-tuned them with an EM algorithm. Hsu et al. (2005) employed wavelets to find change points, Eilers and de Menezes et al. (2005) used quantile or L1-norm smoothing to reconstruct copy numbers, and Wang et al. (2005) suggested building a cluster-tree to gradually merge regions with identical copy numbers. Finally, hidden Markov models were used by Fridlyand et al. (2004), Guha et al. (2006) and Marioni et al. (2006). The latter authors took a fully Bayesian approach, assigning prior distributions to all model parameters.
The purpose of this article is to device a continuous-index hidden Markov model (HMM) approach to the analysis of aCGH data. This approach is motivated below, but first we briefly recall the HMM approach of Fridlyand et al. (2004). They associate to each clone k an unobserved state variable Xk representing the copy number, and the sequence Xk of hidden states is assumed to be a (discrete-index) Markov chain. Letting yk be the observed (measured) log2-ratio for clone k, this number is an observation of a random variable Yk whose conditional distribution given Xk = i is assumed to be normal, N(µi,
2), where µi is the log2-ratio for state i, including systematic errors and
2 is the noise variance. The transition probabilities of the hidden Markov chain are denoted pij. To estimate the unobserved Xk from data, Fridlyand et al. first estimated the model parameters (pij), (µi) and
2 through maximum likelihood using the EM algorithm. This procedure was repeated for models with r = 1, ... ,5 states, and for each model the maximal log-likelihood penalized with an BIC or AIC term was computed. Finally the model with optimal penalized log-likelihood was selected. For the final model, the a posteriori most likely realization of the Markov chain Xk given the data (yk) was reconstructed using the Viterbi algorithm.
Clone positions are given in units of base pairs (bp). A particular feature of the data used in the present study, not found in any of the references cited above, is that a substantial part of the clones overlap, that is, there are clones that start before the previous one has ended, and they do so with as much as 30% of their length; see Figure 4 for an illustration. In a discrete-index method that assigns one state to each clone, it is unclear how to interpret the result if two overlapped clones are assigned to different states; this remark indeed applies to any discrete-index method, regardless of whether it is an HMM or not. Another feature of our data is that the clones differ considerably in length (see Section 2) and then it does not seem reasonable to assume that the transition probability of moving from one hidden state at one clone to another state at the next clone, remains constant across all clones. Neither are the clones equally spaced, which also makes such an assumption questionable. These considerations motivated us to extend the model of Fridlyand et al. by replacing the discrete-index hidden Markov chain by a continuous-index Markov chain that may thus jump at change points continuously distributed at any base-pair along the genome, not only at clone edges. We remark that the need for developing analysis methods tailored for these kinds of array designs has been noticed by other authors (e.g. Picard et al., 2005, p. 11). The unit of base-pairs is indeed discrete too, but since the number of base-pairs in a chromosome is of the order of 109 and the number of jumps (change points) is many orders of magnitude smaller, the approximation error done by modelling jump locations by continuous variables is negligible. In a discrete-index HMM on base-pair scale, the transition probabilities pij would be either extremely small or extremely close to one; the latter for probabilities pii of remaining in a state from one base-pair to the next. In a continuous-index model there are no transition rates qii, removing this potential numerical problem. For parameter estimation in the continuous-index HMM, it is tempting to derive the EM algorithm, just as in the discrete-index case, but because of the clone overlaps this is infeasible; the E-step does not allow any simple implementation. Instead, we chose to use a Monte Carlo EM (MCEM) algorithm.
The article is organized as follows. We begin with a description of the dataset used (Section 2) and then detail the continuous-index method (Section 3). In Section 4, we compare the model to the discrete-index HMM of Fridlyand et al. in terms of residual correlations, discuss model selection using the continuous-index model and estimation of exact locations of jumps (change points), and we also apply the model to the Glioblastoma multiforme data used in the comparative study by Lai et al. (2005) and compare our result to theirs. Some concluding remarks are given in Section 5.
| 2 DATA |
|---|
|
|
|---|
The data used in this study is from DNA of the human breast cancer cell line BT-474, analyzed on 32K BAC tiling microarrays (Jönsson et al., 2006). This cell line is well characterized, and genomic aberrations identified by aCGH for this cell line have previously been reported by Hyman et al. (2002) and Shadeo and Lam (2006). Data includes measured and then normalized log2-ratios for each clone, as well as the genomic location. Locations are given as the start and stop position of each clone in base-pairs (bp). There are 29 682 clones in total over 24 chromosomes, yielding an average of 1237 clones per chromosome with a minimum and maximum of 213 and 2444, respectively. Clone lengths range from 330 bp to 0.298 Mb, so that the longest clone is 900 times longer than the shortest one. Such extreme differences are rare however; 95% of the clones have lengths in the range 0.098–0.212 Mb. Shorter lengths arise because some BAC clones are mapped to the genome by e.g. short STS-markers, not reflecting the true length of the actual BAC clone. A further feature of the data is that the clones often overlap, with up to 30% of their length. The reason for this selection of clones is that it increases the resolution of the clone set (Krzywinski et al., 2004).
| 3 THE CONTINUOUS-INDEX MODEL |
|---|
|
|
|---|
In the continuous-index model, DNA copy number changes are allowed to be located continuously over the chromosome. The copy number is modelled by a continuous-index Markov chain X = (X(t))0
t
T, where T is the length of the chromosome in bp. Here, briefly recall the basic properties of continuous-index Markov chains; fuller treatments are found in many excellent textbooks, e.g. Ross (2006). The process X = (X(t))0
t
T is a continuous-index random process, taking values in a finite state space, {1,2, ... ,r} say, where r is the number of states. The dynamics of the Markov chain are given by transition rates qij for j
i—rather than transition probabilities—where qij is the rate with which the chain moves from state i to state j; its unit here is bp–1. A probabilistic interpretation of this rate is that given X(t) = i, the probability that the chain jumps to a different state j in the small interval (t,t+h] equals qijh up to a term which is of smaller order than h, independently of the past trajectory of the chain. Thus, the probability that, given X(t) = i, the chain does not jump at all in the small interval (t,t+h] is 1–qi h, again up to a term of order smaller than h, where qi=
j
iqij is the total rate out of state i. An alternative way of characterizing the evolution of the process is to describe the distribution of the random waiting times between jumps, and the jump probabilities. Given that a jump to state i has occurred, the chain remains there for a sojourn whose length has a negative exponential distribution with rate qi (mean 1/qi). At the end of this sojourn, the chain jumps to a different state j with probability qij/qi. After this, the procedure starts over from state j, etc. The chain is initialized at index t = 0 by drawing the initial state X(0) from an initial distribution
=(
1, ... ,
r), i.e.
i is the probability that X(0) =i.
Returning to our model, the number of states r is initially selected using the discrete-index method as described in Section 1. As in that model, µ =(µ1, ... ,µr) is the set of log2-ratios for the respective states and
2 is the variance of the measurement error. Following Fridlyand et al. (2004) we do not set the means equal to log2(k/2) for different k since, as noticed above, observed log2-ratios do not follow these numbers but are rather shrunk towards zero. Instead we estimate the µi along with the rest of the model parameters, so that for each of the r states of the hidden Markov chain there is an unknown log2-ratio that we wish to estimate.
The measured log2-ratio yk for clone k is an observation of a random variable Yk whose conditional distribution given the hidden chain X = (X(t))0
t
T is assumed normal,
|
| (1) |
The parameters of the model are estimated with an MCEM algorithm (Tanner, 1996, Chapter 4). Such an algorithm works like an ordinary EM algorithm but with the E-step replaced, or approximated, using several simulated realizations of the missing data—the Markov chain in the present setting—given the observed data and current parameter estimates. Let
=(
,Q,µ ,
2) be the model parameter vector, with the model size r fixed and Q = (qij; i,,j = 1, ... ,r, i
j) being the collection of all transition rates. Instead of computing the central quantity
of the EM algorithm, with
c the complete log-likelihood and
0 the current parameter estimate, we simulate M realizations x1, ... , xM of the missing data X = (X(t))0
t
T conditionally on the observed data
and given the current estimate
0, and then approximate
. The parameter update
of the MCEM algorithm is given by an ordinary M-step applied to
:
. We now make explicit these derivations for the present model. The complete log-likelihood in our case is (cf. Rydén, 1996, Equation (7))
|
|
t
T is a Markov trajectory, mij is the number of jumps from state i to state j in realization x, ti is the total length of all sojourns spent in state i for realization x and tik is the total length of sojourns spent in state i within clone k. Because the complete likelihood forms an exponential family of statistical models, the above expression is linear in the sufficient statistics mij, ti and tik. Thus, assuming that we have at hand a total of M simulated trajectories drawn from the conditional distribution of X given y and parameters
0, we can approximate the conditional expectation E
0(nij, | y), which is part of the E-step, by the sample mean
All in all, the principle of MCEM is thus as follows: starting from an initial estimate
0, generate a set of simulated trajectories from the conditional distribution of X given y and parameter
0. Use these trajectories to compute an approximation
and an updated estimate
1 with the M-step. Then simulate a new set of trajectories from the conditional distribution of X given y and parameter
1, etc.
For simulating the realizations of the Markov chain we employed a Markov chain Monte Carlo (MCMC) methodology developed by this algorithm, the trajectory is gradually modified by a Markov chain mechanism whose stationary distribution is precisely the desired conditional distribution of X given data y and model parameters. Thus, after a burn-in period, the trajectories may be assumed to be drawn from the desired distribution. The simulation of the Markov chain realizations is further described in the Supplementary Material.
After 20 MCEM iterations, the parameters had converged (Fig. 1). With these parameter estimates, we simulated another 1 000 000 realizations of the Markov chain x and kept every 10 000-th, thus obtaining 100 essentially independent realizations that we used for further calculations. Following Fridlyand et al. (2004), we did not attempt to analyse the whole genome as one sequence, but rather did separate analyses for each chromosome.
|
| 4 RESULTS |
|---|
|
|
|---|
4.1 Examples of reconstruction
For discrete-index HMMs, the Viterbi algorithm is used to find the most likely realization of the hidden Markov chain conditional on data and given a specified set of parameters. With continuous index there is no counterpart to this algorithm. Instead, we did a reconstruction of the hidden chain by, for each index t, selecting the state that was most frequent among the 100 realizations mentioned in Section 3. The results for chromosomes 9 and 10 are shown in Figure 2; these chromosomes were analysed also by Picard et al. (2005, Fig. 5), with a different set of clones. On the whole, our results are similar to those of Picard et al. One may notice though that in our reconstruction for chromosome 9, there is a visit to state 3 at
10.5x107 bp that does not appear in Picard et al.; this difference is probably due to the finer resolution in our data. We also notice that, in particular for chromosome 9, there are a number of very short excursions to state 3 that may be considered artificial. They appear because of large variations in the data, and could be eliminated or at least decreased in number by restricting the transition rates of the hidden Markov chain, as discussed in Section 5.
|
4.2 Residual autocorrelation
Since the continuous-index model can change copy number anywhere along the genome, we expect this model to provide a better fit to the data at hand. Residual analysis is a standard tool in time series analysis, and one way to verify the potentially better fit is to check if the residual correlation between adjacent clones is smaller than for the discrete-index HMM. Indeed, consider the situation of a true change point located within the overlap of two adjacent clones k and k + 1. With the discrete-index HMM, each clone can only be assigned to a single state, and hence parts of both clones will be misclassified in any realization of the hidden Markov chain. This creates potentially large (relative to the noise variance) residuals y·k– µxk and yk + 1 – µxk + 1 that contribute to the lag-1 autocorrelation of the series of residuals. The continuous-index HMM can eliminate this problem, although large residuals may still occur as an effect of outliers among the yk. For this model the residuals are y·k– µ(k) with µ(k) the mean of the normal distribution specified in (1). In contrast to e.g. autoregressive models, neither of the HMMs allow the residuals to be computed from the data and estimated model parameters alone, as the hidden Markov chain can never be reconstructed with certainty. Rather one has to simulate trajectories conditionally on data and with estimated model parameters, from which residuals are computed. For the continuous-index HMM, we used the 100 independent realizations of the Markov chain mentioned in Section 3. For the discrete-index HMM, we simulated 100 independent trajectories using forward-filtering backward-sampling (Cappé et al., 2005, Algorithm 6.1.1). For each simulated trajectory, the sample lag-1 residual autocorrelation was computed, thus yielding 100 simulated correlations for both HMMs.
The lag–1 residual correlation was found to be smaller with the continuous-index HMM in all chromosomes investigated, except chromosome 8, for which there is no significant difference (Fig. 3). In most cases, the correlation was also very close to 0, with the exception of chromosome 15 for which it was around 0.2 (but with a larger correlation for the discrete-index method), and chromosome 17 for which similar results were obtained. A possible reason for the high correlation in these chromosomes is that more than five states may be needed to model the data appropriately, at least in terms of achieving low residual correlation, for instance because of outliers. Increasing the number of states beyond r = 5 decreased the residual correlation for the discrete-index HMM, but caused the MCMC sampler used with the continuous-index HMM to show very poor mixing; the problem appears as sections of the genome being allocated to a totally wrong state. The problem lies with the MCMC algorithm, which proposes updates in the Markov trajectory at random without taking the data into account; proposing changes that are likely with respect to the data thus becomes increasingly unlikely as the number of states increases.
|
4.3 Model selection with the continuous-index method
The results presented for the continuous-time HMM so far were obtained with model sizes r found with Bayesian information criterion (BIC) for the discrete-index HMM. In this section, we consider how to choose r using the latter model directly. To employ penalized likelihood methods, it is necessary to compute the log-likelihood
(y;
) of the data for a given parameter
. For a discrete-index HMM, this likelihood is given as a by-product of the forward algorithm. For continuous-index HMMs, one can in some cases devise similar formulae, but in our case this is not possible due to the overlap of the clones. Instead we took a Monte Carlo approach. Let |
| (2) |
|
| (3) |
As they are being computed from Monte Carlo simulations, the estimates
possess random variation. To investigate the size of this variation, and in particular to find if the
,
, etc. are different with statistical significance, we also computed bootstrap confidence intervals for
. Since the Monte Carlo samples (trajectories) used to compute
are dependent, we used blockwise bootstrap (Davison and Hinkley, 1997, pp. 396–408). The block length was set to 10 000 and each sample initiated a new block; thus the number of blocks equalled the number M' of MCMC samples. The MCMC sequence was wrapped at the end points (the first sample considered to follow the last) in order to have all samples appear in the same number of blocks. Using these blocks, 1000 bootstrap replications of length M' were constructed through independent sampling, yielding a set of bootstrapped versions of
from which empirical 2.5 and 97.5% quantiles were extracted; the result is a Monte Carlo confidence interval for
r with coverage
95%.
Results for three different chromosomes are reported in Table 1. For all the three chromosomes, the largest penalized likelihood is significantly different from the other ones; we thus select 5, 4 and 3 states for chromosomes 2, 8 and 17, respectively. For the discrete-index HMM, the corresponding model orders are 4, 5 and 4, respectively. Hence, the continuous-index HMM selects different order in all cases.
|
Comparing the estimated means
–0.16 respectively –0.02
–0.05), and the mean 0.37 in the smaller model is between the means 0.11 and 0.43 in the larger model.
4.4 Locations of jumps
One advantage of the continuous-index model is that we can estimate the exact base-pair position of a jump (change point) from one state to another. Figure 4 shows a short sequence of data for chromosome 4. Overlaps are clearly visible and it is also obvious that a part of the data is in a different state than the rest. It is however not obvious exactly where the two jumps are. The discrete-index model assigns whole clones to different states, but the continuous-index model can be more precise. We estimated the location of the jump in two different ways. First we used the 100 independent realizations of the hidden Markov chain in a model with two states (r = 2), all of which jumped twice in the region in question, to compute kernel density estimates of the two locations (Fig. 4). The second method uses properties of continuous-index Markov chains. Assume that [T1,T2] is an interval such that we know that X(T1) = i, X(T2) = j and that there is exactly one jump in the interval; assume also that the model parameters are given. Then the conditional density of the jump location, t say, is proportional to
|
| (4) |
|
The two methods give similar results (Fig. 4). We notice that in particular the second jump is clearly indicated to be located within a clone. In situations like this, when there are two groups of clones with well-separated log2-ratios but a few clones with log2-ratios in between, the discrete-index method sometimes assigns the intermediate clones to a third possibly superficial state while the continuous-index method rather locates jumps within clones to allocate only parts of them to the two respective states. This is a further advantage of the continuous-index method.
4.5 Analysis of Glioblastoma Multiforme data
We also applied our algorithm to the Glioblastome Multiforme (GBM) data used by Lai et al. (2005) to compare the performances of various algorithms for aCGH data analysis. This data is publicly available at www.chip.org/~ppark/arrayCGH_comparison/. Figure 5 shows the results obtained for data around the EGFR locus of chromosome 7, together with the log2-ratios and physical extensions of clones. The number of states chosen by BIC estimated as in Section 4.3 was r = 2; we notice that the discrete-index HMM chooses r = 1 (Lai et al., 2005). With the fitted parameters, we run another 1 000 000 iterations of the MCMC sampler and recorded every 2000-th, thus obtaining a total of 500 realizations of the hidden Markov chain. Then for each position (base pair) we estimated the conditional probability, given the data, of the Markov chain being in either state as the proportion of realizations being there.
|
Figure 5 is directly comparable to Lai et al. (2005, Fig. 4), apart from the continuous-index plotting used here. The figure indicates three amplified regions. The first two of these have sharp boundaries between regions with probabilities essentially equal to 0 and 1, respectively, and are the same as those detected by several methods in Lai et al. (2005). Some of the methods employed in that article detected these two as a single amplification, but the authors noted that mapping the probes to their physical positions suggested that they are likely to be two separate aberrations. This mapping is indeed done in Figure 5 (apart from the suppression of regions without data), and is also explicitly taken into account by our model. The third amplification in Figure 5 is more blurred in the sense that it does not have sharp boundaries between 0 and 1 regions. To explain this we notice that there is one very long clone that extends over the whole of this amplification, and that the measured ratio for this clone lies rather close to the normal level. Some of the Markov chain realizations jumped to the amplified level already at the start of this clone, yielding an estimated probability of
0.2 for being in the amplified state at this point. Further down the genome there are other clones with larger log2-ratios, and the estimated probability of being in the amplified state then rises to close to 1. Two parts of this amplified region contain no data (clones) apart from the long clone however, and many realizations return to the normal state there; consequently, the probability of the amplified state decreases within these parts. Our model encourages realizations to return to the normal state since the expected log2-ratio for a clone is modelled as the average of means visited along the clone [Equation (1)], so that these jumps make the expected log2-ratio for the long clone come closer to its measured ratio and hence increase the likelihood. Different realizations jump at different points, causing smooth changes of the probabilities. We stress that these details can only be seen in a continuous-index setting. Indeed, in a discrete-time model it is not at all obvious how to order the long clone relative to the other ones. Typically clones are sorted by starting position, but the long clone could equally well be placed elsewhere. | 5 CONCLUDING REMARKS |
|---|
|
|
|---|
We have described a continuous-index hidden Markov model for modelling aCGH data arising from array designs with clones that are of unequal length, are unevenly spaced and/or overlap, as well as a Monte Carlo approach to parameter estimation in this model. In contrast to copy number restoration algorithms that do not specify a statistical model for the copy number sequence—a Markov chain in the present case—this modelling framework is capable of providing figures of uncertainty for essentially all quantities of interest. The jump locations in Section 7.4 is one example, while other relevant examples are the probability of a certain state at a given (continuous) location (as in Fig. 5) and the distribution of the number of excursions (aberrations) away from the normal state. All such quantities, and their conditional distribution given data, can be estimated using the MCMC samples of the hidden chain.
The foremost drawback of the model in its present form is the computational demands required for using it. As an example, running 20 iterations of the MCEM algorithm plus the additional 1 000 000 MCMC iterations, in total 5 000 000 iterations, took
25 CPU hours for chromosome 12 and a model with four states using a Matlab code running under Linux on a Pentium 4 3.06 GHz PC. Models with five states took somewhat longer, whereas models with fewer states were faster. The dependence of the running time on the number of clones was smaller. These running times include the computation of the conditional likelihood L(y|x,
) of Equation (2) in each iteration, which contributes a significant part of the computation time and may be omitted if model order selection is not to be performed. Also, results that are less precise than those reported here although still very indicative, can be obtained with <5 000 000 iterations. As mentioned above, we were able to carry out estimation for models with up to five states, whereas mixing of the MCMC algorithm showed too slow convergence (mixing) for larger models. Thus, it is necessary to devise a more intelligent MCMC sampler to work efficiently with larger models. One such possible modification could be to use a more informative distribution for proposing changes to the trajectory of the hidden Markov chain, for instance one based on a discrete-index restoration, rather than the present one that proposed changes blindly without taking data into account; cf. Section 4.2. Of course such an improved sampler would be beneficial to smaller models as well since a smaller number of replications, and hence less computation time, would be required for reliable results. A different strategy for decreasing the computational demands would be various model simplifications, for instance specifying a common rate for jumps away from the normal state, and another common rate for jumps in the reverse direction, or to abandon the assumption of a finite state space and draw the mean level of each excursion away from the normal level from some prior distribution; cf. Daruwala et al. (2004) who take a similar approach. We attempt to elaborate with such parsimonious parametrizations in future work. We do remark however that a five-state model is reasonably flexible since these states can model e.g. homozygotic loss, low copy loss, normal, low copy gain and amplification.
In our analyses we have also estimated all model parameters, including the transition rates of the hidden Markov chain. With the BT-474 cell line data, for which the clone tiling covered the whole genome, this worked well. The GBM data analysed in Section 4.5 contains substantial gaps in the tiling however, i.e. parts of the genome not covered by clones (cf. caption of Fig. 5). Within the gaps there is no information about the hidden Markov chain, so its simulated realizations may jump quite frequently during long gaps. This in turn causes the estimated jump rates to increase in the M-step. For the data displayed in Figure 5 we obtained the transition rate estimates 6.5x 10–7 bp–1 and 5.1x 10–6 bp–1 respectively for jumps from the normal state to the amplified one and vice versa, and these numbers seem overly large. A remedy would be to put a prior distribution on the transition rates, thus turning to a Bayesian model. Alternatively, one could argue that it is not reasonable to estimate the rates at all from a set of data that contains just a few number of jumps and that they should be fixed apart from possible tuning adjustments (cf. Daruwala et al., 2004).
The BioHMM package (Marioni et al., 2006) is an HMM designed to take unevenly spaced clones into account, as the transition probabilities are allowed to depend on distance between clones. This is done in a rather ad-hoc fashion however, in that there is no underpinning statistical model for the transition probabilities. Indeed, let P(t) denote the transition probability matrix between states for two clones separated by distance t (denoted Ai by Marioni et al.) in the BioHMM model specification. Then the Chapman–Kolmogorov equations P(t1)P(t2) = P(t1+t2) do not hold, so that this specification does not constitute a consistent family of transition probabilities for a (discrete-time) Markov chain. Put simply, there is no Markov chain with the BioHMM-transition probabilities. There are no principal difficulties in incorporating a continuous-index hidden Markov chain into the BioHMM framework; we just let P(t) be the exponential matrix expQt. Overlapping clones would still not fit into the model however.
The new oligonucleotide CGH arrays do also fit well into a continuous-index framework. Such arrays feature a large number of very short clones, e.g. up to >500 000 different single-nucleotide polymorphisms (SNPs) probed by 25-bp clones with GeneChip arrays from Affymetrix, >300 000 SNPs probed by 50-bp clones with the HumanHap300 BeadChip from Illumina, or >244 000 clones of 60 bp with Human Genome CGH Microarray Kit products from Agilent Technologies. The average distance between clones is typically in the range 5–10 kb. Clones do not overlap but are unevenly spaced, so that a continuous-index model, for instance a Markov chain, is suitable for modelling the dependence between clones as a function of distance. Presently, it is common to use a moving average filter, possibly weighted, to average nearby clones (Barrett et al., 2004; Peiffer et al., 2006). Estimating copy numbers with an explicit statistical model for change points would make implicit use of a similar averaging, but making it less ad-hoc, and would also, as remarked above, allow construction of measures of uncertainty of any quantity of interest.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This research was partially funded by a grant from the Swedish Research Council.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Keith Crandall
Received on November 8, 2006; revised on December 20, 2006; accepted on February 12, 2007
| REFERENCES |
|---|
|
|
|---|
Barrett MT, et al. Comparative genomic hybridization using oligonucleotide microarrays and total genomic DNA. Proc. Natl Acad. Sci. USA (2004) 101:17765–17770.
Cappé O, et al. Inference in Hidden Markov Models. (2005) New York: Springer.
Daruwala R, et al. A versatile statistical analysis algorithm to detect genome copy number variation. Proc. Natl Acad. Sci. USA (2004) 101:16292–16297.
Davison AC, Hinkley DV. Bootstrap Methods and their Application. (1997) Cambridge: Cambridge University Press.
Eilers PHC, de Menezes RX. Quantile smoothing of array CGH data. Bioinformatics (2005) 21:1146–1153.
Fridlyand J, et al. Hidden Markov models approach to the analysis of array CGH data. J. Multivar. Anal (2004) 90:132–153.[CrossRef]
Guha S, et al. Bayesian hidden Markov modeling of array CGH data. J. Amer. Statist. Assoc (2006) Submitted.
Hsu L, et al. Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics (2005) 6:211–226.[Abstract]
Hupé P, et al. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics (2004) 20:3413–3422.
Hyman E, et al. Impact of DNA amplification on gene expression patterns in breast cancer. Cancer Res (2002) 62:6240–6245.
Jong K, et al. Chromosomal breakpoint detection in human cancer. In: Lecture Notes in Computer Science.—Cagnoni S, ed. (2003) Berlin: Springer. 54–65. Vol. 2611.
Jönsson G, et al. High-resolution genomic profiles of breast cancer cell lines assessed by tiling BAC array comparative genomic hybridization. Genes, Chromosomes and Cancer (2007) doi:10.1002/gcc. 20438.
Kass RE, Raftery AE. Bayes factors. J. Am. Statist. Assoc (1995) 90:773–795.[CrossRef][Web of Science]
Krzywinski M, et al. A set of BAC clones spanning the human genome. Nucleic Acids Res (2004) 32:3651–3660.
Lai WR, et al. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics (2005) 21:3763–3370.
Marioni JC, et al. BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics (2006) 22:1144–1146.
Myers CL, et al. Accurate detection of aneuploidies in array CGH and gene expression microarray data. Bioinformatics (2004) 20:3533–3543.
Olshen AB, et al. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics (2004) 5:557–572.[Abstract]
Peiffer DA, et al. High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. Genome Res (2006) 16:1136–1148.
Picard F, et al. A statistical approach for array CGH data analysis. BMC Bioinformatics (2005) 6:27.[CrossRef][Medline]
Ross SM. Introduction to Probability Models. (2006) 9th edn. San Diego, CA: Academic Press.
Rydén T. An EM algorithm for estimation in Markov-modulated Poisson processes. Comput. Statist. Data Anal (1996) 21:431–447.[CrossRef]
Shadeo A, Lam WL. Comprehensive copy number profiles of breast cancer cell model genomes. Breast Cancer Res (2006) 8:R9.[CrossRef][Medline]
Snijders AM, et al. Assembly of microarrays for genome-wide measurement of DNA copy number. Nat. Genet (2001) 29:263–264.[CrossRef][Web of Science][Medline]
Tanner MA. Tools for Statistical Inference. (1996) 3rd edn. New York: Springer.
Vissers LE, et al. Identification of disease genes by whole genome CGH arrays. Hum. Mol. Genet (2005) 14:R215–R223.
Wang P, et al. A method for calling gains and losses in array CGH data. Biostatistics (2005) 6:45–48.[Abstract]
This article has been cited by other articles:
![]() |
C. D. Greenman, G. Bignell, A. Butler, S. Edkins, J. Hinton, D. Beare, S. Swamy, T. Santarius, L. Chen, S. Widaa, et al. PICNIC: an algorithm to predict absolute allelic copy number variation with microarray cancer data Biostat., October 15, 2009; (2009) kxp045v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Stjernqvist and T. Ryden A continuous-index hidden Markov jump process for modeling DNA copy number data Biostat., October 1, 2009; 10(4): 773 - 778. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Nicolas, A. Leduc, S. Robin, S. Rasmussen, H. Jarmer, and P. Bessieres Transcriptional landscape estimation from tiling array data using a model of signal shift and drift Bioinformatics, September 15, 2009; 25(18): 2341 - 2347. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Budinska, E. Gelnarova, and M. G. Schimek MSMAD: a computationally efficient method for the analysis of noisy array CGH data Bioinformatics, March 15, 2009; 25(6): 703 - 713. [Abstract] [Full Text] [PDF] |
||||
![]() |
L.-y. Wang, A. Abyzov, J. O. Korbel, M. Snyder, and M. Gerstein MSB: A mean-shift-based approach for the analysis of structural variation in the genome Genome Res., January 1, 2009; 19(1): 106 - 117. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Andersson, C. E. G. Bruder, A. Piotrowski, U. Menzel, H. Nord, J. Sandgren, T. R. Hvidsten, T. Diaz de Stahl, J. P. Dumanski, and J. Komorowski A segmental maximum a posteriori approach to genome-wide copy number profiling Bioinformatics, March 15, 2008; 24(6): 751 - 758. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||










