Skip Navigation

Bioinformatics 2007 23(13):i450-i458; doi:10.1093/bioinformatics/btm221
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Shah, S. P.
Right arrow Articles by Murphy, K. P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shah, S. P.
Right arrow Articles by Murphy, K. P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Modeling recurrent DNA copy number alterations in array CGH data

Sohrab P. Shah 1,*, Wan L. Lam 2, Raymond T. Ng 1 and Kevin P. Murphy 1

1Department of Computer Science, University of British Columbia, 201-2366 Main Mall Vancouver, BC V6T 1Z4 Canada and 2BC Cancer Research Centre, 675 W 10th Ave Vancouver, BC V5Z 1L3, Canada

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 QUANTITATIVE RESULTS ON...
 5 QUALITATIVE RESULTS ON...
 6 DISCUSSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Recurrent DNA copy number alterations (CNA) measured with array comparative genomic hybridization (aCGH) reveal important molecular features of human genetics and disease. Studying aCGH profiles from a phenotypic group of individuals can determine important recurrent CNA patterns that suggest a strong correlation to the phenotype. Computational approaches to detecting recurrent CNAs from a set of aCGH experiments have typically relied on discretizing the noisy log ratios and subsequently inferring patterns. We demonstrate that this can have the effect of filtering out important signals present in the raw data. In this article we develop statistical models that jointly infer CNA patterns and the discrete labels by borrowing statistical strength across samples.

Results: We propose extending single sample aCGH HMMs to the multiple sample case in order to infer shared CNAs. We model recurrent CNAs as a profile encoded by a master sequence of states that generates the samples. We show how to improve on two basic models by performing joint inference of the discrete labels and providing sparsity in the output. We demonstrate on synthetic ground truth data and real data from lung cancer cell lines how these two important features of our model improve results over baseline models. We include standard quantitative metrics and a qualitative assessment on which to base our conclusions.

Availability: http://www.cs.ubc.ca/~sshah/acgh

Contact: sshah{at}cs.ubc.ca


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 QUANTITATIVE RESULTS ON...
 5 QUALITATIVE RESULTS ON...
 6 DISCUSSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
Genetic alterations are a hallmark of numerous human diseases such as cancer and mental retardation. Recent advances in the genome-wide identification and localization of genetic alterations by high resolution array comparative genomic hybridization (aCGH) technologies (Ishkanian et al., 2004, Pinkel and Albertson, 2005) have furthered our understanding of the effect of genetic alterations on disease. Array CGH measures genetic changes as DNA copy number alterations (CNAs) of an individual's DNA against a reference at a fixed set of locations in the genome (Pinkel and Albertson, 2005). A recurrent CNA in a cohort of patients is a CNA found at the same location in multiple samples. Therefore, recurrent CNAs define a pattern that provides a molecular characterization of the cohort's phenotype, potentially identifying disrupted molecular processes, molecular targets for diagnosis, and development of novel therapeutics. A recent example is the identification of recurrent CNAs across different subtypes of lung cancer, reported in Coe et al. (2006). This led to the elucidation of molecular mechanisms that contribute to the distinct phenotypes in small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC), which we show in this article (see Section 5).

In this study, we describe the development of statistical models for the detection of recurrent CNAs across multiple aCGH experiments from a cohort of individuals. Figure 1a–c shows an example of five aCGH NSCLC samples over small regions containing three different types of recurrent CNAs on chromosomes 8, 9 and 1. For each of the ~30 000 probes in an aCGH experiment, a log ratio of the hybridization level of the sample, relative to the reference, is produced. The log ratios have a noisy correspondence to CNAs: deletions (losses) result in negative log ratios, amplifications (gains) result in positive log ratios and no change (neutral) regions in zero log ratios.


Figure 1
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. aCGH profiles of three different types of recurrent CNAs for five NSCLC cell lines (labeled on the right). Horizontal red lines indicate the 0 log ratio level for each sample. Vertical black lines indicate the position of a known gene of interest in NSCLC. (a) a high level shared amplification of a region spanning ~3 Mb containing the MYC oncogene on chromosome 8 (shown with vertical line). (b) a single clone shared aberration at the CA9 locus on chromosome 9. (c) a low-level amplification on chromosome 1 including TNFRSF4 and TP73—both implicated in NSCLC. Both (b) and (c) illustrate examples of recurrent CNAs that may be undetectable if each sample is pre-processed separately.

 
Figure 1a shows a recurrent CNA harboring the MYC oncogene. One common strategy to identify such a recurrent CNA is to first pre-process individual samples to make calls of losses and gains, and then to infer recurrent CNAs using a threshold frequency of occurrence (de Leeuw et al., 2004; Garnis et al., 2006, Pollack et al., 2002). We call this process AF for alteration frequency (see Section 3 for details). While AF may detect signals as shown in Figure 1a, pre-processing or discretizing the sequences separately may remove information by smoothing over short or low-amplification CNAs. However, by jointly considering all the data without pre-processing, we can borrow statistical strength (Gelman et al., 2004) across the samples and identify locations where the signal is shared in the raw data. For example, in Figure 1b, we show data at the locus containing an important NSCLC gene, carbonic anhydrase IX (CA9) (Kim et al., 2004; Swinson et al., 2003). Log ratios of probes overlapping the gene are shown as blue stars and are indicated with arrows. This shared CNA may be hard to detect using AF because when processing individual samples, single probe CNAs are often indistinguishable from experimental noise (Veltman and de Vries, 2006). With high-dimensional arrays, many investigators require CNAs to span at least two consecutive probes (Baldwin et al., 2005; Garnis et al., 2006; Veltman and de Vries, 2006). However, if a single probe CNA is shared across many samples, it may correspond to an important biological feature.

Figure 1c shows a third type of signal that is a low-level or subtle shared CNA. The region includes two known lung cancer related genes, TNFRSF4 (Kawamata et al., 1998) and TP73. When compared to the MYC region in Figure 1a, the level of amplification for TNFRSF4 is much lower and is more difficult to distinguish from noise. However, cell lines H2122, HCC193 and HCC366 appear to share the low-level amplification. Furthermore, the TP73 [a putative tumor suppressor involved in cell death (Alarcon-Vargas et al., 2000)] loci exhibits low-level negative signals in three of the samples. These signals may be lost if each sample is pre-processed in isolation due to premature thresholding.

Most of the genome will not exhibit shared CNA patterns. Figure 1a (right end) shows a region from ~135 to 140 Mb (bounded by blue vertical lines) that is heterogeneous across the samples. One sample (HCC827) has an amplification while two are neutral (HCC193, H2087) and two are deletions (HCC366, H2122). This ambiguity in the signal across samples will be important when we develop our model in Section 3.

In this article, we present statistical models to infer recurrent CNAs from aCGH data. We extend the single sample hidden Markov model (HMM) (Fridlyand et al., 2004; Shah et al., 2006) to the multiple sample case. We consider three different ways to do this. The first simply modifies the observation model of the HMM so that at each location, a vector of observations is generated, one per sample. We call the state sequence of the HMM the ‘master’ sequence. It represents a classification of each probe location into a loss, neutral or gain state and hence it represents the canonical signal that encodes recurrent CNAs.

The second model augments this by allowing each observation in each sample to either be generated from the master sequence, or from its own private sequence. This allows for sample-specific random effects to be superimposed on the canonical signal. We demonstrate that this improves performance significantly. Finally, the third model augments the state space of the master sequence to allow undefined states, which represent locations which are highly ambiguous (such as the 135–140 MB region in Fig. 1a). This allows the master to focus on the highly conserved regions, and to ignore heterogeneous locations. We will show that the resulting output we infer is comparatively sparse, making it easier to create a short list of candidate locations for experimental follow up.

The remainder of the article is organized as follows. We discuss related work in Section 2. We describe our models in Section 3. In Section 4, we demonstrate our results on simulated data, where we know the ground truth, and in Section 5, we demonstrate results on well-studied lung cancer cell line data (Coe et al., 2006). In Section 6, we conclude the article and discuss future work.


    2 RELATED WORK
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 QUANTITATIVE RESULTS ON...
 5 QUALITATIVE RESULTS ON...
 6 DISCUSSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
There has been surprisingly little work on the automatic discovery of shared patterns from aCGH data. Rouveirol et al. (2006) propose an algorithm that takes as input a set of discretized sequences, and which outputs a set of minimal recurrent regions. This method works by converting the sequences to a S x T binary matrix (focusing on losses and gains separately), where S is the number of samples and T is the number of probes on the array. It then tries to find short blocks that are shared by some specified fraction of the samples. The disadvantages of their approach are that it requires discretized data, and that its running time is O(T2). (They also present an O(T) version, but this tended to not work as well.) Diskin et al. (2006) also take a binary S x T matrix as input, and use a greedy search procedure to find regions (‘stacks’) that are shared across samples with statistically significant frequency. Lipson et al. (2006) is the only previous work we are aware of that tries to find shared patterns using the raw data (to avoid problems with premature thresholding discussed above). They present an O(T2) time algorithm for finding intervals with maximal score, although they present two other versions of the algorithm which they say in practice are O(T1.5) and O(T) complexity. An important difference between this approach and ours is that Lipson et al. assume the log ratios are identically and independently distributed across the chromosome, when in fact they are spatially correlated, suggesting that Markovian dynamics, encoded by HMMs, should be used. In addition, our approach is O(T) and model based.


    3 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 QUANTITATIVE RESULTS ON...
 5 QUALITATIVE RESULTS ON...
 6 DISCUSSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
We consider analyzing multiple aCGH samples from multiple chromosomes, although to simplify notation, we will concentrate on modeling a single chromosome. The data is Formula , where Formula is the observed log ratio at location t in sample s, T is the number of probes and S is the number of samples. The goal is to identify patterns in the data which capture the pertinent CNAs that recur across the samples. A pattern consists (roughly speaking) of a list of locations which are highly conserved (either loss, neutral or gain). Thus, the pattern can be represented as a ‘master’ sequence of states M1:T where Mt isin {L,N,G} (loss, neutral, gain) is a multinomial random variable and t isin (1,2,...,T). Since we will often be uncertain about what the pattern should be at any given location, we will summarize our uncertainty using the (marginal) posterior distributions Formula , which we call a profile. When we have data from different groups (as in our lung cancer data), we learn a different profile for each group, Formula . As shown in Section 5, we analyze four different phenotypic groups of lung cancer (i.e. g=1:4). However, we will drop the g superscript for brevity.

Our task is related to learning profile HMMs for multiple sequence alignment (Durbin et al., 1998), but it is harder because the raw data is noisy and continuous-valued. Below, we describe four different approaches to the problem. The first is the method most widely used in current practice, and the remaining three are novel methods that we propose.

3.1 Alteration frequency (AF) model
In the simplest approach, AF, we first process each sample Formula into a discrete sequence Formula , where Formula , using a standard method for discretizing single-sample aCGH data. In this article, we use the robust HMM approach described in Shah et al. (2006). We chose the HMM-based single-sample method for this implementation of AF to allow a more direct algorithmic comparison to the multiple sample HMMs we describe below. Note that other algorithms could be used for this step. For example, Coe et al. (2006) used aCGH-smooth (Jong et al., 2004) to preprocess the lung cancer data presented in Section 5. After preprocessing, we compute the empirical distribution over each state in each location to yield the profile Formula , which can be represented as a KxT stochastic matrix, where K=3 is the number of states, T is the length of the sequence and each column sums to one. This can be further simplified to just compute the empirical probability of a recurrent CNA at each location, to yield a 1 x T vector. The disadvantage of this method is that the mapping from Ys to Zs is done on each sample separately, so information cannot be shared across samples. Thus the method may smooth over important signals, as we will see.

3.2 Factored likelihood HMM (FL-HMM)
The second model, which we call ‘factored likelihood HMM’ (FL-HMM), is a standard HMM model for M1:T (modeling the fact that CNAs tend to occur in runs), but where we modify the likelihood function to generate multiple samples instead of a single sample. Specifically, we assume the samples are conditionally independent given Mt and use a Gaussian observation model, yielding


Formula

The observation model is a product over the emission densities of the samples, hence the term ‘factored likelihood’. We have one mean and variance parameter for each of the three states of the HMM. The mean and variance are sample specific, to model the fact that different samples often have quite different noise characteristics, due to quality of hybridization, different ploidy, tumor/normal admixture coefficients, etc. Note that in a given chromosome some samples may not contain any CNAs, in which case the estimates of Formula and Formula may be poor if j is an aberrated state (L,G). Hence we share (pool) these parameters across chromosomes for statistical strength, as described in Shah et al. (2006). The variable Mt has Markovian dynamics with transition matrix AM, representing the probability of switching between the L/N/G states. The starting state distribution is denoted {pi}M. The model is shown as a directed graphical model in Figure 2a. Please see Bishop, (2006) for details on directed graphical models and how they relate to state transition diagrams for HMMs.


Figure 2
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. The three models (a) FL-HMM, (b) BFL-HMM and (c) H-HMM shown as directed graphical models (Bayesian networks). Circles represent random variables and rounded squares represent parameters. We only show the models for 3 probes, but in reality, the number of random variables is proportional to the number of probes on the chromosome, Tc. Unknown quantities are unshaded and observed quantities are shaded. Figure 2 represents the observed log ratio of sample s in chromosome c at location t, Mc,t is the hidden master state and Figure 2 is the hidden ‘slave’ state. The shaded square nodes represent fixed hyper-parameters. Arrows between nodes indicate probabilistic dependencies. Boxes around variables are called ‘plates’ and represent repetition of the contents inside. Thus we see that the observation parameters µs and {sigma}s are shared (tied) across chromosomes (since they are outside the c plate) but are specific to each sample (since they are inside the s plate), while the HMM parameters AM, {pi}M are shared across chromosomes and samples. The differences between BFL-HMM and H-HMM are that Mct isin {L,G,N,U} in H-HMM whereas Mct isin {L,G,N} in BFL-HMM and FL-HMM. Also, H-HMM has Markovian dynamics on the Figure 2 process (see the horizontal links and the new AZ, {pi}Z parameters).

 
We add standard conjugate priors to all the parameters (Gelman et al., 2004). Specifically, for the multinomial distributions we use Dirichlet priors, AM ~ Dir ({delta}M) and Formula , where the matrix of pseudocounts {delta}M encourages self-transitions, and Formula encourages the neutral state. For the observation variance, we use Formula , where Formula is the precision, and Ga is a gamma density. We set the hyper-parameters in a data driven way as follows. We set Formula to encode the expected variance, where {sigma}s is the empirical variance of sample s, and Formula to reflect that this is a weak prior.

For the observation mean, we use a Gaussian prior, Formula . We set the hyper-parameters as follows: Formula , Formula , and Formula . We set the mean of state 2 to 0 based on the assumption the data has been normalized so that the neutral state usually corresponds to a log ratio of 0. We set the mean of the aberrated states to reflect the typical deviations expected in this sample. This allows the model to adapt to automatically different noise levels coming from different samples or even different platforms. We set the prior variance on the mean to Formula . This was chosen to reflect that our method for choosing Formula was appropriate in the majority of the data we have observed.

To maintain identifiability of the hidden states (i.e. to maintain state 1 means loss, 2 means neutral and 3 means gain), we use a truncated Gaussian on Formula , to ensure Formula . The truncation bounds are set in a similar way to the hyper-parameters.

Let Formula be all the parameters of the model. We can estimate the parameters of this model, Formula , using a Markov chain Monte Carlo (MCMC) algorithm called blocked Gibbs sampling (Scott, 2002). This entails alternating between sampling M1:T as a block using the forwards-filtering backwards-sampling (FFBS) algorithm, and sampling the parameters individually conditioned on M1:T and the data: see Algorithm 1 for details. Alternatively, we can compute a point estimate, Formula , using the EM (expectation-maximization) algorithm.
Algorithm 1. Blocked Gibbs sampling algorithm for H-HMM. We omit the {pi}M and {pi}Z terms for brevity. FFBS stands for forwards-filtering backwards sampling

Table 1

Parameter initialization is done by setting Formula , Formula using a heuristic method analogous to one iteration of K-means clustering. Using the prior means Formula as initial values for the centroids, the data in each sample are assigned to the nearest centroid based on the Gaussian probability density function. Based on these label assignments, Formula , Formula are inferred using maximum likelihood estimation. We initialize AM with 0.9 on the diagonals and 0.1 spread over the remaining entries. We initialize {pi}M to favor neutral states. We can then run EM/MCMC.

3.3 Buffered factored likelihood HMM (BFL-HMM)
The problem with the FL-HMM model is that Mt is summarizing the raw data Formula . If any single sample at a given position has a large deviation from neutral, the master is likely to think that location is aberrated (because the neutral state cannot generate large aberrations). Thus large but rare deviations will be added to the profile. [This problem was also noticed by Lipson et al. (2006).] A simple fix to this is to add a ‘buffer’ to each observation, Formula , which is responsible for generating the observation Formula . Now the master will summarize these discrete states rather than the raw data. A key point is that in contrast to the AF model, we estimate Z and M simultaneously. See Figure 2b.

In more detail, the BFL-HMM can be defined as follows. The ‘slave’ Formula processes are modeled as noisy versions of the master process: Formula , where


Formula

Here {varepsilon} is the probability that the slave copies the master state. If we set {varepsilon}=0, the slaves never copy the master, so the posterior profile will equal the prior profile, i.e. we will not have learned anything, since Mt will be disconnected from the data Formula . As we increase {varepsilon}, each slave is influenced by the master with increased strength. Thus more of the samples will get reflected in the profile. If we set {varepsilon}=1, we are requiring that the slaves perfectly copy the master. This reduces to the FL-HMM model. In practice, we find it best to set {varepsilon}~0.8. See Section 5 for further discussion on the effect of {varepsilon}. We can estimate the parameters in this model using MCMC or EM. We simply modify the algorithm to handle the fact that the observation model is now (a product of) a mixture of Gaussians, with mixing weights Formula . We omit details due to lack of space.

3.4 Hierarchical HMM (H-HMM)
The problem with the BFL-HMM is that the slaves have to copy the master with probability {varepsilon} at every location, even if this location is highly variable. We extend the model by adding an undefined (do not-care) state U to the master. Now if Mt=U, the slaves follow their own private Markovian dynamics, modeling local runs which are not shared, as shown in Figure 2c. If Mt != U, they copy the master with probability {varepsilon} as before:


Formula

The effect of this is that only highly conserved regions are stored in the profile; in the highly variable regions, the profile says ‘undefined’. This makes the profile sparser, and easier to interpret. This is shown in Section 5. The degree of sparsity is controlled by {varepsilon}: as we increase {varepsilon}, the sparsity decreases, since more of the slaves influence the master.

Estimating the parameters in this model is harder, since all the Zs chains become coupled due to explaining away [cf. factorial HMMs (Ghahramani and Jordan, 1997)]. However, conditioned on M1:T, the Formula are independent and can be sampled in parallel using FFBS, so blocked Gibbs sampling is still easy. See Algorithm 1 for details. An interesting feature of this model is that there are competing processes to explain the slaves. If the slave copies the master, its conditional probability distribution (CPD) is determined by A{varepsilon}, otherwise the CPD is determined by AZ. Since AZ is potentially estimated from a large subset of the data, it tends to converge to have diagonal values near 1. In contrast, A{varepsilon} is fixed and therefore can be overwhelmed by the slave process. To avoid this, we use a strong prior on AZ to discourage it from reaching near 1 on the diagonals, but still allowing it to be estimated from the data. This results in a ‘fairer’ competition between the AZ process and the A{varepsilon} process.

We initialize the parameters as in the FL-HMM model. To initialize the states, we first sample each Formula using FFBS, with the master process turned off. We then initialize Mt to be the consensus majority state across Formula . The EM algorithm for the H-HMM is similar to the MCMC algorithm except we maximize the parameters instead of sampling them. Preliminary comparisons indicate that EM tends to converge faster but gives poorer results, perhaps because it is more prone to getting stuck in local optima.

3.5 Running time
Parameter estimation in all four models takes O(T) time. This makes the technique scalable for use in high density oligonucleotide arrays or SNP arrays, frequently used for DNA copy number analysis, that may contain 500 000 or more probes per experiment. In practice, the running time depends on the number of EM/MCMC iterations. For EM on the H-HMM model, we find the system converges within about 10 steps and takes ~90 min to learn a model from 20 samples with 32 000 probes each. [All experiments were performed in Matlab 7.2.0.29 [EC] 4 (R2006a) on a Intel Xeon CPU ©2.4 GHz.] EM for the BFL-HMM and FL-HMM is much faster, since the E step can be performed exactly using the forwards–backwards algorithm, avoiding a Monte Carlo approximation.


    4 QUANTITATIVE RESULTS ON SYNTHETIC DATA
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 QUANTITATIVE RESULTS ON...
 5 QUALITATIVE RESULTS ON...
 6 DISCUSSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
Real data sets rarely have fully verified ground truth locations of recurrent CNAs. Thus, applying standard metrics to assess accuracy on real data is difficult. To overcome this, we created a synthetic data set derived from real data. We used eight mantle cell lymphoma samples originally published in de Leeuw et al. (2004) and used for a qualitative assessment in Rouveirol et al. (2006) and modified it to give us ground truth CNAs. We used the data for chromosome 20 (672 probes) which was reported to be relatively free of CNAs. We permuted the order of the data for each sample so as to remove any undetected shared signals that may be present across samples. We then inserted a recurrent CNA gain and a recurrent CNA loss at fixed positions of width w, in a fraction f of the samples. The clones within the region were shifted up/down (for gain/loss) by {sigma}s {tau} where s is one of the chosen samples, {sigma}s is the empirical variance of that sample and {tau} is the signal to noise ratio (SNR). Thus, {sigma}s {tau} preserves the sample-specific heterogeneity of the noise. In order to ‘soften’ the borders of the aberrations, we extended the borders by {gamma} probes, where {gamma} ~ Gam ({alpha},1) ({alpha} proportional to w—see text below). Here, {gamma} was sampled independently for each sample to ensure the exact borders of the aberrations were not shared. Finally, for each sample, we randomly sampled a location outside of the ground truth recurrent CNA and inserted a gain or loss (randomly chosen) of width w'. Figure 3a shows an example of the synthetic data for w=50, f=0.75, {tau}=0.9, w'=100 (~15% of the chromosome). The recurrent loss is at position 100-149 and the recurrent gain is at position 450-499. Comparing this figure to the real data in Figure 1, we see that the synthetic data is quite realistic and challenging.


Figure 3
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. (a) Example of the simulated data for w = 50, {tau} = 0.9 and f = 0.75. Green lines (on the right) bound an inserted CNA gain, and red lines (on the left) bound an inserted CNA deletion. (b) ROC plot for the synthetic data for H-HMM (green stars), BFL-HMM (blue crosses), FL (red triangles) and AF (purple circles). TPR and FPR were calculated using results for all data, and therefore represent a summary of how the models compare over w, {tau}, f and {varepsilon}. AUC for each model is indicated in brackets in the legend. H-HMM had the best performance overall (AUC = 0.87), followed by BFL-HMM (AUC = 0.82), AF (AUC = 0.77) and FL (0.55). (c) Distributions of AUC for H-HMM, BFL-HMM, FL-HMM and AF over all values of w, {tau}, f and {varepsilon}. H-HMM and BFL-HMM had statistically significantly better performance than AF and FL-HMM (one way ANOVA (p <<0.01). Whiskers indicate standard error bars. The mean and standard error AUC for the models was 0.84±0.01 for H-HMM, 0.82±0.01 for BFL-HMM, 0.76±0.01 for AF and 0.59±0.01 for FL-HMM.

 
We evaluated AF, FL-HMM, BFL-HMM and H-HMM on synthetic data for w = (1,10,50), f = (1/2, 3/4, 1), {alpha} = (1,5,10) and {tau} = (0.3,0.6,0.9,1.2). For the BFL-HMM and the H-HMM, we set {varepsilon} = (0.8). Note for this large scale experiment we used (Monte Carlo) EM instead of MCMC for inference, to save time. However, preliminary results suggest that MCMC does work better, despite its increased cost.

We computed receiver operator characteristic (ROC) curves based on p(Mt = A) = p(Mt = L) + p(Mt = G) where p(Mt = A) is the probability that a recurrent CNA is predicted at position t. Using the ground truth labeling of the data, the false positive rate (FPR) is defined as Formula the number of probes incorrectly predicted as a CNA (FP) over the total number of non-CNA probes. The true positive rate is defined as Formula , the number of correctly predicted CNA probes (TP) over the true number of CNA probes. We plotted TPR versus FPR curves and calculated area under this curve (AUC) as a measure of accuracy to test the effect over w,f,{tau} and {varepsilon} across the various models.

Figure 3b shows a single summary ROC plot combining results for all values of w,f,{tau} and depicts the overall accuracy performance of the models. H-HMM had the highest accuracy (AUC = 0.87) followed by BFL-HMM (AUC = 0.82), AF (AUC = 0.77) and FL (0.55). Figure 3c shows the mean AUC over for every setting of w,f,{tau} (repeated three times). The mean and standard error AUC for the models was 0.84 ± 0.01 for H-HMM, 0.82 ± 0.01 for BFL-HMM, 0.76 ± 0.01 for AF and 0.59 ± 0.01 for FL-HMM. H-HMM and BFL-HMM were significantly more accurate than AF and FL-HMM (one way ANOVA, p <<0.01). Although H-HMM had slightly higher mean of AUC than BFL-HMM, the result was not statistically significant. However, we show in the next section on lung cancer data that in practice, the H-HMM is considerably more useful to the investigator as it returns sparser, yet accurate predictions.


    5 QUALITATIVE RESULTS ON LUNG CANCER DATA
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 QUANTITATIVE RESULTS ON...
 5 QUALITATIVE RESULTS ON...
 6 DISCUSSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
Ultimately, we are interested in applying a model to aCGH data from clinically relevant samples. To compare the output characteristics of the various models, we ran the algorithms on aCGH samples from 39 well-studied lung cancer cell lines, originally published in (Coe et al., 2006; Garnis et al., 2006). This data is particularly relevant since phenotype-specific patterns of recurrent CNAs have been experimentally validated. The samples can be subdivided into four groups: NSCLC adenocarcinoma (NA), NSCLC squamous cell carcinoma (NS), SCLC classical (SC) and SCLC variant (SV). Eighteen samples are NA, seven are NS, nine are SC and five are SV. This data has been rigorously studied and discordant shared patterns validated using PCR and gene expression have been identified across the major and minor groups (Coe et al., 2006; Garnis et al. 2006). We fit separate profiles Formula , one per group, using each of the four models and we qualitatively assess the characteristics and biological relevance of the output, using results reported in Coe et al. (2006) as a guide.

The experiments on synthetic data showed H-HMM and BFL-HMM are the best models. In this section, we show how the explicit modeling of the ambiguity in the data by H-HMM displays a clear advantage over the other models. Recall that in Figure 1 we showed parts of chromosomes 8, 9, and 1 to illustrate different types of recurrent CNAs at important locations. Figures 4 and 5 show the output of H-HMM ({varepsilon} = 0.8), BFL-HMM, FL-HMM and AF on the full chromosome 8 and the p-arm of chromosome 9. p(Mt = gain) is plotted in green and p(Mt = loss) is plotted in red. The clear trend is that H-HMM has sparser output and clearly predicts important regions in isolation. Note arrows at MYC and CA9 for comparison to Figure 1. A similar result was seen for chromosome 1 at the TPFRSF4 and TP73 loci (see Fig. 6 for results of H-HMM). Notice that BFL-HMM and FL-HMM also predict CNAs at these important genes. However it is quite evident that they both overpredict, making it hard for an investigator to discern biologically relevant CNAs from spurious predictions. From Figure 4, we also see that AF has a peak at the MYC locus, but is unable to detect the recurrent CNA at CA9 (Fig. 5) with high frequency. In all three generative models, the signal is clearly predicted.


Figure 4
View larger version (35K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Output from top to bottom of H-HMM, BFL-HMM, FL-HMM and AF for the NA group, chromosome 8. The x-axis is the chromosomal position and the y-axis is predicted probability. Red plots indicate p(Mt = L) and green plots indicate p(Mt = G). Note the sparse, yet accurate predictions for the H-HMM at the MYC locus (recall Fig. 1 c) and the p-arm loss prediction which recapitulates known results (Garnis et al., 2006). The other models either overpredict (BFL-HMM, FL-HMM) or underpredict (AF) the shared aberrations.

 

Figure 5
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Output from top to bottom of H-HMM, BFL-HMM, FL-HMM and AF for the NA group for the p-arm of chromosome 9. (see Fig. 4 for axes description). Similar to Fig. 4, notice the sparse, yet accurate predictions for the H-HMM especially at the single probe CA9 locus (recall Fig. 1b). The AF method does not predict CA9. BFL-HMM and FL-HMM both predict CA9, however, they are over-predicting many other regions not likely to be shared CNAs.

 

Figure 6
View larger version (31K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Output from H-HMM on chromosome 1 for different values of {varepsilon} for SC group. For {varepsilon}≥0.8, gain probability ‘peaks’ correspond to locations of several genes (annotated with arrow) implicated in lung or other cancers.

 
Considering Figure 4 in more detail, the p-arm (left) has a relatively high frequency of deletion and this is cleanly predicted by all models. In contrast, the centromeric half of the q-arm shows ambiguity in the AF plot. Both BFL-HMM and FL-HMM are unable to resolve the ambiguity as they are forced into a {L,N,G} state, while the H-HMM can ‘opt-out’ of making a consensus prediction at these locations, choosing only to predict a CNA when the data cleanly support one (e.g. MYC locus). This illustrates the sparsity of the H-HMM compared to the other models.

The combination of sparsity due to modeling ambiguity and the ability to tune {varepsilon} allows the user to effectively set the FPR of the H-HMM. An example of the value of this is shown in Figure 6, displaying the results for group SC for various values of {varepsilon}. The sparse output for {varepsilon}=0.8 reveals isolated peaks of high probability at locations of genes (TNFRSF4, TP73, TNFRSF9, ZNF151, E2F2, FGR, EIF3S2, DMAP1, FUBP1, RAB13, HDGF, PPCC, NTRK1, TRAF5), whose expression is known to be altered in lung or other cancers. For example, ZNF151 and E2F2 were found to have copy number induced gene expression changes in Coe et al. (2006). Interestingly, the H-HMM predicts the TP73 region as a narrow loss embedded within the gain region harboring TNFRSF4 shown in Figure 1c. TP73 was detected at only 22% frequency in AF and was not detected at all in BFL-HMM. Additional relatively narrow but high probability peaks correspond to the EIF3S2 locus, which mediates the TGF-ß pathway, FUBP1 a transcriptional activator of MYC and the coamplification of TNFRSF4 and TRAF5, which are known interactors and activators in the NF-{kappa}B pathway (Kawamata et al., 1998). These results are computational predictions, yet many provide compelling evidence that they merit experimental follow up.

To investigate whether H-HMM recapitulates the results in (Coe et al., 2006), we examined a subset of genes reported to be differentially disrupted in the two major groups, NSCLC and SCLC. These 22 genes are involved in key lung cancer pathways and therefore represent a highly relevant set of markers as a reference to assess our output. The H-HMM predicted shared aberrations in regions harboring 14 of the 22 genes in at least 1 of the subgroups of NSCLC and SCLC. We counted a prediction if p(Mt=L)>0.5, or p(Mt=G)>0.5 for losses and gains, respectively. The predicted genes included STMN1, E2F2, SC, ZNF151, ID2, MAPK9, EGFR, CDK2NA, KNTC1, HMGB1, HSPH1, JJAZ1, NLK, JUNB, TIAM1, DSCAM. Five of the regions were detected at {varepsilon}≤0.7, eleven at {varepsilon}≤0.9 and the remaining regions at {varepsilon}=0.95. This gives us a reasonable estimate for how to calibrate {varepsilon} in order to predict relevant CNAs. The H-HMM did not predict recurrent CNAs harboring the remaining genes PRDM2, SOX11, MAP3K4, ING1, SMAD4, CCDC5, TCF4.

We assessed if the H-HMM could determine differences in the profiles of the phenotypic groups (NA, NS, SC, SV), as this was part of the focus of the study of Coe et al. (2006) and Garnis et al. (2006). Figure 7 shows that the H-HMM produces very different profiles for chromosomes 9 across the different subgroups. This example chromosome was chosen as it was previously shown to have different patterns of CNA (Coe et al., 2006; Garnis et al., 2006). Although anecdotal, our qualitative results give us confidence that the H-HMM is predicting biologically relevant recurrent CNAs. Combined with the result that the H-HMM is sparser in its output, we believe the H-HMM has the right characteristics of presenting biologically meaningful results to the investigator while maintaining a low FPR.


Figure 7
View larger version (30K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7. H-HMM output for chromosome 9 showing discordant patterns among the lung cancer groups (NA, NS, SC, SV).

 

    6 DISCUSSION AND FUTURE WORK
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 QUANTITATIVE RESULTS ON...
 5 QUALITATIVE RESULTS ON...
 6 DISCUSSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
We developed three novel methods that extend the single sample HMM for aCGH to the multiple sample case in order to infer recurrent CNAs. Our results indicate that the H-HMM, which simultaneously infers discrete labels for the samples and promotes sparsity by modeling ambiguity in the data is quantitatively and qualitatively better than simpler models and standard methods. In an informal qualitative assessment, we showed that the H-HMM produces meaningful biological output when compared to a list of experimentally validated genes. The H-HMM was able to detect previously reported discordant patterns among the lung cancer groups—a key requirement to determine phenotype specific CNA patterns.

6.1 Subgroup discovery
A natural extension to the H-HMM model is to consider the case where the samples in the data come from two or more groups. This suggests unsupervised clustering of the data, a problem recently examined by Liu et al. (2006), who compared distance metrics in a hierarchical clustering framework. We will investigate this problem by considering a mixture of master sequences that determine subgroups and simultaneously infer recurrent CNAs specific to each subgroup.

6.2 Copy number variations
A fraction of the recurring CNAs identified by our algorithm can be attributed to segmental copy number variations (CNV) in the human population (Redon et al. 2006). These variations are not the consequence of somatic alteration in tumor DNA but are naturally occurring copy number states. This suggests they should be filtered out of the output, but it is important to consider the potential contribution of CNVs to disease susceptibility, as such segmental variations overlap with genes associated with phenotypes in humans (Wong et al., 2007). The frequent occurrence of a given CNV in a cohort of cancer patients relative to healthy individuals would raise the possibility of an association between the copy number state and cancer susceptibility. The H-HMM will facilitate the detection of such occurrences, and the identification of susceptibility genes through CNV status will become feasible as databases of aCGH profiles of tumors expand.

6.3 Epigenomic arrays
In addition to gene dosage alterations in CNAs, epigenetic alterations such as changes in DNA methylation status of gene may result in aberrant silencing or inappropriate activation of genes in cancer. Recent development of microarray-based methods for whole genome analysis of methylation status (or methylome profiling) has enabled a new approach to cancer gene discovery (Weber et al., 2005). Future adaptation of the statistical strategy described in this article will expedite the detection of recurring epigenetic alterations, and facilitate the integration of genetic and epigenetic data.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 QUANTITATIVE RESULTS ON...
 5 QUALITATIVE RESULTS ON...
 6 DISCUSSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work is supported by Genome BC/Genome Canada. S.P.S. is supported by the Michael Smith Foundation for Health Research. We thank Bradley Coe (BC Cancer Research Centre) for a critical review of this manuscript.

Conflict of Interest: none declared.


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 QUANTITATIVE RESULTS ON...
 5 QUALITATIVE RESULTS ON...
 6 DISCUSSION AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Alarcon-Vargas D, et al. p73 transcriptional activity increases upon cooperation between its spliced forms. Oncogene (2000) 19:831–835.[CrossRef][Web of Science][Medline]

    Baldwin C, et al. Multiple microalterations detected at high frequency in oral cancer. Cancer Res (2005) 65:7561–7567.[Abstract/Free Full Text]

    Bishop CM. Pattern Recognition and Machine Learning (2006) Springer.

    Coe BP, et al. Differential disruption of cell cycle pathways in small cell and non-small cell lung cancer. Br. J. Cancer (2006) 94:1927–1935.[CrossRef][Web of Science][Medline]

    de Leeuw RJ, et al. Comprehensive whole genome array CGH profiling of mantle cell lymphoma model genomes. Hum. Mol. Genet (2004) 13:1827–1837.[Abstract/Free Full Text]

    Diskin SJ, et al. STAC: A method for testing the significance of DNA copy number aberrations across multiple array-CGH experiments. Genome Res (2006) 16:1149–1158.[Abstract/Free Full Text]

    Durbin R, et al. Biological Sequence Analysis. Probabilistic Models of Proteins and Nucleic Acids (1998) Cambridge University Press.

    Fridlyand J, et al. Hidden Markov models approach to the analysis of array CGH data. J. Multivariate Stat (2004) 90:132–153.[CrossRef]

    Garnis C, et al. High resolution analysis of non-small cell lung cancer cell lines by whole genome tiling path array CGH. Int. J. Cancer (2006) 118:1556–1564.[CrossRef][Web of Science][Medline]

    Gelman A, et al. Bayesian Data Analysis (2004) 2nd edition. Chapman and Hall.

    Ghahramani Z, et al. Factorial hidden Markov models. Mach. Learn (1997) 29:245–273.[CrossRef]

    Ishkanian A, et al. A tiling resolution DNA microarray with complete coverage of the human genome. Nat. Genet (2004) 36:299–303.[CrossRef][Web of Science][Medline]

    Jong K, et al. Breakpoint identification and smoothing of array comparative genomic hybridization data. Bioinformatics (2004) 20:3636–3637.[Abstract/Free Full Text]

    Kawamata S, et al. Activation of OX40 signal transduction pathways leads to tumor necrosis factor receptor-associated factor (TRAF) 2- and TRAF5-mediated NF-kappaB activation. J. Biol. Chem (1998) 273:5808–5814.[Abstract/Free Full Text]

    Kim SJ, et al. Carbonic anhydrase IX in early-stage non-small cell lung cancer. Clin. Cancer Res (2004) 10:7925–7933.[Abstract/Free Full Text]

    Lipson D, et al. Efficient calculation of interval scores for DNA copy number data analysis. J. Comput. Biol (2006) 13:215–228.[CrossRef][Web of Science][Medline]

    Liu J, et al. Distance-based clustering of CGH data. Bioinformatics (2006) 22:1971–1978.[Abstract/Free Full Text]

    Pinkel D, et al. Array comparative genomic hybridization and its applications in cancer. Nat. Genet (2005) 37(Suppl.):11–17.[CrossRef][Web of Science][Medline]

    Pollack JR, et al. Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc. Natl Acad. Sci. USA (2002) 99:12963–12968.[Abstract/Free Full Text]

    Redon R, et al. Global variation in copy number in the human genome. Nature (2006) 444:444–454.[CrossRef][Medline]

    Rouveirol C, et al. Computation of recurrent minimal genomic alterations from array-CGH data. Bioinformatics (2006) 22:849–856.[Abstract/Free Full Text]

    Scott S. Bayesian methods for hidden Markov models: recursive computing in the 21st century. J. Am. Stat. Assoc (2002).

    Shah SP, et al. Integrating copy number polymorphisms into array CGH analysis using a robust HMM. Bioinformatics (2006) 22:431–439.[CrossRef]

    Swinson DE, et al. Carbonic anhydrase IX expression, a novel surrogate marker of tumor hypoxia, is associated with a poor prognosis in non-small-cell lung cancer. J. Clin. Oncol (2003) 21:473–482.[Abstract/Free Full Text]

    Veltman JA, et al. Diagnostic genome profiling: unbiased whole genome or targeted analysis? J. Mol. Diagn (2006) 8:534–537.[Free Full Text]

    Weber M, et al. Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nat. Genet (2005) 37:853–862.[CrossRef][Web of Science][Medline]

    Wong KK, et al. A comprehensive analysis of common copy-number variations in the human genome. Am. J. Hum. Genet (2007) 80:91–104.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
O. M. Rueda and R. Diaz-Uriarte
RJaCGH: Bayesian analysis of aCGH arrays for detecting copy number changes and recurrent regions
Bioinformatics, August 1, 2009; 25(15): 1959 - 1960.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
H. Choi, A. I. Nesvizhskii, D. Ghosh, and Z. S. Qin
Hierarchical hidden Markov model with application to joint analysis of ChIP-chip and ChIP-seq data
Bioinformatics, July 15, 2009; 25(14): 1715 - 1721.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Z. Barutcuoglu, E. M. Airoldi, V. Dumeaux, R. E. Schapire, and O. G. Troyanskaya
Aneuploidy prediction and tumor classification with heterogeneous hidden conditional random fields
Bioinformatics, May 15, 2009; 25(10): 1307 - 1313.
[Abstract] [Full Text] [PDF]


Home page
BloodHome page
K.-J. J. Cheung, S. P. Shah, C. Steidl, N. Johnson, T. Relander, A. Telenius, B. Lai, K. P. Murphy, W. Lam, A. J. Al-Tourah, et al.
Genome-wide profiling of follicular lymphoma by array comparative genomic hybridization reveals prognostically significant DNA copy number imbalances
Blood, January 1, 2009; 113(1): 137 - 148.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Shah, S. P.
Right arrow Articles by Murphy, K. P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shah, S. P.
Right arrow Articles by Murphy, K. P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?