Bioinformatics Advance Access originally published online on March 13, 2006
Bioinformatics 2006 22(9):1144-1146; doi:10.1093/bioinformatics/btl089
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data
1 Hutchison-MRC Research Centre, Department of Oncology, Computational Biology Group, University of Cambridge Hills Road, Cambridge
2 Department of Applied Mathematics and Theoretical Physics,University of Cambridge Wilberforce Road, Cambridge
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: We have developed a new method (BioHMM) for segmenting array comparative genomic hybridization data into states with the same underlying copy number. By utilizing a heterogeneous hidden Markov model, BioHMM incorporates relevant biological factors (e.g. the distance between adjacent clones) in the segmentation process.
Availability: BioHMM is available as part of the R library snapCGH which can be downloaded from http://www.bioconductor.org/packages/bioc/1.8/html/snapCGH.html
Contact: J.Marioni{at}damtp.cam.ac.uk
Supplementary information: Supplementary information is available at http://www.damtp.cam.ac.uk/user/jcm68/BioHMM.html
| INTRODUCTION |
|---|
|
|
|---|
Array Comparative Genomic Hybridization (aCGH) is an experimental technique used to detect DNA copy number changes across the genome. Test and reference samples are dyed with Cy3 and Cy5 respectively and hybridized to an array spotted with clones containing small regions of normal genomic DNA. The array is then scanned and image analysis software is used to measure the signal emitted from each clone. After preprocessing and normalization, the log2 ratio of the signal in the test and reference channels is determined for each clone. Subsequently, the aim is to segment these ratios into states, where all of the clones in a state have the same underlying copy number. Finally, a post-processing step is necessary to assign biological meaning to the different states.
From a statistical perspective, segmentation has received most attention and many different schemes have been proposed. These include calling clones as gained or lost if the ratio is above or below a set threshold, genetic local search algorithms (Jong et al., 2004), circular binary segmentation (Olshen et al., 2004), adaptive weights smoothing (Hupé et al., 2004) and hidden Markov models (HMMs) (Fridlyand et al., 2004). Two recent studies comparing the performance of segmentation schemes (Lai et al., 2005, Willenbrock and Fridlyand, 2005) have identified problems with existing methods.
Despite the large number of methods, models have yet to be developed that take account of additional biological covariates such as the distance between adjacent clones or clone quality that are likely to affect the segmentation. In this article we describe a new segmentation scheme, BioHMM, which extends the HMM approach of Fridlyand et al. (2004) to take account of such information.
BioHMM is an integral part of the R library, snapCGH (Segmentation, Normalisation And Processing of aCGH data), which is a developer package obtainable from the BioConductor website (see Abstract for the address). This library also lets the user apply other segmentation schemes using common input and output data objects. In addition, snapCGH works seamlessly with limma objects (Smyth, 2005), enabling the use of pre-processing (and other) functions therein.
| APPROACH |
|---|
|
|
|---|
A hidden Markov model (HMM) is used to partition observations y1, y2, ... , yn made at locations t1, t2, ... , tn into K (hidden) states, where K < < n. In the context of aCGH data, the observations are (normalized) log2 ratios associated with clones arranged along a chromosome. For each clone t1, t2, ... , tn we know the start and end positions measured in base pairs and we assume that they are ordered along each chromosome by their start position. Subsequently, using the notation of Fridlyand et al. (2004) and Rabiner (1989), we can characterize a discrete (homogeneous) HMM where the observations are continuous as follows:
- The number of states, K, in the model. The states are denoted S1, ... , SK and the chain is assumed to be irreducible; qi denotes the actual state at position i (1
i
n).
- The initial state distribution,
, where
k = P(q1 = Sk).
- The transition matrix, A, giving the probability of moving from one state to another, where
for 1
i
n 1 and 1
l, m
K.
- The distribution, bk, of log2 ratios for clones in state Sk is assumed to be Gaussian with unknown mean and variance, i.e.
where we allow the possibility that the variance terms are equal.
Given the number of states, we can write the parameters to be estimated in the vector
= (A, B,
) where A, B and
denote the parameters in the transition matrix and the emission and initial distributions respectively. The optimal value of
can be found by maximizing the likelihood, L(
|Y), where Y is a vector containing the observed log2 ratios. The likelihood is calculated by computing the forwardbackward equations and the optimal values of the parameters are estimated using the EM algorithm, a Bayesian framework or a suitable numerical optimizer. Models with different numbers of states are (in general) distinguished using a criterion based upon the likelihood and the number of parameters, such as the AIC (Akaike, 1974). Finally, the Viterbi algorithm (Viterbi, 1967), or a similar method, is used to allocate clones to particular states.
We now describe a heterogeneous HMM, BioHMM, where the transition probabilities depend upon the distance, defined as the difference in midpoints, between adjacent clones. These distances are contained in a vector, x, which is of length n 1 since there are only n 1 transitions; information about the state to which the first clone is allocated is captured in the initial distribution. After constructing x, the information it contains can be incorporated into the transition probabilities of an HMM by considering a modified transition matrix, Ai, defined for 1
i
n 1 as
![]() |
for 1
i
n 1 and r
. We can rewrite this as
![]() |
![]() |
We include the parameter r in the model because we expect there to be non-linear differences in the probabilistic structure of the transition matrix depending upon whether adjacent clones are very close or very far apart. We note that p1 and p2 have to be estimated subject to the constraint 0
p1, p2
1.
To estimate the value of these parameters (and the other parameters contained in
) we used a numerical method to optimize the likelihood. To find initial parameter estimates, we used the methods described in Fridlyand et al. (2004) where applicable (for more information see Section 1.1.2 of the Supplementary Material); the initial estimate of r was set to be 1.
BioHMM is also capable of taking account of other biological information when segmenting the log2 ratios. This might include the length of each clone or a quality measure associated with each clone (perhaps obtained from the image analysis software). Some discussion of how we have incorporated additional covariates is given in Section 1.1.1 of the Supplementary Material.
We have only described the structure of the model when there are two underlying states. For details of the model when there are more states and an example of the application of BioHMM see the Supplementary Material.
| DISCUSSION |
|---|
|
|
|---|
In order to check that BioHMM yielded a sensible segmentation of real data, we applied it to the Coriell cell lines (Snijders et al., 2001). This is a well-known cell line for which independently verified karyotype information is available. BioHMM correctly identified all of the changes which have been previously catalogued. For more information see Section 3.1 of the Supplementary Material, where we also provide an example of how transition probabilities vary depending upon the distance between the clones.
To further investigate the efficacy of BioHMM, and in particular to assess its performance relative to other segmentation schemes, a large amount of analysis will need to be carried out; this is beyond the scope of this application note.
One of the aims of both ourselves and Fridlyand et al. (2004) is to extend the HMM-based approach so that the whole genome can be segmented at once. Theoretically this is straightforward, but choosing suitable initial parameter estimates is difficult, and at present it is unclear how this should be done.
In conclusion, the approach adopted by BioHMM has a number of advantages over current segmentation schemes. Most importantly, by taking account of the distance between adjacent clones, BioHMM better models chromosomes where some regions are densely covered and others are covered at lower resolutions. In addition, BioHMM can be extended to take account of other biological covariates. We believe that these features, and the ease with which the model can be implemented, will ensure that BioHMM plays a useful role in the analysis of aCGH data.
| Acknowledgments |
|---|
The authors thank Jane Fridlyand, Nigel Carter and Heike Fiegler for useful discussions on segmenting aCGH data, Michael Smith for helpful suggestions on the programming of snapCGH, and Jessica Pole and Paul Edwards for allowing us to use unpublished data. JCM, NPT and ST are supported by Cancer Research UK. ST is a Royal Society/Wolfson Research Merit Award holder. Funding to pay the Open Access publication charges for this article was provided by CRUK.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Christos Ouzounis
Received on November 24, 2005; revised on February 7, 2006; accepted on March 7, 2006
| REFERENCES |
|---|
|
|
|---|
Akaike, H. (1974) A new look at statistical model identification. IEEE Trans. Autom. Control, 716722.
Fridlyand, J., et al. (2004) Hidden Markov models approach to the analysis of array CGH data. J. Multivar. Anal, . 90, 132153[CrossRef].
Hupé, P., et al. (2004) Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics, 20, 34133422
Jong, K., et al. (2004) Breakpoint identification and smoothing of array comparative genomic hybridisation data. Bioinformatics, 20, 36363637
Lai, W.R., et al. (2005) Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics, 21, 37633770
Olshen, A.B., et al. (2004) Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 5, 557572[Abstract].
Rabiner, L.R. (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77, 257286[CrossRef].
Smyth, G.K. (2005) Limma: linear models for microarray data. In Gentleman, R., Carey, V., Dudoit, S., Irizarry, R., Huber, W. (Eds.). Bioinformatics and Computational Biology Solutions using R and Bioconductor, , New York Springer, pp. 397420.
Snijders, A.M., et al. (2001) Assembly of microarrays for genome-wide measurement of DNA copy number. Nat. Genet, . 29, 42814286.
Viterbi, A.J. (1967) Error bounds for convolution codes and an asymptotically optimal decoding algorithm. IEEE Trans. Inform. Theory, 13, 260269[CrossRef].
Willenbrock, H. and Fridlyand, J. (2005) A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics, 21, 40844091
This article has been cited by other articles:
![]() |
H.-I H. Chen, F.-H. Hsu, Y. Jiang, M.-H. Tsai, P.-C. Yang, P. S. Meltzer, E. Y. Chuang, and Y. Chen A probe-density-based analysis method for array CGH data: simulation, normalization and centralization Bioinformatics, August 15, 2008; 24(16): 1749 - 1756. [Abstract] [PDF] |
||||
![]() |
K. Wang and M. Bucan Copy Number Variation Detection via High-Density SNP Genotyping CSH Protocols, June 1, 2008; 2008(7): pdb.top46 - pdb.top46. [Abstract] [Full Text] |
||||
![]() |
P. Cahan, L. E. Godfrey, P. S. Eis, T. A. Richmond, R. R. Selzer, M. Brent, H. L. McLeod, T. J. Ley, and T. A. Graubert wuHMM: a robust algorithm to detect DNA copy number variation using long oligonucleotide microarray data Nucleic Acids Res., April 1, 2008; 36(7): e41 - e41. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Andersson, C. E. G. Bruder, A. Piotrowski, U. Menzel, H. Nord, J. Sandgren, T. R. Hvidsten, T. Diaz de Stahl, J. P. Dumanski, and J. Komorowski A segmental maximum a posteriori approach to genome-wide copy number profiling Bioinformatics, March 15, 2008; 24(6): 751 - 758. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Pique-Regi, J. Monso-Varona, A. Ortega, R. C. Seeger, T. J. Triche, and S. Asgharzadeh Sparse representation and Bayesian detection of genome copy number alterations from microarray data Bioinformatics, February 1, 2008; 24(3): 309 - 318. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Raffelsberger, Y. Krause, L. Moulinier, D. Kieffer, A.-L. Morand, L. Brino, and O. Poch RReportGenerator: automatic reports from routine statistical analysis using R Bioinformatics, January 15, 2008; 24(2): 276 - 278. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Wang, M. Li, D. Hadley, R. Liu, J. Glessner, S. F.A. Grant, H. Hakonarson, and M. Bucan PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data Genome Res., November 1, 2007; 17(11): 1665 - 1674. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. B. Gerstein, C. Bruce, J. S. Rozowsky, D. Zheng, J. Du, J. O. Korbel, O. Emanuelsson, Z. D. Zhang, S. Weissman, and M. Snyder What is a gene, post-ENCODE? History and updated definition Genome Res., June 1, 2007; 17(6): 669 - 681. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Stjernqvist, T. Ryden, M. Skold, and J. Staaf Continuous-index hidden Markov modelling of array CGH copy number data Bioinformatics, April 15, 2007; 23(8): 1006 - 1014. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. J. Park Integrated Analysis of Array-CGH and Expression Data Am. Assoc. Cancer Res. Educ. Book, April 14, 2007; 2007(1): 209 - 212. [Full Text] [PDF] |
||||
![]() |
S. Colella, C. Yau, J. M. Taylor, G. Mirza, H. Butler, P. Clouston, A. S. Bassett, A. Seller, C. C. Holmes, and J. Ragoussis QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data Nucleic Acids Res., March 27, 2007; (2007) gkm076v3. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Du, J. S. Rozowsky, J. O. Korbel, Z. D. Zhang, T. E. Royce, M. H. Schultz, M. Snyder, and M. Gerstein A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and chIP-chip experiments: systematically incorporating validated biological knowledge Bioinformatics, December 15, 2006; 22(24): 3016 - 3024. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







