Skip Navigation


Bioinformatics Advance Access originally published online on March 13, 2006
Bioinformatics 2006 22(9):1144-1146; doi:10.1093/bioinformatics/btl089
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/9/1144    most recent
btl089v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (15)
Google Scholar
Right arrow Articles by Marioni, J. C.
Right arrow Articles by Tavaré, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Marioni, J. C.
Right arrow Articles by Tavaré, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data

J. C. Marioni 1,2,*, N. P. Thorne 1,2 and S. Tavaré 1,2

1 Hutchison-MRC Research Centre, Department of Oncology, Computational Biology Group, University of Cambridge Hills Road, Cambridge
2 Department of Applied Mathematics and Theoretical Physics,University of Cambridge Wilberforce Road, Cambridge

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 APPROACH
 DISCUSSION
 REFERENCES
 

Summary: We have developed a new method (BioHMM) for segmenting array comparative genomic hybridization data into states with the same underlying copy number. By utilizing a heterogeneous hidden Markov model, BioHMM incorporates relevant biological factors (e.g. the distance between adjacent clones) in the segmentation process.

Availability: BioHMM is available as part of the R library snapCGH which can be downloaded from http://www.bioconductor.org/packages/bioc/1.8/html/snapCGH.html

Contact: J.Marioni{at}damtp.cam.ac.uk

Supplementary information: Supplementary information is available at http://www.damtp.cam.ac.uk/user/jcm68/BioHMM.html


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 APPROACH
 DISCUSSION
 REFERENCES
 
Array Comparative Genomic Hybridization (aCGH) is an experimental technique used to detect DNA copy number changes across the genome. Test and reference samples are dyed with Cy3 and Cy5 respectively and hybridized to an array spotted with clones containing small regions of normal genomic DNA. The array is then scanned and image analysis software is used to measure the signal emitted from each clone. After preprocessing and normalization, the log2 ratio of the signal in the test and reference channels is determined for each clone. Subsequently, the aim is to segment these ratios into states, where all of the clones in a state have the same underlying copy number. Finally, a post-processing step is necessary to assign biological meaning to the different states.

From a statistical perspective, segmentation has received most attention and many different schemes have been proposed. These include calling clones as gained or lost if the ratio is above or below a set threshold, genetic local search algorithms (Jong et al., 2004), circular binary segmentation (Olshen et al., 2004), adaptive weights smoothing (Hupé et al., 2004) and hidden Markov models (HMMs) (Fridlyand et al., 2004). Two recent studies comparing the performance of segmentation schemes (Lai et al., 2005, Willenbrock and Fridlyand, 2005) have identified problems with existing methods.

Despite the large number of methods, models have yet to be developed that take account of additional biological covariates such as the distance between adjacent clones or clone quality that are likely to affect the segmentation. In this article we describe a new segmentation scheme, BioHMM, which extends the HMM approach of Fridlyand et al. (2004) to take account of such information.

BioHMM is an integral part of the R library, snapCGH (Segmentation, Normalisation And Processing of aCGH data), which is a developer package obtainable from the BioConductor website (see Abstract for the address). This library also lets the user apply other segmentation schemes using common input and output data objects. In addition, snapCGH works seamlessly with limma objects (Smyth, 2005), enabling the use of pre-processing (and other) functions therein.


    APPROACH
 TOP
 ABSTRACT
 INTRODUCTION
 APPROACH
 DISCUSSION
 REFERENCES
 
A hidden Markov model (HMM) is used to partition observations y1, y2, ... , yn made at ‘locations’ t1, t2, ... , tn into K (hidden) states, where K < < n. In the context of aCGH data, the observations are (normalized) log2 ratios associated with clones arranged along a chromosome. For each clone t1, t2, ... , tn we know the start and end positions measured in base pairs and we assume that they are ordered along each chromosome by their start position. Subsequently, using the notation of Fridlyand et al. (2004) and Rabiner (1989), we can characterize a discrete (homogeneous) HMM where the observations are continuous as follows:

  1. The number of states, K, in the model. The states are denoted S1, ... , SK and the chain is assumed to be irreducible; qi denotes the actual state at position i (1 ≤ i ≤ n).
  2. The initial state distribution, {pi}, where {pi}k = P(q1 = Sk).
  3. The transition matrix, A, giving the probability of moving from one state to another, where

    Formula
    for 1 ≤ i ≤ n – 1 and 1 ≤ l, m ≤ K.

  4. The distribution, bk, of log2 ratios for clones in state Sk is assumed to be Gaussian with unknown mean and variance, i.e.

    Formula
    where we allow the possibility that the variance terms are equal.

Given the number of states, we can write the parameters to be estimated in the vector {lambda} = (A, B, {pi}) where A, B and {pi} denote the parameters in the transition matrix and the emission and initial distributions respectively. The optimal value of {lambda} can be found by maximizing the likelihood, L({lambda}|Y), where Y is a vector containing the observed log2 ratios. The likelihood is calculated by computing the forward–backward equations and the optimal values of the parameters are estimated using the EM algorithm, a Bayesian framework or a suitable numerical optimizer. Models with different numbers of states are (in general) distinguished using a criterion based upon the likelihood and the number of parameters, such as the AIC (Akaike, 1974). Finally, the Viterbi algorithm (Viterbi, 1967), or a similar method, is used to allocate clones to particular states.

We now describe a heterogeneous HMM, BioHMM, where the transition probabilities depend upon the distance, defined as the difference in midpoints, between adjacent clones. These distances are contained in a vector, x, which is of length n – 1 since there are only n – 1 transitions; information about the state to which the first clone is allocated is captured in the initial distribution. After constructing x, the information it contains can be incorporated into the transition probabilities of an HMM by considering a modified transition matrix, Ai, defined for 1 ≤ i ≤ n – 1 as

Formula
where Formula for 1 ≤ i ≤ n – 1 and r isin R. We can rewrite this as

Formula
where

Formula

We include the parameter r in the model because we expect there to be non-linear differences in the probabilistic structure of the transition matrix depending upon whether adjacent clones are very close or very far apart. We note that p1 and p2 have to be estimated subject to the constraint 0 ≤ p1, p2 ≤ 1.

To estimate the value of these parameters (and the other parameters contained in {lambda}) we used a numerical method to optimize the likelihood. To find initial parameter estimates, we used the methods described in Fridlyand et al. (2004) where applicable (for more information see Section 1.1.2 of the Supplementary Material); the initial estimate of r was set to be 1.

BioHMM is also capable of taking account of other biological information when segmenting the log2 ratios. This might include the length of each clone or a quality measure associated with each clone (perhaps obtained from the image analysis software). Some discussion of how we have incorporated additional covariates is given in Section 1.1.1 of the Supplementary Material.

We have only described the structure of the model when there are two underlying states. For details of the model when there are more states and an example of the application of BioHMM see the Supplementary Material.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 APPROACH
 DISCUSSION
 REFERENCES
 
In order to check that BioHMM yielded a sensible segmentation of real data, we applied it to the Coriell cell lines (Snijders et al., 2001). This is a well-known cell line for which independently verified karyotype information is available. BioHMM correctly identified all of the changes which have been previously catalogued. For more information see Section 3.1 of the Supplementary Material, where we also provide an example of how transition probabilities vary depending upon the distance between the clones.

To further investigate the efficacy of BioHMM, and in particular to assess its performance relative to other segmentation schemes, a large amount of analysis will need to be carried out; this is beyond the scope of this application note.

One of the aims of both ourselves and Fridlyand et al. (2004) is to extend the HMM-based approach so that the whole genome can be segmented at once. Theoretically this is straightforward, but choosing suitable initial parameter estimates is difficult, and at present it is unclear how this should be done.

In conclusion, the approach adopted by BioHMM has a number of advantages over current segmentation schemes. Most importantly, by taking account of the distance between adjacent clones, BioHMM better models chromosomes where some regions are densely covered and others are covered at lower resolutions. In addition, BioHMM can be extended to take account of other biological covariates. We believe that these features, and the ease with which the model can be implemented, will ensure that BioHMM plays a useful role in the analysis of aCGH data.


    Acknowledgments
 
The authors thank Jane Fridlyand, Nigel Carter and Heike Fiegler for useful discussions on segmenting aCGH data, Michael Smith for helpful suggestions on the programming of snapCGH, and Jessica Pole and Paul Edwards for allowing us to use unpublished data. JCM, NPT and ST are supported by Cancer Research UK. ST is a Royal Society/Wolfson Research Merit Award holder. Funding to pay the Open Access publication charges for this article was provided by CRUK.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Christos Ouzounis

Received on November 24, 2005; revised on February 7, 2006; accepted on March 7, 2006

    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 APPROACH
 DISCUSSION
 REFERENCES
 

    Akaike, H. (1974) A new look at statistical model identification. IEEE Trans. Autom. Control, 716–722.

    Fridlyand, J., et al. (2004) Hidden Markov models approach to the analysis of array CGH data. J. Multivar. Anal, . 90, 132–153[CrossRef].

    Hupé, P., et al. (2004) Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics, 20, 3413–3422[Abstract/Free Full Text].

    Jong, K., et al. (2004) Breakpoint identification and smoothing of array comparative genomic hybridisation data. Bioinformatics, 20, 3636–3637[Abstract/Free Full Text].

    Lai, W.R., et al. (2005) Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics, 21, 3763–3770[Abstract/Free Full Text].

    Olshen, A.B., et al. (2004) Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 5, 557–572[Abstract].

    Rabiner, L.R. (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77, 257–286[CrossRef].

    Smyth, G.K. (2005) Limma: linear models for microarray data. In Gentleman, R., Carey, V., Dudoit, S., Irizarry, R., Huber, W. (Eds.). Bioinformatics and Computational Biology Solutions using R and Bioconductor, , New York Springer, pp. 397–420.

    Snijders, A.M., et al. (2001) Assembly of microarrays for genome-wide measurement of DNA copy number. Nat. Genet, . 29, 4281–4286.

    Viterbi, A.J. (1967) Error bounds for convolution codes and an asymptotically optimal decoding algorithm. IEEE Trans. Inform. Theory, 13, 260–269[CrossRef].

    Willenbrock, H. and Fridlyand, J. (2005) A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics, 21, 4084–4091[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
H.-I H. Chen, F.-H. Hsu, Y. Jiang, M.-H. Tsai, P.-C. Yang, P. S. Meltzer, E. Y. Chuang, and Y. Chen
A probe-density-based analysis method for array CGH data: simulation, normalization and centralization
Bioinformatics, August 15, 2008; 24(16): 1749 - 1756.
[Abstract] [PDF]


Home page
CSH ProtocolsHome page
K. Wang and M. Bucan
Copy Number Variation Detection via High-Density SNP Genotyping
CSH Protocols, June 1, 2008; 2008(7): pdb.top46 - pdb.top46.
[Abstract] [Full Text]


Home page
Nucleic Acids ResHome page
P. Cahan, L. E. Godfrey, P. S. Eis, T. A. Richmond, R. R. Selzer, M. Brent, H. L. McLeod, T. J. Ley, and T. A. Graubert
wuHMM: a robust algorithm to detect DNA copy number variation using long oligonucleotide microarray data
Nucleic Acids Res., April 1, 2008; 36(7): e41 - e41.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
R. Andersson, C. E. G. Bruder, A. Piotrowski, U. Menzel, H. Nord, J. Sandgren, T. R. Hvidsten, T. Diaz de Stahl, J. P. Dumanski, and J. Komorowski
A segmental maximum a posteriori approach to genome-wide copy number profiling
Bioinformatics, March 15, 2008; 24(6): 751 - 758.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
R. Pique-Regi, J. Monso-Varona, A. Ortega, R. C. Seeger, T. J. Triche, and S. Asgharzadeh
Sparse representation and Bayesian detection of genome copy number alterations from microarray data
Bioinformatics, February 1, 2008; 24(3): 309 - 318.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
W. Raffelsberger, Y. Krause, L. Moulinier, D. Kieffer, A.-L. Morand, L. Brino, and O. Poch
RReportGenerator: automatic reports from routine statistical analysis using R
Bioinformatics, January 15, 2008; 24(2): 276 - 278.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
K. Wang, M. Li, D. Hadley, R. Liu, J. Glessner, S. F.A. Grant, H. Hakonarson, and M. Bucan
PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data
Genome Res., November 1, 2007; 17(11): 1665 - 1674.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
M. B. Gerstein, C. Bruce, J. S. Rozowsky, D. Zheng, J. Du, J. O. Korbel, O. Emanuelsson, Z. D. Zhang, S. Weissman, and M. Snyder
What is a gene, post-ENCODE? History and updated definition
Genome Res., June 1, 2007; 17(6): 669 - 681.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. Stjernqvist, T. Ryden, M. Skold, and J. Staaf
Continuous-index hidden Markov modelling of array CGH copy number data
Bioinformatics, April 15, 2007; 23(8): 1006 - 1014.
[Abstract] [Full Text] [PDF]


Home page
aacredbookHome page
P. J. Park
Integrated Analysis of Array-CGH and Expression Data
Am. Assoc. Cancer Res. Educ. Book, April 14, 2007; 2007(1): 209 - 212.
[Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. Colella, C. Yau, J. M. Taylor, G. Mirza, H. Butler, P. Clouston, A. S. Bassett, A. Seller, C. C. Holmes, and J. Ragoussis
QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data
Nucleic Acids Res., March 27, 2007; (2007) gkm076v3.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. Du, J. S. Rozowsky, J. O. Korbel, Z. D. Zhang, T. E. Royce, M. H. Schultz, M. Snyder, and M. Gerstein
A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and chIP-chip experiments: systematically incorporating validated biological knowledge
Bioinformatics, December 15, 2006; 22(24): 3016 - 3024.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/9/1144    most recent
btl089v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (15)
Google Scholar
Right arrow Articles by Marioni, J. C.
Right arrow Articles by Tavaré, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Marioni, J. C.
Right arrow Articles by Tavaré, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?