Bioinformatics Advance Access originally published online on March 23, 2007
Bioinformatics 2007 23(11):1424-1426; doi:10.1093/bioinformatics/btm096
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Unsupervised segmentation of continuous genomic data


1Department of Computer Science and Engineering, 2Division of Medical Genetics, and 3Department of Genome Sciences, University of Washington, Seattle, WA, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: The advent of high-density, high-volume genomic data has created the need for tools to summarize large datasets at multiple scales. HMMSeg is a command-line utility for the scale-specific segmentation of continuous genomic data using hidden Markov models (HMMs). Scale specificity is achieved by an optional wavelet-based smoothing operation. HMMSeg is capable of handling multiple datasets simultaneously, rendering it ideal for integrative analysis of expression, phylogenetic and functional genomic data.
Availability: http://noble.gs.washington.edu/proj/hmmseg
Contact: rthurman{at}u.washington.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
The convergence of the genomic era and the advent of high-throughput biological and chemical assays has created a wealth of genomic data, much of which is presented in continuous, time-series-like fashion across the genome. Often, it is desirable to extract simplifying summary information from such data. One summarization approach involves segmenting the data into a small number of discrete states based on the continuous output values. This segmentation may be accomplished in an unsupervised fashion using hidden Markov models (HMMs). For example, a chromosome-wide continuous profile of bulk RNA output generated using tiling DNA microarrays may be partitioned by segmenting the chromosomal coordinates into three states, corresponding to regions of low, medium and high transcription levels. This type of categorization is often desirable in the context of elucidating broad, large-scale trends in the data. In this case, it may be preferable to smooth the data to a specified scale before segmentation, in order to eliminate spurious state transitions resulting from isolated fine-scale features (see Fig. 1).
|
HMMSeg is a tool for segmenting continuous genomic datasets on a scale-specific basis using HMMs. Scale specificity is achieved by an optional smoothing step using wavelets (see below). HMMSeg provides multivariate capability, computing a single segmentation based on multiple datasets simultaneously defined on a common set of genomic coordinates.
As a platform for segmenting a wide variety of genomic data, HMMSeg is distinguished from existing programs using HMMs, which typically fall under two categories: toolboxes for applications in any field, such as htk (Young et al., 1995) and GHMM (http://ghmm.org); or biological application-specific tools that use HMMs, such as glimmerHMM (Majoros et al., 2004), for gene finding and HMMer (Eddy, 1995) for sequence analysis.
1.1 Hidden Markov models
An HMM is a statistical model in which data are assumed to be generated by a stochastic process defined by a predetermined number of hidden states (Rabiner, 1989). Each state is defined by an emission distribution, from which data values are generated. The model also specifies probabilities for transitioning between states. The parameters defining the emission and transition probabilities are typically learned from the data by expectation maximization (EM). Given the learned parameters, there are two common methods for determining the state labels for each observation: the Viterbi algorithm, which finds the single most probable path (sequence of states); and posterior decoding, which computes the most likely state at each point of the sequence. HMMSeg uses Gaussian emission distributions, with diagonal covariance for multiple datasets (assuming independence between variables), and supports both the Viterbi and posterior decoding methods for state assignments.
1.2 Wavelet smoothing
Wavelets are a mathematical tool for multi-scale analysis (Percival and Walden, 2000). Though first used in practical applications in the fields of engineering and signal processing, in recent years wavelets have found many applications in computational biology (Liò 2003). We apply scale-specific smoothing using a variant of the discrete wavelet transform (DWT) called the maximal overlap discrete wavelet transform (MODWT) (Percival and Walden, 2000). Both the DWT and MODWT can be used to decompose a given signal via multiresolution analysis into a sum of scale-specific signals. In contrast with other smoothing techniques, wavelet smoothing is essentially the process of subtracting out the small-scale behavior rather than averaging it. HMMSeg uses the LA(8) family of wavelets for all wavelet transforms. The choice of wavelet scale is application dependent, and can be informed by prior biological information about the scale of features of interest, or by trial-and-error to achieve, say, a desired segment length distribution. See the website for further details on wavelet smoothing.
| 2 DESCRIPTION OF FUNCTIONALITY |
|---|
|
|
|---|
HMMSeg provides a command-line interface. The input to HMMSeg is one or more collections of files in either single column or tab-delimited BED format. Each collection represents a different dataset; multiple collections trigger a multivariate segmentation. After reading the data from the input files, HMMSeg optionally smooths the data at a user-specified scale using the MODWT. In this case, wavelets require that the input data be evenly spaced. There is also an option to smooth the data without HMM training.
HMMSeg proceeds to train a completely connected HMM on the data by using EM. By default, the HMM has two states; models with more states may also be specified. The Gaussian parameters and transition probabilities are initialized randomly, although the user may provide model parameters to replace or initialize EM training. Training may be repeated multiple times from different random starts, in which case the model with the highest total likelihood is selected. Based on the final model, observations are assigned to states using the Viterbi algorithm or posterior decoding.
The program outputs the trained model plus the state assignment for each observation. If the user provides input data in BED format, then the segmentation is output in wiggle format, suitable for display in the UCSC Genome Browser (Kent et al., 2000). The wiggle file contains separate tracks for the original data, smoothed data, the state assignments and (for the posterior decoding method) the probabilities of each data point belonging to each state.
HMMSeg is implemented in Java for platform independence. It has been successfully tested on Windows and UNIX-style systems. Validation and accuracy test results are available on the website.
| 3 EXAMPLE |
|---|
|
|
|---|
In a recent study (Thurman et al., 2006), we analyzed a number of independently generated experimental datasets produced under the NHGRI ENCODE project (ENCODE Consortium, 2004), whose ultimate goal is to identify all of the functional elements in the human genome. Currently the ENCODE project is in its pilot phase, analyzing 44 regions spanning 30MB (
1%) of the genome (ENCODE Consortium, 2006). Our aim was to integrate multiple functional datatypes to create a functional domain map of the ENCODE regions. We used HMMSeg to segment the data at the 64 kb scale into two states, interpreted a posteriori as functionally active or inactive. (Note, however, that in the study wavelet smoothing was performed on all data except for replication timing, as a pre-processing step.) We successfully applied this technique to individual datatypes and up to five datasets simultaneously. Examples of large-scale domains delineated by HMMSeg using MODWT smoothing on all datasets are pictured in Figure 2 (see website for details). Here we see the advantages of using HMMs over simple thresholding techniques, the logic of which breaks down in scenarios with multiple datasets and few states. This approach highlights the potential for integrating multiple functional genomic datatypes with widely varying experimental resolution.
|
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Charles Grant and Tobias Mann for helpful advice. Funding for this work and Open Access publication charges was provided by NIH grants U01 HG003161, R01 GM071923 and R01 GM71852.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
The authors wish it to be know that, in their opinion, the first two authors should be regarded as First Authors. Associate Editor: Alex Bateman
Received on November 22, 2006; revised on February 9, 2007; accepted on March 7, 2007
| REFERENCES |
|---|
|
|
|---|
Eddy SR. Multiple alignment using hidden Markov models. Rawlings C, ed. (1995) Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology. AAAI Press. 114–120.
ENCODE Consortium. The ENCODE (ENCyclopedia Of DNA Elements) project. Science (2004) 306:636–640.
ENCODE Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature (2007) in press.
Kent WJ, et al. The human genome browser at UCSC. Genome Res (2002) 12:996–1006.
Liò P. Wavelets in bioinformatics and compuataional biology: state of art and perspectives. Bioinformatics (2003) 19:2–9.
Majoros WH, et al. Tigrscan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics (2004) 20:2878–2879.
Percival DB, Walden AT. Wavelet Methods for Time Series Analysis (2000) Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. (1989) 77. Proceedings of the IEEE. 257–286.
Thurman RE, et al. Stamatoyannopoulos. Identification of higher-order functional domains in the human genome. Genome Research (2007) in press.
Young S, et al. The HTK Book (1995) Cambridge University.
This article has been cited by other articles:
![]() |
L. Winchester, C. Yau, and J. Ragoussis Comparing CNV detection methods for SNP arrays Brief Funct Genomic Proteomic, September 8, 2009; (2009) elp017v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. M. Pauler, M. A. Sloane, R. Huang, K. Regha, M. V. Koerner, I. Tamir, A. Sommer, A. Aszodi, T. Jenuwein, and D. P. Barlow H3K27me3 forms BLOCs over silent genes and intergenic regions and specifies a histone banding pattern on a mouse autosomal chromosome Genome Res., February 1, 2009; 19(2): 221 - 233. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



