Skip Navigation


Bioinformatics Advance Access originally published online on July 10, 2007
Bioinformatics 2007 23(18):2485-2487; doi:10.1093/bioinformatics/btm350
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/18/2485    most recent
btm350v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Lunter, G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lunter, G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

HMMoC—a compiler for hidden Markov models

Gerton Lunter

MRC Functional Genetics Unit, Department of Physiology, Anatomy and Genetics, South Parks Road, OX1 3TG, University of Oxford, UK


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 OVERVIEW
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: Hidden Markov models are widely applied within computational biology. The large data sets and complex models involved demand optimized implementations, while efficient exploration of model space requires rapid prototyping. These requirements are not met by existing solutions, and hand-coding is time-consuming and error-prone. Here, I present a compiler that takes over the mechanical process of implementing HMM algorithms, by translating high-level XML descriptions into efficient C++ implementations. The compiler is highly customizable, produces efficient and bug-free code, and includes several optimizations.

Availability: http://genserv.anat.ox.ac.uk/software

Contact: gerton.lunter{at}dpag.ox.ac.uk


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 OVERVIEW
 ACKNOWLEDGEMENTS
 REFERENCES
 
Hidden Markov models (HMMs) are very suitable for sequence analysis, and have found wide application in computational biology. These applications include gene finding (Burge and Karlin, 1997), motif discovery and searching (Bailey and Elkan, 1995; Baldi et al., 1994; Bateman et al., 2002; Eddy, 1998), identification of CpG islands (Durbin et al., 1998), genotyping (Scheet and Stephens, 2006), detection of recombination events (Hobolth et al., 2007), sequence alignment (Holmes, 2003) and many more.

Applications in computational biology typically involve large data sets, making high-level toolkits such as R or MatLab less suitable. Another common approach, of which GHMM (Schliep et al., 2003) is but one example, is to use a library implementation which operate on HMMs defined by some data structure. In practice, either the flexibility or the efficiency of such libraries often falls short of practical requirements, especially for pair- and higher-dimensional HMMs. For these reasons, researchers often resort to hand-coding their algorithms. While straightforward, hand-coding remains tedious and error-prone, particularly when models become more complex and optimized code is required.

Code generation could, in principle, combine flexibility, usability and efficiency. An early effort in this direction is the PROLOG-based language GenLang, which is capable of handling context-free grammars as well as HMMs (Searls, 1993). Another approach was taken by the dynamic-programming code generator Dynamite (Birney and Durbin, 1997), which produces highly efficient C code for Viterbi-like algorithms, at the cost of some flexibility. Dynamite later developed into the gene annotation toolset Exonerate, which is based on the code generator C4 (Slater and Birney, 2005).

These code generators, being somewhat tied to their original application domain, lack direct support for probabilistic algorithms. Here I present an HMM compiler, HMMoC, that aims to fill this gap. HMMoC is both efficient and flexible, and provides support for all standard HMM algorithms. The input to the compiler consists of a succinct XML file that defines the topology of the HMM. The probabilities associated to transitions and emissions are defined in the same file, using arbitrarily parameterized C code fragments. From these, the compiler produces C++ header and source files that implement the required HMM algorithms.


    2 OVERVIEW
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 OVERVIEW
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Supported algorithms
The compiler supports the Forward and Backward algorithms, which compute posterior probabilities conditional on one or more input sequences. These algorithms can be configured to return the dynamic programming (DP) table if desired. Baum–Welch parameter estimation is also implemented, when required, as part of either the Forward or Backward algorithm. Parameter estimates are computed in two steps, by first computing a standard Forward DP table followed by a Baum–Welch backward iteration (or vice versa). For efficiency, either or both transition and emission parameters can be computed.

The Viterbi algorithm is implemented as two separate procedures, which compute the Viterbi table (and likelihood) and the most likely path, respectively. Finally, a sampling algorithm may be generated, which samples paths from the posterior distribution. Unconditional sampling from the HMM is not directly supported, but can be implemented by defining a ‘silent’ HMM producing no output, and computing a conditional sample from that.

2.2 Supported HMM architectures
HMMoC supports HMMs with any number of outputs, including pair HMMs, triple HMMs and phylogenetic HMMs. Emissions, from a user-defined alphabet, may be associated to states (‘Moore machines’) or to transitions (‘Mealy machines’), and mixtures are also possible. Higher-order Markov chains (where probabilities depend on a limited number of previously emitted symbols) are also supported, and the order or states may vary over the network. HMMoC includes support for inhomogeneous Markov chains, allowing probabilities to depend on the position within the sequence. This can be used to implement, e.g. inference algorithms for continuous-time Markov chains with variable intervals between measurements, or may be used to incorporate prior knowledge. Finally, HMMoC supports any number of silent states, and the required transition probabilities are computed by a matrix inversion step (‘wing retraction’, (Eddy, 1998)] whenever necessary.

2.3 Extended-exponent real-number type
Underflows are a frequent problem for HMM implementations and large data sets. Working in log space only partially solves this problem, since most algorithms use addition as well as multiplication, necessitating slow conversions to and from log space. HMMoC provides an 8-byte extended-exponent real type, termed BFloats (for ‘buoyant floats’), which combines the precision of a float with an essentially unbounded exponent. An efficient template library implementation provides fast in-line code, resulting in comparable running time for algorithms that use doubles or BFloats. In addition, a logspace type is provided, which is slightly more efficient than BFloats when used in the Viterbi algorithm.

2.4 Efficiency and optimizations
For efficiency, states can be grouped into cliques, over which the HMM induces an acyclic graph. This structure is used by HMMoC to efficiently allocate memory for the DP table. For instance, the start and end states may always form their own clique, and need not be represented in the main DP table. Some algorithm-specific optimizations are provided; for instance, when no DP table is required as output from a Forward or Backward iteration, a lower-dimensional DP table is allocated.

Several time-optimizations are implemented as well. HMMoC interprets the emission patterns to determine the range of coordinates at which states may be visited, and uses this information to limit access to accessible states only. In the main loops, HMMoC chooses an order of computation that minimizes accesses to DP table entries, resulting in considerable improvements, particularly for sparse DP tables where such accesses are relatively expensive (see below). In addition, constant transition probabilities are pre-computed outside the main loop whenever possible. Finally, all loops over transitions and states are unrolled, reducing the number of run-time decisions and lookups. As a result of these optimizations, HMMoC produces code approaching the efficiency of the hand-optimized HMMER package (Eddy, 1998), see Figure 1.


Figure 1
View larger version (35K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Execution times for performing a database search using HMM profiles of the TK and ZnF_C2HC domains (38 and 56 states, respectively), in a database of 4.84 million amino acids on a desktop computer (2.33 GHz E5345 Xeon, 4 MB cache). For this comparison, HMMER and HMMoC performed a Viterbi and a Forward recursion. The HMMoC-generated algorithm uses logspace real numbers for the Viterbi recursion, and BFloats for the Forward recursion.

 
2.5 Banding
Banding refers to the technique of traversing only a user-defined portion of a DP table. When used correctly, this can result in dramatic improvements in time and memory usage, with small to negligible loss of accuracy. HMMoC allows the user to specify an arbitrary iterator to traverse the DP table. When banding is used, HMMoC uses a sparse DP table, backed by a hash map storing only entries that are visited. All DP tables provide accessor functions to hide implementation details from the user.

2.6 Robustness
HMMoC provides several consistency checks, both at compile and run time. For instance, HMMoC checks and determines the consistency of state orders, by insisting that orders can increase by one only following an emission, and the induced graph on cliques is checked for cycles. At run time, warnings are issued when iterators follows an inconsistent or out-of-bounds path, and when negative probabilities are encountered. Care has been taken to produce clean and relatively readable C++ code which should compile without warnings, allowing potential problems to be easily identified.

2.7 Examples and documentation
The package includes four examples demonstrating the use of HMMoC: the classic ‘dishonest casino’ HMM (Durbin et al., 1998), a CpG-island finder, a probabilistic global pairwise aligner with Baum-Welch parameter optimization and a conversion script for HMMER profile HMMs that was used to produce the test results mentioned before. Efforts are underway to develop a graphical programming interface abstracting the XML layer to simplify model design (Dombai and Miklós, Personal communication); a prototype web implementation is available at http://dbalazs.web.elte.hu/Phyl4/.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 OVERVIEW
 ACKNOWLEDGEMENTS
 REFERENCES
 
HMMoC's design was inspired by Ian Holmes’ Telegraph, an XML-based language to define HMMs. Helpful discussions with Jotun Hein and István Miklós are gratefully acknowledged.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alex Bateman

Received on May 30, 2007; revised on June 26, 2007; accepted on June 27, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 OVERVIEW
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bailey TL, Elkan C. The value of prior knowledge in discovering motifs with MEME. Proc. Int. Conf. Intell. Syst. Mol. Biol (1995) 3:21–29.[Medline]

    Baldi P, et al. Hidden Markov models of biological primary sequence information. Proc. Natl Acad. Sci. USA (1994) 91:1059–1063.[Abstract/Free Full Text]

    Bateman A, et al. The Pfam protein families database. Nucleic Acids Res (2002) 30:276–280.[Abstract/Free Full Text]

    Birney E, Durbin R. Dynamite: a flexible code generating language for dynamic programming methods used in sequence comparison. In: Fifth International Conference on Intelligent Systemts in Molecular Biology. (1997) Menlo Park: IAAA Press. 56–64.

    Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol (1997) 268:78–94.[CrossRef][Web of Science][Medline]

    Durbin R, et al. Biological Sequence Analysis. (1998) Cambridge: Cambridge University Press.

    Eddy SR. Profile hidden Markov models. Bioinformatics (1998) 14:755–763.[Abstract/Free Full Text]

    Hobolth A, et al. Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden Markov model. PLoS Genet (2007) 3:e7.[CrossRef][Medline]

    Holmes I. Using guide trees to construct multiple-sequence evolutionary HMMs. Bioinformatics (2003) 19(Suppl. 1):i147–i157.[Abstract]

    Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet (2006) 78:629–644.[CrossRef][Web of Science][Medline]

    Schliep A, et al. Using hidden Markov models to analyze gene expression time course data. Bioinformatics (2003) 19(Suppl. 1):i255–i263.[Abstract]

    Searls DB. String variable grammar: a logic grammar formalism for the biological language of DNA. J. Log. Program (1993) 24:73–102.[CrossRef]

    Slater GS, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics (2005) 6:31.[CrossRef][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Stat Methods Med ResHome page
I. Miklos, A. Novak, R. Satija, R. Lyngso, and J. Hein
Stochastic models of sequence evolution including insertion--deletion events
Statistical Methods in Medical Research, October 1, 2009; 18(5): 453 - 485.
[Abstract] [PDF]


Home page
Nucleic Acids ResHome page
T. Y. Lam and I. M. Meyer
HMMCONVERTER 1.0: a toolbox for hidden Markov models
Nucleic Acids Res., September 8, 2009; (2009) gkp662v1.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
L. Barquist and I. Holmes
xREI: a phylo-grammar visualization webserver
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W65 - W69.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/18/2485    most recent
btm350v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Lunter, G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lunter, G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?