Bioinformatics Advance Access originally published online on September 5, 2007
Bioinformatics 2007 23(23):3263-3264; doi:10.1093/bioinformatics/btm432
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Phylocomposer and phylodirector: analysis and visualization of transducer indel models
Department of Bioengineering, University of California, Berkeley CA, USA
To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Finite-state string transducers are probabilistic tools similar to Hidden Markov Models that can be systematically extended to large number of sequences related by indel and substitution processes on phylogenetic trees. The number of states in such models grows exponentially with the number of nodes in the tree, with the consequence that even quite small trees can be difficult to analyze or visualize. Here, we present two tools, phylocomposer and phylodirector, for working with string transducers. The former tool implements previously described composition algorithms for extending transducers to arbitrary tree topologies, while the latter generates short animations for arbitrary input alignments and phylogenetic trees, illustrating the state path through the composed transducer.
Availability: Phylocomposer and phylodirector are freely available at http://biowiki.org/PhyloComposer and http://biowiki.org/PhyloDirector
Contact: ihh{at}berkeley.edu
String transducers are finite-state automata with input and output tapes. They are closely related to Hidden Markov Models (HMMs) and indeed may be regarded as conditionally-normalized Pair HMMs. Given a string transducer and a phylogenetic guide tree, one can systematically derive a multi-sequence HMM that correctly handles nested indels and neighbor-dependent substitutions (Holmes, 2003).
The algorithm to do this, known as transducer composition, has considerable application throughout bioinformatics. These applications include statistical alignment (Holmes and Bruno, 2001), comparative genome annotation (Lunter et al., 2006), phylogeny (Lunter et al., 2005) and measurement of evolutionary rates (Holmes, 2005).
Figure 1 and Table 1 illustrate why the composition problem is nontrivial, prompting automation: the number of states in the composed machine increases exponentially with number of nodes in the tree. Even though the single-branch transducer (A) corresponds to the simplest possible indel model (Thorne et al., 1991), the largest composed transducer (E) has 620 transitions.
|
|
These data were generated with phylocomposer. The program takes as input a file describing a phylogenetic tree, one or more transducers (represented as Moore machines) and a mapping from branches to transducers. The file format is based on Lisp S-expressions (Rivest, 1997). The output is a description of the composed phylo-transducer, including the algebraic structure of transition and emission probabilities. In addition, the program generates dotfiles suitable for input to GraphViz, which was used to produce Figure 1. It can also be used for dynamic programming alignment and inference. The program is written in C++ and requires the GNU gcc and make utilities.
The phylocomposer distribution includes the examples of Figure 1 along with more realistic transducers, incorporating geometric and mixture-of-geometric length distributions (i.e. affine and general convex gap penalties).
Applications of the HMMs generated by the program include multiple alignment [including progressive (Higgins and Sharp, 1989), multidimensional (Sankoff and Cedergren, 1983) and MCMC (Holmes and Bruno, 2001)]; simultaneous phylogeny and alignment (Lunter et al., 2005); inference of ancestral protosequences; measurement of evolutionary rates and analysis of indel models.
Figure 2 illustrates the output of phylodirector. The program takes as input a Stockholm-format alignment, including a Newick-format phylogenetic tree. The output is an MPEG-format animation illustrating the state path through the composed phylo-transducer required to generate the tree, along with a still image for each column of the alignment. The program requires Perl, the GD graphics library (and GD.pm Perl module) and the Berkeley MPEG Encoder.
|
Animations produced by phylodirector are available at biowiki.org/PhyloFilm.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
IH was funded in part by NIH/NHGRI grant 1R01GM076705-01.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on April 9, 2007; revised on August 1, 2007; accepted on August 17, 2007
| REFERENCES |
|---|
|
|
|---|
Higgins DG, Sharp PM. Fast and sensitive multiple sequence alignments on a microcomputer. Comput. Appl. Biosci (1989) 5:151–153.
Holmes I. Using guide trees to construct multiple-sequence evolutionary HMMs. Bioinformatics (2003) 19(Suppl. 1):i147–i157.[Abstract]
Holmes I. Using evolutionary expectation maximization to estimate indel rates. Bioinformatics (2005) 21:2294–2300.
Holmes I, Bruno WJ. Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics (2001) 17:803–820.
Lunter G, et al. Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics (2005) 6:83.[CrossRef][Medline]
Lunter G. Genome-wide identification of human functional DNA using a neutral indel model. PLoS Comput. Biol (2006) 2.
Rivest R. S-expressions. Internet Working Draft. (1997) http://theory.lcs.mit.edu/rivest/sexp.txt.
Sankoff D, Cedergren RJ. Simultaneous comparison of three or more sequences related by a tree. In: Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison—Sankoff D, Kruskal JB, eds. (1983) chapter 9, Addison-Wesley, Reading, MA, pp. 253–264.
Thorne JL, et al. An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol (1991) 33:114–124.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
R. Satija, L. Pachter, and J. Hein Combining statistical alignment and phylogenetic footprinting to detect regulatory elements Bioinformatics, May 15, 2008; 24(10): 1236 - 1242. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


