Bioinformatics Advance Access originally published online on August 27, 2008
Bioinformatics 2008 24(20):2403-2404; doi:10.1093/bioinformatics/btn457
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees
Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG Oxford, UK
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Bayesian analysis is one of the most popular methods in phylogenetic inference. The most commonly used methods fix a single multiple alignment and consider only substitutions as phylogenetically informative mutations, though alignments and phylogenies should be inferred jointly as insertions and deletions also carry informative signals. Methods addressing these issues have been developed only recently and there has not been so far a user-friendly program with a graphical interface that implements these methods.
Results: We have developed an extendable software package in the Java programming language that samples from the joint posterior distribution of phylogenies, alignments and evolutionary parameters by applying the Markov chain Monte Carlo method. The package also offers tools for efficient on-the-fly summarization of the results. It has a graphical interface to configure, start and supervise the analysis, to track the status of the Markov chain and to save the results. The background model for insertions and deletions can be combined with any substitution model. It is easy to add new substitution models to the software package as plugins. The samples from the Markov chain can be summarized in several ways, and new postprocessing plugins may also be installed.
Availability: The code is available from http://phylogeny-cafe.elte.hu/StatAlign/
Contact: miklosi{at}ramet.elte.hu
| 1 INTRODUCTION |
|---|
|
|
|---|
The fundamental types of mutations that change biological sequences are substitution, insertion and deletion. Although insertions and deletions play an important role in the evolution of sequences, phylogenetic inference is often carried out taking only substitutions into account. Furthermore, analyses based on a single multiple alignment can be misleading, and multiple alignment methods tend to bring more variation to the phylogenetic analysis than tree building methods (Goldman, 1998; Wong et al., 2008). Therefore, it is desirable to incorporate insertions and deletions in the phylogenetic analysis and to co-estimate phylogeny and alignment from their joint posterior distribution.
Time-continuous Markov models have been the standard for modelling substitutions in biological sequences (Felsenstein, 1981; Jukes and Cantor, 1969; Whelan et al., 2001), and most of the popular phylogenetic inferring methods are based on such models (Ronquist and Huelsenbeck, 2003). Time-continuous Markov models for insertions and deletions have been developed by Thorne et al. (1991, 1992) that can also be used in phylogenetic analysis. Such analyses can highlight homoplasy and alignment uncertainty (Lunter et al., 2005; Redelings and Suchard, 2005) and can be applied for protein structure prediction (Miklós et al., 2008) or phylogeny estimation of rapidly emerging pathogens (Redelings and Suchard, 2007).
Software packages published so far that fulfil some of the above described purposes (Fleißner et al., 2005; Holmes and Bruno, 2001; Suchard and Redelings, 2006) lack a graphical interface and the potential of easy extension by further model and data summarization plugins. We have implemented a package in the Java programming language that both has an easy-to-use user interface and is dynamically extendable through post-processing and substitution model plugins—without any need for recompilation.
| 2 THE STATALIGN SOFTWARE PACKAGE |
|---|
|
|
|---|
2.1 The main features of the program
The StatAlign software package is for joint Bayesian analysis of multiple alignments, phylogenetic trees and evolutionary parameters. The background model for insertions and deletions is a modified version of the TKF92 model (Thorne et al., 1992) as described in (Miklós et al., 2008). The indel model can be coupled with an arbitrary substitution model. We provide a wide selection of substitution models both for protein and nucleotide sequence data ranging from the Jukes–Cantor model to the general reversible nucleotide substitution model, and from the Dayhoff model to the Whelan and Goldman model (WAG) (see the Model menu and/or the documentation).
The Bayesian analysis is based on Markov chain Monte Carlo (MCMC) employing the transition kernels described in (Miklós et al., 2008). We made a few algorithmic improvements in the transition kernel for proposing tree topologies which effectively speeds up the analysis by a factor of 3–5. The new method proposes better alignments in a faster way, so it increases the acceptance ratio as well as decreases the amount of time needed for an MCMC step.
StatAlign has a graphical interface for choosing the input sequences, selecting the preferred substitution model and input–output formats, and setting MCMC parameters. During the analysis, users can follow the progress through tabulated panels showing the log-likelihood trace of the Markov chain to verify its convergence, the multiple alignment and phylogeny represented by the current state of the Markov chain, and the current maximum posterior decoding estimate (Durbin et al., 1998; Holmes and Durbin, 1998) for the consensus alignment based on the sampled multiple alignments (Fig. 1).
|
2.2 Modularity of the software package
Our aim was to build a software package with a fixed insertion–deletion model that can be coupled with an arbitrary substitution model. Due to its modularity, it is very easy to develop additional substitution models. We provide detailed description for developers to implement and plug-in new substitution models.
StatAlign generates random samples from the joint posterior distribution of sequence alignments, evolutionary trees and model parameters. This high-dimensional joint distribution can be analysed in several ways: the possibilities range from the simple statistics of marginalized single dimensions (e.g. the posterior distribution of a single rate parameter) to the covariation analysis of multiple dimensions. Besides, the convergence of the Markov chain might also be subject to investigations which vary from plotting of the log-likelihood trace to sophisticated analysis of autocorrelations. We provide detailed descriptions for developers to implement further post-processing modules that perform such analyses and to visualize the results.
| 3 DISCUSSION |
|---|
|
|
|---|
In this article, we introduced a new software package with a graphical interface for the joint Bayesian estimation of alignments, phylogenies and evolutionary parameters. With a new transition kernel for proposing tree topologies, our program is significantly faster than the previously published version (Miklós et al., 2008): one million MCMC steps on 13 sequences of papain family cystein proteinases with an average length of 223 can be performed in less than 20 h. Furthermore, the convergence of the Markov chain is relatively fast: based on the log-likelihood trace, 10 000 steps were sufficient for convergence on this dataset. Another novel feature of the program is that it is easily extendable: new substitution models as well as post-processing modules can be plugged into the package without recompilation.
Bayesian phylogenetic inference is one of the most popular methods for analysing biological sequences. The standard protocol so far has been to align sequences with an alignment tool, such as Clustal-W or T-COFFEE, and then use the alignment as the input for a program that considers only substitutions, e.g. MrBayes (Ronquist and Huelsenbeck, 2003). In contrast, our program allows joint inference of alignment, phylogeny and model parameters. This eliminates some artefacts that previous protocols suffer from, for example, that the tree estimated from the sequence alignment is influenced by the guide tree that the alignment-building program used.
Funding: BBSRC (grant BB/C509566/1); Bolyai postdoctoral fellowship (to I.M.); OTKA (grant F61730 to I.M.)
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on July 18, 2008; revised on August 21, 2008; accepted on August 21, 2008
| REFERENCES |
|---|
|
|
|---|
Durbin R, et al. Biological Sequence Analysis. Probabilistic Models of Proteins and Nucleic Acids. (1998) Cambridge: Cambridge University Press.
Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. (1981) 17:368–376.[CrossRef][Web of Science][Medline]
Fleißner R, et al. Simultaneous statistical multiple alignment and phylogeny reconstruction. Syst. Biol. (2005) 54:548–561.
Goldman N. Phylogenetic information and experimental design in molecular systematics. Proc. R. Soc. Lond. B (1998) 265:1779–1786.[Medline]
Holmes I, Bruno WJ. Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics (2001) 17:803–820.
Holmes I, Durbin R. Dynamic programming alignment accuracy. J. Comp. Biol. (1998) 5:493–504.
Lunter G, et al. Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics (2005) 6:83.[CrossRef][Medline]
Jukes TH, Cantor CR. Evolution of protein molecules. In: Mammalian protein metabolism—Munro HN, ed. (1969) New York: Academic Press. 21–123.
Miklós I, et al. A long indel model for evolutionary sequence alignment. Mol. Biol. Evol. (2004) 21:529–540.
Miklós I, et al. How reliably can we predict the reliability of protein structure predictions? BMC Bioinformatics (2008) 9:137.[CrossRef][Medline]
Redelings BD, Suchard MA. Joint Bayesian estimation of alignment and phylogeny. Syst. Biol. (2005) 54:401–418.
Redelings BD, Suchard MA. Incorporating indel information into phylogeny estimation for rapidly emerging pathogens. BMC Evol. Biol (2007) 7:40.[CrossRef][Medline]
Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics (2003) 19:1572–1574.
Suchard MA, Redelings BD. BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics (2006) 22:2047–2048.
Thorne JL, et al. An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol. (1991) 33:114–124.[CrossRef][Web of Science][Medline]
Thorne JL, et al. Inching toward reality: an improved likelihood model of sequence evolution. J. Mol. Evol. (1992) 34:3–16.[CrossRef][Web of Science][Medline]
Whelan S, et al. Molecular phylogenetics: state of the art methods for looking into the past. Trends Genet (2001) 17:262–272.[CrossRef][Web of Science][Medline]
Wong KM, et al. Alignment uncertainty and genomic analysis. Science (2008) 319:473–476.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
