Skip Navigation


Bioinformatics Advance Access originally published online on November 14, 2007
Bioinformatics 2008 24(1):56-62; doi:10.1093/bioinformatics/btm532
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/1/56    most recent
btm532v3
btm532v2
btm532v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Rodrigue, N.
Right arrow Articles by Lartillot, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Rodrigue, N.
Right arrow Articles by Lartillot, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Uniformization for sampling realizations of Markov processes: applications to Bayesian implementations of codon substitution models

Nicolas Rodrigue 1,*, Hervé Philippe 1 and Nicolas Lartillot 2

1Canadian Institute for Advanced Research, Département de Biochimie, Université de Montréal, C.P. 6821, Succ. Centre-ville, Montréal, Québec Canada, H3C 3J7 and 2Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier, URM 5506, CNRS-Université de Montpellier 2, Montpellier, France

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DATA
 4 RESULTS AND DISCUSSION
 5 CONCLUSIONS AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Mapping character state changes over phylogenetic trees is central to the study of evolution. However, current probabilistic methods for generating such mappings are ill-suited to certain types of evolutionary models, in particular, the widely used models of codon substitution.

Results: We describe a general method, based on a uniformization technique, which can be utilized to generate realizations of a Markovian substitution process conditional on an alignment of character states and a given tree topology. The method is applicable under a wide range of evolutionary models, and to illustrate its usefulness in practice, we embed it within a data augmentation-based Markov chain Monte Carlo sampler, for approximating posterior distributions under previously proposed codon substitution models. The sampler is found to be more efficient than the conventional pruning-based sampler with the decorrelation times between draws from the posterior reduced by a factor of 20 or more.

Contact: nicolas.rodrigue{at}umontreal.ca


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DATA
 4 RESULTS AND DISCUSSION
 5 CONCLUSIONS AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
Data augmentation schemes are increasingly employed in statistical frameworks (Gelman, 2004; Robert and Casella, 2004; van Dyk and Meng, 2001). The overall method consists of introducing latent, or unobserved data, normally integrated over in the closed form of the likelihood function. Somewhat paradoxically, even in cases where closed-form likelihoods are available, such demarginalization techniques often enable computationally attractive sampling devices (van Dyk and Meng, 2001). Within probabilistic phylogenetic contexts, a general data augmentation approach amounts to conditionally mapping specific instances of a Markovian substitution process running over the branches of a given phylogeny. The end result is a detailed substitution history compatible with the sequences at the leaves of the tree, with the timing and nature of all substitutions specified. Beyond the general interest of characterizing patterns of state changes along a phylogeny, these approaches, or variations thereof, have proved useful for implementing novel evolutionary models (Hobolth, 2008; Jensen and Pedersen, 2000; Mateiu and Rannala, 2006; Pedersen and Jensen, 2001; Robinson et al., 2003; Rodrigue et al., 2005; Yu and Thorne, 2006), for accelerating computations (Krishnan et al., 2004; Lartillot, 2006), and as means of conducting posterior predictive assessments (Bollback, 2006; Dimmic et al., 2005; Nielsen, 2002; Rodrigue et al., 2006).

Under common evolutionary models assuming independence between sites, Nielsen (2002) has proposed a scheme for drawing substitution mappings directly from their posterior distribution. The algorithm proceeds in two basic steps: (1) sample internal node states from their joint distribution, conditional on the data and the parameters of the Markov process; (2) sample the series of substitution events along each branch, conditional on parameters of the Markov process, and the states at both ends such as determined in step (1). The second step of the algorithm is an accept/reject approach: starting from the state at the ancestral node, run the Markov process—sampling the timing and nature of events to the end of the branch—and accept the resulting substitution mapping if the last event is consistent with the state of the descendant node; if not, reject the mapping and start over, until a consistent mapping is drawn. ‘The simulation scheme is efficient assuming that the rates of change between all nucleotides [states] is large [...],’ (Nielsen, 2002) and a provision is made to enforce the sampling of at least one event in cases where the ancestral and descendant states differ.

For some models, however, the rates of change between certain states are not large. For instance, the commonly used codon substitution models have infinitesimal rates of change of 0 between states differing by 2 or 3 nt. This has the effect of inducing stricter constraints on the possible coherent mappings, and under some conditions, Nielsen's second step stalls, entering an prolonged while-loop in attempting to sample an acceptable mapping.

In this work, we investigate an alternative that allows for a directed sampling of substitution mappings along each branch. The procedure is based on a uniformization technique previously used by Fearnhead and Sherlock (2006) to explore the occurrence of rare DNA motifs. Uniformization approaches have already been introduced in phylogenetic contexts by Mateiu and Rannala (2006), who used the method to approximate transition probability matrices, embedded within a Markov chain Monte Carlo (MCMC) sampler. Here, the method is adapted to directly sample a complete series of events conditional on the states at the beginning and end of a branch of a phylogeny, without the extra rejection step used by Nielsen (2002). It thus guaranties that a coherent mapping will be drawn every time, regardless of the particularities of the substitution model. The methodology is applicable in general, but we focus on practical advantages in the present context. Specifically, we have used the methods to implement a data augmentation (DA)-based Bayesian MCMC sampler, approximating posterior distributions under codon substitution models. Using real data sets, we show that the resulting sampling device yields approximations of posterior distributions that match those obtained using the standard pruning-based sampler, while displaying much greater computational efficiency.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DATA
 4 RESULTS AND DISCUSSION
 5 CONCLUSIONS AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Data augmentation
Let Q = [Qab] be the infinitesimal generator of a Markov process. Each off-diagonal entry in Q specifies the instantaneous rate of substitution from one state to another, and diagonal entries are set such that the sum of each row equals 0 (i.e. Qaa = –{sum}b != a Qab). The first step of the algorithm proposed by Nielsen (2002) is to sample the states at the internal nodes of the phylogenetic tree. This is accomplished by first sampling the state at the root node according to the product of the stationary probability and the conditional likelihood of each state, followed by a recursive sampling along the tree, with each case conditioned on the state at the immediate ancestral node (Bollback, 2006; Nielsen, 2002).

The second step of the algorithm consists of sampling the series of events between each node, for each site. For a branch of length {lambda} (where Q has been scaled, such that, for instance, {lambda} represents the expected number of substitutions per site), suppose that the starting state at a given position is a and that the ending state is b. The sampling strategy builds on the relation


Formula 1

(1)
where {varphi} represents the detailed substitution mapping leading from a to b, and {Phi} represents all possible mappings. The demarginalizing approach is based on sampling directly from p(b, {varphi} | a, {lambda}), which has the following density:


Formula 2

(2)
where

  • z stands for the total number of state changing events along the branch;
  • tk represents the timing of substitution event k (t0 = 0, 0 < t1 < ... < tz < {lambda});
  • sk – 1 and sk represent the states before and after substitution event k (sk != sk – 1, s0 = a, and sz = b);
  • qa = – Qaa represents the rate away from state a.

To sample according to (2), the accept/reject method proposed by Nielsen (2002) consists of first drawing from an exponential distribution of rate qa. If the draw is less than {lambda}, the state after the first event, written as s1 (s1 != a), is sampled according to s1 ~ Qas1/qa. From the new time point t1, the procedure is repeated, until the time drawn leads a value beyond {lambda}. If the final state sampled is not consistent with the state at the end of the branch (i.e. if sz != b), the simulation is scrapped and started anew, until sz = b. As previously mentioned, even when sampling conditional on there being at least one event along the branch (in the case a != b), this latter loop can block the algorithm under certain settings (e.g. with short branches, and where a and b differ by say 3 nt).

2.2 Uniformization for sampling mappings under constraints
As noted above, under the Markov process defined by Q, the waiting time until the next event depends on the current state of the process. The uniformization procedure (Gross and Miller 1984; Jensen 1953; Lartillot 2006; Mateiu and Rannala 2006) transforms the process defined by Q into a process allowing for virtual events, or self-substitutions (from a to a). Let R be the matrix of this process, obtained from


Formula 3

(3)
where µ ≥ max{qa} is the uniformization rate, and I is the identity matrix. Note here that the sum of each row in R equals 1; this is also called a stochastic matrix. Under the uniformized process, the waiting time until an event no longer depends on the current state, and the probability of having n events (including self-substitutions) over a branch length {lambda} is given by a Poisson distribution:


Formula 4

(4)
Also, taking powers of the matrix R yields the probability of starting in state a and ending at state b after n events:


Formula 5

(5)

We will suppose that we now want to draw a mapping along a branch, and that the states at the ends of the branch (a and b) have already been sampled (using Nielsen's first step). The overall method, described in Fearnhead and Sherlock (2006), can be summarized as three stage progressive demarginalization: (1) sample the number of events (always including virtual events) marginalized over their nature and timing; (2) sample the nature of events in order, marginalized over their exact timing and (3) sample the timing of events.

We first begin by drawing the number of events from the distribution p(n | a, b, {lambda}), which can be calculated from (4) and (5) according to Bayes' theorem:


Formula 6

(6)
Note that the denominator in (6) can be developed as follows:


Formula 7

(7)


Formula 8

(8)


Formula 9

(9)


Formula 10

(10)


Formula 11

(11)


Formula 12

(12)
This development highlights the fact that Q and R are different representation of the same underlying process, and hence both representations may be exploited in the overall sampling scheme. In particular, the form in (12) can be calculated employing a matrix diagonalization routine for matrix exponentiation, and thus used to draw the number of events: first sample g = p(b | a, {lambda}) x U, where U is a random number on the unit interval; and next, starting from n = 0, cumulate p(b | n, a)p(n | {lambda}) over successive values of n, until surpassing g; the final n is thus the number of events sampled.

Having sampled the number of events n, we now wish to sample the specific series of events leading from a to b. The state after the first event (s1) is sampled from


Formula 13

(13)
Then, having sampled the state after first event, the state after the second event (s2) is sampled from


Formula 14

(14)
and so on, until the n events have been sampled. Note that the second factor on the right hand side of (13) and (14) ensures that the state sampled will not ‘trap’ the mapping into a state sk which could not lead to sn = b in nk events. For instance, if n = 3, s2 = m, and m differs with b by two nucleotides, then Formula , and thus this particular state m could not have been sampled in (14).

Finally, if needed, we can draw n values uniformly distributed on [0, {lambda}], and sort the values to obtain the timing of events. Virtual events can then be removed so as to obtain a substitution history directly sampled from the posterior distribution.

2.3 Codon substitution models
We used a widely known codon substitution model modified from Goldman and Yang (1994), with the entries of Q given as


Formula 15

(15)
where {omega} is the non-synonymous/synonymous rate ratio, {kappa} is the transition/transversion ratio and {pi}b is the equilibrium frequency of codon b. This specification of {pi} as a full 61-dimensional vector is often denoted as F61, and we therefore refer to this model as GY-F61.

We also tried a very recent approach proposed by Huelsenbeck et al. (2006), where, rather than using a single global selection coefficient ({omega}), a mixture of K different selection regimes is used. The number of regimes (K) is treated as a random variable under a Dirichlet Process (DP) framework, which we indicate as GY-F61-DP. For this model, we scale the K Markov generators such that branch lengths are in terms of expected number of synonymous substitutions per site, as in Huelsenbeck et al. (2006).

We used the same prior structure as Huelsenbeck et al. (2006), except that we endowed hyperparameters {alpha} and β—respectively, the ‘graininess’ parameter of the Dirichlet Process and the mean of the exponential prior on branch lengths—with exponential priors of mean 1.

2.4 MCMC sampling and decorrelation time estimates
For pruning-based sampling, we used analogous MCMC operators to those used in Huelsenbeck et al. (2006). However, we have kept tree topologies fixed in the present work, and applied multiplicative update operators on branch lengths.

For DA-based sampling, we used an augmented likelihood function based on the product of (2) over all nodes and all sites, as well as the stationary probability of the Markov process. The details of these sampling approaches have been described previously (Lartillot, 2006; Rodrigue et al., 2007). The modification in the present context is that mappings are sampled based on the uniformization method. Our overall strategy is to begin by drawing a mapping conditional on particular parameter configuration, as described above. Then, conditional on the mapping drawn, the sampler cycles over a series of parameter update operators. Following the sequence of updates, matrices are diagonalized, internal node states are re-sampled, the stochastic matrix (and its powers) is computed and the detailed series of events between nodes is re-sampled, thus initiating the next cycle. Note that under GY-F61-DP settings, a set of K stochastic matrices (and their powers) are constructed from the global parameters and the K selection regimes, and that site-specific mappings are drawn under the current affiliation configuration of the Dirichlet process.

To evaluate the mixing behavior of the sampling methods, we estimated the decorrelation time of MCMC runs based on the first zero-crossing of the autocorrelation function of the log-likelihood. More precisely, we repeatedly computed the autocorrelation of the (pruning-based) log-likelihood, on samples of 100, but with an increasing number of cycles, or lag, between each draw. The lag required for the first zero-crossing of the autocorrelation multiplied by the time needed per cycle gives a rough estimate of the decorrelation time. As before, note that under the GY-F61-DP model the log-likelihood for a particular position is computed based on the selection regime currently affiliated to it.


    3 DATA
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DATA
 4 RESULTS AND DISCUSSION
 5 CONCLUSIONS AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
We explored the methods using three previously studied data sets, which we refer to here using a shorthand indicating the gene type, the number of sequences and their length in number of codons:

  • GLOBIN17-144: seventeen vertebrate sequences of the β-globin gene, described in Yang et al., (2000);
  • LYSIN25-134: twenty-five abalone sperm lysin sequences, described in Yang and Swanson (2002) and
  • HIV22-99: twenty-two sequences of the human immunodeficiency virus type 1 protease, described in Doron-Faigenboim and Pupko (2006).
In all cases, we worked under the same tree topologies as those used in the respective works cited above.


    4 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DATA
 4 RESULTS AND DISCUSSION
 5 CONCLUSIONS AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
4.1 Checking the uniformization procedure
In our preliminary analyses, we used a simplified codon substitution model, with {kappa} = 1, {omega} = 1, and {pi} set according to the nucleotide frequencies observed at each codon position in the GLOBIN17-144 data set—a setting commonly referred to as F3 x 4. Under this model, we repeatedly observed stalls utilizing Nielsen's data augmentation procedure, resulting from a prolonged while-loop in the second step of the algorithm. As a specific example of the problem, we encountered an excursion of the MCMC where the states sampled at the ends of a branch for a position were GGC for the ancestral node, and TCA for the descendant node. The branch length separating the two nodes in this particular instance was 0.02 expected substitutions per codon. Sampling the series of events from GGC to TCA using the accept/reject approach proposed by Nielsen typically required over 5 million attempts.

In contrast, employing the uniformization technique described above, a coherent mapping is sampled every time. To highlight these properties of the sampling procedure, we first computed the probability of going from GGC to TCA over the distance 0.02 using matrix exponentiation [Equation (12)], still using the simplified F3 x 4 model, giving a value of 2.31902 x 10 8. As noted above, using the uniformized process given by R, this same probability can be computed from


Formula 16

(16)
Setting µ = max{qa} as the uniformization rate, we computed the sum in (16) from n = 0 to n = 9, reported in Table 1. The sum converges to the same value computed by matrix exponentiation in about n = 7 events. Note, however, that for n = 0, n = 1 and n = 2, no contribution is made to the sum, simply because the states require at least 3 events. Nielsen's method, which does not take this point into account, routinely samples only 1 or 2 events in this example, and even when 3 events are sampled, these are only rarely the three minimum substitutions needed for going from GGC to TCA. Under the uniformized process, a minimum of 3 events must be sampled, in which case the sampling scheme will ensure that each event corresponds to 1 of 3 minimal events.


View this table:
[in this window]
[in a new window]

 
Table 1. Convergence of the sum in equation (16) to the value evaluated by matrix exponentiation

 
In an independent check, we note that the successive powers of R should converge to the stationary probability of the target codon:


Formula 17

(17)
Figure 1 displays this check in three instances. These and all other instances were found to converge properly. It is also interesting to note that the uniformized process has no memory of the initial state beyond about 20 events.


Figure 1
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Convergence of the successive powers of R to the limiting distribution using three examples. In each case, a = AGC; the stationary probability of the target codon (b = GTG, b = CAG and b = AGG) is displayed using a dashed line.

 
4.2 Posterior distributions
Next, we ran both pruning-based and DA-based MCMC samplers to approximate the posterior under the GY-F61 model. The posterior expectations of several statistics are summarized in Table 2, and the full posterior distributions of substitution parameters are displayed graphically in Figures 2 and 3.


Figure 2
View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Posterior distributions of {kappa} and {omega}. Pruning-based are displayed with a full line and DA-based results are displayed with a dashed line. The top two panels (a and b) refer to the GLOBIN17-144 data set, followed by LYSIN25-134 (c and d), and HIV22-99 (e and f).

 

Figure 3
View larger version (29K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Pruning-based (full line) and DA-based (dashed line) posterior 95% credibility intervals of codon stationary probabilities, sorted according to amino acids. For each codon state, a marking x indicates the stationary probability implied by an F3 x 4 empirical setting. The leftmost panel (a) refers to the GLOBIN17-144 data set, followed by LYSIN25-134 (b) and HIV22-99 (c).

 

View this table:
[in this window]
[in a new window]

 
Table 2. Posterior expectations under the GY-F61 model

 
As expected, the two sampling schemes produce matching results. We note the marked differences in {kappa} and {omega} across data sets, and the greater uncertainty associated with higher parameter values for these parameters (Table 2, Fig. 2). Also noteworthy, we find that the codon stationary probabilities implied by an F3 x 4 empirical setting are often well outside of the 95% credibility interval (Fig. 3), suggesting that the practice of pre-fixing these parameters in this way may not be ideal (Huelsenbeck and Dyer, 2004). On the other hand, small data sets may lead to large uncertainties with regard to the full 61-dimensional probability vector. Indeed, as can be appreciated graphically from Figure 3, the 95% credibility intervals of stationary probabilities are broadest for the smallest data set (HIV22-99). More work is needed to compare the statistical merits of different parametric choices at this level.


View this table:
[in this window]
[in a new window]

 
Table 3. MCMC decorrelation times

 
We next ran both sampling devices under the GY-F61-DP model. Huelsenbeck et al. (2006) performed analyses under the Dirichlet process prior by systematically fixing {alpha} to predefined values. Here, given that we treat {alpha} as a free parameter, we inspected its posterior distribution, as well as that of the number of selection coefficients (K). The results using both pruning and DA sampling methods are displayed in Figure 4, and, as before, are found to match well with each other. Worth noting are the distributions of K; the distributions are situated at relatively low values, in comparison with the overall length of alignments, but are still at consistently higher values than the common usage of finite mixture models of the same type, which are typically fixed at K = 3 (Yang et al., 2000). Here again, however, more work is needed to quantify the statistical merits of the Dirichlet process framework in this context.


Figure 4
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Pruning-based (full line) and DA-based (dashed line) posterior distributions of {alpha} and K under the GY-F61-DP model. The top two panels (a and b) refer to the GLOBIN 17-144 data set, followed by LYSIN25-134 (c and d), and HIV22-99 (e and f)

 
Finally, as described in Huelsenbeck et al. (2006), we computed site-specific probabilities of positive selection p({omega} > 1) under the Dirichlet process, simply as the proportion of draws from the posterior in which a site is found to be affiliated to a class having {omega} > 1. The results obtained using DA-based sampling are summarized for all positions and all three data sets in Figure 5, and results under pruning-based sampling are graphically indistinguishable. The results corroborate previous studies, although without assuming a specific parametric form for heterogeneities across positions. The complexity of spike patterns for each data set (Fig. 5) relates intuitively to the posterior distribution of K; a data set with a pronounced heterogeneity of non-synonymous substitution rates across positions (e.g. LYSIN25-134, Fig. 5b) induces high values of K (Fig. 4d). In contrast, most positions of the GLOBIN17-144 data set appear to be under strong purifying selection, such that the overall number of components required for adequately capturing non-synonymous heterogeneity is lower (Fig. 4b).


Figure 5
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Posterior probability of each site being under positive selection. Top panel (a) refers to the GLOBIN17-144 data set, followed by LYSIN25-134 (b), and HIV22-99 (c).

 
4.3 Computational comparisons
Although both pruning and DA-based samplers yield matching approximations of posterior distributions, the efficiency of the two methods differs significantly. The estimated decorrelation times under both models, and utilizing both sampling devices, are reported in Table 3. The DA-based sampler yields a much lower decorrelation time, meaning that quality samples from the posterior distribution can be obtained more quickly.

The computational improvements of the DA-based sampler mainly stem from having factored out the most expensive calculations. In particular, the pruning algorithm for computing the likelihood (Felsenstein et al., 1981) need only be called once per cycle, before drawing internal node states. Also, because the DA-based sampler relies on an augmented likelihood function [based on Equation (2)], calls to matrix diagonalization routines are not needed when proposing parameter updates, which can become expensive when manipulating K different 61 x 61 matrices (under the GY-F61-DP model) with the pruning-based sampler.

Nonetheless, the uniformization technique for sampling mappings is computationally demanding. Under GY-F61-DP model specifically, the DA-based MCMC sampler can spend as much as 20% of its time in this step. Calculation of the successive powers of stochastic matrices is the rate-limiting operation of the overall re-sampling of mappings. Always setting µ = max{qa} as the uniformization rate, the highest number of events sampled along a branch for a codon site was observed with the GLOBIN17-144 data set under the GY-F61-DP model, at n = 60, which implies as many powers of the stochastic matrix. The highest number of actual state changing events was z = 46.The highest number of virtual events was nz = 39. These are the extremes reached in the course of all DA-based MCMC runs, and in practice most sites were observed to undergo 0, 1 or 2 events along a branch, with cases z > 3 being only 2.7% of all branch-wise mappings (z > 10 being 0.042%). Given the difficulty in anticipating the number of events before running the procedure, we found it important to design our implementation so as to only compute the number of powers needed, in accordance with the sampling scheme described in Section 2.2.


    5 CONCLUSIONS AND FUTURE DIRECTIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DATA
 4 RESULTS AND DISCUSSION
 5 CONCLUSIONS AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
The uniformization method proposed here generalizes previous methods for drawing substitution mappings (Lartillot, 2006; Nielsen, 2002; Robinson et al., 2003). While we have focused on using the method for designing fast implementations of flexible codon substitution models, it should also prove useful in studying other models involving sparse or highly peaked Markov generators, such as covarion approaches (Guindon et al., 2004; Huelsenbeck, 2001; Tuffley and Steel, 1998), or site-heterogeneous models (Lartillot and Philippe, 2004; Pagel and Meade, 2004).

The DA-based MCMC device used in this work required considerable trial-and-error tuning of update operators, as well as the relative frequency of mapping re-sampling, and greater computational improvements may be possible with further refinements. We are currently investigating the exploitation of conjugacy properties in order to implement a Gibbs sampling scheme, which would simplify such an empirical tuning (Lartillot 2006). Other possibilities for further computational improvements include the combined use of different methods for generating mappings. Hobolth (2008) has recently proposed another method showing promise, and even Nielsen's algorithm appears stable in simple cases. Indeed, one might imagine including mechanisms for choosing between different techniques, depending on the difficulty anticipated in sampling a mapping. This may require significant empirical explorations, in order to determine adequate rules for calling one method or another.

While we have kept the topology fixed in the present work, applications of the techniques used here could be extended to cases where the tree topology is treated as a free parameter. More precisely, by combining the standard pruning-based MCMC operators on tree topologies with the DA-based operators on parameters, one can envisage devising an efficient sampler across tree space under a broad set of evolutionary models (Lartillot, 2006). This could provide an attractive framework in the context of codon substitution models, where topology explorations are generally considered computationally too prohibitive in practice (Ren et al., 2005).

Other potential applications of the uniformization method could include instantiations of the posterior predictive model assessment framework (Bollback, 2006; Nielsen, 2002), the simulation of mappings from more sophisticated proposal densities for site-interdependent contexts (Robinson et al., 2003), or for designing marginal likelihood estimation devices (Rodrigue et al., 2007). The latter research perspective is of particular interest in the context of codon substitution models, where the statistical merit of several models remains unresolved (Huelsenbeck et al., 2006; Robinson et al., 2003).


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DATA
 4 RESULTS AND DISCUSSION
 5 CONCLUSIONS AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 
We wish to thank John Huelsenbeck for correspondence regarding implementation issues, the Réseau Québécois de calcul de haute performance for computational resources and Asger Hobolth as well as two anonymous reviewers for comments on the manuscript. This work was supported by Génome Québec, the biT fellowships for excellence (a Canadian Institutes of Health Research strategic training program grant in bioinformatics), the Robert Cedergren Centre for bioinformatics and genomics, the Canadian Institute for Advanced Research, the Canadian Research Chair Program, the Centre National de la Recherche Scientifique (through the ACI-IMPBIO Model-Phylo funding program) and the 60ème commission franco-québécoise.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Keith Crandall

Received on May 29, 2007; revised on September 9, 2007; accepted on October 16, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DATA
 4 RESULTS AND DISCUSSION
 5 CONCLUSIONS AND FUTURE...
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bollback JP. Simmap: stochastic character mapping of discrete traits on phylogenies. BMC Bioinformatics (2006) 7:88.[CrossRef][Medline]

    Dimmic MW, et al. Detecting coevolving amino acid sites using Bayesian mutational mapping. Bioinformatics (2005) 21:S126–S135.[CrossRef]

    Doron-Faigenboim A, Pupko T. A combined empirical and mechanistic codon model. Mol. Biol. Evol (2006) 24:388–397.[CrossRef][Web of Science][Medline]

    Fearnhead P, Sherlock C. An exact Gibbs sampler for the Markov-modulated Poisson process. J. R. Statist. Soc. B (2006) 68:767–784.[CrossRef]

    Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol (1981) 17:368–376.[CrossRef][Web of Science][Medline]

    Gelman A. Parameterization and Baysian modeling. J. Am. Stat. Assoc (2004) 99:537–545.[CrossRef][Web of Science]

    Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol (1994) 11:725–736.[Abstract]

    Gross D, Miller DR. The randomization technique as a modeling tool and solution procedure for transient Markov processes. Oper. Res (1984) 32:343–361.[Abstract/Free Full Text]

    Guindon S, et al. Modeling the site-specific variation of selection patterns along lineages. Proc. Natl. Acad. Sci. USA (2004) 101:12957–12962.[Abstract/Free Full Text]

    Hobolth A. A Markov chain Monte Carlo expectation maximization algorithm for statistical analysis of DNA sequence evolution with neighbour-dependent substitution rates. J. Comput. Graph. Stat (2008) In press.

    Huelsenbeck JP. Testing the covariotide model of DNA substitution. Mol. Biol. Evol (2001) 19:698–707.[Web of Science]

    Huelsenbeck JP, Dyer KA. Bayesian estimation of positively selected sites. J. Mol. Evol (2004) 58:661–672.[CrossRef][Web of Science][Medline]

    Huelsenbeck JP, et al. A Dirichlet process model for detecting positive selection in protein-coding DNA sequences. Proc. Natl. Acad. Sci. USA (2006) 103:6263–6268.[Abstract/Free Full Text]

    Jensen A. Markoff chains as an aid in the study of Markoff processes. Skandinavisk Aktuarietidskriff (1953) 36:87–91.

    Jensen JL, Pedersen A-MK. Probabilistic models of DNA sequence evolution with context dependent rates of substitution. Adv. Appl. Probab (2000) 32:499–517.[CrossRef]

    Krishnan NM, et al. Ancestral sequence reconstruction in primate mitochondrial DNA: compositional bias and effect on functional inference. Mol. Biol. Evol (2004) 21:1871–1883.[Abstract/Free Full Text]

    Lartillot N. Conjugate sampling for phylogenetic models. J. Comput. Biol (2006) 13:1701–1722.[CrossRef][Web of Science][Medline]

    Lartillot N, Philippe H. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Biol. Evol (2004) 21:1095–1109.[Abstract/Free Full Text]

    Mateiu L, Rannala B. Inferring complex DNA substitution processes on phylogenies using uniformization and data augmentation. Syst. Biol (2006) 55:259–269.[CrossRef][Web of Science][Medline]

    Nielsen R. Mapping mutations on phylogenies. Syst. Biol (2002) 51:729–739.[CrossRef][Web of Science][Medline]

    Pagel M, Meade A. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst. Biol (2004) 53:561–581.

    Pedersen A-MK, Jensen JL. A dependent rates model and MCMC based methodology for the maximum likelihood analysis of sequences with overlapping reading frames. Mol. Biol. Evol (2001) 18:763–776.[Abstract/Free Full Text]

    Ren F, et al. An empirical examination of the utility of codon substitution models in phylogeny reconstruction. Syst. Biol (2005) 54:808–818.[Abstract/Free Full Text]

    Robert CP, Casella G. Monte Carlo Statistical Methods. (2004) New York: Springer.

    Robinson DM, et al. Protein evolution with dependence among codons due to tertiary structure. Mol. Biol. Evol (2003) 18:1692–1704.

    Rodrigue N, et al. Site interdependence attributed to tertiary structure in amino acid sequence evolution. Gene (2005) 347:207–217.[CrossRef][Web of Science][Medline]

    Rodrigue N, et al. Assessing site-interdependent phylogenetic models of sequence evolution. Mol. Biol. Evol (2006) 23:1762–1775.[Abstract/Free Full Text]

    Rodrigue N, et al. Exploring fast computational strategies for probabilistic phylogenetic analysis. Syst. Biol (2007) 56:711–726.[CrossRef][Web of Science][Medline]

    Tuffley C, Steel M. Modeling the covarion hypothesis of nucleotide substitution. Math. Biosci (1998) 147:63–91.[CrossRef][Web of Science][Medline]

    van Dyk DA, Meng XL. The art of data augmentation. J. Comput. Graph. Stat (2001) 10:1–50.[CrossRef]

    Yang Z, Swanson WJ. Codon substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes. Mol. Biol. Evol (2002) 19:49–57.[Abstract/Free Full Text]

    Yang Z, et al. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics (2000) 155:431–449.[Abstract/Free Full Text]

    Yu J, Thorne JL. Dependence among sites in RNA evolution. Mol. Biol. Evol (2006) 23:1525–1537.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
N. Rodrigue, C. L. Kleinman, H. Philippe, and N. Lartillot
Computational Methods for Evaluating Phylogenetic Models of Coding Sequence Evolution with Dependence between Codons
Mol. Biol. Evol., July 1, 2009; 26(7): 1663 - 1676.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
M. Anisimova and C. Kosiol
Investigating Protein-Coding Sequence Evolution with Probabilistic Codon Substitution Models
Mol. Biol. Evol., February 1, 2009; 26(2): 255 - 271.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
W. Delport, K. Scheffler, and C. Seoighe
Models of coding sequence evolution
Brief Bioinform, January 1, 2009; 10(1): 97 - 109.
[Abstract] [Full Text] [PDF]


Home page
Phil Trans R Soc BHome page
V. N Minin and M. A Suchard
Fast, accurate and simulation-free stochastic mapping
Phil Trans R Soc B, December 27, 2008; 363(1512): 3985 - 3995.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/1/56    most recent
btm532v3
btm532v2
btm532v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Rodrigue, N.
Right arrow Articles by Lartillot, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Rodrigue, N.
Right arrow Articles by Lartillot, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?