Skip Navigation


Bioinformatics Advance Access originally published online on November 30, 2004
Bioinformatics 2005 21(9):1797-1806; doi:10.1093/bioinformatics/bti151
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/9/1797    most recent
bti151v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Husmeier, D.
Right arrow Articles by Milne, I.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Husmeier, D.
Right arrow Articles by Milne, I.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Detecting interspecific recombination with a pruned probabilistic divergence measure

Dirk Husmeier 1,*, Frank Wright 2 and Iain Milne 2

1Biomathematics and Statistics Scotland, JCMB The King's Buildings, Edinburgh, EH9 3JZ, UK
2Biomathematics and Statistics Scotland, SCRI Invergowrie, Dundee, DD2 5DA, UK

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 METHOD
 DATA AND SIMULATIONS
 RESULTS
 DISCUSSION
 REFERENCES
 

Motivation: A promising sliding-window method for the detection of interspecific recombination in DNA sequence alignments is based on the monitoring of changes in the posterior distribution of tree topologies with a probabilistic divergence measure. However, as the number of taxa in the alignment increases or the sliding-window size decreases, the posterior distribution becomes increasingly diffuse. This diffusion blurs the probabilistic divergence signal and adversely affects the detection accuracy. The present study investigates how this shortcoming can be redeemed with a pruning method based on post-processing clustering, using the Robinson–Foulds distance as a metric in tree topology space.

Results: An application of the proposed scheme to three synthetic and two real-world DNA sequence alignments illustrates the amount of improvement that can be obtained with the pruning method. The study also includes a comparison with two established recombination detection methods: Recpars and the DSS (difference of sum of squares) method.

Availability: Software, data and further supplementary material are available at the following website: http://www.bioss.sari.ac.uk/~dirk/Supplements/

Contact: dirk{at}bioss.ac.uk


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 METHOD
 DATA AND SIMULATIONS
 RESULTS
 DISCUSSION
 REFERENCES
 
The underlying assumption of most phylogenetic tree reconstruction methods is that there is one set of hierarchical relationships among the taxa. While this is a reasonable approach when applied to most DNA sequence alignments, it can be violated in certain bacteria and viruses due to interspecific recombination. The resulting transfer or exchange of DNA subsequences can lead to a change of the branching order (topology) in the affected region, which results in conflicting phylogenetic information from different regions of the alignment. If undetected, the presence of these so-called mosaic sequences can lead to systematic errors in phylogenetic tree estimation. Their detection, therefore, is a crucial prerequisite for consistently inferring the evolutionary history of a set of DNA sequences.

Recently, various methods for detecting evidence of interspecific recombination in DNA sequence alignments have been developed; see Posada et al. (2002) for a review. The objective of the present article is to propose an improved phylogenetic window method, which has been motivated by and is based on the ideas of PLATO (partial likelihood assessed through tree optimization), the DSS (difference of sum of squares) method, and the PDM (probabilistic divergence measure) method.

PLATO, proposed by Grassly and Holmes (1997) and illustrated in the left panel of Figure 1, computes the likelihood of various subregions of the DNA sequence alignment from a reference tree and searches for subregions with significantly low likelihood scores. The idea is that if the reference tree is the ‘true’ tree, then recombinant regions will show a low likelihood due to the change of the tree topology. However, the ‘true’ tree is not known and is approximated by a tree estimated from the whole sequence alignment. This includes the recombinant regions, which perturb the parameter estimation for the reference tree. Consequently, the method becomes increasingly unreliable as the recombinant regions grow in length.



View larger version (9K):
[in this window]
[in a new window]
 
Fig. 1 Different methods for detecting recombination. The figure illustrates three window-based phylogenetic methods for detecting recombination in DNA sequence alignments. Left: PLATO. A window of varying size is moved along the DNA sequence alignment. The average log likelihood is computed for both the window and the flanking regions of the sequence alignment, and the Q statistic is defined as the ratio of these values. Large Q values are taken as indications of recombinant regions. Centre: The DSS method. A window consisting of two subwindows is moved along the DNA sequence alignment. A tree is estimated from the left subwindow, and a goodness-of-fit score is computed for both subwindows. The DSS statistic is defined as the difference between these scores. When the window is centred on or near the breakpoint of a recombinant region, the tree estimated from the left subwindow is not an adequate model for the data on the right, which leads to a large DSS value. Right: The PDM method. The figure shows the posterior distribution P(S|Dt) of tree topologies S conditional on two subsets of columns, Dt and Dt', selected by a moving window. When the window is moved into a recombinant region, the posterior distribution P(S|Dt) can be expected to change significantly, which leads to a high probabilistic divergence score. Further details are given in the text.

 
To overcome this shortcoming, the DSS method (McGuire et al., 1997; McGuire and Wright, 2000), illustrated in the middle of Figure 1, replaces the global by a local reference tree. A window of typically about 500 nucleotides is slid along the DNA sequence alignment. On the first half of the window, a distance matrix is calculated according to some Markov model of nucleotide substitution, and a phylogenetic tree is estimated using the least squares method. A distance matrix is then calculated for the second half of the window, and the topology estimated from the first half is fitted to it, again using least squares. Obviously, when the topology in the right window has changed as a result of recombination, the topology in the left window will be a poor fit to the distance matrix from the right, and the difference of the two sum-of-squares statistics – termed the ‘DSS’ statistic – will be large. The motivation for using a score based on pairwise sequence distances rather than the likelihood is the application of a rescaling method to distinguish between recombination and rate variation, as described in McGuire and Wright (2000). However, it is well known that DNA sequences contain more information than pairwise distances, and that estimation methods based on pairwise distances suffer from an inevitable information loss, which is aggravated by the fact that the reference tree is computed from a small section of the sequence alignment.

The PDM method (Husmeier and Wright, 2001b), illustrated in the right panel of Figure 1, is akin to the DSS method, but addresses some of its shortcomings. First, by using a likelihood score, the intrinsic information loss associated with a score based on pairwise distances is prevented. Second, a single optimized reference tree is replaced by a distribution over trees, which captures the intrinsic uncertainty of tree estimation from short subsections of sequence alignments. Third, the method focuses on topology changes, thereby distinguishing true recombination events from the confounding effect of rate heterogeneity. An illustration of the concepts is given in the right panel and the caption of Figure 1.

For a formal introduction, consider a DNA sequence alignment, , where we follow the usual convention that rows represent DNA sequences, and columns represent sites. From , we select a subset of W consecutive columns (sites), centred on the tth site of the alignment; we refer to this as a ‘window’ of width W. Let S be an integer label for tree topologies, and consider the marginal posterior probability of tree topology S conditional on the ‘window’ ,

(1)
From Bayes rule we have

(2)
where is the likelihood, which is computed with the pruning algorithm (Felsenstein, 1981), and P(S, w, {theta}) is the prior, which is chosen uniform, as in Larget and Simon (1999). Note that Equation (1) includes a marginalization over the branch lengths w of the phylogenetic tree and the parameters of the nucleotide substitution model, {theta}. Since the integral in Equation (1) is analytically intractable, it has to be solved numerically by means of a Markov chain Monte Carlo (MCMC) simulation, following, for instance, Larget and Simon (1999). This leads to

(3)
where M is the MCMC sample size, and Mt(S) denotes the number of times the MCMC simulation applied to subalignment visits topology S. The basic idea of the PDM method for detecting recombinant regions is to move the window along the alignment, as illustrated in the right panel of Figure 1, and to monitor the distribution . We would then, obviously, expect a substantial change in the shape of this distribution as we move the window into a recombinant region. To quantify the degree of change, a probabilistic divergence measure D(t) is computed. Define

(4)
which is based on the Kullback–Leibler (KL) divergence:

(5)
The divergence measure D(t) is obtained by averaging in Equation (4) over different degrees of window overlap, typically 0.1 W ≤ 2{Delta}t ≤ 0.9W. A more detailed discussion is available in the supplementary material and in Husmeier and Wright (2001b).


    METHOD
 TOP
 Abstract
 INTRODUCTION
 METHOD
 DATA AND SIMULATIONS
 RESULTS
 DISCUSSION
 REFERENCES
 
The shortcoming of the PDM method is illustrated in Figure 2. A substantial change in the tree topology caused by a recombination event is not distinguished from changes in the clade configurations of closely related taxa. As the number of sequences in the alignment increases, so does the number of such within-clade reconfigurations. This is because for an increased number of taxa the posterior distribution over tree topologies, , becomes increasingly disperse unless the size of the data set is increased. An increased amount of data , however, corresponds to an increased length of the sliding window, which compromises the spatial resolution of the detection and is not an option for short alignments.



View larger version (4K):
[in this window]
[in a new window]
 
Fig. 2 Shortcoming of the PDM method. A substantial change in the branching structure of a phylogenetic tree, illustrated on the left, is indicative of a recombination event. Changes of the branching structure that only involve closely related strains, shown on the right, may result as a mere consequence of statistical fluctuations. The PDM method of Husmeier and Wright (2001b) does not distinguish between these two types of changes in the tree topology, which renders the approach suboptimal.

 
A simple pruning scheme
A possible remedy for this ‘dispersion’ problem is to include information on the amount of change between different tree topologies and thereby to distinguish between the scenarios depicted in Figure 2. Since branch lengths have been marginalized over for the reason mentioned in the previous section, the difference has to be estimated solely on the basis of topological information. A possible metric in the space of tree topologies was introduced by Robinson and Foulds (1981). The idea is illustrated in Figure 3. Consider the two complimentary operations of merging two existing nodes into one, and splitting an existing node into two. The Robinson–Foulds (RF) distance between tree topologies S1 and S2 is defined as the minimum number of operations required to transform S1 into S2 (or S2 into S1). Robinson and Foulds (1981) have derived a simple algorithm for the practical computation, which has been implemented in various phylogeny software packages; see, for instance, program ‘treedist’ in Felsenstein (1996, http://www.evolution.genetics.washington.edu/phylip.html). Define to denote the average posterior distribution of tree topologies, averaged over all sliding window positions:

(6)
in which NW denotes the total number of window positions. The support of the average posterior distribution is the set of those tree topologies for which , that is, the set of all tree topologies visited during the MCMC simulations. Now, an application of the PDM method to a DNA sequence alignment of 10 strains of Hepatitis-B virus, discussed below, was found to lead to a support of 126 distinct tree topologies. Usually, we do not expect to find over hundred recombination events in an alignment of about 3000 nucleotides. Hence most of topology changes correspond to within-clade reconfigurations resulting from diffuse posterior distributions, which degrades the reliability and efficacy of the PDM method.



View larger version (18K):
[in this window]
[in a new window]
 
Fig. 3 RF-distance based pruning. Top: The RF distance between tree topologies is defined as the minimum number of contraction and decontraction operations needed to transform one topology into the other. Bottom: On the average posterior distribution of tree topologies, averaged over all sliding window positions, a cutoff threshold is defined. Tree topologies above this threshold are kept as ‘principal topologies’. Tree topologies below the threshold are assigned to the principal topology with the minimum RF distance. After this re-assignment, the posterior distribution is re-normalized.

 
A way to proceed, then, is to reduce the dispersion of the posterior distribution by reducing the cardinality of the support of . This can be effected with a pruning scheme based on the RF distance. First, identify a set of principal tree topologies, for instance, those that maximize . Next, assign each non-principal tree topology to the principal topology with the minimum RF distance. Finally, renormalize the posterior distributions and recompute the PDM signal. An illustration is given in Figure 3. The reduction in the dispersion of is likely to reduce the noise in the PDM signal. Note that similar pruning methods are used in machine learning to improve the generalization performance of a predictor (Bishop, 1995). Also note that the pruning of the support of has some similarity with a Bayesian approach in that it brings the data-based prediction in line with our prior assumption about the expected frequency of recombination events.

Improved pruning by clustering
The pruning scheme described in the previous subsection has an obvious disadvantage: when the recombinant region is short, the set of principal topologies may not contain any topology that reflects the recombination event. Also, the assignment of non-principal to principal topologies can be interpreted as the first step of an incomplete K-means clustering procedure. Such topology-based clustering of phylogenetic trees was proposed by Stockham et al. (2002) to reduce the information loss incurred by a single consensus tree approach. It here offers the obvious method to be used for pruning. Based on the RF distance, the MCMC sample of tree topologies, {S1,...,SM}, is clustered. For a given number of clusters K, tree topologies are assigned to their respective cluster :

(7)
Now, define the posterior distribution over clusters,

(8)
which is computed from the MCMC sample of tree topologies, {S1, ..., SM}:

(9)
An improved pruning scheme can be effected by using the posterior distribution over classes (9) instead of the original posterior distribution over topologies (3) in the computation of the divergence measure D(t) in Equation (4). This novel approach of computing D(t) from rather than will henceforth be referred to as the pruned PDM method.

Note that while this method addresses the shortcomings of the original PDM method and the simple pruning scheme of Figure 3, it is still heuristic in that the choice of the clustering algorithm is arbitrary. The present work is based on the findings of Stockham et al. (2002). When applying MCMC to sample trees from the posterior distribution, as described in Larget and Simon (1999), one is faced with the problem of summarizing the information contained in the MCMC sample succinctly. A widely applied method is to resolve the conflicts within the obtained sample of tree topologies by computing a consensus tree, which, however, may incur a substantial information loss. Stockham et al. (2002) investigated a post-processing alternative, by which trees are first divided into subsets with some clustering algorithm based on the RF distance, and each cluster is then characterized by its own consensus tree. The authors assessed various clustering algorithms with different measures of information loss. They found that bottom-up average linkage clustering outperformed top-down K-means clustering and the method of phylogenetic islands (Maddison, 1991) in terms of complexity versus information content, and the former scheme will therefore be used in the present study.

The pruning-by-clustering procedure proposed in the present paper thus works as follows. First, generate an exhaustive list of all distinct tree topologies sampled in the complete set of MCMC simulations, that is, the MCMC simulations carried out for all window positions. Next, compute all pairwise RF distances between topologies, and infer a dendogram of topologies with agglomerative average linkage clustering from these distances. Cut this dendogram at a certain height such that it disintegrates into a pre-defined number of clusters. Finally, assign each tree topology to its respective cluster and use this assignment to compute the posterior distribution over clusters by application of Equation (8). This posterior distribution over clusters supersedes the original posterior distribution over topologies in the computation of the divergence measure (4). A summary of the proposed algorithm is given in Figure 4.



View larger version (41K):
[in this window]
[in a new window]
 
Fig. 4 Pruning by clustering. The figure summarizes the various steps involved in the computation of the pruned probabilistic divergence signal.

 

    DATA AND SIMULATIONS
 TOP
 Abstract
 INTRODUCTION
 METHOD
 DATA AND SIMULATIONS
 RESULTS
 DISCUSSION
 REFERENCES
 
The evaluation of a recombination detection method is best carried out through a combined analysis of real-world and simulated DNA sequence alignments. For simulated data, the true location of the recombinant regions and their breakpoints is known; hence a straightforward assessment of the detection accuracy is possible. The disadvantage, however, is that the models used in the simulation study are simplifications of reality. Real-world DNA sequence alignments are obviously not affected by this shortcoming. However, the assessment of the detection accuracy is impeded by the fact that the true location of the recombinant regions and their breakpoints is ultimately unknown. We therefore carried out the present evaluation on three synthetic data sets, for which the level of detection difficulty varied substantially, and two real-world DNA sequence alignments. These sequence alignments can be obtained from the accompanying website.

Simulated recombination
We tested the proposed detection method on two of the simulated data sets used in Husmeier and Wright (2001b). Here, evolution in a population of eight taxa was simulated with the Kimura model (Kimura, 1980), from which a DNA sequence alignment of 5500 nucleotides was generated. Two recombination events were simulated: an ancient event, affecting the region between sites 1000 and 1500, and a recent event, affecting the region between sites 2500 and 3000. A mutational hotzone of the same length, located between sites 4000 and 4500, was evolved at an increased nucleotide substitution rate (factor 3); this is to test whether the detection method can successfully distinguish between recombination and rate variation.

For the present study, we used the data sets B1 and B3 in Husmeier and Wright (2001b). For B1, henceforth referred to as simulated alignment 1, the average branch length of the underlying phylogenetic tree was w = 0.1. For B3, henceforth referred to as simulated alignment 2, this value was reduced to w = 0.01. The ensuing reduction in the number of polymorphic sites renders the second detection problem harder, because recombination tends to be easier to detect with increasing levels of divergence (see also Posada, 2002 p. 714).

We also generated a new sequence alignment, which followed the simulation procedure described in Husmeier and Wright (2001b) but reduced the lengths of the tree branches even further, drawing them from a uniform distribution on the interval [0.003, 0.005]. The motivation for this data set, henceforth referred to as simulated alignment 3, was to increase the detection difficulty deliberately to a level where all alternative detection methods investigated in this study were found to fail.

Hepatitis-B virus
Hepatitis B is caused by a DNA virus with a short genome of 3200 nucleotides. Evidence for recombination was found in Bollyky et al. (1996). The present study investigates a subset of two recombinant strains (Genbank accession numbers D00329 and X68292) and eight non-recombinant strains (V00866 [GenBank] , M57663 [GenBank] , D00330 [GenBank] , M54923 [GenBank] , X01587 [GenBank] , D00630 [GenBank] , M32138 [GenBank] and L27106 [GenBank] ). The sequences were aligned with ClustalW (Thompson et al., 1994), using the default parameters. Columns with gaps were discarded, giving a total alignment length of 3049 nucleotides.

Maize actin genes
Gene conversion is a process equivalent to recombination. It occurs in multigene families, where a DNA subsequence of one gene can be replaced by the DNA subsequence from another. Indication of gene conversion between two pairs of maize actin genes has been reported in Moniz de Sa and Drouin (1996). For the present evaluation, we used the eight maize sequences studied by Moniz de Sa and Drouin (1996) (GenBank/EMBL accession numbers are in brackets): Mac1 (J01238 [GenBank] ), Maz56 (U60514 [GenBank] ), Maz63 (U60513 [GenBank] ), Maz81 (U60511 [GenBank] ), Maz83 (U60510 [GenBank] ), Maz87 (U60509 [GenBank] ), Maz89 (U60508 [GenBank] ) and Maz95 (U60507 [GenBank] ). The sequences were aligned with ClustalW, using the default parameter settings. Columns with more than three gaps were discarded, resulting in a total alignment length of 1281 nucleotides.

Methods
The PDM and pruned PDM methods were applied as follows. Windows of different lengths were moved along the alignment with a fixed step size of {Delta}t = 10 nucleotides. The MCMC simulations were carried out with BAMBE1 described in Larget and Simon (1999). For each new window position, the system was equilibrated over Meq = 100,000 Metropolis–Hastings steps, starting from an initial tree obtained with Neighbour Joining (Saitou and Nei, 1987) and using global2 tree manipulations for the proposal moves. This was followed by a sampling period of M = 100,000 Metropolis–Hastings steps, using local tree manipulations and sampling tree configurations in intervals of {Delta}M = 200 Metropolis–Hastings steps. Full details of the MCMC simulations are available from the accompanying website (see abstract), which includes the input file for BAMBE. Note that rather large values for Meq and M were chosen to ensure sufficient mixing and convergence of the Markov chains. When applied sequentially, this leads to a total execution time of several hours, and one may therefore want to reduce the size of M and Meq. In fact, rerunning the algorithm with M = 10,000, Meq = 10,000, and {Delta}M = 20 was found to lead to results very similar to those presented in this article. However, as discussed in the Discussion section, the algorithm can easily be parallelized, in which case the total execution time is reduced to the order of minutes without having to compromise the convergence and mixing of the Markov chains.

The DSS signals were computed as described in McGuire and Wright (2000), using the same step size and window size as for the PDM and pruned PDM methods. We used the default options of the program, except for the computation of the pairwise distances, where we replaced the Jukes–Cantor model of nucleotide substitution by the Kimura 2-parameter model.

For a further comparison, Recpars (Hein, 1993) was applied. This method requires the prior specification of a recombination and a nucleotide substitution penalty parameter, for which various ratios {Psi} were chosen.


    RESULTS
 TOP
 Abstract
 INTRODUCTION
 METHOD
 DATA AND SIMULATIONS
 RESULTS
 DISCUSSION
 REFERENCES
 
Figure 5 (top left panel) shows the unpruned PDM signal obtained from the first simulated sequence alignment, using a window size of W = 500. All four recombinant breakpoints are clearly detected, and the differently diverged region is successfully suppressed. No pruning is required, although it improves the spatial resolution slightly (Fig. 5, top panel). The resolution is more dramatically improved by reducing the window size to W = 200; see the bottom panel of Figure 5. However, without pruning this reduction in W substantially increases the signal stemming from the confounding, differently diverged region. This shortcoming is avoided when pruning is used (Fig. 5, bottom panel).



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 5 Detection of recombination in the first synthetic DNA sequence alignment: PDM method. The graphs show probabilistic divergence signals; the dashed horizontal lines show the 99 percentiles under the null hypothesis of no recombination, computed with parametric bootstrapping, as discussed in the supplementary material. Top row: window size 500. Bottom row: window size 200. From left to right: no pruning (resulting in 16 tree topologies when the window size is 500, and 104 topologies for a window size of 200), pruning down to 7, 5 and 3 clusters. The true mosaic structure of the sequence alignment is as follows. Ancient recombination event: sites 1000–1500; recent recombination event: sites 2500–3000; differently diverged region: sites 4000–4500.

 
Figure 6 (top left panel) shows the unpruned PDM signal obtained from the second simulated sequence alignment, again using a window size of W = 500. The four recombinant breakpoints are still detected, but at a poor spatial resolution. This deterioration is a consequence of the significantly decreased sequence divergence, as discussed above. Pruning improves the resolution slightly (top panel of Fig. 6). However, a decrease of the window size to W = 200 no longer seems to be a viable option: the bottom left panel of Figure 6 demonstrates that the relevant peaks get lost in a noisy, erratic signal. Interestingly, when applying the pruning method, the true peaks of the recombinant breakpoints re-emerge, resulting in a spatial resolution that is equivalent to the one obtained from the first simulated sequence alignment (bottom panel of Figure 6).



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 6 Detection of recombination in the second synthetic DNA sequence alignment: PDM method. The graphs show probabilistic divergence signals; the dashed horizontal lines show the 99 percentiles under the null hypothesis of no recombination, computed with parametric bootstrapping, as discussed in the supplementary material. Top row: window size 500. Bottom row: window size 200. From left to right: no pruning (resulting in 50 tree topologies when the window size is 500, and 406 topologies for a window size of 200), pruning down to 7, 5 and 3 clusters. The true mosaic structure of the sequence alignment is as follows. Ancient recombination event: sites 1000–1500; recent recombination event: sites 2500–3000; differently diverged region: sites 4000–4500.

 
Figures 6 and 8 show the performance of both the DSS method and Recpars on the first two simulated sequence alignments. With one exception, the DSS method predicts all recombinant breakpoints correctly when a window size of W = 500 is used, while successfully suppressing the confounding, differently diverged region. However, when the window size is decreased to W = 300 and W = 200, the true breakpoints get lost among spurious peaks. The prediction with Recpars depends critically on the recombination versus nucleotide substitution cost ratio, {Psi}. Note that this parameter cannot be inferred from the data, but has to be selected rather arbitrarily by the user. For the first simulated sequence alignment, the correct mosaic structure is predicted when setting {Psi} = 10 (Figure 7, left panel). For the second sequence alignment, {Psi} = 2 leads to a satisfactory prediction (Figure 8, left panel). However, setting the value of {Psi} with the benefit of hindsight after inspecting the outcome of the prediction is methodologically incorrect, and for suboptimal values of {Psi} poor predictions with too many or too few breakpoints are obtained.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 8 Detection of recombination in the second synthetic DNA sequence alignment: alternative methods. Left: Prediction obtained with Recpars. The three panels show segmentations predicted for different recombination versus nucleotide substitution cost ratios {Psi}. Top: {Psi} = 20; middle: {Psi} = 10; bottom: {Psi} = 2. Centre left: DSS signal obtained with a window size of 500. Centre right: DSS signal obtained with a window size of 300. Right: DSS signal obtained with a window size of 200. The horizontal dashed line shows the 99 percentile under the null hypothesis of no recombination, estimated with parametric bootstrapping. The true mosaic structure of the sequence alignment is as follows. Ancient recombination event: sites 1000–1500; recent recombination event: sites 2500–3000; differently diverged region: sites 4000–4500.

 


View larger version (14K):
[in this window]
[in a new window]
 
Fig. 7 Detection of recombination in the first synthetic DNA sequence alignment: alternative methods. Left: Prediction obtained with Recpars. The three panels show segmentations predicted for different recombination versus nucleotide substitution cost ratios {Psi}. Top: {Psi} = 20; middle: {Psi} = 10; bottom: {Psi} = 5. Centre left: DSS signal obtained with a window size of 500. Centre right: DSS signal obtained with a window size of 300. Right: DSS signal obtained with a window size of 200. The horizontal dashed line shows the 99 percentile under the null hypothesis of no recombination, estimated with parametric bootstrapping. The true mosaic structure of the sequence alignment is as follows. Ancient recombination event: sites 1000–1500; recent recombination event: sites 2500–3000; differently diverged region: sites 4000–4500.

 
On the third simulated sequence alignment, none of the alternative methods achieved a satisfactory performance; see the left panel of Figure 9 (unpruned PDM method), and the two subfigures on the left of Figure 10 (DSS method and Recpars). Interestingly, all four breakpoints are correctly identified when the pruning method with a sufficiently small number of clusters K is used (Fig. 9).



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 9 Detection of recombination in the third synthetic DNA sequence alignment with the PDM method. The graphs show probabilistic divergence signals; the dashed horizontal lines show the 99 percentiles under the null hypothesis of no recombination, computed with parametric bootstrapping, as discussed in the supplementary material. The window size of the PDM method was set to 500. From left to right: no pruning (leading to 361 tree topologies in total), pruning down to 7, 5 and 3 clusters. The sequence alignment contains the following true mosaic structure: an ancient recombination event between sites 1000 and 2000, a recent recombination event between sites 3000 and 4000, and a differently diverged region (factor 3) affecting the last 1000 nucleotides.

 


View larger version (13K):
[in this window]
[in a new window]
 
Fig. 10 Prediction of recombination in the third simulated sequence alignment and the Hepatits B virus alignment with alternative methods. Left: Recpars applied to the third simulated sequence alignment. The three panels show segmentations predicted for different recombination versus nucleotide substitution cost ratios {Psi}. Top: {Psi} = 10; middle: {Psi} = 5; bottom: {Psi} = 3. Centre left: DSS signal obtained from the third simulated sequence alignment. The horizontal dashed line shows the 99 percentile under the null hypothesis of no recombination, estimated with parametric bootstrapping. For a location of the true recombinant regions, see the caption of Figure 9. Centre right: Recpars applied to the Hepatitis-B virus sequence alignment. Top: {Psi} = 20; middle: {Psi} = 10; bottom: {Psi} = 5. Right: DSS signal obtained from the Hepatitis-B virus sequence alignment, with the 99 percentile under the null hypothesis shown as a horizontal dashed line.

 
The centre and right panels of Figure 11 show two DSS signals obtained from the maize actin gene sequence alignment, using the window sizes of W = 200 and W = 300. A clear breakpoint is detected around site 1000. This breakpoint is also found with Recpars for {Psi} = 10 (left panel of Fig. 11), and it concurs with the findings reported in Moniz de Sa and Drouin (1996). The unpruned PDM method also detects this breakpoint; however, it is obscured by various spurious peaks (Fig. 12, left panel). This poor performance is rectified with the pruning method. Figure 12 shows that out of six predictions resulting from combining two window sizes (W = 200 and W = 300) with three threshold values (K = 3, 5, 7), five concur with the DSS signals.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 11 Gene conversion in maize actin genes: prediction with alternative methods. Left: Prediction with Recpars. The three panels show segmentations predicted for different recombination versus nucleotide substitution cost ratios {Psi}. Top: {Psi} = 10; middle: {Psi} = 5; bottom: {Psi} = 3. Centre: DSS signal obtained with a window size of 300. Right: DSS signal obtained with a window size of 200. The horizontal dashed line shows the 99 percentile under the null hypothesis of no recombination, estimated with parametric bootstrapping.

 


View larger version (16K):
[in this window]
[in a new window]
 
Fig. 12 Gene conversion in maize actin genes: prediction with the PDM method. The graphs show probabilistic divergence signals; the dashed horizontal lines show the 99 percentiles under the null hypothesis of no recombination, computed with parametric bootstrapping, as discussed in the supplementary material. Top: window size 300. Bottom: window size 200. The various columns refer to different pruning thresholds. From left to right: no pruning, pruning down to 7, 5 and 3 clusters. No pruning led to a support of 52 (window size 300) and 136 (window size 200) topologies.

 
On the Hepatitis-B virus DNA sequence alignment, the PDM method was applied with two window sizes: W = 300 and W = 500. Without pruning, the probabilistic divergence signals, shown in the left column of Figure 13, contain erratic oscillations that obscure any clear patterns. This is dramatically improved with the pruning method. The top row of Figure 13 shows the results for the larger window size. Three breakpoints emerge, for all chosen values of the threshold K. These breakpoints are also clearly predicted with the DSS method (Fig. 10, right). When decreasing the window size, these breakpoints are predicted at a higher spatial resolution (bottom row of Fig. 13), and two further breakpoints occur (at positions 1000 and 1250). Interestingly, these breakpoints also appear in the DSS signal—at the border of the significance threshold (Fig. 10, right). The predictions with Recpars (Fig. 10, centre right) vary too much to be conclusive.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 13 Recombination in Hepatitis-B virus: prediction with the PDM method. The graphs show probabilistic divergence signals; the dashed horizontal lines show the 99 percentiles under the null hypothesis of no recombination, computed with parametric bootstrapping, as discussed in the supplementary material. Top: window size 500. Bottom: window size 300. The various columns refer to different pruning thresholds. From left to right: no pruning, pruning down to 7, 5 and 3 clusters. No pruning led to a support of 126 (window size 500) and 439 (window size 300) topologies.

 

    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 METHOD
 DATA AND SIMULATIONS
 RESULTS
 DISCUSSION
 REFERENCES
 
The present paper has proposed a novel phylogenetic window method for detecting interspecific recombination in DNA sequence alignments. The new approach combines the PDM method of Husmeier and Wright (2001b) with the post-processing clustering procedure of Stockham et al. (2002). The idea is to reduce the dispersion of the posterior distribution of tree topologies, on which the computation of the PDM signal is based, by a pruning procedure. Pruning is effected by clustering trees with similar topologies together, and replacing the posterior distribution over tree topologies by the posterior distribution over topology clusters. The scheme requires the prior specification of the number of topology clusters, K, which reflects our prior assumption about the maximum permissible number of recombination events. For sufficiently small values of K the prediction accuracy was found to increase considerably. Also, for sufficiently small values of K, the results are rather robust with respect to a variation of K; hence this variation is far less critical than the variation of, say, {Psi} in Recpars. The pruning scheme may also allow for more flexibility in the choice of the window size. While a smaller window increases the spatial resolution of the breakpoint detection, it also increases the dispersion of the posterior distribution, reflected by the total number of tree topologies sampled in the MCMC simulations. This dispersion aggravates the inherent shortcoming of the PDM method, as discussed above. By projecting the total topology space onto a small set of topology clusters, the dispersion of the posterior distribution is drastically reduced. The simulation studies discussed in the previous section have demonstrated that a decrease of the window size may lead to erratic and noisy PDM signals when no pruning is used, while the application of the novel pruning method was found to recover the true recombinant breakpoints at an improved spatial resolution. This finding suggests that the pruning scheme proposed in the present article may lead to a substantial improvement on the PDM method, rendering it a more viable tool for the detection of interspecific recombination.

Like the PDM method, the pruned PDM method requires several MCMC simulations to be carried out—one for each window position. When applied sequentially, this procedure is slow and may take several hours to complete. However, each individual MCMC simulation is independent of all the other MCMC simulations, and the method therefore lends itself to a straightforward parallelization. Note that each individual MCMC simulation is fast due to the small length of the selected sequence alignment (selected with the sliding window). Consequently, a parallelization of the algorithm can reduce the total execution time to the order of minutes.

An obvious disadvantage of the proposed scheme is the use of a sliding window, which gives the method an intrinsically heuristic flavour. The simulation studies have demonstrated that the performance of the pruned PDM scheme seems to depend less critically on the window size than the DSS method. This finding reflects the fact that the PDM method is a likelihood method and, consequently, extracts more information from the data than the DSS method, which is based on pairwise sequence distances. However, a method like Recpars, which avoids the use of a window altogether, is in principle superior. The principal shortcoming of Recpars is that, as a parsimony method, it depends on the arbitrary parameter {Psi}, which cannot be estimated from the data. Husmeier and Wright (2001a) and Husmeier and McGuire (2003) have developed a likelihood method equivalent to Recpars, in which all parameters are estimated from the data, and in which the recombinant breakpoints can be determined at a higher resolution than with a window method. However, this approach is restricted to small sequence alignments of typically only four taxa. This restriction calls for the prior identification of putative recombinant sequences, using, for example, window-based methods—like the one discussed in the present article.

As a final remark, note that the objective of the PDM method is to detect changes in the tree topology, that is, systematic deviations from the assumption that a single phylogenetic tree is consistent with the whole DNA sequence alignment. The focus of this paper has been on interspecific recombination and gene conversion. Other processes that may lead to topology changes are gene duplication and lineage sorting. Conversely, certain recombination events may only affect the branch lengths of a phylogenetic tree while leaving the topology unchanged. Consequently, not all recombination events can be detected with the PDM method, while certain breakpoints detected with the PDM method may have been caused by events different from recombination. While, on the face of it, this looks like a shortcoming, we note that the motivation for the PDM method does not come from population genetics, where one is interested in recombination events per se, but from phylogenetics. Most standard phylogenetic methods are based on the implicit assumption that the given alignment is consistent with a single phylogenetic tree. When a recombination event only affects the branch lengths of a phylogenetic tree, this assumption is not particularly critical: the inferred topology will still be correct (as it has not been changed), and the inferred branch lengths will show some average value. However, when recombination (or any other event, like lineage sorting) affects the tree topology, the inference of an average tree is no longer reasonable: tree topologies are cardinal entities, for which an average value is not defined. In fact, it is well-known that in this case the application of a standard phylogenetic inference method may lead to a distorted tree that is unrepresentative of the underlying evolutionary history of the sequences. By detecting systematic changes of the tree topology along the sequence alignment, the pruned PDM method proposed in the present article promises to offer a useful pre-screening tool for improving the consistency of phylogenetic analysis.


    Acknowledgments
 
This work was supported by the Scottish Executive Environmental and Rural Affairs Department (SEERAD). We would like to thank Chris Glasbey for helpful discussions.


    Footnotes
 
1Available from http://www.mathcs.duq.edu/larget/bambe.html Back

2In BAMBE, two different kinds of proposal moves—local and global—are used, which are described in Larget and Simon (1999). Back

Received on July 10, 2004; revised on November 2, 2004; accepted on November 11, 2004

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 METHOD
 DATA AND SIMULATIONS
 RESULTS
 DISCUSSION
 REFERENCES
 

    Bishop, C.M. Neural Networks for Pattern Recognition, (1995) , New York Oxford University Press.

    Bollyky, P.L., Rambaut, A., Harvey, P.H., Holmes, E.C. (1996) Recombination between sequences of Hepatitis B virus from different genotypes. J. Mol. Evol., 42, 97–102[CrossRef][Web of Science][Medline].

    Felsenstein, J. (1981) Evolution trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol., 17, 368–376[CrossRef][Web of Science][Medline].

    Grassly, N.C. and Holmes, E.C. (1997) A likelihood method for the detection of selection and recombination using nucleotide sequences. Mol. Biol. Evol., 14, 239–247[Abstract].

    Hein, J. (1993) A heuristic method to reconstruct the history of sequences subject to recombination. J. Mol. Evol., 36, 396–405.

    Husmeier, D. and McGuire, G. (2003) Detecting recombination in 4-taxa DNA sequence alignments with Bayesian hidden Markov models and Markov chain Monte Carlo. Mol. Biol. Evol., 20, 315–337[Abstract/Free Full Text].

    Husmeier, D. and Wright, F. (2001a) Detection of recombination in DNA multiple alignments with hidden Markov models. J. Comput. Biol., 8, 401–427[CrossRef][Web of Science][Medline].

    Husmeier, D. and Wright, F. (2001b) Probabilistic divergence measures for detecting interspecies recombination. Bioinformatics, 17, S123–S131[Abstract].

    Kimura, M. (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol., 16, 111–120[CrossRef][Web of Science][Medline].

    Larget, B. and Simon, D.L. (1999) Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol., 16, 750–759[Web of Science].

    Maddison, D. (1991) The discovery and importance of multiple islands of most parsimonious trees. Systema. Zoology, 40, 315–328[CrossRef].

    McGuire, G. and Wright, F. (2000) TOPAL 2.0: improved detection of mosaic sequences within multiple alignments. Bioinformatics, 16, 130–134[Abstract/Free Full Text].

    McGuire, G., Wright, F., Prentice, M. (1997) A graphical method for detecting recombination in phylogenetic data sets. Mol. Biol. Evol., 14, 1125–1131[Abstract].

    Moniz de Sa, M. and Drouin, G. (1996) Phylogeny and substitution rates of angiosperm actin genes. Mol. Biol. Evol., 13, 1198–1212[Abstract].

    Posada, D. (2002) Evaluation of methods for detecting recombination from dna sequences: Empirical data. Mol. Biol. Evol., 19, 708–717[Abstract/Free Full Text].

    Posada, D., Crandall, K.A., Holmes, E.C. (2002) Recombination in evolutionary genomics. Ann. Rev. Genet., 36, 75–97[CrossRef][Web of Science][Medline].

    Robinson, D. and Foulds, L. (1981) Comparison of phylogenetic trees. Mathemat. Biosci., 53, 131–147[CrossRef].

    Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406–425[Abstract].

    Stockham, C., Wang, L.-S., Warnow, T. (2002) Statistically based postprocessing of phylogenetic analysis by clustering. Bioinformatics, 18, S285–S293[Abstract].

    Thompson, J.D., Higgins, D.G., Gibson, T.J. (1994) CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/9/1797    most recent
bti151v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Husmeier, D.
Right arrow Articles by Milne, I.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Husmeier, D.
Right arrow Articles by Milne, I.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?