Skip Navigation


Bioinformatics Advance Access originally published online on August 7, 2006
Bioinformatics 2006 22(20):2493-2499; doi:10.1093/bioinformatics/btl427
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/20/2493    most recent
btl427v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (10)
Google Scholar
Right arrow Articles by Scheffler, K.
Right arrow Articles by Seoighe, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Scheffler, K.
Right arrow Articles by Seoighe, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2006 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Robust inference of positive selection from recombining coding sequences

Konrad Scheffler *, Darren P. Martin and Cathal Seoighe *

Computational Biology Group, Institute of Infectious Disease and Molecular Medicine University of Cape Town, Private Bag, Rondebosch 7701, South Africa

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSIONS
 REFERENCES
 

Motivation: Accurate detection of positive Darwinian selection can provide important insights to researchers investigating the evolution of pathogens. However, many pathogens (particularly viruses) undergo frequent recombination and the phylogenetic methods commonly applied to detect positive selection have been shown to give misleading results when applied to recombining sequences. We propose a method that makes maximum likelihood inference of positive selection robust to the presence of recombination. This is achieved by allowing tree topologies and branch lengths to change across detected recombination breakpoints. Further improvements are obtained by allowing synonymous substitution rates to vary across sites.

Results: Using simulation we show that, even for extreme cases where recombination causes standard methods to reach false positive rates >90%, the proposed method decreases the false positive rate to acceptable levels while retaining high power. We applied the method to two HIV-1 datasets for which we have previously found that inference of positive selection is invalid owing to high rates of recombination. In one of these (env gene) we still detected positive selection using the proposed method, while in the other (gag gene) we found no significant evidence of positive selection.

Availability: A HyPhy batch language implementation of the proposed methods and the HIV-1 datasets analysed are available at http://www.cbio.uct.ac.za/pub_support/bioinf06. The HyPhy package is available at http://www.hyphy.org, and it is planned that the proposed methods will be included in the next distribution. RDP2 is available at http://darwin.uvigo.es/rdp/rdp.html.

Contact: konrad{at}cbio.uct.ac.za, cathal{at}science.uct.ac.za


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSIONS
 REFERENCES
 
The standard phylogenetic approach to inferring positive Darwinian selection in protein-coding sequences is based on the codon models first proposed by Muse and Gant (1994) and Yang (1994), which have since been developed into a set of robust methods that detect positive selection while allowing for selective pressure to vary across sites (Nielsen and Yang, 1998; Yang et al., 2000; Wong et al., 2004). These methods, however, assume that the phylogenetic tree topology and branch lengths are constant across all sites in the sequence—an assumption which is invalid when the sequences have been affected by recombination. Indeed, it has been shown (Anisimova et al., 2003; Shriner et al., 2003) that the presence of recombination can cause these methods to fail with type I (false positive) error rates as high as 90%. In a recent study (Scheffler and Seoighe, manuscript submitted), we quantified the percentage of false positive inferences as a function of recombination rate and demonstrated that inferred positive selection on two example HIV datasets is invalidated by the presence of recombination.

Recombination can contribute to false inference of positive selection by causing the branch lengths (Fig. 1a) and tree topologies (Fig. 1b) to differ between sites. In order to devise a robust method of inferring positive selection we investigated the impact of allowing tree topology and branch length parameters to change across recombination breakpoints. In a real analysis we anticipate that a subset of recombination breakpoints might be undetected. In order to improve the performance of our method in the presence of a subset of undetected recombination breakpoints, we included a variable synonymous substitution rate in our models, which allows the total tree length to vary from site to site. Sequences can evolve under a variable synonymous substitution rate owing to mutation rate variation or owing to site-specific selection acting on synonymous changes, but synonymous rate variation could also be detected as a result of recombination events that alter branch lengths. Incorporating synonymous rate variation in the model can therefore account for some of the misestimated branch lengths that result from recombination events that alter branch lengths but not tree topology. In general, we expect these recombination events to be more difficult to detect than those that cause a substantial change in tree topology. We evaluated the performance of the method by simulation and applied it to investigate whether the HIV datasets mentioned above can be inferred to be evolving under positive selection when recombination is taken into account.


Figure 1
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Recombination graphs (Hudson, 1983) (above) and corresponding trees (below) illustrating (a) a recombination event that changes the tree length but not the topology and (b) a recombination event that changes both the tree length and the topology. In the recombination graphs, the letter C indicates coalescent events while the letter R indicates recombination events.

 

    2 MATERIALS AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSIONS
 REFERENCES
 
We generated a number of datasets using the Codonrecsim program written by Rasmus Nielsen (Anisimova et al., 2003) that simulates recombined coding sequence alignments. It does this by simulating under a phylogenetic model of evolution using the discrete model (M3) of site-to-site rate variation proposed by Yang et al. (2000), but with the evolution taking place along genealogies simulated under the coalescent model with recombination (Hudson, 1983). This means that sites that have a recombination breakpoint between them do not evolve along the same phylogenetic tree. Barton and Etheridge (2004) have shown that selection has little effect on genealogies, which justifies neglecting selection when simulating genealogies under the coalescent model with recombination.

We performed two suites of simulation experiments, one using 10-taxon and one using 30-taxon datasets (Table 1). In each suite we simulated neutrally evolving datasets (i.e. {omega} = 1, mimicking pseudogene evolution) to estimate false positive rates and datasets evolving with site-to-site rate variation and positive selection [using the parameters inferred by Anisimova et al. (2003) on their hepatitis D antigen dataset under the 3-component discrete model (Yang et al., 2000)] to estimate power. Each simulated alignment was 500 codons long, and each dataset consisted of 100 replicates. The transition/transversion rate ratio ({kappa}) and the codon equilibrium frequencies were set to the values estimated for the Hepatitis D antigen data set. We chose mutation and recombination rate parameters that produced high false positive rates when using the standard method (see below) to infer positive selection on the neutral datasets. For the 30-taxon datasets the population-scaled recombination rate ({rho}) was 0.01 and the population-scaled mutation rate ({theta}) was 0.36, resulting in an average of 43.2 recombination events in the entire genealogy and an expected number of 1.43 mutations per codon. For the 10-taxon datasets {rho} was 0.05 and {theta} was 3.6, resulting in an average of 247.11 recombination events in the entire genealogy and an expected number of 10.18 mutations per codon (the very high values for the 10-taxon datasets serve to illustrate that the method works well even in extreme cases). To verify that the proposed method does not have an adverse effect when used on unrecombined data, we also simulated datasets with exactly the same parameters but with zero recombination rate.


View this table:
[in this window]
[in a new window]

 
Table 1 Simulation parameters used to create datasets

 
Finally, we analysed the HIV-1 subtype C env and gag data of our recent study (Scheffler and Seoighe manuscript submitted). These datasets contain 10 taxa each, with accession numbers AY118165 [GenBank] -AY118166 [GenBank] , AF286227 [GenBank] , AY158533 [GenBank] -AY158535 [GenBank] , AF411967 [GenBank] , AF391234 [GenBank] -AF391235 [GenBank] and AF391238 [GenBank] for the env sequences (1053 codons in length) and AY118165 [GenBank] -AY118166 [GenBank] , AF286227 [GenBank] , AY158533 [GenBank] -AY158535 [GenBank] , AF411967 [GenBank] , AY162223 [GenBank] -AY162224 [GenBank] and AF391254 [GenBank] for the gag sequences (590 codons in length).


    3 ALGORITHM
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSIONS
 REFERENCES
 
In this study we report results for four methods of detecting positive selection, using different combinations of the two strategies investigated:

Standard: This is the baseline method, which assumes that topology, relative branch lengths and total tree length are constant over all sites.
Synonymous rate variation: This method assumes that topology and relative branch length are constant over all sites, but allows total tree length to vary from site to site.
Partitioning: This method uses recombination breakpoints (either detected or the actual simulated breakpoints) to divide the alignment into partitions, each of which is assumed to include no further recombination breakpoints. Topology, relative branch lengths and total tree length are forced to be constant over all sites within a partition, but allowed to vary between partitions.
Synonymous rate variation with partitioning: This method combines the previous two methods: topology and relative branch lengths are assumed constant over all sites within a partition, but allowed to vary between partitions. Total tree length is allowed to vary from site to site irrespective of partitioning.
We implemented the above methods using the batch language of the HyPhy package (Kosakovsky Pond et al., 2005).

3.1 Baseline (standard) method
We detected positive selection by comparing the discrete ‘nearly neutral’ and ‘selection’ models M1a and M2a of Wong et al. (2004). We used the PAUP* program (Swafford, 2002) to estimate the maximum likelihood topologies under the HKY85 model (Hasegawa et al., 1985). To save computation time, we did not estimate the branch lengths separately for each model, but instead used the branch lengths estimated under the M0 (single rate) model (Yang et al., 2000). We report a sequence as being under positive selection at the 5 or 1% level if model M2a provides a significantly better fit than model M1a as measured by a likelihood ratio test with the appropriate significance level.

3.2 Allowing synonymous rate variation
In the methods that model synonymous rate variation we added a synonymous substitution rate parameter to the baseline method. We treat the synonymous rate s as belonging to one of a number of discrete rate classes, similar to the treatment of the non-synonymous/synonymous rate ratio {omega}, so that the expression for the instantaneous substitution rate from codon i to codon j at site h becomes:


Formula 1

(1)
Here, {kappa} is the transition/transversion rate ratio and {pi}j is the codon equilibrium frequency of codon j. {omega}(h) and s(h) denote, respectively, the non-synonymous/synonymous rate ratio and synonymous rate at site h.

The synonymous rate is drawn from a discrete distribution with three rate categories (we obtained no noticeable difference in results when using four categories, data not shown), with rates scaled such that the average synonymous rate over all sites is 1. This distribution is identical to that used for the {omega} parameter in the discrete model M3 of Yang et al. (2000), except that the latter is unscaled. Thus each site, in addition to belonging to one of the {omega} categories, also belongs to a synonymous rate category. This can also be viewed as providing three different tree scales: the evolution at each site is modelled as following the same tree topology and relative branch lengths, but the tree may be scaled differently for different sites.

Note that our parameterisation of site-to-site rate variation is different from that used by Kosakovsky Pond and Muse (2005), which uses the synonymous rate only for synonymous changes and hence is not a direct measure of total tree length (s(h) is absent from the expression for the instantaneous rate of non-synonymous transitions and transversions). Whereas Kosakovsky Pond and Muse (2005) apply parametric models to the distribution of synonymous and of non-synonymous rates, our parameterisation applies the same parametric models to the distribution of synonymous rates and of selective strengths.

3.3 Detecting recombination breakpoints
For the methods using partitioning by detected recombination we estimated the positions of recombination breakpoints using the non-parametric RDP (Martin and Rybicki, 2000), GENECONV (Padidam et al., 1999) and MAXIMUM CHI SQUARED (Maynard Smith, 1992) methods as implemented in RDP2 (Martin et al., 2005). See Poke et al. (2006) for a description of how these methods work. Default program settings were used throughout except that a Bonferroni corrected P-value cutoff of 0.01 was used to minimize the probability of falsely inferring recombination. All breakpoints detected by any of the three methods were taken into consideration.

3.4 Allowing different tree topologies for different sequence fragments
Once the recombination breakpoints have been detected, we use them to partition the alignment into separate segments (Fig. 2). When the number of segments exceeds a preset maximum N (20 in this study), we use only the N longest unbroken segments and discard the remaining data. The rationale behind this is that when the segments between recombination breakpoints are very short, they contain very little phylogenetic information and therefore the tree topology and branch length parameters cannot be estimated accurately for the partitions. Moreover, such small partitions contribute very little information so that discarding them should be less costly than introducing additional uncertainty resulting from estimating additional branch length and topology parameters for the partition. In the present study, data were discarded only for the simulated data, which had very high rates of recombination. The number of breakpoints detected in the real datasets we examined was lower than the maximum in both cases.


Figure 2
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Strategy for partitioning sequences according to recombination breakpoints. Full sequence (top): the entire sequence is used and described by a single tree topology and set of branch lengths, ignoring recombination breakpoints. Largest unrecombined segments, unpartitioned (middle): only codons in the largest N segments that contain no recombination breakpoints (vertical lines) are used (illustrated here by the white regions, with N = 2). As for the full sequence analysis, these codons are described by a single tree topology and set of branch lengths. Partitioning using largest unrecombined segments (bottom): each of the largest N segments that contain no recombination breakpoints is modelled using a separate topology and set of branch lengths.

 
Next, topologies and branch lengths are estimated as in the baseline method, except that a separate topology and set of branch lengths is used for each segment. The remaining model parameters, however, are shared across all segments. In particular, the parameters of models M1a and M2a describing the rate categories are estimated only once for all segments.

To allow fairer comparison with the unpartitioned methods, we present the results for the simulation experiments not only for the full unpartitioned sequence (Fig. 2, top), but also for an unpartitioned analysis of the sites in the largest unrecombined segments only (Fig. 2, middle). This latter result provides a more direct comparison with the partitioned analysis (Fig. 2, bottom) which uses the same subset of the codons.


    4 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSIONS
 REFERENCES
 
4.1 Simulation experiments
We investigated power and false positive rates using the simulated datasets (summarized in Table 1). For each dataset we first considered the effect of allowing the synonymous rate to vary across sites and of separating the tree topology and branch length parameters between the segments defined by recombination breakpoints, given that the locations of the recombination breakpoints are known. This was done by retrieving the recombination breakpoints used in the simulations. We then present the power and false positive rates for the more realistic case in which the breakpoint locations are not known, but are instead inferred using a set of breakpoint detection algorithms (Martin et al., 2005).

4.1.1 True breakpoints
The neutral simulations provide a worst case (but nevertheless realistic) scenario with which to investigate false positive rates. We found (Table 2) that allowing the synonymous substitution rate to vary across sites brought about a large decrease in false positives relative to the standard method, but still left the false positive rate unacceptably high. Partitioning according to the true breakpoints (Table 2), on the other hand, brought false positive levels down to close to the desired rate. In this case, synonymous rate variation with partitioning did not give further improvement over partitioning alone. The decrease in false positives when partitioning has two causes. First, the fact that some data are discarded inevitably leads to a reduction in power: this can be seen by comparing the full sequence results with the largest unrecombined segments (LUS) results for the unpartitioned methods. Second, the partitioning itself causes a further reduction, which is the desired effect: the magnitude of this effect can be seen by comparing the results for the partitioning methods with the LUS results of the corresponding unpartitioned methods. Therefore, in order to see the effect of partitioning the phylogeny parameters between unrecombined segments or allowing the synonymous rate to vary on the false positive rates, the results obtained using these methods should be compared with those obtained by applying the standard method to the LUS.


View this table:
[in this window]
[in a new window]

 
Table 2 Number of false positive inferences out of 100 replicates obtained at the 5% (1%) significance level by different methods on the simulated neutral data sets when using the true recombination breakpoints

 
The positive selection simulations provide a means to investigate power (Table 3). Again, some caution is required here because positive results could be artefacts of recombination rather than instances where the method detected the signal of positive selection. Nevertheless, when the false positive rate obtained on the corresponding set of neutral simulations is low, we can conclude that the result obtained on the positive selection simulations is a good indication of power.


View this table:
[in this window]
[in a new window]

 
Table 3 Power (number of true positive inferences out of 100 replicates) obtained at the 5% (1%) significance level by different methods on the simulated positive selection data sets when using the true recombination breakpoints

 
For the case in which we assume that the true recombination breakpoints are known, the power was higher on the large dataset than on the small dataset. This was partly because the recombination levels were so high in the small dataset that the average segment length (for the 20 largest unrecombined segments) was <8 codons. In fact, given that tree topologies and branch lengths were inferred on such short segments, it is surprising that the method retains any power to discriminate between datasets with and without positive selection (as demonstrated by the higher rate of positives in the positive selection datasets than in the neutral datasets). This shows that, even when recombination creates what might appear to be a hopelessly fragmented evolutionary history, it can still be possible to perform reasonable inferences provided recombination is taken into account.

Inferring trees and branch lengths on very short segments for the partitioning method caused a large decrease in power for the small datasets, and possibly also a small increase in false positives. This is particularly noticeable for the partitioning method (without synonymous rate variation) applied to the small positive selection dataset, on which we obtained only 6% power at the 5% significance level. To confirm that this severe drop in power was caused by misestimation of tree topologies and branch lengths on the short segments we repeated the analysis, but with the topology and branch lengths for each segment fixed to the true (simulated) values. This resulted in 99% power (at both 5 and 1% significance levels), which is, as expected, identical to the result obtained for the corresponding unrecombined simulations. When the true topology was fixed but the branch lengths estimated as usual, the power was 14%(10%) at the 5%(1%) significance level. Thus the decrease in power can be attributed to inaccurate estimation of the branch lengths, which appears to become particularly acute when the segment lengths are this short. We caution that extremely short segment lengths (e.g. resulting from extremely high recombination rates such as in this simulation) may cause the proposed method to lack power.

4.1.2 Detected breakpoints
In real data, the true breakpoints are unknown and have to be detected by a recombination detection method. This has the disadvantage that there may be inaccuracy in the breakpoints detected, but may also have advantages in that recombination events that have little or no effect (for instance because they occur between closely related taxa and do not change the tree topology, as in Fig. 1a) will remain undetected, and thus will not have any negative effect on the power of the method. This could explain the results in Tables 4 and 5 where we found that using the detected breakpoints resulted in better performance (both a lower rate of false positives and higher power) than using the true breakpoints. In particular, the average segment lengths for the small datasets were longer, owing to the suppression of many presumably unimportant (and difficult to detect) recombination breakpoints. The longer segment lengths yielded improvements of the results obtained by methods using partitioning on these datasets.


View this table:
[in this window]
[in a new window]

 
Table 4 Number of false positive inferences out of 100 replicates obtained at the 5% (1%) significance level by different methods on the simulated neutral data sets when using the detected recombination break points

 


View this table:
[in this window]
[in a new window]

 
Table 5 Power (number of true positive inferences out of 100 replicates) obtained at the 5% (1%) significance level by different methods on the simulated positive selection datasets when using the detected recombination breakpoints

 
Using the detected breakpoints, the power obtained using partitioning with synonymous rate variation on the small dataset was even higher than that obtained on the large dataset. This can be explained by the fact that the diversity in this dataset was much higher so that, once the false signal caused by recombination has been compensated for, the dataset contains more information that can be used to obtain inferences about selective pressure.

It is reassuring that modelling synonymous rate variation had very little effect on the recombination-free sequences: false positives were essentially unchanged while power decreased slightly. Partitioning had no effect: trivially, when the true breakpoints were used, there were no breakpoints to take into account so that the partitioning methods were identical to the corresponding unpartitioned methods. Recombination detection resulted in only a few falsely detected breakpoints (in 3 and 8 of the 100 replicates for the small neutral and small positive selection datasets respectively, and in none of the large datasets), but the inference of positive selection after partitioning gave a different result from that obtained without partitioning only for one replicate in the small positive selection dataset, and only at the higher of the two significance levels listed. Hence the proposed methods do not have negative effects when applied to unrecombined data.

4.2 Analysis of viral datasets
Next, we used the four methods to analyse the HIV-1 subtype C datasets for which we have previously shown (Scheffler and Seoighe, manuscript submitted) that the recombination levels are high enough to cause false inference of positive selection. Indeed, the standard method inferred positive selection on both data sets at very high levels of significance.

For the env data (Table 6) we detected 12 recombination breakpoints. We found that both modelling synonymous rate variation and partitioning (using 13 segments and discarding no data) caused reductions both in the significance level of the result and in the magnitude of positive selection inferred under the M2a model (as seen from the value of the {omega}2 parameter), but that even when using both synonymous rate variation and partitioning we still detected positive selection at a highly significant level. We conclude that these sequences are likely to have evolved under both recombination and positive selection.


View this table:
[in this window]
[in a new window]

 
Table 6 Results for env data

 
For the gag data (Table 7) we detected only four recombination breakpoints. This time, although partitioning (using five segments and discarding no data) without modelling synonymous rate variation did not remove the evidence of positive selection, the result was no longer significant when the synonymous rate was allowed to vary and even less so when synonymous rate variation and partitioning were combined. We conclude that, when recombination is taken into account, there is no convincing evidence that these sequences have evolved under positive selection.


View this table:
[in this window]
[in a new window]

 
Table 7 Reults for gag data

 

    5 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSIONS
 REFERENCES
 
Our simulation results reveal that modelling synonymous rate variation tends to make inference of positive selection more conservative: both false positives and power go down. However, the levels of false positives observed in these simulations were still unacceptably high despite being much lower than when constant synonymous rates were assumed.

Using tree topology and branch lengths inferred separately for segments defined by detected recombination breakpoints caused a dramatic reduction in the false positive rate. For example, in the 10-taxon dataset we obtained an improvement from 94% false positives on the neutral simulations and 73% power on the positive selection simulations to 11% false positives on the neutral simulations and 91% power on the positive selection simulations. By combining partitioning with synonymous rate variation the false positive rate dropped further to an acceptable 2%, albeit at the cost of some reduction in power. The final power of 83% was nevertheless higher than the original power of 73%.

One of the most encouraging aspects of the simulation results was the performance of the partitioning methods using the detected recombination breakpoints. In the current set of simulations these methods performed better than the methods that used the simulated breakpoints, most likely because of the small segment lengths obtained when all of the recombination breakpoints were used. These results imply that the method we propose is not highly susceptible to inaccuracy in the detected breakpoints and that the majority of the benefit derived from partitioning appears to be obtained from the subset of most easily detectable recombination breakpoints.

We have not investigated the accuracy of site-specific selection detection using the proposed methods. In their simulation studies, Anisimova et al. (2003) and Shriner et al. (2003) found that site-specific analyses using standard phylogenetic methods are much more robust to recombination than whole-sequence analyses. This is consistent with our preliminary investigations (data not shown), in which we failed to find high levels of site-specific false positive inference using standard methods. More recently, Kosakovsky Pond et al. (2006) have found that under some conditions site-specific inference using a fixed effects likelihood method can also give highly misleading results in the presence of recombination. These authors found that the effects of recombination on site specific inference can be alleviated by analysing unrecombined segments separately and we therefore recommend that the method presented here should also be used for site-specific inference of positive selection when recombination is suspected.

Our results indicate that the proposed methods are able to filter out false inferences of positive selection on recombined sequences, but also have the power required to infer positive selection in such sequences when the signal of positive selection does exist. Furthermore we show that there is no evidence of a disadvantage of applying partitioning to sequences when the sequences have not in fact undergone recombination. In such cases few, if any, recombination breakpoints were detected and inferring the tree topology and branch length parameters separately for the resulting large unrecombined segments appeared to have no effect on the power or false positive rates. We therefore recommend that a method such as the one we describe, which includes a screen for recombination and separation of phylogeny parameters between recombination breakpoints, be applied routinely when phylogenetic methods are used to infer positive selection in sequences for which recombination is possible.


    Acknowledgments
 
The authors thank Rasmus Nielsen for making his Codonrecsim program available to them, Fourie Joubert and David Posada for use of the Linux clusters at the University of Pretoria, South Africa and the University of Vigo, Spain, and Sergei Kosakovsky Pond for help with the HyPhy package and offering to incorporate the proposed methods into future distributions of HyPhy. This study was supported by the South African National Bioinformatics Network and by the National Institute of Allergy and Infectious Disease and the National Institutes of Health through the Centre for the AIDS Programme of Research in South Africa (grant no. 1U19AI51794). Funding to pay the Open Access publication charges was provided by the South African National Bioinformatics Network.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Keith A Crandall

Received on June 26, 2006; revised on July 31, 2006; accepted on August 1, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSIONS
 REFERENCES
 

    Anisimova, M., et al. (2003) Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites. Genetics, 164, 1229–1236[Abstract/Free Full Text].

    Barton, N.H. and Etheridge, A.M. (2004) The effect of selection on genealogies. Genetics, 166, 1115–1131[Abstract/Free Full Text].

    Goldman, N. and Yang, Z. (1994) A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol, . 11, 725–736[Abstract].

    Hasegawa, M., et al. (1985) Dating the human-ape split by a molecular clock of mitochondrial DNA. J. Mol. Evol, . 22, 160–174[CrossRef][Web of Science][Medline].

    Hudson, R. (1983) Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol, . 23, 183–201[CrossRef][Web of Science][Medline].

    Kosakovsky Pond, S.L. and Muse, S.V. (2005) Site-to-site variation of synonymous substitution rates. Mol. Biol. Evol, . 22, 2375–2385[Abstract/Free Full Text].

    Kosakovsky Pond, S.L., et al. (2005) HyPhy: hypothesis testing using phylogenies. Bioinformatics, 21, 676–679[Abstract/Free Full Text].

    Kosakovsky Pond, S.L., et al. (2006) Automated phylogenetic detection of recombination using a genetic algorithm. Mol. Biol. Evol, . msl051.

    Martin, D.P. and Rybick, E. (2000) RDP: detection of recombination amongst aligned sequences. Bioinformatics, 16, 562–563[Abstract/Free Full Text].

    Martin, D.P., et al. (2005) RDP2: recombination detection and analysis from sequence alignments. Bioinformatics, 21, 260–262[Abstract/Free Full Text].

    Maynard Smith, J. (1992) Analysing the mosaic structure of genes. J. Mol. Evol, . 34, 126–129[Web of Science][Medline].

    Muse, S. and Gaut, B. (1994) A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol, . 11, 715–724[Abstract].

    Nielsen, R. and Yang, Z. (1998) Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics, 148, 929–936[Abstract/Free Full Text].

    Padidam, M., et al. (1999) Possible emergence of new geminiviruses by frequent recombination. Virology, 265, 218–225[CrossRef][Web of Science][Medline].

    Poke, F., et al. (2006) The impact of intragenic recombination on phylogenetic reconstruction at the sectional level in Eucalyptus when using a single copy nuclear gene (cinnamoyl CoA reductase). Mol. Phylogenet. Evol, . 39, 160–170[CrossRef][Web of Science][Medline].

    Shriner, D., et al. (2003) Potential impact of recombination on sitewise approaches for detecting positive natural selection. Genet Res, . 81, 115–121[CrossRef][Web of Science][Medline].

    Swofford, D. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4, (2002) , Sunderland, Massachusetts Sinauer Associates.

    Wong, W.S.W., et al. (2004) Accuracy and power of statistical methods for detecting adaptive evolution in protein coding sequences and for identifying positively selected sites. Genetics, 168, 1041–1051[Abstract/Free Full Text].

    Yang, Z., et al. (2000) Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics, 155, 431–449[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J. Virol.Home page
R. A. Medina, F. Torres-Perez, H. Galeno, M. Navarrete, P. A. Vial, R. E. Palma, M. Ferres, J. A. Cook, and B. Hjelle
Ecology, Genetic Diversity, and Phylogeographic Structure of Andes Virus in Humans and Rodents in Chile
J. Virol., March 15, 2009; 83(6): 2446 - 2459.
[Abstract] [Full Text] [PDF]


Home page
J. Gen. Virol.Home page
E. van der Walt, E. P. Rybicki, A. Varsani, J. E. Polston, R. Billharz, L. Donaldson, A. L. Monjane, and D. P. Martin
Rapid host adaptation by extensive recombination
J. Gen. Virol., March 1, 2009; 90(3): 734 - 746.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
M. Anisimova and C. Kosiol
Investigating Protein-Coding Sequence Evolution with Probabilistic Codon Substitution Models
Mol. Biol. Evol., February 1, 2009; 26(2): 255 - 271.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
W. Delport, K. Scheffler, and C. Seoighe
Models of coding sequence evolution
Brief Bioinform, January 1, 2009; 10(1): 97 - 109.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. F. Y. Poon, F. I. Lewis, S. D. W. Frost, and S. L. Kosakovsky Pond
Spidermonkey: rapid detection of co-evolving sites using Bayesian graphical models
Bioinformatics, September 1, 2008; 24(17): 1949 - 1950.
[Abstract] [Full Text] [PDF]


Home page
J. Virol.Home page
E. Strain, L. A. Kelley, S. Schultz-Cherry, S. V. Muse, and M. D. Koci
Genomic Analysis of Closely Related Astroviruses
J. Virol., May 15, 2008; 82(10): 5099 - 5103.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. McCauley, S. de Groot, T. Mailund, and J. Hein
Annotation of selection strengths in viral genomes
Bioinformatics, November 15, 2007; 23(22): 2978 - 2986.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/20/2493    most recent
btl427v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (10)
Google Scholar
Right arrow Articles by Scheffler, K.
Right arrow Articles by Seoighe, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Scheffler, K.
Right arrow Articles by Seoighe, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?