Skip Navigation

Bioinformatics 2007 23(13):i319-i327; doi:10.1093/bioinformatics/btm176
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Mayrose, I.
Right arrow Articles by Pupko, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Mayrose, I.
Right arrow Articles by Pupko, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Towards realistic codon models: among site variability and dependency of synonymous and non-synonymous rates

Itay Mayrose {dagger}, Adi Doron-Faigenboim {dagger}, Eran Bacharach and Tal Pupko *

The Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel- Aviv University, Tel Aviv 69978, Israel

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Codon evolutionary models are widely used to infer the selection forces acting on a protein. The non-synonymous to synonymous rate ratio (denoted by Ka/Ks) is used to infer specific positions that are under purifying or positive selection. Current evolutionary models usually assume that only the non-synonymous rates vary among sites while the synonymous substitution rates are constant. This assumption ignores the possibility of selection forces acting at the DNA or mRNA levels. Towards a more realistic description of sequence evolution, we present a model that accounts for among-site-variation of both synonymous and non-synonymous substitution rates. Furthermore, we alleviate the widespread assumption that positions evolve independently of each other. Thus, possible sources of bias caused by random fluctuations in either the synonymous or non-synonymous rate estimations at a single site is removed. Our model is based on two hidden Markov models that operate on the spatial dimension: one describes the dependency between adjacent non-synonymous rates while the other describes the dependency between adjacent synonymous rates. The presented model is applied to study the selection pressure across the HIV-1 genome. The new model better describes the evolution of all HIV-1 genes, as compared to current codon models. Using both simulations and real data analyses, we illustrate that accounting for synonymous rate variability and dependency greatly increases the accuracy of Ka/Ks estimation and in particular of positively selected sites. Finally, we discuss the applicability of the developed model to infer the selection forces in regulatory and overlapping regions of the HIV-1 genome.

Contact: talp{at}post.tau.ac.il


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Likelihood methods combined with probabilistic models of sequence evolution are considered the state-of-the-art methods for phylogeny inference and allow robust parameter estimation and vigorous testing of evolutionary hypotheses (Whelan et al., 2001; Yang, 2006). While amino-acid evolutionary models are restricted to computing the degree of purifying selection acting on each site (e.g. Mayrose et al., 2004), codon evolutionary models can be used to detect both the purifying and the positive Darwinian selection forces (reviewed in Yang, 2005). The selection type and intensity are inferred by contrasting the ratio of non-synonymous (amino-acid altering; Ka) to synonymous (silent; Ks) substitution rates, {omega}. Sites showing {omega} significantly lower than 1 are regarded as undergoing purifying selection and may have a functionally or structurally important role. Sites with {omega} > 1 are indicative of positive Darwinian selection, suggesting adaptive evolution.

The most widely used codon evolutionary models assume that the purifying selection acting on protein-coding DNA sequences is the result of selection that operates at the protein level only (Yang et al., 2000). These models assume that synonymous substitutions are free from selection pressure and represent neutral evolution. In such a case, Ks is constant across codon positions, Ka is heterogeneous and the inference of site-specific {omega} values is based solely on Ka variations. Accumulating lines of evidence, however, suggest that this assumption is biologically unrealistic and that synonymous substitutions are regularly subjected to purifying selection (reviewed in Chamary et al., 2006). This may be the result of, for example, constraints for maintaining mRNA secondary structure or cis regulatory motifs (e.g. exonic splicing enhancers) superimposed on the coding sequence of the gene.

Recently, Pond and Muse (2005) presented an evolutionary model that allows for site-to-site variation of both the synonymous and the non-synonymous substitution rates. This model was shown to better fit 9 out of 10 datasets analyzed. In this study it was also illustrated that sites inferred with a significant support to be positively selected under the synonymous constant model may be inferred as being under purifying selection when synonymous rate variation is included in the model (and vice versa). Notwithstanding, a worrying consequence of including site-to-site synonymous rate variation in the model is that now the inference of {omega} relies on the ratio of two inferred parameters. As such, random fluctuations in Ka and Ks estimates may more easily lead to an erroneous {omega} inference. Consider for example, a neutrally evolving site with both Ka and Ks equal to 1. Random fluctuations in the underlying sequences or in their sampling can easily shift the inference of Ka and Ks to 1.2 and 0.8, respectively, leading to an inferred positively selected site with {omega} = 1.5. This shortcoming may be bypassed by including in the model our biological understanding that rates of evolution are correlated between adjacent sites.

A few models were previously developed that relaxed the unrealistic assumption that sites in DNA or protein sequences evolve independently (Felsenstein and Churchill, 1996; Stern and Pupko, 2006; Yang, 1995). Hidden Markov models (HMM) were used to account for the correlation between the rates of adjacent positions. These models were shown to better fit DNA and protein data and to improve the accuracy of site-specific evolutionary rate inference. When codon models are considered, dependencies in either Ka, Ks or both may exist. For example, in a functional region of a protein, consecutive positions with low non-synonymous rates are often observed. Similarly, exonic regulatory elements are characterized by a stretch of sites showing a reduction of both Ka and Ks rates (Goren et al., 2006).

Here, we present models in which both the synonymous and non-synonymous rates vary among sites. Models assuming dependencies between either the Ka, the Ks or both are developed and studied. We show that the inference of positively selected sites is greatly influenced by considering synonymous rate variation, and that a model in which rate dependencies are accounted for greatly increases the accuracy of positive selection inference. The models are applied to study human immunodeficiency virus type 1 (HIV-1), a medically important test case for positive selection studies (e.g. de Oliveira et al., 2004; Yang et al., 2003). We demonstrate the usefulness of our novel models in analyzing overlapping viral genes, in detecting viral cis-regulatory elements, and in studying patterns of selection in specific viral genes and sites.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 The basic evolutionary model (KaC–KsC)
Our codon models are represented as a continuous time Markov process, defined by the instantaneous rate matrix Q. In the simplest model, the rate of changing from codon i to codon j (Qij) is defined as follows:


Formula

where {lambda}s denotes the synonymous substitution rate, {lambda}a denotes the non-synonymous substitution rate, {kappa} denotes the transition versus transversion bias, and {pi}j is the target codon frequency calculated using the product of the observed nucleotide frequencies at the three codon positions (the F3 x 4 model of Yang et al., 2000).

Because of the confounding effects between evolutionary rates and divergence times (Felsenstein, 1981), we can arbitrarily set {lambda}s = 1. The ratio of non-synonymous to synonymous rates, {omega} = {lambda}a/{lambda}s, provides an indication of the magnitude of selective pressure. This model is similar to the codon models suggested by Nielsen and Yang (1998) and Muse and Gaut (1994). We refer to this model as KaC–KsC as both the Ka and Ks rates are constant across the gene (i.e. do not vary).

2.2 Variable Ka with constant Ks (KaV–KsC)
The KaV–KsC model expands the basic model above by allowing Ka to vary across sites. Thus, {lambda}s is shared by all sites and equals 1, while {lambda}a is treated as a random variable drawn independently for each site according to a pre-defined distribution (e.g. Gamma). This is the most commonly used model for detecting positive selection (Nielsen and Yang, 1998; Yang et al., 2000).

2.3 Variable Ka and variable Ks (KaV–KsV)
In this model, both the synonymous and non-synonymous substitution rates are allowed to vary across sites. Thus, both the {lambda}a and {lambda}s parameters are treated as random variables sampled independently for each site from two given rate distributions, with the distribution of {lambda}s restricted to have mean 1. Here, we use the Gamma distribution {Gamma}({alpha}a, ßa) to model the non-synonymous rate variation. The synonymous rate variation is modeled by the unit mean Gamma distribution {Gamma}({alpha}s, {alpha}s). The Gamma distributions are approximated by Ca and Cs equally probable discrete categories (Yang, 1994), representing the non-synonymous and synonymous distributions, respectively. This model was previously suggested by Pond and Muse (2005).

2.4 Modeling site-dependencies (KaD–KsD)
In this model, rate dependencies between codon sites are assumed. The non-synonymous substitution rate Formula at site i is dependent on the non-synonymous substitution rate Formula at site i–1. Similarly, we also assume that the synonymous substitution rate Formula and Formula are dependent. Once the sites are assigned to their rates, each position evolves independently. In each site there are Ca possible non-synonymous rate categories and Cs synonymous rate categories. The dependency between rates at adjacent positions is modeled using two Markov chains that operate on the spatial dimension: one describes the dependency between non-synonymous rates while the other describes the dependency between synonymous rates. The two Markov chains are assumed to be independent of each other. Site dependency within each chain is modeled using the parameters {rho}a, {rho}s isin (0, 1) for the non-synonymous and synonymous chains, respectively. Larger {rho} values indicate higher correlation between adjacent sites.

Practically, instead of modeling two independent chains with Ca and Cs states we can use a combined Markov chain with Ca x Cs states. The state Sjk denotes a site with a non-synonymous rate category j and synonymous rate category k.

The data are then analyzed using an HMM, in which the rates at each position are the hidden states (Durbin et al., 1998). The HMM is characterized by the transition probabilities between the states of the combined Markov chain, by the initial probabilities of the hidden states and by the emission probabilities, which represent the probability of the observations given assignments to the hidden states.

Let T represent the transition matrix between any two states. The transition probability between the states Sjk and Slp is


Formula 1

(1)
where (Tj->l |{alpha}a, ßa, {rho}a) is the transition probability between the non-synonymous states j and l. (Tk->p |{alpha}s, {rho}s) is computed similarly for the synonymous transition. These probabilities are calculated using the correlated bivariate gamma distribution (Stern and Pupko, 2006; Yang, 1995).

The initial probabilities of the hidden states (the rates) in the combined chain are computed using the probabilities obtained by the discrete gamma approximation. Thus, the initial probability of being at the non-synonymous states j and the synonymous state k is


Formula 2

(2)
where P(Sj | {alpha}a, ßa) and P(Sk | {alpha}s) are the initial probabilities of the non-synonymous category j and the synonymous category k, respectively. Since each rate category is given equal probability (Yang, 1994) the initial probabilities are simply 1/(Ca x Cs).

The emission probabilities are the likelihoods of the data at each position. Specifically, P(di|Sjk) is the probability of observing the data at position i, given a certain rate assignment Sjk. This probability is computed using standard approaches (Felsenstein, 1981). Note that all computations are done assuming the phylogenetic tree topology and its branch lengths are given, but for simplicity, we omit them from the equations.

To summarize, the parameters of the KaD–KsD model are {theta} = {{alpha}a, ßa, {alpha}s, {rho}a, {rho}s}. Two variants of this model exist: the KaV–KsD model which assumes dependencies between synonymous rates only, and the KaD–KsV model which assumes dependencies between non-synonymous rates only. These two models can be obtained from the KaD–KsD model by constraining {rho}a = 0, or {rho}s = 0, respectively. The parameters of the model are estimated by maximizing the likelihood function, P(d | {theta}). The likelihood can be calculated using the backward dynamic programming algorithm (Durbin et al., 1998). Let Formula be the probability of observing the partial data from sites i + 1 through n, given that the combined non-synonymous and synonymous state at site i is Sjk. where n is the sequence length. Then


Formula 3

(3)
with bjk(n) = 1. The likelihood is thus


Formula 4

(4)

2.5 Estimating site-specific synonymous and non-synonymous rates
Selective pressure at individual sites can be inferred by calculating the Ka/Ks ratio at each site and the posterior probability of a site to evolve under positive selection pressure. The posterior probability of site i belonging to state Sjk is


Formula 5

(5)
where fjk(i) = P(d1, ..., di|Formula ) is the joint probability of observing sites 1 through i given that the combined state at site i is Sjk. fjk(i) can be calculated using the forward dynamic algorithm (Durbin et al., 1998)


Formula 6

(6)
with fjk (1) = P(d1, Sjk) = {pi}jkP(d1 | Sjk)

Point estimates of Formula and Formula can then be obtained by calculating the expectation over the empirical posterior distribution. Formula is obtained by


Formula 7

(7)
where {lambda}a(Sjk) is the non-synonymous rate assignment in state Sjk. Formula is similarly obtained.

The posterior probability of site i to evolve under positive selection pressure is then the sum of posterior probabilities of states, in which the non-synonymous rate is higher than the synonymous rate


Formula 8

(8)

2.6 Model comparison
The likelihood ratio test (LRT) can be used in order to determine which model best fits the data. The LRT is applicable since the models are nested (except for KaD–KsV and KaV–KsD): when {rho}a, {rho}s = 0, KaD–KsD collapses to KaV–KsV. KaV–KsV collapses to KaV–KsC when {alpha}s approaches infinity. The differences in log-likelihood between the models are compared to a {chi}2 distribution to obtain a P-value (Yang, 2006). Hence, the additional parameters are statistically justified if the log-likelihood improvement between KaD–KsD and KaV–KsV is at least 4.6 or 3.32 between KaV–KsV and KaV–KsC; P-value <0.01 according to the Formula and Formula asymptotic distributions, respectively. Model comparisons using the 2nd order Akaike Information Criterion (AICc) (Burnham and Anderson, 2002) were also performed and gave essentially identical results (not shown).

2.7 HIV-1 datasets
To test the utility of the proposed model we analyzed the nine coding genes of HIV-1. Aligned nucleotide sequences for each gene were retrieved from the Los Alamos HIV sequence database (http://www.hiv.lanl.gov/). The multiple sequence alignment (MSA) of each gene was then modified according to the following steps: (i) A reference sequence for the gene, taken from the HIV-1 complete genome (accession number AF033819 [GenBank] ) was added to the alignment using the profile to profile alignment option of CLUSTALW (Thompson et al., 1994); (ii) Sequences containing missing data and stop codons inside the reading frame, or sequences not starting with ATG, were removed from the analysis. This last criterion was not applied to the pol dataset since it naturally does not start with the ATG codon; (iii) In order to eliminate the number of gaps in the alignment, sequences that opened insertion positions that are shared by less than 25% of the sequences were removed and (iv) In order to save computational time only the 80 most divergent sequences of the remaining alignment were used. The most divergent sequences were chosen using the following procedure: first, all pairwise distances were calculated. Next, the sequence with the maximal distance to one of the sequences already selected was iteratively added. The phylogenetic tree for each gene was reconstructed using the neighbor-joining algorithm (Saitou and Nei, 1987) with pairwise distances estimated using the Jukes and Cantor distance for codons. The parameters of each model were then optimized using the maximum likelihood (ML) paradigm. The gamma distribution was approximated using three categories. In models that include both Ka and Ks variation, three categories are assumed for Ka and three for Ks, resulting in nine possible rate states. The approximation of 3 categories was used to determine the ML estimates of the model parameters. In order to increase accuracy when computing site-specific Formula and Formula , the number of discrete rate categories was increased from three to eight using the same ML parameter estimates.

2.8 Simulation study
Simulations were conducted in order to examine the accuracy of site-specific Formula and Formula under the different models. We simulated each site with a specific ‘true’ rate. An MSA was thus generated based on a vector of true rates. Subsequently, Formula and Formula rates for each column were inferred using the different models. For the simulations, one must provide a true rate for each site. In order to obtain true rates that are biologically relevant, characteristic rates were computed based on the empirically inferred rates of the gag and vif datasets (Table 1). For each dataset, 30 and 15 sequences were simulated. The model tree for each dataset was obtained using the 30 (15) most divergent sequences of the original dataset. For each dataset, three vectors of true Ka and Ks rates were obtained: those inferred with KaV–KsC, KaV–KsV or KaD–KsD. Simulating with true rates inferred with a specific model may bias the results towards this model. Thus, for each simulated dataset, we conducted three separate analyses, corresponding to the three vectors of true rates. In total, 12 simulation scenarios were performed (gag and vif genes, 15 and 30 sequences, each with three vectors of true rates). For each simulated scenario 10 independent and identical runs were conducted. The accuracy of inference was measured by the mean absolute deviation (MAD) distance between the simulated and inferred Ka/Ks values


Formula 9

(9)
where the estimated {omega}i is Formula .


View this table:
[in this window]
[in a new window]

 
Table 1. Log-likelihood (LL) values for nine HIV-1 coding genes under the five models analyzed

 

    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 HIV-1 dataset analysis
To test the utility of the proposed model the nine coding genes of HIV-1 were analyzed. For all genes, models that account for the synonymous rate variation were significantly superior to the commonly used KaV–KsC model, which assumes that all positions evolve with the same Ks rate (Table 1). The minimal log-likelihood difference between the KaV–KsC model and the models accounting for Ks rate variation was 130, and the maximal difference was more than 1500. The hypothesis of constant synonymous rate was thus rejected at P << 0.0001 for all HIV-1 genes. The coefficient of variation (CV), measured by the SD divided by the mean, was used to compare the dispersion of the estimated Ka and Ks distributions. The CV of the non-synonymous rate distribution was found to be in the same order of magnitude as that of the synonymous rate distribution (Table 2), further supporting the essentiality of correctly modelling Ks variation.


View this table:
[in this window]
[in a new window]

 
Table 2. The correlation between adjacent Ka rates ({rho}a) and adjacent Ks rates ({rho}s) and the coefficient variation (CV) of the Ka and Ks distributions inferred by the KaD–KsD model for the nine HIV-1 genes

 
A strong pattern of rate dependency among adjacent positions was found for all HIV-1 genes. This was true for both the Ka and the Ks rates. When comparing a model that accounts for dependencies of both Ka and Ks rates (the KaD–KsD model) with a model that does not (KaV–KsV), a significant increase in log-likelihood was found for all genes analyzed (P < 0.001 for all genes). As expected, the difference in log-likelihood is correlated with the sequence length (Table 1). For example, for pol, the longest gene, the difference was in the order of hundreds. We next tested what factor contributed most to the superiority of KaD–KsD over KaV–KsV: the dependency among synonymous rates or the dependency among non-synonymous rates. Dependency of Ka rates corresponds to regional selection pressure at the protein level, while Ks dependency reflects regional selection on the coding DNA or mRNA. Taking into account Ka dependency only (the KaD–KsV model) significantly increased the log-likelihood as compared to the independent model (KaV–KsV) in six out of nine HIV-1 genes (Table 1). This result is in agreement with previously published works emphasizing the importance of taking into account dependencies at the protein level (e.g. Stern and Pupko, 2006). Taking into account dependency among Ks rates only (comparing the KaV–KsV and the KaV–KsD models), resulted in a significant increase in log-likelihood in all datasets. Interestingly, the contribution of Ks rate dependency was higher than the contribution of Ka rate dependency for all but the rev dataset. The higher Ks rate dependency was also evident in the comparison of the models' correlation parameters: the inferred correlation between Ks rates was higher than the Ka rate correlation in all genes aside from rev (Table 2). Thus, our results strongly indicate that Ks rates vary in a spatial manner along the genes.

3.2 Accuracy of rate estimation: a simulation study
Simulations were used to compare the inference accuracy of site-specific {omega} values under the various models. This comparison tested the accuracy of inference across the whole {omega} range (i.e. both positive and purifying selection) and did not concentrate only on inferring positive selection. As shown in Table 3, the inference obtained using the KaD–KsD model was constantly more accurate compared to the inference obtained using either the KaV–KsV or KaV–KsC models. These results were significant for 11 out of 12 simulated scenarios (P < 0.01; paired t-test). The comparison between KaD–KsD and KaV–KsD was inconclusive as none of the two was consistently more accurate than the other across the 12 simulated scenarios tested.


View this table:
[in this window]
[in a new window]

 
Table 3. Simulation results: accuracy of site-specific {omega} inference based on the different models under different simulation scenarios

 
We next concentrated on the success of various models to specifically infer sites that are under positive selection. The receiver operating characteristic (ROC) curve was used in order to compare the precision and sensitivity of inference under the different models. The closer the curve to the upper left corner, the more successful the prediction is. An example of two ROC curves for simulation scenarios 1 and 2 (Table 3) is shown in Figure 1. These results demonstrate that {omega} values inferred with the KaV–KsC model are consistently the least accurate. The accuracy of the KaD–KsD was greater than that inferred with KaV–KsV when the simulated rates were taken from an empirical rate vector inferred with KaD–KsD (see methods for simulation procedures), while the two models achieved comparable accuracy when simulating with KaV–KsV. In both simulated scenarios, however, the highest precision and accuracy was achieved with the KaV–KsD model. The superiority of the KaV–KsD model over KaD–KsD seen in Figure 1 can be attributed to the small number of sequences (15) simulated. The parameter-rich KaD–KsD model may overfit such a small dataset. Indeed, when the number of sequences is large (the 80 HIV-1 sequences, see below) no significant differences in the inferences of the two models are observed.


Figure 1
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. ROC curves for predicting positively selected sites. (A) ROC curve under simulation scenario 1 and (B) ROC curve under simulation scenario 2. Positively selected sites are regarded as those that were simulated with {omega} > 1.5 and purifying selected sites (true negatives) are those that were simulated with {omega} < 0.9. Sites simulated with 0.9 < {omega} < 1.5 are on the boundary between positive, purifying and neutral evolution and were excluded from the analysis. Other cutoff values gave similar results.

 
3.3 Inferring positive selection across the HIV-1 genome
The number of positively selected sites for each HIV-1 gene analyzed is given in Table 4. In agreement with previous studies (de Oliveira et al., 2004; Yang et al., 2003) positively selected sites were abundantly found in all nine coding genes under all models. However, the number of positively selected sites was found to be highly sensitive to the model applied. Accounting for heterogeneous synonymous rates substantially reduced the number of inferred positively selected sites (Table 4). Accounting for dependency among Ks rates further reduced the number of positively selected sites inferred (Table 4). Taken together, more than twice as many positive sites are inferred by KaV–KsC compared to that by KaD–KsD.


View this table:
[in this window]
[in a new window]

 
Table 4. The number of inferred positively selected sites for each HIV-1 gene under the different modelsa

 
It is expected that the positively selected sites inferred using the KaD–KsD model would be a subset of the sites inferred under the KaV–KsC model. This, however, was not generally the case. The three models together inferred 623 distinct positively selected sites. Only 135 sites (22% of the total) were shared among all three models (Figure 2). The KaD–KsD model exhibited the lowest proportion of uniquely inferred sites and the highest proportion of sites that are shared by all three models. Furthermore, the agreement between KaV–KsC and the other two models was the lowest. Hence, sites inferred under the KaD–KsD model seem to be the most consistent with sites inferred using the other two models. Of course, this consistency does not ensure correct results, but its absence in the other models is alarming. The high sensitivity of the KaV–KsC and KaV–KsV models questions the validity of their assumptions for the data at hand.


Figure 2
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. The total number of positively inferred sites (Posterior probability > 0.9) across the nine genes of HIV-1 for the KaV–KsC, KaV–KsV and KaD–KsD models.

 
Across all HIV-1 genes, the results obtained with KaV–KsD were highly similar to those obtained with KaD–KsD. The results obtained with KaD–KsV highly overlapped those obtained with KaV–KsV (Table 4). This further indicated that accounting for the dependence among Ks rates is more significant than the dependence among Ka rates.

3.4 Positive selection in overlapping regions
The existence of overlapping genes is a widespread phenomenon in the genomes of small viruses (Pavesi, 2006). In overlapping regions, we expect constraints over synonymous and non-synonymous substitutions that are different from those in non-overlapping regions. In fact, overlapping regions are expected to display low and correlated Ks and Ka rates. Our KaD–KsD is especially useful for analyzing a gene that overlaps with another gene, as both Ks variability and spatial dependence is expected.

Here we illustrate that the KaD–KsD model accounts for the selection forces acting on the overlapping region between the vif and pol genes. The overlapping region spans 18 codons (sites 1–18 of vif with sites 986–1004 of pol). Figure 3 shows that this region (the 5' of vif and the 3' of pol) exhibits a substantial reduction of Ks rates as compared to the rest of the gene. Site 12 of the vif gene shows a peculiar selection pattern. Although part of the overlapping region, it exhibits a very high Ks rate and low Ka rate (Figure 3A). What can explain such a high Ks variation in this site? The explanation is provided in Figure 3B. Site 12 of vif corresponds to site 999 in the pol gene. Focusing on this site in pol, the high Ka/Ks ratio ({omega} = 11.4) suggests intensive positive selection acting on this site. Thus, the positive selection in pol drives the high Ks variability of the corresponding site in vif. The purifying selection at the protein level, extracted on this position in vif does not permit any non-synonymous changes, thus increasing the Ks rate only. Position number 8, which overlaps position 994 in pol that has the second highest {omega} value does not exhibit a very large peak in its Ks rate. Alternatively, both its Ka and Ks rates are elevated. The purifying selection extracted on position number 8 is probably less significant than that extracted on position number 12, thus allowing for both substitution types. Finally, it is important to note that these observations cannot be obtained when analyzing the data with the commonly used KaV–KsC model. Under this model, the Ks peak at site 12 of vif cannot be detected. Furthermore, the assumption of a constant Ks rate across the entire pol coding region would have obscured the signal for positive selection in site 999 ({omega} = 1.06, posterior probability = 0.81 in KaV–KsC versus {omega} = 11.4 and posterior probability > 0.99 in KaD–KsD).


Figure 3
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Ka and Ks values for a region of 50 codon sites of (A) vif and (B) pol as inferred by the KaD–KsD model. The overlapping regions are marked by circles. The Ka and Ks rates are shown by gray and black lines, respectively.

 
3.5 Identifying regulatory elements in coding regions
The KaD–KsD model allows the identification of coding regions, in which a spatial purifying selection operates at the nucleic acid level. Using this model, we searched for stretches with low synonymous rates across the pol sequence. In addition to the overlapping region described above, a second region with a low synonymous rate was revealed (codon positions 900–947, Figure 4A). Part of this sequence (positions 900–907, Figure 4B) exhibits exceptional low rate of Ks (as well as a low Ka rate), suggesting that the primary sequence of the DNA, or the RNA, has a functional role. This segment is enriched with purines. Indeed, HIV and other lentiviruses have two polypurine tracts (PPTs) with a known functional role: both PPTs serve as RNA primers for the reverse transcriptase in the synthesis of the plus-strand DNA. One is located at the 3' untranslated region of the viral genome. The second is located at the center of the genome, hence called the central PPT (cPPT) (Charneau and Clavel, 1991). This cPPT locus is mapped exactly to the ultraconserved Ks region in positions 902–906 of pol. A short stretch of T-rich motif (TTTT) immediately upstream of the cPPT is correlated with the ultraconservation found in positions 900–901. These codon positions were experimentally shown to be important for reverse transcription (Ilyinskii and Desrosiers, 1998).


Figure 4
View larger version (31K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. (A) Ks variability over the pol gene. The segment of the pol gene in which the Ks rate is substantially reduced is rectangled. (B) A ‘zoom-in’ on positions 900–950 showing the locations of the ultraconserved Ks regions. Region 1 contains the cPPT and T-rich elements. Region 2 contains the CTS element.

 
Among the 48 conserved codons, a second region (positions 934–947) exhibits exceptionally high level of Ks conservation (Figure 4B). This region nicely overlaps a cis-acting signal named the central termination sequence (CTS), located at positions 930–939. During reverse transcription, a DNA structure, called DNA flap, is formed adjacent to this site (Charneau et al.,1994). This structure has been shown to be involved in the nuclear import of the HIV genome (Zennou et al., 2000). Altogether, our results demonstrate that functionally important cis-acting signals can be identified using our model.


    4 DISCUSSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Estimating the selection pattern at the codon level is at the heart of evolutionary research. Various methods and techniques were developed to increase the accuracy of Ka/Ks inference. Earlier methods were based on counting (e.g. Nei and Gojobori, 1986). In these methods, the number of synonymous and non-synonymous substitutions in each codon is inferred and normalized by the number of synonymous and non-synonymous site. While usually very fast, these counting techniques are statistically problematic as no less than four parameters are estimated for each site (the numbers of synonymous and non-synonymous substitutions and the numbers of synonymous and non-synonymous sites). Muse (1996) further demonstrated that in these methods the estimation of non-synonymous and synonymous rates is not independent of each other. The alternative to these counting methods is to use an explicit model of codon substitution. The inference of Ka/Ks ratio is a by-product of these models. These models tend to be computationally intensive as they involved exponentiating a large (61 by 61) rate matrix. Consequently, current models use oversimplified assumptions such as fixed Ks variation and that each site evolves independently of each other. Thus, when the goal is to estimate Ks variability, one has to resort to counting techniques. In this study, we showed how Ks variability and dependency can be incorporated into probabilistic evolutionary models. We have demonstrated that these models better fit biological data and can significantly increase the accuracy of Ka/Ks estimates.

Yang (2006) and Hurst (2002) have discussed the difficulty of a reliable inference of {omega} per site. This difficulty, which results in a high number of false positive predictions, is due to the large number of inferences being made at a single analysis, and to the inference of {omega} as a ratio of two estimated parameters (Anisimova et al., 2002). An extensive research was aimed at reducing the number of false positive predictions. Specifically, effort was concentrated on model comparisons, on accurate estimation of model parameters, and on finding the best prior distribution of the non-synonymous rates (Anisimova et al., 2001, 2002; Swanson et al., 2003; Wong et al., 2004; Yang et al., 2000, 2005). For example, Anisimova et al. (2001, 2002) recommended to search for positively selected sites using several alternative models and choose those sites that are common to all, thus ensuring the robustness of the results. Noteworthy, all these competing models differ only in the assumed distribution of the {omega} parameter. Recently, the Bayes–Empirical–Bayes approach was developed (Yang et al., 2005), taking into account the sampling errors of the ML estimates of model parameters. This approach was shown to reduce the number of falsely inferred positively selected sites, particularly when small datasets are analyzed. Here we show that the number of falsely inferred positive sites can also be sharply decreased by incorporating the dependence between adjacent synonymous and non-synonymous rates. Using such models as KaD–KsD has the effect of eliminating random noise in site-specific Ka and Ks estimates.

Characterizing the selection forces acting on viral genomes at the DNA level is of great interest. For example, there are well-known RNA secondary structures along the HIV-1 genome, such as the Rev responsive element, that play important functional roles during the virus life cycle (Malim et al., 1990). Another obvious example is the existence of overlapping genes, in which a strong selection on both synonymous and non-synonymous substitutions is expected. However, since the commonly used codon models fail to account for Ks variation, such regions are usually excluded from the analysis, despite their being of high medical interest. For example, in order to avoid overlapping regions, Yang et al. (2003) excluded the whole tat, rev and vpu genes when analyzing selection forces in HIV-1. These three genes, however, are the ones exhibiting the largest fraction of sites that are positively selected (Table 4). Furthermore, assuming a constant synonymous rate for the whole gene is likely to overestimate the actual synonymous rate at the overlapping region and to underestimate the rate at the non-overlapping region. Consequently, the number of positively selected sites is expected to be too low for the overlapping region and too high for the non-overlapping region. The models suggested in this study provide a statistically robust approach to study selection at such overlapping and regulatory elements.

Our analysis revealed a 50-codon-long sequence in the pol open reading frame with exceptionally low synonymous and non-synonymous rates. This sequence roughly starts and ends with sequences known to be functionally important, namely the TTTT-rich box, the cPPT and the CTS signals, all act in cis during the process of HIV-1 reverse transcription (Charneau and Clavel, 1991; Charneau et al., 1994 Ilyinskii and Desrosiers, 1998). This further indicates that our model has the ability to identify important regulatory sequences embedded in open reading frames. Not all conserved segments in this region have a known function. Yet, one can speculate that such segments serve as specific binding sites for cellular and/or viral factors. Support for this hypothesis comes from the fact that these sequences form a DNA flap structure that was suggested to mediate the transport of the HIV-1 genome into the nucleus (Zennou et al., 2000).

Positively selected positions are classically defined as those in which the Ka/Ks ratio is significantly larger than 1. This definition, fundamental to evolutionary research, is based on the assumption that the synonymous rates reflect neutral rate of evolution. However, others and we have shown that synonymous rates vary considerably over sites. This indicates that purifying or maybe even positive, rather than neutral evolution, characterizes the synonymous substitutions to a certain extent. If, for example, synonymous sites are under purifying selection, average Ks values are an underestimate to the neutral rate of evolution, and hence, the Ka/Ks > 1 criterion is too liberal. If, however, positive selection is acting on synonymous sites, then Ks is an overestimate of the neutral rate, and the Ka/Ks criterion might be too conservative. Furthermore, when selection operates at the DNA or mRNA level, the Ka and Ks values are expected to be correlated. Altogether, the variability of Ks questions the validity of the standard Ka/Ks tests for detecting positively selected sites. This Ks variability might vary among different proteins and organisms. In non-viral sequences, an alternative approach is to compare the non-synonymous rate with the evolutionary rate of intronic regions or with the evolutionary rates of pseudogenes. In viral sequences that lack such neutral sequences, how to correct the test for positive selection in a way that accounts for the synonymous variability is an open question that calls for additional research.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank Adi Stern, Julien Dutheil and Eyal Privman for critically reading the manuscript. This study is supported by a grant to E.B. and T.P. from the Israeli Ministry of Science. T.P. and E.B. were also supported by the Israeli Science Foundation grants number 1208/04 and 1184/05. A.D.F. is an Israeli Ministry of Science Eshkol fellow.

Conflict of Interest: none declared.


    FOOTNOTES
 
{dagger}The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Back


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Anisimova M, et al. Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution. Mol. Biol. Evol (2001) 18:1585–1592.[Abstract/Free Full Text]

    Anisimova M, et al. Accuracy and power of Bayes prediction of amino acid sites under positive selection. Mol. Biol. Evol (2002) 19:950–958.[Abstract/Free Full Text]

    Burnham KP, Anderson DR. Model Selection and Multimodel Inference: a Practical Information-theoretic Approach (2002) New York, USA: Springer-Verlag.

    Chamary JV, et al. Hearing silence: non-neutral evolution at synonymous sites in mammals. Nat. Rev. Genet (2006) 7:98–108.[CrossRef][Web of Science][Medline]

    Charneau P, Clavel F. A single-stranded gap in human immunodeficiency virus unintegrated linear DNA defined by a central copy of the polypurine tract. J. Virol (1991) 65:2415–2421.[Abstract/Free Full Text]

    Charneau P. HIV-1 reverse transcription. A termination step at the center of the genome. J. Mol. Biol (1994) 241:651–662.[CrossRef][Web of Science][Medline]

    de Oliveira T, et al. Mapping sites of positive selection and amino acid diversification in the HIV genome: an alternative approach to vaccine design? Genetics (2004) 167:1047–1058.[Abstract/Free Full Text]

    Durbin R, et al. Biological Sequence Analysis (1998) Cambridge: Cambridge University Press.

    Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol (1981) 17:368–376.[CrossRef][Web of Science][Medline]

    Felsenstein J, Churchill GA. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol (1996) 13:93–104.[Abstract]

    Goren A, et al. Comparative analysis identifies exonic splicing regulatory sequences – the complex definition of enhancers and silencers. Mol. Cell (2006) 22:769–781.[CrossRef][Web of Science][Medline]

    Hurst LD. The Ka/Ks ratio: diagnosing the form of sequence evolution. Trends Genet (2002) 18:486.[CrossRef][Web of Science][Medline]

    Ilyinskii PO, Desrosiers RC. Identification of a sequence element immediately upstream of the polypurine tract that is essential for replication of simian immunodeficiency virus. EMBO J (1998) 17:3766–3774.[CrossRef][Web of Science][Medline]

    Malim MH, et al. HIV-1 structural gene expression requires binding of the Rev trans-activator to its RNA target sequence. Cell (1990) 60:675–683.[CrossRef][Web of Science][Medline]

    Mayrose I, et al. Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior. Mol. Biol. Evol (2004) 21:1781–1791.[Abstract/Free Full Text]

    Muse SV, Gaut BS. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol (1994) 11:715–724.[Abstract]

    Muse SV. Estimating synonymous and nonsynonymous substitution rates. Mol Biol Evol (1996) 13:105–114.[Abstract]

    Nei M, Gojobori T. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol (1986) 3:418–426.[Abstract]

    Nielsen R, Yang Z. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics (1998) 148:929–936.[Abstract/Free Full Text]

    Pavesi A. Origin and evolution of overlapping genes in the family Microviridae. J. Gen. Virol (2006) 87:1013–1017.[Abstract/Free Full Text]

    Pond SK, Muse SV. Site-to-site variation of synonymous substitution rates. Mol. Biol. Evol (2005) 22:2375–2385.[Abstract/Free Full Text]

    Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol (1987) 4:406–425.[Abstract]

    Stern A, Pupko T. An evolutionary space-time model with varying among-site dependencies. Mol. Biol. Evol (2006) 23:392–400.[Abstract/Free Full Text]

    Swanson WJ, et al. Pervasive adaptive evolution in mammalian fertilization proteins. Mol. Biol. Evol (2003) 20:18–20.[Abstract/Free Full Text]

    Thompson JD, et al. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res (1994) 22:4673–4680.[Abstract/Free Full Text]

    Whelan S, et al. Molecular phylogenetics: state-of-the-art methods for looking into the past. Trends Genet (2001) 17:262–272.[CrossRef][Web of Science][Medline]

    Wong WS, et al. Accuracy and power of statistical methods for detecting adaptive evolution in protein coding sequences and for identifying positively selected sites. Genetics (2004) 168:1041–1051.[Abstract/Free Full Text]

    Yang W, et al. Widespread adaptive evolution in the human immunodeficiency virus type 1 genome. J. Mol. Evol (2003) 57:212–221.[CrossRef][Web of Science][Medline]

    Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol (1994) 39:306–314.[CrossRef][Web of Science][Medline]

    Yang Z. A space-time process model for the evolution of DNA sequences. Genetics (1995) 139:993–1005.[Abstract]

    Yang Z. The power of phylogenetic comparison in revealing protein function. Proc. Natl Acad. Sci. USA (2005) 102:3179–3180.[Free Full Text]

    Yang Z. Computational Molecular Evolution (2006) Oxford: Oxford University Press.

    Yang Z, et al. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics (2000) 155:431–449.[Abstract/Free Full Text]

    Yang Z, et al. Bayes empirical Bayes inference of amino acid sites under positive selection. Mol. Biol. Evol (2005) 22:1107–1118.[Abstract/Free Full Text]

    Zennou V, et al. HIV-1 genome nuclear import is mediated by a central DNA flap. Cell (2000) 101:173–185.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
M. Anisimova and C. Kosiol
Investigating Protein-Coding Sequence Evolution with Probabilistic Codon Substitution Models
Mol. Biol. Evol., February 1, 2009; 26(2): 255 - 271.
[Abstract] [Full Text] [PDF]


Home page
J. Virol.Home page
D. C. Tully and M. A. Fares
Shifts in the Selection-Drift Balance Drive the Evolution and Epidemiology of Foot-and-Mouth Disease Virus
J. Virol., January 15, 2009; 83(2): 781 - 790.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
W. Delport, K. Scheffler, and C. Seoighe
Models of coding sequence evolution
Brief Bioinform, January 1, 2009; 10(1): 97 - 109.
[Abstract] [Full Text] [PDF]


Home page
Proc R Soc BHome page
S. Kryazhimskiy, G. A Bazykin, J. Plotkin, and J. Dushoff
Directionality in the evolution of influenza A haemagglutinin
Proc R Soc B, November 7, 2008; 275(1650): 2455 - 2464.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
R. A. Studer, S. Penel, L. Duret, and M. Robinson-Rechavi
Pervasive positive selection on duplicated and nonduplicated vertebrate protein coding genes
Genome Res., September 1, 2008; 18(9): 1393 - 1402.
[Abstract] [Full Text] [PDF]


Home page
J. Virol.Home page
S. Kryazhimskiy, G. A. Bazykin, and J. Dushoff
Natural Selection for Nucleotide Usage at Synonymous and Nonsynonymous Sites in Influenza A Virus Genes
J. Virol., May 15, 2008; 82(10): 4938 - 4945.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
D. S. Horner, W. Pirovano, and G. Pesole
Correlated substitution analysis and the prediction of amino acid structural contacts
Brief Bioinform, January 1, 2008; 9(1): 46 - 56.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Mayrose, I.
Right arrow Articles by Pupko, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Mayrose, I.
Right arrow Articles by Pupko, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?