Bioinformatics Advance Access originally published online on October 5, 2007
Bioinformatics 2007 23(22):2978-2986; doi:10.1093/bioinformatics/btm472
Annotation of selection strengths in viral genomes
Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG, UK
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Viral genomes tend to code in overlapping reading frames to maximize informational content. This may result in atypical codon bias and particular evolutionary constraints. Due to the fast mutation rate of viruses, there is additional strong evidence for varying selection between intra- and intergenomic regions. The presence of multiple coding regions complicates the concept of Ka/Ks ratio, and thus begs for an alternative approach when investigating selection strengths. Building on the paper by McCauley and Hein, we develop a method for annotating a viral genome coding in overlapping reading frames. We introduce an evolutionary model capable of accounting for varying levels of selection along the genome, and incorporate it into our prior single sequence HMM methodology, extending it now to a phylogenetic HMM. Given an alignment of several homologous viruses to a reference sequence, we may thus achieve an annotation both of coding regions as well as selection strengths, allowing us to investigate different selection patterns and hypotheses.
Results: We illustrate our method by applying it to a multiple alignment of four HIV2 sequences, as well as of three Hepatitis B sequences. We obtain an annotation of the coding regions, as well as a posterior probability for each site of the strength of selection acting on it. From this we may deduce the average posterior selection acting on the different genes. Whilst we are encouraged to see in HIV2, that the known to be conserved genes gag and pol are indeed annotated as such, we also discover several sites of less stringent negative selection within the env gene. To the best of our knowledge, we are the first to subsequently provide a full selection annotation of the Hepatitis B genome by explicitly modelling the evolution within overlapping reading frames, and not relying on simple Ka/Ks ratios.
Availability: The Matlab code can be downloaded from http://www.stats.ox.ac.uk/mccauley/
Contact: degroot{at}stats.ox.ac.uk
| 1 INTRODUCTION |
|---|
|
|
|---|
The pressure on many viruses to minimize genome size often results in their coding in overlapping reading frames in order to compactify information. The nucleotides involved in coding in such multiple coding regions tend to exhibit atypical behaviour, both in codon bias as well as in evolutionary constraints.
There have been several approaches to investigate codon usage particular to multiple reading frames. Kozlov (2000) explored the variability of nucleotide and codon usage, specific to overlapping regions. Pavesi et al. (1997) examined the informational content within overlapping coding regions using information theory indices, as well as discovering an overrepresented motif common to such regions. McCauley and Hein (2006) developed an HMM, which modelled the nucleotide usage of a sequence separately for single, double and triple coding regions, thus being able to annotate full viral genomes based on the sequence composition.
Various studies have also been made into the evolutionary behaviour of overlapping regions. Generally, these regions have been thought to be under more stringent selection than singly coding sequences, since a mutation may result in a non-synonymous change in more than one gene. Indeed, higher than expected negative selection has even been used for the de novo detection of overlapping genes, as described by Spiropoulou and Nichol (1993) and Walewski et al. (2001).
Recently, however, several papers investigating overlapping coding regions—amongst them Mizokami et al. (1997), Guyader and Ducray (2002), Hughes et al. (2005), Pavesi (2006) and Osiowy et al. (2006)—attempt to determine selection acting on overlapping reading frames by comparing non-synonymous to synonymous substitution rates. Within a pair of overlapping genes, they report having detected purifying selection acting on one reading frame, and directional on the other. Unfortunately, using the concept of Ka/Ks ratio brings intrinsic problems with it, when applied to overlapping reading frames. Counting the ratio of non-synonymous to synonymous substitutions in one reading frame only makes sense, when the synonymous substitutions are unconstrained. In overlapping regions, however, a synonymous substitution in one reading frame could be non-synonymous in the other, thus biasing the analysis towards an underestimation of the true synonymous substitution rate in one reading frame and thus the potentially false appearance of positive selection.
Anticipating this problem, Rogozin et al. (2002) thus restricted their analysis on selection in prokaryotic genes, to positions where the synonymous substitutions were unconstrained. Hein and Støvlbæk (1995) also improved on the Ka/Ks approach, in their investigation on the evolution of multiple coding regions in viral genomes. By developing a context-specific evolutionary model and a maximum-likelihood method, they were able to estimate the actual strength of selection on different parts of the genome as well as to test various hypotheses about the nucleotide evolution within overlapping regions.
In this article we introduce a model for the evolution of a sequence, which accounts for variability in selection along the genome. Each site may choose from a given set of selection strengths, ranging from directional to purifying, and sites are independent up to an auto correlation factor. We make no assumptions about the difference in selectional behaviour between overlapping and single coding regions, and thus leave the model free reign. We incorporate our evolutionary model into the HMM introduced in McCauley and Hein (2006), extending it to a phylogenetic HMM, as described by Pedersen and Hein (2003) and Siepel and Haussler (2004). Given a reference sequence and a multiple alignment, we use the phylogenetic HMM to annotate the former with protein coding regions, as well as different levels of selection—what we call a selection annotation. We so do not rely on a prior annotation off GenBank (all data used is publicly released on the Gen Bank database, see http://www.ncbi.nlm.nih.gov), thus making this method readily available for the coding and selection annotation of new virus strains. If however the annotation is available, we may use it as a prior and run our method solely for the purpose of selection annotation.
| 2 METHODS |
|---|
|
|
|---|
2.1 Basic structure of our model
Initially, our set of coding states C is identical to those described in our previous paper (McCauley and Hein, 2006). Since we are dealing with overlapping reading frames, each nucleotide may be coding in up to three different reading frames, and within each of these for three different codon positions. This results in there being eight different coding states (see Fig. 1), on top of which we specify a set of m different selection strengths S. Now let X be an n x L matrix, giving the gapped alignment of n sequences of length L, and T be the underlying phylogenetic tree, as shown in Figure 2. Let the row vector X1 be our reference sequence and furthermore, let Xk:l, j denote the kth to lth entries of the jth column vector. Let c = (cj) (j = 1, ... , L) be a vector of coding states—where cj denotes the coding state at nucleotide position j—and similarly let s = (sj) be a vector of selection states. Then, we emit the joint probability distribution of an alignment column Xj, a coding state cj and a selection state sj with probability P(Xj, cj, sj), Our model however is complicated by the fact, that both transition and emission probabilities are actually dependent on the nucleotides in their neighbourhood as well as their coding and selection states, as described more detail in Section 2.2:
- The transition probability from one coding state cj to another is dependent on the prior three nucleotides.
- The emission of a nucleotide in the reference sequence is conditional on the prior two nucleotides.
- The emission of the alignment column vector X2:n, j is conditional on the nucleotide context, by which we mean X1, j – 2:j + 2.
|
| (1) |
- the first product term relates to the emission of the ancestral nucleotide conditional on the prior 2 nt and the coding state,
- the second product term relates to the emission of the descendant column conditional on the ancestral nucleotide, its context and the coding and selection states
- the third product term relates to the probability of transitioning from one coding state to another dependent on the prior 3 nt and finally,
- the fourth product term captures the probability of transitioning from one selection state to another.
|
|
Due to the various emission and transition probabilities having a two-sided context dependency of up to 3 nt, this holds for all but the first and last three positions of the sequence. These we fix to be in the non-coding state, which coincides with us wishing to only annotate entire genes. In the case of being in the non-coding state, the column probability is independent of the neighbouring nucleotide context, i.e. P(Xj | ci = NC, sj, X1, j – 2:j + 2) = P(Xj | cj = NC, sj). We may then, using Equation (1), formulate our likelihood L(X, c, s) of observing an alignment X together with a coding and selection state annotation, by simply multiplying the above expression over all columns j, (j = 4, ... , n – 3).
Note, that while our model does not quite fit into the usual HMM framework, it is trivial to adapt the standard Forward, Backward and Viterbi dynamic programming algorithms to it.
2.2 Model parameters
We will now elaborate on how to calculate the factors of Equation (1).
Our transition probability from a non-coding to a coding state is conditional on observing a start codon in the reference sequence. All other coding transition probabilities are deterministic—once in a coding state in a particular reading frame, one rotates in a three periodicity around the codon-positions until one meets a stop codon. This accounts for the coding state transition probabilities being dependent on the previous 3 nt in the reference sequence (see Fig. 1).
Concerning selection states, each one represents a different selection strength. These can be either independent parameters, or related by a grid to ease computational demands. Let our grid of size m be given by a fixed vector g = (g1, ... , gm). We then introduce a scaling parameter
, to obtain a new vector
· g of m different selections strengths. The parameter
is merely an artificial tool to decrease our parameterspace and has no direct biological significance. The product
· g is interpretable as the selection factor of a given site. Transitions from one selection state to itself occur with a certain probability
, which we call the autocorrelation factor, and states are switched with uniform probability
. With this model we cannot estimate single site selection, but only averages over subsets of sites. Without the autocorrelation, but keeping the grid of selection factors, we would essentially be modelling selection as a mixture model, similar to Yang and Swanson (2002). The justification for including the autocorrelation is that we believe, that the structure of the underlying protein implies a dependency in selection of closely situated loci. With our model we will not pick up selective strengths of single nucleotide positions, but rather general trends in evolutionary behaviour of subdomains of proteins, such as recognizing important reactive sites.
The emission probability P(X1, j | X1, j – 2: j – 1, cj) of the reference nucleotide in a certain coding state is dependent on the prior 2 nt. It is drawn from independent multinomial distributions for each of the conditionals, thus resulting in a total of 66 free parameters. For further details see McCauley and Hein (2006).
We let mutations in our sequences occur according to the general time reversible model (GTR), as described in Tavaré (1986), and obtain the base equilibrium frequencies from the nucleotide distribution of the reference sequence.
Given a reference sequence X1, together with a set of aligned sequences Xi we now consider the effect a mutation at a certain position might have on the amino acid composition. Since we are dealing with overlapping reading frames, we must consider the reference nucleotide being—potentially simultaneously—at the first, second and third locus of a codon. The emission probability of the descendant nucleotide is thus dependent on the nucleotide context in the reference sequence of two to the left and two to the right of it. Let Ex,y be the event of the mutation from nucleotide x to y, given a certain context Xj – 2:j + 2 about x, causing at least one non-synonymous substitution. Notice that x is not necessarily the nucleotide X1, j —the reference sequence is used to define the context of x only. With sj being the selection strength on the sequence at position j, that mutation then gets accepted by a factor
, where
is the indicator function. If a substitution is synonymous this will assign it with a selection factor of 1. Additionally mutations that result in the loss or gain of a true stop codon get penalized by an additional factor sstop.
We construct the substitution matrix Aj at column j as follows: define F j to be the matrix with entries
. That is to say, the (x, y)th entry in F j is sj if and only if there is at least one non-synonymous substitution between x and y in the context of X1, j – 2: j + 2, else the entry equals 1. Let the rate matrix associated with our evolutionary process be Q. If
denotes the entry by entry product between two matrices, (A
B)kl = Akl · Bkl, the instantaneous evolutionary rate matrix Aj for position j is then given by
|
| (2) |
|
| (3) |
We may subsequently obtain the expression P j (t) = exp(Ajt) for the substitution matrix of our model along a branch of length t, and from that calculate the probability of emitting a given column using Felsenstein's Pruning Algorithm on our tree topology T, with T re-rooted at X1. Assuming column independence, this then gives us the likelihood of the data given the model. Given a set of seed parameters, we may thus estimate all free parameters using the EM algorithm, combining Baum–Welch and numerical optimization. For the seed parameters, we first obtain a seed annotation by marking every open reading frame longer than 200 nt as coding. From this we estimate our state transition and evolutionary seed parameters, where initial input values to kick off the optimization procedure make little difference to our final estimates.
One problem underlying our above approach is the fixed nucleotide context over time, and ideally we would let it vary. Pedersen and Jensen (2001) investigated this by developing an exact evolutionary model, from which they sampled using Markov Chain Monte Carlo techniques, and observed markedly improved results in contrast to the stationary context model. However, as noted by them, the increase in computational complexity is unmanageable, and thus we must for now suffice with this slightly cruder version.
| 3 RESULTS |
|---|
|
|
|---|
To evaluate our method with respect to gene prediction, we use annotations from GenBank. We define true positives to be the sum
jC + (xj) where xj is the jth nucleotide and C + (xj) is the number of reading frames it is coding in. Similarly, we define the true negatives to be
jC – (xj) where C–(xj) is the number of reading frames the nucleotide is not coding in. Then we may as usual define- Sensitivity
- Specificity
3.1 Simulated data
We initially wish to test our method on simulated data to see whether we can actually discern regions evolving under distinct selection as such. We construct a 10 000-nt long reference sequence containing one single coding region using the nucleotide composition of the single coding regions of the Hepatitis B genome. We subsequently fit a sinusoidal selection pattern along the genome with low and high peak values at 0.1 and 0.8, respectively, and wave length of 250 nt.
We evolve the reference genome accordingly along a tree into three descendant sequences, using the same evolutionary model as described in Methods Section. Subsequently, we run our method to compare our selection annotation to the true one. As shown in Figure 3, we are not annotating perfectly, especially as far as catching the peaks is concerned. However, although selection is rapidly changing along the genome, we still manage to capture the main essence of these changes and deliver a good representation of the true annotation. We present the selection annotation sj for position j as a weighted posterior across all eight selection factors
, where pj (m) is the posterior probability of being in selection class m at position j, and gm is the grid value for class m as described above.
|
Note also that even on simulations with smaller wavelengths, our model still recovers a good approximation to the true selection annotation.
We also wish to see what effect sequence divergence and length of consecutive coding regions have on our estimations. We simulate data with a differing percentage of average column identity across the alignment. In Figure 4, we plot the column identity against our estimation error and observe the expected U-shaped plot with errors for the evolutionary distances that we most likely would be dealing with being very low.
|
We also test our method on a double coding region of varying length acting under constant selection, to discern how long a region has to be for us to pick up a signal. Figure 5 shows that, as expected, very small lengths do not quite suffice for us to make out a signal, but we do very well, the longer the regions get. Since our method is not designed to pick up the selectional behaviour of individual nucleotides, but rather depict trends over genomic regions, this however is only to be expected.
|
3.2 HIV2
We illustrate our method on a small HIV2 data set, where we annotate a reference sequence U27200 [GenBank] based on three descendant sequences AY530889 [GenBank] , M30502 [GenBank] and DQ307022 [GenBank] all of which are at a reasonable evolutionary distance to one another. To avoid complications due to recombination, we have chosen sequences that are believed not to be recombinants of one another, since unaccounted for recombination is known to bias the estimation towards the detection of positive selection (Scheffler et al., 2006). We download the sequences from GenBank and align them using CLUSTALW (ClustalW Software can be found on the web at http://www.ebi.ac.uk/alustalw/) (Thompson et al., 1994). We subsequently obtain our phylogenetic tree using PHYLIP (Felsenstein, 1989). The identity between the sequences ranges significantly from
50% to
90% across the genome. Due to CLUSTALW and PHYLIP's topology not accounting for the presence of overlapping reading frames, our use of them may of course be slightly problematic and we return to this in the Discussion Section. However, a goodness of fit test to the model implies that—apart from in the structurally conserved 5' and 3' UTR regions—our model fits the data well.
3.2.1 HIV2 gene annotation
We first examine the gene annotation obtained. In comparison to the results presented in our earlier paper (McCauley and Hein, 2006), the inclusion of phylogenetic information improves slightly on the single sequence method, in particular by predicting the first half of the two intronic rev and tat genes. Since we require a start codon for the beginning of a coding region, our method is incapable of annotating the latter half of these intronic genes. Several small ORFs, both in the 5' and 3' UTR region, are falsely predicted as coding, presumably due to the secondary structure conservation in these regions. Our method uses no prior knowledge about the annotation of the descendant sequences. In the scenario, however, of wanting to annotate an unknown strain with respect to a set of known ones, such knowledge is easily incorporated with a prior on the hidden state path.
As stated in McCauley and Hein (2006), the single sequence method alone already does a good job at gene prediction and improves on GeneMark.hmm, the state of the art method for viral genome prediction for single sequences (Besemer and Borodovsky, 1999; Besemer et al., 2001; Mills et al., 2003). Once extended to a phylogenetic HMM, we note a marked improvement in our method achieving an average prediction sensitivity and specificity of
97%. This improves on the results in the recently published pairwise method presented by de Groot et al. (2007).
3.2.2 HIV2 selection annotation
Next we consider the selection annotation. We choose to model our set of selection factors according to a grid of size 8 as shown in Table 1, scaled by the parameter
, see Methods Section. All selection parameter estimates are given in Table 2. In this analysis, the maximum-likelihood estimate for
was 0.5512, scaling our grid to 0.0055–1.1024. Note here, that our model explicitly chose
, so that the majority of the states pertain to negative selection. Since it had the choice to accommodate for both positive and negative selection this is a marked decision, suggesting that overall selection is largely of a purifying nature.
|
|
The maximum-likelihood estimate for the additional stop codon selection factor sstop was 0.347, indicating that substitutions resulting in a change in stop codon are under three times as strong negative selection as normal non-synonymous substitutions. The autocorrelation parameter of transitioning from one selection state to another was estimated at 0.932, suggesting a strong dependence between selection strengths at adjacent locations. This fits our expectation that neighbouring nucleotides are going to be under similar selection, since they will tend to belong to a common domain of tertiary structure.
Figure 6 illustrates the selection strengths along the genome. The darker the shading, the stronger the negative selection on that site. In the bottom part of the picture we present the selection annotation as given by the weighted posterior, as described in Section 3.1. Interestingly enough, as shown in Figure 7, the overlapping regions seem to be under less stringent selection than the single coding regions. In this figure, as in the following ones, the y axis shows the average posterior probability. We obtain it by, for each of the eight states, averaging the posterior probability of being in that state over all nucleotides in the two types of region. Opinion on the selectional behaviour of single versus double coding regions is relatively split—Spiropoulou and Nichol (1993) and Walewski et al. (2001) assume constrained selection in double coding regions to identify de novo overlapping reading frames, whereas de Oliveira et al. (2004) seem to suggest more positive selection in double coding regions of HIV1 than in single coding ones. Since there is very little data in our analysis, however, we cannot draw strong conclusions, but as Figure 7 illustrates our method highlights less stringent selection in the overlapping regions of HIV2.
|
|
Figure 8 shows the weighted frequencies of the selection classes on the individual genes. The gag and pol genes are well documented to be under strong negative selection (Seibert et al., 1995; Yang and Swanson, 2002), due to their housekeeping role in the viral organism. It is therefore reassuring to see that our results support this fact with the selection strength distribution being biased towards the strongly purifying selection classes. The only surprise is maybe the fact that the gag gene is not under slightly more stringent conservation than in our observation. Our model cannot pick up on exact sites of positive selection, but would rather be inclined to label a certain region under negative selection, which was sprinkled with many spots of positive selection, as being under closer to neutral selection. We find it thus unsurprising to see the other genes being under generally less stringent negative selection than gag and pol, due to the presence of precisely these sites.
|
One region in particular, around nucleotide position 7000, stands out in Figure 6 as being very lightly shaded. Upon further analysis, this apparently highly divergent part of the env gene corresponds to a well-known hypervariable region between two β sheets in the gp120 protein, as cited by Simmonds et al. (1990), thus providing further indication of our method providing reasonable results. The tat and rev genes are very small intronic genes, implying that we have very little data on them to draw strong statistical conclusions. However, we do see a tendency towards more neutral selection in both of them, especially rev, such behaviour having also been documented by de Oliveira et al. (2004).
3.3 Hepatitis B
More than a third of the Hepatitis B genome is double coding, making it another excellent candidate to test our method. Several studies have been made on the selection within this virus, most recently by Osiowy et al. (2006). However, the fact that each one attempts to mold the concept of Ka/Ks ratio to their purpose, provokes serious difficulties when attempting to analyse their results. We provide a full selection annotation of the Hepatitis B genome by explicitly modelling selection in overlapping reading frames, thus being to our knowledge the first to provide easily interpretable objective results on the evolutionary behaviour of this virus. We run our method on an alignment of three separate Hepatitis B strands X04615
[GenBank]
, M18752
[GenBank]
and K02715
[GenBank]
.
3.3.1 Hepatitis B gene annotation
Both the single sequence and the phylogenetic method predict the same gene annotation. We falsely predict a short overlapping ORF as being coding as well as miss out on the first part of the S gene, resulting in an accuracy of 89.7% sensitivity and 98.2% specificity. Restricting our attention purely to the overlapping regions, we recover 84.7% of these. The complexity of the Hepatitis B virus particularly lends itself to highlighting the strengths of our method. For example GeneMark.hmm, used by Mills et al. (2003) to compile the VIOLIN database, only recovers 37% of the true coding regions and only 14% of the overlapping ones, demonstrating the true need for a more sophisticated approach. The pairwise comparative method by de Groot et al. (2007) achieve similar annotation results to ours, however, does not provide a full selection annotation.
3.3.2 Hepatitis B selection annotation
Figure 9 shows our selection annotation of the Hepatitis B genome. Due to the large amount of overlapping regions, it lends itself ideally to the study of selection acting on these. We can see that double coding regions are on average under much stronger conservation than single coding ones, as demonstrated further in Figure 10. Although, as shown in Table 2, the grid factor
is estimated at 0.6 and the grid thus reaches from 0.006 to 1.2, there is effectively no posterior weight on the eight rate class, and thus the entire Hepatitis B genome appears to be under negative selection. Interestingly, the separate selection factor for stop codons was estimated at 0.68—nearly twice as high as in HIV2, suggesting that gene structure is more conserved in the Hepatitis B virus. Though, since we have a very limited data set this difference may well not be significant.
|
|
When looking at the selection strengths in Figure 11 across the different Hepatitis B genes, we see that the C gene is certainly the most conserved, with 81% of the nucleotides coding for it being under extremely stringent negative selection of 0.06. There are a few light blocks within the genome, notably within the X and the C genes, potentially highlighting hypervariable regions within the Hepatitis B genome. Chain and Myers (2005) investigate the C gene using Ka/Ks ratio and find an annotation compatible to ours. Osiowy et al. (2006), however, conclude in their analysis of the X protein that it is under relatively stringent negative selection, thus seemingly contradicting our results, which quite clearly highlight a region of raised variability. Our method is not devised to explain, but merely to capture a tendency in selectional behaviour, so further investigation into the biological nature of these particular regions would be necessary.
|
| 4 DISCUSSION |
|---|
|
|
|---|
We have introduced a novel method for the comparative annotation of viruses containing overlapping reading frames. Most importantly it accounts for intra- and intergenic differences in selection strength along the genome, thus additionally providing a selection annotation of the virus. Having a multiple alignment as an input, it attempts to draw information from the vast amount of sequence data available. We do not rely on a prior GenBank annotation, thus making our method readily applicable to novel virus strands. However, in the case of the gene annotation already being known, we may easily fix it to solely obtain the selection annotation.
Up till now, methods for selection analysis have been restricted to simple studies based on the concept of the Ka/Ks ratio, which in the presence of overlapping reading frames however loses its power. Mizokami et al. (1997), Guyader and Ducray (2002), Hughes and Hughes (2005), Pavesi, (2006) and Osiowy et al. (2006) use it nonetheless, in an attempt to understand the evolutionary pressures underlying multiple coding regions. Their results are subsequently difficult to interpret, since synonymous substitutions in one reading frame may indeed be non-synonymous and thus restricted in another. This may lead to an underestimation of the true synonymous substitution rate and the subsequent false suggestion of positive selection. The fact that we have incorporated varying selection into a model that accounts for overlapping reading frames is a step towards being able to make truly statistically significant statements about the nature of selection on multiple coding regions. Of course, it is hard for us to make claims on the accuracy of our method, since no true selection annotation exists. Our simulation results however do show that we can successfully infer parameters for sequences evolved under the same model.
We demonstrate our method on an alignment of 4 HIV2 strains. We discover strong purifying selection acting on non-synonymous substitutions of stop codons, as well as on the majority of the gag and pol. Several sites within the HIV2 virus appear to be under less stringent negative selection, one in particular corresponding to a known hypervariable region in the env gene.
Interestingly enough, in HIV2 the overlapping regions appear to be under less stringent selection, though due to the small sample space it is difficult to draw significant conclusions from this. When analysing the Hepatitis B virus, we can draw stronger conclusions about the evolutionary behaviour of multiple coding regions, due to the large amount of coding sites in overlapping reading frames. To the best of our knowledge, we are the first to provide a full selection annotation of the Hepatitis B virus genome. We note that here selection in overlapping regions is more conserved. We also investigate differences in the evolutionary pressures underlying the coding regions of Hepatitis B and highlight several regions that appear to be under increased mutational pressure, notably in the S and C genes.
Recombination is a complicating factor that is not wisely neglected in fast evolving viral genomes. As shown in Scheffler et al. (2006), not accounting for recombination can have a dramatic effect on estimated selection strengths. An obvious extension to our method would be to take a similar approach as Scheffler et al. (2006) and computationally detect recombination breakpoints and let the tree vary between these points. This would allow us to obtain different tree topologies for different parts of the genome. An alternative approach would be to include tree topologies as hidden states in our HMM, and take an approach similar to Husmeier and McGuire (2003) or Minin et al. (2005).
A more serious problem is that neither CLUSTALW or PHYLIP take non-coding and coding, yet alone overlapping reading frames, into account, when constructing the alignment and inferring a tree. How much this influences our results is difficult to say, but ideally we would need to use packages that do account for such features, or better yet incorporate alignment and tree building into our method. Indeed, in our future work we will focus on jointly estimating selection factors and alignment in a statistical alignment framework.
There are several drawbacks to our model. Initially, the fact that we are fixing nucleotide context over time brings severe problems with it, that are not easily rectified. As mentioned before, Pedersen and Jensen (2001) develop an accurate evolutionary model but need to resort to complicated MCMC methods in order to estimate parameters. Since complexity is already an issue with our method, fixing the nucleotide context over time is currently our only viable option. Having to exponentiate a large number of matrices results in us running into CPU usage problems and having long runtimes. We could either simplify our model to attain closed form solutions to the matrix exponents, or investigate Taylor expansions in order to find approximate closed form solutions. Letting selection vary along the genome is an improvement to most evolutionary models, but we still model transition and transversion rates as constant. Ideally we would let these change as well, in order to account for constraining elements such as RNA secondary structure.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alex Bateman
Received on May 9, 2007; revised on August 3, 2007; accepted on September 11, 2007
| REFERENCES |
|---|
|
|
|---|
Besemer J, Borodovsky M. Heuristic approach to deriving models for gene finding. Nucleic Acids Res. (1999) 27:3911–3920.
Besemer J, et al. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. (2001) 29:2607–2618.
Chain BM, Myers R. Variability and conservation in hepatitis B virus core protein. BMC Microbiol. (2005) 5.
de Groot S, et al. Comparative annotation of viral genomes with non-conserved gene structure. Bioinformatics (2007) 23:1080–1089.
de Oliveira T, et al. Mapping sites of positive selection and amino acid diversification in the HIV genome. Genetics (2004) 167:1047–1058.
de Zanotto P, et al. Genealogical evidence for positive selection in the nef gene of HIV-1. Genetics (1999) 153:1077–1089.
Ding SW, et al. New overlapping gene encoded by the cucumber mosaic virus genome. Virology (1994) 198:593–601.[CrossRef][Web of Science][Medline]
Durbin R, et al. Biological Sequence Analysis (1998) Cambridge University Press.
Felsenstein J. PHYLIP – Phylogeny inference package (Version 3.2). Cladistics (1989) 5:164–166.
Firth AE, Brown CM. Detecting overlapping coding sequences with pairwise alignments. Bioinformatics (2005) 21:282–292.
Firth AE, Brown CM. Detecting overlapping coding sequences in virus genomes. BMC Bioinformatics (2006) 7.
Fukuda Y, et al. On dynamics of overlapping genes in bacterial genomes. Gene (2003) 323:181–187.[CrossRef][Web of Science][Medline]
Guyader S, Ducray DG. Sequence analysis of Potato leafroll virus isolates reveals genetic stability, major evolutionary events and differential selection pressure between overlapping reading frame products. J. Gen. Virol. (2002) 83:1799–1807.
Hein J, Støvlbæk J. A maximum-likelihood approach to analyzing nonoverlapping and overlapping reading frame. J. Mol. Evol. (1995) 40:181–189.[CrossRef][Web of Science][Medline]
Hughes AL, Hughes M.AK. Patterns of nucleotide difference in overlapping and non-overlapping reading frames of papillomavirus genomes. Virus Res. (2005) 113:81–88.[CrossRef][Web of Science][Medline]
Husmeier D, McGuire G. Detecting recombination in 4-taxa DNA sequence alignments with Bayesian Hidden Markov models and Markov chain Monte Carlo. Mol. Biol. Evol. (2003) 20:315–337.
Johnson ZI, Chisholm S. Properties of overlapping genes are conserved across microbial genomes. Genome Res. (2006) 14:2268–2272.[CrossRef]
Kozlov NN. Overlapping genes and variability of the genetic code. Dokl. Biol. Sci. (2000a) 375:677–680.[CrossRef][Medline]
Kozlov NN. Analysis of a set of overlapping genes. Dokl. Biochem. (2000b) 373:119–122.[Medline]
Makalowska I, et al. Overlapping genes in vertebrate genomes. Comput. Biol. Chem. (2005) 29:1–12.[CrossRef][Web of Science][Medline]
McCauley S, Hein J. Using HMMs and observed evolution to annotate viral genomes. Bioinformatics (2006) Advance Access published online on April 13, 2006.
Mills R, et al. Improving gene annotation of complete viral genomes. Nucleic Acids Res. (2003) 31:7041–7055.
Minin VN, et al. Dual multiple change-point model leads to more accurate recombination detection. Bioinformatics (2005) 21:3034–3042.
Mizokami M, et al. Constrained evolution with respect to gene overlap of Hepatitis B Virus. J. Mol. Evol. (1997) 44(Suppl. 1):83–90.[CrossRef]
Osiowy C, et al. Molecular evolution of hepatitis B virus over 25 Years. J. Virol. (2006) 80:10307–10314.
Pavesi A. Detection of signature sequences in overlapping genes and prediction of a novel overlapping gene in hepatitis G virus. J. Mol. Evol. (2000) 50:284–295.[Web of Science][Medline]
Pavesi A. Origin and evolution of overlapping genes in the family Microviridae. J. Gen. Virol. (2006) 87:1013–1017.
Pavesi A, et al. On the informational content of overlapping genes in prokaryotic and eukaryotic viruses. J. Mol. Evol. (1997) 44:625–631.[CrossRef][Web of Science][Medline]
Pedersen AM, Jensen JL. A dependent-rates model and an MCMC-based methodology for the maximum-likelihood analysis of sequences with overlapping reading frames. Mol. Biol. Evol. (2001) 18:763–776.
Pedersen JS, Hein J. Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics (2003) 19:219–227.
Rogozin I, et al. Purifying and directional selection in overlapping prokaryotic genes. Trends Genet. (2002) 18:228–232.[CrossRef][Web of Science][Medline]
Scheffler K, et al. Robust inference of positive selection from recombining coding sequences. Bioinformatics (2006) 22:2493–2499.
Seibert S, et al. Natural selection on the gag, pol, and env genes of human immunodeficiency virus 1 (HIV-1). Mol. Biol. Evol. (1995) 12:803–813.[Abstract]
Siepel A, Haussler D. Combining phylogenetic and hidden Markov models in biosequence analysis. J. Comput. Biol. (2004) 11:413–428.[CrossRef][Web of Science][Medline]
Simmonds P, et al. Analysis of sequence diversity in hypervariable regions of the external glycoprotein of human immunodeficiency virus type 1. J. Virol. (1990) 64:5840–5850.
Spiropoulou CF, Nichol ST. A small highly basic protein is encoded in overlapping reading frame within the P gene of vesicular stomatitis virus. J. Virol. (1993) 67:3103–3110.
Tavaré S. Some probabilistic and statistical problems in the analysis of DNA sequences. In: Some Mathematical Questions in BiologyDNA Sequence Analysis—Miura RM, ed. (1986) Providence: American Mathematic Society. 57–86.
Thompson JD, et al. CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. (1994) 22:4673–4680.
Yang ZH, Swanson WJ. Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes. Mol. Biol. Evol. (2002) 19:49–57.
Walewski JL, et al. Evidence for a new hepatitis C virus antigen encoded in an overlapping reading frame. RNA (2001) 7:710–721.[Abstract]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











