Skip Navigation


Bioinformatics Advance Access originally published online on April 13, 2006
Bioinformatics 2006 22(11):1308-1316; doi:10.1093/bioinformatics/btl092
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/11/1308    most recent
btl092v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by McCauley, S.
Right arrow Articles by Hein, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by McCauley, S.
Right arrow Articles by Hein, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Using hidden Markov models and observed evolution to annotate viral genomes

Stephen McCauley * and Jotun Hein

Department of Statistics, Oxford University Oxford, UK

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS
 4 DISCUSSION AND FURTHER...
 REFERENCES
 

Motivation: ssRNA (single stranded) viral genomes are generally constrained in length and utilize overlapping reading frames to maximally exploit the coding potential within the genome length restrictions. This overlapping coding phenomenon leads to complex evolutionary constraints operating on the genome. In regions which code for more than one protein, silent mutations in one reading frame generally have a protein coding effect in another. To maximize coding flexibility in all reading frames, overlapping regions are often compositionally biased towards amino acids which are 6-fold degenerate with respect to the 64 codon alphabet. Previous methodologies have used this fact in an ad hoc manner to look for overlapping genes by motif matching. In this paper differentiated nucleotide compositional patterns in overlapping regions are incorporated into a probabilistic hidden Markov model (HMM) framework which is used to annotate ssRNA viral genomes. This work focuses on single sequence annotation and applies an HMM framework to ssRNA viral annotation. A description of how the HMM is parameterized, whilst annotating within a missing data framework is given. A Phylogenetic HMM (Phylo-HMM) extension, as applied to 14 aligned HIV2 sequences is also presented. This evolutionary extension serves as an illustration of the potential of the Phylo-HMM framework for ssRNA viral genomic annotation.

Results: The single sequence annotation procedure (SSA) is applied to 14 different strains of the HIV2 virus. Further results on alternative ssRNA viral genomes are presented to illustrate more generally the performance of the method. The results of the SSA method are encouraging however there is still room for improvement, and since there is overwhelming evidence to indicate that comparative methods can improve coding sequence (CDS) annotation, the SSA method is extended to a Phylo-HMM to incorporate evolutionary information. The Phylo-HMM extension is applied to the same set of 14 HIV2 sequences which are pre-aligned. The performance improvement that results from including the evolutionary information in the analysis is illustrated.

Availability: We implement the SSA method in the MATLAB programming language and provide the source code at http://www.stats.ox.ac.uk/Qmccauley. Additional supplementary material referred to in the text is available on the same webpage.

Contact: mccauley{at}stats.ox.ac.uk

Supplementary Information: Supplementary data are available at http://www.stats.ox.ac.uk/Qmccauley


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS
 4 DISCUSSION AND FURTHER...
 REFERENCES
 
This paper presents a hidden Markov model (HMM) topology designed to allow for overlapping CDS regions. This HMM topology is incorporated into an SSA program which is used successfully to annotate ssRNA viral genomes. Overlapping CDS regions are of interest because there are multiple evolutionary coding constrains operating on them. These multiple constraints complicate any statistical inference procedure that might be applied to the genomic analysis of ssRNA viral genomes.

This work focuses on annotating ssRNA viral genomes because they are known to harbour overlapping CDS regions in relative abundance. The issues of modelling the evolutionary constraints on these regions, and making general inference as to their abundance, is confounded by the fact they are difficult to identify. This is often because they are small and nested within longer CDS regions in alternative open reading frames (ORFs). This is illustrated by the fact that new overlapping CDS regions within viral genomes are still being identified (Brocchieri et al., 2005). Whilst existing CDS-finding methodologies such as GeneMarkS (Besemer et al., 1999, 2001) have been successful in identifying the majority of ssRNA viral CDS regions, it remains probable that the identification of all overlapping CDS regions within viral genomes is incomplete. It is of obvious biological importance to identify any additional CDS within viral genomes that have been missed, and so any methodologies likely to further such identification are well motivated. When an evolutionary model is included to improve the viral annotation, the estimates of the parameters of the evolutionary model are themselves interesting and worthy of consideration.

There are very many documented patterns of overlaps which fall into the three main classes of unidirectional (CDS overlap in same reading direction), convergent (CDS 3' ends overlap in opposing reading directions) and divergent (CDS 5' starts overlap in opposing reading directions). Some examples of these include Lartey et al. (1996), Shmulevitz et al. (2002), Walewski et al. (2001) and Zajanckauskaite et al. (1997). Study has been made into these overlapping regions and the nature of the evolutionary pressures that they are subject to (Guyader and Ducray, 2002; Hughes et al., 2001; Mizokami et al., 1997; Rogozin et al., 2002). The success of the single sequence annotation (SSA) methodology presented in this paper is based on an HMM topology which allows for any given nucleotide to code in multiple reading frames as well as allowing for differentiated nucleotide compositional patterns in overlapping CDS with respect to non-overlapping CDS and non-CDS regions. These differentiated nucleotide compositional patterns are a result of the particular evolutionary pressures that overlapping CDS are subjected to, some of which are described in the papers cited above. Pavesi et al. (1997) and Pavesi (2000) exploited information on common overlapping CDS motifs to search for as yet unknown overlapping CDS with success. More theoretical work modelling overlapping CDS includes Kozlov (2000a, b) and Krakauer (2000).

There has been some research in the identification and evolution of overlapping CDS but it remains an understudied area. Recently Firth and Brown (2005) proposed a methodology for analysing overlapping CDS with pairwise alignments where they model differentiated mutation rates between aligned overlapping CDS and non-overlapping CDS. Hein and Støvlbæk (1995) also investigated evolutionary modelling of overlapping CDS and applied their methods to a pairwise alignment of HIV sequences. Further work on estimating evolutionary parameters between pairwise alignments was undertaken by Pedersen and Jensen (2001). These papers clearly illustrate that evolutionary methods are useful for analysing overlapping CDS. Although Firth and Brown (2005) refer to HMM methods as being complex to model and difficult to parameterize, this work has employed an HMM framework and has achieved significant success at annotating overlapping CDS in single sequences.

As a first step towards an evolutionary extension of this model, a Phylogenetic-hidden Markov model (Phylo-HMM) framework has been developed and applied to an alignment of 14 HIV2 sequences using a parameterized evolutionary model. The potential for improved annotation based on the additional evolutionary signal within the Phylo-HMM framework is illustrated. Because these methodologies are applied at the nucleotide level, there is the potential to extend them to account for additional evolutionary constraints imposed upon the ssRNA nucleotides (such as secondary structure conservation). This has already been considered (albeit from a different perspective) by Pedersen et al. (2004a) who examined differentiated mutation rates in coding and non-coding regions depending upon the secondary structure constrains imposed upon the nucleotide (Pedersen et al., 2004b). Their methods only consider non-overlapping coding regions (whose CDS annotation is known), and a combination of approaches may serve to give a more generalized and accurate model of the evolutionary mutational processes operating on ssRNA viral genomes thereby improving both the CDS annotation and secondary structure predictions.


    2 METHODOLOGY
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS
 4 DISCUSSION AND FURTHER...
 REFERENCES
 
The SSA method takes as input a genomic sequence and annotates the genome into regions of CDS and non-CDS within the HMM framework discussed below. The method parameterizes this HMM concurrently with the annotation in an iterative Expectation Maximization procedure. For a given set of parameters, the method maximizes with respect to an annotation, the overall likelihood of observing the genome. This is accomplished using Viterbi's procedure (Durbin et al., 1998). The essence of the HMM framework is that the observed nucleotides are distributed according to a probability distribution that depends on the coding characteristics of the various reading frames. In order to describe this particular HMM, it is necessary to describe the topology of the HMM by defining the various states and emissions and how they are related. A detailed discussion of the transition and emission parameters as well as a description of how the SSA procedure is implemented follows.

2.1 Defining the HMM topology
The HMM topology is designed to accommodate unidirectional overlaps which are the most common form of overlapping CDS regions. This methodology allows for any given nucleotide to code in up to a maximum of three overlapping frames reading in one direction. By making certain assumptions regarding the emission distributions of the nucleotides we can accommodate an overlapping CDS model in eight states. The HMM topology is defined by describing the states in the HMM, the possible transitions between these states and the emission distributions from each of these states.

2.1.1 HMM states
The States are illustrated in Figure 1.

  1. State NC: Describes a nucleotide which is not coding in the genome.
  2. State S1: Describes a nucleotide coding at the first codon locus in a codon within a region of the genome encoding only one CDS.
  3. State S2: Describes a nucleotide coding at the second codon locus in a codon within a region of the genome encoding only one CDS.
  4. State S3: Describes a nucleotide coding at the third codon locus in a codon within a region of the genome encoding only one CDS.
  5. State D1,2: Describes a nucleotide coding at the first codon locus in a codon in one reading frame and at the second in another within a region of the genome coding in two distinct reading frames.
  6. State D2,3: Describes a nucleotide coding at the second codon locus in a codon in one reading frame and at the third in another within a region of the genome coding in two distinct reading frames.
  7. State D3,1: Describes a nucleotide coding at the third codon locus in a codon in one reading frame and at the first in another within a region of the genome coding in two distinct reading frames.
  8. State T1,2,3: Describes those rare nucleotides coding at the first, second and third codon loci in three distinct reading frames.


Figure 1
View larger version (10K):
[in this window]
[in a new window]
 
Fig. 1 The HMM starts in the NC state and cycles around until it moves into a coding state. It can cycle around the singly coding states or make transitions to a double coding state, or back to the NC state.

 
2.1.2 HMM state transitions
For the state transitions again please refer Figure 1. Those transitions leaving each of the eight states are described, two for each state, giving a total of 16 labelled as T1–T16. The single parameter {tau} represents the probability of entering a CDS region conditional on the previous three emissions coding AUG (a potential START codon).
  1. Transitions from State NC will generally be the self-transitioning cycle (T1). In the case where the previous three emissions are AUG there is a probability {tau} that the HMM will also make a transition to State S1 (T2).
  2. Transitions from State S1 will generally be to S2 (T3) but if the previous three nucleotides were AUG then the HMM goes to state D1,2 with probability {tau} (T4). Note the preceding CDS region is now coding at the second codon locus but the new CDS region is coding at the first codon locus.
  3. Transitions from State S2 will generally be to S3 (T5) but if the previous three nucleotides were AUG then the HMM transitions to state D3,1 with probability {tau} (T6).
  4. Transitions from State S3 will generally be to S1 (T7) but if the previous three nucleotides represent a STOP codon then the HMM transitions to the NC state with unitary probability (T8). Even if the previous three nucleotides encode an AUG there is no transition to a double CDS state since there already exists a CDS in this reading frame.
  5. Transitions from State D1,2 will generally be to D2,3 (T9) but if the previous three nucleotides were AUG then the HMM transitions to the triple coding state T1,2,3 with probability {tau} (T10). Since neither of the CDS regions has emitted their third codon locus nucleotide, there is no possibility of terminating either of the CDS regions.
  6. Transitions from State D2,3 will generally be to the D3,1 state (T11), but if the previous three nucleotides represent a STOP codon then the HMM transitions to S3 (T12).
  7. Transitions from State D3,1 will generally be to the D1,2 state (T13), but if the previous three nucleotides represent a STOP codon then the HMM transitions to S2 (T14).
  8. Transitions from State T1,2,3 will generally be the self-transitioning cycle (T15) but if the three previous nucleotides represent a STOP codon then the HMM transitions to D2,3 (T16).

2.1.3 HMM conditional emission distributions
There are eight conditional emission distributions (CEDs) to define within the SSA procedure each corresponding one to one with the eight states. Table 1 illustrates the nature of these emission distributions.

  1. CEDNC: Describes the emissions in non-coding regions which are modelled by sampling from a multinomial distribution across the four-letter nucleotide alphabet.
  2. CEDS1: Describes the first three periodic coding locus in a CDS codon. In this methodology the level of conditioning is only according to the present codon so in fact this CEDS1 can be described as sampling from a multinomial distribution across the four-letter nucleotide alphabet.
  3. CEDS2: Describes the second three periodic coding locus in a CDS codon. This CEDS2 can be represented as conditional sampling from one of four multinomial distributions across the four-letter nucleotide alphabet.
  4. CEDS3: Describes the third three periodic coding locus in a CDS codon. This CEDS3 can be represented as conditional sampling from one of 16 multinomial distributions across the four-letter nucleotide alphabet.
  5. CEDD1,2: Describes the first three periodic coding locus in one CDS codon and the second coding locus in another. CEDD1,2 can be described by conditional sampling from one of four multinomial distributions across the four-letter nucleotide alphabet.
  6. CEDD3,1: Describes the first three periodic coding locus in one CDS codon and the third coding locus in another CDS. CEDD3,1 can be represented as conditional sampling from one of 16 multinomial distributions across the four-letter nucleotide alphabet.
  7. CEDD2,3: Describes the second coding locus in one CDS codon and the third coding locus in another CDS. CEDD2,3 can also be represented as conditional sampling from one of 16 multinomial distributions across the four-letter nucleotide alphabet.
  8. CEDT1,2,3: Represents the first coding locus in one CDS and the second coding locus in another CDS and the third coding locus in the final CDS. CEDT1,2,3 can be represented as conditional sampling from one of 16 multinomial distributions across the four-letter nucleotide alphabet.


View this table:
[in this window]
[in a new window]
 
Table 1 Summarising the conditional emission distributions

 
2.2 Expectation maximisation joint parameterisation and annotation procedure
Viral genomes differ substantially in their overall nucleotide composition, therefore it is unrealistic to expect one set of parameters to best describe all viral genomes. Thus it is required to parameterize the HMM according to the specific viral genome at hand.

If P is the multidimensional parameter space, and A is the annotation space, then for every set of parameters Formula there is a maximally likely annotation Formula as calculated using Viterbi. The stepwise Baum–Welsh iterative procedure's convergence to the global maximum Amax (with an associated set of parameters Pmax) is dependent on the global properties of the multidimensional likelihood function and its smoothness as a function of P. Related to these points is the distance of Pinitial from Pmax. If the likelihood function is not a smooth function of P then there are likely to be local maxima where the EM procedure may get stuck and the closer Pinitial to Pmax the less likely this is to happen.

The first step in the SSA procedure is feeding the SSA algorithm some seed HMM parameters (which are derived from a seed annotation). At first every ORF of nucleotide length greater than some threshold (typically 200 nt) is labelled as coding and this seed annotation is used to calculate the Maximum Likelihood estimates of the HMM parameters accordingly (the convergence of the Baum–Welsh training procedure is quite robust to this threshold value). The most likely estimates of the parameters are identified by summing over possible state annotations in a probabilistic fashion (using the old estimates of the parameters). The old parameters are updating in an iterative loop until the change in the probability of the data falls below a certain threshold. The most likely state annotation is then calculated using Viterbi's algorithm with these maximally likely parameters estimates.

One difficulty parameterizing according to this method is that any initial seed annotation is unlikely to be correct, and will undoubtedly render more information regarding non-CDS and single CDS regions than overlapping CDS regions. It is a point of concern that the nucleotide composition of overlapping regions is likely to vary too much in the iterative procedure because they typically cover a smaller percentage of the genome and ssRNA viral genomes in general tend to be short, therefore there is little data to train on. Various strategies were tested including utilizing Bayesian priors for the overlapping nucleotide compositions. The simplest approach, namely deriving the overlapping nucleotide compositional emission parameters as an arithmetic average of the relevant single coding emission parameters, performed on par with more complex prior models. Consequentially this is the method implemented in the SSA procedure. Since the singly coding parameters are updated in successive Baum–Welsh iterations, so too are the derived overlapping emission parameters. This allows for differentiated nucleotide compositions in overlapping regions, but does not necessarily skew the overlapping emission probability towards emitting a higher overall fraction of degenerate codons (which we may or may not suspect a priori). There are means of skewing the parameters in such a manner (such as weighting the third codon positions lower in the arithmetic averaging procedure), but this involves introducing another set of parameters (weights) and again in the short genomes examined this did not seem to affect the overall probability or annotation significantly.

In the scenario where a viral genome has several triple coding regions the SSA model will fit the same emission distribution to each of these regions (because they are assigned the same state in the HMM topology). However there might be particular local phenomenon which perturb the emission distributions of these regions so they are in fact quite different from a nucleotide compositional perspective. They will be modelled as having the same emission distribution in this SSA model, and to the extent that this is incorrect the accuracy of the model and therefore the overall annotation will suffer. The SSA method has limited data to train on and while there are known scenarios which it does not model, it nevertheless generates some encouraging results.

2.3 An illustration of a phylogenetic-HMM extension to the SSA methodology
The SSA method has been extended to incorporate comparative information via a Phylo-HMM. Phylo-HMMs incorporate the time dimension into the model by utilizing another set of Markov processes which describe how each genomic locus tends to evolve through time (Siepel and Haussler, 2004). Phylo-HMMs allow for the capture of both nucleotide sequence and nucleotide evolution information and these can be related back to the underlying hidden state of the HMM. This particular Phylo-HMM can be understood in terms of extending the single sequence HMM which emits a single nucleotide at any locus, to the case where this emitted nucleotide then generates the other nucleotides in the alignment according to a set of general time reversible (GTR) mutational models and a phylogenetic tree. For the mutational model we chose an extension of Hein and Støvlbæk's implementation of the Kimura two parameter model which applies different selection factors to nucleotides coding in multiple reading frames [Hein and Støvlbæk (1995)]. This model has the advantage of being relatively simple to implement because it requires only a few additional parameters. Both the multiple alignment and the phylogenetic tree are taken as extrinsic to the modelling procedure for now, as is the parameterization of the mutational model (a detailed description of the application of the mutational processes can be found at the following webpage http://www.stats.ox.ac.uk/Qmccauley).

The essence of the mutational model is that constrained selection factors are applied to those substitutions which would represent non-synonymous substitutions with respect to a given reading frame and state. Whilst the mutational processes are modelled from the perspective of the individual nucleotides independently of the evolution of other nucleotides, the starting context and state determines which parameters within the mutational model space are applied for a given nucleotide locus. The single sequence is annotated in light of its nucleotide composition and the progeny which it gives rise to. From the perspective of the Phylo-HMM the single sequence to be annotated is considered to be the ancestral sequence and it represents the top node in a transposed phylogenetic tree. This is theoretically consistent because only GTR Markov models of substitutions are used. Other sequences are considered to be leaf progeny and the probability of observing those leaf emissions is calculated using the mutational model (based on the ancestral nucleotide's context) and phylogenetic information using Felsenstein's pruning algorithm. Additional selection factors are applied to the mutational models of those nucleotide loci whose state assignment and context imply they constitute one of the 3 nt in a STOP codon. This makes allowance for the possibility that the CDS structure may change due to STOP codon mutations (defaulting to a later STOP codon in the same reading frame). It is worth noting the selection factor applied to these situations is stringent and so biases any structural CDS changes towards more distantly related sequences.

The Phylo-HMM will provide greater information with respect to the hidden states within the Phylo-HMM only insofar as the inputs to the model are accurate. Thus there is emphasis on the accuracy of the alignment, as well as the parameterized mutation models. In general, sequences of sufficient genetic divergence to warrant such an application are necessary, as is a means of constructing an accurate phylogeny.

2.4 Measures of annotation accuracy
Measures of sensitivity and specificity were implemented to compare model annotations with the GenBank (and other) annotations. The frequency of overlapping CDS in viral genomes poses the question of how to classify nucleotides whose annotation in one reading frame is correct but incorrect in another. Ideally some measurement which scales according to the degree of accuracy in the three reading frames would be desirable. The measurements of sensitivity and specificity quoted are based on an effective coding genome size where nucleotides which code in two reading frames count twice in the sensitivity measurement and so on. The denominator of the sensitivity statistic is the sum over the entire genome of the number of CDS reading frames each nucleotide codes in. The numerator is the sum of the correctly predicted CDS reading frames each nucleotide codes in. The denominator of the specificity statistic is the sum over the genome, of the number of CDS reading frames each nucleotide does not code in. The numerator is the sum over the genome of the number of correctly predicted non-coding reading frames for each nucleotide. It should be noted that in the case where the model correctly predicts a new CDS region that is not in the GenBank database, this will show up as a decrease in the specificity measurement.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS
 4 DISCUSSION AND FURTHER...
 REFERENCES
 
3.1 SSA as applied to HIV2 genomes
The SSA method was applied to 14 strains of HIV2. The SSA method had an average sensitivity of 90% overall on these genomes (see Table 4, last row, first numerical column) and the performance on one strain (GenBank Accession Number J04498) is illustrated in detail. The SSA predicted annotation for strain J04498 [GenBank] is illustrated in Figure 2B with the GenBank annotation above (Fig. 2A) and the extended Phylo-HMM prediction below (Fig. 2C). A cursory comparison indicates general agreement between the GenBank and the SSA annotations which is in line with the overall sensitivity of 93.1% (Table 4 second row first numeric column) and overall specificity of 99.5% (Table 5 second row first numeric column). With respect to the GenBank annotation the SSA annotation misses two of the smaller CDS regions, one in global reading frame 3 (gRF3) and the other in gRF2 as well as incorrectly predicting the start of the final CDS region (these account for the 10% of missed annotations). With respect to the GenBank annotation the SSA annotation predicts two very small CDS regions in gRF1 which are not labelled in the GenBank annotation, and this accounts for the non-unitary specificity.


Figure 2
View larger version (20K):
[in this window]
[in a new window]
 
Fig. 2 (A–C) These plots illustrate the actual and predicted coding regions in HIV2 (Strain Accession Number J04498). Actual CDS regions are illustrated as defined in GenBank, and the predictions are made by the SSA and the Phylo-HMM procedures. The y-axis represents the three gRFs and the x-axis the genome nucleotides. In this particular sequence the UTR at the beginning of the genome is not sequenced. The x-axis of the middle and bottom plots has an additional description of the different regions that the SSA and Phylo-HMM methods partition the genome into. There are 19 regions which share a continuous CDS assignment as defined by the SSA methodology and 23 regions in the Phylo-HMM prediction. The numbers with arrows represent the mean posterior probability of that CDS assignation being correct within that contiguous region (as calculated by Posterior Decoding).

 
Considering the 19 regions within the SSA annotation which account for the different continuous CDS regions in the maximally likely HMM path, it is possible to analyse these regions and consider the mean probability that they are indeed correctly annotated. This is accomplished using Posterior Decoding (Durbin et al., 1998) at the nucleotide level and averaging over all nucleotides in that given region. These regions are represented as intervals on a line parallel to the x-axis in Figure 2B and the mean posterior decoding probabilities for these regions are indicated by arrows pointing to these regions. The two small incorrectly predicted CDS regions observed in gRF1 have mean posterior probabilities of 0.6 and 0.55 (which may give some indication that they are predictions not to be relied upon). Those nucleotide loci of the missed CDS regions have very high posterior probabilities even though the annotation is incorrect which indicates that the modelling procedure is not sufficiently sensitive to detect these incorrect annotations.

3.2 SSA as applied to a selection of ssRNA genomes
The SSA method has been applied to the annotation of other ssRNA viral genomes and the following subset of nine illustrate the successes and limitations of the methodology. The subset of nine genomes are as follows:

  1. Marburg: Marburg of the family Filoviridae. Sequenced by Feldmann et al. (1992) and Bukreyev et al. (1995) this 19 112 nt genome does not contain overlapping CDS (although it does contain two overlapping gene products as delineated by conserved transcriptional signals, beginning with a start site at the 3' genome end and a terminating transcriptional polyadenylation site) and so represents the least complex genome to annotate from the perspective of CDS overlaps.
  2. Ebola: Reston Ebola of the family Filoviridae. Sequenced and analysed by Sanchez et al. (1993, 1996), Volchkov et al. (1999) and Groseth et al. (2002) this 18 891 nt genome contains a region which encodes for two proteins. The smaller sGP protein starts and terminates in the same gRF, but the longer GP protein has the same start point but ends in a different gRF (this is accomplished via a process known as transcriptional editing).
  3. Hep G: Hepatitis G of the family Flaviviridae. Sequenced by Linnen et al. (1996) this 9392 nt genome contains only one CDS region which encodes a precursor polyprotein and from the perspective of this analysis represents a simple genome.
  4. EIAV: Equine infectious anemia virus is a lentivirus of the family Retroviridae. The genome we considered was published by Petropoulos (1997) and is 8359 nt long with two areas of overlapping CDS.
  5. SHIV: Simian-Human immunodeficiency virus is a lentivirus of the family Retroviridae. The genome we consider was published by Reimann et al. (1996) and is 10 000 nt long with 6 areas of CDS overlap and two CDS regions (tat and rev) which are further complicated by the presence of introns. This represents a difficult genome to annotate according to our methodology.
  6. Rous S: Rous Sarcoma is an Alpharetrovirus of the family Retroviridae. The genome we considered was again published in Petropoulos (1997) and the genome is 9392 nt long with 3 areas of overlapping CDS. There are several introns interspersing these CDS regions and this genome would be considered difficult to annotate.
  7. HIV2: Another lentivirus of the family Retroviridae. The genome we consider was published by Kirchhoff et al. (1990) and is similar in structure to Simian-Human immunodeficiency virus but with longer overlaps in the CDS.
  8. SHFV: Simian hemorrhagic fever virus is an arterivirus of the family Arteriviridae and the 15 717 nt genome we consider was sequenced and analysed by Smith et al. (1997), Godeny et al. (1995) and Zeng et al. (1995). The genome contains many CDS regions, introns and many overlaps. Many of the overlaps are short and the large number of them makes this a difficult genome to annotate.
  9. HTLV: Human T-lymphotropic virus is a deltaretrovirus of the family retroviridae. The 8507 nt genome we consider is again published in Petropoulos (1997). This is quite a complex genome to annotate with multiple examples of ribosomal slippage and introns.

In this section the annotations of the SSA method are compared with those returned by the GeneMarkS annotation procedure [Besemer et al. (1999) (2001)]. Comparisons of overall sensitivity and specificity measurements are illustrated, as are those measurements applied only to the overlapping regions (if any) in the applicable genome.

When submitting the genomes to GeneMarkS, since all genomes were <100 000 nt in length, the following program was automatically utilized: GeneMark.hmm PROKARYOTIC Version 2.2. The GeneMark.hmm program used to annotate the submitted genomes depends on heuristic methods to parameterize the HMM models. These heuristics relate the observed relationships between the positional frequencies and the global frequencies as well as relationships between the amino acid frequencies and the global GC percentage. The parameterized models are three periodics for coding sequence of orders zero, one and two and a single zero order model for non-coding sequence [Besemer et al. (1999)].

If we compare the overall sensitivities of the two methods in Table 2 (first two numeric columns) then we immediately observe that the overall sensitivities of the SSA method is generally superior to those predicted by GeneMark.hmm. In all but one genome the SSA model outperforms GeneMark.hmm, and in that case the results are within a percent or two so the performance is fairly similar. As with the majority of CDS finding methods the SSA method has a very high overall specificity as observed in Table 3 (first two numeric columns) and those high numbers compare favourably with Genemark.hmm.


View this table:
[in this window]
[in a new window]
 
Table 2 Comparing both the overall and overlapping sensitivities of the SSA method with Genemark.hmm

 

View this table:
[in this window]
[in a new window]
 
Table 3 Comparing both the overall and overlapping specificities of the SSA method with Genemark.hmm

 
If we compare the overlapping sensitivities in Table 2 (final two numeric columns) then the improvement that the SSA model brings to the analysis of ssRNA viral genomes is highlighted. (Note that these numbers are based on far fewer nucleotides so a large percentage difference can represent a fairly small number of correct/incorrect nucleotide annotations.) The first three genomes in the table do not have CDS encoded in different overlapping gRFs (therefore we would consider them to be non-complex from an annotation perspective). There is an instance of transcriptional editing in the Ebola genome, where one smaller secreted non-structural glycoprotein sGP is encoded in one reading frame, and where a longer structural glycoprotein GP uses the same start sequence as sGP but its coding is extended beyond the end of the sGP CDS in a different reading frame by means of an inserted nucleotide at the transcriptional stage. The SSA model is not designed to pick up on such processes as transcriptional editing, but the model does flag both regions as coding, and indeed highlights these different products as partially overlapping CDS. From a perspective of overlapping CDS in different reading frames however, these first three genomes do not contain any, so they have no overlapping sensitivity statistics since there are no overlaps to identify. The SSA model does predict a small overlap however and so has a zero specificity (as does the GeneMark.hmm program) but this result is not as bad as it sounds, because by virtue of not predicting a small overlap, a large coding region that encodes the post-transcriptional editing site of the GP CDS would be missed entirely.

There is a sequential unidirectional overlap in Equine Infectious Anemia Virus of 200 or so nucleotides which only the SSA model annotates correctly, so from a sensitivity viewpoint the SSA model performs much better. Neither of the models predict any incorrectly labelled overlapping states so the specificity statistics are both unitary for EIAV. Simian-Human Immunodeficiency Virus is an even more complex virus with considerably more overlaps and a similar pattern is observed, in that the SSA model has a high sensitivity and low specificity, whereas GeneMark.hmm has low overlapping sensitivity. Rous Sarcoma is an interesting example where the SSA model again outperform on a sensitivity basis, however the model has a low specificity which results from a prediction of 2 nested frames of 189 and 183 nt nested within the p60-SRC phosphoprotein CDS.

Of the remaining three genomes the SSA methodology clearly outperforms at predicting overlapping CDS. In the complex HIV 2 genome it predicts ~45% of the overlaps and in Simian Hemorrhagic Fever Virus it predicts ~60%. Human T-Lymphotropic Virus is an extremely complex genome with multiple overlaps and multiple examples of ribosomal slippage and introns which are exceptions to the behaviour captured in our model. Comparatively the SSA method performs well but there is still room for improvement in the more complex viral genomes.

3.3 Phylo-HMM applied to HIV2 genomes
The Phylo-HMM method is applied to each of the 14 HIV2 example sequences which were downloaded from GenBank and aligned with ClustalW [Thompson et al. (1994)]. Results are illustrated in Tables 4, 5 and Figure 2. A simple phylogeny was constructed using Felsenstein's Phylip program [Felsenstein (1981)] which is illustrated in Figure 3. Each of the 14 sequences was annotated according to its own nucleotide composition and the observed evolution (the significance of which is interpreted using the mutational models and the phylogenetic tree).


View this table:
[in this window]
[in a new window]
 
Table 4 Overall and overlapping sensitivity comparisons for 3 of the 14 HIV2 sequences (only 3 are highlighted for ease of reading)

 

View this table:
[in this window]
[in a new window]
 
Table 5 Overall and overlapping specificity comparisons for 3 of the 14 HIV2 sequences (only 3 are highlighted for ease of reading)

 

Figure 3
View larger version (11K):
[in this window]
[in a new window]
 
Fig. 3 An illustration of the HIV2 Phylogeny used in the Phylo-HMM. When examining each sequence using the evolutionary information to improve the annotation, the root of the tree is considered to be the sequence we are annotating. This affects a tree transposition where the tree is pruned of a leaf and a branch. This methodology considers only general time reversible mutational processes to describe the evolution, so such transpositions are permitted.

 
The mutational models were parameterized using a manual iterative procedure which involved searching through the multidimensional parameter space for the Maximum Likelihood estimates of the multidimensional likelihood function. It would be ideal for these inputs to the Phylo-HMM to be calculated within the Phylo-HMM process, in which case it might be expected to improve performance over the present static inputs. Irrespective of these concerns, an overall average sensitivity boost of 6.73% is observed (Table 4, last row, numerical column 3) for an average cost of 0.92% in specificity (Table 5, last row, numerical column 3). When considering the overlapping results the improvement is even more apparent where the overlapping sensitivity improves by 22% when the Phylo-HMM is utilized as compared with the SSA. In fact when the Phylo-HMM is employed the overlapping sensitivity scores only lag the overall sensitivity scores by 4%. Switching to the Phylo-HMM appears to have around the same specificity cost irrespective of whether we are comparing the overall or overlapping numbers.

Figure 2C is the Phylo-HMM annotation for one of the 14 sequences (GenBank accession number J04498) which has 23 continuous coding regions in the Viterbi most likely annotation. For this specific strain the Phylo-HMM sensitivity is 99.39% and the specificity is 99.28%. This represents an improvement of 6.31% in sensitivity for a cost of 0.23% in specificity as compared to the SSA methodology.

Again the mean posterior probability of correct annotation for each region is plotted just above the x-axis. The Phylo-HMM annotation can be compared to the GenBank annotation (top plot) and shows better agreement than the SSA annotation (middle plot). There are two small false predictions (one in gRF3 and one in gRF2) but it is interesting to note that these false predictions have a very low mean Posterior Probability of being correct. This is also true for some small regions which are correctly predicted, so although there is an overall increase in sensitivity for the Phylo-HMM, the modelling procedure is still lacking with respect to the false prediction of small CDS regions. Although there appears to be quite a few regions with very low mean Posterior probability, in fact only 5% of sequence loci have a mean posterior probability of <50%, and only 8.6% have posterior probabilities of <98%.


    4 DISCUSSION AND FURTHER WORK
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS
 4 DISCUSSION AND FURTHER...
 REFERENCES
 
The results of the SSA methodology as applied to HIV2 ssRNA viral genomes are encouraging. Of the 14 strains of HIV2 examined in this work, the overall mean SSA sensitivity was ~90% which indicates that the method is performing well on average. The SSA mean specificity was ~99% which is further evidence that the method is performing well. The results of the SSA method on the nine ssRNA genomes indicate that this level of sensitivity and specificity are consistent with other results.

There is a limit to the accuracy of the annotation which is related to the complexity of the genome CDS structures and the degree to which the modelling procedure describes the genome. It is clear that integrating some comparative information into this approach improves the sensitivity of the method, particularly with respect to the overlapping regions that are missed by the SSA method. It is not always possible to run such a comparative analysis as there are not always additional sufficiently divergent sequences to run a comparative annotation with. However present comparative results indicate that where there is extra sequence information, the Phylo-HMM methodology will outperform the SSA. The SSA model is flexible enough to accommodate the differentiated nucleotide compositional signal that overlapping CDS provide to the HMM methodology. One aim of this work was to develop a method that maximizes the information that a single sequence can provide with respect to overlapping CDS, and this has been achieved to a large extent.

Insofar as a genome presented to this methodology violates the general assumptions or design specifications of the method, it will obviously not perform as well. The SSA method is specifically designed to annotate single stranded viral genomes and does not look for overlapping CDS in any opposing strand. In addition this method is not designed to deal with circular genomes although this could be accommodated with relative ease. This method does not identify introns readily and this is something which could possibly be addressed in further extensions. The model does not explicitly model instances of ribosomal slippage and transcriptional editing, but as the results on the Ebola genome indicate, sometimes this slippage is predicted as a CDS overlap and the main coding regions are still correctly annotated.

The Phylo-HMM framework for incorporating comparative information into the modelling procedure has been discussed and results indicate that adding evolutionary information does improve the annotation. An average increase of 6.73% in the sensitivity measurements was observed for a 0.92% cost in specificity when applied to the HIV2 allignment of 14 sequences. This represents a significant improvement over the SSA method, and it is observable that the main improvement occurs in the overlapping regions where there is a 22% sensitivity boost. The authors are thus motivated to pursue this line of research further and intend to automate the parameterization of the mutational models and apply it to a series of ssRNA alignments. It is also interesting to consider whether incorporating some prior annotative knowledge (of which there is now plenty) into the process might help identify any hidden CDS regions. This could be achieved by placing restrictions on the Viterbi maximally likely search space, e.g. constraining the HMM to take a path consistent with a known CDS region in a given reading frame.


    Acknowledgments
 
The authors would like to thank the members of our group who kindly read the manuscript including Gerton Lunter, Naila Mimouni and Saskia deGroot. Particular thanks to Gerton Lunter for suggestions which have improved the methodological description. We would also like to thank the anonymous referees for their helpful comments.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Thomas Lengauer

Received on August 19, 2005; revised on March 9, 2006; accepted on March 9, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS
 4 DISCUSSION AND FURTHER...
 REFERENCES
 

    Besemer, J. and Borodovsky, M. (1999) Heuristic approach to deriving models for gene finding. Nucleic Acids Res, . 27, 3911–3920[Abstract/Free Full Text].

    Besemer, J., et al. (2001) GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res, . 29, 2607–2618[Abstract/Free Full Text].

    Brocchieri, L., et al. (2005) Predicting coding potential from genome sequence:application to betaherpesviruses infecting rats and mice. J. Virol, . 79, 7570–7596[Abstract/Free Full Text].

    Bukreyev, A.A., et al. (1995) The complete nucleotide sequence of the Popp (1967) strain of Marburg virus: a comparison with the Musoke (1980) strain. Arch. Virol, . 140, 1589–1600[CrossRef][ISI][Medline].

    Durbin, R., Eddy, S., Krogh, A., Mitchison, G. Biological Sequence Analysis, (1998) , Cambridge, UK Cambridge University Press.

    Feldmann, H., et al. (1992) Marburg virus, a filovirus: messenger RNAs, gene order, and regulatory elements of the replication cycle. Virus Res, . 24, 1–19[CrossRef][ISI][Medline].

    Felsenstein, J. (1981) Evolutionary trees from DNA sequences:a maximum likelihood approach. J. Mol. Evol, . 17, 368–376[CrossRef][ISI][Medline].

    Firth, A.E. and Brown, C.M. (2005) Detecting overlapping coding sequences with pairwise alignments. Bioinformatics, 21, 282–292[Abstract/Free Full Text].

    Groseth, A., et al. (2002) Molecular characterisation of an isolate from the 1989/90 epizootic of Ebola virus Reston among macaques imported into the United States. Virus Res, . 87, 155[CrossRef][ISI][Medline].

    Godeny, E.K., et al. (1995) Molecular characterisation of the 3' terminus of the simian hemorrhagic fever virus genome. J. Virol, . 69, 2679–2683[Abstract].

    Guyader, S. and Ducray, D.G. (2002) Sequence analysis of Potato leafroll virus isolates reveals genetic stability, major evolutionary events and differential selection pressure between overlapping reading frame products. J. Gen. Virol, . 83, 1799–1807[Abstract/Free Full Text].

    Hein, J. and Støvlbæk, J. (1995) A maximum-likelihood approach to analyzing nonoverlapping and overlapping reading frames. J. Mol. Evol, . 40, 181–189[CrossRef][ISI][Medline].

    Hughes, A.L., et al. (2001) Simultaneous positive and purifying selection on overlapping reading frames of the tat and vpr genes of simian immunodeficiency virus. J. Virol, . 75, 7966–7972[Abstract/Free Full Text].

    Kirchhoff, F., et al. (1990) A novel proviral clone of HIV-2: biological and phylogenetic relationship to other primate immunodeficiency viruses. Virology, 177, 305–311[CrossRef][ISI][Medline].

    Kozlov, N.N. (2000a) Overlapping genes and variability of the genetic code. Dokl. Biol. Sci, . 375, 677–680[CrossRef][Medline].

    Kozlov, N.N. (2000b) Analysis of a Set of Overlapping Genes. Dokl. Biochem, . 373, 119–122[Medline].

    Krakauer, D.C. (2000) Stability and evolution of overlapping genes. Evolution, 54, 731–739[ISI][Medline].

    Lartey, R.T., et al. (1996) Tobamovirus evolution: gene overlaps, recombination, and taxonomic implications. Mol. Biol. Evol, . 13, 1327–1338[Abstract].

    Linnen, J., et al. (1996) Molecular cloning and disease association of hepatitis G virus: a transfusion-transmissible agent. Science, 271, 505–508[Abstract].

    Mizokami, M., et al. (1997) Constrained evolution with respect to gene overlap of hepatitis Bvirus. J. Mol. Evol, . 44, Suppl. 1, S83–S90.

    Pavesi, A. (2000) Detection of signature sequences in overlapping genes and prediction of a novel overlapping gene in hepatitis G virus. J. Mol. Evol, . 50, 284–295[ISI][Medline].

    Pavesi, A., et al. (1997) On the informational content of overlapping genes in prokaryotic and eukaryotic viruses. J. Mol. Evol, . 44, 625–631[CrossRef][ISI][Medline].

    Pedersen, A.M. and Jensen, J.L. (2001) A dependent-rates model and an MCMC-based methodology for the maximum-likelihood analysis of sequences with overlapping reading frames. Mol. Biol. Evol, . 18, 763–776[Abstract/Free Full Text].

    Pedersen, J.S., et al. (2004a) An evolutionary model for protein-coding regions with conserved RNA structure. Mol. Bio. Evol, . 21, 1913–1922[Abstract/Free Full Text].

    Pedersen, J.S., et al. (2004b) A comparative method for finding and folding RNA secondary structures within protein-coding regions. Nucleic Acids Res, . 32, 4925–4936[Abstract/Free Full Text].

    Petropoulos, C.J. (1997) Appendix 2: retroviral taxonomy, protein structure, sequences, and genetic maps. RETROVIRUSES:757, , Cold Spring Harbor, New York, NY, USA Cold Spring Harbor Laboratory Press.

    Reimann, K.A., et al. (1996) An env gene derived from a primary human immunodeficiency virus type 1 isolate confers high in vivo replicative capacity to a chimeric simian/human immunodeficiency virus in rhesus monkeys. J. Virol, . 70, 3198–3206[Abstract].

    Rogozin, I.B., et al. (2002) Purifying and directional selection in overlapping prokaryotic genes. Trends Genet, . 18, 228–232[CrossRef][ISI][Medline].

    Sanchez, A., et al. (1993) Sequence analysis of the Ebola virus genome: organisation, genetic elements, and comparison with the genome of Marburg virus. Virus Res, . 29, 215–240[CrossRef][ISI][Medline].

    Sanchez, A., et al. (1996) The virion glycoproteins of Ebola viruses are encoded in two reading frames and are expressed through transcriptional editing. Proc. Natl Acad. Sci., USA, 93, 3602–3607[Abstract/Free Full Text].

    Shmulevitz, M., et al. (2002) Sequential partially overlapping gene arrangement in the tricistronic S1 genome segments of avian reovirus and nelson bay reovirus: implications for translation initiation. J. Virol, . 76, 609–618[Abstract/Free Full Text].

    Siepel, A. and Haussler, D. (2004) Combining phylogenetic and hidden Markov models in biosequence analysis. J. Comput. Biol, . 11, 413–28[CrossRef][ISI][Medline].

    Thompson, J.D., et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res, . 22, 4673–4680[Abstract/Free Full Text].

    Smith, S.L., et al. (1997) Sequence of the 3' end of the simian hemorrhagic fever virus genome. Gene, 191, 205–210[CrossRef][ISI][Medline].

    Volchkov, V.E., et al. (1999) Characterisation of the L gene and 5' trailer region of Ebola virus. J. Gen. Virol, . 80, 355–362[Abstract].

    Walewski, J.L., et al. (2001) Evidence for a new hepatitis C virus antigen encoded in an overlapping reading frame. RNA, 7, 710–721[Abstract].

    Zajanckauskaite, A., et al. (1997) A rare type of overlapping genes in bacteriophage T4: gene 30.3' is completely embedded within gene 30.3 by one position downstream. Gene, 194, 157–162[CrossRef][ISI][Medline].

    Zeng, L., et al. (1995) Analysis of simian hemorrhagic fever virus (SHFV) subgenomic RNAs,junction sequences, and 5' leader. Virology, 207, 543–548[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
W. S.W. Wong and R. Nielsen
Finding cis-regulatory modules in Drosophila using phylogenetic hidden Markov models
Bioinformatics, August 15, 2007; 23(16): 2031 - 2037.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/11/1308    most recent
btl092v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by McCauley, S.
Right arrow Articles by Hein, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by McCauley, S.
Right arrow Articles by Hein, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?