Bioinformatics Advance Access originally published online on September 3, 2004
Bioinformatics 2005 21(3):282-292; doi:10.1093/bioinformatics/bti007
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 21 issue 3 © Oxford University Press 2005; all rights reserved.
Detecting overlapping coding sequences with pairwise alignments
Department of Biochemistry, University of Otago P.O. Box 56, Dunedin, New Zealand
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Overlapping gene coding sequences (CDSs) are particularly common in viruses but also occur in more complex genomes. Detecting such genes with conventional gene-finding algorithms can be difficult for several reasons. If an overlapping CDS is on the same read-strand as a known CDS, then there may not be a distinct promoter or mRNA. Furthermore, the constraints imposed by double-coding can result in atypical codon biases. However, these same constraints lead to particular mutation patterns that may be detectable in sequence alignments.
Results: In this paper, we investigate several statistics for detecting double-coding sequences with pairwise alignmentsincluding a new maximum-likelihood method. We also develop a model for double-coding sequence evolution. Using simulated sequences generated with the model, we characterize the distribution of each statistic as a function of sequence composition, length, divergence time and double-coding frame. Using these results, we develop several algorithms for detecting overlapping CDSs.
The algorithms were tested on known overlapping CDSs and other overlapping open reading frames (ORFs) in the hepatitis B virus (HBV), Escherichia coli and Salmonella typhimurium genomes. The algorithms should prove useful for detecting novel overlapping genesespecially short coding ORFs in viruses.
Availability: Programs may be obtained from the authors.
Contact: chris.brown{at}otago.ac.nz
Supplementary information: http://biochem.otago.ac.nz/double.html
| 1 INTRODUCTION |
|---|
|
|
|---|
Overlapping gene coding sequences (CDSs) occur where a nucleotide sequence codes for two different amino acid sequences in different read-frames. There are five possible read-frames for a secondary open reading frame (ORF) relative to a primary ORF that we label 3, 2, 1, +1 and +2 (Fig. 1). In terms of nucleotide mutation patterns, +1 and +2 are equivalent upon interchanging the primary and secondary ORFs.
|
Overlapping CDSs occur frequently in viruses (Normark et al., 1983) where they serve as a mechanism for fitting more genetic information into a small genome and for coregulating gene expression. In the hepatitis B virus (HBV), for example,
50% of the genome comprises overlapping CDSs (Mizokami et al., 1997). Overlapping CDSs also occur in prokaryotes (Rogozin et al., 2002; Fukuda et al., 2003) and, more rarely, in eukaryotes (Sharpless and DePinho, 1999; Poulin et al., 2003). Ribosomal frameshifting (Farabaugh, 1996) can also lead to partially overlapping CDSs in the +1 and +2 frames. The majority of current gene-finding algorithms are optimized for finding non-overlapping protein-coding genes. Such methods are well-developed and generally make use of combinations of the following signatures of protein-coding genes: (1) signal (e.g. splice sites, ORFs), (2) content (e.g. codon bias), (3) similarity to known sequences or conservation between species and (4) expression in cDNA/expressed sequence tag (EST) libraries (Stormo, 2000; Snyder and Gerstein, 2003).
Although standard gene-finding algorithms can also be used to find overlapping CDSs, there are a variety of problems which can lead to a decrease in sensitivity. Owing to the double-coding constraints, overlapping CDSs often display an atypical codon bias (Pavesi, 2000). Extending training set methods, such as hidden Markov models (HMMs), to overlapping CDSs is made difficult by several different frames (each requiring its own model) and limited training data. Similarity to known sequences or conservation between species may only point to the existence of one of an overlapping pair. Furthermore, overlapping genes on the same read-strand (e.g. at ribosomal frameshifting sites) may have the same promoter and mRNA, so that looking for promoters or expression may only identify one of the two genes.
Nonetheless overlapping CDSs have their own signatures resulting from the mutational constraints imposed by the requirement of simultaneously maintaining protein function in both genes. For example, previous studies have investigated (1) relative substitution rates in N 1, N 2 and N 3, i.e. the 1st, 2nd and 3rd nucleotide positions in each codon (Bilsel et al., 1990); (2) information theory indices (Pavesi et al., 1997); (3) low rate of synonymous mutations relative to other sites (Mizokami et al., 1997; Pavesi, 2000); and (4) codon usage (Pavesi et al., 1997; Pavesi, 2000). Perhaps the simplest method is to measure the mutation frequency in N 3 relative to N 1 and N 2. In single-coding sequences, many N 3 mutations give rise to synonymous amino acids and so N 3 is relatively unconstrained. In contrast, in double-coding sequences (except the 2 frame) N 3 is generally constrained since it corresponds to N 1 or N 2 in the alternative frame.
Given a known CDS in a sequence alignment and a potential overlapping CDS (i.e. a conserved or semi-conserved ORF), one may proceed to measure the conservation in N 1, N 2, N 3 within the overlap region and compare it with the background (i.e. single-coding) statistics measured in the non-overlap region of the known CDS. However, it is not clear how the results should be interpreted. For example: How significant are any observed deviations? What are the dependences on sequence divergence, composition and double-coding frame? What if the non-overlap region of the known CDS is too short to calculate background N 1, N 2, N 3 statistics?
In this paper, we investigate several simple statistics that may be used on pairwise alignments, namely the mutation rate in N 1, N 2, N 3 (N123) and the rate of synonymous and non-synonymous mutations (NsNn). We also develop a more sensitive and less frame-dependent, maximum-likelihood statistic (MLOGD) that makes use of a nucleotide substitution matrix, a codon usage table (CUT) and an amino acid substitution matrix. We develop a Monte Carlo sequence evolution algorithm that can produce simulated sequence alignments subject to either single-coding or double-coding constraints. Then we use the simulated sequences to characterize the distributions of the N123, NsNn and MLOGD statistics, as a function of sequence divergence, composition, length and frame, for both single-coding and double-coding sequences. By comparing observations for a real sequence alignment with such simulations, we may attempt to determine whether a potential overlapping CDS is real (i.e. functionally constrained) or not. We find that the new MLOGD statistic,
s,d, is the most sensitive statistic for detecting overlapping CDSs.
| 2 METHODS |
|---|
|
|
|---|
In Section 2.1 we describe the various statistics that we use to test for double-coding. In Section 2.3 we describe the Monte Carlo sequence evolution algorithm. In Section 2.3 we describe how we combine the two to classify a given sequence alignment as single-coding or double-coding. Further details are given in the Supplementary Material.
2.1 Double-coding test statistics
We investigate three test statistics designed to detect the mutation signature of overlapping CDSs in pairwise alignments of two sequences, S 1 and S 2.
2.1.1 N123: mutation rate in N 1, N 2, N 3
For the N123 method, we simply count the number of nucleotide differences between S 1 and S 2 in each of N 1, N 2 and N 3 (i.e. the 1st, 2nd and 3rd nucleotide positions in codons in the primary read-frame) and express each count as a fraction of the total number of N 1, N 2 and N 3 loci. We label these statistics f N1, f N2 and f N3, respectively.
2.1.2 NsNn: Synonymous and nonsynonymous mutation rates
For the NsNn method, we step through aligned codon pairs in the primary read-frame of S 1 and S 2 and count which codon pairs are identical, non-identical but synonymous and non-synonymous. The numbers of synonymous and non-synonymous codon pairs are expressed as a fraction of the total number of codon pairs. We label these statistics f syn and f non, respectively.
2.1.3 MLOGD: maximum-likelihood method
The MLOGD (Maximum -Likelihood Overlapping Gene Detector) method is an attempt to estimate the relative probabilities of S 1 mutating to S 2 under single-coding and double-coding models.
The evolution of a single-coding sequence is often modelled as a Markov process Goldman and Yang, 1994. The probability of S 1 mutating to S 2 after time t may be expressed as
![]() |
where
,
are the k-th codons of S 1, S 2 and [p ij (t)] = P(t) = exp(Q t), where Q is a 64 x 64 matrix of instantaneous codon mutation probabilities.
In the case of overlapping CDSs, cross-talk between adjacent codons prevents the factorization-by-codon seen in (1). In theory, instead of a 64 x 64 codon matrix Q, we must define a single 4 N x 4 N matrix (where N is the sequence length in nucleotides) that describes the entire sequence at once. Clearly, for typical values of N, this is computationally impractical. Instead, we use the following simplified approach.
For each nucleotide pair
,
in S 1, S 2, we estimate the probability that
mutates to
for each of the single-coding and double-coding models. Specifically, we first define
, i = U, C, A, G, by
![]() |
for the single-coding (m = s) model, and
![]() |
for the double-coding (m = d) model. Here X 1 and X 2 are the original and final amino acids or codons in the primary read-frame and X'1 and X'2 are the original and final amino acids or codons in the secondary read-frame for the nucleotide mutation
. Also P(t) = exp(Q t), and Q, C and A are nucleotide, codon and amino acid substitution matrices, respectively (described in Section 2.2).
The probability that
mutates to
, after time t, is then given by
![]() |
As in (1), we sum over the sequence alignment as follows:
![]() |
We maximize (5) with respect to t for each of the single-coding and double-coding models, giving t s and t d , respectively. Then the log-likelihood ratio of the two models is
![]() |
If
s,d is positive, then the observed mutations between S 1 and S 2 are more consistent with double-coding. If
s,d is negative, then the observed mutations are more consistent with single-coding.
The above methodology involves several approximations. For each component of the probability sum (5), we consider only a single nucleotide and the single codon in each of the primary and secondary read-frames containing that nucleotide; longer-range dependences are ignored. In addition, as far as codon and amino acid weightings are concerned, we consider only the start and end points of the unknown mutation pathway connecting an aligned codon pair in S 1 and S 2. It seems reasonable that these simplifications are justified provided S 1 and S 2 are not too divergent (so that mutation pathways are short and inter-codon cross-talk is low). Tests with simulated sequences (which are not subject to these simplifications) show that the model provides useful results over a wide range of circumstances (see Sections 3.1.13.1.4).
2.2 Monte Carlo sequence evolution model
The problems outlined above do not arise when generating simulated sequences. Random nucleotide mutations may be applied sequentially and the neighbourhood of any mutating nucleotide is known at the time of mutation, thus allowing appropriate amino acid and codon substitution weights to be applied.
Given a starting sequence, a frame, a mutation rate, a 4 x 4 nucleotide mutation matrix Q, a 64-entry CUT C and a 20 x 20 amino acid substitution matrix A, our Monte Carlo simulation applies nucleotide mutations one-by-one until the required total number of mutations have occurred. Nucleotide mutations are chosen randomly with probabilities determined by Q and are accepted or discarded with probabilities determined by C and A (for full details see Supplementary Material). Under the double-coding model, the codon and amino acid weights are applied in both read-frames. In order to match the simulations to real sequence data, the mutation rate
is tied to the number of observed mutations rather than the number of accepted mutations (e.g. A
C
G counts as two accepted mutations but only one observed mutation).
By default, we use a
= 3 Kimura, (1980) nucleotide matrix, null CUT (i.e. equal codon frequencies) and the Henikoff and Henikoff, (1992) BLOSUM40 amino acid substitution matrix (for details see Supplementary Material).
2.3 Classification
Having calculated the statistics
s,d, f N1, f N2, f N3, f syn, f non for a particular pairwise sequence alignment S 1 + S 2, as described in Section 2.1, we then wish to classify it as either single-coding or double-coding according to which of the two models is most consistent with the observed statistics. For the MLOGD method, classification is straightforward: if
s,d > 0 we classify S 1 + S 2 as double-coding, while if
s,d
0 we classify S 1 + S 2 as single-coding. For the N123 and NsNn methods, we use the simulations to find out what range of values we would expect to observe for the two models. We may then find the log-likelihood ratios
![]() |
and
![]() |
We generate simulated sequences (Section 2.2), starting with the sequence S 1 and with the mutation rate fixed by the total number of point differences between S 1 and S 2. We generate 100 simulated sequences each for the single-coding and double-coding models. We then calculate f N1, f N2, f syn, f non for each of the 200 sequence pairs comprising S 1 and one of the simulated sequences, and calculate the means and SD,
,
, where m = s or d is the model, and x is one of the statistics f N1, f N2, f syn, f non. Note that we omit f N3 since, having fixed
= f N1 + f N2 + f N3, f N3 adds no new information.
Assuming normal distributions and assuming independence of f N1, f N2 and of f syn, f non, it is straightforward to calculate
and
(see Supplementary Material). Since, in fact, these two assumptions are only approximations,
and
are not strictly speaking likelihood ratios, but may still be used as classifiers: the sequence pair, S 1 + S 2, is classified as single-coding if
or
and as double-coding if
or
. The magnitudes of
s,d,
and
also give a measure of the confidence of classifications (see Supplementary Material).
| 3 RESULTS |
|---|
|
|
|---|
In Section 3.1 we use simulated sequences to characterize the statistics introduced in Section 2.1. We investigate their dependence on sequence divergence, double-coding frame (Section 3.1.1) and sequence length (Section 3.1.2). We also investigate how robust the statistics are with respect to choice of input nucleotide, codon and amino acid matrices (Section 3.1.3) and with respect to sequencing errors (Section 3.1.4). In Sections 3.2 and 3.3 we test the algorithms on real sequence data.
3.1 Tests on Monte Carlo simulations
3.1.1 Dependence on frame
Simulated sequences (Section 2.2) were generated from initial random nucleotide sequences with a range of mutation rates 0
0.5, where
is the number of observed mutations per nucleotide. Each simulated sequence S 2, together with the corresponding initial sequence S 1, makes an aligned sequence pair for which the statistics
s,d, f N1, f N2, f syn, f non of Section 2.1 may be calculated.
Figure 2 shows the distribution of the MLOGD statistic,
s,d, as a function of
, for the different frames. MLOGD clearly separates the single-coding from the double-coding simulations, except at the smallest
values (see Section 3.1.2), irrespective of frame. Figures 3 and 4 show the distribution of the N123 and NsNn statistics, f N1, f N2, f syn, f non, for the different frames. Here the distinction between single-coding and double-coding is often less clear, and much more frame-dependent.
|
|
|
3.1.2 Dependence on sequence length
It is apparent that the ability of any method to distinguish single-coding from double-coding regions will decrease for lower mutation rates
and for shorter sequence lengths. Figures analogous to Figures 24 for sequence lengths 300, 50 and 20 codons are given in the Supplementary Material. Figure 5 shows the variation in the spread of
s,d values for sequence lengths 1000, 300, 50 and 20 codons for the +2 frame (e.g. the frame of the P, C and X genes relative to the S, P and P genes, respectively, in HBV and also the frame of 1 ribosomal frameshifts). The longest overlap in HBV is
440 codons, though overlap regions are generally much shorter. Even for a sequence length of 20 codons,
s,d performs fairly well for
0.2. In contrast f syn, f non and, to some extent, f N1, f N2, are much less useful for such short sequences (see below for details).
|
It is difficult to tell from such plots which statistics are most usefulespecially for small
, where identifying double-coding regions is most challenging. To more precisely quantify the utility of the three methods of Section 2, we use the procedure described in Section 2.3 to classify each of 100 single-coding and 100 double-coding simulations (for a particular initial sequence, frame and mutation rate
), calculating
,
, from the other 99 single-coding and 99 double-coding simulations. The number of single-coding simulations correctly classified as single-coding and the number of double-coding simulations correctly classified as double-coding give a measure of the power of each method.
These values are summarized in Figure 6 for the +2 frame (see Supplementary Material for other frames). For long sequences, MLOGD and N123 give good results. For short sequences, NsNn becomes unusable. The MLOGD statistic,
s,d, is consistently the most discriminating classifier, especially for small
. For a sequence of 300 codons, MLOGD has a mean success rate (averaged over all five frames) of 97% for
as low as 0.03 (cf. 74% for NsNn and 88% for N123). For a sequence of 20 codons, MLOGD has a mean success rate of 83% for
= 0.2 (cf. 66% for NsNn and 77% for N123). The NsNn results are little better than a random classifier (success rate 50%). These results are summarized in Figure 7.
|
|
3.1.3 Effect of model parameters
We tested how robust the algorithms are with respect to the choice of input nucleotide, codon and amino acid matrices (Section 2.2). Simulated sequences were generated with (1) the default 50% GC-content nucleotide matrix replaced with a 70% GC matrix, (2) the default null CUT replaced with a human CUT or (3) the BLOSUM40 amino acid matrix replaced with the BLOSUM62 matrix. These sequences were classified as in Section 3.1.2, with the
,
model values derived from simulated sequences using the default matrices. Figures analogous to Figure 6 for (1), (2) and (3) are available in the Supplementary Material. MLOGD, NsNn and N123 are all robust with respect to reasonable changes in the model matrices.
3.1.4 Effect of sequencing errors
We also tested how robust the algorithms are with respect to sequencing errors. Random nucleotide substitution errors were added to simulated sequences at a rate of 0.3, 1 and 3% and the resulting sequences were classified as in Section 3.1.2, with the
,
model values derived from simulated sequences without errors. Figures analogous to Figure 6 for the different error rates are available in the Supplementary Material. MLOGD, NsNn and N123 are all similarly affected by sequencing errors. Averaging over all five frames, the decrease in classification success is about
(e.g. for
= 0.1 and a 1% sequencing error rate, the decrease in classification success is typically 3%). When the error rate is of the order of a third of the mutation rate, or greater, all methods are significantly affected.
3.2 Tests on hepatitis B (HBV) genome
We have tested the algorithms on known overlapping CDSs and non-coding overlapping ORFs in the HBV genome. HBV has a circular partially double-stranded DNA genome. The long strand comprises
3215 nt and encodes four genes (P, C, S, X) read in the forward direction. The S gene is completely contained in the P gene and read in the +1 frame relative to P, while the C and X genes both overlap the ends of the P gene and are read in the +2 frame (Fig. 8).
|
We used 10 human strains (GenBank accession numbers NC_003977 [GenBank] , X70185 [GenBank] , D00329 [GenBank] , X75665 [GenBank] , AB074755 [GenBank] , D50520 [GenBank] , X02496 [GenBank] , X75664 [GenBank] , X75663 [GenBank] and AB056514 [GenBank] ), illustrating a range of diversity [genotypes C2, A, B, C, C1, C2, D, E, F and G; see Huy et al. (2004)], together with strains from woolly monkey (AF046996 [GenBank] ) and woodchuck (J02442 [GenBank] ). A phylogenetic tree is given in the Supplementary Material.
Multiple sequence alignments were made separately for each of the P, C, S, X ORFs using CLUSTALW (Higgins et al. 1994) on the translated sequences. In the interests of demonstrating a fully automated approach, no manual adjustments were made to the alignments. ORFs were detected in the reference sequence NC_003977 [GenBank] using the EMBOSS program getorf (Rice et al., 2000). For each potential overlap region in the reference sequence, MLOGD, NsNn and N123 statistics were calculated for each of the other 11 sequences paired with the reference sequence.
3.2.1 Hepatitis Bknown overlaps
The results for the overlap between the S and P genes are shown in Figure 9. This overlap region contains
440 codons, providing a good signal for all three methods. The overlaps between C and P and between X and P are shorter (
45 and
80 codons, respectively) and the signal is correspondingly reduced. All three methods give incorrect classifications for some sequence pairs, with the fraction of sequence pairs correctly classified being 86, 59 and 77% for MLOGD, NsNn and N123, respectively (see Supplementary Material). For MLOGD and N123, nearly all the incorrect classifications have low confidence (i.e. scores close to zero).
|
Note that the observed statistics are not infrequently outside the confidence limits predicted by the simulations. One reason for this is that the mutational patterns in the compact genomes of viruses are subject to many other constraints that can produce deviations from the models.
Tests with known overlapping CDSs in Lentiviruses, Luteoviruses and Poleroviruses produced similar results (Firth A.E. and Brown C.M., manuscript in preparation).
3.2.2 Hepatitis Bpotential overlaps
We also tested the algorithms on all other ORFs in NC_003977
[GenBank]
that have at least 30-codon overlaps with one or more of the known P, C, S, X genes (Fig. 8). There are 43 such overlaps, involving 29 new ORFs. These ORFs are presumed to be non-coding, however short functional ORFs have been found in other viruses. Unlike the four known genes, many of these ORFs are in the reverse read-direction. Some are conserved in only a few strains and, although we skip stop
non-stop codon transitions in the calculation of the MLOGD statistic, their presence in an alignment would commonly be taken as an evidence that an ORF is non-functional. Figures analogous to Figure 9 for all 43 overlaps, together with a summary table of statistics, are given in the Supplementary Material.
When scores are summed over all 11 sequence pairs, MLOGD, NsNn and N123 identify 79, 77 and 74%, respectively of the 43 overlaps as single-coding. One reason for the
2126% failures may be due to the short length of some of the overlap regionsgiving a reduced signal. Another important factor is that in viruses there are often several constraints, besides maintaining protein function, that contribute to the pattern of sequence conservation at any site. In particular, a large number of these ORFs overlap two known genes (e.g. S and P) thus giving an extra frame-dependent constraint on N 1, N 2, N 3, which essentially invalidates the N123 method. The MLOGD method is more robust with respect to this effect and in fact there are only a few (namely 4 out of 20) particular combinations of frameseasily determined from simulations (see Supplementary Material)that are liable to give rise to a false double-coding signal in a tertiary ORF overlapping two coding sequences. In fact such a combination of frames explains seven of the nine false positives identified by the MLOGD method. These false positives are easily removed since all seven ORFs give a negative signal relative to the other of the two overlapping known genes. Taking such double-overlaps into account, MLOGD has a 93% success rate on the 29 (presumed) non-functional ORFs, many of which are very short.
3.3 Tests on bacterial genomes
The much larger genomes of bacteria are not subject to the same size constraints as virus genomes, and the vast majority of short ORFs are expected to be non-coding. In addition, real overlaps may often be the result of random events (e.g. stop codon mutation) and may not be subjected to strong functional constraints (Fukuda et al., 2003). Hence, testing the algorithms on bacterial genomes can provide a good estimate of the false positive rate.
We used a table of 3682 pairwise symmetrical best hits between annotated protein-coding genes in Escherichia coli (K12; NC_000913 [GenBank] .2) and Salmonella typhimurium (LT2; NC_003197 [GenBank] .1), obtained from NCBI, as the set of primary ORFs. The set of secondary ORFs was taken to be all ORFs detected in E.coli. MLOGD scores were calculated (using the alignment between the primary ORFs in E.coli and S.typhimurium and an E.coli CUT) for all overlaps with length at least 20 codons and sequence identity within the overlap of at least 70%. A total of 18 081 overlaps were analysed. Only 19 of these involved two annotated genes in E.coli, and only three of these pairs were conserved as an overlapping pair in S.typhimurium.
Figure 10 shows the distribution of MLOGD scores for the different frames. Except for the 2 frame, the number of MLOGD scores greater than zero is very low: 2 and 0.5% for overlap lengths
20 and 50 codons, respectively. Some of these overlaps may be functionale.g. the three conserved annotated overlaps marked, plus possibly others which, even if not conserved in S.typhimurium, may still have been subject to double-coding constraints over part of the evolutionary divergence between the two species.
|
The false positive rate in the 2 frame is unacceptably high if a threshold score of zero is used. In this frame, N 3 in the primary ORF opposes N 3 in the secondary ORF, leaving N 3 relatively unconstrained. Thus evolution mimics evolution in single-coding sequences. Hence, a higher threshold needs to be usede.g. using the variation in calculated MLOGD scores, as a function of overlap length, as a guide (see also Supplementary Material). Note also that overlaps in the 2 frame are expected to be relatively rare (Rogozin et al., 2002).
| 4 DISCUSSION |
|---|
|
|
|---|
We have presented a new algorithm (MLOGD) for predicting whether or not a particular region in a pairwise sequence alignment is likely to be double-coding. We have also used simulations to formalize and put probabilities on two simple, but rather ill-defined, previously used test statistics (N123 and NsNn). All three methods have been fully characterized using simulated data, and tested on viral and bacterial genomes.
We find that no method gives perfect predictions all the time, but MLOGD consistently performs better than N123, while NsNn performs rather poorly. MLOGD is particularly useful when there are extra constraints on sequence evolution (e.g. known double-coding with a potential tertiary ORF, or known single-coding with two alternative potential secondary ORFs). MLOGD also has the advantage that the distribution of the
s,d statistic is relatively frame-independent and easy to interpret (namely,
s,d
0
single-coding and
s,d > 0
double-coding). In contrast, the N123 and NsNn statistics require time-consuming simulations to interpret their values. Simulations suggest that MLOGD can identify double-coding regions at
90% confidence for pairwise sequence alignments that differ at just
20 nt sites (Fig. 7), though results with complex real sequence data are less good (possibly due to suboptimal alignments, Section 3.2).
Although MLOGD produced the best results, we suggest using all three methods and comparing results. By doing this, one can efficiently identify candidate overlapping CDSs for further investigation. In addition, comparison with the model simulations can provide valuable insights about the mutational constraints operating within a given sequence. For an input sequence alignment, our software package automatically produces plots similar to Figure 9 and a table of sequence statistics, allowing a quick visual inspection of double-coding constraints in any input primary ORF.
For many pairs of overlapping genes, one of the two is fairly unconstrained, e.g. in viruses many code for structural proteins whose primary sequence may not be highly conserved. In evolutionary terms, such a situation is desirable, since too many constraints can be a severe limitation on evolution (Rogozin et al., 2002). Our software also produces nucleotide-by-nucleotide plots for MLOGD. This allows one to quickly identify particularly conserved regions in an overlap. Such analyses can be useful for selecting targets for vaccines or antiviral drugs since, if functional constraints are important in both genes of an overlapping pair, then mutations will be more restricted.
Although the statistics
s,d,
and
(Section 2.3) may be summed along the branches of a phylogenetic treefor increased sensitivityby calculating, for example, maximum parsimony sequences at intermediate nodes, we have preferred to display statistics for each pairwise alignment. When only a small numbers of species are available, it is informative to view each comparison individually. Even when large alignments are available, certain overlapping CDSs may exist in only some lineages, so again it is important to view the pairwise, rather than summed, statistics.
The algorithms should prove useful for detecting novel genes in viruseswhere overlapping gene structures are common and often conserved over large evolutionary distances. They are expected to be particularly useful for analysing short ORFs and ribosomal frameshift sites. In addition, many overlapping CDSs have been annotated in yeast and prokaryotic genomes, and these algorithms could be used to test their functionality.
| Acknowledgments |
|---|
Thanks to Peter Stockwell, Amonida Zadissa and Rob Pollock for many interesting discussions. A.E.F. gratefully acknowledges funding from the Foundation for Research, Science and Technology, grant number UOOX0304.
Received on May 20, 2004; revised on July 29, 2004; accepted on August 26, 2004
| REFERENCES |
|---|
|
|
|---|
Bilsel, P.A., Rowe, J.E., Fitch, W.M., Nichol, S.T. (1990) Phosphoprotein and nucleocapsid protein evolution of vesicular stomatitis virus New Jersey. J.Virol., 64, 24982504
Farabaugh, P.J. (1996) Programmed translational frameshifting. Annu. Rev. Genet., 30, 507528[CrossRef][Web of Science][Medline].
Fukuda, Y., Nakayama, Y., Tomita, M. (2003) On dynamics of overlapping genes in bacterial genomes. Gene, 323, 181187[CrossRef][Web of Science][Medline].
Goldman, N. and Yang, Z. (1994) A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol., 11, 725736[Abstract].
Henikoff, S. and Henikoff, J.G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA, 89, 1091510919
Higgins, D., Thompson, J., Gibson, T. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 46734680
Huy, T.T., Ushijima, H., Quang, V.X., Win, K.M., Luengrojanakul, P., Kikuchi, K., Sata, T., Abe, K. (2004) Genotype C of hepatitis B virus can be classified into at least two subgroups. J. Gen. Virol., 85, 283292
Kimura, M. (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol., 16, 111120[CrossRef][Web of Science][Medline].
Mizokami, M., Orito, E., Ohba, K., Ikeo, K., Lau, J.Y., Gojobori, T. (1997) Constrained evolution with respect to gene overlap of hepatitis B virus. J. Mol. Evol., 44, S83S90[CrossRef][Web of Science][Medline].
Normark, S., Bergstrom, S., Edlund, T., Grundstrom, T., Jaurin, B., Lindberg, F.P., Olsson, O. (1983) Overlapping genes. Ann. Rev. Genet., 17, 499525[CrossRef][Web of Science][Medline].
Pavesi, A. (2000) Detection of signature sequences in overlapping genes and prediction of a novel overlapping gene in hepatitis G virus. J. Mol. Evol., 50, 284295[Web of Science][Medline].
Pavesi, A., Iaco, B., Granero, M.I., Porati, A. (1997) On the informational content of overlapping genes in prokaryotic and eukaryotic viruses. J. Mol. Evol., 44, 625631[CrossRef][Web of Science][Medline].
Poulin, F., Brueschke, A., Sonenberg, N. (2003) Gene fusion and overlapping reading frames in the mammalian genes for 4E-BP3 and MASK. J. Biol. Chem., 278, 5229052297
Rice, P., Longden, I., Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet., 16, 276277[CrossRef][Web of Science][Medline].
Rogozin, I.B., Spiridonov, A.N., Sorokin, A.V., Wolf, Y.I., Jordan, I.K., Tatusov, R.L., Koonin, E.V. (2002) Purifying and directional selection in overlapping prokaryotic genes. Trends Genet., 18, 228232[CrossRef][Web of Science][Medline].
Sharpless, N.E. and DePinho, R.A. (1999) The INK4A/ARF locus and its two gene products. Curr. Opin. Genet. Dev., 9, 2230[CrossRef][Web of Science][Medline].
Snyder, M. and Gerstein, M. (2003) Defining genes in the genomics era. Science, 300, 258260
Stormo, G.D. (2000) Gene-finding approaches for eukaryotes. Genome Res., 10, 394397
This article has been cited by other articles:
![]() |
B. Y.-W. Chung, W. A. Miller, J. F. Atkins, and A. E. Firth An overlapping essential gene in the Potyviridae PNAS, April 15, 2008; 105(15): 5897 - 5902. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. McCauley, S. de Groot, T. Mailund, and J. Hein Annotation of selection strengths in viral genomes Bioinformatics, November 15, 2007; 23(22): 2978 - 2986. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Belshaw, O. G. Pybus, and A. Rambaut The evolution of genome compression and genomic novelty in RNA viruses Genome Res., October 1, 2007; 17(10): 1496 - 1504. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. de Groot, T. Mailund, and J. Hein Comparative annotation of viral genomes with non-conserved gene structure Bioinformatics, May 1, 2007; 23(9): 1080 - 1089. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. McCauley and J. Hein Using hidden Markov models and observed evolution to annotate viral genomes Bioinformatics, June 1, 2006; 22(11): 1308 - 1316. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||













for normal distributions) based on 100 simulations at each of 
perfect classification) while the lower three lines show the fraction of double-coding simulations classified as single-coding (0
random initial sequences. The rms error is of the order 0.010.03.)





