Bioinformatics Advance Access originally published online on January 31, 2007
Bioinformatics 2007 23(7):825-831; doi:10.1093/bioinformatics/btm024
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A simulation test bed for hypotheses of genome evolution

1Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada, 2Institute for Molecular Bioscience and ARC Centre for Bioinformatics, Brisbane, Australia and 3GenomeAtlantic, Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia, Canada
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Microbial genomes undergo evolutionary processes such as gene family expansion and contraction, variable rates and patterns of sequence substitution and lateral genetic transfer. Simulation tools are essential for both the generation of data under different evolutionary models and the validation of analytical methods on such data. However, meaningful investigation of phenomena such as lateral genetic transfer requires the simultaneous consideration of many underlying evolutionary processes.
Results: We have developed EvolSimulator, a software package that combines non-stationary sequence and gene family evolution together with models of lateral genetic transfer, within a customizable birth–death model of speciation and extinction. Here, we examine simulated data sets generated with EvolSimulator using existing statistical techniques from the evolutionary literature, showing in detail each component of the simulation strategy.
Availability: Source code, manual and other information are freely available at www.bioinformatics.org.au/evolsim
Contact: beiko{at}cs.dal.ca
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Sequenced genomes provide large sets of predicted genes and proteins, allowing the development and testing of genome-level evolutionary hypotheses. Such tests have been used to assess the effect of genome G + C content on protein composition (Echols et al., 2002; Singer and Hickey, 2000), modes of sequence evolution under different regimes such as parasitism and symbiosis (Itoh et al., 2002; Rispe et al., 2004), and the extent of codon usage bias (Merkl, 2003; Sharp et al., 2005). Genes that deviate significantly from the global compositional pattern of a genome may be evidence of other evolutionary phenomena such as lateral genetic transfer (LGT) (Lawrence and Ochman, 1997, 1998; Nakamura et al., 2004; Wang et al., 2001). Like the mutational and substitutional processes described above, the evolutionary phenomena that have been discovered and elucidated in genome-scale analyses need to be examined in detail with simulation studies. These studies are vital both for assessment of the robustness and sensitivity of analytical techniques, and to generate test data sets that can be compared to equivalent empirical data. LGT is a particularly interesting case, as there are many hypotheses about the extent and biases of LGT events (Lawrence and Hendrickson, 2003). Inference of LGT is confounded by several evolutionary effects: sequence saturation, cryptic paralogy, non-stationary sequence substitution and amelioration of acquired genes to reflect the mutational biases of the recipient genome (Beiko et al., 2005; Jordan et al., 2001; Kunin and Ouzounis, 2003; Lawrence and Ochman, 1997; Ragan, 2001; Stiller and Hall, 1999).
Empirical analyses make use of data that were generated through actual evolutionary processes, but in most cases, empirical conclusions cannot be validated since the true history of the sequences under consideration is not known. Evolutionary simulations allow the investigation of evolutionary processes and hypotheses where the real history of sequences is known, but the value of simulations is constrained by limitations in our understanding and implementation of evolutionary processes. A simple example would be the use of a grossly oversimplified substitution model to compare the accuracy of different phylogenetic methods: any conclusions derived from simulations of a simple Poisson substitution process with no among-site rate variation (for instance) would not be applicable to many real sequence data sets (Sullivan and Swofford, 1997; Yang et al., 1994). To model the evolution of genes in different lineages, a genome simulation package should include non-stationary evolutionary models that perpetuate a drift in mutational biases across lineages through evolutionary time. Existing sequence simulation packages offer good conceptual starting points: packages such as evolver from PAML (Yang, 1997) and Seq-Gen (Rambaut and Grassly, 1997) offer multiple substitution models and among-site rate variation. Recent simulation packages that incorporate the important idea of non-equilibrium substitution rates include HETERO (Jermiin et al., 2003) and covTree (Blouin et al., 2005). These models must then be combined with models of organismal speciation and extinction to simulate the vertical inheritance of sets of genes. Bi-De (Rambaut et al., 1996) offers birth and death models for lineages, with subsequent simulation of sequence evolution along the generated lineages. Gene family expansion and contraction has been assessed with parsimony (Kunin and Ouzounis, 2003) and with birth, death and innovation models (BDIM) (Karev et al., 2002, 2004), but there has been no explicit linkage with sequence substitution and possible losses of selective pressure due to gene duplication. Little has been done to date in the simulation of LGT events: part of the reason for this is that LGT simply yields a modified tree topology when evolutionary models are static. The amelioration of transferred genes can furthermore occur only when the donor and recipient genomes have different substitution regimes.
We have developed EvolSimulator, a new program which performs the essential integration of gene-, protein- and genome-scale evolutionary processes that occur in prokaryotes, with user-configurable parameters to regulate each component of the simulation. EvolSimulator simultaneously creates lineages, gene families and sequence substitutions according to probabilistic models whose parameters are allowed to drift through evolutionary time. An important principle in the design of EvolSimulator is the partitioning of summary models (such as substitution matrices and rates of gene duplication) into component forces that can act specifically on the evolutionary units of residues, genes and genomes. An example of this (presented in more detail below) is the decoupling of traditional substitution models into genome-specific mutational biases and substitution probabilities (Yampolsky and Stoltzfus, 2005) that are controlled at the genome, gene and residue level. Data generated under these decoupled processes can then be summarized and compared to similar observations from empirical data. Several different functional and evolutionary hypotheses pertaining to LGT are implemented as optional models in EvolSimulator: LGT can be forbidden, or can be permitted as a consequence of one or more of these models. If non-stationary evolutionary models are used, then transferred genes may be gradually ameliorated following their uptake by the recipient genome. EvolSimulator has already been used to generate null models for a detailed study of LGT in the cyanobacteria (Zhaxybayeva et al., 2006); below we give further examples of questions that can be addressed using this software.
| 2 SYSTEM AND METHODS |
|---|
|
|
|---|
The following subsections describe the features available in EvolSimulator in principle and relative to existing theoretical and empirical models: further details about the implementation of all evolutionary parameters can be found in the EvolSimulator manual. Validation techniques and results can be found in Section 3. We finish with a brief conclusion in Section 4.
2.1 The global simulation level and creation of the cenancestor
A single cenancestral genome is first constructed by specifying the number of genes for each of one or more compositional classes of proteins. These are constructed by sampling from user-specified amino acid frequency tables and then deriving a cognate nucleotide sequence by selecting appropriate codons for the amino acid sequence. Several gene- and protein-level characteristics are assigned from probability distributions, including a serially correlated degree of resistance to residue substitution along its sequence (rates across sites). Instead of sampling the evolutionary rate of each site at random from a frequency distribution such as a continuous gamma, e.g. Yang (1993) with mean and variance parameters, the evolutionary rate at each amino acid site is set equal to the rate at the previous site, adjusted proportionally by a value between zero and a user-specified maximum amount. This approach allows rate variation across sites, with contiguous regions of a protein potentially showing conservation patterns similar to protein domains, and addresses the same principles as the models described by Yang (1995) and Felsenstein and Churchill (1996).
Two pseudo-random number generators are employed within EvolSimulator. The first controls genome-level events such as speciation and mutational biases, while the second directs the simulation's other stochastic events. One can therefore reproduce every detail of a previous simulation if desired, or reproduce a given phylogenetic history of genomes with consistent mutational characteristics in separate runs, while changing the parameters of, e.g. amino acid substitution. Evolutionary forces of different magnitudes can therefore be applied to a set of genomes with the same history of speciation, extinction and shifting mutational biases, allowing direct comparisons across runs.
2.2 Genome evolution and ecology
In our model, the total number of evolving genomes is kept reasonably constant over evolutionary time after an initial speciation phase, with one extinction event compensating each speciation event on average. Extinct lineages may easily outnumber extant ones in such a model; this is important when LGT occurs, because then-contemporary donor lineages can subsequently go extinct. Birth-and-death processes (Kendall, 1948) are frequently used to model speciation and extinction events; the basic principle in EvolSimulator is the same, though many parameters can modify the underlying probability that a given lineage will speciate or go extinct. We simulate balanced speciation and extinction in a fashion similar to the epidemic process described in Raup et al. (1973) and implemented in Bi-De (Rambaut et al., 1996), with speciation and extinction probabilities that are nominally equal. However, in EvolSimulator there is a user-specified upper boundary on species richness, which if reached will preclude further speciation events unless the population is lowered by one or more extinctions. As in e.g. Gould et al. (1977), there is also a soft lower boundary on the number of extant species: the population can drop below this level through extinction, but the extinction rate is consequently decreased by half to reflect reduced competition in a species-poor scenario.
Genomes modeled by EvolSimulator are assigned to niches and habitats, and can occupy more than one habitat during the course of their evolution. The number of habitats, niches and niche spaces are specified by the user. If a single niche with sufficient spaces is specified, then all genomes will evolve in this context, removing any effect of niche specificity from the simulation. Conversely, the user can specify a highly heterogeneous environment, where a large number of niches and habitats (relative to the number of genomes) each contain a small number of spaces. The assignment of genomes to niches within habitats is restricted by available space and the presence of essential genes for that niche, which confer properties (in an abstract sense) that are necessary for survival of the organism. Genes are also assigned varying degrees of usefulness within each niche, each habitat and globally. The usefulness of a gene in the genome's current environment will affect its tendency to resist chance deletion. The sum of the usefulness of all genes, again in the context of the current niche, can furthermore direct speciation and extinction probabilities, if desired, as well as the amount of time typically spent in any given niche.
In EvolSimulator, there is an underlying neutral model in which no particular genome size or strategy is preferred, but this can be modified to favor genomes that are small or large, or to prefer genomes that adopt specialized or generalized lifestyles. Generalists and specialists are known to be favored in different types of environment (Futuyma and Moreno, 1988; Kassen, 2002), so the interaction between niche heterogeneity (implemented as described in the previous paragraph) and genome lifestyle can be examined in this context. One can imagine a number of additional types of selective pressure on genomes that could be implemented, such as selection for different compositional properties in different habitats (Cambillau and Claverie, 2000; Hickey and Singer, 2004).
2.3 Gene content
Karev et al. (2002) developed a BDIM to describe changes in gene content over time. In EvolSimulator, changes in the number of genes are achieved via processes that are compatible with this model: gene duplications (births), gene losses (deaths) and the introduction of genes from other genomes via LGT (one component of innovation). While the models developed in Karev et al. (2002) and (2003) are intended to model the gross behavior of all gene families, we assign specific probabilities of duplication (= duplicability) and loss (= usefulness) to each gene. The evolutionary properties of a paralog and the parent that spawned it are modified in recognition of possible redundancies or functional enhancements provided by the duplication, but either descendant gene may be sufficient in its own right to satisfy a habitat- or niche-specific requirement. This reflects the tendency of retained paralogs to specialize or gain new functions (Zhang, 2003), but can also recreate phenomena such as differential gene loss (Doolittle and Handy, 1998).
One or more LGT events may be proposed in each iteration of the simulation. A gene is randomly drawn from one of the genomes for potential introduction of a copy into a second genome. If the ortholog of an incoming gene is already present in the target, it may be replaced, otherwise the transfer represents an insertion. If such a gene gain provides a selective advantage, it may increase the probability of loss of less-important genes in the genome (Lawrence and Roth, 1999). LGT events can be allowed to succeed automatically, or they can be evaluated under one or more of a set of criteria. More closely related genomes are generally more likely to participate in successful LGT events, due to factors such as low levels of sequence divergence, compatibility of DNA transfer mechanisms, and compatibility of transcriptional regulation and DNA replication mechanisms (Gogarten et al., 2002; Gogarten and Townsend, 2005; Hendrickson and Lawrence, 2006; Jain et al., 2003). In EvolSimulator, three specific criteria can be used in addition to, or in place of, genome divergence time: genome G + C similarity, gene content similarity or niche compatibility. It is also possible to disable all of the relatedness criteria, leading to a random model of LGT in which the success of transfer is independent of any similarity between the participating genomes. The frequency at which LGT events are attempted corresponds to a Poisson distribution with a mean value specified by the user: this is similar to the technique used in Addario-Berry et al. (2003) to simulate LGT, but we distinguish between attempted LGT events and successful ones according to the criteria listed above. An attempted event can be thought of as the acquisition of foreign DNA by a recipient cell, whereas the criteria for success reflect barriers to recombination of the foreign DNA into the genome and spread of the acquired gene through the population of recipient cells. More-sophisticated criteria concerning recombination and functional interactions between genes (see e.g. Jain et al., 1999 and Milkman and Bridges, 1990) could be developed to model recombination in future versions of EvolSimulator.
2.4 Mutation and substitution
Prokaryotic genomes show a wide variation in residue content, due to mutational biases and environmental influences (Bentley and Parkhill, 2004; Foerstner et al., 2005). The rate of substitution also varies across microbial genomes, with obligate intracellular organisms such as Blochmannia floridanus representing extreme cases of accelerated rates alongside biased substitution regimes (Itoh et al., 2002; Moran, 1996). The proposal of mutations in EvolSimulator is based on genome-wide mutational biases rather than an explicit four-by-four nucleotide substitution model such as GTR (Tavaré, 1986), but the combination of mutation parameters allows the simulation of reversible and non-reversible mutation models. Successful mutation depends upon the global mutation rate within a genome, and is quantitatively biased in favor of transitions or transversions, purines or pyrimidines, and towards strongly (G + C) or weakly (A + T) bonding residues. The parameters affecting mutation can drift over time, leading to mutational biases that differ between lineages.
Acceptance of a mutation event also requires that any resulting amino acid change be permitted by selection. Every site in a protein has a current as well as a reference residue: while proteins can be highly tolerant of sequence change (Sinha and Nussinov, 2001), the set of reference residues is meant to represent the global fitness optimum of the current protein. While the reference and current residue for a given site are identical when a protein sequence is first created, substitutions can change the current residue without changing the reference, and it is the reference residue that is used to judge proposed substitutions. Suppressors and other types of mutation can be advantageous rather than neutral or deleterious, triggering the redefinition of a site's reference residue, thereby providing a new vantage point from which subsequent substitutions are evaluated.
Each protein has its own overall selective pressure between 0.0 and 1.0, which can be increased or decreased by the autocorrelated rates-across-sites substitution model described in Section 2.1. The selective pressure s and site-specific rate r are combined as follows to yield an overall propensity toward amino acid substitution
:
|
| (1) |
: when p = 1.0, the change in
is linear with respect to r for 0
r < 0.5, and 0.5 < r
1.0. Values of p < 1.0 diminish the influence of r; the default setting of p in EvolSimulator is currently 0.5. This setting will tend to concentrate
values in the vicinity of the original s, with only extreme site-specific rates causing dramatic changes in the overall propensity. This function can consequently recreate L-shaped distributions (similar to
values <1.0) if s is very close to 0.0 or 1.0 (since one tail of the distribution is near the extreme value), and bell-shaped distributions for intermediate values of s.
While amino acids share structural and chemical properties that influence substitutions, an amino acid that is present at two different sites within a protein may play two entirely different roles (e.g. participation in an
-helix versus a loop region, in real proteins), and consequently selection will act differently on one residue versus the other. We model this by using two 20 x 20 amino acid substitution matrices to assign residue specificity to specific substitution events. A global, user-specified amino acid matrix (M) is applied universally, but is modified at each site by a specific auxiliary matrix. A site's auxiliary matrix (A), chosen at random from among many when the protein is first created (in the cenancestor, or as the result of a gene duplication), modifies the global matrix idiosyncratically, extending or restricting the range of specific substitutions in compensation for the necessarily simplifying assumptions of a global substitution model. Together, the two matrices are intended to represent residue-specific selection due to protein functional and structural constraints. The default matrix in EvolSimulator is a reversible version of the EX matrix (Yampolsky and Stoltzfus, 2005), because this model is a pure representation of amino acid exchangeability, without reference to codon distances or mutational probabilities. An exchange from amino acid i to amino acid j is accepted if the following resistance calculation is less than a number sampled randomly from the uniform distribution in the range [0.0, 1.0]:
|
| (2) |
| 3 VALIDATION |
|---|
|
|
|---|
We carried out a series of simulations using EvolSimulator, both to illustrate the range of scenarios that can be modeled, and to illustrate the way in which EvolSimulator results can be examined in light of existing evolutionary theory. In describing the simulations below, we highlight only those parameters of greatest relevance to the simulation, but complete parameter sets (as defined in the EvolSimulator configuration file) are readily available upon request.
3.1 Genomic lineages: speciation and extinction
There are many parameters that can be extracted from phylogenetic trees that represent different evolutionary processes (Mooers and Heard, 1997). Pybus and Harvey (2000) presented the gamma statistic as a tool for testing the equivalence of birth and death rates, while the tree balance statistic of Colless (Colless, 1995; Heard, 1992) expresses the tendency of some lineages to speciate more frequently than others. We generated genomic histories under six combinations of speciation/extinction rate and simulation length. In each case, the hard maximum and soft minimum population sizes were set to 100. Example organismal histories are shown in Supplementary Figure 1a–f, while tree statistics, calculated from the relationships among extant taxa, are summarized in Supplementary Table 1.
The tree statistics obtained indicate that balanced tree topologies are produced by EvolSimulator: values of Colless imbalance statistic range between 0.051 and 0.333, with shorter runs and fewer taxa tending to produce greater imbalance in trees. None of the computed gamma values was sufficiently large to reject the null hypothesis of equal speciation and extinction rates, with the minimum threshold of significance at p = 0.05 equal to –1.65. Shorter runs with lower speciation/extinction rates yield gamma values that are much closer to this threshold. Since these runs often did not reach 100 individuals, the speciation rate would have been much higher than the effective extinction rate, leading to a greater imbalance in these two quantities when evaluated with the gamma statistic. In fact, if the soft minimum on population size is large relative to the speciation/extinction rate and the number of iterations in the simulation, then a regime much closer to a pure-birth model (Yule, 1924) will be achieved. High speciation and extinction rates are reflected by a relatively large number of late-emerging external branches, as illustrated by low ratios of external to internal branch lengths, and low proportions of surviving taxa relative to the total number that have existed in the simulation.
3.2 Gene content evolution: duplications, losses and lateral genetic transfer
To illustrate the evolution of gene content, we first simulated gene gains and gene losses in the absence of LGT. The random number generators were seeded identically in every run, with the maximal rate of change of genome size (sizeAdjFactor) the only parameter changed between runs. The initial duplicability and global, habitat and niche usefulness values of each gene were both assigned random values between 0.0 and 1.0, with no drift over the course of simulation, though new paralogs were assigned random duplicability values. A genome was chosen at random from the set of extant genomes after 1000 iterations, and the size distribution of paralogous gene families within this genome compared across runs (Supplementary Figure 2). When genome size was not allowed to change in the simulation, there was no opportunity for duplication, so each family only had one member, each corresponding to a gene that was present in the cenancestral genome. An extremely high rate of genome size change (up to 100 genes per iteration) produced paralogous families with up to 26 members, but also led to the loss of 72% of the initial ancestral genome, with 0 as the mode of the family size distribution. Intermediate values of genome size drift (1 or 10 genes per iteration) led to tighter distributions of paralogous family sizes, with maximum family sizes of 9 and 11, respectively.
We also examined the correlation between the size of paralogous gene families across the entire set of 98 genomes, and the initial values of duplicability, essentiality and global, habitat and niche usefulness. Regressions of gene family size versus these factors yielded p-values <2.2 x 10–16, with all of the above listed factors making significant contributions (p < 2.2 x 10–16) to the model as well. In contrast, selective pressure, which is also sampled initially from a uniform random distribution but affects purifying selection rather than gene abundance, showed no relationship for any of the three tested values of sizeAdjFactor > 0. R-squared values were highest for lower values of sizeAdjFactor, with R2 = 0.77, 0.71 and 0.54 for sizeAdjFactor = 1, 10 and 100, respectively. The unexplained component of variation in gene family size is largely due to the stochastic nature of the simulation, and the assignment of new duplicability and usefulness values to new paralogs.
An important goal of EvolSimulator is to explore the interaction between LGT and niche occupancy. A brief example of this was constructed by performing three simulations with identical parameter values for speciation and duplication probabilities (10%/generation), maximum population size (50), genome size range (400–600 genes), maximum rate of change (5 genes/10 generations) and rate of attempted LGT (500 events/iteration). Five habitats were simulated, with a small number of genes (
5) necessary for survival in each. A genome living in habitat A would therefore be able to lose genes that were necessary for survival in any or all of habitats B, C, D and E, and in the absence of LGT would forever lose the potential to migrate to any of B–E. The only difference among the three simulations was the choice of criterion for LGT acceptance: in one simulation events were never successful (no LGT), in another they were always successful (the random LGT criterion), while in the third, acceptance was proportional to the number of iterations since divergence of the donor and recipient genomes. We examined the hypothesis that LGT facilitates habitat switching by examining the habitat history of each extant genome after 1000 iterations. A simple criterion of habitat specificity was used, which we defined as the sum of squares of the proportion of the simulation spent in each habitat. A genome that remained in the one habitat for all 1000 iterations would have a habitat specificity of ((1.0)2)1 + ((0.0)2)4 = 1.0, while one that spent 200 iterations in each habitat would have a specificity of ((0.2)2)5 = 1 x 10–7. The trends in Supplementary Figure 3 suggest that increasing LGT does facilitate niche switching, with median habitat specificity lowest when rates of LGT acceptance are high (no LGT, 0.93; divergence-based LGT, 0.87; random LGT, 0.68).
3.3 Sequence substitution: mutational pressures
We tested the mutational parameters of EvolSimulator described in Section 2.4 by simulating the evolution of nucleotide sequences under evolutionary processes that are independent and identically distributed (i.i.d.), reversible, stationary and homogeneous (see e.g. Lio and Goldman, 1998): except for the identical distribution of processes across residues, these conditions are assumed by most currently used phylogenetic methods, with a few notable exceptions (e.g. LogDet: Lockhart et al., 1994). Ignoring the start codon, the initial distribution of nucleotides is determined by the starting amino acid frequencies. For instance, if the starting frequency of tryptophan (W) is 1.0 and all other amino acids 0.0, then the starting frequencies of A, C, G and T would be 0, 0, 2/3 and 1/3 respectively, reflecting the exclusive usage of the codon TGG. In the absence of selection at the amino acid level, the initial frequencies would be expected to converge on the mutational biases of the genome, with a small bias induced by avoidance of the stop codons TAA, TAG and TGA. The expected stationary frequencies of A, C, G and T in the absence of bias are
0.240, 0.262, 0.251 and 0.246 respectively, and the stationary G + C frequency is expected to be 0.514.
In the absence of selection against stop codons, the mutational scheme described above is equivalent to a Jukes–Cantor model, with all substitution rates equal to one another, and identical stationary frequencies for all nucleotides. If stop codons are selected against, then the substitution regime remains homogeneous and reversible, because the unequal stationary distribution of residues, when multiplied by the unequal substitution rates, will be constant for every non-diagonal element in the substitution matrix. We examined the behavior of EvolSimulator in the presence and the absence of selection against stop codons, and under a range of mutation rates, to see if (a) convergence on the expected stationary distribution occurred, and (b) the expected substitution models could be recovered from sequences that diverged after stationarity was reached. Three starting distributions of amino acids were used: one where every starting residue was a lysine (coded by AAA and AAG), one composed exclusively of glycine (coded by GGA, GGC, GGG and GGT), and a third consisting of 50% K and 50% G. Supplementary Figure 4 shows the progression of G + C content, with each simulation represented by a single lineage, chosen randomly from among those that persisted for the entire 1000 iterations of the simulation. All simulations showed a tendency to converge on the appropriate stationary distribution for sequences with or without selection against stop codons (51.3 and 50.0% genomic G + C, respectively), with the rate of convergence proportional to the mutation rate. The simulations that began with equal numbers of G and K codons (not shown) both had an initial G + C content of 50%, with convergence on 51.4% G + C when stop codons were selected against. Also, when nucleotide frequencies converged on the appropriate stationary distribution, the frequencies of the encoded amino acids was equal to that amino acid's frequency in the codon table: for instance, the frequency of leucine residues was equal to 0.0938 (6/64) when all codons were equally fit, and 0.0984 (6/61) when stop codons were selected against.
We then examined the effect of allowing mutational biases to change over time during the simulation process. Cenancestral genomes were constructed with a mean G + C content of 50%, then in separate simulations the G + C bias was allowed to change by a maximum of either 0, 0.01, 0.1, 1, 10 or 100% after every ten iterations of the simulation. Supplementary Figure 5 shows the distribution of mean ORF G + C content across 98 genomes after 1000 iterations of the simulation. The range of genomic G + C compositions is smallest when the rate of drift was lowest (0.01%/10 generations). However, the maximum range of genomic G + C was observed when the maximum rate of drift was 1%/10 generations; larger values of drift tended to yield a more homogeneous population. In these cases, the increased potential for acquisition of radical G + C content bias (due to increased rates of drift) is offset by the high probability that the genome will quickly drift back to a lesser G + C content bias. Since actual ORF G + C content lags behind the genomic bias in a manner proportional to the mutation rate, genomes that undergo radical shifts in G + C content bias will not spend enough iterations at the extremes for their composition to reflect the extreme bias.
Two sister genomes were chosen from the set of 98 to examine the effects of asymmetrical substitution regimes on sequence composition. The Bowker and Stuart tests produce p-values that reflect the extent to which sequence substitutions between the two genomes are symmetric (Ababneh et al., 2006), with low p-values indicating unbalanced reciprocal substitutions between the genomes (e.g. A to G substitutions are not balanced by G to A). Supplementary Figure 6 shows the effect of different rates of G + C bias drift on substitutional asymmetry: when the maximum rate of drift was 0 or 0.01%/10 generations, the median p-values obtained from comparisons of all orthologs between the pair of genomes was below the usual threshold of significance (
= 0.05). Increasing values of drift yield dramatic increases in observed asymmetry, with peaks at 1 and 10%/10 generations. In most cases the median p-value is less for proteins than it is for DNA, which may reflect the greater robustness of amino acids to changes in G + C composition due to interchangeability of residues in the third position of many codons.
3.4 Sequence substitution: selective pressure on amino acids and reference residues
The idea of a reference residue has an important influence on protein evolution. If this feature is included in a simulation, then the evolutionary process will not be Markovian because there will be some memory of previous amino acid states. To illustrate the effects of reference residues, we carried out two parallel simulations involving a single lineage: in one case, the probability of setting the reference residue equal to the current residue was 1.0, yielding a Markov process. In the other, the probability of resetting the reference residue was set to 0.0, such that the original amino acid would always be the originating amino acid from which subsequent substitutions would be judged. Each run simulated the evolution of a single protein with a length of 1 000 000 amino acids, for 1000 generations, with an extremely high mutation rate (0.9) and no preference for purines/pyrimidines or strong/weak residues. Starting residue frequencies were equal to those used to generate the WAG matrix (Whelan and Goldman, 2001). Supplementary Figure 7 shows the final amino acid frequencies from these simulations: in the Markov process, the final distribution of each amino acid was within 1.0% of the frequency of its corresponding codons in the codon table (e.g. Alanine = 4/61). Conversely, the final amino acid frequencies in the non-Markovian simulation deviated in many cases from the underlying codon frequencies, due to the persistent influence of the original amino acids on subsequent substitutions at each site.
| 4 CONCLUSION |
|---|
|
|
|---|
EvolSimulator permits many tailored analyses of gene and genome evolution. We have shown examples of how subsets of evolutionary phenomena can be considered in any particular simulation to address specific questions. For instance, if one is only interested in mutation and sequence substitution, then phenomena such as gene gain, loss and LGT can be inactivated. Replication allows the simulated evolution of sets of genes at different evolutionary rates or under different LGT regimes, while preserving the same organismal history. A common use of simulation techniques is in the validation of phylogenetic methods: while we have not examined this in depth here, Zhaxybayeva et al. (2006) provides a good illustration of this application.
Many of the evolutionary phenomena implemented in EvolSimulator are still the subject of intensive research, and it is probable that some of the simple models here (such as gene family expansion and contraction) will be supplanted by more-sophisticated approximations of these processes. For instance, the basic habitat/niche competition and adaptation model is useful for examining the role of habitat partitioning and selection in the spread of genes via LGT, but does not capture the continuous nature of many habitat parameters such as temperature and nutrient availability. Models where speciation and extinction rates can change independently of one another (Rabosky, 2006) or in response to innovation (Ree, 2005) are examples of ways in which the present speciation/extinction model could be modified. We have not yet attempted to include simulations of insertions and deletions, due to the complexity of generating a distribution for these phenomena (Pang et al., 2005; Stoye et al., 1998), synchronizing this phenomenon with the rate of sequence substitution (Thorne et al., 1991), and determining what specific sequence should be inserted when such a mutation occurs. Finally, Azad and Lawrence (2005) introduced methods to build simulated genomes that reflect biases in oligonucleotide composition such as codon models. Integration of these static models in an evolutionary context is a significant challenge, but would permit meaningful simulation-based comparisons of phylogenetic and parametric or surrogate (Ragan et al., 2006) methods. While these and other phenomena will allow more detailed evolutionary simulations, the current capacity of EvolSimulator to simulate gene- and genome-level processes permits a detailed examination of many key hypotheses in sequence and genome evolution, particularly concerning the role of lateral genetic transfer.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank John Logsdon, W. Ford Doolittle, Mark Ragan, Ingrid Jakobsen and Nicholas Hamilton for helpful discussions, and the Australian Research Council and Genome Atlantic for financial support.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Present address: sanofi pasteur, 1755 Steeles Ave. W., Toronto ON, M2R 3T4 Associate Editor: Keith Crandall
Received on August 10, 2006; revised on November 23, 2006; accepted on January 22, 2007
| REFERENCES |
|---|
|
|
|---|
Ababneh F, et al. Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences. Bioinformatics (2006) 22:1225–1231.
Addario-Berry L, et al. Ancestral maximum likelihood of evolutionary trees is hard. Lect. N. Bioinformatics (2003) 2812:202–215.
Azad RK, Lawrence JG. Use of artificial genomes in assessing methods for atypical gene detection. PLoS Comput. Biol (2005) 1:e56.[CrossRef][Medline]
Beiko RG, et al. Highways of gene sharing in prokaryotes. Proc. Natl. Acad. Sci. USA (2005) 102:14332–14337.
Bentley SD, Parkhill J. Comparative genomic structure of prokaryotes. Annu. Rev. Genet (2004) 38:771–792.[CrossRef][Web of Science][Medline]
Blouin C, et al. Impact of taxon sampling on the estimation of rates of evolution at sites. Mol. Biol. Evol (2005) 22:784–791.
Cambillau C, Claverie JM. Structural and genomic correlates of hyperthermostability. J. Biol. Chem (2000) 275:32383–32386.
Colless DH. Relative symmetry of cladograms and phenograms – an experimental-study. Syst. Biol (1995) 44:102–108.
Doolittle RF, Handy J. Evolutionary anomalies among the aminoacyl-tRNA synthetases. Curr. Opin. Genet. Dev (1998) 8:630–636.[CrossRef][Web of Science][Medline]
Echols N, et al. Comprehensive analysis of amino acid and nucleotide composition in eukaryotic genomes, comparing genes and pseudogenes. Nucleic Acids Res (2002) 30:2515–2523.
Felsenstein J, Churchill GA. A Hidden Markov Model approach to variation among sites in rate of evolution. Mol. Biol. Evol (1996) 13:93–104.[Abstract]
Foerstner KU, et al. Environments shape the nucleotide composition of genomes. EMBO Rep (2005) 6:1208–1213.[CrossRef][Web of Science][Medline]
Futuyma DJ, Moreno G. The evolution of ecological specialization. Annu. Rev. Ecol. Syst (1988) 19:207–233.[CrossRef][Web of Science]
Gogarten JP, Townsend JP. Horizontal gene transfer, genome innovation and evolution. Nat. Rev. Microbiol (2005) 3:679–687.[CrossRef][Web of Science][Medline]
Gogarten JP, et al. Prokaryotic evolution in light of gene transfer. Mol. Biol. Evol (2002) 19:2226–2238.
Gould SJ, et al. The shape of evolution: a comparison of real and random clades. Paleobiology (1977) 3:23–40.[Abstract]
Heard SB. Patterns in tree balance among cladistic, phenetic, and randomly generated phylogenetic trees. Evolution (1992) 46:1818–1826.[CrossRef][Web of Science]
Hendrickson H, Lawrence JG. Selection for chromosome architecture in bacteria. J. Mol. Evol (2006) 62:615–629.[CrossRef][Web of Science][Medline]
Hickey DA, Singer GA. Genomic and proteomic adaptations to growth at high temperature. Genome Biol (2004) 5:117.[CrossRef][Medline]
Itoh T, et al. Acceleration of genomic evolution caused by enhanced mutation rate in endocellular symbionts. Proc. Natl. Acad. Sci. USA (2002) 99:12944–12948.
Jain R, et al. Horizontal gene transfer among genomes: the complexity hypothesis. Proc. Natl. Acad. Sci. USA (1999) 96:3801–3806.
Jain R, et al. Horizontal gene transfer accelerates genome innovation and evolution. Mol. Biol. Evol (2003) 20:1598–1602.
Jermiin LS, et al. Hetero: a program to simulate the evolution of DNA on a four-taxon tree. Appl. Bioinformatics (2003) 2:159–163.[Medline]
Jordan IK, et al. Lineage-specific gene expansions in bacterial and archaeal genomes. Genome. Res (2001) 11:555–565.
Karev GP, et al. Birth and death of protein domains: a simple model of evolution explains power law behavior. BMC Evol. Biol (2002) 2:18.[CrossRef][Medline]
Karev GP, et al. Simple stochastic birth and death models of genome evolution: was there enough time for us to evolve? Bioinformatics (2003) 19:1889–1900.
Karev GP, et al. Gene family evolution: an in-depth theoretical and simulation analysis of non-linear birth-death-innovation models. BMC Evol. Biol (2004) 4:32.[CrossRef][Medline]
Kassen R. The experimental evolution of specialists, generalists, and the maintenance of diversity. J. Evol. Biol (2002) 15:173–190.[CrossRef][Web of Science]
Kendall DG. On the generalized birth-and-death process. Ann. Math. Stat (1948) 19:1–15.[CrossRef]
Kunin V, Ouzounis CA. The balance of driving forces during genome evolution in prokaryotes. Genome. Res (2003) 13:1589–1594.
Lawrence JG, Ochman H. Amelioration of bacterial genomes: rates of change and exchange. J. Mol. Evol (1997) 44:383–397.[CrossRef][Web of Science][Medline]
Lawrence JG, Ochman H. Molecular archaeology of the Escherichia coli genome. Proc. Natl. Acad. Sci. USA (1998) 95:9413–9417.
Lawrence JG, Roth JR. Genomic flux: genome evolution by gene loss and acquisition. In: Organization of the Prokaryotic Genome.—Charlebois RL, ed. (1999) Washington, D.C: ASM Press.
Lio P, Goldman N. Models of molecular evolution and phylogeny. Genome. Res (1998) 8:1233–1244.
Lockhart PJ, et al. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol (1994) 11:605–612.[Web of Science]
Merkl R. A survey of codon and amino acid frequency bias in microbial genomes focusing on translational efficiency. J. Mol. Evol (2003) 57:453–466.[CrossRef][Web of Science][Medline]
Milkman R, Bridges MM. Molecular evolution of the Escherichia coli chromosome. III. Clonal frames. Genetics (1990) 126:505–517.[Abstract]
Mooers AO, Heard SB. Evolutionary process from phylogenetic tree shape. Q. Rev. Biol (1997) 72:31–54.[CrossRef]
Moran NA. Accelerated evolution and Muller's rachet in endosymbiotic bacteria. Proc. Natl. Acad. Sci. USA (1996) 93:2873–2878.
Nakamura Y, et al. Biased biological functions of horizontally transferred genes in prokaryotic genomes. Nat. Genet (2004) 36:760–766.[CrossRef][Web of Science][Medline]
Pang A, et al. SIMPROT: using an empirically determined indel distribution in simulations of protein evolution. BMC Bioinformatics (2005) 6:236.[CrossRef][Medline]
Pybus OG, Harvey PH. Testing macro-evolutionary models using incomplete molecular phylogenies. Proc. Biol. Sci (2000) 267:2267–2272.
Rabosky DL. Likelihood methods for detecting temporal shifts in diversification rates. Evolution (2006) 60:1152–1164.[CrossRef][Web of Science][Medline]
Ragan MA. Detection of lateral gene transfer among microbial genomes. Curr. Opin. Genet. Dev (2001) 11:620–626.[CrossRef][Web of Science][Medline]
Ragan MA, et al. Do different surrogate methods detect lateral genetic transfer events of different relative ages? Trends Microbiol (2006) 14:4–8.[CrossRef][Web of Science][Medline]
Rambaut A, Grassly NC. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci (1997) 13:235–238.
Rambaut A, et al. Bi-De: an application for simulating phylogenetic processes. Comput. Appl. Biosci (1996) 12:469–471.
Raup DM, et al. Stochastic models of phylogeny and the evolution of diversity. J. Geol (1973) 81:525–542.[Web of Science]
Ree RH. Detecting the historical signature of key innovations using stochastic models of character evolution and cladogenesis. Evolution (2005) 59:257–265.[CrossRef][Web of Science][Medline]
Rispe C, et al. Mutational and selective pressures on codon and amino acid usage in Buchnera, endosymbiotic bacteria of aphids. Genome. Res (2004) 14:44–53.
Sharp PM, et al. Variation in the strength of selected codon usage bias among bacteria. Nucleic Acids Res (2005) 33:1141–1153.
Singer GA, Hickey DA. Nucleotide bias causes a genomewide bias in the amino acid composition of proteins. Mol. Biol. Evol (2000) 17:1581–1588.
Sinha N, Nussinov R. Point mutations and sequence variability in proteins: redistributions of preexisting populations. Proc. Natl. Acad. Sci. USA (2001) 98:3139–3144.
Stiller JW, Hall BD. Long-branch attraction and the rDNA model of early eukaryotic evolution. Mol. Biol. Evol (1999) 16:1270–1279.[Web of Science][Medline]
Stoye J, et al. Rose: generating sequence families. Bioinformatics (1998) 14:157–163.
Sullivan J, Swofford DL. Are guinea pigs rodents? the importance of adequate models in molecular phylogenetics. J. Mammal. Evol (1997) 4:77–86.[CrossRef]
Tavaré S. Some probabilistic and statistical problems on the analysis of DNA sequences. In: Lectures in Mathematics in the Life Sciences. (1986) New York City, USA: American Mathematical Society. 57–86.
Thorne JL, et al. An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol (1991) 33:114–124.[CrossRef][Web of Science][Medline]
Wang HC, et al. Analysis of codon usage patterns of bacterial genomes using the self-organizing map. Mol. Biol. Evol (2001) 18:792–800.
Whelan S, Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol (2001) 18:691–699.
Yampolsky LY, Stoltzfus A. The exchangeability of amino acids in proteins. Genetics (2005) 170:1459–1472.
Yang Z. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol (1993) 10:1396–1401.[Abstract]
Yang Z. A space-time process model for the evolution of DNA sequences. Genetics (1995) 139:993–1005.[Abstract]
Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci (1997) 13:555–556.
Yang Z, et al. Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol. Biol. Evol (1994) 11:316–324.[Abstract]
Yule GU. A mathematical theory of evolution, based on the conclusions of Dr J.C. Willis. Phil. Trans. R. Soc. Lond. B (1924) 213:21–87.
Zhang J. Evolution by gene duplication: an update. Trends. Ecol. Evol (2003) 18:292–298.[CrossRef]
Zhaxybayeva O, et al. Phylogenetic analyses of cyanobacterial genomes: quantification of horizontal gene transfer events. Genome Res (2006) 16:1099–1108.
This article has been cited by other articles:
![]() |
R. G. Beiko, W. F. Doolittle, and R. L. Charlebois The Impact of Reticulate Evolution on Genome Phylogeny Syst Biol, December 1, 2008; 57(6): 844 - 856. [Abstract] [Full Text] [PDF] |
||||
![]() |
T.-Y. Wong, S. Fernandes, N. Sankhon, P. P. Leong, J. Kuo, and J.-K. Liu Role of Premature Stop Codons in Bacterial Evolution J. Bacteriol., October 15, 2008; 190(20): 6718 - 6725. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

