Polyploids, genome halving and phylogeny
1Department of Mathematics and Statistics, 2Department of Biology and 3Department of Biochemistry, University of Ottawa, Ottawa K1N 6N5, Canada
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Autopolyploidization and allopolyploidization events multiply the number of chromosomes and genomic content. Genome rearrangement phylogenetics requires that all genomes analyzed have the same set of orthologs, so that it is not possible to include diploid and polyploid genomes in the same phylogeny.
Results: We propose a framework for solving this difficulty by integrating the rearrangement median and genome halving algorithms. Though the framework is general, some problems remain open. We implement a heuristic solution to the prototypical case of a tree with one tetraploid and two diploid genomes, and apply it to study the evolution of cereals and of yeast.
Contact: sankoff{at}uottawa.ca
| 1 INTRODUCTION |
|---|
|
|
|---|
Phylogenomics based on cross-species comparisons of synteny block order (henceforward rearrangement phylogenetics) provides an approach to phylogenetics independent of that based on nucleotide or amino acid sequence divergence. The order-based approach takes advantage of the periodic and cumulative rearrangement of genomic material by evolutionary processes, such as inversion, reciprocal translocation and transposition. The basic methods require that the genomic content be roughly the same in all the organisms being compared, so that every chromosomal segment in one genome be identified with a single orthologous counterpart in each of the others, though adjustments can be made for a limited amount of deletion, insertion and duplication of segments.
Many genomes have been shown to result from an ancestral doubling, or tetraploidization, event, after which meiosis is characterized not by the normal pairings of one maternal and one paternal chromosome, but by quadrivalent alignment of chromosomes or other combinations. Tetraploidization is followed by a period of re-diploidization, where distinct pairings again emerge, though in twice the original number, a process mediated by sequence divergence and by genome rearrangement through intra- and interchromosomal movement of genetic material. The present-day genome (often still referred to loosely as a tetraploid) can be decomposed into a set of duplicated synteny blocks dispersed among the chromosomes. There is usually no obvious way of partitioning the blocks into two sets according to which ones were together in the original tetraploid.
Rearrangement phylogeny algorithms are not applicable since there is a two-to-one relationship between blocks in the former tetraploid and those in related diploid species, whereas these algorithms require a one-to-one correspondence.
Tetraploidization may also occur as a fusion of two distinct but related genomes (allotetraploidy) instead of the doubling of a genome (autotetraploidy), and both types of polyploidization may recur during evolution, so that instead of a 2n diploid number, the descendant (polyploid) genome will have 2rn, where
.1 These genomes will be constituted not by duplicated blocks, but by a set of blocks with r homologous copies each, dispersed among the chromosomes.
In this article, we provide an overall strategy for rearrangement phylogeny for sets of related genomes that include some that have undergone polyploidization, including allopolyploidization. We specifically attack the small phylogenetic problem, i.e. identifying the ancestral genomes for a given phylogeny that jointly minimize the sum of the rearrangement distances along the branches of that phylogeny. To take into account allopolyploidy, the phylogeny must be reticulated.
In Section 2, we outline a model for generating an arbitrary pattern of polyploidy observed at the tips of a reticulate phylogeny. Based on this model, we then present an algorithm for inferring the ploidy of the ancestral genomes in terms of an economical set of autopolyploidization and allopolyploidization events along the edges of the phylogeny graph. Once we have the ancestral ploidies, we can approach the actual rearrangement problem. We identify three kinds of component of this problem, one a calculation of the genomic distance between two given genomes with clearly identified orthologs, i.e. the minimum number of rearrangements necessary to transform one genome into another; the second a de-ploidization calculation for inferring the genome of an ancestral polyploid based on internal evidence from its modern descendant only and the third a medianizing process for inferring an ancestral genome from its three neighboring genomes in a binary branching tree. In Section 3, we show how to integrate algorithms for the three components into an overall procedure for inferring the ancestral polyploids in a given phylogeny, and we describe in particular detail the prototypical case of one tetraploid and two related diploids. In Sections 4 and 5, we apply our method to a small data set on maize and a large data set on yeast, respectively.
| 2 MODEL, INFERENCE AND DECOMPOSITIONS |
|---|
|
|
|---|
The simplified assumptions we will adopt in this abstract are that polyploidization occurs either by tetraploidization of a genome, namely replacing each of its chromosomes by two identical chromosomes, so the diploid number goes from 2n to 4n, or by the fusion of two different genomes of diploid numbers 2n and 2m, respectively, merging the two sets of chromosomes, and producing a
We will assume the evolutionary histories to be binary branching trees, with allopolyploidy events represented by horizontal reticulations between branches of the tree, as illustrated in Figure 1. The model imposes the equations in the illustration: each autopolyploid must have ploidy equal to a non-negative exponent of 2, times the ploidy of its immediate ancestor. Each allopolyploid must have ploidy equal to the sum of its contributing genomes. The allopolyploidy events are given, though not the ploidy of the participating genomes, which must be inferred, and the autopolyploidy events are to be inferred.
|
This model is simplified and cannot account for all possible observations of even-numbered ploidies at the leaves of the phylogeny; a full model of polyploidy in phylogenetic context would allow for events such as the fusions of a polyploid with an earlier diploid version of itself. Such a model, worked out in the full version of this article, can account for all possible observations of even-numbered ploidies at the leaves of the phylogeny, but can also give rise to a great multiplicity of solutions.
Because our restricted version of this problem here does not generate all possible combinations of observations at the leaves of the tree, the solution to the ancestral ploidy assignment problem does not always exist for an arbitrary data set of present-day ploidies. When it does exist, it can be obtained by solving a system of equations such as that in Figure 1, with the objective of minimizing the sum of the exponents in the autopolyploidization equations. Generally, the ploidy of the root is as high as possible, consistent with a minimum of autopolyploidization events along all the branches.
Once we have inferred the ploidy of the ancestral genomes, how are we to approach our original problem: to reconstruct the synteny block order of the ancestral genomes and thus infer the cost of the phylogeny in terms of rearrangement events? Elements of the solution are discussed in Section 3.1.1 below. The first point to stress is that the rearrangement distance can only be directly calculated between two genomes that have a common polyploidization history. Thus, we can calculate the rearrangement distance between the genomes labeled a and b in Figure 1, but not between a and c. What is required is to take account of the inferred transition from diploid to tetraploid, the autopolyploidization event h, on the path between q and k. We add the distance between the tetraploids at h and c to the distance between the diploids at h and a. To be able to do this, we first find the synteny block order at h using the genome halving algorithm.
We may further ask, even if we can calculate h, how can we know the synteny block order for an ancestor like that labeled m in Figure 1? This requires a median algorithm. Other questions to be answered before all kinds of ancestral genomes can be inferred, and the total branch length of the phylogeny evaluated, are listed in Section 3.1.4.
| 3 THE ALGORITHMS |
|---|
|
|
|---|
In this section, we discuss a local search heuristic for the solution to a prototypical phylogeny problem involving one genome descended from a tetraploid and two related diploids. The main focus of this work is to produce an accurate initialization. It is based on integrating three existing algorithms, which we can only cite in this abstract.
3.1 Existing and missing resources
3.1.1 Genomic distance
Distance based on genomic structure d(X, Y) is calculated by linear-time rearrangement algorithms for finding the minimum number of operations necessary to convert one genome X into another Y. Each genome is composed of a (possibly different) number of chromosomes containing linearly ordered terms. Comparison of the two genomes induces a decomposition of each into a set of synteny blocks. The set of blocks is the same for each genome, but it is differently partitioned among the chromosomes, differently ordered within the chromosomes, and the left-right orientation of a block may also differ in the two genomes.
The biologically motivated rearrangement operations we consider include inversions (implying as well change of orientation) of chromosomal segments containing one or more blocks, reciprocal translocations (of telomere-containing segments—suffixes or prefixes—of two chromosomes) and chromosome fission or fusion. Here we make use of a versatile rearrangement algorithm recently introduced by Bergeron et al. (2006), which we constrain to allow only the operations we have listed.
3.1.2 Genome halving
Given a genome T that can be decomposed into a set of synteny blocks, each of which appears twice on the genome, on the same or on different chromosomes, how can we construct a genome A containing only one copy of each block, and such that the genome
consisting of two copies of each chromosome in A minimizes
? Here we use the linear-time algorithm for solving this problem due to El-Mabrouk and Sankoff (2003).
3.1.3 Rearrangements median
Given three genomes X,Y and Z, how can we find the median genome M such that
is minimized. For this NP-hard problem, we implement a heuristic using the principles of the [Bourque and Pevzner, 2002] MGR (multiple genome rearrangement) algorithm, but based on the constrained version of the Bergeron et al. (2006) distance algorithm.
3.1.4 Open questions
To fully solve the inference problem as stated, even within the limitations imposed by the heuristic implementation of the median problem and the heuristic steps in the main algorithm in Section 3.2 below, we would have to generalize the genome halving problem in several directions:
- Given two tetraploid (or
-ploid) genomes X and Y, i.e. with two (or
) copies each of every syntenic block, find the matching of each pair (or set of
) paralogs between the two genomes that minimizes the rearrangement distance.
- Given a genome P with ploidy
, find the 2r-ploid and 2s-ploid genomes R and S, respectively, such that the distance
is minimized.
- Given a genome Q with ploidy
, find the
-ploid A such that the distance
is minimized.
3.2 Strategy for the problem of one tetraploid and two diploids
Let T be a genome with diploid number 4n, i.e. 2n pairs of (identically ordered) maternal and paternal chromosomes, and 2m syntenic blocks,
, dispersed in any order on the 2n different chromosomes. For each i, we call
and
duplicates, and the subscript 1 or 2 is assigned arbitrarily. A potential ancestral tetraploid of T is written
, and consists of
chromosomes, where some half (
) of the chromosomes contains exactly one of each of
or
for each
. The remaining
chromosomes are each identical to one in the first half, in that where
appears on a chromosome in the first half,
appears on the corresponding chromosome in the second half, and vice versa. We define A to be either of the two halves of
, where the subscript 1 or 2 is suppressed from each
or
. These
chromosomes, and the m syntenic blocks they contain,
, constitute a potential ancestral diploid of T.
A solution of the genome halving problem for T is any A such that
is minimal. There are generally many different solutions to this problem.
Consider an unrooted tree with three leaves, T and two diploid genomes R1 and R2 with blocks orthologous to
, as in Figure 2a. Our central problem is to find a diploid genome A and a median genome M of
and R2 that minimize
|
| (1) |
|
There is no requirement that A be a solution to the genome halving problem, but since they already minimize one term of D, some of these solutions might be good initial guesses for an optimal A. Let S be the set of solutions of the genome halving algorithm for T. Initially in our heuristic, schematized in Figure 2b, we confine our search to
For each solution
, we calculate the median distance
, as in Figure 2c. This is the bottleneck in our procedure, since
may be very large, and an accurate calculation of the median is costly for each element of
. When the number of markers m is small, say a few dozen, as to be illustrated in Section 4 below, it is possible to do evaluation of
. When m is in the hundreds, as to be illustrated in Section 5 below, we resort to a random sample of the genomes in
.
|
| (2) |
|
| (3) |
|
| (4) |
| 4 A SMALL DATA SET ON MAIZE |
|---|
|
|
|---|
It is generally agreed that the maize (Zea mays) genome underwent a genome doubling event some 11–16 million years ago (Gaut and Doebley, 1997). While some duplicated regions clearly attest to this event, there is no consensus on the exact inventory of such regions. Here we apply our method to infer the ancestor of the maize genome, with the rice (Oryza sativa) and sorghum (Sorghum bicolor) genomes as the two related diploids. For this purpose, we are concerned only with duplicated blocks in maize, and their single-copy counterparts in rice and sorghum, as extracted from the Gramene database (Jaiswal et al., 2006), and not the remainder of each of the genomes.
In a previous study (Zheng et al., 2006), we used Gramene to identify 34 syntenic blocks with two copies in maize and one copy each in sorghum and rice, though the partial nature of the maize genome sequence and the relative absence of sorghum sequence means that this genetic marker-based construction must be considered preliminary.
The genome halving algorithm usually involves some arbitrary choices in constructing the optimal ancestral tetraploid. In the case of the maize genome, this leads to more than 1 500 000 distinct solutions in
. The original data set not being very large (34 blocks in two genomes, 68 in maize), this exemplifies the extreme lack of uniqueness in the results of genome halving.
When we bring the diploid genomes to bear using Equation (2), however, testing all 1 500 000 elements of
, the set
contains only nine solutions. Thus there is a massive reduction of non-uniqueness induced by carrying out de-ploidization in phylogenetic context.
Searching for A and M(A) along a trajectory from
towards the median using the criterion in inequality (4) led directly to the solution in Figure 3, which is not improved by local searching. Other trajectories from
towards the median gave three other solutions, with almost identical component distances. And other search methods (along trajectories to R1 or R2) provided a fifth solution, at a much greater distance,
, from T.
|
For the schema in Figure 3, the given and inferred genomes, with synteny blocks evident, are depicted in Figure 4.
|
| 5 TETRAPLOIDIZATION OF YEAST |
|---|
|
|
|---|
Wolfe and Shields (1997) convincingly demonstrated an ancient tetraploidization of Saccharomyces cerevisiae a decade ago. Recently, the post-tetraploidization evolution of S.cerevisiae has been studied by comparison to the diploid genomes of Ashbya gossypii (Dietrich et al., 2004) and of Kluyveromyces waltii (Kellis et al., 2004), though without recourse to genome rearrangement or genome halving algorithms.
Each of these studies located a set of synteny blocks on the diploid genome, each block homologous to a pair of duplicate synteny blocks on the S.cerevisiae genome. These blocks were explicitly listed in the case of K.waltii, for which we could confirm 239 blocks, but only portrayed diagrammatically in the case of A.gossypii. We developed a protocol to tabulate the A.gossypii blocks based on this visual information, and obtained 409 blocks.
We then established a second protocol to align the blocks on S.cerevisiae corresponding to K.waltii blocks and those corresponding to A.gossypii blocks, sometimes dividing a long block from one diploid into shorter blocks corresponding to the other, and allowing
extra ORFs on a block without throwing a correspondence into doubt. This protocol produced 263 blocks in both K.waltii and A. gossypii, each corresponding to a pair of duplicate blocks in S.cerevisiae.
Applying our method to this large data set produced the solutions in Figure 5. Because the time required for the median heuristic increases drastically with m, where we could handle 1.5
runs with m = 34 in the case of maize, we could only sample 2506 elements from
with m = 263, and found an
with only one element. To compensate for the sketchy coverage of
, we also examined several solutions of the genome halving algorithm where D was slightly suboptimal. Furthermore, we used the criterion in inequality (3) instead of the computationally more costly inequality (4) to locate A. Of interest is that one of the solutions has
, though this was not one of the sampled genomes, but was found in the trajectory between a suboptimal solution B and M(B).
|
How different are these two solutions, summarized in Figure 5? If we calculate the rearrangement distance between them and compare it with randomly chosen pairs of genomes in
|
|
| 6 CONCLUSION |
|---|
|
|
|---|
Among orthology assignment problems, the case of tetraploidy (and autopolyploidy in general) is rather unique in that DNA sequence information cannot help in partitioning the duplicate blocks into two sets, one from one copy of the original diploid, and the other set from the identical second copy, precisely because they were identical. This is not always the case with allopolyploidy since paralogs coming from one contributing polyploid would be more similar in DNA sequence amongst themselves than to paralogs from the other contributing polyploid. Thus our methodology could be made more precise in such cases by incorporating DNA sequence evidence insofar as allopolyploidy is concerned, but not autopolyploidy.
As mentioned in Section 3, there are many open problems to be solved before a general solution, even a heuristic one, is feasible for our simple model of polyploidy. And there are many more problems for a general model allowing for autopolyploidy by means other than tetraploidization.
Algorithmically, a difficult problem would be to replace our sequential procedure by a single algorithm searching for the pair
that minimizes
. This would be a hard problem, given that the median problem itself is NP-hard. Modifying the halving algorithm so that it could take account of both R1 and R2, while retaining optimality of the ancestral diploid, might be a good strategy for avoiding the construction of the entire set
, but would not mitigate the complexity of the median step.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
Research supported in part by a grant to D.S. from the Natural Sciences and Engineering Research Council of Canada (NSERC). D.S. holds the Canada Research Chair in Mathematical Genomics and is a Fellow of the Evolutionary Biology Program of the Canadian Institute for Advanced Research.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
1 Genomes with odd ploidy are generally deemed to be infertile because of the impossibility of segregating into haploids containing equal numbers of chromosomes during meiosis.
| REFERENCES |
|---|
|
|
|---|
Bergeron A, et al. A unifying view of genome rearrangements. Algorithms in Bioinformatics. Proceedings of WABI 2006. Lect. Notes Comput. Sci, —Bücher P, Moret BME, eds. ( (2006) ) 4175, : 163–173.[CrossRef].
Bourque G, Pevzner P. Genome-scale evolution: reconstructing gene orders in the ancestral species. Genome Res, ( (2002) ) 12, : 26–36.
Dietrich FS, et al. The Ashbya gossypii genome as a tool for mapping the ancient Saccharomyces cerevisiae genome. Science, ( (2004) ) 304, : 304–307.
El-Mabrouk N, Sankoff D. The reconstruction of doubled genomes. SIAM J. Comput, ( (2003) ) 32, : 754–792.[CrossRef].
Gaut BS, Doebley JF. DNA sequence evidence for the segmental allotetraploid origin of maize. Proc. Natl Acad. Sci. USA, ( (1997) ) 94, : 6809–6814.
Jaiswal P, et al. Gramene: a bird's eye view of cereal genomes. Nucleic Acids Res, ( (2006) ) 34, : D717–D723. URL: http://www.gramene.org.
Kellis M, et al. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature, ( (2004) ) 428, : 617–624.[CrossRef][Medline].
Wolfe KH, Shields DC. Molecular evidence for an ancient duplication of the entire yeast genome. Nature, ( (1997) ) 387, : 708–713.[CrossRef][Medline].
Zheng C, et al. Genome halving with an outgroup. Evol. Bioinformatics, ( (2006) ) 2, : 319–326..
This article has been cited by other articles:
![]() |
C. Zheng, Q. Zhu, Z. Adam, and D. Sankoff Guided genome halving: hardness, heuristics and the history of the Hemiascomycetes Bioinformatics, July 1, 2008; 24(13): i96 - i104. [Abstract] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







