Bioinformatics Advance Access originally published online on October 28, 2004
Bioinformatics 2005 21(7):869-879; doi:10.1093/bioinformatics/bti107
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Theoretical and practical advances in genome halving
Department of Computer Science, Duke University Box 90129, Durham, NC 27708-0129, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Duplication of an organism's entire genome is a rare but spectacular event, enabling the rapid emergence of multiple new gene functions. Over time, the parallel linkage of duplicated genes across chromosomes may be disrupted by reciprocal translocations, while the intra-chromosomal order of genes may be shuffled by inversions and transpositions. Some duplicate genes may evolve unrecognizably or be deleted. As a consequence, the only detectable signature of an ancient duplication event in a modern genome may be the presence of various chromosomal segments containing parallel paralogous genes, with each segment appearing exactly twice in the genome. The problem of reconstructing the linkage structure of an ancestral genome before duplication is known as genome halving with unordered chromosomes.
Results: In this paper, we derive a new upper bound on the genome halving distance that is tighter than the best known, and a new lower bound that is almost always tighter than the best known. We also define the notion of genome halving diameter, and obtain both upper and lower bounds for it. Our tighter bounds on genome halving distance yield a new algorithm for reconstructing an ancestral duplicated genome. We create a software package GenomeHalving based on this new algorithm and test it on the yeast genome, identifying a sequence of translocations for halving the yeast genome that is shorter than previously conjectured possible.
Availability: GenomeHalving is available upon email request.
Contact: py{at}cs.duke.edu; amink{at}cs.duke.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
1.1 Biological motivation
In the course of evolution, gene duplications are extremely significant events, enabling the emergence of new gene functions (Ohno, 1970). The presence of one copy of each gene is normally sufficient for the survival of the species, allowing other (redundant) copies to evolve with less selective pressure. Beyond the duplication or multiplication of individual genes, it is possible for the entire genome of a species to be duplicated in a process known as tetraploidization. Although tetraploidization is normally lethal, in rare cases a tetraploid can become a stabilized diploid with two sets of identical chromosomes. The functionalities of the genes in one set are usually preserved, while the genes in the other set are now free to evolve into novel functional units, presenting the species with a tremendous opportunity for new evolutionary possibilities. A potentially more important consequence of whole-genome duplication is the combinatorial number of possibilities for the co-evolution of a group of genes in concert (Fryxell, 1996).
Evidence supporting the occurrence of whole-genome duplication has been adduced in numerous plant genomes (Ahn and Tanksley, 1993; Gaut and Doebley, 1997; Moore et al., 1995; Scheffler et al., 1997; Shoemaker et al., 1996; Paterson et al., 1996), as well as in vertebrate genomes (Nadeau, 1991; Lundin, 1993; Gibson and Spring, 2000; Gu et al., 2002; McLysaght et al., 2002). A particularly convincing example of whole-genome duplication is found in the yeast genome. Wolfe and Shields (1997) provided early strong evidence that the genome of Saccharomyces cerevisiae is the product of an ancient tetraploidization, which has been further supported by subsequent studies (Vision and Brown, 2000; Seoighe et al., 2000; Langkjær et al., 2003; Dietrich et al., 2004; Kellis et al., 2004). However, we note that there exist alternative views on whole-genome duplication in yeast (Mewes et al., 1997; Coissac et al., 1997; Llorente et al., 2000a, b) and that it remains a somewhat controversial issue. Evidence also exists to suggest the flowering plant Arabidopsis may have undergone whole-genome duplication, but this is not conclusive (Ku et al., 2000; Paterson et al., 2000; Lynch and Conery, 2000). For surveys on whole-genome duplication, see Wolfe (2001) and Durand (2003).
During the course of evolution subsequent to genome duplication, the parallel linkage of genes across chromosomes may be disrupted by reciprocal translocations, while the intra-chromosomal order of genes may be modified by inversions and transpositions. Some duplicate genes may evolve unrecognizably or be deleted. As a consequence, sometimes the only extant evidence of an ancient duplication in a modern genome is the presence of various duplicate chromosomal segments containing parallel paralogous genes dispersed throughout the genome.
The genome halving problem is to construct a (minimal) sequence of operationstranslocations, inversions or transpositionsthat transform an ancestral genome immediately after a genome duplication event into a modern genome; or conversely but equivalently, a minimal sequence of operations that transform a modern genome G into an ancestral duplicated genome G'. In the latter interpretation of the problem, the modern genome G is said to be halved by these transformations, since G' consists of two identical copies of each chromosome, representing the ancestral genome immediately after duplication.
El-Mabrouk et al. (1998) propose two formulations of the genome halving problem. The problem of genome halving with ordered chromosomes considers a chromosome as an ordered sequence of gene blocks, and aims to construct a sequence of operations that transform an ancient duplicated genome to that of a modern species via translocations and intra-chromosomal operations, like inversions and transpositions. Seoighe and Wolfe (1998) study this problem using a computer simulation and a heuristic analytical method. El-Mabrouk et al. propose an exact algorithm to solve this problem El-Mabrouk et al., 1999; El-Mabrouk, 2000; El-Mabrouk and Sankoff, 2002).
We are here interested in the related problem of genome halving with unordered chromosomes, which considers a chromosome as an unordered collection of gene blocks, and aims to construct a sequence of only translocations that transform the synteny or linkage structure of the genome of an ancestral duplicated genome to that of a modern species. Both the ordered and unordered problems can provide insight to understanding the possible evolutionary path that leads from the ancestral duplicated genome to that of a modern species. However, one aspect of the comparative importance of the unordered version of the genome halving problem resides in the possibility that intra-chromosomal operations, such as inversions and transpositions, alter the order of gene blocks within a chromosome repeatedly between translocations, as suggested by El-Mabrouk et al. (1998). In such a context, the intra-chromosomal order of the gene blocks and the intra-chromosomal operations are of only marginal significance in exploring the possible optimal sequence of translocation events that transform an ancient genome to its current state; as a result, the unordered formulation of the problem as discussed in this paper is of greater relevance. Furthermore, a potential practical constraint on the application of genome halving with ordered chromosomes in some cases might be the unavailability of data on the intra-chromosomal order of gene blocks for a species.
El-Mabrouk et al. (1998) provide both upper and lower bounds for the problem of genome halving with unordered chromosomes, and give a heuristic algorithm for computing the ancestral genome. We improve both of their bounds, and then design and implement an algorithm for reconstructing an ancestral duplicated genome. We create a software package GenomeHalving and apply it to the yeast genome to obtain a shorter halving path than was previously conjectured possible. In addition, we define the notion of genome halving diameter, and offer an upper bound and a lower bound that almost always match for genomes with a realistic number of chromosomes.
1.2 Definitions and notation
For the remainder of the paper, we refer to the problem of genome halving with unordered chromosomes as simply genome halving, for brevity. In this formulation of the problem, a genome G is a set of chromosomes and a chromosome Si is a collection of gene blocks, or blocks. Since we are interested in studying the translocation history of the ancient duplicated genome, we can ignore gene blocks that occur only once in the genome (due to subsequent gene deletion or mutation) because they contribute no useful information in reconstructing the translocation history of the genome. Thus, we restrict our attention to gene blocks that appear exactly twice in the genome. If a gene block happens to appear twice in the same chromosome, it is called a 2-block. A genome G = {S1, S2, ..., Sn} can be represented by an equivalent intersection graph as follows (El-Mabrouk et al., 1998). Create a vertex vi for each chromosome Si; connect vi and vj with an undirected edge e(vi,vj) if and only if Si
Sj
and i
j; connect vi to itself with a loop if and only if Si contains a 2-block. A vertex with (without) a loop to itself is called a loop-vertex (non-loop-vertex). Note that a loop-vertex is adjacent to itself. Denote by h(v) the number of vertices adjacent to v, including v itself. A vertex v with h(v) = k will sometimes be called a k-vertex. A pair of adjacent (non-loop) 1-vertices are referred to as a perfectly matched vertex pair. A graph consisting of only perfectly matched vertex pairs is called a perfect matching graph. A duplicated genome with two identical sets of chromosomes corresponds to a perfect matching graph. To simplify notation, we use the symbol G interchangeably to denote both a genome and the intersection graph derived from that genome.
The basic operation allowed in the genome halving problem is translocation, or the exchange of gene blocks between two chromosomes. We represent a translocation between vi and vj with the quadruplet
= (vi, vj, Bi, Bj), where Bi
Si and Bj
Sj, indicating the movement of block set Bi from Si to Sj and of block set Bj from Sj to Si. In the above formulation, neither fission nor fusion of chromosomes are allowed: Si
; Sj
; when Bi =
, we require that Bj
Sj; when Bj =
, we require that Bi
Si. Sometimes, we omit Bi and Bj and just write
(vi, vj). After the translocation
(vi,vj), vertex vi is denoted by
vi and vertex vj is denoted by
vj. A vertex vk that is adjacent to vi or vj before
(vi, vj) must be adjacent to
vi or
vj, provided i
k and j
k. If vi is adjacent to vj before
(vi, vj), either or both of
vi and
vj may be loop-vertices. To make these notions more concrete, Figure 1 shows an example of a sequence of translocations that transform a particular genome into an ancestral genome immediately after duplication.
|
1.3 Problem definition
The genome halving problem requires finding the minimum number d(G) of translocations that are sufficient to transform a given genome G into an ancestral duplicated genome G' containing two identical sets of chromosomes. We call d(G) the genome halving distance of G. Let |G| be the size of genome G. Since |G'| is even and |G| = | G'|, we require that |G| be even.
We define the genome halving diameter for genomes of size n, D(n), as the maximum value of the genome halving distance for any genome with n chromosomes:
![]() |
The rest of the paper is organized as follows. We first give an upper and a lower bound for the genome halving diameter problem in Section 2. Then in Section 3, we give a new upper bound for the genome halving distance d(G) that is tighter than the best known, and a new lower bound that is almost always tighter than the best known. Based on the insight obtained in the analysis of these tighter bounds for genome halving distance, we report a novel algorithm to reconstruct ancestral duplicated genomes in Section 4. In Section 5, we analyze the yeast genome with a software package GenomeHalving we have developed to implement our algorithm, and identify a sequence of translocations for halving the yeast genome that is of shorter length than was previously conjectured possible. We close with a discussion of our results.
| 2 GENOME HALVING DIAMETER |
|---|
|
|
|---|
In this section, we obtain an upper bound and a lower bound for the genome halving diameter problem. For genomes with a realistic number of chromosomes, the upper bound almost always matches the lower bound.
2.1 Genome halving diameter: upper bound
El-Mabrouk et al. (1998) studied the diameter problem in which chromosomes can be merged and split, and offered an upper bound of n to construct a trivial duplicated genome. Their construction simply merges all chromosomes into one big chromosome using n 1 fusion translocations, and then divides the resultant chromosome into two identical chromosomes using a fission translocation. By examining this problem with a bit more scrutiny, we are able to derive a tighter upper bound without resorting to either fusions or fissions.
For ease of exposition, we introduce a little more notation. Denote by I(vi,vj) one copy of each of the blocks shared by vertices vi and vj; note that if vi is a loop-vertex we permit vi = vj, in which case I(vi, vi) contains only one copy of each 2-block contained in vi. Denote by
the collection of blocks shared between v and all its adjacent vertices, including itself. Note that
is just Si. Finally, a genome with n chromosomes that has either n or n 1 loop-vertices is defined to be a loopy genome.
THEOREM 2.1.
D(n)n 1; if we restrict our attention to non-loopy genomes, we have D(n)
n 2.
PROOF
We give a constructive proof. Color all perfectly matched vertices black, and color the remaining vertices white. Now select a vertex pair (v1, v2) as follows.
- If there exists a white loop-vertex, select it as v1. If there exists another white loop-vertex, select it as v2; otherwise select an arbitrary white vertex as v2.
- If there is no white loop-vertex, since each white vertex must have least one white neighbor, we can arbitrarily select a pair of neighboring white vertices as v1 and v2.
Color v1 and v2 black. Then perform translocation
1(v1,v2,B1,
) where
. Note that one of I(v1,v1) or I(v1,v2) could be empty, but not both. Also note that if v1 contains a 2-block, B1 will contain only one copy of that 2-block. After translocation
1, vertex
1v1 is a non-loop 1-vertex whose only neighbor is
1v2.
Next, select another white loop-vertex, if it exists, as v3; otherwise choose an arbitrary white vertex as v3. Perform translocation
2(
1v2, v3, B2,
) where
. Note that both copies of the 2-blocks in
1v2, if they exist, will be passed to v3. Now, after two translocations, we have produced a perfectly matched vertex pair, (
1v1,
2
1v2), and each is newly colored black.
Repeat the above operations until we are left with only four white vertices. Since every two translocations generate a pair of perfectly matched vertices, we have performed at most n 4 translocations to this point. It is easy to verify that three more translocations are sufficient to transform any set of four vertices into two perfectly matched vertex pairs. Hence, the total number of translocations needed to halve any genome with n chromosomes is at most n 1.
Now we restrict our attention to non-loopy genomes and show that D(n)
n 2. We discuss two cases.
- If there exist perfectly matched vertices in the initial graph, we observe that these perfectly matched vertices require no translocations and hence we must have D(n)
(n 1) 2
n 2.
- If there are no perfectly matched vertices in the initial graph, we observe that the final four white vertices must contain at least two white non-loop-vertices, since we start with at least two non-loop-vertices by definition, and take care to exhaust all the white loop-vertices before considering any white non-loop-vertex. In such a case, it is easy to verify that two more translocations are sufficient to transform the remaining four vertices into two perfectly matched pairs and hence we must have D(n)
n 2.
2.2 Genome halving diameter: lower bound
Before proceeding to this section, we note that obtaining a lower bound on genome halving diameter is the most technically challenging problem addressed in this paper. As a result, the proof is unavoidably more involved, and may at certain points be tedious. We preface the proof with an overview intuition, but readers uninterested in the details can safely skip ahead to the statement of the lower bound itself in Section 2.2.5.
2.2.1 Proof intuition and overview
To derive a lower bound on the number of translocations required to transform an arbitrary graph into a perfect matching graph, it is easier to study the reverse process in which a perfect matching graph is transformed into an arbitrary graph. More specifically, we study the special case of transforming a perfect matching graph G(V,E) into a clique K(V), a graph whose vertices are all pairwise adjacent.
One critical observation is that if a vertex v is adjacent to either v1 or v2 after a translocation
(v1, v2), then v must have been adjacent to either v1 or v2 before the translocation. Let
(v1,v2) be the last of any series of translocations that transform a graph with vertex set V = {v1,v2,...,vn} into an n-vertex clique. We observe that if we view v1 and v2 as one vertex
such that
's neighbors are the union of the neighbors of v1 and v2, then the n1 vertices
, v3,...,vn are pairwise adjacent before translocation
(v1,v2). In other words, the graph before translocation
(v1,v2) can be viewed as an (n 1)-clique, if v1 and v2 are considered as a single vertex. For an example illustration, see Figure 2.
|
Based on this observation, an induction proof is constructed along the following lines. We first introduce the concept of a pseudo-clique and show that the size of the largest pseudo-clique in a graph can be increased by at most one with each translocation, providing us with an inductional device (Lemma 2.4). Then by analyzing the base case to find the largest pseudo-clique in a perfect matching graph (Lemma 2.6), we have a proof by induction to obtain a lower bound for the halving diameter (Corollary 2.8).
2.2.2 Additional notation and definitions
A central device used in this section is the pseudo-graph, which is derived from an intersection graph and provides an alternative view thereof. Given an intersection graph G(V,E), a pseudo-node
is defined as a non-empty subset of the vertices V. Two pseudo-nodes are disjoint if their intersection is empty. Given two disjoint pseudo-nodes
and
, if no vertex in
has an adjacent vertex in
, then
and
are non-adjacent; otherwise, they are adjacent. Given a particular set of disjoint pseudo-nodes, we can connect each pair of adjacent pseudo-nodes with a pseudo-edge and get a pseudo-graph
(see Fig. 3). For readability, we sometimes omit the pseudo description for a pseudo-edge when it is clear from context. We emphasize a pseudo-graph
exists only in the context of an underlying intersection graph G, and the adjacencies in
are completely determined by the adjacencies in G, given a particular set of pseudo-nodes. In this sense,
is said to be derived from G. In particular, if translocations performed on an underlying graph G change the adjacencies in G, the adjacencies in the derived graph
may change correspondingly. We also note that multiple pseudo-graphs can be derived from the same underlying intersection graph G by choosing different sets of vertices to be the pseudo-nodes.
|
We define adjacency rules between a vertex and a pseudo-node in an analogous manner: given a vertex vi and a pseudo-node
, where
, if vertex vi has no adjacent vertex in pseudo-node
, then vertex vi and pseudo-node
are non-adjacent; otherwise, they are adjacent. A pseudo-graph is complete if all the pseudo-nodes in it are pairwise adjacent. Now we provide two definitions that apply to complete pseudo-graphs, expanding vertex pair and pseudo-clique, which will be important later in the proofs.
DEFINITION 2.2.
Given a complete pseudo-graphwith a set of k disjoint pseudo-nodes
, and a pseudo-node
containing a vertex pair (vi,vj), the vertex pair (vi,vj) is called an expanding vertex pair for pseudo-graph
if there is a translocation
(vi, vj) such that the vertices contained in
can be split into two new disjoint pseudo-nodes
and
satisfying the following two conditions:
The translocation
- the k + 1 pseudo-nodes in
and their induced edges after translocation
(vi,vj) form a complete pseudo-graph
.
- either
or
contains an expanding vertex pair for the newly formed complete pseudo-graph
, and the same holds for
.
(vi,vj) together with the split of
into
and
is referred to as an expansion and is denoted by
(vi,vj).
DEFINITION 2.3.
A complete pseudo-graphwith a set of k disjoint pseudo-nodes
is called a pseudo-clique if for all
either
or
contains an expanding vertex pair for pseudo-graph
.
2.2.3 Inductional device
We now prove a lower bound on D(n) by induction. We begin with the following lemma, which shows that each translocation can increase the size of the largest pseudo-clique in a pseudo-graph by at most one.
LEMMA 2.4.
Given an intersection graph G, if translocationresults in a new intersection graph G' and a k-pseudo-clique can be derived from G', then a (k 1)-pseudo-clique can be derived from G.
PROOF
Denote the k-pseudo-clique derived from G' asand let its pseudo-nodes be
. For concreteness, suppose translocation
is between vertices vi and vj. We study two cases.
- If vi and vj belong to two distinct pseudo-nodes in
, suppose w.l.o.g. that
and
. Define a new pseudo-node
, and let
. We claim that the k 1 pseudo-nodes in
together with the set of induced edges connecting them form a (k 1)-pseudo-clique
that can be derived from G.
Indeed, we have that the pseudo-nodes inare pairwise disjoint, which follows immediately from the fact that the pseudo-nodes in
are disjoint. Furthermore, any translocation between vi and vj adds no new edge between pseudo-nodes in
and does not affect the connectivity between
and any pseudo-node in
. Therefore, pseudo-nodes in
must all be pairwise adjacent before performing
(vi, vj), showing that
is complete. We still need to show that each pseudo-node in
either has cardinality 1 or contains an expanding vertex pair.
By definition,contains an expanding vertex pair (vi, vj). For any other pseudo-node
with
, since
and its induced edges in G' form a k-pseudo-clique
, pseudo-node
contains an expanding vertex pair, say (va, vb), for
. For any vertex
, if vertex vt is adjacent (non-adjacent) to any pseudo-node
in
in G', it must be adjacent (non-adjacent) to
in G. Furthermore, if vt is a loop-vertex (non-loop-vertex) in G', then it is a loop-vertex (non-loop-vertex) in G. Therefore, by Definition 2.2, it is straightforward to verify that (va, vb) is an expanding vertex pair for the pseudo-graph that is derived from G by
, together with its induced edges.
- If, on the other hand, vi and vj belong to at most one pseudo-node in
, there exists a set
of k 1 pseudo-nodes in
containing neither vi nor vj and thus unaffected by translocation
. More precisely, if vertex
and vertex
are adjacent (non-adjacent) in G', then they are adjacent (non-adjacent) in G; if vs is a loop-vertex (non-loop-vertex) in G', then it is a loop-vertex (non-loop-vertex) in G, and the same is true for vt. Therefore, the pseudo-nodes in
and the induced edges connecting them form a (k 1)-pseudo-clique
that can be derived from G.
Putting everything together proves the lemma.
2.2.4 Base case for the induction
Lemma 2.4 provides us with an inductional device to derive a lower bound. We next study the base case by finding the largest pseudo-clique that can be derived from a perfect matching graph, but this requires some further machinery. Given a vertex v (pseudo-node
), the pseudo-degree of
is the number of pseudo-nodes adjacent to
and is denoted by
(
). Given a pseudo-graph
, if a pseudo-node
only contains vertices of pseudo-degree 0 or 1,
is referred to as a singly-adjacent-pseudo-node. Note that though a \scpn
contains only vertices of pseudo-degree 0 or 1, the pseudo-degree of
itself may be greater than 1.
LEMMA 2.5
Given a singly-adjacent-pseudo-nodein a complete pseudo-graph
, if
contains an expanding vertex pair, we must have
PROOF
We prove by induction. When, the lemma is trivially true. Now suppose that the lemma holds for
; we show that it also holds for k + 1.
Denote the k + 1 pseudo-nodes adjacent to
by
. For concreteness, let (va,vb) be the expanding vertex pair contained in
. Since
is a singly-adjacent-pseudo-node, the pseudo-degree of va and vb is at most 1. Assume w.l.o.g. that
is the pseudo-node adjacent to vb, if such a pseudo-node exists. After expansion
(va,vb),
is split into two new pseudo-nodes,
and
, in the newly formed complete pseudo-graph
. The k + 1 pseudo-nodes
,
together with their induced edges also form a complete pseudo-graph
. We claim that
is a singly-adjacent-pseudo-node in
(though it may not be a singly-adjacent-pseudo-node in
). The claim follows from the fact that expansion
(va,vb) cannot connect any vertex in
to any of the pseudo-nodes
, though such expansion might connect a vertex in
to
or
.
According to the definition of expansion,
must contain an expanding vertex pair for
and such a vertex pair is necessarily an expanding vertex pair for
. Therefore,
is a singly-adjacent-pseudo-node with pseudo-degree k in the complete pseudo-graph
and it contains an expanding vertex pair. According to the induction hypothesis, we have
. Symmetrically, we have
. Thus we have
. This completes the proof.
LEMMA 2.6.
Letbe a k-pseudo-clique derived from a perfect matching graph G. For k > 2, we must have |G|
k x 2k.
PROOF.
We prove the lemma by showing that each of the k pseudo-nodes incontains at least 2k vertices. Consider any such pseudo-node
. Because
is derived from a perfect matching graph, when k > 2, each pseudo-node must contain an expanding vertex pair. For concreteness, let (va, vb) be the expanding vertex pair in
. After expansion, pseudo-node
is split into pseudo-node
and pseudo-node
. Denote the resulting pseudo-graph by
. Since there is no loop-vertex in a perfect matching graph,
must contain a \pmvp (vc, vd), which connects
to
in
. Assume w.l.o.g.
and
.
We discuss three possible expansion cases as depicted in Figure 4 and show in each case that
.
- |{vc,vd}
{va, vb|} = 2. In other words, va = vc and vb = vd. After the expansion,
is a singly-adjacent-pseudo-node with pseudo-degree k in the resulting pseudo-graph
. According to Lemma 2.5 we have
. Symmetrically,
. Hence
.
- |{vc,vd}
{va, vb}| = 1. Assume w.l.o.g. va = vc. After the expansion, it is possible that va has pseudo-degree 2 in the resulting graph
. To reduce its pseudo-degree to 1, we merge the pseudo-node(s) adjacent to va into one single pseudo-node, and obtain a pseudo-graph
. Now
is a singly-adjacent-pseudo-node in
with pseudo-degree at least k 1. Since there is no loop vertex in
and
contains an expanding vertex pair,
must contain a perfectly matched vertex pair vaa and vab. By expanding
and merging the pseudo-node(s) adjacent to vaa into a single pseudo-node as before, we can derive a complete pseudo-graph in which
is a singly-adjacent-pseudo-node with pseudo-degree k1 that contains an expanding vertex pair for the pseudo-graph. According to Lemma 2.5, we have
. Hence
. Similarly, we can show
. Thus we have
.
- |{vc,vd}
{va, vd} | = 0. By an argument similar to that of case 2, we can show
.
View larger version (20K):
[in this window]
[in a new window]
Fig. 4 Panel I depicts pseudo-node before expansion. Pseudo-node
contains a perfectly matched vertex pair (vc,vd). Panels II, III and IV illustrate cases 1, 2 and 3 discussed in the proof of Lemma 2.6, respectively. In panels III and IV, when pseudo-nodes are merged as discussed in the proof, the resultant pseudo-nodes are represented in bold.
This completes the proof.
We note that for |G| < 24, the largest pseudo-clique that can be derived from a perfect matching graph G is a 2-pseudo-clique.
2.2.5 Statement of the lower bound
Lemmas 2.4 and 2.6 complete the induction and lead to the following theorem and corollary, the straightforward proofs of which are omitted for brevity.
THEOREM 2.7.
Given a perfect matching graph G with n vertices, it takes at least n k translocations to transform G into an n-clique, where k = 2 when n < 24; when n24, k is the largest integer that satisfies k x 2k
n.
COROLLARY 2.8.
D(n)n k, where k = 2 when n < 24; when n
24, k is the largest integer that satisfies k x 2k
n.
| 3 GENOME HALVING DISTANCE |
|---|
|
|
|---|
While the diameter problem attempts to find the maximum halving distance for all genomes of size n, the distance problem attempts to find the halving distance for a particular genome of size n. By definition, the halving distance for a particular genome is less than or equal to the diameter.
3.1 Genome halving distance: upper bound
We can obtain a tighter upper bound on the genome halving distance by analyzing the algorithm presented in the proof for Theorem 2.1 more closely. In the worst case, it may take two translocations to obtain each perfectly matched pair of vertices. However, if the intersection graph contains a non-loop 1-vertex, a perfectly matched vertex pair can be produced using just one translocation. In some sense, the existence of a non-loop 1-vertex has the potential to save one translocation in transforming the intersection graph to a perfect matching graph. This observation leads to the following lemma.
LEMMA 3.1.
Given a genome G of size n, d(G)n 2 +
(G) min{s, (n4)/2}, where s is the number of non-loop 1-vertices in G;
(G) = 1 if G is a loopy genome, and
(G) = 0 otherwise.
PROOF
During the transformation of the final four vertices, the existence of a non-loop 1-vertex does not necessarily help to save translocations; for example, two translocations are still required to transform a star graph with four vertices to a perfect matching graph (a star graph is one in which one central vertex is adjacent to all the other 1-vertices). In contrast, when the number of remaining white vertices is greater than 4, the existence of a non-loop 1-vertex can always save one translocation. However, to achieve the potential savings of a non-loop 1-vertex, we need to consume two vertices. More precisely, after one translocation, the non-loop 1-vertex and its neighbor become a perfectly matched pair and thus cannot be used in future translocations. The claim then follows.
By extending the intuition behind Lemma 3.1 we can get an even better upper bound on d(G). For readability, in the remainder of Section 3.1, we sometimes omit the non-loop description for a vertex when it is clear from context.
Given a graph G(V,E), a well-separated vertex set W
V is a set of non-loop-vertices such that:
- for any vi, vj
W, vi and vj are not adjacent and share no common neighbor if h(vi) > 1 and h(vj) > 1; and
vi
Wh(vi)
(n 4)/2.
THEOREM 3.2.
Given a genome G of size n, d(G)n 2 +
(G) |W*|, where W* denotes the maximum well-separated vertex set contained in G, and
(G) is defined as in Lemma 3.1.
PROOF
Observe that we can create two 1-vertices by performing a translocation between the two neighbors of a 2-vertex. For example, consider the case depicted in Figure 5 in which v1 is a 2-vertex: after the translocation(v2, v3,
, B3) where
, we obtain two 1-vertices,
v1 and
v3. Therefore by Lemma 3.1 we have that the existence of a 2-vertex can also save one translocation. However, four vertices (v1, v2, v3 and v4) are consumed to achieve the potential savings of a 2-vertex. In general, we can create a (k1)-vertex from a k-vertex with one translocation. By applying the above procedure recursively, any k-vertex has the potential to save one translocation at a cost of consuming 2k vertices.
View larger version (5K):
[in this window]
[in a new window]
Fig. 5 A translocation between two neighbors of a 2-vertex, v1, can produce two 1-vertices (in this case v1 and v3).
In addition, to realize the potential savings of a non-loop 1-vertex v1 whose only neighbor is v2, we first find a vertex v'V\W* that will not be consumed during the processing of the vertices in W*. Note that the existence of v' is guaranteed by the well-separatedness of W*: the total number of vertices that will be consumed will be at most 2 x
vi
W* h(vi)
n 4 < n. Then perform translocation
(v2, v', B2,
) where
. This leaves (v1,
v2) as a perfectly matched pair.
Label the vertices in W* as
1,
2, ...,
|W*| such that h(
i)
h(
j) for all i < j. If h(
j) increases, we say
j is destroyed. We now show that if we process the vertices
1,
2, ...,
|W*| in order, then we neither consume nor destroy any
j
W* while processing
i
W*.
- If h(
i) = 1, we realize the potential savings of
i by touching only its neighbor and v'. By the well-separatedness of W*, no
j
W* is consumed or destroyed.
- If h(
i) > 1, realizing the potential savings of
i may affect
i, its neighbors, v', and possibly some vertices adjacent to
i's neighbors. But by this point, all the potential savings of 1-vertices must have already been realized (since the vertices are processed in order), and any
j
W* with h(
j) > 1 shares no neighbor with
i by the well-separatedness of W*. Therefore, no
j
W* is consumed or destroyed.
Thus, we can fully realize the potential savings of all the
i
W*, resulting in the upper bound as claimed.
3.2 Genome halving distance: lower bound
By studying the so-called fan structure of the intersection graph induced by genome G, El-Mabrouk et al. obtain a lower bound of
log2([(e n/2)/p] + 1)
for d(G), where n is the number of chromosomes in G, e is the number of edges in the intersection graph representing G, and p is the largest number of neighbors shared by any two vertices in the intersection graph. Their strategy is to count the maximum number of edges that can be reduced with one translocation. This strategy is also at the core of their greedy algorithm to find the optimal number of translocations. In this section, we use a different strategy to derive a lower bound that is almost always tighter than the above lower bound; some experimental evidence for this claim comes in the analysis of the yeast genome later in the paper.
We have the following lemma.
LEMMA 3.3., for h(v*) > 1, where v* is the vertex with the maximum degree in G.
PROOF 3.3.
When h(v*) > 1, label edges initially incident to v* as bad. A bad edge can disappear by being merged into another bad edge. Alternatively, it can become an edge connecting the two vertices of a perfectly matched vertex pair, in which case we say the edge becomes good. Let b(G) be the number of bad edges in G. Initially, b(G)=h(v*). Since there are no bad edges in the final perfect matching graph, we must remove b(G) bad edges to arrive at the final graph. We enumerate below all possible types of translocations and their influence upon b(G).As a single translocation decreases b(G) by at most two, the total number of translocations required is at least
- A translocation between v* and a neighbor v1 decreases b(G) by at most two. Such a translocation can happen when v* is a loop-vertex with a 1-vertex neighbor, v2. A translocation between v* and v1 turns the edge e(v*,v2) into a good edge, and merges edges e(v*,v*) and e(v*,v1) into one bad edge. See Figure 6 for an illustration.
- A translocation between v* and a vertex v1 not adjacent to v* decreases b(G) by at most two. Such a translocation can happen when v* is a 2-vertex, v1 is a 1-vertex, and v* and v1 have a common neighbor 2-vertex, v2, as illustrated in Figure 7. A translocation between v* and v1 turns both of the bad edges incident to v* into good edges.
- A translocation between two neighbors of v* decreases b(G) by at most two. This case can happen when v* is a 2-vertex with a neighboring 1-vertex, as illustrated in Figure 8. A translocation between v1 and v2 merges the two bad edges incident to v* into one good edge.
- A translocation between a neighbor of v* and a vertex not adjacent to v* or between two vertices neither of which is adjacent to v* does not decrease b(G) (though it may increase b(G) by one).
h(v*)/2
.
View larger version (7K):
[in this window]
[in a new window]
Fig. 6 There exists a translocation between v* and v1 that decreases the number of bad edges by two. Bad edges are depicted as bold solid segments; good edges as dashed segments.
View larger version (4K):
[in this window]
[in a new window]
Fig. 7 There exists a translocation between v* and v1 that decreases the number of bad edges by two. Bad edges are depicted as bold solid segments; good edges as dashed segments.
View larger version (4K):
[in this window]
[in a new window]
Fig. 8 There exists a translocation between v1 and v2 that decreases the number of bad edges by two. Bad edges are depicted as bold solid segments; good edges as dashed segments.
Note that when h(v*) = 1, our lower bound is simply d(G)
0, since in a perfect matching graph, h(v*) = 1 and d(G) =0.
By extending Lemma 3.3, we get the following theorem.
THEOREM 3.4
Given a graph G(V,E), let V1 and V2 be two disjoint subsets of V, such that no vertex in V1V2 is part of any perfectly matched vertex pair, and every vertex in V1 has a neighbor in V2 and vice versa. We have
PROOF
Assume w.l.o.g. |V1||V2|. Let us redefine the initial selection criteria for bad edges from before. For each v
V1, choose an arbitrary edge incident to v and label it as a bad edge. Again, since there are no bad edges in the final perfect matching graph, the total change in the number of bad edges must be | V1 |. Similar to the analysis in Lemma 3.3 we can show that any translocation decreases the number of bad edges by at most two. This proves the theorem.
| 4 ALGORITHM TO RECONSTRUCT ANCESTRAL DUPLICATED GENOMES |
|---|
|
|
|---|
We now present an algorithm to reconstruct ancestral duplicated genomes based on the intuition behind the proof of Theorem 3.2. The algorithm GENOME-HALVING first colors perfectly matched vertices black and other vertices white. Then it processes the white vertices until either (1) there are only white loop-vertices left, at which point it calls procedure LOOP-VERTICES; or (2) there are at most four white vertices left, at which point it calls procedure 4-VERTICES.
We describe the algorithm in detail below. We first present the main routine GENOME-HALVING and then describe the procedures LOOP-VERTICES and 4-VERTICES called by GENOME-HALVING. Finally, we describe a routine 1-VERTEX that is called by both GENOME-HALVING and LOOP-VERTICES.
The main algorithm GENOME-HALVING(G(V,E)) is presented below.
- Color all perfectly matched vertices in V black, and color the remaining vertices white.
- If no white non-loop-vertex exists, call the procedure LOOP-VERTICES(V), which will terminate.
- If the number of white vertices is less than or equal to four, call the procedure 4-VERTICES(V), which will terminate.
- Find a white non-loop-vertex v1 of the smallest degree. If h(v1) = 1, call the procedure 1-VERTEX(v1, V). Otherwise, label any two of its neighbors as v2 and v3. Since vertex v2 is not a non-loop 1-vertex, it must have another neighbor different from v1. Label it as v4. Note that it is possible v4 = v2 or v4 = v3. Perform translocation
(v2, v3, B2,
) where
. Then v2 becomes a 1-vertex, and h(v1) is decreased by one. Call the procedure 1-VERTEX(
v2,
V). Note that at this point,
V contains at least four white vertices.
- Repeat Steps 2, 3 and 4 until termination.
For a vertex set V whose white vertices are all loop-vertices, we define the procedure LOOP-VERTICES(V) as follows:
- Arbitrarily select a white vertex v1 and a white vertex v2. Perform translocation
(v1,v2, B1, B2) where
and B2 = I(v2,v2). Then v1 becomes a non-loop 1-vertex.
- If v2 also becomes a non-loop 1-vertex, color v1 and v2 black and go to Step 4.
- If, on the other hand, v2 is not a non-loop 1-vertex, call the procedure 1-VERTEX(
v1,
V). Note that at this point,
V contains at least four white vertices.
- If no white vertex remains, terminate; otherwise repeat Steps 1, 2 and 3.
For a vertex set V that contains at most four white vertices, the procedure 4-VERTICES(V) transforms the white vertices in V into perfectly matched vertex pairs with two or three translocations and terminates. We omit the details here for brevity.
For any non-loop 1-vertex v1
V, where V contains at least four white vertices, we define the procedure 1-VERTEX(v1, V) as follows:
- Label the neighbor of v1 as v2.
- Find the white vertex with the maximum degree in V \ {v1, v2} and label it as v3. Note that the existence of v3 is guaranteed by the fact that V has at least four white vertices, including v1 and v2.
- Perform translocation
(v2, v3, B2,
) where
. Color v1 and v2 black.
We have implemented the above algorithm in Java. A user-friendly graphical interface is provided for illustrating the sequence of translocations used to reconstruct an ancestral duplicated genome. For example, the result of halving a genome represented by an 8-clique graph with our program is shown in Figure 9.
|
| 5 GENOME HALVING DISTANCE AND ANCESTRAL GENOME FOR YEAST |
|---|
|
|
|---|
To compare the results of our bounds analysis and of our algorithm with those reported by El-Mabrouk et al. (1998) we use the same yeast genome data set. The data was initially drawn from Wolfe and Shields (1997) and is reproduced here in Table 1.
|
The analysis of El-Mabrouk et al. (1998) gives the bounds, 3
d(G)
16. Their program reconstructs a duplicated yeast genome using thirteen translocations, and they conjectured this value to be optimal based on a series of experiments they performed to find a lower halving distance. In comparison, our analysis yields the bounds, 6
d(G)
12, and our algorithm halves the yeast genome using only eleven translocations, as shown in Figure 10 and the Appendix. In Table 2 we present the allocation of gene blocks among the chromosomes of one possible ancestral yeast genome immediately after genome duplication.
|
|
| 6 CONCLUSIONS AND FUTURE DIRECTIONS |
|---|
|
|
|---|
In this paper, we define the concept of genome halving diameter D(n) and obtain both upper and lower bounds for it. We also derive a tighter upper bound for the genome halving distance d(G) than the best known. In addition, we develop a new strategy for computing a lower bound for genome halving distance; the lower bound we get is almost always tighter than the best known, and in particular, is tighter for the yeast genome halving problem. Furthermore, we design and implement a software package GenomeHalving to reconstruct possible ancestral duplicated genomes. The same yeast data set used by El-Mabrouk et al. (19




, and a pseudo-node
containing a vertex pair (vi,vj), the vertex pair (vi,vj) is called an expanding vertex pair for pseudo-graph
and
satisfying the following two conditions:
and their induced edges after translocation
.
or
(vi,vj).
and let its pseudo-nodes be
. For concreteness, suppose translocation
, suppose w.l.o.g. that
and
. Define a new pseudo-node
, and let
. We claim that the k 1 pseudo-nodes in
that can be derived from G.
and does not affect the connectivity between
with
, since
contains an expanding vertex pair, say (va, vb), for
, if vertex vt is adjacent (non-adjacent) to any pseudo-node
in
in G', it must be adjacent (non-adjacent) to
and vertex
are adjacent (non-adjacent) in G', then they are adjacent (non-adjacent) in G; if vs is a loop-vertex (non-loop-vertex) in G', then it is a loop-vertex (non-loop-vertex) in G, and the same is true for vt. Therefore, the pseudo-nodes in 

, the lemma is trivially true. Now suppose that the lemma holds for
; we show that it also holds for k + 1.
. For concreteness, let (va,vb) be the expanding vertex pair contained in
is the pseudo-node adjacent to vb, if such a pseudo-node exists. After expansion
and
, in the newly formed complete pseudo-graph
,
together with their induced edges also form a complete pseudo-graph
. We claim that
or
. Symmetrically, we have
. Thus we have
. This completes the proof.
and
.
.
. Now
is a singly-adjacent-pseudo-node with pseudo-degree k1 that contains an expanding vertex pair for the pseudo-graph. According to Lemma 2.5, we have
. Hence
.
(G) min{s, (n4)/2}, where s is the number of non-loop 1-vertices in G;
, we obtain two 1-vertices, 
1,
, for h(v*) > 1, where v* is the vertex with the maximum degree in G.


V2 is part of any perfectly matched vertex pair, and every vertex in V1 has a neighbor in V2 and vice versa. We have 

