Bioinformatics Advance Access originally published online on February 15, 2005
Bioinformatics 2005 21(10):2171-2176; doi:10.1093/bioinformatics/bti327
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Divide-and-conquer approach for the exemplar breakpoint distance
1School of Computing, National University of Singapore Singapore 117543
2Department of Mathematics, National University of Singapore Singapore 117543
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: A one-to-one correspondence between the sets of genes in the two genomes being compared is necessary for the notions of breakpoint and reversal distances. To compare genomes where there are paralogous genes, Sankoff formulated the exemplar distance problem as a general version of the genome rearrangement problem. Unfortunately, the problem is NP-hard even for the breakpoint distance.
Results: This paper proposes a divide-and-conquer approach for calculating the exemplar breakpoint distance between two genomes with multiple gene families. The combination of our approach and Sankoff's branch-and-bound technique leads to a practical program to answer this question. Tests with both simulated and real datasets show that our program is much more efficient than the existing program that is based only on the branch-and-bound technique.
Availability: Code for the program is available from the authors.
Contact: matzlx{at}nus.edu.sg
| 1 INTRODUCTION |
|---|
|
|
|---|
In the theory of genome rearrangement, a genome is usually modeled as a permutation (or equivalently a linear order) on a set of genes and the breakpoint or reversal distances are used for evolutionary inference (Nadeau and Taylor, 1984; Watterson et al., 1982; International Mouse Genome Consortium, 2002; Moret et al., 2005). The breakpoint and signed reversal distances were introduced by Watterson et al. (1982) and Sankoff (1989) respectively. Since then, efficient algorithms for these and other measures have been studied extensively [see Kececioglu and Sankoff (1994); Sankoff and Blanchette (1998) and El-Mabrouk (2005) for a complete bibliography]. The breakpoint distance is easily calculated in linear time. However, the permutations is polynomial-time computable only for the signed reversal distance (Hannenhalli and Pevzner, 1999; Caprara, 1997; Kaplan et al., 2000).
A one-to-one correspondence between the sets of genes in the two genomes is necessary for the notions of both breakpoint and reversal distances. While this may be appropriate for small viruses and mitochondria genomes, it becomes problematic when applied to eukaryotic genomes where paralogous genes exist. Hence, several approaches have been proposed for comparison of genomic sequences with gene families. In inferring phylogeny from genome rearrangement data, Tang and Moret (2003) assumed that the size of each gene family is the same in each genome and simply treated each gene member in a gene family as a distinct gene. Similarly, Chen et al. (2005) also assumed that each gene family has the same size in each genome by removing from a gene family those members that were the least homologous to the members of the counterpart family. Under the assumption of equal content, they proposed a graph-based heuristic algorithm for computing the assignment of orthologous genes that minimizes the reversal distance between two genomes. Finally, a more generalized version of the genome rearrangement problem was formulated by Sankoff (1999) where the restriction of the identical size of each gene family is removed. His idea is to delete all but one member of a gene family in each of the two genomes, so as to minimize the breakpoint or reversal distance between the two reduced genomes, called exemplars. Such an approach is clearly relevant in the phylogenetic context where the object is to reconstruct the ancestral gene order and the size of each gene family had been changed due to gene duplication/deletion (Sankoff and El-Mabrouk, 2000; El-Mabrouk, 2005).
In the current paper, we study efficient algorithms for the exemplar problem for genomes with gene families. Formally, the exemplar problem with the breakpoint distance is, given two genomes
and
, to find the minimum breakpoint distance between g and h overall choices of examplars g and h of
and
respectively. Even though it is almost trivial to calculate the breakpoint distance between two genomes with single-gene families, the exemplar breakpoint distance problem is not only NP-hard (Bryant, 2000), but also unlikely to have polynomial-time constant-ratio approximation algorithm unless NP = P (C.T.Nguyen, Unpublished manuscript). By observing the monotonicity of the exemplar distances, Sankoff proposed a branch-and-bound method to tackle this problem. The idea is to work on one gene family separately, choosing the pair of exemplars that least increases the distance when inserted into the partial exemplar genomes already constructed.
Motivated by the above idea, we propose to calculate the exemplar breakpoint distance by partitioning the gene families into disjoint subsets such that two gene families in different subsets are independent. Then, we find the pair of exemplars of gene families in each subset at a time, and finally merge all the exemplars together to obtain two optimal exemplars of the given genomes. Tests with both simulated and real datasets show that our divide-and-conquer approach is much more efficient than the branch-and-bound approach.
The paper is organized into six parts. Section 2 provides the needed definitions and notations. Section 3 introduces the concept of independence of gene families and proves a theorem that characterizes the relationship between the independence of gene families and the breakpoint distance. Section 4 presents our divide-and-conquer approach that is based on the characterization theorem established in Section 3, and Section 5 reports our test results. Section 6 concludes with a few remarks on future research.
| 2 THE EXEMPLAR DISTANCE PROBLEM |
|---|
|
|
|---|
In the study of genome rearrangement with gene duplication, a genome
is modeled as a string of signed (+ or ) symbols from an alphabet A, which represents the gene families. For a symbol a
A, all its occurrences in genome
constitute a gene family; and the number
of its occurrences is the size of the gene family.
If
for all a
A, then
is a permutation on the symbols in A. Consider two permutation genomes G' = g1,g2,...,gn and H' = h1,h2,...,hn. We say gi precedes gi+1 in G' and hi precedes hi+1 in H' for i
n 1. If gene a precedes gene b in G', but neither a precedes b nor b precedes a in H', they determine a breakpoint in G'. Additional breakpoints are posited if g1
h1 and if gn
hn. The number of breakpoints, dB(G', H'), in G' is called the breakpoint distance between G' and H' since dB(G', H') = dB(H', G').
A reversal operation on a segment ab
cd of a genome
ab
cd
transforms the genome to
dc
ba
. For example, genome 3214 can be transformed into 4 3 1 2 in two reversals as follows: 3214
3 4 1 2
4 3 1 2. The reversal distance} between G' and H', denoted by dR(G', H'), is the minimum number of reversals necessary to transform G' into H' or vice versa.
Given two genomes
and
(on alphabet A) satisfying
and
for all a
A. An exemplar of a genome
is obtained by deleting all but one occurrence of each gene family. Note that an exemplar string is a permutation. Let
denote the set of all exemplars of
. Then, we define the exemplar breakpoint distance (EBD) between
and
as
![]() |
![]() |
In the rest of the paper, we shall focus on deriving an efficient algorithm for finding
given
and
over an alphabet.
| 3 INDEPENDENCE OF GENE FAMILIES |
|---|
|
|
|---|
Let
and
be two genome strings with gene family set A. For gene family a, each occurrence
in
and each occurrence
in
form an occurrence pair. If
, a is called a single-pair family; otherwise, it is called a multi-pair family. For example, in genomes
![]() |
and
to denote the i-th copy of g in
and
respectively.
A permutation
on a subset A' of gene families is called a partial exemplar of
if it can be obtained by deleting all the occurrences of each gene family in A\A' and deleting all but one occurrence of each family in A'. For instance, in the above example, the six single-pair gene families a, b, c, d, e, l and r form a partial exemplar to la bc der of
and lb a ed cr of
. For an occurrence
of a gene a
A\A' in
, we use
to denote the resulting partial exemplar of
after
is inserted into
. Similarly, we define
for a set S of occurrences of different gene families. We let
if a
A'. In the rest of this section, we assume that
and
are partial exemplars of
and
, respectively.
For two gene families a and b, we say their consecutive occurrences
and
in
are adjacent in
and
if and only if they do not form a breakpoint in
and
.
DEFINITION 1
Assume a and b are two gene families at least one of which does not occur inand
.
- Two occurrence pairs
and
in
and
are said to be possibly adjacent with respect to
and
if and only if
and
are adjacent in
and
and
are adjacent in
.
- Let c and d be an adjacent pair in
and
. An occurrence pair
of a kills (c, d) if and only if (c, d) is no longer adjacent in
and
.
- Let r and s be two possibly adjacent pairs with respect to
and
. A pair
blocks r and s with respect to
and
if and only if r and s are not possibly adjacent with respect to exemplars
and
.
Take the partial exemplars
a b c d e r of
and
b a e d c r of
as an example. Pairs
and
are possibly adjacent with respect to
and
;
and
kill adjacent pair (c,d) in
; and
blocks
and (d, d) with respect to
and
.
DEFINITION
Assume a and b are two different gene families that do not occur inand
.
- Two occurrence pairs of a and b are independent with respect to
and
if and only if (1) they are not possibly adjacent with respect to
and
, (2) they do not block each other from forming an adjacency with another pair with respect to
and
and (3) they do not kill the same adjacenct pair in
and
.
- Two gene families a and b are independent with respect to
and
if and only if any occurrence pair of a and occurrence pair of b are independent with respect to
and
.
For a gene pair
of a gene family a that does not occur in
and
, we use
and
to denote
and
, respectively. Similarly, we can define
and
for set P of occurrence pairs that are from different gene families.
LEMMA 1
Let P and Q be sets of occurrence pairs of different gene families that are not inand
. If any p
P and q
Q are independent with respect to
and
,then
where
, i.e. it denotes the change in the number of breakpoints after all the occurrence pairs in P are inserted.
PROOF
Let a and b be two consecutive signed symbols in. Let
ab(X) denote the change in the number of breakpoints between a and b after all the occurrence pairs in X are inserted in
and
for X = P, Q, P
Q. Obviously, for each X,
(X) =
(a,b)
ab(X). Thus we only need to prove the following equation:
for any consecutive symbols a and b in
(1) . To this end, we consider the following two cases.
Case 1. The pair (a,b) is adjacent in
. After inserting all pairs in P
Q, (a, b) is still adjacent in both
and
or separated by some occurrence pairs in
or
. In the former case, we have
ab(P
Q) =
ab(P) =
ab(Q) = 0 and hence Equation (1) holds. In the latter case, we have the following facts.
CLAIM 1
a and b can only be separated by either some pairs in P or some pairs in Q, but not both inand
PROOF
Assume that a and b are separated by some pairs in P and some pairs in Q. Supposeand
such that
and
are inserted between a and b in
in the following configuration:
. Then, both p and q kill the same adjacency (a, b) in
and
, contradicting the independence hypothesis.
Suppose
and
such that only
is inserted between a and b in
and
is inserted between a and b (or b and a) in
. Again, both p and q kill the same adjacency (a, b) in
, contradicting the independence hypothesis. This finishes the proof of Claim 1.
We may assume that a and b are separated by k pairs
(1
i
k) in P after P
Q are inserted: a
b is a substring of
. For simplicity, we define
and
. Then, we have the following.
CLAIM 2and
are adjacent in
and
if and only if they are adjacent in
and
.
PROOF
Assume thatand
are adjacent in
and
. If they are not adjacent in
and
, there must be an occurrence pair
in Q such that
separates
and
. In other words,
blocks
and
. Since at least one of
and
is in P, this contradicts the independence hypothesis. Thus
and
are adjacent in
and
.
On the other hand, it is obvious that if
and
are adjacent in
and
, they are adjacent in
and
. This finishes the proof of Claim 2.
By Claim 2,
ab(Q) = 0 and
ab(P
Q) =
ab(P). Hence, Equation (1) holds.
Case 2. The pair (a,b) is a breakpoint in
. Denote by tX the number of gene occurences between a and b in
, and by uX the number of adjacent pairs between a and b in
. Obviously, tPQ = tP + tQ. Moreover, uPQ = uP + uQ since a pair in P can neither block a pair in Q to form adjacency with other pairs nor form an adjacency with a pair in Q and vice versa. Therefore, we have that
ab(X) = (tX + 1 uX) 1 = tX uX, and thus
![]() |
Using the above lemma, we obtain the following.
THEOREM 1
Let A' and A'' be sets of gene families that do not occur inand
such that any a'
A' and a''
A'' are independent with respect to
and
. If {pi|i
| A'|} and {qi|i
|A''|} are occurrence pairs of gene families in A' and A'' such that
and
reach the minimum values over all the choices of occurrence pairs, then
has the minimum value over all the occurrence pairs of A' and A''.
The occurrence pairs pi are called the optimal representative pairs of A' with respect to
and
.
| 4 ALGORITHMS |
|---|
|
|
|---|
Divide-and-Conquer approach. Following the discussion in the previous section, we propose to calculate
in two steps:- Form partial exemplars
and
with all the single-pair gene families, and then divide multi-pair gene families into disjoint subsets that are independent with respect to
and
.
- For each gene family subset found in step (1), find its optimal representative pairs. Obtain the optimal exemplars of
and
by taking the union of all the optimal representative pairs.
To partition the gene families into disjoint subsets that are independent, we define a graph
, called the dependence graph of the gene families. Each vertex in
represents a multi-pair family. Two vertices are joined by an edge if and only if they are not independent with respect to
and
. By Definition, one can efficiently determine whether two families are independent or not. Noting that gene families in different connected components are independent, we just partition the gene families according to the connected components of
in Step (1).
In Step (2), we use Sankoff's program (1999) to find optimal exemplar occurrence pairs for each gene family subset and call it SimplBB.
Approximating algorithm. Dividing multigene families into disjoint subsets does not always solve the problem. For large genomes, these subsets are still too large and finding their optimal representative pairs is forbidding.
To speed up the procedure, we delete a small subset X of vertices to split each connected component into smaller ones with an appropriate size in the graph. By these deletions, we trade accuracy for efficient computation. Hence, we only obtain tight lower and upper bounds of
. Assume that
,
are the partial genomes obtained by removing the gene families corresponding to vertices in X. Since the breakpoint distance is monotone (Sankoff, 1999),
is a lower bound of
. On the other hand, we also obtain an upper bound of
by finding the breakpoint distance between
and
, where
and
are the optimal exemplars of
and
, respectively. When X is small, these bounds are tight.
The deletion can be done solely on
since removing a vertex (or a family) does not affect the others. However, the problem of deleting the smallest subset of vertices to reduce all connected components to a desired size is NP-hard since it contains the independent set problem as a special case. Hence, we simply delete the vertices with the largest degree until the size of each component is small enough for SimplBB to run fast. On the other hand, the size should not be too small to make the gap between the upper and lower bounds large. Because of this, we determine the value of the size parameter according to the input dataset.
| 5 EXPERIMENTAL TESTS |
|---|
|
|
|---|
In Sankoff (1999) the computational cost of SimplBB is measured in the number of calls to subroutine BD which calculates the breakpoint distance between two partial exemplar strings. In our program, most of the computation is done by calling SimplBB on each independent subset of gene families. Thus, we also measure the computational cost of our program FastEBD in the number of calls to BD for comparing it with SimplBB. We implemented both SimplBB and FastEBD in C++ and ran them on a PC with 1.6 GHz Pentium processor and 512 MB RAM.
5.1 Results on simulated genomes
We use the simulation method proposed by Sankoff (1999) to generate genome pairs for our validation test. Each genome pair is characterized by two parameters: the number of (both single-pair and multi-pair) gene families and their configuration, which indicates the number of copies of each multi-pair gene family in each of the two genomes. In the configurations given in Table 1, each row corresponds to a genome, each column corresponds to a multi-pair gene family and each entry gives the number of genes of the corresponding gene family in the corresponding genome. In each genome pair generated according to the configuration threes, a genome contains roughly about three copies of each multi-pair family. However, in each genome pair generated according to the second configuration, the number of gene copies in each multi-pair family varies. Some multi-pair families contain as many as 10 gene copies in each genome while others contain only one or two copies in each genome. Hence, the second configuration is named unbalanced.
|
We run both SimplBB and FastEBD to find the exemplar breakpoint distance between two genomes with the same configuration described in Table 1. In Figure 1, we plot the numbers of calls to the subroutine BD made by SimplBB and FastEBD in log scale against the length of exemplars. The graph indicates that FastEBD is much more efficient than SimplBB.
|
We also note that when the number of single-pair gene families increases, the running cost of SimplBB increases while the cost of FastEBD increases for a narrow range of length and then decreases. The reason for FastEBD running faster when there is a large number of gene families is that when more single-pair gene families are included, the multi-pair gene families become less dependent on each other and hence the large set of gene families could be partitioned into small unrelated subsets.
Another property of our program FastEBD is that it allows a trade-off between accuracy and computational efficiency. This is vital when working with real genome datasets that contain hundreds or thousands of genes. Thus we conducted experiments to examine the accuracy of the approximating version of our program. By setting the threshold T on the number of calls to BD, our algorithm outputs the lower and upper bounds on the exemplar breakpoint distance between the input genomes. If T is small, we have to remove more gene families so that the remaining gene families can be partitioned into independent subsets on which SimplBB can finish by calling BD at most T times. Symmetrically, if T is large, we just need to remove less gene families and hence obtain better lower and upper bounds on the exemplar breakpoint distance between the input genomes. We measured accuracy using the ratio of the difference of the lower and upper bounds to the upper bound of the exemplar breakpoint distance output from the program.
To generate genomic datasets that are close to biological data, we adopt a common approach in benchmarking algorithms for calculating genomic distance. Starting with two identical genomes, we let them undergo a series of mutations such as duplication, transposition and reversal with certain probabilities. Since duplication, transposition and reversal of long segments happen less frequently than those of short segments (Pinter and Skiena, 2002), we use a Poisson distribution P
(k) to calculate the probabilities of mutations on a segment of length k with the mean
as a parameter.
We generated 30 genome pairs in which each genome contains 3000 genes and undergoes 500 cycles of mutations. The mean of probabilities for each type of mutation is 0.1. The resulting genome pairs contain about 38 copies in each gene family, giving rise to about 1025010350 exemplar pairs. We ran FastEBD on each pair of genomes with T taking values 104, 2x 104, 3x 104, 4x 104 and 5x 104. The accuracy ratios in these tests are summarized in Figure 2. From this graph, it is clear that accuracy is improved when T is bigger. It also shows that even with T=104, the error is <5% in most of our experiments.
|
5.2 Bacterial genomes
We ran the algorithms on five bacteria genome pairs studied in Earnst-De Young et al. (2004): Buchnera aphidicola and W. brevipalpis (BaphiWigg), Pasteurella multocida and Hamophilus influenzae (PmultHinfl), Escherechia coli and Salmonella typhimurium (EcoliStyphi), Xanthomonas axonopodis and Xanthomonas campestris (XaxoXcamp) and Yersinia pestis-KIM and Yersinia pestis-CO92 (Ypes).
Each genome in the five genome pairs above contains gene families that are not in the other. There are two ways to deal with these genes for our purpose. The easy way is to delete all such genes and consider the reduced genomes. Alternatively, we first put a pivot gene x at the right end of both genomes. Then, we construct two genomes from
and
by doing the following: (1) for each block B of consecutive occurrences of gene families that are in
but not in
, we introduce a new gene family gB, replace B with a copy of gB and append a copy at the right end of
. (2) We deal with the gene families that appear in
, but not in
, in the same way. The purpose of this preprocessing method is to take gene deletions into accounts. But it may introduce redundant hypothetical gene deletion events. Our computation results show that we might underestimate the exemplar breakpoint distance between two given genomes by using the first method and overestimate it by using the second method. By using these two methods to deal with gene losses, we obtain two genomic datasets from each of the five bacteria genome pairs, denoted by BaphiWigg1, BaphiWigg2, PmultHinfl1, PmultHinfl2, EcoliStyphi1, EcoliStyphi2, XaxoXcamp1, XaxoXcamp2, Ypes1 and Ypes2 respectively.
We ran both SimplBB and FastEBD on these datasets. SimplBB could finish only on BaphiWigg1 and made about 105 calls to BD. FastEBD with T=104 finished on all the genome pairs, returning either an exact EBD or lower and upper bounds. (See Section 5.1 for threshold T.) In the latter case, the bounds differ by only 1. Furthermore, we are able to obtain the exact EBD for all pairs by increasing T. For example, with T=3x 106, FastEBD output 119 given Xaxo-Xcamp1 as input. The results are summarized in Table 2.
|
| 6 CONCLUSION |
|---|
|
|
|---|
In this paper, we have proposed an efficient algorithm for calculating the exemplar breakpoint distance between two genomes with gene families. It is not only fast, but also outputs accurate breakpoint distances. Our idea can be incorporated into the existing methods to derive an efficient algorithm for computing the exemplar reversal distance because of the close relation of the reversal distance and the breakpoint distance. Future research topics also include how to incorporate our idea into the existing methods such as the branch-and-bound approach to derive efficient algorithms for the exemplar problem with other distance measures such as translocation distance (Hannenhalli and Pevzner, 1995) or syntenic distance (Feretti et al., 1996).
As for applications, investigating whether our approach can be extended into an efficient method for genomic comparison or for reconstructing phylogeny from gene-content data is also interesting.
| Acknowledgments |
|---|
The authors would like to thank the reviewers for the careful reading of the manuscript and the useful suggestions for revising it. They would also like to thank D. Sankoff for the helpful discussions on the exemplar problem, E. Lerat for sending us her genomic datasets, and J. Tang for the comments on the first draft of this manuscript. L.Z. was partially supported by BMRC Research Grant BMRC01/1/21/19/140.
Received on December 3, 2004; revised on February 5, 2005; accepted on February 10, 2005
| REFERENCES |
|---|
|
|
|---|
Bryant, D. (2000) The complexity of calculating exemplar distances. In Sankoff, D. and Nadeau, J.H. (Eds.). Comparative Genomics, , Dordrecht Kluwer Academic Publishers, pp. 207212.
Caprara, A. (1997) Sorting by reversals is difficult. Proceedings of the First International Conference on Computational Molecular Biology (RECOM97) CM Press, pp. , pp. 7583.
Chen, X., Zheng, J., Fu, Z., Nan, P., Zhong, Y., Lonardi, S., Jiang, T. (2005) Computing the assignment of orthologous genes via genome rearrangement. Proceedings of the Third Asia-Pacific Bioinformatics Conference Imperial College Press, pp. , pp. 363378.
Earnst-De Young, J., Lerat, E., Moret, B. (2004) Reversing gene erosionreconstructing ancestral bacterial genomes from gene content and order data. Proceedings of the 4th Annual Workshop on Algorithms in Bioinformatics (WABI'04) , Berlim Springer, pp. , pp. 113.
El-Mabrouk, N. (2005) Genome rearrangements with gene families. In Gascuel, O. (Ed.). Mathematics of Evolution and Phylogeny, (In press) Oxford University Press.
Feretti, V., Nadeau, J.H., Sankoff, D. (1996) Original synteny. Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching , Berlin Lecture Notes in Computer Science, Vol. 1075 Springer, pp. , pp. 7887.
Hannenhalli, S. and Pevzner, P. (1995) Transforming men into mice (polynomial algorithm for genomic distance problem). Proceedings of IEEE 36th Annual Symposium on Foundations of Computer Science The IEEE Computer Society, pp. , pp. 581592.
Hannenhalli, S. and Pevzner, P. (1999) Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. Journal of the Association for Computing Machinery, 46, 127.
International Mouse Genome Consortium. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 520562[CrossRef][Medline].
Kaplan, H., et al. (2000) Faster and simpler algorithm for sorting signed permutations by reversals. SIAM J. Comput., 29, 880892[CrossRef].
Kececioglu, J. and Sankoff, D. (1994) Efficient bounds for oriented chromosome-inversion distance. Proceedings of the Fifth Symposium Combinatorial Pattern Matching , Berlin Lecture Notes in Computer Science, Vol. 807 Springer, pp. 307325.
Moret, B., et al. (2005) Reconstructing phylogenies from gene-content and gene-order data. In Gascuel, O. (Ed.). Mathematics Of Evolution and Phylogeny, Oxford University Press.
Nadeau, J.H. and Taylor, B.A. (1984) Lengths of chromosomal segments conserved since divergence of man and mouse. Proc. Natl. Acad. Sci. USA, 81, 814818
Pinter, R.Y. and Skiena, S. (2002) Genomic sorting with length-weighted reversals. Proceedings of the Thirteenth International Conference on Genomics Informatics , Tokyo, Japan Universal Academy Press, pp. 103111.
Sankoff, D. (1989) Mechanisms of genome evolution: models and inference. Bull. Int. Stat. Institut., 47, 461475.
Sankoff, D. and Blanchette, M. (1998) Multiple genome rearrangement and breakpoint phylogeny. J. Comput. Biol., 5, 555570[Web of Science][Medline].
Sankoff, D. and El-Mabrouk, N. (2000) Duplication, rearrangement and reconcilliation. In Sankoff, D. and Nadeau, J.H. (Eds.). Comparative Genomics, , Dordrecht Kluwer Academic Press, pp. 537550.
Sankoff, D. (1999) Genome rearrangement with gene families. Bioinformatics, 15, 909917
Tang, J. and Moret, B. (2003) Phylogenetic reconstruction from gene rearrangement data with unequal gene contents. Proc. 8th Workshop on Algorithms and Data Structures (WADS'03) , Berlin Lecture Notes in Computer Science, vol. 2748 Springer Verlag, pp. 3746.
Watterson, G.A., et al. (1982) The chromosome inversion problem. J. Theor. Biol., 99, 17[CrossRef].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



in
and
are adjacent in
.
, i.e. it denotes the change in the number of breakpoints after all the occurrence pairs in P are inserted.
(X) =
(a,b)
and
such that
and
are inserted between a and b in
. Then, both p and q kill the same adjacency (a, b) in
is inserted between a and b (or b and a) in
(1
b is a substring of
and
. Then, we have the following.
and
are adjacent in
in Q such that
separates
and
. In other words,
. Since at least one of 
and
reach the minimum values over all the choices of occurrence pairs, then
has the minimum value over all the occurrence pairs of A' and A''.
