Evolution and Phylogenetics
Inferring phylogeny from whole genomes
Górecki *Institute of Informatics, Warsaw University Banacha 2, 02-678 Warsaw, Poland
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Inferring species phylogenies with a history of gene losses and duplications is a challenging and an important task in computational biology. This problem can be solved by duplication-loss models in which the primary step is to reconcile a rooted gene tree with a rooted species tree. Most modern methods of phylogenetic reconstruction (from sequences) produce unrooted gene trees. This limitation leads to the problem of transforming unrooted gene tree into a rooted tree, and then reconciling rooted trees. The main questions are What about biological interpretation of choosing rooting?, Can we find efficiently the optimal rootings?, Is the optimal rooting unique?.
Results: In this paper we present a model of reconciling unrooted gene tree with a rooted species tree, which is based on a concept of choosing rooting which has minimal reconciliation cost. Our analysis leads to the surprising property that all the minimal rootings have identical distributions of gene duplications and gene losses in the species tree. It implies, in our opinion, that the concept of an optimal rooting is very robust, and thus biologically meaningful. Also, it has nice computational properties. We present a linear time and space algorithm for computing optimal rooting(s). This algorithm was used in two different ways to reconstruct the optimal species phylogeny of five known yeast genomes from approximately 4700 gene trees. Moreover, we determined locations (history) of all gene duplications and gene losses in the final species tree. It is interesting to notice that the top five species trees are the same for both methods.
Availability: Software and documentation are freely available from http://bioputer.mimuw.edu.pl/~gorecki/urec
Contact: gorecki{at}mimuw.edu.pl
| 1 INTRODUCTION |
|---|
|
|
|---|
The relationships between species cannot always be inferred from a single gene family. This important property is crucial in the species phylogeny reconstruction. In modern phylogenetics, gene trees are mostly computed by computer programs from gene sequences. Most of them produce unrooted gene trees (ML, MP, NJ, etc.) in which the common ancestor relation is undefined. Sometimes an outgroup species is added to root the unrooted trees. In this paper, we use a different approach. The best rooting is determined by cost minimality criterion using the duplication-loss model (DL-model) and reconciled trees (Page and Charleston, 1997). The model explains the differences in terms of gene duplications and gene losses (Goodman et al., 1979; Page, 1994; Guigo et al., 1996). It is known that the reconciled tree gives the minimal number of these events (Bonizzoni et al., 2003; Górecki and Tiuryn, 2005) required to reconcile a given gene tree with a given species tree.
The problem of rooting an unrooted gene tree, subject to minimization of a certain cost function, has been addressed in Chen et al. (2000). This is, to our knowledge, the only paper which deals with reconciliation of an unrooted gene tree with a (rooted) species tree. However, the authors of Chen et al. (2000) are not specific enough in explaining the algorithm which they use to solve this problem. Hence it is difficult to compare our approach with theirs. In particular they do not discuss the important problem of how the distribution of costs assigned to the nodes of a reconciled tree looks when we have two or more optimal rootings. As we will see, it follows from the theory presented in our paper that these distributions are in fact identical. Hence it is enough to find just one optimal rooting. Also a discussion of the role of weight constants in the cost function, which express relative importance of a duplication cost versus gene loss cost, is omitted in Chen et al. (2000). One of the main results of our paper states that these constants are not important (as long as they are positive), i.e. exactly the same edges are optimal for each choice of a cost function.
The problem of reconstructing the species phylogeny in DL-model is NP-complete (Ma et al., 2000). In our experiments, we enumerate all possible species trees. It should be noted that it is possible for a small number of species.
The paper is organized as follows. In Sections 2 and 3, we introduce the duplication-loss model and the general procedures of phylogeny reconstruction. Section 4 provides a detailed exposition of reconciling of unrooted gene tree and a rooted species tree. The main result of this paper is included in Theorem 7. Then, we present the algorithm of finding an optimal rooting. Finally, we present the results of experiments for five yeast genomes.
| 2 PHYLOGENY RECONSTRUCTION |
|---|
|
|
|---|
2.1 Preliminaries
Let I be the set of species consisting of N > 0 elements. We consider a set P of protein sequences from species in I. The first step is the classification of proteins into families. As a result a division C of P is obtained. Formally, C is a family of disjoint non-empty sets, called clusters, such that
C = P. A cluster represents a set of homologous proteins. In particular, the process consists of computing a similarity measure between sequences (e.g. by SmithWaterman algorithm or BLAST) and then applying a clustering algorithm (e.g. MCL-method). In general, it can be more complex: in most cases an additional sophisticated analysis is employed to incorporate specific features of the clustered set (Dujon and Sherman, 2004). In our approach, we assume that the clustering is given.
2.2 Species and (un)rooted gene trees
Formally, the unrooted gene tree is an undirected acyclic graph in which each node has degree 3 (internal nodes) or 1 (leaves), and the leaves are labeled by the elements from I.
The terminology for rooted trees is partially taken from Ma et al. (2000). The model of evolutionary species tree is a rooted binary tree with N leaves uniquely labeled by the elements from I. Sometimes, we refer to the nodes of S by cluster of labels of its leaves. For instance, in a species tree (a,(b,c)) (Fig. 1), we have five nodes: a, b, c, bc and abc.
|
A rooted gene tree is a rooted binary tree with leaves labeled by the elements from I. By int(T) we denote the set of internal nodes in a tree T. By Lf(T) we denote the set of leaf labels in T.
2.3 Reconstructing species treegeneral schema
We present two general approaches to the problem of phylogeny reconstruction. Here, we assume that the primary step of computing cost c(G,S) between the given unrooted gene tree G and a given species tree S is known.
The first is based on the cost criterionchoose the tree S* which minimizes the total cost
.
The second is similar to votingchoose the species tree which is the most popular as optimal in local optimizations:
- For every unrooted gene tree G
- - Let
G be the set of all species trees with the minimal c(G,S).
- - For every S
G, S gets 1/SG point(s) (i.e. a point from G is distributed uniformly over all species trees from
G).
- Return species tree with the maximal number of points.
| 3 DUPLICATION-LOSS MODEL |
|---|
|
|
|---|
We now shortly introduce the concepts of the reconciled trees and the duplication-loss model. For a broader approach we refer reader to Goodman et al. (1979); Page (1994); and Guigo et al. (1996).
Let S =
VS, ES
be a species tree. We view S as an upper semilattice with + a binary least upper bound operation and
the top element (i.e. the root). In particular for a,b
VS, a
b means that a and b are on the same path from the root, with b being closer to the root than a (unless a
b). We will use the comparability predicate D(a,b) = 1, when a
b or b
a and D(a,b) = 0, when a and b are incomparable. The distance function
(a,b) denotes the number of edges on the unique path connecting a and b.
Nodes a,b
VS are called siblings when a + b is a parent of both a and b. For a,b
VS let Sb(a,b) denote the set of siblings defined as follows. Sb(a,b) = {c}
Sb(a + c,b), where c is a sibling of a and either a < b or a + c < a + b. In a similar way, we define Sb(a,b) = {d}
Sb(a,d + b), where d is a sibling of b and either b < a or d + b < a + b. The remaining cases are covered by the definition: Sb(a,b) =
, when a = b, or a and b are siblings. It can be easily proved that the above definition is correct (i.e. it uniquely defines a set) and that Sb(a,b) = Sb(b,a).
Let L(a,b) denote the number of elements in Sb(a,b). The functions L and
are close to each other. When a and b are comparable, then L(a,b) =
(a,b). While when a and b are incomparable, then we have L(a,b) =
(a,b) 2. These observations are easy to prove by induction. The connection between L and
can be written concisely as follows:
![]() | (1) |
VS be the least common ancestor mapping from rooted G into S which preserves the labeling of the leaves. Clearly when w1w2
VG are children of v, then we have
![]() | (2) |
which for all nodes v
VG, a
VS assigns a real
(v, a) representing a contribution to node a which comes from v when reconciling G with S. Having
we can derive several values of interest. For a
VS,
is the total cost assigned to a. The function
will be called a cost distribution function. Dually, for v
VG we can define
which is a total contribution from v due to reconciliation of G with S. The function
will be called a contribution function. Finally,
is the total cost of reconciliation of G with S.
Let us give a few examples of a cost function which we will use later. Below we denote by w1 and w2 the children of a node v
VG [if v
int(G)]
Duplication cost function is defined as follows:
![]() |
Loss cost function:
![]() |
LEMMA 1
Let vVG and let w1w2 be children of v. Then,
|
| (3) |
|
| (4) |
It can be proved that for a gene node v,
L (v) = l(v), where l(v) is the number of gene losses associated to v (Ma et al., 2000; Eulenstein et al., 1998). Also,
D (v) = 1 if v is a duplication node (Ma et al., 2000; Eulenstein et al., 1998). Distributions
D and
L show the location of gene duplications and gene losses, respectively, in the species tree. The example is shown in Figure 1.
The reconciliation model is known to be biologically meaningful (Mirkin et al., 1995). The crucial notion is the reconciled tree which can be visualized as an embedding of the gene tree into the species tree. It follows easily that
D and
L are the minimal number of gene duplications and gene losses (respectively) required to reconcile G with S. Here we omit the details. Refer Figure 1 for example of embedding.
| 4 RECONCILING UNROOTED GENE TREE WITH A SPECIES TREE |
|---|
|
|
|---|
4.1 Unrooted gene trees
From now on, let G =
VG, EG
be unrooted gene tree. Choosing a position of the root in G amounts to selecting an edge e
EG on which the root is to be placed. Such a rooted tree we denote by Ge, with v* a new node denoting the root. To distinguish between rootings of G all symbols defined in previous section for a rooted gene tree will be extended by inserting index e, for instance Me is the mapping from Ge to S, etc. Without loss of generality, we assume
(A1) S and G have at least one internal node,
(A2) Me(v*) =
, i.e. the root of every rooting is mapped into the root of S (it is independent of e).
Note that from these conditions |Lf(G)|> 1.
The proof of the next lemma follows immediately from the definition of duplication and gene loss costs.
LEMMA 2
Let e and e' be any two edges in G. If for any pair of nodes a,bVS the number of siblings w1,w2 in Ge which are mapped to a and b, respectively, is the same as the number of siblings u1, u2 in Ge' which are mapped to a and b, respectively, then
.
4.2 Orientation and labeling of G
We transform G into a directed graph
, where
. In other words each undirected edge
in G is replaced in
by a pair of directed edges
v, w
and
w, v
.
Edges in
are labeled by nodes of S as follows. If v
VG is a leaf labeled by a, then the edge
is labeled by a. For the induction step let us assume that
w1, v
and
w2, v
are labeled by b1 and b2, respectively, then the edge
, such that w3
w1 and w3
w2 is labeled by b1 + b2. The intuition behind the label a of an edge
v, w
of
is that in the rooted tree Ge, where e is either {v, w}, or it is located in the part of G which is past w, the node v is mapped to a. Equation (2) justifies this intuition.
4.3 Classification of stars
Every internal node v
VG defines a star with center v as indicated in Figure 2. The edges
v, wi
will be called outgoing, while the edges
wi, v
will be called incoming. We will refer to the undirected edge {v, wi} as ei, for i = 1,2,3.
|
LEMMA 3
Let bi![]()
, for all i = 1,2,3. Then for exactly one ai say for a1, we have a1
![]()
(i.e. a2 = a3 =
).
PROOF. It follows from (A2) that for every edge {v, w} of G, if the label of
v, w
is a and the label of
w, v
is b, then a + b =
. Thus, we have b1 + b2 + b3 = a3 + b3 =
. Since none of bi's is
, it follows that if x, y
VS are the children of
, then two elements (say b2 and b3) are below x, and the third (i.e. b1) is below y. Hence a1 = b2 + b3
x and therefore a1
. On the other hand we have a2 = b1 + b2 =
and a3 = b1 + b2 =
. This completes the proof.
Observe that placing the root on the edge ei results in a rooted tree as presented in Figure 2. In square bracket next to a node u we indicate the node of S to which u is mapped from Ge. The trees below each node wi are the same in all trees of Figure 2, for each i = 1,2,3, and therefore we omit them.
We will use the following classification of stars (Fig. 3). A star is said to be of type S1, if it has exactly one incoming edge labeled
and exactly two outgoing edges labeled
and these edges are connected to the three siblings of the center. A star is said to be of type S2, if it has no incoming edges labeled
(it follows from Lemma 2 that it must have exactly two outgoing edges labeled
). A star is said to be of type S3, if it has exactly one incoming edge labeled
and all outgoing edges labeled
. Finally, a star is said to be of type S4, if it has all outgoing edges labeled
and at least two incoming edges labeled
. It is easy to show that there are no other types of stars.
|
LEMMA 4
For every gene tree [satisfying (A1) and (A2)] G we have the following mutually exclusive situations.
has exactly one star of type S2 and all other stars are of type S1.
has exactly two stars of type S2. They share a common edge. All other stars are of type S1.
has exactly two stars of type S3. They share a common edge. All other stars are of type S1.
has an occurrence of a star of type S4. All other stars are of type S1 or S3.
PROOF. Suppose
does not contain stars of type S3 or S4. We will show that it has to contain a star of type S2. Suppose on the contrary that
has only stars of type S1. Take any such a star and follow a path which starts at its center and goes against the direction of the edges labeled
, i.e., move from node v to v' whenever
v', v
is labeled
. Clearly this path is unique and does not contain a cycle. So we have to arrive at a node, say u, from which we cannot continue. This must be a leaf. This is a contradiction since then
contains an edge starting from a leaf and labeled
. This proves that G must contain a star of type S2, and in fact the node u where we got stuck is the center of a star of type S2. Let u' be a node of this star such that neither
u, u'
nor
u', u
is labeled
. Observe that every node v
VG is reachable either from u or from u' by a directed path along edges labeled
. Hence all internal nodes different from u and u' are centers of stars of type S1. Clearly if u or u' is leaf, then
has only one occurrence of a star of type S2. Otherwise
has exactly two occurrences of stars of type S2 and these stars share the same edge {u, u'}.
Observe that if
contains a star of type S3 or S4, then it cannot contain a star of type S2. Suppose now that
contains a star of type S3 with center u, and no star of type S4. Let u' be such that both
u, u'
and
u' u
are labeled
. We will show that
has exactly two occurrences of a star of type S3 and these stars share the edge {u, u'}. Clearly neither u nor u' is a leaf and the outgoing edges leaving u or u' must be labeled
. This implies that we have two occurrences of a star of type S3 and they share the same edge {u, u'}. If
contains another star of type S3 and v is its center, then all edges on the path connecting u and v have labels in both directions equal
. It follows that on this path there must be at least one occurrence of a center of a star of type S4. This yields contradiction and completes the proof of the Lemma.
4.4 Duplication and loss cost analysis
In this section we will analyze how the total cost changes when we move a position of the root in G. The analysis will be done for loss cost, as well as for the gene duplication cost. We will use throughout this section the notation of a star introduced in Figure 2.
PROPOSITION 5. Consider the following two cases.
- b1 =
.
- bi
, for all i = 1, 2, 3 and let a1
(see Lemma 2).
Then in each case we have
![]() |
![]() |
and either b2 =
, or b3 =
, then
![]() |
PROOF. We start with Case (I) and use notation of Figure 2. We have b1 = a2 = a3 =
and by Lemma 1 we obtain
and thus
. Moreover, when a1 =
and b2 =
or b3 =
, then again by Lemma 2 we obtain
which implies
L = 0. It remains to compute
when b1 = a2 = a3 =
. We have
,
, and
,
. Then
![]() |
The second equality above follows from a1 = b2 + b3. Thus
L
0 and therefore
. This completes the proof of Case (I). For the proof of Case (II) we have a1
and a2 = a3 =
. Then
,
and similarly
,
. Since we have
![]() |
and
coincide on all other nodes, we conclude that
.
On the other hand we have
and
. Therefore
![]() |
L > 0 we have
. This completes the proof of Case (II).
It follows from Proposition 5 that for stars of type S1, S2 or S3 we have
L > 0, and for stars of type S4 we have
L = 0.
PROPOSITION 6. Consider the following two cases.
- b1 =
.
- bi
, for all i = 1,2,3 and let a1
(see Lemma 3).
Then in each case we have
![]() |
![]() |
and either b2 =
, or b3 =
, then
![]() |
PROOF. The proof is similar to the previous one.
It follows from Proposition 6 that for stars of type S4 we have
D = 0 and the cost distribution function is the same for any of the three undirected edges. It also follows that for a star of type S3 we have
D = 1.
4.5 Finding minimal edges
Here we present the main theoretical results of our paper. First we introduce some definitions and notation. A subset X of edges of EG is called a full subtree of G if any two nodes incident to edges of X (these nodes do not have to be incident to the same edge) are connected by unique path consisting entirely of edges from X, and every internal node of X (i.e. incident to at least two edges of X) has degree 3 (i.e. it is incident to three edges of X).
Let
, ß
R be two positive reals. In this section we will consider a weighted combination of duplication and loss costs of reconciliation. This is represented by the following cost function called a weighted mutation cost.
![]() |
= ß = 1.
THEOREM 7
Let S and G be species and gene trees (S rooted and G unrooted) and let, ß
R be any positive reals. Let
Then
- |MinG|> 1 iff
contains a star of type S4. Moreover, if |MinG|> 1, then MinG consists of all and only edges which occur in stars of type S4. All others stars in this case are of type S1 or S3.
- |MinG|> 1 iff
has exactly one pair of edges
v, w
and
w, v
whose labels are either both equal to
or both are not equal to
. For such edges we have that the unordered pair {v, u} is the only element of MinG. All stars of G which do not contain the edge {v, w} are of type S1.
- For all edges e, e'
MinG we have
- The set MinG is independent of the choice of
and ß, as long as they are positive.
- MinG is a full subtree of G.
- The edges of MinG can be found by a greedy method of gradient descent: leave a star with
through an edge {e} with a smaller
. Arriving at a star with
means that we have arrived at MinG.
PROOF. It follows from Propositions 5 and 6 that the only stars in
which have
are of type S4, whenever
. Hence |MinG| = 1 iff
has no stars of type S4. The second part of (i) follows from Lemma 4. The only if part of (ii) is equivalent, by Lemma 4, to the statement that
has no stars of type S4. This proves (ii). (iii) follows from (i), Propositions 5 and 6. Condition (iv) follows from (i), since the types of stars of
are independent of
and ß. Condition (v) follows from (i) and Lemma 4. Finally (vi) follows immediately from (i) and (ii). This completes the proof.
It follows from (ii) in the above theorem that if
has only one element, then the star which contains the edge {v, w} is either of type S2, or of type S3.
The above result can be nicely explained in terms of a landscape of a cost function. It follows that in case of more than one optimal edge, all such edges are clumped together and form a flat valley which consists of all and only stars of type S4. The boundary of this valley is built of stars of type S3 and the slopes of the surrounding hills are stars of type S1. There are no local minima in this landscape, except the global minimum located in the valley (Fig. 4). A similar picture can be drawn for the case of a unique optimal edge.
|
In Figure 4 we present an illustration of unrooted gene tree with an example of embedding one of the optimal edges. It follows from our results that there are five different optimal rootings. Moreover, all of them have the same distributions
L and
D. Also, all differences between optimal embeddings are located in the hat (Fig. 4). This fact can be formally proved for the reconciled trees. We omit the details which follow quite easily from the model. This property is important from biological point of view. It ensures that the embeddings of optimal rootings are almost identical, and thus the concept of searching optimal rooting is biologically correct also when |MinG| > 1.
4.6 Algorithm
As proved in Theorem 7 it is enough to use a greedy method of gradient descent to find an optimal edge. Thus, it is sufficient to know only the starred topology of unrooted gene tree, i.e. the labels of edges in
. First, we need an efficient algorithm for computing least common ancestors in S. There is an algorithm which requires O(N) time for preprocessing and then O(1) time for LCA-query (Bender and Farach-Colton, 2000). Note that there is no cycle between computing the labels. When a value is evaluated then it is stored in the attribute. Thus, when we refer to already computed value, we do not have to recompute it. Thus, all labels in
can be computed in O(n) steps, where n is the size of G if LCA-preprocessing is given.
Computing cost of a rooted reconciliation is linear (Ma et al., 2000). For computing number of losses, we use depth of nodes in S, which can be computed in O(1) after O(N) preprocessing, where N is the size of S. For duplications use the labels of edges in
.
Finally, computing costs of all rootings has time complexity O(max(n, N)). The same time complexity is for computing an optimal cost.
Note also that starting from an optimal edge the set of all optimal edges (and also optimal gene trees) can be easily enumerated.
| 5 EXPERIMENTS |
|---|
|
|
|---|
5.1 Optimal species tree by total cost
The algorithms described in previous sections were used to compute optimal species tree for the set of gene families of five yeasts from Dujon and Sherman (2004). In our example I = {c, k, d, y, s} where c, Candida glabrata; k, Kluyveromyces lactis; d, Debaryomyces hanseni; y, Yarrowia lipolytica and s, Saccharomyces cerevisiae. We started from the clusterings taken from Dujon and Sherman (2004). We aligned them by ClustalW. The ClustalW trees were used as unrooted gene trees. The total number of gene trees equals 4708. For our small set of species we computed all possible species trees (there are 105 rooted trees). Then, we calculate costs of reconciling 4708 unrooted gene trees with every species tree (we started our algorithm 4708 x 105 = 4 94 340 times which consumed
9 s on Athlon 1800+; program was written in C++. The summary of computations is presented in Figure 5. Here, we show only the best nine species trees. The total weighted mutation cost is shown (with weights 1).
|
The optimal species tree is (y, (d, (k, (s, c)))). It was reconciled with 4708 unrooted gene trees with the mutation cost 14 266. This breaks into 6214 gene duplications and 8052 gene losses. The optimal tree agrees with the tree computed for 25S rDNA sequences in Dujon and Sherman (2005).
Now, we can shortly compare the major evolutionary events in the hemiascomycetes shown in Dujon and Sherman (2004) with the results obtained in the present experiment.
- between s S.cerevisiae and sc (i.e. between the least common ancestor of S.cerevisiae and C.glabrata) there is 1110 (
7.7% of 14 266) of gene losses; this agrees with the duplicated gene loss as indicated in Dujon and Sherman (2004)
- 282 (4.5%) duplications between sc and ksc; the authors of Dujon and Sherman (2004) detected massive duplications and suggest whole-genome duplication; our analysis does not support this hypothesis;
- 983 (15.8%) duplications between sck and sckd; this is more probable whole-duplication location (this hypothesis requires an additional analysis);
- only 95 (1.5%) gene duplications between k and ksc; in Dujon and Sherman (2004) authors detected few duplicated blocks;
- 531 (10.7%) gene losses between k and ksc, and 1429 (10.0%) gene losses between c and sc may suggest low level of genome redundancy in C.glabrata and K.lactis; this agrees with Dujon and Sherman (2004).
5.2 Optimal species tree by voting
We applied voting algorithm (see Section 2.3) to trees from previous section. The result is summarized in Table 1. Again, the optimal species tree is the 25S rDNA tree. Moreover, the first five species trees are the same for both algorithms.
|
| 6 DISCUSSION |
|---|
|
|
|---|
This paper contains several theoretical and practical results on the problem of phylogeny reconstruction. First, we show that the model reconciled trees can be applied to unrooted gene trees. In particular, we formally prove that the concept of choosing an optimal rooting leads to biologically correct scenarios even if more than one optimal solution exists. This property, which is a contribution of the present paper, together with the efficient algorithm of computing an optimal cost or reconciliations allow us to perform fast experiments on large datasets available for yeast genomes. Our experimental results confirm (with some exceptions as shown in the previous section) the known facts from literature. We propose, therefore, in the present paper that the model of unrooted reconciliation can be used to infer correctly phylogeny of species.
In the basic approach, unrooted gene family trees generated from publicly available phylogenetic software can be treated as input to the unrooted reconciliation in order to obtain a set of optimal rootings. This can be viewed as an alternative for species outgroup methodology used commonly when the rooting is unknown.
| Acknowledgments |
|---|
The financial support for the study is provided by KBN Grant 3 T11F 021 28.
Conflict of Interest: none declared.
| REFERENCES |
|---|
|
|
|---|
Bender, M.A. and Farach-Colton, M. (2000) The LCA problem revisited. LATIN '00: In Proceedings of the 4th Latin American Symposium on Theoretical InformaticsBerlin, Germany, pp. 8894.
Bonizzoni, P., Vedova, G.D., Dondi, R. (2003) Reconciling gene trees to a species tree. Algorithms and Complexity. Proceedings of the 5th Italian Conference, LNCS CIAC 2003, Springer 2653, , pp. 120131.
Chen, K., Durand, D., Farach-Colton, M. (2000) Notung: dating gene duplications using gene family trees. Proceedings of RECOMB 2000Tokyo, Japan, pp. 96106.
Dujon, B and Sherman, D. (2004) Genome evolution in yeasts. Nature, 430, 3544[CrossRef][Medline].
Eulenstein, O., et al. (1998) Duplication-based measures of difference between gene and species trees. J. Comput. Biol, . 5, 135148[Web of Science][Medline].
Goodman, M., et al. (1979) Fitting the gene lineage into its species lineage. A parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Zool, . 28, 132163[CrossRef].
Górecki, P. and Tiuryn, J. (2005) On the structure of reconciliations. Proceedings of RCG Bertinoro'04, LNCS Springer 3388, , pp. 4254.
Guigo, R., et al. (1996) Reconstruction of ancient molecular phylogeny. Mol. Phy. Evol, . 6, 189213[CrossRef][Web of Science][Medline].
Ma, B., et al. (2000) From gene trees to species trees. SIAM J. Comput, . 30, 792852.
Mirkin, B., et al. (1995) A biologically consistent model for comparing molecular phylogenies. J. Comput. Biol, . 2, 493507[Medline].
Page, R.D.M. and Charleston, M.A. Reconciled trees and incogruent gene and species trees. Mathematical Hierarchies and Biology, DIMACS Series in Mathematics and Theoretical Computers Science, Vol. 37, 5770.
Page, R.D.M. (1994) Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Syst. Biol, . 43, 5877[CrossRef].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




.













,i.e. the total number of gene duplications and gene losses required to reconcile S with Ge. MinG has five edges with cost 15 (shown in red). Below are the embedding of one of the optimal rootings obtained from the marked edge and its distributions.