Bioinformatics Advance Access originally published online on January 18, 2008
Bioinformatics 2008 24(6):826-832; doi:10.1093/bioinformatics/btn024
Inferring horizontal transfers in the presence of rearrangements by the minimum evolution criterion
1School of Computer Science, Tel Aviv University, Tel Aviv, Israel and 2School of Computer Science and Communication, Royal Institute of Technology, Sweden
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: The evolution of viruses is very rapid and in addition to local point mutations (insertion, deletion, substitution) it also includes frequent recombinations, genome rearrangements and horizontal transfer of genetic materials (HGTS). Evolutionary analysis of viral sequences is therefore a complicated matter for two main reasons: First, due to HGTs and recombinations, the right model of evolution is a network and not a tree. Second, due to genome rearrangements, an alignment of the input sequences is not guaranteed. These facts encourage developing methods for inferring phylogenetic networks that do not require aligned sequences as input.
Results: In this work, we present the first computational approach which deals with both genome rearrangements and horizontal gene transfers and does not require a multiple alignment as input. We formalize a new set of computational problems which involve analyzing such complex models of evolution. We investigate their computational complexity, and devise algorithms for solving them. Moreover, we demonstrate the viability of our methods on several synthetic datasets as well as four biological datasets.
Availability: The code is available from the authors upon request.
Contact: tamirtul{at}post.tau.ac.il
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Many groups of eukaryotes (e.g. mammals) evolve largely through vertical lineal descent driven by local point mutations and genome rearrangements. Unlike these cases, bacteria usually also acquire genetic materials through the transfer of DNA segments across species boundaries—a process known as Horizontal Gene Transfer (HGT) (Doolittle et al., 2003). In the presence of HGTs, the evolutionary history of a set of organisms is modeled by a phylogenetic network, which is a directed acyclic graph obtained by inferring a set of edges between pairs of edges in the organismal tree to model the horizontal transfer of genetic material (Guohua et al., 2007) (Fig. 1). We call such a network a rooted phylogenetic network.
|
The evolution in viruses is even more complicated. Viral genomes are usually compact and evolve very rapidly by all the aforementioned mutations in addition to a large number of recombinations (Sinkovics et al., 1998), HGTs (Koonin et al., 2006) and rearrangements (e.g. Gompels et al., 1995). Furthermore, in this case an organismal tree usually does not exist (Sinkovics et al., 1998), thus the right model is an unrooted tree with an additional small set of undirected edges (between pairs of edges in the initial tree). We call such networks unrooted phylogenetic networks.
There are many strategies and models for dealing with non-tree-like evolution, here we briefly describe some of them. For example, Splits networks (see, e.g. Huson and Bryant, 2006) are graphical models that capture incompatibilities in the data due to various factors, not necessarily HGT or hybrid speciation. Some works describe phylogenetic networks as probabilistic models and use maximum likelihood for analyzing them (Guohua et al., 2007; Strimmer and Moulton, 2000), while others use maximum parsimony (Guohua et al., 2007), or deal with the problem by a graph–theoretic approach of reconciling species and gene trees into phylogenetic networks (Addario-Berry et al., 2003). None of the aforementioned works deal with rearrangements.
In this work, we devise a distance-based method for inferring evolution under complicated models that can involve substitutions, insertions and deletions of single nucleotides, rearrangement, HGT and recombination. We believe that in our case, where the models of evolution are complex, distance methods have advantages for three main reasons: first, sometimes the appropriate probabilistic model is not completely clear, thus, using ML is not feasible. Second, by our experience (Guohua et al., 2006, 2007), when considering a non i.i.d model ML and MP are more time consuming than distance methods. If the models include HGTs together with rearrangements these methods are not feasible. Finally, MP and ML require multiple alignment as an input, while we want to separate our method from this requirement.
The multiple alignment problem is NP-hard (Elias, 2003), and to the best of our knowledge, at some stage of the processing most methods for inferring evolutionary networks require a multiple alignment. We believe that this requirement is problematic, especially with regard to complete viral genomes. Thus, our method takes unaligned sequences as input and uses both the alignment-free Average Common Substring (ACS) (Ulitsky et al., 2006) measure and pairwise alignment as distance measures between pairs of sequences.
Boc and Makarenkov suggested a distance based method for detecting HGTs (Boc and Makarenkov, 2003), however, there are two main differences between this research and the work of Boc and Makarenkov: first, as opposed to their method, our models allow for rearrangements and recombination. Second, our models are sequence oriented (i.e. the input in our case is a set of sequences), while the approach of Boc and Makarenkov (2003) requires distance matrices as input. Consequently, our work considers a more general and realistic setting.
Our methods are based on the following basic biological observations: (i) In phylogenetic networks each nucleotide evolves according to a tree (which may be different from the organismal tree) (Hein, 1993). (ii) Closely positioned nucleotides are more likely to have evolved according to the same tree than distantly positioned nucleotides (Koonin et al., 2006). Therefore, our method infers different trees to different subset of sequences, and partitions the genomes into subsequences (each of at least a few dozens bp) and constrain the nucleotides in each subsequence to have the same evolution.
Given an organismal tree and a set of sequences (If the organismal is not part of input we estimate it from the input sequences), our method finds families of homolog subsequences and reconstructs their evolutionary history by adding reticulation edges to the organismal tree while optimizing the minimum evolution criteria.
This work does not handle gene duplication or deletion; dealing with these operations has been deferred to future works.1 However, in viruses rearrangements and recombinations are more dominant than gene deletion; these two operations that do not change the number of genes (Worobey and Holmes, 1999). Therefore, the model we use here is a good approximation of the actual one. As we demonstrate in this work, there are many interesting datasets that do not involve events such as duplication or deletion. For example, our method is useful for inferring (complete or partial) HGT events that are related to a given set of genes or proteins.
| 2 DEFINITIONS |
|---|
|
|
|---|
Let T = (V, E) be a tree, where V and E are the tree nodes and tree edges, respectively, and let L(T) denote its leaf set. Further, let X be a set of taxa (species). Then T is a phylogenetic tree over X if there is a bijection between X and L(T). Henceforth, we identify the taxa with their associated leaves and denote the set of leaf-labels with [n] = {1, ... , n}. A tree T is said to be rooted if the set of edges E is directed and there is a single distinguished internal vertex r with in-degree 0. Let S = [s1, s2, s3, ... , sn] be the sequences corresponding to the n taxa (note that these sequences may be of different lengths). A family over the set of sequences S is a set of sequences
Let D(· , ·) denote a distance measure between pairs of sequences. In this article, D(· , ·) is either the cost of the pairwise alignment or the ACS distance. Two sequences, s1 and s2, are considered d-homologous with respect to a block length L, if each sequence is longer than L, and every window of length L in their pairwise alignment has evolutionary distance < d; we denote this property DL(s1, s2) < d.
A family of d-homologous subsequences is defined as a set of subsequences S' with the following property:
. A non-overlapping set of families is a set of families such that in each sequence, subsequences from different families do not overlap. The subsequence s'
s that is part of the family f is denoted by f (s). We call the set of subsequences that are induced by a set of families a partitioning.
A rooted phylogenetic network (Guohua et al., 2006, 2007) N = N(T ) = (V',E') over the taxa set X is derived from a rooted tree T by inferring reticulation edges between pairs of edges in T. That is, each reticulation edge is inferred by adding two new vertices on two edges of E and thereafter joining the two new vertices with the directed reticulation edge. A tree edge can take part in more than one reticulation event. In a similar way, an unrooted phylogenetic network is derived from an unrooted tree by adding undirected edges to the tree. Each family f of subsequences is related to a subset of the reticulation edges, denoted M(f), which describes the evolution of the family. If a family, f, evolves along the organismal tree then M(f) =
.
A rooted phylogenetic network must satisfy additional temporal constraints, such as acyclicity (Guohua et al., 2006, 2007). Such temporal constraints do not exist in an unrooted network. Finally, we denote the set of all trees contained inside the network N (rooted or unrooted) by
(N). In the case of rooted network, each such tree is obtained, by the following two steps: (i) for each node of in-degree 2, remove one of the incoming edges, and then (ii) for every node x of in-degree and out-degree 1, whose parent is u and its child is v, remove node x and its two adjacent edges and add a new edge from u to v. In the case of unrooted networks, a tree is obtained by removing an edge from each cycle of the tree, removing each node, x, with exactly two neighbors u and v, removing the two edges that include the node x, and adding a new undirected edge, (u,v). In our setting the tree, Tf 
(N) that includes exactly all the reticulation edges in M(f) describes the evolution of the family f.
In this work, we deal with the Minimum Evolution (ME) criteria (Kidd and Sgaramella-Zonata, 1971). It is known to be consistent when using the least-squares criterion (Rzhetsky and Nei, 1993) (as in this work); meaning that it converges to the correct tree for long enough sequences. In the case of evolutionary trees, the decision variant of the problem of finding the minimum evolution tree is defined as follows:
PROBLEM 1
Input: Set of n sequences, S, that induces a distance matrix B and a real number e.Output: A tree, T, with total edge lengths less than e, while the edge lengths are least squares estimated from B. The sum of edge lengths of a tree T is the ME score of a tree; Let E(T,S,D) denote the ME score for a tree T, with a set of sequences S corresponding to its leaves, and when D is used as a distance measure between pair of sequences.
In our setting, we use the minimum evolution criterion to select the additional reticulation edges that best explain the evolution of each family of subsequences. That is, given a set of sequences S and a phylogenetic tree T, our goal is to find a set of non-overlapping families, a set of reticulation edges and a mapping relating each family to a subset of the reticulation edges (i.e. one tree for each family). These are selected with the objective of minimizing the sum of minimum evolution scores for each family and associated tree. If the set of families is F = [ f1, f2, ... , fh], the set of reticulation edges is H, the mapping is M and the pairwise distance measure between sequences is D, then we denote this score by
.
Let
and
denote two subsequences of the sequence s. We say that
precedes
(![]()

) if
ends before
begins. Under a non-rearrangement assumption, there is an order of the families, f1, f2, ... , fh, such that in each sequence, si: f1(si)
f2(si), ... , fh – 1(si)
fh(si) (Fig. 2), but this assumption is not always justified.
|
Here we deal with three variants of the problem, each related to different assumptions about the input:
- The first variant, Non-Rearrangement Given Tree (NRGT), assumes an organismal tree and that subsequences have not been rearranged. An example of such input is a set of proteins and an organismal tree of bacteria.
- The second variant, Rearrangement Given Tree (RGT), assumes an organismal tree and that subsequences may be rearranged. An example of such input is a set of genomes and an organismal tree of bacteria.
- The third variant, Rearrangement No Tree (RNT), does not assume an organismal tree and subsequences may be rearranged. An example of such input is a set of viral genomes.
| 3 HARDNESS ISSUES |
|---|
|
|
|---|
Our analysis suggests that most of the variants of the problems that were mentioned in the previous section are NP-hard. Due to lack of space the proofs appear in the supplementary while in this section we only describe the main results.
PROBLEM 2 [RGT]: Input
A set of binary sequences S, a phylogenetic tree T, two integers h, and k, a real number c, and a distance measure between pairs of sequences, D.Question: Is there a set, F, of h non-overlapping families, a set, H, of k reticulation edges, and a mapping, M, from each family to subset of H, such that the score E(T, F, H, M, D)
c.The problem RNT is defined in a similar way while the input does not include a tree T, the problem NRGT is defined in a similar way but the families must be in order. We show that the RGT and the RNT problems are NP-hard even when the number of reticulation edges is not part of the input (there are 0 reticulation edges).
THEOREM 1
RGT and RNT are NP-hard.
As mentioned, in this work we deal with minimum evolution criteria (minimum evolution tree, or MET, see Problem 1). We also suggest the following observation.
OBSERVATION 1
NP-hardness of the problem MET implies NP-hardness of NRGT (when there is no rearrangement and the tree is given).
| 4 ALGORITHMS AND PARAMETERS |
|---|
|
|
|---|
In this section, we describe our method, Find Net-Families. As can be seen in Figure 3 this method consists of three stages, each of which solves a separate computational problem and is described in one of the following subsections. As was just shown, most of the problems we deal with are NP-hard and consequently the algorithms presented here are heuristics. The input to Find Net-Families is an organismal tree. This tree is either provided by the user or generated by computing a distance matrix with the ACS (Ulitsky et al., 2006) method and then building the associated neighbor joining tree (Saitou and Nei, 1987). See Sections 3 and 4 in the Supplementary Material for more details about inferring organismal trees by the ACS method.
|
Due to running time considerations we used a parameter L that constrains the length of each subsequence in each family to be around c · L, where c is an integer (the adjusting procedure allows lengths that are up to 10% different than this constraint). The average gene length in bacteria is nearly 1 kb, and in eukaryotes it is about 1.3 kb (Xu et al., 2006). Usually, only complete genes are horizontally transferred (Delwiche and Palmer, 1996).
In the case of partial HGT or recombination, the lengths are in the order of magnitude of at least half a gene (Bergthorsson et al., 2003; Kalinina et al., 2004). Indeed, using L of few hundreds nucleotides gave good results (usually changing L from 100 to few hundreds does not change the results dramatically).
4.1 Finding d-Homologous families in two sequences
Given two sequences s1 and s2 of length
, our goal is to find d-homologous families where each block should be longer than L, and such that d is minimal. Namely, we wish to match each block in one sequence with exactly one as similar as possible block in the other sequence. In this work, we assume that there is one such unique matching. In practice, for large enough L (i.e. more than few dozen characters) and when duplications are not present this is indeed the case.
The procedure has two stages; in the first stage we search for common subsequences of length at least W, where W has to be tuned with respect to the input sequences. In general, too small W (e.g. W = 2) is not specific enough, since we expect that both non-homolog and homolog sequences share common subsequences of length W. On the other hand, too large W (e.g. W = 100) is also problematic, since even homologue sequences do not share such long subsequences. In practice, 12 < W < 18 gave good results for nucleotides and 7 < W < 12 gave good results in the case of amino acids.
Let Si (s1, s2) denote the longest substring that starts in position i in s1 and appears in the two sequence (we assume that |Si(s1, s2)| = O(1)). In the first stage, we performed the following steps: (i) Generate the suffix array for s1. (ii) Scan s2, in each position i, find the longest subsequence that starts in that position and appears in both sequence, Si(s1, s2). (iii) If |Si(s1, s2)| > W keep the position and the length of the matching substring.
In our implementation, we used the lightweight suffix array of (Burkhardt and Kärkkäinen, 2003; Ulitsky et al., 2006) which is constructed in time O(
log(
)). Step (ii) of the algorithm above, for each position, i, can be accomplished in log(
) time by performing lexicographic binary search for s2(i) in the suffix array of s1.
After the first stage we have a set of position-pairs for each common substring longer than W. In the second stage, we map each overlapping window of length L in the first sequence with the window in the second sequence which has the maximal sum of lengths of common substrings. We call each such match the core of a family f
F. Finally, we greedily adjust the boundaries of each family by adding/removing small blocks at the ends of the windows while optimizing minF; s'(i),s'(j)
FDL(s'(i),s'(j)), such that in the end of this stage the two strings have been partitioned into families that cover all of the sequences. The runtime complexity of this stage is O(
2) for each pair of sequences. Thus, the total runtime complexity for n sequences is O(
2 ·
).
4.2 Finding a family of d-homologous subsequences
From the previous stage we have a d-homologous partitioning for each
pairs of sequences. In this stage, the aim is to expand these pairwise matchings to families of d-homologous subsequences with minimal d that cover all the n input sequences. As mentioned before, we assume that each window of length close to L in a sequence has exactly one homolog in each of the other sequences, an assumption which is supported by our biological inputs.
We examine the expansion of each of the
partitionings of pairs of sequences to a partitioning over all the n sequences. This is done by the following steps: (1) For each of the
partitionings of pairs of sequences.
- Start with one partitioning.
- The k-th (k
n – 1) step: Greedily add another sequence to the partitioning of k – 1 sequences that was generated in the previous step. This is done by checking consecutive overlapping windows of length L, and for each family choosing a non-overlapping window(s) (i.e. a subsequence of the new sequence) that includes the maximal sum of lengths of common subsequences that appear in the other members of this family in the k – 1 previous sequences.
(2) Chose the expansion that minimizes minF; s'(i),s'(j)
FDL(s'(i),s'(j)).
The runtime complexity of this stage is O(
· n2) for each pair of sequences. Thus the total runtime complexity for n sequences is O(
· n2 ·
).
4.3 Adding reticulation edges and refining the partitioning to families
In this section, we describe how to find the set of reticulation edges that are related to each family. In this stage, we assume a given initial (organismal) tree and a set of d-homologous families. Each family induces a distance matrix. Our procedure greedily chooses one of the families and adds a new reticulation edge that is related to that family. In each such step the size of the set of reticulation edges that is related to one of the families is increased by one.
We plot a graph of the improvement in the ME score after each such step. Such a graph can help biologists to decide the actual number of reticulation edges. Our simulation study usually shows dramatic improvement in the ME score after adding each of the right reticulation edges, while the improvement in the ME score when adding additional reticulation edges (after adding the right ones) is relatively insignificant (see, e.g. Fig. 4E).
|
We use least-square estimation to calculate the edge lengths of a given tree topology (an organismal tree and a set of reticulation edges) with a set of sequences at its leaves (a family that induces a distance matrix). This can be done in the time complexity of an n x n matrix inversion (Saitou and Nei, 1987), less than O(n3). By using the more sophisticated method of (Gascuel 1997) the least square estimation of the edge lengths of a given tree and distance matrix can be done in O(n2).
After each stage of adding a reticulation edge, we perform a stage of greedily adjusting the boundaries of the families (by increasing or decreasing the boundaries of each subsequences in each family) while improving the ME criteria. Since after each such stage the ME criteria is improved, a convergence to a local optima is guaranteed. The time complexity of this stage is (h2 · f ·n) · n2.
4.4 Total time complexity
Suppose the input includes n sequences of length
, and the result includes h families each with f reticulation edges. The total runtime complexity of our method is
·
2 +
· n2 ·
+ (h2 · f · n) · n2 = O(n2 · (
2 + n · h2 · f + n2 ·
)).
| 5 EXPERIMENTAL RESULTS |
|---|
|
|
|---|
For evaluating our methods we performed three experiments; first (Sections 5.1 and 5.2), we implemented our method on two biological dataset (bacterial rbcL proteins, and the plants gene rps11). Previous works (Bergthosson et al., 2003; Boc and Makarenkov, 2003; Delwiche and Palmer, 1996; Guohua et al., 2006, 2007) suggested that these sets underwent horizontal gene transfer (since these two sets include genes/proteins we should not expect rearrangements). The goal of this experiment was to compare the outputs of our method the HGT events suggested by previous works. In the second experiment (Section 5.3), we simulated evolution that included HGTs/recombinations, rearrangements, and local point mutations. For this dataset, a quantitative evaluation of our method can be supplied. Finally (Section 5.4), we applied our method on two datasets of viral genomes. However, these two datasets had not previously been analyzed.
5.1 Proteins of bacteria
The first input includes the rubisco gene rbcL of a group of 14 plastids, cyanobacteria, and proteobacteria, see Supplementary Figure 1. This dataset was first analyzed by Delwiche and Palmer (1996), they and other suggest that it includes HGTs. This dataset consists of amino acid sequences, part of them are from Form I of rubisco, and the other six are from Form II of rubisco. We used exactly the same sequences that Delwiche and Palmer used in their paper. The species tree was based on information from the ribosomal database project (http://rdp.life.uiuc.edu) and the work of (Delwiche and Palmer, 1996). We checked two distance matrices, PAM250 and Blosum62, both with gap penalty—8. Since this dataset includes a set of proteins we constrained the families to be ordered (the NRGT problem). We checked various sizes of L, but due to lack of space we will describe only the results for L = 250 (the results in the other cases were similar). The results are described in Supplementary Figure 1. We got similar results for the two distance matrices; which indicates that our method is robust to changes in the distance matrix. In general, our results support previous results that analyzed this dataset (Boc and Makarenkov, 2003; Delwiche and palmer 1996; Guohua et al., 2006, 2007). For example, the reticulation edge between
and β proteobacteria, and the edge between the proteobacteria and the plastid, were discovered by previous methods (see Supplementary Fig. 1).
5.2 Genes of plants
The second database includes the ribosomal protein gene rps11 of a group of 47 flowering plants, which was first analyzed by Bergthorsson et al. (2003) (they and others suggest that this dataset includes partial HGT). See Supplementary Figure 2.
The species tree was reconstructed based on various sources, including the work of Michelangeli et al. (2003) and Judd and Olmstead (2004). We used exactly the same sequences that Bergthorsson et al. used in their paper. The results for L = 150 are described in Supplementary Figure 2. (We also checked L = 100 and L = 50 and got similar results.) By Bergthorsson et al., these species underwent chimeric HGT (e. g. partial HGT), this conjecture is supported by our results which relate all the HGT to the family in positions 150 through 300. In general, our HGTs suggest transfer of genetic material between Liliopsida and Dipsacales, Liliopsida and Papaveraceae, and Ranunculales and Dipsacales. The first two HGTs are similar to HGTs reported in previous works (e.g. in Guohua et al., 2007), while the third is new and suggests further biological research (See Supplementary Fig. 2).
5.3 Simulating HGT/recombination, rearrangements, and local point mutations
Here we evaluate the accuracy of our method on simulated data. The data consists of sequences which have evolved through substitutions, insertions, deletions and lateral transfers. We checked two models of nucleotide evolution: The Jukes–Cantor (JC) model and the Hasegawa, Kishino and Yano (HKY) model. For each model, we generated 20 datasets with 10 leaves and 20 datasets with 20 leaves using the following recipe (Fig. 4). (1) The species tree was generated using a regular birth death process from the Beep software package (Arvestad et al., 2006). These trees are ultra metric 2 with a root to leaf distance of 1. (2) Three transfer trees were independently created from the species tree by applying two random lateral transfers. Each transfer event was chosen to occur at time t
[0,1] with probability
, i.e the probability increases linearly with the number of concurrent lineages. Once t was selected the transfer was selected uniformly at random from the possible transfer events at time t. (3) Species sequences, the sequences which have evolved according to the species tree, of expected length 4000 (similar to the typical length of genome virus, which is few kbp) were generated using the ROSE (Stoye et al., 1998) software package. We checked two models of nucleotide evolution: The JC model and the HKY model with a transition transvertion ratio of 2. The substitution probability was 0.2 from the root down to any leaf. Moreover, in each nucleotide insertions and deletions of up to seven nucleotides (the standard insertion and deletion functions in ROSE were used) occurred with probability 0.01 from the root down to any leaf. (4) Transfer blocks, the sequences which have evolved according to the transfer trees, of expected length 500 were generated using the same process as for the species sequences (The typical length of a gene is few hundreds nucleotides, usually complete gene are horizontally transferred (Delwiche and Palmer, 1996). In the case of partial HGT or recombination, the lengths are in the order of magnitude of at least half a gene (Bergthorsson et al., 2003; Kalinina et al., 2004), i.e. few hundreds bp. Thus, we transfers blocks with similar size) (5) The combined sequences, the sequences containing both the species sequences and the transfer blocks, where created by inserting the transfer blocks uniformly at random into the species sequences such that no evolutionary block in the sequences was shorter than 500.
We ran our algorithm with 380 < L < 600, and with W = 15 (the results for 12
W
18 where similar). For each of the 20 datasets of each size, there are three blocks of length about 500 that were transferred (while the rest of the sequences evolve in the original tree). Thus, there were seven families for each dataset with a total of 140 families (for both the 10 and 20 leaf test sizes). Moreover, each family had been affected by two HGT events. Thus, there was a total of 120 HGT events (for both the 10 and 20 leaf test sizes).
5.4 Simulation results
In all cases the algorithm did not completely miss any family, about 5% of the families were shifted by more than 50 positions and no more than 300 positions.
Let X1, Y1, X2, Y2 denote branches in the organismal tree, and let (Xi,Yi) denote a HGT from branch Xi to branch Yi (i = 1 or 2). Let d(X1, X2) denote distance (measured as the number of branches) between the X1 and X2 in the organismal tree. For evaluating the ability of the algorithm to infer HGTs, we defined the a distance between pair of HGTs (X1,Y1) and (X2,Y2) (one appear in the model network and one is inferred) as D((X1,Y1),(X2,Y2)) = d(X1,X2) + d(Y1,Y2)/2.
For example, if (X1, Y1) and (X2, Y2) are identical D((X1, Y1), (X2, Y2)) = 0, and if X1 is one branch away from X2 and Y1 = Y2 D((X1, Y1), (X2, Y2)) = 0.5. Figure 5 depicts the distribution of these distances between the true and the inferred HGTs for the four datasets (JC/HKP model with 20 or 10 leaves). The graphs show that in all the four datasets the algorithm found more than 80% of the HGTs. In rest of the cases, the inferred HGTs were very similar to the real ones. For example, in all the four datasets more than 50% of inferred HGTs that were non-identical to the true ones had distance
1 from the true HGTs. Considering the complexity of the problem we think that these are very encouraging results.
|
Most of the cases where the performances of the method were relatively poor involved organism trees with long branches. This phenomena is not surprising, and was studied before in the context phylogenetic tree reconstruction by distance methods or by maximum parsimony (Felsenstein, 1978; Susko et al., 2004). It is clear that the phenomena is also relevant in our case.
One important goal of the method is its ability to infer the right number of HGT events. By the results, our method performed very well in achieving this goal. Usually after adding the correct number of reticulation edges the improvement in the score is negligible (e.g. see Fig. 4). This is a major advantage compared to methods such as MP or ML when sites are independent, where usually there is less clearer change in the slope of the score graph (Guohua et al., 2006, 2007).
5.5 Genome of viruses
Our last datasets include complete genomes of two RNA viruses, one of HIV the and other of Hepatitis C. We checked our method on these two typical inputs. The genomes were downloaded from Kuiken et al., (2005), and each dataset included 10 genomes (See Supplementary Fig. 3).
We used our method to check if the datasets include HGTs/recombination and/or rearrangement, since part of the viruses in the datasets are relatively evolutionary close (all from the same type, some of the same subtype), it is not clear whether the datasets include such events. The data was downloaded from Kuiken et al., (2005) (see Supplementary Fig. 3 for more details about the genomes), the average length of each genome was close to 10 000 bp. In both cases, the initial tree was generated by ACS and NJ (Felsnstein, 1993), we show here the results for L = 700, and W = 15 (we got similar results for other sets of parameters). For the HIV dataset, our method did not find HGT events nor did it find rearrangement events. In the case of Hepatitis C we found two possible reticulation edges that may suggest an ancient recombination or horizontal gene transfer events (see Supplementary Fig. 3). The difference between these two datasets can be explained by the fact the HIV is a relatively new virus (AIDS was first discovered in 1981), while by estimations Hepatitis C has existed for thousands of years (Smith et al., 1997).
| 6 CONCLUDING REMARKS AND FURTHER RESEARCH |
|---|
|
|
|---|
In general, genomic material evolves through local point mutations (insertion, deletion, substitution), genome rearrangements, horizontal gene transfers, recombinations, duplications and deletions. This work is a step towards developing a method for inferring evolution under all these types of operations, and it is mainly a proof of concept. We showed that our method, which is based on the ME criterion, is useful for inferring partial or complete HGT events, and can infer rearrangements together with HGTs or recombinations.
One work on this new topic is clearly not enough for solving all the problems. Further research in this direction will include: extending the set of operations to include duplications, deletions, and inversions; developing a more sophisticated simulator of genome evolution; investigating the hardness of NRGT (in this work we proved the hardness RGT and RNT); and improving the running time of our heuristic. We are currently aiming at using our approach for exploring the evolution of various groups of viruses and bacteria.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We would like to thank Prof. Benny Chor for helpful discussions. T.T. was supported by the Edmond J. Safra Bioinformatics program at Tel Aviv University.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Keith Crandall
Preliminary version of this work was presented in WABI07. ![]()
1 In bacteria there are cases where up to 40% of genes are duplicated. Our method can be used in these cases only if it is restricted to known single-copy genes. ![]()
2 We used ultrametric trees since it allows the correct and easy modeling the HGT. ![]()
Received on June 7, 2007; revised on October 16, 2007; accepted on January 14, 2008
| REFERENCES |
|---|
|
|
|---|
Addario-Berry L, et al. Towards identifying lateral gene transfer events. In: PSB03. (2003) 279–290.
Arvestad L, et al. Beep software package (2006).
Bergthorsson U, et al. Widespread horizontal transfer of mitochondrial genes in flowering plants. Nature (2003) 424:197–201.[CrossRef][Medline]
Boc A, Makarenkov V. New efficient algorithm for detection of horizontal gene transfer events. In: WABI. (2003) 190–201.
Burkhardt S, Kärkkäinen J. Fast lightweight suffix array construction and checking. Baeza-Yates R, Chávez E, Crochemore M, eds. (2003) Proceedings of 14th Symposium on Combinatorial Pattern Matching (CPM'03). Berlin Heidelberg: Springer-Verlag. 55–69.
Delwiche C, Palmer J. Rampant horizontal transfer and duplicaion of rubisco genes in eubacteria and plastids. Mol. Biol. Evol (1996) 13:873–882.[Abstract]
Desper R, Gascuel O. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. J. Comp. Biol (2002) 9:687–705.[CrossRef]
Doolittle WF, et al. How big is the iceberg of which organellar genes in nuclear genomes are but the tip? Phil. Trans. R. Soc. Lond. B. Biol. Sci (2003) 358:39–57.
Elias I. Settling the intractability of multiple alignment. In: ISAAC. (2003) 352–363.
Felsenstein J. Cases in which parsimony and compatibility methods will be positively misleading. Syst. Zool (1978) 27:27–33.
Felsenstein J. Phylip (phylogeny inference package) version 3.5c. Distributed by the author. (1993) Seattle: Department of Genetics, University of Washington.
Gascuel O. Concerning the NJ algorithm and its unweighted version UNJ. In: Mathematical hierarchies and Biology.—Mirkin B, et al, eds. (1997) DIMACS Series in Discrete Mathematics and Theoretical Computer Science 37. Amerrican Mathematical Society, Providence, R.I.
Gompels UA, et al. The DNA sequence of human herpesvirus-6: structure, coding content, and genome evolution. Virology (1995) 209:29–51.[CrossRef][Web of Science][Medline]
Hein J. A heuristic method to reconstruct the history of sequences subject to recombination. J. Mol. Evol (1993) 36:396–405.[Web of Science]
Huson DH, Bryant D. Application of phylogenetic networks in evolutionary studies. Mol. Biol. Evol (2006) 23:254–267.
Guohua J, et al. Maximum likelihood of phylogenetic networks. Bioinformatics (2006) 22:2604–2611.
Guohua J, et al. Inferring phylogenetic networks by the maximum parsimony criterion: A case study. Mol. Biol. Evol (2007) 24:324–337.
Judd WS, Olmstead RG. A survey of tricolpate (eudicot) phylogenetic relationships. Am. J. Bot (2004) 91:1627–1644.
Kalinina O, et al. Full-length open reading frame of a recombinant Hepatitis C virus strain from St Petersburg: proposed mechanism for its formation. J. Gen. Virol (2004) 85:1853–1857.
Kidd KK, Sgaramella-Zonta LA. Phylogenetic analysis: concepts and methods. Am. J. Hum. Genet (1971) 23:235–252.[Web of Science][Medline]
Koonin EV, et al. The ancient virus world and evolution of cells. Biol. Direct (2006) 1.
Kuiken C, et al. The los alamos HCV sequence database. Bioinformatics (2005) 21:379–84.
Michelangeli FA, et al. Phylogenetic relationships among Poaceae and related families as inferred from morphology, inversions in the plastid genome, and sequence data from mitochondrial and plastid genomes. Am. J. Bot (2003) 90:93–106.
Rzhetsky A, Nei M. Theoretical foundation of the minimum-evolution method of phylogenetic inference. Mol. Biol. Evol (1993) 10:1073–1095.[Abstract]
Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol (1987) 4:406–25.[Abstract]
Sinkovics J, et al. The origin and evolution of viruses (a review). Acta Microbiol Immunol Hung (1998) 45:349–390.[Medline]
Smith DB, et al. The origin of hepatitis C virus genotypes. J. Gen. Virol (1997) 78:321–328.[Abstract]
Stoye J, et al. Rose: generating sequence families. Bioinformatics (1998) 14:157–163.
Strimmer K, Moulton V. Likelihood analysis of phylogenetic networks using directed graphical models. Mol. Biol. Evol (2000) 17:875–881.
Susko E, et al. On inconsistency of the neighbor-joining, least squares, and minimum evolution estimation when substitution processes are incorrectly modeled. Mol. Biol. Evol (2004) 21:1629–1642.
Ulitsky I, et al. The average common substring approach to phylogenomic reconstruction. J. Comp. Biol (2006) 13:336–350.[CrossRef]
Worobey M, Holmes EC. Evolutionary aspects of recombination in RNA viruses. J. Gen. Virol (1999) 80:2535–2543.
Xu L, et al. Average gene length is highly conserved in prokaryotes and eukaryotes and diverges only between the two kingdoms. Mol. Biol. Evol (2006) 23:1107–1108.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


