In search of lost introns
rös 1,*1Department of Computer Science and Operations Research, Université de Montréal, Québec, Canada, 2Department of Computer Science, Saint John's University and the College of St. Benedict, Collegeville, MN, USA and 3National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Many fundamental questions concerning the emergence and subsequent evolution of eukaryotic exon–intron organization are still unsettled. Genome-scale comparative studies, which can shed light on crucial aspects of eukaryotic evolution, require adequate computational tools.
We describe novel computational methods for studying spliceosomal intron evolution. Our goal is to give a reliable characterization of the dynamics of intron evolution. Our algorithmic innovations address the identification of orthologous introns, and the likelihood-based analysis of intron data. We discuss a compression method for the evaluation of the likelihood function, which is noteworthy for phylogenetic likelihood problems in general. We prove that after O(n
) preprocessing time, subsequent evaluations take O(n
/log
) time almost surely in the Yule–Harding random model of n-taxon phylogenies, where
is the input sequence length.
We illustrate the practicality of our methods by compiling and analyzing a data set involving 18 eukaryotes, which is more than in any other study to date. The study yields the surprising result that ancestral eukaryotes were fairly intron-rich. For example, the bilaterian ancestor is estimated to have had more than 90% as many introns as vertebrates do now.
Availability: The Java implementations of the algorithms are publicly available from the corresponding author's site http://www.iro.umontreal.ca/~csuros/introns/. Data are available on request.
Contact: csuros{at}iro.umontreal.ca
| 1 INTRODUCTION |
|---|
|
|
|---|
Typical eukaryotic protein-coding genes contain introns, which are removed prior to translation. Key constituents of the spliceosome, which is the RNA–protein complex that performs the intron excision, can be traced back (Collins and Penny, 2005) to the last common ancestor of extant eukaryotes (LECA). Even deep-branching lineages (Nixon et al., 2002; Va
á
ová et al., 2005) have introns. It is thus almost certain that spliceosomal introns were present in LECA. Moreover, when comparing distant eukaryotes, intron positions often agree (Rogozin et al., 2003). The similarity is likely due more to conservation of early introns than to parallel gains (Roy and Gilbert, 2005; Sverdlov et al., 2005). It is thus compelling to use genome-scale comparisons to study intron evolution in different lineages, and even to estimate the exon–intron organization in extinct ancestors. One of the first such studies, by Rogozin et al. (2003), involved orthologous gene sets in eight eukaryotes. The same data set was re-analyzed by different authors (Carmel et al., 2005; Cs
rös, 2005; Nguyen et al., 2005; Roy and Gilbert, 2005), using novel methods developed for intron data. Subsequent inquiries (Coulombe-Huntington and Majewski, 2007; Nielsen et al., 2004; Roy and Penny, 2006, 2007) attest to a renewed interest in understanding the specifics of intron evolution within different eukaryotic lineages. This article introduces novel computational techniques for the analysis of spliceosomal intron evolution, anticipating more large-scale studies to come. Section 2 describes an alignment method for intron-annotated protein sequences, as well as a segmentation method for identifying conserved portions of a multiple alignment. Section 3 describes a likelihood framework in which intron evolution can be analyzed in a theoretically sound manner. Section 4 scrutinizes a compression technique that accelerates the evaluation of the likelihood function. Fast evaluation is particularly important when the likelihood is maximized in a numerical procedure that computes the likelihood function with many different parameter settings. Section 5 describes two applications of our methods. In one application, intron-aware alignment was used to validate some unexpected features of intron evolution before LECA. In the second application, we analyzed intron evolution in 18 eukaryotic species. We found evidence of intron-rich early eukaryotes and a prevalence of intron loss over intron gain in recent times.
| 2 IDENTIFICATION OF ORTHOLOGOUS INTRONS |
|---|
|
|
|---|
2.1 Intron-aware alignment
Orthologous introns can be identified by using whole genome alignments in the case of closely related genomes (Coulombe-Huntington and Majewski, 2007). For distant eukaryotes, however, intron orthology can only be established through protein alignments (Rogozin et al., 2005). The usual procedure is to project intron positions onto an alignment of multiple orthologous proteins. If introns in different species are projected to the same alignment position, then the introns are assumed to be orthologous.
Introns can also serve as checkpoints in a protein alignment. If the protein sequences are annotated with the intron positions, then the alignment can exploit intron homology by rewarding aligned introns and penalizing the juxtaposition of an intron with an intronless codon. In addition to appropriate scoring, alignment algorithms need to be modified to cope with phase-0 introns, which fall between codons. Incorporating intron annotation should lead to better alignments at the amino acid level (Fig. 1), and to a more reliable identification of orthologous introns.
|
First, consider the case of aligning two protein sequences, S and T, which are annotated with the intron positions. Every residue has two associated intron sites (after the first and second nucleotides of their codons), and there is an intron site between consecutive amino acid positions (phase-0 introns). Intron sites may or may not contain introns. We use the notation for S[i: 0] for the phase-0 site preceding the codon for amino acid i, and S[i: 1], S[i: 2] for phase-1 and phase-2 sites within the codon. Intron presence is encoded by 1, and intron absence is encoded by 0 throughout this article. The intron annotation is specified by the variables S[i: j ]
{1, 0}.
Phase-1 and phase-2 intron sites are automatically placed by their associated amino acids, but the placement of phase-0 introns is not fixed with respect to gaps. It is not possible to simply add a new character representing phase-0 introns because they affect gaps differently from amino acids. Scores for aligned intron sites are given by a 2 x 2 matrix
. If M is the amino acid scoring matrix, then the alignment of S[i ] and T[ j ] entails a score of
with
. Similarly, aligning S[i] with an indel implies a score of
, in addition to possible gap opening and closing penalties.
The intron scores
can be based on log-likelihood ratios in a probabilistic model (Durbin et al., 1998). For an aligned intron site, let s, t
{1, 0} denote the intron states in S and T, respectively. The score
should be proportional to
, where p denotes the joint distribution of states at orthologous sites, and
S,
T are the prior intron state probabilities.
In order to assess the strength of intron-match signals, and to choose reasonable scores, we used the data set of Rogozin et al. (2003). We estimated p and
using relative frequencies of intron absence-presence patterns in the 488 157 possible intron sites, and then computed log-odds scores for intron matches. The intron score is asymmetric and varies with evolutionary distance and intron conservation. Intron-less sites have an insignificant score, but shared introns have a high score, such as 304 (human–Plasmodium), 317 (human–Arabidopsis) or 515 (Drosophila–Anopheles) on a 1/60-bit scale. Shared introns thus give a signal comparable to strong amino acid matches: in the 1/60-bit scaled version of the VTML240 matrix (Müller et al., 2002), a tryptophan match scores 289. Discordant intron sites have similar scores to amino acid mismatches.
We added intron scoring into a multiple alignment framework, using a sum-of-pairs scoring policy with affine gap– scoring. It is NP-hard to find the alignment of two multiple alignments under these optimization criteria (Ma et al., 2003). Even the alignment of a single sequence to a multiple alignment necessitates sophisticated techniques (Kececioglu and Zhang, 1998). Our solution therefore uses a gap-counting heuristic: namely, a gap-open penalty is triggered for an indel aligned with an amino acid if the indel is preceded by an amino acid or a phase-0 intron. Gap opening thus corresponds to a pattern
or
. Here,
are intron states for the phase-0 site, and ? denotes either state. In addition,
denotes an arbitrary amino acid,
is the indel character, and
is either of the latter two. We implemented affine gap-scoring by separate gap-open and -close penalties, so that gaps at the alignment extremities can be penalized less severely. Gap closing is counted for the patterns
and
.
Table 1 gives the recurrences for a dynamic programming algorithm that aligns an intron-annotated sequence S to a multiple alignment P of h intron-annotated protein sequences. In order to simplify the presentation, we represent the sequences in such a way that every odd position of S and P is a regular residue or alignment column, annotated with information on the presence of phase-1 and phase-2 introns, whereas every even position is a phase-0 intron site. We use
to denote the sum of intron-match scores for the intron sites associated with the positions S[i ], P[ j ]. We use also the shorthand
to denote scoring for the alignment of an amino acid x with an amino acid profile y. The algorithm uses three types of variables,
and
, which correspond to partial prefix alignments ending with aligned residues, gaps in S, or gaps in P, respectively. In case of
, the last indel must be aligned with an amino acid column, and, thus j must be odd; for
, i must be odd.
|
Gaps are scored by using affine penalties, with gap-open, -extend, and -close scores, denoted by
(<),
(–),
(>). The gap-counting heuristic implies that gap scores in the equations of Table 1 are defined by the number of certain patterns in up to three consecutive alignment columns. Table 2 lists the patterns for the gap-counting heuristic.
|
Table 1 does not show the initialization of the variables, nor the final gap-counting: they employ a logic analogous to the recurrences. At the end of the algorithm, the best of
We implemented the algorithm in Java. The program iteratively realigns one sequence at a time to the rest of the sequences in a multiple alignment. Instead of sequence-dependent intron match-mismatch scores, the implementation uses only two parameters: one for intron conservation and another for intron-state mismatch.
2.2 Identification of conserved blocks
In order to reliably identify orthologs, we need to be able to distinguish regions of the alignment that are highly conserved from those that are less conserved. In poorly conserved regions, we cannot confidently infer intron orthology.
Zhang et al. (1999) proposed postprocessing pairwise sequence alignments into alternating blocks that score above a threshold parameter
or below (–
). We attained a similar goal by adapting algorithmic techniques from Cs
rös (2004). The procedure separates a multiple alignment into alternating high- and low-scoring regions. Using a complexity penalty
, a segmentation with k high-scoring regions has a segmentation score of A – k
where A is the sum of scores for the aligned columns. Column scores are computed without gap-open and -close penalties. The best segmentation of an alignment of length
can be found in O(
) time after the column scores are computed.
The result of this computation is that the total of column scores in each selected high-scoring region will be greater than
. There may be small subregions of negative scores, but the total score of such a subregion cannot be less than (–
). Conversely, unselected regions score below (–
) and cannot have subregions scoring above
.
| 3 A LIKELIHOOD FRAMEWORK |
|---|
|
|
|---|
3.1 Markov models of evolution
We use a Markov model for intron evolution, as in previous studies (Cs
rös, 2005). For the sake of generality, we describe the Markov model (Felsenstein, 2004; Steel, 1994) over an arbitrary alphabet
(u), which is called its state or label, that takes values over
(u). Along every path away from the root, the node states form a Markov chain. The distribution is thus determined by the root probabilities (
(a) : a
b) : a, b
= (
(u) : u
X). An input data set (or sample) consists of independent and identically distributed (iid) characters:
= (
i : i = 1, ... ,
).
In case of intron evolution, introns are generated by a two-state continuous-time Markov process with gain and loss rates
e, µe
0 along each edge e. The edge length is denoted by te. Using standard results, the transition probabilities on edge e with rates
e =
, µe = µ and length te = t can be written as
, and pe(0
0) = 1 – pe(0
1), pe(1
1) = 1 – pe(1
0). In the absence of established edge lengths, we fix the edge length scaling in such a way that
e + µe = 1. Independent model parameters are thus
(1), µe and te for all edges e. It is important to allow for branch-dependent rates, since loss and gain rates vary considerably between lineages (Jeffares et al., 2006; Roy and Gilbert, 2006).
In a maximum likelihood approach, model parameters are set by maximizing the likelihood of the input sample. Let x = (x1, ... , x
) be the input data. Every xi is a vector of n states, and we write xi(u) to denote the observed state of leaf u. By independence, the likelihood of x is the product
. Each character's likelihood can be computed in O(n) time, using a dynamic programming procedure introduced by Felsenstein (1981). The procedure relies on a pruning technique, which consists of computing the conditional likelihoods Lu(a) for every node u and letter a. Lu(a) is the probability of observing the leaf labelings in the subtree of u, given that u is labeled with a. The likelihood for the character x equals
.
Intron data are somewhat unusual in that an all-0 character is never observed: the input does not include sites in which introns are absent in all of the organisms. To resolve this difficulty, we employ a correction technique proposed by Felsenstein (1992) for restriction sites. (Cs
rös (2005) describes an alternative technique based on expectation maximization.) We compute the likelihood under the condition that the input does not include all-0 characters. We use therefore the corrected likelihood
and maximize L' =
i L'(xi).
3.2 Ancestral events in intron evolution
Our goal is to give a reliable characterization of the dynamics of intron evolution. In particular, we aim to give estimates for intron density in ancestral species, and for intron loss and gain events on the edges. Notice that the estimation method needs to account for ancestral introns even if they got eliminated in all descendant lineages. It is possible to do that with the help of conditional expectations, which fit naturally into a likelihood framework.
For an observed character x, we define upper conditional likelihoods Uu(a) so that Uu(a)Lu(a) =
{
= x,
(u) = a}. Upper conditional likelihoods are computed with dynamic programming, from the root towards the leaves, in O(n
) time (Cs
rös, 2005), even for non-binary trees. Similar computations are routinely used in DNA and protein likelihood maximization programs (Adachi and Hasegawa, 1995; Guindon and Gascuel, 2003). Here we allow irreversible probabilistic models, which explains why Uu(a) must be computed in a top-down fashion in (1).
The posterior probability for the state at node u is
|
|
|
|
Now, the number of ancestral introns is estimated as the conditional expectation
. The formula takes into consideration unobserved intron sites, by estimating their number
as the mean of a negative binomial random variable. The number of intron state changes is estimated as Nv(a
b) =
. In particular, Nv(1
0) is the number of introns lost, and Nv(0
1) is the number of introns gained on the edge leading to v.
In order to compute U, we initialize Uroot(a) =
(a). On every edge uv,
|
| (1) |
| 4 RAPID COMPUTATION OF THE LIKELIHOOD |
|---|
|
|
|---|
There are many heuristics that accelerate likelihood-based phylogenetic reconstruction (Friedman et al., 2002; Guindon and Gascuel, 2003), which mostly concentrate on the exploration of the tree space. We propose an improvement to the evaluation of the likelihood function, which normally takes linear time in the input size. Our evaluation method yields an O(log
) speedup for typical trees. We use it to optimize the parameters of intron evolution on a known tree, but the method is generally applicable to phylogenetic problems where the likelihood is numerically optimized. The key idea is that it is enough to carry out the pruning algorithm once for every different labeling within a subtree. Different labelings within each subtree can be computed in a preprocessing step. Subsequent evaluations of the likelihood function with different model parameters are faster, and depend on the number of different labelings in the data set.
Here we describe how the preprocessing step can be carried out in O(n
) time. Secondly, we analyze the computational complexity of subsequent evaluations, and show that an O(log
) speedup is achieved for almost all random trees in the Yule–Harding model. The latter analysis produces some novel concentration results on the number of subtrees with a fixed size k.
The idea of identifying different subtree labelings was articulated by Larget and Simon (1998), without linear-time preprocessing or a theoretical result on the improvement in computation time. Kosakovsky Pond and Muse (2004) proposed a preprocessing method based on heuristic optimization of a traveling salesman problem arising from the comparison of subtree labelings between different characters. Linear-time preprocessing is proposed in a similar context by Stamatakis et al. (2002). Specifically, they proposed identifying characters in which leaves in a subtree have identical labels. The technique relies entirely on the fact that alignments of closely related sequences exhibit high levels of identity, and cannot be extended to non-identical labelings.
4.1 Yule–Harding model
The Yule–Harding distribution is encountered in random birth and death models of species and in coalescent models (Felsenstein, 2004), and is thus one of the most adequate random models for phylogenies. In one of the equivalent formulations, a random tree is grown by adding leaves one by one in a random order. The leaves are first numbered by using a random uniform permutation of the integers 1, 2, ... , n. Leaves are joined to the tree in an iterative procedure. In Step 1, the tree is just leaf 1 on its own. In Step 2, the tree is a cherry with leaves 1 and 2. In each subsequent step i = 3, ... , n, a random leaf Yi is picked uniformly from the set {1, 2, ... , i – 1}. The new leaf i is added to the tree as the sibling of Yi, forming a cherry: a new node is placed on the edge leading to Yi and i is connected to it. The resulting random tree in iteration n has the Yule–Harding distribution (Harding, 1971).
4.2 Preprocessing
The likelihood can be computed faster by first identifying the different xi values, along with their multiplicity in the data. For large trees, the input typically consists of many different labelings, but for small trees with n > logr
), the compression is useful, since the number of different xi values is bounded by rn <
. In order to exploit the benefits of compression, we extend it to every subtree.
DEFINITION 1
Define the multiset of observed labelings for every node u as follows. Let n' denote the number of leaves in the subtree rooted at u, and let u1, ... , un' denote those leaves. Defineu(y) for every labeling y
![]()
n' of the leaves in the subtree of u as the number of times y is observed in the data:
where X{·} denotes the indicator function that takes the value 1 if its argument is true, otherwise it is 0. Define also the set of observed labelings
The multisets of observed labelings are computed in the preprocessing step. The likelihood function is evaluated subsequently by computing the conditional likelihoods at each node u for the labelings of
u only, in O(|
u|) time.
In order to compute
u, we use a recursive procedure. It is important to avoid working with the O(n)-dimensional vectors y of Definition 1 directly, otherwise the preprocessing may take superlinear time in n. For that reason, every labeling y
u is represented by the index i for which xi is the first occurrence of y. Accordingly, we compute an auxiliary array hi(u), which stores the first occurrence of each labeling xi in u's subtree. In particular, hi(u) = i' if i' is the smallest index such that xi(uk) = xi' (uk) for all k where uk are the leaves in u's subtree. Figure 2 shows that the values hi(u) can be computed in a postorder traversal. After hi(u) are computed for all i and u, the multiplicities
u and observed labelings
u are straightforward to calculate in linear time. The map H in Lines C7–C11 is sparse with at most
entries, and can be implemented as a hash table so that accessing and updating it takes O(
) time. Consequently, Algorithm COMPRESS takes O(n
) time.
|
4.3 Evaluating the likelihood function
After the preprocessing step, the conditional likelihoods are computed only for the different labelings within each subtree. Figure 3 shows the evaluation of the likelihood function. The running time for the algorithm is O(s) where s is the total number of different labelings within all subtrees: s =
u ||
| (2) |
different labelings for n – O(log
) subtrees. Caterpillar trees, however, are rare in phylogenetic analysis. Typical phylogenies have fairly balanced subtrees (Aldous, 2001).
|
In what follows, we examine the bound of (2) more closely in the Yule–Harding model. The analysis relies on a characterization of the random number of subtrees with a given size, as expressed in Theorems 1 and 2.
THEOREM 1
Consider random evolutionary trees with n leaves in the Yule–Harding model. Let Ck denote the number of subtrees with k = 1, ... , n – 1 leaves in a random tree. The expected value of Ck isTrivially,
(3) Cn =1:
PROOF
Heard (1992) derives Equation (3) by appealing to a Pólya urn model. An equivalent result is stated by Devroye (1991).
THEOREM 2
For all> 0,
(4a)
(4b)
PROOF
Consider the random construction of the tree. Let Yi denote the random leaf picked in step i to which leaf i gets connected, for i = 3, 4, ... , n. Each random variable Yi is uniformly distributed over the set {1, 2, ... , i – 1}, and Y3, Y4, ... , Yn are independent. Moreover, they completely determine the tree T at the end of the procedure. Consequently, Ck is a function of (Yi : i = 3, ... , n): Ck = f(Y3, Y4, ... , Yn). The key observation for the concentration result is that if we change the value of only one of the Yi in the series, then Ck changes by at most two. In order to see this, consider what happens to the tree T, if we change the value of exactly one of the Yi from y to y'. Such a change corresponds to a subtree prune and regraft transformation (Felsenstein, 2004). Specifically, the subtree Ti, defined as the child tree of the lowest common ancestor u of y and i containing the leaf i, is cut from T, and is reattached to one of the edges on the path from the root to y'. Now, such a transformation does not affect Ck by much. Notice that subtree sizes are strictly monotone decreasing from the root on every path. On the path from the root to u, subtree sizes decrease by the sizeof Ti, and u disappears, contributing a change of +1, 0 or (–1) to Ck. (At most one subtree of size
+ k that contains Ti now has size k, and at most one subtree of original size k is not counted anymore: it may be u's subtree itself, or a subtree above it.) An analogous argument shows that regrafting contributes a change of +1, 0, or (–1). Hence, the function f(·) is such that by changing one of its arguments, its value changes by at most 2. As a consequence, McDiarmid's inequality (1989) can be applied to bound the probabilities of large deviations for Ck. In particular, for all
> 0,
where
and
Since ci = 2 for all i,
, implying Equation (4a). An identical bound holds for the right-hand tail of the distribution.
REMARK
The particular case of k = 2 was considered by McKenzie and Steel (2000). They showed that the distribution of C2 is asymptotically normal with mean n/3 and variance 2n/45. The result suggests that the best constant factor in the exponent of Equation (4) is 45/8 for k = 2, instead of. Rosenberg (2006) gave exact formulas for the variance of Ck. He showed that the variance of Ck is
where 8/45
vk < 2 is a constant that depends on k. The variance formulas were given earlier in a different context by Devroye (1991). (See also the discussion by Blum and François (2005).) The result suggests that by analogy with the cherries, the best constant factor in the exponent is
. It is thus plausible that the probability is properly bounded by 1–o(n log–2
) in Theorem 3.
THEOREM 3
With probability, the likelihood function can be evaluated for random trees in theYule–Harding model in
time after an initial preprocessing step that takes O(n
) time. Evaluating the likelihood function takes O(n
) time in the worst case, and O(n
log–1
) time on average.
We need the following lemma for the proof of Theorem 3.
LEMMA 4
For all t4,
. For all t
1 and
.
PROOF
The proof is straightforward by induction in t. Notice that the right order of magnitude is![]()
but we need a bound for all t.
PROOF OF THEOREM 3
The preprocessing (Fig. 2) takes O(n) time as discussed in Section 4.2. The evaluation of the likelihood function (Fig. 3) takes O(s) time. By Equation (2),
Let
(5) . By Theorem 1 and Lemma 4, if
![]()
16 or r
3,
which proves our claim about the average running time. Now, let
(6) . Plugging
into Theorem 2, we get that
Let
(7) k denote the event that |Ck –
Ck| <
for k = 1, ... , t, and let
t + 1 denote the event that
Since
implies
t + 1. By (7),
where
denotes the complementary event to
k. Now,
also implies that
. Since the likelihood computation takes O(s) time, the theorem holds.
Theorem 3 underestimates the actual speedups that the compression method brings about. Table 3 shows that compression results in a 50–500-fold speedup of the likelihood evaluation in practice. Notice also that the constants hidden behind the asymptotic notation are quite different between the preprocessing and evaluation steps: costly floating point operations are avoided in the preprocessing step. It is important to stress that the theoretical analysis does not rely on similarities in the input data: Theorem 3 and the bound of (2) hold for any sample of size
. Real-life data are expected to behave even better, as Table 3 illustrates.
|
Even though the theorem holds for arbitrary alphabets, the compression is less effective for large alphabets (amino acids for example) in practice. For DNA sequences, however, it should still be valuable: we conjecture that compression would accelerate likelihood optimization by at least an order of magnitude.
We implemented LOGLIKELIHOOD, along with likelihood maximization and the posterior calculations of Section 3.2 in a Java package. Likelihood is maximized by setting loss and gain parameters for the edges, by using mostly the Broyden–Fletcher–Goldfarb–Shanno method (Press et al., 1997) whenever possible, and occasionally Brent's line minimization (Press et al., 1997) for each parameter separately. It is noteworthy that the algorithmic improvements reduced the likelihood optimization time to a few minutes on a common desktop computer for the data set of Section 5.2, and made the execution of a large number of simulations possible.
| 5 APPLICATIONS |
|---|
|
|
|---|
5.1 Ancient paralogs
In our first example, intron-aware alignment was used to reject the hypothesis of misleading protein alignments.
We used intron-aware alignment in a study about ancient eukaryotic paralogs (Sverdlov et al., 2007). In the study, 157 homologous gene families were examined across six species (Arabidopsis thaliana, Homo sapiens, Cxnorhabditis elegans, Drosophila melanogaster, Saccharomyces cerevisx and S. pombe). These families were notable because they contained paralogous members in multiple eukaryotic species, but not in prokaryotes, and, thus, presumably underwent duplication in the lineage leading to LECA. Ancient paralogs within and across species share very few (in the order of a few percentages) introns. The finding is quite surprising, as recent paralogs, resulting from lineage-specific duplications, and also orthologs between human and Arabidopsis agree much more in their intron sites (Rogozin et al., 2003). Consequently, ancient paralogs either lacked introns at the time of their duplication, or their duplication involved removal of introns.
In one of the data validation steps for the study, we used intron-aware alignment with very high intron match rewards. With larger rewards for intron matches, more introns line up in the alignment, but the protein alignment gets worse. Even by using huge intron match rewards at the expense of corrupted protein alignments, we were not able to achieve intron sharing levels similar to that of human-Arabidopsis orthologs. The lack of intron agreement is therefore not an artifact of the protein alignments.
5.2 Intron-rich ancestors
We compiled a data set with 18 eukaryotic species to give a comprehensive picture of spliceosomal evolution among Eukaryotes.
Data preparation
Genbank flatfiles and protein sequences were downloaded from RefSeq (Pruitt et al., 2007) and Ensembl (Hubbard et al., 2007). Exon-intron annotation was extracted from the Genbank flatfiles. The C. briggsx protein sequences and genome annotation were downloaded from WormBase (Bieri et al., 2007), and intron annotation was extracted from the GFF annotation file. Table 4 lists the data sources and the species abbreviations. We used the 684 ortholog sets, each corresponding to a cluster of orthologous groups, or KOG (Tatusov et al., 2003), from the study of Rogozin et al. (2003) as seeds for compiling a set of putative orthologs for our species. For each seed (consisting of homologous protein sequences for eight species), we performed a position-specific iterated BLAST search (Altschul et al., 1997). In case of Plasmodium species, we used three iterations against a database of all protozoan peptide sequences in RefSeq. We used an E-value cutoff of 10–9 for retaining candidates. Each candidate hit was then used as a query in a reversed position-specific BLAST (rpsblast) search (Marchler-Bauer et al., 2007) against the KOG database. Candidates were retained after this point if they had the highest scoring hit (by rpsblast) against the same KOG as the KOG of the seed data, and they scored within 80% of the best such hit for the species.
|
From the resulting set of paralogs, we selected a putative ortholog set in the following manner. For each KOG, we selected all possible triples of human–Arabidopsis–Saccharomyces paralogs and kept the triple that had the highest score in alignments built by MUSCLE (Edgar, 2004). Alignment score was computed using the VTML240 matrix (Müller et al., 2002). Additional putative orthologs were added for one species at a time, by aligning each paralog individually to the current profile, and keeping the one that gave the largest alignment score. At this iterative addition, scoring was done with the VTML240 matrix, by summing the five-highest-pairwise scores between a candidate and already included sequences.
The resulting sets were then realigned using MUSCLE, and then realigned again using our intron-aware alignment with a gap penalty of 300, gap-extend penalty of 11, VTML240 amino acid scoring, intron-match scores of 300 and intron-mismatch penalties of 20. (The latter were established using different intron-match and -mismatch scores, and selecting the ones that gave the fewest number of intron sites while decreasing the score of the implied protein alignment by less than 0.1%.)
Conserved portions were extracted using our segmentation program with a complexity penalty of
= 400 (larger values gave identical segmentation results, and lower values resulted in too many scattered blocks). We penalized indels with an infinitely large value to exclude gap columns. Phase-0 introns falling on the boundaries of conserved blocks were excluded. Intron presence and absence in the aligned data was then extracted to produce the data for the likelihood programs.
The final data set comprises 8044 intron sites in 483 sets of orthologs.
Results
By extending the data of Rogozin et al. (2003), we wanted to examine the effects of increased taxon sampling in the analysis of intron evolution. In the course of that scrutiny, we were also able to refine the findings of previous studies concerning the dynamics of intron evolution.
Figure 4 shows the intron densities for ancestral species. Ancestral nodes such as the bilaterian ancestor (node 4), the ecdysozoan ancestor (node 5), the opisthokont ancestor (node 15) and the bikont ancestor (node 16) all have high intron densities, exceeding previous estimates of Rogozin et al. (2003) and Cs
rös (2005). Honey-bee genes (IHBSC, 2006) indicate a more intron-rich ecdysozoan ancestor (7 introns per gene) than what was known before, and surpass even the generous estimate (6.1 introns per gene) of Roy and Gilbert (2005). (By removing the honey-bee data, our estimate drops by 25%.) Intron density in the bilaterian ancestor is estimated to be almost as high as in humans. Sequences of a handful genes from a certain marine annelid have already indicated that the bilaterian ancestor had at least two-thirds as many introns as vertebrates (Raible et al., 2005).
|
Most branches are characterized by net intron loss. Figure 5 shows the estimated changes in a few lineages. Intron evolution has been much slower in the vertebrate lineage than in most other lineages: in more than 380 million years since the divergence with fishes, only about 3% of our introns got lost. Fungi, for example, massively trimmed their introns in many lineages as can be seen by the substantial decreases from node 12 onwards, covering a comparable time period. A notable exception is C. neoformans, which seems to have gained introns, but that assessment may change if more basidiomycetes are included.
|
| 6 CONCLUSION |
|---|
|
|
|---|
We presented an alignment technique for establishing intron orthology, and a likelihood framework in which intron evolutionary events can be quantified. We described a compression method for the evaluation of the likelihood function, which has been extremely valuable in practice. We also showed that the compression leads to sublinear running times for likelihood evaluation.
We illustrated our methods for analyzing intron evolution with a large and diverse set of eukaryotic organisms. The data set is more comprehensive than any used in other studies published to this day. The data indicate that ancestral eukaryotic genomes were more intron-rich than previous studies suggested.
Many circumstances influence intron loss (Jeffares et al., 2006; Roy and Gilbert, 2006), and realistic likelihood models need to introduce rate variation (Carmel et al., 2005). Usual rate variation models (Felsenstein, 2004) entail multiple evaluations of the likelihood function, and, thus, underline the importance of computational efficiency. We believe that the proposed methods will help to produce and analyze large data sets even within complicated likelihood models.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
M.Cs. is supported by a grant from the Natural Sciences and Engineering Research Council of Canada. I.B.R. was supported by the NLM/NIH/DHHS Intramural Research Program.
Conflict of Interest: none declared.
| REFERENCES |
|---|
|
|
|---|
Adachi J, Hasegawa M. MOLPHY version 2.3: programs for molecular phylogenetics based on maximum likelihood. In: Vol. 28 of Computer Science Monographs (1995) Tokyo, Japan: Institute of Statistical Mathematics. 1–150.
Aldous D. Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today. Stat. Sci (2001) 16:23–34.[CrossRef][Web of Science]
Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25(17):3389–3402.
Bieri T, et al. WormBase: new content and better access. Nucleic Acids Res (2007) 35(Suppl. 1):D506–510.
Blum MGB, Francois O. On statistical tests of phylogenetic tree imbalance: the Sackin and other indices revisited. Math. Biosci (2005) 195:141–153.[CrossRef][Web of Science][Medline]
Carmel L, et al. An expectationmaximization algorithm for analysis of evolution of exon-intron structure of eukaryotic genes. Lec. Notes in Comput. Sci (2005) 3678:35–46.[CrossRef]
Collins L, Penny D. Complex spliceosomal organization ancestral to extant eukaryotes. Mol. Biol. Evol (2005) 22(4):1053–1066.
Coulombe-Huntington J, Majewski J. Characterization of intron loss events in mammals. Genome Res (2007) 17(1):23–32.
Cs
rös M. Maximum-scoring segment sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2004) 1(4):139–150.[CrossRef]
Cs
rös M. Likely scenarios of intron evolution. Lec. Notes in Comput. Sci (2005) 3678:47–60.[CrossRef]
Devroye L. Limit laws for local counters in random binary search trees. Random Struct. Algor (1991) 2(1):303–315.
Durbin R, et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (1998) UK: Cambridge University Press.
Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res (2004) 32(5):1792–1797.
Felsenstein J. Evolutionary trees from DNAsequences: a maximum likelihood approach. J. Mol. Evol (1981) 17:368–376.[CrossRef][Web of Science][Medline]
Felsenstein J. Phylogenies from restriction sites, a maximum likelihood approach. Evolution (1992) 46:159–173.[CrossRef][Web of Science]
Felsenstein J. Inferring Pylogenies (2004) Sunderland, Massacheusetts, USA: Sinauer Associates.
Friedman N, et al. A structural EM algorithm for phylogenetic inference. J. Comput. Biol (2002) 9(2):331–353.[CrossRef][Web of Science][Medline]
Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol (2003) 52(5):696–704.[CrossRef][Web of Science][Medline]
Harding E. The probabilities of rooted tree-shapes generated by random bifurcation. Adv. Appl. Probab (1971) 3:44–77.[CrossRef]
Heard SB. Patterns in tree balance among cladistic, phenetic, and randomly generated phylogenetic trees. Evolution (1992) 46(6):1818–1826.[CrossRef][Web of Science]
Hubbard TJP, et al. Ensembl 2007. Nucleic Acids Res (2007) 35(Suppl. 1):D610–617.[CrossRef][Web of Science][Medline]
IHBSC. Insights into social insects from the genome of the honey bee Apis mellifera. Nature (2006) 443:931–949.[CrossRef][Medline]
Jeffares DC, et al. The biology of intron gain and loss. Trends Genet (2006) 22(1):16–22.[CrossRef][Web of Science][Medline]
Kececioglu J, Zhang W. Aligning alignments. Farach-Colton M, ed. (1998) Proceedings of CPM, Vol. 1448 of LNCS. Springer. 189–208.
Kosakovsky Pond SL, Muse SV. Column sorting: rapid calculation of the likelihood function. Syst. Biol (2004) 53(5):685–692.[CrossRef][Web of Science][Medline]
Larget B, Simon DL. Faster likelihood calculations on trees. In: Technical Report 98-02 (1998) Pittsburgh, Pennsylvania, USA: Department of Mathematics and Computer Science, Duquesne University.
Ma B, et al. Alignment between two multiple alignments. Baeza-Yates RA, Chávez E, Crochemore M, eds. (2003) Proceedings of CPM, Vol. 2676 of LNCS. Springer. 254–265.
Marchler-Bauer A, et al. CDD: a conserved domain database for interactive domain family analysis. Nucleic Acids Res (2007) 35(Suppl. 1):D237–240.
McDiarmid C. On the method of bounded differences. In: Surveys in Combinatorics—Siemons J, ed. (1989) Cambridge University Press. 148–184.
McKenzie A, Steel M. Distributions of cherries for two models of trees. Mathe. Biosci (2000) 164:81–92.[CrossRef]
Müller T, et al. Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. Mol. Biol. Evol (2002) 19(1):8–13.
Nguyen HD, et al. New maximum likelihood estimators for eukaryotic intron evolution. PLoS Comput. Biol (2005) 1(7):e79.[CrossRef][Medline]
Nielsen CB, et al. Patterns of intron gain and loss in fungi. PLoS Biol (2004) 2(12):e422.[CrossRef][Medline]
Nixon JEJ, et al. A spliceosomal intron in Giardia lamblia. Proc. Nat. Acad. Sci. USA (2002) 99(6):3701–3705.
Press WH, et al. Numerical Recipes in C: The Art of Scientific Computing (1997) 2nd. Cambridge University Press.
Pruitt KD, et al. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res (2007) 35(Suppl. 1):D61–65.
Raible F, et al. Vertebrate-type intron-rich genes in the marine annelid Platynereis dumerilii. Science (2005) 310:1325–1326.
Rogozin IB, et al. Remarkable interkingdom conservation of intron positions and massive, lineagespecific intron loss and gain in eukaryotic evolution. Curr. Biol (2003) 13:1512–1517.[CrossRef][Web of Science][Medline]
Rogozin IB, et al. Analysis of evolution of exon-intron structure of eukaryotic genes. Brief. Bioinformat (2005) 6(2):118–134.
Rosenberg NA. The mean and variance of r-pronged nodes and r-caterpillars in Yule-generated genealogies. Ann. Combinatorics (2006) 10:129–146.[CrossRef]
Roy SW, Gilbert W. Complex early genes. Proc. Nat. Acad. Sci. USA (2005) 102(6):1986–1991.
Roy SW, Gilbert W. The evolution of spliceosomal introns: patterns, puzzles and progress. Nat. Rev. Genet (2006) 7:211–221.[CrossRef][Web of Science][Medline]
Roy SW, Penny D. Large-scale intron conservation and order-ofmagnitude variation in intron loss/gain rates in apicomplexan evolution. Genome Res (2006) 16(10):1270–1275.
Roy SW, Penny D. Patterns of intron loss and gain in plants: Intron loss-dominated evolution and genome-wide comparison of O. sativa and A. thaliana. Mol. Biol. Evol (2007) 24(1):171–181.
Stamatakis AP, et al. AxML: Afast program for sequential and parallel phylogenetic tree calculations based on the maximum likelihood method. (2002) Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB). 21–28.
Steel MA. Recovering a tree from the leaf colourations it generates under a Markov model. Appl. Math. Lett (1994) 7:19–24.
Sverdlov AV, et al. Conservation versus parallel gains in intron evolution. Nucleic Acids Res (2005) 33(6):1741–1748.
Sverdlov AV, et al. A glimpse of a putative pre-intron phase of eukaryotic evolution. Trends Genet (2007) 23(3):105–108.[CrossRef][Web of Science][Medline]
Tatusov RL, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics (2003) 4:441.
Va
ácová
, et al. Spliceosomal introns in the deep-branching eukaryote Trichomonas vaginalis. Proc. Nat. Acad. Sci. USA (2005) 102(12):4430–4435.
Zhang Z, Berman P, Wiehe T, Miller W. Post-processing long pairwise alignments. Bioinformatics (1999) 15(12):1012–1019.
This article has been cited by other articles:
![]() |
M. Csuros Malin: maximum likelihood analysis of intron evolution in eukaryotes Bioinformatics, July 1, 2008; 24(13): 1538 - 1539. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Tarrio, F. J. Ayala, and F. Rodriguez-Trelles From the Cover: Alternative splicing: A missing piece in the puzzle of intron gain PNAS, May 20, 2008; 105(20): 7223 - 7228. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Csuros, I. B. Rogozin, and E. V. Koonin Extremely Intron-Rich Genes in the Alveolate Ancestors Inferred with a Flexible Maximum-Likelihood Approach Mol. Biol. Evol., May 1, 2008; 25(5): 903 - 911. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. W. Roy and M. Irimia Rare Genomic Characters Do Not Support Coelomata: Intron Loss/Gain Mol. Biol. Evol., April 1, 2008; 25(4): 620 - 623. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Irimia and S. W. Roy Spliceosomal introns as tools for genomic and evolutionary analysis Nucleic Acids Res., March 1, 2008; 36(5): 1703 - 1712. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





> 0,
of Ti, and u disappears, contributing a change of +1, 0 or (–1) to Ck. (At most one subtree of size 
vk < 2 is a constant that depends on k. The variance formulas were given earlier in a different context by Devroye (




