Skip Navigation

Bioinformatics 2007 23(13):i87-i96; doi:10.1093/bioinformatics/btm190
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Csurös, M.
Right arrow Articles by Rogozin, I. B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Csurös, M.
Right arrow Articles by Rogozin, I. B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

In search of lost introns

Miklós Csurös 1,*, J. Andrew Holey 2 and Igor B. Rogozin 3

1Department of Computer Science and Operations Research, Université de Montréal, Québec, Canada, 2Department of Computer Science, Saint John's University and the College of St. Benedict, Collegeville, MN, USA and 3National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IDENTIFICATION OF ORTHOLOGOUS...
 3 A LIKELIHOOD FRAMEWORK
 4 RAPID COMPUTATION OF...
 5 APPLICATIONS
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Many fundamental questions concerning the emergence and subsequent evolution of eukaryotic exon–intron organization are still unsettled. Genome-scale comparative studies, which can shed light on crucial aspects of eukaryotic evolution, require adequate computational tools.

We describe novel computational methods for studying spliceosomal intron evolution. Our goal is to give a reliable characterization of the dynamics of intron evolution. Our algorithmic innovations address the identification of orthologous introns, and the likelihood-based analysis of intron data. We discuss a compression method for the evaluation of the likelihood function, which is noteworthy for phylogenetic likelihood problems in general. We prove that after O(n{ell}) preprocessing time, subsequent evaluations take O(n{ell}/log {ell}) time almost surely in the Yule–Harding random model of n-taxon phylogenies, where {ell} is the input sequence length.

We illustrate the practicality of our methods by compiling and analyzing a data set involving 18 eukaryotes, which is more than in any other study to date. The study yields the surprising result that ancestral eukaryotes were fairly intron-rich. For example, the bilaterian ancestor is estimated to have had more than 90% as many introns as vertebrates do now.

Availability: The Java implementations of the algorithms are publicly available from the corresponding author's site http://www.iro.umontreal.ca/~csuros/introns/. Data are available on request.

Contact: csuros{at}iro.umontreal.ca


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IDENTIFICATION OF ORTHOLOGOUS...
 3 A LIKELIHOOD FRAMEWORK
 4 RAPID COMPUTATION OF...
 5 APPLICATIONS
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Typical eukaryotic protein-coding genes contain introns, which are removed prior to translation. Key constituents of the spliceosome, which is the RNA–protein complex that performs the intron excision, can be traced back (Collins and Penny, 2005) to the last common ancestor of extant eukaryotes (LECA). Even deep-branching lineages (Nixon et al., 2002; Vanácová et al., 2005) have introns. It is thus almost certain that spliceosomal introns were present in LECA. Moreover, when comparing distant eukaryotes, intron positions often agree (Rogozin et al., 2003). The similarity is likely due more to conservation of early introns than to parallel gains (Roy and Gilbert, 2005; Sverdlov et al., 2005). It is thus compelling to use genome-scale comparisons to study intron evolution in different lineages, and even to estimate the exon–intron organization in extinct ancestors. One of the first such studies, by Rogozin et al. (2003), involved orthologous gene sets in eight eukaryotes. The same data set was re-analyzed by different authors (Carmel et al., 2005; Csurös, 2005; Nguyen et al., 2005; Roy and Gilbert, 2005), using novel methods developed for intron data. Subsequent inquiries (Coulombe-Huntington and Majewski, 2007; Nielsen et al., 2004; Roy and Penny, 2006, 2007) attest to a renewed interest in understanding the specifics of intron evolution within different eukaryotic lineages. This article introduces novel computational techniques for the analysis of spliceosomal intron evolution, anticipating more large-scale studies to come.

Section 2 describes an alignment method for intron-annotated protein sequences, as well as a segmentation method for identifying conserved portions of a multiple alignment. Section 3 describes a likelihood framework in which intron evolution can be analyzed in a theoretically sound manner. Section 4 scrutinizes a compression technique that accelerates the evaluation of the likelihood function. Fast evaluation is particularly important when the likelihood is maximized in a numerical procedure that computes the likelihood function with many different parameter settings. Section 5 describes two applications of our methods. In one application, intron-aware alignment was used to validate some unexpected features of intron evolution before LECA. In the second application, we analyzed intron evolution in 18 eukaryotic species. We found evidence of intron-rich early eukaryotes and a prevalence of intron loss over intron gain in recent times.


    2 IDENTIFICATION OF ORTHOLOGOUS INTRONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IDENTIFICATION OF ORTHOLOGOUS...
 3 A LIKELIHOOD FRAMEWORK
 4 RAPID COMPUTATION OF...
 5 APPLICATIONS
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Intron-aware alignment
Orthologous introns can be identified by using whole genome alignments in the case of closely related genomes (Coulombe-Huntington and Majewski, 2007). For distant eukaryotes, however, intron orthology can only be established through protein alignments (Rogozin et al., 2005). The usual procedure is to project intron positions onto an alignment of multiple orthologous proteins. If introns in different species are projected to the same alignment position, then the introns are assumed to be orthologous.

Introns can also serve as checkpoints in a protein alignment. If the protein sequences are annotated with the intron positions, then the alignment can exploit intron homology by rewarding aligned introns and penalizing the juxtaposition of an intron with an intronless codon. In addition to appropriate scoring, alignment algorithms need to be modified to cope with phase-0 introns, which fall between codons. Incorporating intron annotation should lead to better alignments at the amino acid level (Fig. 1), and to a more reliable identification of orthologous introns.


Figure 1
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Fragment of a multiple alignment before and after realignment using intron annotation. Shaded rectangles show the intron positions projected to the protein sequences.

 
First, consider the case of aligning two protein sequences, S and T, which are annotated with the intron positions. Every residue has two associated intron sites (after the first and second nucleotides of their codons), and there is an intron site between consecutive amino acid positions (phase-0 introns). Intron sites may or may not contain introns. We use the notation for S[i: 0] for the phase-0 site preceding the codon for amino acid i, and S[i: 1], S[i: 2] for phase-1 and phase-2 sites within the codon. Intron presence is encoded by 1, and intron absence is encoded by 0 throughout this article. The intron annotation is specified by the variables S[i: j ] isin {1, 0}.

Phase-1 and phase-2 intron sites are automatically placed by their associated amino acids, but the placement of phase-0 introns is not fixed with respect to gaps. It is not possible to simply add a new character representing phase-0 introns because they affect gaps differently from amino acids. Scores for aligned intron sites are given by a 2 x 2 matrix {Lambda}. If M is the amino acid scoring matrix, then the alignment of S[i ] and T[ j ] entails a score of Formula with Formula . Similarly, aligning S[i] with an indel implies a score of Formula Formula , in addition to possible gap opening and closing penalties.

The intron scores Formula can be based on log-likelihood ratios in a probabilistic model (Durbin et al., 1998). For an aligned intron site, let s, t isin {1, 0} denote the intron states in S and T, respectively. The score Formula should be proportional to Formula , where p denotes the joint distribution of states at orthologous sites, and {pi}S, {pi}T are the prior intron state probabilities.

In order to assess the strength of intron-match signals, and to choose reasonable scores, we used the data set of Rogozin et al. (2003). We estimated p and {pi} using relative frequencies of intron absence-presence patterns in the 488 157 possible intron sites, and then computed log-odds scores for intron matches. The intron score is asymmetric and varies with evolutionary distance and intron conservation. Intron-less sites have an insignificant score, but shared introns have a high score, such as 304 (human–Plasmodium), 317 (human–Arabidopsis) or 515 (Drosophila–Anopheles) on a 1/60-bit scale. Shared introns thus give a signal comparable to strong amino acid matches: in the 1/60-bit scaled version of the VTML240 matrix (Müller et al., 2002), a tryptophan match scores 289. Discordant intron sites have similar scores to amino acid mismatches.

We added intron scoring into a multiple alignment framework, using a sum-of-pairs scoring policy with affine gap– scoring. It is NP-hard to find the alignment of two multiple alignments under these optimization criteria (Ma et al., 2003). Even the alignment of a single sequence to a multiple alignment necessitates sophisticated techniques (Kececioglu and Zhang, 1998). Our solution therefore uses a gap-counting heuristic: namely, a gap-open penalty is triggered for an indel aligned with an amino acid if the indel is preceded by an amino acid or a phase-0 intron. Gap opening thus corresponds to a pattern Formula or Formula . Here, Formula are intron states for the phase-0 site, and ? denotes either state. In addition, Formula denotes an arbitrary amino acid, Formula is the indel character, and Formula is either of the latter two. We implemented affine gap-scoring by separate gap-open and -close penalties, so that gaps at the alignment extremities can be penalized less severely. Gap closing is counted for the patterns Formula and Formula .

Table 1 gives the recurrences for a dynamic programming algorithm that aligns an intron-annotated sequence S to a multiple alignment P of h intron-annotated protein sequences. In order to simplify the presentation, we represent the sequences in such a way that every odd position of S and P is a regular residue or alignment column, annotated with information on the presence of phase-1 and phase-2 introns, whereas every even position is a phase-0 intron site. We use Formula to denote the sum of intron-match scores for the intron sites associated with the positions S[i ], P[ j ]. We use also the shorthand Formula to denote scoring for the alignment of an amino acid x with an amino acid profile y. The algorithm uses three types of variables, Formula and Formula , which correspond to partial prefix alignments ending with aligned residues, gaps in S, or gaps in P, respectively. In case of Formula , the last indel must be aligned with an amino acid column, and, thus j must be odd; for Formula , i must be odd.


View this table:
[in this window]
[in a new window]

 
Table 1. Recurrences for intron-aware alignment

 
Gaps are scored by using affine penalties, with gap-open, -extend, and -close scores, denoted by {gamma}(<), {gamma}(–), {gamma}(>). The gap-counting heuristic implies that gap scores in the equations of Table 1 are defined by the number of certain patterns in up to three consecutive alignment columns. Table 2 lists the patterns for the gap-counting heuristic.


View this table:
[in this window]
[in a new window]

 
Table 2. Patterns for gap counting

 
Table 1 does not show the initialization of the variables, nor the final gap-counting: they employ a logic analogous to the recurrences. At the end of the algorithm, the best of Formula is selected, and the actual alignment is found by standard traceback techniques (Durbin et al., 1998).

We implemented the algorithm in Java. The program iteratively realigns one sequence at a time to the rest of the sequences in a multiple alignment. Instead of sequence-dependent intron match-mismatch scores, the implementation uses only two parameters: one for intron conservation and another for intron-state mismatch.

2.2 Identification of conserved blocks
In order to reliably identify orthologs, we need to be able to distinguish regions of the alignment that are highly conserved from those that are less conserved. In poorly conserved regions, we cannot confidently infer intron orthology.

Zhang et al. (1999) proposed postprocessing pairwise sequence alignments into alternating blocks that score above a threshold parameter {alpha} or below (–{alpha}). We attained a similar goal by adapting algorithmic techniques from Csurös (2004). The procedure separates a multiple alignment into alternating high- and low-scoring regions. Using a complexity penalty {alpha}, a segmentation with k high-scoring regions has a segmentation score of A k{alpha} where A is the sum of scores for the aligned columns. Column scores are computed without gap-open and -close penalties. The best segmentation of an alignment of length {ell} can be found in O({ell}) time after the column scores are computed.

The result of this computation is that the total of column scores in each selected high-scoring region will be greater than {alpha}. There may be small subregions of negative scores, but the total score of such a subregion cannot be less than (–{alpha}). Conversely, unselected regions score below (–{alpha}) and cannot have subregions scoring above {alpha}.


    3 A LIKELIHOOD FRAMEWORK
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IDENTIFICATION OF ORTHOLOGOUS...
 3 A LIKELIHOOD FRAMEWORK
 4 RAPID COMPUTATION OF...
 5 APPLICATIONS
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Markov models of evolution
We use a Markov model for intron evolution, as in previous studies (Csurös, 2005). For the sake of generality, we describe the Markov model (Felsenstein, 2004; Steel, 1994) over an arbitrary alphabet Formula of fixed size Formula . The intron alphabet is Formula . A phylogeny over a set of species X is defined by a rooted tree T and a probabilistic model. The leaves are bijectively mapped to elements of X. Each tree node u has an associated random variable {xi}(u), which is called its state or label, that takes values over Formula . The tree T with its parameters defines the joint distribution for the random variables {xi}(u). Along every path away from the root, the node states form a Markov chain. The distribution is thus determined by the root probabilities ({pi}(a) : a isin Formula ) and edge transition probabilities (pe(a -> b) : a, b isin Formula ) assigned to every edge e. The root probabilities give the distributon of the root state. Edge transition probabilities define the conditional distributions Formula . The leaf states form the character {xi} = ({xi}(u) : u isin X). An input data set (or sample) consists of independent and identically distributed (iid) characters: {xi} = ({xi}i : i = 1, ... , {ell}).

In case of intron evolution, introns are generated by a two-state continuous-time Markov process with gain and loss rates {lambda}e, µe ≥ 0 along each edge e. The edge length is denoted by te. Using standard results, the transition probabilities on edge e with rates {lambda}e = {lambda}, µe = µ and length te = t can be written as Formula Formula , and pe(0 -> 0) = 1 – pe(0 -> 1), pe(1 -> 1) = 1 – pe(1 -> 0). In the absence of established edge lengths, we fix the edge length scaling in such a way that {lambda}e + µe = 1. Independent model parameters are thus {pi}(1), µe and te for all edges e. It is important to allow for branch-dependent rates, since loss and gain rates vary considerably between lineages (Jeffares et al., 2006; Roy and Gilbert, 2006).

In a maximum likelihood approach, model parameters are set by maximizing the likelihood of the input sample. Let x = (x1, ... , x{ell}) be the input data. Every xi is a vector of n states, and we write xi(u) to denote the observed state of leaf u. By independence, the likelihood of x is the product Formula . Each character's likelihood can be computed in O(n) time, using a dynamic programming procedure introduced by Felsenstein (1981). The procedure relies on a ‘pruning’ technique, which consists of computing the conditional likelihoods Lu(a) for every node u and letter a. Lu(a) is the probability of observing the leaf labelings in the subtree of u, given that u is labeled with a. The likelihood for the character x equals Formula .

Intron data are somewhat unusual in that an all-0 character is never observed: the input does not include sites in which introns are absent in all of the organisms. To resolve this difficulty, we employ a correction technique proposed by Felsenstein (1992) for restriction sites. (Csurös (2005) describes an alternative technique based on expectation maximization.) We compute the likelihood under the condition that the input does not include all-0 characters. We use therefore the corrected likelihood Formula and maximize L' = prodi L'(xi).

3.2 Ancestral events in intron evolution
Our goal is to give a reliable characterization of the dynamics of intron evolution. In particular, we aim to give estimates for intron density in ancestral species, and for intron loss and gain events on the edges. Notice that the estimation method needs to account for ancestral introns even if they got eliminated in all descendant lineages. It is possible to do that with the help of conditional expectations, which fit naturally into a likelihood framework.

For an observed character x, we define upper conditional likelihoods Uu(a) so that Uu(a)Lu(a) = Formula {{xi} = x, {xi}(u) = a}. Upper conditional likelihoods are computed with dynamic programming, from the root towards the leaves, in O(n{ell}) time (Csurös, 2005), even for non-binary trees. Similar computations are routinely used in DNA and protein likelihood maximization programs (Adachi and Hasegawa, 1995; Guindon and Gascuel, 2003). Here we allow irreversible probabilistic models, which explains why Uu(a) must be computed in a top-down fashion in (1).

The posterior probability for the state at node u is


Formula

The posterior probabilities for state changes on an edge uv are


Formula

Now, the number of ancestral introns is estimated as the conditional expectation Formula . The formula takes into consideration unobserved intron sites, by estimating their number Formula as the mean of a negative binomial random variable. The number of intron state changes is estimated as Nv(a -> b) = Formula . In particular, Nv(1 -> 0) is the number of introns lost, and Nv(0 -> 1) is the number of introns gained on the edge leading to v.

In order to compute U, we initialize Uroot(a) = {pi}(a). On every edge uv,


Formula 1

(1)


    4 RAPID COMPUTATION OF THE LIKELIHOOD
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IDENTIFICATION OF ORTHOLOGOUS...
 3 A LIKELIHOOD FRAMEWORK
 4 RAPID COMPUTATION OF...
 5 APPLICATIONS
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
There are many heuristics that accelerate likelihood-based phylogenetic reconstruction (Friedman et al., 2002; Guindon and Gascuel, 2003), which mostly concentrate on the exploration of the tree space. We propose an improvement to the evaluation of the likelihood function, which normally takes linear time in the input size. Our evaluation method yields an O(log {ell}) speedup for typical trees. We use it to optimize the parameters of intron evolution on a known tree, but the method is generally applicable to phylogenetic problems where the likelihood is numerically optimized. The key idea is that it is enough to carry out the pruning algorithm once for every different labeling within a subtree. Different labelings within each subtree can be computed in a preprocessing step. Subsequent evaluations of the likelihood function with different model parameters are faster, and depend on the number of different labelings in the data set.

Here we describe how the preprocessing step can be carried out in O(n{ell}) time. Secondly, we analyze the computational complexity of subsequent evaluations, and show that an O(log {ell}) speedup is achieved for almost all random trees in the Yule–Harding model. The latter analysis produces some novel concentration results on the number of subtrees with a fixed size k.

The idea of identifying different subtree labelings was articulated by Larget and Simon (1998), without linear-time preprocessing or a theoretical result on the improvement in computation time. Kosakovsky Pond and Muse (2004) proposed a preprocessing method based on heuristic optimization of a traveling salesman problem arising from the comparison of subtree labelings between different characters. Linear-time preprocessing is proposed in a similar context by Stamatakis et al. (2002). Specifically, they proposed identifying characters in which leaves in a subtree have identical labels. The technique relies entirely on the fact that alignments of closely related sequences exhibit high levels of identity, and cannot be extended to non-identical labelings.

4.1 Yule–Harding model
The Yule–Harding distribution is encountered in random birth and death models of species and in coalescent models (Felsenstein, 2004), and is thus one of the most adequate random models for phylogenies. In one of the equivalent formulations, a random tree is grown by adding leaves one by one in a random order. The leaves are first numbered by using a random uniform permutation of the integers 1, 2, ... , n. Leaves are joined to the tree in an iterative procedure. In Step 1, the tree is just leaf 1 on its own. In Step 2, the tree is a ‘cherry’ with leaves 1 and 2. In each subsequent step i = 3, ... , n, a random leaf Yi is picked uniformly from the set {1, 2, ... , i – 1}. The new leaf i is added to the tree as the sibling of Yi, forming a cherry: a new node is placed on the edge leading to Yi and i is connected to it. The resulting random tree in iteration n has the Yule–Harding distribution (Harding, 1971).

4.2 Preprocessing
The likelihood can be computed faster by first identifying the different xi values, along with their multiplicity in the data. For large trees, the input typically consists of many different labelings, but for small trees with n > logr {ell}), the compression is useful, since the number of different xi values is bounded by rn < {ell}. In order to exploit the benefits of compression, we extend it to every subtree.

DEFINITION 1
Define the multiset of observed labelings for every node u as follows. Let n' denote the number of leaves in the subtree rooted at u, and let u1, ... , un' denote those leaves. Define {nu}u(y) for every labeling y isin Formula n' of the leaves in the subtree of u as the number of times y is observed in the data:


Formula

where X{·} denotes the indicator function that takes the value 1 if its argument is true, otherwise it is 0. Define also the set of observed labelings


Formula

The multisets of observed labelings are computed in the preprocessing step. The likelihood function is evaluated subsequently by computing the conditional likelihoods at each node u for the labelings of Formula u only, in O(|Formula u|) time.

In order to compute {nu}u, we use a recursive procedure. It is important to avoid working with the O(n)-dimensional vectors y of Definition 1 directly, otherwise the preprocessing may take superlinear time in n. For that reason, every labeling y isin Formula u is represented by the index i for which xi is the first occurrence of y. Accordingly, we compute an auxiliary array hi(u), which stores the first occurrence of each labeling xi in u's subtree. In particular, hi(u) = i' if i' is the smallest index such that xi(uk) = xi' (uk) for all k where uk are the leaves in u's subtree. Figure 2 shows that the values hi(u) can be computed in a postorder traversal. After hi(u) are computed for all i and u, the multiplicities {nu}u and observed labelings Formula u are straightforward to calculate in linear time. The map H in Lines C7–C11 is sparse with at most {ell} entries, and can be implemented as a hash table so that accessing and updating it takes O(Formula ) time. Consequently, Algorithm COMPRESS takes O(n{ell}) time.


Figure 2
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Computing the auxiliary arrays from which the multiplicities {nu}u are obtained.

 
4.3 Evaluating the likelihood function
After the preprocessing step, the conditional likelihoods are computed only for the different labelings within each subtree. Figure 3 shows the evaluation of the likelihood function. The running time for the algorithm is O(s) where s is the total number of different labelings within all subtrees: s = {sum}u |Formula u|. If nu denotes the number of leaves in the subtree rooted at u, then Formula u has at most Formula elements. Hence, s is bounded as


Formula 2

(2)
Observe that by sheer number of arithmetic operations, it is always worth evaluating the likelihood function this way. The worst situation is that of a caterpillar tree (where every inner node has a leaf child). In that case, there are only a few Formula non-leaf nodes for which we can compress the data, and it is possible to construct an artificial data set in which there are {ell} different labelings for nO(log {ell}) subtrees. Caterpillar trees, however, are rare in phylogenetic analysis. Typical phylogenies have fairly balanced subtrees (Aldous, 2001).


Figure 3
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Computing the log-likelihood using the observed labelings.

 
In what follows, we examine the bound of (2) more closely in the Yule–Harding model. The analysis relies on a characterization of the random number of subtrees with a given size, as expressed in Theorems 1 and 2.

THEOREM 1
Consider random evolutionary trees with n leaves in the Yule–Harding model. Let Ck denote the number of subtrees with k = 1, ... , n – 1 leaves in a random tree. The expected value of Ck is


Formula 3

(3)
Trivially, Formula Cn =1:

PROOF
Heard (1992) derives Equation (3) by appealing to a Pólya urn model. An equivalent result is stated by Devroye (1991).{square}

THEOREM 2
For all {varepsilon} > 0,


Formula 4

(4a)


Formula 5

(4b)

PROOF
Consider the random construction of the tree. Let Yi denote the random leaf picked in step i to which leaf i gets connected, for i = 3, 4, ... , n. Each random variable Yi is uniformly distributed over the set {1, 2, ... , i – 1}, and Y3, Y4, ... , Yn are independent. Moreover, they completely determine the tree T at the end of the procedure. Consequently, Ck is a function of (Yi : i = 3, ... , n): Ck = f(Y3, Y4, ... , Yn). The key observation for the concentration result is that if we change the value of only one of the Yi in the series, then Ck changes by at most two. In order to see this, consider what happens to the tree T, if we change the value of exactly one of the Yi from y to y'. Such a change corresponds to a ‘subtree prune and regraft’ transformation (Felsenstein, 2004). Specifically, the subtree Ti, defined as the child tree of the lowest common ancestor u of y and i containing the leaf i, is cut from T, and is reattached to one of the edges on the path from the root to y'. Now, such a transformation does not affect Ck by much. Notice that subtree sizes are strictly monotone decreasing from the root on every path. On the path from the root to u, subtree sizes decrease by the size {tau} of Ti, and u disappears, contributing a change of +1, 0 or (–1) to Ck. (At most one subtree of size {tau} + k that contains Ti now has size k, and at most one subtree of original size k is not counted anymore: it may be u's subtree itself, or a subtree above it.) An analogous argument shows that regrafting contributes a change of +1, 0, or (–1). Hence, the function f(·) is such that by changing one of its arguments, its value changes by at most 2. As a consequence, McDiarmid's inequality (1989) can be applied to bound the probabilities of large deviations for Ck. In particular, for all {varepsilon} > 0,


Formula

where Formula and


Formula

Since ci = 2 for all i, Formula , implying Equation (4a). An identical bound holds for the right-hand tail of the distribution.{square}

REMARK
The particular case of k = 2 was considered by McKenzie and Steel (2000). They showed that the distribution of C2 is asymptotically normal with mean n/3 and variance 2n/45. The result suggests that the best constant factor in the exponent of Equation (4) is 45/8 for k = 2, instead of Formula . Rosenberg (2006) gave exact formulas for the variance of Ck. He showed that the variance of Ck is Formula where 8/45 ≤ vk < 2 is a constant that depends on k. The variance formulas were given earlier in a different context by Devroye (1991). (See also the discussion by Blum and François (2005).) The result suggests that by analogy with the cherries, the best constant factor in the exponent is Formula . It is thus plausible that the probability is properly bounded by 1–o(n log–2 {ell}) in Theorem 3.

THEOREM 3
With probability Formula , the likelihood function can be evaluated for random trees in theYuleHarding model in Formula time after an initial preprocessing step that takes O(n{ell}) time. Evaluating the likelihood function takes O(n{ell}) time in the worst case, and O(n{ell} log1 {ell}) time on average.

We need the following lemma for the proof of Theorem 3.

LEMMA 4
For all t ≥ 4, Formula . For all t ≥ 1 and Formula .

PROOF
The proof is straightforward by induction in t. Notice that the right order of magnitude is Formula Formula but we need a bound for all t.{square}

PROOF OF THEOREM 3
The preprocessing (Fig. 2) takes O(n{ell}) time as discussed in Section 4.2. The evaluation of the likelihood function (Fig. 3) takes O(s) time. By Equation (2),


Formula 6

(5)
Let Formula . By Theorem 1 and Lemma 4, if {ell} ≥ 16 or r ≥ 3,


Formula 7

(6)
which proves our claim about the average running time. Now, let Formula . Plugging {varepsilon} into Theorem 2, we get that


Formula 8

(7)
Let Formula k denote the event that |CkFormula Ck| < {varepsilon} for k = 1, ... , t, and let Formula t + 1 denote the event that Formula Since Formula implies Formula t + 1. By (7),


Formula

where Formula denotes the complementary event to Formula k. Now, Formula also implies that Formula . Since the likelihood computation takes O(s) time, the theorem holds.{square}

Theorem 3 underestimates the actual speedups that the compression method brings about. Table 3 shows that compression results in a 50–500-fold speedup of the likelihood evaluation in practice. Notice also that the constants hidden behind the asymptotic notation are quite different between the preprocessing and evaluation steps: costly floating point operations are avoided in the preprocessing step. It is important to stress that the theoretical analysis does not rely on similarities in the input data: Theorem 3 and the bound of (2) hold for any sample of size {ell}. Real-life data are expected to behave even better, as Table 3 illustrates.


View this table:
[in this window]
[in a new window]

 
Table 3. Effect of the compression on three data sets

 
Even though the theorem holds for arbitrary alphabets, the compression is less effective for large alphabets (amino acids for example) in practice. For DNA sequences, however, it should still be valuable: we conjecture that compression would accelerate likelihood optimization by at least an order of magnitude.

We implemented LOGLIKELIHOOD, along with likelihood maximization and the posterior calculations of Section 3.2 in a Java package. Likelihood is maximized by setting loss and gain parameters for the edges, by using mostly the Broyden–Fletcher–Goldfarb–Shanno method (Press et al., 1997) whenever possible, and occasionally Brent's line minimization (Press et al., 1997) for each parameter separately. It is noteworthy that the algorithmic improvements reduced the likelihood optimization time to a few minutes on a common desktop computer for the data set of Section 5.2, and made the execution of a large number of simulations possible.


    5 APPLICATIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IDENTIFICATION OF ORTHOLOGOUS...
 3 A LIKELIHOOD FRAMEWORK
 4 RAPID COMPUTATION OF...
 5 APPLICATIONS
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
5.1 Ancient paralogs
In our first example, intron-aware alignment was used to reject the hypothesis of misleading protein alignments.

We used intron-aware alignment in a study about ancient eukaryotic paralogs (Sverdlov et al., 2007). In the study, 157 homologous gene families were examined across six species (Arabidopsis thaliana, Homo sapiens, Cxnorhabditis elegans, Drosophila melanogaster, Saccharomyces cerevisx and S. pombe). These families were notable because they contained paralogous members in multiple eukaryotic species, but not in prokaryotes, and, thus, presumably underwent duplication in the lineage leading to LECA. Ancient paralogs within and across species share very few (in the order of a few percentages) introns. The finding is quite surprising, as recent paralogs, resulting from lineage-specific duplications, and also orthologs between human and Arabidopsis agree much more in their intron sites (Rogozin et al., 2003). Consequently, ancient paralogs either lacked introns at the time of their duplication, or their duplication involved removal of introns.

In one of the data validation steps for the study, we used intron-aware alignment with very high intron match rewards. With larger rewards for intron matches, more introns line up in the alignment, but the protein alignment gets worse. Even by using huge intron match rewards at the expense of corrupted protein alignments, we were not able to achieve intron sharing levels similar to that of human-Arabidopsis orthologs. The lack of intron agreement is therefore not an artifact of the protein alignments.

5.2 Intron-rich ancestors
We compiled a data set with 18 eukaryotic species to give a comprehensive picture of spliceosomal evolution among Eukaryotes.

Data preparation
Genbank flatfiles and protein sequences were downloaded from RefSeq (Pruitt et al., 2007) and Ensembl (Hubbard et al., 2007). Exon-intron annotation was extracted from the Genbank flatfiles. The C. briggsx protein sequences and genome annotation were downloaded from WormBase (Bieri et al., 2007), and intron annotation was extracted from the GFF annotation file. Table 4 lists the data sources and the species abbreviations. We used the 684 ortholog sets, each corresponding to a cluster of orthologous groups, or KOG (Tatusov et al., 2003), from the study of Rogozin et al. (2003) as ‘seeds’ for compiling a set of putative orthologs for our species. For each seed (consisting of homologous protein sequences for eight species), we performed a position-specific iterated BLAST search (Altschul et al., 1997). In case of Plasmodium species, we used three iterations against a database of all protozoan peptide sequences in RefSeq. We used an E-value cutoff of 10–9 for retaining candidates. Each candidate hit was then used as a query in a reversed position-specific BLAST (rpsblast) search (Marchler-Bauer et al., 2007) against the KOG database. Candidates were retained after this point if they had the highest scoring hit (by rpsblast) against the same KOG as the KOG of the seed data, and they scored within 80% of the best such hit for the species.


View this table:
[in this window]
[in a new window]

 
Table 4. Data sources and species abbreviations

 
From the resulting set of paralogs, we selected a putative ortholog set in the following manner. For each KOG, we selected all possible triples of human–Arabidopsis–Saccharomyces paralogs and kept the triple that had the highest score in alignments built by MUSCLE (Edgar, 2004). Alignment score was computed using the VTML240 matrix (Müller et al., 2002). Additional putative orthologs were added for one species at a time, by aligning each paralog individually to the current profile, and keeping the one that gave the largest alignment score. At this iterative addition, scoring was done with the VTML240 matrix, by summing the five-highest-pairwise scores between a candidate and already included sequences.

The resulting sets were then realigned using MUSCLE, and then realigned again using our intron-aware alignment with a gap penalty of 300, gap-extend penalty of 11, VTML240 amino acid scoring, intron-match scores of 300 and intron-mismatch penalties of 20. (The latter were established using different intron-match and -mismatch scores, and selecting the ones that gave the fewest number of intron sites while decreasing the score of the implied protein alignment by less than 0.1%.)

Conserved portions were extracted using our segmentation program with a complexity penalty of {alpha} = 400 (larger values gave identical segmentation results, and lower values resulted in too many scattered blocks). We penalized indels with an infinitely large value to exclude gap columns. Phase-0 introns falling on the boundaries of conserved blocks were excluded. Intron presence and absence in the aligned data was then extracted to produce the data for the likelihood programs.

The final data set comprises 8044 intron sites in 483 sets of orthologs.

Results
By extending the data of Rogozin et al. (2003), we wanted to examine the effects of increased taxon sampling in the analysis of intron evolution. In the course of that scrutiny, we were also able to refine the findings of previous studies concerning the dynamics of intron evolution.

Figure 4 shows the intron densities for ancestral species. Ancestral nodes such as the bilaterian ancestor (node 4), the ecdysozoan ancestor (node 5), the opisthokont ancestor (node 15) and the bikont ancestor (node 16) all have high intron densities, exceeding previous estimates of Rogozin et al. (2003) and Csurös (2005). Honey-bee genes (IHBSC, 2006) indicate a more intron-rich ecdysozoan ancestor (7 introns per gene) than what was known before, and surpass even the generous estimate (6.1 introns per gene) of Roy and Gilbert (2005). (By removing the honey-bee data, our estimate drops by 25%.) Intron density in the bilaterian ancestor is estimated to be almost as high as in humans. Sequences of a handful genes from a certain marine annelid have already indicated that the bilaterian ancestor had at least two-thirds as many introns as vertebrates (Raible et al., 2005).


Figure 4
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Phylogeny for the 18 species and estimated intron counts at ancestors. Ancestral nodes are identified by the numbers 1–16. Double lines denote significant net gains along branches directed away from node 16. Framed terminal taxa appear in the data set of Rogozin et al. (2003). Error bars denote 95% confidence intervals computed from 1000 simulated data sets. The scale for introns per gene on the right-hand side is based on the average of 8.82 introns per gene in humans (Jeffares et al., 2006).

 
Most branches are characterized by net intron loss. Figure 5 shows the estimated changes in a few lineages. Intron evolution has been much slower in the vertebrate lineage than in most other lineages: in more than 380 million years since the divergence with fishes, only about 3% of our introns got lost. Fungi, for example, massively trimmed their introns in many lineages as can be seen by the substantial decreases from node 12 onwards, covering a comparable time period. A notable exception is C. neoformans, which seems to have gained introns, but that assessment may change if more basidiomycetes are included.


Figure 5
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Intron gains and losses in a few evolutionary paths. Estimated and actual intron counts are in parentheses. Gains and losses (top and bottom) are indicated on each branch.

 

    6 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IDENTIFICATION OF ORTHOLOGOUS...
 3 A LIKELIHOOD FRAMEWORK
 4 RAPID COMPUTATION OF...
 5 APPLICATIONS
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We presented an alignment technique for establishing intron orthology, and a likelihood framework in which intron evolutionary events can be quantified. We described a compression method for the evaluation of the likelihood function, which has been extremely valuable in practice. We also showed that the compression leads to sublinear running times for likelihood evaluation.

We illustrated our methods for analyzing intron evolution with a large and diverse set of eukaryotic organisms. The data set is more comprehensive than any used in other studies published to this day. The data indicate that ancestral eukaryotic genomes were more intron-rich than previous studies suggested.

Many circumstances influence intron loss (Jeffares et al., 2006; Roy and Gilbert, 2006), and realistic likelihood models need to introduce rate variation (Carmel et al., 2005). Usual rate variation models (Felsenstein, 2004) entail multiple evaluations of the likelihood function, and, thus, underline the importance of computational efficiency. We believe that the proposed methods will help to produce and analyze large data sets even within complicated likelihood models.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IDENTIFICATION OF ORTHOLOGOUS...
 3 A LIKELIHOOD FRAMEWORK
 4 RAPID COMPUTATION OF...
 5 APPLICATIONS
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
M.Cs. is supported by a grant from the Natural Sciences and Engineering Research Council of Canada. I.B.R. was supported by the NLM/NIH/DHHS Intramural Research Program.

Conflict of Interest: none declared.


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IDENTIFICATION OF ORTHOLOGOUS...
 3 A LIKELIHOOD FRAMEWORK
 4 RAPID COMPUTATION OF...
 5 APPLICATIONS
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Adachi J, Hasegawa M. MOLPHY version 2.3: programs for molecular phylogenetics based on maximum likelihood. In: Vol. 28 of Computer Science Monographs (1995) Tokyo, Japan: Institute of Statistical Mathematics. 1–150.

    Aldous D. Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today. Stat. Sci (2001) 16:23–34.[CrossRef][Web of Science]

    Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25(17):3389–3402.[Abstract/Free Full Text]

    Bieri T, et al. WormBase: new content and better access. Nucleic Acids Res (2007) 35(Suppl. 1):D506–510.[Abstract/Free Full Text]

    Blum MGB, Francois O. On statistical tests of phylogenetic tree imbalance: the Sackin and other indices revisited. Math. Biosci (2005) 195:141–153.[CrossRef][Web of Science][Medline]

    Carmel L, et al. An expectationmaximization algorithm for analysis of evolution of exon-intron structure of eukaryotic genes. Lec. Notes in Comput. Sci (2005) 3678:35–46.[CrossRef]

    Collins L, Penny D. Complex spliceosomal organization ancestral to extant eukaryotes. Mol. Biol. Evol (2005) 22(4):1053–1066.[Abstract/Free Full Text]

    Coulombe-Huntington J, Majewski J. Characterization of intron loss events in mammals. Genome Res (2007) 17(1):23–32.[Abstract/Free Full Text]

    Csurös M. Maximum-scoring segment sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2004) 1(4):139–150.[CrossRef]

    Csurös M. Likely scenarios of intron evolution. Lec. Notes in Comput. Sci (2005) 3678:47–60.[CrossRef]

    Devroye L. Limit laws for local counters in random binary search trees. Random Struct. Algor (1991) 2(1):303–315.

    Durbin R, et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (1998) UK: Cambridge University Press.

    Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res (2004) 32(5):1792–1797.[Abstract/Free Full Text]

    Felsenstein J. Evolutionary trees from DNAsequences: a maximum likelihood approach. J. Mol. Evol (1981) 17:368–376.[CrossRef][Web of Science][Medline]

    Felsenstein J. Phylogenies from restriction sites, a maximum likelihood approach. Evolution (1992) 46:159–173.[CrossRef][Web of Science]

    Felsenstein J. Inferring Pylogenies (2004) Sunderland, Massacheusetts, USA: Sinauer Associates.

    Friedman N, et al. A structural EM algorithm for phylogenetic inference. J. Comput. Biol (2002) 9(2):331–353.[CrossRef][Web of Science][Medline]

    Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol (2003) 52(5):696–704.[CrossRef][Web of Science][Medline]

    Harding E. The probabilities of rooted tree-shapes generated by random bifurcation. Adv. Appl. Probab (1971) 3:44–77.[CrossRef]

    Heard SB. Patterns in tree balance among cladistic, phenetic, and randomly generated phylogenetic trees. Evolution (1992) 46(6):1818–1826.[CrossRef][Web of Science]

    Hubbard TJP, et al. Ensembl 2007. Nucleic Acids Res (2007) 35(Suppl. 1):D610–617.[CrossRef][Web of Science][Medline]

    IHBSC. Insights into social insects from the genome of the honey bee Apis mellifera. Nature (2006) 443:931–949.[CrossRef][Medline]

    Jeffares DC, et al. The biology of intron gain and loss. Trends Genet (2006) 22(1):16–22.[CrossRef][Web of Science][Medline]

    Kececioglu J, Zhang W. Aligning alignments. Farach-Colton M, ed. (1998) Proceedings of CPM, Vol. 1448 of LNCS. Springer. 189–208.

    Kosakovsky Pond SL, Muse SV. Column sorting: rapid calculation of the likelihood function. Syst. Biol (2004) 53(5):685–692.[CrossRef][Web of Science][Medline]

    Larget B, Simon DL. Faster likelihood calculations on trees. In: Technical Report 98-02 (1998) Pittsburgh, Pennsylvania, USA: Department of Mathematics and Computer Science, Duquesne University.

    Ma B, et al. Alignment between two multiple alignments. Baeza-Yates RA, Chávez E, Crochemore M, eds. (2003) Proceedings of CPM, Vol. 2676 of LNCS. Springer. 254–265.

    Marchler-Bauer A, et al. CDD: a conserved domain database for interactive domain family analysis. Nucleic Acids Res (2007) 35(Suppl. 1):D237–240.[Abstract/Free Full Text]

    McDiarmid C. On the method of bounded differences. In: Surveys in Combinatorics—Siemons J, ed. (1989) Cambridge University Press. 148–184.

    McKenzie A, Steel M. Distributions of cherries for two models of trees. Mathe. Biosci (2000) 164:81–92.[CrossRef]

    Müller T, et al. Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. Mol. Biol. Evol (2002) 19(1):8–13.[Abstract/Free Full Text]

    Nguyen HD, et al. New maximum likelihood estimators for eukaryotic intron evolution. PLoS Comput. Biol (2005) 1(7):e79.[CrossRef][Medline]

    Nielsen CB, et al. Patterns of intron gain and loss in fungi. PLoS Biol (2004) 2(12):e422.[CrossRef][Medline]

    Nixon JEJ, et al. A spliceosomal intron in Giardia lamblia. Proc. Nat. Acad. Sci. USA (2002) 99(6):3701–3705.[Abstract/Free Full Text]

    Press WH, et al. Numerical Recipes in C: The Art of Scientific Computing (1997) 2nd. Cambridge University Press.

    Pruitt KD, et al. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res (2007) 35(Suppl. 1):D61–65.[Abstract/Free Full Text]

    Raible F, et al. Vertebrate-type intron-rich genes in the marine annelid Platynereis dumerilii. Science (2005) 310:1325–1326.[Abstract/Free Full Text]

    Rogozin IB, et al. Remarkable interkingdom conservation of intron positions and massive, lineagespecific intron loss and gain in eukaryotic evolution. Curr. Biol (2003) 13:1512–1517.[CrossRef][Web of Science][Medline]

    Rogozin IB, et al. Analysis of evolution of exon-intron structure of eukaryotic genes. Brief. Bioinformat (2005) 6(2):118–134.[Abstract/Free Full Text]

    Rosenberg NA. The mean and variance of r-pronged nodes and r-caterpillars in Yule-generated genealogies. Ann. Combinatorics (2006) 10:129–146.[CrossRef]

    Roy SW, Gilbert W. Complex early genes. Proc. Nat. Acad. Sci. USA (2005) 102(6):1986–1991.[Abstract/Free Full Text]

    Roy SW, Gilbert W. The evolution of spliceosomal introns: patterns, puzzles and progress. Nat. Rev. Genet (2006) 7:211–221.[CrossRef][Web of Science][Medline]

    Roy SW, Penny D. Large-scale intron conservation and order-ofmagnitude variation in intron loss/gain rates in apicomplexan evolution. Genome Res (2006) 16(10):1270–1275.[Abstract/Free Full Text]

    Roy SW, Penny D. Patterns of intron loss and gain in plants: Intron loss-dominated evolution and genome-wide comparison of O. sativa and A. thaliana. Mol. Biol. Evol (2007) 24(1):171–181.[Abstract/Free Full Text]

    Stamatakis AP, et al. AxML: Afast program for sequential and parallel phylogenetic tree calculations based on the maximum likelihood method. (2002) Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB). 21–28.

    Steel MA. Recovering a tree from the leaf colourations it generates under a Markov model. Appl. Math. Lett (1994) 7:19–24.

    Sverdlov AV, et al. Conservation versus parallel gains in intron evolution. Nucleic Acids Res (2005) 33(6):1741–1748.[Abstract/Free Full Text]

    Sverdlov AV, et al. A glimpse of a putative pre-intron phase of eukaryotic evolution. Trends Genet (2007) 23(3):105–108.[CrossRef][Web of Science][Medline]

    Tatusov RL, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics (2003) 4:441.

    Vanácová S, et al. Spliceosomal introns in the deep-branching eukaryote Trichomonas vaginalis. Proc. Nat. Acad. Sci. USA (2005) 102(12):4430–4435.[Abstract/Free Full Text]

    Zhang Z, Berman P, Wiehe T, Miller W. Post-processing long pairwise alignments. Bioinformatics (1999) 15(12):1012–1019.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
M. Csuros
Malin: maximum likelihood analysis of intron evolution in eukaryotes
Bioinformatics, July 1, 2008; 24(13): 1538 - 1539.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
R. Tarrio, F. J. Ayala, and F. Rodriguez-Trelles
From the Cover: Alternative splicing: A missing piece in the puzzle of intron gain
PNAS, May 20, 2008; 105(20): 7223 - 7228.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
M. Csuros, I. B. Rogozin, and E. V. Koonin
Extremely Intron-Rich Genes in the Alveolate Ancestors Inferred with a Flexible Maximum-Likelihood Approach
Mol. Biol. Evol., May 1, 2008; 25(5): 903 - 911.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
S. W. Roy and M. Irimia
Rare Genomic Characters Do Not Support Coelomata: Intron Loss/Gain
Mol. Biol. Evol., April 1, 2008; 25(4): 620 - 623.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Irimia and S. W. Roy
Spliceosomal introns as tools for genomic and evolutionary analysis
Nucleic Acids Res., March 1, 2008; 36(5): 1703 - 1712.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Csurös, M.
Right arrow Articles by Rogozin, I. B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Csurös, M.
Right arrow Articles by Rogozin, I. B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?