Bioinformatics Advance Access originally published online on November 15, 2006
Bioinformatics 2007 23(3):289-297; doi:10.1093/bioinformatics/btl578
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Indelign: a probabilistic framework for annotation of insertions and deletions in a multiple alignment
Department of Computer Science, University of Illinois, Urbana-Champaign Urbana, IL, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: A quantitative study of molecular evolutionary events such as substitutions, insertions and deletions from closely related genomes requires (1) an accurate multiple sequence alignment program and (2) a method to annotate the insertions and deletions that explain the gaps in the alignment. Although the former requirement has been extensively addressed, the latter problem has received little attention, especially in a comprehensive probabilistic framework.
Results: Here, we present Indelign, a program that uses a probabilistic evolutionary model to compute the most likely scenario of insertions and deletions consistent with an input multiple alignment. It is also capable of modifying the given alignment so as to obtain a better agreement with the evolutionary model. We find close to optimal performance and substantial improvement over alternative methods, in tests of Indelign on synthetic data. We use Indelign to analyze regulatory sequences in Drosophila, and find an excess of insertions over deletions, which is different from what has been reported for neutral sequences.
Availability: The Indelign program may be downloaded from the website http://veda.cs.uiuc.edu/indelign/
Supplementary information: Supplementary material is available at Bioinformatics online.
Contact: sinhas{at}uiuc.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
The recent publication of genome sequences of several closely related species (National Human Genome Research Institute, http://www.genome.gov/11008080, 2006) provides a great opportunity to study the molecular events underlying evolution. It is now possible to do large-scale comparison of homologous sequences and measure the extent of sequence evolution due to various categories of change, such as substitutions, insertions and deletions, and then look for interesting patterns. For instance, neutral sequences in Drosophila, such as non-functional transposons or pseudogenes, show an excess of deletions over insertions (Petrov and Hartl, 1998), while our recent work (Sinha and Siggia, 2005) found evidence for a different pattern, namely, an excess of insertions over deletions, in regulatory sequences. These preliminary findings have raised the question whether there is a clear difference in patterns of evolutionary events between neutral and functional non-coding sequences, either due to mechanistic reasons or due to selection. An answer to this question is of fundamental scientific significance, and an affirmative answer may help pinpoint functionality of certain kinds in non-coding sequence.
The basic requirement for a study of evolutionary events are as follows: (1) an alignment program and (2) an annotation program to identify the insertions and deletions in the alignment. Existing alignment tools such as Blastz (Schwartz et al., 2004) may be assumed to return broad areas of homology (
1 kb or longer), which can be accurately aligned by programs such as MLagan (Brudno et al., 2003) and TBA (Blanchette et al., 2004a). The annotation program should then identify the substitutions, insertions and deletions so as to explain the gaps in the alignment, and be efficient enough to do this on genomic scales, for modest number of species. This paper describes a new algorithm with this functionality in a probabilistic framework.
The standard way to quantify evolutionary events currently is a two step approach: (1) obtain a multiple alignment of the sequences, and (2) annotate the insertions and deletions in the alignment using maximum parsimony criteria (Blanchette et al., 2004b; Fredslund et al., 2003; Sinha et al., 2004). This approach has the following problems, however:
- The maximum parsimony approach does not afford a principled way to weigh insertions and deletions of various lengths against each other, and against substitutions, when counting events. Its results are expected to be inferior to that from a more general likelihood-based approach that incorporates substitutions, insertions and deletions in a unified evolutionary model.
- The alignment step is completely decoupled from the annotation step. However, if there is a prior expectation about rates of various events, it may be possible to obtain an alignment that is more in tune with the prior expectation, if alignment and annotation were integrated. The converse is also possible, i.e. to estimate various evolutionary parameters based on the observed alignment (over large number of samples).
- The alignment step is often based on an underlying dynamic programming algorithm to optimize a scoring function that does not explicitly use a model of evolution, and makes no distinction between insertions and deletions.
In this paper, we propose a new algorithm, called Indelign, that addresses the above problems in current strategies for recording evolutionary events. Let the term evolutionary history refer to a multiple alignment and an annotation of insertions/deletions along the branches of the phylogenetic tree, consistent with the gaps in the alignment. Indelign evaluates an evolutionary history by its likelihood of being generated by a model, whose parameters are the rates of substitution, insertion and deletion, and length distributions of insertions and deletions. The Indelign program can be used for any of the following goals, for three or more species, given their phylogeny with relative branch lengths:
- Given a multiple alignment, annotate the insertions and deletions on each branch of the phylogeny so as to maximize the likelihood of the resulting evolutionary history.
- Given a multiple alignment that is close to optimal, make limited changes to the alignment so that the resulting evolutionary history is consistent with the assumed model parameters.
- Given a set of multiple alignments produced by similar evolutionary processes, learn the values of the model parameters and the most likely evolutionary history simultaneously.
An interesting application of Indelign will be to fit the model parameters on multiple alignment of closely related species, such as the seven sequenced strains of Drosophila simulans, where the alignments are accurate, and then apply the trained model to infer evolutionary histories of more diverged species. This application has been proposed earlier in Keightley and Johnson (2004). The accurate annotation of insertions and deletions, which may involve inferring the sequences at the internal nodes of the phylogeny, will also play an important role in the emerging field of ancestral genome reconstructions (Blanchette et al., 2004b).
We prefer to designate our program as an indel annotation and realignment program. Its goal is not to search the entire space of multiple alignments; rather to start from a reasonably good alignment, obtained from existing multiple alignment programs, and improve it to get accurate estimates of insertions and deletions. The major thrust of our work is on accurate annotation of indels.
1.1 Previous work
Our work has similarities to the MCAlign algorithm of Keightley and Johnson (2004), which finds a maximum likelihood alignment of non-coding sequences using an evolutionary model, whose parameters include rates of substitutions and indels, and the length distribution of indels. The possibility of multiple indels happening at the same locus is ignored in MCAlign, as in Indelign. However, MCAlign is implemented for at most three species, whereas Indelign is completely general in terms of number of species, and scales well with the number of species. The separate modeling of insertions and deletions is an important feature of Indelign, which is absent in MCAlign. Moreover, MCAlign assumes that the model parameters are known, e.g. from multiple alignments of more closely related species, whereas Indelign allows for the estimation of the parameters from input data using an iterative algorithm. We note that MCAlign is an alignment program whereas Indelign is developed primarily as an indel annotation program.
Fredslund et al., (2003) and Blanchette et al. (2004b) have proposed parsimony-based algorithms for annotating insertions and deletions. Indelign, on the other hand, finds the best annotation by maximum likelihood using an evolutionary model that integrates insertions, deletions and substitutions in a principled manner.
Statistical approaches to phylogenetic inference have a rich history going back to Jukes and Cantor (1969), Kimura (1980) and Felsenstein (1981), where the process of nucleotide substitution was modeled at various levels of biological realism. Indels were first included in a rigorous probabilistic model by Thorne et al. (1991), who defined a continuous-time, time-reversible evolutionary process with single nucleotide indels. Later developments by Holmes and Bruno (2001), Metzler (2003) and Miklós et al. (2004) built on this model, or its extension to longer indels [TKF92 (Thorne et al., 1992)], to provide statistical alignment algorithms that can allow for accurate inference of evolutionary histories, but these methods are unlikely to scale to genome-wide analysis. Indelign takes a pragmatic approach to the problem, with an explicit goal of summarizing evolutionary event statistics in three or more species, for the restricted but extensively studied domain of closely related genomes, e.g. the mammalian genomes (Blanchette et al., 2004b) or the fruitfly genomes. Importantly, Indelign, similar to MCAlign, provides the advantage of allowing arbitrary length distributions of indels (which may optionally be trained from the data). This aspect of the model makes the computational tasks of training the model and evaluating likelihoods challenging, and Indelign implements new ideas for solving the problem efficiently.
Several interesting ideas on evolutionary models that have inspired our work may be found in simulation programs developed by Cartwright (2005), Stoye et al. (1998) and Pollard et al. (2004). These programs incorporate fairly sophisticated evolutionary models, including separate and flexible treatment of insertions and deletions, but differ from our work in that their goal is to generate synthetic sequence data related by a phylogeny, rather than to annotate existing alignments with their evolutionary events.
| 2 MODEL AND LIKELIHOOD |
|---|
|
|
|---|
We begin with an input of (1) a multiple alignment of sequences, each sequence being a string in
L, where
= {A,C,G,T,} and L is the alignment length, as well as (2) a phylogenetic tree T = (V, E), where V = VL
VI, VL is the set of leaf nodes, VI is the set of internal nodes and E is the set of branches. Let Sv be the sequence at node v, and is known for all v
VL.
2.1 Formalizing evolutionary histories
A complete evolutionary history (CEH) assigns to each node v
VI a sequence Sv
L (Fig. 1A). An indel evolutionary history (IEH) assigns to each node v
VI a sequence Sv
{*,}L, i.e. it only specifies whether each position of Sv is a gap () or an unknown nucleotide (*) (Fig. 1A). An IEH can therefore correspond to a large number of CEHs (which may be exponential in the sequence length), obtained by instantiating each of the *s at the internal nodes in one of four ways. An IEH (or CEH) uniquely specifies the insertion, deletion and alignment (orthology) events on all branches e
E, as per the following rules (Fig. 1B). [P(e) and C(e) denote the parent and child nodes of edge e, respectively.]
- A deletion (D, e, l, r), on branch e
E, located at position l to r, is possible only if (1) the child node sequence has only gaps in those positions (i.e. SC(e)[l...r] is all gaps) and (2) the parent node sequence has non-gaps at positions l and r (i.e. SP(e)[l]
and SP(e)[r]
).
- An insertion (I, e, l, r), on branch e
E, located at position l to r, is possible only if (1) the parent node sequence has only gaps in those positions (i.e. SP(e)[l...r] is all gaps) and (2) the child node sequence has non-gaps at positions l and r (i.e. SC(e)[l]
and SC(e)[r]
).
- An alignment (A, e, p) exists at position p on branch e iff SC(e)[p] and SP(e)[p] are both non-gaps. Additionally, for a CEH, if SP(e)[p]
SC(e)[p], a substitution is said to have occurred.
- For every branch e, and every position p, either (1) there exists either an alignment (A, e, p) or an indel (D/I, e, l, r) such that p
[l, r], or (2) SP(e) = SC(e) = .
|
2.2 Evolutionary model
Overview. The evolutionary model describes the probabilities of various event types (substitution, insertion and deletion) along each edge of the phylogeny. Under the model, every base in the ancestral sequence is prescribed a certain probability of (1) being substituted (or remaining conserved), (2) being the start position of a deletion or (3) having an insertion occur immediately preceding it. The probabilities of these different event types depend on the branch length (evolutionary time), event type and various model parameters. Important properties of the model are noted at the end of the following paragraph. For computational tractability, we impose two intuitively justified restrictions on the evolutionary histories that we will consider (and evaluate under the model); these are presented in Section 2.3.
The likelihood of a CEH
is defined recursively as follows (
is the sequence assigned in
to node v):
|
| (1) |
SC(e)) is prescribed by an evolutionary model that has a hidden Markov model (HMM) structure defined by the parent sequence SP(e). This model is illustrated in Figure 1C. SP(e) is first trimmed to remove all gaps, and for each position i in the trimmed sequence
SC(e)) on branch e. We next note a few important features of the evolutionary model. (1) The length distributions of insertions and deletions are unconstrained parameters of the model. (2) The indel process is in discrete time, i.e. an insertion (or deletion) is created with a fixed probability that depends on the time spanned by the entire branch. (3) As in Dawg (Cartwright, 2005), there are no restrictions on the relative rates of insertions and deletions, which implies that the model is not time reversible. (4) Indels do not happen inside other indels on the same branch. This is a realistic assumption for species not too far diverged, and leads to a greatly simplified and tractable model. Note, however, that the HMM model does allow multiple indels adjacent to each other.
The model prescribes computation of the probability Pr(SP(e)
SC(e)) [from Equation (1)] as
|
| (2) |
2.2.1 Substitution model
The term Ae in Equation 2 may use any time-dependent single nucleotide substitution model. Our current implementation uses the continuous-time Felsenstein model (Felsenstein, 1981). (This choice was arbitrary, and it is straightforward to implement other substitution models if necessary.) This model has a single parameter u which, along with the stationary distribution of nucleotides {
A,
C,
G,
T} defines the transition probabilities for any branch of length t (in arbitrary units):
|
| (3) |
|
| (4) |
2.3 Restrictions on evolutionary histories
Here, we describe two additional restrictions imposed by Indelign on the evolutionary histories considered. We first introduce some terminology regarding multiple alignments. (Fig. 2A, illustration.) A sequence at a leaf node is called an extant sequence; sequences at internal nodes are called ancestral sequences. Any contiguous stretch of gaps in any extant sequence is called a hole. Any column of the multiple alignment where a hole begins or ends in any sequence is called a hole-boundary. Any stretch of alignment columns that is bordered by two successive hole-boundaries, or by the alignment boundary and an adjacent hole-boundary, is called a block. Thus, a typical multiple alignment is a concatenation of multiple blocks. Two adjacent blocks are said to be mutually-dependent if both blocks contain a hole in the same sequence, e.g. in Figure 2A, blocks DE and EF are mutually dependent, and so are blocks EF and FG.
|
The following two constraints are imposed on any valid IEH or CEH by Indelign. (IEH and CEH were defined at the beginning of Section 2.1.)
Assumption 1: Any two aligned nucleotides in the IEH must share a common ancestral nucleotide. In other words, neither of the two aligned nucleotides may be the result of an insertion event. (See Fig. 2B, cases 13, for some examples of evolutionary histories prohibited by this assumption.) This assumption respects the implicit semantics of an aligned position being an evolutionarily orthologous position, and has been used in other related work (Blanchette et al., 2004b).
Assumption 2: An indel event begins and ends at hole boundaries. This implies that if we align the ancestral sequences with the given alignment of the extant sequences, the hole boundaries and block locations will remain unchanged. (Fig. 2B, case 4, for an example. The single hole BD in the second species is not allowed to result from two separate deletions BC and CD, as per Assumption 2.)
Claim 1: If two adjacent blocks are not mutually dependent, an IEH cannot have an indel event spanning the two blocks, on any branch of the tree. (This follows from the two assumptions above; see Supplementary Material.)
2.4 Block level representation of sequences
Let l and r be the boundary positions of a block. By Assumption 2and Claim 1, the subsequence Sv[l...r], for any node v, is either a string of gaps or a string of nucleotides (known for extant sequences, unknown for ancestral sequences). This allows us to rewrite all sequences (assigned to nodes in an IEH) at a block-level, as follows. Let B = (l, r) denote the block from position l to r, {Bi}i = 1...k be the sequence of blocks in the multiple alignment, and Sv[Bi]
Sv[li...ri], which implies Sv = Sv[B1]·Sv[B2]
Sv[Bk]. We define
- Rv[Bi] = * if Sv[Bi] is all *s, (* is an unknown nucleotide.)
- Rv[Bi] = if Sv[Bi] is all gaps, and
- Rv[Bi] = N if Sv[Bi] is all nucleotides.
Rv[Bk]. Thus,- an IEH is simply an assignment of each Rv[Bi] as * or ,
v
VI,
Bi, and
- a CEH
is a pair (
, µ
), where 
is an IEH, and µ
is a mapping from each Rv[Bi] = * to a string Sv[Bi]
{A, C, G, T}rili+1 (Fig. 2C).
2.5 Enforcing Assumption 1
To find the maximum-likelihood IEH, we must only find the optimal labeling of each block at each internal node as a * or and do not need to compute the substitution terms (Ae) of Equation 2. This is a consequence of Assumption 1, as proved in the Supplementary Material. It implies that in finding the maximum-likelihood IEH, we do not need to sum over all possible CEHs consistent with a candidate IEH. It is therefore a key result for the efficiency of our algorithm. The substitution terms of the likelihood (Equation 2) can be computed separately for each column of the alignment, using Felsenstein's post-order traversal algorithm, thereby completing the likelihood computation.
Assumption 1 also restricts the space of IEHs that we have to search in order to find the maximum-likelihood solution. Two specific restrictions that follow from the assumption are as follows: (1) For any block B, for any internal node v, if its two child subtrees each have a node labeled as non-gap, then v itself must be labeled with a non-gap. (2) For any block B, if two nodes are labeled as non-gaps, and one of them is an ancestor of the other, then there may be no gap-labeled node on the path between them. These two restrictions are enforced by running algorithm FIXSTARSONTREE to fix which of the internal nodes must be labeled as non-gap (*). (see Supplementary material for details.)
| 3 ALGORITHM |
|---|
|
|
|---|
The Indelign program has the following components.
- ANNOTATE: Given a multiple alignment, this obtains an IEH with maximum likelihood.
- SEARCH: Given an initial multiple alignment, this makes limited changes to the alignment so as to obtain an IEH with maximum likelihood.
Overview: We describe below how the ANNOTATE component can find the maximum-likelihood IEH either by naively evaluating all possible IEHs (Section 3.1.1), or by using an algorithmic technique called dynamic programming (Section 3.1.2). A crucial step here is to identify each block (or sequence of blocks) that can be annotated independently of other blocks, and to use a divide and conquer strategy that solves each subproblem (annotation of a sequence of blocks) separately before merging the solutions into an overall annotation. This is done without sacrificing correctness of the solution. The main idea behind the SEARCH component (Section 3.3) is to explore all alignments that maybe obtained by making a limited class of modifications to the current alignment, find the best scoring among such neighboring alignments, and update the current alignment to this best scoring neighbor. This update is then repeated as long as improvements are obtained.
3.1 Annotate
The ANNOTATE algorithm can be run with or without complete knowledge of the model parameters. In Sections 3.1.1 and 3.1.2, we assume that the parameter values are known a priori. If not known, they are learned from the data, as described in Section 3.2.
3.1.1 Enumerative algorithm
This simple algorithm enumerates all possible IEHs, using the algorithm FIXSTARSONTREE (Supplementary material) to enforce Assumption 1 and restrict the search space. It computes the likelihood of each IEH, and outputs the maximum-likelihood IEH. For any block B, at any internal node v, there are at most two possibilities for the label Rv[B]( or *). In a full binary tree with n leaves, there are n 1 internal nodes. Hence there are at most 2n1 possible IEHs for the block. Now consider a maximal sequence of blocks where every adjacent pair of blocks is mutually-dependent (e.g. blocks DE, EF, FG in Fig. 2A). The maximum-likelihood IEH of the entire multiple alignment must include the maximum-likelihood IEH of this sequence of blocks, since no other blocks are mutually dependent with them. (Claim 1 in Section 2.3 ensures that there can be no indel on any branch that straddles across the boundaries of this sequence of blocks.) There are at most 2k(n1) IEHs, where k is the number of blocks, and therefore the enumerative algorithm has complexity O(2k(n1) n). (The extra factor of n is due to the O(n) time complexity of evaluating each IEH.) The time complexity of the finding an IEH of the entire multiple alignment is the sum of the contributions from each such maximal sequence of mutually dependent blocks.
3.1.2 Dynamic programming algorithm
Indelign also implements a dynamic programming algorithm to find the maximum-likelihood annotation. As in the enumerative algorithm, it operates on each maximal sequence of blocks where every adjacent pair of blocks is mutually dependent. (This follows from Claim 1 in Section 2.3.) Let k be the number of blocks in such a sequence. The algorithm must assign to each internal node v, a string Rv
{*,}k. The dynamic programming algorithm associates with each node v
VI, a table dpv with 2k entries. Each entry of dpv is indexed by a string Rv
{*,}k, and records the likelihood of the maximum-likelihood IEH for the subtree rooted at v if node v was labeled with Rv. Let e1 and e2 be the edges from v to its two child nodes c1 and c2, respectively. Given the labels r1 = Rc1 and r2 = Rc2 assigned for these k blocks to the two child nodes, it is straightforward to compute the insertion and deletion events on each branch, and their likelihood contributions Ie1(Rv, r1)De1(Rv, r1) and Ie2(Rv, r2)De2(Rv, r2) as functions of the start label Rv and end label (r1 or r2). The dynamic programming recurrence then is:
|
| (5) |
3.2 Estimation of parameters
In the previous section, we saw how a maximum-likelihood IEH is computed, given the model parameters. These parameters include (1) the single parameter u in the F81 substitution model; (2) the insertion and deletion probabilities pI and pD, respectively (for each branch e), and (3) the probability distributions on lengths of insertions and deletions. Here, we describe how the model parameters are estimated from an IEH. If the parameter values are not known a priori, ANNOTATE starts with an estimate based on the input alignment, then iteratively computes the maximum-likelihood IEH and uses the computed IEH to re-estimate the parameter values, until convergence.
To estimate u, we take the two closest related sibling species from the phylogeny, and count the number of times (N
ß) that residue
in the first species aligns with residue ß in the second, for all
, ß. Let t be the total branch length between the two species. The contribution of the substitution events to the log likelihood is log 
ß Pr(
ß|t)N
ß. Using Equations 3 and 4, and differentiating with respect to u, we get
|
| (6) |
Estimation of the indel length distributions is done by fitting the assumed form of the distributions on observed histograms of the lengths. For instance, for a Zipf (or power-law) distribution of lengths, the observed histogram is converted to a loglog scale and a linear regression is performed to obtain the parameters of the distribution. The current implementation of Indelign supports three indel length distributions: Zipf, Poisson and Geometric.
To estimate the insertion and deletion probabilities pI and pD, we assume these probabilities for any branch e with length te to be proportional to te, i.e. pI = cIte and pD = cDte, where cI and cD are constants of proportionality independent of the branch. (Other, non-linear relationships may also be implemented in the future.) We then count the numbers of insertions and deletions NI and ND from the parent of the two closest related sibling species to either species, and estimate the constants cI and cD using the equations: (See Supplementary material for rationale of this heuristic.)
|
| (7) |
3.3 SEARCH
The SEARCH component of Indelign uses the ANNOTATE component to score a multiple alignment by its maximum-likelihood IEH, and searches the space of multiple alignments for the highest scoring IEH. It begins with an initial alignment obtained from an existing multiple alignment program, fixes the highly conserved parts of the alignment to be immutable, and performs a hill climbing search by changing the locations of holes locally. Any existing multiple alignment programs such as MLagan (Brudno et al., 2003), MAFFT (Katoh et al., 2005) or TBA (Blanchette et al., 2004a) can be used to generate an initial alignment. Recall from Section 2.3 that a hole may straddle multiple blocks, e.g. in Figure 2A, the hole DF in the second sequence straddles blocks DE and EF. The part of a hole that belongs to one particular block is called a block-hole. We next describe each step of the SEARCH algorithm in detail.
- Start with an initial alignment of the input sequences.
- Identify statistically significant two-sequence anchors in the alignment. These are ungapped blocks of alignment (between any two sequences) with statistically significant percent identity. Any pair of aligned nucleotides included in an anchor is forced to being aligned throughout the search. Entire columns that are immutable in this sense partition the multiple alignment into independent inter-anchor regions that are processed independently.
- Annotate the initial alignment based on a weighted maximum parsimony (minimum number of indel events scaled by inverse branch length, see Section 4.2) criterion to get an initial IEH, and obtain an initial estimate of model parameters from this IEH.
- For each inter-anchor region, repeat until convergence:
- Choose a leaf node to process, in a round-robin fashion.
- In the sequence at the chosen leaf node, pick a block-hole at random. Consider alignments that may be obtained by sliding the block-hole to the left or right (without crossing the adjacent hole or the inter-anchor boundary). Such alignments form the neighborhood of the current alignment.
- Compute the maximum-likelihood IEH for each of these neighboring alignments, using the ANNOTATE component, and use the (maximum) likelihood as the score of an alignment.
- Move to the highest scoring among the neighboring alignments. This is steepest ascent or greedy algorithm. (We also implemented a MetropolisHastings move, and found the performance to be significantly inferior to this greedy heuristic, for comparable running times.)
- Choose a leaf node to process, in a round-robin fashion.
- Compute IEH from current alignment, estimate model parameters, and loop to Step 4.
| 4 EXPERIMENTS ON SYNTHETIC DATA |
|---|
|
|
|---|
We first performed evaluation studies on synthetic data generated with the simulation program Dawg (Cartwright, 2005). An ancestral sequence of length 4000 bp was chosen at random, and evolved along the branches of an input phylogenetic tree t with n leaves. The parameters for the simulation included an indel to substitution ratio, set to 0.1, and an insertion to deletion ratio, set to 1:3. The substitution model was Felsenstein 81, and indel lengths followed a Zipf (power-law) distribution. (See Supplementary material for details.) The synthetic data experiments allowed us to evaluate and compare different methods under controlled settings.
4.1 Evaluation measures
We used the following measures for evaluating the accuracy of any given indel annotation by comparing with the true annotation.
- Annotation Agreement: This is the fraction of indels in the true alignment that are correctly positioned and annotated (as insertion or deletion). (Optimal score: 1.)
- Indel Agreement: This captures how close the output numbers of insertions and deletions are to the true numbers. It is defined by the formula
where NIt and NDt are true numbers of insertions and deletions and NIe and NDe are the output numbers. (Optimal score: 0.)
- Indel Ratio Ratio: This is the true insertion/deletion ratio, divided by the estimated insertion/deletion ratio, given by the formula
. (Optimal score: 1.)
These evaluation measures were computed over all species excluding any outgroup species that is diverged directly from the root species.
4.2 Baseline annotation method
We used a column-wise maximum parsimony heuristic as a baseline annotation method. Each column of the multiple alignment was independently annotated so as to minimize the total number of 1 bp events (substitutions, insertions and deletions) on the phylogenetic tree, each event being weighted by the inverse of the length of the branch on which it happens. This maximum parsimony annotation was achieved by a post-order traversal algorithm, using an extended alphabet (A, C, G, T, ). Finally, we merged adjacent 1 bp events of the same type (insertion or deletion) on the same branch to form longer indel events.
4.3 Evaluation of the ANNOTATE algorithm
We first performed experiments to assess the performance of the ANNOTATE component, if the true alignment is given. Annotation was done with (mode T) and without (mode U) the knowledge of model parameters. The second mode uses automatic parameter estimation. These two modes of ANNOTATE were compared with the baseline method based on column-wise maximum weighted parsimony. A three-node phylogenetic tree was used (roughly based on the phylogenetic relationship of cis-regulatory sequences of Drosophila melanogaster, Drosophila yakuba, Drosophila ananassae), 20 datasets were simulated and the various evaluation measures (Section 4.1) were computed for the outputs of each annotation method. This entire procedure was repeated five times, with the phylogenetic tree being shrunk (the length of each branch was shortened) by a scaling factor
= {0.5, 0.6...0.9}. Scaling the tree had the effect of decreasing the divergence time among the species simulated.
Figure 3 shows the averages of the different evaluation measures, for ANNOTATE (modes T and U) and the baseline, at different values of the scaling factor
. The ANNOTATE algorithm shows a significant improvement in performance over the baseline method. Although the performance in mode T (known parameters) is better than in mode U as expected, the difference is marginal, demonstrating that the algorithm is correctly able to learn the parameter values from the data. We also note that the actual values of the evaluation measures for Indelign (ANNOTATE) are very close to the optimal, demonstrating the practicality of the algorithm on datasets that were designed to mimic the evolutionary divergence of the sequenced Drosophila species.
|
We repeated similar experiments on phylogenies with four and five species, respectively (Figs 8 and 9 in Supplementary material). We again find a significant improvement in performance of ANNOTATE over the baseline method, close-to-optimal values of the actual evaluation measures, and an insignificant difference between the U and T modes. (The phylogenies used for these experiments were based on inferred divergence distances of sequenced Drosophila genomes.)
4.4 Evaluation of the SEARCH algorithm
We deployed the experimental set-up of the previous section to assess the improvement made by the SEARCH component of Indelign over the output of a popular multiple alignment program, MLagan (Brudno et al., 2003). The initial alignment was done with MLagan, run with default parameters and a gap opening penalty of 400. This alignment was annotated using Indelign's ANNOTATE component, run in the T mode (parameters known). The SEARCH component was then invoked to refine the initial alignment, and the the optimal alignment reported was annotated using ANNOTATE.
Figure 4AD shows the average values of the Indel Agreement score and the Indel Ratio Ratio score, for three and four species simulations, for a range of values of the scale factor
. We notice a substantial improvement in average Indel Ratio Ratio when using Indelign to realign MLagan output, especially for four species simulations (Fig. 4D). In this case, this evaluation measure, whose optimal value is 1.0, goes from an average of 1.38 (using MLagan) to an average of 1.23 (using Indelign) at
= 1.
|
We also observe a significant improvement in the Indel Agreement score, especially for four species data (Fig. 4C). For instance, at
= 0.5, the average of this evaluation measure, whose optimal value is 0, moves from 0.10 (using MLagan) to 0.07 (after realignment). Thus, we see a clear improvement in annotation of insertions and deletions once the input multiple alignment has been realigned with Indelign. We also note that the per-position alignment accuracy is not changed in the results of the SEARCH component from its value in the initial MLagan alignment (Fig. 4E).
4.5 Efficiency of algorithms
The ANNOTATE component implements two different algorithmsan enumerative strategy and a dynamic programming (DP) strategy. We used our experimental set-up to compare the running time of these two strategies, and as seen in Figure 4F, the DP algorithm leads to a significant improvement in efficiency. In absolute terms, the ANNOTATE component using the DP algorithm takes (on a 3 GHz IntelR Xeon workstation)
45 s (and
18 MB memory) on a dataset of 100 kb sequence for each of four extant species.
| 5 EXPERIMENTS ON CIS-REGULATORY MODULES IN DROSOPHILA |
|---|
|
|
|---|
Our earlier work (Sinha and Siggia, 2005) documented that a set of 76 cis-regulatory modules (CRMs) in Drosophila show an excess of insertions over deletions, while it is known from prior work (Petrov and Hartl, 1998) that neutral sequences in Drosophila have a clear excess of deletions over insertions (in the ratio of 8:1). The results of Sinha and Siggia (2005) were based on alignments of D.melanogaster and D.yakuba, with Drosphila pseudoobscura as outgroup, using MLagan as the alignment program, and an in-house implementation of maximum parsimony heuristics for indel annotation. We first repeated the same experiments using Indelign for the annotation step, and confirmed that the insertion/deletion count ratio was indeed >1 (data not shown).
Next, we tested the same phenomenon on a larger dataset of 448 CRMs (total length: 474 054 bp) obtained from the REDfly database, with D.ananassae serving as a less diverged out-group than D.pseudoobscura. The phylogeny used in Indelign is ((D.mel:0.08, D.yak:0.08):0.13, D.ana:0.22), and the insertion/deletion count ratio, over all CRMs, is plotted as a function of the gap opening penalty of MLagan, in Figure 5. We find a ratio >2 regardless of the gap penalty parameter of the alignment program. The same conclusion is reached even when ignoring all 1bp indels (Fig. 10B in Supplementary material). The predominance of insertions over deletions was also observed when considering the total lengths of indels rather than their numbers (Fig. 10A and C in Supplementary material). We interpret these results as a reliable confirmation of the claims made in Sinha and Siggia (2005), i.e. regulatory sequences in Drosophila are enriched in insertions over deletions, in stark contrast to the situation in neutral sequences. This translates into an evolutionary trend of expansion in the regulatory sequences (and perhaps in functional non-coding sequences in general), at least since the D.melanogasterD.yakuba split. The earlier observation (Sinha and Siggia, 2005) that D.yakuba had a much higher insertion/deletion ratio than D.melanogaster was also confirmed in the more comprehensive assessment done here (Fig. 5).
|
| 6 DISCUSSION AND CONCLUSIONS |
|---|
|
|
|---|
A limitation of the proposed probabilistic framework is that multiple indels at the same site are not allowed in the same branch. Such an extension would require major changes to the model and incur heavy computational overheads, and prohibit genome-wide applications. Moreover, we believe that such multiple hits of indels will be relatively uncommon when analyzing closely related genomes.
A direction for future research is to improve the SEARCH component of Indelign. Although the current algorithm is able to provide improvements over the initial alignment, we noticed that it is unable to make highly non-local changes to the input alignment.
We have presented a probabilistic framework for annotation of insertions and deletions in an alignment of three or more species, and demonstrated its performance on synthetic datasets. The framework is also able to realign the given multiple alignment, thereby improving the annotation accuracy significantly. The probabilistic model allows for arbitrary distributions of indel lengths, and a dynamic programming algorithm leads to an efficient implementation that can scale to genome-wide applications.
| Acknowledgments |
|---|
S.S. would like to acknowledge the contribution of Eric Siggia, who played a major role in initiating this research. Several very useful discussions with Mathieu Blanchette are also acknowledged. J.K. wishes to thank Younhee Ko and Yoonkyong Lee for their contributions in the initial development of the project.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
Received on August 24, 2006; revised on October 19, 2006; accepted on November 13, 2006
| REFERENCES |
|---|
|
|
|---|
Blanchette, M., et al. (2004a) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res, . 14, 708715
Blanchette, M., et al. (2004b) Reconstructing large regions of an ancestral mammalian genome in silico. Genome Res, . 14, 24122423
Brudno, M., et al. (2003) LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res, . 13, 721731
Cartwright, R.A. (2005) DNA assembly with gaps (Dawg): simulating sequence evolution. Bioinformatics, 21, Suppl. 3, iii31iii38
Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol, . 17, 368376[CrossRef][Web of Science][Medline].
Fredslund, J., Hein, J., Scharling, T. (2003) A large version of the small parsimony problem. Proceedings of the 3rd Workshop on Algorithms in Bioinformatics (WABI)September 1520Budapest, Hungary.
Holmes, I. and Bruno, W.J. (2001) Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics, 17, 803820
Jukes, T.H. and Cantor, C.R. (1969) Evolution of protein molecules. In Munro, H.N. (Ed.). Mammalian Protein Metabolism, , New York Academic Press, pp. 21123.
Katoh, K., et al. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res, . 33, 511518
Keightley, P.D. and Johnson, T. (2004) MCALIGN: stochastic alignment of noncoding DNA sequences based on an evolutionary model of sequence evolution. Genome Res, . 14, 442450
Kimura, M. (1980) A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J. Mol. Evol, . 16, 111120[CrossRef][Web of Science][Medline].
Metzler, D. (2003) Statistical alignment based on fragment insertion and deletion models. Bioinformatics, 19, 490499
Miklós, I., et al. (2004) A Long indel model for evolutionary sequence alignment. Mol. Biol. Evol, . 21, 529540
National Human Genome Research Institute. (2006) Fruitfly Genome Sequencing. [http://www.genome.gov/11008080].
Petrov, D.A. and Hartl, D.L. (1998) High rate of DNA loss in the Drosophila melanogaster and Drosophila virilis species groups. Mol. Biol. Evol, . 15, 293302[Abstract].
Pollard, D., et al. (2004) Benchmarking tools for the alignment of functional noncoding DNA. BMC Bioinformatics, 5, 73[CrossRef][Medline].
Schwartz, S., et al. (2004) Human-mouse alignments with BLASTZ. Genome Res, . 14, 786
Sinha, S. and Siggia, E.D. (2005) Sequence turnover and tandem repeats in cis-regulatory modules in Drosophila. Mol. Biol. Evol, . 22, 874885
Sinha, S., et al. (2004) Cross-species comparison significantly improves genome-wide prediction of cis-regulatory modules in Drosophila. BMC Bioinformatics, 5, 129[CrossRef][Medline].
Stoye, J., et al. (1998) Rose: generating sequence families. Bioinformatics, 14, 157163
Thorne, J.L., et al. (1991) An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol, . 33, 114124[CrossRef][Web of Science][Medline].
Thorne, J.L., et al. (1992) Inching toward reality: an improved likelihood model of sequence evolution. J. Mol. Evol, . 34, 316[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
R. A. Cartwright Problems and Solutions for Estimating Indel Rates and Length Distributions Mol. Biol. Evol., February 1, 2009; 26(2): 473 - 480. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. S. Halfon, S. M. Gallo, and C. M. Bergman REDfly 2.0: an integrated database of cis-regulatory modules and transcription factor binding sites in Drosophila Nucleic Acids Res., January 11, 2008; 36(suppl_1): D594 - D598. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. K. Bradley and I. Holmes Transducers: an emerging probabilistic framework for modeling indels on trees Bioinformatics, December 1, 2007; 23(23): 3258 - 3262. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





