Bioinformatics Advance Access originally published online on December 4, 2008
Bioinformatics 2009 25(3):295-301; doi:10.1093/bioinformatics/btn630
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment
1Department of Engineering, University of California, Santa Cruz CA, USA and 2EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Multiple sequence alignment is a cornerstone of comparative genomics. Much work has been done to improve methods for this task, particularly for the alignment of small sequences, and especially for amino acid sequences. However, less work has been done in making promising methods that work on the small-scale practically for the alignment of much larger genomic sequences.
Results: We take the method of probabilistic consistency alignment and make it practical for the alignment of large genomic sequences. In so doing we develop a set of new technical methods, combined in a framework we term sequence progressive alignment, because it allows us to iteratively compute an alignment by passing over the input sequences from left to right. The result is that we massively decrease the memory consumption of the program relative to a naive implementation. The general engineering of the challenges faced in scaling such a computationally intensive process offer valuable lessons for planning related large-scale sequence analysis algorithms. We also further show the strong performance of Pecan using an extended analysis of ancient repeat alignments. Pecan is now one of the default alignment programs that has and is being used by a number of whole-genome comparative genomic projects.
Availability: The Pecan program is freely available at http://www.ebi.ac.uk/
bjp/pecan/ Pecan whole genome alignments can be found in the Ensembl genome browser.
Contact: benedict{at}soe.ucsc.edu
supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
The problem of multiple sequence alignment is a long standing one in bioinformatics. In essence, it can be expressed as the problem of grouping (aligning) together sets of residues from a set of two or more input sequences, under some objective function. Generally, these sets are created with the aim of aligning homologous residues, i.e. those that have arisen (recognizably) from a common ancestor. Many theoretical sub-problems can be derived from the general problem, depending on the type of mutations (changes) affecting the sequences and the objective function used. In this article, we present a method which deals with a variant of perhaps the most classical problem in this domain, that of finding an alignment under the assumption that the sequences have changed only by mutations that insert (add), delete (remove) and substitute (replace) residues, this has typically been called the problem of global alignment. See Gusfield (1997) and Durbin et al. (1998) for detailed reviews of the problem and Notredame (2007) for a review of new methods.
Despite the age of the global multiple sequence alignment problem there has been no clear agreement on the exact objective function to use for this problem. This substantially relates to the fact that the problem for most standard objective functions, such as the sum-of-pairs function, is NP-hard (Elias, 2006; Wang and Jiang, 1994;). For pairs of sequences so called pairwise models, for example, pair hidden Markov models (pair-HMMs) (Durbin et al., 1998), can be used to probabilistically model the evolution of the sequences and are computationally tractable. For multiple sequences and a given phylogenetic tree, models, such as transducers (Bradley and Holmes, 2007), give precise generalizations of pairwise models for multiple sequences. However, predictably, the standard forms of probabilistic algorithms for multi-sequence transducer models scale exponentially with the number of sequences involved.
One way around the hardness of this problem is to use so called progressive alignment methods, which greedily break the problem down into a series of tractable, generally pairwise, sub-problems, which are recursively combined up the branches of a given guide tree (Feng and Doolittle, 1987). The obvious problem being that at each stage the computation of the sub-problems are not able to account for all the global information available, and so produce sub-optimal solutions. Consistency alignment algorithms (Notredame et al., 2000; Rausch et al., 2008), of the type implemented by the program introduced in this article, seek to partially overcome the problems of progressive alignment. They essentially work by computing information for each of a set of pairwise sequence comparisons, then transitively relating this information between the different pairwise comparisons, before subsequently computing a progressive alignment from the transformed pairwise information in a greedy fashion. Thus information from sequences across the tree is used in computing each progressive step, and so fewer correctable errors are made.
Most progressive multiple alignment methods have used variants of the Viterbi algorithm at each stage, which gives the single most probable alignment for the sub-problem in hand. An alternative to this is to search for the alignment that, while not being the single most probable, contains the maximum expected number of correct features, for example, maximizing the number of correctly aligned (grouped) pairs. Such an alignment can be computed by taking a Bayesian approach to the problem, by first computing the probability of features in the alignment and then applying a second stage of optimization to compute the alignment that maximizes the sum of these feature's probability, something originally investigated in (Kececioglu, 1993).
In this article, we use the objective function introduced in Do et al. (2005), which was the first to combine a consistency alignment technique with the use of posterior pairwise probabilities to create the initial pairwise information. This methodology has been coined probabilistic consistency alignment. The calculation of pairwise posterior probabilities and use of pair-HMMs makes it easy to train models via unsupervised expectation-maximization and incorporate more complex gap models. We have previously described (Paten et al., 2008a) how, similarly to Probcons, Pecan utilizes two sets of affine parameters (a bi-phasic model) and has been trained via Baum–Welch (Baum, 1972).
Until now probabilistic consistency alignment has been impractical for the alignment of very large sequences. Computing alignments of very large sequences, while not theoretically a different problem, is nevertheless a considerable engineering challenge because of the polynomial scaling factors associated with the average lengths of the sequences. In general this therefore necessitates the pre-computation of constraints on the alignment problem to hence limit the search space of the optimization. The other efficient methods for computing large-scale multiple sequence alignments have all combined progressive alignment with variants on the Viterbi algorithm (Blanchette et al., 2004; Bray and Pachter, 2004; Brudno et al., 2003).
Combining the methods of probabilistic consistency alignment with multi-sequence alignment with constraints (Myers et al., 1996), we present a novel methodology for the alignment of large genomic sequences and describe its implementation in our alignment program, Pecan. In brief, we have made two major technical advances. First, we have developed a banded form of the standard forward–backward algorithm (Durbin et al., 1998) for pair-HMMs, to allow posterior probabilities of pairwise alignment to be generated with heuristics on large sequences. Second, we conceived a novel technique to handle the computation of multiple pairwise posterior probability calculations during the generation of a single consistency alignment across very large genomic sequences, which we call sequence progressive alignment. Although we have combined these two methodologies into a single program for large-scale genomic alignment, each of these technical advances are applicable to more areas than this specific use case. Previously, we and others have assessed Pecan against existing methods (Margulies et al., 2007; Paten et al., 2008a). Here, we show further validation of Pecan's performance by performing a detailed mammalian whole-genome ancient repeat analysis comparing Pecan to another popular alignment program, MLAGAN (Brudno et al., 2003).
| 2 APPROACH |
|---|
|
|
|---|
Let
represent the basic symbol alphabet (i.e. {A, C, T, G} for DNA). A sequence represents a member of 

i, where
i represents the set of all strings of length i comprised of characters from
. Denote – as the gap symbol and let
'=
{–}.
We define an alignment
as a 2D matrix whose cells all contain a symbol from
'. Each row represents the symbols of a sequence interleaved with gaps. Each column represents an aligned group of
', the gaps representing sequences missing due to insertion or deletion. Positions from two sequences are aligned if they occur in the same alignment column. Let xi denote position i in sequence x and let xi
yj denote that position xi and position yj are aligned. We define an indicator function I
such that
|
|
Pecan optimizes a sum-of-pairs function based upon pairwise posterior probabilities computed using a pair-HMM
and identical in principle to that used by the Probcons program (Do et al., 2005). Formally, for Pecan the score of an alignment S(
|
) of a set of leaf sequences
1 ... N is given by:
|
| (1) |
yj |
) defines the transformed probability: |
|
|
| (2) |
Equation (2) represents the aforementioned probabilistic consistency transformation. The optimization of this objective function is broken down into four stages.
- Finding a high specificity constraint map using a local alignment program and recursive, greedy strategy.
- Calculating pairwise posterior match probabilities of alignment for each pair of distinct sequences. This is implemented using a constrained form of the forward–backward algorithms for pair-HMMs that utilizes the constraint map calculated in the previous stage.
- Transforming the pairwise match probabilities using Equation (2).
- Progressively aligning the input sequences using the transformed pairwise match probabilities to optimize Equation (1).
The obvious way to implement such a set of procedures is serial, with stage 2 being performed after stage 1 is complete, and so forth. In such a serial implementation the middle two stages accumulate and transform the complete set of all match probabilities used, before the final stage combines them into an alignment. However, assuming we only keep the posterior match probabilities above a prespecified threshold, by default P>0.01, the total number of these posterior match probabilities grows about linearly with the average length of the sequences involved and quadratically with the number of sequences. This is acceptable for smaller alignments but, as we will show, is impractical for general larger scale alignment. We solve this problem by effectively parallelizing the three stage procedure.
Such a strategy depends upon two related observations:
- That every position in every sequence of a multiple alignment is only likely to be aligned to a single restricted contiguous range of positions in each of the other sequences.
- That these ranges are not independent of one another, but indeed possess a high degree of transitivity.
We will show it is thus generally unnecessary to store the complete set of posterior probabilities, such that memory can be effectively recycled throughout the process. We call this overall efficiency strategy sequence progressive alignment, so called because it works iteratively along the set of input sequences progressively combining them into a final alignment from left to right.
In overview this is achieved by:
- Computing the constraint map independently of the subsequent stages.
- Breaking up the pairwise forward–backward calculations into a series of small fragments using principled and testable heuristics.
- Ordering the computation of these fragments to facilitate their incremental computation along the alignment implied by the constraint map.
- Incrementally (from left to right) inferring the final alignment.
- Progressively freeing memory for positions in the sequences once combined into a final alignment.
We break the methodological description of this process into three parts. First, the calculation of the constraint map. Second, the computing, transforming and ordering of the pairwise posterior probabilities. Finally, the process of incrementally combining the transformed posterior probabilities to compute the constraint map. Some of the sub-methods have previously been described in brief overview in Paten et al. (2008a), here we are able to expand upon these descriptions, and show their importance for the novel sequence progressive framework.
| 3 METHODS |
|---|
|
|
|---|
3.1 Building the constraint map
Let xi
yj denote the constraint that residue xi must appear before residue yj in an alignment of x and y. Similarly, let
yj used to state that residues xi and yj are aligned. We also talk about alignment anchors. In the context of constrained alignment, alignment anchors are continuous ungapped series of one or more aligned pairs (see green edges in Fig. 1), normally derived from local alignments. For each pairwise alignment Pecan must first select a colinear (monotonically increasing) and non-overlapping chain of anchor constraints around which to construct the alignment band (Chao et al., 1993). The alignment band being a relaxed set of constraints derived from the initial set of anchors (black lines around anchors in Fig. 1). The prime constraint map (see Section 3.1.1) is constructed greedily, thus, starting from an initially empty set, constraints are added one by one provided they are consistent (see Fig. 2A, B for the importance of maintaining the partial ordering of the constraint map) with the existing set.
|
|
The procedure has the following outline:
- Create an empty prime constraint map, C'.
- For each set of pairwise alignment parameters (progressively more sensitive):
- For each pair of distinct sequences, in ascending order of pairwise distance:
- For each non-zero length gap between existing anchors (or the ends of the sequences) compute a chain of monotonically increasing local alignments using the Exonerate program (Slater and Birney, 2005).
- For each local alignment, if it is consistent with the existing constraint map, add it to C'.
- For each non-zero length gap between existing anchors (or the ends of the sequences) compute a chain of monotonically increasing local alignments using the Exonerate program (Slater and Birney, 2005).
- For each pair of distinct sequences, in ascending order of pairwise distance:
- Relax the constraint map. At this stage C' will consist of a set constraints each of the form xi
yj. Denote this fixed set of pair constraints B.
xi
yj
B remove xi
yj from C' and replace it with two constraints xi–k
=yj+k and yj-k
=xi+k, where k is a positive integer parameter. This in effect makes an envelope of some constant width around the anchors in B.
- Add extra boundary constraints to C'. These are extra constraints that ensure certain bounds on the alignment process (see Section 3.1.2).
The method employs two nested loops. The outer loops recurses over sets of local alignment parameters. The use of progressively more sensitive parameters is used successfully by a number of other programs (Brudno et al., 2003; Schwartz et al., 2003). Biological sequences are frequently quite heterogeneous in their pattern of conservation. It is sensible therefore to search initially for strong anchors (such as exons) that can be found quickly before narrowing the search.
The Exonerate program is used to generate the local alignments, using the simplest affine-gap local alignment model. We chose Exonerate because it is very fast and we found it to be reasonably sensitive. Before adding anchors to the constraint map the ends are trimmed (by default 3 matched residues at each end). Even where the core (middle most positions) of an anchor is correct, edge wander effects (Holmes and Durbin, 1998) will often cause misalignments at the ends. By trimming both ends of each anchor the method reduces the risk of introducing misaligned constraints that might greedily prevent correct anchors from being later added to the set. Anchors smaller than the total trim length are discarded at this point.
3.1.1 A brute force method for constraint map building
For efficiency, it is necessary to work with the minimum possible prime set of constraints when building the constraint map. For
1 ...N and a set of pairwise constraints C Myers et al. described an O(N2+N|C|) algorithm for computing the set of all prime constraints, where the notation |C| denotes the total number of input constraints. This algorithm requires complete re-computation when new constraints are added to the set of existing prime constraints. We use a simpler, essentially brute force online algorithm that can be progressively updated with new constraints.
In a nutshell, the procedure starts with an existing set of prime constraints, C', and a new constraint xi
yj. For simplicity, we omit the details concerning constraint type. If xi
yj is consistent with this set C' is updated by checking if for every ordered pair of sequences (w, z) a new prime constraint is implied by a chain of prime constraints of the form wh
xi
yj
zk. If so, the new constraint is added to C' and any existing redundant (now non-prime) constraints are removed. The worst case running time of the method in the context of the pipeline given in the previous section is approximately O(N2Mlog(L)), where L is the average length of a sequence and M is the total length of all the sequences. Proof that this algorithm is correct, details of adaptations and runtime analysis can be found in Appendix A in Supplementary Material.
3.1.2 Constraint guarantees
The sequence progressive method proceeds by iterating through the computation of pairwise alignments, the precise strategy being described below. For this process to be robust we need to provide guarantees. For a pair of sequences (x, y), let min(x, i, y) denote the index in y such that ymin(x, i, y)
=xi. Similarly, let max(x, i, y) denote the index in y such that xi
= ymax(x, i, y). We define the window of alignment for residue xi with y by the range ymin(x, i, y) ... ymax(x, i, y). We observe that the majority of cases, where the window of alignment for position grows larger than that desirable, occur because of large deletions and/or sequence gaps in the input sequence. This is illustrated in Figure 2c. We call these cases over-long gap optimizations, because they force the program to optimize the placement of the positions surrounding a gap over an oversized window of alignment. We can guarantee that the sizes of the windows of alignment are always less than the maximum by applying extra constraints, effectively closing these over-long gap optimisations, see Appendix B in Supplementary Material.
3.2 Computing posterior match probabilities
After computing the constraint map Pecan computes banded posterior probabilities for each pair of distinct sequences in the multiple alignment. Normally, to compute posterior match probabilities the entire matrix of forward and/or backward variables must be held during the two passes of the forward and backward algorithms. Even when the size of the matrix is reduced using banding constraints calculation of posterior probabilities is in practical terms unrealistic for very large sequences. To get around this issue we break up the computation of an alignment band into a series of smaller alignment fragments.
Once we have a method to break up the computation for pairs of sequences into a series of smaller pieces we find a more efficient way to structure the computation for multiple separate pairs of sequences. First, we describe this break up of the computation for a pair of sequences, then we discuss how we can utilise this method for multiple sequences.
3.2.1 Fragmented posterior match probability computation
An anti-diagonal line in a pairwise edit graph is a line defined by the equation i+j=k, where i and j are indices in each of the respective sequences and k is the index of the line. The insight upon which our method is based is that one can generate a very good estimate of the backward distribution along an anti-diagonal line, termed a cut-line, in the edit graph. This is achieved by choosing a cut-line at near the end of a high confidence anchor, initializing the anchor cut-point with a probability 1 of alignment and propagating this pseudo-sum over all paths from this point back to the cut line. In effect, without forcing any position to be definitively on the final alignment, we suggest that we can estimate the probability of a line abutting an anchor cut-point using a relatively local area of alignment, this is shown visually in Figure 1 and described below. For a pair of sequences {x, y}
the banded local posterior algorithm has the following basic outline:
- From the global constraint map C' derive the alignment band for x1...n and y1...m.
- Derive a subset B'xy of approximately evenly spaced anchor cut-points points from {(i, j) : xi
yj
B}, additionally including the end point (n, m).
- Set pk=0 and initialize the forward variables for the point (0, 0) with appropriate starting values.
- For each member (i, j) of B'x y in ascending order:
- Compute the backward algorithm between the cut-point (i, j) and the anti-diagonal cut–line with index pk+1=i+j – l, where the size of l defines the gap between the cut-point and the cut-line.
- Calculate the total probability of the sub matrix up to (i, j) by treating (i, j) as the end point of the alignment and using the forward variables for the anti–diagonal line with index pk.
- Calculate posterior probabilities for all points between the lines with indices pk+1 and pk. Store only those aligned pairs for which P(xi
yj |
) is greater than a given threshold (by default 0.01).
- Free all forward variables up to but excluding the line with index pk+1, which are used in the computation of the next fragment.
- Set pk=pk+1.
- Compute the backward algorithm between the cut-point (i, j) and the anti-diagonal cut–line with index pk+1=i+j – l, where the size of l defines the gap between the cut-point and the cut-line.
Thus, starting with an alignment band the method breaks up the dynamic programming matrix into a series of small fragments. The cut-points in B'xy are chosen to intersect the middle portions of anchors (where the confidence will on average be highest) using a greedy (and non-symmetric) parse of the anchors that for efficiency ensures an even spacing between cut-points. It should be noted that the use of cut-points makes the forward–backward calculations non-symmetric, in that inverting the sequences would potentially produce a different result. It is possible to avoid using this non-symmetric cut-line approximation by performing an initial extra backward pass over the matrix to calculate and store the correct backward cut-line distributions, which would take up only minimal further memory for most problems. It is easy, however, to test the heuristic used to show that in reality, at least for simple pair-HMMs, the approximation has little if any effect on the final alignment, and only a tiny effect on actual posterior probability values (see Supplementary Table S1).
3.2.2 Coordinating the posterior match probability calculations for multiple pairs of sequences
For multiple sequences the method requires posterior match probabilities for each pair of distinct sequences. Without parallel processing this would normally be achieved by iterating through each of the pairs in turn and computing the complete set of match probabilities for the pair. However, this would not facilitate the stated aim of progressively iterating across the computation of the complete alignment.
Given the method for breaking up the computation of posterior match probabilities into a series of fragments, we reorder the computation of all the fragments in the set of all distinct pairs of sequences.
This is achieved by doing two things, as follows.
- Topologically ordering (Cormen, 2001) the computation of all the pairwise posterior probability alignment fragments.
- Tracking the completed computation of posterior probability pairs for the positions in the sequences. For the case of multiple sequences and the consistency transformation this involves accounting for first-order transitive projections between the sequences.
To order all the fragments we need simply to create a combined super-set B'=
{{x,y}
: x
y}B'x,y and sort B' in topologically ascending order according to the constraint map C'.
For the second item, note when computing a consistency transformation, each pair-probability P(xi
yj) is modified by projections of the form P(xi
zk)xP(zk
yj), through all intermediate sequence positions zk for which P(xi
zk) and P(zk
yj) are greater than a certain threshold. Consequently, pair-probabilities for a sequence position cannot be counted as properly transformable until the positions in other sequences to which the position projects have also been fully aligned. To track these projections we precompute a first order reachability relation on the constraint map (which can be thought of as a directed acyclic graph) and maintain a list of completed indices to track the completion of the different pairwise alignments.
Given the procedures for performing the above two modifications to the basic procedure, alignment proceeds by repeatedly selecting the fragment of the next pair of sequences in the list to align and perform the following.
- Computing the pairwise forward–backward alignment for the fragment.
- Updating the coordinates of the completed alignment.
- Checking if residues in any sequence are complete with respect to alignment pair computation and if so, pushing the fragments of sequence to the final multiple alignment procedure (described below).
Using the constraint framework, and in particular the extra constraints added to the constraint map at the end of the process, we can provide a guarantee on the maximum number of fragments that need to be computed in order from a subsection of the alignment fragment list to calculate all the relevant posterior probabilities for a sequence residue. In outline, if the sizes of the pairwise alignment fragments are always less than a maximum (i.e. pk+1–pk is always less than a constant) then given that the windows of alignment are bounded for every base, the number of fragments relevant to a position in a sequence in the alignment is bounded.
3.3 Computing the final alignment
While computing the sets of pairwise posterior match probabilities, to achieve its stated aim of recycling memory, Pecan must incrementally infer a final alignment. In common with Probcons, we use the same basic method of progressive, sparse dynamic programming with a guide tree to compose the final alignment in N–1 stages. Each progressive step involves the alignment of two sub-alignments (which may be leaf sequences) to produce a complete alignment of the relevant subtree. We first describe this pairwise step, then describe how these pairwise steps are composed to compute the entire final alignment.
3.3.1 Incremental pairwise alignment
For pairs of sequences x, y, Pecan attempts to find the alignment
, given the constraints, that maximizes:
|
|
This form of alignment is sometimes called maximal expected-accuracy alignment (Durbin et al., 1998). The alignment that optimizes the value of this function can be found by a simple dynamic programming algorithm. As the program addresses only residue pairs whose posterior probability is greater than a threshold, the dynamic programming implemented is sparse. Figure 3A graphically describes an example procedure for this purpose. In it computation proceeds by moving the sweep line (fine dotted line) progressively along one dimension (in this case the y dimension), while updating a monotonically increasing record of the ends of the highest scoring chains along the other dimension (in this case the x dimension). Each new aligned pair xi1
yj1 (pictured on the course dotted line) is attached to the rightmost pair xi2
yj2 on the sweep-line for which i2<i1, with a total chain-score equal to that of the preceding aligned pair plus the probability of the new pair. An aligned pair xi1
yj1 remains on the sweep line, while no pair xi2
yj2 with higher chain-score exists with i2
i1 (as an illustration, the pair labelled B is shown deleted from the line, but the pair A survives). Assuming searching for rightmost points can be done in log(n) time, where n is the total number of pairs, it should be clear that the total computation cost is O(nlog(n)). Note that this method requires no gap function, because the posterior probabilities already factor in the cost of gaps.
|
To detect the creation of the optimal alignment incrementally, it is necessary to define conditions that when met are sufficient to infer that the previous alignment (which is to the top and to the left in an edit graph) has collapsed down to a single possibility. Figure 3B shows an alternative process which adapts the algorithm of Figure 3A. In this algorithm two sweeping lines are used, one along each axis, to track the maximum scoring chains. The key observation being that the point at which the maximum scoring chain from each sweeping line merges must lie on the optimal alignment path. Computation for each sweep line proceeds as in Figure 3(A). To determine if any optimum alignment has been created it is sufficient to trace-back the connections from the highest scoring chains on each sweep-line (furthest most points along the sweep lines, connections shown as dotted arrows), any pairs contained in the intersection of these two chains are optimal (grey box shown in top-left corner).
3.3.2 Incremental multiple alignment
It is possible to combine multiple instances of the pairwise procedure just described to perform multiple sequence incremental progressive alignment, by ordering these processes in a configuration that mirrors the internal nodes of a guide tree. Intuitively, sub-sequences are fed in at the leaf nodes of the tree, and then as optimal alignments are found, alignment fragments are passed up from the internal nodes of the tree and in turn aligned, etc., until they form completed sections of the multiple alignment at the root node.
3.4 Summary
We have outlined modifications to allow the optimization of Equation (1) to be computed incrementally from left to right over an alignment, and so progressively allow the recycling of memory. The Figure S2 in the Supplementary Material gives an example of this process for the first part of a three-way sequence alignment.
| 4 RESULTS |
|---|
|
|
|---|
To demonstrate the utility of the sequence progressive method, we modified Pecan to near continuously report (i) the cumulative number of posterior match probabilities above threshold calculated, and (ii) the total number actually stored in memory. For an alignment of the CFTR locus using data from the Encode dataset (Margulies et al., 2007), we plot the numbers of posterior match probabilities stored over time in Figure 4. This dataset covers 1.87 Mb of the human genome sequence. The result is that peak memory usage, in terms of numbers of match probabilities, was 100 times smaller than the total cumulative memory usage. In practice this meant that the alignment (which in this case only included 14 sequences), could just be completed on a machine with 8 GB of memory when the memory recycling of the sequence progressive procedure was disabled, but was alignable on a machine with less than a gigabyte of memory when memory recycling was enabled.
|
|
We have previously demonstrated using simulations and biological assessment metrics the high relative performance of the Pecan program versus even its nearest direct competitors (Margulies et al., 2007; Paten et al., 2008a). Here, we sought further, detailed biological validation of Pecan's accuracy. Using the Ensembl (Flicek et al., 2007) whole-genome multiple alignment segmentation sets [derived using the Mercator (Dewey, 2007) program] and considering sequences from the human, mouse, rat, dog and cow genomes, we compare Pecan and MLAGAN. This comparison is particularly pertinent because the simulations indicate that MLAGAN is the best performing competitor to Pecan capable of producing a global alignment of a set of sequences. This comparison is, therefore, indicative of only the differences between the two alignment programs. For the assessment we use the ancestral repeat scheme fully described in (Paten et al., 2008a). This scheme relies on the presence of ancestral repeats due to a burst of transposon activity in the ancestral mammalian lineage. The repeat copies, therefore, have two independent alignment methods: first, via the extant homologous sequences without reference to the repeat consensus, and second, via the set of pairwise alignments of each copy of the repeat to the consensus. We classified alignment columns containing ancestral repeats into three classes as follows.
- Full matches, being all species are consistent with the consensus alignment.
- Partial matches, being that at least two species are consistent and the other species have other, non-repeat matches.
- Mismatches, being that the homologous alignment places two ancestral positions from different consensus positions in the same column.
Table 1 shows these results, the supplementary Material gives details of the exact methodology. Pecan matches a significantly higher proportion of residues fully, and partially or fully than MLAGAN for every class of ancient repeat considered. Expanded in the supplement, Figure S3(A) in the supplement shows the base pair coverage of the various classes of considered ancient repeat by the two alignment programs. Overall Pecan covers (aligns to) a slightly higher number of ancient repeat residues, though the two aligners are very similar in this regard.
| 5 DISCUSSION AND CONCLUSION |
|---|
|
|
|---|
This article has introduced an overall framework for incrementally computing an all-pairwise (consistency) alignment. Although many of the details of this method are quite specific to Pecan, we believe that the component methods could prove useful for a number of other computationally expensive multi-sequence methods. In general, it is probable that there is considerable space to improve these methods. For example, the heuristic employed for closing overlong gaps on a pairwise basis does not use all the available information it could, and will certainly lead to a fraction (however small) of undesirable cases.
In the near feature we will report results on integrating Pecan with the promising AMAP program (Schwartz and Pachter, 2007), which implements an alternative objective function based upon pairwise posterior probabilities that explicitly accounts for gap probabilities. In short, we expect further significant improvements in Pecan's performance. However, one critical problem with the objective functions used by Pecan, AMAP and, in fact, all currently available large-scale alignment programs is that they are not phylogenetically realistic. Recent work has highlighted the importance of addressing the bias associated with this lack of realism (Löytynoja and Goldman, 2008). We have recently reported a new program for inferring phylogenetic alignments (including ancestor sequences; Paten et al., 2008b), which relies on an initial alignment from which to work. In the future we view the niche of programs like Pecan to be in computing initial difficult alignment optimization tasks, which can then be refined with more complex phylogenetic models.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We would like to thank Daniel Zerbino and Michael Hoffman for their helpful comments on the article. We also thank the three anonymous referees for their helpful comments and suggestions.
Funding: European Molecular Biology Laboratory.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alfonso Valencia
Received on September 11, 2008; revised on November 28, 2008; accepted on December 2, 2008
| REFERENCES |
|---|
|
|
|---|
Baum LE. An equality and associated maximisation technique in statistical estimation for probabilistic functions of Markov processes. Inequalities (1972) 3:1–8.
Blanchette M, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. (2004) 14:708–715.
Bradley RK, Holmes I. Transducers: an emerging probabilistic framework for modeling indels on trees. Bioinformatics (2007) 23:3258–3262.
Bray N, Pachter L. MAVID: constrained ancestral alignment of multiple sequences. Genome Res. (2004) 14:693–699.
Brudno M, et al. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. (2003) 13:721–731.
Chao KM, et al. Constrained sequence alignment. Bull. Math. Biol. (1993) 55:503–524.[Web of Science][Medline]
Cormen T. Introduction to Algorithms. (2001) The MIT Press.
Dewey CN. Aligning multiple whole genomes with mercator and mavid. Methods Mol. Biol. (2007) 395:221–236.[Medline]
Do CB, et al. Probcons: Probabilistic consistency-based multiple sequence alignment. Genome Res. (2005) 15:330–340.
Durbin R, et al. Biological Sequence Analysis. (1998) Cambridge University Press.
Elias I. Settling the intractability of multiple alignment. J. Comput. Biol. (2006) 13:1323–1339.[CrossRef][Web of Science][Medline]
Feng D.F., Doolittle RF. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. (1987) 25:351–360.[Web of Science][Medline]
Flicek P, et al. Ensembl 2008. Nucleic Acids Res (2007).
Gusfield D. Algorithms on Strings, Trees, and Sequences. (1997) Cambridge University Press.
Holmes I, Durbin R. Dynamic programming alignment accuracy. J. Comput. Biol. (1998) 5:493–504.[Web of Science][Medline]
Kececioglu J. The maximum weight trace problem in multiple sequence alignment. LNCS (1993) 684:106–119.
Löytynoja A, Goldman N. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science (2008) 320:1632–1635.
Margulies EH, et al. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. (2007) 17:760–774.
Myers G, et al. Progressive multiple alignment with constraints. J. Comput. Biol. (1996) 3:563–572.[Web of Science][Medline]
Notredame C. Recent evolutions of multiple sequence alignment algorithms. PLoS Comput. Biol. (2007) 3:e123.[CrossRef][Medline]
Notredame C, et al. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. (2000) 302:205–217.[CrossRef][Web of Science][Medline]
Paten B, et al. Enredo and pecan: Genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Res. (2008a) 18:1814–1828.
Paten B, et al. Genome-wide nucleotide-level mammalian ancestor reconstruction. Genome Res. (2008b) 18:1829–1843.
Rausch T, et al. Segment-based multiple sequence alignment. Bioinformatics (2008) 24:i187–i192.
Schwartz AS, Pachter L. Multiple alignment by sequence annealing. Bioinformatics (2007) 23:e24–e29.
Schwartz S, et al. Human-mouse alignments with BLASTZ. Genome Res. (2003) 13:103–107.
Slater GSC, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics (2005) 6:31.[CrossRef][Medline]
Wang L, Jiang T. On the complexity of multiple sequence alignment. J. Comput. Biol. (1994) 1:337–348.[Medline]
This article has been cited by other articles:
![]() |
P. Flicek, B. L. Aken, B. Ballester, K. Beal, E. Bragin, S. Brent, Y. Chen, P. Clapham, G. Coates, S. Fairley, et al. Ensembl's 10th year Nucleic Acids Res., November 11, 2009; (2009) gkp972v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Kemena and C. Notredame Upcoming challenges for multiple sequence alignment methods in the high-throughput era Bioinformatics, October 1, 2009; 25(19): 2455 - 2465. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Hamada, K. Sato, H. Kiryu, T. Mituyama, and K. Asai Predictions of RNA secondary structure by combining homologous sequence information Bioinformatics, June 15, 2009; 25(12): i330 - i338. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





