Bioinformatics Advance Access originally published online on March 1, 2007
Bioinformatics 2007 23(9):1073-1079; doi:10.1093/bioinformatics/btm076
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
COBALT: constraint-based alignment tool for multiple protein sequences
National Center for Biotechnology Information, National Institutes of Health, Department of Health and Human Services, Bldg. 38A, Room 10–03N, 8600 Rockville Pike, Bethesda, MD 20894, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: A tool that simultaneously aligns multiple protein sequences, automatically utilizes information about protein domains, and has a good compromise between speed and accuracy will have practical advantages over current tools.
Results: We describe COBALT, a constraint based alignment tool that implements a general framework for multiple alignment of protein sequences. COBALT finds a collection of pairwise constraints derived from database searches, sequence similarity and user input, combines these pairwise constraints, and then incorporates them into a progressive multiple alignment. We show that using constraints derived from the conserved domain database (CDD) and PROSITE protein-motif database improves COBALT's alignment quality. We also show that COBALT has reasonable runtime performance and alignment accuracy comparable to or exceeding that of other tools for a broad range of problems.
Availability: COBALT is included in the NCBI C++ toolkit. A Linux executable for COBALT, and CDD and PROSITE data used is available at: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/cobalt
Contact: richa{at}helix.nih.gov
| 1 INTRODUCTION |
|---|
|
|
|---|
The simultaneous alignment of multiple sequences (multiple alignment) serves as a building block in several fields of computational biology (Gotoh, 1999), such as phylogenetic studies (Fleissner et al., 2005), detection of conserved motifs (Frith et al., 2004), prediction of functional residues and secondary structure (Livingstone and Barton, 1996), prediction of correlations (Socolich et al., 2005) and even quality assessment of protein sequences (Bianchetti et al., 2005). The development of algorithms that can automatically produce biologically plausible multiple alignments is a subject of very active research (Edgar and Batzoglou, 2006; Notredame, 2002). Unfortunately, finding a multiple alignment that rigorously optimizes the commonly used sum-of-pairs scoring measure is computationally hard (Wang and Jiang, 1994) and not practical when more than a few sequences are involved (Li et al., 2000). This has led to an arsenal of approximation techniques from graph theory (Gupta et al., 1995; Kobayashi and Imai, 1998) combinatorial optimization (Notredame and Higgins, 1996; Zhong, 2002) and probability theory (Do et al., 2005).
Most of the popular modern algorithms designed for multiple alignment of more than a few sequences, such as ClustalW (Thompson et al., 1994), MUSCLE (Edgar, 2004a,b), ProbCons (Do et al., 2005) and PCMA (Pei et al., 2003), employ a progressive alignment technique (Feng and Doolittle, 1987) that aligns pairs of sequences, and then pairs of sequence collections, starting from the most similar sequences and continuing until all sequences contribute to the alignment. These algorithms detect sequence similarity as a preliminary step and use the results to construct a guide tree that drives the actual alignment process. Once an initial solution containing all sequences is available, a refinement stage (Wallace et al., 2005) attempts to use the extra information embodied in the alignment to improve that solution.
Several algorithms take the set of sequences to align as their only input, whereas others incorporate information from multiple heterogeneous sources (Notredame et al., 2000). Even the latter primarily restrict themselves to observations of the input dataset, for example, the secondary structure locations on the inputs (O'Sullivan et al., 2004). When the number of sequences is small or the collection has low pairwise similarity, less information is available for these algorithms to construct an alignment. The information content can be increased by turning sequences into position-specific profiles based on the similarity of each sequence to members of a database, and then aligning the profiles instead of the original sequences (Simossis and Heringa, 2004). Alignment to a profile is significantly more sensitive to subtle relationships between sequences (Gribskov et al., 1987; Marti-Renom et al., 2004). The traditional drawback to use of profiles has been the computational expense of constructing them, for example, via iterated PSI-BLAST searches against a large protein database.
We see three relatively underexplored opportunities for further development in the field of multiple alignment:
- The use of biologically relevant information encoded in databases such as the conserved domain database (Marchler-Bauer et al., 2005) (CDD) and PROSITE (Hulo et al., 2006) for improving the quality of multiple alignments. CDD is a curated collection of profiles derived from aligned protein families, whereas PROSITE is a database of regular expressions representing motifs. Using independent, curated, biologically significant databases has the additional advantage of potentially improving the alignment quality automatically as and when these databases are updated to represent a larger number of protein families or motifs. PROSITE patterns were used by Du and Lin (2005) to constrain multiple alignments, but we are not aware of any multiple alignment tool that has attempted to utilize CDD.
- The use of local pairwise similarity present in multiple sequence pairs to highlight similar regions in otherwise divergent sequences. Local alignments can also constrain global alignment to improve performance, because the presence of a constraint reduces the size of the space that a dynamic programming implementation must search for an optimal pairwise alignment. Some algorithms, such as T-Coffee (Notredame et al., 2000) and DbClustal (Thompson et al., 2000), do use libraries of pairwise alignments, but they do not attempt to explicitly choose alignments present in multiple pairs.
- An easy way for users to specify regions they want to see aligned in any multiple alignment computed. User input can be particularly useful when user knowledge is not reflected in sequence similarity. Such a capability was added to a semi-automatic version of DIALIGN (Morgenstern et al., 1998) by Morgenstern et al. (2006). Other methods (for example, SALIGN in MODELLER (Marti-Renom et al., 2004)) include the option for a user to specify constraints when aligning sequences and structures.
COBALT has a general framework that uses progressive multiple alignment to combine pairwise constraints from different sources into a multiple alignment. COBALT does not attempt to use all available constraints (for example, via algorithms used by Myers et al. (1996)) but uses only a high-scoring consistent subset that can change as the alignment progresses, where a set of constraints is called consistent if all of the constraints in the set can be simultaneously satisfied by a multiple alignment. Using the RPS-BLAST tool (the core search algorithm for the service described by Marchler-Bauer and Bryant (2004)), we can quickly search for domains in CDD that match to regions of input sequences. When the same domain matches to multiple sequences, we can infer several potential pairwise constraints based on these domain matches. Furthermore, CDD also contains auxiliary information that allows COBALT to create partial profiles for input sequences before progressive alignment begins, and this avoids computationally expensive procedures for building profiles. We use PROSITE patterns for making constraints only in the refinement stage since using them in the initial stages gave worse performance (data not shown). This is likely because PROSITE patterns are much shorter than domains from CDD and as such are more likely to give spurious matches. COBALT also retains a maximum consistent subset of any user-specified pairwise constraints by giving these constraints a priority higher than that for any constraint derived by other means.
We tested COBALT on five different multiple alignment benchmark sets. Compared with ClustalW, MUSCLE, ProbCons and PCMA, COBALT achieves the highest or close to highest average alignment quality, although all five programs perform similarly on many of these benchmarks. Here, the figure of merit is the percentage of letter pairs in computed alignments that match those in the conserved regions of reference alignments. Use of CDD searches improves COBALT's average alignment quality by
3%, and use of local alignments significantly improves alignment quality in benchmarks such as Implanted Rose Motifs base (IRMbase) (see Table 1). We also show that the alignments reported by various alignment algorithms differ significantly, and this is an important consideration when making conclusions based on multiple alignment produced by any tool (Ogden and Rosenberg, 2006).
|
The runtime performance of COBALT is highly data driven, but we find empirically that our implementation is about two times slower than MUSCLE and comparable to ProbCons and PCMA unless the number of sequences exceeds about a dozen, in which case COBALT is about five times faster than ProbCons. COBALT, therefore, represents a good compromise between alignment quality and runtime requirements and may be a good choice when one does not want to try multiple tools. We expect to incorporate COBALT into various NCBI resources and make further enhancements to improve COBALT's speed and/or accuracy.
In the next section, we describe the methods used by COBALT to find constraints and generate a guide tree, along with the heuristics for aligning subsets of sequences presented by the guide tree. This section also describes five benchmarks used for evaluating COBALT. The Results section compares alignment quality achieved by COBALT with that of ClustalW, MUSCLE, ProbCons and PCMA on the five benchmarks. We close with a discussion of related work and open problems.
| 2 MATERIALS AND METHODS |
|---|
|
|
|---|
COBALT is included in the NCBI C++ Toolkit. Numerous auxiliary programs were written in C, C++ and Perl to automate testing and summarize results. Splus version 6.0 was used to run Friedman's rank sum test.
2.1 Scoring a multiple alignment
The traditional method of scoring an alignment between a pair of sequences is to use a score matrix based on log-odds scores, such as the PAM or BLOSUM series of matrices. When there are more than a few sequences, the information content in a multiple alignment can be measured by its entropy. A small number of sequences, or correlations between residues, can have a harmful effect on an entropy-based scoring measure. Partitioning residues into a smaller number of residue classes can sometimes dampen this effect.
The intuition used by Tharakaraman et al. (2005) is to combine ideas behind log-odds and entropy-based scoring. This method scores a multiple alignment by measuring how well individual sequences agree with the overall alignment. We extend these ideas to account for gaps in multiple alignments. We also partition the residues into ten classes when scoring a multiple alignment. The score assigned to a column i, denoted by CS(i), within a multiple alignment is given by:
|
| (1) |
j pj for classes with non-zero cij and K is a pseudocount set to two by default. The score for a multiple alignment is the sum of the scores for all of its columns. Equation (1) was derived using ideas from Tharakaraman et al. (2005) as follows.
The individual score in Equation (7) from Tharakaraman et al. (2005) is the measure of how similar a residue j in a given sequence S is to column i of a given profile:
|
|
For scoring multiple alignments, we can use this formulation by checking how each sequence S scores against the alignment for the rest of the sequences. Because cij includes the count for S in the multiple alignment for S, the count for cij should be decremented by one while scoring S. Also, since columns with gaps will not have
j cij = N , we explicitly replace c by N – 1 so as to downweight the score for columns with gaps. These two substitutions give:
|
|
|
|
|
| (2) |
|
|
j|cij0 pj} , and substituting values for aj and a, Equation (2) becomes: |
|
2.2 Algorithm
COBALT is a flexible tool for simultaneously aligning a given set of protein sequences, where users can directly specify pairwise constraints and/or ask COBALT to generate the constraints using sequence similarity, (optional) CDD searches and (optional) PROSITE pattern searches. COBALT will optionally create partial profiles for input sequences based on any CDD search results. Aside from these features, the COBALT algorithm is similar to that of other progressive multiple alignment tools:
|
We do not regenerate the guide tree in the refinement phase, because we found that the guide tree generated in Step 3 provides a branching order for progressive alignment that can lead to the desired benchmark solution, and do not expect that regenerating the tree will improve result quality. Next, we briefly describe the implementation of the above steps. In the following, we denote the given set of input protein sequences by S = {S1, S2, ... , SN} , the residues of sequence Si by
where m is the length of Si, and the profile for Si at position j by
for the protein alphabet of size k. We use
to represent the frequency of gaps for Si at position j. From now on, we overload the term residue to mean an index in a scoring matrix (BLOSUM62 by default), and also the actual amino acid letter in the sequence.
2.2.1 Finding alignments for generating constraints (Step 1)
RPS-BLAST is used to align each Si to each domain in the CDD database. For each domain, CDD contains a position-specific score matrix used for RPS-BLAST alignment, the residue frequencies that produced the score matrix, and a list of both highly conserved (core block) and divergent (loop) regions within the domain. For each Si, we divide each domain match that meets a minimum expect-value threshold (0.01 by default) into a collection of core blocks, and realign each individual core block to Si using dynamic programming and the portion of the domain's score matrix appropriate for the block. During realignment, a block may shift position from its location on the original match up to half the size of the loop region to either side of the block, and may have gaps added or removed compared to the original match. Because sequences can be expected to align on block boundaries, we use only the block alignments for inferring a potential constraint between Si and Sj, and only if both of them align to the same portion of a domain.
The BLASTP module of BLAST is used after RPS-BLAST to find local pairwise sequence similarity in regions where RPS-BLAST fails to detect possible structural similarities to CDD domains. Each sequence Si is partitioned into regions that participate in a constraint based on CDD alignments (domain regions) and regions that do not (filler regions). Filler regions of Si are aligned to set S– {Si} using BLASTP, and any match found that exceeds a minimum expect-value (0.01 by default) becomes a potential pairwise constraint.
Matches against protein motifs from the PROSITE database are found using PHI-BLAST (Zhang et al., 1998). Each occurrence of a pattern on Si makes a potential pairwise constraint with each occurrence of the same pattern on Sj when i
j .
The initial set of potential constraints is the set of CDD alignments, pairwise local alignments and any user-specified residue pairs between sequences. Matches against protein motifs are used only in the refinement stage.
2.2.2 Consistent constraints and profiles (Step 2)
Pairwise constraints generated in Step 1 may conflict with each other. For each pair of sequences with constraints, we find a consistent subset by determining the maximum-scoring collection of pairwise constraints between the pair of sequences, such that the sequence ranges that appear in all constraints are disjoint. Here, the score for a constraint derived from BLASTP or RPS-BLAST is the alignment score; user-defined constraints are all given an artificially high score to preserve the maximum number of user-specified constraints.
If Si aligns to one or more domains in CDD, then we use the position-specific matrix of residue frequencies for those domains to create a partial profile for Si in the locations dictated by block alignments. The domain matches that will contribute residue frequencies are chosen in a greedy manner, where the next domain chosen does not overlap any domain matches chosen for Si so far, and has the highest-scoring set of block alignments. Two alignments are said to overlap in this context, if their range on Si overlaps. We add only residue frequencies from block alignment regions, boosting the frequency of the actual letter in each position of Si as follows:
|
|
2.2.3 Finding a guide tree (Step 3)
The pairwise consistent constraints found in Step 2 are used to generate a distance matrix. Let the score between sequence Si and Sj be scoreij, corresponding to the combined alignment score of all constraints found in Step 2 for this pair of sequences. The distance between two sequences Si and Sj is
|
|
2.2.4 Computing the multiple sequence alignment (Step 4)
The progressive multiple alignment does a depth-first traversal of the tree generated in Step 3. At each node of the tree, we generate profiles for both subtrees and align these profiles to produce a multiple alignment for all of the sequences seen thus far.
The profile of a subtree is computed by adding the contribution from each sequence in the subtree. For the subtree containing Si, the contribution of sequence Si to position k of column j in the profile is
, where wi is the weight for Si, computed by normalizing the reciprocals of the distance from each sequence to the subtree root. In practice, this formulation reduces the contribution of more distantly related sequences but tends to produce equal weights for all sequences when the tree is ambiguous.
A variant of ordinary Needleman–Wunsch dynamic programming computes a global alignment of two profiles. The alignment process uses well-known techniques to reduce memory consumption (Edgar, 20004a) and includes two variations that are specific to profile alignments. The first variation is the choice of profile–profile score function (Edgar and Sjölander, 2004; Wang and Dunbrack, 2004). Given an amino acid score matrix M and two vectors p and q of residue frequencies, where vector element i is the frequency of occurrence of amino acid i, a straightforward generalization for the pairwise score Mij for aligning amino acid i with amino acid j is
|
|
However, this scoring function has the effect of diluting the profile–profile score because of the influence of cross-terms even when two profiles are exact copies of each other. A more appropriate scoring function should assign a high score to pairs of profile columns that contain similar residue frequency distributions. This suggests a modification to the above formula:
|
| (3) |
The second variation involves modifying the Needleman–Wunsch recurrence relations to account for the frequency of gaps in vector p. Affine gap scoring means that the first gap character incurs a gap-open penalty Go, while remaining consecutive gaps only incur a gap-extension penalty Ge. When opening a gap in vector p, we scale the penalty by the fraction of non-gap characters in q; i.e. the gap open penalty is:
|
| (4) |
|
| (5) |
Each alignment of two sequence collections may also incorporate pairwise constraints that reduce the size of the dynamic programming lattice to explore. We tabulate constraints that cross from one subtree to the other, and the highest-scoring consistent subset constrains the dynamic programming procedure. We also merge identical constraints in the same profile neighborhood, scaling the constraint score by the number of constraints merged. The merge process makes it more likely that constraints appropriate to multiple sequences will influence the complete profile–profile alignment.
2.2.5 Alignments by bipartition (Step 5)
The initial guide tree is rooted at its longest edge. In an attempt to undo errors committed by this arbitrary choice for the root, we isolate a fraction of the longest edges in the guide tree as possible choices for the root. For each choice, we partition the current multiple alignment into two sequence collections separated by that edge and then perform a profile–profile alignment between the two collections. Each column in the realignment is scored using Equation 1. For each edge, bipartition repeats up to five times, or as long as the best score from the current iteration improves the best score from the previous iteration by at least 2%.
2.2.6 Refinement by finding new constraints (Step 6)
The second refinement phase begins by finding conserved columns in the output from Step 5. A column in a multiple alignment is considered to be high scoring if its score exceeds a cutoff (set to 0.67 by default), and groups of at least two adjacent high-scoring columns are considered conserved. Iteration continues as long as the number of conserved columns increases. Before iterating, the set of constraints found in Step 2 is replaced by constraints that encompass alignment decisions based on conserved columns, pattern matches and user-specified pairs.
COBALT uses an all-against-all collection of pairwise constraints to represent each group of conserved columns. Conserved columns may contain gaps, but sequences that contain gaps in a conserved column do not participate in pairwise constraints for that column. This exception allows conserved columns to be used for most profile–profile alignments, while generating pressure on slightly misaligned sequences to shift position.
2.3 Assessment
Most of the development of COBALT was done using BaliBase 2.0 (Bahr et al., 2001), containing 265 alignments divided into eight sets according to sequence length and percent similarity. These sets represent a wide variety of multiple alignment problems. We used the following benchmarks for our tests:
- BaliBase 3.0 (Thompson et al., 2005), containing 218 alignments organized in the same way as BaliBase 2.0, but with larger sequence collections that contain more outlier sequences.
- HOMSTRAD (Stebbings and Mizuguchi, 2004), containing 1032 alignments that represent a large series of known protein families.
- IRMbase (Stoye et al., 1998), containing 180 alignments. In this set, a highly conserved motif is inserted into large, randomly generated protein sequences, and then edit operations are performed that simulate evolutionary events on the collection of sequences. The objective is to recover the conserved motif.
- PREFAB 4.0 (Edgar, 2004a), containing 1682 alignments that consist of two sequences surrounded by a collection of other similar sequences found by PSI-BLAST. The objective is to produce a multiple alignment that contains the known structural alignment of the original pair of sequences.
- SABmark (Walle et al., 2005) has 634 sets of sequence pairs, and the objective is to produce a multiple alignment that simultaneously preserves the known structural alignments of all pairs of sequences in each dataset.
Using these benchmarks, we compared COBALT to ClustalW 1.83, MUSCLE 3.6, ProbCons 1.10 and PCMA 2.0. Default settings were used for all programs except for the results presented for COBALT without some or any additional information. CDD version 2.05 and PROSITE release 19.0 were used for the COBALT results reported here. The quality assessment score (Q-score) is an average over all datasets in a benchmark, where for each dataset we find the percentage of the letter pairs in the reference alignment that are also aligned in the computed alignment. BaliBase, PREFAB and IRMbase benchmarks mark core regions in their reference alignments; for these benchmarks, we also calculate the Q-score for letter pairs in only the core regions while considering the whole alignment as a core region for HOMSTRAD and SABmark.
| 3 RESULTS |
|---|
|
|
|---|
COBALT was developed using BaliBase 2.0 and tested on BaliBase 3.0, HOMSTRAD, PREFAB, IRMbase and SABmark multiple alignment benchmarks. Table 1 shows the Q-score for alignments computed by ClustalW, MUSCLE, ProbCons, PCMA and COBALT on five reference benchmarks and their running time. The Q-score restricted to core regions gives an indication of how well each algorithm finds these regions. These results show that COBALT achieves the best score for HOMSTRAD and SABmark. COBALT also achieves a score comparable to the best score on BaliBase 3.0 (achieved by ProbCons), PREFAB (achieved by ProbCons) and IRMbase (achieved by PCMA). Table 1 also shows that COBALT is significantly faster than ProbCons. The results in Table 1 for IRMbase show that the isolated nature of all conserved regions defeats the similarity-detecting heuristics in, MUSCLE and ClustalW. The datasets from each benchmark, where COBALT performs the best compared to all algorithms are bali_20002 (94.3 versus best of 89.4 by PCMA), hom_SpoU_methylase_N (97.2 versus best of 43.1 by ProbCons), irm_1_400_4_30_7 (100.00 versus best of 48.9 by PCMA), 1h9jA_1g291 (63.2 versus best of 21.6 by MUSCLE) and twi_156 (84.3 versus best of 22.7 by ProbCons).
The average Q-score for a benchmark hides the fact that there is usually significant variation on any given set. We quantify the variation in each benchmark, reported in Table 2, by finding the root mean square deviation in Q-scores for a pair of tools, and then finding the significance in the difference in results using Friedman's rank sum test. We note that although the average performance of all four programs is quite similar on HOMSTRAD, the tools show large variations in alignment quality for any particular dataset. Because of this variation, we think that users should consider using more than one tool, but if users want to pick one tool, then COBALT provides a good balance between alignment quality and running time.
|
As shown in Table 1, using CDD improves the Q-score for COBALT by a few percentage points, whereas PROSITE patterns have a negligible effect on the results. This shows that there are gains to be made by utilizing resources containing information about protein domains, and also shows that the algorithm used by COBALT to make an alignment, even without using any additional information (Table 1, row labeled No info.), is comparable to that of current state-of-the-art multiple alignment algorithms.
| 4 DISCUSSION |
|---|
|
|
|---|
COBALT is a general framework for transforming pairwise constraints among multiple protein sequences into a multiple sequence alignment. The constraints may arise from several unrelated sources, and in particular may include constraints derived from direct user input. We believe that by making COBALT more aware of what is already known about proteins and captured in publicly available resources, COBALT has a better chance of producing a biologically meaningful multiple alignment compared to tools that do not utilize this information. The alignment process itself includes several heuristics for combining constraints and aligning sequences represented as frequency profiles. The result is an algorithm whose performance matches or exceeds that of the best current methods and still achieves reasonable running time.
Our current implementation uses searches against the PROSITE database of protein-motif regular expressions and against the CDD of protein domains. We expect COBALT alignment quality to improve as the underlying resources continue to evolve. Future efforts will investigate the applicability of additional information such as secondary structure alignments computed with recent algorithms (Shindyalov and Bourne, 1998; Zhou and Zhou, 2005) and the detection of short highly conserved motifs found with de novo methods (Neuwald et al., 1997; Rigoutsos and Floratos, 1998). We are particularly interested in finding robust and computationally inexpensive motif-finding tools, as we find that PROSITE patterns longer than three letters are highly selective: performing the pattern search procedure on the datasets comprising BaliBase 2.0 shows that over 90% of the resulting constraint positions agree exactly with the reference alignment.
Progressive multiple alignment algorithms all have difficulty with highly divergent sequence inputs, and so COBALT may also benefit from incorporating alignment algorithms that explicitly process more than two sequences or sequence collections at a time (Kececioglu and Starrett, 2004; Schroedl, 2005; Zhang and Kahveci, 2006). Unfortunately, preliminary investigations show that these measures invariably require excessive computational resources.
We are also examining ways to improve performance in the case where the similarity between input sequences is high. It is especially desirable to avoid the need for RPS-BLAST against CDD if there is no need to deduce subtle structural relationships between inputs, and COBALT should be able to detect this situation and avoid unnecessary work that accounts for a large portion of the algorithm's runtime. Because RPS-BLAST easily runs in parallel, we are also considering parallelizing at least some computations in COBALT as part of continuing development and optimization.
Finally, we have implemented COBALT as a library in the NCBI C++ Toolkit and expect to incorporate the algorithm into NCBI resources and into user tools such as alignment editors.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
Thanks to David Lipman, Jim Ostell and Tom Madden for advice and encouragement. Discussions with John Spouge, Teresa Przytycka, Maricel Kann, Anna Panchenko and Aron Marchler-Bauer were also helpful. A web page developed by Irena Zaretskaya to run the COBALT algorithm has been helpful in jump-starting activity on linking COBALT with other applications at NCBI. We thank Aravind Iyer for testing COBALT with his datasets and providing valuable feedback. This research was supported by the Intramural Research Program of the NIH, NLM. Funding to pay the Open Access charges was provided by the Intramural Research program of the NIH, National Library of Medicine.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Charlie Hodgman
Received on September 15, 2006; revised on January 23, 2007; accepted on February 26, 2007
| REFERENCES |
|---|
|
|
|---|
Bahr A, et al. BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Res, ( (2001) ) 29, : 323–326.
Bianchetti L, et al. vALId: validation of protein sequence quality based on multiple alignment data. J. Bio. Comput. Biol, ( (2005) ) 3, : 929–947..
Clarke GDP, et al. Inferring genome trees by using a filter to eliminate phylogenetically discordant sequences and a distance matrix based on mean normalized BLASTP scores. J. Bacteriol, ( (2002) ) 184, : 2072–2080.
Desper R, Gascuel O. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. J. Comput. Biol, ( (2002) ) 9, : 687–705.[CrossRef][ISI][Medline].
Do CB, et al. ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res, ( (2005) ) 15, : 330–340.
Du Z, Lin F. Pattern-constrained multiple polypeptide sequence alignment. Comput. Biol. Chem, ( (2005) ) 29, : 303–307.[CrossRef][ISI][Medline].
Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, ( (2004a) ) 5, : 113.[CrossRef][Medline].
Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, ( (2004b) ) 32, : 1792–1797.
Edgar RC, Batzoglou S. Multiple sequence alignment. Curre. Opin. Struct. Biol, ( (2006) ) 16, : 368–373.[CrossRef].
Edgar RC, Sjölander K. A comparison of scoring functions for protein sequence profile alignment. Bioinformatics, ( (2004) ) 20, : 1301–1308.
Feng D-F, Doolittle RF. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol, ( (1987) ) 25, : 351–360.[ISI][Medline].
Fleissner R, et al. Simultaneous statistical multiple alignment and phylogeny reconstruction. Sys. Biol, ( (2005) ) 54, : 548–561.[CrossRef].
Frith MC, et al. Finding functional sequence elements by multiple local alignment. Nucleic Acids Res, ( (2004) ) 32, : 189–200.
Gotoh O. Multiple sequence alignment: algorithms and applications. Adv. Biophys, ( (1999) ) 36, : 159–206.[CrossRef][ISI][Medline].
Gribskov M, et al. Profile analysis: Detection of distantly related proteins. Proc. Nat. Acad. Sci. USA, ( (1987) ) 84, : 4355–4358.
Gupta SK, et al. Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. J. Comput. Biol, ( (1995) ) 2, : 459–472.[Medline].
Hillis DM, et al. Molecular Systematic. Sinauer Associates. ( (1996) ) second edition. Sunderland, M..
Hulo N, et al. The PROSITE database. Nucleic Acids Res, ( (2006) ) 34, : D227–D230.
Kececioglu JD, Starrett D. Aligning alignments exactly. ( (2004) ) Proceedings of the 8th ACM Conference Research in Computational Molecular Biology. 85–96..
Kobayashi H, Imai H. Improvement of the A* algorithm for multiple sequence alignment. ( (1998) ) 9, . Proceedings of the Genome Informatics Workshop. 120–130..
Li M, et al. Near optimal multiple alignment within a band in polynomial time. ( (2000) ) Proceedings of the 32nd ACM Symposium on Theory of Computing. 425–434..
Livingstone CD, Barton GJ. Identification of functional residues and secondary structure from protein multiple sequence alignment. Meth. Enzymol, ( (1996) ) 266, : 497–512.[ISI][Medline].
Marchler-Bauer A, Bryant SH. CD-Search: protein domain annotations on the fly. Nucleic Acids Res, ( (2004) ) 32, : W327–W331.
Marchler-Bauer A, et al. CDD: A conserved domain database for protein classification. Nucl. Acids Res, ( (2005) ) 33, : D192–D196.
Marti-Renom MA, et al. Alignment of protein sequences by their profiles. Protein Sci, ( (2004) ) 13, : 1071–1087.
Morgenstern B, et al. DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics, ( (1998) ) 14, : 290–294.
Morgenstern B, et al. Multiple sequence alignment with userdefined anchor points. Algorithms Mol. Biol, ( (2006) ) 1, . http://www.almob.org/content/1/1/6..
Myers G, et al. Progressive multiple alignment with constraints. J. Comput. Biol, ( (1996) ) 3, : 563–572.[ISI][Medline].
Neuwald AF, et al. Extracting protein alignment models from the sequence database. Nucleic Acids Res, ( (1997) ) 25, : 1665–1677.
Notredame C. Recent progresses in multiple sequence alignment: a survey. Pharmacogenomics, ( (2002) ) 3, : 131–144.[CrossRef][ISI][Medline].
Notredame C, Higgins DG. SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res, ( (1996) ) 24, : 1515–1524.
Notredame C, et al. T-coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol, ( (2000) ) 302, : 205–217.[CrossRef][ISI][Medline].
Ogden TH, Rosenberg MS. Multiple sequence alignment accuracy and phylogenetic inference. Systematic Biol, ( (2006) ) 55, : 314–328.[CrossRef].
Sullivan O, et al. 3DCoffee: combining protein sequence and structures within multiple sequence alignments. J. Mol. Biol, ( (2004) ) 340, : 385–395.[CrossRef][ISI][Medline].
Pei J, et al. PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics, ( (2003) ) 19, : 427–428.
Rigoutsos I, Floratos A. Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics, ( (1998) ) 14, : 55–67.
Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. biol. evol, ( (1987) ) 4, : 406–425.[Abstract].
Schroedl S. An improved search algorithm for optimal multiple sequence alignment. Journal of Artifical Intelligence Research, ( (2005) ) 23, : 587–623..
Shindyalov IN, Bourne PE. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng, ( (1998) ) 11, : 739–747.
Simossis VA, Heringa J. PSI-PRALINE: A novel algorithm for multiple sequence alignment. In: Poster in 12th International Conference on Intelligent Systems for Molecular Biology, Glasgow, Scotland, ( (2004) )..
Socolich M, et al. Evolutionary information for specifying a protein fold. Nature, ( (2005) ) 437, (7058): 512–518.[CrossRef][Medline].
Stebbings LA, Mizuguchi K. HOMSTRAD: recent developments of the homologous protein structure alignment database. Nucleic Acids Res, ( (2004) ) 32, : D203–D207.
Stoye J, et al. Rose: generating sequence families. Bioinformatics, ( (1998) ) 14, : 157–163.
Tharakaraman K, et al. Alignments anchored on genomic landmarks can aid in the identification of regulatory elements. Bioinformatics, ( (2005) ) 21, (Suppl. 1): i440–i448.[Abstract].
Thompson, et al. CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, ( (1994) ) 22, : 4673–4680.
Thompson JD, et al. DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res, ( (2000) ) 28, : 2919–2926.
Thompson JD, et al. BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins: Structure, Function and Bioinformatics, ( (2005) ) 61, : 127–136.[CrossRef][ISI].
Wallace IM, et al. Evaluation of iterative alignment algorithms for multiple alignment. Bioinformatics, ( (2005) ) 21, : 1408–1414.
Walle IV, et al. SABmark: a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics, ( (2005) ) 21, : 1267–1268.
Wang G, Dunbrack RL Jr. Scoring profile-to-profile sequence alignments. Protein Science, ( (2004) ) 13, : 1612–1626.
Wang L, Jiang T. On the complexity of multiple sequence alignment. J. comput. biol, ( (1994) ) 1, (4): 337–348.[Medline].
Zhang X, Kahveci T. A New Approach for Alignment of multiple proteins. Pac. Symp. Biocomput, ( (2006) ) 11, : 339–350..
Zhang Z, et al. Protein sequence similarity searches using patterns as seeds. Nucleic Acids Res, ( (1998) ) 26, : 3986–3990.
Zhong W. Using travelling salesman problem algorithms to determine multiple sequence alignment orders. In: M.S. Thesis, ( (2002) ) Univeristy of Georgia..
Zhou H, Zhou Y. SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics, ( (2005) ) 21, : 3615–3621.
This article has been cited by other articles:
![]() |
K. Katoh and H. Toh Recent developments in the MAFFT multiple sequence alignment program Brief Bioinform, July 1, 2008; 9(4): 286 - 298. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


