Bioinformatics Advance Access originally published online on January 31, 2007
Bioinformatics 2007 23(7):802-808; doi:10.1093/bioinformatics/btm017
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PROMALS: towards accurate multiple sequence alignments of distantly related proteins
1Howard Hughes Medical Institute and 2Department of Biochemistry, The University of Texas Southwestern Medical Center at Dallas, 6001 Forest Park Road, Dallas, TX 75390-9050, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Accurate multiple sequence alignments are essential in protein structure modeling, functional prediction and efficient planning of experiments. Although the alignment problem has attracted considerable attention, preparation of high-quality alignments for distantly related sequences remains a difficult task.
Results: We developed PROMALS, a multiple alignment method that shows promising results for protein homologs with sequence identity below 10%, aligning close to half of the amino acid residues correctly on average. This is about three times more accurate than traditional pairwise sequence alignment methods. PROMALS algorithm derives its strength from several sources: (i) sequence database searches to retrieve additional homologs; (ii) accurate secondary structure prediction; (iii) a hidden Markov model that uses a novel combined scoring of amino acids and secondary structures; (iv) probabilistic consistency-based scoring applied to progressive alignment of profiles. Compared to the best alignment methods that do not use secondary structure prediction and database searches (e.g. MUMMALS, ProbCons and MAFFT), PROMALS is up to 30% more accurate, with improvement being most prominent for highly divergent homologs. Compared to SPEM and HHalign, which also employ database searches and secondary structure prediction, PROMALS shows an accuracy improvement of several percent.
Availability: The PROMALS web server is available at: http://prodata.swmed.edu/promals/
Contact: jpei{at}chop.swmed.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Multiple sequence alignments have broad applications in sequence similarity searches, structure modeling and phylogenetic analysis (Altschul et al., 1997; Eddy, 1998; Ginalski and Rychlewski, 2003; Phillips et al., 2000). They also aid in experimental design by revealing conserved residues with potential functional importance. A variety of alignment methods that rely on different algorithms and scoring functions have been developed (Edgar and Batzoglou, 2006). A rigorous method that aligns all sequences simultaneously (Lipman et al., 1989) is computationally prohibitive for large sets of sequences. In contrast, a progressive method that aligns pairs of sequences and sequence groups along a tree is algorithmically simpler and much faster, requiring only N–1 steps of pairwise alignments for N sequences. However, in progressive methods, alignment errors made at each step are propagated to subsequent steps. Many progressive methods use a scoring function called sum-of-pairs, i.e. a sum of amino acid substitution scores for pairs of amino acids between two positions (Edgar and Batzoglou, 2006; Thompson et al., 1994). Such a scoring function yields reasonable alignment quality for closely related sequences (identity above 40%). However, alignment quality drops rapidly with decreasing sequence similarity (Thompson et al., 1999).
Effective construction of multiple alignments with respect to accuracy and speed has been extensively researched in recent years. Refinement and consistency-based scoring are two major techniques to improve classical progressive methods. MUSCLE (Edgar, 2004) and MAFFT (Katoh et al., 2005) represent two recent methods that use extensive refinement to correct errors made in progressive steps. They both implement sum-of-pairs scores, which are easy to compute and offer the advantage of great speed. In T-COFFEE (Notredame et al., 2000), the scoring is derived by finding consistently aligned residue pairs in a library of pairwise alignments. Such consistency-based scoring functions can give better alignment quality than sum-of-pairs scores. Further improvement comes with a probabilistic treatment of consistency via pairwise hidden Markov models (HMMs), as first implemented in ProbCons (Do et al., 2005). MUMMALS (Pei and Grishin, 2006) builds on the success of probabilistic consistency by introducing HMMs with more states that capture local structural information. Consistency transformation requires operations on sequence triplets, and therefore is computationally intensive. By aligning similar sequences with general substitution matrices and aligning divergent sequence groups with profile-based consistency, PCMA (Pei et al., 2003) is able to achieve a balance between alignment accuracy and speed.
Even with refinement and consistency-based scoring, current methods still have difficulty in obtaining high-quality alignments when sequence identity drops below 20%. As homologous proteins can have very low sequence similarity while maintaining similar structures and functions (Murzin, 1998), aligning distantly related sequences is an important task. A recent trend in the multiple alignment field is to recruit various sources of sequence and structural information to improve alignment accuracy (Edgar and Batzoglou, 2006). Such sources include homologs detected in database searches (Katoh et al., 2005; Simossis and Heringa, 2005; Thompson et al., 2000), predicted secondary structure (Simossis and Heringa, 2005; Zhou and Zhou, 2005), and known 3D structures (O'Sullivan et al., 2004). Since additional homologs improve the quality of sequence profiles, and structural features such as secondary structure are generally more conserved than sequences, their usage can lead to improved alignment quality.
Here, we describe PROMALS, a multiple sequence alignment method that combines recent advances in computational approaches to tackle the difficult task of aligning divergent sequences. PROMALS improves probabilistic consistency-based scoring of profiles by utilizing predicted secondary structures and additional homologs found in database searches. To effectively combine these additional data, we developed and implemented a new hidden Markov model for profile-profile comparison, which scores both amino acid similarity and secondary structure similarity, and has local structure-dependent transition and emission probabilities. Like PCMA, PROMALS is made more computationally efficient by treating similar and divergent sequences with different alignment strategies. On several difficult data sets, we show that PROMALS gives the best alignment accuracy among leading methods such as SPEM, HHalign (Soding, 2005), MUMMALS, ProbCons and MAFFT.
| 2 METHODS |
|---|
|
|
|---|
2.1 A hidden Markov model of profile–profile alignment
A classical pairwise HMM for aligning two sequences has three types of hidden states: a match state M emitting a residue pair, an X state emitting a residue in the first sequence and a Y state emitting a residue in the second sequence (Durbin et al., 1998). X and Y states correspond to insertions or deletions in the two sequences. Our hidden Markov model for aligning two alignments (having profile representations) has the same architecture as a pairwise sequence HMM. In our model, an M state emits a pair of positions instead of a pair of residues. For an X or Y state, a single position in the first alignment or in the second alignment is emitted, respectively. The emitted objects (observations) are amino acid frequency vectors and predicted secondary structure types.
We adopt a representation of amino acid sequence profile similar to the ones in PSI-BLAST (Altschul et al., 1997) and COMPASS (Sadreyev and Grishin, 2003). Two profile components are estimated for a position in an alignment: (i) effective frequencies of amino acids, and (ii) target frequencies of amino acids. The effective frequencies serve as the emitted objects (observations) in a position for the hidden Markov model. They are estimated from the position-specific independent counts (PSIC) of amino acids (Pei and Grishin, 2001; Sunyaev et al., 1999), which is a sequence-weighting scheme that corrects for biased similarities between sequences. If an amino acid is not present in a position, it has an effective frequency of zero. The target frequencies serve as the hidden amino acid probabilistic generator for a position. The target frequencies are estimated from the effective frequencies, taking into account prior knowledge of amino acid substitution characteristics. The target frequency is a mixture (weighted average) between effective frequency and the pseudocount frequency (Altschul et al., 1997; Tatusov et al., 1994). Defined in this way, the target frequency of any amino acid, even if it is not present in a position, is always greater than zero. Details on derivation of the two profile components are in Supplementary Data.
For an M state, the probability of emitting the observed amino acids for a position pair (i, j) is the product of two probabilities: (i) the probability of generating the effective frequencies of position i using the target frequencies of position j, and (ii) the probability of generating the effective frequencies of position j using the target frequencies of position i. For an X or Y state, the probability of emitting the observed amino acids in a position k is the probability of generating the effective frequencies of position k using the background amino acid frequencies in insertion regions. Besides amino acids, an M state also emits a pair of predicted secondary structures, and an X or Y state also emits a single predicted secondary structure. The emission probability in a hidden state (M, X or Y) is a weighted product of amino acid emission probability and secondary structure emission probability. The relative weights for the scoring terms of amino acids and predicted secondary structures have been optimized to increase the alignment accuracy of the training sequence pairs. Details on emission probability formulas, parameter estimation and the algorithm for aligning two profiles with optimal posterior probabilities of position matches are described in Supplementary Data.
2.2 PROMALS multiple sequence alignment procedure
PROMALS (PROfile Multiple Alignment with predicted Local Structure) is a progressive method (Fig. 1). The alignment order is set by a tree built using a k-mer count method (Edgar, 2004). Like PCMA (Pei et al., 2003) and MUMMALS (Pei and Grishin, 2006), PROMALS has two alignment stages for easy and difficult alignments. In the first stage, highly similar sequences are progressively aligned in a fast way with a weighted sum-of-pairs measure of BLOSUM62 scores (Henikoff and Henikoff, 1992) (step 2 in Fig. 1). If two neighboring groups on the tree have an average sequence identity higher than a certain threshold (default: 60%), they are aligned in this fast way. The result of the first alignment stage is a set of sequences or pre-aligned groups that are relatively divergent from each other. In the second alignment stage, one representative sequence (the longest one) is selected from each pre-aligned group. For each representative, PSI-BLAST is used to search for homologs from sequence database UNIREF90 (Wu et al., 2006) with three iterations and an E-value cutoff of 0.001. Hits with <20% identity to the query are removed and up to 300 hits are selected. The PSI-BLAST checkpoint file after three iterations is used to predict secondary structures by PSIPRED (Jones, 1999). For each pair of representatives, profiles are derived from the PSI-BLAST alignments and PSIPRED secondary structure prediction, and a matrix of posterior probabilities of matches between positions is obtained by forward and backward algorithms of the profile-profile HMM (see Supplementary Data for details). These matrices are used to calculate the probabilistic consistency scores as described in Do et al. (2005). The representatives are then aligned progressively according to the consistency-based scoring function, and the pre-aligned groups obtained in the first stage are merged to the multiple alignment of the representatives. Finally, gap placement is refined to make the gap patterns more realistic. For that, we define a core block as a set of consecutive positions with gap content less than 0.5 at each position. A highly gapped (gappy) region is defined as a set of consecutive positions with gap contents no less than 0.5 at each position. A gappy region is either bound by two adjacent core blocks, or is at the start or the end of the alignment. If there are l amino acid residues in a gappy segment, gap refinement introduces continuous gap characters in between the [l/2]th residue and the (l–[l/2])th residue, with the exceptions for any gappy segment in N- or C-terminus, where a single run of continuous gap characters is introduced at the sequence start or end.
|
2.3 Assessment of alignment methods
The following methods were tested: SPEM (Zhou and Zhou, 2005), HHalign (Soding, 2005), MUMMALS (Pei and Grishin, 2006), ProbCons (version 1.10) (Do et al., 2005), MAFFT (version 5.667) (Katoh et al., 2005), MUSCLE (version 3.52) (Edgar, 2004) and ClustalW (version 1.83) (Thompson et al., 1994). For MAFFT, we report two alignment options (-linsi and -ginsi) that show the best results. HHalign is an enhanced version of HHsearch (Soding, 2005) that performs pairwise profile–profile alignment with predicted secondary structures (J. Soding, personal communication). Several parameters (score shift, secondary structure weight, pseudocount weight) of HHalign were selected that gave optimal performance on SCOP domain pairs with identity <20%.
For pairwise alignment tests, we used divergent SCOP superfamily domain pairs that were divided into three identity bins: below 10%, 10–15% and 15–20%. For multiple alignment tests, we added up to 24 homologs to each sequence in the testing cases of pairwise alignments. Details on construction of these testing data sets were given in our previous work (Pei and Grishin, 2006). Two large benchmark data sets compiled by other researchers were used as well. One is the SABmark database (version 1.65) (Van Walle et al., 2005), which contains two sets of multiple protein domains related at SCOP fold or superfamily level. The other is PREFAB database (version 4.0) (Edgar, 2004), which is based on structural alignments in FSSP database (Holm and Sander, 1998b) and homologous sequences from database searches. Reference-dependent alignment quality scores (Q-scores) were calculated using the built-in programs in SABmark and PREFAB packages. The Q-score is the number of correctly aligned residue pairs in the test alignment divided by the number of aligned residue pairs in the reference alignment. The value of the Q-score is between 0 and 1. Wilcoxon signed-ranks tests were performed to calculate the statistical significance of comparisons between alignment methods.
In addition to Q-score, we applied reference-independent evaluation of alignment quality to SCOP domain pairs, as described in our previous work (Pei and Grishin, 2006). We calculated several scores reflecting structural similarity of two SCOP domains compared according to aligned residues in a test alignment: DALI Z-score (Holm and Sander, 1998a), GDT-TS score (Zemla et al., 1999), TM-score (Zhang and Skolnick, 2004), 3D-score (Rychlewski et al., 2003) and two LiveBench contact scores (Rychlewski et al., 2003). These scores were scaled by taking into account self-comparison scores, random scores and alignment coverage (scaled scores are no larger than 1 and usually above 0). We also calculated two reference-independent sequence similarity scores: sequence identity and BLOSUM62 scores of aligned positions in a test alignment. These scores were also calculated for DaliLite (Holm and Sander, 1998a) structure-based alignments as a positive control.
| 3 RESULTS |
|---|
|
|
|---|
PROMALS is a progressive multiple alignment method based on probabilistic consistency of profile-profile comparison, with enhanced profile information from homologs detected by PSI-BLAST and secondary structures predicted by PSIPRED (Fig. 1). SPEM and HHalign are comparable methods as they also use these two sources of extra data. While PROMALS and SPEM can align two or more sequences, HHalign performs only pairwise alignments. The other tested methods (MUMMALS, ProbCons, MAFFT, MUSCLE and ClustalW) are stand-alone multiple sequence methods that do not resort to other data sources or programs.
3.1 Reference-dependent evaluation of methods
3.1.1 Tests on weakly similar SCOP domain pairs
We tested our profile-profile HMM on 1207 divergent SCOP domain pairs (Pei and Grishin, 2006) with <20% sequence identity (Table 1, first numbers in columns under SCOP). The three methods that use extra data (PROMALS, SPEM and HHalign) produce substantially better results than stand-alone methods (MUMMALS, ProbCons, MAFFT, MUSCLE and ClustalW) that align a pair of sequences without using additional homologs or predicted secondary structures. For sequence pairs with identity below 10%, the average Q-score of PROMALS (0.431) is almost three times higher than that of MUMMALS (0.156). For alignments with identity ranges 10–15% and 15–20%, PROMALS also gives substantial accuracy increases over MUMMALS of 0.272 and 0.176, respectively. PROMALS shows about 3–4% accuracy increases over SPEM and HHalign, suggesting that our profile-profile HMM utilizes homologs and predicted secondary structures in a better way.
|
We also tested the methods (except HHalign, which is a pairwise alignment program) on data sets of multiple sequences constructed by adding up to 48 homologs to each SCOP domain pair (Table 1, second numbers in columns under SCOP). With multiple sequences, PROMALS and SPEM both show slight improvement (1–2% for PROMALS and 2–3% for SPEM) over their pairwise profile–profile alignments. PROMALS outperforms SPEM by
2% on multiple sequences. With added homologs, stand-alone methods all yield better accuracies than pairwise sequence alignments, among which MUMMALS is the best method. PROMALS outperforms MUMMALS by 0.13, 0.1, and 0.05 for data sets with identities <10%, 10–15% and 15–20%, respectively.
3.1.2 Tests on SABmark database
SABmark database (version 1.65) has two multiple alignment benchmark sets. The twilight zone set contains 209 tests of SCOP (version 1.65) fold-level domains with very low similarity, and the superfamily set contains 425 tests of SCOP superfamily-level domains with low to intermediate similarity. PROMALS achieves the best results among all methods for both sets. Its accuracy is
6% and 4% higher than SPEM on twilight zone set and superfamily set, respectively. For the most difficult twilight zone set, PROMALS doubles the accuracy of the best stand-alone method (MUMMALS). Nevertheless, only
40% residues were correctly aligned on average by PROMALS for the twilight zone set, suggesting that homology modeling of extremely divergent domains remains a difficult problem with regard to alignment quality.
3.1.3 Tests on and PREFAB database
PREFAB 4.0 database consists of 1682 alignments averaging 45.2 sequences per alignment. Each alignment consists of two sequences with known structures and their homologs found by PSI-BLAST database searches. The reference structural alignment in each test is based on the consensus of FSSP (Holm and Sander, 1998b) and CE (Shindyalov and Bourne, 1998) alignments. We have used the performances of pairwise profile–profile alignments of PROMALS and SPEM as an indicator of their multiple alignment performances. The three methods that use additional data (PROMALS, SPEM and HHalign) give similar results, each with an average Q-score above 0.75. Their accuracies are higher than those on the two SCOP data sets with identity <15% and the two SABmark sets, suggesting that PREFAB 4.0 is an easier testing data set. PROMALS, SPEM and HHalign are more accurate than MUMMALS by 4–6%. PROMALS is statistically more accurate (P-value <0.000001) than SPEM and HHalign despite small differences in their average Q-scores. Results on PREFAB 4.0 confirm that alignment quality differences between methods become smaller on easier tests.
3.2 Reference-independent evaluation of methods
On our data sets of 1207 SCOP domain pairs with identity below 20%, we evaluated alignment quality using reference-independent scores that reflect the similarity between two structures compared according to aligned residue pairs in the test alignment (Pei and Grishin, 2006). These structural similarity scores are DALI Z-score, TM-score, GDT-TS score, 3D-score, and two LiveBench contact scores (Table 2). Consistent with reference-dependent evaluation, PROMALS produces significantly higher average structural similarity scores than other methods. Used as a positive control, structural alignment method DaliLite yields higher structural similarity scores than any sequence-based alignment method (Table 2). Interestingly, DaliLite alignments have the lowest reference-independent sequence similarity scores (sequence identity and BLOSUM62 scores). PROMALS also shows lower sequence similarity scores than several other sequence-based methods. These observations suggest that for distantly related sequences (sequence identity <20%), sequence similarity scores, such as identity or BLOSUM62, may not correlate with alignment quality measured by 3D structural comparison, and maximization of these scores may not improve structural models based on sequence alignments.
|
3.3 Pairwise comparisons of alignment methods
To gain further understanding of the differences between alignment methods, we compared their performance on individual domain pairs from the SCOP sets (identity <20%). Table 3 shows the number of pairs, for which one method performs better than another method by a relatively large margin of 0.1 or more (measured by scaled TM-score or Q-score, both scores are between 0 and 1). Although PROMALS clearly leads by a large margin, it does not offer the best alignment in each and every case. For example, PROMALS gives a TM-score increase of 0.1 or more over SPEM on 197 alignments, while producing significantly inferior alignments for 109 pairs. Even stand-alone methods (MUMMALS, ProbCons, MAFFT, MUSCLE and ClustalW) outperform PROMALS by a TM-score of 0.1 or more on a small number of pairs (
5%, i.e. 49–67 out of 1207 alignments). These comparisons suggest that alignments constructed by different methods can vary much for divergent sequences, and a method with an overall inferior performance is capable of generating better alignments in some cases. Careful inspection of alignments produced by several programs could help improve alignment quality for divergent sequences.
|
| 4 DISCUSSION |
|---|
|
|
|---|
Judging by its performance, PROMALS is a definite advance compared to our previous alignment programs MUMMALS (Pei and Grishin, 2006). MUMMALS derives probabilistic consistency from pairwise HMMs with built-in local structural information (secondary structure and/or solvent accessibility), and shows slight but significant improvement (a few percent) over other stand-alone methods such as ProbCons (Do et al., 2005) and MAFFT (Katoh et al., 2005). However, since no additional homologs are used, the local structure prediction implicitly performed by MUMMALS is of low accuracy compared to advanced methods such as PSIPRED (Jones, 1999). In contrast, PROMALS incorporates database searches and more accurate secondary structure prediction, and derives probabilistic consistency from profile–profile HMMs. Moreover, the HMM in PROMALS has a two-track structure (Karchin et al., 2003) that treats both amino acids and predicted secondary structures as emitted objects, while MUMMALS HMMs only emit amino acids. Owing to additional data sources and the advanced profile–profile HMM, PROMALS shows significant improvement over MUMMALS and other stand-alone methods, especially for highly divergent sequences.
The HMM in PROMALS adopts a numerical representation of sequence profile (see Supplementary Data for details) that successfully works in other profile-sequence or profile–profile alignment methods such as PSI-BLAST (Altschul et al., 1997) and COMPASS (Sadreyev and Grishin, 2003). A recent comprehensive study also supported the effectiveness of this profile–profile scoring scheme (Wang and Dunbrack, 2004). To adequately use predicted secondary structures, we not only score them as emitted objects, but also use transition and emission probabilities that are dependent on predicted secondary structure types (Supplementary Data). Unlike HHalign, which treats each alignment as a classical profile HMM (Eddy, 1998), our HMM has a simpler structure similar to the classical 3-state pairwise HMM (Durbin et al., 1998). SPEM (Zhou and Zhou, 2005) does not use HMMs, but applies an empirical profile–profile alignment method (SP2) that identifies the optimal alignment path. In contrast, the HMM in PROMALS allows estimation of posterior probabilities of matches between positions. As a result, PROMALS has a probabilistic treatment of consistency similar to the one in ProbCons and MUMMALS, while simple consistency measures are used in SPEM, T-COFFEE (Notredame et al., 2000) and PCMA (Pei et al., 2003). PROMALS performs significantly better than SPEM and HHalign on difficult tests, suggesting the advantages of our profile–profile comparison scheme.
Since PROMALS relies on PSI-BLAST and PSIPRED to collect additional homologs and predicted secondary structures, the speed of PROMALS is considerably slower than that of stand-alone progressive methods. Our strategy for improving speed is to use different algorithms for easy and difficult alignments (Pei et al., 2003). By aligning highly similar sequences in a fast way, the number of sequences subject to the time-consuming steps (running PSI-BLAST, PSIPRED and consistency transformation) could be substantially reduced. For example, for 1207 SCOP domain pairs with up to 48 added homologs, the average number of sequences in an alignment is 41.6. After PROMALS aligns similar sequences with identity above 60% in the first stage, only
24 sequences on average require database searches, secondary structure prediction, and consistency transformation. For these tests, the median CPU time of PROMALS is
30 min per alignment, as compared to 67 min for SPEM (on Redhat Enterprise Linux 3, AMD Opteron 2.0 GHz). The stand-alone methods (MUMMALS, PROBCONS, MAFFT, MUSCLE and ClustalW) are much faster, all with a median CPU time <1 min.
As in our previous work (Pei and Grishin, 2006), we demonstrated the effectiveness of reference-independent evaluation of alignment quality in this study. First, we observed a good correlation between reference-dependent and reference-independent evaluations, suggesting that it may not be necessary to spend significant efforts on development of reference alignment databases. Second, reference-independent techniques solve the problem of reference alignment ambiguity, which becomes significant when similarity is low. Third, reference-independent evaluation helps answer general questions such as whether alignments can be further improved for sequences with low similarity, and whether such improvements will help structure modeling. For several structural similarity measures (GDT-TS, 3Dscore, TM-score, LB contact scores), the ratio between the average score of PROMALS sequence-based alignment and the average score of DaliLite structure-based alignment is
0.6 on domain pairs with <20% sequence identity (Table 2), suggesting that we are still 40% below what can be achieved with structures in hand. Notably, for these divergent sequences, DaliLite structural alignments have lower sequence similarity scores (identity and BLOSUM62 scores) than alignments produced by any sequence method, suggesting that scoring functions based only on amino acid sequence similarity may not be suitable for aligning divergent sequences for the purpose of homology modeling. This observation further justifies the use of alternative scoring schemes, such as the ones that recruit structural information.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We would like to thank Bong-Hyun Kim for the reference-independent evaluation routine, and Johannes Soding for providing the HHalign program. We would like to thank Lisa Kinch, Ruslan Sadreyev and James Wrabl for critical reading of the manuscript and helpful comments. This work was supported in part by NIH grant GM67165 to NVG.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alex Bateman
Received on December 4, 2006; revised on January 12, 2007; accepted on January 17, 2007
| REFERENCES |
|---|
|
|
|---|
Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:3389–3402.
Do CB, et al. ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res (2005) 15:330–340.
Durbin R, et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. (1998) Cambridge University Press.
Eddy SR. Profile hidden Markov models. Bioinformatics (1998) 14:755–763.
Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res (2004) 32:1792–1797.
Edgar RC, Batzoglou S. Multiple sequence alignment. Curr. Opin. Struct. Biol (2006) 16:368–373.[CrossRef][Web of Science][Medline]
Ginalski K, Rychlewski L. Detection of reliable and unexpected protein fold predictions using 3D-Jury. Nucleic Acids Res (2003) 31:3291–3292.
Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA (1992) 89:10915–10919.
Holm L, Sander C. Dictionary of recurrent domains in protein structures. Proteins (1998a) 33:88–96.[CrossRef][Web of Science][Medline]
Holm L, Sander C. Touring protein fold space with Dali/FSSP. Nucleic Acids Res (1998b) 26:316–319.
Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol (1999) 292:195–202.[CrossRef][Web of Science][Medline]
Karchin R, et al. Hidden Markov models that use predicted local structure for fold recognition: alphabets of backbone geometry. Proteins (2003) 51:504–514.[CrossRef][Web of Science][Medline]
Katoh K, et al. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res (2005) 33:511–518.
Lipman DJ, et al. A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. USA (1989) 86:4412–4415.
Murzin AG. How far divergent evolution goes in proteins. Curr. Opin. Struct. Biol (1998) 8:380–387.[CrossRef][Web of Science][Medline]
Notredame C, et al. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol (2000) 302:205–217.[CrossRef][Web of Science][Medline]
O'Sullivan O, et al. 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol (2004) 340:385–395.[CrossRef][Web of Science][Medline]
Pei J, Grishin NV. AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics (2001) 17:700–712.
Pei J, Grishin NV. MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Res (2006) 34:4364–4374.
Pei J, et al. PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics (2003) 19:427–428.
Phillips A, et al. Multiple sequence alignment in phylogenetic analysis. Mol. Phylogenet. Evol (2000) 16:317–330.[CrossRef][Web of Science][Medline]
Rychlewski L, et al. LiveBench-6: large-scale automated evaluation of protein structure prediction servers. Proteins (2003) 53(Suppl. 6):542–547.[CrossRef][Web of Science][Medline]
Sadreyev R, Grishin N. COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J. Mol. Biol (2003) 326:317–336.[CrossRef][Web of Science][Medline]
Shindyalov IN, Bourne PE. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng (1998) 11:739–747.
Simossis VA, Heringa J. PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Res (2005) 33:W289–294.
Soding J. Protein homology detection by HMM-HMM comparison. Bioinformatics (2005) 21:951–960.
Sunyaev SR, et al. PSIC: profile extraction from sequence alignments with position-specific counts of independent observations. Protein Eng (1999) 12:387–394.
Tatusov RL, et al. Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc. Natl. Acad. Sci. USA (1994) 91:12091–12095.
Thompson JD, et al. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res (1994) 22:4673–4680.
Thompson JD, et al. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res (1999) 27:2682–2690.
Thompson JD, et al. DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res (2000) 28:2919–2926.
Van Walle I, et al. SABmark—a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics (2005) 21:1267–1268.
Wang G, Dunbrack R.L. Jr. Scoring profile-to-profile sequence alignments. Protein Sci (2004) 13:1612–1626.[CrossRef][Web of Science][Medline]
Wu CH, et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res (2006) 34:D187–191.
Zemla A, et al. Processing and analysis of CASP3 protein structure predictions. Proteins (1999) (Suppl. 3):22–29.
Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins (2004) 57:702–710.[CrossRef][Web of Science][Medline]
Zhou H, Zhou Y. SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics (2005) 21:3615–3621.
This article has been cited by other articles:
![]() |
C. Kemena and C. Notredame Upcoming challenges for multiple sequence alignment methods in the high-throughput era Bioinformatics, October 1, 2009; 25(19): 2455 - 2465. [Abstract] [Full Text] [PDF] |
||||
![]() |
T.-C. Chu, T. Liu, D. T. Lee, G. C. Lee, and A. C.-C. Shih GR-Aligner: an algorithm for aligning pairwise genomic sequences containing rearrangement events Bioinformatics, September 1, 2009; 25(17): 2188 - 2193. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. D. Morgan, E. A. Dwinell, T. K. Bhatia, E. M. Lang, and Y. A. Luyten The MmeI family: type II restriction-modification enzymes that employ single-strand modification for host protection Nucleic Acids Res., August 1, 2009; 37(15): 5208 - 5221. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Loytynoja and N. Goldman Uniting Alignments and Trees Science, June 19, 2009; 324(5934): 1528 - 1529. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. A. Johnston, H.-T. Lee, D. J. Macqueen, K. Paranthaman, C. Kawashima, A. Anwar, J. R. Kinghorn, and T. Dalmay Embryonic temperature affects muscle fibre recruitment in adult zebrafish: genome-wide changes in gene and microRNA expression associated with the transition from hyperplastic to hypertrophic growth phenotypes J. Exp. Biol., June 15, 2009; 212(12): 1781 - 1793. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A. Morrison Why Would Phylogeneticists Ignore Computerized Sequence Alignment? Syst Biol, March 25, 2009; (2009) syp009v1. [Full Text] [PDF] |
||||
![]() |
Y. Lu and S.-H. Sze Improving accuracy of multiple sequence alignment algorithms based on alignment of neighboring residues Nucleic Acids Res., February 1, 2009; 37(2): 463 - 472. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Bateman, R. D. Finn, P. J. Sims, T. Wiedmer, A. Biegert, and J. Soding Phospholipid scramblases and Tubby-like proteins belong to a new superfamily of membrane tethered transcription factors Bioinformatics, January 15, 2009; 25(2): 159 - 162. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. D. Morgan, T. K. Bhatia, L. Lovasco, and T. B. Davis MmeI: a minimal Type II restriction-modification system that only modifies one DNA strand for host protection Nucleic Acids Res., November 1, 2008; 36(20): 6558 - 6570. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Ahola, T. Aittokallio, M. Vihinen, and E. Uusipaikka Model-based prediction of sequence alignment quality Bioinformatics, October 1, 2008; 24(19): 2165 - 2171. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Katoh and H. Toh Recent developments in the MAFFT multiple sequence alignment program Brief Bioinform, July 1, 2008; 9(4): 286 - 298. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Pei, M. Tang, and N. V. Grishin PROMALS3D web server for accurate multiple protein sequence and structure alignments Nucleic Acids Res., July 1, 2008; 36(suppl_2): W30 - W34. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Roovers, K. H. Kaminska, K. L. Tkaczuk, D. Gigot, L. Droogmans, and J. M. Bujnicki The YqfN protein of Bacillus subtilis is the tRNA: m1A22 methyltransferase (TrmK) Nucleic Acids Res., June 1, 2008; 36(10): 3252 - 3262. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. White, Z. Li, R. Sardana, J. M. Bujnicki, E. M. Marcotte, and A. W. Johnson Bud23 Methylates G1575 of 18S rRNA and Is Required for Efficient Nuclear Export of Pre-40S Subunits Mol. Cell. Biol., May 15, 2008; 28(10): 3151 - 3161. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. F. Bolliger, J. Pei, S. Maxeiner, A. A. Boucard, N. V. Grishin, and T. C. Sudhof Unusually rapid evolution of Neuroligin-4 in mice PNAS, April 29, 2008; 105(17): 6421 - 6426. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Liu, R. Tewari, J. Ning, A. M. Blagborough, S. Garbom, J. Pei, N. V. Grishin, R. E. Steele, R. E. Sinden, W. J. Snell, et al. The conserved plant sterility gene HAP2 functions after attachment of fusogenic membranes in Chlamydomonas and Plasmodium gametes Genes & Dev., April 15, 2008; 22(8): 1051 - 1068. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Pei, B.-H. Kim, and N. V. Grishin PROMALS3D: a tool for multiple protein sequence and structure alignments Nucleic Acids Res., April 1, 2008; 36(7): 2295 - 2300. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. A. Cymerman, I. Chung, B. M. Beckmann, J. M. Bujnicki, and G. Meiss EXOG, a novel paralog of Endonuclease G in higher eukaryotes Nucleic Acids Res., March 27, 2008; 36(4): 1369 - 1379. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. C. Hoopman, W. Wang, C. A. Brautigam, J. L. Sedillo, T. J. Reilly, and E. J. Hansen Moraxella catarrhalis Synthesizes an Autotransporter That Is an Acid Phosphatase J. Bacteriol., February 15, 2008; 190(4): 1459 - 1472. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Tartari, C. Gissi, V. Lo Sardo, C. Zuccato, E. Picardi, G. Pesole, and E. Cattaneo Phylogenetic Comparison of Huntingtin Homologues Reveals the Appearance of a Primitive polyQ in Sea Urchin Mol. Biol. Evol., February 1, 2008; 25(2): 330 - 338. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. M. Szymanski, D. Binns, R. Bartz, N. V. Grishin, W.-P. Li, A. K. Agarwal, A. Garg, R. G. W. Anderson, and J. M. Goodman The lipodystrophy protein seipin is found at endoplasmic reticulum lipid droplet junctions and is important for droplet morphology PNAS, December 26, 2007; 104(52): 20890 - 20895. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











