Bioinformatics Advance Access originally published online on May 3, 2005
Bioinformatics 2005 21(13):2950-2956; doi:10.1093/bioinformatics/bti462
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Enhanced statistics for local alignment of multiple alignments improves prediction of protein function and structure
1Department of Molecular Genetics, Weizmann Institute of Science Rehovot 76100, Israel
2Agriculture Faculty, Hebrew University Rehovot 76100, Israel
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Improved comparisons of multiple sequence alignments (profiles) with other profiles can identify subtle relationships between protein families and motifs significantly beyond the resolution of sequence-based comparisons.
Results: The local alignment of multiple alignments (LAMA) method was modified to estimate alignment score significance by applying a new measure based on Fisher's combining method. To verify the new procedure, we used known protein structures, sequence annotations and cyclical relations consistency analysis (CYRCA) sets of consistently aligned blocks. Using the new significance measure improved the sensitivity of LAMA without altering its selectivity. The program performed better than other profile-to-profile methods (COMPASS and Prof_sim) and a sequence-to-profile method (PSI-BLAST). The testing was large scale and used several parameters, including pseudo-counts profile calculations and local ungapped blocks or more extended gapped profiles. This comparison provides guidelines to the relative advantages of each method for different cases. We demonstrate and discuss the unique advantages of using block multiple alignments of protein motifs.
Availability: http://bioinformatics.weizmann.ac.il/blocks/LAMA
Contact: shmuel.pietrokovski{at}weizmann.ac.il
| INTRODUCTION |
|---|
|
|
|---|
Prediction of protein function and structure using sequence information is a key problem in computational biology. The most successful approaches rely on the common origin of proteins with related function and structure from a common ancestorhomology. Homologous proteins can be grouped into families, which themselves can be super- and subgrouped. The main advantage of using homologous proteins is the transfer of data known about some family member or members to other members and the identification of conserved regions.
Multiple sequence alignments also offer better description of the sequence features of the aligned families. Instead of specific examples provided by each sequence, a multiple sequence alignment presents the general sequence characteristics for all available family members. Moreover, such alignments can be better extrapolated to describe these characteristics for the whole family, including yet unknown members (Gribskov et al., 1987; Tatusov et al., 1994; Henikoff and Henikoff, 1996). These transformations convert the raw multiple sequence alignments into position-specific scoring matrices, commonly known as profiles.
Profile comparison with single sequences has been found to be more informative than sequence-to-sequence comparisons (Gribskov et al., 1987; Henikoff and Henikoff, 1994; Tatusov et al., 1994; Altschul et al., 1997). This results from the improved description of sequence constraints and the reduction in the search space as a result of combining individual sequences into profiles. Profile-to-profile comparisons benefit from the same advantages over sequence-to-profile comparisons.
The first profile-to-profile database searching approach was the local alignment of multiple alignments (LAMA) method (Pietrokovski, 1996). LAMA searches databases of local ungapped (block) multiple protein sequence alignments with block queries. It identifies ungapped alignments between block pairs.
Several algorithms and implementations of profile-to-profile comparisons are available for local and global pairwise alignments of profiles (Gotoh, 1993; Panchenko, 2003; Rychlewski et al., 2000; Sadreyev and Grishin, 2003; Taylor, 1988; Thompson et al., 1994; Yona and Levitt, 2002). These methods differ in the type of profiles being compared, in the methods used to generate the profiles, in their column-to-column similarity (column score) metric and in their estimation of the alignment significance. The Prof_sim comparison method of Yona and Levitt (2002) uses a dynamic programming algorithm to find a gapped local alignment of two profiles. The method's column-to-column similarity score is the JensenShannon measure for the divergence between two probability distributions. A shift transformation is performed on these scores to make them suitable for a dynamic programming algorithm to identify local alignments. COMPASS is another gapped local profile-to-profile alignment method, (Sadreyev and Grishin, 2003). This method produces analytical statistical significance estimations (E-values) of local alignments similar to those of the PSI-BLAST method. Log-odds ratios are used to calculate the column-to-column comparison scores. COMPASS uses the extreme value distribution of gapped local profile alignment scores to estimate the E-value of the optimal alignments.
Beyond profile-to-profile comparison is the cyclical relations consistency analysis (CYRCA) method for multiple alignments of profiles (Kunin et al., 2001). This detects weak protein sequence similarities within sets of profiles by analyzing ungapped pairwise protein profile alignments identified by LAMA. These are clustered into consistently aligned sets of profiles with varying degrees of connectivity within each set. CYRCA is a very sensitive method for detecting subtle relationships between protein families and motifs (Kunin et al., 2001).
In this study, we introduce improvements to the LAMA method. These include incorporation of a new significance measure based on the probabilities of the column scores making up the alignments and the use of pseudo-counts in profile calculations (Henikoff and Henikoff, 1996). The performance of the new version of LAMA was compared with that of the old version and of the PSI-BLAST, COMPASS and Prof_sim methods. The number of CYRCA sets found and the completeness of connectivity within each set (the fraction of all possible intra-set alignments) were used as comparison measures. This comparison provides guidelines on the relative advantages of each method for different cases. Using the new significance measure improved the sensitivity of LAMA without altering its selectivity.
| DATA AND METHODS |
|---|
|
|
|---|
Multiple sequence alignment databases
Two releases of the Blocks database (Henikoff and Henikoff, 1991), from January 2000 and August 2001, were used in this study, including 9300 and 10 200 compositionally unbiased blocks (Pietrokovski, 1996), respectively.
To find alignment scores expected by chance, blocks with reversed column order were used. This keeps the local composition of the multiple alignments while removing other information (Luthy et al., 1994).
Gapped multiple sequence alignments corresponding to Block entries were identified from the Pfam database (Bateman et al., 2004) using an entries cross-index. Each corresponding pair was compared by LAMA to determine whether it is also aligned in the same way (column identities
0.7 across the whole block region).
New significance measure using Fisher's combining method
The significance measure of LAMA alignments was originally calculated using empirical distributions of random-profile alignments. Mean and standard deviation values are calculated for the scores of these alignments after the scores are clustered by profile width (Pietrokovski, 1996). The original z-score (z-score1) calculated from these values is thus based on the random distribution of profile-to-profile alignment scores. We have developed a new significance measure for LAMA alignments based on the empirical distribution of profile-to-profile column scores. Our model assumes independent occurrences of column scores in alignments expected by chance. This approximation is the same as the one taken in modeling sequence-to-sequence alignments (Altschul and Gish, 1996).
The combined column scores were evaluated using Fisher's combining method. Given k independent outcomes and their p-values p1,
,pk, Fisher's procedure uses the product p1p2...pk to combine the p-values (Hedges and Olkin, 1985). In any hypothesis-testing situation, if a null hypothesis H0 holds, then the p-value, p, is distributed uniformly between 0 and 1. If p has an uniform distribution between 0 and 1, then 2log(p) has a
2-distribution with two degrees of freedom. Since the sum of independent
2-variables has a
2-distribution, then, if p1
pk are the p-values in k independent studies, under H0 the expression
![]() |
2-distribution with 2k degrees of freedom. The mean of the
2-distribution with 2k degrees of freedom is 2k, and the standard deviation is the square root of 4k. Therefore, under H0 a column-score p-value product of a profile alignment width k can be transformed into a theoretically normal distributed variable (z-score2 or z2) by
![]() |
LAMA uses the Pearson correlation coefficient to compare profile columns. This measure was found to give superior performance to other possible column score measures in several studies (Edgar and Sjolander, 2004; Panchenko, 2003; Pietrokovski, 1996; Wang and Dunbrack, 2004). Another study suggested that Pearson's correlation coefficient has the same performance as the COMPASS and Prof_sim significance measures for more conserved sequences (>15% identity) and lower performance for sequences of <15% identity (Mittelman et al., 2003). Such results were obtained for short ungapped alignments 57 residues long. In our study, most of the data are alignments of 915 residues long.
An empirical cumulative distribution for 2 x 107 Pearson correlation coefficient comparisons was constructed by comparing random columns from two different profiles. For a given correlation coefficient r, the p-value was calculated as
![]() |
An error correction of
= 1/(n+1), where n is the number of comparisons, was used to avoid zero p-values (of r = 1).
Integrating alignment significance measures
The z-score1 and z-score2 significance measures are not fully independent, but they do evaluate different aspects of alignments. This can be seen from the low correlation of the two scores (r = 0.354) (Fig. 1). A new significance procedure was tested using these two significance measures as a double filter for genuine profile alignments (Fig. 1).
|
To estimate a p-value for the combination of the z-score1 and z-score2 measures, p(z1, z2), we analyzed alignments between real and reversed unbiased and non-palindromic blocks derived from the Blocks database in August 2001.
Empirical two-dimensional percentiles for z-score1 and z-score2 were calculated using percentile(z1,z2), which is a proportion of z-score pairs (w1,w2) such that w1
z1, w2
z2. Next, the p-value for the pair of z-scores was calculated as
![]() |
Optimization of LAMA parameters
To identify selective z-score cutoffs, so that only a few false positive (FP) results will be found, we found optimal alignments between pairs of real and reversed blocks. A cumulative two-dimensional distribution of z-score1 and z-score2 of these alignments was constructed. The cutoffs corresponding to the 1% level of significance were
5.0 for z-score1 and
6.4 for z-score2.
To enhance the sensitivity starting from these selective cutoffs, we systematically examined different cutoff combinations: 4.55.5 with an interval of 0.1 for z-score1, and 6.07.0 with an interval of 0.1 for z-score2. For each pair of cutoffs on real data, the total number of resulting CYRCA sets was calculated together with the percentage of true positive (TP) sets found and percentage of FP sets avoided. Values of z-score1
5.0 and z-score2
6.5 maximized the sensitivity of LAMA without altering its selectivity. In
13 000 random pairwise relations above these cutoffs, only seven small CYRCA sets (each of three nodes) were identified. This was significantly fewer than the 129 sets observed in the analysis of the Blocks database from August 2001. The appearance of CYRCA sets from the random data indicates the FP rate in the analysis of real data. These optimal cutoffs were used in the LAMA program to identify genuine block alignments.
Comparing the LAMA, Prof_sim, COMPASS and PSI-BLAST methods
All blocks from the Blocks database, August 2001 release, were aligned with each other using LAMA, with and without pseudo-counts, and the results were used to construct CYRCA sets. Maximal connectivity of each set was calculated as the number of all possible intra-alignments [N*(N1)/2 for a set of N blocks] which were at least four columns long. Next, Prof_sim and COMPASS alignments for each pair in a set were constructed using Blocks (local, ungapped alignments) and Pfam profiles (gapped alignments). The number of edges below the significance thresholds [p-value
0.01 for Prof_sim (Yona and Levitt, 2002) and E-value
0.001 for COMPASS (Sadreyev and Grishin, 2003)] was calculated for each set. This was done without testing the correspondence of each edge to a CYRCA consistent region. Thus, we used a permissive criterion for identifying corresponding profile-to-profile hits.
To compare the sensitivity of LAMA relative to PSI-BLAST, we used the procedure of Kunin et al. (2001): for each CYRCA set, we tested whether each sequence in every block is similar to some sequence in other blocks using the PSI-BLAST method (Altschul et al., 1997).
| RESULTS |
|---|
|
|
|---|
Using CYRCA sets to analyze LAMA performance
The new version of LAMA was compared with the original version by analyzing the CYRCA sets identified from LAMA intra-comparison of the Blocks database from January 2000. Genuine (TP) sets were identified by family annotation and structural similarity, and erroneous (FP) sets were identified as sets with no structural similarity, using previously described criteria (Pietrokovski, 1996; Kunin et al., 2001). All other sets were considered as sets with unknown status (unknown).
The fraction and total number of TP sets increased using our new significance measure (Fig. 2). Almost all genuine sets previously identified now include more blocks and have higher connectivity (78.3%). More than one-third (20/56) of the new LAMA TP sets were previously unidentified but 10 of the previous TP sets were now missing. When these now missing sets were examined, it was found that some of those relationships relied on blocks with weakly informative alignments, i.e. alignments including only a few (24) or almost completely identical (99%) sequences. One of these missing sets, previously considered true, was of short (59 amino acids) alpha-helices. Such structural similarity could probably occur by chance, and the set perhaps should not have been considered a TP. Sets with these types of short structural similarities across single helices or strands were not considered true in this study.
|
Most of the new and previously identified sets are TP sets. The total fraction of TP sets increased from 58% (46/79) to 74% (56/76). One old FP set and three previously unidentified FP sets were found using the new statistics, giving a 5.3% (4/76) FP rate. We expected to find a few FP sets with the selectivity limit we used. The number of sets with unknown status dropped from 39% (31/79) to 21% (16/76) (Fig. 2). The increase of TP and decrease of unknown fractions are accredited to the better performance of the new LAMA statistics.
Comparison between CYRCA sets found with and without pseudo-counts
Using pseudo-counts, we detected 238 new CYRCA sets (Table 1), almost doubling the number of CYRCA sets found without pseudo-counts. Nevertheless, there was no decrease in the CYRCA sets' TP rate, and the sets' FP rate even improved (decreased) (Table 1). The large increase in the absolute number of TP sets and the small decrease in the FP rate are significant advantages of using pseudo-counts.
|
Pseudo-counts are added to the residue counts in inverse proportion to the number of sequences in the multiple alignments and to the alignment conservation (Henikoff and Henikoff, 1996). Thus, pseudo-counts have a small effect on conserved multiple alignments with many sequences, mainly affecting those with fewer than six sequences. Confirming this, we found that about half (26/49 = 53%) of the sets found only with pseudo-counts had blocks with a small number of sequences. The remaining 23 sets were relatively weakly conserved, with 19/23 = 82.6% of them being sets of transmembrane regions.
Comparison among the LAMA, Prof_sim and COMPASS programs
The sensitivity of the LAMA program using the new statistics was compared with that of the COMPASS (Sadreyev and Grishin, 2003) and Prof_sim (Yona and Levitt, 2002) profile-to-profile alignment programs and with that of the PSI-BLAST sequence-to-profile comparison program (Altschul et al., 1997). We used the connectivity within CYRCA sets of blocks to assess the programs' sensitivity. CYRCA sets are composed of three or more blocks that are fully or partially connected (aligned) to each other (Kunin et al., 2001). For each set, we counted how many of all possible alignments were identified by each program. Since the programs compared with LAMA were developed for aligning gapped profiles, we also examined their performance in identifying alignments of gapped Pfam profiles corresponding to the blocks and aligned in the same way. The LAMA program was also compared with and without using pseudo-counts. We tabulated the number of sets where each of the programs found more, the same, or fewer alignments between the blocks, or gapped profiles, relative to the LAMA alignments (Fig. 3 and Table 2).
|
|
None of the programs was more sensitive than LAMA. COMPASS had the best results among the programs compared. Using COMPASS with blocks found better connectivity than LAMA in 19.4% of the sets and the same connectivity in 30.2% of the sets (Table 2). PSI-BLAST could not identify more alignments than LAMA in any of the sets and found the same amount in just 15.5% of the sets (Table 2). The performance of COMPASS and Prof_sim relative to LAMA did not change significantly when LAMA used pseudo-counts (Fig. 3 and Table 2). The relative sensitivity and selectivity of the profile-to-profile comparison methods was also examined using receiver operating characteristic curve analyses of specific CYRCA sets with methylase motifs. LAMA and COMPASS performed markedly better than Prof_sim, and LAMA had 512% more true hits than COMPASS for low false-hits rates (<5%) among the sets analyzed (Supplementary Figure 1).
Both COMPASS and Prof_sim performed better using gapped profiles. Adding to the analysis Pfam alignments aligned in a different way from the blocks in the corresponding region worsened the performance of both methods (data not shown). Thus, the results we found did not depend on the alignment type (gapped or ungapped) or on the alignment itself (same or different from blocks).
Examples of CYRCA sets found using new LAMA statistics
(1) Kinases and cell division protein. Using the new version of LAMA, we found a cyclic CYRCA set of three blocks from the shikimate kinase (SK), phosphoglycerate kinase (PGK) and cell division FtsA families. The LAMA block alignment was confirmed by structure superposition of all three discovered regions (RMSDs 2.02.6 Å over 710 amino acids). The structural similarity was local, confined to these regions. The proteins in the set belong to different SCOP families, but all proteins are ATPases. The FtsA and PGK aligned regions bind ATP molecules (Fig. 4a). The SK region in this set corresponds to the ADP binding site of adenylate kinase (AK) subunit F (RMSD = 3.1Å over 158 amino acids between 2SHK and 1NKS). AK also has an AMP binding site (also on subunit F) that corresponds to the known SK ADP binding site, which has a bound ADP in the determined structure. This suggests that SK has two nucleotide binding sites, one of which corresponds to the site we found similar to the FtsA and PGK nucleotide binding sites (Fig. 4a). This prediction can be tested experimentally.
|
The COMPASS program found this set using block alignments with two edges (66% connectivity), with an E-value threshold of 0.001. Prof_sim identified only the PGK to SK edge (33%) using block alignments with a p-value threshold of 0.001. PSI-BLAST did not find any of these sequence similarities.
(2) A phosphate binding site. Another example of identifying a subtle common ligand binding site is a CYRCA set of four blocks from the dihydrofolate reductase (DHFR), PGK, purine nucleoside phosphorylase (PNP) and phenol hydrolase reductases families. Structures are currently available for the first three families. Examining the structure of the aligned regions found them to be nucleotide diphosphates binding sites in PGK and DHFR, and a phosphate binding site in PNP (Fig. 4b). Phosphates from all structures interact with the same aligned residues (PGK:Gly21, DHFR:Gly114, PNP:Gly32 and PGK:Ala218, DHFR:Ala115, PNP:Ser33). The aligned regions also have very similar peptide backbone structures (RMSDs 1.52.7 Å over 913 amino acids; Fig. 4b). These families belong to different SCOP folds, but all include a parallel ß-sheet in their core. The CYRCA identified site is in the middle of these ß-sheets and the structural similarity can be extended to a few strands on either side (data not shown). This finding predicts the protein structure and a phosphate binding site for the phenol hydrolase reductases family.
None of the block and gapped-profile sequence similarities between the four families was found by the PSI-BLAST and Prof_sim methods. COMPASS identified two of the local similarities between the families, whereas LAMA identified five of the possible six.
(3) Macrophage migration inhibitory factor (MIF) and caspase-3 A fully connected set of three blocks from caspase proteases, interleukin-1B converting enzymes and MIF proteins was identified by the CYRCA method using LAMA output with the new statistics. The first two families belong to the peptidase C14 family InterProID: IPR002398.
The LAMA alignment between the MIF proteins and caspase proteases is of special interest. The aligned regions have marginal structure similarity (RMSD = 3.5 Å for 15 amino acids; Fig. 4c). However, a domain structural similarity over a 61 amino acid region, including the LAMA aligned segments, is found by the CE structural superposition method (Shindyalov and Bourne, 1998) (RMSD = 4.8 Å, z-score = 3.5 and sequence identity = 11.5%). An even more extensive structural similarity can be observed over 215 amino acids comprising the protein cores (Fig. 4c). These proteins are responsible for the different catalytic activity. However, LAMA aligned the catalytic isoleucine of the MIF active site with the catalytic histidine of a caspase catalytic dyad. It is thus likely that these families diverged from a common evolutionary origin. The similarity between MIF proteins and caspase proteases was not detected by PSI-BLAST (five iterations with E = 0.001), or by COMPASS and Prof_sim using either block or gapped-profile alignments.
| DISCUSSION |
|---|
|
|
|---|
In this study, we present a new statistical procedure for the LAMA method that improves the sensitivity of the method without compromising its selectivity. This new statistics neutralizes most of the effects of the alignment width. It can be used for different column comparison measures, not only for the Pearson correlation coefficient.
CYRCA analysis can be used to compare the relative sensitivity and selectivity of ungapped profile-to-profile alignment methods. Sensitivity is measured by the number of TP CYRCA sets, the number of profiles they include and their connectivity. Selectivity is similarly measured with FP CYRCA sets.
We used very permissive criteria for identifying COMPASS, Prof_sim and PSI-BLAST hits, testing the performance of COMPASS and Prof_sim for both block and gapped profiles and not requiring consistent hits. The methods parameters we examined were those recommended for use (Sadreyev and Grishin, 2003; Yona and Levitt, 2002; Altschul et al., 1997). Nevertheless, most of the consistent LAMA alignments could not be identified by these methods, and only a few CYRCA sets could be found by them with better connectivity than LAMA.
The methods that we compared with LAMA were developed for aligning gapped profiles. Using such alignments did improve the sensitivity of the methods, but not above that of LAMA (Table 2). Using all block-corresponding gapped profiles or only those aligned in the same way did not improve the compared methods beyond LAMA. Thus, the use of different alignments to describe the families is not the cause of the performance difference between LAMA and the compared methods.
We emphasize that the comparisons presented here were for identifying only those genuine and false hits found by our method. We did not examine whether our method would find all the genuine hits of the other methods, nor whether our method would manage to avoid any possible false hits found by the other methods.
The main advantages of using local multiple alignments (blocks) are their modularity and accuracy. Many blocks represent local protein features that independently appear in different contexts. These features include binding and modification sites (e.g. phosphorylation and glycosylation) and structural motifs that can also repeat in different amounts. When using more extended alignments (gapped profiles), the similarity of these short regions might go undetected if they are embedded in different contexts. Conversely, different short regions that are embedded in similar long contexts might be erroneously aligned. For these reasons, short conserved regions are also more difficult to align within profiles than to align as blocks. Sequences with different numbers of repeating short regions, which might be interspersed with variable linker segments, can also be better aligned by blocks than by longer and gapped profiles.
Using pseudo-counts improved LAMA, identifying about twice as many genuine hits without increasing its FP rate. We attribute this improvement to better treatment of weakly conserved blocks and blocks with a small number of sequences. This supports previous findings showing that the implicit use of diverse sequences from each family and adding pseudo-counts to family profiles improves the profiles' performance (Henikoff and Henikoff, 1996; Sadreyev and Grishin, 2004).
We improved the sensitivity and selectivity of the LAMA method by using a new statistic and pseudo-counts in the profile calculation. The improvement is over the previous version of LAMA and other methods for profile comparison. CYRCA sets of multiply aligned blocks provide a novel and convenient means to test the accuracy of profile and sequence comparison methods. Block-to-block comparison is shown to identify genuine structural and functional relationships that are difficult to identify by other means and methods.
| Acknowledgments |
|---|
This research was supported by The Israel Science Foundation, founded by The Israel Academy of Sciences and Humanities, and the Weizmann Institute of Science Crown Human Genome, and Leon and Julia Forscheimer Molecular Genetics centers. S.P. holds the Ronson and Harris Career Development Chair.
Received on January 9, 2005; revised on March 21, 2005; accepted on April 21, 2005
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F. and Gish, W. (1996) Local alignment statistics. Methods Enzymol., 266, 460480[Web of Science][Medline].
Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 33893402
Bateman, A., et al. (2004) The Pfam protein families database. Nucleic Acids Res., 32, D138D141
Edgar, R.C. and Sjolander, K. (2004) A comparison of scoring functions for protein sequence profile alignment. Bioinformatics, 20, 13011308
Gotoh, O. (1993) Optimal alignment between groups of sequences and its application to multiple sequence alignment. Comput. Appl. Biosci., 9, 361370
Gribskov, M., et al. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 43554358
Hedges, L.V. and Olkin, I. Statistical Methods for Meta-Analysis, (1985) , New York Academic press.
Henikoff, J.G. and Henikoff, S. (1996) Using substitution probabilities to improve position-specific scoring matrices. Comput. Appl. Biosci., 12, 135143
Henikoff, S. and Henikoff, J.G. (1991) Automated assembly of protein blocks for database searching. Nucleic Acids Res., 19, 65656572
Henikoff, S. and Henikoff, J.G. (1994) Position-based sequence weights. J. Mol. Biol., 243, 574578[CrossRef][Web of Science][Medline].
Kunin, V., et al. (2001) Consistency analysis of similarity between multiple alignments: prediction of protein function and fold structure from analysis of local sequence motifs. J. Mol. Biol., 307, 939949[CrossRef][Web of Science][Medline].
Luthy, R., et al. (1994) Improving the sensitivity of the sequence profile method. Protein Sci., 3, 139146[Web of Science][Medline].
Mittelman, D., Sadreyev, R., Grishin, N. (2003) Probabilistic scoring measures for profile-profile comparison yield more accurate short seed alignments. Bioinformatics, 19, 15311539
Panchenko, A.R. (2003) Finding weak similarities between proteins by sequence profile comparison. Nucleic Acids Res., 31, 683689
Pietrokovski, S. (1996) Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res., 24, 38363845
Rychlewski, L., et al. (2000) Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci., 9, 232241[Web of Science][Medline].
Sadreyev, R. and Grishin, N. (2003) COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J. Mol. Biol., 326, 317336[CrossRef][Web of Science][Medline].
Sadreyev, R.I. and Grishin, N.V. (2004) Quality of alignment comparison by COMPASS improves with inclusion of diverse confident homologs. Bioinformatics, 20, 818828
Shindyalov, I. and Bourne, P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng., 11, 739747
Tatusov, R.L., et al. (1994) Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc. Natl Acad. Sci. USA, 91, 1209112095
Taylor, W.R. (1988) A flexible method to align large numbers of biological sequences. J. Mol. Evol., 28, 161169[CrossRef][Web of Science][Medline].
Thompson, J.D., et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 46734680
Wang, G. and Dunbrack, R.J. (2004) Scoring profile-to-profile sequence alignments. Protein Sci., 13, 16121626[CrossRef][Web of Science][Medline].
Yona, G. and Levitt, M. (2002) Within the twilight zone: a sensitive profileprofile comparison tool based on information theory. J. Mol. Biol., 315, 12571275[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
R. I. Sadreyev, M. Tang, B.-H. Kim, and N. V. Grishin COMPASS server for homology detection: improved statistical accuracy, speed and functionality Nucleic Acids Res., July 1, 2009; 37(suppl_2): W90 - W94. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. I. Sadreyev and N. V. Grishin Accurate statistical model of comparison between multiple sequence alignments Nucleic Acids Res., April 1, 2008; 36(7): 2240 - 2248. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








atoms of ligand binding residues are also shown as spheres on the protein chains. (c) Macrophage migration inhibitory factor and caspase-3 structural superposition. Macrophage migration inhibitory factor (1CA7, chain E with hydroxyphenylpyruvate) is shown in magenta with the region found by LAMA (5266) and active site Ile64 in red; caspase-3 (1CP3, chain A with inhibitor) is shown in cyan with the region found by LAMA (109123) and catalytic His121 in blue. The figure was prepared using the PyMol program (