Bioinformatics Advance Access originally published online on December 13, 2005
Bioinformatics 2006 22(4):413-422; doi:10.1093/bioinformatics/bti828
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions
Center for Bioinformatics, Department of Biology, Rensselaer Polytechnic Institute Troy, NY 12180, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: In recent years, advances have been made in the ability of computational methods to discriminate between homologous and non-homologous proteins in the twilight zone of sequence similarity, where the percent sequence identity is a poor indicator of homology. To make these predictions more valuable to the protein modeler, they must be accompanied by accurate alignments. Pairwise sequence alignments are inferences of orthologous relationships between sequence positions. Evolutionary distance is traditionally modeled using global amino acid substitution matrices. But real differences in the likelihood of substitutions may exist for different structural contexts within proteins, since structural context contributes to the selective pressure.
Results: HMMSUM (HMMSTR-based substitution matrices) is a new model for structural context-based amino acid substitution probabilities consisting of a set of 281 matrices, each for a different sequencestructure context. HMMSUM does not require the structure of the protein to be known. Instead, predictions of local structure are made using HMMSTR, a hidden Markov model for local structure. Alignments using the HMMSUM matrices compare favorably to alignments carried out using the BLOSUM matrices or structure-based substitution matrices SDM and HSDM when validated against remote homolog alignments from BAliBASE. HMMSUM has been implemented using local Dynamic Programming and with the Bayesian Adaptive alignment method.
Availability: Matrices, source codes and programs are available at http://www.bioinfo.rpi.edu/~bystrc/downloads.html.
Contact: bystrc{at}rpi.edu, huangy2{at}rpi.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Amino acid substitution matrices are evolutionary models that seek to explain the cost of mutating any one of the twenty amino acids to any other, relative to the cost of not mutating. The most widely used substitution matrices, such as PAM (Dayhoff et al., 1978) and BLOSUM (Henikoff and Henikoff, 1992), are derived from high confidence multiple sequence alignments (MSAs). Counting statistics are used to estimate the frequency of each possible mutation, and a ratio of the observed mutation probability to the probability one would expect by chance is calculated. The logarithm of this likelihood ratio is the number we use when scoring pairwise alignments such as those generated by Dynamic Programming algorithms [LALIGN (Huang and Miller, 1991), etc.] and by database search algorithms [SSEARCH, FASTA (Pearson, 1995, 1996, 1998, 2000) and BLAST (Altschul et al., 1990, 1997)].
The current models ignore the possible effect of the structural context on the frequencies of mutation, instead summing the frequencies over all positions in the training set alignments equally. But the structural context may influence the cost of making a mutation. For example, glycine is known to be strongly conserved when it occurs in a tight turn, where the presence of any side chain would sterically hinder the formation of the turn. In other structural contexts, glycine is more likely to mutate. As another example, consider the selective pressures felt by a leucine in the core of a protein versus a leucine on the surface. A hydrophobic side chain on the surface may sometimes be conserved for functional reasons, but a hydrophobic side chain in the core of a protein is probably critical for folding, and this is arguably a stronger selective pressure.
In general, we do not know the structural context of each of the positions in a sequence at the point when we want to use the evolutionary modelthat is, when we want to align the sequence or search a database. A structure-based evolutionary model such as the one we propose in this paper is useful for alignments and database searches only if we can first predict the structural context of each residue. The prediction of residue context falls far short of predicting the 3D structure, but adds information that can be used to improve the quality of alignments, as we show in this paper.
Here we use local structure predictions from HMMSTR [hidden Markov model for protein structure (Bystroff et al., 2000)] to improve the quality of the most difficult pairwise alignments, those with <25% sequence identityin the so-called Twilight Zone of homology detection (Doolittle, 1981). The new method compliments remote homology detection methods, such as RAPTOR (Xu et al., 2003), SVM-HMMSTR (Hou et al., 2004), 3D-SHOTGUN (Sasson and Fischer, 2003) and others, that can accurately identify remote structural homology but do not contribute to improved alignments. We show that distant homologs can be aligned more accurately when the substitution matrix considers the local structure as predicted by HMMSTR.
In general, profileprofile alignment methods have a greater chance of success in aligning Twilight Zone sequences (Yona and Levitt, 2002; Jaroszewski and Godzik, 2002), but obtaining the profile requires first performing pairwise alignments. Therefore, improvements in pairwise alignments will eventually improve MSAs and profileprofile alignments.
The new model, called HMMSUM (HMMSTR-based substitution matrices), is a set of matrices, one for each of the 281 local structure contexts defined by HMMSTR. The matrices were summed from a training set of MSAs in the same way as the previous matrices PAM and BLOSUM were summed, but the training set in this case contained HMMSTR descriptors (Markov state probabilities) for each position, based on known structures. When HMMSUM is used in an alignment, the target and template HMMSTR descriptors are first predicted, then an alignment matrix (Smith and Waterman, 1981) is calculated using the HMMSUM model. In the alignment matrix, each match score is a linear combination of the substitution scores from the 281 matrices. The weights are the target and template HMMSTR descriptors [
, Equation (1)].
The alignments were carried out using standard SmithWaterman local Dynamic Programming (Smith and Waterman, 1981; Smith et al., 1981), or using Bayesian Adaptive Alignment (BAA) (Zhu et al., 1998). The results were validated against BAliBASE (Thompson et al., 1999; Bahr et al., 2001), a curated database of structure-based MSAs. Alignment accuracy was assessed as the number of correctly aligned positions (correct matches in Table 1). Pairwise alignments produced by the HMMSUM model had significantly more correct matches than alignments using the BLOSUM35, 40, 50 or 55 matrices. These results were also a significant improvement over two previously described structure-based alignment matrices, SDM and HSDM (Prli
et al., 2000), which are in turn significantly better than the best BLOSUM matrix. However, SDM and HSDM are single substitution matrices, whereas our model consists of many matrices.
|
In broad terms, an increase in the number of variable parameters improved the model. An apparent increase in accuracy can result from simple over-fitting of the data, but this is not the case here. None of the test alignments were part of the training data, which did not include twilight-zone alignments or structure-based alignments. Also, jackknife statistical tests were performed to show that the gap parameter optimization was valid.
Analysis of the numbers (log-likelihood ratios, LLRs) in the HMMSUM matrices, when compared with SDM, hints at the reasons for the improved accuracy. Different local structures have different amino acid preferences. These differences are significant and cannot be captured in a single substitution matrix.
| 2 METHODS |
|---|
|
|
|---|
2.1 Construction of MSAs and structure descriptors
The observed amino acid substitution frequencies were summed from a non-redundant training set of MSAs. A modified PDBselect-25 list (Hobohm and Sander, 1994) was used as a source of parent sequences. Sequences in the original PDBselect-25 list were excluded if the parent structure was not well determined, if the protein was membrane-bound, or if it contained a large number of disulfide bonds. Disordered loop regions were also omitted. Each of the remaining 657 parent sequences has at most 25% sequence identity with any other sequence in the list. MSAs were produced by searching the nr protein database using two iterations of PSI-BLAST (Altschul et al., 1997), with an E-value cut-off of 0.001. In practice, this cut-off produces alignments that very rarely have <25% identity, and the training set contained none of the Twilight Zone alignments that were later used for validation in the testing phase.
HMMSTR predictions
HMMSTR (Bystroff et al., 2000) was used to assign/predict structure descriptors for each position in each MSA during the training phase, and also to predict the structure descriptors in target sequences during the testing phase. HMMSTR takes as input the sequence profile and, optionally, the protein structure expressed as backbone (
,
) angles. We say the descriptors were assigned if the true structure was used as input to HMMSTR in addition to the sequence profile. Otherwise, we say the descriptors were predicted if only the sequence profile was the input to HMMSTR. A sequence profile is a probability distribution over the 20 amino acids at each MSA position. It is calculated by summing the frequency of each amino acid in each MSA column, after accounting for sequence redundancy. PSI-BLAST produces a profile as part of its function, and the profile was used as input to HMMSTR.
HMMSTR is a hidden Markov model containing 281 Markov states (q). Each state emits a single descriptor from a state-specific probability distribution. HMMSTR states have probability distributions for amino acid type, secondary structure, backbone angle region and structural context. States are connected to each other via transitions to model the likelihood of different emission distributions being adjacent to each other in the protein chain. Strongly linked states in HMMSTR represent the I-sites local structure motifs (Bystroff and Baker, 1998), such as helix capping motifs and beta turns, from which the HMMSTR model was derived. Some studies have shown that I-sites motifs can fold independently and/or initiate folding (Yi et al., 1998; Bystroff and Garde, 2003; Northey et al., 2002). The forward/backward algorithm (Rabiner, 1989) was used to predict the conditional probabilities [
, Equation (1)] of each state q given each position t in the sequence.
![]() | (1) |
The details of this calculation were described previously (Bystroff et al., 2000).
The
matrix may be viewed as a probabilistic representation of all local structure motifs (I-sites) given the sequence and, optionally, the structure. Loosely speaking,
![]() |
In practice, when the optional structure data were included the gamma distribution was sharply peaked at most positions, having probability = 1 for one state and 0 for all others. However, some local structure types are missing or redundantly modeled by HMMSTR. For these positions the gamma distribution was less sharp. In the testing phase of this study, when the structure data were not included in the forward/backward calculation, the motif predictions were more often ambiguous, and the gamma distribution was sharply peaked at fewer positions. Because of the known ambiguity of motif predictions, an early decision was made to treat all local structure assignments/predictions probabilistically [i.e. Equation (1)] rather than as a string of discrete descriptors.
2.2 Summing structure-dependent substitution frequencies from training set MSAs
The matrices of observed and background frequencies for amino acid substitutions were summed in a manner similar to the one described earlier for BLOSUM (Henikoff and Henikoff, 1992), but in our case probabilistic weights were used in order to separate the training set positions into HMMSTR states.
Sequence weights, calculated using the mismatch method (Vingron and Argos, 1989), were used to eliminate the redundancy within an given MSA. For example, in a MSA with L aligned sequences, an L x L mismatch matrix is constructed where each matrix element is the number of mismatches between two sequences. The mismatch distance of a sequence a is calculated as the sum over all the elements in row a, divided by sum over the whole matrix. Sequences with lower mismatch distances (w) were more redundant and should therefore contribute less to the sum of substitutions. The contribution of sequences a and b to any observed substitution frequency was the product of the sequence weights (wa x wb).
As in previous studies, no assumption was made about the direction of the substitution; therefore, the substitution matrices are symmetrical.
Likelihood ratios are the ratio of the observed substitution probability and the expected substitution probability. The expected substitution probability was calculated as the product of two background amino acid probabilities. The background frequencies were summed from the same training set MSAs, with each sequence a contributed wa to this sum.
The observed counts of state-dependent substitutions F was calculated from the training set of MSAs. Each column (t) in each MSA (m) was given a probability of being state q (
qt) using HMMSTR. F(i, j|q), the state-specific observed counts of i
j substitutions, was the sum of the
-weighted sequence weights (w) over all sequence pairs a, b where the amino acid at t in a was i and the amino acid at t in b was j, as follows:
![]() | (2) |
Two strategies for expected frequencies
For the expected frequencies of substitution, we considered two models, called Dayhoff (D) and Lipman (L) in the spirit of two classic bioinformatics experiments in estimating expectation values for sequence alignments. Margaret Dayhoff's model (Dayhoff et al., 1978, 1983) for random alignments was to align scrambled sequences, whereas David Lipman aligned randomly selected non-homologous but unscrambled sequences (Lipman et al., 1984). The scores for random natural sequences were generally higher than for scrambled sequences, showing the non-random nature of protein sequence space.
Our D-model assumes that the expected frequency of substitution is not dependent on the structure but only on the overall background frequencies of the amino acids. Our L-model uses the structural context to estimate the expected frequency of substitution, assigning a different background frequency distribution to each state. The strategy applied in the L-model should give us a better estimate of the significance of a substitution given the structure, but with a finite-sized database we run the risk of sparse or missing data for some of the 210 possible substitutions and 281 states. And the poorer results for the more finely-divided L-model suggest that sparse data were a problem. The D-model suffers less from the sparse data. The following formulas were used to sum the background counts [Equation (3)] or state-dependent background counts [Equation (4)] of each amino acid from the training set.
![]() | (3) |
![]() | (4) |
A small pseudo-count (
) was added to the expected and observed substitution frequencies to overcome the problem of sparse and missing data. The choice of pseudo-count does not greatly a effect the resulting alignments but prevents certain run-time errors from occurring.
The frequencies F were normalized to probabilities, P. The observed conditional probability of a structure-specific substitution P(i, j|q) is as follows:
![]() | (5) |
The normalized, structure-specific background amino acid probabilities are
![]() | (6) |
Structure-independent substitution probabilities P(i, j) were also calculated, in order to directly compare the added value of structure-specific models.
![]() | (7) |
Structure-independent background amino acid probabilities P(i) are
![]() | (8) |
2.3 Generating the alignment matrix
Pairwise alignment programs generally start by calculating the alignment matrix (A), where each element Atu is the score for the two amino acids at and bu, where a and b are the two sequences being aligned and t and u are two sequence positions. In traditional Dynamic Programming algorithm, Atu is assigned the appropriate LLR from the global substitution matrix (Smith and Waterman, 1981).
To calculate the alignment matrix using HMMSUM, we first calculate the gamma distribution for each sequence using the HMMSTR model, as described in the previous section. Then, we sum the substitution probabilities, P(at, bu|q), over all values of q, weighted by the joint probability that positions t and u are in state q. The joint probability is simply the product of the
-values. P(at, bu|Q) is defined as the weighted sum over all states q of the substitution probability.
![]() | (9) |
We also need to find the weighted sum of the expected substitution frequencies if we are using the L-model. We define P(at|Q) to be the weighted sum of the conditional background frequencies over all states q,
![]() | (10) |
Now we can define the LLRs in the A matrix in the terms presented above in Equations (5)(10). Three models are shown. The M-model is independent of structure, a single substitution matrix like BLOSUM and PAM. The D-model depends on the predicted structure and assumes a structure-independent background substitution probability, which is the product of the background amino acid probabilities. The L-model depends on predicted structure and assumes a structure-dependent background substitution probability.
![]() | (11) |
![]() | (12) |
![]() | (13) |
Alignments were carried out using SmithWaterman local Dynamic Programming and gap penalties were optimized as described below. An overview of construction and testing of the HMMSUM model is shown in Figure 1.
|
2.4 Measuring accuracy in alignment
To assess the performance of the HMMSUM model, we used a documented benchmark database, called BAliBASE (Bahr et al., 2001) as a reference set. BAliBASE consists of 167 reference MSAs, with >2100 sequences. All alignments are based on 3D structural superpositions (except reference set 7) which we believe to be as close as possible to the true orthologous relationships. Currently, BAliBASE is categorized into eight different reference sets for different purposes. Within reference set 1 there are several subcategories: short sequences (<100 residues); medium sequences (200300 residues) and long sequences (>500 residues), and each subcategory is further divided into different levels of percent identity (<25, 2040 and >35%). For this study, we extracted all 99 of the MSAs in the twilight-zone range (from 7 to 25% identify) from reference set 1 of BAliBASE 2.0. A total of 147 available reference pairwise alignments are obtained from those sequences, containing a total of 27 930 true matches.
The performance of the alignment methods was evaluated by assessing the accuracy (the fraction of predicted matches that were true matches) and coverage (the fraction of true matches that were predicted matches) over the 27 930 matches in the 147 alignments. The statistical significance (P-value) of the differences in true matches and coverage was evaluated using the Wilcoxon signed-rank test (Wilcoxon, 1945) over the individual alignments. Alignments were ranked by the number of correct matches or by accuracy, and the P-value was calculated for the likelihood that the differences between the methods were the result of chance.
Alignment algorithms
The sequence alignments were performed in two ways. Local Dynamic Programming alignment is an implementation of the SmithWaterman algorithm, which returns the optimal alignment given the scoring scheme. In the SmithWaterman algorithm, we utilized the affine gap penalty method (Altschul and Erickson, 1986). Optimized gap opening penalties and gap extension penalties were determined for each scoring method by applying a grid search and evaluating the results by counting the total correct matches.
Another algorithm, BAA computes the Bayesian posterior probability of all possible at, bu matches. In BAA, likelihood ratios, rather than LLRs are used. For traditional substitution matrices like BLOSUM, we used the exponent of the reported LLRs. BAA is carried out by first performing a forward sum, similar to the forward step in the forward/backward algorithm, over a maximum of K-aligned blocks. We used K = min (20, n/10), where n is the length of the shorter of the two proteins being aligned. Instead of a trace-back through the optimal alignment, a sample-back approach is taken in BAA. Probabilistic sampling of the alignment space produces a probability distribution over all alignments, expressed as a matrix of probabilities S, where Stu is the probability of seeing positions t, u matched in an alignment.
Different scoring schemes were compared by calculating the probability of the true alignment, i.e. the BAliBASE alignment, given the sample-back distribution, S.
![]() | (14) |
Using BAA allows us to evaluate alignment quality on all possible suboptimal alignments, and also avoids the use of parameters such as gap opening penalties and extension penalties, which are known to highly affect the result of alignments.
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
3.1 Qualitative analysis of the substitution matrices
Substitution matrices such as BLOSUM, PAM and SDM are expressed as symmetrical 20 x 20 matrices of LLRs. HMMSUM on the other hand, consists of 281 matrices and is not so easily viewed and understood. We may inspect the new model by looking at its marginal properties, and also by asking whether different structural contexts had significantly different substitution probabilities, even when the amino acid composition was similar.
3.1.1 Marginal characteristics of HMMSUM
By using marginal probabilities of substitutions [Equation (11)], a single substitution matrix (denoted as HMMSUM-M) can be generated in the 20 x 20 format. A comparison of HMMSUM-M and SDM scoring matrices is shown in Figure 2. All of the structure information was ignored in the HMMSUM-M model. Overall, we see a higher score for changes relative to identities in the HMMSUM-M model when compared with SDM, as indicated by the relatively higher numbers off the diagonal (after correcting for differences in the mean and standard deviation). This increases the importance of mutations in the alignment score, relative to conserved residues. HMMSUM-M singles out Gly, Asp, Pro and Asn for the highest substitution penalties, while Thr, Gln and Met are the most mutable. SDM, on the other hand, gives Gly, Pro, Ile and Trp the highest penalty for substitution, and finds Gln, Thr, and Arg to be the most mutable. The normalized mean mismatch penalties for the amino acids correlate poorly (R = 0.60) between HMMSUM-M and SDM. The correlation over all elements of HMMSUM-M with SDM (R = 0.72) and with BLOSUM50 (R = 0.72) is weaker than these two correlate with each other (R = 0.84). Overall, the diversity between models suggests that selection pressure has multiple solutions that cannot all be modeled by a single matrix.
|
3.1.2 Analysis of two structural contexts that favor glycine
Two conserved glycine states were chosen to illustrate the effect of structural context on the substitution probabilities. For comparison purposes only, observed and background frequencies of each state q were transformed to LLRs according to Equation (15) to generate substitution matrices
![]() | (15) |
= 0.25. Note, however, that the matrices were not used in this form in making alignments. Instead the alignment matrices are calculated directly from the substitution probabilities, as described in Equations (913), not from LLRs. The substitution matrix for HMMSTR state 30, a conserved glycine in a ß-turn motif (Figure 3, lower left) was found to be dramatically different from the substitution matrix for state 102, a conserved glycine in a ß-strand conformation (Figure 3, upper right). The LLRs were calculated using structure-dependent expected substitution values, as in HMMSUM-L. Both states represent common structural contexts.
|
The ß-strand state has exactly the expected probability of GG substitutions (i.e. conservations), given the glycine-rich composition of that state (LLR = 0). But the ß-turn state, which is also glycine-rich, has a much higher than expected rate of GG substitution (LLR = 5). This reflects the energetic preference for Gly in a ß-turn.
Residues other than glycine are less likely to be found in states 30 and 102, but if they are found in these contexts, they are scored very differently. State 102 (ß-strand) favors a mismatch of almost any sort, almost as much as an identity match. This means that residues other than Gly mutate more often than expected in this context. This is not surprising since the expected substitution probability would be very small for all non-Gly amino acids, and it means that those rare occurrences of non-glycine residues in this context occurred together in the training set. For state 30 (ß-turn) the situation is different. Glycine mutates more often than expected to certain residues, C, D, P, W and Y. In particular the preferred Gly to Pro mutation makes sense for a ß-turn. But other substitutions in this turn state are either not observed (large negative LLRs) or they occur more often than expected, for the same reasons as mentioned before.
By modeling the structural contribution to evolutionary selective pressure, HMMSUM should add biological meaning to the alignment of protein sequences. If so, an improvement should be seen in the alignment accuracy.
3.2 Evaluation of alignment accuracy
To test whether the HMMSUM improves the alignments of twilight-zone sequences, we compared the matrix performance of the HMMSUM models with BLOSUM matrices or structure-based substitution matrices SDM/HSDM, which has been reported as the best choice for aligning proteins with low percent identity (Prli
et al., 2000). Three methods were used in our study to construct the Atu matrix [Equations (1113)] and alignments were performed using local Dynamic Programming.
Gap opening/extension penalties are optimized for each matrix before the evaluation of accuracy. An 8-fold jackknife cross-validation was used for all methods tested when determining the optimal penalties in order to avoid over-fitting. Evaluations are reported here using the alignment parameters that gave us the maximum correct matches (Table 1). The average gap opening/extension penalties after cross-validations for SDM (7.1/0.4) and HSDM (19.5/1.0) are comparable to previous reports (6.6/0.6 and 19.0/0.8, respectively).
3.2.1 HMMSUM-M results
Using optimal gap parameters (15.0/0.9), the structure-independent model HMMSUM-M finds
6% more of the correct matches than BLOSUM40 (P = 0.053), the best performing BLOSUM matrix in our evaluation. This result cannot be attributed to the use of structure, since neither structure assignment nor prediction was used, but it may be attributed to using a different sequence weighting scheme in summing substitution frequencies. Our weighting scheme increased the contribution of distant homologs to the substitution scores, and the results are seen in the increased LLRs for changes relative to conservations.
However, the increase in the number of correct matches was accompanied by an increase in the number of incorrect matches, and the accuracy in alignments remained unimproved overall. The Wilcoxon test indicated that the change of accuracy in individual alignments may have resulted by chance when compared with BLOSUM40 (P = 0.887).
3.2.2 HMMSUM-L results
As mentioned in section 2, two sets of structure-dependent matrices, HMMSUM-D and HMMSUM-L, were made based on two different assumptions about the expected substitutions. HMMSUM-L, with optimized gap parameters (11.6/1.4), found
5% more correct matches than BLOSUM40 (P = 0.065). This was fewer than the structure-independent HMMSUM-M model. We believe that the L-model, although theoretically more sound, suffered from small counts problems and therefore did not manifest its theoretical potential.
However, the small increase in correct matches was not accompanied by an increase in incorrect matches completely, and the overall accuracy in alignments improved slightly. This improvement was found to be possibly the result of chance when compared with BLOSUM40 (P = 0.371).
3.2.3 HMMSUM-D results
HMMSUM-D, a structure-dependent model with structure-independent background frequencies, performed much better than expected, finding 24% more correct matches than the BLOSUM40 matrix. Both the accuracy and the coverage of correct matches were significantly improved when optimal gap parameters were used (21.0/0.4). Of the 147 alignments 61% were more accurate using this model, an unlikely chance occurrence (P < 0.001).
When compared with SDM and HSDM, the state-of-the-art matrices derived from structures, HMMSUM-D still showed a 16% and an 18% improvement in correct matches, respectively (P < 0.001). About 54% out of the 147 alignments were more accurate when using HMMSUM-D in both comparisons. The alignment accuracy and coverage of HMMSUM-D were significantly better than SDM (P = 0.025 and P < 0.001, respectively).
3.2.4 Results using structure predictions and structure assignments
To investigate the effects of structural information on difficult alignment problems, several more HMMSUM models were constructed. A structure-prediction dependent model, HMMSUM-DNS, was summed from the training set using the same method as that described from HMMSUM-D, except that the calculation of
was performed without using the structural information. The
-values in HMMSUM-DNS can be expressed loosely as follows:
![]() |
The
weights are therefore structure predictions rather than assignments. Surprisingly, the accuracy significantly improved over the results when structure information was used in HMMSUM-D (P = 0.002). When measured using correct matches, alignment accuracy or coverage, this prediction-based model consistently performed best among methods tested. Figure 4 illustrates this result using the two structurally similar proteins shown in Figure 5.
|
|
To further understand the effect of structural information on the alignment accuracy, we used the true structural information for both the training phase and the testing phase. Although a pairwise sequence alignment would not be the method of choice for aligning two known structures, this test was thought to simulate a best case scenario where the structure predictions were perfect for both sequences. As expected, the alignment accuracy for this model, indicated as HMMSUM-DS, was better than that of HMMSUM-D, where structure was ignored in the testing phase, but surprisingly it was not as good as HMMSUM-DNS, where the structure was ignored in both phases (Table 1).
3.2.5 Results using three structural states
To address the importance of precision in structure predictions and assignments, bare minimum structure information was adapted into the HMMSUM model. A simple three-state (helix, strand and loop) HMMSUM-D model (denoted HMMSUM-D3) was made and analyzed in an analogous fashion. The performance of HMMSUM-D3 was comparable with other HMMSUM-D models in correct matches, alignment accuracy or coverage (Table 1). The same improvement was also found in another model, HMMSUM-D3+NS, which was built in the same way as HMMSUM-DNS, using structure predictions rather than structure assignments, but using only three structure states. However, both three-state models are better than the best reported single-matrix model, SDM.
3.3 Evaluation of suboptimal alignments
By examining the details of alignments from HMMSUM models, we found that some segments were aligned to the wrong regions by shifting a few positions (Fig. 6a). This suggested that the true alignment was suboptimal but nearby in parameter space, perhaps accessible by a slight change in the substitution matrices.
|
Alignments that are off by a little are just as bad as alignments that are off by a lot, according to our measure of accuracy for the Dynamic Programming algorithm (Table 1) since this method can deliver only the one optimal alignment, and arbitrarily chooses one alignment if there are other equal-scoring possibilities. There are several modifications to Dynamic Programming that consider suboptimal alignments (Waterman and Eggert, 1987; Huang et al., 1990; Chao et al., 1994), however, in this study we chose to use BAA, which models all alignments.
The BAA algorithm was also used to perform each of the 147 pairwise alignments using each substitution model, giving a posterior probability distribution over all alignments for each case. To evaluate the performance of HMMSUM-D when compared to SDM using BAA, the probabilities of reference alignments were compared using Ptrue [Equation (14)]. Around 88% out of 147 reference alignments have higher Ptrue when using HMMSUM-D than with SDM. Figure 6b shows the results of BAA for the sequences in Figure 6a.
| 4 CONCLUSIONS |
|---|
|
|
|---|
Amino acid substitution matrices that are local structure-specific significantly improve the accuracy of pairwise alignments when the sequences are distant homologs, even if only a simple three-state structural model is used. The improvement can be seen in the accuracy and coverage of correct matches when compared with curated structure-based alignments.
Substitution matrices based on structure predictions were found to be significantly more accurate (P < 0.001) than substitution matrices based on structure assignments. This result runs counter-intuitive since it suggests that the sequences alone contain more information than the structure, even though we know that structural homology extends beyond the reach of sequence-based alignment methods. One possible explanation is that structure predictions contain information about weak or multiple secondary structure propensities that is lost when a unique structure is assigned to each position. Weak propensities may be conserved in some parts of the structure in order to enforce the order of events in the folding pathway. However, experimental studies have shown no conservation of propensity across proteins of the same family (Muñoz et al., 1995).
An alternative explanation is that substitution data were sparse when unique structure assignments were used but less sparse when predictions were used, leading to less random error in substitution scores when indexed to predicted structures, and more random error in substitution scores when indexed to assigned structures. In some cases the structure predictions were wrong, but perhaps the added data more than compensated for the inclusion of some false data. It was also considered that homologous proteins might tend to have correlated wrong predictions, and that would improve alignments. The sparse data issue also partially explains why the more finely divided HMMSUM-L model did not perform as well as expected. Meanwhile, the three-state model HMMSUM-D3 may have suffered less from sparse data, since using structure predictions (HMMSUM-D3+NS) did not further improve alignments.
An improvement was also found in the scores of true suboptimal alignments using the Bayesian Adaptive method. An additional 34% of 147 individual alignments were improved when using the BAA algorithm in comparing HMMSUM-D and SDM. Accuracy in finding true suboptimal alignments is important since optimal alignments for twilight zone sequences are only 4050% accurate.
| Acknowledgments |
|---|
We thank Bobbie-Jo Webb-Robertson for contributing to the Bayes Aligner coding and for helpful conversations. This work was supported by NSF award DBI-0343206 to C.B.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Christos Ouzounis
Received on March 9, 2005; revised on November 2, 2005; accepted on December 8, 2005
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F. and Erickson, B.W. (1986) Optimal sequence alignment using affine gap costs. Bull. Math. Biol, . 48, 603616[ISI][Medline].
Altschul, S.F., et al. (1990) Basic local alignment search tool. J. Mol. Evol, . 215, 403410.
Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, . 25, 33893402
Bahr, A., et al. (2001) BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Res, . 29, 323326
Bystroff, C. and Baker, D. (1998) Prediction of local structure in proteins using a library of sequence-structure motifs. J. Mol. Biol, . 281, 565577[CrossRef][ISI][Medline].
Bystroff, C. and Garde, S. (2003) Helix propensities of short peptides: molecular dynamics versus bioinformatics. Proteins, 50, 552562[CrossRef][ISI][Medline].
Bystroff, C., et al. (2000) HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. J. Mol. Biol, . 301, 173190[CrossRef][ISI][Medline].
Chao, K.M., et al. (1994) Recent developments in linear-space alignment methods: a survey. J. Comput. Biol, . 1, 271291[Medline].
Dayhoff, M.O., et al. (1983) Establishing homologies in protein sequences. Methods Enzymol, . 91, 524545[ISI][Medline].
Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. (1978) A model of evolutionary change in proteins. In Dayhoff, M. (Ed.). Atlas of Protein Sequence and structure, , Maryland, USA. National Biomedical Research Foundation, Silver Spring Vol. 5, Suppl. 3, , pp. 345352.
Doolittle, R.F. (1981) Similar amino acid sequences: chance or common ancestry? Science, 214, 149159
Henikoff, S. and Henikoff, J.G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA, 89, 1091510919
Hobohm, U. and Sander, C. (1994) Enlarged representative set of protein structures. Protein Sci, . 3, 522524[Abstract].
Hou, Y., et al. (2004) Remote homolog detection using local sequence-structure correlations. Proteins, 57, 518530[CrossRef][ISI][Medline].
Huang, X.Q., et al. (1990) A space-efficient algorithm for local similarities. Comput. Appl. Biosci, . 6, 373381
Huang, X.Q. and Miller, W. (1991) A time-efficient, linear-space local similarity algorithm. Adv. Appl. Math, . 12, 337357[CrossRef].
Jaroszewski, L., et al. (2002) In search for more accurate alignments in the twilight zone. Protein Sci, . 11, 17021713
Lipman, D.J., et al. (1984) On the statistical significance of nucleic acid similarities. Nucleic Acids Res, . 12, 215226[ISI][Medline].
Munoz, V., et al. (1995) The distribution of alpha-helix propensity along the polypeptide chain is not conserved in proteins from the same family. Protein Sci, . 4, 15771586[Abstract].
Northey, J.G., et al. (2002) Protein folding kinetics beyond the phi value: using multiple amino acid substitutions to investigate the structure of the SH3 domain folding transition state. J. Mol. Biol, . 320, 389402[CrossRef][ISI][Medline].
Pearson, W.R. (1995) Comparison of methods for searching protein sequence databases. Protein Sci, . 4, 11451160[Abstract].
Pearson, W.R. (1996) Effective protein sequence comparison. Methods Enzymol, . 266, 227258[ISI][Medline].
Pearson, W.R. (1998) Empirical statistical estimates for sequence similarity searches. J. Mol. Biol, . 276, 7184[CrossRef][ISI][Medline].
Pearson, W.R. (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol, . 132, 185219[Medline].
Prli
, A., et al. (2000) Structure-derived substitution matrices for alignment of distantly related sequences. Protein Eng, . 13, 545550
Rabiner, L.R. (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77, 257286[CrossRef].
Sasson, I. and Fischer, D. (2003) Modeling three-dimensional protein structures for CASP5 using the 3D-SHOTGUN meta-predictors. Proteins, 53, Suppl. 6, 389394.
Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol, . 147, 195197[CrossRef][ISI][Medline].
Smith, T.F., et al. (1981) Comparative biosequence metrics. J. Mol. Evol, . 18, 3846[CrossRef][ISI][Medline].
Thompson, J.D., et al. (1999) BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 15, 8788
Vingron, M. and Argos, P. (1989) A fast and sensitive multiple sequence alignment algorithm. Comput. Appl. Biosci, . 5, 115121
Waterman, M.S. and Eggert, M. (1987) A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. J. Mol. Biol, . 197, 723728[CrossRef][ISI][Medline].
Wilcoxon, F. (1945) Individual Comparisons by Ranking Methods. Biometrics, 1, 8083[CrossRef][ISI].
Xu, J., et al. (2003) RAPTOR: optimal protein threading by linear programming. J. Bioinform. Comput. Biol, . 1, 95117[CrossRef][Medline].
Yi, Q., et al. (1998) Prediction and structural characterization of an independently folding substructure in the src SH3 domain. J. Mol. Biol, . 283, 293300[CrossRef][ISI][Medline].
Yona, G. and Levitt, M. (2002) Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J. Mol. Biol, . 315, 12571275[CrossRef][ISI][Medline].
Zhu, J., et al. (1998) Bayesian adaptive sequence alignment algorithms. Bioinformatics, 14, 2539
This article has been cited by other articles:
![]() |
P. Fariselli, I. Rossi, E. Capriotti, and R. Casadio The WWWH of remote homolog detection: The state of the art Brief Bioinform, March 1, 2007; 8(2): 78 - 87. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Pei and N. V. Grishin MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information Nucleic Acids Res., September 11, 2006; 34(16): 4364 - 4374. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





















)] and SDM matrix (
). True alignments from BAliBASE (#) are highlighted in gray color. Actual secondary structures of 4ENL are indicated in cylinders as
-helices and arrows as ß-strands. Two sequences have 17.1% identity.


