Bioinformatics Advance Access originally published online on August 25, 2005
Bioinformatics 2005 21(19):3704-3710; doi:10.1093/bioinformatics/bti616
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Pairwise alignment incorporating dipeptide covariation


1Department of Plant and Microbial Biology 111 Koshland Hall #3102 University of California, Berkeley, CA, USA
2Department of Molecular and Cell Biology, University of California Berkeley, CA 94720-3102, USA
*To whom correspondence should be addressed at Physical Biosciences Division, Lawrence Berkeley Natl Lab., Berkeley, CA 94720, USA
| Abstract |
|---|
|
|
|---|
Motivation: Standard algorithms for pairwise protein sequence alignment make the simplifying assumption that amino acid substitutions at neighboring sites are uncorrelated. This assumption allows implementation of fast algorithms for pairwise sequence alignment, but it ignores information that could conceivably increase the power of remote homolog detection. We examine the validity of this assumption by constructing extended substitution matrices that encapsulate the observed correlations between neighboring sites, by developing an efficient and rigorous algorithm for pairwise protein sequence alignment that incorporates these local substitution correlations and by assessing the ability of this algorithm to detect remote homologies.
Results: Our analysis indicates that local correlations between substitutions are not strong on the average. Furthermore, incorporating local substitution correlations into pairwise alignment did not lead to a statistically significant improvement in remote homology detection. Therefore, the standard assumption that individual residues within protein sequences evolve independently of neighboring positions appears to be an efficient and appropriate approximation.
Availability: Sequence data, software and matrices are freely available from http://compbio.berkeley.edu/
Contact: gec{at}compbio.berkeley.edu
Supplementary information: Supplementary data for this paper is available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Among the most commonly used tools in computational biology are the pairwise protein sequence alignment methods, such as SSEARCH, FASTA and BLAST (Smith and Waterman, 1981; Pearson and Lipman, 1988; Altschul et al., 1990; Durbin et al., 1998). These algorithms are elegant, efficient and effective methods of detecting similarity between closely related protein sequences. However, the ability of fast pairwise methods to detect homology deteriorates as the divergence between the sequences increases. Past the twilight zone (2030% pairwise sequence identity), only a small fraction of related proteins can be found (Sander and Schneider, 1991; Doolittle, 1992; Brenner et al., 1998; Green and Brenner, 2002). Therefore, in order to make better use of the vast and increasing amount of available biological sequence data, there is an immediate need for more sensitive, fast database search methods.
For the sake of computational efficacy, current pairwise alignment methods make several simplifying assumptions. First, amino acid substitutions are assumed to be homogeneous between protein families. The most commonly used substitution matrices [BLOSUM (Henikoff and Henikoff, 1992) and PAM (Dayhoff et al., 1978)] are thus generic models of protein sequence evolution across all protein sequence families at various evolutionary distances. Second, substitutions at a given site are assumed to be uncorrelated with those on neighboring sites, i.e. the likelihood of substituting an amino acid X for amino acid Y is assumed to be independent of the sequence context of X. It is known that both of these simplifying assumptions introduce errors into homology searching. Relaxing the assumption of homogeneous substitution across protein families can significantly improve the performance of pairwise alignment methods (Yu et al., 2003). Furthermore, alignment methods that remove the assumption of homogeneity among different positions in the sequence, and instead model the heterogeneity of the given protein family, have been found to be dramatically superior for remote homology detection (Park et al., 1998; R. E. Green and S. E. Brenner, Unpublished data). Unfortunately, these profile methods [e.g. PSI-BLAST (Altschul et al., 1997), HMMER (http://hmmer.wustl.edu/) (Eddy, 2001), SAM (Karplus et al., 1998)] are not tractable for all query sequences. They require the presence, identification and correct alignment of homologous sequences in order to generate a model of the query sequence's family. Therefore, the fast and universally applicable pairwise methods remain widely used for database searching, despite their lower sensitivity.
One proposed strategy for increasing the sensitivity of pairwise alignment is to use a more sophisticated scoring function for amino acid substitutions, namely one that is sensitive to the sequence context in which the residue resides. For example, amino acid sequences are correlated with secondary structural features, such as helixes and loops, which can directly lead to structure-dependent substitution patterns (Thorne et al., 1996; Topham et al., 1997; Goldman et al., 1998). Similarly, one might intuitively expect structurally and functionally important residues, such as cysteines and prolines, to be more or less conserved depending on their local sequence environment and the prevalence of particular motifs.
The first large-scale exploration of the effect of sequence context on amino acid evolution was performed by Gonnet et al. (1994), who examined the frequencies of dipeptide substitutions, and compared them with the dipeptide substitution frequencies expected assuming no sequence dependent correlations. Despite the fact that nearly half of the elements of the 400 x 400 observed dipeptide matrix were vacant (owing to the sparsity of data) several interesting patterns were evident. The chief trend was that amino acids are generally more likely to be conserved if they are adjacent to positions that are also conserved.
More recently Jung and Lee (2000) have taken advantage of the large increase in available data to reexamine trends in dipeptide evolution. They used the observed patterns of substitution within a large set of structure-based alignments to generate dipeptide substitution matrices. Furthermore, they developed an extension to the standard SmithWaterman alignment algorithm that incorporates a term from these dipeptide matrices. By using sequence and structure context information, they show some improvement in homolog detection in a limited test set. However, their method could not be extensively tested or practically utilized because an efficient dynamic programming method for finding the optimal alignment was not known to the authors. Instead, they adopted a heuristic search that is not guaranteed to find optimal alignments.
In this study, we have extended the work described above by examining the strength of local, dipeptide substitution correlations using the massive amount of alignment data within the BLOCKS database. We have also extended the standard SmithWaterman algorithm to include local dipeptide correlation information over a user-defined distance. Similar to SmithWaterman, this new polynomial time algorithm, doublet, finds the optimal alignment under the scoring scheme described. Using a standard remote homolog detection evaluation strategy, we have tested doublet against the SmithWaterman algorithm to measure the impact of including this extra information. Perhaps surprisingly, we found that incorporating doublet substitution correlations leads to a statistically insignificant difference in homology detection.
| 2 METHODS |
|---|
|
|
|---|
2.1 Quantifying substitution correlations
Consider two aligned, ungapped sequences, x = x1,x2,
,xn and y = y1,y2,
,yn, both of length n, where each element represents one of the 20 canonical amino acids. We wish to use the patterns of conservation and variation between these sequences to estimate the log odds that the sequences are homologous (i.e. that both sequences have descended from a common ancestor).
![]() | (1) |
Except for very short segments, the background and target probability distributions are large and cannot be directly measured. Therefore, Equation (1) is typically simplified by assuming that substitutions probabilities are homogeneous (independent of the location in the fragment) and that both the substitutions and the sequences themselves are uncorrelated from one position to the next. Consequentially, the total similarity score is now a sum of independent parts,
![]() | (2) |
This approximation of the full similarity by a sum of singlet substitution scores requires that we neglect all intersite correlations. We can perform a more controlled approximation by noting that a homogeneous multivariate probability can be expanded into a product of single component distributions, pairwise correlations, triplets correlations and so on.
![]() | (3) |
![]() | (4) |
![]() | (5) |
By truncating the expansion of the full similarity score at doublet terms [Equation (4)], we are assuming that triplet and higher order correlations between substitutions are relatively uninformative. For reasons discussed below, this is probably a reasonable approximation. Furthermore, the most important intersite correlations are between residues neighboring on the chain (Fig. 3). Therefore, we can restrict the maximum distance over which doublet interactions are scored without serious error.
|
The average similarity score is the interhomolog mutual information I (Cover and Thomas, 1991), a measure of the interequence correlations. A high mutual information value indicates strong correlation, whereas a mutual information value of zero indicates uncorrelated variables. Mutual information has various advantages as a correlation measure: it is firmly grounded in information theory, it is additive for independent contributions and it has consistent, intuitive units (bits).
![]() | (6) |
The preceding analysis applies to contiguously aligned sequence segments. However, in addition to substitutions, protein sequences are modified by the insertion and deletion of residues. Since it is not obvious how to capture the existence of indels in doublet scores, in the following discussion we assume that dipeptide correlations do not extend across gaps, and we adopt the simple and standard affine model of gap lengths. This approximation should have little impact, since aligned detectably homologous sequences tend to have relatively few indels, particulary in regions that are significantly similar.
2.2 Alignment algorithm
We have extended the standard SmithWaterman optimal local sequence alignment algorithm (Smith and Waterman, 1981) to incorporate doublet substitution scores (Fig. 1). The time complexity of SmithWaterman is O(nm), where n and m are the lengths of the two sequences. Adding doublet scores increases the complexity to O(nmL), where L is the distance over which substitution correlations are scored. This efficient dynamic programming alignment is possible because, although we are scoring correlations between residues that are not directly aligned, these correlations are local along the chain. The space complexity of our implementation is also O(nmL), which could be improved using standard techniques (Durbin et al., 1998).
|
The additional similarity score associated with adding the final match pair xi,yj to the alignment contains singlet (S) doublet (D) substitution scores;
![]() | (7) |
![]() | (8) |
The optimal, highest scoring alignment between two sequences (x = x1,x2,
,xn and y = y1,y2,
,ym) is found by populating a series of score tables, also known as dynamic programming matrices. The entries of the match table, M(i,j,r), are the maximum alignment score for an alignment that terminates with an ungapped segment of length r, ending at the i-th position of x and the j-th position of y. Similarly, the gap tables Gx(i,j) and Gy(i,j) contain the maximum alignment similarity given that the alignment ends with xi or yj gapped. The entries of these tables can be efficiently computed starting from the following boundary conditions: M(i,0,l),M(0,j,l),Gx/y(i,0),Gx/y(0,j) =
. A single aligned amino acid pair may signal the beginning of a new local alignment, or it may occur immediately after any alignment gap.
![]() | (9) |
![]() | (10) |
![]() | (11) |
The largest score within the match table marks the last aligned position of the optimal alignment. The full alignment can be found by backtracking through the table, according to the choices previously made during the scoring step.
We used the method of Bailey and Gribskov (2002) to fit an extreme value distribution to the results of aligning a query sequence against a database of possible homologs. The maximum-likelihood parameters are then used to assign E-values to each alignment.
2.3 Doublet BLOcks SUbstitution matrix
A doublet substitution matrix [Equation (5)] contains 204 = 160 000 entries, of which 202 x (202 + 1)/2 = 80 200 are unique as a result of the underlying symmetry, dl(i,i';j,j') = dl(j,j';i,i'). To accurately estimate these scores we require a very large collection of reliably aligned protein sequences. The BLOCKS database is one such resource (Henikoff and Henikoff, 1992; Henikoff et al., 2000). Each database block consists of a reasonably reliable, ungapped multiple sequence alignment of a core protein region. BLOCKS version 13+ contains 11 853 blocks, containing, on average, 56 segments of average length 26 residues. Overall, about 109 pairwise amino acid comparisons are available for study.
The widely used canonical BLOcks SUbstitution Matrices (BLOSUM) were generated from version 5 of the BLOCKS database (Henikoff and Henikoff, 1992). In order to generate a series of matrices representing different evolutionary divergences, the sequences in each block are clustered at a given level of sequence identity and the intercluster sequence correlations are collected. Thus BLOSUM100 (where only 100% identical sequences are clustered) represents a wide range, including low levels, of evolutionary divergence, whereas BLOSUM30 represents only correlations between very diverged sequences.
In principle, we should match the divergence inherent in the substitution matrix to the divergence of the pair of sequences we wish to align (Bishop and Thompson, 1986; Thorne et al., 1991, 1992; Altschul, 1993). However, this is computationally expensive, and, in practice, a single matrix is chosen based on its ability to align remote homologs, on the grounds that matching close homologs is relatively easy (Brenner, 1996, 1998; Crooks and Brenner, 2005). In a recent evaluation of remote pairwise homology detection efficacy (Green and Brenner, 2002; Zachariah et al., 2005), we discovered that the BLOSUM65 substitution matrix, reparameterized from the BLOCKS 13+ database, was more effective than any other reparameterized BLOSUM (BLOCKS 13+), classic BLOSUM (BLOCKS 5) or PAM (Dayhoff et al., 1978) substitution matrix, and was comparable to the most effective VTML matrix (Müller et al., 2002). Consequentially, we have built singlet and doublet substitution matrices from the BLOCKS 13+ database at 65% clustering, using an adaptation of the original BLOSUM clustering code (Henikoff and Henikoff, 1992). This provides
107108 independently aligned doublets, depending on the sequence separation l.
The estimated doublet target frequencies ql(i,i';j,j'), where smoothed and regularized by adding a pseudocount
(i,i';j,j') to the raw count data, n(i,j';j,j'). These pseudocounts are taken to be proportional to the marginal singlet target probabilities, ql(i;j)ql(i',j').
![]() | (12) |
![]() | (13) |
2 x 106, which can be compared with the 107108 actual observations. The full details are given in the Supplementary materials. A representative subset of a doublet substitution matrix is shown in Figure 2.
|
Standard statistical errors were estimated by non-parametric Bayesian bootstrap resampling on sequence blocks (Efron, 1979; Rubin, 1981). Instead of assigning equal weight to every sequence block, each block is instead given a random weight drawn from a Dirichlet distribution. This random reweighting induces random changes in the estimated scores, thereby providing an estimate of the statistical errors caused by the finite size and inhomogeneity of the training data.
2.4 Evaluation of remote homology detection
We have previously developed and applied a sensitive strategy for evaluation of database search methods (Brenner et al., 1998; Green and Brenner, 2002; Zachariah et al., 2005; Price et al., 2005). This strategy is made possible by the availability of a large collection of protein sequences whose evolutionary interrelations are known (primarily from structural information). In our approach, each sequence is aligned against every other sequence, and the alignment scores are used to determine putative homologs. We then consider the proportion of correctly identified homologs as a function of erroneous matches. Since the homology information derives from sequence-independent data, we avoid the circularity inherent in other evaluation approaches.
The collection of related sequences is derived from the structural classification of proteins (SCOP) database (Murzin et al., 1995). We use the ASTRAL compendium (Chandonia et al., 2004) of representative subsets of SCOP release 1.61 (Sept. 2002), filtered so that no two domains share more than 40% sequence identity. We partition every other SCOP fold into separate test and training subsets of approximately equal size, each containing
550 superfamilies, 2500 sequences, and 50 000 homologous sequence pairs. To avoid overfitting, adjustable parameters are optimized using the training set. Results of an all-versus-all comparison of the test set, using these optimized parameters, are reported as a plot of coverage (fraction of true relations found) versus errors per query (EPQ), the total number of false relations divided by the number of sequences (Fig. 4). The raw, unnormalized coverage is the fraction of all true relations that are found.
|
Since the number of relations within a superfamily scales as the square of the size of the superfamily, and because SCOP superfamilies vary greatly in size, this reported coverage is dominated by the ability to detect relations within the largest superfamilies. To compensate for this unwarranted dependence, we also report the average fraction of true relations per sequence (linear normalization) and the average fraction of true relations per superfamily (quadratic normalization). In general, large superfamilies are more diverse, and the relationships within them are harder to discover (Green and Brenner, 2002). Thus, unnormalized coverage is typically less than the linearly normalized coverage, which in turn is less than quadratically normalized coverage. One important point of comparison for search results is 0.01 EPQ rate for linearly normalized results, the average fraction of true relations per database query at a false positive rate of 1 in 100. We report the observed difference in coverage of two methods at this selected EPQ, and determine standard statistical errors and confidence intervals using Bayesian bootstrap resampling (Rubin, 1981; Price et al., 2005).
| 3 RESULTS |
|---|
|
|
|---|
3.1 Doublet substitution correlations
Various trends are evident within the doublet score matrix, as illustrated in Figure 2. Notably, exact conservations, such as AA
AA, AD
AD and DD
DD, generally have positive scores. This is expected because the pairs of sequences used to build the BLOSUM have a variety of intersequence similarity, ranging from most conserved to very diverged. Thus the observation of a conserved residue suggests that the sequences are relatively undiverged, and therefore, that other aligned residues are also more likely than average to be conserved.
Also notable is that many (but far from all) exact swaps, such as DA
AD, are significantly more likely that expected. Possibly, this is because the effect of a deleterious mutation X
Y can sometimes be ameliorated by the occurrence of the corresponding mutation Y
X, in the immediate sequence neighborhood. Partial swaps, where only one of the substitution pair is conserved, are also often positive. This might reflect alignment errors in the original dataset. The most highly positive scores (and therefore those events that are most overrepresented in the data relative to uncorrelated substitutions) are associated with the substitutions PC
Cx, i.e. a translocation of a cystine, replacing a proline. The most relatively uncommon substitutions involve the mutation of one cystine in the cystine pair CxxC (second column), a widespread and important motif found, for example, in the thioredoxin family. However, these interesting particular cases are atypical. Most of the doublet substitution matrix is similar to the ET
Ax substitutions displayed in the third column; the majority of the scores are not significantly different from zero, indicating that most possible substitution doublets are essentially uncorrelated.
We can place the above observations on a quantitative footing by considering the intersequence mutual information [Equation (6)], a measure of the correlation strength between aligned homologous sequences. The first order contribution is equal to the average singlet score, which is 0.31 bits per aligned residue for BLOSUM65 (BLOCKS13+). The corresponding average doublet score, the additional information encoded in intersite substitution covariation, is
0.04 bits at modest sequence separations (illustrated in Fig. 3). Thus, the intersite substitution correlations carry relatively little information. However, these correlations appear to persist to non-local neighbors, which suggests that the total information from interactions at all sequence separations is substantial. However, Figure 3 also displays the contributions to this total information from various categories of substitution. The largest contribution, and the only contribution to persist above a sequence separation of four residues, represents exactly conserved pairs of residues. This is a rather trivial correlation which is persistent because all parts of two homologous sequences have the same chronological divergence. All other substitution classes, summing over all sequence separations, contribute no more than 0.1 bits per residue. This is not entirely insignificant, but it is still small compared with the singlet mutual information. Thus non-trivial correlations between substitutions are relatively weak.
3.2 Homology detection
The primary use for pairwise alignment methods is to search databases of previously characterized biological sequences for homologs of the sequence of interest. Therefore, the most powerful methods will perform this task most effectively by assigning true homolog significant statistical scores and assigning unrelated sequence low statistical scores. Our assessment methodology compares database search methods on this criteria.
We compared the doublet alignment algorithm against the standard SmithWaterman algorithm. To perform a fair test, we converted raw scores to statistical scores for both algorithms using the same length normalized maximum-likelihood EVD parameter determination method (Bailey and Gribskov, 2002). Optimal parameters for gapping, matrix scaling and distance over which to consider dipeptide correlations were found using the training database described above. Then, the algorithms were evaluated by comparing the relative ability to detect remote homologs within the test dataset, using the parameters optimized on the training dataset. (Inset, Fig. 4).
The results of a database search for SmithWaterman and doublet, using only nearest neighboring dipetide covariations, are shown in Figrue 4a. Both the SmithWaterman and doublet methods performed remarkably similarly over all error rates and normalization schemes. The linearly normalized coverage at 0.01 EPQ was slightly higher for SmithWaterman than doublet (Inset, Fig. 4). From this, we conclude that including dipeptide covariation, information does not improve remote homology detection and, in fact, slightly degrades performance at this error rate. We also performed the same coverage versus EPQ analysis using only sequences with <30% sequence identity (Fig. 4b), as it was previously reported that dipeptide covariation information may be useful only for detecting these extremely remote evolutionary relationships (Jung and Lee, 2000). Our results, however, show that even at this evolutionary distance, dipeptide covariation scoring does not improve homology detection.
We used Bayesian bootstrap resampling to estimate statistical errors and to determine if the observed coverage difference was statistically significant. We found that a 95% confidence interval for the coverage difference at 0.01 EPQ comfortably contained zero difference. Therefore, we cannot distinguish between the remote homolog detection abilities of SmithWaterman and doublet.
We also evaluated the effect of including covariation information over larger sequence separations. As can be seen in table of Figure 4, incorporating this additional information into alignment scores actually results in a slow degradation of homology detection efficacy.
| 4 DISCUSSION |
|---|
|
|
|---|
We have developed, implemented and tested an alignment algorithm, doublet, that generates the optimal pairwise protein sequence alignment under a scoring scheme that includes dipeptide covariation information. Perhaps surprisingly, and in marked contrast to previous reports, we found that using this information provides no benefit to remote homolog detection. The performance of the doublet algorithm for detecting remote homologs is statistically indistinguishable from the standard SmithWaterman algorithm.
The underlying explanation for this indifference of alignment to dipeptide covariation is that substitution correlations are weak on the average (Figs 2 and 3). Therefore, the average effect of these interactions is insignificant and including covariation in sequence alignment makes very little material difference to remote homology detection.
We might reasonably question if the training data are at fault. Indeed, the slight degradation of homology detection, as more distant correlations are included (Inset table, Fig. 4), does indicate that the doublet substitution matrices contain anomalies, perhaps owing to the training or alignment of the BLOCKS sequences, or perhaps because of the different sampling of sequences included in BLOCKS compared with those included in SCOP. The BLOCKS database that we use to train the doublet substitution matrices contains ungapped alignments, many of shorter length than the average SCOP protein domain. Fukami-Kobayashi et al. (2002) showed that the covariation signal is strongest within single secondary structure elements. The poor performance of doublet, then, may be the result of its applying the covariation model too bluntly across the entire protein sequences when it is only applicable within secondary structure elements. However, we note that the BLOCKS database has been used to derive very effective singlet substitution matrices (Green and Brenner, 2002), and therefore, it is implausible that the substitution signals within the BLOCKS database are substantially erroneous. On the contrary, the observed degradation simply reinforces the idea that neighboring substitutions are weakly correlated, particularly when compared with single substitution correlations, and therefore, the doublet signal is readily degraded by minor anomalies in the data.
Another line of evidence comes from examining the intersite amino acid correlation of single protein sequences (Y
as, 1958; Weiss et al., 2000; Crooks and Brenner, 2004; Crooks et al., 2004). Neighboring amino acids are almost entirely uncorrelated; the nearest neighbor mutual information has been estimated as only 0.006 bits (Crooks and Brenner, 2004). This lack of sequence correlation is consistent with (but does not require) small intersite substitution correlations.
It should be emphasized, however, that the observation of weak average dipeptide covariation does not negate the possibility of strong, interesting covariation in particular instances, such as CP
Cx, or within particular families. Moreover, it is conceivable that covariation information could be used more judiciously, thereby improving alignment results. For example, as previously discussed, one might include doublet-type scoring information only for residue pairs that are likely to be within the same secondary structural element. Similarly, one might examine the covariation of residues that are proximate in the tertiary structure, rather than along the sequence (Rodionov and Johnson, 1994; Lin et al., 2003). However, residues that are proximate in space are also only weakly correlated (Cline et al., 2002; Crooks et al., 2004), and the interresidue mutual information is not improved by foreknowledge of the local structure environment (Crooks and Brenner, 2004; Crooks et al., 2004). Therefore, we suspect that such approaches will also not have dramatic effects on protein sequence alignment.
In conclusion, the ubiquitous assumption that neighboring sites along a protein sequence evolve independently appears to be generally appropriate. This leads to fast, elegant and effective algorithms for protein sequence alignment and homology detection.
| Acknowledgments |
|---|
R.E.G. and G.E.C. jointly conceived and designed the doublet alignment algorithm and co-wrote this paper, with guidance from S.E.B.; G.E.C. was responsible for creating the doublet BLOSUM substitution matrices and R.E.G. for the statistical comparison of doublet to SmithWaterman. The authors would like to thank Emma Hill, Sandrine Dudoit and Jeff Thorne for their helpful discussions and suggestions. This work was supported by the National Institutes of Health (1-K22-HG00056) and an IBM Shared University Research grant. G.E.C. received funding from the Sloan/DOE postdoctoral fellowship in computational molecular biology. S.E.B. is a Searle Scholar (1-L-110). Funding to pay the Open Access publication charges for this article was provided by the N.I.H.
Conflict of Interest: none declared.
| Footnotes |
|---|
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
Received on March 16, 2005; revised on July 28, 2005; accepted on August 4, 2005
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F. (1991) Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol., 219, 555565[CrossRef][Web of Science][Medline].
Altschul, S.F. (1993) A protein alignment scoring system sensitive at all evolutionary distances. J. Mol. Evol., 36, 290300[CrossRef][Web of Science][Medline].
Altschul, S.F. and Erickson, B.W. (1986) Optimal sequence alignment using affine gap costs. Bull. Math. Biol., 48, 603616[Web of Science][Medline].
Altschul, S.F., et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403410[CrossRef][Web of Science][Medline].
Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 33893402
Bailey, T.L. and Gribskov, M. (2002) Estimating and evaluating the statistics of gapped local-alignment scores. J. Comput. Biol., 9, 575593[CrossRef][Web of Science][Medline].
Bishop, M.J. and Thompson, E.A. (1986) Maximum likelihood alignment of DNA sequences. J. Mol. Biol, 190, 159165[CrossRef][Web of Science][Medline].
Brenner, S.E. Molecular propinquity: Evolutionary and structural relationships of proteins (1996) PhD thesis Cambridge University.
Brenner, S.E., et al. (1998) Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl Acad. Sci. USA, 95, 60736078
Chandonia, J.-M., et al. (2004) The ASTRAL Compendium in 2004. Nucleic Acids Res., 32, 189192
Cline, M.S., et al. (2002) Information-theoretic dissection of pairwise contact potentials. Proteins, 49, 714[CrossRef][Web of Science][Medline].
Cover, T.M. and Thomas, J.A. Elements of Information Theory, (1991) , New York Wiley.
Crooks, G.E. and Brenner, S.E. (2004) Protein secondary structure: entropy, correlations and prediction. Bioinformatics, 20, 16031611
Crooks, G.E. and Brenner, S.E. (2005) An alternative model of amino acid replacement. Bioinformatics, 21, 975980
Crooks, G.E., et al. (2004) Measurements of protein sequence-structure correlations. Proteins, 57, 804810[CrossRef][Web of Science][Medline].
Dayhoff, M.O., et al. (1978) A model of evolutionary change in proteins. Atlas of Protein Sequences and Structure, 5, Suppl 3, 345352.
Doolittle, R.F. (1992) Reconstructing history with amino acid sequences. Protein Sci., 1, 191200[Web of Science][Medline].
Durbin, R., Eddy, S., Krogh, A., Mitchison, G. Biological sequence analysis, (1998) Cambridge University Press.
Eddy, S.R. HMMER: Profile hidden Markov models for biological sequence analysis, (2001) .
Efron, B. (1979) Bootstrap methods: another look at the jacknife. Ann. Stat., 7, 126.
Fukami-Kobayashi, K., et al. (2002) Detecting compensatory covariation signals in protein evolution using reconstructed ancestral sequences. J. Mol. Biol., 319, 729743[CrossRef][Web of Science][Medline].
Goldman, N., et al. (1998) Assessing the impact of secondary structure and solvent accessibility on protein evolution. Genetics, 149, 445458
Gonnet, G.H., et al. (1994) Analysis of amino-acid substitution during divergent evolutionthe 400 by 400 dipeptide substitution matrix. Biochem. Biophys. Res. Comm., 199, 489496[Medline].
Green, R.E. and Brenner, S.E. (2002) Bootstrapping and normalization for enhanced evaluations of pairwise sequence comparison. Proc. IEEE, 90, 18341847[CrossRef].
Henikoff, J.G., et al. (2000) Increased coverage of protein families with the blocks database servers. Nucleic Acids Res., 28, 228230
Henikoff, S. and Henikoff, J.G. (1992) Amino-acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA, 89, 1091510919
Jung, J.S. and Lee, B. (2000) Use of residue pairs in protein sequence-sequence and sequence-structure alignments. Protein Sci., 9, 15761588[Medline].
Karplus, K., et al. (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics, 14, 846856
Lin, K., et al. (2003) Testing homology with Contact Accepted mutatiOn (CAO). Comput. Biol. Chem., 27, 93102[CrossRef][Web of Science][Medline].
Müller, T., et al. (2002) Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. Mol. Biol. Evol., 19, 813
Murzin, A.G., et al. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536540[CrossRef][Web of Science][Medline].
Park, J., et al. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284, 12011210[CrossRef][Web of Science][Medline].
Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 24442448
Price, G.A., et al. (2005) Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap. Bioinformatics, doi:10.1093/bioinformatics/bti627.
Rodionov, M.A. and Johnson, M.S. (1994) Residue-residue contact substitution probabilities derived from aligned three-dimensional structures and the identification of common folds. Protein Sci., 3, 23662377[Web of Science][Medline].
Rubin, D.B. (1981) The Bayesian bootstrap. Ann. Stat., 9, 130134.
Sander, C. and Schneider, R. (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins, 9, 5668[CrossRef][Web of Science][Medline].
Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195197[CrossRef][Web of Science][Medline].
Thorne, J.L., et al. (1996) Combining protein evolution and secondary structure. Mol. Biol. Evol., 13, 666673[Abstract].
Thorne, J.L., et al. (1991) An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol., 33, 114124[CrossRef][Web of Science][Medline].
Thorne, J.L., et al. (1992) Inching toward reality: an improved likelihood model of sequence evolution. J. Mol. Evol., 34, 316[CrossRef][Web of Science][Medline].
Topham, C.M., et al. (1997) Prediction of the stability of protein mutants based on structural environment-dependent amino acid substitution and propensity tables. Protein Eng., 10, 721
Weiss, O., et al. (2000) Information content of protein sequences. J. Theor. Biol., 206, 379386[CrossRef][Web of Science][Medline].
Yu, Y.K., et al. (2003) The compositional adjustment of amino acid substitution matrices. Proc. Natl Acad. Sci. USA, 100, 1568815693
Y
as, M. (1958) The protein text. Symposium on Information Theory in Biology , NY Pergamon Press, pp. 70101.
Zachariah, M.A., et al. (2005) A generalized affine gap model significantly improves protein sequence alignment accuracy. Proteins, 58, 329338[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
A. Biegert and J. Soding Sequence context-specific profiles for homology searching PNAS, March 10, 2009; 106(10): 3770 - 3775. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

















