Bioinformatics Advance Access originally published online on March 4, 2005
Bioinformatics 2005 21(10):2315-2321; doi:10.1093/bioinformatics/bti347
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
REVCOM: a robust Bayesian method for evolutionary rate estimation
Molsoft LLC 3366 North Torrey Pines Court, Suite 300, La Jolla, CA 92037, USA
*To whom correspondence should be addressed at Computer Science and Mathematics Division, Oak Ridge National Laboratory, PO Box 2008, MS6173, Oak Ridge, TN 37831, USA
| Abstract |
|---|
|
|
|---|
Motivation: Evolutionary conservation estimated from a multiple sequence alignment is a powerful indicator of the functional significance of a residue and helps to predict active sites, ligand binding sites, and protein interaction interfaces. Many algorithms that calculate conservation work well, provided an accurate and balanced alignment is used. However, such a strong dependence on the alignment makes the results highly variable. We attempted to improve the conservation prediction algorithm by making it more robust and less sensitive to (1) local alignment errors, (2) overrepresentation of sequences in some branches and (3) occasional presence of unrelated sequences.
Results: A novel method is presented for robust constrained Bayesian estimation of evolutionary rates that avoids overfitting independent rates and satisfies the above requirements. The method is evaluated and compared with an entropy-based conservation measure on a set of 1494 protein interfaces. We demonstrated that
62% of the analyzed protein interfaces are more conserved than the remaining surface at the 5% significance level. A consistent method to incorporate alignment reliability is proposed and demonstrated to reduce arbitrary variation of calculated rates upon inclusion of distantly related or unrelated sequences into the alignment.
Contact: bordner{at}ornl.gov
Supplementary information: The proteinprotein interface dataset, multiple sequence alignments and corresponding phylogenetic trees are available at http://www.molsoft.com/~bordner/REVCOM/
| 1 INTRODUCTION |
|---|
|
|
|---|
The evolutionary conservation varies among amino acid sites owing to differing degrees of functional constraints on them (Fitch and Margoliash, 1967; Uzzell and Corbin, 1971; Holmquist et al., 1983). Sites that are important for the protein's tertiary structure and folding, enzymatic activity, ligand binding or interaction with other proteins are generally more conserved. The greater conservation of residues in binding sites compared with other surface residues has been exploited by some methods that predict small ligand or proteinprotein interfaces by mapping residue conservation values to the query protein surface (Landgraf et al., 2001; Lichtarge and Sowa, 2002; Pupko et al., 2002). The identification of conserved residues may be useful for identifying functionally important residues even in the absence of structural information. Binding site prediction results have applications in structure-based drug design as well as protein functional assignment and suggest important residues for mutation analysis studies.
Models that account for the evolutionary relationship of the sequences through a phylogenetic tree should be less sensitive to the choice of sequences than methods based on residue frequencies in a corresponding multiple sequence alignment column. In particular, a large number of closely related sequences may give erroneously high residue conservation. Evolutionary tracing is one such method that utilizes a phylogenetic tree to identify residues that are identically conserved in a subtree (Lichtarge et al., 1996). The maximum tree depth at which a residue remains unchanged is used to rank the degree of conservation. This analysis was later modified to incorporate a quantitative model of residue substitutions (Landgraf et al., 1999). Another algorithm, ConSurf, used a maximum parsimony tree to calculate a site conservation score as the number of substitutions weighted by their physicochemical distance (Armon et al., 2001). A study by Pupko et al. (2002) describes a method for a maximum-likelihood estimation of independent evolutionary rates assuming a homogeneous Markov model of residue substitution. The maximum-likelihood method (Neyman, 1971; Felsenstein, 1981) is not subject to errors present in the maximum parsimony principle, including systematic overestimation of conservation due to neglect of backward and parallel substitutions and no dependence on branch lengths. It also has the advantage that a fast recursive algorithm may be used to calculate likelihoods for a given tree (Felsenstein, 1981).
However, a straightforward estimation of independent site rates suffers from overfitting since there are as many parameters as sites (Felsenstein, 2001). One solution to this problem is to assume that the rates come from a probability distribution, usually assumed to be a gamma distribution. Yang (1993, 1994) used a discrete approximation to the gamma distribution to incorporate rate variation in phylogeny determination using the maximum-likelihood method. Although there is no evolutionary process that supports it, a gamma rate distribution is often used since it is defined on the correct interval (positive rates). It also leads to a negative binomial distribution for total substitution frequencies in a simple Poisson model of site substitution probabilities. An early paper by Uzzell and Corbin (1971) supported a gamma rate distribution by showing that the negative binomial distribution fits sequence data better than a Poisson distribution. On account of this evidence and mathematical convenience we also use the gamma distribution, but as a Bayesian prior rates distribution.
A lack of robustness or excessive sensitivity of a conservation calculation to the input multiple sequence alignment is a general problem for the existing methods. This sensitivity may be owing to overrepresentation of a particular subfamily or alignment errors. While distantly related sequences have the potential to provide strong evidence of residue conservation they also generally introduce more noise. Three sources of errors at large evolutionary distances may affect the site rate calculation: (1) local alignment errors due to ambiguities in divergent sequence segments, (2) alignment to a non-homologous sequence mistakenly picked out by a database search and (3) uncertainties in the residue substitution matrix owing to alignment errors or extrapolation to large distances. All these errors are expected to become more possible at larger distances and are neglected in most conservation prediction methods. This is important, particularly for automatically generated alignments which may include non-homologous sequences or alignment errors for distantly related sequences. While iterative alignment methods, such as PSI-BLAST, have high sensitivity in detecting remote homologs (Park et al., 1998) they are in particular susceptible to inclusion of non-homologous sequences owing to overly permissive parameters or profile drift (Sjölander, 2004). We introduce the Robust EVolutionary COnservation Measure (REVCOM) which consistently incorporates the alignment reliability into the likelihood calculation in order to render the method more robust to the inclusion of distantly related sequences. The resulting likelihood for each site may be interpreted as a sum of likelihoods for all possible trees with a subset of sequences removed, weighted by the probability that the excluded sequences are unreliable and the included ones are reliable.
A large non-redundant dataset of 1494 proteinprotein interfaces with available X-ray crystal structures was also compiled in order to test the conservation algorithm. Since they are larger in general than ligand binding interfaces the variance in conservation statistics for proteinprotein interfaces is expected to be lower. We compared the REVCOM method with a simple entropy-based method that does not use evolutionary trees. Both the ability of the methods to detect the higher conservation in proteinprotein interfaces and their stability with the inclusion of distantly related sequences in the alignment were evaluated.
| 2 SYSTEMS AND METHODS |
|---|
|
|
|---|
2.1 REVCOM method overview
First, pairwise evolutionary distances and a phylogenetic tree are estimated from a multiple sequence alignment without assuming rate heterogeneity. Next, the site rate distribution is inferred using a maximum-likelihood estimate that accounts for alignment reliability. Finally, the individual site rates are calculated using this distribution as a Bayesian prior distribution.
2.2 Residue substitution model
Amino acid substitutions at sites are assumed to follow independent time-homogeneous Markov processes with the JonesTaylorThornton (JTT) matrix used to calculate substitution probabilities (Jones et al., 1992). The substitution matrix for a given evolutionary distance was extrapolated from the matrix given in this reference for a distance corresponding to one percent accepted point mutation (one PAM).
2.3 Multiple sequence alignments
Multiple sequence alignments were generated automatically for each protein sequence in the dataset. First, the BLAST program (Altschul et al., 1990) with an E-value cutoff of 0.1 was used to collect similar sequences from the NCBI nr database. This relatively permissive cutoff was chosen such that distantly related homologous sequences that make a disproportionately large contribution to the evolutionary rates estimation are included. Sequences with >90% identity to another sequence in the set were then iteratively removed. The ClustalW program (Thompson et al., 1994) with default alignment parameters (Gonnet 250 scoring matrix with gap opening = 10 and gap extension = 0.1) was used to align the remaining sequences.
2.4 Phylogenetic trees
Phylogenetic trees were then generated for each alignment using the neighbor-joining algorithm (Saitou and Nei, 1987) as implemented in the Quicktree (Howe et al., 2002) program. Thus the trees for every protein in a complex were determined independently. A weighted least squares fit using pairwise PAM distances between sequences was then used to recalculate the branch lengths. The PAM distance t was calculated by inverting the expression for the expected fraction of identical residues q(t)
![]() | (1) |
in the distance was approximated from the binomial variance using the delta method
![]() | (2) |
![]() | (3) |
the estimates of the branch lengths are b = (CTV1C)1CTV1d.
2.5 Simple residue conservation model
We compared our rate prediction model with a simple entropy-based model in which the site conservation is calculated for each alignment column using the Shannon entropy S
![]() | (4) |
2.6 Proteinprotein interface datasets
A dataset of protein intermolecular interfaces in complexes was compiled from biological unit information in the Protein Data Bank (PDB) (Berman et al., 2000) archive using the ICM scripting language (Molsoft, LLC, 2004). Only the first biological unit was included and mmCIF format files were utilized since they contain biological unit information for all entries. Only X-ray structures of proteins with
20 residues were considered and pairs of interacting proteins were clustered at 30% sequence identity. Complexes whose PDB information conflicted with their Swiss-Prot (Apweiler et al., 2004) subunit annotations were corrected or removed after consulting the literature. Alignments of the protein sequences in each cluster to a representative sequence were then used to compare interface residues. Two interfaces on a protein were considered distinct if their residue sets, referred to the representative sequence, overlapped <20%. The highest quality structure, with the least missing coordinates and the highest resolution, for each unique proteinprotein interface was then included in the dataset. Interfaces including immune system proteins that are highly polymorphic proteins or undergo somatic mutation, namely MHC, T-cell receptors and antibodies, were excluded. Finally, only interfaces containing at least 10 residues were included in the set because of the large variance of the residue-based statistics as well as difficulties in validating the interface prediction for smaller interfaces. The statistical analyses were performed for each distinct protein with its interface residues determined from the structure of the complex. The resulting set had 1494 proteinprotein interfaces (1143 unordered pairs), of which 518 were in homodimers, 114 were in heterodimers and the remaining 862 were in multimers.
2.7 Statistical tests
The MannWhitneyWilcoxon rank sum test was used to compare the evolutionary rates for the interface and non-interface regions. This non-parametric test was used since the underlying distribution of the posterior rates is unknown and is not a normal distribution.
| 3 ALGORITHM |
|---|
|
|
|---|
3.1 Evolutionary site rates estimation
The REVCOM method calculates evolutionary rates for sites in a reference protein sequence using the following steps:
- Perform a BLAST search of the sequence database using the reference sequence as a query.
- Remove redundant sequences (>90% identity).
- Construct a multiple alignment of all sequences.
- Calculate all pairwise PAM distances between sequences from fractional identities using Equation (1).
- Determine the phylogenetic tree from pairwise distances using neighbor-joining algorithm.
- Recalculate branch lengths using weighted least squares method.
- Find the maximum-likelihood estimate of the
parameter in the prior gamma distribution.
- Calculate site rates by averaging over the posterior rates distribution.
P-values from the BLAST search in the first step are used as estimates of alignment reliability in the final two steps.
3.2 Maximum-likelihood estimate of the gamma distribution parameter
The likelihood that is maximized for the
parameter estimate is
![]() | (5) |
, r) is the gamma distribution with mean 1
![]() | (6) |
is the likelihood for site m, given the rate
is a vector of amino acid types in column m of the multiple sequence alignment. Non-standard amino acid symbols, e.g. X and gaps are excluded from the calculation by removing the corresponding branches in the tree for that site.
3.3 Accounting for alignment reliability
One new feature in the REVCOM method is the use of an alignment reliability probability in order to prevent distantly related or non-homologous sequences from inordinately affecting the likelihood in Equation (5) and consequently the predicted rates. The probabilities of each sequence to be misaligned with a reference sequence, for which the rates are to be calculated, are used. The misalignment probabilities for different sequences will be assumed to be independent. This allows them to be easily incorporated into the maximum-likelihood calculation.
Misalignment probabilities from progressive alignment methods such as ClustalW are generally correlated since early alignment errors are retained and early accurate profiles yield subsequent alignments of greater accuracy. Misalignment probabilities in adjacent columns are also expected to be correlated, both, because errors are more likely to occur in segments corresponding to solvent-exposed loops than in segments corresponding to core helices or strands, and because errors in, e.g. gap placement, will adversely affect the alignment of the nearby columns. Both correlations yield more extreme misalignment probabilities, i.e. low probabilities are lower and high probabilities are higher; however, these deviations from independence are not expected to cause large errors in our method.
BLAST P-values, 1 exp(Ei), with Ei, the E-value from the database search with the reference sequence of the protein onto which the rates will be projected, will be used for a rough global estimate of the misalignment probabilities. This underestimates the misalignment probabilities since it only explicitly accounts for error (2) mentioned above; however, this is a serious global alignment error whose statistical distribution is known (Altshul and Gish, 1996). Our algorithm can be used with more accurate local, or column dependent, misalignment probabilities that account for other sources of alignment error since the probabilities are defined for each individual site. This is an area of future investigation since we are unaware of any existing method to calculate such probabilities. Previously developed methods for local alignment reliability measures (Vingron and Argos, 1990; Chao et al., 1993; Mevissen and Vingron, 1996; Abagyan and Batalov, 1997; Schlosshauer and Ohlsson, 2002; Löytynoja and Milinkovitch, 2003) may provide useful starting points.
The site likelihood
is calculated recursively using a modification of the method of Felsenstein (1981).
![]() | (7) |
![]() | (8) |
|
|
3.4 Bayesian rates estimate
The site rate distribution with the estimated parameter
may then be used as a prior distribution to calculate the posterior rate distribution using Bayes formula
![]() | (9) |
![]() | (10) |
, is then calculated as the average of this posterior distribution
![]() | (11) |
We note that our method uses an empirical Bayes approach since the parameter
is first estimated in the prior rates distribution. This approach has been used Yang and co-workers to calculate DNA evolutionary rates using maximum-likelihood estimates of additional model parameters, including nucleotide substitution rates and the tree topologies (Yang and Wang, 1995), as well as to calculate non-synonymous/synonymous DNA substitution rate ratios (Yang et al., 2000).
| 4 IMPLEMENTATION |
|---|
|
|
|---|
Several numerical approximations were used to speed up the calculation. First, PAM distances were estimated using linear interpolation based on distances that were precalculated using Equation (1) for discrete fractional identity values. A conjugate gradient iterative method was used to solve the linear system of equations resulting from the least squares fit of PAM distances. Also, components of the JTT substitution matrix, which are continuous functions of the evolutionary PAM distance, were approximated by Chebyshev polynomials for fast calculation (Pupko and Graur, 2002). Finally, the integral in Equation (5) for the total likelihood may be efficiently calculated using Gaussian quadrature based on Laguerre polynomials (Press et al., 1992), as discussed in the study by Felsenstein (2001).
| 5 DISCUSSION |
|---|
|
|
|---|
5.1 Comparison with entropy based conservation measure
The site rates calculated using the REVCOM method were compared with the simple entropy conservation measure by examining their ability to detect statistically significant differences in evolutionary conservation between proteinprotein interface residues and non-interface surface residues. Statistical tests with an alternative hypothesis of higher residue conservation in the interface were performed at three different significance levels, 5%, 1% and 0.1% using both conservation measures. All interfaces in the dataset were analyzed. The number of proteins with significantly higher conservation in their interaction interfaces as well as the median P-values are given in Table 1 for each method. It is apparent that the REVCOM method discriminates conservation differences better than the entropy measure since it detects significant conservation differences in more interfaces at all significance levels. Overall
62% of the interfaces have significantly higher residue conservation than the non-interface surface residues at the 5% level.
|
5.2 Robustness to inclusion of distantly related sequences
Site rate predictions may be compared with predictions using alignments containing additional distantly related or non-homologous sequences in order to verify that including misalignment probabilities renders the method more robust. First, the NCBI nr database was searched with two different BLAST E-value cutoffs, 0.1 and 10.0, using sequences for each of the 1494 proteins in the dataset as queries. Next, rates were calculated for the 820 proteins for which additional sequences were found at the higher E-value cutoff. The average absolute deviation in rates was only 0.23 for the REVCOM method as compared with 0.33 for the same method without misalignment probabilities. A histogram of the differences in the rates as well as in the entropy measures with an expansion of the multiple alignment is shown in Figure 3. The greater robustness from accounting for alignment reliability is apparent from the histogram of the REVCOM method, which is more strongly peaked about zero rate difference. The histogram also shows that expansion of the multiple alignment causes a large systematic increase in the entropy conservation measure whereas the differences for the REVCOM method are more symmetrical about zero.
|
Next, we illustrate how the misalignment probabilities render the method more robust to the inclusion of non-homologous sequences by a specific example. Site rates are calculated for the SH2 domain using a reliable alignment and an alignment with additional non-homologous sequences and compared. The Pfam seed alignment (Bateman et al., 2002) with 58 sequences is used for the reliable alignment and 23 (
40%) random non-homologous sequences are then added and aligned to a profile of the original sequences using ClustalW (Thompson et al., 1994) for the test alignment. The median absolute site rate differences between the reliable and test alignments is 0.33 without including alignment reliability and only 0.065 when it is included. This is because of the decoupling of the non-homologous sequences in the site likelihood calculation when misalignment probabilities are accounted for.
5.3 Conservation in proteinprotein interfaces
It is important to compare conservation of interface residues with that of other surface residues rather than with all protein residues. This is because the lower conservation of surface residues compared with buried residues yields a greater contrast. In fact, one study compared the evolutionary conservation of proteinprotein interface residues with the entire protein sequence and found little difference (Grishin and Phillips, 1994). However, other studies, which compared interface residues with surface residues, found significant differences in conservation (Valdar and Thornton, 2001; Caffrey et al., 2004). Our results corroborate this observation but with a much larger dataset than those used in these earlier papers.
The site rates may be displayed on the protein surface to identify proteinprotein interaction interfaces and active sites. An example of Escherichia coli malate dehydrogenase is shown in Figure 4. This homodimeric enzyme uses NAD as a cofactor to reversibly oxidize malate to oxaloacetate. It is an essential component of the citric acid cycle with orthologs in both prokaryotes and eukaryotes. The figure shows that residues in a surface region extending from the dimer interface to the cofactor and substrate binding sites have lower rates, i.e. higher conservation, in general than the remaining surface residues. The lower evolutionary rates for the interface is also evident from the low P-value of 5.2 x 108 for the statistic comparing interface and non-interface residue rates.
|
Although the majority of proteinprotein interfaces in our dataset are more conserved than the remainder of the surface, the paper of Caffrey et al. (2004) concluded that conservation alone is insufficient to predict the interfaces when the surface is divided into small patches. Extending their analysis to a larger dataset or using different conservation measures, such as the one presented here, may yield better results. However, including physicochemical, geometric or residue distribution properties, such as those considered by Jones and Thornton (1997) in their patch analysis of proteinprotein interfaces, can only improve the prediction performance. Thus an accurate proteinprotein interface prediction method should not only account for residue conservation but also include other discriminating properties.
5.4 Future directions
There are several possible extensions of the REVCOM method presented here. A local or column-dependent alignment reliability measure (Vingron and Argos, 1990; Chao et al., 1993; Mevissen and Vingron, 1996; Abagyan and Batalov, 1997; Schlosshauer and Ohlsson, 2002; Löytynoja and Milinkovitch, 2003) may be used for calculating the misalignment probabilities instead of the global measure. However, the most appropriate measure, not described in these references, would be the probability of a local misalignment error between a given sequence and the reference sequence in the context of the multiple sequence alignment. Since misalignment probabilities are required by the method described in this paper a local alignment reliability index must be converted into a probability, e.g. by calibrating it against a structural alignment database (Mevissen and Vingron, 1996). It is also interesting to investigate the use of different prior rates distributions. The distribution should not have many parameters because estimation of the gamma distribution parameter is currently the most computationally intensive step in the rates calculation. Finally, since there is evidence that alignment gaps are less frequent in proteinprotein interfaces, at least for permanent dimers (Caffrey et al., 2004), accounting for gaps in the stochastic evolutionary process, rather than treating them as missing data, may result in more accurate site rates.
It would be informative to divide the interface dataset into permanent and transient proteinprotein interactions to investigate the differences in interface residue conservation. Caffrey et al. (2004) found that central interface residues were more conserved than peripheral interface residues in permanent interfaces but not transient ones, but their dataset was considerably smaller than the one used here. The difficulty is in automatically assigning binding affinities to a large number of protein interactions. Another issue that has not been thoroughly investigated is whether residues that contribute the most to the binding free energy are more conserved than other interface residues. Previous studies, using alanine-scanning mutagenesis results, have shown that a small number of interface residues contributes to the majority of the binding energy (Clackson and Wells, 1995; Bogan and Thorn, 1998). Hu et al. (2000) identified polar residues that are conserved in a structural alignment of proteinprotein interfaces and speculated, based on the agreement with alanine scanning data, that these may be hot spot residues that strongly influence binding.
| 6 CONCLUSION |
|---|
|
|
|---|
We have presented a method, REVCOM, to calculate residue conservation that accounts for the evolutionary relationships of the sequences and avoids overfitting by using a Bayesian framework. This method also accounts for alignment errors by summing the likelihoods of truncated phylogenetic trees which are weighted by the probability that the missing sequences were unreliable and the included ones were reliable. The resulting conservation measure was shown to discriminate conservation differences for proteinprotein interfaces better than a column entropy measure and to prevent distantly related sequences from overly affecting the rates. The evolutionary rates provided by the REVCOM method are useful for identifying surface residues in enzyme active sites and small ligand or protein binding sites.
| SUPPLEMENTARY DATA |
|---|
|
|
|---|
Supplementary data for this paper are available on Bioinformatics online.
| Acknowledgments |
|---|
We thank J. Thorne for pointing out an error in an earlier version of this paper. This work was funded by a grant from the Department of Energy (No. DE-FG03-01ER83282).
Received on January 25, 2005; accepted on February 19, 2005
| REFERENCES |
|---|
|
|
|---|
Abagyan, R. and Batalov, S. (1997) Do aligned sequences share the same fold? J. Mol. Biol., 273, 355368[CrossRef][Web of Science][Medline].
Altshul, S. and Gish, W. (1996) Local alignment statistics. Methods Enzymol., 266, 460480[Web of Science][Medline].
Altschul, S., et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403410[CrossRef][Web of Science][Medline].
Apweiler, R., et al. (2004) UniProt: the universal protein knowledgebase. Nucleic Acids Res., 32, D115D119
Armon, A., et al. (2001) Consurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. J. Mol. Biol., 307, 447463[CrossRef][Web of Science][Medline].
Bateman, A., et al. (2002) The Pfam protein families database. Nucleic Acids Res., 30, 276280
Berman, H., et al. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235242
Bogan, A. and Thorn, K. (1998) Anatomy of hot spots in protein interfaces. J. Mol. Biol., 280, 19[CrossRef][Web of Science][Medline].
Bulmer, M. (1991) Use of the method of generalized least-squares in reconstructing phylogenies from sequence data. Mol. Biol. Evol., 8, 868883[Web of Science].
Caffrey, D., et al. (2004) Are proteinprotein interfaces more conserved in sequence than the rest of the protein surface? Protein Sci., 13, 190202[CrossRef][Web of Science][Medline].
Chao, K., et al. (1993) Locating well-conserved regions within a pairwise alignment. Comput. Appl. Biosci., 9, 387396
Clackson, T. and Wells, J. (1995) A hot spot of binding energy in a hormonereceptor interface. Science, 267, 383386
Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol., 17, 368376[CrossRef][Web of Science][Medline].
Felsenstein, J. (2001) Taking variation of evolutionary rates between sites into account in inferring phylogenies. J. Mol. Evol., 53, 447455[CrossRef][Web of Science][Medline].
Fitch, W. and Margoliash, E. (1967) A method for estimating the number of invariant amino acid coding positions in a gene using cytochrome c as a model case. Biochem. Genet., 1, 6571[CrossRef][Medline].
Grishin, N. and Phillips, M. (1994) The subunit interfaces of oligomeric enzymes are conserved to a similar extent to the overall protein sequences. Protein Sci., 3, 24552458[Web of Science][Medline].
Hall, M. and Banaszak, L. (1993) Crystal structure of a ternary complex of Escherichia coli malate dehydrogenase citrate and NAD at 1.9 Å resolution. J. Mol. Biol., 232, 213222[CrossRef][Web of Science][Medline].
Hogg, R. and Craig, A. (1978) Bayesian estimation. Introduction to Mathematical Statistics, 5th edn, , Englewood Cliffs, NJ Prentice-Hall, pp. 363372.
Holmquist, R., et al. (1983) The spatial distribution of fixed mutations within genes coding for proteins. J. Mol. Evol., 19, 437448[CrossRef][Web of Science][Medline].
Howe, K., et al. (2002) QuickTree: building huge neighbour-joining trees of protein sequences. Bioinformatics, 18, 15461547
Hu, Z., et al. (2000) Conservation of polar residues as hot spots at protein interfaces. Proteins, 39, 331342[CrossRef][Web of Science][Medline].
Jones, D., et al. (1992) The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci., 8, 275282
Jones, S. and Thornton, J. (1997) Analysis of proteinprotein interaction sites using surface patches. J. Mol. Biol., 272, 121132[CrossRef][Web of Science][Medline].
Landgraf, R., et al. (1999) Analysis of heregulin symmetry by weighted evolutionary tracing. Protein Eng., 12, 943951
Landgraf, R., et al. (2001) Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J. Mol. Biol., 307, 14871502[CrossRef][Web of Science][Medline].
Lichtarge, O. and Sowa, M. (2002) Evolutionary predictions of binding surfaces and interactions. Curr. Opin. Struct. Biol., 12, 2127[CrossRef][Web of Science][Medline].
Lichtarge, O., et al. (1996) An evolutionary trace method defines binding surfaces common to protein families. J. Mol. Biol., 257, 342358[CrossRef][Web of Science][Medline].
Löytynoja, A. and Milinkovitch, M. (2003) A hidden Markov model for progressive multiple alignment. Bioinformatics, 19, 15051513
Mevissen, H. and Vingron, M. (1996) Quantifying the local reliability of a sequence alignment. Protein Eng., 9, 127132
Mihalek, I., et al. (2004) A family of evolutionentropy hybrid methods for ranking protein residues by importance. J. Mol. Biol., 336, 12651282[CrossRef][Web of Science][Medline].
ICM Software Manual. Version 3.0 Molsoft, LLC. (2004) .
Neyman, J. (1971) Molecular studies of evolution: a source of novel statistical problems. In Gupta, S. and Yackel, J. (Eds.). Statistical Decision Theory and Related Topics, , NY Academic Press, pp. 127.
Park, J., et al. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284, 12011210[CrossRef][Web of Science][Medline].
Press, W., Teukolsky, S., Vetterling, W., Flannery, B. Numerical Recipes in C, (1992) Cambridge University Press.
Pupko, T. and Graur, D. (2002) Fast computation of maximum likelihood trees by numerical approximation of amino acid replacement probabilities. Comput. Stat. Data An., 40, 285291[CrossRef].
Pupko, T., et al. (2002) Rate 4site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics, 18, S71S77[Abstract].
Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406425[Abstract].
Schlosshauer, M. and Ohlsson, M. (2002) A novel approach to local reliability of sequence alignments. Bioinformatics, 18, 847854
Sjölander, K. (2004) Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics, 20, 170179
Thompson, J., et al. (1994) Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 46734680
Uzzell, T. and Corbin, K. (1971) Fitting discrete probability distributions to evolutionary events. Science, 172, 10891096
Valdar, W. (2002) Scoring residue conservation. Proteins, 48, 227241[CrossRef][Web of Science][Medline].
Valdar, W. and Thornton, J. (2001) Proteinprotein interfaces: analysis of amino acid conservation in homodimers. Proteins, 42, 108124[CrossRef][Web of Science][Medline].
Vingron, M. and Argos, P. (1990) Determination of reliable regions in protein sequence alignments. Protein Eng., 3, 565569
Yang, Z. (1993) Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol., 10, 13961401[Abstract].
Yang, Z. (1994) Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol., 39, 306314[CrossRef][Web of Science][Medline].
Yang, Z. and Wang, T. (1995) Mixed model analysis of dna sequence evolution. Biometrics, 51, 552561[CrossRef][Web of Science][Medline].
Yang, Z., et al. (2000) Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics, 155, 431449
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||














