Bioinformatics Advance Access originally published online on September 24, 2007
Bioinformatics 2007 23(22):3001-3008; doi:10.1093/bioinformatics/btm470
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The Poisson Index: a new probabilistic model for protein–ligand binding site similarity
1School of Mathematics and 2Institute of Molecular and Cellular Biology, University of Leeds, Leeds LS2 9JT, UK
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: The large-scale comparison of protein–ligand binding sites is problematic, in that measures of structural similarity are difficult to quantify and are not easily understood in terms of statistical similarity that can ultimately be related to structure and function. We present a binding site matching score the Poisson Index (PI) based upon a well-defined statistical model. PI requires only the number of matching atoms between two sites and the size of the two sites—the same information used by the Tanimoto Index (TI), a comparable and widely used measure for molecular similarity. We apply PI and TI to a previously automatically extracted set of binding sites to determine the robustness and usefulness of both scores.
Results: We found that PI outperforms TI; moreover, site similarity is poorly defined for TI at values around the 99.5% confidence level for which PI is well defined. A difference map at this confidence level shows that PI gives much more meaningful information than TI. We show individual examples where TI fails to distinguish either a false or a true site paring in contrast to PI, which performs much better. TI cannot handle large or small sites very well, or the comparison of large and small sites, in contrast to PI that is shown to be much more robust. Despite the difficulty of determining a biological ground truth for binding site similarity we conclude that PI is a suitable measure of binding site similarity and could form the basis for a binding site classification scheme comparable to existing protein domain classification schema.
Availability: PI is implemented in SitesBase www.modelling.leeds.ac.uk/sb/
Contact: r.m.jackson{at}leeds.ac.uk
| 1 INTRODUCTION |
|---|
|
|
|---|
It is often argued that protein structure is better conserved than protein sequence—for instance, many members of the globin family that are evolutionarily related share considerable structural similarity but little sequence similarity (Bashford et al., 1987). However, it is a non-trivial problem to produce a structural similarity score that, like sequence similarity methods such as BLAST and the Smith–Waterman algorithm contains well-understood models of statistical similarity and that can be applied to examine comparisons on a large scale. However, the construction of structural classification schemes for protein domains, such as SCOP (Murzin et al., 1995), CATH (Orengo et al., 1997) and FSSP (Holm et al., 1992) have had a profound impact on our ability to understand structural and functional relationships where sequence similarity is not statistically significant. The large-scale determination of new protein structures further increases the need for tools for understanding structure–function relationships.
There are many instances where the structural comparison of functional sites on proteins helps in understanding functional relationships. In response, many methods are now available for binding site comparison or functional site prediction (Laurie and Jackson, 2006; Watson et al., 2005), all of which include some measure of site similarity. In many cases the score represents the size or percentage match between the query site and other sites (Gold and Jackson, 2006b; Schmitt et al., 2002; Shulman-Peleg et al., 2004); which is ideal for ranking hits to a query site. Alternatively, the statistical significance of a hit with respect to a database can be represented at the sequence or structural level via a Z-score (Kinoshita et al., 2002), or by E- and p-values (Binkowski et al., 2003; Gold and Jackson, 2006b; Laskowski et al., 2005). However, most methods lack the rigorous statistical basis needed to discriminate between true structural matches and noise. Stark et al. (2003) have come closest to doing this, by developing a geometrical model that allows an estimate of significance, a priori, based on both spatial and sequence conservation at the residue level.
Measures of similarity are clearly important for protein binding site classification (Khun et al., 2006; Najmanovich et al., 2007) or network construction (Zhang and Grigorov, 2006) and have also been used to improve the specificity of the annotation of genes using the Gene Ontology (GO) (Kang et al., 2004). All these recent studies use the Tanimoto Index (TI) as a measure of binding site similarity for input into cluster analysis or network construction. Apart from the use in bioinformatics applications the TI has a long history of use in chemoinformatics (Willett et al., 1986), and has been applied successfully to a wide variety of problems, including describing the similarity between chemical fingerprints (Arimoto et al., 2005; Brown and Martin, 1997). A particular strength of the TI is that it uses only the size of the two objects to be compared and their match score. The objects can be described in terms of their defining features (e.g. atoms or labelled points) and the method is fast to calculate and therefore appropriate for large-scale comparison and a wide number of applications. However, there is no obvious way to determine a critical value for TI. This is particularly important in matching binding sites where it is difficult to discern true structural matches from noise. As we shall see below, TI is not entirely scale independent and in protein binding site comparison has particular difficulties with small or large sites or matching between the two.
We propose a statistical model-based index, the Poisson Index (PI) that uses the same information as TI. The model assumes a superpopulation from which all the points are drawn randomly within a given volume. Further, two subsets of the points are sampled from this population with a probability structure that imposes similarity and dissimilarity characteristics. The main advantage of using PI over TI is that it is based on an intuitive model, and has a natural probability distribution. This distribution can be further used to obtain p-values to indicate whether the match could have occurred by chance. It is independent of any assumptions based on protein sequence or structure, and like the TI the method is highly flexible and can be applied to a wide variety of potential applications.
We compare TI and PI in the context of SitesBase (Gold and Jackson, 2006a). The SitesBase extraction process is automated and produces over 30 000 binding sites, generating a wealth of matching information for understanding structural relationships.
| 2 METHODS |
|---|
|
|
|---|
2.1 Derivation of index
Let X and Y be two binding sites that consist of a set of co-ordinates in three-dimensional space. The two sites are superimposed using a linear transformation consisting of a rotation and translation such that the maximum number of points coincide. Let L be the number of matching points defined to be such that the distance between matching pairs x
X and y
Y, after translation and rotation, is less than a cutoff of 1 Å; m the number of points in X; and n the number of points in Y with m
n. The Tanimoto Index T is
|
| (1) |
The null model considers a set of N locations in the superpopulation consisting of a homogeneous Poisson process with rate
over a region of volume v. Each location may belong to the set X,Y, neither, or both. We assume the probabilities of each location belonging to each set are px, py, 1 – px – py –
px py,
px py, respectively, where
is the tendency of points to match a priori. Under this model, the number of locations that belong to X,Y, neither and both are independent Poisson random variables with counts m–L, n–L,N–m – n+ L, L and means
vpx,
vpy,
v(1 – px – py –
pxpy),
v
pxpy, respectively. Hence the probability of observing L matches, conditional on m, n and N, is proportional to
|
|
|
|
/(
v) represents the propensity of two points to match, and K is a normalizing constant, calculated numerically so that probabilities sum to 1. This distribution of L which depends only on d, has been derived in Green and Mardia (2006). This parameterization gives a distribution with only one parameter that takes into account the volume, the intensity and the probability of a match.
Thus if we have a null hypothesis that the matches are due to chance, we can calculate the P-value to indicate matches that are not random, e.g. for sites from proteins with recent common ancestors. The P-value is the tail probability of finding a match as good as or better than the observed Lobs given m, n and d
|
| (2) |
2.2 Data
The data used in this study were taken from the binding site database SitesBase (Gold and Jackson, 2006a). SitesBase matches binding sites using an all-atom representation, and defines a binding site by a set of protein atoms within a 5 Å radius of any ligand atom. Over 33 000 binding sites in the database have been matched together in a pairwise fashion using a geometric hashing algorithm (Brakoulias and Jackson, 2004). The SCOP (Structural Classification Of Proteins) (Murzin et al., 1995) code of the corresponding protein of that site is also recorded. A reduced dataset was created by considering only matches between sites from proteins found in the SCOP40 database annotated in the ASTRAL compendium (Chandonia et al., 2004). The data were further refined by filtering out any site for which the ligand was (all or partly) a modified amino acid since it was found to usually represent a false binding site. Redundancy between sites on the same protein was reduced by clustering binding sites with TI greater than 0.8 and then removing sites in the same cluster that were taken from the same protein.
2.3 Maximum likelihood estimation
To find a suitable estimate for d, we used maximum likelihood estimation (MLE). For getting insight into the maximum likelihood estimate and the P-values, we fixed the site sizes to be m = n = 30. We only looked at sites that were derived from unrelated recent evolutionary backgrounds (taken from different SCOP classes) since these best represent false matches. In general, matches below a specified threshold min(20, 0.3m) are not given in SitesBase and are treated as missing data. Match scores less than 9 are not given in SitesBase. To account for the missing data, the following censored distribution was used to calculate the maximum likelihood
|
| (3) |
|
|
We found a point estimate of
for m = n = 30. We checked the validity of the distribution using this point estimate by superimposing the empirical distribution of the data with the theoretical distribution using Equation (3) for all values of L from 9 to 30. It was shown that the fit of theoretical distribution to the data is reasonable (Fig. 1).
|
2.4 General model
We repeated the MLE estimation for other values of m and n (including m
n) and found that
/
v and so v will be dependent upon m and n since larger sites will occupy larger volumes. We proceeded to search for a link between d and m, n. To do this, we obtained point estimates of d for every pair m, n, 30
m
n
200 and fitted a linear model of the form d = a(mn)–b. We found a good fit with R2 approximately equal to 0.99 with a
2.8 and b
0.8 (Fig. 2).
|
To find the optimal values for a and b we used the maximum likelihood estimators for a and b
|
|
|
|
are the average number of matches for the given set of (mj, nj). Nj is the number of pairs of sites of size mj, nj. Kj is the normalizing constant dependent on dj,mj,nj. Finally, T is the number of (mj, nj) pairs of any size (m,n) from 30 to 200, (here T = 14 706).
A grid search of possible combinations of a, b showed a clearly identifiable single best estimate that was confirmed to be the estimate produced by MLE,
. PI is then calculated using Equation (2) with d = 2.732(mn)–0.797.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Calibration of the PI
To check the calibration of the PI, we took a random sample of 700 non-related site pairs (defined as sites pairs from different SCOP classes) and produced an empirical cumulative frequency conditional distribution of PI. To do this, we calculated the conditional PI using Equation (3) for all site pairs and then plotted several threshold values of the conditional PI against the percentage of site pairs that fell below that particular value. For a well-calibrated score, the plotted values would lie close to the line y = x. The distribution is required to be the conditional distribution since PI tail probabilities cannot be calculated for censored pairings and so a large number of chance pairings are inaccessible. Censoring takes into account the fact that SitesBase does not include all low values of L (see Section 2.3). It can be clearly seen for values of PI less than 0.15 that the calibration is good (Fig. 3).
|
3.2 The PI is a better discriminator of site similarity than TI
We proceeded to test the capability of PI to distinguish between false matching pairs and true matching pairs. A random sample of 499 site pairs derived from similar recent evolutionary backgrounds (both belong to the same SCOP superfamily) was taken along with 499 sites with no identifiably similar recent evolutionary background (both belong to different SCOP classes). PI and TI values for these data were calculated using Equation (2) for PI and Equation (1) for TI. By varying the threshold score at which site pairings are considered true positives or false positives, it was possible to generate specificity and sensitivity data for each metric at different threshold levels. The specificity and sensitivity values were plotted on Receiver Operator Characteristic (ROC) curves (Fig. 4).
|
There is a clear improvement to be found in using PI, a sensitivity score of 0.8 can be attained with a false positive rate of approximately 0.09. Conversely, TI is only able to reach such high level of sensitivity with a much higher false positive rate of 0.22. With a false positive rate of 0.005 (i.e. at the 99.5% confidence level) PI obtains a sensitivity of 0.71 while TI is only 0.32.
3.3 At the 99.5% confidence level TI values are uninformative
We next consider the capacity of TI to distinguish pairings at the 99.5% confidence level. The ROC analysis reveal that the required TI value to obtain a false positive rate threshold of 0.005 was > 0.2365. Pairs with TI values near this threshold (0.23–0.24) were further investigated. Based upon the above criteria, pairings at this threshold should be significant matches in the majority of cases. The matches found were ranked by PI scores and found to include matches that are clearly significant, matches that are clearly not significant and those that fall somewhere in between (see Table 1).
|
The matches in the first part of the table have highly significant PI scores, notably they all come from matches between nucleotide binding sites and are derived from proteins related at the family or superfamily level according to SCOP. On inspection, the superimpositions of these matched sites appear to be genuine. The middle group contain matches that show moderate to low levels of similarity, they consist of sites that are smaller in general than the first group. Of note is the match between 1ecf and 1nh8, these sites are from different folds yet there is on inspection a clear region of similarity between them that is exclusively found in close proximity to a phosphate group in both sites. Matches in the final group are between sites that are from different folds or classes and are again smaller than sites from the second group. Several of these pairings are too poor to fall within the top 5000 matches ranked by score that are listed on SitesBase, on inspection all of these matches are found to be unconvincing. Under TI all of these matches would be considered to be equivalent in merit which is undoubtedly not the case, PI in contrast is able to distinguish true from false positives and accounts well for the effect of scale.
3.4 A difference map for comparison of TI and PI
We examined a difference map of a set of non-redundant sites (see Methods section) taken from the alpha and beta (a/b) SCOP class under equivalent TI and PI thresholds. The values at the 99.5% confidence level for PI and TI were chosen as the threshold at which matches were deemed significant. For PI this was
7 x 10–4 and for TI this was
0.2365. Figure 5 shows sites found within the threshold limits for both Tanimoto and PI (blue) and sites found with only one of the two methods in different halves of the map, PI (orange), TI (green). Prominent members of the a/b class such as those containing the Rossmann fold (II) and those containing the P-loop (VI) are well represented. Within superfamilies, PI picks up many matches between members that Tanimoto does not. The thioredoxin-like (VIII), PLP-dependent transferases (X), formate dehydrogenase/DMSO reductase, domains 1–3 (XII) and the S-adenosyl-L-methionine-dependent methyltransferases (IX) are all good examples of superfamilies where PI produces many more intra-superfamily matches than TI and PI combined.
|
Off-diagonal matches are also more numerous with PI. While off-diagonal matches that are significant under the TI alone are sparse and scattered (traits consistent with random noise), with PI there are prominent off-diagonal matches between superfamilies which are not seen under TI, which warrant further investigation.
Not surprisingly the Rossmann fold (Rao and Rossmann, 1973) (II) has many significant off-diagonal matches to other superfamilies, including to its nearest neighbours on the map the FAD/NAD(P) binding domains (III) and the nucleotide binding domains (IV) that are both closely associated with nucleotide binding. These three superfamilies also match in a significant part to both the DHS-like NAD/FAD binding domain superfamily (V) (see intersection 1a) and to the S-adenosyl-L-methionine (SAM)-dependent methyltransferases superfamily (IX) (see intersection 2). The former is interesting since DHS-like domains are Rossmann-like but bind molecules in the opposite direction (Dym and Eisenberg, 2001) while the latter similarity between the Rossmann fold and SAM-dependent methyltransferases has been noted previously (Schubert et al., 2003). The DHS-like NAD/FAD binding domain is also seen to have significant off-diagonal matches with the SAM-dependent methyltransferases superfamily (see intersection 1b) and with domains 1–3 of the formate dehydrogenase/DMSO reductase superfamily (see intersection 1c). However, interestingly the off-diagonal matches between the formate dehydrogenase/DMSO reductase superfamily (XII) and the DHS-like superfamily (V) do not extend to the Rossmann fold superfamily (II) or to the SAM-dependent methyltransferase superfamily (IX).
Other prominent off-diagonal matches involve the well known P-loop containing nucleoside triphosphate hydrolases (VI). Off-diagonal matches between this superfamily and the PRTase-like superfamily (VIII) (see intersection 3), the MurD-like peptide ligases (catalytic domain) superfamily (XI) (see intersection 4), the PEP carboxykinase-like superfamily (XIII) (see intersection 5) and the adenine nucleotide alpha hydrolases-like superfamily that is represented by a single site (see intersection 6) are of note. The MurD-like peptide ligases (c.72.2.x) are known to bind with ADP during their enzymatic action in a mononucleotide binding site containing the ATP/GTP binding P-loop (Bertrand et al., 1999). The PEP carboxykinase-like superfamily is noted to contain a P-loop motif in SCOP itself (Matte et al., 1996). The phosphoribosyltransferases (PRTases) are noted for their dissimilarity with each other (Sinha and Smith, 2001). a fact that is reflected by the small increase in within superfamily significant matches detected by PI (VIII). Since PRTase function is sometimes involved in nucleotide synthesis or purine salvage pathways, it is plausible that some members would match to the nucleoside triphosphate hydrolases. The single member of the adenine nucleotide alpha hydrolases-like superfamily is a GMP synthetase (pdb:1gpm) that is known to possess a conserved P-loop in the ATP phosphatase domain (Tesmer et al., 1996).
Binding sites that share the same protein (and by proxy SCOP code) may often possess different functions, sometimes contradicting the definition of ground truth given earlier. This is well illustrated by two sites found upon the protein 1e6u. The acetylphosphate binding site (see intersection 7) bears more similarity to the P-loop superfamily than its parent superfamily, matching to a mere three other sites in (II, III and IV). In contrast the NADP binding site adjacent to it on the map matches very well to superfamilies (II, III and IV) but poorly to superfamily (VI). Both binding sites share the same SCOP code and would be considered similar, an issue considered further in the Discussion section.
3.5 Case studies
We looked at some interesting binding site matches that have been found by PI. The three matches below are not detected using the standard sequence similarity methods of BLAST, FASTA or PSI-BLAST.
3.5.1 Sugar binding protein and hydrolase
TI has difficulty dealing with small sites, often producing significant matches between sites we might not conclude are similar on closer inspection. For example, matching the nicotinamide binding site on the hydrolase 1isi with the galactose binding site found on the sugar binding protein 1ule gives a TI of 0.28 (matching score of 14 between two sites of size 32). Given that the 99.5% threshold value for TI is 0.23 it would indicate that this match is probably a significant one. However, on inspection of the superimposition of the two binding sites, there appears to be no similarity save for a side chain group of a single tryptophan residue. The corresponding PI score 1.5 x 10–3 is not significant in terms of the PI threshold at 7 x 10–4.
3.5.2 FAD binding sites
TI also has difficulty with larger sites for the opposite reason, genuine matches are deemed insignificant because the size of one or both sites is large. For example, the superimposition of two FAD binding sites taken from 1f8r and 1kf6, both of which come from the FAD/NAD(P) binding domain (c.3.1.x) superfamily match with a score of 59 and come from proteins with common recent evolutionary history. Because both sites involved are large, the TI for this match is only 0.18 and below the significance threshold, while the PI score of 8.1 x 10–18 correctly identifies the match as highly significant.
3.2.3 Subtilisin and trypsin
The similarity of the subtilisin and trypsin families, despite the disparate evolutionary background, is well documented in the literature (Eder et al., 1993). Since the bindings sites involved come from different folds (and SCOP classes), it is not possible to detect this similarity at the domain level, i.e. the local similarity due to functional convergence of the two serine proteases is not well reflected in global similarity. The benefits of restricting analysis to the binding sites are evident in this case in picking out similarities that are difficult to find when considering protein domains. While there are better examples of matches between sites taken from subtilisin-like and trypsin-like proteins, which are found by both PI and TI, there are cases where PI picks up this convergent similarity when TI does not. For example, when comparing an active site taken from 1dx5 (SCOP code: b.47.1.2) with the active site taken from 1kdv (SCOP code: c.41.1.2), with a matching score of 27, it can be seen that significant regions of the binding sites superimpose well. The PI score for this pair is 4.3 x 10–7 and can be considered significant, whilst the corresponding TI score of 0.23 is just below the TI significance threshold.
| 4 DISCUSSION |
|---|
|
|
|---|
We have developed a successful probabilistic model for protein–ligand binding site matching By testing it against a widely used method of similarity measurement in the TI we have deduced it to be a robust and sufficiently accurate method of determining similarity, considering that only three values are required to compute the score.
It is particularly evident in Table 1 that Tanimoto scores are dependent upon the size of the sites involved. In this particular case very similar TI values are found for different matches, despite the fact that we would hopefully find more merit in a partial match of 63 atoms between two sites that are 73 and 257 atoms large, than in 12 atoms between two sites that are 32 atoms large. The case studies show two matches where a clearly false match has a much higher TI score than a clearly significant match simply because smaller site matches are favoured and larger site matches are unfavoured.
For any application of TI, consideration of its semi-scale dependency should be taken into account. However, the usage of TI in any given situation is contextual and rarely transferable. All evidence above points to TI values between binding site matches certainly as low as 0.3 being of some significance in this study, while in other implementations a score of 0.3 would be indicative of a poor result (Brown and Martin, 1997). Therefore, the impact of scale dependency will likely depend on the given application.
PI measures the probability of a single pairwise match to calculate similarity but in practice we would compare a single site to all other candidates in the database to find significant matches. Like any pairwise test extended to a large number of comparisons, the effect of multiple testing must be taken into account. Experience has shown that scores less than or equal to 1 x 10–6 are generally reliable.
A notable difficulty in binding site matching is the definition of what we would consider a true match between two binding sites. Throughout the study we have been using the simple assumption that sites from the same SCOP family or superfamily are related and sites from different superfamilies are not related. While this definition may hold for protein domains, it is less convincing at the binding site level. First, proteins often possess more than one binding site that may provide a totally different function, this is well demonstrated by the acetylphosphate binding site in Section 3.4. Second, but less critically, sites from different superfamilies have been shown to occasionally possess the same convergent function (see case studies), however, some of the examples shown here could have been improved with a more comprehensive knowledge of the functional connectivity between binding sites or a revised ground truth.
| ACKNOWLEDGEMENT |
|---|
|
|
|---|
We would like to thank Nicola Gold, Vysaul Nyringo for helpful comments and the EPSRC for funding a studentship for J.R.D.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Burkhard Rost
Received on June 13, 2007; revised on September 7, 2007; accepted on September 10, 2007
| REFERENCES |
|---|
|
|
|---|
Arimoto R, et al. Development of CYP3A4 inhibition models: comparisons of machine-learning techniques and molecular descriptors. J. Biomol. Screen. (2005) 10:197–205.
Bashford D, et al. Determinants of a protein fold. Unique features of the globin amino acid sequences. J. Mol. Biol. (1987) 196:199–216.[CrossRef][Web of Science][Medline]
Bertrand JA, et al. Determination of the MurD mechanism through crystallographic analysis of enzyme complexes. J. Mol. Biol. (1999) 289:579–590.[CrossRef][Web of Science][Medline]
Binkowski TA, et al. Inferring functional relationships of proteins from local sequence and spatial surface patterns. J. Mol. Biol. (2003) 332:505–526.[CrossRef][Web of Science][Medline]
Brakoulias A, Jackson RM. Towards a structural classification of phosphate binding sites in protein-nucleotide complexes: an automated all-against-all structural comparison using geometric matching. Proteins (2004) 56:250–260.[CrossRef][Web of Science][Medline]
Brown RD, Martin YC. The information content of 2D and 3D structural descriptors relevant to ligand-receptor binding. J. Chem. Inf. Comput. Sci. (1997) 37:1–9.[CrossRef][Web of Science]
Chandonia JM, et al. The ASTRAL compendium in 2004. Nucleic Acids Res. (2004) 32:D189–D192.
Dym O, Eisenberg D. Sequence-structure analysis of FAD-containing proteins. Protein Sci. (2001) 10:1712–1728.[CrossRef][Web of Science][Medline]
Eder J, et al. Folding of subtilisin BPN: characterization of a folding intermediate. Biochemistry (1993) 32:18–26.[CrossRef][Medline]
Gold N, Jackson RM. Sitesbase: a database for structure-based protein ligand binding site comparisons. Nucleic Acids Res. (2006a) 34:D231–234.
Gold ND, Jackson RM. Fold independent structural comparisons of protein-ligand binding sites for exploring functional relationships. J. Mol. Biol. (2006b) 355:1112–1124.[CrossRef][Web of Science][Medline]
Green PJ, Mardia KV. Bayesian alignment using hierarchical models with applications in protein bioinformatics. Biometrika (2006) 93:235–254.
Holm L, et al. A database of protein structure families with common folding motifs. Protein Sci. (1992) 1:1691–1698.[Web of Science][Medline]
Kang T, et al. Learnability-based further prediction of gene functions in gene ontology. Genomics (2004) 84:922–928.[CrossRef][Web of Science][Medline]
Khun D, et al. From the similarity analysis of protein cavities to the functional classification of protein families using Cavbase. J. Mol. Biol. (2006) 359:1023–1044.[CrossRef][Web of Science][Medline]
Kinoshita K, et al. Identification of protein functions from a molecular surface database, eF-site. J. Struct. Funct. Genomics (2002) 2:9–22.[CrossRef][Medline]
Laskowski RA, et al. Protein function prediction using local 3D templates. J. Mol. Biol. (2005) 351:614–626.[CrossRef][Web of Science][Medline]
Laurie AT, Jackson RM. Methods for the prediction of protein-ligand binding sites for structure-based drug design and virtual ligand screening. Curr. Protein Pept. Sci. (2006) 7:395–406.[CrossRef][Web of Science][Medline]
Matte A, et al. Crystal structure of Escherichia coli phosphoenolpyruvate carboxykinase: A new structural family with the p-loop nucleoside triphosphate hydrolase fold. J. Mol. Biol. (1996) 256:126–143.[CrossRef][Web of Science][Medline]
Murzin AG, et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. (1995) 247:536–540.[CrossRef][Web of Science][Medline]
Najmanovich RJ, et al. Analysis of binding site similarity, small molecule similarity and experimental binding profiles in the human cytosolic sulfotransferase family. Bioinformatics (2007) 23:e104–e109.
Orengo CA, et al. CATH a hierarchic classification of protein domain structures. Structure (1997) 5:1093–1108.[Medline]
Rao ST, Rossmann MG. Comparison of super-secondary structures in proteins. J. Mol. Biol. (1973) 76:241–256.[CrossRef][Web of Science][Medline]
Schmitt S, et al. A new method to detect related function among proteins independent of sequence and fold homology. J. Mol. Biol. (2002) 323:387–406.[CrossRef][Web of Science][Medline]
Schubert HL. Many paths to methyltransfer: a chronicle of convergence. Trends Biochem. Sci. (2003) 28:329–335.[CrossRef][Web of Science][Medline]
Shulman-Peleg A, et al. Recognition of functional sites in protein structures. J. Mol. Biol. (2004) 339:607–633.[CrossRef][Web of Science][Medline]
Sinha SC, Smith JL. The PRT protein family. Curr. Opin. Struct. Biol. (2001) 11:733–739.[CrossRef][Web of Science][Medline]
Stark A, et al. A model for statistical significance of local similarities in structure. J. Mol. Biol. (2003) 326:1307–1316.[CrossRef][Web of Science][Medline]
Tesmer J.JG, et al. The crystal structure of GMP synthetase reveals a novel catalytic triad and is a structural paradigm for two enzyme families. Nat. Struct. Biol. (1996) 3:74–86.[CrossRef][Web of Science][Medline]
Watson JD, et al. Predicting protein function from sequence and structural data. Curr. Opin. Struct. Biol. (2005) 15:275–284.[CrossRef][Web of Science][Medline]
Willett P, et al. Implementation of nearest-neighbor searching in an online chemical structure search system. J. Chem. Inf. Comput. Sci. (1986) 26:36–41.[CrossRef][Web of Science]
Zhang Z, Grigorov MG. Similarity networks of protein binding sites. Proteins (2006) 62:470–478.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
T. Hamelryck Probabilistic models and machine learning in structural bioinformatics Statistical Methods in Medical Research, October 1, 2009; 18(5): 505 - 526. [Abstract] [PDF] |
||||
![]() |
L. Xie, L. Xie, and P. E. Bourne A unified statistical model to support local sequence order independent similarity searching for ligand-binding sites and its application to genome-based drug discovery Bioinformatics, June 15, 2009; 25(12): i305 - i312. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






