Bioinformatics Advance Access originally published online on November 23, 2007
Bioinformatics 2008 24(2):225-233; doi:10.1093/bioinformatics/btm580
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Genome scale enzyme–metabolite and drug–target interaction predictions using the signature molecular descriptor
1Sandia National Laboratories, Computational Biosciences Department, P.O. Box 5800, Albuquerque, NM 87185-1413, 2Sandia National Laboratories, Computer Science & Informatics Department, P.O. Box 5800, Albuquerque, NM 87185-1316 and 3Sandia National Laboratories, Biosystems Research, P.O. Box 969, Livermore, CA 94551-9291, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Identifying protein enzymatic or pharmacological activities are important areas of research in biology and chemistry. Biological and chemical databases are increasingly being populated with linkages between protein sequences and chemical structures. There is now sufficient information to apply machine-learning techniques to predict interactions between chemicals and proteins at a genome scale. Current machine-learning techniques use as input either protein sequences and structures or chemical information. We propose here a method to infer protein–chemical interactions using heterogeneous input consisting of both protein sequence and chemical information.
Results: Our method relies on expressing proteins and chemicals with a common cheminformatics representation. We demonstrate our approach by predicting whether proteins can catalyze reactions not present in training sets. We also predict whether a given drug can bind a target, in the absence of prior binding information for that drug and target. Such predictions cannot be made with current machine-learning techniques requiring binding information for individual reactions or individual targets.
Availability and Contact: For questions, paper reprints, please contact Jean-Loup Faulon at jfaulon{at}sandia.gov. Additional information on the signature molecular descriptor and codes can be downloaded at: http://www.cs.sandia.gov/~jfaulon/publication-signature.html
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Molecular recognition is the primary event involved in the interaction of proteins with other proteins and with small molecules such as metabolites and therapeutics. The ability to predict these interactions on a genome-wide scale is vital in determining the enzymes that catalyze a given metabolic reaction, the reactions a given enzyme catalyzes, the targets a particular drug binds and the drugs that will bind a given target. Predicting these interactions has direct application when completing genome annotations, finding enzymes for synthetic chemistry, and predicting drug specificity, promiscuity and polypharmacology.
Molecular recognition has been studied using a variety of approaches. At the molecular level, biophysical simulation techniques ranging from ab inito quantum calculations (Fukuzawa et al., 2005) to rapid virtual screening (Warren et al., 2006) have been used. These techniques can be very accurate but require knowledge of the three dimensional structure of the chemical–protein interface and can be computationally expensive, giving them limited applicability at the whole genome level. On the other hand, statistical machine-learning approaches can be applied on a much larger scale. The increasing amount of information linking chemical and biological data (Austin et al., 2004; Brooksbank et al., 2005; Kanehisa et al., 2006; Wishart et al., 2006) has provided the required input for these machine-learning tools. Machine-learning approaches are computationally efficient, and they do not necessarily require three-dimensional structural information.
To date, machine-learning methods have been applied to molecular recognition questions based on whether they are used with biological sequences or with chemical connectivity data. In bioinformatics, for instance, there are techniques making use of Enzyme Commission (EC) numbers to predict the metabolites a given sequence can catalyze (Borgwardt et al., 2005; Cai et al., 2003; Kunik et al., 2005). There are also sequence and structure-based methods for locating homologous ligand binding sites (Johnson and Church, 2000; Kalinina et al., 2004). In cheminformatics, methods have been developed to find compounds binding to a given target (Gasteiger and Engel, 2003), as well as to predict the EC number of a given metabolic reaction (Kotera et al., 2004). However, none of these techniques take advantage of the increasing linkage between biological sequences and molecular information. Further, there are no methods available that can predict when an unclassified (in the EC nomenclature) chemical reaction will be catalyzed by a given enzyme, or when a chemical will bind to a sequence in the absence of binding information for either the sequence or the chemical. To overcome these problems, we have developed an approach making use of a Support Vector Machine (Noble, 2006) (SVM) kernel-based method that operates on both protein and chemical information.
Our approach uses a graph-based representation of molecules known as signature (Faulon, 1994). This representation describes a molecular graph by decomposition into canonical subgraphs (Faulon et al., 2004). Using this type of representation, the similarity between two molecules can be evaluated by comparing their subgraphs (Bender et al., 2004). Similar comparisons have been previously used by exploiting graph kernel SVMs (Gartner et al., 2003; Kashima et al., 2003; Mahe et al., 2006; Swamidass et al., 2005) for the prediction of mutagenecity, toxicity and anti-cancer activities (Swamidass et al., 2005).
In our case, we are interested in chemicals as well as proteins. Perhaps unsurprisingly, functional predictions using protein sequences are also based on pair-wise comparison. These comparisons are typically made using sequence homology, as computed by sequence alignment algorithms, or, in the case of SVMs, remote homology detection with string kernels (Leslie et al., 2002). Protein sequence- based approaches and in particular sequence alignment are routinely used to assign EC numbers in newly sequenced genomes (White, 2006).
The similarities between the chemical-based and sequence-based approaches described above provided the motivation for combining the different methods into a general framework. To accomplish this goal, we used molecular signature as a common representation for both chemicals and protein sequences. Chemicals cannot be coded as linear sequences of well-defined units, but protein sequences are molecules and as such can be represented by molecular graphs. A common representation should thus come from cheminformatics. Our choice of molecular signature (among the myriad cheminformatics descriptors) is motivated by the fact that signatures are molecular fragments. When applied to proteins, these fragments correspond to short sequences, and short sequences correspond to protein domains, which are extensively used in bioinformatics applications (Mulder et al., 2007).
We implemented our approach within the framework of SVMs by developing a kernel that uses the signature descriptor in the context of protein–chemical pairs. Our kernel is based on the signature product kernel developed for protein–protein interaction (Martin et al., 2005) and encompasses ideas related to string kernels (Leslie et al., 2002), graph kernels (Gartner et al., 2003; Kashima et al., 2003; Mahe et al., 2006; Swamidass et al., 2005) and pairwise kernels (Ben-Hur and Noble, 2005; Martin et al., 2005). Details of our techniques are found in the Methods section. The Results section presents enzyme-metabolites binding predictions based on metabolic reaction information, sequence information and the product of both. Next, enzyme–metabolite and drug–target interactions are predicted for independent test sets. In these two latter cases, we investigate the possibilities of predicting enzyme–metabolite and drug–target interactions in the absence of prior binding information in the training sets for the metabolites, enzymes, drugs or targets of the test sets.
| 2 METHODS |
|---|
|
|
|---|
2.1 Molecular signature
The signature of a molecule is a vector whose components correspond to atomic signatures (Faulon, 1994). Each component of a molecular signature counts the number of occurrences of a particular atomic signature in the molecule. An atomic signature is a canonical representation of the subgraph surrounding a particular atom. This subgraph includes all atoms and bonds up to a predefined distance from the given atom. This distance is called the signature height. Formally, we let hS = ZN (h), where N (h) is the number of possible atomic signatures of height h and the unit basis vectors ZN (h) correspond uniquely to the possible atomic signatures. If G = (V, E) is a molecular graph, with vertex (atom) set V, and edge (bond) set E, then the molecular signature of G is given by
|
| (1) |
G (x) is the unit basis vector of hS corresponding to the atomic signature in G rooted at x of height h.
Algorithms and codes to compute atomic and molecular signatures have been previously documented (Faulon et al., 2004). Examples of atomic and molecular signatures are provided in Supplementary Figures S1 and S2. The signature heights used for chemicals typically range between 0 and 6. The signature height used for proteins ranges between 6 and 18 and corresponds roughly to amino acid strings with 1–7 residues. Computationally, the most expensive step in computing a molecular signature is the subgraph canonization step (Faulon et al., 2004). While the computational complexity for canonizing subgraphs (and graphs in general) is unknown, a computational running times study for various types of chemical structures indicate the running time to be linearly proportional to the input size (Faulon et al., 2004). This computational efficiency allows us to process large chemical structures such as proteins with more than 100 000 atoms. We find the computational complexity to only slightly increase with the signature height as on average it takes 6 s CPU time (on a Dual 2.3 GHz PowerPC Macintosh) to compute protein signatures for height 6 and 8 s for height 18. We were thus able to process datasets of 1000 proteins (which was the maximum training and test set size used in this study) in less than 3 h. While not applied in the present study, further saving in running time could potentially be gained by computing signature only for specific atoms along the protein backbone (only C-
atoms, for instance).
2.2 Reaction signature
Reaction signatures are computed for enzymatic reactions. We assume that all enzymatic reactions take the general form R: s1 S1 + s2 S2 + ... + sn Sn
p1 P1 + p2 P2 + ... + pm Pm, where si and pj are the stoichiometric coefficients of substrates Si and products Pj. The height h signature of reaction R is then defined by
|
| (2) |
(Pj) and h
(Si) are the height h molecular signatures of substrate Si and product Pj computed using Equation (1).
2.3 Support vector machines
Support Vector Machines (Noble, 2006) (SVMs) are classifiers that have been widely used in both bioinformatics and cheminformatics. Relevant to this study, SVMs have been used to predict the EC numbers of protein sequences (Borgwardt et al., 2005; Cai et al., 2003; Kunik et al., 2005). In the chemistry literature, SVMs have been used to make property predictions based on molecular connectivity (Kashima et al., 2003; Swamidass et al., 2005).
A SVM can be described as follows. Suppose our data are given as pairs {(xi , yi)}
Rn { ± 1}. In other words, suppose our data consists of two classes with labels 1 and –1. Using this notation, a SVM assumes the form: f(x) =
i
i yi k(xi , x) + b where f: Rn
R is a decision function [x belongs to class 1 if f (x) is greater than some threshold b, or to class –1 otherwise], k: Rn x Rn
R is a kernel function, otherwise known as a dot product in some vector space, and the constants b and
i are obtained by solving a quadratic programming problem.
A kernel function should measure the similarity between two inputs as a dot product. For sequences, the most commonly used kernels are string kernels (Leslie et al., 2002). A string kernel is a dot product between two vectors containing occurrence numbers of short subsequences within the main sequence. In addition to string kernels, kernels based on feature vectors are also used (Bock and Gough, 2001). These vectors compile physico-chemical parameters collected for each amino acid in a sequence. Various kernels for chemicals have been proposed including path and tree kernels (Swamidass et al., 2005), marginalized kernels (Kashima et al., 2003) and pharmacophore kernels (Mahe et al., 2006). Path and tree kernels count the occurrence of paths and trees and in a molecular graph, the marginalized kernel is based on random walks in a graph, and the pharmacophore kernel takes into account the three dimensional structure of the chemicals. The above kernels have been used to classify chemicals for mutagenicity, toxicity and anti-cancer activities (Swamidass et al., 2005). While kernels already exist for protein sequences and chemicals, our goal in this article is to develop a kernel that will process both simultaneously.
2.4 Signature kernels
Different kernels based on signatures are used for comparing chemicals, reactions and sequences. In all cases, kernels are scalars while signatures are vectors. In the case of chemicals, the kernel function computes the similarity between two chemicals represented by their molecular signatures. In this article, to measure similarity we use a simple dot product between the molecular signature vectors. In the case of chemical reactions, the kernel function computes the similarity between the signatures of the reactions. In the case of protein sequences, the sequences are first transformed into molecular graphs by replacing the amino acids with their chemical structures. The proteins are then processed as chemicals.
When using both chemicals and proteins, as in the case of a drug binding to a target, or a metabolic reaction catalyzed by an enzyme, the kernel is defined for two pairs of chemical–protein interactions. It is the product of the signature kernel between the two chemicals and the signature kernel between the two proteins. We define all of these kernels in Equations. (3–4) below.
For a given height h, the signature kernel between two chemical structures A and B is defined by:
|
| (3) |
(A) and h
(B) are computed using Equation (1), and | h
(A)| denotes the norm of h
(A). This kernel also applies to proteins considered as atomic structures and reactions, where reaction signatures are used instead of molecular signatures. Examples of signature kernels are given in Supplementary Figure S2.
Let P be a protein and C be a chemical. The height l signature of P can be expressed as: l
(P) = (p1 , p2 , ... , pn) where n = | l
|. Similarly the height h signature of C can be written as: h
(C) = (c1 , c2 , ... , cm)where m = | hS |. Following the definition found in Martin et al. (2005) and Ben-Hur and Noble (2005) for protein–protein interaction kernel, we define the signature of the complex P
C as the tensor product of the signatures of P and C:
|
|
If we let (P,C) and (Q,D) be two pairs of chemical–protein interactions, using simple algebra we have:
|
|
Similarly it can easily be shown that:
|
|
|
| (4) |
2.5 Cross-validation
Several metabolite-enzyme and drug-target datasets were processed in this study, these are detailed in the Result sections. Once a dataset was generated, prediction accuracy was assessed using cross-validation. X-fold cross-validation is performed by dividing a dataset into X equal non-intersecting subsets. A given subset is treated as a test set, and the complement serves as a training set. A classifier is trained using the training set and predictions are made on the test set. This procedure is repeated for each of the original subsets to obtain predictions on the entire dataset. In this work, we used either leave-one-out (LOO) or 5-fold cross-validation. LOO cross-validation uses each element of the dataset separately as a test set and 5-fold cross-validation splits the dataset into five equal subsets.
Statistics were compiled for each dataset using the cross-validation predictions. Accuracy, sensitivity, specificity and precision were computed. Using TP, FP, TN and FN to denote true positive, false positives, true negatives and false negatives, accuracy is defined by (TP + TN)/(TP + FP + TN + FN), sensitivity is defined by TP/(TP + FN), specificity is defined by TN/(TN + FP) and precision is defined by TP/(TP + FP). In addition, we computed the Jaccard coefficient J = TP/(TP + FP + FN) and the area under the receiver-operator-characteristic (ROC) curve. The area under the curve (auc) is obtained by integrating under the ROC curve. The ROC curve is obtained by varying the threshold b (see SVM section) separating positives from negatives and plotting the TP rate (sensitivity) versus the FP rate (1-specificity). For all of these statistics, a larger number indicates a better result.
| 3 RESULTS |
|---|
|
|
|---|
Our approach rests on the assumption that we can replace protein homology calculations based on sequence with similar calculations based on the underlying atomistic representation. Our first result is therefore related to the investigation of this claim. Next, we compare our approach to existing methods using only chemical or only sequence information. Then, we combine both sequence and chemical information for the prediction of reaction–enzyme and drug–target interactions. Finally, we investigate the efficacy of our method for predictions in the absence of prior binding information for reaction, enzyme, drug or target.
3.1 Signature similarity measurements and sequence alignment scores
Sequence alignment algorithms such as Altschul et al. (1997) use substitution matrices to compute similarity scores. The procedure is straightforward and consists of summing up the scores found in substitution matrices (such as BLOSUM62) for the amino acids that are aligned. BLOSUM62 is a 20 x 20 matrix of amino acids substitution scores (Henikoff and Henikoff, 1992). A score between two amino acids is the logarithm of the ratio of the likelihoods of these two amino acids to be substituted. In BLOSUM62 the substitution likelihood between two given amino acids was originally computed from the frequencies observed for finding the two amino acids aligned in a large set of trusted alignments having at least 62% identity.
To investigate the relationship between sequence alignment scores and similarity obtained using the signature kernel [c.f. Equation (3)], the kernel was computed for every pair of amino acids found in BLOSUM62. The results shown in Figure 1 demonstrate a clear correlation, indicating that residues that are chemically similar generally have a high BLOSSUM62 value and conversely. For example, while the signature kernel value between Leucine and Isoleucine is 1.0 (c.f. Supplementary Fig. S2) and its BLOSUM62 value is 2, the BLOSUM62 value for Glycine and Isoleucine is –4 and the signature kernel value is 0.42 (c.f. Supplementary Fig. S2). Figure 1 confirms previous findings relating amino acids substitution scores and their physico–chemical properties (Atchley et al., 2005). Indeed, since properties are structure dependent, if BLOSUM62 scores are related to amino acid properties, some relationship should be found between BLOSUM62 scores and amino acid atomistic structures. Our relationship is interesting because it occurs despite the fact that BLOSUM62 scores were obtained without regard to the amino acid chemical structures.
|
3.2 Chemicals classification
To benchmark our signature kernel with other state-of-the-art kernel we used the Predictive Toxicology Challenge (PTC) dataset (Helma et al., 2001), which reports the carcinogenicity of 417 chemical compounds for male mice (MM), female mice (FM), male rats (MR) and female rats (FR). Because toxicology is generally hard to predict, this dataset has been used previously to evaluate various kernel methods (Swamidass et al., 2005). Supplementary Table S1 gives a comparison of the signature kernel with a frequent pattern discovery approach (Kashima et al., 2003), a marginalized kernel (Kramer and De Raedt, 2001), SMILES string kernels, pathcount-based graph kernels and a kernel based on three dimensional atomic distances (Swamidass et al., 2005). Our kernel compares well with these state-of-the-art kernels for chemical heights ranging between 1 and 4.
3.3 EC number classification using metabolic reactions
Enzymes are organized according to the Enzyme Commission (EC) system, a hierarchical classification that assigns unique four-field numbers to different enzymatic activities (Webb, 1992). The first field of an EC number indicates the general class of catalyzed reaction: 1 denotes oxidoreductases, 2 denotes transferases, 3 denotes hydrolases, 4 denotes lyases, 5 denotes isomerases and 6 denotes ligases. The second and third fields depend on different criteria related to the chemical features of the substrate and the product of the reaction. The fourth field is substrate and product specific. As an example, the tripeptide aminopeptidases have the number EC 3.4.11.4. Level 1 EC 3 enzymes are hydrolases. Level 2 EC 3.4 enzymes are hydrolases that act on peptide bonds. Level 3 EC 3.4.11 enzymes are hydrolases that cleave off the amino-terminal amino acid from a polypeptide, and level 4 EC 3.4.11.4 enzymes are those that cleave off the amino-terminal end from a tripeptide.
To predict the EC numbers of metabolic reactions, we created a training set by downloading the KEGG database. As of 21 November 2006, this database contained 10 951 compounds and 6556 reactions involving compounds with molecular structures stored in the database. Positive examples were compiled using all reactions in the KEGG database having a specified EC number. Wild cards were allowed in order to generate datasets at various EC levels. As an example, EC level 2 1.1.*.* consisted of 409 reactions having an EC number starting with 1.1. Positive example sets having less than 50 elements were not processed. If the positive set included more than 500 examples, then excess examples were removed at random. Next, the datasets were completed by taking equal numbers of negative examples at random. Negative examples were composed of reactions not present in the positive class. The fourth EC level could not be processed because it usually contains only one reaction. The signature reaction kernel [Equation (2)] was applied in order to classify metabolic reactions. Cross-validation results are presented in Supplementary Figure S4a and Table S2. Maximum accuracies are 91% for level 1, 84% for level 2 and 88% for level 3. These accuracies, obtained for heights ranging between 1 and 6, are lower than those published in Kotera et al. (Kotera et al., 2004). However, the signature kernel can process a larger number of reactions (c.f. Supplementary Table S2).
3.4 EC number classification using protein sequence
To predict EC numbers using protein sequence, we downloaded protein sequences from the KEGG database having one or more EC number assignments, resulting in 308 094 sequences. Training sets were constructed for all four EC levels as described in Section 3.3 using protein sequences instead of reactions. The signature kernel [Equation (1)] was computed using several heights for each protein sequence using an atomistic description of the protein. Cross-validation results are reported in Supplementary Figure S4b. We find the accuracies peak at height 8 for levels 1, 2 and 3 (values are 60%, 79% and 81%, respectively), and at height 6 for level 4 (value is 95%). These accuracies are generally lower than those reported using protein structure graph kernels (Borgwardt et al., 2005), feature vectors [i.e. SVM-PROT (Cai et al., 2003)] or motifs (Kunik et al., 2005). Nevertheless, the accuracies are high enough that the signature kernel can be used with protein sequence alone to predict EC numbers (except for EC level 1). When both the sequence and the reaction are used, the accuracy improves as shown in the next section.
3.5 EC number classification using both protein sequences and metabolic reactions
Using the reactions and sequences with assigned EC numbers in the KEGG database, a set of 855 772 pairs (out of 3905 reactions and 255 304 enzymes) was compiled. Training sets were constructed for all four EC levels as described in Section 3.3 taking reaction-protein pairs instead of reaction only. Signatures of various heights were computed for both the reactions and the atomistic representations of the proteins. The signature product kernel [c.f. Equation (4)] was applied for each EC level. Cross-validation results are reported for classes 1, 1.1, 1.1.1 and 1.1.1.1 in Supplementary Figure S5. We find the accuracy for class 1 peaks at height 4 for chemicals and height 8 for proteins. For class 1.1, the largest accuracy is obtained at heights 1 and 8 for reactions and proteins; for class 1.1.1, the accuracy peaks at heights (3,10) and for class 1.1.1.1, maximum accuracy is reached at heights (3,6). Taking the above optimum heights, the signature product kernel is compared with the protein structure graph kernel (Borgwardt et al., 2005) in Supplementary Table S3, the SVM-PROT feature vectors method (Cai et al., 2003) in Supplementary Table S4, the motifs technique of Kunik et al. (2005) in Supplementary Tables S5. Additional results are reported for all classes 1.1.1.* in Supplementary Table S6. Table 1 (which summarized results reported in Supplementary Tables S3–S6) shows that the signature product performs comparably with the other competing techniques. Better sensitivity (accuracy on positives) and Jaccard coefficient are obtained with the signature product because the competing techniques were trained on unbalanced training sets comprising more negatives than positives. This imbalance appears to increase the number of false negative predictions thus reducing sensitivity and Jaccard coefficient.
|
3.6 Predicting new enzyme–metabolite interactions
To test the ability of the signature product kernel to predict enzyme–metabolite interactions not present in training sets, all enzymes and reactions corresponding to EC numbers accepted in September 2006 by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) were removed from the KEGG database. Test sets composed of removed enzymes and reactions were created for each selected EC numbers. For each test set, a training set composed of 500 positive examples was constructed from the remaining KEGG database. An equal number of negative interactions were added to both training and test sets by choosing at random interactions between reactions and enzymes not present in the reduced KEGG database. SVMs were trained with the signature product kernel using height 3 for reactions, and heights 6, 8 and 10 for proteins. These heights were selected because they lead to high accuracies in cross-validation studies (c.f. Section 3.5). Results are given in Table 2 using height 3 for reactions and 10 for proteins, similar results were obtained for heights (3,6) and (3,8). The results show that it is possible to predict (with accuracies above 80%) whether or not a given enzyme will catalyze a given reaction, even when the EC class of the enzyme reaction pair is not present in the training set.
|
3.7 Predicting new drug–target interactions
A dataset linking drugs with protein targets was created from the KEGG database. This dataset contained 873 drug–target pairs taken from 121 targets and 551 drugs. A training set was constructed using all 873 positive pairs. An equal number of negatives were added by selecting random pairs that did not belong to the set of positives. Cross-validation results were obtained using 5-fold cross-validation with the signature product kernel, as reported in Supplementary Figure S6 for various drug and target signature heights. Best accuracies (>85%) are obtained when the drug height is in the range 2–3 and the target height in the range 10–14.
The same training set was used to predict if any of the drug target binding pairs stored in the DRUGBANK database could be predicted using the SVM trained on the KEGG database. As of 1 December 2006, DRUGBANK was composed of 1133 drugs and 509 targets, with 1849 drug–protein pairs. A test set was created from DRUGBANK by including drugs having a KEGG reference name. Drugs having no KEGG references were removed to avoid discrepancies in chemical structure representations, e.g. KEGG stores hydrogen suppressed structures in mol files, while DRUGBANK has full hydrogen structures. From the initial 1113 drugs, 124 had a KEGG reference. These 124 drugs were bound to 117 targets resulting in 298 pairs. Out of the 124 drugs and 117 targets, 24 drugs and 50 targets were present in the training set. Further classification of the DRUGBANK test pairs, depending on whether or not the target or the drug were in the KEGG training set, is given in the caption of Table 3. Negatives were constructed forming pairs at random between the 124 drugs and 117 targets of the DRUGBANK test set. Calculations were run using the signature product kernel [Equation (4)] with a height 2 for drugs and 10 for targets, these heights were selected because of their high cross-validation accuracies (c.f. Supplementary Fig. S6). The prediction efficiencies on this test set are presented in Figure 2 and Table 3.
|
|
Despite the small number of common interactions between KEGG and DRUGBANK (32), 231 – 32 = 199 additional interactions not in KEGG predicted by the signature product kernel were found in DRUGBANK, including 67 interactions between drugs and targets not present in KEGG (c.f. Table 3).
Some of the predictions made by the signature product kernel even when the drugs and the targets are not present in the training set (class V in Table 3) are rather obvious. As shown in Figure 3, predicting that Quetiapine binds serotonin receptor 2B could have easily been guessed from the training set pair Quetiapine Fumarate binds serotonin receptor 2A. However, this is not the case when predicting Tretinoin binds to retinoic acid receptor alpha since the closest homolog in the training set (Baclofen binds GABAB1-receptor agonist) is rather dissimilar to the tested pair. To systematically evaluate all class V predictions made by the signature product kernel, we computed for each of the 67 pairs its closest homolog in the training set. Precisely, for each pair we searched the pair in the training leading to the highest Tanimoto coefficient (Tc) between the drugs, and the lowest BLAST E-value between the targets. Tc is a widely used measure of similarity between two chemicals (Gasteiger and Engel, 2003); Tc ranges between 0 and 1 and the closer to 1 the more similar are the chemicals. In this study, Tc was evaluated using a standard procedure (Swamidass and Baldi, 2007), which consists of computing 512 bits fingerprints counting paths up to 7 atoms. BLAST E-values were returned running BLASTP (Altschul et al., 1997) using BLOSUM62 as the scoring matrix. Two examples of homologous pairs are given in Figure 3. Results for all 67 tested pairs are listed in Supplementary Table S7. Setting up rather weak thresholds of Tc
0.50 and E-value
10, to maximize the number of potential homologs, we find that 17 of the 67 predictions could have been guessed from similarity and homology calculations. The fact that the signature product kernel can find interactions even when homologs cannot be found in the training set should not come as a surprise. First, the signature product kernel is used in a supervised learning context, where learning is performed on protein–chemical pairs, while homology and similarity calculations are unsupervised. Second, related kernels such as the string kernel (Leslie et al., 2002) and the motif kernel (Kunik et al., 2005) are known to perform well to classify remote homologues (c.f. Table S5 where motifs kernel is compare with sequence alignment).
|
| 4 DISCUSSION |
|---|
|
|
|---|
We have proposed a unified method for predicting protein–chemical interactions based on the representation of a protein using its atomistic structure. We found by a comparison with sequence alignment scores that our method could be used to detect protein homology, even though (strictly speaking) we ignore sequence information in favor of molecular structure. Looking at the atomistic structures of the amino acids, it can be demonstrated that a height 2 atomic signature can uniquely characterize each amino acid, every dimer can be characterized using a height 8 atomic signature, every trimer with a height 9, every tetramer with a height 11 and so on. Thus, when two proteins have many signatures in common, they share many amino acid strings and thus should have some degree of homology. Sharing strings certainly increases alignment score, but there are some instances where strings made of different amino acids can lead to high homology scores. Sequence alignment algorithms incorporate this information in substitution matrices such as BLOSUM62. The surprising implication of our comparison with the sequence alignment score is that an atomistic structure representation of proteins encompasses information stored in substitution matrices. This is the key observation that underlies our method.
Because of the key observation just discussed, we find that the signature kernel and the product kernel can be used to predict whether a given enzyme sequence will catalyze a given reaction. This problem can be solved by other machine-learning methods depending on the database available to construct the training sets. There are five cases. In the first case, both the sequence and the reaction are in the database and they have the same EC number; then the problem is solved and no prediction is needed. In the second case, the sequence and the reaction are in the database albeit with different EC numbers. Note that this situation can arise because enzymes can process several reactions. In such an instance, a training set can be collected taking all the enzymes known to catalyze the given reaction and testing using machine learning the given sequence of interest. If the given reaction has several known enzymes then sequence-based bioinformatics techniques (Borgwardt et al., 2005; Cai et al., 2003; Kunik et al., 2005) can perform the task. The third case arises when the reaction is in the database but not the sequence; as in second case, sequence-based bioinformatics techniques can be used, provided several enzymes are known to catalyze the reaction. The last two cases correspond to situations where the sequence is in the database set but not the reaction, or neither the sequence nor the reaction are in the database. In such instances, training sets cannot be constructed because no enzymes are known to catalyze the reaction, and classical bioinformatics techniques cannot be used. The signature product kernel can process all five cases, it has accuracies comparable to other techniques (c.f. Table 1) in the first three cases and reaches accuracies above 80% in cases where other techniques cannot be used (c.f. Table 2).
The traditional method for constructing a metabolic map of a newly sequenced organism is to assign EC numbers to its proteins. Many proteins remain un-annotated not only because their sequences have not been mapped to an already classified enzyme, but also because the reactions catalyzed by the proteins have not been characterized in the EC nomenclature. EC number assignment requires published evidence and full characterization of the enzymatic reaction. For this reason many reactions, although occurring in various pathways, do not have an assigned EC number. Again as shown in Table 2 our method can predict when an enzyme will catalyze a metabolic reaction, even in the absence of any EC nomenclature information.
A similarly important result, again following from our key observation, is that signature and signature product kernels can be used to predict drug–target interactions. The fact that the signature kernel can predict when a new drug will bind to a given target when other binders are known is not unexpected. Indeed, it has previously been demonstrated that signature and signature-like descriptors can be used to derive structure–activity relationships, and thus predict other drugs binding to a specific target (Churchwell et al., 2004; Faulon et al., 2003). The real strength of the signature product kernel is its ability to detect new interactions when the targets and the drugs are not present in the training set (c.f. Table 3 and Fig. 3). Such predictions cannot be made with classical cheminformatics techniques such as chemical similarity and structure–activity relationships.
As already mentioned, several cheminformatics and bioinformatics methods exist to predict protein-chemical binding. When used in a machine-learning context, traditional cheminformatics methods require training sets composed of chemicals binding to the same protein; bioinformatics methods require sequences catalyzing the same chemicals. The technique we have presented here does not outperform these traditional methods provided training sets can be constructed and the methods can be applied (c.f. Sections 3.1 and 3.2 for cheminformatics methods and Sections 3.4 and 3.5 for bioinformatics methods). The real strength of the proposed technique is its ability to handle both proteins and chemicals using a common representation (e.g. the signature). This common representation allows us to train machine learning directly on protein–chemical pairs, rather than on protein sequences and chemical structures alone.
While the technique we have proposed deals with situations where traditional bioinformatics and cheminformatics methods fail, it has its own limitation. First, the technique requires training, and its accuracy depends on the quality and completeness of the training set. Second, as we have shown, the technique makes use of the signature product kernel for measuring similarity between protein–chemical pairs. The similarity between two chemicals (or two proteins) is based on the number of signatures the chemicals (or proteins) have in common. Thus, a given protein–chemical pair can be predicted accurately only when proteins and chemicals can be found in the training set having signatures in common with the tested pair.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This work was funded by Sandia Laboratory Directed Research and Development. Sandia is a multi-program laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Olga Troyanskaya
Received on June 21, 2007; revised on October 21, 2007; accepted on November 19, 2007
| REFERENCES |
|---|
|
|
|---|
Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. (1997) 25:3389–3402.
Atchley WR, et al. Solving the protein sequence metric problem. Proc. Natl Acad. Sci. USA (2005) 102:6395–6400.
Austin CP, et al. NIH molecular libraries initiative. Science (2004) 306:1138–1139.
Ben-Hur A, Noble WS. Kernel methods for predicting protein-protein interactions. Bioinformatics (2005) 21(Suppl. 1):i38–i46.[Abstract]
Bender A, et al. Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): evaluation of performance. J. Chem. Inf. Comput. Sci. (2004) 44:1708–1718.[CrossRef][Web of Science][Medline]
Bock JR, Gough DA. Predicting protein-protein interactions from primary structure. Bioinformatics (2001) 17:455–460.
Borgwardt KM, et al. Protein function prediction via graph kernels. Bioinformatics (2005) 21(Suppl. 1):i47–i56.[Abstract]
Brooksbank C, et al. The European Bioinformatics Institute's data resources: towards systems biology. Nucleic Acids Res. (2005) 33(Database issue):D46–D53.
Cai CZ, et al. SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. (2003) 31:3692–3697.
Churchwell CJ, et al. The signature molecular descriptor. 3. Inverse-quantitative structure-activity relationship of ICAM-1 inhibitory peptides. J. Mol. Graph. Model. (2004) 22:263–273.[CrossRef][Web of Science][Medline]
Faulon JL. Stochastic generator of chemical structure. 1. Application to the structure elucidation of large molecules. J. Chem. Inf. Comput. Sci. (1994) 34:1204–1218.[CrossRef][Web of Science]
Faulon JL, et al. The signature molecular descriptor. 2. Enumerating molecules from their extended valence sequences. J. Chem. Inf. Comput. Sci. (2003) 43:721–734.[CrossRef][Web of Science][Medline]
Faulon JL, et al. The signature molecular descriptor. 4. Canonizing molecules using extended valence sequences. J. Chem. Inf. Comput. Sci. (2004) 44:427–436.[CrossRef][Web of Science][Medline]
Fukuzawa K, et al. Ab initio quantum mechanical study of the binding energies of human estrogen receptor alpha with its ligands: an application of fragment molecular orbital method. J. Comput. Chem. (2005) 26:1–10.[CrossRef][Web of Science][Medline]
Gartner T, et al. On graph kernels: hardness results and efficient alternatives. In: Proceedings of the Sixteenth Annual Conference on Computational Learning Theory and Seventh Kernel Workshop (2003) New York: Springer Verlag.
Gasteiger J, Engel T. Chemoinformatics (2003) Weinheim: Wiley-VCH.
Helma C, et al. The predictive toxicology challenge 2000–2001. Bioinformatics (2001) 17:107–108.
Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA (1992) 89:10915–10919.
Johnson JM, Church GM. Predicting ligand-binding function in families of bacterial receptors. Proc. Natl Acad. Sci. USA (2000) 97:3965–3970.
Kalinina OV, et al. SDPpred: a tool for prediction of amino acid residues that determine differences in functional specificity of homologous proteins. Nucleic Acids Res. (2004) 32(Web Server issue):W424–W428.
Kanehisa M, et al. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. (2006) 34(Database issue):D354–D357.
Kashima H, et al. Marginalized kernels between labeled graphs. In: Proceedings of the Twentieth International Conference on Machine Learning (2003) Washington DC: AAAI Press.
Kotera M, et al. Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions. J. Am. Chem. Soc. (2004) 126:16487–16498.[CrossRef][Web of Science][Medline]
Kramer S, De Raedt L. Feature construction with version spaces for biochemical applications. In: Eighteenth International Conference on Machine Learning Table of Contents (2001) San Francisco, CA: Morgan Kaufmann Publishers Inc.
Kunik V, et al. Motif extraction and protein classification. Proc. IEEE Comput. Syst. Bioinform. Conf. (2005) 4:80–85.
Leslie C, et al. The spectrum kernel: a string kernel for SVM protein classification. Pac. Symp. Biocomput. (2002) 564–575.
Mahe P, et al. The pharmacophore kernel for virtual screening with support vector machines. J. Chem. Inf. Model. (2006) 46:2003–2014.[CrossRef][Web of Science][Medline]
Martin S, et al. Predicting protein-protein interactions using signature products. Bioinformatics (2005) 21:218–226.
Mulder NJ, et al. New developments in the InterPro database. Nucleic Acids Res. (2007) 35(Database issue):D224–D228.
Noble WS. What is a support vector machine? Nat. Biotechnol. (2006) 24:1565–1567.[CrossRef][Web of Science][Medline]
Swamidass SJ, Baldi P. Mathematical correction for fingerprint similarity measures to improve chemical retrieval. J. Chem. Inf. Model. (2007) 47:952–964.[CrossRef][Web of Science][Medline]
Swamidass SJ, et al. Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics (2005) 21(Suppl. 1):i359–i368.[Abstract]
Warren GL, et al. A critical assessment of docking programs and scoring functions. J. Med. Chem. (2006) 49:5921–5931.
Webb EC. Enzyme Nomenclature 1992: Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes (1992) San Diego, CA: Academic Press.
White RH. The difficult road from sequence to function. J. Bacteriol. (2006) 188:3431–3432.
Wishart DS, et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. (2006) 34(Database issue):D668–D672.
This article has been cited by other articles:
![]() |
D. A. R. S. Latino, Q.-Y. Zhang, and J. Aires-de-Sousa Genome-scale classification of metabolic reactions and assignment of EC numbers with self-organizing maps Bioinformatics, October 1, 2008; 24(19): 2236 - 2244. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Panayotatos Drug-binding Cavities in Long-Lived Biologics: Cause for Concern but Also Potential Benefit J. Clin. Pharmacol., October 1, 2008; 48(10): 1208 - 1211. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





