Bioinformatics Advance Access originally published online on May 17, 2007
Bioinformatics 2007 23(15):2004-2012; doi:10.1093/bioinformatics/btm266
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Statistical prediction of protein–chemical interactions based on chemical structure and mass spectrometry data
Department of Biosciences and Informatics, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, 223-8522, Japan
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Prediction of interactions between proteins and chemical compounds is of great benefit in drug discovery processes. In this field, 3D structure-based methods such as docking analysis have been developed. However, the genomewide application of these methods is not really feasible as 3D structural information is limited in availability.
Results: We describe a novel method for predicting protein–chemical interaction using SVM. We utilize very general protein data, i.e. amino acid sequences, and combine these with chemical structures and mass spectrometry (MS) data. MS data can be of great use in finding new chemical compounds in the future. We assessed the validity of our method in the dataset of the binding of existing drugs and found that more than 80% accuracy could be obtained. Furthermore, we conducted comprehensive target protein predictions for MDMA, and validated the biological significance of our method by successfully finding proteins relevant to its known functions.
Availability: Available on request from the authors.
Contact: yasu{at}bio.keio.ac.jp
Supplementary information: Appendix–technical details of method, Supplementary Table 1–7 and Supplementary Figure 1.
| 1 INTRODUCTION |
|---|
|
|
|---|
In the early stages of drug discovery processes, the prediction of protein–chemical interactions, or the binding of a chemical compound to a specific protein, can be of great benefit in the identification of lead compounds (candidates for a new drug). Moreover, the effective screening of potential drug candidates at an early stage leads to large cost savings at a later stage of the overall drug discovery process.
In the field of drug discovery, docking analysis has been the principal method used to elucidate interactions between proteins and small molecules (Jones et al., 1997; Morris et al., 1998; Shoichet et al., 1992). This technique is a 3D-structure based method in which the potential energy for a small molecule to bind to the target protein is evaluated according to a set of equations that model the physical interactions between the receptor and the potential ligand. Because such predictions that are based upon valid free energy calculations are relatively reliable, there are now many docking software tools available, such as AutoDock (Morris et al., 1998), DOCK (Shoichet et al., 1992) and GOLD (Jones et al., 1997). However, the requirement of these programs for 3D structural information is a severe disadvantage as these data are extremely limited in availability. Hence, the genome wide application of docking analyses is not really feasible. For example, among the GPCRs (G-protein coupled receptors), the modulation of which underlies the actions of 30% of the best known commercial drugs (Klabunde and Hessler, 2002), the structure of only one mammalian member, bovine rhodopsin (Palczewski et al., 2000), is known.
To achieve a more comprehensive protein–chemical interaction predictions, the utilization of more readily available biological data, and more generally applicable methods that are independent of the need for 3D–structural data is essential. In this regard, recent developments in statistical learning and prediction methods hold the promise for very accurate prediction performances when large quantities of learning data are available. In particular, the support vector machine (SVM) statistical method has now been applied to the calculation of putative protein–protein interactions and has been shown to be effective (Bock and Gough, 2001; Gomez et al., 2003; Martin et al., 2005). In addition, the classifications of chemical compounds into drugs and non-drugs using SVM has been proposed (Swamidass et al., 2005; Zernov et al., 2003).
The most prevalent data available for proteins are undoubtedly their amino acid sequences. For chemical compounds, formulas and structures are also generally available in most cases. Moreover, comprehensive metabolite analyses have now been undertaken using mass spectrometry such as CE-MS (Soga et al., 2002), and these have also generated valuable and available data. Based upon these data availabilities, we herein propose a more comprehensively applicable protein–chemical interaction prediction method than previously described, which is based upon SVM analysis of amino acid sequence data, chemical structure data and mass spectrometry data (Fig. 1). Unlike the previous approaches to such analyses as described above that assess chemical compounds only and classifying them according to their pharmacological effects, a distinct and novel feature of our proposed approach is the classification of protein and chemical compound pairs into binding and non-binding pairs. We further show from our computational experiments that this framework improves the prediction accuracy of the pharmaceutical effects of chemical compounds. Particularly, we demonstrate that our current approach using SVM successfully identifies target proteins of chemical compounds that the standard similarity-based methods such as BLAST fail to detect. Another notable feature of our proposed method is the use of mass spectra to encode chemical compounds. In addition, we highlight the effectiveness of using mass spectral data by comparison with and by integrated with existing chemical compound structure data (Fig. 1).
|
Finally, it is known that interactions of molecules have much more information than the evidence of binding. Protein–protein interactions, for instance, contribute to the elucidation of protein functions (Schwikowski et al., 2000) and transcriptional regulations (Nagamine et al., 2005). Therefore, we propose the utilization of predicted protein–chemical interactions to describe properties of chemical compounds.
| 2 METHODS |
|---|
|
|
|---|
2.1 Sample representation
For a protein–chemical compound pair, the protein is represented by its amino acid sequence and the compound is denoted by either its mass spectrum or its chemical structure. The combination of a feature vector for a protein and that for a chemical compound constitutes a sample.
2.2 Feature representation
In order to apply statistical methods to non-numerical data such as character strings, this type of data must be converted into some numerical data. The feature representation is one way to realize this in which we evaluate whether a feature, such as a specific character in strings, exists in a sample or how many times a feature appears in a sample. As a result of the feature representation, a non-numerical sample is converted into a numerical vector, or a feature vector, whose ith value corresponds to the existence or the frequency of the ith feature considered. Many statistical methods, including SVM that is mentioned later, utilize the similarity between feature vectors to solve the problem.
2.2.1 Protein description
We define description as mapping the non-numerical data like amino acid sequences into an n-dimensional numerical vector place so that we can utilize these data in the statistical learning.
The amino acid composition n-peptide composition and the derivatives of these are generally used to represent the protein sequences in many bioinformatics applications (Bhasin and Raghava, 2004; Martin et al., 2005; Xiao et al., 2006; Yu et al., 2006).
In our current study, an amino acid sequence is divided into trimers, referred to as the height 1 signatures, as described in (Martin et al., 2005). A signature consists of an amino acid and its neighbors. For example, the five-letter amino acid sequence NGMGN produces three signatures; G(MN), M(GG) and G(MN).
Each signature a01(a11a12) is then mapped into a vector space
|
|
(a) is a 5D property vector for an amino acid a based on 237 physical–chemical properties calculated previously in (Venkatarajan and Braun, 2001). All of the possible 4200 signature vectors are clustered into 199 groups based on
s by using the variational Bayesian mixture modelling implemented in program R package vabayelMix (Teschendorff et al., 2005) (http://www.cran.r-project.org).
According to these 199 clusters (see Supplementary Table 3), a feature vector for protein p, C(p), is calculated as follows,
|
| (1) |
For example, the five-letter amino acid sequence NGMGN can be represented as follows,
|
|
2.2.2 Chemical description by mass spectrometry data
A mass spectrum of a compound generates information about its structure and physical–chemical properties, and can thus be used to represent it.
In this study, two types of feature vectors, fragment vector F (c) and gap vector G(c), are produced from the mass spectrometry data showing m/z values and intensities for each m/z value, which are scaled 1–999 in a chemical compound.
A fragment vector for a chemical c, F (c), is defined as follows,
|
| (2) |
|
For example, a spectrum in Figure 2 can be represented as follows,
|
|
A gap is defined between two peaks, or m/z values, and reflects the substructure that is represented by the bigger m/z value and not by the smaller m/z value. A gap vector, G(c), is calculated by
|
|
|
| (3) |
2.2.3 Chemical description by chemical structures
Substructures, or paths, extracted from chemical structures, which are regarded as a graph with an atom as a node and a bond as an edge, can be an effective descriptor for chemical compounds (Clark 2005; Swamidas, et al., 2005; Merlot et al., 2005).
In this study, we followed the method described in (Swamidass et al., 2005), and a feature vector based on the 2D structure is thus defined as follows,
|
| (4) |
l) and which appears at least once in chemical structures in the dataset and
For example, methane (CH4),
|
|
|
|
2.3 Support vector machine
Classification is an important data mining task in bioinformatics. Many model-generating classification methods, which first learn a model from the training dataset and then use it to assign class labels to the unlabeled objects, have been proposed.
For example. the logistic regression analysis (LRA) constructs a linear separating hyperplane between classes (Hosmer and Lemeshow, 2000). The artificial neural network (ANN), which consists of several layers of neurons(i.e. input layer, hidden layer and output layer), can deal with arbitrary data distributions (Ripley, 1996).
Among these, the SVM (Cristianini and Sawe - Taylor, 2000; Vapnik, 1998, is one of the most successful learning algorithms. The SVM has been widely used and has been shown to be effective in many bioinformatics applications (Bhasin and Raghava, 2004; Martin et al., 2005; Swamidass et al., 2005; Yu et al., 2006).
Given n samples, each of which has a m-dimensional feature vector
and one of two classes such as binding and non-binding (yi
{1, –1}), an SVM produces the classifier
|
| (5) |
1, ...,
n) are the parameters learned.
The output of an SVM can be regarded as a probability using the following formula (Platt, 2000),
|
|
In our present report, the itLIBSVM2.81 (Chang and Lin, 2001) program was employed to construct the SVM model.
2.4 Representation of a protein–chemical interaction using kernel functions
In our current study, a sample or a pair of a protein and chemical compound, is represented by several types of feature vectors; C, D, F and G expressed in Equations (1–4).
A straightforward way to represent protein–chemical bindings is to concatenate feature vectors for proteins and compounds, and then to treat the concatenated vector as one feature vector. For example, a sample S1, an interaction between peptide NGMGN and methane, can be represented as follows,
|
|
||S1 – S2||2) is utilized in Equation (5), the concatenation means that the similarity between C1and C2 and that between D1 and D2 are independently evaluated by the same measure and then multiplied to give the overall similarity due to K(S1,S2) = K(C1,C2) · K(D1,D2). However, it may well be the case that the appropriate measure to evaluate the similarity for one feature vector type differs from that for another feature vector type. Moreover, to represent and predict protein–chemical interactions combination effects of different feature vector types can be significant.
Therefore, we used the following formula to determine similarities between two samples in Equation (5),
|
| (6) |
IJ for a pair of feature vector types are empirically selected to give maximum accuracy. In order to obtain proper inner products, the dimensions, or the number of features in different feature vector types, need to be equivalent. To achieve this, the features are ordered according to the mean squared error calculated among all of the different proteins or chemical compounds in the dataset, and the upper 199 features are used for each feature vector type. Here, 199 is the number of protein clusters in Equation (1), that is independent of the datasets and that is smaller than the number of features for other feature vector types.
Moreover, in order to equalize the influence of each feature vector type and each feature in the feature types, a normalization scaling was applied. A value for the jth feature of the sample i was scaled as follows.
|
|
2.5 Evaluation of the prediction performances
We evaluated the prediction performances of our method using the 10-fold cross-validation based on the following measurements; precision (pre.), sensitivity (sen.), accuracy (acc.) and Matthew's correlation coefficient (MCC). Hence,
|
|
TP true positive; TN true negative; FN false positive; FP false positive.
For all of these measurements, the higher the value, the better the prediction is.
2.6 Similarity measure based on target proteins
Target protein sets generated by comprehensive target protein predictions for chemical compounds can reveal biological and functional similarities among these chemical compounds.
It is generally assumed that the more target proteins two drugs have in common, the more biologically or functionally similar they are, because effects of drugs are largely determined by target proteins to which they bind.
Therefore, in this study, we define the similarity between two chemical compounds
and ß as follows,
|
|
. Here, to overcome the problem of most statistical learning methods that they depend on limited training data, several prediction results made by models with different negative samples are combined for the sake of higher confidence. The higher the s(
, ß) value, the more biologically similar
and ß are thought to be. Principal component analysis (PCA) was then applied to the similarity matrix S, whose element sij represents the similarity between the compounds i and j.
2.7 Mass spectrometry and protein sequence data
The mass spectra used in this study were obtained from the NIST/EPA/NIH mass spectral library (NIST 05) (http://www.nist.gov/) incorporating 190 825 EI (Electron Impact) spectra for 163 198 chemical compounds. For protein sequence data, the UniProtKB/Swiss-Prot protein knowledgebase release 49.0 (Apweiler et al., 2004), containing 13 487 human proteins, was used as our amino acid sequence resource.
2.8 Experimental datasets
We constructed two experimental datasets, an adrenergic receptor (AR) drug and DrugBank dataset. The AR drug dataset was based on ARDB (http://ardb.bjmu.edu.cn/default.htm) as of February, 2006 and comprises of 48 AR drugs, including 22 agonists and 26 antagonists, and 9 human ARs. Out of the total possible number (9 x 48 = 432) of protein–chemical compound pairs, 142 were found to be positive samples, or interacting protein–chemical pairs (see Supplementary Table 1), and the remaining 290 are considered negative or non-binding protein–chemical pairs. We regarded AR
1 targeted drugs as binding to three receptors in the AR
1 family (
1A,
1B and
1D). For example, if a drug x is known to bind only to AR ß1, a pair (x, ARß1) is regarded as positive, and other eight pairs such as (x, ARß2) and (x, AR
2A) are treated as negative samples.
|
The DrugBank dataset was constructed from Approved Drug Target Protein Sequences data, downloaded in February, 2006, from the DrugBank database (Wishart et al., 2006). These data consist of 519 approved drugs and their 291 associated target proteins, constituting 980 interacting pairs (see Supplementary Table 2). An example within this dataset is the dopamine receptor, COX2, and the sodium-dependent serotonin transporter. In this dataset, n random pairs of drugs and proteins, except for positive pairs, are regarded as negative samples (n = 1000–8000).
|
|
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
3.1 Specific binding prediction
3.1.1 Evaluation of the method
We define specific binding prediction problem as the prediction of all possible interactions between the chemical compounds being tested and a specific family of proteins. We compare and contrast this with general binding prediction at a later stage in the text. It has often been observed that compounds designed against one protein target also demonstrate useful activities against other members of the same protein family. This suggests that the members of a particular protein family may often share a common essential binding mechanism. The aim of our specific binding model is to elucidate this shared mechanism and exploit it in the classification of protein–chemical pairs as binding and non-binding.
In our computational assessments of specific binding predictions, a prediction model for the human AR family was constructed from the AR drugs dataset. The prediction performance of this model was the evaluated using a 10-fold cross-validation and some prediction performance measurements (Table 1A).
Two main features of our proposed method is the representative description of proteins and compounds and the representation of a protein–chemical pair. In our current study, we proposed the representation of a protein–chemical pair by multiplication of several kernel functions Equation (6). This type of representation gave a better performance (0.820 MCC) than just concatenating feature vectors to represent a pair (0.765 MCC) (Table 1A). This result indicates the importance of considering the crossover effects between different types of feature vectors.
Table 1A also shows the validity of using non-linear SVM for the classification of binding and non-binding protein–chemical pairs. As shown in Table 1A, SVM using the RBF kernel showed the best accuracy (89.8%) when the same combination of feature vectors and the same way of representing a pair (concatenation of vectors) was used (Table 1A). The logistic regression, the ANN and SVM with the linear kernel gave the same level of prediction performances (75–78% accuracy) (Table 1A).
We introduced four types of feature mappings; C for protein description, and D, F and G for representing chemical compounds. Mapping C is derived from the frequency of subsequences and physico-chemical properties of amino acids. As shown in Table 1A, this feature mapping of proteins showed a better performance (0.765 MCC) than the commonly used dipeptide frequency (0.716 MCC) with fewer features (199 versus 400).
For chemical compound description, mapping D is based on chemical structure data, and both F and G are derived from mass spectrometry data. The use of D gave very high prediction performances such as 94.4% accuracy (Table 1A). On the other hand, the combination of F and G achieved a bit lower than the use of D, but significantly high performances, including a 92.1% prediction accuracy, and a more than 0.8 MCC (Table 1A). Moreover, the combination of D, F and G showed the best performances in Table 1A, including 0.889 MCC.
The three mapping D, F and G are based on a common principle that extracted substructures of chemical compounds are sufficiently representative of that compound that they can be used to elucidate the binding mechanism. Though mass spectra are more unprocessed data than chemical structures, the peaks in the mass spectra for F and G can be interpreted as substructures, and the results show that it works sufficiently.
In comparison with D, one possible disadvantage of using a combination of F and G is the existence of synonyms, or compounds whose chemical structures are different but whose molecular weights, or m/z values in the spectra, are equivalent. This is also thought to be the reason why G showed a lower performance (0.707 MCC) than F (0.793 MCC). On the other hand, one advantage of using the mapping method based on mass spectra is the existence of intensities that reflect the physical-chemical properties of each peak. In this regard, we performed an experiment using peak existence instead of peak intensity, and found that this produced a lower degree of accuracy (see Supplementary Table 4). Hence, based upon these performances assessments, the integration of D, F and G mapping has the capacity to compensate for the limitations that are inherent in each individual mapping method and thus produce more accurate predictions (Tables 1A and 2).
Overall, the best result found in these analyses was a 95.1% accuracy (Table 1A). These very high values indicate that an essential binding mechanism shared among protein family members can be extracted statistically by SVM from a large dataset that contains adequate feature vectors for protein–chemical pairs.
3.1.2 Prediction of binding properties: classifications of agonism and antagonism
In our current study, we represented a sample by combining feature vectors for proteins and chemical compounds, and classify protein–chemical pairs to predict interactions between them. To show the effectiveness of this representation, we conducted the following experiment.
The AR drug dataset comprises 22 agonists constituting 73 receptor–agonist pairs, and 26 antagonists for 69 receptor–antagonist pairs. To predict whether a compound acts as an agonist or an antagonist, two types of classification tasks were performed. The first of these is a classification of agonist–receptor pair and antagonist–receptor pair in which a protein–chemical pair is the input, and the second is a classification of agonist and antagonist where only the chemical compounds are used as the input.
The results of this analysis are shown in Table 1B, and indicate that, for the prediction of either agonism or antagonism of the AR by different chemical compounds, our classification of protein–chemical pairs gave a better performance (0.986 MCC) than classification of chemical compounds alone (0.748 MCC). These findings suggest the usefulness of considering protein–chemical pairs.
Table 1B also suggests that some activating and non-activating binding mechanisms can be extracted from the feature vectors of protein-chemical pairs by SVM. Moreover, this method may be applied also to the prediction of other binding properties such as affinity, where samples are classified into two classes by fixed threshold or regression methods such as support vector regression.
3.1.3 Predictions based on different regions of proteins
An AR, which is also a GPCR, consists of three regions; TMHs (transmembrane helices), ELs (extracellular loops) and CLs (cytoplasmic loops). Moreover, the majority of the small-molecule drugs that have been developed interact with the seven transmembrane-spanning domains of GPCRs (Kristiansen, 2004). In our computational analysis of the AR drug binding predictions using each region of the GPCRs, the utilization of TMHs alone in a c mapping gave a better performance (93.3% accuuracy) than that of the whole sequence (92.1% accuracy), EL (88.0% accuracy) or CL (86.8% accuracy) (Table 1C).
This result may indicate the biological relevance of this protein–chemical interaction predication. In addition. it suggests the possibility that our novel prediction method can successfully identify protein regions that are essential for this binding of small molecules.
3.2 General binding prediction
We define general binding prediction problem' as the prediction of the interactions between chemical compounds and proteins belonging to different protein families. Hence , our genaral binding prediction model is designed to extract some of the underlying common binding mechanisms that are shared by several binding protein families and utilizes this for general protein–chemical interaction predictions.
In our computational experiments for general predictions, the general binding models were constructed from the DrugBank dataset. The prediction performances for different negative samples within this model were evaluated as shown in Table 2. This method achieved more than 80% accuracy for most negative sample numbers (Table 2). Based upon this relatively high performance, we conclude that some general binding mechanisms that are common to a number of protein families can be successfully detected by our proposed method and that its application enables us much wider series of predictions.
Though we used random pairs of drugs and proteins as negative samples in constructing a model, the lack of reliable negative samples is always a problem when applying the statistical learning methods. In our current study, it is assumed that drugs in the DrugBank dataset rarely interact with proteins other than their known targets because they are approved drugs. Moreover, to see the tolerance of our method to accidentally containing positive drug-protein pair in a negative sample set, we conducted an experiment in which a fraction of positive samples were intentionally labeled as negatives (pseudo-negatives). We successfully observed that those pseudo-negatives were predicted as positives until the number of pseudo-negatives exceeded a certain level (see Supplementary Fig 1). Hence, our proposed method is robust to a small fraction of unknown positives in negatives which may be the case in using approved drugs.
3.3 Genome-wide target protein prediction
One of the advantages of our proposed method is that screening target proteins for a chemical compound can be performed on a genome-wide scale. This is due to the fact that our method can be applied to all proteins whose amino acid sequences have been determined even though the 3D structural data is not yet available. Furthermore, our method can also be applied to chemical compounds that have been identified by high-throughput analysis using MS, but whose chemical structures has yet to be determined. These advantages of our novel prediction methodology may therefore facilitate the identification of unknown functions of novel chemical compounds by using their predicted target proteins as characterization profiles. Additionally, further predictions of possible adverse effects of chemical agents may be made by identifying unexpected protein targets.
We conducted genome-wide target protein predictions for MDMA from a pool of 13 487 human proteins (Table 3A,B and see Supplementary Table 5). For this purpose, we used our general binding prediction model, exploiting mapping C, F and G with 2000 artificially generated negative samples. The number of negative samples was set at 2000 as this gave the best MCC score (Table 2). MDMA, or ecstasy, is one of the best known psychoactive drugs, but is also believed to be effective in the treatment of post-traumatic stress disorder (PTSD).
MDMA was predicted to bind to 56 different proteins among the 13 487 proteins screened using our model, and the 5 proteins with the highest binding probabilities are listed in Table 3A. MDMA was correctly predicted to bind to sodium-dependent serotonin transporter (5HTT), and this binding prediction is validated by the existing evidence that MDMA stimulates serotonin secretion and exhibits psycho activity by binding to 5HTT (Rudnick and Wall, 1992). Moreover, our specific binding prediction model, constructed from the AR drug dataset, predicted that MDMA binds to the
-1 AR families and activates them (Table 3B). This is also biologically correct, as MDMA-induced hyperthermia is known to be caused by the activation of
-1 ARs, in conjunction with the ß-3 AR (Spargue et al., 2003).
It is noteworthy that the known binding of MDMA to ß-3 AR is not predicted by our method but this may be due to the lack of positive samples containing this receptor.
Overall, we conclude that our current prediction results indicate the biological plausibility of undertaking genome-wide analyses using our proposed novel method.
3.4 Comparisons with the similarity-based search method
Sequence similarities between these predicted target proteins of MDMA were relatively low (see Supplementary Table 6). For example, 5HTT (P31645
[GenBank]
) and AR
-1A (P35348
[GenBank]
), showed only
10% sequence similarity though both were reported to interact with MDMA (Rudnick and Wall, 1992; Spargue et al., 2003). On the other hand, the similar chemical structure search of MDMA, which was conducted by the DrugBank web service (Wishart et al., 2006), showed no approved drugs that had 5HTT (P31645
[GenBank]
) as their target (see Supplementary Table 7).
These results suggest that our method can identify novel target proteins or chemical compounds that are not similar to known targets and that are not found by similarity-based search methods such as BLAST.
In the researches of protein family detection, it has been shown that the kernel methods such as SVM can detect remote protein evolutionary and structural relationships more sensitively and more specifically than the simple sequence similarity-based method such as PSI-BLAST (Leslie, et al., 2004; Liao and Noble, 2003). Therefore, we conclude that the use of the kernel method and the consideration of multiple types of interactions between proteins and chemical compounds are effective in the comprehensive protein–compound interaction prediction.
3.5 Interactomical profile
By utilizing genome-wide target protein predictions, it will also be possible to classify chemical compounds according to their predicted protein targets and this profile may also be used to classify their functions.
In this context, we applied PCA to the distances between compounds in terms of the overlaps between their target proteins (Fig. 3). Based upon these PCA results shown in Figure. 3, it is clear that there are boundaries separating one group of chemical compounds including psychoactive drugs, such as LSD, MDMA and PCP and the other groups including coenzyme Q10 and flavanone that have a number of effects in the body but do not act on neural systems. In addition, strong similarities between LSD, PCP and chinoform, which has been reported to cause a serious neuropathy called SMON, are suggested by these analyses.
|
| 4 CONCLUSION |
|---|
|
|
|---|
In this study, we first showed the high performances of predictions of protein–chemical interactions using SVM and several types of feature vectors derived from very general data. Then, we applied our method to the genome-wide target protein prediction of several compounds to validate its biological significances.
The fact that our method achieved very high prediction performances with the most general data, i.e. amino acid sequences and chemical structures (Tables 1A and 2), suggested comprehensive binding prediction between all the proteins and all the chemical compounds in the large databases. This type of application could contribute to the repurposing of known small molecules and the elucidation of mechanisms of drug side effects.
Our method with the mass spectrometry data showed the same level of prediction performances with that using the chemical structure data (Tables 1A and 2). Mass spectrometry data have been rapidly produced by comprehensive metabolite analyses mainly to quantitate known chemical compounds. These analyses have also produced many spectra whose corresponding chemical structure is unknown. Our method could be used to predict functions of these unknown chemical compounds with the profiles of predicted target proteins (Table 3). In addition, predicted functions would be of use to decide the priority order of determining the chemical structure of unknown spectra. Determined chemical structures, combined with mass spectra, would improve the prediction accuracy (Tables 1A and 2), and further elucidate the biological roles of chemical compounds.
Moreover, in addition to comprehensive metabolite analyses, MS methods have now been exploited to obtain high-throughput profiles of glycans from cells and tissues (An et al., 2003). This indicated a possible application of a method that incorporated such MS data to the prediction of glycosylation, or the attachment of glycans or carbohydrates to proteins. Since glycosylation is the most significant and active post-translational modification in the cell, this approach could be developed into more precise protein–chemical interaction prediction method to identify unknown functions of small molecules.
In our present report, we used EI mass spectrometry data due to data availability although EI-MS spectra have some weakness of abundance and reproducibility. However, our method is general enough to be applied to MS/MS spectra which show many fragments representing chemical substructures as EI-MS spectra do and which will be produced and accumulated rapidly in the comprehensive metabolite analyses such as CE-MS. Therefore, our approach could be one of effective ways to directly exploit mass spectrometry data that will be produced at ever increasing speed.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This work is supported in part by Grant program for bioinformatics research and development of Japan Science and Technology Agency, Grant-in-Aid for Scientific Research on Priority Area No. 17018029 and Grant-in-Aid for Scientific Research (B) No. 16300095. Funding to pay the Open Access publication charges was provided by Grant program for bioinformatics research and development of Japan Science and Technology Agency.
| FOOTNOTES |
|---|
Associate Editor: Jonathan Wren
Received on February 26, 2007; revised on April 21, 2007; accepted on May 10, 2007
| REFERENCES |
|---|
|
|
|---|
An HJ, et al. Determination of N-glycosylation sites and site heterogeneity in glycoproteins. Anal. Chem., ( (2003) ) 75, : 5628–5637.[Medline].
Apweiler R, et al. UniProt: the universal protein knowledgebase. Nucleic Acids Res., ( (2004) ) 32, : D115–D119.
Bhasin M, Raghava GPS. GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors. Nucleic Acids Res., ( (2004) ) 32, : W383–W389.
Bock HJ, Gough DA. Predicting protein-protein interactions from primary structure. Bioinformatics, ( (2001) ) 17, : 455–460.
Chang C-C, Lin C-J. LIBSVM: a library for support vector machines., ( (2001) ) Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm..
Clark M. Generalized fragment-substructure based property prediction method. J. Chem. Inf. Model., ( (2005) ) 45, : 30–38.[CrossRef][ISI][Medline].
Cristianini N, Sawe-Taylor J. An Introduction to Support Vector Machines, ( (2000) ) Cambridge, UK: Cambridge University Press..
Firth D. Bias reduction of maximum likelihood estimates. Biometrika, ( (1993) ) 80, : 27–38.
Gomez SM, et al. Learning to predict protein-protein interactions. Bioinformatics, ( (2003) ) 19, : 1875–1881.
Hosmer DW, Lemeshow S. Applied Logistic Regression, ( (2000) ) New York: Wiley..
Jones GP, et al. Development and validation for a genetic algorithm for flexible docking. J. Mol. Biol., ( (1997) ) 267, : 727–748.[CrossRef][ISI][Medline].
Klabunde T, Hessler G. Drug design strategies for targeting G protein-coupled receptors. Chem. Bio. Chem., ( (2002) ) 3, : 928–944.[Medline].
Kristiansen K. Molecular mechanisms of ligand binding, signaling and regulation within G-protein-coupled receptors: molecular modeling and mutagenesis approaches to receptor structures and function. Pharmacol. Ther., ( (2004) ) 103, : 21–80.[CrossRef][ISI][Medline].
Leslie CS, et al. Mismatch string kernels for discriminative protein classification. Bioinformatics, ( (2004) ) 20, : 467–476.
Liao L, Noble S. Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J. Comput. Biol., ( (2003) ) 10, : 857–868.[CrossRef][ISI][Medline].
Martin S, et al. Predicting protein-protein interactions using signature products. Bioinformatics, ( (2005) ) 21, : 218–226.
Merlot C, et al. Chemical substructures in drug discov. Drug Discov. Today, ( (2003) ) 8, : 594–602.[CrossRef][ISI][Medline].
Morris GM, et al. Automated docking using a lamarckian genetic algorithm and empirical binding free energy function. J. Comput. Chem., ( (1998) ) 19, : 1639–1662.[CrossRef][ISI].
Nagamine N, et al. Identifying cooperative transcriptional regulations using protein-protein interactions. Nucleic Acids Res., ( (2005) ) 33, : 4828–4837.
Palczewski K, et al. Crystal structure of rhodopsin: a G protein-coupled receptor. Science, ( (2000) ) 289, : 739–745.
Platt J. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In: Advances in Large Margin Classifiers, —Smola A, et al, eds. ( (2000) ) Cambridge, MA: MIT Press. 61–74..
Ripley BD. Pattern Recognition and Neural Networks, ( (1996) ) Cambridge, UK: Cambridge University Press..
Rudnick G, Wall SC. The molecular mechanism of "ecstasy" [3,4-methylenedioxymethamphetamine, MDMA]: serotonin transporters are targets for MDMA induced serotonin release. Proc. Natl Acad. Sci. USA, ( (1992) ) 89, : 1817–1821.
Schwikowski B, et al. A network of protein-protein interactions in yeast. Nat. Biotechnol., ( (2000) ) 18, : 1257–1261.[CrossRef][ISI][Medline].
Shoichet BK, et al. Molecular docking using shape descriptors. J. Comput. Chem., ( (1992) ) 13, : 380–397.[CrossRef][ISI].
Soga T, et al. Simultaneous determination of anionic intermediates for Bacillus subtilis metabolic pathways by capillary electrophoresis electrospray ionization mass spectrometry. Anal. Chem., ( (2002) ) 74, : 2233–2239.[Medline].
Sprague JE, et al. Hypothalamic-pituitary-thyroid axis and sympathetic nervous system involvement in the hyperthemia induced by 3,4-methylene-dioxymethamphetamine (MDMA, Ecstasy). J. Pharmacol. Exp. Ther., ( (2003) ) 305, : 159–166.
Swamidass SJ, et al. Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics, ( (2005) ) 21, : 359–368.[CrossRef].
Teschendorff AE, et al. A variational Bayesian mixture modelling framework for cluster analysis of gene-expression data. Bioinformatics, ( (2005) ) 21, : 3025–3033.
Vapnik VN. Statistical Learning Theory, ( (1998) ) New York: John Wiley and Sons..
Venkatarajan MS, Braun W. New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical-chemical properties. J. Mol. Model., ( (2001) ) 7, : 445–453.[CrossRef].
Wishart DA, et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res., ( (2006) ) 34, : D668–D672.
Xiao X, et al. Using pseudo amino acid composition to predict protein structural classes: approached with complexity measure factor. J. Comput. Chem., ( (2006) ) 27, : 478–482.[CrossRef][ISI][Medline].
Yu C-S, et al. Prediction of protein subcellular localization. PROTEINS: Struct. Funct. Bioinform., ( (2006) ) 64, : 643–651.[CrossRef].
Zernov VV, et al. Drug discovery using support vector machines. The case studies of drug-likeness, agrochemical-likeness, and enzyme inhibition predictions. J. Chem. Comput. Sci., ( (2003) ) 43, : 2048–2056.[CrossRef].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











