Skip Navigation


Bioinformatics Advance Access originally published online on July 12, 2006
Bioinformatics 2006 22(17):2099-2106; doi:10.1093/bioinformatics/btl352
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/17/2099    most recent
btl352v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zheng, M.
Right arrow Articles by Jiang, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zheng, M.
Right arrow Articles by Jiang, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Mutagenic probability estimation of chemical compounds by a novel molecular electrophilicity vector and support vector machine

Mingyue Zheng 1, Zhiguo Liu 1, Chunxia Xue 1, Weiliang Zhu 1, Kaixian Chen 1, Xiaomin Luo 1,* and Hualiang Jiang 1,2,*

1 Shanghai Institute of Materia Medica, Shanghai Institutes of Biological Sciences Chinese Academy of Sciences, 555 Zu Chong Zhi Road, Shanghai 201203, China
2 School of Pharmacy, East-China University of Science and Technology Shanghai 200237, China

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSIONS
 REFERENCES
 

Motivation: Mutagenicity is among the toxicological end points that pose the highest concern. The accelerated pace of drug discovery has heightened the need for efficient prediction methods. Currently, most available tools fall short of the desired degree of accuracy, and can only provide a binary classification. It is of significance to develop a discriminative and informative model for the mutagenicity prediction.

Results: Here we developed a mutagenic probability prediction model addressing the problem, based on datasets covering a large chemical space. A novel molecular electrophilicity vector (MEV) is first devised to represent the structure profile of chemical compounds. An extended support vector machine (SVM) method is then used to derive the posterior probabilistic estimation of mutagenicity from the MEVs of the training set. The results show that our model gives a better performance than TOPKAT (http://www.accelrys.com) and other previously published methods. In addition, a confidence level related to the prediction can be provided, which may help people make more flexible decisions on chemical ordering or synthesis.

Availability: The binary program (ZGTOX_1.1) based on our model and samples of input datasets on Windows PC are available at http://dddc.ac.cn/adme upon request from the authors.

Contact: hljiang{at}mail.shcnc.ac.cn; xmluo{at}mail.shcnc.ac.cn


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSIONS
 REFERENCES
 
The development of drugs depends on finding compounds that have beneficial effects with minimal toxicity. During the past two decades, drug discovery techniques such as combinatorial chemistry and high-throughput screening (HTS) have made substantial progresses in identifying hits at an early stage. However, toxicity issues are still significant factors in late-stage drug failure (Caldwell et al., 2001). At present, a variety of toxicological tests need to be conducted by the drug regulatory authorities for safety assessment. In addition to the financial cost and laborious procedures, these tests generally have the limitation of low-throughput capacity, and hence can neither be used to assess drug toxicity at the early stage in discovery nor to detect adverse drug events prior to widespread clinical use (Johnson and Wolfgang, 2000). Recently, one area of particular interest has been the application of in silico toxicological model to supplement those in vitro and in vivo testing. Previous toxicological studies have yielded large amount of structure-activity-relationship (SAR) information, and numerous software packages, including statistical tools, support the generation of molecular descriptors, fragmentation patterns and the like, for the development of the predictive models (Benfenati and Gini, 1997; Benigni, 2005; Fielden et al., 2002; Greene, 2002; Helma, 2005; Johnson and Wolfgang, 2000).

Mutagenicity is the ability of a compound to cause mutations in DNA, which is one of the toxicological liabilities closely evaluated in drug discovery. On one hand, the standard test for mutagenicity determinations, the Ames test, has become a regulatory requirement for drug approval. On the other hand, an increasing body of evidence has been amassed revealing a significant correlation between positive Ames test results and rodent carcinogenicity. A successful in silico model of mutagenicity thus might also serve as a predictor for rodent carcinogenicity (Kim and Margolin, 1999; Zeiger et al., 1990). To develop a predictive toxicity model, one should select a dataset that is limited to a predominating toxic mechanism. From this point of view, mutagenicity should also be easier to predict than other types of toxicity because of its relatively simple mechanisms (Snyder and Smith, 2005).

To date, some computational tools for mutagenicity estimation have been developed, including Deductive Estimation on Risk from Existing Knowledge (DEREK), Multiple Computer Automated Structure Evaluation (MCASE) and Toxicity Prediction by Komputer Assisted Technology (TOPKAT). Details about these computational systems can be found elsewhere (Greene, 2002). Although there is an optimistic expectation that those tools might someday reduce or eliminate the need for experimental testing, we are, in fact, a long way from this (Snyder and Smith, 2005). Recent comparative studies showed that the performance characteristics of all the above mentioned programs still have limited predictive capabilities, especially in terms of overall sensitivity for detecting Ames positives—only 43 to 52% of positives correctly identified. For the above mentioned programs, similarly, poor sensitivity was reported in assessment of proprietary pharmaceuticals (White et al., 2003).

In addition to those commercial software packages, numerous literature methods for mutagenicity prediction have been presented, which can be mainly categorized as knowledge-based and statistics-driven models. Generally, the knowledge-based approach is likely to provide a mechanistic basis, but those predefined fragments or rules are expressions of existing knowledge rather than of new knowledge. The statistics-driven methods offer the advantage of extending existing knowledge without being biased towards particular mechanisms of toxic action, but the performance of these methods is always limited by the quality of molecular descriptors, diversity of training and test data, and the efficiency of the statistical learning algorithm. Recently, Helma et al. demonstrated that predictive accuracies of models using a molecular feature miner algorithm (MOLFEA) are ~10–15% higher than those using molecular properties alone (Helma et al., 2004). In that study, different statistical learning methods have been employed, and support vector machine (SVM) gave the best results, the predictive accuracy of 78% for 10-fold cross-validation (CV).

Except for the performance limitations, current in silico approaches make little attempt at probabilistic prediction. Mutagenicity prediction is usually considered as a binary classification problem. Only discrete class labels, as output results, are often not enough in practical applications. For example, in the course of ordering a compound library or guiding organic synthesis, a positive prediction might simply result in the elimination of a promising drug candidate. Whereas, the probabilistic prediction could not only help researchers make more flexible decisions, but also give them a measured confidence level about their decisions.

Therefore, the primary aim of this study is to present a discriminative and informative model for mutagenicity prediction. To this end, we have developed a mutagenic probability prediction model based on a novel molecular representation method and an extended SVM algorithm.

In the development of SAR models, it is essential to select the structural or chemical properties most relevant to the point of interest. For mutagenicity, the choice of descriptors should fully account for the possible mechanisms. Different mutagenic mechanisms have been reported, which typically arises from direct chemical/DNA interaction dependent largely on electrophilicity. Recently, a QSAR study (Lewis et al., 2003) demonstrated that electrophilicity is also important for compounds with indirect mutagenic activity, taking place by transforming non-reactive compounds into DNA-reactive metabolites. For these reasons, we designed a set of atomic indices and the molecular electrophilicity vector (MEV) to delineate the electrophlicity profile for chemicals.

A successful SAR model also relies on the efficiency of the algorithm utilized to formulate the final mathematical relationship. Here SVM is particularly investigated because of its good performance. Numerous recent classification studies have consistently demonstrated that SVM to various degrees gives better prediction accuracy than other supervised statistical learning methods (Bock and Gough, 2001; Doniger et al., 2002; Li et al., 2005; Lo et al., 2005). In addition, an extended SVM could also make class probability estimates, as illustrated by (Wu and Lin, 2004) and implemented by (Chang and Lin, 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm).

To adequately assess the prediction accuracy of the methods used in this study, two different evaluation methods were used. One is 10-fold CV and the other is the use of an external independent validation set, which is also predicted by employing the program TOPKAT for comparison.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSIONS
 REFERENCES
 
2.1 Mutagenicity dataset
The training set used in the present study was collected from the literature (Kazius et al., 2005) and consists of 4337 chemical compounds along with information indicating whether they have mutagenicity in those Salmonella Typhimurium strains required for regulatory evaluation of drug approval. The same categorization protocol of (Kazius et al., 2005) was applied, which indicated that 54% of the dataset was mutagenic (2401 mutagens and 1936 non-mutagens). The validation dataset was collected from various public sources such as EPA (http://www.epa.gov) and NIH (http://www.nih.gov). After the removal of identical compounds presented in the training set, mixtures and resonance structures, 592 compounds were obtained with 54.90% mutagenic. Molecules in both datasets are structurally diverse, laying a solid foundation for a robust predictive model.

2.2 MEV generation
The major technique of a toxicophore approach relies on identifying patterns of substructural fragments relevant to toxicity, which provides an easy avenue for the incorporation of experts' knowledge and experience. According to this idea, MEV is also designed to relate mutagenicity to structural molecular composition. In addition to substructural fragments, molecular composition in MEV is mainly described in terms of atomic electrophilicity, which is specially suited for the mutagenicity mechanism. The generation of MEV involves three main steps: the atom typing for each molecule in the dataset; the calculation of atomic descriptors for each atom type and the construction of the feature vector from the resulting atomic type and properties.

To allow for portability and simple implementation, all atom types were presented as SMARTS strings (Table 1). For each atom, the type was determined by its own chemical properties and the neighboring atoms and bonds reflecting its chemical environment. Programmable atom typer (PATTY) backtracking algorithm (Bush and Sheridan, 1993) included in OpenBabel (http://openbabel.sourceforge.net) was then employed to perform the type assignment. Using SMARTS and PATTY, flexible and efficient atom type specifications could be made in terms that are meaningful to chemists and toxicologists.


View this table:
[in this window]
[in a new window]

 
Table 1 The 52 atom typing rules and 3 toxicophore indicators, with brief description and SMARTS notations.

 
In MEV, five types of electronic parameters were calculated to characterize the electrophlicity. One of the most direct electrophilicity indices should be frontier orbital electron density on an atom, which could provide a useful means for the detailed characterization of donor–acceptor interactions (Prabhakar, 1991; Tuppurainen et al., 1991). Since most DNA attacks are via electrophilic species and the charge transfer is generally from the electron-rich base pairs of DNA, only the nucleophilic electron density (fN) on chemical agents is calculated.

Another important index is superdelocalizability (Fukui, 1975; Fukui et al., 1957), which reflects the ability to accept or donate electron density and can be used as an index of the reactivity of occupied and unoccupied orbital. This parameter is useful in the characterization of soft molecular interactions (Brown and Simas, 1982; Kikuchi, 1987) and the comparison of corresponding atoms in different molecules (Kikuchi, 1987). In the present study, the nucleophilic superdelocalizability (SN) was calculated, which describes the interactions with the nucleophilic center in the second reactant (DNA structures). (Brown and Simas, 1982)

Atomic partial charges are obviously the driving force of electrostatic interactions between molecules. Orbital parameters such as superdelocalizability represent dynamic reactivity indices (Franke, 1984), while the electrical charges, describing isolated molecules in their ground state, can be treated as static indices. Here we calculated three types of partial charges for each atom, including {sigma}-charge (q{sigma}), {pi}-charge (q{pi}) and total-charge (q). These descriptors were employed as measurements of weak intermolecular interactions with DNA.

The OBGastChrg module in OpenBabel was responsible for the assignment of q{sigma} to a molecule according to the Gasteiger–Marsili charge model (Gasteiger and Marsili, 1980); a user-defined C++ module OBHMO was designed and implemented for the calculation of q{pi}, fN and SN, on the basis of the semiempirical Hückel method (Hückel, 1931; Streitweiser, 1961). For a given atom r, q{pi},r is calculated by the following function:

Formula 1(1)
where m is the number of molecular orbital (MO); kr and ni are the number of {pi}-electrons provided by the atom r and in the i-th MO, respectively; Cr,i is the linear combination atomic orbital (LCAO) coefficient of r in the i-th {pi} MO. fN,r is obtained in the form

Formula 2(2)
which is also called LUMO frontier electron density. SN,r is given by

Formula 3(3)
where occ represents the number of occupied MOs and Ej is the energy of the j-th unoccupied MO energy level. Parameters used to derive the coulomb and resonance integrals were taken from the literature (Purcell and Singer, 1967).

In addition to direct or indirect reactivity toward DNA, mutagenicity could also be caused by intercalation of a compound with aromatic polycyclic ring into DNA, resulting in a distortion of the DNA structure (Garrett and Grisham, 1995). Mutagenicity from this mechanism could be predicted by a toxicophore approach, with only three predefined rules to give high accuracies ranging from 93 to 95% (Kazius et al., 2005). Therefore, three additional bits were reserved in MEV (as shown in Table 1, i.e. BAY, K and POLY) to hold the presence/absence information of these predefined substructures. In this way, validated toxicophores could be easily incorporated in our method, combining the merit of fragment-based approach.

With the electrophilicity descriptors calculated for each atom and the toxicophore retrieving results, the next step of the procedure is the composition of MEV, as summarized below. Given an input molecule M, a float array VM with the length of N x 5 + 3 (52 x 5 + 3 = 263) was formed by the developed routine, where N is the number of all electrophilicity-related atom types. All bits of VM were initially set to zero and every five bits were assembled into a subset, corresponding to a particular atomic type. All atoms of M were then sorted according to specified atomic types. Atoms with identical types will always be mapped to the same subset of the array. For the subset related to an atom type A, each bit was allocated for one type of descriptor (FA), of which the final value is the sum of FA values of all type A atoms presented in M. For the bit corresponding to the predefined toxicophore B, the value FB was set to 1 if B is presented in M and 0 if not. The resulting VM is the MEV for the molecule M and the bin occupancies are the descriptor variables encoding molecular electrophlicity and substructure information. The whole process of the MEV generation is shown in Figure 1, with the compound aniline as an example.


Figure 1
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Transformation of a chemical structure (aniline for example) to its characteristic MEV.

 
2.3 Support vector machine
Generated MEVs for the molecules in the training set were then inputted to a SVM route to determine the best model. Details about the theory of SVM can be found in the literatures (Burges, 1998; Vapnik, 1995). Basically, for a given dataset xi isin Rn (I = 1, ... , N) with corresponding labels yi (yi = +1 or –1, representing the two classes to be classified as mutagen versus non-mutagen in this study), SVM gives a decision function (classifier):

Formula 4(4)
where {alpha}i is the coefficient to be learned and K is a kernel function. Parameter {alpha}i is trained through maximizing the Lagrangian expression given below:

Formula 5(5)
subject to 0 ≤ {alpha}i ≤ C (i=1, ... , N) and Formula 5Platt's approach was used to derive posterior probabilities for the estimated class membership f(xi) of observation xi (Platt, 1999). A sigmoid function is fitted to all estimatedFormula 5 to derive probabilities of the form

Formula 6(6)
where A and B are estimated by minimizing the negative log-likelihood of the training data:

Formula 6

Labels and decision values [estimatedg(xi)] are required to be independent, so here we conducted a 5-fold CV to obtain decision values.

The LibSVM package (version 2.81) (Chang and Lin, 2001) was used in this study. To obtain a SVM classifier with optimal performance, the penalty parameter c and the radial basis function (RBF) kernel parameter {gamma} were tuned based on the training set using the grid search strategy in LibSVM.

2.4 Feature selection
A F-score based recursive feature elimination (RFE) method (Guyon et al., 2002) was employed to rank and select those with higher contribution to the mutagenicity. For the training MEV xi (i=1, ... , N), if the numbers of positive and negative instances are N+ and N, respectively, then the F-score of the j-th feature is defined as follows:

Formula 6
where Formula 6, Formula 6, Formula 6 are the average of the j-th feature of the whole, positive and negative datasets, respectively. Formula 6 is the j-th feature of the i-th positive instance, and Formula 6 is the j-th feature of the i-th negative instance. F-score was calculated for every feature, the larger the score is, the more likely discriminative this feature is. The change of the validation accuracy (measured by 5-fold CV) was tracked on gradually dropping features with lower score, and the optimal subset of features is supposed to give the highest accuracy rate.

2.5 Performance measurement
As in the case of all discriminative methods (Baldi et al., 2000; Roulston, 2002), the performance of statistical learning methods can be measured by the quantity of true positives (true mutagens), TP; true negatives (true non-mutagens), TN; false positives (false mutagens), FP and false negatives (false non-mutagens) FN. Sensitivity (Formula 6) is the measure of a program's ability to correctly identify true mutagens. Specificity (Formula 6) is the prediction accuracy for the non-mutagens. The overall concordance rate [Q = (TP + TN)/(TP+ TN + FP + FN)] and Matthews correlation coefficient (Formula 6) (Matthews, 1975) were used to measure the accuracies of prediction.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSIONS
 REFERENCES
 
The overall performance of the SVM + MEV model is provided in Table 2, in comparison to those from Kazius' toxicophore model (2005). For model construction, the SVM + MEV method shows its superior efficiency in the data fitting, with a concordance rate of 91.86%. Sensitivity and specificity are 93.63 and 89.67%, respectively, ~10% higher than the toxicophore model. For validation, a prediction accuracy of 84.80% for the external test set was yielded, which is close to the experimental reproducibility of the Salmonella assay (Benigni and Giuliani, 1988). As pointed out by Kazius et al. (2005), it is unlikely to get a better performance for such an in silico tool, as a result of the intrinsic limitations present in both the experimental data and the SAR approach. In this study, the same test set was also evaluated using TOPKAT, a statistics-driven program based on ‘electro-topological’ descriptors rather than chemical substructures. As we may notice from Table 2, TOPKAT has the ability to correctly identify true negatives, with a specificity of 85.10%, but has not the same effective performance for predicting true positives. Sensitivity and overall accuracy given by TOPKAT are 77.32 and 80.81%, respectively, lower than our results. Similar poor sensitivities were also found for other computation programs such MCASE and DEREK (Snyder and Smith, 2005). In contrast, the sensitivity and specificity of our SVM + MEV are similarly high, indicating a well-balanced model that is capable of predicting mutagenicity and nonmutagenicity.


View this table:
[in this window]
[in a new window]

 
Table 2 Comparison of the overall statistics of the MEV + SVM model, Kazius' (2005) toxicophore model and TOPKAT.

 
An expanded analysis of CV results for mutagenicity prediction is presented in Table 3. All by 10-fold CV, results from Helma et al's. (2004) MOLFEA model and Kazius' (2006) elaborate chemical representation (ECR) method indicate that our SVM + MEV model gives much more accurate predictions, up to 10–12% higher. It should be noted that direct comparison with results from previous studies are usually inappropriate because of differences in the use of a dataset, molecular descriptors and classification methods. However, it may at least provide some rough estimates on the approximate level of accuracy of our method with regard to those achieved by other data-driven models. As the CV performance of SVM + MEV is good enough and generally consistent with that of the training set and the independent external test prediction, we may infer this method to be able to extract significant knowledge out of experimental data.


View this table:
[in this window]
[in a new window]

 
Table 3 Comparison of the 10-fold CV results of the MEV+SVM model and other substructure mining methods.

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSIONS
 REFERENCES
 
Being able to explain the obtained solution in terms of the features is as relevant as obtaining the best possible predicator. Here we performed the F-score analysis to interpret the feature importance. As shown in Figure 2, F-scores of MEV features are depicted as columns and categorized according to feature types. Those from the classes of atomic partial charges constitute the largest percentage of the most discriminative descriptors (with high F-score values). Among the 50 highest-ranked-features, 72% are found to be charge related, giving an enrichment factor (EF) of 1.21; those two orbital properties, fN and SN, only make up for 22% in all. For the top 10, an even more significant enrichment was observed—nine of them belong to charge categories. These findings highlight the important role of atomic partial charges in describing the intermolecular interactions with DNA. Covalent interactions between a chemical and cellular DNA can be readily recognized and anticipated to have genotoxic potential (Snyder et al., 2004). However, the inadequate coverage of non-covalent DNA interactions may lead to a high level of false negatives, contributing to the inherent lack of sensitivity of many computation tools (Snyder and Smith, 2005). Since the partial charge related descriptors are employed as a measure of the non-covalent intermolecular interaction, this proposition could get support from our F-score results, which, accordingly, also account for the marked increase in the sensitivity of our SVM + MEV method. In Figure 2, another noteworthy point is that the descriptor (POLY) indicating whether a molecule contains a polycyclic aromatic system ranked first. The other two descriptors (BAY and K) detecting polycyclic rings with and without bay- or K-region also get high F values. This set of descriptors can give high accuracies on identifying the mutagens that is prone to intercalate into DNA structures (Kazius et al., 2005). Primarily determined by electrostatic features, non-covalent interaction with DNA also involves the intercalation mechanism. These high F-scores demonstrate the feasibility and importance of these toxicophore-based descriptors in distinguishing mutagencity through this mechanism.


Figure 2
View larger version (46K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 F-Score of each feature in MEV. Atomic descriptors (SN, fN, q{sigma}, q{pi}, q) and toxicophore indicators (T) are depicted in different colors.

 
The original MEV comprises 263 bits, though the input space dimension is not as high as compared with the number of samples in the training set, the risk of over-fitting still exists, reducing the performance of the model (Kohavi and John, 1997). As shown in Figure 2, there are substantial amount of descriptors with low F-scores, which may add the chance of noise fitting and were gradually removed in the RFE process. Figure 3 shows the change of concordance rate (determined by 5-fold CV) as the complexity of the model decreases. We may notice that the accuracy does not start to drop until the length of MEV was reduced to 164 bits. Actually, the use of RFE-selected descriptors even slightly improved the prediction performance.


Figure 3
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 The concordance rate (Q, %) of the training set (measured by 5-CV) and the test set versus the number of top F-score ranked features.

 
With the top ranked 164 features, the 5-fold CV gave an increased concordance rate of 82.43% when {gamma} is set to 0.003285 and c to 181.0193. For the external test set, the accuracy is enhanced from 84.80 to 85.14%. Therefore, these results suggest that the simple F-score based RFE method could help to optimally select molecular descriptors and enable the development of a more accurate and efficient model.

In addition to a binary classification, it is also desirable for such a classifier to output a scalar value showing its belief in the prediction. For example, a compound with the estimated probability of 90% should be more likely to be a mutagen than the one with the P-value of 60%. The plot in Figure 4 shows the change of concordance rate (Q) against the estimated mutagenic probabilities (P) for all training set molecules: the predictive power of this model reaches maximums when P approaches two ends, 0 and 100%, and drops drastically when P approximates the middle value, 50%. It is to some extent obscure because the P-value close to 0% actually means that the probability to be a non-mutagen is ~100%. For clarity, we use P' to represent the probability for either mutagens or non-mutagens, of which the value can be given by the piecewise functionFormula 6


Figure 4
View larger version (32K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 The concordance rate (Q, %) of the training set versus the estimated mutagenic probability (P,%), which are shown in the bottom-left coordinate system; The sample number (N) versus P are shown in the bottom-right system.

 
After this transformation we may clearly find that P' is well correlated with Q, the higher the P' is, the more accurate prediction can be made. For the external test set, compounds with the estimated P' > 75% (405 compounds of test set) could be accurately classified with a Q of 91.85%, partially improving the predictive performance. Although the average accuracy is unchanged, we can still make a more rational decision with the performance level related to a calculated P'.


    5 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSIONS
 REFERENCES
 
In this article, a mutagenicity prediction model has been developed based on a novel molecular representation method MEV and SVM. The MEV is devised to characterize the electorphilicity and topology of a molecule, accounting for both direct and indirect mechanisms of mutagenicity. SVM, in contrast, is used as a powerful statistics-learning machine to extend existing knowledge without being biased toward particular mutagenic mechanisms. The high performance of our approach suggests the combination of MEV and SVM is feasible and effective on in silico mutagenicity modeling.


    Acknowledgments
 
The authors would like to thank Prof. Roberta Bursi and Jeroen Kazius for generous sharing of their datasets and helpful discussion. The authors gratefully acknowledge financial support from the State Key Program of Basic Research of China (Grants 2003CB114401, 2004CB518901 and 2002CB512802), the National Natural Science Foundation of China (Grants 20372069, 29725203, 20472094 and 20102007), the Basic Research Project for Talent Research Group from the Shanghai Science and Technology Commission, the Key Project from the Shanghai Science and Technology Commission (Grant 02DJ14006), the Key Project for New Drug Research from CAS.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Dmitrij Frishman

Received on April 17, 2006; revised on June 23, 2006; accepted on June 23, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSIONS
 REFERENCES
 

    Baldi, P., et al. (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16, 412–424[Abstract/Free Full Text].

    Benfenati, E. and Gini, G. (1997) Computational predictive programs (expert systems) in toxicology. Toxicology, 119, 213–225[CrossRef][ISI][Medline].

    Benigni, R. (2005) Structure-activity relationship studies of chemical mutagens and carcinogens: mechanistic investigations and prediction approaches. Chem. Rev, . 105, 1767–1800.

    Benigni, R. and Giuliani, A. (1988) Computer-assisted analysis of interlaboratory Ames test variability. J. Toxicol. Environ. Health, . 25, 135–148[ISI][Medline].

    Bock, J.R. and Gough, D.A. (2001) Predicting protein–protein interactions from primary structure. Bioinformatics, 17, 455–460[Abstract/Free Full Text].

    Brown, R.E. and Simas, A.M. (1982) On the applicability of CNDO indices for the prediction of chemical reactivity. Theoret. Chim. Acta, 62, 1–16.

    Burges, C.J.C. (1998) A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov, . 2, 121–167[CrossRef].

    Bush, B.L. and Sheridan, R.P. (1993) PATTY: a programmable atom type and language for automatic classification of atoms in molecular databases. J. Chem. Inf. Comput. Sci, . 33, 756–762[CrossRef].

    Caldwell, G.W., et al. (2001) The new pre-preclinical paradigm: compound optimization in early and late phase drug discovery. Curr. Top. Med. Chem, . 1, 353–366[CrossRef][Medline].

    Chang, C.C. and Lin, C.J. LIBSVM: a library for support vector machines, (2001) .

    Lewis, D.F., et al. (2003) A quantitative structure-activity relationship (QSAR) study of mutagenicity in several series of organic chemicals likely to be activated by cytochrome P450 enzymes. Teratog. Carcinog. Mutagen, . 23, 187–193[CrossRef].

    Doniger, S., et al. (2002) Predicting CNS permeability of drug molecules: comparison of neural network and support vector machine algorithms. J. Comput. Biol, . 9, 849–864[CrossRef][ISI][Medline].

    Fielden, M.R., et al. (2002) In silico approaches to mechanistic and predictive toxicology: an introduction to bioinformatics for toxicologists. Crit. Rev. Toxicol, . 32, 67–112[CrossRef][ISI][Medline].

    Franke, R. Theoretical Drug Design Methods, (1984) , Amsterdam Elsevier, pp. 115–123.

    Fukui, K. Theory of Orientation and Stereoselection, (1975) , New York Springer-Verlag, pp. 34–39.

    Fukui, K., et al. (1957) MO-theoretical approach to the mechanism of charge transfer in the process of aromatic substitutions. J. Chem. Phys, . 27, 1247[CrossRef].

    Garrett, L.H. and Grisham, C.M. Biochemistry, . (1995) , Orlando, FL Saunders College Publishing, pp. 225–929–932.

    Gasteiger, J. and Marsili, M. (1980) Iterative partial equalization of orbital electronegativity—a rapid access to atomic charges. Tetrahedron, 36, 3219–3228[CrossRef][ISI].

    Greene, N. (2002) Computer systems for the prediction of toxicity: an update. Adv. Drug. Deliv. Rev, . 54, 417–431[CrossRef][ISI][Medline].

    Guyon, I., et al. (2002) Gene selection for cancer classification using support vector machines. Mach. Learn, . 46, 389–422[CrossRef].

    Helma, C. (2005) In silico predictive toxicology: the state-of-the-art and strategies to predict human health effects. Curr. Opin. Drug. Discov. Devel, . 8, 27–31[ISI][Medline].

    Helma, C., et al. (2004) Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. J. Chem. Inf. Model, . 44, 1402–1411.

    Hückel, E.Z. (1931) Quantentheoretische beitrage zum benzolproblem. I. Die elektronenkonfiguration des benzols und verwandter beziehungen. Physik, 70, 204–286[CrossRef].

    Johnson, D.E. and Wolfgang, G.H. (2000) Predicting human safety: screening and computational approaches. Drug. Discov. Today, 5, 445–454[CrossRef][ISI][Medline].

    Karelson, M., et al. (1996) Quantum-chemical descriptors in QSAR/QSPR studies. Chem. Rev, . 96, 1027–1044[CrossRef][ISI][Medline].

    Kazius, J., et al. (2005) Derivation and validation of toxicophores for mutagenicity prediction. J. Med. Chem, . 48, 312–320[CrossRef][ISI][Medline].

    Kazius, J., et al. (2006) Substructure mining using elaborate chemical representation. J. Chem. Inf. Model, . 46, 597–605[CrossRef][ISI][Medline].

    Kikuchi, O. (1987) Systematic QSAR procedures with quantum chemical descriptors. Quant. Struct.-Act. Relat, . 6, 179.

    Kim, B.S. and Margolin, B.H. (1999) Prediction of rodent carcinogenicity utilizing a battery of in vitro and in vivo genotoxicity tests. Environ. Mol. Mutagen, . 34, 297–304[CrossRef][ISI][Medline].

    Kohavi, R. and John, G.H. (1997) Wrappers for feature subset selection. Artificial Intell, . 97, 273–324[CrossRef].

    Lewis, D.F.V., et al. (2003) A quantitative structure-activity relationship (QSAR) study of mutagenicity in several series of organic chemicals likely to be activated by cytochrome P450 enzymes. Teratog. Carcin. Mutage, . 23, 187–193[CrossRef].

    Li, H., et al. (2005) Prediction of genotoxicity of chemical compounds by statistical learning methods. Chem. Res. Toxicol, 18, 1071–1080[CrossRef][ISI][Medline].

    Lo, S.L., et al. (2005) Effect of training datasets on support vector machine prediction of protein-protein interactions. Proteomics, 5, 876–884[CrossRef][ISI][Medline].

    Matthews, B.W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta, . 405, 442–451[Medline].

    Platt, J. (1999) Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Smola, A.J., Bartlett, P., Schölkopf, B., Schuurmans, D. (Eds.). Advance in Large Margin Classifiers,, , Cambridge, MA MIT Press, pp. 61–74.

    Prabhakar, Y.S. (1991) Quantum QSAR of the antirhinoviral activity of 9-benzylpurines. Drug. Des. Deliv, . 7, 227–239[Medline].

    Purcell, W.P. and Singer, J.A. (1967) A brief review and table of semiempirical parameters used in the Hueckel molecular orbital method. J. Chem. Eng. Data, 12, 235–246[CrossRef].

    Roulston, J.E. (2002) Screening with tumor markers: critical issues. Mol. Biotechnol, . 20, 153–162[CrossRef][ISI][Medline].

    Snyder, R.D., et al. (2004) Assessment of the sensitivity of the computational programs DEREK, TOPKAT, and MCASE in the prediction of the genotoxicity of pharmaceutical molecules. Environ. Mol. Mutagen, 43, 143–158[CrossRef][ISI][Medline].

    Snyder, R.D. and Smith, M.D. (2005) Computational prediction of genotoxicity: room for improvement. Drug. Discov. Today, 10, 1119–1124[CrossRef][ISI][Medline].

    Streitweiser, A. Molecular Obital Theory for Organic Chemists, . (1961) , Wiley, New York.

    Tuppurainen, K., et al. (1991) About the mutagenicity of chlorine-substituted furanones and halopropenals. A QSAR study using molecular orbital indices. Mutat. Res, . 247, 97–102[ISI][Medline].

    Vapnik, V.N. The Nature of Statistics Learning, (1995) , New York Springer.

    White, A.C., et al. (2003) A multiple in silico program approach for the prediction of mutagenicity from chemical structure. Mutat. Res, . 539, 77–89.

    Wu, T.F. and Lin, C.J. (2004) Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res, . 5, 975–1005.

    Zeiger, E., et al. (1990) Evaluation of four in vitro genetic toxicity tests for predicting rodent carcinogenicity: confirmation of earlier results with 41 additional chemicals. Environ. Mol. Mutagen, . 16, Suppl. 18, 1–14[ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/17/2099    most recent
btl352v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zheng, M.
Right arrow Articles by Jiang, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zheng, M.
Right arrow Articles by Jiang, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?