Bioinformatics Advance Access originally published online on July 27, 2007
Bioinformatics 2007 23(18):2498-2500; doi:10.1093/bioinformatics/btm363
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
sMOL Explorer: an open source, web-enabled database and exploration tool for Small MOLecules datasets
Information Systems Laboratory, BIOTEC Central Research Unit, National Center for Genetic Engineering and Biotechnology (BIOTEC), Klongluang, Pathumthani, 12120, Thailand
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: sMOL Explorer is a 2D ligand-based computational tool that provides three major functionalities: data management, information retrieval and extraction and statistical analysis and data mining through Web interface. With sMOL Explorer, users can create personal databases by adding each small molecule via a drawing interface or uploading the data files from internal and external projects into the sMOL database. Then, the database can be browsed and queried with textual and structural similarity search. The molecule can also be submitted to search against external public databases including PubChem, KEGG, DrugBank and eMolecules. Moreover, users can easily access a variety of data mining tools from Weka and R packages to perform analysis including (1) finding the frequent substructure, (2) clustering the molecular fingerprints, (3) identifying and removing irrelevant attributes from the data and (4) building the classification model of biological activity.
Availability: sMOL Explorer is an Open Source project and is freely available to all interested users at http://www.biotec.or.th/ISL/SMOL/
Contact: supawadee{at}biotec.or.th
| 1 INTRODUCTION |
|---|
|
|
|---|
To increase the success rate in laboratory and expedite research for drug discovery, databases and software tools are basically required for computational prescreening of compound libraries. Especially, the databases of compounds known to the desired biological activity are very important to be prepared in the early stages of a virtual screening project. Several online databases of known small molecule and properties are now available for chemists as the very large chemical space to explore; for example, the open National Cancer Institute (NCI) database, ChemBank (Strausberg and Schreiber, 2003), and PubChem (Wheeler et al., 2005). Recent efforts on development of open-source computational tools were also initiated including CDK (Steinbeck et al., 2006) and JOELib (http://joelib.sourceforge.net) to enable exploratory data analysis and support a fast implementation of chemoinformatics applications. In addition, many data mining tools such as the R environment (R Development Core Team, 2004) and the WEKA library (Witten and Frank, 2005) are becoming freely available to incorporate into prescreening process; for example, finding structure-activity relationship and predicting activity from structure. To exploit these databases and tools efficiently and conveniently, sMOL Explorer, an open source Web-based integrated system, has been designed and developed for managing two dimensional (2D) structure and related properties data from multiple datasets and easily exploring the data through a range of data mining techniques. User management and system administration functions are also included in sMOL Explorer for setting in the multi-user environment.
| 2 SOFTWARE FEATURES |
|---|
|
|
|---|
sMOL Explorer shown in Figure 1 has been developed as a collection of Java Server Pages (JSP) and Servlets running on the Apache Tomcat web server, utilizing MySQL database management and several chemical informatics libraries such as CDK, JOELib and Open Babel for data manipulation, and connecting to various data analysis and mining methods from the Weka library and R statistical environment. After the installation, only a web browser is needed for using sMOL Explorer.
|
2.1 Web-enabled database management
sMOL Explorer is a centralized system that allows the registered users to create a database of small molecules in two ways: direct entry and data upload. In mode of direct entry, users can add a structure of small molecule into database via the web with several options:
- Draw interactively the 2D structure of molecules or paste SMILES via JChemPaint (Krause et al., 2000),
- Upload the MOL file directly into the database.
sMOL Explorer will automatically generate a CDK-based molecular fingerprint and molecular weight for each molecule registered in the database. The CDK fingerprint is a binary string of 1024 bits with each bit representing the presence or absence of a particular structural feature, based on the paths computed for the molecule. In any cases, sMOL Explorer will associate a MOL file, SMILES and a fingerprint for each molecule.
Users can select or search for molecules in sMOL Explorer database and save them in a data workspace, or upload a new dataset for an analysis. Once the data has been loaded or saved in the data workspace, it can be analyzed using algorithms in sMOL Explorer.
2.2 Structural similarity and text search
sMOL Explorer supports both structure search and text search. For structure search, there are three basic categories: exact structure, substructure and structural similarity searches. The exact structure/substructure search is to find all the compounds in the database that have the given structure or substructure. Both exact and substructure searches in sMOL Explorer are implemented using graph-based structure search algorithms from CDK; while the structural similarity between molecules is measured using a similarity coefficient defined by users such as Tanimoto, Cosine or Simpson based on molecular fingerprints. To use structure search in sMOL explorer, chemist can paste a molecular structure into JChemPaint or upload a data file to find the exact structure or substructure or similar compound in database. Besides, chemist can search molecules against other public accessible databases including PubChem, KEGG, DrugBank and eMolecules via an internet. In text search, sMOL Explorer allows users to specify text search terms to find the compounds that have information relevant to the query text.
2.3 Clustering analysis
Clustering techniques are useful for finding structure-activity relationships, since small molecules with similar structure are likely to have similar functional properties. In addition, clustering can help to discover new group of related molecules and reduce the search space from large and diverse compound databases to smaller and more focused on the desired biological activity. In sMOL Explorer, users can select a set of molecules, cluster the selected molecules based on molecular fingerprints and download the clustering result. Users can also interactively select a clustering algorithm with a similarity measure between molecules and input other parameters mandatory for each algorithm. Presently, sMOL Explorer contains clustering algorithms that are applicable in speed and performance such as Minimum Entropy clustering (Li et al., 2004), Hierarchical clustering and K-Centroids Cluster analysis in R.
2.4 Finding frequent substructure or fragments
Discovering frequent substructures in a known ligand dataset is important in identifying what structural part of compounds is related to the desirable property. For example, the common activity-related structural pattern found in known molecules can be used subsequently in the rational synthesis of the new compounds and the classification of compounds into different classes. A Molecular Substructure Mining algorithm (Borgelt and Berthold, 2002) called MoFa is integrated into sMOL Explorer, because MoFa is accurate and fast enough in searching for a common substructure. MoFa was developed from the Eclat (Zaki et al., 1997) association rule mining algorithm. With Mofa in sMOL Explorer, users must specify a minimum support threshold for finding the frequent substructures in the dataset. The minimum support is actually the frequency of small molecules containing the same substructure. The output from this analysis normally returns a list of frequent substructures that occur in molecules above the minimum support in the dataset.
2.5 Feature selection and classification
Based on the databases of known molecules, statistical-learning techniques can be applied to build the predictive model in searching for novel hit compounds (the compounds that exhibit the desired properties) in silico or quickly eliminating a number of undesirable compounds. As known, the accuracy of the model is affected by how the structure descriptors or features are relevant to the classification problem. Most of feature selection methods are then used to identify a set of relevant features to use in the construction of the model. sMOL Explorer provides both feature selection and classification methods available in the WEKA library and R packages. A method of feature selection in sMOL Explorer can be used to remove irrelevant features from the dataset before attempting to train a classifier. To build a classification model in sMOL Explorer, users can upload a dataset or select a set of molecules in the database for training and testing the model from the selected classification method such as NaiveBayes, C4.5 Decision Tree, Random Forest, Neural Network and SVM. The output will display the performance measures of the model in classifying the compounds and the predicted class assigned to each test sample of compounds.
2.6 File Conversion and molecular descriptor computation
sMOL Explorer also integrates utilities supporting the file format conversion, computing molecular descriptors of structure such as molecular weight and updating the URL of external databases.
| 3 CONCLUSION |
|---|
|
|
|---|
sMOL Explorer is an open source integrated suite that provides necessary tools for chemists to carry out the exploration and mining on chemical datasets. Based on its flexibility in performing exploratory analysis and user-friendly web interface, sMOL Explorer is powerful to facilitate and speed up the prescreening process to the users.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This work was supported by National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand. We wish to thank Dr Duangdao Wichadakul and Dr Pattama Pittayakhajonwut for their useful feedback.
Funding to pay Open Access publication charges was provided by National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Limsoon Wong
Received on June 26, 2007; revised on July 9, 2007; accepted on July 9, 2007
| REFERENCES |
|---|
|
|
|---|
Borgelt C, Berthold MR. Mining molecular fragments: finding relevant substructures of molecules. (2002) Proceedings of IEEE International Conference on Data Mining (ICDM 2002, Maebashi, Japan). Piscataway, NJ, USA: IEEE Press. 51–58.
Krause S, et al. JChemPaint - using the collaborative forces of the Internet to develop a free editor for 2D chemical structures. Molecules (2000) 5:93–98.[Web of Science]
Li H, et al. Minimum entropy clustering and applications to gene expression analysis. (2004) Proceedings of IEEE Computational Systems Bioinformatics Conference. Stanford, USA: IEEE. 142–151.
R Development Core Team. R: A Language and Environment for Statistical Computing. (2004) Vienna, Austria: R Foundation for Statistical Computing.
Steinbeck C, et al. Recent developments of the chemistry development kit (CDK) – an open-source Java library for chemo- and bioinformatics. Curr. Pharm. Des (2006) 12:2111–2120.[CrossRef][Web of Science][Medline]
Strausberg RL, Schreiber SL. From knowing to controlling: a path from genomics to drugs using small molecule probes. Science (2003) 300:294–295.
Wheeler DL, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res (2005) 33:D39–D45.
Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. (2005) San Francisco: Morgan Kaufmann.
Zaki M, et al. New Algorithms for fast discovery of association rules. (1997) Proceedings of 3rd International Conference on Knowledge Discovery and Data Mining (KDD97). Menlo Park, CA, USA: AAAI Press. 283–296.
This article has been cited by other articles:
![]() |
O. Sperandio, M. Petitjean, and P. Tuffery wwLigCSRre: a 3D ligand-based server for hit identification and optimization Nucleic Acids Res., July 1, 2009; 37(suppl_2): W504 - W509. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

