Bioinformatics Advance Access originally published online on September 20, 2005
Bioinformatics 2005 21(22):4133-4139; doi:10.1093/bioinformatics/bti683
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ChemDB: a public database of small molecules and related chemoinformatics resources


1Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California Irvine, CA, USA
2Department of Computer Science, School of Information and Computer Sciences, University of California Irvine, CA, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: The development of chemoinformatics has been hampered by the lack of large, publicly available, comprehensive repositories of molecules, in particular of small molecules. Small molecules play a fundamental role in organic chemistry and biology. They can be used as combinatorial building blocks for chemical synthesis, as molecular probes in chemical genomics and systems biology, and for the screening and discovery of new drugs and other useful compounds.
Results: We describe ChemDB, a public database of small molecules available on the Web. ChemDB is built using the digital catalogs of over a hundred vendors and other public sources and is annotated with information derived from these sources as well as from computational methods, such as predicted solubility and three-dimensional structure. It supports multiple molecular formats and is periodically updated, automatically whenever possible. The current version of the database contains approximately 4.1 million commercially available compounds and 8.2 million counting isomers. The database includes a user-friendly graphical interface, chemical reactions capabilities, as well as unique search capabilities.
Availability: Database and datasets are available on http://cdb.ics.uci.edu
Contact: pfbaldi{at}ics.uci.edu
Supplementary information: Supplementary materials are available on http://cdb.ics.uci.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
The development of chemoinformatics has been greatly hampered by the lack of publicly available comprehensive datasets of molecules (Marris, 2005), large-scale collaborative projects to annotate these molecules, and efficient tools to rapidly sift through large chemical repositories. Suffice it to say that no repository of all known organic molecules and their properties is publicly available and downloadable on the Internet. To draw a simple analogy with bioinformatics, the chemoinformatics equivalent of GenBank and Blast are still to be created. To begin addressing these problems, at least for organic chemistry, we describe ChemDB, a public database available on the Web of over 4.1 million small molecules.
Small molecules with at most a few dozen atoms play a fundamental role in organic chemistry and biology. They can be used as combinatorial building blocks for chemical synthesis (Schreiber, 2000; Agrafiotis et al., 2002), as molecular probes for perturbing and analyzing biological systems in chemical genomics and systems biology (Schreiber, 2003; Stockwell, 2004; Dobson, 2004) and for the screening, design and discovery of useful compounds. These include of course new drugs (Lipinski and Hopkins, 2004; Jonsdottir et al., 2005), the majority of which are small molecules. Furthermore, huge arrays of new small molecules can be produced in a relatively short period of time (Houghten, 2000; Schreiber, 2000).
As datasets of small molecules become available, it is crucial to organize these datasets in rapidly searchable databases and to develop computational methods to rapidly extract or predict useful information for each molecule, including its physical, chemical and biological properties. Conversely, large and well-annotated datasets are essential for developing statistical machine learning methods in chemoinformatics, whether supervised or unsupervised, including predictive classification, regression and clustering of small molecules and their properties (e.g. Micheli et al., 2003; Ralaivola et al., 2005). Aggregation and organization of datasets of chemical information allows for massive in silico processing that would be impractical or even impossible in a traditional experimental setting.
Consider, for instance, a classical drug discovery problem where the starting point is a protein of known structure and perhaps a corresponding ligand (Fig. 1). With a good database of small molecules, the discovery process can proceed from both ends. Starting from the protein, one can dock millions of small molecules to the protein in silico. In fact, with sufficient computing power, one ought to be able to dock all known small molecules to all proteins with known structure contained in the PDB (Berman et al., 2000). Starting from the ligand, one can search the database of small molecules for compounds that are similar to the known ligand(s), where similarity can be defined in different ways. In both approaches, additional filters can be used to eliminate molecules that are, for instance, poorly soluble, too flexible or toxic (Swamidass et al., 2005). Furthermore, in silico chemical reactions applied to the molecules in the database can further expand the space of interesting molecules being screened or designed.
|
Most large databases of small molecules, such as MDL's Available Chemicals Directory (ACD) or (ACS) American Chemical Society's CAS registry, are privately owned, expensive and often available only through restricted interfaces that are not suitable for the development of statistical methods. A few datasets of small molecules, such as the NCI (National Cancer Institute) open database, are available publicly. However, in general these are limited in size, with compounds on the order of 103105 (Voigt et al., 2001). Furthermore, efforts towards public databases must face fierce opposition from the ACS (Marris, 2005; Kaiser, 2005a, b). Given the importance of small molecules, ChemDB aims to address the data bottleneck in the current environment by integrating existing public datasets with datasets originating from dozens of chemical vendors. These datasets are integrated into a database containing compounds on the order of 107, available on the Web with a unique combination of chemoinformatics resources.
| 2 METHODS |
|---|
|
|
|---|
2.1 Data sources, formats and size
ChemDB is a chemical information database system grown not only out of an aggregation of multiple information sources, primarily commercial vendor catalogs, but also of publicly available repositories (e.g. NCI). For sources that periodically update their data and make them available on the Internet, we automatically download the data and resynchronize the latest updates into ChemDB. For other sources that currently distribute their data only through CDs, we contact them periodically for updates. Complete information about all the vendors is available from the Supplementary materials. In total, the current database contains about 4.1 million unique compounds and 8.2 million counting isomers, aggregated from over 115 sources.
Molecules come with multiple representations and formats (Fig. 2) including one-dimensional (1D) SMILES strings (Weininger et al., 1989; James et al., 2004, http://www.daylight.com/dayhtml/doc/theory/theory.toc.html),2D graphs of atoms and bonds, 3D atom coordinates (SDF or MOL2 files) and fingerprints (Fligner et al., 2002; Flower, 1998; Ralaivola et al., 2005), all of which are stored in ChemDB. We have developed scripts to automatically parse input data, run different tests and populate the database. To populate the database, all datasets are first converted to the SDF format because of its standardized annotation mechanism. However, conversion between several popular molecular file formats, including SMILES, MDL Mol, PDB, Tripos Sybyl mol2 and SDF, is easily accomplished using OpenEye software's OEChem toolkit (http://www.eyesopen.com), or the open-source alternative, OpenBabel. Additional curation and normalization steps are applied to the data as it is inserted. For instance, 3D structures are generally not available and are therefore predicted using the program CORINA (Sadowski et al., 1994; Gasteiger et al., 1996).
|
One difficult issue for chemoinformatics systems is how to handle stereochemistry. This issue is complicated by the absence of stereochemical and geometric information from most sources, which generally provide only the atom-bond connection table. ChemDB currently enumerates up to n = 16 stereoisomers for each molecule. This is a reasonable number since it allows listing all possible isomers for over 97.4% of compounds in ChemDB, i.e. those with at most 4 stereocenters and therefore at most 16 isomers (see Results). In addition, for each isomer, ChemDB generates and stores not only the stereochemistry specific connection table as an isomeric SMILES string, but also the corresponding predicted 3D coordinates as an SDF file. This solution allows us to specify which isomer is relevant when stereochemistry is known and provides a more complete picture for the user, in virtual docking and other applications. If the stereoisomer is not specified, we assume that the chemical is available as a racemic mixture. Thus, this solution provides a reasonable and effective compromise in light of limited information about molecule handedness.
2.2 Database schema
The basic database schema is relationally organized and relies upon canonical SMILES string representations for rapid indexing and enforcement of uniqueness. The relational structure allows for maintenance and querying of complex arrangements, such as the many-to-many relationship between sources and the chemicals for which they provide records. The database schema contains primary tables for sources, chemicals and molecular descriptors and annotations. It is described in detail in the Supplementary materials.
2.3 Implementation
The database is implemented using the leading open-source relational database PostgreSQL (http://www.postgresql.org). We have also built filters for conversion to Oracle and maintain an Oracle version internally for comparison purposes. Web interfaces and tools are delivered using the open-source Apache Web server. Many of the basic application tools, scripts and Web interfaces are written in Python, while computationally intensive modules are written in C or Java. Python has convenient interfaces to important packages, like the OEChem toolkit that implements several basic algorithms needed for chemical data processing, including SMARTS pattern matching and SMIRKS reaction processing. We use OEDepict and the JMol Java applet (http://jmol.sourceforge.net/) for chemical image rendering.
2.4 Molecular descriptors and example of filters
In addition to 3D structures obtained using CORINA, we compute and store several other molecular descriptors including molecular weight, number of hydrogen-bond donors, number of hydrogen-bond acceptors, octanol/water partition coefficient log P, solvation energy, number of rigid fragments, number of rotatable bonds, number of chiral centers and number of chiral double bonds. For each molecular descriptor, we include hyperlinks to the program that was used to compute it. For instance, we compute log P values for all compounds, using the XLogP program and the calculation module available from ChemAxon (http://www.chemaxon.com). Similarly, compound solvation energy is always calculated and recorded using OpenEye's ZAP module. We also store in a similar way any additional molecular descriptors found in the vendor's electronic catalogs.
The database interface allows the user to implement flexible search filters by specifying thresholds or ranges for any combination of these molecular descriptors. For example, Lipinski's rules of five (Lipinski et al., 1997) are often used as criteria for drug oral bioavailability. These rules correspond to molecular mass <500 daltons; number of hydrogen-bond donors <5; number of hydrogen-bond acceptors <10 and octanol/water partition coefficient log P (an indication of the ability of a molecule to cross biological membranes) <5. If two or more of those criteria are out of range, the compound is likely to have poor absorption or permeability. The cutoffs in the rules can be tightened or relaxed in the interface to allow for flexible searches as well as computational and experimental errors, especially in the computational determination of the partition coefficient. As an alternative, one can easily use the set of rules proposed in Veber et al. (2002). By examining oral bioavailability in rats for over 1100 drug candidates, Veber et al. concluded that only two structural variables control this crucial property: molecular flexibility, measured by the number of rotatable bonds, and polar surface area, expressed as the sum of hydrogen-bond donors and acceptors. These studies indicate that drug candidates with 10 or fewer rotatable bonds and a polar surface area
140 Å2 (equivalent to 12 or fewer H-bond donors and acceptors) will exhibit favorable oral bioavailability.
It is important to recognize, however, that the Lipinski and Veber rules are not absolute (Frimurer et al., 2000) and that oral bioavailability is only one of many potentially important criteria. New modes of drug delivery have entered clinical practice recently and will likely continue to do so in the future. Thus, even within the limited chemoinformatics goals of drug screening, the ChemDB interface provides the user with a wide array of filters and threshold values to be tailored to different problems and searches.
2.5 Vendor/source descriptors and experimental annotations
We store the name, contact information and date of the latest update for each vendor's dataset in the database. In addition, we incorporate any annotation provided in the vendor's digital catalogs. These annotations range in utility and their presence can vary greatly from vendor to vendor. Annotations provided typically include purchase price, English name, CAS registry number, experimentally determined octanol-water partition coefficient, amount available, purity, melting point, heteroatoms and net charge. For a smaller fraction of compounds, additional miscellaneous information is available from the vendors, ranging from literature references to boiling points to possible activity (e.g. Interleukin agonist). We flag any experimentally derived annotation provided by the vendors or present in some of the public datasets (e.g. NCI). Finally, we also flag FDA-approved drugs for reference.
2.6 Similarity search methods and kernels
As in the case of bioinformatics, once large repositories of small molecules are assembled the next fundamental step for chemoinformatics is the definition and fast implementation of similarity measures. This is fundamental for two reasons: (1) to enable rapid and meaningful searches through millions of records and (2) to enable supervised and unsupervised predictive methods that are based on similarity measures, from clustering to kernel methods (Schölkopf and Smola, 2002).
Similarity measures between small molecules can be defined in several different ways and by leveraging different representations. In Swamidass et al. (2005), several similarity measures are described and assessed using spectral representations and spectral kernels, i.e. similarity measures derived by comparing the occurrence of substructures, such as substrings in SMILES strings, paths in 2D atom-bond graphs and histograms of atomic distances in 3D. While these and other similarity measures are under investigation, at this point the 2D similarity measures yield the best results and are used extensively in ChemDB. These measures are based on fixed-size fingerprint vectors counting the presence (or number of occurrences) of labeled paths in a molecule (see Swamidass et al., 2005 and references therein for details). Binary fingerprint representations of typical length 512 or 1024, combined with efficient bit-wise algorithms, yield fast search algorithms.
2.7 Web interface
The speed with which the bit-wise algorithm can sequentially search millions of chemical fingerprints makes the ChemDB available for queries through a Web interface, where users can enter query molecule(s) in any standard molecular file format with several additional options. These novel options include the ability to search by a selection of ranges for the primary annotations (e.g. number of rotatable bonds), by substructures and superstructures with or without constraints on the presence or absence of particular groups or other features (masks) and by profile using a group of related molecules rather than a single molecule as the query. The current version of profile search maximizes the sum of the similarities to each of the query structures. Alternatives under investigation include building a profile fingerprint vector and using alternative measures, such as the maximum of the pairwise similarities between a molecule and the structures in the query set.
2.8 In silico chemical reactions: RChemDB
The repository's size can be expanded further by considering virtual compounds that can be synthesized from building-blocks in the ChemDB, which are readily available through the vendors. This can be achieved by annotating functional groups and applying in silico reactions to the current dataset. Implicit or explicit functional group annotation is derived using the SMARTS pattern method (James et al., 2004) with the OEChem implementation. The SMARTS pattern method is essentially a subgraph isomorphism algorithm applied to a molecule represented as a graph of labeled atoms and bonds. This method provides precise search results, but is computationally intensive and therefore not suitable for interactive use. In comparison, the fingerprint bit vector approach may have a slightly higher rate of false positives but is much faster and therefore more suitable for interactive use. Furthermore, unlike a simple SMARTS-based approach which only provides a binary resultdepending on whether a given substructure is present or not in a given structurea fingerprint-based approach provides a real-valued similarity score between any two structures.
Once functional groups have been identified, combinatorial reactions that specify which groups can react are defined by the Daylight SMIRKS specification (James et al., 2004). Examples of reactions currently implemented include amide formation, BuchwaldHartwig, cyanation of aromatic halides, DielsAlder, ester formation, Grignard, GrubsReaction, Heck, Hiyama, Negishi, phosphodiester formation, Sonogashira, Suzuki and SwernOxidation. RChemDB denotes the set of virtual molecules that can be generated from ChemDB by iterative applications of a library of in silico reactions. It is essential to note that as the reactions are iterated, the number of compounds grows exponentially. Thus RChemDB itself is virtual in the sense that we can generate and conduct directed searches of its compounds, but these are not stored directly into ChemDB. An example of application of RChemDB is small chemical refinement of a basic lead or scaffold, by letting the scaffold structure react with all of the very small molecules in ChemDB, e.g. with <10 atoms. These correspond to slightly <1% of ChemDB.
2.9 Additional datasets
In addition to particular subsets that can be extracted from ChemDB, we also maintain on a Web page associated with the ChemDB a list of downloadable datasets that can be used as training/validation sets in unsupervised or supervised machine learning and other computational experiments. These are also hyperlinked with the UCI Machine Learning Repository.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Statistics
ChemDB allows us to compute several useful statistics on small molecules, such as the histogram counting the number of molecules with a given number of stereocenters (Supplementary materials). A molecule with k stereocenters generally yields 2k isomers or less. This is because isomers are based on stereocenters that generally have two configurations. The number of isomers can be <2k due to geometric clashes or redundant combinations of stereocenters. The majority of chemicals in the system (2.5 million) have no stereocenters, which explains why although CORINA is set to determine up to 16 isomers per chemical, the number of isomers is only about twice the number of unique records. In fact, 97.4% of the chemicals in ChemDB have at most 4 stereocenters, resulting in at most 16 configurations, and thus all their isomers are stored in ChemDB. Using values k = 5 or k = 6 would increase the coverage only marginally to 98.3 and 99.2%, respectively. For the small minority of compounds that have more than four stereocenters, rather than pre-computing and storing all configurations, a random sample of 16 isomers is pre-computed and stored in ChemDB. Additional isomers can be generated on a per-request-basis. Finally, at the extreme end of the distribution, there are currently 27 chemicals with 50 or more stereocenters. The top four (Cyanovirin, Heptakis-(6-O-maltosyl)-ß-cyclodextrin, D-Alanyl-lipoteichoic acid and Scytovirin) have 111, 105, 86 and 86 stereocenters, respectively. Enumerating all of the isomers for the first chemical alone would yield potentially 2111
1033 possible configurations. It is worth noting, however, that the majority of the chemicals in this tail are natural products or natural product derivatives. Thus only one or two isomers of each chemical are likely to exist in nature and be available from vendors. ChemDB histograms for other molecular descriptors including the number of chiral bonds, chiral atoms, H-bond donors, H-bond acceptors, rotatable bonds and rigid segments per molecule are shown in Figure 3. Additional pairwise statistics, displaying for instance the weak correlation betwen molecular weight and (predicted) solubility, are given in the Supplementary materials.
|
3.2 In silico reactions
Examples of simple reaction processing capabilities implemented in ChemDB are given in Figures 4 and 5. Figure 4, derived from the ChemDB interface, shows how amino groups and carboxylic acids react to form an amide bond. Besides expanding the dataset by predicting reaction products, the reaction processing capabilities can be applied to other novel purposes. For example, they can be used as part of a screen for potential polymer components. A simple polymer screenidentifying candidates that can at least self-polymerizeis accomplished by identifying each molecule which, given a library of reactions, can iteratively react with itself and with the products of these reactions. Figure 5 shows how DNA can be rediscovered in ChemDB using a simple polymer screen.
|
|
3.3 Web interface and searches
Figure 6 depicts a composite screenshot from the ChemDB interface upon performing an integrated chemical similarity search. Shown inset on the left is the structure for a chemical known to be an inhibitor of monoacylglycerol lipase (MGL), an intracellular serine hydrolase that catalyzes the hydrolysis of 2-arachidonoylglycerol (2-AG), a primary endogenous cannabinoid in the mammalian brain. Recent studies suggest an MGL inhibitor can mediate opioid-independent stress-induced analgesia, identifying MGL as an important drug target (Hohmann et al., 2005). In an effort to find additional inhibitors in collaboration with chemists and pharmacologists (Drs Chamberlin and Piomelli), ChemDB was searched for chemicals with structural similarity to a known inhibitor. In this case similarity is computed using the Tversky measure (Tversky, 1977; Rouvray, 1992) applied to the binary fingerprints. Based upon a mechanistic understanding of the inhibitor, our collaborators provided feedback suggesting that chemicals of interest may not require complete structural similarity to the original chemical. Instead, the two structures shown in the sketcher window in the top left, should be contained as substructures/functional groups of the desired structure. As shown in the figure, beyond substructures, the search can be further refined by restricting the ranges of several molecular descriptors, in this case the number of rotatable bonds, predicted XLogP and molecular weight. The particular set of values selected in this example reflect a customized combination of Lipinski's and Veber's rules. All compounds are ranked in decreasing order of similarity, but only the top three results are shown here together with their similarity score and basic information, including corresponding vendor. The interface also displays a dynamically generated histogram representing the similarity score distribution for every chemical in the database relative to the original structure on a logarithmic scale. After human-expert examination, several top hits obtained from these searches have been ordered from the corresponding vendors and are being tested in the laboratory.
|
| 4 DISCUSSION |
|---|
|
|
|---|
While most commercial databases of small molecules have sizes smaller or comparable to ChemDB, there exist a few that are larger, notably the CAS registry of the ACS with the related SciFinder tool. While these commercial databases may contain useful information, they do not always provide flexible chemoinformatics tools or interfaces. For instance, even in the ACS database queries are allowed only one compound at a time and the full database is not downloadable. As in bioinformatics with Genbank or PDB queries performed one item at a time may be satisfactory for many users, but researchers involved in the development and application of large-scale datamining methods need full access to the entire corpus of data. Furthermore, the cost of these commercial databases is often very significant, at least from an academic standpoint.
To address the data bottleneck created in part by the ACS, public, downloadable, chemical repositories have begun to emerge. In addition to NIH's PubChem (http://pubchem.ncbi.nlm.nih.gov) examples of other public database efforts related to ChemDB include Harvard's ChemBank (Strauseberg and Schreiber, 2003), UCSF's ZINC (Irwin and Shoichet, 2005) and the European Bioinformatics Institute's ChEBi (http://www.ebi.ac.uk/chebi). While in the long run some degree of consolidation among these efforts can be expectedwe are currently depositing ChemDB compounds into PubChemin the short run a diversity of efforts with different aims and approaches allows for the exploration of different solutions and tradeoffs. Indeed, the existing databases have slightly different goals and properties, in terms of size, focus, availability and informatics algorithms for searching and other operations. At the time of this writing, for instance, PubChem and ChemBank are smaller in size (approximately 1 million compounds) with a greater emphasis on literature references (PubChem) and experimental bioactivity annotation (Chembank). PubChem, ChemBank and ChemDB allow for searching compounds by IUPAC/English names. ChemBank, however, is not fully downloadable. ZINC is fully downloadable and perhaps closest in size and spirit to ChemDB, with a primary focus on structure download to facilitate docking. Unlike ZINC and other public repositories, ChemDB's focus goes beyond drug discovery and includes the development of new computational tools for annotating, searching and mining large repositories of chemical data. In particular, the current flexible search capabilities found in ChemDB are unique and so are its chemical reaction capabilities, among publicly available databases. A table in the Supplementary materials summarizes some of the tradeoffs between these synergistic public efforts. Such a table, however, quickly becomes outdated as all of these repositories are undergoing rapid evolution.
Chemical descriptors and annotations are clearly essential for exploring chemical space directly, as well as indirectly for the development of efficient computational annotation methods. Many molecular descriptors, such as molecular weight and number of rotatable bonds, are precisely defined and can be computed exactly. Other computational annotations, such as the degree of solubility (log P) or 3D structures, are noisy and subject to closer scrutiny. In particular, the predicted 3D structures are important but of some concern since we, and other authors, have noted that predictive methods based on 3D structure can be outperformed by methods based on 2D structure alone (Swamidass et al., 2005). While predicting the structure of small molecules is easier than predicting the structure of proteins, it is still essential to run large-scale tests to assess the quality of those predictions and whether they can be used reliably. For this purpose, we have recently acquired a license to the Cambridge Structural Database system, another commercial repository, containing the experimentally determined 3D structures of
300 000 molecules to validate the quality of predicted structures.
A related important problem arises with stereochemistry. In ChemDB we have adopted the solution of storing up to 16 isomers for each compound, but we currently do not test for the relevance or synthetic feasibility of these compounds. In the future it may be possible to heuristically guess more relevant isomers by cross-referencing structures with other databases such as PubChem and the PDB and perhaps more intelligent decoding of chemical names; our database schema immediately accommodates these extensions. As far as synthetic feasibility, it is an important criterion that should also be implemented in the future. Currently, we believe the enumeration of theoretical compounds has value in and of itself, as we are not just interested in cataloging known compounds but also pushing the boundaries of knowledge towards potential compounds for a better understanding of chemical space. Even from a practical standpoint, a theoretical compound found to be of particular value in a docking study, for instance, may spur the interest of chemists towards its synthesis. Without enumeration of theoretical compounds, this particular compound would have not even been considered. Furthermore, these theoretical compounds are not random but logically derived by computational methods that stack the odds in favor of finding a reasonable synthetic pathway.
Finally, the scarcity of publicly available chemical annotation points to the need for new approaches to chemical annotation that could include the following: (1) development of automated information retrieval systems to derive annotations from chemical literature; (2) sharing of private or commercial annotation by, for instance, the ACS or large pharmaceutical companies and (3) development of collaborative, coordinated, and large-scale annotation efforts across academic centers, similar to those used in the other life sciences. Coupling public databases with public annotation efforts will lead in time to repositories that may allow predictive chemical informatics to blossom and develop tools to fully explore chemical space, from drug discovery to new materials to the origin of life.
| Acknowledgments |
|---|
This work was supported by an NIH Biomedical Informatics Training grant (LM-07443-01) and an NSF MRI grant (EIA-0321390) to P.B., and by the UCI Medical Scientist Training Program and a Harvey Fellowship to S.J.S. We would also like to acknowledge the OpenBabel project and OpenEye Scientific Software for their free software academic license, and Drs Chamberlin, Nowick, Piomelli and Weiss for their useful feedback. Funding to pay the Open Access publication charges for this article was provided by the Institute for Genomics and Bioinformatics at UCI.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
Received on July 10, 2005; revised on September 11, 2005; accepted on September 18, 2005
| REFERENCES |
|---|
|
|
|---|
Agrafiotis, D.K., et al. (2002) Combinatorial informatics in the post-genomics era. Nat. Rev. Drug Discov., 1, 337346[Medline].
Berman, H.M., et al. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235242
Dobson, C.M. (2004) Chemical space and biology. Nature, 432, 824828[CrossRef][Medline].
Fligner, M.A., et al. (2002) A modification of the Jaccard/Tanimoto similarity index for diverse selection of chemical compounds using binary strings. Technometrics, 44, 110[CrossRef].
Flower, D.R. (1998) On the properties of bit string-based measures of chemical similarity. J. Chem. Inf. Comput. Sci., 38, 378386.
Frimurer, T.M., et al. (2000) Improving the odds in discriminating drug-like from non drug-like compounds. J. Chem. Inf. Comput. Sci., 40, 13151324[Medline].
Gasteiger, J., et al. (1996) Chemical information in 3D-space. J. Chem. Inf. Comput. Sci., 36, 10301037[CrossRef].
Hohmann, A.G., et al. (2005) An endocannabinoid mechanism for stress-induced analgesia. Nature, 435, 11081112[CrossRef][Medline].
Houghten, R.A. (2000) Parallel array and mixture-based synthetic combinatorial chemistry: tools for the next millennium. Ann. Rev. Pharmacol. Toxicol., 40, 273282[CrossRef][ISI][Medline].
Irwin, J.J. and Shoichet, B.K. (2005) ZINCa free database of commercially available compounds for virtual screening. J. Chem. Inf. Comput. Sci., 45, 177182.
James, C.A., et al. Daylight Theory Manual, (2004) .
Jonsdottir, S.O., et al. (2005) Prediction methods and databases within chemoinformatics: emphasis on drugs and drug candidates. Bioinformatics, 21, 21452160
Kaiser, J. (2005a) Chemists want NIH to curtail database. Science, 308, 774.
Kaiser, J. (2005b) House approves 0.5% raise for NIH, comments on database. Science, 308, 1729.
Lipinski, C. and Hopkins, A. (2004) Navigating chemical space for biology and medicine. Nature, 432, 855861[CrossRef][Medline].
Lipinski, C.A., et al. (1997) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev., 23, 325[CrossRef].
Marris, E. (2005) Chemistry society goes head to head with NIH in fight over public database. Nature, 435, 718719.
Micheli, A., Sperduti, A., Starita, A., Biancucci, A.M. (2003) A novel approach to QSPR/QSAR based on neural networks for structures. In Cartwright, H. and Sztandera, L.M. (Eds.). Soft Computing Approaches in Chemistry, , Heidelberg, Germany Springer Verlag, pp. 265296.
Ralaivola, L., et al. (2005) Graph kernels for chemical informatics. Neural Netw., in press.
Rouvray, D. (1992) Definition and role of similarity concepts in the chemical and physical sciences. J. Chem. Inf. Comput. Sci., 32, 580586[CrossRef].
Sadowski, J., et al. (1994) Comparison of automatic three-dimensional model builders using 639 X-ray structures. J. Chem. Inf. Comput. Sci., 34, 10001008[CrossRef].
Schölkopf, B. and Smola, A.J. Learning with Kernels, Support Vector Machines, Regularization, Optimization and Beyond, (2002) MIT University Press.
Schreiber, S.L. (2000) Target-oriented and diversity-oriented organic synthesis in drug discovery. Science, 287, 19641969
Schreiber, S.L. (2003) The small-molecule approach to biology: chemical genetics and diversity-oriented organic synthesis make possible the systematic exploration of biology. Chem. Eng. News, 81, 5161.
Stockwell, B.R. (2004) Exploring biology with small organic molecules. Nature, 432, 846854[CrossRef][Medline].
Strauseberg, R.L. and Schreiber, S.L. (2003) From knowing to controlling: a path from genomics to drugs using small molecule probes. Science, 300, 294295
Swamidass, S.J., et al. (2005) Kernels for small molecules and the prediction of mutagenicity, toxicity, and anti-cancer activity. Bioinformatics, 21, Suppl. 1, 359368[CrossRef].
Tversky, A. (1977) Features of similarity. Psychol. Rev., 84, 327352[CrossRef][ISI].
Veber, D., et al. (2002) Molecular properties that influence the oral bioavailability of drug candidates. J. Med. Chem., 45, 26152623[CrossRef][ISI][Medline].
Voigt, J.H., et al. (2001) Comparison of the NCI open database with seven large chemical structural databases. J. Chem. Inf. Comput. Sci., 41, 702712[CrossRef][Medline].
Weininger, D., et al. (1989) SMILES. 2. Algorithm for generation of uniques SMILES notation. J. Chem. Inf. Comput. Sci., 29, 97101[CrossRef].
This article has been cited by other articles:
![]() |
Y. Cao, A. Charisi, L.-C. Cheng, T. Jiang, and T. Girke ChemmineR: a compound mining framework for R Bioinformatics, August 1, 2008; 24(15): 1733 - 1734. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Baldi and R. W. Benz BLASTing small molecules--statistics and extreme statistics of chemical similarity scores Bioinformatics, July 1, 2008; 24(13): i357 - i365. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. H. Chen, E. Linstead, S. J. Swamidass, D. Wang, and P. Baldi ChemDB update full-text search and virtual chemical space Bioinformatics, September 1, 2007; 23(17): 2348 - 2351. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Ceroni, F. Costa, and P. Frasconi Classification of small molecules by two- and three-dimensional decomposition kernels Bioinformatics, August 15, 2007; 23(16): 2038 - 2045. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Klekota, F. P. Roth, and S. L. Schreiber Query Chem: a Google-powered web search combining text and chemical structures Bioinformatics, July 1, 2006; 22(13): 1670 - 1673. [Abstract] [Full Text] [PDF] |
||||
![]() |
T.-W. Lin, M. M. Melgar, D. Kurth, S. J. Swamidass, J. Purdon, T. Tseng, G. Gago, P. Baldi, H. Gramajo, and S.-C. Tsai Structure-based inhibitor design of AccD5, an essential acyl-CoA carboxylase carboxyltransferase domain of Mycobacterium tuberculosis PNAS, February 28, 2006; 103(9): 3072 - 3077. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






