Skip Navigation


Bioinformatics Advance Access originally published online on February 15, 2005
Bioinformatics 2005 21(10):2145-2160; doi:10.1093/bioinformatics/bti314
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary data
Right arrow All Versions of this Article:
21/10/2145    most recent
bti314v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (17)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Jónsdóttir, S. O.
Right arrow Articles by Brunak, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Jónsdóttir, S. O.
Right arrow Articles by Brunak, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Prediction methods and databases within chemoinformatics: emphasis on drugs and drug candidates

Svava Ósk Jónsdóttir 1, Flemming Steen Jørgensen 2 and Søren Brunak 1,*

1Center for Biological Sequence Analysis, BioCentrum-DTU, Technical University of Denmark DK-2800 Kongens Lyngby, Denmark
2Department of Medicinal Chemistry, Danish University of Pharmaceutical Sciences Universitetsparken 2, DK-2100 Copenhagen, Denmark

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 DATABASES
 DESCRIPTORS USED FOR CHEMICAL...
 REDUNDANCY OF DATA SETS
 METHODS FOR CLASSIFICATION OF...
 PREDICTION OF PROPERTIES
 ROLE OF DATABASES AND...
 DATABASES AND COMPUTER PROGRAMS
 REFERENCES
 

Motivation: To gather information about available databases and chemoinformatics methods for prediction of properties relevant to the drug discovery and optimization process.

Results: We present an overview of the most important databases with 2-dimensional and 3-dimensional structural information about drugs and drug candidates, and of databases with relevant properties. Access to experimental data and numerical methods for selecting and utilizing these data is crucial for developing accurate predictive in silico models. Many interesting predictive methods for classifying the suitability of chemical compounds as potential drugs, as well as for predicting their physico-chemical and ADMET properties have been proposed in recent years. These methods are discussed, and some possible future directions in this rapidly developing field are described.

Contact: svava{at}cbs.dtu.dk


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 DATABASES
 DESCRIPTORS USED FOR CHEMICAL...
 REDUNDANCY OF DATA SETS
 METHODS FOR CLASSIFICATION OF...
 PREDICTION OF PROPERTIES
 ROLE OF DATABASES AND...
 DATABASES AND COMPUTER PROGRAMS
 REFERENCES
 
Chemoinformatics (Blake, 2000; Olsson and Oprea, 2001; Kubinyi, 2003) is a rapidly growing field, with a huge application potential. Chemoinformatics concerns the gathering and systematic use of chemical information, and the use of those data to predict the behavior of unknown compounds in silico.

The word ‘chemoinformatics’ is rather new, but papers that fall under this field date back to the mid-1960s, where structure–activity relationships (SAR) were proposed based on the work of Hansch and Fujita (1964) and Fujita et al. (1964) and the earlier work of Hammett and Taft. The first textbooks on chemoinformics have been published recently (Leach and Gillet, 2003; Gasteiger and Engel, 2003; Gasteiger, 2003; Bajorath, 2004).

The related field bioinformatics (Durbin et al., 1998; Baldi and Brunak, 2001; Mount, 2001; Orengo et al., 2002; Lengauer, 2002) was fully established in the 1990s, and has become an integrated activity in most major pharmaceutical companies. The basis behind the success of bioinformatics was the access to a vast amount of experimental data, together with the structured nature of genetic information. Several authors have recently published reviews on the use of bio- and chemoinformatics methods in drug discovery processes (Stahura and Bajorath, 2002; Hopkins and Groom, 2002; Golden, 2003).

The availability of experimental data relevant to chemoinformatics modeling is, however, much more restricted. In fact, most of the chemical information reside in the private domain. If the success of the bioinformatics area should be mimicked, the availability of new experimental data will be an absolute necessity for developing efficient and robust models for prediction of various properties (Beresford et al., 2002).

Implementing, handling and searching chemical databases is a crucial aspect of chemoinformatics (Miller, 2002). Chemical database techniques and data mining methods will improve as this field evolves, also due to more implementation of new data structures (Miled et al., 2003). Methods for full text data mining are likely to be become very powerful in the years to come, and will presumably play a highly important role in the general area of chemoinformatics. A new XML (extensible markup language) based approach for managing molecular information, ‘chemical markup language’ (CML), was proposed by Murray-Rust and Rzespa (1999, http://www.xml-cml.org). Figure 1 shows the number of references to the words ‘bioinformatics’, ‘chemoinformatics’, ‘chemogenomics’ and ‘metabonomics’ in PubMed from 1992 to 2004. It is seen that the present trend for chemoinformatics resembles the trend in bioinformatics five to ten years ago. It should be mentioned that this graph is based on one database only, PubMed, and is intended to give an idea about the development in publishing frequency in these areas, and not as a complete overview.



View larger version (33K):
[in this window]
[in a new window]
 
Fig. 1 Number of papers in PubMed 1992–2004 containing the keywords ‘bioinformatics’ (top left), ‘chemoinformatics’ or ‘cheminformatics’, ‘chemogenomics’ or ‘chemical genomics’ and ‘metabonomics’ or ‘metabolomics’.

 
It is clear that the drug discovery and optimization process is undergoing very significant changes. Many more hits are found than previously, especially due to the advances gained in combinatorial chemistry (Gallop et al., 1994; Gordon et al., 1994) and high throughput screening (HTS). The approach used in drug discovery has been linear with respect to various relevant properties, but more parallel approaches are evolving, where not only the potency (activity) and selectivity of the lead is examined at an early stage, but also other key properties. Many of the compounds drawn out of combinatorial libraries may look promising at first, but they fail at later stages in the drug discovery process due to undesired properties. A compound can for example be feasible based on molecular structure, but due to aggregation, limited solubility or limited uptake in the human organism it is not useful as a drug. Many pharmaceutical companies might even be repeating the same mistakes, due to these problems. Methods for assessing these properties at a very early stage, both experimentally and computationally, are thus highly desirable. This is expected to lower the cost of drug discovery and optimization significantly, and hopefully provide an increased number of useful leads. In cases where a lead has sufficiently high activity, but various properties need to be improved, chemoinformatics methods could be used to modify substructures within the lead space with minimal effect on the activity profile (Oprea et al., 2001b).

Much improvement has occurred concerning techniques for in vitro measurements of various properties, and in models used for accessing how well a given compound does absorb from the gastrointestinal tract (GIT) into the blood stream or its ability to cross the blood–brain-barrier (BBB) (van de Waterbeemd et al., 2003). In most cases in vitro measurements are carried out using different cell models, human or animal. For measuring permeability, and thus absorption, the use of artificial membranes, called PAMPA (parallel artificial membrane permeability assay) has become a popular alternative to the CACO-2 (human colon carcinoma) cell line (van de Waterbeemd et al., 2003). The cost of each measurement using PAMPA is 1/20 of that of CACO-2, seemingly with comparable accuracy (van de Waterbeemd et al., 2001). PAMPA is, however, only useful for measuring passive permeability. For in vitro studies of liver toxicity, the rat hepatoma cell line is a well characterized and evaluated method, which is often followed up by human hepatoma cell line studies (van de Waterbeemd et al., 2001). Toxicogenomic studies using microarray expression techniques have also become increasingly important. Major concerns in toxicological studies are multiple endpoints, dose–response relationships and selection of endpoints. Other important aspects are purity of drugs, protein binding and metabolic stability among others.

Likewise, various computational methods are evolving rapidly at present. Computational techniques used to search through chemical libraries and databases, so-called virtual screening methods, have become increasingly popular in drug discovery (Walters et al., 1998; Böhm and Schneider, 2000). A whole range of computational techniques are used for searching for molecular similarities and dissimilarities (Sello, 1998; Willett, 2000; Bajorath, 2002; Gillet et al., 2003), for extracting information about pharmacophores (structural models of targets or binding sites) from compound libraries (Hopfinger and Duca, 2000), for prediction of properties, for studying molecular interactions at the atomic level, among other things (Miller, 2002). Chemoinformatics is strongly linked to computational chemistry and molecular modeling (Burkert and Allinger, 1982; Jensen, 1999; Cramer, 2002). Molecular modeling methods are particularly useful for conducting conformational analysis of molecules, and for accessing the strength of intermolecular interactions.

Newly established fields like chemogenomics (or chemical genomics), metabonomics and metabolomics also play increasingly important roles in modern drug discovery and development. Chemogenomics (Browne et al., 2002) deals with interactions between chemical compounds and living systems in terms of induced genomic response. In metabonomics (Nicholson and Wilson, 2003) relatively low-molecular weight materials produced during genomic expression within a cell are studied, normally by use of1 H-NMR spectroscopy and multivariate data analysis (chemometrics) (Geladi and Kowalski 1986b, a). It has been shown to be a useful tool for understanding drug efficacy and toxicity. Metabolomics is similar to metabonomics, but where metabonomics deals with integrated, multicellular, biological systems, metabolomics deals with simple cell systems.

This review presents an extensive and thorough overview of small molecule databases relevant to drug discovery, and methods for classifying chemical compounds as being drug-like and/or lead-like. A brief overview of various methods for predicting ADMET (adsorption, distribution, metabolism, excretion and toxicity) properties, BBB penetration and key physico-chemical properties is also given, but it must be stressed that this overview is by no means complete. References are given to more detailed review papers for the various properties.

The fundamental behavior of substances is governed by intermolecular interactions at various levels. Physico-chemical properties of drug molecules are mostly governed by the interactions between the drug molecules and the surrounding aqueous environment. The potency of drugs depends on how well a given drug molecule (ligand) fits into a target, and how strong the interactions between the ligand and the target are, often studied with computational methods, e.g. docking and scoring methods. ADMET properties depend on how the drug molecule interacts with a large number of macromolecules in the human organism. In cases where a patient uses more than one drug, drug–drug interactions are also of importance, and such interactions are unfortunately often ignored. The human organism is an immensely complicated system, but modeling of properties can be accomplished by studying various subsystems and their interactions using a broad range of computational and experimental methods. Recently, Macchiarulo et al. (2004) studied cross interactions between enzymes and small molecules in a cell that are caused by similarities in the molecular structures of the metabolites (small molecules) and the flexibility in binding at the active sites of the enzymes. Based on their results they propose that HTS should not only involve a selection of small molecules, but also a panel of proteins to test for cross-reactivity.


    DATABASES
 TOP
 Abstract
 INTRODUCTION
 DATABASES
 DESCRIPTORS USED FOR CHEMICAL...
 REDUNDANCY OF DATA SETS
 METHODS FOR CLASSIFICATION OF...
 PREDICTION OF PROPERTIES
 ROLE OF DATABASES AND...
 DATABASES AND COMPUTER PROGRAMS
 REFERENCES
 
The availability of reliable experimental data and basic structural information is crucial for successful modeling work. In this section various databases relevant to drug discovery and development are discussed, including databases for available organic compounds, screening compounds, medicinal agents (drugs), as well as databases with ADMET properties and physico-chemical properties. A few protein databases are also mentioned.

An overview of the databases and key features is given in Table 1, and the type of properties provided is listed in Table 2. Most of the databases provide 1-dimensional (1D), 2-dimensional (2D) and 3-dimensional (3D) structural information (see below for further detail on molecular descriptors). The 1D coding is either given in the SD-file format from MDL Information Systems or as SMILES (simplified molecular input line entry specification) strings (Weininger, 1988, http://www.daylight.com/smiles/f_smiles.html). In cases where only 1D and 2D information is given, several programs for generating 3D structures from 2D structures are available. One should, however, bear in mind that due to conformational flexibility many 3D structures (equilibrium conformations) may exist for each molecule. Examples of 1D, 2D and 3D structures are shown in Figure 2.


View this table:
[in this window]
[in a new window]
 
Table 1 Overview of small molecule databases and key informations provided in those

 

View this table:
[in this window]
[in a new window]
 
Table 2 Overview of relevant experimental physico-chemical and ADMET properties provided in small molecule databases

 


View larger version (24K):
[in this window]
[in a new window]
 
Fig. 2 Structural information for a molecule can be given as a 1D SMILES string, a 2D drawing or a 3D representation. The SMILES string and the 2D drawing contain information about which atoms are bound to one another, and the 3D representation shows how the atoms are located geometrically in relation to one another. The atomic coordinates used are shown behind the 3D picture.

 
General small molecule and screening compound databases
The best known database for synthetic organic and inorganic compounds is the Available Chemicals Directory (ACD),1 which currently includes approximately 300,000 unique substances. ACD is often used for representing non-drugs in model development, and although some compounds within ACD may be biologically active, the majority of compounds are not. The Chemicals Available for Purchase (CAP) database contains similar information, where the CAP Reagents database currently has about 240,000 molecular structures and CAP Complete approximately 1.6 million compounds. A new version of the Spresi database, SPRESIweb, contains more than 4.5 million molecules, including various physical properties, and 3.5 million reactions. The physical property information is, however, not well organized within the Spresi database.

A number of databases have been developed especially for screening purposes. Three main catalogs, previously being part of ACD, the SALOR (Sigma–Aldrich Library of Rare Chemicals), Maybridge and Bionet, are now available through MDL Screening Compounds Directory (MDL SCD). The SPECSnet database contains screening compounds, building blocks and natural products provided by SPECS.

Databases for medicinal agents
The three main database collections of drug molecules are the Comprehensive Medical Chemistry database (CMC), MDL Drug Data Report (MDDR) and the Derwent Word Drug Index (WDI). CMC contains presently 8473 drug compounds, and is updated annually with compounds identified for the first time in the United States Approved Names (USAN) list. The CMC database also includes information about drug class, and measured or estimated values for the acid–base dissociation constant (pKa) and the octanol/water partition coefficient (log P). For log P, 120 experimental and 8300 calculated records are provided, and for pKa 1200 measured records are given. The MDDR contains 132,726 drugs launched and under development, collected from the patent literature and other relevant sources. MDDR is updated monthly, which adds up to approximately 10,000 new compounds per year, and includes information about drug class and drug activity in qualitative terms as well. WDI contains around 73,000 marketed drugs and drugs under development, with each record classified by drug activity, mechanism of action, treatment, among other factors. The Derwent Drug File is a highly focused database of selected journal articles and conference reports on all aspects of drug development.

There are a number of other available drug databases. The National Cancer Institute database (NCI) contains 213,628 compounds, and is made up of four publicly available NCI databases, including both the AIDS and the Cancer databases. The MedChem database consists of 48,500 compounds, 67,000 measured log P values, 13,700 measured pKa values and 19,000 pharmacological (drug) activities. According to the providers of the MedChem Bio-loom database, all available data for log P and pKa have been gathered from the literature. In addition, the database includes a sorted list containing only those measurements carried out with high quality methods. The MedChem database also contains a collection of biological and physico-chemical data intended for QSPR (quantitative structure–property relationship) modeling, and software for calculating log P values (Clog P). The WOrld of Molecular BioAcivities database (WOMBAT) is a very comprehensive collection of biological activity data, and contains 76,165 molecules, 68,543 SMILES, 143,000 activities for 630 targets, 1230 measured log P values and 527 measured solubility values. The biological activities in this database are mainly related to receptor antagonists and enzyme inhibitors, which are ranked by target class. The BIOSTER is a database of bioanalogous pairs of molecules (bio-isosters), and contains over 9500 active molecules, including drugs, agrochemicals and enzyme inhibitors. The Chapman & Hall/CRC Dictionary of Drugs contains over 41,000 drugs, and available physico-chemical property and toxicity data.

Databases for ADMET properties of drugs
The MDL Metabolite database contains information about metabolism pathways for xenobiotic compounds and biotransformations (primary medicinal agents) and experimental data from in vivo and in vitro studies. The database includes more than 10,000 parent compounds, over 64,000 biotransformations and over 40,000 molecules (parent compounds, intermediates and final metabolites). Metabolism data are also found in the Accelrys' Metabolism and Biotransformation databases. The Metabolism database covers vertebrates (animals), invertebrates and plants. This databases contains 4101 parent compounds, 30,000 transformations and is being extended to 40,000 transformations. The Accelrys' Biotransformation database, on the other hand, is a stand-alone database for vertebrates only, containing 1744 unique parent compounds, and 9809 transformations.

The MDL Toxicity database includes more than 158,000 chemical substances compiled from several major sources, where around 65% are drugs and drug development compounds. The database contains six categories of data, i.e. acute toxicity, mutagenicity, skin/eye irritation, tumorigenicity, reproductive effects and multiple dose effects. It also contains detailed information about how the in vivo and in vitro experiments were carried out, including species of organism or tissue studied, dose, etc. The DSSTox project (Distributed Structure-Searchable Toxicity Public Database Network) (Richard and Williams, 2002, http://www.epa.gov/nheerl/dsstox) is a forum for publishing toxicity data, which presently includes databases for carcinogenicity in various animals, as well as acute toxicity.

The ToxExpress Reference Database is a toxicogenomic database, which contains gene expression profiles from known toxicants. This database is built on in vivo and in vitro studies of exposure to toxicants, and around 110 compounds have been profiled.

Other databases that might be of interest, although primarily focused on environmental and occupational health issues, are the TSCA93 database, containing over 100,000 substances, and the TOXicology Data NETwork (TOXNET) database cluster.

Other important small molecule databases
The ACD/Labs Physico-Chemical Property databases contain a variety of physico-chemical data for organic substances, including experimental log P for 18,400 compounds, over 31,000 measured pKa values for 16,000 compounds and aqueous solubility (logS) for 5000 compounds. Another collection of experimental data is the Physical Properties Database (PHYSPROP) from Syracuse Research Corporation (SRC), which contains 13,250 measured log P values, 1652 pKa values and 6340 records for aqueous solubility. Both ACD/Labs and SRC market software for prediction of these and other properties. A third database, the AQUASOL dATAbASE, contains aqueous solubilities for almost 6000 compounds.

There are other important databases for small molecules. CrossFire Beilstein contains more than 8 million organic compounds, over 9 million reactions, a variety of properties, including various physical properties, pharmacodynamics and environmental toxicity. This database contains over 500,000 bioactive compounds. Other important collections of physical properties of organic compounds are CRC (Lide, 2003, http://www.hbcpnetbase.com) (11,000 compounds), Chapman & Hall/CRC Properties of Organic Compounds (29,000 compounds) and the Design Institute of Physical Properties Database (DIPPR) (1743 compounds).

Protein databases and ligand information
In this context, we would also like to mention a few relevant macromolecular and crystallographic databases with drug-discovery relevant information. The Protein Data Bank (PDB) is the main source for available protein crystal structures and structural information obtained with NMR spectroscopy, with currently more than 28,000 structures, and a weekly growth of about 100 structures. The Cambridge Structural Database (CSD) is the most comprehensive collection of crystallographic data for small molecules, containing around 305,000 structures. Figure 3 shows the evolution in number of structures within PDB and CSD over the last 12 years. Although CSD has a much larger number of entries, the relative growth rate within PDB is higher, which shows clearly the increased focus on protein structure determination world wide (structural genomics).



View larger version (30K):
[in this window]
[in a new window]
 
Fig. 3 Number of crystallographic structures in the CSD and the PDB in the period 1992–2004. This figure can be viewed in colour on Bioinformatics online.

 
PDB contains a number of useful sub-databases, including a Target Registration Database. Relibase (Hendlich et al., 2003; Günther et al., 2003, http://relibase.ccdc.cam.ac.uk) is a powerful data mining tool which utilizes structural information about protein–ligand complexes within PDB for comprehensive analysis of protein–ligand interactions. Relibase+ is an improved version of Relibase with a number of additional features, like protein–protein interaction searching and a crystal packing module for studying crystallographic effects around ligand binding sites. The LIGAND database (Goto et al., 1998, 2002, http://www.genome.ad.jp/ligand) is a chemical database for enzyme reactions, including information about structures of metabolites and other chemical compounds, their reactions in biological pathways, and about the nomenclature of enzymes.

Ji et al. (2003) recently published an overview of various data collections for proteins associated with drug therapeutic effects, adverse reactions and ADME available over the internet.

A number of databases containing protein–protein interaction data are available, including Database of Interacting Proteins (DIP) (Xenearios et al., 2002, http://dip.doe-mbi.ucla.edu), Biomolecular Interaction Network Database (BIND) (Bader et al., 2003 http://bind.ca), Molecular Interaction Database (MINT) (Zanzoni et al., 2002, http://cbm.bio.uniroma2.it/mint/), Comprehensive Yeast Protein Genome Database (MIPS) (Mewes et al., 2002, http://mips.gsf.de/genre/proj/yeast/), Yeast Proteome Database (YPD) (Hodges et al., 1999; Costanzo et al., 2000), the IntAct database (Hermjakob et al., 2004, http://www.ebi.ac.uk/intact/index.html) and Human Protein Reference Database (HPRD) (Peri et al., 2003, http://www.hprd.org), among others. Protein–protein interaction databases are collections of information about actual protein–protein contacts or complex associations within the proteome of specific organisms, in particularly in yeast, but also in humans, fruitfly and other organisms.


    DESCRIPTORS USED FOR CHEMICAL STRUCTURES
 TOP
 Abstract
 INTRODUCTION
 DATABASES
 DESCRIPTORS USED FOR CHEMICAL...
 REDUNDANCY OF DATA SETS
 METHODS FOR CLASSIFICATION OF...
 PREDICTION OF PROPERTIES
 ROLE OF DATABASES AND...
 DATABASES AND COMPUTER PROGRAMS
 REFERENCES
 
In chemoinformatics the structural features of the individual molecules are described by various parameters derived from the molecular structure, the so-called descriptors (Todeschini and Consonni, 2000). The simplest types are constitutional (1D) or topological (2D) descriptors, describing the types of atoms, functional groups, or types and order of the chemical bonds within the molecules. Ghose atom types (Viswanadhan et al., 1989), CONCORD atom types, ISIS keys (MDL keys) (Durant et al., 2002) and various other atom/bond indices, discussed in the following section, are examples of such descriptors. Physico-chemical properties are often used as descriptors as well.

The ISIS keys and CONCORD atom types are often referred to as fingerprint methods, and were for example used in the training of neural networks (Frimurer et al., 2000) for compound classification. ISIS keys are a set of codes which represent the presence of certain functional groups and other structural features in the molecules, and CONCORD atom types are generated by assigning up to six atom-type codes available within the CONCORD program (Pearlman, 2000) to each atom.

In QSAR (quantitative structure–activity relationship) and QSPR modeling, constitutional and topological descriptors are combined with more complicated descriptors, representing the 3D structure of the molecules. This includes various geometrical descriptors, and most importantly a variety of electrostatic and quantum chemical descriptors. The electrostatic descriptors are parameters which depend on the charge distribution within the molecule, including the dipole moment. Examples of quantum chemically derived descriptors are various energy values like ionization energies, HOMO–LUMO gap, etc. Similarly descriptors derived from molecular mechanics can be used (Dyekjær et al., 2002). A variety of different types of descriptors used in QSAR and QSPR are discussed by Karelson (2000).

QSAR and QSPR models are empirical equations, used for estimating various properties of molecules, and have the form

(1)
where P is the property of interest, a, b, c,... are regression coefficients and D1, D2, D3, ... are the descriptors.

The so-called 3D QSAR methods differ somewhat from traditional QSAR methods. The CoMFA (comparative molecular field analysis) (Cramer et al., 1988, http://www.tripos.com/sciTech/inSilicoDisc/strActRelationship/) approach, one of the most popular 3D QSAR methods, uses steric and electrostatic interaction energies between probe atoms and the molecules as descriptors. In the VolSurf method (Cruciani et al., 2000) a similar approach is used, where descriptors are generated from molecular interaction fields calculated by the GRID program (Goodford, 1985).


    REDUNDANCY OF DATA SETS
 TOP
 Abstract
 INTRODUCTION
 DATABASES
 DESCRIPTORS USED FOR CHEMICAL...
 REDUNDANCY OF DATA SETS
 METHODS FOR CLASSIFICATION OF...
 PREDICTION OF PROPERTIES
 ROLE OF DATABASES AND...
 DATABASES AND COMPUTER PROGRAMS
 REFERENCES
 
When using data extracted from large databases to train and test data driven prediction tools, the redundancy, completeness and representativeness of the underlying data sets are extremely important. The redundancy affects strongly the evaluation of the predictive performance, and also, most importantly the bias of the method toward specific, over-represented subclasses of compounds.

Within the general bioinformatics area this is an issue that has been dealt with in great detail. It is well known that if a protein structure prediction algorithm is tested on sequences and structures which are highly similar to the sequences and structures used to train it, the predictive performance is significantly overestimated (Sander and Schneider, 1991). Therefore methods have been constructed for cleaning data sets for examples that are ‘too easy’, such that the performance of the algorithm on novel data will be estimated in a more reliable manner. One needs obviously to define the similarity measure between the data objects under consideration; for amino acid sequences one typically uses alignment techniques such as the metric for quantitative, pair-wise comparison (Hobohm et al., 1992). Techniques have also been developed for redundancy reduction of sequences containing functional sites, such as signal peptide cleavage sites in secreted proteins (Nielsen et al., 1996) or translation initiation sites in mRNA sequences (Pedersen and Nielsen, 1997).

In this respect chemoinformatics is still in its infancy compared to the thorough validation methods used in bioinformatics. For chemoinformatics applications, redundancy within databases can affect the learning capabilities of neural networks, decision trees and other methods, whereas redundancy between different databases generally introduces noise and reduces the discriminating power of the model.

Redundancy in this context means that the compounds in their vectors, component for component, are similar. Similarity of the chemical compounds within a data set can lead to over-fitting of a model, such that predictions for compounds similar to those used in the training set are excellent, but for different compounds the predictions are not accurate to the same level. There are several examples of compound clustering techniques, where the Tanimoto coefficient (Patterson et al., 1996) has been used as the metric quantifying the similarity between compounds, normally with the goal of identifying leads by similarity, or for increasing the diversity of libraries used for screening (Reynolds et al., 1998; 2001; Voigt et al., 2001; Willett, 2003). The Tanimoto coefficient, T, compares binary fingerprint vectors, and is defined as,

(2)
where Nxy is the number of 1 bits shared in the fingerprints of molecules x and y, Nx the number of 1 bits in the fingerprint of molecule x, and Ny the number of 1 bits in the fingerprint of molecule y. A Tanimoto coefficient approaching zero indicates that the compounds being compared are very different, while a Tanimoto coefficient approaching one indicates that they are very similar.

In the drug-likeness predictor developed by Frimurer et al. (2000) the concept of redundancy reduction normally used within bioinformatics was transferred to compound data extracted from MDDR and ACD. In this case the compounds were represented by CONCORD atom types, and the Tanimoto coefficient (Patterson et al., 1996) was used to calculate the similarity between any two compounds. Using a given threshold for maximal similarity in terms of a Tanimoto coefficient of, say, 0.85, the redundancy in a data set can be reduced to a well-defined level. This type of technique also allows for the creation of common benchmark principles, and could replace frozen benchmark data sets, which often become outdated as databases grow over time. The performance of prediction techniques developed on different data sets can therefore be compared in a fair manner, despite the fact that the underlying data sets differ in size.

In most of the chemoinformatics applications developed so far this kind of data set cleaning has not been carried out. Instead test sets have been selected randomly from large, overall data sets, thereby introducing many cases where highly similar pairs of examples are found both in the training and test parts of the data sets. The estimation of the predictive performance is therefore much more conservative in the work of Frimurer et al., while it may be too high for some of the prediction tools where the split between training and test data has been established just by random selection.

Note that similarity to a neural network is somewhat different from molecular similarity, because the network can correlate all the vector components with each other in a nonlinear fashion. Two very different compounds may appear very similar for a neural net in terms of functionality, while two quite similar compounds can be interpreted as having different properties. A redundant data set will typically not constrain the network weight structure as much as a nonredundant data set. Nonredundant data from complete and representative data sets will therefore in most cases lead to well-performing predictors, even if the weights/training examples ratio is larger.


    METHODS FOR CLASSIFICATION OF DRUG-LIKE STRUCTURES
 TOP
 Abstract
 INTRODUCTION
 DATABASES
 DESCRIPTORS USED FOR CHEMICAL...
 REDUNDANCY OF DATA SETS
 METHODS FOR CLASSIFICATION OF...
 PREDICTION OF PROPERTIES
 ROLE OF DATABASES AND...
 DATABASES AND COMPUTER PROGRAMS
 REFERENCES
 
Several methods have been developed to determine the suitability of compounds to be used as drugs (pharmaceuticals), based on knowledge about the molecular structure and key physical properties. Reviews about a variety of methods have been published recently (Walters and Murcko, 2002; Clark and Pickett, 2000; Muegge, 2003).

The task is to select compounds from combinatorial libraries in such a way that the probability of identifying a new drug that will make it to the market is optimized. To minimize the probability of leaving out potential drugs in the HTS, it is important to ensure adequate diversity of the molecular structures used (Gillet et al., 1999). A variety of methods for compound selection have been proposed, ranging from simple intuitive to advanced computational methods.

Intuitive and graphical methods
Many researchers have developed simple counting methods, and a highly popular method is the ‘rule of five’ proposed by Lipinski et al. (1997). Although the ‘rule of five’ originally addresses the possibility, or risk, for poor absorption or permeation, based on molecular weight (MW), the octanol–water partition coefficient (log P, also called lipophilicity), number of hydrogen bond donors (HBD) and acceptors (HBA), respectively, it has been widely used to distinguish between drug-like and non-drug-like compounds. A number of databases have included Lipinskis' ‘rule of five’ data for each compound, using calculated values for log P. This is the case for SPECS, MDL SCD, ACD, ACD/Labs Physico-Chemical databases.

Frimurer et al. (2000) performed a detailed, quantitative analysis assessment of the predictive value of Lipinskis' ‘rule of five’. It was found that the correlation coefficient is close to zero, in fact slightly negative, when evaluated on a nonredundant data set. Oprea (2000) also discusses drug-related chemical databases and the performance of the ‘rule of five’. He concludes that the ‘rule of five’ does not distinguish between ‘drugs’ and ‘non-drugs’, because the distribution of the parameters used does not differ significantly between these two groups of compounds. The problem with the ‘rule of five’ is that it represents conditions which are necessary, but not sufficient for a drug molecule. Many (in fact most) of the molecules which fulfill these conditions are not drug molecules. The work of Lipinski has had, however, a tremendous pioneering value concerning quantitative evaluation of ‘drug-likeness’ of chemical structures.

Muegge et al. (2001) proposed a simple selection scheme for drug-likeness based on the presence of certain functional groups, pharmacophore points, in the molecule. This method and other methods using functional group filters are significantly less accurate than the various machine learning methods discussed (Table 3).


View this table:
[in this window]
[in a new window]
 
Table 3 Comparison of various methods for prediction of drug-likeness of molecules, including neural networks, functional group filters, quantitative structure-activity relationships and decision trees

 
Chemistry space, or chemography, methods are somewhat similar, and are based on the hypothesis that drug-like molecules share certain properties, and thus cluster in graphical representations. Analysis of how well molecular diversity is captured using a variety of descriptors was done by Cummins et al. (1996) Xu and Stevenson (2000) and Feher and Schmidt (2003) who examined the difference in property distribution between drugs, natural products and non-drugs. opgo01a,opgo01b, Oprea et al. (2002) and Oprea (2003) have developed a method called ChemGPS, where molecules are mapped into a space using principal component analysis (PCA), where a so-called drug-like chemistry space is placed in the center of the graph, surrounded by a non-drug-like space. Satellites, molecules having extreme outlier values in one or more of the dimensions of interest, are intentionally placed in the non-drug-like space. Examples of satellites are molecules like iohexol, phenyladamantane, benzene and also the non-drug-like drug erythromycin. They used molecular descriptors obtained with the VolSurf (Cruciani et al., 2000) method, along with a variety of other descriptors in their work.

Brüstle et al. (2002) proposed a method for evaluating a numerical index for drug-likeness. They proposed a range of new molecular descriptors obtained from semiempirical molecular orbital (AM1) calculations, calculated principal components (PCs) based on these descriptors, and used one of the PCs as a numerical index. The authors also used the proposed descriptors to develop QSPR models for a number of physico-chemical properties, including log P and aqueous solubility.

Machine learning methods
Several prediction methods using neural networks (NN) have been developed. These include work by Ajay et al. (1998) Sadowski and Kubinyi (1998), Frimurer et al. (2000), Muegge et al. (2001), Murcia-Soler et al. (2003) and Takaoka et al. (2003). Most of these methods give relatively good predictions for drug-likeness, with the exception of the methods by Muegge et al. (2001) and Takaoka et al. (2003). As shown in Table 3, the methods differ in the type of NN architecture and descriptors used, and, most importantly, they use different data sets of drug molecules. The most significant difference concerns the data used to test and train the networks as discussed in the section about redundancy of data sets, and to a lesser degree the choice of descriptors for identifying the chemical structures. Ford et al. (2004) have developed an NN method for predicting if compounds are active as protein kinase ligands or not, where they used the drug-likeness method of Ajay et al. (1998) to remove unsuitable compounds from the training set. Like in the work by Frimurer et al. (2000) they use Tanimoto coefficient for establishing a diverse training set.

Ajay et al. use the so-called ISIS keys, and various other molecular properties, including those proposed in the ‘rule of five’, as descriptors. The descriptors used by Sadowski and Kubinyi are the atom types of Ghose (Viswanadhan et al., 1989), originally developed for the prediction of log P. Frimurer et al. use the so-called CONCORD atom types as descriptors. In this work the descriptor selection procedure is also inverted, in the sense, that the weights in the trained NNs are inspected after training was completed. Thereby one may obtain a ranking of the importance of different descriptors for prediction tasks like drug-likeness. Obviously, such a ranking cannot reveal the correlations between descriptors which may be highly important for the classification performance. The power of the NNs is that they can take such correlations into account, but they can be difficult to visualize or describe in detail.

Murcia-Soler et al. (2003) assign probabilities for the ability of each molecule to act as a drug in their method. They use various topological indices as descriptors, reflecting types of atoms and chemical bonds in the molecules. Their paper contains a detailed and interesting analysis of the performance on training, test and validation data sets (Table 3). Takaoka et al. (2003) developed an NN method by assigning compound scores for drug-likeness and easy of synthesis based on chemists' intuition. Their method predicts the drug-likeness of drug molecules with 80% accuracy, but the accuracy of the prediction of non-drug-likeness is hard to assess.

Wagener and van Geerestein (2000) developed a quite successful method using decision trees. The first step in their method is similar to the method by Muegge et al. (2001) where the molecules are sorted based on which key functional groups they contain. Going through several steps, the connectivity of the atoms in each functional group is determined and atom types are assigned, using the atom types of Ghose. This basic classification is then used as an input (leaf note) in a decision tree, which is trained to determine if a compound is drug-like or non-drug-like. To increase the accuracy, a technique called ‘boosting’ can be used, where weights are assigned to each data point in the training set, and optimized in such a way that they reflect the importance of each data point. By including misclassification costs in the training of the tree, the predictions can be improved even more.

Gillet et al. (1998) proposed a method which uses a genetic algorithm (GA), using weights of various properties obtained from substructural analysis. Using the properties from ‘rule of five’ and various topological indices as descriptors, the molecules are sorted and ranked, and from these results a score is calculated. It is seen that drugs and non-drugs have different distributions of scores, with some overlap. Also the distribution varies for different types of drugs.

Anzali et al. (2001) developed a QSAR type method for predicting biological activities, with the computer system PASS (prediction of activity spectra for substances). Both data for drugs and non-drugs were used in the training of the model, and subsequently used for discriminating between drugs and non-drugs.

The NN, decision trees, GA and QSAR methods are classified as machine learning, whereas ‘rule of five’ and functional group filters are simple intuitive methods. A short overview of the reported predictive capability of the various methods is given in Table 3. It is seen that the NN methods and the decision tree methods give the best results. As discussed previously, it is difficult to compare the different NN methods in quantitative terms, due to differences in size and redundancy of the data sets used to train and test the methods.

Lead-likeness
Another possibility is to determine whether a molecule is lead-like rather than drug-like. As discussed by Hann et al. (2001) it is often not feasible to optimize the drug-like molecules by adding functional groups, as the molecules then become too large and over-functionalized. Instead one could target new leads with computerized methods, and in such a way give additional flexibility in targeting new potential drugs.

According to Oprea et al. (2001a) not much information about molecular structures of leads is available in the literature. In their paper, several lead structures are given, and an analysis of the difference between leads and drugs is presented. According to their analysis, drugs have higher molecular weight, higher lipophilicity, additional rotational bonds, only slightly higher number of hydrogen-bond acceptors, and the same number of hydrogen-bond donors compared to leads. As pointed out by the authors, this analysis contains too few molecules to be statistically significant, but it indicates useful trends. Similar trends are also observed by Hann et al. (2001). The design of lead-like combinatorial libraries is also discussed by Teague et al. (1999) and Oprea conducted studies of chemical space navigation of lead molecules (Oprea, 2002a) and their properties (Oprea, 2002). Various methods for ranking molecules in lead-discovery programs are discussed by Wilton et al. (2003) ,including trend vectors, substructural analysis, bioactive profiles and binary kernel discrimination.


    PREDICTION OF PROPERTIES
 TOP
 Abstract
 INTRODUCTION
 DATABASES
 DESCRIPTORS USED FOR CHEMICAL...
 REDUNDANCY OF DATA SETS
 METHODS FOR CLASSIFICATION OF...
 PREDICTION OF PROPERTIES
 ROLE OF DATABASES AND...
 DATABASES AND COMPUTER PROGRAMS
 REFERENCES
 
ADMET properties
ADMET properties stand for absorption, distribution, metabolism, excretion and toxicity of drugs in the human organism. In recent years increased attention has been given to modeling these properties, but still there is a lack of reliable models. One of the first attempts to address the modeling of one these properties, the intestinal absorption, was the ‘rule of five’ proposed by Lipinski et al. (1997). A number a review articles on ADMET properties have been published recently (Beresford et al., 2002; Ekins et al., 2002; van de Waterbeemd and Gifford, 2003; Livingstone, 2003). Boobis et al. (2002) made an expert report on the state of models for prediction of ADMET properties. Boobis et al. (2002) and van de Waterbeemd and Gifford (2003) give a fairly detailed overview of methods and computer programs for prediction of ADMET properties.

One of the major problems concerning modeling of ADMET properties is the lack of reliable experimental data for training the models. For metabolism and toxicity, databases are available, but for the other properties, experimental information is scarce. A number of programs for modeling of ADMET properties have been developed recently, and ADMET modules have been included in some examples of molecular modeling software. Only models for absorption are discussed in greater detail here, whereas modeling of the other ADMET properties is only discussed briefly, with references to recent reviews. Figure 4 shows how ADMET properties relate to targets and diseases from a chemoinformatics perspective.



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 4 Chemoinformatics integrates information on ADMET properties in the relationship between chemistry space, biological targets and diseases. This figure can be viewed in colour on Bioinformatics online.

 
Absorption
According to Beresford et al. (2002) reasonable models exist for intestinal absorption and blood–brain-barrier (BBB) penetration. It is, however, important to bear in mind that the reason for low oral bioavailability is often not due to poor overall absorption, but due to limited first pass through the gut wall membrane. Recently, van de Waterbeemd et al. (2001) published a property guide for optimization of drug absorption and pharmacokinetics.

At least three commercially available computer programs have been developed for prediction of the intestinal absorption, IDEATM, GastroPlusTM and OraSpotterTM. IDEATM uses an absorption model proposed by Grass (1997), and later improved by Norris et al. (2000), and GastroPlusTM uses the ACAT (advanced compartmental absorption and transit) model proposed by Agoram et al. (2001). In the ACAT model the GI tract is divided into nine compartments, one for the stomach, seven for the small intestine and one for the colon, and the absorption process is described using a set of over 80 differential equations. The ACAT model uses compound specific parameters like permeability, solubility and dose, as well as physiological information such as species, GI transit time, food status, etc. A detailed overview of these two methods is given by Parrott and Lavé (2002), and the link between drug absorption and the permeability and the solubility is discussed by Pade and Stavhansky (1998). The OraSpotterTM program uses information about the molecular structure only, which are turned into SMILES. The descriptors are then evaluated from the SMILES.

Parrott and Lavé (2002) developed models for the intestinal absorption using IDEATM 2.0 and GastroPlusTM 3.1.0. They used three approaches in their work: (1) Prediction of absorption class using chemical structure data only. (2) Predictions based upon measured solubility and predicted permeability. (3) Predictions based upon measured solubility and measured CACO-2 permeability. The RMS deviations were in the range of 19–24% for the models developed, and the best model was obtained with the IDEA program, using both measured solubility and measured permeability. However, the IDEA and GastroPlus programs gave fairly similar results in this study.

Wessel et al. (1998) developed a QSPR type model, using non-linear GA/NN techniques, for the prediction of human intestinal absorption, based on Caco-2 measurements of permeability. The data set used contains measured values for 86 drug and drug-like compounds, selected from various sources. The descriptors used were generated with the ADAPT (automated data analysis and pattern recognition toolkit) software, the MOPAC program using the AM1 semi-empirical quantum chemical method, derived from 2D structures using graph theory, and by using substructure fragments. A fairly good model, with RMS deviation of 16%, was obtained.

Palm et al. (1998) evaluated the relationship between various molecular descriptors and transport of drugs across the intestinal epithelium, also using measurements for Caco-2 cells, and Clark (1999) developed a method using the polar molecular surface area of the molecules as a descriptor. A minimalistic model based on the ‘rule of five’ was proposed by Oprea and Gottfries (1999) Zamora et al. (2001) used chemometrics methods for predicting drug permeability, and new QSAR models were proposed by Kulkarni et al. (2002). Egan and Lauri (2002) published a review on methods for prediction of passive intestinal permeability.

BBB penetration
The BBB separates the brain from the systemic blood circulation. Passage through the BBB is a necessity for orally administrated drugs targeting receptors and enzymes in the central nervous system (CNS), whereas generally it is an unwanted property for peripherally acting drugs. BBB penetration is usually expressed as log BB = log (Cbrain/Cblood), i.e. the logarithm to the ratio of the concentration of the drug in the brain to that in the blood. Experimental log BB values range between –2.0 and +1.0 and compounds with log BB > 0.3 are characterized as BBB penetrating whereas compounds with log BB < –1.0 are poorly distributed to the brain.

One of the first attempts to predict BBB penetration dates back to 1988, when Young et al. (1988) obtained a correlation between log BB and log P or derivatives of log P for a small series of histamine H2 antagonists. In the following years a number of models for BBB penetration were published, and these models were discussed in recent reviews (Ekins et al., 2000; Norinder and Haeberlein 2002). In the following we will focus on more recently reported models. Nearly all models developed until 2000 were based on a relatively small number of compounds (57–65 compounds) with a limited structural diversity, including several compounds being far from drug-like. Kelder et al. (1999) expanded the number of available compounds by reporting log BB for an additional 45 drug-like compounds. In addition to developing models for BBB penetration, they also showed that CNS active drugs could be distinguished from non-CNS active drugs based on their polar surface area.

Jørgensen et al. (2001) presented a model based on all available BBB penetration data at that time (105 compounds). They based their model on an atom-weighted surface area approach, where the contribution of each atom to the BBB penetration depended on the atom type and its solvent accessibility. Rose et al. (2002) developed a model for the same set of compounds using electrotopological state descriptors, whereas Kaznessis et al. (2001) generated a number of physically significant descriptors from Monte Carlo simulations in water and used those in a model for BBB penetration. Hou and Xu (2002) developed a model using GA and Feher et al. (2000) proposed a very simple model based on only three descriptors, log P, number of hydrogen bonds and polar surface areas. It appears that these descriptors, or related descriptors, describe the structural requirements for BBB penetration most effectively (Norinder and Haeberlein 2002).

It is interesting that the quantitative models for the prediction of BBB penetration, although based on very different methods, yielded comparable results. This probably reflects the relatively small number of compounds included in the modeling, the lack of structural diversity, as well as uncertainties associated with the experimental log BB values. Several drug molecules are actively transported across membranes, including the BBB, by various transporters which may include active transport to the CNS as well as efflux transporters (Bodor and Buchwald, 1999; Kusuhara and Sugiyama, 2002; Sun et al., 2002).

A number of models for the classification of compounds as CNS active or CNS inactive have been published. An NN based on 2D Unity fingerprints was developed by Keserû et al. (2000) for identification of CNS active compounds in virtual HTS. Engkvist et al. (2003) showed that a model based on substructure analysis performed as well as a more complex NN based model. Doniger et al. (2002) developed and compared an NN and a support vector machine approach. Crivori et al. (2000) published a method for the classification of compounds as CNS active or CNS inactive. The model used descriptors from 3D molecular interaction fields generated by VolSurf and multivariate statistical methods for the subsequent data analysis. Based on a test set of 110 molecules the model predicted log BB correctly with a 90% accuracy for an external set of 120 compounds. In a recent paper by Wolohan and Clark (2003) the combination of descriptors developed from interaction fields and subsequently analyzed by multivariate statistics were further developed and applied not only for the prediction of BBB penetration but also for addressing the more general problem of oral bioavailability.

Other ADMET properties
Metabolism of drugs in the gut wall and in the liver is a major issue, where reactions due to cytochrome P450 liver enzymes are of particular importance (Lewis, 1996). Metabolism of xenobiotics is very complicated and thus difficult to model adequately. Nicholson and Wilson (2003) recently published a review article on xenobiotic metabolism, and de Groot and Ekins (2002) and Lewis and Dickins (2002) published a review on metabolism through cytochrome P450 dependent reactions. Langowski and Long (2002) studied various enzyme modeling systems, databases and expert systems for the prediction of xenobiotic metabolism. They discussed three expert systems, META (Klopman et al., 1994; Talafous et al., 1994; Klopman et al., 1997), MetabolExpert and METEOR (Greene et al., 1997; Buttom et al., 2003) and two enzyme modeling systems. Enzymatic reactions are primarily studied using various molecular modeling methods, and as it is not a central chemoinformatics problem, it is outside the scope of this paper.

Toxicity of drugs is another extremely important problem, leading to a significant number of drug failures. Greene (2002) published a review paper about a number of commercial prediction systems, including DEREK (Greene et al., 1997; Judson et al., 2003), OncoLogic (Dearden et al., 1997), HazardExpert, COMPACT (Parke et al., 1990), CASE/Multi-CASE (Klopman, 1992) and TOPKAT. Katritzky et al. (2001), Espinosa et al. (2002) and Tong et al. (2002) have developed QSAR models for the prediction of toxicity.

Up to now, not much effort has been put into in silico modeling of distribution of drugs within the human organism, and almost none on excretion. Experimental quantities which measure distribution of drugs in the human body adequately are not readily available, but according to Boobis et al. (2002) log P, octanol–water distribution coefficients (log D) and in vivo pharmacokinetic data can be used as a measure of distribution. Transporters, plasma-protein binding and other aspects of distribution are discussed by van de Waterbeemd and Gifford (2003).

Physico-chemical properties
As mentioned above, physico-chemical properties of interest to drug discovery are mainly acid–base dissociation constants (pKa), aqueous solubilities, octanol–water partition coefficients (log P) and octanol–water distribution coefficients (log D). These properties are generally relatively well studied, and many predictive methods are available in the literature. Recently, van de Waterbeemd (2003) and Livingstone (2003) published reviews on physico-chemical properties relevant to the drug discovery process. Many relatively good models are available for log P, log D and pKa of drugs (van de Waterbeemd, 2003), and a large number of in silico models have been developed for the aqueous solubility of drugs and drug-like compounds, as discussed in recent reviews by Blake (2000), Husskonen (2001) and Jorgensen and Duffy (2002). There is, however, continued interest in improved models for the solubility of drugs, due to the importance of this property. It is important to mention that the solubility relates to the solid phase and is significantly more difficult to model than the other physico-chemical properties. Difficulties in modeling properties relating to the solid phase are discussed by Dyekjær and Jónsdóttir (2003).

Some of the available models for aqueous solubility of organic compounds are based on structure information only using NN (Huuskonen et al., 1998; Huuskonen 2000; McElroy and Jurs 2001; Yan and Gasteiger 2003; Taskinen and Yliruusi, 2003; Wegner and Zell, 2003) and QSPR (Katritzky et al., 1998; Yaffe et al., 2001; Zhong and Hu, 2003) methods. Other models use experimental quantities like log P, melting points, heats of fusion and vapor pressures as descriptors (Ran et al., 2001; Sangvhi et al., 2003), sometimes together with calculated structural descriptors (McFarland et al., 2001; Thompson et al., 2003), and group contribution methods have also been developed (Kühne et al., 1995; Klopmann and Zhu, 2001). Traditionally solubility is calculated using equilibrium thermodynamics (Prausnitz et al., 1986). Very accurate predictions can be obtained, but the physical parameters needed are only seldom available for drugs and drug-like molecules. Jónsdóttir et al. (2002) have developed an in silico method using molecular mechanics calculations.

To date, the research efforts have mainly been focused on the aqueous solubility of drugs, and very little attention has been devoted to their solubility in buffered solutions, and the pH dependence of solubility (Avdeef et al., 2000; Sangvhi et al., 2003). For understanding dissolution of drugs in the human organism it is crucial to focus increasingly on solubility in a more realistic environment, and to acquire larger amounts of experimental data for the pH dependence of solubility.


    ROLE OF DATABASES AND CHEMOINFORMATICS IN FUTURE DRUG DISCOVERY
 TOP
 Abstract
 INTRODUCTION
 DATABASES
 DESCRIPTORS USED FOR CHEMICAL...
 REDUNDANCY OF DATA SETS
 METHODS FOR CLASSIFICATION OF...
 PREDICTION OF PROPERTIES
 ROLE OF DATABASES AND...
 DATABASES AND COMPUTER PROGRAMS
 REFERENCES
 
As documented in this review, chemoinformatics is a rapidly growing field which has become extremely important in pharmaceutical research in the last couple of years. The generation, storage, retrieval and utilization of data related to chemical structures has merged established disciplines and catalyzed the development of new techniques. Thus, chemoinformatics bridges established areas like molecular modeling, computational chemistry, statistics and bioinformatics, and is closely related to other emerging fields as chemogenomics, metabonomics and pharmacogenomics.

Chemoinformatics methods are already used extensively in the drug discovery and development process by the pharmaceutical industry, and many powerful methods have been proposed. The predictive methods available are of various qualities and complexities ranging from simple rules-of-thumb to sophisticated 3D methods involving simulation of ensembles of molecules containing thousands of atoms. In the years to come, improved in silico methods for the prediction of properties based on structural information will merge, and be used to assist in identifying more suitable hits and leads.

Structural and property databases provide the foundation of chemoinformatics and a variety of databases containing structurally derived data for organic compounds are available. Although the number of entries in a given database is often considered the most important measure of the quality or usability of a database, there is increased awareness concerning the quality of the data rather than on the number of entries.

The recent literature presents a number of methods for the classification of the drug-likeness of compounds based on subsets of molecules from databases representing drug-like and non-drug-like molecules. Classification methods which predict the lead-likeness of molecules, or their affinity toward specific targets, would be even more useful, and thus the generation of databases of lead-like molecules, etc., is highly desired. All such methods are limited to likely drugs, but we could eventually gather information about non-drug-like drugs, and find common features in those as well.

To ensure good predictive power of such models, the redundancy of the data sets used for training and testing the model is crucial. It is thus very important to balance the data in such a way that certain structural features are not over-represented. Researchers using many other databases, e.g. protein structure databases, have encountered a similar problem. They realized that the use of all available 3D protein structures in PDB for analyzing and extracting loops, motifs, side-chains etc. could be problematic due to the over-representation of some proteins or protein families relative to others. Accordingly, a number of data sets of non-homologous protein structures have been developed.

Also, much focus has been devoted to in silico modeling of ADMET and physico-chemical properties of drugs and drug candidates. A variety of different methods for the prediction of relevant physico-chemical properties, like log P, pKa and aqueous solubility, of organic compounds are available, and in general these methods perform satisfactorily considering the data available. In the case of the aqueous solubility, a large number of methods have been developed, and there is continued interest in this property. The effect of the pH value of the solution on solubility needs to be studied in much greater detail.

Concerning ADMET properties, a number of in silico models have been proposed for intestinal absorption and BBB penetration, some models for metabolism and toxicity and only very few models for distribution and excretion. The total amount of data available for a certain end point often limits the possibility of developing improved predictive models for ADMET properties. Permeability of the BBB is such a case, where measured values are only available for a little more than 100 compounds. The lack of structural diversity within these compounds is another limiting factor.

Properties like metabolism, oral bioavailability, etc., of drugs within the human organism involve several individual processes, and thus sub-processes and bottlenecks governing these processes need to be identified. Due to lack of reliable data, combined with a very complex mode of action of the processes involved, the present methods often fail. In some cases where a method yields promising results for a set of compounds, the transferability of the method to another series or class of compounds may be questionable. Within this area there is much need for developing better and more robust predictive methods, but there is also a need for determining and collecting larger amount of experimental data for the individual processes.

Oral bioavailability is a particularly difficult property to model as it involves a huge number of processes within the human organism, and depends on all the ADMET and physico-chemical properties discussed above. The molecule has to dissolve, be adsorbed into the bloodstream, transported to the target, and not metabolized on its way. Thus for an orally administrated drug to reach its final destination, a whole range of properties need to be within acceptable limits, and thus a model which defines such limits would be very useful. Dynamic modeling of processes within a living cell with systems biology methods is growing rapidly, and is expected to have a huge impact on future drug design, in particular on the modeling of the oral bioavailability. As discussed by Parsons et al. (2004) chemical–genetic and genetic interaction profiles can potentially be integrated to provide information about the pathways and targets affected by bioactive compounds. Such methods could thus be very useful for identifying the mechanism of action and cellular targets of bioactive compounds. Improved understanding of how different drugs affect one another within the human organism is also of great importance, and thus much interest is presently devoted to studies of drug–drug interactions. An extremely exciting perspective is of course also to use pharmacogenomic methods for examining how individual patients respond to specific drugs, and how that depends on their genetic makeup.

Although more data, better data and an improved understanding of the interplay between the different processes in the human organism are required, the present level of available data has already made chemoinformatics an effective tool in the drug discovery and development process.


    DATABASES AND COMPUTER PROGRAMS
 TOP
 Abstract
 INTRODUCTION
 DATABASES
 DESCRIPTORS USED FOR CHEMICAL...
 REDUNDANCY OF DATA SETS
 METHODS FOR CLASSIFICATION OF...
 PREDICTION OF PROPERTIES
 ROLE OF DATABASES AND...
 DATABASES AND COMPUTER PROGRAMS
 REFERENCES
 

The Accelrys' Biotransformation database is marketed by Accelrys, San Diego, USA at URL http://www.accelrys.com/.
The Accelrys' Metabolism database is marketed by Accelrys, San Diego, USA at URL http://www.accelrys.com/.
ACD/Labs Physico-Chemical Property databases are marketed by Advanced Chemical Development, Inc., Toronto, Canada at URL http://www.acdlabs.com/.
The AQUASOL dATAbASE of Aqueous Solubility, sixth edition, is marketed by the University of Arizona, Tucson Arizona, USA at URL http://www.pharmacy.arizona.edu/outreach/aquasol/.
The Available Chemical Directory (ACD) is marketed by MDL Information Systems Ltd., San Lenandro, USA at URL http://www.mdl.com/.
The BIOSTER database is marketed by Accelrys, San Diego, USA at URL http://www.accelrys.com/.
The Cambridge Structural Database (CSD) is maintained and distributed by the Cambridge Crystallographic Data Centre, Cambridge, UK at URL http://www.ccdc.cam.ac.uk.
CASE and MultiCASE, MultiCASE, Inc., Beachwood, OH, USA, URL http://www.multicase.com.
The Chapman & Hall/CRC Dictionary of Drugs, produced by Chapman & Hall/CRC, London, UK, is a part of Combined Chemical Dictionary (CCD) and is available online at URL http://www.chemnetbase.com/.
The Chemicals Available for Purchase (CAP) database is marketed by Accelrys, San Diego, USA at URL http://www.accelrys.com/.
The Comprehensive Medical Chemistry (CMC) database is marketed by MDL Information Systems Ltd., San Lenandro, USA at URL http://www.mdl.com/.
CrossFire Beilstein is marketed by MDL Information Systems Ltd., San Lenandro, USA at URL http://www.mdl.com/.
DEREK, LHASA limited, Department of Chemistry, University of Leeds, Leeds, UK, URL http://www.chem.leeds.ac.uk/luk.
The Derwent World Drug Index (WDI) and Derwent Drug File are published by Thomson Derwent, London, UK at URL http://www.derwent.com/products/lr/wdi/.
The Design Institute of Physical Properties database (DIPPR) is maintained by Michigan Technology University, Houghton, USA in collaboration with AIChE. Information is found at URLs http://dippr.chem.mtu.edu and http://www.aiche.org/dippr/.
GastroPlusTM version 4.0, Simulations Plus, Inc., Lancaster, Ca, USA, URL http://www.simulations-plus.com.
GRID version 21, Molecular Discovery Ldt, Ponte San Giovanni - PG, Italy, URL http://www.moldiscovery.com.
HazardExpert, CompuDrug International, Inc., Sedone, AZ, USA, URL http://www.compudrug.com.
IDEATM version 2.2 and IDEA pkEXPRESSTM, LION bioscience AG, Heidelberg, Germany, URL http://www.lionbioscience.com.
MDL Drug Data Report (MDDR) is marketed by MDL Information Systems Ltd., San Lenandro, USA at URL http://www.mdl.com/.
The MDL Metabolite database is marketed by MDL Information Systems Ltd., San Lenandro, USA at URL http://www.mdl.com/.
The MDL Screening Compounds Directory (formerly ACD-SC)is marketed by MDL Information Systems Ltd., San Lenandro, USA at URL http://www.mdl.com/.
The MDL Toxicity database is marketed by MDL Information Systems Ltd., San Lenandro, USA at URL http://www.mdl.com/.
The MedChem Bio-loom database is published by BioByte Corp., Claremont, USA at URL http://www.biobyte.com, and also distributed by Daylight Chemical Informations Systems Inc., Mission Viejo, USA at URL http://www.daylight.com/.
MetabolExpert, CompuDrug International, Inc., Sedone, AZ, USA. URL http://www.compudrug.com.
METEOR, LHASA limited, Department of Chemistry, University of Leeds, Leeds, UK, URL http://www.chem.leeds.ac.uk/luk.
The National Cancer Institute database (NCI) is a publicly available database distributed by MDL Information Systems Ltd., San Lenandro, USA at URL http://www.mdl.com/ and Daylight Chemical Information Systems Inc., Mission Viejo, USA at URL http://www.daylight.com/.
OraSpotterTM version 3.0, ZyxBio,LLC, Cleveland, OH, USA, URL http://www.zyxbio.com.
The Physical Properties Database (PHYSPROP) is marketed by Syracuse Research Corporation (SRC), North Syracuse, USA at URL http://www.syrres.com/esc/.
The Properties of Organic Compounds database is produced by Chapman & Hall/CRC, London, UK, available online at URL http://www.chemnetbase.com/.
The Protein Data Bank (PDB) is operated by Rutgers, The State University of New Jersey; the San Diego Supercomputer Center at the University of California, San Diego; and the Center for Advanced Research in Biotechnology of the National Institute of Standards and Technology, and funded by NSF, NIH and the Department of Energy. It is found at URL http://www.rcsb.org/pdb/.
Protein–protein interaction databases. List of databases with links to each database at URLs http://www.cbio.mskcc.org/prl/index.php, http://proteome.wayne.edu/PIDBL.html and http://www.imb-jena.de/jcb/ppi/.
SPECSnet database is provided by Specs, Riswijk, The Netherlands at URL http://www.specs.net/.
The Spresi database is marketed by Infochem GmbH München, Germany at URL http://www.infochem.de/ An older version of the database, Spresi95, is available from Daylight Chemical Information Systems Inc., Mission Viejo, USA at URL http://www.daylight.com/.
TOPKAT (TOxicity Predictions by Komputer Assisted Technology), Accelrys, San Diego, USA, URL http://www.accelrys.com.
The ToxExpress Reference Database and ToxExpress Solutions are marketed by GeneLogic Inc., Gaithersburg, USA at URL http://www.genelogic.com/.
Toxicological Data Network (TOXNET) is a collection of databases, published by various services, and made accessible free of charge over the internet by National Library of Medicine, Bethesda, USA at URL http://toxnet.nlm.nih.gov/.
The TSCA93 database is published by the US Environmental Protection Agency, and is distributed by Daylight Chemical Informations Systems Inc., Mission Viejo, USA at URL http://www.daylight.com/.
VolSurf version 3.0.11, Molecular Discovery Ldt, Ponte San Giovanni - PG, Italy, URL http://www.moldiscovery.com.
The WOrld of Molecular BioActivities database (WOMBAT) is published by Sunset Molecular Discovery LLC, Santa Fe, USA at URL http://www.sunsetmolecular.com, and distributed by Daylight Chemical Informations Systems Inc., Mission Viejo, USA at URL http://www.daylight.com/.


    Acknowledgments
 
The information about the databases are mostly found on the internet pages provided by the suppliers. We contacted Molecular Design Limited (MDL), Accelrys, Infochem and Derwent for additional informations, which they kindly gave to us. We should also like to thank Tudor Oprea for information about the WOMBAT database and Olga Rigina for helping with information about protein–protein interaction databases.


    Footnotes
 
1Databases and computer programs are listed in alphabetical order before the references. Some of the databases and programs are represented by an ordinary reference, and are not included in this list. Back

Received on February 11, 2004; revised on February 4, 2005; accepted on February 7, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 DATABASES
 DESCRIPTORS USED FOR CHEMICAL...
 REDUNDANCY OF DATA SETS
 METHODS FOR CLASSIFICATION OF...
 PREDICTION OF PROPERTIES
 ROLE OF DATABASES AND...
 DATABASES AND COMPUTER PROGRAMS
 REFERENCES
 

    Agoram, B., et al. (2001) Predicting the impact of physiological and biochemical processes on oral drug bioavailability. Adv. Drug Delivery Rev., 50, S41–S67[CrossRef][Medline].

    Ajay, B., et al. (1998) Can we learn to distinguish between ‘drug-like’ and ‘nondrug-like’ molecules. J. Med. Chem., 41, 3314–3324[CrossRef][Web of Science][Medline].

    Anzali, S., et al. (2001) Discrimination between drugs and nondrugs by prediction of activity spectra for substances (PASS). J. Med. Chem., 44, 2432–2437[Medline].

    Avdeef, A., et al. (2000) pH-metric solubility. 2: Correlation between the acid-base titration and the saturation shake-flask solubility pH methods. Pharm. Res., 17, 85–89[CrossRef][Web of Science][Medline].

    Bader, G.D., et al. (2003) BIND: the biomolecular interaction network database. Nucleic Acids Res., 31, 248–250[Abstract/Free Full Text].

    Bajorath, J. (2002) Chemoinformatics methods for systematic comparison of molecules from natural and synthetic sources and design of hybrid libraries. J. Comput.-Aided Mol. Des., 16, 431–439[CrossRef].

    Bajorath, J. Chemoinformatics: Concepts, Methods, and Applications (Methods in Molecular Biology), (2004) , Totowa Humana Press.

    Baldi, P. and Brunak, S. Bioinformatics: The Machine Learning Approach. Adaptive Computation and Machine Learning, (2001) 2nd edn. , Cambridge The MIT Press.

    Beresford, A.P., et al. (2002) The emerging importance of predictive ADME simulation in drug discovery. Drug Discov. Today, 7, 109–116[Medline].

    Blake, J.F. (2000) Chemoinformatics–predicting the physicochemical properties of ‘drug-like’ molecules. Curr. Opin. Biotech., 11, 104–107[Medline].

    Bodor, N. and Buchwald, P. (1999) Recent advances in the brain targeting of neuropharmaceuticals by chemical delivery systems. Adv. Drug Delivery Rev., 36, 229–254[CrossRef][Medline].

    Böhm, H.-J. and Schneider, G. Virtual Screening for Bioactive Molecules, (2000) , Weinheim Wiley-VCH.

    Boobis, A., et al. (2002) Congress Report. In silico prediction of ADME and parmacokinetics. Report of an expert meeting organised by COST B15. Eur. J. Pharm. Sci., 17, 183–193[Medline].

    Browne, L.J., et al. (2002) Chemogenomics - pharmacology with genomics tools. Targets, 1, 59.

    Brüstle, M., et al. (2002) Descriptors, physical properties, and drug-likeness. J. Med. Chem., 45, 3345–3355[CrossRef][Web of Science][Medline].

    Burkert, U. and Allinger, N.L. (1982) ACS Monograph. Molecular Mechanics, , Washington, DC American Chemical Society Vol. 177, .

    Buttom, W.G., et al. (2003) Using absolute and relative reasoning in the prediction of the potential metabolism of xenobiotics. J. Chem. Inf. Comput. Sci., 43, 1371–1377[CrossRef][Web of Science][Medline].

    Clark, D.E. (1999) Rapid calculation of polar molecular surface area and its application to the prediction of transport phenomena. 1: Prediction of intestinal absorption. J. Pharm. Sci, 88, 807–814[CrossRef][Medline].

    Clark, D.E. and Pickett, S.D. (2000) Computational methods for the prediction of ‘drug-likeness’. Drug Discovery Today, 5, 49–58[Web of Science][Medline].

    Costanzo, M.C., et al. (2000) The Yeast Proteome Database (YPD) and Caeohabditis elegans proteome database (WormPD): comprehensive resources for the organization and comparison of model organism protein information. Nucleic Acids Res., 28, 73–76[Abstract/Free Full Text].

    Cramer, C.J. Essentials of Computational Chemistry: Theories and Models, (2002) , New York John Wiley and Sons.

    Cramer, R.D., III, et al. (1988) Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J. Am. Chem. Soc., 110, 5959–5967[CrossRef][Web of Science].

    Crivori, P., et al. (2000) Predicting blood-brain barrier permeation from three-dimensional molecular structure,. J. Med. Chem., 43, 2204–2216[CrossRef][Web of Science][Medline].

    Cruciani, G., et al. (2000) Molecular fields in quantitative structure-permeation relationships: the VolSurf approach. J. Mol. Struct. (Theochem), 503, 17–30[CrossRef].

    Cummins, D.J., et al. (1996) Molecular diversity in chemical databases: comparison of medicinal chemistry knowledge bases and database of commercially available compounds. J. Chem. Inf. Comput. Sci., 36, 750–763[Medline].

    de Groot, M.J. and Ekins, S. (2002) Pharmacophore modeling of cytochromes P450. Adv. Drug Delivery Rev., 54, 367–383[CrossRef][Web of Science][Medline].

    Dearden, J.C., et al. (1997) The development and validation of expert systems for predicting toxicity. ATLA, 25, 223–252.

    Doniger, S., et al. (2002) Predicting CNS permeability of drug molecules: comparison of neural network and support vector machine algorithms. J. Comp. Bio., 9, 849–864.

    Durant, J.L., et al. (2002) Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci., 42, 1273–1280[CrossRef][Web of Science][Medline].

    Durbin, R., et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, (1998) , Cambridge Cambridge University Press.

    Dyekjær, J.D. and Jónsdóttir, S.Ó. (2003) QSPR models based on molecular mechanics and quantum chemical calculations. 2. Thermodynamic properties of alkanes, alcohols, polyols, and ethers. Ind. Eng. Chem. Res., 42, 4241–4259.

    Dyekjær, J.D., et al. (2002) QSPR models based on molecular mechanics and quantum chemical calculations. 1. Construction of Boltzmann-averaged descriptors for alkanes, alcohols, diols, ethers and cyclic compounds. J. Mol. Model., 8, 277–289[CrossRef][Medline].

    Egan, W.J. and Lauri, G. (2002) Prediction of intestinal permeability. Adv. Drug Delivery Rev., 54, 273–289[Medline].

    Ekins, S., et al. (2000) Progress in predicting human ADME parameters in silico. J. Pharm. Tox. Methods, 44, 251–272[CrossRef][Medline].

    Ekins, S., et al. (2002) Toward a new age of virtual ADME/TOX and multidimensional drug discovery. J. Comput.-Aided Mol. Des., 16, 381–401[CrossRef].

    Engkvist, O., et al. (2003) Prediction of CNS activity of compound libraries using substructure analysis. J. Chem. Inf. Comput. Sci., 43, 155–160[Medline].

    Espinosa, G., et al. (2002) An integrated SOM-Fuzzy ARTMAP neural system for the evaluation of toxicity. J. Chem. Inf. Comput. Sci., 42, 343–359[Medline].

    Feher, M. and Schmidt, J.M. (2003) Property distributions: differences between drugs, natural products, and molecules from combinatorial chemistry. J. Chem. Inf. Comput. Sci., 43, 218–227[CrossRef][Web of Science][Medline].

    Feher, M., et al. (2000) A simple model for the prediction of blood-brain partitioning. Int. J. Pharm., 201, 239–247[CrossRef][Medline].

    Ford, M.G., et al. (2004) Selecting compounds for focused screening using linear discriminant analysis and artificial neural networks. J. Mol. Graph. Model., 22, 467–472[Medline].

    Frimurer, T.M., et al. (2000) Improving the odds in discriminating ‘drug-like’ from ‘non drug-like’ compounds. J. Chem. Inf. Comput. Sci., 40, 1315–1324[Medline].

    Fujita, T., et al. J. Am. Chem. Soc., (1964) 86, 5175–5180[CrossRef][Web of Science].

    Gallop, M.A., et al. (1994) Application of combinatorial technologies to drug discovery. 1. Background and peptide combinatorial libraries. J. Med. Chem., 37, 1233–1251[CrossRef][Web of Science][Medline].

    Gasteiger, J. Handbook of Chemoinformatics. From Data to Knowledge, (2003) , Weinheim Wiley-VCH vols. 1–4, .

    Gasteiger, J. and Engel, T. Chemoinformatics: A Textbook, (2003) , Weinheim Wiley-VCH.

    Geladi, P. and Kowalski, B.R. (1986a) An example of 2-block predictive partial least-squares regression with simulated data. Anal. Chim. Acta, 185, 19–32[CrossRef][Web of Science].

    Geladi, P. and Kowalski, B.R. (1986b) Partial least-squares regression: a tutorial. Anal. Chim. Acta, 185, 1–17[CrossRef][Web of Science].

    Gillet, V.J., et al. (1998) Identification of biological activity profiles using substructural analysis and genetic algorithms. J. Chem. Inf. Comput. Sci., 38, 165–179[CrossRef][Web of Science][Medline].

    Gillet, V.J., et al. (1999) Selecting combinatorial libraries to optimize diversity and physical properties. J. Chem. Inf. Comput. Sci., 39, 169–177.

    Gillet, V.J., et al. (2003) Similarity searching using reduced graph. J. Chem. Inf. Comput. Sci., 43, 338–345[CrossRef][Web of Science][Medline].

    Golden, J. (2003) Towards a tractable genome: Knowledge management in drug discovery. Curr. Drug Discovery, Feb., 17–20.

    Goodford, P.J. (1985) A computational procedure for determining energetically favorable binding-sites on biologically important macromolecules. J. Med. Chem., 28, 849–857[CrossRef][Web of Science][Medline].

    Gordon, E.M., et al. (1994) Application of combinatorial technologies to drug discovery. 2. Combinatorial organic synthesis, library screeing strategies, and future directions. J. Med. Chem., 37, 1385–1401[CrossRef][Web of Science][Medline].

    Goto, S., et al. (1998) LIGAND: chemical database for enzyme reactions. Bioinformatics, 14, 591–599[Abstract/Free Full Text].

    Goto, S., et al. (2002) LIGAND: database of chemical compounds and reaction in biological pathways. Nucleic Acids Res., 30, 402–404[Abstract/Free Full Text].

    Grass, G.M. (1997) Simulation models to predict oral drug absorption from in vitro data. Adv. Drug Delivery Rev., 23, 199–219.

    Greene, N. (2002) Computer systems for the prediction of toxicity: an update. Adv. Drug Delivery Rev., 54, 417–431[CrossRef][Web of Science][Medline].

    Greene, N., et al. (1997) Knowledge-based expert system for toxicity and metabolism prediction: DEREK, StAR, and METEOR, SAR QSAR. Environ. Res., 10, 299–313.

    Günther, J., et al. (2003) Utilizing structural knowledge in drug design strategies: applications using relibase. J. Mol. Biol., 326, 621–636[Medline].

    Hann, M.M., et al. (2001) Molecular complexicity and its impact on the probability of finding leads for drug discovery. J. Chem. Inf. Comput. Sci., 41, 856–864[CrossRef][Web of Science][Medline].

    Hansch, C. and Fujita, T. (1964) J. Am. Chem. Soc., 86, 1616–1626.

    Hendlich, M., et al. (2003) Relibase: design and development of a database for comprehensive analysis of protein – ligand interactions. J. Mol. Biol., 326, 607–620[CrossRef][Web of Science][Medline].

    Hermjakob, H., et al. (2004) IntAct—an open source molecular interaction database. Nucleic Acids Res., 32, D452–D455[Abstract/Free Full Text].

    Hobohm, U., et al. (1992) Selection of representative protein data sets. Protein Sci., 1, 409–417[Web of Science][Medline].

    Hodges, P.E., et al. (1999) The yeast proteome database (YPD): a model for the organization and presentation of genome-wide functional data. Nucleic Acids Res., 27, 69–73[Abstract/Free Full Text].

    Hopfinger, A.J. and Duca, J.S. (2000) Extraction of pharmacophore information form High-throughput Screens. Curr. Opin. Biotech., 11, 97–103[Medline].

    Hopkins, A.L. and Groom, C.R. (2002) The drugable genome. Nat. Rev. Drug Discovery, 1, 727–730[CrossRef][Web of Science][Medline].

    Hou, T. and Xu, X. (2002) ADME evaluation in drug discovery 1. Applications of genetic algorithms to the prediction of blood–brain partitioning of a large set of drugs. J. Mol. Model., 8, 337–349[Medline].

    Huuskonen, J. (2000) Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology. J. Chem. Inf. Comput. Sci., 40, 773–777[Medline].

    Huuskonen, J. (2001) Estimation of aqueous solubility in drug design. Comb. Chem. High. T. Scr., 4, 311–316.

    Huuskonen, J., et al. (1998) Aqueous solubility prediction of drugs based on molecular topology and neural network modeling. J. Chem. Inf. Comput. Sci., 38, 450–456[Medline].

    Jensen, F. (1999) Introduction to Computational Chemistry. , New York John Wiley and Sons.

    Ji, Z.L., et al. (2003) Internet resources for proteins associated with drug therapeutic effect, adverse reactions and ADME. Drug Discov. Today, 12, 526–529.

    Jónsdóttir, S.Ó., et al. (2002) Modeling and measurements of solid–liquid and vapor–liquid equilibria of polyols and carbohydrates in aqueous solution. Carbohydrate Res., 337, 1563–1571[Medline].

    Jørgensen, F.S., et al. (2001) Prediction of blood–brain barrier penetration. In Höltje, H.-D. and Sippl, W. (Eds.). Rational Approaches to Drug Design, , Barcelona Prous Science Press, pp. 281–285.

    Jorgensen, W.L. and Duffy, E.M. (2002) Prediction of drug solubility from structure. Adv. Drug Delivery Rev., 54, 355–366[CrossRef][Web of Science][Medline].

    Judson, P.N., et al. (2003) Using argumentation for absolute reasoning about the potential toxicity of chemicals. J. Chem. Inf. Comput. Sci., 43, 1364–1370[Medline].

    Karelson, M. Molecular Descriptors in QSAR/QSPR, (2000) John Wiley & Sons, Inc.

    Katritzky, A.R., et al. (1998) QSPR studies of vapor pressure, aqueous solubility, and the prediction of water–air partition coefficients. J. Chem. Inf. Comput. Sci., 38, 720–725.

    Katritzky, A.R., et al. (2001) Theoretical descriptors for the correlation of aquatice toxicity of environmental pollutant by quantitative structure–toxicity relationships. J. Chem. Inf. Comput. Sci., 41, 1162–1176[Medline].

    Kaznessis, Y.N., et al. (2001) Prediction of blood–brain partitioning using Monte Carlo simulations of molecules in water. J. Comput.-Aided Mol. Des., 15, 697–708[CrossRef].

    Kelder, J., et al. (1999) Polar molecular surface as a dominating determinant for oral absorption and brain penetration of drugs. Pharm. Res., 16, 1514–1519[CrossRef][Web of Science][Medline].

    Keserû, G.M., et al. (2000) A neural network based virtual high throughput screening test for the prediction of CNS activity. Comb. Chem. High. T. Scr., 3, 535–540.

    Klopman, G. (1992) MULTI-CASE: 1. A hierarchical computer automated structure evaluation program. Quant. Struct.-Act. Relat., 11, 176–184[CrossRef].

    Klopman, G., et al. (1994) META 1. A program for the evaluation of metabolic transformations of chemicals. J. Chem. Inf. Comput. Sci., 34, 1320–1325[CrossRef][Web of Science][Medline].

    Klopman, G., et al. (1997) META 3. A genetic algorithm for metabolic transform priorities optimization. J. Chem. Inf. Comput. Sci., 37, 329–334[Medline].

    Klopmann, F. and Zhu, H. (2001) Estimation of the aqueous solubility of organic molecules by group contribution approach. J. Chem. Inf. Comput. Sci., 41, 439–445[Medline].

    Kubinyi, H. (2003) Drug research: myths, hype and reality. Nat. Rev. Drug Discovery, 2, 665–668[Medline].

    Kühne, R., et al. (1995) Group contribution methods to estimate water solubility of organic compounds. Chemosphere, 30, 2061–2077.

    Kulkarni, A., et al. (2002) Predicting Caco-2 permeation coefficients of organic molecules using membrane-interaction QSAR analysis. J. Chem. Inf. Comput. Sci., 42, 331–342[Medline].

    Kusuhara, H. and Sugiyama, Y. (2002) Role of transporters in the tissue-selective distribution and elimination of drugs: transporters in the liver, small intestine, brain and kidney. J. Cont. Release, 78, 43–54.

    Langowski, J. and Long, A. (2002) Computer systems for the prediction of xenobiotic metabolism. Adv. Drug Delivery Rev., 54, 407–415[CrossRef][Web of Science][Medline].

    Leach, A.R. and Gillet, V.J. An Introduction to Chemoinformatics, (2003) , Dordrecht Kluwer Academics Publishers.

    Lengauer, T. Bioinformatics—From Genomes to Drugs, Vol. 14, Methods and Principles in Medicinal Chemistry, (2002) , Weinheim Wiley-VCH.

    Lewis, D.F.V. The Cytochromes P450: Structure, Function and Mechanism, (1996) , London Taylor and Francis.

    Lewis, D.F.V. and Dickins, M. (2002) Substrate SARs in human P450s. Drug Discov. Today, 7, 918–925[Medline].

    Lide, D.R. CRC Handbook of Chemistry and Physics, (2003) 84th edn. , Boca Raton, FL CRC Press.

    Lipinski, C.A., et al. (1997) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Delivery Rev., 23, 3–25[CrossRef][Web of Science].

    Livingstone, D.J. (2003) Theoretical property prediction. Curr. Top. Med. Chem., 3, 1171–1192[CrossRef][Web of Science][Medline].

    Macchiarulo, A., et al. (2004) Ligand selectivity and competition between enzymes in silico. Nat. Biotechnol., 22, 1039–1045[CrossRef][Web of Science][Medline].

    McElroy, N.R. and Jurs, P.C. (2001) Prediction of aqueous solubility of heteroatom-containing organics compounds from molecular structure. J. Chem. Inf. Comput. Sci., 41, 1237–1247[Medline].

    McFarland, J.W., et al. (2001) Estimating the water solubilities of crystalline compounds from their chemical structures alone. J. Chem. Inf. Comput. Sci., 41, 1355–1359[Medline].

    Mewes, H.W., et al. (2002) MIPS: a database for genome and protein sequences. Nucleic Acids Res., 30, 31–34[Abstract/Free Full Text].

    Miled, Z.B., et al. (2003) An efficient implementation of a drug candidate database. J. Chem. Inf. Comput. Sci., 43, 25–35[CrossRef][Web of Science][Medline].

    Miller, M.A. (2002) Chemical databases techniques in drug discovery. Nat. Rev. Drug Discovery, 1, 220–227[CrossRef][Medline].

    Mount, D.W. Bioinformatics: Sequence and Genome Analysis, (2001) , Cold Spring Harbor Cold Spring Harbor Laboratory Press.

    Muegge, I. (2003) Selection criteria for drug-like compounds. Med. Res. Rev., 23, 302–321[CrossRef][Web of Science][Medline].

    Muegge, I., et al. (2001) Simple selection criteria for drug-like chemical matter. J. Med. Chem., 44, 1841–1846[Medline].

    Murcia-Soler, M., et al. (2003) Drugs and nondrugs: an effective discrimination with topological methods and artificial neural networks. J. Chem. Inf. Comput. Sci., 43, 1688–1702[Medline].

    Murray-Rust, P. and Rzespa, H.S. (1999) Chemical markup Language and XML Part I. Basic principles. J. Chem. Inf. Comput. Sci., 39, 928.

    Nicholson, J.K. and Wilson, I.D. (2003) Understanding ‘global’ systems biology: metabolism and the continuum of metabolism. Nat. Rev. Drug Discov., 2, 668–677[CrossRef][Web of Science][Medline].

    Nielsen, H., et al. (1996) Defining a similarity threshold for a functional protein sequence pattern: The signal peptide cleavage site. Proteins, 24, 165–177[CrossRef][Web of Science][Medline].

    Norinder, U. and Haeberlein, M. (2002) Computational approaches to the prediction of the blood-brain distribution. Adv. Drug Delivery Rev., 54, 291–313[CrossRef][Medline].

    Norris, D.A., et al. (2000) Development of predictive pharmacokinetic simulation models for drug discovery. J. Contr. Release, 65, 55–62[CrossRef][Medline].

    Olsson, T. and Oprea, T.I. (2001) Chemoinformatics: a tool for decision-marker in drug discovery. Curr. Opin. Drug Disc. Dev., 4, 308–313[Medline].

    Oprea, T.I. (2000) Property distribution of drug-related chemical databases. J. Comput. Aided Mol. Des., 14, 251–264[CrossRef][Web of Science][Medline].

    Oprea, T.I. (2002a) Chemical space navigation in lead discovery. Curr. Opin. Chem. Biol., 6, 384–389[Medline].

    Oprea, T.I. (2002) Current trends in lead discovery: are we looking for the appropriate properties. J. Comput.-Aided Mol. Des., 16, 325–334[CrossRef].

    Oprea, T.I. (2003) Chemoinformatics and the quest for leads in drug discovery. In Gasteiger, J. (Ed.). Handbook of Chemoinformatics—From Data to Knowledge, , Weinheim Wiley-VCH Vol. 4, , pp. 1509–1531.

    Oprea, T.I. and Gottfries, J. (1999) Toward minimalistic modeling of oral drug absorption. J. Mol. Graph. Model., 17, 261–274[Medline].

    Oprea, T.I. and Gottfries, J. (2001a) CHEMGPS: a chemical space navigation tool. In Höltje, H.-D. and Sippl, W. (Eds.). Rational Approaches to Drug Design, , Barcelona Prous Science Press, pp. 437–446.

    Oprea, T.I. and Gottfries, J. (2001) Chemography: the art of navigating in chemical space. J. Comb. Chem., 3, 157–166[CrossRef][Medline].

    Oprea, T.I., et al. (2001a) Is there a difference between leads and drugs? A historical perspective. J. Chem. Inf. Comput. Sci., 41, 1308–1315[CrossRef][Web of Science][Medline].

    Oprea, T.I., et al. (2001b) Quo vadis, scoring functions? Toward an integrated pharmacokinetic and binding affinity prediction framework. In Ghose, A.K. and Viswanadhan, V.N. (Eds.). Combinatorial Library Design and Evaluation for Drug Design, , New York Marcel Dekker Inc., pp. 233–266.

    Oprea, T.I., et al. (2002) Pharmacokinetically based mapping device for chemical space navigation. J. Comb. Chem., 4, 258–266[Medline].

    Orengo, C.A., et al. Bioinformatics. Genes, Proteins and Computers, Advanced Texts, (2002) , Abingdon BIOS Scientific Publishers.

    Pade, V. and Stavhansky, S. (1998) Link between drug absorption solubility and permeability measurements in Caco-2 Cells. J. Pharm. Sci., 87, 1604–1607[Medline].

    Palm, K., et al. (1998) Evaluation of dynamic polar surface area as predictor of drug absorption: comparison with other computational and experimental predictors. J. Med. Chem., 41, 5382–5392[CrossRef][Medline].

    Parke, D.V., et al. (1990) The safety evaluation of drugs and chemical by the use of computer optimized molecular parametric analysis of chemical toxicity (COMPACT). ATLA, 18, 91–102.

    Parrott, N. and Lavé, T. (2002) Prediction of intestinal absorption: comparative assessment of GASTROPLUSTM and IDEATM. Eur. J. Pharm. Sci., 17, 51–61[Medline].

    Parsons, A.B., et al. (2004) Integration of chemical-genetic and genetic interaction data links bioactive compounds to cellular target pathways. Nat. Biotechnol., 22, 62–69[CrossRef][Web of Science][Medline].

    Patterson, D.E., et al. (1996) Neighborhood behavior: a useful concept for validation of molecular diversity descriptors. J. Med. Chem., 39, 3049–3059[CrossRef][Web of Science][Medline].

    Pearlman, R.S. (2000) CONCORD User's Manual. , St Louis, MO Tripos Inc.

    Pedersen, A.G. and Nielsen, H. (1997) Neural network prediction of translation initiation sites in eukaryotes: perspectives for EST and Genome analysis. ISMB, 5, 226–233[Medline].

    Peri, S., et al. (2003) Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res., 13, 2363–2371[Abstract/Free Full Text].

    Prausnitz, J.M., et al. Molecular Thermodynamics and Fluid Phase Equilibria, (1986) 2nd edn. , NJ Prentice-Hall.

    Ran, Y., et al. (2001) Prediction of aqueous solubility of organic compounds by general solubility equation (GSE). J. Chem. Inf. Comput. Sci., 41, 1208–1217[Medline].

    Reynolds, C.H., et al. (1998) Lead discovery using stochastic cluster analysis (SCA): a new method for clustering structurally similar compounds. J. Chem. Inf. Comp. Sci., 38, 305–312.

    Reynolds, C.H., et al. (2001) Diversity and coverage of structural sublibraries selected using the SAGE and SCA algorithms. J. Chem. Inf. Comp. Sci., 41, 1470–1477[Medline].

    Richard, A.M. and Williams, C.R. (2002) Distributed structure-searchable toxicity (DSSTox) puclic database network: a proposal. Mutat. Res., 499, 27–52[Medline].

    Rose, K., et al. (2002) Modeling blood–brain barrier partitioning using the electrotopological state. J. Chem. Inf. Comput. Sci., 42, 651–666[Medline].

    Sadowski, J. and Kubinyi, H. (1998) A scoring scheme for discriminating between drugs and nondrugs. J. Med. Chem., 41, 3325–3329[CrossRef][Web of Science][Medline].

    Sander, C. and Schneider, R. (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins, 9, 56–68[CrossRef][Web of Science][Medline].

    Sangvhi, T., et al. (2003) Estimation of aqueous solubility by the General Solubility Equation (GSE) the easy way. QSAR Comb. Sci., 22, 258–262[CrossRef].

    Sello, G. (1998) Similarity measures: is it possible to compare dissimilar structures. J. Chem. Inf. Comput. Sci., 38, 691–701.

    Stahura, F.L. and Bajorath, J. (2002) Bio- and chemo-informatics beyond data management: crucial challenges and future opportunities. Drug Discov. Today, 7, S41–S47[Medline].

    Sun, H., et al. (2003) Drug efflux transporters in the CNS'. Adv. Drug Delivery Rev., 55, 83–105[CrossRef][Medline].

    Takaoka, Y., et al. (2003) Development of a method for evaluating drug-likeness and ease of synthesis using a data set in which compounds are assigned scores based on chemists' intuition. J. Chem. Inf. Comput. Sci., 43, 1269–1275[Medline].

    Talafous, J., et al. (1994) META 2. A dictionary model of mammalian xenobiotic metabolism. J. Chem. Inf. Comput. Sci., 34, 1326–1333[CrossRef][Web of Science][Medline].

    Taskinen, J. and Yliruusi, J. (2003) Prediction of physcochemical properties based on neural networks modelling. Adv. Drug Delivery Rev., 55, 1163–1183[CrossRef][Web of Science][Medline].

    Teague, S.J., et al. (1999) The design of leadlike combinatorial libraries. Angew. Chem. Int. Ed., 38, 3743–3748[CrossRef].

    Thompson, J.D., et al. (2003) Predicting aqueous solubilities from aqueous free energies of solvation and experimental or calculated vapor pressure of pure substances. J. Chem. Phys., 119, 1661–1670.

    Todeschini, R. and Consonni, V. Handbook of Molecular Descriptors, Vol. 11, Methods and Principles in Medicinal Chemistry, (2000) , Weinheim Wiley-VCH.

    Tong, W., et al. (2002) Development of quantitative structure-activity relationships (QSPRs) and their use for priority setting in the testing strategy of endocrine disruptors. Reg. Res. Persp. J., May, 1–21.

    van de Waterbeemd, H. (2003) Physico-chemical approaches to drug absorption. In van de Waterbeemd, H., Lennernäs, H., Arthursson, P. (Eds.). Drug Bioavailability. Estimation of Solubility, Permeability, Absorption and Bioavailability, Methods and Principles in Medicinal Chemistry, , Weinheim Wiley-VCH Vol. 18, , pp. 3–20.

    van de Waterbeemd, H. and Gifford, E. (2003) ADMET in silico modelling: Towards prediction paradise. Nat. Rev. Drug Discovery, 2, 192–204[CrossRef][Web of Science][Medline].

    van de Waterbeemd, H., et al. (2001) Property-based design: optimization of drug absorption and pharmacokinetics. J. Med. Chem., 44, 1313–1333[CrossRef][Medline].

    van de Waterbeemd, H., et al. Drug Bioavailability. Estimation of Solubility, Permeability Absorption and Bioavailability, Vol. 18, of Methods and Principles in Medicinal Chemistry, (2003) , Weinheim Wiley-VCH.

    Viswanadhan, V.N., et al. (1989) Atomic physicochemical parameters for three dimensional structure directed quantitative structure–activity relationships. 4. Additional parameters for hydrophopic and dispersive interactions and their application for an automated superposition of certain naturally occurring nucleoside antibiotics. J. Chem. Inf. Comput. Sci., 29, 163–172.

    Voigt, J.H., et al. (2001) Comparison of the NCI open database with seven large chemical structural databases. J. Chem. Inf. Comp. Sci., 41, 702–712[CrossRef][Web of Science][Medline].

    Wagener, M. and van Geerestein, V.J. (2000) Potential drugs and nondrugs: prediction and identification of important structural features. J. Chem. Inf. Comput. Sci., 40, 280–292[Web of Science][Medline].

    Walters, W.P. and Murcko, M.A. (2002) Prediction of ‘drug-likeness’. Adv. Drug Delivery Rev., 54, 255–271[CrossRef][Web of Science][Medline].

    Walters, W.P., et al. (1998) Virtual screening—an overview. Drug Discovery Today, 3, 160–178[CrossRef][Web of Science].

    Wegner, J.K. and Zell, A. (2003) Prediction of aqueous solubility and partition coefficient optimized by a genetic algorithm based descriptor selection method. J. Chem. Inf. Comput. Sci., 43, 1077–1084[CrossRef][Web of Science][Medline].

    Weininger, D. (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules,. J. Chem. Inf. Comput. Sci., 28, 31–36[CrossRef][Web of Science].

    Wessel, M.D., et al. (1998) Prediction of human intestinal absorption of drug compounds from molecular structure. J. Chem. Inf. Comput. Sci., 38, 726–735[CrossRef][Medline].

    Willett, P. (2000) Chemoinformatics—similarity and diversity in chemical libraries. Curr. Opin. Biotech., 11, 85–88[Medline].

    Willett, P. (2003) Similarity-based approaches to virtual screening. Biochem Soc. Trans., 31, 603–606[CrossRef][Medline].

    Wilton, D., et al. (2003) Comparison of ranking methods for virtual screening in lead-discovery programs. J. Chem. Inf. Comput. Sci., 43, 469–474[Medline].

    Wolohan, P.R.N. and Clark, R.D. (2003) Predicting drug pharmacokinetic properties using molecular interaction fields and SIMCA. J. Comput-Aided Mol. Des., 17, 65–76.

    Xenearios, I., et al. (2002) DIP, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucl. Acids Res., 30, 303–305[Abstract/Free Full Text].

    Xu, J. and Stevenson, J. (2000) Drug-like index: a new approach to measure drug-like compounds and their diversity. J. Chem. Inf. Comput. Sci., 40, 1177–1187[Medline].

    Yaffe, D., et al. (2001) A fuzzy ARTMAP based on quantitative structure–property relationships (QSPRs) for predicting aqueous solubility of organic compounds. J. Chem. Inf. Comput. Sci., 41, 1177–1207[Medline].

    Yan, A. and Gasteiger, J. (2003) Prediction of aqueous solubility of organic compounds based on a 3d structure representation. J. Chem. Inf. Comput. Sci., 43, 429–434[Medline].

    Young, R.C., et al. (1988) Development of a new physicochemical model for brain penetration and its application to the design of centrally acting H2 receptor histamine antagonists. J. Med. Chem., 31, 656–671[CrossRef][Medline].

    Zamora, I. (2001) Prediction of oral drug permeability. In Höltje, H. and Sippl, W. (Eds.). Rational Approaches to Drug Design, , Barcelona Prous Science Press, pp. 271–280.

    Zanzoni, A., et al. (2002) MINT: a molecular interaction database. FEBS Lett., 513, 135–140[CrossRef][Web of Science][Medline].

    Zhong, C.G. and Hu, Q.H. (2003) Estimation of the aqueous solubility of organic compounds using molecular connectivity indices. J. Pharm. Sci., 92, 2284–2294[Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
C. M. Song, S. J. Lim, and J. C. Tong
Recent advances in computer-aided drug design
Brief Bioinform, September 1, 2009; 10(5): 579 - 591.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
O. Sperandio, M. Petitjean, and P. Tuffery
wwLigCSRre: a 3D ligand-based server for hit identification and optimization
Nucleic Acids Res., July 1, 2009; 37(suppl_2): W504 - W509.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
T. L. Moda, L. G. Torres, A. E. Carrara, and A. D. Andricopulo
PK/DB: database for pharmacokinetic properties and predictive in silico ADME models
Bioinformatics, October 1, 2008; 24(19): 2270 - 2271.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. Chen, S. J. Swamidass, Y. Dou, J. Bruand, and P. Baldi
ChemDB: a public database of small molecules and related chemoinformatics resources
Bioinformatics, November 15, 2005; 21(22): 4133 - 4139.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary data
Right arrow All Versions of this Article:
21/10/2145    most recent
bti314v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (17)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Jónsdóttir, S. O.
Right arrow Articles by Brunak, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Jónsdóttir, S. O.
Right arrow Articles by Brunak, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?