TOC PREV NEXT INDEX







PROVIDING CHEMINFORMATICS SOLUTIONS
TO SUPPORT DRUG DISCOVERY DECISIONS

Carleton R. Sage, Kevin R. Holme, Nianish Sud and Rudy Potenzone

Lionbioscience, Inc., 9880 Campus Point Dr., San Diego, CA 92121, USA

E-Mail: rudolph.potenzone@lionbioscience.com

Received: 1st August 2002 / Published: 15th May 2003


Abstract

Drug discovery programs have had to deal with an avalanche of data coming from both the adoption of new technologies such as high throughput screening and combinatorial chemistry, as well as advances in genomics and structural genomics which have facilitated a gene family target approach to drug discovery. Although this data rich environment has been a challenge to manage, it has provided an opportunity for the development of informatics based tools and solutions to extract information from this large body of data, and convert this information into knowledge that can be used and reused for drug discovery.
In the cheminformatics field there has been considerable focus on the development of new tools to visualise and analyse the data, particularly with relation to identifying new leads, and analysing SAR for lead optimisation. While individual cheminformatics tools are critical for analysing this data, a real opportunity exists to provide solutions that synthesize results from these analyses into knowledge to support drug discovery decisions. This remains largely a "manual" activity that takes place within individual project teams.
This paper will describe some concepts and implementations of cheminformatics solutions that begin to address the need for reusable knowledge generation within drug discovery projects. The talk will address requirements for the integration of chemical and biological data as well as the integration of tools and models. The power of using predictive tools for compound design will be highlighted as well as methods to simultaneously consider multiple SAR's. We will describe how providing such solutions.


Introduction

The drug discovery process used to be less complicated. Teams of chemists and biologists (drug discovery scientists) would work together on tens to hundreds of molecules to try to specifically alter the function of their biological target. Things have changed. Because of changes in available technologies and increases in fundamental understandings of biology, the drug discovery scientist has to contend with thousands to millions of molecules interacting with potentially thousands of targets. Therefore, the modern drug discovery scientist is awash in data. However, the changes in available technologies haven't necessarily resulted in improvements in the quality of the data. As a result, we have been flung into what might be a morass of meaningless data, or discovery knowledge nirvana. How do we navigate?


Components Required to Build the Map

The first stage to approaching this information overload is by assembling the informatics components necessary for integrating all the available data. Databases are an integral component. This simple component may pose problems for some organisations since the default drug discovery data repository has become Microsoft Excel. Once the data has been arranged into databases, establishing a link between the data in the databases is an essential component. This component does not have a trivial solution, since it involves linking data of different types across different research areas. One obvious potential solution is to use the experimental assay data as the common link between the genomics/bioinformatics/proteomics data and cheminformatics data. Assuming that the data has been integrated to some extent, the next required components are ways to visualise, analyse, and interrogate the data. Furthermore, sophisticated computational approaches must also be taken in order to summarise collections of data into reusable knowledge for prospective application. Finally, means to allow `rollup' of these components for decision support, including tracking and knowledge management tools are essential for efficient. Figure 1 illustrates the Lion Bioscience architecture.
Once an integrated system of data, analysis, and decision support tools has been created, then the true power of the system can be exploited for more rational/informed decision making. These systems can be used in all phases of the drug discovery process from target selection to screening set selection to lead optimisation and lead selection for pre-clinical development.



Figure 1. Lion Bioscience integration architecture
The purpose of this paper is to illustrate the potential utility of these approaches in the post-target selection region of drug discovery, and the next two figures illustrate a framework for discussing the small molecule drug discovery process. Figure 2 defines our use of the terms "Lead Identification" and "Lead Optimisation" since these terms may have different connotations in different organisations.



Figure 2. Definitions and Assumptions for Drug Discovery project initiation.
A second concept illustrated in figure 2 are some of the operating assumptions and project criteria that must be performed in order for this integrated approach to work most efficiently. Project criteria are a key decision support / project tracking consideration. Figure 3 illustrates a version of a pre-clinical drug discovery (or lead candidate selection) workflow. It is important to note that this entire workflow sits upon the foundation of knowledge/information gleaned from other projects, and any and all data (both positive and negative) and knowledge generated for a given project in this workflow is put back into this foundational "Knowledge Base".



Figure 3. Idealised project work/information flow for Pre-clinical Drug Discovery.


Reusing and Leveraging Existing Data Through Computational Models (Hypotheses)

Leveraging data that exists at project initiation via computational means provides momentum in the rational decision making early in the drug discovery process, and continues this momentum as more data becomes available. The type, quantity, and quality of these data define the extent to which computational approaches should be used. Figure 4 describes the potential computational approaches that could be used in support of drug discovery projects given defined sets of experimental data.
In the extreme, if the project under consideration has a liganded protein structure, related protein structures and lead series data available, computational approaches including diversity, similarity, QSAR/pharmacophore models, drug-likeness models, docking, cross-target (or specificity) alert models, and project-specific ADME-models can be applied simultaneously to best leverage all existing information in the compound prioritisation and compound assessment phases of lead identification and lead optimisation (Figure 3).



Figure 4. An example of an "available data / available application" matrix


Accessing Integrated Data

Once an integrated data/query system has been created, it must be delivered to the end user in such a way that it actually enhances their work. Here the major challenge is a "people issue". The interface must be simple and familiar, and should not require too much specific additional training for the user to start using it. In our solution to this problem, we have chosen web pages as an interface since it is familiar. has simple interfaces, and is easy to access. Figure 5 represents a "front door" to the system, the place where the end user chooses what task they want to accomplish. Figure 6 demonstrates a simple interface for searching multiple databases simultaneously using an sd, molfile, or sketched molecule.



Figure 5. Simple "front page" access to a set of integrated data, models, and tools.



Figure 6. Simple interface for performing a 2D similarity search of multiple databases simultaneously.


Turning Data into Sharable Knowledge Using Computational Tools

Figure 4 shows the computational methods available given a starting set of experimental data. These approaches have been powerful in enabling complicated hypothesis-driven experimental design in drug discovery project. However, the resultant models are usually "put on the shelf' as projects are promoted or discontinued, leaving these synthesised data unused for future projects. Having these computational models available, and using them appropriately could prove very valuable in addressing specificity, ADME, and safety information of current and future projects, especially in the case of organisations pursuing the target-class approach to drug discovery. Figure 7 shows a matrix of potency/specificity receptor-relevant chemspace (Pearlman reference) models created for a collection of nuclear receptors, representing a collection of easily applicable knowledge about a large fraction of proteins in the nuclear receptor target family. Though we have presented receptor-relevant chemspace models as our in silico surrogates for potency/selectivity, any computationally derived model, from similarity clustering methods to docking may be assembled, integrated, and applied in a cross-project manner.



Figure 7. Distance matrix representing the relatedness between receptor relevant chemspace models.


Appropriately Using Data in the Form of Computational Models

Having data available in an integrated, searchable, analysable context, should be very valuable, however, more value could be added to this data by the creation of robust computational summaries (models) of the data, and applying them in appropriate ways. As an example, may varied approaches and algorithms exist that take a set of compounds and experimental activity data and derive a QSAR model that can often accurately predict the potential activity of a new compound.
It should be possible to reuse these models as a component of a knowledge environment for in silico evaluation of every compound against a surrogate for every potential experimental assay. However, most of the approaches used to build these computational models are statistical in nature, and therefore the performance of these models is only interpolative in nature, therefore, the models will likely perform poorly outside of "chemical space" (extrapolation) they were built upon. An approach to addressing this problem is illustrated in Figure 8 which shows the results from a model to predict Caco-2 effective permeability, including measures of uncertainty, using only the chemical structure of the compound as the only input. This caco-2 model was built using sophisticated statistical pattern recognition methods in which the output is a consensus prediction from 10 independent models. Each model is trained on a different representation of chemical space. To calculate the prospective measure of uncertainty (M.O.U) for the model, the upper and lower bounds of the chemical features (chemical descriptors) were determined to represent the bounds of the multidimensional chemical space upon which each model was trained. For each of the 10 independent (child) models, a new compound may have features whose values are outside of he bounds of the training set.



Figure 8. Example application of prospective measures of uncertainty in a predictive model for caco-2 effective permeability.
If the features are outside of the bounds for any feature for a child model, that model is considered to be extrapolating. Figure 7 shows an example of the consequences of using extrapolated increases, the error in prediction increases dramatically. Developing methods to assess prediction confidence for all models could aid dramatically both in their successful first-time use as well as enable appropriate reuse of the knowledge gleaned by their creation.


Simultaneous Use of Computational Models and Integrated Data for Compound Prioritisation and Assessment

During Lead Identification and Lead Optimisation, prioritisation or ranking of compounds to acquire, plate, or synthesise can be cumbersome, and is often performed without using all available information. Similarly, assessment of which compound or compound series to promote to the next phase of research should also be performed using all available information simultaneously. Figure 9 shows a simple compound prioritisation input screen for simultaneous prediction of ADME and potency/specificity properties. In this input screen the user may choose which properties to calculate, the criteria defining whether a given compound passes or fails, and the weight of that property in the calculation of a summary score for a molecule.



Figure 9. Prioritisation Input Screen. The end user selects the models to run, the criteria required to pass, and the weight each model contributes to the final score.
The summary score allows the composite ranking of all compound under consideration using the same objective evaluation criterion. Figure 10 shows the results of a prioritisation in two views. First, a compound-by-compound view, and second, a property distribution view, which would likely be useful for selecting from commercially available compound collections of in the evaluation of whether or not to synthesise a particular series of compounds versus another. By including integrated data determined experimentally in this analysis, compoung assessment can also be performed.



Figure 10. Output from compound prioritisation. Compound dependent rollup of predictions, and compound collection distributions of predictions.


Application of Integrated Approaches as a Cheminformatics Solution for Decision Support in Drug Discovery: A Hypothetical Estrogen Receptor Project

For this hypothetical example, we will assume that we have chosen Estrogen Receptor alpha (Era) as the target of our Lead Identification (LI)/Lead Optimisation (LO) project. We perform a search of the literature, and find several examples of compounds which have been shown to interact with Era (Figure 11).



Figure 11. Starting information for the hypothetical ER-a project.
In addition, since our organisation is using a target class approach to drug discovery, we also retrieve all known NR-ligand pairs from the literature as an initial knowledge environment. ER-a is an example of a data-rich project (Figure 11), and therefore almost every method available for computational model generation would be at our disposal. Once the project criteria and starting assumptions (Example shown in Figure 12) have been established, lead identification (Figure 3) can begin. A starting potential screening set is selected by a search of the known ligands against all available compounds (both virtual and existing - either in house or available from vendors - example shown in Figure 6).



Figure 12. Project criteria and use example.
This screening set is further reduced by simultaneous parameter evaluation using integrated computational tools. The project team decides where to cut off the screening collection, and these compounds are assembled, synthesised or acquired, and are run in initial potency assays.
In this scenario, all evaluations can be performed by any and all members of the project team through a web site with common default settings for project criterion. After the screening results are returned and confirmed, the project team then must assess where the project is at that point in time, and decide where to focus the available resources. In this example, three series of compounds passed the requirements necessary for lead optimisation (Figure 13), however, the project team has only enough synthetic resources to work on one series at at time, therefore, the project team must decide which series has the most potential for success.



Figure 13. Hypothetical leads to choose between for prioritisation for optimisation.
To evaluate the potential of each series, virtual libraries will be enumerated, and the resultant products will be evaluated simultaneously (Figure 9). Then, to rate the potential of the libraries in the context of the other libraries, the distributions the components are compared simultaneously between the libraries representing the three scaffolds. Shown in Figure 14, this analysis can be used as an evaluation of potential liabilities which are best compared by analysing the fraction of each library that fails the project criteria for that component.
In this analysis, series 1 has the fewest bad distribution of liabilities, and should be chosen by the project team for further research, with series 2 representing a potential backup series.



Figure 14. Simultaneous library property distribution comparison.
As the project moves closer to lead selection fpr pre-clinical development, the data integration components start to take priority over the in silico predictions. However, computational approches still have tremendous value at this stage, allowing the project team to evaluate the potential performance of candidate molecules in man. Figure 15 shows an analysis of the human absorption potential for three compounds representative of candidate series, in which the experimentally determined solubility and permeability values have been varied systematically.
As can be seen in the graphs, three classes of behaviors can be observed. In one example neither increases in solubility nor permeability can increase the absorption potential, which remains relatively low. In the second example, increases in solubility or permeabiltiy also have no effect on the human absorption potential, which is high. In the final example, increases in solubility and permeability show marked changes in the human absorption potential. The candidate molecules from the second and third examples are therefore the compound series to bring forward if all other factors besides absorption are equal. In addition, this analysis illustrates which series deserve followup study for the development of second generation compounds.



Figure 15. Use of sensitivity analysis for the evaluation of human absorption potential for lead selection.


Summary/Conclusion

So what is different about this approach? The drug discovery process is a fairly evolved one. Data has been shared and models developed for use in drug discovery process ever since affordable computers showed up in the marketplace.
However, the primary storage locale for experimental data is still Microsoft Excel, and the primary conduit of information is through individual interactions between project team members. Unfortunately, humans do not work with complete fidelity in serving as data or knowledge-sharing nodes. One could argue that modern drug discovery is using the ancient means of folklore as the method of knowledge sharing - this clearly is not sufficient given the ever exploding number of targets, hits, and interactions. This paper has described the initial steps of building a system that uses integrated databases to store all the data determined for discovery projects. It also indicates that this data is best used in synthesised from through the appropriate application of computational model building and their resultant use. Finally it illustrates the potentional power of combining the data and computational models in a system that allows the end user to simultaneously consider all available properties, and therefore make more and presumable better decisions about prioritising resources in the support of drug discovery and development.

Published in "Molecular Informatics: Confronting Complexity", Martin G. Hicks & Carsten Kettner (Eds.), Proceedings of the Beilstein-Institut Workshop, May 13th - 16th 2002, Bozen, Italy
http://www.beilstein-institut.de/bozen2002/proceedings/


TOC PREV NEXT INDEX