Astra Zeneca, CNS Discovery, 1800 Concord Pike, Wilmington 19850-543, U.S.A.
To understand what chemical genetics is and how it can add value to the drug discovery process we must first consider some of the challenges and needs in the pharmaceutical industry. The process of discovering new drugs is a highly complex, multidisciplinary activity requiring very large investments of time, intellectual capital, and money. Today the average cost of bringing an NCE to market is on the order of $900 million [1]. For every 5000 compounds synthesized only one makes it to the market. Only three of ten drugs generate revenue that meets or exceeds average R&D costs, and 70% of total returns are generated by only 20% of the products [2]. Given this gloomy backdrop it is even more disturbing to learn that despite the proliferation of many new technologies of great potential (and great cost!) pharmaceutical productivity levels have not increased over the last ten years (which we show graphically in Fig. 1).
Figure 1. US Drug approvals have not increased over the last ten years.
Pharmaceutical R&D costs continue to grow exponentially, driven in part by investments in new technologies, but the return on this investment remains elusive. There are many reasons for these disturbing trends. If we consider the pharmaceutical industry as primarily a generator of knowledge (defining knowledge as compiled and interpreted information that can be acted upon) and focus on the knowledge creation process, we can shed some light on how the current situation, a productivity gap, emerged. Working harder is not likely to overcome this productivity gap to deliver more drugs. Working smarter, doing things differently, and focusing on what we actually need to deliver, i.e., knowledge may be a new way to approach the problem. Ultimately, spanning the "knowledge gap" will lead us to the efficient exploitation of the human genome to discover new drugs to meet major medical needs.
Pharmaceutical companies create and sell knowledge, e.g., knowledge that a drug product will rid patients of the symptoms of their disease while not causing serious side effects. The resources that go into the drug production, pale alongside the resources needed to gain knowledge of what the drug will do when administered to a patient. In the early years of drug discovery it was often true that the literature provided a significant knowledge base for our efforts. Two approaches were taken: 1) function based screening, where one did not know what the target was but could easily screen for small molecules that possessed the right biology [3]; and 2) "rational drug discovery" where one has knowledge of the target and its function [4]. What were needed were small molecules that would interact with the target in the right way before being optimized for in vivo activity and safety.
The existing and evolving chemistry and biology literature fuelled these efforts. It is probably also true to say that the medical problems addressed in these early days of drug discovery represented the more accessible opportunities. Often the biology was not only reasonably well understood but it was reasonably easy to study and measure. Examples of biological effects that were tackled include blood pressure, acid secretion and cytotoxicity. The situation today is very different. We now face many new targets that we know little about, and biology that is complex to study and understand. In addition to these issues, advances in our knowledge of distribution, metabolism and pharmacokinetics, as well as toxicology and pharmacogenetics, have led to the introduction of discovery processes that front load measurement of such small molecular properties. This also raises the bar for passage of compounds through to the process - making the process more difficult and slower. While this may lead to lower output of development candidates it should also lead to lower failure rates later in development, i.e., improvements in quality.
The human genome has been solved and optimistic promises have been made. It is clear that the human genome did not deliver knowledge (i.e., something immediately useful); rather, it delivered a massive amount of data. Significant advances have also been made in cell biology and "systems" biology. The relationship between genes/proteins derived from the human genome and their function as a part of a biological system constitutes the "knowledge gap," and our appreciation for the extent of this void is still emerging. The human genome is thought to consist of ca. 30,000 genes. Each gene can potentially produce several proteins via alternative splicing and post-translational modification, and every protein can potentially combine with other proteins to form many different protein complexes. Clearly, the number of different proteins and protein complexes is much larger than 30,000. o add further complexity, small molecules (that we hope will become drugs) can interact with different sites on a protein or via different mechanisms to further expand the diversity of possible outcomes from the interaction of small molecules with a protein target. We do not know what many gene products (proteins) do, either physiologically or pathologically, and we do not really know how many of these proteins can interact with small molecule ligands [5]. There are many genes about which we know nothing at all.
In summary there is clearly a vast knowledge gap between knowing a gene and knowing the function (physiology and pathology) of its protein product (Fig. 2). The enormity of this knowledge gap has been underestimated by the pharmaceutical industry.
Figure 2. The knowledge gap represents the large gap in understanding that exists between genetic information from the human genome and information regarding biological function from cell and systems biology.
In an effort to illustrate the size of the knowledge gap consider the following (admittedly approximate) analysis from the substance P area. Substance P antagonists have emerged in recent years as potential new treatments for depression although none have yet been approved for this use. Substance P has been known since 1937 and since that time (66 years!) there have been over 6500 papers published providing significant new information on substance P. Thousands of scientists have worked on generating this information over this time frame. It is sobering that our understanding of Substance P's role in depression and in other conditions is still in its infancy! No one pharmaceutical company can generate this volume of information. New, faster and more efficient methods must be developed to fill these knowledge gaps. Partnership with the academic community will become increasingly important as the number of druggable targets expands.
One critical piece of knowledge to the pharmaceutical industry relates to knowledge of a drug target and its link to a disease process. In the context of small molecule drug discovery we define target validation in a broader sense to include the knowledge of the protein target and its specific interaction with small molecules, and the consequence of this interaction for modifying a disease process. In fact drug discovery is primarily focused on the biology of a target in the presence of a drug, i.e., drug induced biology. It begins with a chemical effect - the interaction of a ligand with a protein at a specific site in a specific manner, and ends in patients' gaining benefit from taking a drug derived from the application and exploitation of this knowledge. Target validation that simply links a specific protein and its function to a disease state does not include reference to whether a small molecule can modulate the function of the protein. The protein may not therefore constitute a true "target" since it is not a "target" for a small molecule ligand and efforts to do "target validation" on such a protein will ultimately lead to a negative outcome. We can (and do) proceed to work on drug discovery before we have all the knowledge we need. The absence of this knowledge constitutes the major risk of drug discovery. One way to proceed is to focus on obtaining the most critical knowledge first. This is the knowledge that modulation of a protein target by a small molecule can ultimately lead to a clinical benefit in patients.
Target validation is the foundation of drug discovery and needs greater attention if we are to reduce the risk of failure after significant investment. Traditionally target validation has been thought of as a biology problem. Thinking in terms of what knowledge we need makes it clear that the problem does not fall neatly into any particular discipline and is better characterized as an integrated biology and chemistry problem. A schematic target validation "roadmap" is shown in Fig. 3, where the entire validation path from a chemical effect through various levels of biological effects to a clinical effect is outlined.
Figure 3. The knowledge road map of target validation beginning with a chemical effect between a small molecule and a protein and ending with a beneficial clinical effect on a human disease. Chemical genetics approaches provide some assistance in pursuing this path.
To begin with, an understanding of the function of a particular gene product in many cases can be achieved through the methods of classical genetics. However, the process can be slow and tedious. For example, developing a mouse carrying the mutation of interest could take months or years. Indeed, if the gene product is essential the organism may not survive long enough to be studied. On the other hand, the situation wherein a molecule is available that alters the function of the gene product has a number of advantages. However, it should be recognized that significant chemical effort is often required. The phenotype of interest is conditional in that it is present only when the molecule is present, allowing the study of essential gene products. It is also tunable, i.e., the intensity of the phenotype can be adjusted by controlling the concentration of the molecule.
Chemical genetics is the purposeful modulation of protein function through interaction with a small molecule. It can also be thought of as the study of chemotype, here defined analogously with phenotype as the annotated information set that describes a molecule in terms of its interactions with proteins and other macromolecules and the consequences of these interactions. The principles of chemical genetics were established in the rich history of using small molecules to explore biological function and in this sense chemical genetics is not new. What is new is the development of a systematic approach to studying biological function with small molecules - this is the emerging field of chemical genetics. Just as genetic changes can alter protein function, so can small molecule-protein interactions [6]. It is important to appreciate that by interaction of a ligand with a protein we mean interaction of a small molecule at a specific site on a protein causing a specific protein change, conformational or otherwise, ultimately leading to a specific biological effect. Small molecules can often interact with multiple sites on proteins and cause a multitude of consequences such as agonism, antagonism, partial agonism, modulation, competitive and non-competitive inhibition, etc.. They can also interact at junctions between protein subunits. The sophistication of small molecule-protein interactions and their biological consequences cannot easily be reproduced by techniques such as gene knock-in/out or using siRNA where genes/proteins are simply removed or increased in concentration in a biological system. Having said that, knockout models have certainly contributed significantly to drug discovery and will continue to do so [7].
The power of chemical genetics resides in this sophistication of the small molecule-protein interaction and the precise way we can (in principle) modulate the function of a protein. As a precursor to drug discovery it serves the purpose to focus us on where small molecule drug discovery really begins - with the chemical interaction of a small molecule and a protein.
At the heart of this approach to knowledge generation in TI/TV (target identification/target validation) is the simple concept that small molecules are used to perturb biological systems. Manipulation of a biological system in a controlled manner by small molecules allows us to study these systems more systematically. In this way the detailed definition of the target - small molecule interaction, and its biological consequence, can be revealed and assessed. This knowledge can be very useful in making decisions about the viability of a drug discovery project.
Chemical starting points are needed to develop a lead series and ultimately drug development candidates. They are also needed to help define, validate and aid in the screening of, a biological target where they are often referred to as chemical tools. Finding quality chemical starting points capable of being efficiently optimized into useful tools or drugs is one of the major problems facing the Pharmaceutical Industry. The total number of "reasonable" drug-like molecules has been estimated [8]. The result was approximately 1063 discrete molecules, a number so large that the synthesis of all of them is simply impossible.
Finding quality chemical tools to modulate biological systems is a difficult step and shares many of the risks associated with finding quality leads in a drug discovery program [9]. Strategies for finding small molecule leads and tools representing two poles on a continuum of approaches are illustrated by structure based design and the high-throughput screening approach. Given our focus on knowledge generation it is interesting to note that molecules at either end of this spectrum also reflect different levels of "information content". Individual molecules used in high-throughput screening teach us (if we are fortunate) about a simple IC50 or EC50. Molecules obtained via a structure-based design approach that additionally teach us how they bind to their molecular target, provide us with much more useful information especially when we consider what to do next to improve or change the biology of the molecule (Fig. 4).
Figure 4. The spectrum of approaches to finding chemical tools or leads illustrating the inverse relationship between information content and numbers of compounds needed.
Small molecules can reveal other kinds of useful information upon profiling against other targets or biological systems. While knowledge about how a molecule binds to its protein target has been exploited extensively in drug discovery, the profile of information associated with the binding at other non-target proteins has not been fully explored. While some might argue that the outcome of a successful drug discovery project is a small molecule that ONLY interacts with one protein - the target protein - the reality is somewhat different.
We have illustrated the enormity of chemistry space and the focus on biologically relevant chemistry space, but what about biology space itself? How many biologically relevant targets are there? While this number has been estimated to be around 3000 [5] it may well be much larger than this if we extrapolate from what we know about particular target classes, e.g., GPCRs, where there are many potential druggable targets and many potential pharmacologies from agonists to antagonists to modulators to inverse agonists. In a typical drug discovery programme selectivity of potential development candidates is often assessed against a panel of 50-100 biologies. Clearly this does not cover a very large fraction of available biology space. In fact many compounds originally thought to be very selective have been found later to have effects against many other targets. For example cholesterol lowering HMG-CoA reductase inhibitors (statins) are among the worlds top selling drugs. It has been recognized recently that statins possess additional biology e.g. anti-inflammatory activity, that is not explained by their interaction with this enzyme. High-throughput screening of large chemical libraries has identified lovastatin (a statin) as an extracellular inhibitor of LFA-1. Lovastatin was shown to decrease LFA-1-mediated leukocyte adhesion to ICAM-1 and T-cell co-stimulation. Unexpectedly, lovastatin was found to bind to a hitherto unknown site in the LFA-1 I (inserted) domain, as documented by nuclear magnetic resonance spectroscopy and crystallography [10].
Some structural classes, e.g. benzodiazepines, are well known to exhibit diverse biology depending on the precise substituent pattern and conformation. Selective ligands with common cores have been obtained against many protein targets (Fig. 5). Such privileged structures suggest that some common structural binding motifs on proteins are reused across many different protein families [11].
Figure 5. The classic privileged structure, the benzodiazepine nucleus with small structural modification is capable of many different biologies.
It is widely accepted that few if any of the known biologically active molecules are exquisitely selective for a single biological target. This forms the basis for the discovery of new uses for existing drugs and the explanation of side effects observed for all drugs. Indeed, in a commentary on the molecular basis for the binding promiscuity of antagonist drugs LaBella [12] stated that it is unlikely that binding site dimensions, geometry, charge environments, hydrophobic surfaces and other features will ever be known to the extent that drug design technology will yield a compound with absolute specificity for one species of functional protein.
We have used a related strategy to analyse the performance of our corporate collection in high-throughput screening over the past several years (Michne, unpublished results). Our panel of proteins consists of drug targets of interest, and spans several target classes including GPCR's, several classes of enzymes, ion channels, etc.. Our thesis is that a compound that exhibits biological activity in any target class is more likely to exhibit activity in another unrelated class than is a compound that has never exhibited biological activity of any kind.
We initially used a relatively small set of assays and screened compounds, and identified about 3500 compounds that were biologically active in at least one assay and met our internal criteria for molecular weight, cLogP, polar surface area, and other chemistry based filters. About 10% of these compounds were found to exhibit activity in other assays. The number of active compounds was then expanded to about 10,000, and the number of assays to 40 (Tinker, unpublished results). The hit rate of the general corporate collection was normalized to a frequency of 1, and compared to the hit rate of the 10K known biologically active set. The results are shown below (Fig. 6).
Figure 6. Observed hit rates for a biology based library on a scale where the hit rate for the general collection is normalized to a value of 1.
Clearly, the hit rate exceeds that of the general collection in the majority of screens. However, recent publications sounded a cautionary note. Roche and co-workers [13] reported the development of a virtual screening method for the identification of "frequent hitters." These compounds show up as hits in many different biological assays covering a wide range of targets for two main reasons: a) the activity or the compound is not specific for the target, or b) the compound perturbs the assay of the detection method. They found that with an increasing drug-likeness of the database a decreasing fraction of frequent hitters is predicted.
Sheridan [14] reported finding multi-activity substructures by mining databases of drug-like compounds. Shoichet and co-workers [15] described a common mechanism underlying this phenomenon. In their study they observed that several non-specific inhibitors formed aggregates 30-400 nm in diameter, and that these aggregates were likely to be responsible for the inhibition.
With these two reports in mind we returned to our corporate database and identified, again after suitable filtering, a set of 72,000 biologically active compounds. We then selected a subset of about 25,000 compounds based on the following criteria: a) compounds with confirmed activity in at least two assays; b) compounds with confirmed activity in no more than five assays; c) compounds tested in at least ten assays. We felt that this simple approach would give us a set of information rich compounds largely free of frequent hitters. Using Daylight 2D fingerprints and a Tanimoto distance of 0.3 the set consists of 9,200 clusters, of which there are almost 5,100 singletons. We propose that this richly diverse subset is an ideal starting platform for the design of screening libraries, and for the discovery of new privileged structures. Interestingly, with respect to physical properties, the subset is slightly more lipophilic, and has slightly more polar surface area than the general collection, but the distribution of molecular weights and numbers of hydrogen bond donors and acceptors is the same. We conclude that the currently accepted drug-like physical properties boundary conditions are necessary, but not sufficient to define biological activity, and that other poorly understood factors are the true drivers of such activity. We continue to explore just what those factors might be.
Another approach to the generation of information rich compound sets emphasizes the importance of true integrations of the key disciplines driving drug discovery. Simply put a set of compounds designed to uncover, explore, or provide solutions to key biological, DMPK (Distribution metabolism and pharmacokinetics) and toxicological issues cannot be designed optimally by chemists alone! It is critical that expertize from other disciplines is sought so that the information being used in a medicinal chemistry design campaign is the appropriate information. An important bonus of this focus is that buy-in and teamwork is increased when a group operates in this way (see Fig. 7).
One potential downside of a design strategy that focuses on maximizing the information content revealed by a set of compounds (and often de-emphasizing synthetic accessibility) is that the diversity of structures may present synthetic challenges not easily addressed by typical MPS approaches.
Figure 7. The critical elements of s thorough design process incorporate expertise across many disciplines.
The recognition that the intersection of biology space is limited within chemistry space has encouraged the development of new strategies in organic synthesis for the discovery of biological activity.
Our own interest in this problem was the result of our work on the biology-based collections discussed above. We found that typically only half the compounds were available as solid samples for further study, and that the remainder was dropped from consideration for that reason. The efficient re-synthesis of hundreds or thousands of disparate compounds was simply not practical. Or was it? Perhaps there was an easy way to sort out multiple syntheses to common starting materials and reactions, and carry them out in parallel.
To that end, we used LeadScope software [16] as our management tool. Normally, LeadScope links chemical and biological data, allowing chemists to explore large sets of compounds by a systematic substructural analysis using a predefined set of 27,000 structural features. More importantly for our purposes, two sets can be compared with respect to these features. We chose the Available Chemicals Directory (ACD) as our second set.
We could then easily select those starting materials that would give rise to many products via different routes. We then run as many reactions as possible using parallel synthesis methods. We have used this method for syntheses of up to four steps, and have been able to maintain a productivity level of one compound per chemist per day, 25 mg scale, purified >85%, and characterized by LC-MS and nmr.
We are also developing an approach to the true simultaneous synthesis of disparate core compounds. Most molecules of the size and complexity that we are interested in would likely be prepared in no more than five steps. The actual transformations are usually limited to the chemistry background and experience of the chemist(s) involved in the project. However, the routes need not be so limited. Indeed, consider the generation of tens or hundreds of routes to each compound of interest. The problem then becomes one of how to prepare the maximum number of compounds using the minimum set of common chemistries, staging the routes as necessary in order to maximize the overlap of reagents and conditions. The generation of syntheses is software based. Two or three decades ago there was a lot of effort to develop software to predict the most efficient syntheses of complex organic molecules; most have been abandoned. We chose to use the SynGen program [17] for the very reason that it usually produces several routes to a molecule, each of which begins with a commercially available starting material, and its transformations usually have a literature precedent.
Common chemistries can be grouped at three levels: a) reaction type, e.g., acylation of amines; b) reagent type, e.g., acylation of secondary amines; and c) specific reagents, e.g., acylation of diethyl amine. Each level is specifically encoded by the program, making searching, sorting, and matching fairly easy. We will not necessarily choose the shortest route to each molecule, since it is entirely possible that some longer routes would give rise to additional commonalities thereby allowing the preparation of a larger total number of compounds. As shown in the example a set of 27 diverse compounds were synthesized using this approach in 22 steps.
This is a significant efficiency gain over the 83 synthetic steps needed to access these compounds using a more traditional approach.
Thus greater efficiency is achieved by staging the syntheses to maximize overlap across steps. It is anticipated that larger libraries will result in even greater synthetic efficiency (Fig. 8).
Figure 8. Illustration of a synthesis matrix aligning common synthetic steps aimed at the synthesis of 27 diverse small molecules.
Bridging the knowledge gap between the data provided by the human genome and our knowledge of biological processes and systems is a requirement for the efficient and effective exploitation of this knowledge in drug discovery. We see this knowledge gap as being best bridged by a truly interdisciplinary approach and a tight integration of chemistry, biology, thinking and experiment. Chemical genetics provides a framework for the systematic study of small molecules to perturb and thus understand biological systems. The adoption of chemical genetics thinking is already growing in its influence among chemists and biologists, and the fruits of this integrated approach to drug discovery promises to be an exciting and rewarding area of research for the next decade.
[1] "Post-approval R&D raises total drug development costs to $897 million". (Kaitin, K. I., Ed.), Tufts Center for the Study of Drug Development Impact Report 2003, 5(3), can be found under
http://csdd.tufts.edu/InfoServices/Publications.asp#ResearchBibliography.
[2] "Pharmaceutical Innovation: An analysis of leading companies and strategies": Reuters Business Insight 2002, can be found under
http://www.delphipharma.com/innovation.htm.
[3] Olbe, L., Carlsson, E., Lindberg, P. (2003) Nature Rev. Drug Discov. 2:132-139.
[4] Black, J.W.(1989) Science :486.
[5] Hopkins, A.L., Groom, C.R. (2002) Nature Rev. Drug Discov. 1:727-736.
[6] a) Schreiber, S.L.(2003) C&EN 81:51-61; b) Schreiber, S.L.(1998) Bioorg. Med. Chem. 6:1127-1152; c) Stockwell, B.R. (2000) Trends Biotechnol. 18:449-455.
[7] Zambrowicz, B. P., Sands, A. T. (2003) Nature Reviews :38-51.
[8] a) Bohacek, R.S., McMartin, C., Guida, W.C. (1996) Med. Res. Rev. 16:3-50; b) Kolb, H. C., Finn, M.G., Sharpless, K.B. (2001) Angew. Chem. Int. Ed. 40:2004-2021.
[9] a) Gura, T. (2000) Nature 407:282-284; b) Waldmann, H. (2002) Angew. Chem. Int. Ed. 41:2879-2890.
[10] Weitz-Schmidt, G. (2002) Trends Pharmacol. Sci. 23:482-486.
[11] Holm, L. (1998) Curr. Opin. Chem. Biol. 8: 372-379.
[12] LaBella, F.S. (1991) Biochem. Pharmacol. 42: Suppl., S1-S8.
[13] Roche, O., Schneider, P., Zuegge, J., Guba, W., Kansy, M., Alanine, A., Bleicher, K., Danel, F., Gutknecht, P., Rogers-Evans, E., Neidhart, M., Stalder, W., Dillon, H., Sjögren, M., Fotouhi, E., Giillespie, N., Goodnow, R., Harris, W., Jones, P.,Taniguchi, M.,Tsujii, S., von der Saal, W., Zimmerman, G., Schneider, G. (2002) J. Med. Chem. 45: 137-142.
[14] Sheridan, R.P. (2003) J. Chem. Inf. Comput. Sci. 43:1037-1050.
[15] McGovern, S.L. Caselli, E. Grigorieff, N. Shoichet, B.K. (2002) J. Med. Chem. 45:1712-1722.
[16] Roberts, G., Myatt, G.J., Johnson, W.P., Cross, K.P., Blower, P.E. (2000) J. Chem. Inf. Comput. Sci. 40:1302-1314.
[17] a) Hendrickson, J.B. (1997) Knowl. Eng. Rev. 12:369-386; b) Further information and a demonstration of the program can be found under http://syngen2.chem.brandeis.edu/syngen.html.
Published in: "The Chemical Theatre of Biological Systems", Martin G. Hicks & Carsten Kettner (Eds.),
Proceedings of the Beilstein-Institut Workshop, May 24th - 28th, 2004, Bozen, Italy.