Chemical Data Analysis in the Large, May 22nd - 26th 2000, Bozen, Italy |
COMPUTATION AND ANALYSIS OF LARGE CHEMISTRY DATA SETSS. STANLEY YOUNG AND CHRIS E. KEEFERGlaxo Wellcome Inc., Research Triangle Park, NC 27709, USA. |
ABSTRACT
Very large screening data sets are becoming available; hundreds of thousands of compounds are screened against panels of biological assays. There is a need to make sense out of the data; screeners need to know which compounds to screen next and medicinal chemists need to know which series of compounds are active and what features are associated with activity. We use the statistical technique recursive partitioning and simple molecular descriptors, atom pairs and topological torsions, to analyze these data sets based upon the 2D representation of the compounds. We use more general features and a special 3D representation of the compounds for pharmacophore identification. The benefit of this work is that we can rapidly evaluate screening data and make sound recommendations for additional screening work or how to proceed with lead optimization. |
INTRODUCTION
Enormous numbers of compounds are now available for screening. Large companies will have over five hundred thousand compounds in inventory; over one million compounds are available commercially; library synthesis offers many millions of possible compounds. It is not feasible to screen all available compounds in all screens. Indeed, with the ongoing genetics efforts there will be an explosion of drug targets over the next several years, increasing the number of available screens. |
METHODSWe use rather simple compound descriptors. See Figure 1 for examples of atom pairs, [1] atom triples, and topological torsions. [2] For pharmacophore identification we use standard pharmacophore features. [3] |
| Atom pair N(3,0) - 7 - S(2,0) |
![]() |
| Atom triple N(3,0) - 7 - S(2,0) S(2,0) - 6 - N(2,0) N(3,0) - 12 - N(2,0) |
![]() |
| Topological torsion N(3,0)C(2,0)-1-C(3,1)O(2,0) |
![]() |
| Figure 1. Atom pair, atom triple and topological torsion molecular descriptors. |
|
Recursively splitting a data set into homogeneous subsets was first proposed by Morgan, and Sonquest. [4] Statistical methods for univariate recursive partitioning are described by Hawkins and Kass, [5] Hawkins et al. [6] and Rusinko et al. [7] Basically, all potential variables are examined and the single variable that will best split the entire data set into two daughter data sets is selected and the split made; those compounds with the feature go to the right daughter node and those without the feature go to the left. See Figure 2. Each daughter node is split in turn. Splitting stops when there are no statistically significant splits remaining. For multivariate recursive partitioning we replace the Student t-test with the Hotelling T2. [8] |
![]() |
| Figure 2. The data set is split using a t-test. |
RESULTSRecursive partitioning is capable of identifying multiple chemical classes of compounds from a data set, and is thus a method for deconvoluting mixtures. [7] Figure 3 gives a skeleton of the recursive partitioning tree. Also given are representative compounds from two of the terminal nodes. These compounds act through different mechanisms to block the MAO enzyme, see references in Rusinko et al. [7] |
![]() |
| Figure 3. Tree and active compound classes identified. |
|
A data set of 20989 compounds with 4 tumor responses was obtained from the NCI website. Multivariate recursive partitioning was run. Figure 4 gives a skeleton tree with blowups of two of the terminal nodes. Terminal node N0101 has a relatively high incidence of the first and last tumor types, Lung and Melanoma, and a relatively low incidence of the second and third tumor types, Colon and Breast. Terminal node N001 has a high incidence of the first and third tumor types. The bits in the node names note the absence, 0, or presence, 1, of chemical features characteristic of compounds in the terminal nodes. |
![]() |
| Figure 4. Multivariate recursive partitioning tree, NCI data. |
|
An internal data set of 1444 compounds with IC50 values for the kinase CDK2 was analyzed using typical pharmacophoric features, H-bond donor, H-bond acceptor, etc. [3] Multiple conformations were computed and distance between features were binned. After each split, constrained conformations were computed. A total of about 1.4M conformations were computed and the analysis took about 14 hr. CPU time. The resulting recursive partitioning tree is given in Figure 5. |
![]() |
| Figure 5. 3D recursive partitioning tree for CDK2 data set. |
|
The resulting 3D pharmacophore was comparable to crystal structure results, Figure 6. |
![]() |
| Figure 6. Node N111 in CDK2 Tree. |
DISCUSSION
The key problem to be overcome in the analysis of HTS data sets is that there are likely to be multiple, biological mechanisms. Some molecules may act through one mechanism and others by another. Some might bind in one orientation, others in a different orientation or even at a different location. In the case of the Abbott MAO data set, two mechanisms are known and compounds following each mechanism are found by recursive partitioning. For a large HTS data set there are likely to be multiple mechanisms and even for a single binding pocket, different compounds might bind in different orientations. Most statistical methods assume that there is one underlying model of a single process. If there are two processes, e.g. regular binding site and alosteric binding site, then the features important for one process are very unlikely to be important for the other. Most statistical methods, e.g. linear regression, will average the effect for each feature over the two processes. Results are likely to be bad and could be entirely misleading. Recursive partitioning is a simple statistical method that can deal with multiple mechanisms. A feature is identified and the data split based upon this feature. If the feature is important for a specific mechanism, then compounds with that feature (and binding by that mechanism) are separated out from the main body of the data. Following this set of compounds, the analysis is limited to just these compounds; other compounds in the data set have no affect on the subsequent analysis. In this manner, multiple mechanisms can be identified. |
REFERENCES AND NOTES
[1] Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. J. Chem. Inf. Comput. Sci. 1985, 25, 64. |
|
Published in "Chemical Data Analysis in the Large: The Challenge of the Automation Age", Martin G. Hicks (Ed.), Proceedings of the Beilstein-Institut Workshop, May 22nd - 26th, 2000, Bozen, Italy http://www.beilstein-institut.de/bozen2000/proceedings/ |