Proceedings of

the Beilstein Bozen Symposium

Chemical Data Analysis in the Large:

The Challenge of the Automation Age

22 – 26 May 2000 in Bozen, Italy

The articles of the conference proceedings are available in PDF format.

Download the complete proceedings book in PDF format (3 MB).


Managing and effectively utilizing large collections of highly diverse chemical data – especially when chemical structures are involved – is a major challenge for the chemical and pharmaceutical industries and universities, as well as for the producers of large publicly available databases. Automated techniques, such as, combinatorial chemistry coupled with high throughput screening result in the routine generation of enormous amounts of data. Methods of information handling such as knowledge discovery and data mining, machine learning, statistical analysis, and visualization, whose origins lie outside chemistry, are becoming more and more applicable in the area of chemical sciences. The aim of this workshop was to bring together experts from chemical and non-chemical fields to discuss new and better methods for handling and analyzing large amounts of data of a chemical nature.

The remote location of Schloss Korb – set on a hillside overlooking Bozen/Bolzano – provided the ideal venue for the participants to spend time discussing issues of interest and to make contact with scientists from different disciplines. The format of these workshops, with ample time for discussion between the lectures and afterwards at lunch and dinner, provided the participants with something rarely found at larger meetings – time to think and time to talk.

Over three days we heard a series of invited talks, which covered the following areas:

  • Knowledge Discovery and Data Mining
  • Information Extraction and Text Mining
  • Data Compression and Clustering of Large Data Sets
  • Chemical Structure Representations
  • Structure Browsing and Similarity Indexes
  • Virtual Screening and Library Design
  • Property Prediction
  • Visualization of Data and Physicochemical Properties

The scientific program was compiled by Martin Hicks (Beilstein-Institut), Gerald Maggiora (Pharmacia) and Peter Willett (University of Sheffield).

The Beilstein-Institut organizes and sponsors scientific meetings, workshops and seminars, with the aim of catalyzing advances in chemical science by facilitating the interdisciplinary exchange and communication of ideas amongst the participants. We were very pleased that speakers from both inside and outside the mainstream chemical community accepted invitations to speak. The resonance that we had both during and after this workshop clearly reflected the attractiveness of the scientific program and the format of the workshop.

We would like to thank particularly the authors who provided us with written versions of the papers that they presented. Special thanks go to all those involved with the preparation and organization of the workshop, as well as to the speakers and participants for their contributions in making this workshop a success.

Werner Brich
Martin G. Hicks


Bruce Croft

NSF Center for Intelligent Information Retrieval, Computer Science Department, University
of Massachusetts, Amherst, USA.

Much of the information in science, engineering and business has been recorded in the form of text. Traditionally, this information would appear in journals or company reports, but increasingly it can be found online in the World-Wide Web. Tools to support information access and discovery on the Internet are proliferating at an astonishing rate. Some of this development reflects real progress but there are also many exaggerated claims. The focus of this presentation will be to review the important technologies for text-based information access on the Web and to describe the progress that is being made by researchers in these areas.


Robert Gaizauskas, Kevin Humphreys and George Demetriou

Department of Computer Science, University of Sheffield, UK.

With the explosive growth of scientific literature in the area of molecular biology, the need to process and extract, information automatically from on-line text sources has become increasingly important. Information extraction technology, as defined and developed through the U.S. DARPA Message Understanding Conferences (MUCs), has proved successful at extracting information primarily from news-wire texts and primarily in domains concerned with human activity. In this paper we consider the application of this technology to the extraction of information from scientific journal papers in the area of molecular biology. We describe how an information extraction system designed to participate in the MUC exercises has been modified for two bioinformatics applications: EMPathIE, concerned with enzyme and metabolic pathways; and PASTA, concerned with protein structure. The progress so far provides convincing grounds for believing that IE techniques will deliver novel and effective ways for the extraction of information from unstructured text sources in the pursuit of knowledge in the biological domain.


Valerie J. Gillet

University of Sheffield, UK.

The techniques of combinatorial chemistry and high throughput screening are in widespread use in the pharmaceutical and agrochemical industries. During the last few years, many different computational approaches have been developed to select compounds for screening and to design combinatorial libraries. The main approaches are reviewed in the first half of this paper. In the second half, we describe how the library design program SELECT has be used to demonstrate that significant improvements in diversity can be achieved by basing library design in product space rather than in reactant space. A series of experiments are reported involving two combinatorial libraries, three different descriptors and three different diversity indices. Finally, a further significant advantage of performing library design in product space is the ability to optimise multiple properties simultaneously. Thus, SELECT can be used to design libraries that are both diverse and have druglike physiochemical properties.


Fionn Murtagh

School of Computer Science, The Queen's University of Belfast, Northern Ireland.

We review the time and storage costs of search and clustering algorithms. We exemplify these, based on case studies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. First we describe nearest neighbor searching, an elemental form of clustering, and a basis for clustering algorithms to follow. Next we review a number of families of clustering algorithms. Finally we discuss visual or image representations of data sets, from which a number of interesting algorithmic developments arise.


Holger Wallmeier

Aventis Research & Technologies GmbH & Co KG, Core Technology Area Biomathematics, Industrial Park Hoechst, Frankfurt am Main, Germany.

Support of industrial research and development activities by computing and information technologies today is coupled to huge amounts of data. Therefore, data management is a very crucial aspect of successful application of information technologies. Various strategies are used to handle the situation, each of which has its merits depending on the type of data, the context, and the usage.

Apart from the very straightforward approach to distribute data on appropriate storage media of sufficient volume, there are three different ‘philosophies’ of data compression.

  1. Non-lossy data compression
  2. Lossy data compression
  3. Model-based data compression

Types 1 and 2 are probably the most widely used because they do not necessarily introduce a bias into the compressed data. There are a number of methods known today that are fully reversible, or at least reversible to a large extent.

This is different for model-based data compression. The idea is useful for data being produced by dynamic, deterministic systems. Important is the existence of a model with well-defined data scheme and data structure. These model features can be used to condense the corresponding original data. Two examples from industrial research are presented.

First example is the representation of computer simulations of molecular ensembles by correlation functions. The second example is the representation of microbiological studies on pathogenicity by kinetic constants. In both cases, the underlying model together with methods to generate compressed data representations allows efficient interpretation of simulations or experiments, respectively.

High levels of data condensation provide a variety of opportunities to link results from research and development to auxiliary information from many different sources. Thus, powerful infrastructures for decision support can be created.


Mark Johnsona and Yong-Jin Xub

aComputer-Assisted Drug Discovery, Pharmacia Inc., Kalamazoo, USA.

bDiscovery Medicinal Chemistry, Pharmacia Inc., Creve Coeur, USA.

Structural browsing indices (SBIs) have been proposed as tools for organizing and exploring large sets of chemical structures in a manner complementary to that addressed by substructure and similarity-based methodologies. Molecular equivalence indices (MEQIs) comprise a special subclass of SBIs that play a central role in constructing a suite of SBIs appropriate to a variety of browsing, chemical-diversity, and SAR tasks. After presenting a general definition of a molecular equivalence index, three different ways of constructing SBIs based on MEQIs will be illustrated. The first index uniquely identifies the chemical graph of a compound and will be used to identify the sets of geometric and stereoisomers in a compound collection as well as to visually assess the overlap of two compound collections. The second index identifies a largest set of nonoverlapping functional groups of a compound and will be used to visually identify a functional-group-based receptor-relevant subspace associated with ACE inhibitors. The third index provides a hierarchical ordering of compounds whose use will be illustrated in the context of browsing structures and SAR relationships.


S. Stanley Young and Chris E. Keefer

Glaxo Wellcome Inc., Research Triangle Park, USA.

Very large screening data sets are becoming available; hundreds of thousands of compounds are screened against panels of biological assays. There is a need to make sense out of the data; screeners need to know which compounds to screen next and medicinal chemists need to know which series of compounds are active and what features are associated with activity. We use the statistical technique recursive partitioning and simple molecular descriptors, atom pairs and topological torsions, to analyze these data sets based upon the 2D representation of the compounds. We use more general features and a special 3D representation of the compounds for pharmacophore identification. The benefit of this work is that we can rapidly evaluate screening data and make sound recommendations for additional screening work or how to proceed with lead optimization.


William L. Jorgensena and Erin M. Duffyb

aDepartment of Chemistry, Yale University, New Haven, USA.

bCentral Research Division, Pfizer Inc., Groton, USA.

Monte Carlo statistical mechanics simulations have been carried out for more than 250 organic solutes in water. Physically significant descriptors such as the solvent-accessible surface area, numbers of hydrogen bonds, and indices for cohesive interactions in solids are correlated with pharmacologically important properties including the octanol/water partition coefficient (log P), aqueous solubility (log S), and brain/blood concentration ratio (log BB). The regression equations for log P and log S only require 4 - 5 descriptors to provide correlation coefficients, r2, of 0.9 and rms errors of 0.7. The descriptors can form a basis for structural modifications to guide an analog’s properties into desired ranges. For more rapid application, a program that estimates the significant descriptors, QikProp, has been created. It can be used to predict the properties for ca. 1 compound/sec. with no loss of accuracy.


Timothy Clark

Computer-Chemie-Centrum, Universität Erlangen-Nürnberg, Erlangen, Germany.

The use of semiempirical MO-theory for complete databases is demonstrated using the example of the Maybridge Chemical Company Database (53,000 compounds). 3DDescriptors derived from the quantum mechanical wavefunction are used to set up QSPRmodels using neural nets as the interpolation technique. Techniques for cross-validation of such models and for calculating individual error estimates for each compound are discussed. The examples are illustrated for properties such as logP, the vapor pressure, aqueous solubility and boiling points. The multi-net method of estimating individual error bars appears to give a good approximation of error limits of ± one standard deviation for several datasets.



Rainer Herges and Andrea Papafilippopoulos

Institut für Organische Chemie, Technische Universität Braunschweig, Braunschweig, Germany.

We have shown that the anisotropy of the induced current density (ACID) can be interpreted as the density of the delocalized electrons in molecules. The ACID scalar field, which can be plotted as an isosurface, is a powerful and generally applicable method for investigating and visualizing delocalization and conjugative effects, e.g. stereoelectronic effects in reactions, the anomeric effect, aromaticity, homoaromaticity etc.


Jeffrey D. Saffer, Cory L. Albright, Augustin J. Calapristi, Guang Chen, Vernon L. Crow, Scott D. Decker, Kevin M. Groch, Susan L. Havre, Joel M. Malard, Tonya J. Martin, Nancy E. Miller, Philip J. Monroe, Lucy T. Nowell, Deborah A. Payne, Jorge F. Reyes Spindola, Randall E. Scarberry, Heidi J. Sofia, Lisa C. Stillwell, Gregory S. Thomas, Sarah J. Thurston, Leigh K. Williams and Sean J. Zabriskie

OmniViz, Inc., Richland, USA.

The volumes and diversity of information in the discovery, development, and business processes within the chemical and life sciences industries require new approaches for analysis. Traditional listor spreadsheet-based methods are easily overwhelmed by large amounts of data. Furthermore, generating strong hypotheses and, just as importantly, ruling out weak ones, requires integration across different experimental and informational sources. We have developed a framework for this integration, including common conceptual data models for multiple data types and linked visualizations that provide an overview of the entire data set, a measure of how each data record is related to every other record, and an assessment of the associations within the data set.