Web Search, Filtering and Text Mining: Technology for a New Era of Information Access
Bruce Croft
NSF Center for Intelligent Information Retrieval, University of Massachusetts, Amherst, United States Of America
Much of the information in science, engineering and business has been recorded in the form of text. Traditionally, this information would appear in journals or company reports, but increasingly it can be found online in the World-Wide Web. Tools to support information access and discovery on the Internet are proliferating at an astonishing rate. Some of this development reflects real progress but there are also many exaggerated claims. The focus of this presentation will be to review the important technologies for text-based information access on the Web and to describe the progress that is being made by researchers in these areas.
One of the major tools for information access is the search engines. Most search engines use information retrieval techniques to rank Web pages in presumed order of relevance based on a simple query. Compared to the bibliographic information retrieval systems of the 70’s and 80’s, the new search engines must deal with information that is much more heterogeneous, “messy”, more varied in quality and vastly more distributed or “linked”. Queries tend to be short and the potential database is very large and growing rapidly. Current solutions to these problems emphasize speed and coverage, with less importance attached to effectiveness. With the growing number of complaints about “information overload”, however, this is beginning to change. Research in this area is providing solutions in the form of new retrieval models, distributed architectures, and summarization techniques. There is also an increasing focus on different approaches to retrieval such as question answering and cross-lingual retrieval. The efforts on standardizing format and content through XML and metadata will also have a large impact on future search systems.
Information filtering has been around for some time in the form of “current awareness” systems. A number of Web tools provide this functionality (often under the “agent” label). Collaborative filtering is a complementary technique based on matching user preferences that has become popular in e-commerce applications. It remains to be seen whether the combination of content-based and collaborative filtering will improve information access in scientific and engineering contexts.
A considerable amount of research is being carried out under the heading of text data mining. This includes a variety of techniques such as information extraction, clustering, and discovery of associations or “rules”. All of these techniques combine statistical methods with some level of linguistic analysis. Information extraction techniques are designed to extract “facts” from text. In many cases, this means very simple facts such as names of companies, people, and monetary amounts, but some applications also extract more complex information. There has also been recent work focusing on information extraction based on the structure of Web pages. Clustering is used to group related information. This technique has been well-studied in information retrieval but has recently been the subject of a number of new papers. Information extraction and clustering can be used with other techniques to discover interesting associations in text databases. The applications of this type of discovery have been mostly based on business information, but it may also be useful in scientific and engineering contexts.
Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures
Robert Gaizauskas
Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, United Kingdom
Information Extraction (IE) may be defined as the activity of extracting details of predefined classes of entities and relationships from natural language texts and placing this information into a structured representation called a template. IE technology, as refined through the U.S. DARPA Message Understanding Conferences (MUCs), has proved successful at extracting information primarily from newswire texts and primarily in domains concerned with human activity. In this talk I discuss the application of this technology to the extraction of information from scientific journal papers in the area of molecular biology. In particular, I describe how an information extraction system designed to participate in the MUC exercises has been modified for two bioinformatics projects: EMPathIE, concerned with enzyme and metabolic pathways; and PASTA, concerned with protein structure. Progress to date provides convincing grounds for believing that IE techniques will deliver novel and effective ways for scientists to make use of the core literature which defines their disciplines.
Understanding Chemistry. A Beginner’s Guide for Computers
John Bradshaw
Daylight Chemical Information Systems Inc., Sheraton House, Castle Park, Cambridge CB3 0AX, United Kingdom
Chemistry is defined in the dictionary as “…science concerned with properties of substances and their combinations and reactions…”. However modern chemistry handles information as much as it handles material change1 chemists process from the outset a piece of information and work on various electronic signals generated by the plethora of instrumentation present in today’s chemical laboratory.
Laszlo goes further,
“…as a rule samples have become miniscule, in amounts generally between micrograms and milligrams. One may safely conclude that present-day chemists ‘handle’ mostly mental representations. The chemical laboratory, more than a site of transformations of matter, has become predominately a production centre for concepts…”
Since the mid-nineteenth century, in order to understand a molecular compound and account for its physical, chemical and biological properties, it has been essential to understand its structure. Much of the instrumentation described above is used to determine structure and establish the relationship of the compound(s) under study, to the world of known structures.
Along with the development of the structural theory that the understanding of structure has required, non-textual graphical techniques have developed to represent these three dimensional objects and the concepts surrounding them in a two-dimensional graphic space2. These techniques are now so entrenched that “it is difficult to imagine that we could talk, write or even think about molecular structures without recourse to them.”3. Unfortunately there are a large number of these techniques, developed often at the behest of typesetters in publishing houses, which the chemist moves happily between, doing the mental gymnastics necessary to check for identity or similarity. Thus are all interpreted correctly by a practised chemist as being representations of the same chemical.
It is unfortunate too, that these same graphical techniques have been emulated, to allow communication of structural information to computers. Whilst it is fine for computers to be used to render structures to look like chemist-drawn ones, a process as simple as counting charges, or indeed assigning a symmetry group, to something as trifling as nitrobenzene becomes non-trivial, for a computer, depending on the graphical representation. Yet it is these very simple counts and related calculations, which are being used to parameterize many millions of molecules to be used in HTS-QSAR studies. Unless we represent a set of molecules, under study, consistently, before parameterizing them, any subsequent data analysis could be worthless, however sophisticated the technique.
Using techniques developed to handle the structures for the many millions of samples provided by vendors for high-throughput screening, I hope to illustrate approaches to providing normalized valence bond representations of structures and show that this is an essential prerequisite for any structure-activity oriented data analysis.
References:
1. Lazlo, P.: 1998, ‘Chemical Analysis as Dematerialization’, HYLE, 4(1), 29-38
2. Luisi, P.-L.; Thomas, R.M.: 1990, ‘The Pictographic Molecular paradigm: Pictorial Communication in the Chemical and Biological Sciences’, Naturwissenshaften, 77, 67-74
3. Francoeur, E.: 2000, ‘Beyond dematerialization and inscription’. HYLE, 6(1),
Latent Semantic Structure Indexing (LaSSI)
Richard Hull
Merck Research Laboratories, 126 E. Lincoln Ave., RY50SW-100, Rahway, NJ 07065, United States of America
A novel extension of the vector model for computing chemical similarity is described. This method uses the singular value decomposition (SVD) of a chemical descriptor/molecule matrix to create a low dimensional representation of the original descriptor space. Ranking compounds relative to a probe compound using the similarity in a reduced dimensional descriptor space versus the similarity in the original descriptor space has several advantages: latent structure matching is more robust than descriptor matching; choice of the number of singular values provides a rational way to vary the resolution of the search; probes created from more than one molecule are handled more naturally; and the reduction in the dimensionality of the chemical space increases searching speed.
Harvesting Structures for HTS Knowledge Mining
Susan I. Bassett
Bioreason, Inc., 150 Washington Avenue, Suite 303, Santa Fe, NM 87501, United States of America
The growing use of high-throughput screening (HTS) continues to push the need to create knowledge structures amenable to archiving and mining information not just for a single screen but also across many screens over time. Drug discovery teams want not only the information from the current screen, but also the ability to compare information from previous screens, and from related screens on the same or different libraries.
A good first step in this solution is to build automated knowledge discovery systems that allow capture of information in a structured manner. These knowledge structures can then be archived for future mining according to the needs of the scientists.
The question of what type of knowledge-based method is best for a given application is a classic problem in the science of computational intelligence. Is a neural network better than an expert system, is a frame-based or case-based system needed, or does the problem lend itself to a hybrid solution? The answer depends on what type of information or knowledge the user hopes to gain from the data, and more critically, what type of decision process is going to be used with the knowledge extracted from the data.
For example, in analyzing a large set of compounds showing activity in an initial screen, a project scientist needs to first quickly sort the compounds into classes which relate structural components to activity. Without knowing how many different mechanisms for activity are present within the data, this becomes a search first for naturally coherent families of compounds, and then for structural features that indicate either unique mechanisms or ones that are similar to those in families already analyzed.
Continuing this example, the compound below could be characterized in a number of ways, depending on which structural domain the project chemist thought was significant for activity. In fact, this same compound might be active for different reasons in different assays. The “correct” family for this compound is not clear until more analysis is done, and so a method for placing this compound in several families, based on similarities to other active compounds from this assay using a variety of structural domains is needed. A software solution embodying computational intelligence can incorporate the flexibility of a human chemist in reasoning about which family or families are natural homes for this compound. Our software places compounds in multiple families depending on different domains of similarity, called “phylogenetic-like groupings” to indicate that generational family characteristics are important.
This is just one of the many types of decisions that project chemists make in analyzing HTS data, and one that points up the need for knowledge structures that are rich enough to capture the information needed to make these decisions, as well as automated systems to capture the human reasoning process needed to use these knowledge structures in making decisions.
Enhancing rather than replacing the abilities of the human chemist, decision support systems allow access to this structured knowledge to aid in lead identification as well as lead characterization and prioritization. Cross-assay analysis and comparison of responses from different libraries as they are submitted to HTS for a target becomes possible when a suitable knowledge structure for archiving and comparing the results is used. In this talk, the use of computational intelligence and other techniques in building these systems will be presented, and the relevance to realistic, noisy HTS datasets will be illustrated.
Bioactivity Datamining
William Fisanick
Research Unit, Chemical Abstracts Service, Columbus, Ohio 43210, United States of America
Bioactivity datamining can be described as the automatic discovery of relationships in typically large databases among three components: 1) drug, 2) bioactivity and 3) drug target. In many instances, the discovery results can be used in predicting relationships among this component triad. The prediction or estimation of such relationships is becoming increasingly important with the advent of combinatorial chemistry, high-throughput screening, genomics, etc. For example, the “virtual” chemistry capabilities resulting from bioactivity datamining can significantly focus on the design of a substance or set of substances that have good potential for general or specific bioactivities and thus, reduce time and costs. One important and rich “knowledge” source for bioactivity datamining is the published secondary journal and patent literature by a publisher such as Chemical Abstracts Service (CAS). This paper will discuss various techniques and methods for the browsing and prediction of “drug--bioactivity--drug target” relationships using the literature knowledge base. Some estimation possibilities are of a small molecule(s) potential for a therapeutic category such anticonvulsant or mode of action such as beta-blocker; potential binding ligands for a protein sequence; and an expressed sequence tag’s (EST) potential for association with a protein or a disease. In addition to specific bioactivity estimation, schemes have been examined to estimate the “general” bioactivity potential of a substance.
Designing Combinatorial Libraries by Exploring Drug Space
Val Gillet
University of Sheffield, Department of Information Studies, The University of Sheffield, Western Bank, Sheffield S10 2TN, United Kingdom
Combinatorial chemistry is now a key technique for the discovery of novel bioactive compounds within the pharmaceutical and agrochemical industries. The technique allows the simultaneous synthesis of large sets of structurally related molecules, called combinatorial libraries. The collection of compounds that could potentially be made in a combinatorial reaction scheme using all available reactants is referred to as a virtual library. Practical limits on the automation equipment are such that virtual libraries often greatly exceed capacity and hence there is a requirement to be selective about the compounds that are actually synthesised in a combinatorial library experiment. An important concept in combinatorial library design is molecular diversity, i.e., the degree of heterogeneity, structural range or dissimilarity in a set of compounds. This concept is based on the assumption that a structurally diverse range of compounds should provide a broad coverage of bioactivity space. The different methods that have been developed for selecting diverse subsets of compounds will be reviewed. Selection strategies are often applied at the reactant level on the assumption that diverse reactants lead to diverse products, however, evidence will be provided to demonstrate that this assumption does not hold and that applying selection at the product level results in increased diversity. Finally, while diverse libraries have been shown to be effective in generating hits, there is a tendency for them to contain many compounds that have poor physicochemical properties, for example, high molecular weight, high flexibility or high lipophilicity. Thus, other factors also need to be taken into account if a library is to contain attractive drug leads. Recent efforts have been directed towards designing libraries that maintain high diversity but which also have drug-like physicochemical properties.
Clustering in Massive Data Sets
Fionn Murtagh
School of Computer Science, The Queen's University of Belfast, Belfast BT7 1NN, Northern Ireland
We review the time and storage costs of search and clustering algorithms. We exemplify these, based on case-studies in astronomy, information retrieval, visual user interfaces, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, and a basis for clustering algorithms to follow. Sections 7 to 11 review a number of families of clustering algorithm. Sections 12 to 14 relate to visual or image representations of data sets, from which a number of interesting algorithmic developments arise.
References
Murtagh, F., Multidimensional Clustering Algorithms, Physica-Verlag, Wuerzburg, 1985.
Murtagh, F., "Foreword to the Special Issue on Clustering and Classification", The Computer Journal, 41, 517, 1998.
Murtagh, F., Starck, J.L. and Berry M., "Overcoming the curse of dimensionality in clustering by means of the wavelet transform", The Computer Journal, 2000, in press.
Starck, J.L., Murtagh, F. and Bijaoui, A., Image and Data Analysis: The Multiscale Approach, Cambridge University Press, New York, 1998.
Poincot, Ph., Lesteven, S. and Murtagh, F. "Maps of information spaces: assessments from astronomy", Journal of the American Society for Information Science, submitted, 1999.
Model-Based Data Compression: From Data Compression to Information Condensation
Holger Wallmeier
Aventis Research & Technologies, Industriepark Höchst, G 864, 65926 Frankfurt, Germany
Support of industrial research and development activities by computing and information technologies today is coupled to huge amounts of data. Therefore, data management is a very crucial aspect of successful application of information technologies. There are various strategies in use, to handle the situation, each of which has its merits depending on the type of data, the context, and the usage.
Apart from the very straightforward approach to distribute data on appropriate storage media of sufficient volume, there are three different ‘philosophies’ of data compression.
Non-lossy data compression
Lossy data compression
Model-based data compression
Types 1 and 2 are probably the most widely used because they do not necessarily introduce a bias into the data compressed. There are a number of methods known today that are fully reversible, or at least reversible to a large extent.
This is different for model-based data compression. The idea is useful for data being produced by a dynamic, deterministic system. To the extent that the system can be modeled, the data produced can also be modeled. Two examples from industrial research are presented.
First example is the representation of computer simulations of molecular ensembles by correlation functions. The second example is the representation of microbiological studies on pathogenicity by kinetic constants. In both cases, the underlying model together with methods to generate compressed data representations allow efficient interpretation of simulations or experiments, respectively.
High levels of data condensation provide a variety of opportunities to link results from research and development to auxiliary information of many different sources. Thus, powerful infrastructures for decision support can be created.
Structural Browsing Indices as SAR Analysis Tools
Yong-jin Xu and Mark Johnson
Pharmacia & Upjohn, 301 Henrietta street, Kalamazoo, Michigan 49007-4940, United States of America
Structural browsing indices (SBIs) have been proposed as tools for organizing and exploring large, and possibly diverse, sets of chemical structures in a manner complementary to that addressed by substructure and molecular-similarity based methodologies. Molecular-equivalence indices (MEQIs) comprise a special subclass of SBIs that play a central role in constructing a suite of SBIs appropriate for a large variety of browsing, chemical-diversity, and SAR tasks. This presentation will discuss the general issues surrounding the construction of SBIs in general and of MEQIs in particular. Some MEQIs likely to be of broad use will be presented along with some structural subsequences of a useful ordering of compounds we have developed. Some uses of SBIs and hierarchical orderings of compounds in the visual analysis of SAR will be discussed.
Computing Properties on Large Numbers of Molecules
S. Stanley Young, Chris Keefer, Joseph D. Simpkins
Glaxo Wellcome Inc., 5 Moore Drive, Research Triangle Park, NC 27709, United States of America
Combinatorial chemistry has greatly increased the numbers of compounds that may be synthesized for therapeutically important targets. Computational chemistry has been successfully used to select those compounds that should be made by choosing only compounds which are of some "medicinal" value, e.g., having the correct molecular properties such as logP, solubility, molecular weight, etc. Although straightforward to calculate, these molecular properties may be resource intensive when applied to large virtual libraries from which the synthesized compounds will be derived. In this case, distributed or coarse grain parallel processing of these molecular property calculations across a farm of processors can significantly increase the speed of their calculation.
The presentation will focus on a novel method for distributed processing of numerous molecular properties across a large network of NT machines. These molecular properties are to include, for example, cLogP, cLogS, molecular weight, bioavailability, and flexibility. The implementation of these property calculations in a distributed environment will be discussed as well as methods for selecting "medicinally important" compounds from large virtual libraries.
Predictions of Pharmaceutically Important Properties from Simulations: Free Energies of Solvation, Partition Coefficients, Cell Permeability and Solubility
Erin M. Duffy * and William L. Jorgensen +
*Pfizer Inc., Central Research Division, Groton, CT 06340, United States of America
+Dept. of Chemistry, Yale University, 225 Prospect St., New Haven, CT 06520-8107, United States of America
Valuable descriptors for prediction of molecular properties are obtained automatically from routine Monte Carlo statistical mechanics simulations of organic molecules in water. The total solute-water electrostatic interaction energy, solvent accessible surface area, and indices of hydrogen-bond donation and acceptance are found to be particularly useful. With a data set of over 250 diverse organic molecules including 150 heterocycles and drugs, correlations for log P (octanol/water) with only 3-4 parameters are obtained that yield standard deviations, r-squared and q-squared of 0.5, 0.9 and 0.9. Fits for log P (water/gas), log P (octanol/gas) and log P (hexadecane/gas) with 80 compounds have also been obtained with 2-4 parameters and have standard deviations of 0.4 - 0.8. For aqueous solubility, log S, only four descriptors are needed to yield an r-squared and q-squared of 0.9 and rms error of 0.7 for 150 compounds including 100 drugs; an index of cohesive interactions in the solid-state has been identified. A smaller data set of Caco-2 cell permeabilities can also be fit well with 3-4 descriptors including the cohesive index.
The Monte Carlo simulations have been highly automated using the BOSS program such that one only has to provide a coordinate input file, e.g., mol2 or PDB, to a script. Twenty-five compounds can be processed per day on a dual-processor 500 MHz PC. Key advantages to the methodology are the small number of variables, the physical significance and interpretability of the variables, the inclusion of three-dimensional structural information and conformational sampling, and the high cross-validated predictive ability.
For more rapid application, a program QIKPROP has been created, which estimates the descriptors that have emerged as significant from the Monte Carlo simulations including the hydrogen-bond counts. The surface area components are also directly computed from an input 3D structure; however, no conformational averaging is performed. Molecular volume or SASA, the hydrophobic surface area, the counts of donated and accepted hydrogen bonds, an Onsager-like solvation term, and the cohesive index consistently emerge as the most significant descriptors. Little or no degradation in the quality of the correlations is found in comparison to the full simulation results. Properties predictions can be made for 500,000 drug-like molecules per day with QIKPROP on a 500 MHz Pentium III or R10000 system. This is a powerful tool for screening and library design.
Quantum Cheminformatics – an Oxymoron?
Tim Clark
Computer-Chemie-Centrum, Friedrich-Alexander-Universität Erlangen-Nürnberg, Nägelsbachstraße 25, 91052 Erlangen, Germany
The traditional view of quantum mechanical calculations is that they are particularly CPU-time intensive and that they can only be applied to a relatively small number of compounds. However, advances in computer hard- and software have made the application of the more economical semiempirical molecular orbital methods for tens of thousands of molecules routine. In a benchmark calculation in 1998, for instance, we were able to optimise the geometries of 53,000 organic compounds from the Maybridge database in 14 hours elapsed time on a 128-processor Origin 2000 supercomputer. Today, cost-effective clusters of as few as eight Windows or Linux workstations can treat several hundred thousand molecules per month.
The data thus produced can be used for the prediction of biological activity using novel 3D-descriptors and techniques suited to high throughput virtual screening. The detailed molecular information, including electrostatic and polarisability properties, can, however, be used to predict physical properties for unknown compounds – the theme of the 1988 Beilstein Workshop. These properties include vapour pressure and boiling point, water-octanol partition coefficients, aqueous solubility, pKa’s, and melting points as well as spectroscopic properties such as 13C chemical shifts.
The descriptors used to derive such models are derived directly from the calculated wavefunctions of the molecules and often involve no other knowledge of the chemical structure. Simple feed-forward neural nets are used as interpolation devices and error estimates for individual compounds are provided.
A Widely Applicable Set of Descriptors
Paul Labute
Chemical Computing Group Inc., 1255 University Street, Suite 1600, Montreal, Quebec, Canada H3B 3X3
Introduction. The pioneering work of Hansch and Leo was an attempt to describe biological phenomena in a “language” consisting of a small set of physical molecular properties, in particular, logP (octanol/water), pKa, and molar refractivity. Early QSAR efforts centered on deriving linear regression relationships between such descriptors and biological activity. Subsequent efforts sought to increase the applicability of linear models by introducing more (and more) descriptors. This has led to an increased reliance on validation methods to identify spurious models (e.g., leave-one-out or k-fold cross-validation). Notwithstanding the use of complicated regression methods and descriptor selection procedures, the consistent production of effective QSAR models remains elusive. For this reason, we have chosen to return to the thinking of Hansch and Leo: by fixing a relatively small set of descriptors for use in many (hopefully all) situations we can, perhaps, a) reduce the problems of variable selection, and b) consistently produce meaningful QSAR models.
The idea of using a fixed collection of descriptors in QSAR is related to the definition of a “chemistry space” for use in molecular diversity studies. In such work, a compound is mapped to a k-dimensional vector which is used as a surrogate when comparing compounds. Validating a chemistry space can be difficult, especially when it is proposed for diversity analysis. Often, cluster analysis is used to justify a chemistry space: if compounds with similar biological behavior cluster together in the proposed space then it seems reasonable to conclude that the chemistry space is good. An alternative to cluster-based justification is QSAR-based justification: if a collection of descriptors can be used to construct reasonable models of many properties of interest, using many modeling techniques, it seems reasonable to conclude that the chemistry space induced by the descriptors is meaningful for diversity analysis. In the present work, we will adopt the QSAR-based justification of the chemistry space induced by the descriptors defined herein.
Methods. We will define 32 molecular descriptors each of which is derived by summing the approximate exposed surface area for each according to classifications based upon logP (octanol/water), molar refractivity and partial charge.
The surface area of an atom in a molecule is the amount of surface area of that atom not contained in any other atom of the molecule. Consider a molecule of n atoms each with van der Waals radius Ri and let Bi denote the set of all atoms bonded to atom i. We will neglect the effect of atoms not related by a bond and define the approximate Van der Waals surface area (VSA) for atom i, denoted by Vi, to be
where bij is the ideal bond length between atoms i and j. Thus, the VSA for each atom can be calculated from connection table information alone assuming a dictionary of van der Waals radii and ideal bond lengths.
Suppose that for each atom i in a molecule we are given a numeric property Pi. Our fundamental idea is to create a descriptor for a specific range [u,v) of the property values P; this descriptor will be the sum of the atomic VSA contributions of each atom i with Pi in [u,v). More precisely, we define the quantity P_VSA(u,v) to be
where Vi is the atomic contribution of atom i to the VSA of the molecule (defined previously section). We now define a set of n descriptors associated with the property P as follows:
where a0 < ak <an are interval boundaries such that [a0,an) bound all values of Pi in any molecule. Each VSA-type descriptor can be characterized as the amount of surface area with P in a certain range. If, for a given set of descriptors, the interval ranges span all values, then the sum of the descriptors will be the VSA of the molecule. Therefore, the VSA-type descriptors correspond to a subdivision of the molecular surface area.
We used this methodology to construct the following 32 descriptors:
SlogP_VSAk (10) intended to capture hydrophobic and hydrophilic effects either in the receptor or on the way to the receptor. The Wildman & Cripping SlogP model was used to calculated atomic contributions to logP. Interval boundaries were assigned to produce equal frequencies of occurrence over the Maybridge database.
SMR_VSAk (8) intended to capture polarizability. The Wildman & Crippen SMR model was used to calculate atomic contributions to molar refractivity. Interval boundaries were assigned to produce equal frequencies of occurrence over the Maybridge database.
PEOE_VSAk (14) intended to capture direct electrostatic interactions. The Gasteiger & Marsili partial charge model was used to assign atomic partial charges.
Each of these descriptor sets is derived from, or related to, the Hansch and Leo descriptors with the expectation that they would be widely applicable. Taken together the VSA descriptors define, nominally, a 10+8+14=32 dimensional chemistry space.
Results. The results of computational experiments with the descriptors defined above are summarized as follows.
The approximate VSA calculation is accurate to within 10% of a 3D high-density dot-based surface area calculation.
A correlation analysis over 2,000 molecules (randomly selected) revealed that the 32 VSA descriptors are weakly correlated with each other.
The 32 VSA descriptors encode many traditional descriptors (topological, atom counts, etc.) to a high degree (r-squared better than 0.9).
Reasonably good linear models of Boiling Point, Free Energy of Solvation, Vapor Pressure, Solubility in Water, Blood-Brain Barrier Permeability, activity against Thrombin, Trypsin and Factor Xa were constructed using only the 32 VSA descriptors.
Compound classification experiments revealed that the 32 VSA descriptors can distinguish among seven receptor classes.
We conclude that the new descriptors are likely to be a very good starting point for QSAR/QSPR work; and that the collection of new descriptors may be a meaningful low-dimensional chemistry space for chemical diversity, HTS data analysis and combinatorial library design.
Structure-based Methods for Virtual Screening of Drug Databases
Thomas Lengauer
Institute for Algorithms and Scientific Computing, GMD - German National Research Center for Information Technology, Sankt Augustin, Germany
and Department of Computer Science, University of Bonn, Germany
We present structure-based methods for virtual screening of drug databases.
We have developed tools that exploit the knowledge of the 3D structure of the binding site of the target protein and those that get by without this information. Our tools analyze molecular structures on different levels of description achieving throughput rates ranging from just under a molecule per minute to about 50 molecules per second on a PC or workstation. Some of the tools have been adapted to screening combinatorial libraries.
Specifically, the tools are:
FlexX - docks a ligand molecule into the binding site of a protein in little more than a minute [1,2,3]. FlexX has a feature to place critical water molecules into the active site [4]. Detailed evaluations of FlexX are presented in [5,6] There is a recent version of FlexX that can deal with combinatorial ligand libraries.
FlexE - a variant of FlexX that exploits structural information from multiple models of the binding site (in development).
FlexS - performs a comparison of two ligand molecules based on a structural superposition, in about 1,5 minutes. This program does not need structural information on the binding site. [7,8]. A screening experiment with FlexS is described in [9].
Feature Trees - performs a fast comparison of two ligand molecules, in about 20ms. The program does not need information on the binding site [10].
We discuss the methods employed by the tools and summarize our experiences with their use.
References:
[1] M. Rarey, S. Wefing, T. Lengauer, Placement of Medium-Sized Molecular Fragments into Active Sites of Proteins. Journal of Computer-Aided Molecular Design 10 (1996) 41-54.
[2] M. Rarey, B. Kramer, T. Lengauer, G. Klebe, A Fast Flexible Docking Method Using an Incremental Construction Algorithm. Journal of Molecular Biology 261,3 (1996) 470-489.
[3] M. Rarey, B. Kramer, T. Lengauer, Multiple automatic base selection: Protein-ligand docking based on incremental construction without manual intervention, Journal of Computer-Aided Molecular Design 11,4 (1997) 369-384.
[4] M. Rarey, B. Kramer, T. Lengauer, The Particle Concept: Placing Discrete Water Molecules During Protein-Ligand Docking Predictions, PROTEINS: Structure, Functions and Genetics 34 (1999) 17-28.
[5] B. Kramer, M. Rarey, T. Lengauer, CASP-2 experiences with docking flexible ligands using FlexX, PROTEINS: Structure, Function, and Genetics Suppl. 1 (1997) 221-225.
[6] B. Kramer, M. Rarey, T. Lengauer, Evaluation of the FlexX Incremental Construction Algorithm for Protein-Ligand Docking, PROTEINS: Structure, Function and Genetics 37 (1999) 228-241.
[7] C. Lemmen, T. Lengauer, Time-Efficient Flexible Superposition of Medium-Sized Molecules, Journal of Computer-Aided Molecular Design 11,4 (1997) 357-368.
[8] C. Lemmen, T. Lengauer, G. Klebe, FlexS: A Method for Fast Flexible Ligand Superposition, Journal of Medicinal Chemistry 41,23 (1998) 4502-4520.
[9] C. Lemmen, T. Lengauer, Fragment-based Screening of Ligand Databases
(1998), Proceedings of EURO-QSAR¹98, to appear.
[10] M. Rarey, J. S. Dixon, Feature Trees: A new molecular similarity measure based on tree matching, Journal of Computer-Aided Molecular Design 1998, 12, 471-490
Mobile Electrons in Molecules: The Anisotropy of the Current-induced Density
Rainer Herges
Institut für Organische Chemie, Technische Universität Braunschweig, Hagenring 30, 38106 Braunschweig, Germany.
We have shown that the anisotropy of the induced current density (ACID) can be interpreted as the density of the delocalized electrons in molecules. The ACID scalar field, which can be plotted as an isosurface, is a powerful and generally applicable method for investigating and visualizing delocalization and conjugative effects, e.g. stereoelectronic effects in reactions, the anomeric effect, aromaticity, homoaromaticity etc.
Charting the Future of Information Visualizations: Methods, Metaphors and Mental Models
James A. Wise
Integral Visuals, Inc, 2620 Willowbrook, Richland, WA 99352, United States of America
Electronically mediated Information Visualization has completed its first decade of dizzying experimentation and growth. It is now poised to create a second New Age of Visualization for the 21st century that will rival the first which ushered in the Renaissance. The preceding decade has seen information visualizations proceed from awkward first steps which pastiched primitive graphics with specialized analyses, to highly interactive 3-D visualizations supported by a wide array of advanced statistical techniques. This period of technical advancement has also supported a maturation in thinking about how to construct and utilize visualizations so that the “meet the mind” of the user to co-create a mental workspace which extends perceptual and conceptual skills.
This paper examines the convergence of developments in analytical methods, visualization metaphors and cognitive science to illustrate what has worked, and why, and how certain accomplishments have fundamentally changed how information visualizations will develop from here on. These includes the paradigmatic shift from information retrieval to information exploration, the evolution from sophisticated, dedicated analytical methods to computationally lean and adaptive ones, and the replacement of “black box” systems with those that put the user-in-the-loop so as to enact rather than receive, visualizations. Equally important, no visualization metaphor has emerged as dominant, and tenets of “ecological perception” will continue to hold in information spaces as well as real ones of the natural world.
The foundation has been lain to not only bring information visualizations into common, widespread use, but to employ them to address the advancements in science and the survival issues that accompany any highly technological civilization. Continued evolution of information visualizations will not only be an achievement of computer science alone, but one that utilizes and synthesizes results from many different academic disciplines.
Visualization and Integrated Data Mining of Disparate Information
Jeffrey D. Saffer
OmniViz, Inc., Richland, WA 99352, United States of America
The volumes and diversity of information in the discovery and development process within the chemical and life sciences industries require new approaches for analysis. Traditional list- or spreadsheet-based methods are easily overwhelmed by large amounts of data. Furthermore, generating strong hypotheses and, just as importantly, ruling out weak ones, requires integration across different experimental and informational sources.
To address these issues of data volume and integration, the following questions must be considered:
What types of data overviews are necessary?
How can these overviews be presented visually?
What attributes of the visualizations are necessary to enable the discovery process?
How can diverse analyses be integrated?
I will discuss how we have approached each of these major questions and how that has led to the creation of an integrated data visualization and mining software package.
The major classes of data overviews required for exploratory data analysis are (A) an overview of the underlying data, (B) an overview of how each data record is related to each other, and (C) an overview of the associations within the entire data set. In addition, for text, (D) an overview of the major themes within the data is required. For each type of visualization, a key attribute is the ability to handle large volumes of information; sufficient for current and future combinatorial chemistry and ultra high throughput screening methods. An example of major class of visualization is shown below.
To enable the discovery process, each of these visualizations must provide ready access to the underlying information and appropriate analytical tools. With these, it becomes possible to explore prior hypotheses as well as the unexpected relationships often suggested by the structure of visualizations of complex data sets.
A key benefit of the visualization-based approaches is the natural extension toward integrated analysis of multiple data types. Many investigators preferentially scrutinize their own data to the exclusion of much other relevant data. This may not be from a lack of desire, but a lack of appropriate tools. I will show how we have implemented one approach for linking the diverse visualizations shown using examples from the chemical sciences.
Frontiers in Visualization for Chemistry and Biochemistry
Detlef Krömker
Johann Wolfgang Goethe-Universität, Fachbereich Informatik 20
and Fraunhofer Anwendungszentrum, Computergraphik in Chemie und Pharmazie, Varrentrappstraße 40-42, 60486 Frankfurt am Main, Germany
Advances in sciences have often been characterized by inventions that allow people to see old things in a new way. Telescope, microscope, and oscilloscope are obvious examples. Now again such an “new” instrument is coming into place: it’s a common PC with the modern graphics and visualization capabilities.
Recent developments in Computer Graphics show, that we will have visualization capability and graphics power available on “every” desktop, previously only available to very few people with high-end Computer Graphics systems. An other trend shows a very high demand for good visualizations. To summarize the most important goal, its “Using vision to think” [Card, Mackinlay, Schneiderman].
Some important questions show up: What are “good” visualizations? Can we measure this? Are there guidelines? What are the major differences between a visualization on a screen and on paper? What are good visualization environments? Which trends can be identified?
In particular I will talk about some important principles to design effective and efficient visualizations. Although much is known about the design of static presentations, computer visualization is a new medium involving interactive visual representations which tap humans skills not only in perception but also in manipulation.
I will deal with these questions in a generic way –but most of the examples are taken from the field of chemistry and biochemistry.


