Chemical Data Analysis in the Large, May 22nd - 26th 2000, Bozen, Italy |
MODEL-BASED DATA COMPRESSION: FROM DATA COMPRESSION TO INFORMATION CONDENSATIONHOLGER WALLMEIERAventis Research & Technologies GmbH & Co KG, Core Technology Area Biomathematics, Industrial Park Hoechst, G 515 A, D-65926 Frankfurt am Main, Germany. |
ABSTRACT
Support of industrial research and development activities by computing and information technologies today is coupled to huge amounts of data. Therefore, data management is a very crucial aspect of successful application of information technologies. Various strategies are used to handle the situation, each of which has its merits depending on the type of data, the context, and the usage.
Types 1 and 2 are probably the most widely used because they do not necessarily introduce a bias into the compressed data. There are a number of methods known today that are fully reversible, or at least reversible to a large extent. |
INTRODUCTIONProduction of Data
The typical scenario of data generation starts from a device or automaton producing data (production), which will be recorded using some representation characteristic for the data and, of course, characteristic for the production itself. Based on this representation, data will be processed to extract the related information. In addition, the data may be transferred into a repository for later use. |
![]() |
| Figure 1. Data production and representation. |
The advantage of the correspondence between original production and model production is that data scheme and data representation of the model can be used to generate a condensed representation of the original data. The information behind model data can usually be represented by few parameters. Representation and Condensation of data
Extracting and condensing information from data means creating a specific representation of the information. Basically, there are two different approaches to representing information. On the one hand, information can be mapped using predefined descriptor sets, thus creating specific profiles. On the other hand, information can be mapped in terms of relationships of the given object to known objects, which results, at least, in a delimiting view of the information. Genealogic aspects can be taken into account quite easily on a class and instance basis. |
AN EXAMPLE FROM MOLECULAR MODELINGMolecular modeling can be very useful to assess questions regarding, very generally speaking, stability and affinity of molecular systems. A very powerful, even though 'expensive' tool is the simulation of the dynamics of molecular systems. The underlying paradigm is based on perturbation theory. Simulations can be considered as computer experiments that allow the study of the response of a given system (molecular model) to some defined perturbation. The perturbation applied most frequently is just the kinetic energy of the N particles associated with a given temperature T according to [1] |
![]() | (1) |
Such simulations show the time evolution of the given system under the thermodynamic conditions specified and allow us to judge the stability of the given structural alignment, constitution, or conformation relative to some reference state. For this reason, simulation of molecular dynamics is a quite popular way of performing conformation-searches, especially for large molecular systems. By extending the analysis to the various aspects of entropy, affinity can also be estimated, at least on a molecular level. A Model for Dynamical Affinity of Molecular Systems
In practice, molecular dynamics simulations are performed by discretized integration of the respective equations of motion. [2] Since the characteristic frequencies of all relevant degrees of freedom must be resolved by the integration step-size, simulations of molecular dynamics are usually very lengthy and produce huge data sets (trajectories) if applied to large systems. |
![]() |
| Figure 2. Coupled oscillators. |
For simplicity, we take a pair of one-dimensional, coupled, identical harmonic oscillators positioned on the z-axis. The corresponding Hamiltonian is given by |
![]() | (2) |
where m is the mass of each oscillator, K the force constant and a the coupling constant, which is a function of the distance between the equilibrium positions of the oscillators. After separation of variables one has |
![]() | (3) |
The first term represents the coherent motion of the center of gravity of the pair of oscillators and the second term the relative 'breathing' motion. Since both oscillators have a ground state frequency w0, coupling results in a symmetric split of energy levels as shown in the following scheme (for a > 0), (for a < 0 in reversed order): |
![]() |
The energy of an oscillator is |
![]() | (4) |
For small |a| one has |
![]() | (5) |
which is a typical second order, resonance-like effect. |
![]() | (6) |
The Helmholtz free energy is |
![]() | (7) |
and since |
![]() |
and |
![]() |
it is clear that |
![]() | (8) |
Therefore, in terms of thermodynamics, coupling of oscillators adds a contribution to the energy of the overall system, which is mainly an entropy effect. It should be noted that the difference of the energy levels is independent of the sign of the coupling constant, since it is proportional to a2. The energetic order of w+ and w-, however, is a function of the sign of the coupling constant. By analogy to the analysis of the (electronic) dispersion interaction by London, [4] this contribution to the entropy of molecular complexes can be called mechanical dispersion or, because of its stabilizing effect, dynamical affinity. Tracing Dynamical Affinity in Molecular Dynamics Simulations
In a molecular dynamics simulation dynamical affinity can be traced, mapping coherent and breathing motion by correlation functions. |
![]() |
| : position correlation function | |
| g ={ | |
| : velocity correlation function | |
| i, j | : centers of correlationd |
| d | : correlation time |
| T-t0 | : time of measurement (simulation time) |
| i = j | : autocorrelation |
| i ≠ j | : cross-correlation |
| Gij(0) = 1 | : normalization |
For harmonic oscillators, one can define the autocorrelation functions for coherent and breathing motion |
![]() | (9) |
![]() | (10) |
Now, it can be shown that the second derivative of these correlation functions is -w2 for zero correlation time (d=0). This means that the whole simulation can be condensed to just two independent numbers, w+, w-, and perhaps Dw=w+-w-. Streptavidin and Biotin
The example given below is a complex of two biomolecules, the protein Streptavidin and the vitamin Biotin. They form a specific complex with the largest binding constant known between biomolecules in nature. Therefore, this system is frequently used for immobilization of biomolecules. Surprisingly, experimental studies with molecules slightly different from Biotin show significant loss of stability and document the high specificity of the Biotin/Streptavidin complex. |
![]() |
For example, 2-(4'-hydroxyphenylazo)-benzoic acid (HABA) also binds to streptavidin, but with a binding constant which is 9 (!) orders of magnitude lower. |
![]() |
![]() |
The thermochemical data measured for these complexes are given in Table 1. [5] |
| Table 1: Thermochemical data of Streptavidin complexes with Biotin and HABA. |
| Molecule | DG kcal.mol-1 | DH kcal.mol-1 | TDS (297 K) kcal.mol-1 | Kbinding (M-1) | DGBiotin/ DGHABA | TDSBiotin / TDSHABA |
| Biotin | -18,3 | -32,0 | -13,7 | 2,5x1013 | 3,5 | -2,0 |
| HABA | -5,27 | 1,70 | 6,97 | 104 | 1 | 1 |
Apart from the remarkable values of the binding constants, it should be noted that the sign of the entropy contribution to the free binding energy changes going from Biotin to HABA. This is an indication of a change in the role of entropy. |
| Table 2: Results of molecular dynamics simulations of Streptavidin complexes with Biotin and HABA. Starting from the crystal structures published, different orientations of the ligands have been studied. See text for further details. |
| System / binding mode | w+ (GHz) | w- (GHz) | Dw (GHz) | Type of coupling | Splitting kcal.mol-1 | TDS (297 K) [6] kcal.mol-1 |
| 1STP [8]/ 1[9] | 5.4 | 14.6 | -9.1 | a<0 | -0.87 | |
| 1STP / 1 | 2.9 | 12.4 | -9.6 | a<0 | -0.91 | -13.70 |
| 1STP / 2 | 6.1 | 0.76 | 5.4 | a>0 | 0.51 | |
| 1STP / 3 | 13.5 | 6.9 | 6.6 | a>0 | 0.63 | |
| 1STP / 4 | 10.7 | 2.2 | 8.4 | a>0 | 0.80 | |
| 1HBA [8] / 1 | 8.7 | 4.7 | 4.0 | a>0 | 0.38 | 6.97 |
| 1HBA / 2 | 8.6 | 13.8 | -5.3 | a<0 | -0.50 |
For Biotin, four different orientations of the ligand in the binding pocket of Streptavidin have been simulated, for two HABA (Table 3). Columns 2 and 3 show the frequencies of the coupled oscillator motions derived from the autocorrelation functions G+ and G-. For the crystal structure orientation of Biotin (1) the coherent motion has the lower frequency and the breathing motion is significantly faster. |
| Table 3: Orientations of the ligands Biotin and HABA bound to Streptavidin. |
| Ligand orientation | Description | System |
| 1 | crystal structure | Biotin, HABA |
| 2 | upside down | Biotin, HABA |
| 3 | reversed | Biotin |
| 4 | upside down and reversed | Biotin |
In the first row of Table 2 the Biotin-results of a simulation without water and counterions are given. The values do not differ very much from the results for the solvated system, which indicates the robustness of the method. |
| Experiment | Simulation | |
![]() | 297 K -1.97 | MD 300 K -2.27 |
This underlines the role of oscillator coupling as indicator for stability of a given molecular alignment. At the same time it demonstrates the potential of data reduction that is given by this approach. |
AN EXAMPLE FROM BIOMETRY
Quite a different approach to model based data compression is possible in the area of kinetic studies of bacterial pathogenicity. Such studies are very important in infectious disease research. In a very general view, the key issue is the interaction between pathogens and the hosts they infect. Besides the medicinal aspects of infection, pathogen-host interactions are the primary focus of target and lead compound search in the pharmaceutical industry. It is a complex phenomenon with several degrees of freedom. Dynamics of Infectious Disease Progression
Progression of an infectious disease is, in a generalized sense, always the result of several types of growth processes, which are characteristic for different phases of disease progression. [10] If one wants to identify targets for anti-infective drugs, the early phases of disease progression are of special interest. |
![]() |
| Figure 3. Phase of infectious disease progression. See text for details. |
The scenario described above can be summarized in terms of the categories pathogenicity, virulence, and susceptibilty. Even though in literature pathogenicity and virulence are often used synonymously, a distinction based on genomical and disease progression considerations is possible. Pathogenicity is first of all a property of a pathogen that manifests in the formation of pathogenic factors like, for example toxins. [12] This, of course, depends on genotypic, as well as phenotypic conditioning of the pathogen. To be specific, what matters is type and amount of pathogenicity factors produced by the pathogen inside, or in contact with the host. The amount of factors formed, however, also depends on the size of the pathogen population inside the host, which, in turn, depends on genotypic and phenotypic conditioning of the pathogen. |
![]() |
| Figure 4. Pathogenicity, virulence, susceptibility, genotypic and phontypic conditioning. |
Due to host response, however, pathogen multiplication also depends on genotypic and phenotypic conditioning of the host. In principle, there are two degrees of freedom for the pathogen. These are, on the one hand its ability to produce pathogenicity factors, and on the other hand the size of population of pathogenicity factor producing pathogens inside the host. |
![]() |
| Figure 5. Genetic degrees of freedom in infectious diseases. |
Any research in the field of infectious diseases aimed at understanding the large variety of strategies pathogens have developed during evolution must analyze the kinetics related to the different phases. First of all, descriptors have to be identified that allow us to follow the individual processes by experimental measurements (see Table 4). |
| Table 4: Processes in disease progression |
| Phase | Type of process | Descriptors |
| Invasion | invasive pathogen count, [13] optical densities of culture media specific interactions | |
| Pathogen multiplication | pathogen count, [13] optical densities pathogen count, [13] disease marker concentration, antibody titer | |
| Toxin enrichment | toxin concentration, antibody titer, disease marker concentration pathogen count [13] | |
| Development of symptoms | disease marker concentration, antibody titer |
A key problem in handling living organisms is reproducibility. Usually, this is taken care of by running replicate experiments and forming averages. In addition, time-resolved measurements are necessary to analyze the associated kinetics. To do so, the following model assumptions are useful. A Model for Infectious Disease Dynamics |
![]() |
| Figure 6. Decrease of a host population after infection. The time of the population's half-life is indicated. |
The normal way to measure pathogenicity starts from a set of N0 host organisms, which are infected. In the course of the experiment, decrease of the host population is measured. Typically, one obtains a sigmoid curve (Figure 6), which can be represented by the solutions of the following differential equation (DE). It is called the logistic, autocatalytic, or autokatakinetic differential equation [14] |
![]() | (11) |
describing growth processes with feedback. It is the equation of an exponential growth, which is modified by the second term in the square brackets. This second term depends on the population N at time t and constitutes the feedback. It can be agonistic (g<0), as well as antagonistic (g>0). The general form of the solution is |
![]() | (12) |
With the integration constant N0, the initial size of the population, plus the rate constant k, and the feedback constant g there are three independent parameters. The combination of N0>0 and a negative value of k describes the decrease of a population (Figure 6). |
![]() |
| Figure 7. Increase of a pathogen population after infection. |
Applying this equation to infection experiments, one has, first of all, equation (11) for the decrease of the host population. According to the considerations outlined above, it is easy to imagine that the kinetic constant k in fact depends on the growing pathogen population P(t) and is thus a pseudo-constant. Therefore, |
![]() | (13) |
The growth of the pathogen population may either be unrestricted (free exponential growth) |
![]() | (14) |
or restricted, |
![]() | (15) |
reaching a saturation level due to host response. As mentioned above, P0 is small, and k is positive. |
![]() | (16) |
which, for example results in the situation shown in Figure 8. |
![]() |
| Figure 8. Decrease of host population coupled to growing pathogen population. Feed-back overrides the effect of the pathogen population. |
It is obvious that the effect of the growing (not constant) pathogen population can be seen as a deformation of the host population curve. The degree of deformation increases with h. There is, however, a further type of deformation of the host population curve. It comes from the feedback term and can be seen in Figures 9 and 10. This certainly reflects host conditioning. |
![]() |
| Figure 9. Decrease of host population coupled to growing pathogen population. |
![]() |
| Figure 10. Decrease of host population coupled to growing pathogen population. A clear modulation of the host curve by the feed-back term can be seen. |
In experiments with time-resolved measurements, one usually has data reflecting the decrease of a host population that consists of test organisms such as, for example insects, mice, rats, or nematodes [15] (Figure 11).
![]() |
| Figure 11. Time resolved measurements of a C. elegans population infected by P. aeruginosa. |
Traditionally, the simplest way to measure pathogenicity is to count the host population after some predefined time tscoring, which gives an ad hoc score as a percentage. |
![]() | (17) |
Unfortunately, this way of measuring pathogenicity depends very much on the choice of tscoring. Standardization is accomplished by normalization with the wild type of the pathogen: |
![]() | (18) |
Very often, time series are run until the host population has reached half its original size and |
![]() | (19) |
is taken as the measure of pathogenicity. This condenses the whole series of measurements to one single value. For a given host organism those pathogens with a low t½ are more pathogenic than those with a higher value. Further Developments
In general, however, this is not sufficient to distinguish all possible effects that may modulate the interactions between a host and a pathogen. Coming back to the solutions of the logistic equation, steepness of the population curve at t½ can tell a lot about pathogens, as well as hosts. [15] To improve the analysis, one has to fit solutions of the logistic equation (12) to the experimental data. This can be done, for example using the method by Marquardt and Levenberg. [16] The set of parameters k, g, and eventually k, l, or even h allow the identification of those bacterial mutants that show extraordinary behavior. This allows scanning the genome for so-called pathogenicity and virulence genes. Furthermore, different mechanisms of infection can be distinguished. |
SUMMARY
Whenever it is possible, model-based data compression serves two purposes. First of all it can be a great help to condense even huge data sets to very few numbers. Furthermore, the definition of the model necessary for compression is a very challenging step that often helps to gain deeper insights into the matter. It can reveal inconsistencies and facilitate the recognition of unknown phenomena. Together with the condensed data it offers possibilities to represent the information behind the data in a very efficient way. |
REFERENCES AND NOTES
[1] McQuarrie, D. A.;Statistical Mechanics, Harper & Row, New York, 1976. |
|
Published in "Chemical Data Analysis in the Large: The Challenge of the Automation Age", Martin G. Hicks (Ed.), Proceedings of the Beilstein-Institut Workshop, May 22nd - 26th, 2000, Bozen, Italy http://www.beilstein-institut.de/bozen2000/proceedings/ |