Calculations and simulations: an invaluable resource

Timothy Clark

 

Beilstein Magazine 2016, 2, No. 6               doi:10.3762/bmag.6

published: 19 May 2016

What is the role of calculations, modeling and simulation in chemistry and biology? The answer to this question is time dependent and is changing ever faster as Moore’s Law continues to describe the increase in our computational power. Gordon Moore’s editorial in Electronics Magazine [1] was published the day before my 16th birthday. Contrary to what we like to believe, Moore was careful to limit his prognosis to the next decade:

The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.


Here we are, fifty years later and Moore’s Law still holds good, although each increase of a factor of two has typically required 18 months, rather than a year. We all know Moore’s Law and cite it as we need it (if only to predict that it will all “soon come to an end”). Very few of us really appreciate the consequences of this inexorable increase in our ability to simulate chemical and biological systems. We are all familiar with speed comparisons; for instance, complete geometry optimizations that took months in the late 1970s are now done in 5-10 seconds, or could be done on a mobile phone. Such comparisons serve only to mask the real change that has taken place, above all in the past decade. It is not really important how long we wait for a simulation to finish and we seldom, if ever, do simulations on mobile phones. The average time to complete a calculation has not changed all that much over the decades because we do better simulations. What has changed is how well we can simulate the system of interest and, above all, how close our simulation model comes to the experimental object. To put it another way, it is not important that my mobile phone has the compute power of a top-500 computer in 1996 because simulations that could have been done on a top-500 computer in 1996 are no longer relevant. We often emphasize the immense progress in hardware performance, rather than the key observation that we can do things with each new generation of machines that we simply could not do before.

What does this mean? There are two consequences that we need to consider; the accuracy of the simulations and their ability to provide information that is not available from experiments. Both, and particularly the latter, require a reassessment of the relationship between simulations and experimental science. Particularly in biomedical and biological research, experiments are necessarily limited in their ability to provide detailed information. In some areas, such as, for instance protein structure and function, our most valuable experimental techniques suffer from one or more serious limitations.

For instance, even though X-ray crystallography of proteins provides the foundation of our understanding of structural biology, X-ray structures provide at best a static picture (a snapshot) of very dynamic systems. In very few cases is the relevance of an X-ray crystal structure to the related biological situation established reliably. Strictly speaking, X-ray crystal structures describe the conformation and packing of the kinetically least soluble crystalline form of the protein in question. Increasingly, the protein itself has little strict relevance to the biological system. Domains are removed to aid crystallization or synthetic domains that promote crystallization are fused onto the protein regions of interest.

This analysis is not intended to deprecate the quality of research using X-ray crystallography, which remains one of the most sophisticated research specializations of all, but it does place its relevance for structural biology in perspective.

By now it should be clear that I intend to expound the virtues of simulation. That is nothing new; theoreticians and modelers have taunted experimentalists with the advantages of simulations since the mid-1970s. The early Beilstein Bozen Workshops provided a welcome opportunity to describe the capabilities of computational models and simulations in a broad-minded atmosphere that was seldom encountered elsewhere. Experimentalists were usually either skeptical or downright hostile – with good reason. It has only become possible to perform quantum mechanical (QM) calculations whose accuracy rivals experiment on molecules of several hundred atoms in the last decade. Similarly, ten years ago molecular-dynamics (MD) simulations on biological systems were typically possible for tens of nanoseconds.

Neither the size of the molecules for which accurate QM structures and energies could be calculated nor was the time scale of MD simulations were adequate for most systems. Accurate and reliable QM calculations are now possible for hundreds of atoms and MD simulations of several microseconds are becoming standard. It is becoming quite well accepted that experiments that disagree with high-quality QM calculations for molecules up to, say, one hundred atoms are likely to be in error. Examples abound in the literature. These examples are likely to become less common because experiments designed to determine structures or energies are usually accompanied by calculations at a level adequate to validate the experimental results.

What about MD-simulations? The physical model (Hamiltonian) on which they are based determines the quality of simulations. This model is almost exclusively a non-polarizable force field (i.e. one with fixed electrostatics that are not affected by the environment) in current biophysical simulations. Here, we need to consider the simulation system in order to judge the accuracy of the simulation. The best protein force fields are very good and unlikely to produce extreme deviations from reality. This is sadly not yet the case for nucleic-acid simulations. The approximations inherent in non-polarizable force-fields are apparently adequate for the relatively non-polar proteins, but break down for the polyelectrolyte nucleic acids. This leads to deviations from the experimentally observed behavior, especially in long simulations. Nucleic-acid force fields will improve in the same type of gradual optimization process that has taken place in the past three decades for protein force fields. For the current discussion, however, we will focus on proteins.

The simulation time scale is of paramount importance. It is hardly surprising that the era of ten-nanosecond MD simulations had little impact on structural and mechanistic biology. Evolution has no reason to optimize processes to take place in nanoseconds, so biology takes place on a slower time scale. Current simulations now routinely attain simulation times of tens of microseconds, so that some, but not all, biological processes can now be simulated directly. Slower processes can be investigated with enhanced-sampling or steered simulations, although these necessarily introduce an extra degree of approximation. However, a silent revolution has taken place. MD simulations are no longer inadequate models of real biological processes: The accuracy of current force fields and the simulation times now possible often make simulations 1:1 computational models of important biological proteins. We know this situation from, for instance, automobile crash simulations: There is no need to do the experiment. The situation in biophysics is not quite that of the automobile industry, but simulations have attained a new status in the past five years that has not yet become clear to most experimentalists, or for that matter, those who perform simulations.

Tim Clark: Figure, chimera
Three distinct binding sites for the peptide hormone vasopressin in its vasopressin 2 receptor. The hormone is shown as atomic spheres, the receptor as a blue ribbon and the lipid bilayer in which the receptor is embedded as a grey/red/yellow surface. The three different binding sites are shown as three replicas of vasopressin; magenta for the vestibule site, green for the intermediate one and blue for the orthosteric binding site. The last of these three was known previously (but not fully characterised), whereas the first two were revealed for the first time by extensive molecular-dynamics simulations [3]. The existence of three different binding sites explains the different effects (agonist, antagonist, reverse agonist, partial agonist) of a variety of synthetic ligands. The picture summarises simulations that required approximately 20 million CPU hours on the SuperMUC supercomputer in Munich [4] and amount to terabytes of trajectory data.

What do simulations offer that is different to experiments? The first, and most important, difference is that simulations provide a different quantity and quality of information. Even if X-ray crystal structures were perfect representations of the biologically active system, they would only be single snapshots. MD simulations provide the dynamics of the system over microseconds in full atomistic detail. After decades of development, methods to determine free energies, for instance of binding, from simulations are now adequate to rival the best measurements. The figure, for instance, shows the results of tens of millions of CPU hours’ simulation on SuperMUC. Three alternative binding sites were revealed compared to the single orthosteric site known previously. The strength of the simulations is to reveal transient phenomena that cannot be found in static experiments. The drawback is that we never know whether we have sampled the system adequately: Are we looking at a rare conformation that happens to be able to survive for the length of our simulation? This ever-present danger in simulations requires constant validation by comparison with experiment. It is, however, important to compare like with like: The simulations must calculate exactly what is measured and vice versa. Simulations should never be compared with interpretations of experiments, but rather with the exact measured quantity.

The wealth of information provided by simulations also represents a new aspect of computational research. The typical situation is that simulations are carried out to answer a specific question. This means that the analysis of the trajectory (the atomic coordinates as a function of time) concentrates on answering the question posed. However, the trajectory often amounts to terabytes of information; far more than is needed to answer the original question. In other words, one trajectory can easily serve to answer a host of questions, posed either by the original investigators or others. Thus, much like the electron-density map that forms the basis of X-ray crystal structures, the simulation trajectory represents a valuable resource for further research. This is especially true as access to the most powerful hardware becomes more limited (the gap between the most powerful machines and common department or university computers is increasing rapidly).

Similarly, it is now possible to perform semi-empirical molecular-orbital calculations on up to 100,000 atoms [2]. The wavefunction obtained in such a calculation contains all the information needed to calculate a multitude of physical or spectroscopic properties. In contrast to earlier MO-programs, the interpretation of the results is now performed in a separate post-processing step based on the calculated wavefunction, which is stored in a file of several terabytes. The same is true of density-functional theory (DFT) calculations on several thousand atoms. Calculations of this magnitude are often only possible on large central compute clusters and are not accessible to all researchers. The raw information contained in the wavefunction or the electronic density, however, is adequate to derive far more information than that usually reported by the original authors.

The situation described above differs significantly from that a decade ago. There was little point in storing and making available the results of calculations that in ten years’ could be performed 100 times faster. Today’s massively parallel clusters, however, are producing results that will still be require significant effort in ten years. More importantly, the information content of the calculations and simulations far exceeds what can be reported in a single publication. This makes the trajectories or wavefunctions a significant asset, possible worth more than the original publication in which they are reported.

Therefore, not only the original publication that reported the calculation or simulation is worthy of publication and citation, but also the raw results themselves. These therefore deserve to be published under their own DOI and made available for subsequent analysis by authors other than those who performed the calculation. This requires standardized submission forms similar to those used when submitting X-ray structures and, above all, standard universally readable data formats. The availability of such results not only allows more in-depth analyses of existing calculations, but also makes meta-studies possible in which, for instance, hundreds of published trajectories of MD simulations on G-protein coupled receptors could be analyzed for common features and differences.

This proposal contains a significant change in our publishing paradigm. We must recognize that large simulations or electronic-structure calculations contain far more information than will be published in the article in which they are reported. The results are haystacks from which usually only a single needle has been extracted and published. In the past, far shorter simulations were targeted at single features of the system being investigated. Now, the simulations are exquisitely detailed accounts of the movements of biologically relevant systems in their natural environments (such as biological membranes) over tens of microseconds.

We also need to abandon the notion that only the originator of the trajectory, electron density or wavefunction can make use of “his own” primary data. The results of cutting-edge simulations or calculations are a valuable, information-rich resource. Importantly, they are likely to remain so for at least a decade, simply because the computational facilities needed (often tens of thousands of cores) will not be available to most researchers for at least this long. If we assume that the price of computer hardware of a given performance will be halved every year, which is not happening at the moment, the performance of a $ 100 million Top 500 machine of today will still cost $ 100,000 in ten years. For this reason, publishing the data itself should be credited highly, not just the analysis resulting from it.

Currently, the originator of the data is currently regarded highly and a researcher who analyzes them anew as something akin to a parasite. This must change; a real paradigm change in science that will not only maximize the benefit derived from central supercomputers, but also open new research possibilities for those unable to access the top machines. After all, those who are able to carry out benchmark simulations are not necessarily those who understand their results best.

 

References

[1]

Moore, G. E., Cramming more components onto integrated circuits, Electronics Magazine, pages 114-117, 19th April 1965 (see http://www.cs.utexas.edu/~fussell/courses/cs352h/papers/moore.pdf, accessed 24th February 2016).

[2]

Hennemann, M.; Clark, T. J. Mol. Model. 2014, 20, 2331.

[3]

Saleh, N.; Haensele, E.; Banting, L.; Spokova-de Oliviera Santos, J.; Whitley, D. C.; Saldino, G.; Gervasio, F. L.; Bureau, R.; Clark, T. Angew. Chemie Int. Ed. 2016, 128, Early View. doi: 10.1002/anie.201602729.

[4]

https://www.lrz.de/services/compute/supermuc/