Models of Convenience

Chemists generally depict molecules based on Lewis structures, a concept that has flowed into the computer representations used since the 1970s and 80s. These representations still form the basic records for most databases containing chemical information, which have been used to construct predictive models using neural networks and machine learning since the end of the 1980s, as were various group contribution methods involving some regression algorithm.

The science community is now being challenged with the hype of Big Data and new methods of ML/AI, such as deep learning, that often go against the traditional validation concepts of the cheminformatics community. Chemistry is still entering the digital era and as a discipline it is confronted with a problem not found in some of the more commonplace applications of ML/AI; for example, large e-commerce companies, real time analysis and simulations in the automobile industry and data from large spectrometers and colliders such as the LHC.

The central element in chemistry, i.e. the basic identifier/model for a molecule is no more than metadata, which contain the apparent connectivity between atoms in a formalistic manner of single, double, and triple bonds. Furthermore, chemistry data are not generally measured directly in real time and its findings are often subject to human selection/interpretation based on a predetermined bonding model. The distinction between measurements and the experimentalists’ interpretation of the measurements is far less clear in chemistry than in many other branches of science and engineering. For instance, 13C chemical shifts were for years believed to be equivalent to net atomic charges, a concept without physical reality.

Even worse, deviations from the model are often treated as sensational exceptions to established principles and given far more significance than they deserve. Boranes, for example, only exhibit exceptional bonding within the Lewis model.

Is this model sufficient for ML/AI applications or for large simulations in biology and materials science? Will its lifetime be extended thereby or curtailed? The reality of a molecule in solution or in vivo is three-dimensional and dynamic, and thus far more complicated than our simple models can accommodate; many interactions between molecules are non-equilibrium events. In the past, it has been shown that for many property prediction systems, 2D-representations have given better results than 3D.

Why should this be the case; are our models capturing the essential spatial information? How does the effective averaging of dynamic conformations in a stylized 2D-representation developed for chemists to exchange ideas by drawing structures on paper affect our ability to predict molecular properties? How close do our models of molecules have to be to reality to work effectively using contemporary and future ML/AI techniques? Are we underusing the immense power of modern hard- and software with our simple models? What can we do better using ML/AI, and what are the new directions in chemistry that data-driven discovery could make possible? Have we already reached an accuracy limit imposed by the accuracy and precision of the data for predictions?