Machine learning in atomistic simulations
CECAM USI, Lugano, Switzerland
In recent decades atomistic simulations have become an important tool to compliment and aid in the interpretation of experimental results. Furthermore, it is now relatively routine to base simulation models on approximate versions of quantum mechanics that can give very accurate energetics and forces. However, when interpreting the results of such simulations, and when fitting more coarse grained descriptions of the potential energy surface, we are often still reliant on chemical and physical theories obtained from the interpretation of experimental data. This can lead to a biased interpretation of simulations and can give rise to a tendency to get the right simulation results for the wrong reasons.
The power of modern computers makes it possible to develop powerful algorithms that can "learn'' from data and, in a manner of speaking, develop theories that describe it. These algorithms are beginning to be used in chemistry to understand and interpret simulation data where chemical understanding is perhaps lacking. In particular, these algorithms have used to analyze trajectories, develop enhanced sampling strategies and to fit coarse grained models. In the proposed meeting we would therefore like to bring together researchers working at the cutting edge of machine learning with colleagues from the simulation communities in chemistry and physics to discuss how progress can be made in this exciting, interdisciplinary area. During the conference we would like to address three particular problems; namely, the analysis of molecular dynamics trajectories, the use of machine learning in enhanced sampling calculations and the how machine learning can be used to represent complex potential energy landscapes.
Analysis of Molecular Dynamics trajectories
Many of the molecules that appear in chemistry, biology and materials science are highly complex and involve many thousand atoms. This forces one to question why molecules consistently behave in the same manner when they have such a large phase space to explore? The answer to this is that the constraints on the geometries of the molecules make most of phase space energetically inaccessible. These constraints force the system to adopt particular configurations preferentially and thus drive reactions such as the folding of proteins. Consequently, all the low energy configurations of such systems lie on a manifold that has a considerably lower dimensionality than the full 3N-dimensional phase space [1-5].
The fact that only a relatively small portion of phase space is energetically accessible is reassuring but it does not necessarily make the understanding of complex processes, such as protein folding, any easier. After all, in any transition between two different configurations there is a collective change in the positions of many atoms. Hence, determining the vectors that span the low-energy portion of space using only chemical intuition is enormously difficult. Therefore, for these problems systematic approaches for examining the relationships between structures are required. Especially so given that extracting vectors spanning this low-dimensional structure can shed light onto reaction mechanisms.
It is now common to extract data from simulations using principal component analysis [1-4]. This is a rather primitive informatics tool that makes the questionable assumption that the low-dimensional part of phase space lies on a linear sub-space in the full-dimensional space . Hence, a number of groups have sought better, non-linear approaches for extracting low-dimensional structures. In particular, isomap [5,7-9] and diffusion maps [10-16] are showing promise. However, there may be other, superior algorithms in the literature on dimensionality reduction and manifold learning that can better address this problem and that have simply not yet been identified as the literature is large and, for non-specialists, difficult to penetrate.
Machine learning based enhanced sampling
For many coarse grained systems one can obtain a thorough sampling of phase space from a single molecular dynamics simulation. However, for the most accurate atomistic simulation methods obtaining a thorough sampling of phase space requires a heroic amount of computational time . Consequently, using unbiased molecular dynamics to examine the free energy landscapes in these systems is not feasible. There are therefore numerous methods for resolving the timescale problem of conventional molecular dynamics . Typically, these methods work by either using some form of enhanced sampling or by focusing on the transition from one local minimum to another. Other methods recognize that a small number of degrees of freedom (collective variables) accurately describe the interesting transitions and so either raise the temperature of these degrees of freedom or introduce a bias to enhance the sampling. Clearly for the second class of methods it would be enormously useful if we were able to extract collective variables automatically from trajectories using machine learning algorithms [19,20].
Representing Potential-Energy Surfaces for Complex Systems
An accurate description of the atomic interactions is the most basic requirement to obtain reliable results in computer simulations of chemical processes. In principle, highly accurate ab initio methods are available, but even most efficient implementations like density-functional theory are computationally too demanding to carry out extended simulations of systems containing thousands of atoms. A wide range of empirical potentials of varying form and complexity has been devised in recent decades to extend the length and time scales of atomistic simulations. Typically they are based on physically reasonable approximations and thus are able to capture the basic properties of the atomic interactions. Still, there are many systems, which are hard or even impossible to describe by these potentials, and therefore the accuracy is necessarily limited by the employed approximations.
In recent years, machine learning techniques and in particular artificial neural networks (NN) have become an interesting alternative approach to construct very accurate and efficient potentials by interpolating a set of reference energies from high-level electronic structure calculations [22,23]. NN potentials offer a number of advantages. They are numerically very accurate because of their very flexible functional form and thus they allow to reproduce ab initio energies with very high accuracy. Further, they have a completely unbiased functional form. Therefore, no knowledge about the physical interactions in the systems of interest is needed, and no cumbersome construction of the individual energy terms on a trial and error basis is needed. Consequently, NN potentials are able to describe systems with very different types of interactions with the same accuracy on an equal footing, like small molecules, metals, semiconductors and liquids. In contrast to many conventional force fields they are ``reactive'', i.e., they do not require the specification of atom types and bonding patterns. Finally, they are particularly suitable for complex bonding situations and coordination patterns, like in phase transitions , processes at surfaces, or distorted molecules, e.g. transition states.
Jörg Behler (Universität Göttingen) - Organiser
Gareth Tribello (Queen's University Belfast) - Organiser