Chasing CVs using Machine Learning: from methods development to biophysical applications
The dynamical behavior of molecular systems in chemistry, biophysics and materials science, is typically governed by a small number of collective modes, alternatively collective variables (CVs) or reaction coordinates, the number of which has been previously coined as the intrinsic or effective dimension of a system.
Until a few years ago, CVs were identified using empirical approaches and chemical intuition. Nevertheless, the multi-body and complex nature of the CVs means that they are particularly challenging or even impossible to intuit for complex systems. Intense efforts have been invested in automating the definition of CVs from molecular simulation data, using either supervised or unsupervised Machine Learning (ML) techniques [1-11]. These CVs are then used to obtain new understanding on the system at hand, perform molecular design, and guide enhanced sampling [12,13]. They also provide a way to perform dimensionality reduction.
Although the details and specifics differ, most CV discovery techniques fall into two categories: those seeking high-variance CVs and those seeking slow CVs . High variance CVs maximally preserve the configurational variance in the high-dimensional data upon projection into the low-dimensional space spanned by these CVs. Slow (i.e., maximally autocorrelated) CVs define a low-dimensional space that maximally preserves the long-time kinetics of the system. From a technical point of view, the methods which are used are either linear or nonlinear, typically with a balance between the interpretability of the results and the quantification of their uncertainties (which are better understood for linear models), and the quality or expressivity of the approximation (nonlinear models typically leading to much smaller regression/classification errors).
The best known high-variance CV estimation technique is principal component analysis (PCA). Nonlinear analogs of PCA include kernel and nonlinear PCA, independent component analysis (ICA), multidimensional scaling, sketch map, locally linear embedding (LLE), Isomap, local tangent space alignment, semidefinite embedding/maximum variance unfolding, Laplacian and Hessian eigenmaps, diffusion maps, etc. (key references can be found in [12,13]). Specialized techniques for molecular simulations that integrate iterative high-variance CV discovery and accelerated sampling of configurational space have been developed in recent years [14-19].
Most of the approaches proposed for the identification of slow CVs rely on the variational approach to conformational dynamics (VAC) or in the (extended) dynamical mode decomposition ((E)DMD) . The inputs in these approaches can be the (Cartesian or internal) coordinates of the system itself, or average values of indicator functions as in Markov State Models. Nonlinearity can be brought into the model through the use of neural networks .
The choice of the proper dimensionality of the CVs to be found is a delicate question, which still requires further research efforts. Some studies aim at finding or predicting the committor function between two metastable groups of conformations (such as [21-23]), which is the optimal one-dimensional CV for various purposes. For linear models such as PCA and its variations, there are clear mathematical guidelines for selecting an appropriate number of dimensions based on variance explained by the model. A similar framework for nonlinear models, based on neural networks for instance, is currently missing.
Marc Bianciotto (Sanofi) - Organiser
Paraskevi Gkeka (Sanofi) - Organiser
Gabriel Stoltz (Ecole des Ponts) - Organiser
Carsten Hartmann (Brandenburgische Technische Universität Cottbus-Senftenberg) - Organiser
Francesco Luigi Gervasio (University College London) - Organiser