Expanding the Impact of Molecular Simulations by Integrating Machine Learning with Statistical Mechanics
Location: Grand Hotel Vesuvio, Sorrento (Italy)
Organisers
Since its initial applications in the 1970s, molecular dynamics (MD) has emerged as an invaluable tool for investigating complex biological phenomena. Indeed, it has facilitated the exploration of protein folding, conformational changes, ligand association and dissociation, and various other relevant processes. Recent years have witnessed a remarkable advancement in MD simulations, owing to the exponential growth in computational power and methodological enhancements. Nowadays, it is indeed common practice to perform microsecond-long MD simulations, generating massive amounts of data, often on the order of hundreds of gigabytes, that necessitate careful processing [1]. Despite these advancements, standard MD calculations still struggle to capture the timescale of many biological processes which typically occur in the millisecond-to-second range. This limitation can be overcome by employing coarse-grained modeling [2] and enhanced sampling methods [3, 4], such as umbrella sampling and metadynamics. The latter approaches allow obtaining a detailed characterization of the free energy landscape associated with the process of interest, although their success relies on the choice of the proper system's reaction coordinates, known as collective variables (CVs). These CVs, which are used to accelerate the simulation sampling, should capture the slowest degrees of freedom of the investigated event in order to correctly describe the system's thermodynamic and kinetic properties. Nevertheless, identifying the appropriate CVs remains a challenging task that typically demands huge expertise and involves time-consuming trial-and-error procedures [5, 6]. In this context, the rapid growth of molecular simulation data has highlighted the necessity for a simplified representation of the data manifold. Consequently, there has been a surge of interest in algorithms capable of organizing and analyzing such data. To address this need, numerous machine learning (ML) methods have been developed with the aims of defining CVs, solving dimensionality reduction problems, deploying advanced clustering schemes, and constructing thermodynamic and kinetic models [7]. These ML methods typically involve artificial or graph neural networks that take the initial dataset, comprising Cartesian coordinates and specific molecular features, and project it from a high-dimensional configuration space to a lower-dimensional space [8]. Depending on the availability and nature of the training set, ML algorithms can be broadly classified into three main groups: supervised learning, unsupervised learning, and reinforced learning approaches [9]. Supervised learning involves training an ML algorithm using a dataset consisting of input-output pairs to predict desired outputs for unseen inputs. Unsupervised learning, on the other hand, enables the ML algorithm to extract useful information solely from the input values without any specific output provided in the training set. In reinforcement learning, the ML model learns by continuously interacting with its environment through a trial-and-error approach, without relying on any pre-existing data. While supervised and unsupervised learning have found wide applications [10] in predicting molecular properties and defining CVs [11], respectively, the utilization of reinforcement learning in biomolecular simulations is still in its nascent stages. Considerable progress has been made in the implementation of ML algorithms in biomolecular simulations, with the aim to improve their performances in terms of both velocity and precision [12]. However, the excitement surrounding these developments is tempered by the need to critically evaluate the accuracy, utility and applicability of the existing methods. In this context, the present workshop serves as a timely opportunity, especially for young researchers, to engage in positive and productive brainstorming sessions focused on the new challenges posed by state-of-the-art theoretical studies that apply ML to biomolecular simulations. It aims to facilitate a critical assessment of the benefits and limitations of these approaches.
References
Vincenzo Maria D'Amore (University of Naples "Federico II") - Organiser
Marco De Vivo (Istituto Italiano di Tecnologia) - Organiser
Francesco Saverio Di Leva (University of Naples Federico II) - Organiser
Switzerland
Vittorio Limongelli (Università della Svizzera italiana USI Lugano) - Organiser
United States
Gregory Voth (University of Chicago) - Organiser & speaker