Machine Learning Meets Statistical Mechanics: Success and Future Challenges in Biosimulations
CECAM-IT-SIMUL, Grand Hotel Vesuvio, Sorrento
Since the first applications in the 1970s, molecular dynamics (MD) has progressively become a valuable tool for the investigation of complex biological phenomena such as protein folding and conformational changes, ligand association and dissociation, and many other biologically relevant phenomena. In recent years, the exponential increase of the computational power together with the methodological improvements has placed MD simulations in a leading position in biophysical studies. In particular, nowadays microsecond long MD simulations are routinely performed, producing hundreds of gigabytes of data on the investigated system that require to be properly processed . Nevertheless, standard MD calculations yet hardly reach the timescale of a biological process, typically occurring in the millisecond-to-second range. Such a limitation can be overcome by employing coarse-grained modelling  and enhanced sampling methods , or a combination of both , which allow characterizing the free energy landscape of the investigated process. However, the success of enhanced sampling methods like umbrella sampling and metadynamics, depends on the choice of the system’s reaction coordinates, namely collective variables (CVs), that are used to accelerate the sampling and that should depict the slowest degrees of freedom, finally providing an accurate description of the thermodynamic and kinetic properties of the system. However, selecting the right CVs remains a challenging task, which generally requires both expertise of the investigator and time-consuming trial-and-error procedures [5, 6].
In this scenario, the rapid growth of molecular simulations data has highlighted the need of having a simplified representation of the available data manifold, leading to an increased interest in algorithms capable of organizing and analysing such data. To this end, a number of machine learning (ML) methods have been developed to manage simulations data with the scope to: i) define CVs; ii) solve dimensionality reduction problems; iii) deploy advanced clustering schemes; and iv) build thermodynamic and kinetic models. These tasks are generally achieved by incorporating the initial data set (i.e. Cartesian coordinates and specific molecular features) in artificial or graph neural networks that are designed to project data from a high dimensional to a low dimensional configuration space . Based on the availability and nature of the training set, ML algorithms can be essentially classified into three main groups: i) supervised learning; ii) unsupervised learning and iii) reinforced learning approaches . In supervised learning, a ML algorithm is trained using a data set made of input−output pairs in order to predict the desired output for unseen input values. In unsupervised learning, the ML algorithm can extract useful information using solely the input values with no specific output available in the training set. Finally, in reinforcement learning, no data at all is used to train the ML model, which instead learns by continuously interacting with its environment through a trial-and-error approach. Supervised and unsupervised learning have already found large applications [9, 10], for instance, in the prediction of specific molecular properties and in the definition of CVs , respectively; on the other hand, the use of reinforcement learning in biomolecular simulations is still in its infancy. Altogether, significant progress has been made in the implementation of ML algorithms in biomolecular simulations, however the enthusiasm is counteracted by the real accuracy, utility and applicability of the methods developed so far. In this context, the present workshop is timely in providing the opportunity, particularly to the young researchers, to establish a positive and productive brainstorming on the new challenges posed by cutting-edge theoretical studies that apply ML to biomolecular simulations, with a critical evaluation of their benefits and limitations.
The main objectives of the workshop are:
- To describe innovative nonlinear dimensionality reduction methods for defining complex collective variables (CVs)
- To highlight both advantages and limitations of ML-based atomistic force fields
- To investigate unexplored potentialities of machine learning algorithms for coarse-graining
- To predict ligand-protein and protein-protein interaction through ML-based methodologies
- To present machine learning models useful for extracting thermodynamic and kineticinformation, such as free energy, transition rates, pathways, and time-correlation functions
The following topics will be specifically covered:
- CVs definition
- Atomistic force fields
- Molecular binding
- ML-based kinetics models
The event will be made of 5 sessions distributed over two days and a half. Five slots of 20 minutes each will be reserved to oral communication of not-invited speakers in order to give younger scientists the possibility to present their research work.
Marco De Vivo (Istituto Italiano di Tecnologia) - Organiser
Francesco Saverio Di Leva (University of Naples Federico II) - Organiser
Vittorio Limongelli (Università della Svizzera italiana USI Lugano) - Organiser
Gregory Voth (University of Chicago) - Organiser