From Data to Dynamics: Machine Learning in Statistical Mechanics and Molecular Simulations
Location: CECAM-Lugano, Lugano, Switzerland
Organisers
Since its introduction in the 1970s, molecular dynamics (MD) has become an indispensable computational microscope for studying complex biological systems at atomic resolution. It has enabled detailed investigations into protein folding, conformational dynamics, and ligand binding and unbinding. Over the past decade, increasing computational power has made microsecond-scale simulations routine, producing massive datasets that demand sophisticated analysis strategies [1]. Despite these advances, conventional MD simulations still face a fundamental limitation: many biologically relevant events occur over milliseconds to seconds—timescales largely inaccessible to standard MD.
To bridge this gap, researchers increasingly turn to enhanced sampling techniques—such as metadynamics and umbrella sampling [2,3]—and coarse-grained (CG) modeling approaches [4]. These methods enable more comprehensive exploration of the system’s free energy landscape, yet their success critically depends on the selection of appropriate reaction coordinates or collective variables (CVs). CVs must capture the slowest, most functionally relevant motions to accurately reflect thermodynamic and kinetic behavior. However, identifying suitable CVs remains one of the field’s most challenging tasks, typically requiring domain expertise and iterative refinement [5, 6].
This complexity has fueled growing interest in machine learning (ML) techniques, which are now transforming how MD simulations are analyzed, interpreted, and even conducted. ML methods have been applied to automate CV discovery, perform dimensionality reduction, build thermodynamic and kinetic models, and enhance sampling efficiency [7]. These models often employ artificial neural networks or graph neural networks to map high-dimensional molecular configurations—such as Cartesian coordinates or molecular descriptors—into low-dimensional representations suitable for analysis [8].
Depending on the structure and type of data, ML algorithms can be broadly categorized into supervised, unsupervised, and reinforcement learning paradigms [9]. Supervised learning uses labeled input-output pairs to predict properties such as molecular energies or binding affinities [10], while unsupervised learning enables the identification of latent features, such as CVs, directly from data [11].
A cornerstone of modern ML-driven simulation is the development of symmetry-aware molecular representations. The predictive power of ML models hinges on encoding physical symmetries—like rotation and translation—directly into the model. E(3)-equivariant neural networks have emerged as powerful tools for this purpose, significantly improving data efficiency and generalization in learning potential energy surfaces [12]. Ongoing research continues to explore the optimal balance between enforcing strict symmetry and retaining model flexibility.
Meanwhile, breakthroughs in structural prediction—most notably the advent of AlphaFold 3—have revolutionized how researchers obtain initial molecular configurations. AlphaFold now provides remarkably accurate models of not only proteins but also their complexes with nucleic acids, ions, and small-molecule ligands [13]. However, these are static snapshots. They cannot capture dynamic behaviors, allosteric transitions, or binding kinetics—areas where physics-based simulations remain indispensable. Initial benchmarks suggest that even state-of-the-art predictors still fall short in modeling protein dynamics and ranking ligand binding affinities, further emphasizing the role of MD [14].
To address the dimensionality and sampling bottlenecks, unsupervised ML approaches such as time-lagged autoencoders have reframed CV identification as a data-driven task. More recently, generative models—including diffusion models and variational autoencoders—have emerged as a new frontier. These models can learn the full conformational landscape of biomolecules and enable enhanced sampling, in some cases eliminating the need for predefined CVs altogether [15].
Once accurate structural models and CVs are established, ML can significantly improve the estimation of thermodynamic and kinetic properties. In drug discovery, for instance, predicting protein–ligand binding affinity remains a central challenge. ML potentials trained on quantum mechanical data can be combined with enhanced sampling to yield highly accurate free energy landscapes and binding kinetics—results previously unattainable due to computational limitations [16]. However, challenges in data quality, model interpretability, and transferability remain critical areas of ongoing investigation [17].
Finally, ML is driving a renaissance in CG modeling. Deep neural networks can now learn many-body CG potentials directly from all-atom simulations, capturing emergent properties and enhancing transferability [18]. These models open the door to longer, larger-scale simulations with greater physical accuracy.
In this rapidly evolving context, it becomes imperative to critically assess both the promise and limitations of ML in biomolecular simulation. The excitement surrounding these developments must be tempered by careful validation and benchmarking. This workshop thus serves as a timely opportunity—especially for early-career researchers—to explore these cutting-edge methods, engage in constructive dialogue, and chart new directions in the application of machine learning to molecular dynamics and drug discovery.
References
Vincenzo Maria D'Amore (University of Naples "Federico II") - Organiser
Marco De Vivo (Istituto Italiano di Tecnologia) - Organiser
Francesco Saverio Di Leva (University of Naples Federico II) - Organiser
Switzerland
Daniele Angioletti (Università della Svizzera italiana (USI)) - Organiser
Vittorio Limongelli (Università della Svizzera italiana USI Lugano) - Organiser
United States
Gregory Voth (University of Chicago) - Organiser

About