When data science meets molecular dynamics
Location: CECAM-HQ-EPFL, Lausanne, Switzerland
Organisers
Since the pioneering work at CECAM in the seventies, Molecular Dynamics (MD) has evolved from a “proof of concept” technique used by a handful of physicists and theoretical chemists to a widely used technique embedded in the core of many research areas, from chemistry and material sciences to biology and biomedicine [1,2]. The improvement in force-fields and the possibility to simulate realistic conditions have expanded the number of users, while the improvement in software and hardware have facilitated the extension of longer trajectories on bigger systems [3-5]. The result is an exponential increase in the volume of trajectories generated and in the problems derived to mine such trajectories and extract all the useful information [6]. The standard paradigm in the field, which follow that developed in the seventies, implying one group running a simulation and performing also a hypothesis driven analysis, is no longer valid, as current trajectories contain much more information that can be anticipated, especially when placed in the context of other simulations performed on similar systems. The traditional way to manage trajectories, also an heritage of the seventies, imply storing it in local computers while the paper is under review and delete them a few months later, assuming that if required new trajectories could be repeated in the future. This policy is unrealistic as some of the current state-of-the-art simulations implying thousands of cores cannot be rerun without a very good reason. Furthermore, deleting trajectories hampers reproducibility, raises doubts on the claims published in journals, makes reusing of trajectories and meta-analysis covering hundreds of trajectories impossible and kills any possibility to use MD as a source of data for artificial intelligence trainings [7-10]. In addition, as the field is increasingly adopting machine learning methods, the availability of extensive repositories of well-curated high quality simulations has a large intrinsic value as training data.
Recent initiatives are under development to correct this unusual situation that implies that a vast amount of HPC resources are wasted and an entire field is kept in the past century. Thus, different databases of MD simulations have been created, among others those centered in representative protein folds [11-13], nucleic acids [14], membrane proteins [15,16], proteins of impact in the function of the CNS system, SARS-CoV2 simulations [17,18], small molecules [19], canonical B-DNA [20-22] and drug-protein complexes. Challenges the community is facing to consolidate transition to a new FAIR paradigm for MD simulations include:
- How to guarantee sustainability of these initiatives.
- How to guarantee quality on the stored trajectories.
- How to define standards for simulation and for the stored data.
- How to define the metadata describing the simulations.
- How to define an analysis infrastructure allowing pan-analysis of macromolecular structures.
- How to obtain commitment of key players (HPC centers, funding agencies, MD providers and MD users).
We will review the state-of-the-art in database generation, centering in biomacromolecules systems, discuss the new biology that is emerging from the analysis of these databases and try to get community consensus on how to face the challenges above.
References
Adam Hospital (IRB Barcelona) - Organiser
Modesto Orozco (IRB Barcelona) - Organiser
Sweden
Erik Lindahl (Stockholm University) - Organiser