CECAM Flagship Workshop Proposal FAIR and TRUE Data Processing for Soft Matter Simulations
The emergence of big-data-driven techniques as a fundamental paradigm of science has forced the evaluation of the way that researchers manage, document, and share their data . As a result, a variety of domain-specific projects have developed innovative tools for ensuring FAIR--Findable, Accessible, Interoperable, Reusable--or TRUE--Transparent, Reproducible, Usable by others, Extensible--data management . For example, the Novel Materials Discovery (NOMAD) Laboratory is a user-driven platform for sharing and exploiting computational materials science data, with a focus on data from Ab-initio calculations . In practice, 4 main challenges, known as the 4V's of big data, arise when developing data management procedures: volume (the amount of data), variety (the heterogeneity of form and meaning of data), velocity (the rate at which data may change or new data arrive) and veracity (the uncertainty of data quality). In contrast to Ab-initio data, the data generated by soft matter simulations (e.g., atomistic molecular dynamics simulations and multiscale modeling techniques) pose a particular challenge, due primarily to issues associated with the first 2 V's.
To address reproducibility of soft matter simulations, Cummings, McCabe, and coworkers have developed the Molecular Simulation Design Framework (MoSDeF), an open-source Python software stack that enables facile use of multiple open-source molecular simulation engines, while at the same time ensuring maximum reproducibility [4,5]. This suite provides support for constructing topologies and configurations, implementing and saving force fields, and generating simulation input files for popular molecular simulation software. In this way, researchers can implement complex simulation workflows in a fully scriptable fashion that is maximally reproducible .
In the context of accessibility and data sharing, various communities have developed niche repositories and management tools. Recently, FAIRmat--a consortium of the German research-data infrastructure (NFDI)--was formed to continue to raise awareness and acceptance of FAIR data practices . One of the primary tasks of FAIRmat is to extend the NOMAD infrastructure to a wide variety of materials science data, including data from soft matter simulations. Additionally, FAIRmat aims to assist the community in advancing metadata schemas and ontologies, enabling efficient exchange of FAIR research data and big-data analyses that aim to revolutionize the development of novel materials.
Proper interoperability requires some standardization of the simulation data and workflows. To tackle these challenges, it is essential that involved parties, including leading software developers outside the data management sphere, come together to set appropriate standards. Communication between individual projects will facilitate efficient development of tools and avoid duplication of work and "reinventing the wheel".
Martin Girard (Max-Planck-Institut für Polymerforschung) - Organiser
Joseph Rudzinski (Humboldt University) - Organiser
Clare McCabe (Vanderbilt University (USA)) - Organiser
Peter Cummings Peter Cummings (Vanderbilt University) - Organiser