Machine-actionable data interoperability for the chemical sciences (MADICES 2)
Location: Zuse Institute Berlin (CECAM-DE-MMS)
Organisers
The recently established research data management (RDM) initiatives have been largely successful in delivering tools advancing the digitalisation of sciences, including a few open research data (ORD) repositories, several electronic lab notebooks (ELNs), workflow management systems (WFMSs), and many instrument automation platforms [1-4].
However, the RDM landscape remains fragmented, both across disciplines and national borders. Attempts to fix this fragmentation, at least in the domain of material sciences and related disciplines, include:
- the OPTIMADE consortium, addressing the interoperability and accessibility of ORD repositories using a single application programming interface (API) [5];
- the ELN consortium, addressing the interoperability between different ELNs by defining a mechanism for the exchange of data files;
- the MaRDA working group on Metadata Extractors, developing a single API for downstream tools for extraction of metadata and data using upstream file parsers;
- and several attempts at standardization of the research data itself, either at the file format level (e.g. NeXus extensions by FAIRmat) or at the ontology level (e.g. the domain-specific BATT-INFO [6] or NFDI4cat ontologies).
In this proposed workshop we plan to focus on interoperability across the stages of such RDM pipelines. This includes both integration of experimental and computational data, and the interoperability of instrument automation platforms and WFMSs with ELNs. In particular, we wish to address the following challenges:
- Sample provenance in mixed workflows. WFMSs are tailored to keep track of data and sample provenance for workflows that are completely contained in the WFMS. Similarly, an ELN is usually tailored to allow the users to enter protocols and link it with data manually. However, what happens in cases where the sample history is a mixture of automated and manual steps? How do we avoid data duplication and yet retain the complete sample history in an ELN?
- Digital twins suitable for computation and experiment. A sample in a computational workflow is, generally, a well-defined, immutable, idealized system, which can be easily reused in further computations. However, a sample in an experimental workflow is strictly dependent on its history, as many experimental techniques alter the sample state. In this case, one cannot physically access the previous sample states. How does a WFMS know what kind of sample it’s handed from the ELN? How does the WFMS know to which descendant of the parent sample it’s meant to push the new results? How is this provenance graph consistently transferred between the different components of the RDM pipeline?
At the workshop, we will bring together WFMS and ELN developers, scientists, and data stewards to discuss the above issues and find cross-platform solutions. The goal of the workshop will be to establish a working group focused on developing a standard for such a tool-to-tool interoperability in a platform-agnostic way.
The ultimate goal is to provide the infrastructure for future autonomous laboratories, in which ML/AI algorithms can seamlessly drive both experiments and simulations toward fully autonomous discovery and characterization. In doing so, the contribution of MADICES2 will be to encourage the support by the various platforms for FAIR data sharing “by design”.
Following the virtual kick-off, pre-workshop discussions will occur in January and February and are open to all. Please see the MADICES-2024 GitHub discussions for info on how to join in.
Relevant links:
References
Matthew Evans (UCLouvain) - Organiser
Germany
Kevin Jablonka (Helmholtz Institute for Polymers in Energy Applications) - Organiser
Peter Kraus (Technische Universität Berlin) - Organiser
Switzerland
Edan Bainglass (Paul Scherer Institut) - Organiser
Caterina Barillari (ETH Zurich) - Organiser