FAIR Data Management of Theoretical Spectroscopy and Green’s Function Methods
Location: CECAM-HQ-EPFL, Lausanne, Switzerland
Organisers
Big-data-driven methodologies have emerged as a fundamental paradigm of science, but require an enormous amount of resources to achieve their promised impact. The FAIR (Findable, Accessible, Interoperable, and Reusable) data principles [1] ensure that scientific data can be shared and reutilized, providing an efficient route for accumulating data and taking advantage of these powerful techniques. FAIR data management allows essential knowledge to be systematically extracted from data, accelerating discoveries and innovations across various domains [2]. Furthermore, open science is essential for the verifiability and reproducibility of results and has been a topic of major discussion over the last decade. In materials science, data-driven methodologies, coupled with the appropriate FAIR data management practices, are invaluable for the discovery of new materials due to the vast combinatorial space of chemical systems that emerge from the periodic table [3, 4]. Such methodologies have been successfully applied, e.g., to design and predict new materials with desired properties using ab-initio ground state simulations, i.e., data generated from Density Functional Theory (DFT) calculations [5]. However, there remains a critical gap in replicating this success in the context of other simulation frameworks.
Theoretical spectroscopy and Green's function method simulations [6, 7], including data simulated using the GW approximation, Time-Dependent Density Functional Theory (TDDFT), the Bethe-Salpeter equation (BSE), Dynamical Mean-Field Theory (DMFT), and Korringa-Kohn-Rostoker (KKR), pose especially difficult challenges in the context of FAIR data management. These simulations not only involve extensive computational resources and produce large datasets with associated complex workflows but are also executed using a large variety of public and in-house simulation software. At the same time, these methodologies are essential for understanding excited state properties of complex materials; they are more accurate than DFT calculations and provide better comparisons with experimental results since they incorporate excited states and electronic correlation effects in a more consistent manner [8].
There has recently been a number of individual efforts to improve the accessibility of data produced by theoretical spectroscopy and Green’s function methods through the usage of publicly accessible databases. For example, the Computational Materials Repository (CMR) [9] contains several individual databases, amongst which the Computational 2D Materials Database (C2DB) [10] contains GW and BSE data for a specific set of parameters and properties. The MaterialsCloud [11] database has some individual datasets published for these methodologies, however there is not a clear data structure for them. The NIST-JARVIS [12] database has a specific app for BeyondDFT simulations with DMFT data, but only for a specific simulation code. By making datasets findable, these efforts aim to avoid redundant computations and thus build upon existing work more efficiently. While these efforts represent an important step in the right direction, they fall short of fully achieving their goal due to a continued lack of consistency (i.e., interoperability) between individual databases. Moreover, these self-managed databases typically lack the ability to store the complete provenance of the simulated workflow, which is essential to ensure reproducibility.
Recently, FAIRmat [13], a consortium of the German research data infrastructure (NFDI) association, was formed to construct a scalable data infrastructure for Materials Science that can be easily customized for individual communities. This infrastructure consists of a primary software and repository called NOMAD [14]—a free web-service that enables the organization, analysis, sharing, and publishing of materials science data. One of the tasks within FAIRmat’s scope is to build support for theoretical spectroscopy and Green’s function simulations within NOMAD. Support for several of these methodologies have now been successfully built, and there already exists over 10 000 entries in the NOMAD repository containing GW [15], BSE [16], and DMFT [17] data, along with the full provenance of the corresponding complex workflows. The next step to developing a FAIR data infrastructure for these methods is to tackle the interoperability problem.
Interoperability within this domain is extremely challenging due to the heterogeneous character of theoretical spectroscopy and Green’s function simulations. Consequently, the adoption of common structures (e.g., describing the Green’s function, the self-energy, or the dielectric function) is the key for improving interoperability. Thus, various members of the community, including method developers, materials and data scientists, and data management experts, must come together to reach a consensus on specific common data structures.
References
Fabio Caruso (University of Kiel) - Organiser
Claudia Draxl (Humboldt-Universität zu Berlin) - Organiser
Jose M. Pizarro (Bundesanstalt für Materialforschung und -prüfung) - Organiser
Patrick Rinke (Technical University Munich) - Organiser
Joseph Rudzinski (Humboldt-Universität zu Berlin) - Organiser