Towards a Common Format for Computational Materials Science Data
- Raghunathan Ramakrishnan (Tata Institute of Fundamental Research, Centre for Interdisciplinary Sciences (TCIS), India)
- Matthias Scheffler (Fritz Haber Institute of the Max Planck Society (FHI), Berlin, Germany)
- Luca M. Ghiringhelli (Fritz Haber Institute of the Max Planck Society (FHI), Berlin , Germany)
- Christian Carbogno (Fritz Haber Institute of the Max Planck Society (FHI), Berlin, Germany)
- Micael Oliveira (Max Planck Institute for the Structure and Dynamics of Matter, Hamburg, Germany)
- Damien Caliste (Alternative Energies and Atomic Energy Commission (CEA), France)
- Georg Huhs (, Austria)
- Martin Lueders (Daresbury Laboratory, United Kingdom)
The development of modern commercial products be it from the health and environment, clean energy, heavy industry, information or communication technology sector depends strongly on the development and design of new and improved materials. However, identifying the best material or designing a novel and improved material for a specific task/application is a significant challenge. Of key importance are the characteristics of the materials at the atomic and molecular levels, which determine their properties and behaviors at the macro scale. To aid and guide this search, computational materials science employs complex methods and computing algorithms (‘codes’) to investigate, characterize and predict material properties. Fueled by the “Materials Genome Initiative for Global Competitiveness” , announced by President Obama in June 2011, these computational techniques are increasingly and successfully employed also for the “high-throughput screening” of materials [2-4]. In conjunction with techniques from big-data analytics and machine learning, such an approach enables to scan many thousands of compositions for the material with the best-suited properties to predict trends, and to identify potentially (technologically) important candidates [5,6]. So far, however, different technologies and frameworks developed in this context have addressed only very specific aspects, e.g, by focusing on properties relevant to one particular application and/or by supporting only one or very few electronic structure codes.
In practice, this means that European computational material scientists produce a huge amount of materials data on their local workstations, computer clusters, and supercomputers using a variety of computer codes that are most commonly also developed by European research groups. Though being extremely valuable, this information is mostly unavailable to the community, since most of the data are stored locally or even deleted right away. But even if they are available, a re-use and re-purposing would not be straightforward, given that different codes often use very different file formats and conventions to store the same physical data. Enabling sharing and comparing such data is thus a pressing issue that needs to be addressed to advance this field, as exemplified by multiple European initiatives, for instance, the European Center of Excellence for Novel Materials Discovery (NOMAD-CoE)  aims at establishing a unified, code-independent data format, to which the raw data calculated by different electronic structure codes can be converted, so that big-data analytic techniques can then be exploited to obtain unprecedented insight from vast amounts of calculations. In a similar spirit, the Center of Excellence E-CAM, which was recently established by CECAM to build an e-infrastructure for software, training and consultancy in simulation and modeling, is committed to actively support the development and adoption of software libraries and standards within the electronic structure community. One measure aiming at this is CECAM’s Electronic Structure Library (ESL) initiative , established in 2014.
The proposed workshop provides a unique platform to establish a common framework that supports several different electronic structure and force field codes and that is prepared to interface with the newly emerging field of data-driven material discovery in the European research landscape. In this view, a common purpose of the NOMAD-CoE and the CECAM-supported ESL is to integrate the computed results from leading electronic structure codes. Defining a common code-independent representation for all relevant quantities, e.g., structure, energy, electronic wave functions, trajectories of the atoms, etc., is challenging, as the codes differ, for example, in their choice of basis sets, treatment of the core electrons (e.g. usage of pseudopotentials). To tackle this challenges from a technological point of view, we will build on the experience gained during previous community projects with somewhat narrower focus but with similar philosophy. For instance, one of the most consistent and successful efforts was the development of the (NetCDF  based) ETSF file format  by the ETSF . Similar standardization efforts are currently under way within the EUSpec  network. In this context, it is also planned to extend and modify the ETSF file format, in particular for greater flexibility for parallel I/O.
We thus believe that now is the right time to bring together the key players in the electronic-structure and force-field code development, in order to discuss and implement the aforementioned code-independent representation of materials science data. We propose to divide the workshop in two parts: a 2.5 days discussion on the file format specifications, followed by an 8.5 days coding effort for its implementation. For the discussion, we intent to invite representatives of as many community codes as possible, so that their requirements and needs can steer the discussions. In particular, we aim at having participants and speakers from diverse fields, including different type of electronic-structure theory codes (e.g. pseudopotential plane-wave codes, localized-basis codes), force-field and beyond DFT codes, as well as high-throughput infrastructures. To stimulate exchange of thoughts, ample time will be reserved for round table discussions.
The coding itself shall be carried out by a smaller team of experienced developers with the goal to implement the required IO routines as software library (API) in order to minimize the effort for developers to support this new format in their codes. Similarly, the exact same easy-to-use APIs shall enable straightforward access to the stored data for post-processing, e.g., for the big-data analytics techniques described before.
 Materials Genome Initiative for Global Competitiveness (President Obama, June 2011): http://www.whitehouse.gov/sites/default/files/microsites/ostp/materials_genome_initiative-final.pdf
 B. C. Wood and N. Marzari, Phys. Rev. Lett. 103, 185901 (2009). DOI:10.1103/PhysRevLett.103.185901
 S. Curtarolo, et al., Nat. Mat. 12, 191 (2013). DOI: 10.1038/nmat3568
 S. Kang, et al., Nano Lett. 14 , 1016 (2014). DOI: 10.1021/nl404557w
 Y. Ritov, et al., Statistical Science 29, 619 (2014). DOI: 10.1214/14-sts483
 L. M. Ghiringhelli, et al., Phys. Rev. Lett. 114, 105503 (2015). DOI: 10.1103/PhysRevLett.114.105503
 European Center of Excellence for Novel Materials Discovery (NOMAD-CoE), http://nomad-coe.eu
 The Electronic Structure Library, http://esl.cecam.org
 R. K. Rew and G. P. Davis, IEEE Computer Graphics and Applications 10, 76 (1990). DOI: 10.1109/38.56302
 ETSF File Format Standardization Project: http://www.etsf.eu/fileformats
 European Theoretical Spectroscopy Facility: http://www.etsf.eu
 COST Action MP1306: http://euspec.eu/