Standardisation and databasing of ab-initio and classical simulations
- Simone Meloni (Sapienza University of Rome, Italy)
- Rodolphe Vuilleumier (Ecole Normale Supérieure, Paris, France)
- Teodoro Laino (IBM Research - Zurich, Switzerland)
- Ari Paavo Seitsonen (Ecole Normale Supérieure, France)
Computational science has become a crucial and integral part of science and engineering. Advanced algorithms and computer programs are in fact tools of the trade in fields as diverse as condensed matter physics, mineralogy, molecular biology, astronomy, chemical catalysis, biochemistry, geo-physics, climatology, semiconductor and nano-technology, meteorology, climatology, aircraft design and material science
A common trait of some of the most exciting problems in these fields is that they can only be investigated with techniques that span different time and space scales and combine atomistic, molecular, meso, macro and continuum models. Such techniques represent a scientific challenge in themselves as, in most cases, theoretically sound interfaces between the different scales have yet to be developed. In addition to the purely scientific difficulties there are also technical factors that slow down, and in some cases prevent, progress, The naturally interdisciplinary approach to multiscale problems often requires using different software in a workflow or concurrently. The lack of standardisation in the output data of different codes , however, prevents a straightforward combination of the software employed in the various steps.
Even when multiple scales do not enter the problem, the lack of standardization jeopardizes advancement. For example, in condensed matter physics or in chemistry, simulations - both classical and ab-initio - produce huge amount of data that can only be efficiently stored in binary format. Unfortunately, binary files are not portable from one computer architecture to another. Therefore, for a research group it is in practice impossible even to create long-term archives of simulations produced with the same code and results obtained from time consuming, often expensive, calculations can only be accessed for relatively short periods.
The lack of standardisation affects the scientific community also in other ways and it complicates the preservation and the dissemination of valuable scientific knowledge. In fact, if a standard were available, it would be possible to archive simulations in a database for benchmarking different algorithms, and performing further analysis. Within the framework of a well-regulated access to the archive it would also be possible to make the data available to a broad community within any given field and even use it as basis for training young computational scientists.
Several actions toward standardisation have been taken in the last few years by communities related to CECAM. CECAM itself has hosted workshops on this topic., for example on creating standard data and file format for pseudopotential used in ab initio (DFT) calculations. In spite of these efforts, the problem outlined above has essentially not been solved yet mainly because implementing standards in scientific codes requires human resources unavailable in the scientific community in general and to individual research groups in particular. In the absence of a technical workforce with specific competence in information technology (IT) for the implementation of standards and for the creation of efficient and user friendly simulation archives, also the few successful initiatives that exist have been quite limited in scope.
The goal of this workshop is twofold. We aim (1) at defining the most suitable standard for storing the output of simulations from some of the most commonly used classical and ab initio codes in condensed phase physics and chemistry, and (2) at defining the structure and the characteristics of a simulation database for archiving such information in an efficient and easily accessible way.
With respect to the first point, we propose to promote and follow a new approach that is emerging. After a common data format is defined, the developers of the different codes will not be asked to implement the standard directly in their programs. Rather, a series of converters from native data/file formats to the standard one and vice-versa will be created and made available on the archive. The positive consequences of this protocol are that: (1) it is considerably less intrusive in scientific codes and it does not require dedicated human resource within research groups; (2) it is possible to involve IT specialists in the development of the conversion tools. This point is very important as one of the major difficulties in finding human resources is that usually IT specialist are not able to work in a very complex context such as that of codes produced by the scientific community, but they can easily handle the big data files that are produced by the codes.