Big Data of Materials Science -- Critical Next Steps
CECAM-HQ-EPFL, Lausanne, Switzerland
We identify two key open problems in the field:
1) Control/Ensure Veracity, i.e. the challenge of controlling the accuracy of the large amount of data, possibly deriving form heterogeneous sources, used for the analysis.
2) The role of the descriptor: the “cause (descriptor) -> property/function” relation should contain physical insight and not only yielding a good numerical approximation.
The planned workshop is intended for bringing together the key scientists that have been developing and applying techniques for the production and storage of large amount of first principles data and/or techniques for the analysis of such Big Data. The aim of the gathering is to analyse in depth the two above mentioned problems and identify a strategy for solving them. In particular:
1) First-principles computational materials science and engineering starts from the nuclear numbers but several approximations are imposed when actually solving the equations. Besides numerical issues, which can and must be tested, uncertainties concern the Born-Oppenheimer approximation, the pseudopotential approximation (if used), and, in particular, the exchange-correlation functional. Interestingly, in materials science we do not have the methodology for properly assessing the accuracy and range of validity of these issues, so far.
Several talks will be devoted to these issues and in the discussion sessions, common strategies will be identified, with the aim of possibly building a database infrastructure that ideally could serve the whole community, with common standards.
2) For many, maybe most, material functions, the “cause/descriptor (d) -> property/function (P)” relation is complex and indirect. From statistical-learning theory, it is known that inverting the d -> P mapping, i.e. identifying the cause from known data P is an ill-posed problem, even when a one-to-one correspondence exists: A little error in the data P may suggest a very different cause d. Obviously, the nuclear numbers and stoichiometry uniquely identify the many-body Hamiltonian and its results. However, in order to establish a d -> P mapping, the question is: What is the (microscopic) mechanism behind the desired quantity. In other words, what is the best descriptor d?
Identification of the cause-descriptor d is the key and difficult step, which so far was achieved via elaborate analyses and ingenious scientific understanding. It is the hope that in the future machine-learning techniques may help to make the discovery of d and the P(d) relation more a systematic strategy than a serendipitous event.
In particular, sparsity-related techniques such as compressed sensing, which has had a tremendous impact in image compression, face recognition, and magnetic resonance imaging, will be extensively discussed. These techniques could play a leading role in moving the Big Data analysis from a collection of numerically accurate techniques to a (physical) model identification strategy.
A related critical issue is that unphysical correlations that are (often) found by statistical learning must be identified as "non causal" and disregarded as they would not advance scientific understanding and must fail for predictions. Methods for this step are not yet available for the challenges we are facing in materials science.
Luca Ghiringhelli (NOMAD Laboratory at the Fritz Haber Institute of the Max Planck Society and Humboldt University, Berlin) - Organiser & speaker
Matthias Scheffler (Fritz-Haber-Institut der Max-Planck-Gesellschaft) - Organiser
Sergey Levchenko (Skolkovo Institute of Science and Technology, Moscow, Russia) - Organiser & speaker