Big-Data driven Materials Science
CECAM-HQ-EPFL, Lausanne, Switzerland
Many, probably most, areas in the basic and applied sciences and engineering are increasingly facing the challenge of dealing with massive amounts of data, nowadays commonly addressed as „big data“. This big-data challenge is not only about storing and processing huge amounts of data, but also, and in particular, it is a chance for new methodology and understanding, as it opens qualitatively new routes for doing research.
The number of possible materials, including organic and inorganic materials, surfaces, interfaces, and nanostructures, as well as hybrids of the mentioned systems, is practically infinite. Less than 200,000 materials are “known” to exist, but only for very few of these “known” materials, the basic properties (elasticity constants, plasticity, piezoelectric tensors, conductivity, etc.) have been determined. When considering 60 commercial elements blended together, there is essentially an infinite number of compounds to be explored.
It is, therefore, highly likely, that new materials with superior and up to now simply unknown property profiles exist that could help solving fundamental issues in the fields of energy, mobility, safety, information, and health.
There have not been many breakthroughs, yet, in terms of predicting new materials. The best examples may be the works on electric breakdown  and on thermoelectrics . Other works did not use analytics, i.e., a specified dataset was scanned for optimizing a specified quantity [3-9]. The discoveries in these studies were "limited" to finding the best (formerly unknown) material optimizing the given quantity. However, in general, creating data on elastic constants and piezoelectric parameters  is also very important – even without the analytics.
For materials science it is already clear that in terms of properties and functions, big data are structured, and in terms of materials properties and functions, the space of all possible materials is sparsely populated. Finding this structure in the big data, e.g., asking for efficient catalysts for methane formation, good thermal barrier coatings, shape memory alloys, or thermoelectric materials for energy harvesting from temperature gradients, may be possible, even if the actuating physical mechanisms of these properties and functions are not yet understood in detail. Novel big-data analytics tools, e.g., based on machine learning and in particular compressed sensing, promise to do so.
Finding structure in big data is just one example of a promising route in big-data-driven materials science. However, at present there is a significant hype associated with the term “big data”. Often, promises are not well founded, because trustful big-data analytics tools and error bars associated with these tools are hardly established. Thus, from a science perspective, certainly for materials science and engineering, “big-data-driven science” is a just emerging field. However, there is hardly any doubt that this field will considerably affect the way science is done in the future.
We identify these two outstanding challenges in the "big-data driven materials science":
1. Developing big-data analytics to find structures and causal relationships in big data of materials that are not recognizable by "naked eye" or standard tools.
2. Assigning error bars or uncertainty tags to the data.
The aim of the workshop is to put in contact the community that develops models and methodologies for the data analytics with the continuosly growing part of the materials science community that is applying those models and methodologies to relevant problems in the field.
The purpose of this cross-breeding is on one side to expose the material scientists to novel, state-of-the-art and beyond, methods; on the other side, to stimulate the theoretical data analytics and managment community with practical problems whoe solution may require further advance in their disciplines.
It is also worth noting that materials science presents a couple of "anomalies", compared to other disciplines where big-data analytics is routinely performed (e.g, social sciences, drug design, meteorology) with respect to the kind of data that are handled.
a) Predictions in materials science need to be unusually accurate, in order to identify the “needle in a haystack”, i.e., say, the top 100 best materials for prescribed target properties out of a pool that contains a practically infinite number of possible chemical and/or structural compounds.
b) Thanks to modern computational methodologies, the behavior of compounds that are difficult to create or handle in the lab (e.g, because poisonous, radioactive or unstable at room conditions) can be calculated. In contrast, in other disciplines, all the data are typically collected before the analysis and it is normally impossible to acquire new data under the same conditions in order to test what was found by the analysis.
These "anomalies" call for "domain tailored" data-analytics techniques that require the cooperation of developers of such techniques with the material scientists.
Luca Ghiringhelli (NOMAD Laboratory at the Fritz Haber Institute of the Max Planck Society and Humboldt University, Berlin) - Organiser & speaker
Matthias Scheffler (Fritz-Haber-Institut der Max-Planck-Gesellschaft) - Organiser