**If you are interested in joining this workshop, please contact the workshop organizers by following the instructions in "Apply". ****This a multi-part event and we indicate the date for the first meeting. Dates of follow ups are decided during the first event.**

**----**

E-CAM is an EINFRA project funded by H2020. Its goal is to create, develop, and sustain a European infrastructure for computational science, applied to simulation and modelling of materials and biological processes that are of industrial and societal interest. E-CAM builds upon the considerable European expertise and capability in this area.

E-CAM is organized around four scientific areas: Molecular dynamics, electronic structure, quantum dynamics and meso- and multiscale modelling, corresponding to work packages WP1-4. E-CAM gathers a number of groups with complementary expertise in the area of meso- and multiscale modeling and has also very well established contact to simulation code developers. Among the aims of the involved groups in this area is to produce a software stack by combining software modules, and to further develop existing simulation codes towards highly scalable applications on high performance computer architectures. It has been identified as a key issue that simulation codes in the field of molecular dynamics, meso- and multiscale applications should be prepared for the upcoming HPC architectures. Different approaches have been proposed by E-CAM WPs: (i) developing and optimizing highly scalable applications, running a single application on a large number of cores and (ii) developing micro-schedulers for task-farming approaches, where multiple simulations run each on smaller partitions of a large HPC system and work together on the collection of statistics or the sampling of a parameter space, for which only loosely coupled simulations would be needed. Both approaches rely on the efficient implementation of simulation codes.

Concerning strategy, most of modern parallelized (classical) particle simulation programs are based on a spatial decomposition method as an underlying parallel algorithm. In this case, different processors administrate different spatial regions of the simulation domain and keep track of those particles that are located in their respective region. Processors exchange information (i) in order to compute interactions between particles located on different processors, and (ii) to exchange particles that have moved to a region administrated by a different processor. This implies that the workload of a given processor is very much determined by its number of particles, or, more precisely, by the number of interactions that are to be evaluated within its spatial region.

Certain systems of high physical and practical interest (e.g. condensing fluids) dynamically develop into a state where the distribution of particles becomes spatially inhomogeneous. Unless special care is being taken, this results in a substantially inhomogeneous distribution of the processors' workload. Since the work usually has to be synchronized between the processors, the runtime is determined by the slowest processor (i.e. the one with highest workload). In the extreme case, this means that a large fraction of the processors is idle during these waiting times. This problem becomes particularly severe if one aims at strong scaling, where the number of processors is increased at constant problem size: Every processor administrates smaller and smaller regions and therefore inhomogeneities will become more and more pronounced. This will eventually saturate the scalability of a given problem, already at a processor number that is still so small that communication overhead remains negligible.

The solution to this problem is the inclusion of dynamic load balancing techniques. These methods redistribute the workload among the processors, by lowering the load of the most busy cores and enhancing the load of the most idle ones. Fortunately, several successful techniques are known already to put this strategy into practice (see references). Nevertheless, dynamic load balancing that is both efficient and widely applicable implies highly non-trivial coding work. Therefore it has has not yet been implemented in a number of important codes of interest to the E-CAM community, e.g. DL_Meso, DL_Poly, Espresso, Espresso++, to name a few. Other codes (e.g. LAMMPS) have implemented somewhat simpler schemes, which however might turn out to lack sufficient flexibility to accommodate all important cases. Therefore, the present proposal suggests to organize an Extended Software Development Workshop (ESDW) together with E-CAM, where code developers of CECAM community codes are invited together with E-CAM postdocs, to work on the implementation of load balancing strategies. The goal of this activity is to increase the scalability of these applications to a larger number of cores on HPC systems, for spatially inhomogeneous systems, and thus to reduce the time-to-solution of the applications.

The workshop is intended to make a major community effort in the direction of improving European simulation codes in the field of classical atomistic, mesoscopic and multiscale simulation. Various load balancing techniques will be presented, discussed and selectively implemented into codes. Sample implementations of load balancing techniques have been done for the codes IMD and MP2C. These are highly scalable particle codes, cf. e.g. http://www.fz-juelich.de/ias/jsc/EN/Expertise/High-Q-Club/_node.html. The technical task is to provide a domain decomposition with flexible adjustment of domain boarders. The basic load balancing functionality will be implemented and provided by a library, which will be accessed via interfaces from the codes.

**In order to attract as many developers of CECAM codes as possible the workshop will be split into 3 parts:**

**Part 1: preparation meeting (2 days)**

- various types of load balancing schemes will be presented conceptually and examples of implemented techniques will be shown

- code developers / owners will present their codes. Functionalities will be presented and parallel implementations are discussed in view of technical requirements for the implementation of load balancing techniques

- an interface definition for exchanging information from a simulation code to a load balancing library will be set up

- identify any HPC-related training required for subsequent meetings

**Part 2: training and implementation (1 week)**

- any required training identified in Part 1 will be provided at Juelich

- during and after the training (planned for 2-3 days), participants can start implementing a load balancing scheme into a code

- for those participants who are already on an expert level in HPC techniques, it is possible to start immediately with implementing load balancing schemes

**Part 3: implementation and benchmarking (1 week)**

- final implementation work with the goal to have at least one working implementation per code

- for successful implementations benchmarks are conducted on Juelich supercomputer facilities

It is intended that between the face-to-face parts of the workshop, postdocs and developers continue the preparation and work on the load balancing schemes, so that the meetings will be an important step to synchronise, exchange information and experience and improve the current versions of implementation.

** **