Both Markov state models and balanced model order reduction are well established coarse graining techniques that can handle large- and multiscale dynamical systems, given certain equilibrium or stochastic stability assumptions like reversibility or the fluctuation-dissipation relation. If these assumptions are met, MD trajectory data can be used to derive coarse-grained representations of a molecular system and to quantify its uncertainly [10, 11]; one such example is the Python-based software package pyEMMA [12]. The systematic use and the incorporation of general MD data, e.g., coming from nonequilibrium, driven or open boundary MD is a relatively new aspect that has been studied only recenty [13, 14, 15].

**Model and dimension reduction**

One of the guiding principles for the analysis of molecular dynamics data is that the relevant conformational degrees of freedom can be characterised in terms of the largest spatiotemporal scales in the system that can identified by, e.g., the principal component analysis or one of its more sophisticated nonlinear variants (kernelPCA, diffusion maps, etc.). Finding a good set of principal components, however, that capture the full molecular dynamics is typically infeasible, and it is therefore reasonable to understand ”dynamics” with respect to a set of suitable observables or collective variables. Balanced model order reduction (MOR) is a rational approximation technique that seeks an approximation of given obervables as functions of external input variables (e.g. external forces or noise), by identifying a subspace of variables that are both sensitive to the inputs and strongly coupled to the observables [16]. For linear and bilinear systems, finding

a reduced-order system boils down to solving a set of coupled Lyapunov matrix equations for the corresponding controllability and observability Gramians, for nonlinear systems in equilibrium, the problem can be rephrased in terms of a Monte-Carlo sampling procedure [17]. Similar ideas for general nonequilibrium systems have been developed in the group of one of the workshop organizers [18]; cf. [19]. Casting these ideas into functional software requires to develop modules for the statistical estimation of the empirical Gramians from MD trajectories (e.g. shrinkage and maximum likelihood covariance estimators) as well as modules for the efficient computation of

the balancing transformation, Hankel singular values and the reduced subspace, that exploit the sparsity of the Gramians, together with interfaces to MD visualization software to visualize the dimension-reduced data. Back coupling of the MOR scheme into the MD codes would require to use advanced force field interpolation methods (like DEIM) and will not be part of the ESDW.

The MOR modules that we plan to developed during the workshop will be relevant for WP3 of the E-CAM project that involves the optimal control of open quantum systems.

**Markov state modelling of open boundary MD**

On the one hand, a Markov State Model (MSM) can be viewed as a meshfree discretization of the transfer operator associated with the equations of reversible molecular dynamics [20]. On the other hand, an MSM is a time series analysis tool that allows for extracting the essential conformation dynamics from a trajectory that has been generated by any reversible and ergodic dynamical system [21]. Both interpretions are intimately related, in that the discretization point of view provides the theoretical justification for the predictions made by an MSM on the basis of a given MD trajectory. The generalization of the MSM approach to reversible open systems,

that is currently undertaken in the Berlin node of the E-CAM project, is a natural next step, as the reversibility assumption (i.e. detailed balance) guarantees the physical interpretability of the principal MSM eigenvalues as the dominant time scales of the conformation dynamics, and the exchange of particles with a reservoir as in GC-AdResS has a natural interpretation in terms of a (reversible) stochastic transition kernel in the Markovian framework [22, 23]. As a consequence, the MSM discretisation of GC-AdResS admits a sound theoretical foundation (similar to the finite element discretisation of a reaction-diffusion equation), and the ESDW will concentrate on MSM as a time series analysis tool for open boundary MD data. The software modules to be developed

for this purpose will follow the route of the traditional MSM approach (e.g. [12]) and comprise modules for rate matrix estimation and uncertainty quantification within a Bayesian framework, geometric clustering algorithms for state space discretization, and metastability feature extraction using the Perron-cluster cluster analysis (PCCA) and transition path theory (TPT). The planned workshop will strongly benefit from the expertise of the research groups of Ch. Schütte (theory) and F. Noé (software) that are both located on the FUB/ZIB campus.

Control of MSM and postprocessing of MSM by balanced MOR would be interesting aspects, that are, however, out of scope of the planned ESDW and will be studied at a later stage.

**Objectives**

Multiscale and multiphysics methodologies have become an integral part of molecular modelling and simulation and have been applied to various real-world problems that would have been out of reach without scale-bridging techniques [1, 2]. The research groups of the two organizers are well-known for their contributions to the multiscale modelling and simulation of molecular systems, specifically the unique combination of mathematical analysis, physical modelling and algorithmic development [3, 4, 5]. The E-CAM WP4 contribution of the Berlin node involves

(a) the development of the quantum-classical grand canonical adaptive resolution simulation scheme (GC-AdResS), using a semi-classical path integral formalism (PI/GC-AdResS),

(b) the adaption of balanced model order reduction (MOR) for linear and bilinear control systems

for driven molecular systems [6, 7], beyond the linear response regime, and

(c) the systematic incorporation of Markov State Model (MSM) approximations into the grand canonical molecular dynamics (MD) framework of GC-AdResS.

The theoretical and algorithmic framework for (a) has been described in the recent PhD thesis [8], which also documents the first implementation of PI-AdResS in GROMACS (which is not terribly efficient though); cf. [9] The implementation and documentation of interfaces to the most popular open source MD codes (GROMACS, LAMMPS, CP2K, NAMD, OpenMM), including benchmark systems, is ongoing work that will be completed by early 2017.

The next steps that will be relevant for the planned Extended Software Development Workshop (ESDW) involve WP4 subtasks (b) and (c), within a purely data-driven framework. Both topics are currently under investigation from a theoretical and modelling point of view (see details below). From the perspective of software development, the advantage of the data-driven approach is threefold: firstly, the corresponding software modules can be developed and tested independently of specific MD codes as the algorithms use raw MD trajectories as input data; secondly, it can be readily used together with existing numerical linear algebra libraries and trajectory readers for

various MD trajectory formats; thirdly, the modules may be useful for analysing arbitrary highdimensional time series, not just MD data. These aspects make the development of software modules from WP4 topics (b) and (c) specifically suitable for an ESDW.

The medium-term aim beyond the planned ESDW will be to integrate the developed modules into PI/GC-AdResS, in order to use the information from the MSM- or MOR-based coarse graining to optimize the shape of fine- and coarse-grained regions in PI/GC-AdResS and to identify negligible degrees of freedom in the interface region that do not contribute to the particle exchange with the reservoir—with the ultimate goal of making multiscale simulations more efficient

Possible Projects: (also see Section Files)

https://www.cecam.org/upload/files/file_3339.pdf