[PDF] Learning the Effective Dynamics of Complex Multiscale Systems

Abstract

Simulations of complex multiscale systems are essential for science and technology ranging from weather forecasting to aircraft design. The predictive capabilities of simulations hinges on their capacity to capture the governing system dynamics. Large scale simulations, resolving all spatiotemporal scales, provide invaluable insight at a high computational cost. In turn, simulations using reduced order models are affordable but their veracity hinges often on linearisation and/or heuristics. Here we present a novel systematic framework to extract and forecast accurately the effective dynamics (LED) of complex systems with multiple spatio-temporal scales. The framework fuses advanced machine learning algorithms with equation-free approaches. It deploys autoencoders to obtain a mapping between fine and coarse grained representations of the system and learns to forecast the latent space dynamics using recurrent neural networks. We compare the LED framework with existing approaches on a number of benchmark problems and demonstrate reduction in computational efforts by several orders of magnitude without sacrificing the accuracy of the system.

Full PDF

LL EARNING THE E FFECTIVE D YNAMICS OF C OMPLEX M ULTISCALE S YSTEMS

A P

REPRINT - J

ULY

3, 2020

Pantelis R. Vlachas

Computational Science and Engineering LaboratoryETH ZürichCH-8092, Switzerland [email protected]

Georgios Arampatzis

Computational Science and Engineering LaboratoryETH ZürichCH-8092, Switzerland [email protected]

Caroline Uhler

Institute for Data, Systems, and SocietyMassachusetts Institute of TechnologyMassachusetts Avenue 77, Cambridge, MA 02139,USA [email protected]

Petros Koumoutsakos ∗ Computational Science and Engineering LaboratoryETH ZürichCH-8092, Switzerland [email protected]

July 3, 2020 A BSTRACT

Simulations of complex multiscale systems are essential for science and technology ranging fromweather forecasting to aircraft design. The predictive capabilities of simulations hinges on their capac-ity to capture the governing system dynamics. Large scale simulations, resolving all spatiotemporalscales, provide invaluable insight at a high computational cost. In turn, simulations using reducedorder models are affordable but their veracity hinges often on linearisation and/or heuristics. Here wepresent a novel systematic framework to extract and forecast accurately the effective dynamics (LED)of complex systems with multiple spatio-temporal scales. The framework fuses advanced machinelearning algorithms with equation-free approaches. It deploys autoencoders to obtain a mappingbetween ﬁne and coarse grained representations of the system and learns to forecast the latent spacedynamics using recurrent neural networks. We compare the LED framework with existing approacheson a number of benchmark problems and demonstrate reduction in computational efforts by severalorders of magnitude without sacriﬁcing the accuracy of the system.

Keywords

Multiscale modeling · Equation-free · Autoencoders

A wide range of scientiﬁc problems and engineering designs is founded on the study of complex systems with dynamicsspanning multiple spatio-temporal scales. Examples include protein dynamics [1], turbulence [2], brain [3] and cancerdynamics [4], climate [5], ocean dynamics [6] and social systems [7]. Over the last ﬁfty years, simulations have becomea key component of these studies thanks to a conﬂuence of advances in computing architectures, numerical methods andsoftware. Large scale simulations have led to unprecedented insight, acting as in-silico microscopes [8] or telescopes toreveal the dynamics of galaxy formations [9]. At the same time these simulations have led to the understanding thatresolving the full range of spatio-temporal scales in such complex systems will remain out of reach in the foreseeablefuture.In recent years there have been intense efforts to develop efﬁcient simulations that exploit the multiscale characterof the systems under investigation [10, 11, 12, 13, 14, 15]. Multiscale methods rely on judicious approximations ∗ To whom correspondence should be addressed. E-mail: [email protected] a r X i v : . [ phy s i c s . c o m p - ph ] J u l earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020 of the interactions between processes occurring over different scales and a number of potent frameworks have beenproposed including the equation-free framework (EFF) [13, 16, 11, 17, 18], the Heterogeneous Multiscale Method(HMM) [12, 19, 20] and the FLow AVeraged integatoR (FLAVOR) [21, 22]. In these algorithms the system dynamicsare distinguished into ﬁne and coarse scales or expensive and affordable simulations respectively. Their success dependson the separation of scales that are inherent to the system dynamics and their capability to capture the transfer ofinformation between scales. Effective applications of multiscale methodologies minimize the computational effortwhile maximizing the accuracy of the propagated dynamics. The EFF relies on few ﬁne scale simulations that areused to acquire, through “restricting”, information about the evolution of the coarse grained quantities of interest. Inturn various time stepping procedures are used to propagate the coarse grained dynamics. The ﬁne scale dynamicsare obtained by judiciously “lifting” the coarse scales to return to the ﬁne scale description of the system and repeat.When the EFF reproduces trajectories of the original system, the identiﬁed low order dynamics represent the intrinsicsystem dynamics, also called effective dynamics, inertial manifold [23, 24] or slow collective variables [25] or reactioncoordinates [26] in molecular kinetics.While it is undisputed that the EFF, HMM, FLAVOR and related frameworks have revolutionized the ﬁeld of multiscalemodeling and simulation, we identify two critical issues that presently limit their potential. First, the accuracy ofpropagating the coarse grained dynamics hinges on the employed time integrators. More importantly the choice ofinformation transfer, in particular from coarse to ﬁne scale dynamics in ‘lifting’, greatly affects the forecasting capacityof the methods.In the present work we resolve these two critical issues through Machine Learning algorithms that (i) deploy state of theart recurrent neural networks (RNNs) with gating mechanisms to evolve the coarse grained (latent) dynamics and (ii)employ advanced probabilistic autoencoders (AEs) to transfer in a systematic, data driven manner, the informationbetween coarse and ﬁne scale descriptions.Over the last years, machine learning (ML) algorithms have exploited the ample availability of data, and powerfulcomputing architectures [27], to provide us with remarkable successes across scientiﬁc disciplines from physics [28,29, 30, 31], ﬂuid dynamics [32, 33, 34], image and language processing [35], mathematics [36] to medicine [37]. Theparticular elements of our algorithms have been employed in the modeling of dynamical systems. Autoencoders [38, 39]have been used to identify a linear latent space based on the Koopman framework [40, 41], model high dimensionalﬂuid ﬂows [42, 43] or sample effectively the state space in the kinetics of proteins [44, 45, 46]. Recurrent neuralnetworks with gating mechanisms have been shown successful in capturing the coarse grained dynamics of complexsystems [47, 48]. However, these previous works fail to employ one or more of the following mechanisms: considerthe coarse scale dynamics, account their non-Markovian or non-linear nature, exploit a probabilistic generative mappingfrom the coarse to the ﬁne scale, and scale to high dimensional systems. In [49], the authors identify a PDE on acoarse level using diffusion maps, Gaussian processes or neural networks, and utilize forward integration in the coarserepresentation to model the FitzHugh-Nagumo equation (FHN) in the equation-free formalism.We ﬁnd that by augmenting multiscale frameworks (including EFF, HMM, FAVOR) with state of the art MachineLearning algorithms allows for evolving the coarse scale dynamics by taking into account their time history and byproviding consistent lifting and restriction operators to transfer information between ﬁne and coarse scales. To thebest of our knowledge, this is the ﬁrst time that machine learning algorithms are exploited in the context of multiscalemodeling. We demonstrate that the proposed LED allows for simulations of complex multiscale systems that reduce thecomputational cost by orders of magnitude, to capture spatiotemporal scales that would be impossible to resolve withexisting computing resources.

We propose a learning framework to identify and propagate the effective dynamics of multiscale systems. The proposedlearned effective dynamics (LED) allow for an accurate prediction of the evolution of the system at a signiﬁcant reducedcomputational cost.The LED framework is founded on the multiscale framework of Equation Free Methods [11] and the related Hetero-geneous Multiscale Methods (HMM) [20] and FLow AVeraged integratORs (FLAVOR) [21] methodologies. LEDsprovide a uniﬁed description and augment these methods by delivering a consistent transfer of information betweencoarse and ﬁne grain scales, using AEs, and a non-Markovian advancement of the nonlinear dynamics of the latentspace, using RNNs.In the following, the high dimensional state of a dynamical system is given by s t ∈ R d s , and the discrete time dynamicsare given by s t +∆ t = F ( s t ) , (1)2earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020 s t − T warm +Δ t s t −2Δ t s t s t + T m −Δ t s t + T m ENCODER ENCODER ENCODER z t − T warm +Δ t z t −2Δ t z t RNN RNN RNN h t −Δ t h t − T μ z t +Δ t RNNRNN z t +Δ t z t +2Δ t z t + T m −Δ t ˜ s t + T m DECODER z t + T m s t −Δ t ENCODER z t −Δ t RNN h t −2Δ t Microscale system (fully resolved equations / simulations) ‘RESTRICT’Encoding to latent space

Teacher forcing the RNN for T warm Forecasting oflatent dynamics for T m ‘LIFT’Decoding to high dimensional space T warm ≪ T m Figure 1: The high dimensional dynamics s t are simulated for a short period of T warm time units. During this warm-upperiod, the state s t is passed through the encoder network. The outputs of the encoder z t are iteratively provided asthe input to the RNN, allowing for the update of its hidden state h t capturing non-Markovian effects. Starting fromthe last latent state z t the RNN iteratively propagates the low order latent dynamics up to a total horizon of T m timeunits, with T m (cid:29) T warm . The LED decoder maps the last latent state z t + T m back to a high-dimensional representation ˜ s t + T m . Propagation in the low order space unraveled by LED can be orders of magnitude cheaper compared to evolvingthe high dimensional system based on ﬁrst principles. As a consequence, LED can accelerate simulations, enablingthe study of the dynamics in much longer time scales, and unraveling state space regions that would be otherwiseintractable.where ∆ t is the sampling period and F may be nonlinear, deterministic or stochastic. We assume that the dynamicsof the system can be described by a few variables, on a low order manifold z t ∈ R d z , where d s (cid:29) d z . In order toidentify this manifold, we utilize an encoder E w E : R d s → R d z , where w E are trainable parameters, projecting thehigh dimensional state to z t = E w E ( s t ) . In turn, we utilize a decoder that maps back this latent representation to thehigh dimensional state, i.e. ˜ s t = D w D ( z t ) . Variational AEs regularizing the training of AEs, described in detail in thesupplementary information (SI) document are also embedded in LED. For stochastic systems, D w D is modeled with aMixture Density (MD) decoder [50], approximating the probability distribution of the state ˜ s t ∼ p ( · ; w MD ) , where w MD = D w D ( z t ) are the outputs of the decoder, parametrizing the distribution.The parameters w E , w D are identiﬁed by maximizing the log-likelihood of the reconstruction, w E , w D = argmax w E , w D log p (cid:0) s t ; w MD (cid:1) , (2) w MD = D w D ( z t ) = D w D (cid:0) E w E ( s t ) (cid:1) . (3)For a deterministic decoder, this loss reduces to the classical reconstruction loss || s t − D w D ( E w E ( s t )) || . Furtherdetails on the implementation of the MD decoder are provided in the SI.As the LEDs are targeting physical systems we take into consideration properties such as energy conservation [51],translation invariance [52], or permutation invariance. Such invariants can be embedded in the proposed multiscaleframework. In this work, we demonstrate that LED can be easily coupled with a permutation invariant layer [53] (seedetails in the SI), and utilized later in the modeling of the dynamics of a large set of particles governed by the advectiondiffusion equation.As a non-linear propagator in the low order manifold (coarse scale), we utilize a RNN, capturing non-Markovian,memory effects by keeping an internal memory state. The efﬁciency of RNNs in capturing non-Markovian temporaldependencies has been demonstrated in complex high-dimensional dynamical systems [48, 47], yet their capabilities inmultiscale modeling has not been demonstrated before. The RNN is learning a forecasting rule h t = H w H (cid:0) z t , h t − ∆ t (cid:1) , ˜ z t +∆ t = R w R (cid:0) h t (cid:1) , (4)where h t ∈ R d h is an internal hidden memory state, and ˜ z t +∆ t is a prediction of the latent state.3earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

ENCODER ENCODER

RNN RNN RNNRNN

ENCODER

RNN

Decoding

RNNRNN

ENCODERENCODERDECODER

RNN ⋯ Macro dynamics T m Micro dynamics T warm Micro dynamics T μ Macro dynamics T m Encoding T warm , T μ ≪ T m Figure 2: Multiscale forecasting with LED (testing phase) . Starting from an initial condition in the test data, usethe equations/ﬁrst principles to evolve the dynamics for T warm , project to the latent space dynamics using theautoencoder, and teacher force the RNN to warm up the hidden state for this initial time T warm . Next, iteratively (1) propagate the dynamics on the low-dimensional latent space with the RNN for T m time units, (2) project the latentdynamics at t = T m back to the high dimensional state, (3) starting from this high-dimensional state as an initialcondition, use the equations/ﬁrst principles to evolve the dynamics for T µ , and so on and so forth.The role of the RNN is twofold. First, it is updating its hidden memory state h t , given the current state provided atthe input z t and the hidden memory state at the previous time-step h t − ∆ t , tracking the history of the low order stateto model non-Markovian dynamics. Second, given the updated hidden state h t the RNN forecasts the latent state atthe next time-step(s) ˜ z t +∆ t . H w H and R w R are the hidden-to-hidden mapping, and the hidden-to-output mappings,while w H and w R are the trainable parameters of the RNN. Possible implementations of H w H and R w R are the longshort-term memory (LSTM) [54] cell or the Gated Reccurent Unit (GRU) [55], explained in the SI. The RNN is trainedto minimize the forecasting loss || ˜ z t +∆ t − z t +∆ t || by backpropagation through time (BPTT) [56].Firstly, the RNN and the autoencoder, jointly referred to as learned effective dynamics (LED), are trained on data fromshort simulations of the fully resolved (or microscale) dynamical system. After training, LED is employed to forecastthe dynamics on unseen data, by propagating the low order latent state with the RNN and avoiding the computationallyexpensive simulation of high-dimensional dynamics, as depicted in Figure 1.The LED framework allows for data driven information transfer between coarse and ﬁne scales through the AEs.Moreover it propagates the latent space dynamics without the need to upscale back to the high-dimensional state spaceat every time-step. The interface with the high dimensional state space is enabled only at the time-steps and scales ofinterest. This is in contrast to [45, 46], and is easily adaptable to the needs of particular applications thus augmentingthe arsenal of models developed for multiscale problems. We note that, as is the case for any approximate iterativeintegrator (here the RNN), the initial model errors will propagate. In order to mitigate potential instabilities, we proposethe multiscale forecasting scheme in Figure 2. In this way, the approximation error can be reduced at the cost of thecomputational complexity associated with evolving the high dimensional dynamics. We note that, training of LEDmodels is performed with the Adam stochastic optimization method [57]. All LED models are implemented in Pytorch,mapped to a single Nvidia Tesla P100 GPU and executed on the XC50 compute nodes of the Piz Daint supercomputerat the Swiss national supercomputing centre (CSCS). We demonstrate the application of LED in a number of benchmark problems and compare its performance with existingstate of the art algorithms. 4earning the Effective Dynamics of Complex Multiscale Systems

A P

REPRINT - J

ULY

3, 2020

Self-Similar 5.00 9.00 Latent Ω = T m /T µ . . . . . . W D M T µ = 50 . , T f = 4000 . Self-Similar 5.00 9.00 Latent Ω = T m /T µ . . . . W D M T µ = 50 . , T f = 4000 . X ° . ° . . . . X ° . ° . . . . X ° . ° . . . . Groundtruth LED, T µ = 0 t . . . W D M t . . . . W D M ABC D X ° . ° . . . . X ° . ° . . . . X ° . ° . . . . Groundtruth LED, T m = 450 ,Ω = 9 EF t = 625 t L h i s t e rrr o r M Iterative Latent Stochastic ForecastingMultiscale Stochastic Forecasting T µ = 50, T m = 450Multiscale Stochastic Forecasting T µ = 50, T m = 250 Ω = T m /T µ . . . . l og ( Sp ee d - up ) T µ = 50 . , T f = 4000 . H Figure 3: Starting from different initial conditions on the test data, we propagate the dynamics with LED restartingfrom each initial condition different times (changing the random seed in each restart), up to a ﬁnal time of T f = 4000 . For every run, we calculate the ﬁrst two moments of the distribution of the particle positions, M1 andM2. At each time-step, we have predicted M1 and M2. Three LED variants are considered, LED with T m = 0 (iterative latent propagation), and two variants with T µ = 50 , T m = 450 , and T µ = 50 , T m = 250 . The warm-uptime is T warm = 60 for all variants. A,B)

The evolution of the Wasserstein distance (WD) between the predicteddistributions of M1/M2 and the groundtruth, averaged over the initial conditions.

C,D)

The Wasserstein distance inM1/M2 averaged over initial conditions and time. LED variants capture the variance (M2) accurately, with relativelylow errors on M1 (compared with the domain size L = 1 ). The self-similar error is plotted for reference [58] as errorsbelow this level are statistically insigniﬁcant. E,F)

Examples of predicted particle positions at t = 625 from a singlerun starting from the same initial condition on the test data for each model. In the SI we provide additional results onthe L1-histogram distance, and the evolution of the L1-histogram distance and the WD on the particle positions onsingle random runs, supporting the aforementioned arguments. H) The speed-up of LED compared to the micro scalesolver is plotted w.r.t. ρ . We apply the LED method to the simulation of the advection-diffusion equation. We model the microscale description ofthe Advection-Diffusion (AD) process with a system of N = 1000 particles on a bounded domain Ω = (cid:2) − L/ , L/ (cid:3) d x .The particle dynamics are modeled with the stochastic differential equation (SDE) d x t = u t d t + √ D d W t , (5)where x t ∈ Ω denotes the position of the particle at time t , D ∈ R is the diffusion coefﬁcient, d W t ∈ R d x is aWiener process, and u t = A cos( ω t ) ∈ R d x is a cosine advection (drift) term. In the following, we consider the threedimensional space d x = 3 with D = 0 . , A = [1 , . , . T and ω = [0 . , . , . T and a domain size of L = 1 . Wereport the Péclet number P e = 9 quantifying the rate of advection by the rate of diffusion, i.e.

P e = LUD , (6)where L = 1 , U = | A | ≈ . . Equation (5) is solved with explicit Euler integration with ∆ t = 10 − , initial conditions x = , and reﬂective boundary conditions ensuring that x t ∈ Ω , ∀ t . The positions of the particles are saved at acoarser time-step ∆ t = 0 . and by starting from randomly selected initial conditions, we generate three datasets. Thetraining and validation datasets consist of samples each, and the test dataset consists of samples. The fullstate of the system is high dimensional, i.e. s t = [ x t ; . . . ; x Nt ] T ∈ R N × . We ﬁnd that the particles concentrate on afew “meta-stable” states, and jump between them, suggesting that the collective dynamics can be captured by a few5earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020 latent variables. However, it is not straightforward to determine a-priori the number of these states and the patternsof collective motion. LED unravels this information and provides a computationally efﬁcient multiscale model toapproximate the system dynamics. We employ an AE with a permutation invariant input layer (see SI) with a latentdimension d z = 8 , a MD decoder and a stateful LSTM-RNN to learn and forecast the dynamics on the low-dimensionalmanifold. The network is fed an initial conﬁguration of particles s t ∈ R N × , compresses the conﬁguration to the latentstate z t ∈ R , and utilizes the RNN to forecast the latent dynamics z t +1 . From this latent space, the MD decoder ofLED is sampling the next conﬁguration s t +1 . After training the RNN, we evaluate the efﬁciency of LED to capture thestatistics of the system Figure 3.In the testing phase, we examine two variants of LED. The ﬁrst variant does not evolve the dynamics on the particlelevel (we denote it as “Latent” LED) and its error increases with time and exhibits the highest errors on average. Thesecond variant, (we denote it as “multiscale” LED), evolves the low order manifold dynamics (coarse scale) for T m time units, and the particle dynamics (ﬁne scale) for T µ = 50 and corrects iteratively for the statistical error. This effectis due to the explicit dependence of the coarse system dynamics in time, as the cos( ω t ) advection term dominates. Twovalues for T m are considered, T m = 250 leading to a relative ratio of coarse to ﬁne simulation time of ρ = T m /T µ = 5 ,and another one with T m = 450 , leading to ρ = 9 . This incurs additional computational cost induced by the evolutionof the high dimensional state. As the multiscale ratio ρ = T m /T µ is increased, spending more time in the latentpropagation, the errors gradually increase. We note that the propagation in the low dimensional latent space is farless computationally expensive compared to the evolution of the high dimensional dynamics. As we increase ρ , wecan achieve greater computational savings, albeit at the cost of higher approximation error, as depicted in Figure 3.We demonstrate that the LED is able to generalize to different numbers of particles and provide additional results onthe one-dimensional AD case in the SI. The effectiveness of LED depending on the diffusion coefﬁcient D is shownin Figure 5.Moreover, in Figure 4, it is shown that by clustering the latent space dynamics, identiﬁed by LED, we can gain insighton the collective high dimensional dynamics of the system.Figure 4: A) Evolution of the second PCA mode of the latent state z t ∈ R d z =8 , against the ﬁrst mode. Highercolor intensity denotes higher density. Six high density regions are identiﬁed. We perform spectral clustering onthe PCA modes of the latent dynamics. The six identiﬁed cluster centers are marked, while color illustrates thecluster membership. The LED probabilistic decoder is employed to map each cluster center to a realization of ahigh-dimensional simulation state. LED effectively unravels six meta stable states of the Advection-Diffusion equation,along with the transitions between them, representing the low order effective dynamics. B) Evolution of the thirdPCA mode against the second one, colored according to cluster assignment. C) Density of the particle positions fromsimulation plotted against the distribution of the positions predicted by LED. We remark the good agreement betweenthe two distributions.

We examine the capability of LED to capture the evolution of the FitzHugh-Nagumo [59, 60] equations (FHN). TheFHN model describes the evolution of an inhibitor u ( x, t ) = ρ ac ( x, t ) and an activator density v ( x, t ) = ρ in ( x, t ) on6earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020 Ω = T m /T µ . . . . . . . W D M Ω = T m /T µ . . . . W D M Ω = T m /T µ . . . . . . . l og ( Sp ee d - up ) CBA Ω = T m /T µ . . . . . . . l og ( Sp ee d - up ) D = 0 . Pe = 100 D = 0 . Pe = 10 D = 2 . Pe = 1 Ω = T m /T µ Sp ee d - up Figure 5: Analysis of the performance of LED for different Péclet numbers

P e ∈ { , , } . To measure the errors,we perform runs, starting a random initial condition, up to ﬁnal time T f = 3000 . A, B)

WD on the distributionsof the ﬁrst two moments between the micro scale simulation and LED. The methods consistently exhibit lower erroras the Péclet number decreases. C) The speed-up is plotted w.r.t. ρ . Simulation of micro dynamics for lower Pécletnumbers is more expensive, due to the lower time-step necessary to resolve the diffusive dynamics. As P e decreases,the computational savings of LED increase by orders of magnitude (see the SI for further details).the domain x ∈ [0 , L ] : ∂u∂t = D u ∂ u∂x + u − u − v,∂v∂t = D v ∂ v∂x + (cid:15) ( u − α v − α ) . (7)The system evolves under two timescales, with the activator/inhibitor density acting as the “fast”/“slow” variablerespectively. The bifurcation parameter (cid:15) = 0 . controls the difference in the time-scales. We set D u = 1 and D v = 4 the diffusion coefﬁcients of the activator and the inhibitor, respectively, and select L = 20 , α = − . and α = 2 .Equation (7) is discretized with N = 101 grid points and solved using the Lattice Boltzmann (LB) method [61], withtime-step ∆ t = 0 . . In the following, the mesoscopic solution obtained by LB is considered the ﬁne-grained solution.LED is operating on a coarser time-scale ∆ t = 1 , and a coarse latent scale d z = 8 (increasing the latent scale furtherdid not lead to lower reconstruction error of the LED autoencoder as reported in the SI). In this case, LED is not utilizinga MD decoder, as the system under study is deterministic. LED is benchmarked against equation free variants [49] inthe FHN equation in Figure 6.LED identiﬁes and propagates the low order intrinsic dynamics, and is able to up-scale back to the activator density,forecasting its evolution accurately, while being S = T LED /T ﬁne ≈ times faster, where T LED is the average timethat LED takes for one step of size ∆ t time units, and T ﬁne is the average time the LB solver takes for one coarse timeunit ∆ t = 1 . This speed-up can be decisive in accelerating simulations and achieving much larger time horizons. Usingthe multiscale propagation (iteratively exchanging between the coarse-grained dynamics of LED and using the solveron the ﬁne-scale), the approximation error of LED decreases, at the cost of reduced speed-up. This interplay can beseen in Figure 7. The warm-up time is T warm = 60 for all variants. The Kuramoto-Sivashinsky (KS) is a prototypical spatially partial differential equation (PDE) of fourth order thatexhibits a very rich range of nonlinear phenomena [62]. We examine the capability of LED to learn the low ordermanifold of the effective dynamics in KS [63, 64]. Even though PDEs, such as the KS modeling viscous ﬂows, areinﬁnite dimensional, in case of high dissipation and small spatial extent, the long-term dynamics can be represented ona low dimensional inertial manifold [24, 23], that attracts all neighboring states at an exponential rate after a transientperiod.We consider the one dimensional K-S equation given by the PDE ∂u∂t = − ν ∂ u∂x − ∂ u∂x − u ∂u∂x , (8)7earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020 x t u ( x,t ) ° . ° . . . . Target ° . ° . . . . x t u ( x,t ) ° . ° . . . . Prediction ° . ° . . . . x t | u ( x,t ) ° ˜ u ( x,t ) | . . . . Absolute Error .

01 0 .

02 0 .

03 0 .

04 0 . BA .

00 0 .

01 0 .

02 0 .

03 0 . u, ˜ u )LED-VAE d z = 32LED-AE d z = 32LED-VAE d z = 8LED-AE d z = 8LED-VAE d z = 4LED-AE d z = 4CSPDE-GPCSPDE-NNCSPDE-GP-F1CSPDE-NN-F1CSPDE-GP-F2CSPDE-NN-F2CSPDE-GP-F3CSPDE-NN-F3 3.41E-032.43E-033.27E-036.13E-036.12E-034.17E-03 1.59E-021.53E-021.58E-021.54E-02 2.39E-022.00E-02 3.20E-022.08E-02 Lee et. al.LED-VAELED-AE Figure 6: Comparison of LED with equation-free variants [49] that are based on the identiﬁcation of PDEs on thecoarse level (CSPDE), in forecasting the dynamics of the FHN equation starting from an initial condition from thetest data up to ﬁnal time T f = 451 . CSPDE variants are utilizing Gaussian processes (GP) or neural networks (NN),features of the ﬁne scale dynamics obtained through diffusion maps (F1 to F3) and forward integration to propagatethe coarse representation in time. A) Mean normalised absolute difference (MNAD) on the activator density, betweenthe result of the LB simulation u ( x, t ) considered as groundtruth and the model forecasts ˆ u , either LED or CSPDEvariants. Variants of LED with autoencoders (AE), variational autoencoders (VAE), and different sizes of the latentdimension (LD) are considered, to evaluate the sensitivity of the performance on these parameters. LED outperformCSPDE variants by an order of magnitude. The deﬁnition of the MNAD, and additional results on the inhibitor densityare given in the SI. B) Prediction of the dynamics of the activator density of the FHN equation on the test dataset usingLED with a coarse dimension d z = 8 . ρ = T m /T µ . . . . M NA D ( u , ˜ u ) T µ = 10 . , T f = 8000 . Ω = T m /T µ . . . . . l og ( Sp ee d - up ) T µ = 10 . , T f = 800 . t . . . NA D ( u , ˜ u ) T µ = 10 . , T f = 8000 . iterative T m = 200 . ,Ω = 20 . T m = 100 . ,Ω = 10 . T m = 50 . ,Ω = 5 . T m = 10 . ,Ω = 1 . Figure 7: Starting from different initial conditions in the test data, the LB method is utilized to compute theFHN dynamics up to a large horizon T f = 8000 , approximately times larger than the training data. Using theLED framework, either with full iterative propagation on the reduced order space ( T µ = 0 ), or alternating betweenmacro-dynamics for T m = 10 and high-dimensional dynamics for T µ , we approximate the evolution. The results for T µ = 0 are denoted with the label “Latent”. The warm-up period is set to T warm = 75 . (a) The average MNAD errorbetween the predicted and ground-truth evolution of the activator density is plotted as a function of the macro-to-microratio ρ = T m /T µ . (c) The speed-up compared to the LB solver is plotted w.r.t. ρ . For T m = T µ = 10 ( ρ = 1 ), weobserve that the MNAD is reduced from ≈ . , to ≈ . compared to the iterative latent propagation. However, thespeed-up is reduced from S ≈ to S ≈ . By increasing T m ∈ { , , } , we get the intermediate regimesbetween propagation of the computationally expensive (and possibly intractable) high-dimensional system dynamics,and the fully latent propagation. As we increase T m (increase ρ ), the speed-up is increased, as we are using more andmore the reduced order dynamics, albeit at the cost of an increasing error.8earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

Figure 8: A) Contour plot of the KS dynamics starting from an initial condition from the test data. B) The evolutionof the d z = 8 dimensional latent state of LED. C) The predicted ﬁeld by LED iteratively propagating the dynamicson a d z = 8 dimensional latent space, after a warm-up period T warm = 60 ( T µ = 0 ). D) The density of values in the u x − u xx space from the predicted trajectory matches closely the E) ground-truth, illustrating that the LED is able tomimic characteristics of the dynamical system, even though propagating coarse dynamics. F) The statistical error (WD)on the distribution of state values for different variants of LED w.r.t. ρ . The errors remain small, demonstrating that theLED is capturing the long-term statistics (climate). H) The speed-up of LED w.r.t. ρ . Evolution of the latent state ofLED ( T µ = 0 ) is approximately two orders of magnitude cheaper than the micro scale dynamics. (Quantitative resultson the power spectrum and extensive analysis is provided in the SI)on the domain Ω = [0 , L ] with periodic boundary conditions u (0 , t ) = u ( L, t ) and ν = 1 . In this work, we consider thecase L = 22 studied extensively in [65] exhibiting a structurally stable chaotic attractor, i.e. an inertial manifold wherethe long-term dynamics lie. We discretize Equation (8) with a grid of size D = 64 , and solve it using the fourth-ordermethod for stiff PDEs introduced in [66] with a time-step of ∆ t = 0 . starting from a random initial condition, andkeeping every tenth datapoint, to obtain a dataset where samples are distanced by ∆ t = 0 . in time.We use · such samples for training and another · for validation. For testing purposes (long-term forecasting),we repeat the process with a different random seed, generating another samples. The results of LED in KS areshown in Figure 8. We have presented a novel framework for learning the effective dynamics (LED) and accelerate the simulations ofmultiscale stochastic and deterministic dynamical systems. Our work relies on augmenting the equation-free formalismwith state of the art deep learning methods.We have tested the LED framework on a number of benchmark problems. We ﬁnd that in systems where evolving thehigh dimensional state dynamics based on ﬁrst principles (solvers, equations, etc.) is computationally expensive, LEDcan accelerate the simulation by propagating on the latent space and upscaling to the high-dimensional system statewith the probabilistic generative mixture density decoder, only at the time scales of interest.The efﬁciency of LED was evaluated in unraveling and forecasting the stochastic collective dynamics of particlesfollowing Brownian motion subject to advection and diffusion in the three dimensional space, forecasting the FitzHugh-Nagumo equation dynamics achieving an order of magnitude lower approximation error compared to other state ofthe art equation-free based approaches while being two orders of magnitude faster than the Lattice Boltzmann solver,and identifying the effective dynamics of the Kuramoto-Sivashinsky equation with L = 22 , achieving accurate shortterm prediction while capturing of the long-term behavior (climate), even though LED is propagating the dynamics onthe coarse latent state, achieving a speed-up of S ≈ . We note that the present method is readily applicable to allproblems where Equation Free, HMM and FLAVOR methodologies have been applied.In summary, LED identiﬁes and propagates the effective dynamics of dynamical systems with multiple spatiotemporalscales providing signiﬁcant computational savings. Moreover, LED provides a systematic way of trading betweenspeed-up and accuracy for a multiscale system by switching between propagation of the latent dynamics, and evolutionof the original equations, iteratively correcting the statistical error at the cost of reduced speed-up. The presentmethodology can be deployed both in problems described by ﬁrst principles as well as for problems where only data are9earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020 available for either the macro or microscale descriptions of the system. Finally, by providing a bridge between datadriven and ﬁrst principles models LEDs open new horizons for the effective simulations of complex multiscale systems.

Contributions & Acknowledgments

P.K. conceived the project; P.R.V., G.A., C.U., and P.K. designed and performed research; P.R.V., and G.A. contributednew analytic tools; P.R.V., G.A., and P.K. analyzed data; and P.R.V., G.A., and P.K. wrote the paper.The authors declare no conﬂict of interest.We would like to thank Nikolaos Kallikounis for helpful discussions on the Lattice Boltzmann method, and KostasSpiliotis, Joon Lee, and Yannis Kevrekidis for providing code to reproduce the data for the FHN equation. We are alsothankful to the Swiss National Supercomputing Centre (CSCS) providing the necessary computational resources underProjects s929.

References [1] Thomas CB McLeish, TL Rodgers, and Mark R Wilson. Allostery without conformation change: modellingprotein dynamics at multiple scales.

Physical biology , 10(5):056004, 2013.[2] David C Wilcox. Multiscale model for turbulent ﬂows.

AIAA journal , 26(11):1311–1320, 1988.[3] Salvador Dura-Bernal, Benjamin A Suter, Padraig Gleeson, Matteo Cantarelli, Adrian Quintana, Facundo Ro-driguez, David J Kedziora, George L Chadderdon, Cliff C Kerr, Samuel A Neymotin, et al. Netpyne, a tool fordata-driven multiscale modeling of brain circuits.

Elife , 8:e44494, 2019.[4] Thomas S Deisboeck and Georgios S Stamatakos.

Multiscale cancer modeling . CRC press, 2010.[5] National Research Council.

A National Strategy for Advancing Climate Modeling . The National Academies Press,2012.[6] Amala Mahadevan. The impact of submesoscale physics on primary productivity of plankton.

Annual review ofmarine science , 8:161–184, 2016.[7] Nicola Bellomo and Christian Dogbe. On the modeling of trafﬁc and crowds: A survey of models, speculations,and perspectives.

SIAM review , 53(3):409–463, 2011.[8] Eric H Lee, Jen Hsin, Marcos Sotomayor, Gemma Comellas, and Klaus Schulten. Discovery through thecomputational microscope.

Structure , 17(10):1295–1306, 2009.[9] Volker Springel, Simon DM White, Adrian Jenkins, Carlos S Frenk, Naoki Yoshida, Liang Gao, Julio Navarro,Robert Thacker, Darren Croton, John Helly, et al. Simulations of the formation, evolution and clustering ofgalaxies and quasars. nature , 435(7042):629–636, 2005.[10] Richard Car and Mark Parrinello. Uniﬁed approach for molecular dynamics and density-functional theory.

Physical review letters , 55(22):2471, 1985.[11] Ioannis G Kevrekidis, C William Gear, James M Hyman, Panagiotis G Kevrekidid, Olof Runborg, ConstantinosTheodoropoulos, et al. Equation-free, coarse-grained multiscale computation: Enabling mocroscopic simulatorsto perform system-level analysis.

Communications in Mathematical Sciences , 1(4):715–762, 2003.[12] E Weinan, Bjorn Engquist, et al. The heterognous multiscale methods.

Communications in Mathematical Sciences ,1(1):87–132, 2003.[13] Ioannis G Kevrekidis, C William Gear, and Gerhard Hummer. Equation-free: The computer-aided analysis ofcomplex multiscale systems.

AIChE Journal , 50(7):1346–1355, 2004.[14] Radek Erban, Ioannis G Kevrekidis, David Adalsteinsson, and Timothy C Elston. Gene regulatory networks: Acoarse-grained, equation-free approach to multiscale computation.

The Journal of chemical physics , 124(8):084106,2006.[15] Sabine Attinger, Petros D Koumoutsakos, and Summer Program in Multi-Scale.

Multiscale modelling andsimulation . Springer, 2004.[16] Ioannis G Kevrekidis and Giovanni Samaey. Equation-free multiscale computation: Algorithms and applications.

Annual review of physical chemistry , 60:321–344, 2009.[17] Carlo R Laing, Thomas Frewen, and Ioannis G Kevrekidis. Reduced models for binocular rivalry.

Journal ofcomputational neuroscience , 28(3):459–476, 2010. 10earning the Effective Dynamics of Complex Multiscale Systems

A P

REPRINT - J

ULY

3, 2020 [18] Yohai Bar-Sinai, Stephan Hoyer, Jason Hickey, and Michael P Brenner. Learning data-driven discretizations forpartial differential equations.

Proceedings of the National Academy of Sciences , 116(31):15344–15349, 2019.[19] E. Weinan, Xiantao Li, and Eric Vanden-Eijnden. Some recent progress in multiscale modeling. In Sabine Attingerand Petros Koumoutsakos, editors,

Multiscale Modelling and Simulation , pages 3–21, Berlin, Heidelberg, 2004.Springer Berlin Heidelberg.[20] E Weinan, Bjorn Engquist, Xiantao Li, Weiqing Ren, and Eric Vanden-Eijnden. Heterogeneous multiscalemethods: a review.

Communications in computational physics , 2(3):367–450, 2007.[21] Molei Tao, Houman Owhadi, and Jerrold E Marsden. Nonintrusive and structure preserving multiscale integrationof stiff odes, sdes, and hamiltonian systems with hidden slow dynamics via ﬂow averaging.

Multiscale Modeling& Simulation , 8(4):1269–1324, 2010.[22] Molei Tao, Houman Owhadi, and Jerrold E. Marsden. Space-time FLAVORS: ﬁnite difference, multisymlectic,and pseudospectral integrators for multiscale PDEs.

Dynamics of Partial Differential Equations , 8(1):21–45,MAR 2011.[23] Alec J Linot and Michael D Graham. Deep learning to discover and predict dynamics on an inertial manifold.

Physical Review E , 101(6):062209, 2020.[24] James C Robinson. Inertial manifolds for the kuramoto-sivashinsky equation.

Physics Letters A , 184(2):190–193,1994.[25] Giacomo Fiorin, Michael L Klein, and Jérôme Hénin. Using collective variables to drive molecular dynamicssimulations.

Molecular Physics , 111(22-23):3345–3362, 2013.[26] Robert B Best and Gerhard Hummer. Reaction coordinates and rates from transition paths.

Proceedings of theNational Academy of Sciences , 102(19):6732–6737, 2005.[27] Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett Phillips, AnkurMahesh, Michael Matheson, Jack Deslippe, Massimiliano Fatica, et al. Exascale deep learning for climateanalytics. In

SC18: International Conference for High Performance Computing, Networking, Storage andAnalysis , pages 649–660. IEEE, 2018.[28] Kathleen P Champion, Steven L Brunton, and J Nathan Kutz. Discovery of nonlinear multiscale systems: Samplingstrategies and embeddings.

SIAM Journal on Applied Dynamical Systems , 18(1):312–333, 2019.[29] Zhong Yi Wan and Themistoklis P Sapsis. Machine learning the kinematics of spherical particles in ﬂuid ﬂows.

Journal of Fluid Mechanics , 857, 2018.[30] Guido Novati, Lakshminarayanan Mahadevan, and Petros Koumoutsakos. Controlled gliding and perching throughdeep-reinforcement-learning.

Physical Review Fluids , 4(9):093902, 2019.[31] Zhong Yi Wan, Pantelis Vlachas, Petros Koumoutsakos, and Themistoklis Sapsis. Data-assisted reduced-ordermodeling of extreme events in complex dynamical systems.

PloS one , 13(5):e0197704, 2018.[32] Steven L Brunton, Bernd R Noack, and Petros Koumoutsakos. Machine learning for ﬂuid mechanics.

AnnualReview of Fluid Mechanics , 52, 2019.[33] Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Discovering governing equations from data by sparseidentiﬁcation of nonlinear dynamical systems.

Proceedings of the National Academy of Sciences , 113(15):3932–3937, 2016.[34] Siddhartha Verma, Guido Novati, and Petros Koumoutsakos. Efﬁcient collective swimming by harnessing vorticesthrough deep reinforcement learning.

Proceedings of the National Academy of Sciences , 115(23):5849–5854,2018.[35] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation.In

Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages1532–1543, 2014.[36] Jiequn Han, Arnulf Jentzen, and E Weinan. Solving high-dimensional partial differential equations using deeplearning.

Proceedings of the National Academy of Sciences , 115(34):8505–8510, 2018.[37] Eric J Topol. High-performance medicine: the convergence of human and artiﬁcial intelligence.

Nature medicine ,25(1):44–56, 2019.[38] Mark A Kramer. Nonlinear principal component analysis using autoassociative neural networks.

AIChE journal ,37(2):233–243, 1991.[39] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 , 2013.11earning the Effective Dynamics of Complex Multiscale Systems

A P

REPRINT - J

ULY

3, 2020 [40] Bethany Lusch, J Nathan Kutz, and Steven L Brunton. Deep learning for universal linear embeddings of nonlineardynamics.

Nature communications , 9(1):1–10, 2018.[41] Samuel E Otto and Clarence W Rowley. Linearly recurrent autoencoder networks for learning dynamics.

SIAMJournal on Applied Dynamical Systems , 18(1):558–593, 2019.[42] Nicholas Geneva and Nicholas Zabaras. Modeling the dynamics of pde systems with physics-constrained deepauto-regressive networks.

Journal of Computational Physics , 403:109056, 2020.[43] Michele Milano and Petros Koumoutsakos. Neural network modeling for near wall turbulent ﬂow.

Journal ofComputational Physics , 182(1):1–26, 2002.[44] Christoph Wehmeyer and Frank Noé. Time-lagged autoencoders: Deep learning of slow collective variables formolecular kinetics.

The Journal of chemical physics , 148(24):241703, 2018.[45] Carlos X Hernández, Hannah K Wayment-Steele, Mohammad M Sultan, Brooke E Husic, and Vijay S Pande.Variational encoding of complex dynamics.

Physical Review E , 97(6):062412, 2018.[46] Mohammad M Sultan, Hannah K Wayment-Steele, and Vijay S Pande. Transferable neural networks for enhancedsampling of protein dynamics.

Journal of chemical theory and computation , 14(4):1887–1894, 2018.[47] Pantelis R Vlachas, Wonmin Byeon, Zhong Y Wan, Themistoklis P Sapsis, and Petros Koumoutsakos. Data-drivenforecasting of high-dimensional chaotic systems with long short-term memory networks.

Proceedings of the RoyalSociety A: Mathematical, Physical and Engineering Sciences , 474(2213):20170844, 2018.[48] Pantelis R Vlachas, Jaideep Pathak, Brian R Hunt, Themistoklis P Sapsis, Michelle Girvan, Edward Ott, andPetros Koumoutsakos. Backpropagation algorithms and reservoir computing in recurrent neural networks for theforecasting of complex spatiotemporal dynamics.

Neural Networks , 2020.[49] Seungjoon Lee, Mahdi Kooshkbaghi, Konstantinos Spiliotis, Constantinos I Siettos, and Ioannis G Kevrekidis.Coarse-scale pdes from ﬁne-scale observations via machine learning.

Chaos: An Interdisciplinary Journal ofNonlinear Science , 30(1):013141, 2020.[50] Christopher M Bishop. Mixture density networks. 1994.[51] Marios Mattheakis, P Protopapas, D Sondak, M Di Giovanni, and E Kaxiras. Physical symmetries embedded inneural networks. arXiv preprint arXiv:1904.08991 , 2019.[52] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature , 521(7553):436–444, 2015.[53] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. arXiv preprintarXiv:1511.06391 , 2015.[54] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.

Neural Comput. , 9(8):1735–1780, November1997.[55] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrentneural networks on sequence modeling. In

NIPS 2014 Workshop on Deep Learning, December 2014 , 2014.[56] Paul J. Werbos. Generalization of backpropagation with application to a recurrent gas market model.

NeuralNetworks , 1(4):339 – 356, 1988.[57] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ,2015.[58] Yang Cao and Linda Petzold. Accuracy limitations and the measurement of errors in the stochastic simulation ofchemically reacting systems.

Journal of Computational Physics , 212(1):6–24, 2006.[59] Richard FitzHugh. Impulses and physiological states in theoretical models of nerve membrane.

Biophysicaljournal , 1(6):445, 1961.[60] Jinichi Nagumo, Suguru Arimoto, and Shuji Yoshizawa. An active pulse transmission line simulating nerve axon.

Proceedings of the IRE , 50(10):2061–2070, 1962.[61] Iliya V Karlin, Santosh Ansumali, Christos E Frouzakis, and Shyam Sunder Chikatamarla. Elements of the latticeboltzmann method i: Linear advection equation.

Commun. Comput. Phys , 1(4):616–655, 2006.[62] Dieter Armbruster, John Guckenheimer, and Philip Holmes. Kuramoto–sivashinsky dynamics on the center–unstable manifold.

SIAM Journal on Applied Mathematics , 49(3):676–691, 1989.[63] Yoshiki Kuramoto. Diffusion-Induced Chaos in Reaction Systems.

Progress of Theoretical Physics Supplement ,64:346–367, 02 1978. 12earning the Effective Dynamics of Complex Multiscale Systems

A P

REPRINT - J

ULY

3, 2020 [64] G. I. Sivashinsky. Nonlinear analysis of hydrodynamic instability in laminar ﬂames — I. Derivation of basicequations.

Acta Astronautica , 4:1177–1206, 1977.[65] Predrag Cvitanovi´c, Ruslan L Davidchack, and Evangelos Siminos. On the state space geometry of the kuramoto–sivashinsky ﬂow in a periodic domain.

SIAM Journal on Applied Dynamical Systems , 9(1):1–33, 2010.[66] A. Kassam and L. Trefethen. Fourth-order time-stepping for stiff pdes.

SIAM Journal on Scientiﬁc Computing ,26(4):1214–1233, 2005. 13earning the Effective Dynamics of Complex Multiscale Systems

A P

REPRINT - J

ULY

3, 2020

In the following, we describe the different models used in the proposed framework to learn and propagate the effectivedynamics (LED) for complex multiscale systems.

Classical autoencoders are nonlinear neural networks that map an input to a low dimensional latent space and thendecode it to the original dimension at the output, trained to minimize the reconstruction loss L = | x − ˜ x | . They wereproposed in [38] as a nonlinear alternative to Principal Component Analysis (PCA). An autoencoder is depicted inFigure 9a. Autoencoders

ENCODER x High dimensional state

DECODER

Reconstruction ˜ xz Low dimensional latent space (a) Classical Autoencoder (AE)

Autoencoders

ENCODER x High dimensional state

DECODER

Reconstruction ˜ x μ z Low dimensional latent space ⏟ σ z z = μ z + σ z ⊙ ϵ ⏟ ϵ ∼ 𝒩( , I ) (b) Variational Autoencoder (VAE) Figure 9: (a)

A schematic diagram of a classical Autoencoder (AE). A high dimensional state x is mapped to a lowdimensional feature space z by applying the encoder transformation through multiple fully connected layers. The lowdimensional feature space z is expanded in the original space by the decoder. The autoencoder is trained with the loss L = || x − ˜ x || , so that the input can be reconstructed as faithfully as possible at the decoder output. (b) A schematicdiagram of a Variational Autoencoder (VAE). Instead of modeling the latent space deterministically, the encoder outputsa mean latent representation µ z , along with the associated uncertainty σ z . The latent space z is sampled from a normaldistribution z ∼ N ( ·| µ z , σ z I ) , with diagonal covariance matrix. Research efforts on generative modeling led to the development of Variational Autoencoders (VAEs) [39]. The VAEsimilar to AE is composed by an encoder and a decoder. The encoder neural network, instead of mapping the input x deterministically to a reduced order latent space z , produces a distribution q ( z | x ; w q ) over the latent representation z , where w q is the parametrization of the distribution given by the output of the encoder w q = E w E ( x ) . In mostpractical applications, the distribution q ( z | x ; w q ) is modeled as a factorized Gaussian, implying that w q is composedof the mean, and the diagonal elements of the covariance matrix. The decoder maps a sampled latent representationto an output ˜ x = D w D ( z ) . By sampling the latent distribution q ( z | x ; w q ) , for a ﬁxed input x , the autoencoder cangenerate samples from the probability distribution over ˜ x at the decoder output. The network is trained to maximize thelog-likelihood of reproducing the input at the output, while minimizing the Kullback-Leibler divergence between theencoder distribution q ( z | x ; w q ) and a prior distribution, e.g. N (0 , I ) . VAEs are esentially regularizing the trainingof AE by adding the Gaussian noise in the latent representation. In this work, we take into account a Gaussian latentdistribution with diagonal covariance matrix q ( z | x ; µ z , σ z (cid:124) (cid:123)(cid:122) (cid:125) w q ) = N (cid:0) z | µ z ( x ) , diag( σ z ( x )) (cid:1) , (9)where the mean latent representation µ z and the variance σ z vectors are the outputs of the encoder neural network E w E ( x ) . The latent representation is then sampled from z ∼ N ( ·| µ z , diag( σ z )) . The decoder receives as an input thesample, and outputs the reconstruction ˜ x . A VAE is depicted in Figure 9b. Physical systems may satisfy speciﬁc properties like energy conservation, translation invariance, permutation invariance,etc. In order to build data-driven models that accurately reproduce the statistical behavior of such systems, theseproperties should ideally be embedded in the model. Translation invariance is taken into account in the design ofthe convolution operator in Convolutional neural networks (CNNs) developed to process images [52]. Other type of14earning the Effective Dynamics of Complex Multiscale Systems

A P

REPRINT - J

ULY

3, 2020 properties like energy conservation and physical symmetries are discussed in [51]. In this section, we describe how wecan ensure permutation invariance to model the dynamics of particles of the same kind. This is useful in simulations ofmolecules, i.e. molecular dynamics, where the state of the system is described by a conﬁguration of particles, and anypermutation of these particles corresponds to the same conﬁguration. Permutation invariance is handled here with asum decomposition of a feature space [53]. The exact procedure is depicted in Figure 10.Assume that the state of a dynamical system s is composed of N particles of the same kind, each one having speciﬁcproperties or features with dimensionality d x , e.g. position, velocity, etc. The features of a single particle are givenby the state x ∈ R d x of the particle. As an input to the network, we provide raw data, i.e. the features of allparticles, stacked together in a matrix s ∈ R N × d x . The problem we are dealing with is that a permutation of twoparticles represents in essence the same conﬁguration and should be mapped to the same latent representation. This isachieved with a permutation invariant layer that ﬁrst applies a nonlinear transformation φ : R d x → R d p mapping eachparticles’ features to a high dimensional latent representation of dimension d p . This mapping is applied to all particlesindependently leading to N such latent vectors. The mean of these vectors is taken to construct the representation ofthe conﬁguration. The representation N (cid:80) Ni =1 φ ( x i ) is ﬁnally fed to a ﬁnal layer reducing the dimensionality to alow-order representation z ∈ R d z , with d z (cid:28) d p , N . This is achieved by the mapping g : R d p → R d z . In this work,we utilize the permutation invariance layer in the modeling of the collective dynamics of a group of particles whosemovement is governed by the advection-diffusion equation in the one and three dimensional space. Both mappings g and φ are implemented with neural networks, having layers of hidden units each, residual connections, and tanh activations. x N d x = 3 s ∈ ℝ N × d x ϕ ϕ : ℝ d x → ℝ d p ϕ ( x N ) N d p + N N ∑ i =1 ϕ ( x i ) ∈ ℝ d p f ( x , x , …, x N ) z g : ℝ d p → ℝ d z g Low dimensionallatent space N Figure 10: Illustration of the permutation invariant encoder. The input of the network is composed of N atomic statesthat are permutation invariant, e.g. positions { x , . . . , x N } of N particles in a particle simulation, each one withdimension d x , i.e. x i ∈ R d x , ∀ i ∈ { , . . . , N } . A transformation φ ( · ) : R d x → R d p is applied to each atomic stateseparately, mapping to a high dimensional latent feature space. The mean of these latent representations of the atomicstates is computed, leading to a singe latent feature that is permutation invariant with respect to the input. The ﬁnal layerof the encoder maps the high-dimensional feature to a low dimensional representation z , which is again permutationinvariant with respect to the input, representing the encoding of the global state. Mixture density networks (MDNs) [50] are powerful neural networks that can model non-Gaussian, multi-modaldata distributions. The outputs of MDNs are parameters of a mixture density model (mixture of probability densityfunctions). The most generic choice of the mixture component distribution, is the Gaussian distribution. GaussianMDNs are widely deployed in machine learning applications to model structured dynamic environments, i.e. (video)games. However, the effectiveness of MDNs in modeling physical systems remains unexplored.In physical systems, the state may be bounded. In this case, the choice of a Gaussian MDN is problematic due to itsunbounded support. To make matters worse, most applications of Gaussian MDNs when modeling random vectorsdo not consider the interdependence between the vector variables, i.e. the covariance matrix of the Gaussian mixturecomponents is diagonal, in an attempt to reduce their computational complexity. Arguably in the applications wherethey were successful, modeling this interdependence was not imperative. In contrast, in physical systems the variablesof a state might be very strongly dependent on each other. In order to cope with these problems we consider thefollowing approach. Firstly, we model the distribution p ( v | z ) of an auxiliary vector variable v ∈ R d x , of the samedimensionality d x as the high dimensional state (input/output of the autoencoder). The distribution is modeled as amixture of K multivariate Normal distributions p ( v | z ) = K (cid:88) k =1 π k ( z ) N (cid:18) µ k v ( z ) , Σ k v ( z ) (cid:19) , (10)15earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

We parametrize the multivariate normal distribution in terms of a mean vector µ k v , a positive deﬁnitive covariancematrix Σ k v , and the mixing coefﬁcients π k which are functions of z . The covariance matrix is parametrised by alower-triangular matrix L k v with positive-valued diagonal entries, such that Σ k v = L k v L k T v ∈ R d x × d x (This triangularmatrix can be recovered by Cholesky factorization of the positive deﬁnite Σ k v ). The functional forms of π k ( z ) ∈ R , µ v ( z ) ∈ R d x , and the n ( n + 1) / entries of L k v are neural networks, their values are given by the outputs of thedecoder for all mixture components k ∈ { , . . . , K } , i.e. w D = D w D ( z ) = { π k , µ k v , L k v } ,...,K . The positivity of thediagonal elements of L k v is ensured by a softplus activation function f ( x ) = ln(1 + exp( x )) (11)in the respective outputs of the decoder. The mixing coefﬁcients satisfy ≤ π k < and (cid:80) Kk =1 π k = 1 . To ensurethese conditions, the respective outputs of the decoder are passed through a softmax activation σ ( x ) i = e x i (cid:80) i e x i . (12)The rest (non-diagonal elements and mean vector) of the decoder outputs have linear activations, so no restriction intheir sign. In total, the decoder output is composed of K ( n − n/ Kn single valued outputs with linear activationfor the non-diagonal elements of L k v and the mean vectors µ k v , and Kn positive outputs with softplus activation for thediagonal of L k v , and K outputs with softmax activation for the mixing coefﬁcients. DECODER

Reconstruction v ∼ K ∑ k =1 π k 𝒩 ( μ k v , Σ k v ) Low dimensionallatent space z μ v ⏟ Σ v μ v K … Σ v K … ˜ x ( v ) = 11 + exp(− v ) v ∈ ℝ d x v = log (

11 + exp(− ˜ x ) ) ˜ x ∈ [0,1] d x π … π K ⏟ High dimensionalspace

Figure 11: A mixture density network modeling the probability density p ( ˜ x | z ) , with bounded ˜ x . The decoder maps thelatent state z to the parameters of a mixture model on the latent vector v ∈ R d x , which are the mixing coefﬁcients π k ∈ R , mean vectors µ kv ∈ R d x , and a lower-triangular matrix L k v ∈ R d x × d x with positive-valued diagonal entries.From the latter, the covariance matrix is derived from Σ k v = L k v L k T v which is positive deﬁnite by construction. Themixture models the probability distribution of the latent state p ( v | z ) . However, the targets used to train the network in asupervised way are deﬁned on the reconstruction ˜ x . The targets are scaled to ˜ x ∈ [0 , d x , and then transformed totargets for v using the inverse of the softplus activation. The MDN autoencoder is trained to maximize the likelihood p ( v | z ) of the transformed data v . In the low order manifold (coarse, latent state), a Recurrent Neural Network (RNN) is utilized to capture the nonlinear,non-markovian dynamics. The forecasting rule of the RNN is given by h t = H w H (cid:0) z t , h t − ∆ t (cid:1) , ˜ z t +∆ t = R w R (cid:0) h t (cid:1) , (13)where w H and w R are the trainable parameters of the network, h t ∈ R d h is an internal hidden memory state, and ˜ z t +∆ t is a prediction of the latent state. The RNN is trained to minimize the forecasting loss || ˜ z t +∆ t − z t +∆ t || , whichcan be written as || ˜ z t +∆ t − z t +∆ t || = ||R w R (cid:0) h t (cid:1) − z t +∆ t || = ||R w R (cid:0) H w H (cid:0) z t , h t − ∆ t (cid:1)(cid:1) − z t +∆ t || . (14)16earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

This leads to w H , w R = argmin w H , w R ||R w R (cid:0) H w H (cid:0) z t , h t − ∆ t (cid:1)(cid:1) − z t +∆ t || . (15)The RNNs are trained with Backpropagation through time (BPTT) [56]. In this work, we consider two implementationsof the mappings H w H and R w R , the long short-term memory (LSTM) [54] cell and the Gated Reccurent Unit(GRU) [55]. The output mapping for both cells is given by a linear transformation, i.e. ˜ z t +∆ t = W z , h h t , (16)where W z , h ∈ R d z × d h . As a consequence, the set of trainable weights of the hidden-to-output mapping is just onematrix w R = W z , h ∈ R d z × d h for both cells. However, the architecture of the hidden-to-hidden mapping, is different.In the following, we describe these implementations for both cells. o t h t − (cid:12) σ r t (cid:12) − · z t + h t h t (cid:12) tanh ˜ h t σ (a) GRU Cell o t h t − σ g ot h t h t c t − c t (cid:12) g ft g it σ + tanh (cid:12) tanh˜ c t (cid:12) σ (b) LSTM Cell Figure 12: The information ﬂow for a Gated Recurrent Unit (GRU) cell and a Long Short-Term Memory (LSTM) cell.The cells employ gating mechanisms that allow forgetting and storing of information in the processing of the hiddenstate. Ellipses and circles denote entry-wise operations, while rectangles denote layer operations.

In the GRU cell, the functional form of the mapping h t = H w H (cid:0) z t , h t − ∆ t (cid:1) is given by u t = σ g (cid:0) W u [ h t − ∆ t , z t ] + b u (cid:1) r t = σ g (cid:0) W r [ h t − ∆ t , z t ] + b r (cid:1) ˜ h t = tanh (cid:0) W h (cid:2) r t (cid:12) h t − ∆ t , z t (cid:3) + b h (cid:1) h t = (1 − u t ) (cid:12) h t − ∆ t + u t (cid:12) ˜ h t , (17)where z t ∈ R d z is the latent state (output of the encoder, or previous time-step of the RNN) provided at the input ofthe RNN at time t , u t ∈ R d h is the update gate vector, r t ∈ R d h is the reset gate vector, ˜ h t ∈ R d h , h t ∈ R d h is theinternal hidden memory state, W u , W r , W h ∈ R d h × ( d h + d z ) are weight matrices and b u , b r , b h ∈ R d h biases. Thegating activation σ g is a sigmoid. An illustration of the information ﬂow in a GRU cell is given in Figure 12a. The architecture of the LSTM cell is slightly more complicated. The LSTM possesses two hidden states, a cell state c and an internal memory state h . The hidden-to-hidden mapping h t , c t = H w H (cid:0) z t , h t − ∆ t , c t − ∆ t (cid:1) (18)takes the form g ft = σ f (cid:0) W f [ h t − ∆ t , z t ] + b f (cid:1) g it = σ i (cid:0) W i [ h t − ∆ t , z t ] + b i (cid:1) ˜ c t = tanh (cid:0) W c [ h t − ∆ t , z t ] + b c (cid:1) c t = g ft (cid:12) c t − ∆ t + g it (cid:12) ˜ c t g z t = σ h (cid:0) W h [ h t − ∆ t , z t ] + b h (cid:1) h t = g z t (cid:12) tanh( c t ) , (19)where g ft , g it , g z t ∈ R d h , are the gate vector signals (forget, input and output gates), z t ∈ R d z is the latent input at time t , h t ∈ R d h is the hidden state, c t ∈ R d h is the cell state, while W f , W i , W c , W h ∈ R d h × ( d h + d z ) , are weight matrices17earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020 and b f , b i , b c , b h ∈ R d h biases. The symbol (cid:12) denotes the element-wise product. The activation functions σ f , σ i and σ h are sigmoids. The dimension of the hidden state d h (number of hidden units) controls the capacity of the cell toencode history information. The set of trainable parameters of the recurrent mapping H w H is thus given by H w H = { g ft , g it , g z t , W f , W i , W c , W h } (20)An illustration of the information ﬂow in a LSTM cell is given in Figure 12b. In this work, we consider discrete time autonomous dynamical systems of the form s t +∆ t = F ( s t ) , (21)where s t ∈ R d s is the state of the dynamical system at time t , ∆ t is the sampling period and F is a function representingthe discrete time dynamics. The dynamics F may be nonlinear, deterministic or stochastic. In our study, we areinterested in systems whose state s k is high dimensional, but the intrinsic dimensionality of the dynamics is low. Thecomplete description of high-dimensional systems with nonlinear dynamics through the governing equations is oftenchallenging for numerous reasons: either because the equations F might be unknown, the dynamics might be veryhard or computationally expensive to solve with an appropriate resolution, or due to the magnitude of model errors. Inmany cases, we are interested in more macroscopic, coarse grained characteristics that can be resolved by the largescales. Moreover, the effective system dynamics might live on a low dimensional manifold. Employing classical orderreduction methods such as Galerkin projection can be problematic as the truncated modes (i.e. small scales) might berelevant for the effective dynamics, i.e. back-scattering. As a consequence, their effect on the dynamic evolution on thelarger scales has to be considered, i.e. closure models in turbulence.In this work, we propose to learn the effective dynamics (LED) of complex system by coupling an autoencoder (eithernormal or variational) with a RNN, that learns to forecast these dynamics, and a (mixture density in case of stochasticsystems) decoder that reconstructs the high dimensional dynamics. First, the networks are trained with the Adamstochastic optimization method [57]. The trained networks can be utilized to capture and forecast the evolution thedynamics on unseen test data. Given a short-term evolution of the state of a simulation based on ﬁrst principlescomposed of T warm initial time steps { s t − T warm +∆ t , . . . , s t } , we propose to “teacher force” the LED, iterativelyfeeding the encoder the high dimensional states, passing the computed latent representations z at every time-step tothe RNN, and propagating the latent state h t of the RNN up to time-step t . After this short initial warm-up phase,the dynamics of the system can be propagated on this low-dimensional latent state (coarse representation, or macrodynamics) for a long time horizon T m (cid:29) T warm by utilizing the computationally cheap RNN forecasting rule, whilethe decoder can be utilized to map the latent space back to the high dimensional state space whenever needed. Anillustration of the proposed iterative prediction on the latent space is given in Figure 13.As is the case for any iterative forecasting method, initial model errors will propagate. In order to alleviate this,we propose the following multiscale scheme. Starting from an initial condition, for a warm-up period T warm theautoencoder projects the high dimensional state to the latent space, while the latent states are fed to the RNN to warm-upits hidden state. For this period T warm the dynamics are evolved on the high dimensional state, while the RNN is“teacher forced” with the respective latent state at the output of the autoencoder. No forecasting takes place in this initialperiod. After this initial warm-up period, we propose to iteratively (1) propagate the dynamics on the low-dimensionallatent space with the RNN for some time T m , (2) project the latent dynamics at t = T m back to the high dimensionalstate, (3) starting from this high-dimensional state as an initial condition, use the equations/ﬁrst principles to evolvethe dynamics for T µ , and so on and so forth. The power of the proposed data-driven approach is that it utilizes state ofthe art deep learning methods, it does not rely on constraints on the system dynamics, it can be utilized to propagate thelatent dynamics without the need to transform back to the high-dimensional state space at every time-step, and it canapplied in a generic way augmenting the arsenal of models developed for multiscale problems [19]. The multiscaleswitching procedure is illustrated in Figure 14. 18earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

Teacher forcing the RNN for T warm Forecasting oflatent dynamics for T m ‘LIFT’Decoding to high dimensional space T warm ≪ T m Figure 13: An illustration of the iterative latent space propagation in LED. The high-dimensional state of the dynamicalsystem is passed through the encoder network for some initial time T warm . The low-dimensional tractable latentrepresentation at the output of the encoder z is provided as an input to the RNN. Starting from the last latent state ( z t inthe ﬁgure), we deploy the RNN to iteratively propagate the dynamics in the latent space up to a total horizon of T m timesteps, with T m (cid:29) T warm . The decoder can be utilized to upscale the latent state back to the high-dimensionalstate of the dynamical system, i.e. ˜ s t + T m = D w D ( z t + T m ) . Due to the fact that the propagation in the latent space canbe orders of magnitude cheaper than the evolution of high dimensional dynamics based on ﬁrst principles, LED canaccelerate simulations of dynamical systems. As a consequence, we can achieve longer simulation times, explore thestate space faster, and resolve time and spatial scales that would be otherwise intractable. ENCODER ENCODER

RNN RNN RNNRNN

ENCODER

RNN

Decoding

RNNRNN

ENCODERENCODERDECODER

RNN ⋯ Macro dynamics T m Micro dynamics T warm Micro dynamics T μ Macro dynamics T m Encoding T warm , T μ ≪ T m Figure 14: Multiscale forecasting with LED (testing phase) . By iteratively switching between the computationallycheap propagation in the latent space for a large time horizon T m and a computationally expensive model based on ﬁrstprinciples for T µ , we may iteratively correct the statistical error and increase further the simulation horizon, withoutsacriﬁcing the performance of the method, albeit at the cost of a reduction in the overall speed-up of LED compared tosolely propagating on the latent space. The speed-up is considered here with respect to propagating the computationallyexpensive high dimensional dynamics in the micro scale. 19earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020 H i s t og r a m ( no r m a li ze d t o d e n s it y ) Z1Z2L1 Histogram Error (a) L1 Histogram distance E m p i r i ca l C D F = F Z ( z ) Wasserstein distanceZ1Z2 (b) Wasserstein distance

Figure 15: (a)

A schematic illustration of the L1 difference of the histograms of two random variables Z and Z . In (b) the Wasserstein distance between the distributions based on the inverse Cumulative Distribution Function (CDF). In this section, we elaborate on the metrics used to quantify the effectiveness of the proposed approach to capture thedynamics and the state statistics of the systems under study. To quantify the prediction performance of the method in adeterministic system, we use the mean normalized absolute difference (MNAD). The Wasserstein distance (WD) andthe L1-Norm histogram distance (L1-NHD) are utilized to quantify the difference between two statistical objects (eitherstate distributions, or random variables).

Assume that a model is used to predict a spatiotemporal ﬁeld y ( x, t ) , at discrete state x i and time t j locations. Predictedvalues from a model (neural network, etc.) are denoted with ˜ y , while the groundtruth (simulation of the equations witha solver based on ﬁrst principles) with y . The normalized absolute difference (NAD) between the model output and thegroundtruth is deﬁned as NAD ( t j ) = 1 N x N x (cid:88) i =1 | y ( x i , t j ) − ˆ y ( x i , t j ) | max i,j ( y ( x i , t j )) − min i,j ( y ( x i , t j )) , (22)where N x is the dimensionality of the discretized state x . The NAD depends on the time t j . The mean NAD (MNAD)is given by the mean over time of the NAD score, i.e.MNAD = 1 N T N T (cid:88) j =1 NAD ( t j ) , (23)where N T is the number of timesteps considered. The MNAD is used in the deterministic FitzHugh-Nagumo equationto measure the prediction error of LED and compare it with the state of the art. The Wasserstein distance (WD), is a metric used to quantify the difference between the distribution functions of tworandom variables. It is deﬁned as the integral of the absolute difference of the inverse Cumulative Distribution Functions(CDF) of the random variables. Assuming two random variables Z and Z , with CDFs given by τ = F Z ( z ) and F Z ( z ) , with τ ∈ [0 , , the Wasserstein metric is deﬁned as WD( Z , Z ) = (cid:90) | F − Z ( τ ) − F − Z ( τ ) | dτ. (24)An illustration of WD is given in Figure 15b. In high-dimensional problems, where the random variable is multivariate(random vector), we are reporting the mean WD of each variable after marginalization of all others. In order to quantify the difference of the distributions of two random multivariate random variables Z and Z , weemploy in addition to the WD, the L1-Norm histogram distance (L1-NHD). We measure this metric based on the L120earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020 norm of the difference between the normalized histograms of the random variables computed on the same grid. Anillustration of the L1-histogram distance is given in Figure Figure 15a. The number of bins for the computation of thehistograms, is selected according to Rice rule, given by N bins = (cid:6) √ n (cid:7) where n is the number of observations inthe sample z . The WD and the L1-NHD are used to measure the difference between the distribution of the state in theAdvection-Diffusion in 1-D and 3-D (over a single run). Moreover, they are used to measure the difference between thedistributions of the mean and variance of the state over multiple runs.21earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

We evaluate the effectiveness of LED to model the stochastic dynamics of the advection-diffusion equation in onedimension in Section 8.1 and three dimensions in Section 8.2. Moreover, in Section 8.3 we benchmark LED againstequation free approaches proposed in [49] that are based on uncovering a PDE on the coarse level using diffusion maps,Gaussian processes or neural networks, and using forward integration in the coarse representation to model the FitzHugh-Nagumo equation. Last but not least, in Section 8.4 we apply LED on the challenging Kuramoto-Sivashinsky equationexhibiting deterministic chaos, unraveling a low dimensional inertial manifold. All LED models are implementedin Pytorch, mapped to a single Nvidia Tesla P100 GPU and executed on the XC50 compute nodes of the Piz Daintsupercomputer at the Swiss national supercomputing centre (CSCS).

In this section, we evaluate LED on a system of N point particles on a bounded domain Ω = (cid:2) − L/ , L/ (cid:3) d x , where d x is the dimension of the space, whose dynamics are determined by the advection-diffusion equation, modeled withthe stochastic differential equation (SDE) d x t = u ( t ) dt + √ D d W t , (25)where x t ∈ Ω denotes the position of the particle at time t , D ∈ R is the diffusion coefﬁcient, d W t ∈ R d x is a Wienerprocess, and u ( t ) = A cos( ω t ) ∈ R d x is a cosine advection (drift) term. Equation (25) is solved with initial conditions x = and reﬂective boundary conditions, ensuring that x t ∈ Ω , ∀ t . The system is advanced in time with explicitEuler integration. We pick a domain size of L = 1 . In this section, we consider the one dimensional case with d x = 1 .We solve for N = 1000 particles, and consider the values D = 0 . , A = 1 and ω = 0 . . The stochastic dynamicsare solved with a timestep δt = 10 − . By saving the positions of the particles at a coarser time-step ∆ t = 0 . andstarting from different random initial conditions, we generate three datasets, a training and a validation dataset, eachconsisting of samples, and a test dataset with samples. We trained a VAE with permutation invariant inputFigure 16: LED applied on the one dimensional Advection-Diffusion equation. The dimensionality of the latentdimension of the VAE is d z = 1 . LED successfully uncovered the cosine dynamics on the latent space, even though thedimensionality of the original state s t describing the state of the system s t ∈ R N , with N = 1000 . However, due to theiterative propagation of the latent dynamics with LED, a phase lag is introduced, making the error to increase withtime. This can be seen for example, in the error on the estimated mean position (M1) of the particles. Nevertheless, thevariance is captured along with the long-term statistical behavior.layer, a latent dimension d z = 1 , and a mixture density decoder to compress the dynamics. Then, we trained the RNNof LED to forecast the one-dimensional latent dynamics. The result of an iterative prediction on the test dataset is given22earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020 in Figure 16. The warm-up period of LED is set to T warm = 60 . The network is fed an initial conﬁguration of particles s t = [ x t , . . . , x Nt ] T ∈ R N , compresses the conﬁguration to one number z t ∈ R , and utilizes the RNN to forecastthe next latent dynamics z t +1 . From this latent space, the MD decoder of LED is sampling the next conﬁguration s t +1 . LED successfully uncovered the inherent cosine dynamics of the stochastic movement of the particles, eventhough the dimensionality of the reduced order space was very low ( d z = 1 ). We plot the L1-Histogram error betweenthe predicted and true density of particle positions, along with the Wasserstein distance and the error on the ﬁrst twomoments on the particle positions as a function of time. From the low dimensional latent space, the decoder of LEDlearned to map to the position of the N = 1000 particles, as depicted by the low error on the densities. As expected,due to the iterative propagation of the model error, the prediction diverges and there is a phase lag between the predictedand the true dynamics. This causes an increase of the error on the estimated mean position of the particles (M1) andan error on the variance (M2). Even though the error on the variance is kept low, the error on M1 is increasing as weiteratively forecast for a large horizon. The statistics of the system are reproduced accurately, as depicted in Figure 17,where we plot the true and predicted distribution of positions over all timesteps.Regarding the hyperparameters of LED, we tune them based on a grid search reported in Section 8.1.2. The inputs tothe network are scaled to [0 , , while adding noise in the input did not improve performance for this case. Performanceis measured in terms of the maximum data likelihood of the MD autoencoder, and the minimum root mean square errorfor the RNN part, which are trained independently. The autoencoder with the highest likelihood from our grid search,turned out to be a VAE, with a permutation invariant input layer (function φ in Figure 10) consisting of residual layersof nodes and tanh activation, mapping to a permutation invariant latent space of dimension M = 200 with meanfeature function, with an additional residual autoencoder (modeling function g in Figure 10) with layers of hiddennodes each and tanh activation, reducing the dimensionality to a latent space of dimension d z = 1 . Finally, the decoderis composed of additional residual layers of size each, and a mixture density output layer, with hidden units,and kernels outputting the parameters for the mixture coefﬁcients, the means, and the covariance matrices of the kernels. A conﬁguration can be sampled in a trivial way from this representation. Regarding the RNN part of LED, astateful LSTM with 1 layer of nodes trained with a BBTT truncated length of exhibited the lowest validationerror. z t z t + Latent dynamics in test (a) − . − . . . . x . . . . . . . f X ( x ) Target Density Predicted Density (b)

Figure 17: (a)

Low order ( d z = 1 ) latent dynamics uncovered by LED, the plot shows the latent state z t +1 as a functionof z t , while colored according to the density of the values. We observe that the network successfully unraveled the twoquasi stable states (particles concentrating on the left or right wall due to the cos advection term). (b) True distributionof the particle position values, and the predicted one by LED.In the following, we employ LED to perform multiscale forecasting switching between propagation of the latentdynamics for T m = 450 and evolution of the particle dynamics for T µ = 50 , leading to a multiscale ratio of ρ = T m /T µ = 9 . The initial warm-up period of LED is T warm = 60 . The iterative propagation on the test set is plottedin Figure 18. We observe that by switching to the particle conﬁguration and evolving the dynamics, we alleviate theiterative error propagation, at the additional cost of propagating the original system dynamics (Brownian motion of the N = 1000 particles) for of the total time. The statistics of the system are captured well as demonstrated by thelow error on the L1-Histogram error, the mean (M1) and variance (M2) of the particles. For a more detailed statisticalanalysis of the effectiveness of the proposed approach in modeling the dynamics of the Advection-Diffusion system,please refer to Section 8.1.1. 23earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

Figure 18: Multiscale LED applied on the one dimensional Advection-Diffusion equation, switching between thepropagation of the latent space dynamics for T m = 450 , and evolution of the particle dynamics for T µ = 50 . Thedimensionality of the latent dimension of the VAE is d z = 1 . The particle dynamics are correcting the error in thephase lag introduced by the iterative propagation on the latent space.24earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

In this section, we perform a statistical analysis to evaluate the performance of LED in forecasting the stochasticdynamics of a group of N = 1000 particles whose dynamics are governed by the Advection-Diffusion equation.Starting from different initial conditions, we propagate the dynamics with LED different times, changing theseed up to a ﬁnal time of T f = 4000 , which is equivalent to steps. Since LED is sampling from the MD decoder,every run will result to a different evolution of the conﬁguration, as expected. For every run, we calculate the ﬁrst twomoments of the conﬁguration, M1 and M2. At each timestep, we have predicted M1 and M2. In order to evaluatethe statistical error, we compare the predicted distributions of M1 and M2, with the groundtruth one, by evaluating theirL1-Histogram error and the WD. The evolution of these errors as a function of time is plotted in Figure 19 for fourvariants of LED with different T m . The initial warm-up period of LED is T warm = 60 . Note that as we decrease T m ,i.e. the time spent evolving the latent dynamics of LED, all error metrics decrease. t . . . W D M Iterative Latent Stochastic ForecastingMultiscale Stochastic Forecasting T µ = 50, T m = 450Multiscale Stochastic Forecasting T µ = 50, T m = 250Multiscale Stochastic Forecasting T µ = 50, T m = 50 t L h i s t e rrr o r M t . . . . L h i s t e rrr o r M t . . . W D M t . . . W D M Figure 19: Starting from different initial conditions, we propagate the dynamics of the Advection-Diffusion in theone dimensional space ( d x = 1 ) with LED different times, changing the seed up to a ﬁnal time of T f = 4000 ,which is equivalent to steps. For every runs, we calculate the ﬁrst two moments of the conﬁguration, M1 andM2. At each timestep, we have predicted M1 and M2. We plot the statistical errors (L1-Histrogamm error andWasserstein distance), between the predicted distributions of M1/M2 and the groundtruth, as function of time. Themean over all initial conditions is reported.In Figure 20, we plot how the errors vary with respect to the ratio between the time spent in the macro dynamics (LEDlatent state) and the micro dynamics ρ = T m /T µ . The mode where LED is propagating solely on its latent space isdenoted as “Latent”. The Latent propagation exhibits the highest errors in all metrics. However, the propagation onthe latent state is two orders of magnitude faster compared to the micro dynamics. This speed-up is reduced as wespent more time on the micro state, exchanging between latent propagation and micro solver dynamics (decreasing ρ ),gaining a reduction on the errors in all metrics. 25earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

Self-Similar 1.00 5.00 9.00 Latent ρ = T m /T µ . . . . L h i s t e rrr o r M T µ = 50 . , T f = 4000 . Self-Similar 1.00 5.00 9.00 Latent ρ = T m /T µ . . . . . . L h i s t e rrr o r M T µ = 50 . , T f = 4000 . ρ = T m /T µ Sp ee d - up T µ = 50 . , T f = 4000 . Self-Similar 1.00 5.00 9.00 Latent ρ = T m /T µ . . . . . W D M T µ = 50 . , T f = 4000 . Self-Similar 1.00 5.00 9.00 Latent ρ = T m /T µ . . . . . . W D M T µ = 50 . , T f = 4000 . ρ = T m /T µ . . . . . . l og ( Sp ee d - up ) T µ = 50 . , T f = 4000 . Figure 20: Starting from different initial conditions, we propagate the dynamics of the Advection-Diffusion in theone dimensional space ( d x = 1 ) with LED different times, changing the seed up to a ﬁnal time of T f = 4000 ,which is equivalent to steps. For every run, we calculate the ﬁrst two moments of the conﬁguration, M1 andM2. At each timestep, we have predicted M1 and M2. The bars show the mean (over time and initial conditions)statistical errors (L1-Histrogamm error and Wasserstein distance), between the predicted distributions of M1/M2 and thegroundtruth. The self-similar error is plotted for reference [58] as errors below this level are statistically insigniﬁcant.We observe due to the iterative prediction on the latent space, the LED variant that is not evolving the dynamics on theparticle level at all (denoted as “Latent”) exhibits the highest errors in all metrics. However the Latent LED is morethan two orders of magnitude faster compared to the micro solver. By spending a reference time T µ on the ﬁne scaledynamics (particles), we achieve a signiﬁcant error reduction in all metrics, albeit at the cost of reduced speed-up. Onthe other hand, as the multiscale ratio ρ is increased, spending more time in the latent propagation, the errors graduallyincrease. 26earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

We tuned the hyperparameters of LED based on a grid search, considering the values of hyperparameters given in Table 1for the Autoencoder, and Table 2 for the RNN.Table 1: Autoencoder Hyperparameters for Advection-Diffusion in 1-D ( d x = 1 )Hyperparameter ValuesNumber of AE layers { } Size of AE layers { } Activation of AE layers tanh( · ) Latent dimension { , , , , , , , , , , , , , , , , , } Residual connections TrueVariational True/FalsePermutation Invariant Layer d p { , } Number of MD kernels K { } Hidden units of MD decoder { } Input/Output data scaling Min-Max in [0 , Noise level in the data { , , } ( (cid:104) ) Weight decay rate { . , . } Batch size Initial learning rate . Table 2: LED-RNN Hyperparameters for Advection-Diffusion in 1-D ( d x = 1 )Hyperparameter ValuesNumber of AE layers { } Size of AE layers { } Activation of AE layers tanh( · ) Latent dimension { , } Residual connections TrueVariational True/FalsePermutation Invariant Layer d p { } Number of MD kernels K { } Hidden units of MD decoder { } Input/Output data scaling Min-Max in [0 , Noise level in the data { } Weight decay rate { . } Batch size Initial learning rate . BBTT Sequence length { } Hidden state propagation length

RNN cell type { lstm , gru } Number of RNN layers { } Size of RNN layers { } Activation of RNN Cell tanh( · ) A P

REPRINT - J

ULY

3, 2020

In this section, we test LED on capturing the intrinsic dynamics of a set of N = 1000 particles moving according to theAdvection-Diffusion equations given in Equation (25) in the three dimensional space. We consider the values D = 0 . , A = [1 , . , . T and ω = [0 . , . , . T . We solve the stochastic equations starting from a random initial conditionwith a timestep δt = 10 − . By saving the positions of the particles at a coarser time-step ∆ t = 0 . and starting fromdifferent random initial conditions, we generate three datasets, a training and a validation dataset, each consisting of samples, and a test dataset with samples. The state of the system is given by the position of the N particles s t =  x ,t x ,t x ,t ... ... ... x N,t x N,t x N,t  ∈ R N × d x , (26)where x n,tk denotes position of particle n at time t in the k th axis of the three dimensional space, while d x = 3 and N = 1000 . In this case, we pick d z = 8 . We tune the hyperparameters of LED , ending up with the followingarchitecture: the φ function (permutation invariant layer Figure 10) consists of × residual layers and tanh activation, the permutation invariant space has dimension M = 200 with mean feature function ( f in Figure 10), andthe g decoder consists of a residual network with × layers and tanh activation, reducing the dimensionality tothe desired latent state of dimension d z = 8 . The decoder is a variational one and is composed of × residuallayers, and a mixture density output layer, with hidden units, and kernels outputting the parameters for the mixturecoefﬁcients, the means, and the covariance matrices of the kernels. The RNN propagating the dynamics in thelatent space, is composed of one stateful LSTM layer with nodes and was trained with BBTT with a sequencelength of . For more information on the hyperparameter tuning of the architectures the interested reader is referredto Section 8.2.2. An example of the evolution of the latent state, the errors on the ﬁrst two moments, and the Wassersteindistance between the groundtruth and the predicted distribution of particle positions in an iterative prediction on the testdata is shown in Figure 21a. The initial warm-up period of LED is set to T warm = 60 . All metrics are averaged overthe three dimensions. We observe that LED captures the variance of the particle positions but due to the iterative errorpropagation the error on the distribution (mean and Wasserstein distance) is increasing with time.In Figures 22a to 22c we plot the evolution of the PCA modes of the latent dynamics z t of LED. The plots are coloredaccording to the density. As depicted clearly in Figure 22b, the latent state exhibits “meta” states, where the coarsedynamics remain for most of the time, and is iteratively jumping between them. We cluster the PCA data of the latentstate, identifying six meta stable states, and identify the corresponding particle conﬁguration for each state usingthe mixture density decoder in Figure 23. LED captures the distribution of the positions of the particles as depictedin Figure 22d.In the following, we utilize multiscale forecasting, switching between latent propagation for T m = 250 , and evolutionof the AD dynamics for T µ = 50 . Note that, the initial warm-up period of LED is T warm = 60 . The evolution of thelatent state, and the errors on the mean, the variance of the predicted particle positions, and the Wasserstein distance ontheir distribution in a single iterative forecast in the test dataset is given in Figure 21b. Indeed multiscale propagationalleviates the problem of iterative error propagation, and evolving the high-dimensional AD particle dynamics for T µ = 50 is correcting the error on the statistics. In Figure 24, we plot the LED predictions of the particle conﬁgurationsfor selected time instants. We perform a statistical analysis to evaluate the performance of LED in forecasting the stochastic dynamics. Startingfrom different initial conditions, we propagate the dynamics with LED different times (different runs startingfrom the same initial condition), changing the seed up to a ﬁnal time of T f = 4000 (equivalent to steps). SinceLED is sampling from the MDN at the decoder output, every run will result to a different evolution of the conﬁguration,as expected. For every run, we calculate the ﬁrst two moments of the density of particle positions, M1 and M2 (averagedover the three dimensions). At each timestep, we have predicted means and variances. In order to evaluate thestatistical error, we compare the predicted distribution of M1 and M2, and the groundtruth one, by evaluating theirL1-Histogram error and the WD. The evolution of these errors as a function of time is plotted in Figure 25 for fourvariants of LED with different T m . LED with T µ = 0 corresponding to iterative propagation in the latent space, withoutany evolution of the particle dynamics, and the distribution of the predicted M1 eventually diverges, although thedistribution on M2 is captured accurately. By spending T m = 900 in the latent space, and T µ = 50 in the originaldynamics, we achieve a reduction of the error in the distribution of M1 as denoted by the low value of the Wassersteinmetric and the lower L1-Histogram error. Note that as we decrease the time spent evolving the latent dynamics ( T m ),the errors are getting smaller in all metrics, at the cost of evolving the high dimensional particle dynamics.28earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020 . . . T i m e t .

00 0 .

25 0 .

50 0 .

75 1 . Wasserstein Distance .

00 0 .

25 0 .

50 0 .

75 1 . M1 Error .

00 0 .

25 0 .

50 0 .

75 1 . M2 Error − . − . − . − . − . . . . . . (a) LED T µ = 0 T i m e t . . . . . . . . . − . − . − . − . − . . . . . . (b) LED T m = 250 , T µ = 50 , ρ = 5 Figure 21: (a)

LED applied on the 3-dimensional Advection-Diffusion equation, iteratively forecasting the evolution ofthe particles starting from an initial condition in the test data. The initial warm-up period of LED is set to T warm = 60 .A Variational autoencoder with a permutation invariant input layer, and a latent dimension of d z = 8 is utilized tocoarse grain the high dimensional dynamics. The decoder of LED is mapping from the latent space to the particleconﬁguration using a MD decoder. We plot the evolution of the latent state in time, along with the Wasserstein distancebetween the predicted and groundtruth particle distributions and the absolute error on the mean, and the standarddeviation of the particle distributions. LED can forecast the evolution of the particle positions with low error, eventhough the total dimensionality of the original state describing the conﬁguration of the N = 1000 particles of the systemis s t ∈ × . The network, learned an d z = 8 dimensional coarse grained representation of this conﬁguration.However, due to the iterative prediction with LED, the error on the predicted distribution of particles is increasing withtime. This can be observed at the increasing error on the estimated mean position (M1) of the particles. Nevertheless,the variance is captured along with the long-term statistical behavior. (b) Multiscale propagation in LED. To alleviatethe iterative error propagation, the multiscale propagation is utilized with T m = 250 , T µ = 50 , ρ = 5 . Due to theiterative transition between propagation in the latent space z t of LED for T m and evolution of the particle dynamicsdescribing the system state for T µ , the effect of iterative statistical error propagation is alleviated. Indeed, the erroron the mean (M1) and standard deviation (M2) is not signiﬁcantly increasing with time and the statistical long-termbehavior is accurately captured. 29earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020 − − P C A m o d e Latent dynamics in test (a) PCA ( z ) w.r.t PCA ( z ) − − P C A m o d e Latent dynamics in test (b) PCA ( z ) w.r.t PCA ( z ) − − P C A m o d e Latent dynamics in test (c) PCA ( z ) w.r.t PCA ( z ) − . − . . . . x i . . . . . f X ( x i ) Target Density Predicted Density (d) LED T m = 250 , T µ = 50 , ρ = 5 Figure 22: (a)

Plot of the evolution of the ﬁrst PCA mode of the latent state z t ∈ R d z , with d z = 8 plotted against thesecond mode, where every mark is a point in time of the iterative prediction and colored according to the density. (b) First PCA mode against third one. (b)

Third PCA mode against second one. (b)

True distribution of the particle positionvalues, and the predicted one by LED. The density of the predicted particle positions matches closely the true one.In Figure 26 we plot the averaged statistical errors over time, demonstrating that as the ratio between macro and microdynamics ρ = T m /T µ decreases, the errors on the statistics are smaller. The LED variant that propagates solely on thelatent space ( T µ = 0 ) denoted as “Latent” is almost two orders of magnitude faster than the micro scale solver. As ρ decreases, the statistical error may decrease, but this comes at the cost of reduced speed-up introduced by the evolutionof the high-dimensional micro dynamics for T µ A P

REPRINT - J

ULY

3, 2020 − − . − . − . . . . . P C A m o d e Latent dynamics in test (a) PCA ( z ) w.r.t PCA ( z ) − − P C A m o d e Latent dynamics in test (b) PCA ( z ) w.r.t PCA ( z ) − − P C A m o d e Latent dynamics in test (c) PCA ( z ) w.r.t PCA ( z ) X − . − . . . . X − . − . . . . X − . − . . . . X − . − . . . . X − . − . . . . X − . − . . . . X − . − . . . . X − . − . . . . X − . − . . . . X − . − . . . . X − . − . . . . X − . − . . . . X − . − . . . . X − . − . . . . X − . − . . . . X − . − . . . . X − . − . . . . X − . − . . . . Figure 23: Spectral clustering on the PCA modes of the latent space. The six clusters correspond to different metastable states. The particles are transitioning between these states. The corresponding particle conﬁguration for eachidentiﬁed cluster in the latent space. 31earning the Effective Dynamics of Complex Multiscale Systems

A P

REPRINT - J

ULY

3, 2020 X − . − . . . . X − . − . . . . X − . − . . . . Groundtruth LED, T µ = 0 (a) t = 0 X − . − . . . . X − . − . . . . X − . − . . . . Groundtruth LED, T µ = 0 (b) t = 500 X − . − . . . . X − . − . . . . X − . − . . . . Groundtruth LED, T µ = 0 (c) t = 875 X − . − . . . . X − . − . . . . X − . − . . . . Groundtruth LED, T m = 250 , ρ = 5 (d) t = 0 X − . − . . . . X − . − . . . . X − . − . . . . Groundtruth LED, T m = 250 , ρ = 5 (e) t = 500 X − . − . . . . X − . − . . . . X − . − . . . . Groundtruth LED, T m = 250 , ρ = 5 (f) t = 875 Figure 24: Starting from an initial condition from the test dataset plotted in Figure (a) t = 0 , we evolve the particlesof the Advection-Diffusion equation, plotting the state of the system at two later time instants (b) t = 500 , and (c) t = 875 . Two variants of LED with latent state dimension d z = 8 are utilized to forecast the evolution of the stochasticdynamics, one with T µ = 0 in (a) - (c) , and one with T m = 250 , T µ = 50 , ρ = T m /T µ = 5 in (d) - (f) . LED with T µ = 0 ,is not switching between propagation in the latent space and evolution of the original particle dynamics in the highdimensional state space. Due to the iterative propagation of the error, it cannot capture the state of the system at the ﬁnaltime t = 875 as seen in Plot (c) . In contrast, the iterative switching of the second variant, alleviates the problem, and thestatistics are captured as seen in Plot (f) . Note that the initial warm-up period of LED set to T warm = 60 is not visiblehere, the time t = 0 is immediately after the warm-up period, when the iterative forecasting on the latent space starts.32earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020 t L h i s t e rrr o r M Iterative Latent Stochastic ForecastingMultiscale Stochastic Forecasting T µ = 50, T m = 450Multiscale Stochastic Forecasting T µ = 50, T m = 250 t . . . . L h i s t e rrr o r M t L h i s t e rrr o r M t . . . . W D M t . . . W D M Figure 25: Starting from different initial conditions, we forecast the dynamics with LED restarting from eachinitial condition different times, changing the seed each time, up to a ﬁnal time of T f = 4000 (equivalent to steps). For every run, we calculate the ﬁrst two moments of the conﬁguration, M1 and M2. At each timestep,we have predicted M1 and M2. We plot the statistical errors (L1-Histrogamm error and Wasserstein distance),between the predicted distributions of M1/M2 and the groundtruth, as function of time. The mean over all initialconditions is reported. We observe that by iteratively switching between propagation in the LED latent state andevolution of the particle dynamics, LED can capture the statistics more accurately, at the computational cost of evolvingthe high-dimensional micro dynamics. 33earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

Self-Similar 5.00 9.00 Latent ρ = T m /T µ . . . . . . L h i s t e rrr o r M T µ = 50 . , T f = 4000 . Self-Similar 5.00 9.00 Latent ρ = T m /T µ . . . . . . L h i s t e rrr o r M T µ = 50 . , T f = 4000 . ρ = T m /T µ Sp ee d - up T µ = 50 . , T f = 4000 . Self-Similar 5.00 9.00 Latent ρ = T m /T µ . . . . W D M T µ = 50 . , T f = 4000 . Self-Similar 5.00 9.00 Latent ρ = T m /T µ . . . . . . W D M T µ = 50 . , T f = 4000 . ρ = T m /T µ . . . . l og ( Sp ee d - up ) T µ = 50 . , T f = 4000 . Figure 26: Starting from different initial conditions, we propagate the dynamics with LED restarting from eachinitial condition different times, changing the seed each time, up to a ﬁnal time of T f = 4000 (equivalent to steps). For every run, we calculate the ﬁrst two moments of the conﬁguration, M1 and M2. At each timestep, we have predicted M1 and M2. The bars show the mean (over time and initial conditions) statistical errors (L1-Histrogammerror and Wasserstein distance), between the predicted distributions of M1/M2 and the groundtruth. The self-similarerror [58] is plotted for reference as errors below this level are statistically insigniﬁcant. We observe due to the iterativeprediction on the latent space, the LED variant that is not evolving the dynamics on the particle level at all (denotedas “Latent”) exhibits the highest errors on all metrics. However, the Latent LED is two orders of magnitude fastercompared to the micro solver. By spending a reference time T µ = 50 on the ﬁne scale dynamics (particles), we achievea signiﬁcant error reduction in all metrics, at the cost of reduced speed-up. As the multiscale ratio ρ = T m /T µ isincreased, spending more time in the latent propagation, the errors gradually increase.34earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

The hyperparameters of LED are given in Table 3 for the Autoencoder, and Table 4 for the RNN.Table 3: Autoencoder Hyperparameters for Advection-Diffusion in 3-D ( d x = 3 )Hyperparameter ValuesNumber of AE layers { } Size of AE layers { } Activation of AE layers tanh( · ) Latent dimension { , , , , , , , , , , , , , , , , , } Residual connections TrueVariational True/FalsePermutation Invariant Layer d p { , } Number of MD kernels K { } Hidden units of MD decoder { } Input/Output data scaling Min-Max in [0 , Noise level in the data { , , } ( (cid:104) ) Weight decay rate { . , . } Batch size Initial learning rate . Table 4: LED-RNN Hyperparameters for Advection-Diffusion in 3-D ( d x = 3 )Hyperparameter ValuesNumber of AE layers { } Size of AE layers { } Activation of AE layers tanh( · ) Latent dimension { } Residual connections TrueVariational True/FalsePermutation Invariant Layer d p { } Number of MD kernels K { } Hidden units of MD decoder { } Input/Output data scaling Min-Max in [0 , Noise level in the data { } Weight decay rate { . } Batch size Initial learning rate . BBTT Sequence length { } Hidden state propagation length

RNN cell type { lstm , gru } Number of RNN layers { } Size of RNN layers { , , , } Activation of RNN Cell tanh( · ) A P

REPRINT - J

ULY

3, 2020

In this section, we provide additional results on the generalization of LED for a different number of particles inthe simulation. Due to the permutation invariant encoder, coarse-graining the high-dimensional input of LED, weexpect the network to be able to generalize to a different number of particles, since the identiﬁed coarse representationshould rely on global statistical quantities, and not depend on individual positions. We utilize the network trainedin conﬁgurations of N = 1000 particles, to forecast the evolution of N = 400 particles evolving according to theAdvection-Diffusion equation. The propagation of the errors is plotted in Figure 27. The high dimensional state atthree time instants as predicted by LED propagating only the latent state, and a variant with T m = 50 , T µ = 250 , ρ = 5 is plotted in Figure 28 starting from an initial condition in the test data. The initial warm-up period of LED is set to T warm = 60 for all variants. We observe an excellent generalization ability of the network. . . . T i m e t .

00 0 .

25 0 .

50 0 . Wasserstein Distance .

00 0 .

25 0 .

50 0 . M1 Error .

00 0 .

25 0 .

50 0 . M2 Error − . − . − . − . − . . . . . . (a) LED T µ =0 T i m e t . . Wasserstein Distance . . M1 Error . . M2 Error Multiscale?

Latent SpaceState Space − . − . − . − . − . . . . . . (b) LED T µ = 250 , ρ = 5 Figure 27: LED trained on particle conﬁgurations with N = 1000 number of particles, learned an d z = 8 dimensionalcoarse grained representation of this conﬁguration. We utilize two models with T µ = 0 (iterative latent propagation) and T µ = 250 , ρ = 5 (multiscale forecasting) to forecast the evolution of a particle conﬁguration composed of N = 400 particles to test the generalization ability of the model. The initial warm-up period is set to T warm = 60 . We plotthe latent space, the Wasserstein distance between the densities of the particle positions, and the error on the ﬁrst twomoments, for both variants of LED. We observe that the LED is able to successfully generalize to the case of N = 400 particles. 36earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020 X − . − . . . . X − . − . . . . X − . − . . . . Groundtruth LED, T µ = 0 (a) t = 0 X − . − . . . . X − . − . . . . X − . − . . . . Groundtruth LED, T µ = 0 (b) t = 500 X − . − . . . . X − . − . . . . X − . − . . . . Groundtruth LED, T µ = 0 (c) t = 875 X − . − . . . . X − . − . . . . X − . − . . . . Groundtruth LED, T m = 250 , ρ = 5 (d) t = 0 X − . − . . . . X − . − . . . . X − . − . . . . Groundtruth LED, T m = 250 , ρ = 5 (e) t = 500 X − . − . . . . X − . − . . . . X − . − . . . . Groundtruth LED, T m = 250 , ρ = 5 (f) t = 875 Figure 28: We employ a network trained with N = 1000 particles, to forecast the evolution of N = 400 particles testingthe generalization ability of LED. Starting from an initial condition from the test dataset plotted in Figure (a) t = 0 ,we evolve the particles of the Advection Diffusion equation, plotting the state of the system at two later time instants (b) t = 500 , and (c) t = 875 . Two variants of LED with latent state dimension d z = 8 are utilized to forecast theevolution of the stochastic dynamics, one with T µ = 0 in (a) - (c) , and one with T m = 250 , T µ = 50 , ρ = T m /T µ = 5 in (d) - (f) . The initial warm-up period is set to T warm = 60 for both variants. LED with T µ = 0 , is not switchingbetween propagation in the latent space and evolution of the original particle dynamics in the high dimensional statespace. Due to the iterative propagation of the error, it cannot capture the state of the system at the ﬁnal time t = 875 as seen in Plot (c) . In contrast, the iterative switching of the second variant, alleviates the problem, and the statisticsare captured as seen in Plot (f) . Even though the network is trained on N = 1000 particle conﬁgurations, it is able togeneralize to different particle sizes. 37earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

In the following study, we evaluate how the diffusion and the advection coefﬁcient affect the efﬁciency of LED. Weset the advection to A = [1 , . , T and ω = [0 . , . , . T . The amplitude of the sinusoidal advection term is thus | A | = 2 . We utilize the dimensionless Péclet number to quantify the rate of advection by the rate of diffusion, i.e. P e = LUD , (27)where we set the characteristic length L equal to the domain length L = 1 , U is the amplitude of the advection U = | A | = 2 , and D the diffusion coefﬁcient. We consider different values of the Diffusion coefﬁcient D ∈{ . , . , . } , leading to Péclet numbers Pe ∈ { , , } . In order to select the time-step for each case, we takeinto account the following considerations. The velocity of each particle, has two parts, one due to the advection, andone due to diffusion. The incremental steps from each part should be much smaller than the domain length, leadingto U δt (cid:28) L and √ Dδt (cid:28) L . The constraints on δt are thus δt (cid:28) LU and δt (cid:28) L D . As the diffusion coefﬁcient isgeometrically increased, the time-step has to be geometrically decreased to achieve the same resolution with respect tothe domain size L . Based on these arguments, we chose the timesteps for the micro solver as δt ∈ { − , − , − } for D ∈ { . , . , . } respectively. The LEDs considered in this study, are trained on training samples ( datapoints used for training and for validation purposes) with a coarse time-step ∆ t = 1 . The latent dimension is kept d z = 8 . The RNN of LED is composed of one LSTM layer of size . The rest of the hyperparameters are reportedin Section 8.2.2. The results of the analysis are reported in Figure 29. Ω = T m /T µ . . . . . . . W D M Ω = T m /T µ . . . . W D M Ω = T m /T µ . . . . . . . l og ( Sp ee d - up ) CBA Ω = T m /T µ . . . . . . . l og ( Sp ee d - up ) D = 0 . P e = 100 D = 0 . P e = 10 D = 2 . P e = 1 Ω = T m /T µ Sp ee d - up Figure 29: For each case ( Pe ∈ { , , } ), starting from a random initial condition, we propagate the dynamics withLED restarting different times, changing the seed each time, up to a ﬁnal time of T f = 3000 (equivalent to steps, as ∆ t = 1 ). For every run, we calculate the ﬁrst two moments of the conﬁguration, M1 and M2. At each timestep,we have predicted M1 and M2. We plot the mean (over time and initial conditions) statistical error (Wassersteindistance), between the predicted distributions of M1/M2 and the groundtruth. We observe that as the Péclet numberincreases, the errors on both distributions (M1 and M2) increase. In all cases, LED is capturing the effective dynamicsof the system, as demonstrated by the low errors. Moreover, we note that the latent propagation of LED is orders ofmagnitude faster than the micro-scale solver. For lower Péclet numbers, the time-step needed to resolve the diffusiveeffects render the micro simulator computationally expensive. Thus, the computational savings by employing the latentpropagation of LED are greater for lower Péclet numbers.38earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

In the following, we evaluate LED in capturing the low dimensional intrinsic dynamics of the FitzHugh-Nagumo [59, 60]model (FHN) in one spatial dimension and compare its efﬁciency with the coarse graining approach of [49]. TheFitzHugh-Nagumo model describes the governing equations of the evolution of two spatial densities, an inhibitordensity u ( x, t ) = ρ ac ( x, t ) and an activator density v ( x, t ) = ρ in ( x, t ) . The governing equations read ∂u∂t = D u ∂ u∂x + u − u − v,∂v∂t = D v ∂ v∂x + (cid:15) ( u − α v − α ) . (28)The two densities evolve in different timescales, the activator density is considered the “fast” variable, while the inhibitordensity, the “slow” variable. The bifurcation parameter (cid:15) controls the difference in the time-scales between the two. D u and D v are the diffusion coefﬁcients of the activator and the inhibitor respectively. We set the coefﬁcients α = − . and α = 2 . We discretize Equation (28) on the domain x ∈ [0 , L ] , with domain size L = 20 , and N = 101 gridpoints. The spatial step-size is thus, δx = L/N = 0 . . The diffusion coefﬁcients of the activator and the inhibitorare set to D u = 1 and D v = 4 , respectively. We solve the equations using the Lattice Boltzmann method [61]. Theinterested reader is referred to Section 8.3.1 for the implementation details. The mesoscopic solution obtained by LBis considered our ﬁne-grained solution in agreement with [49]. The time-step considered in the Lattice Boltzmannmethod is δt = 0 . , while the bifurcation parameter (cid:15) = 0 . . In agreement with [49], we gather data starting from different initial conditions. We subsample the data, keeping every 200 th data point, leading to time-series with 451points distant in time by ∆ t = 1 . We consider 5 initial conditions for training purposes and one initial condition fortesting.In [49], a coarse graining framework is introduced, based on the identiﬁcation of PDEs on the coarse-scale using eitherneural networks (NN), or Gaussian Processes (GP). The two variants of the method are referred to as CSPDEs-NN andCSPDEs-GP respectively. An improvement on the efﬁciency of the methods is achieved by augmenting the ﬁne-scaleobservations from the LB solver, with features uncovered by Diffusion Maps (DM). Three features are included in [49],referred to as F1, F2 and F3 in this work. We train the LED framework with various latent state dimensions to capture . . . . . . M S E L o ss AEVAE (a) MSE loss L = || x − ˜ x || . − − − − − l og ( M S E L o ss ) AEVAE (b) Log-MSE loss log (cid:0) || x − ˜ x || (cid:1) . Figure 30: (a)

MSE loss on the test data set plotted as a function of the latent state dimension. (b)

Logarithm of theMSE loss on the test data set. We observe for both AE and VAE a latent dimension of d = 8 is enough to capture mostinformation of the data and reproduce the state evolution accurately, as the error is of the order of − in logarithmic scale.The gain in MSE loss in using a higher latent state dimension is negligible, providing evidence that we approximatelycaptured the dimensionality of the effective dynamics of the system.the low order intrinsic dimensionality of the dynamics. We utilize AE and VAE with three layers of 100 nodes each.For more information on the tuned hyperparameters, refer to Section 8.3.2. The reconstruction error on the test datasetis plotted with respect to the latent state dimension in Figure 30. We ﬁnd that an autoencoder with latent dimension d z = 8 is able to capture the evolution of the densities, achieving low reconstruction of the error. Further increase onthe latent dimension does not lead to a signiﬁcant reduction on the MSE on the reconstruction. Using the autoencoderswith d z = 8 , we train the RNN. Tuning of the RNN hyperparameters is reported in Section 8.3.2. The lowest erroron the validation data, was achieved by an LSTM with d h = 50 hidden nodes, using a BPTT sequence length of .We compare the MNAD deﬁned in Equation (23), where N T = 451 and N x = 101 , the spatiotemporal ﬁeld y is theresult of the LB simulation, and ˜ y the forecast of the models. We are considering the MNAD on both the activator andinhibitor density on the testing data. The MNAD comparison between LED and the CSPDEs [49] framework is givenin Figure 31. The initial LED warm-up period is set to T warm = 75 . Plots of the prediction on the test data with LEDare given in Figures 32 and 33. 39earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020 .

00 0 .

01 0 .

02 0 .

00 0 .

01 0 .

02 0 .

03 0 . v, ˜ v )LED-VAE d z = 32LED-AE d z = 32LED-VAE d z = 8LED-AE d z = 8LED-VAE d z = 4LED-AE d z = 4CSPDE-GPCSPDE-NNCSPDE-GP-F1CSPDE-NN-F1CSPDE-GP-F2CSPDE-NN-F2CSPDE-GP-F3CSPDE-NN-F3 2.25E-031.77E-032.22E-036.34E-035.50E-033.50E-03 1.62E-021.56E-021.62E-021.57E-022.20E-022.11E-02 3.31E-022.16E-02 Lee et. al.LED-VAELED-AE (b) MNAD ( v, ˆ v ) . Figure 31: Comparison of LED and CSPDEs [49] models on the FHN equation, on the same testing data. (a)

Meannormalised absolute difference (MNAD) on the activator density u ( x, t ) . (b) MNAD on the inhibitor density v ( x, t ) .We observe that LED outperforms all CSPDEs variants. x t u ( x,t ) − . − . . . . Target − . − . . . . x t u ( x,t ) − . − . . . . Prediction − . − . . . . x t | u ( x,t ) − ˜ u ( x,t ) | . . . . Absolute Error .

01 0 .

02 0 .

03 0 .

04 0 . (a) Target, LED prediction, and associated absolute error on the activator density u ( x, t ) . x t v ( x,t ) − . − . . . . Target − . − .

05 0 .

00 0 .

05 0 . x t v ( x,t ) − . − . . . . Prediction − . − .

05 0 .

00 0 .

05 0 . x t | v ( x,t ) − ˜ v ( x,t ) | . . . . Absolute Error . . . . . (b) Target, LED prediction, and associated absolute error on the inhibitor density v ( x, t ) . Figure 32: Surface plots of the prediction of the dynamics of the FHN equation on the test dataset using LED.40earning the Effective Dynamics of Complex Multiscale Systems

A P

REPRINT - J

ULY

3, 2020 x t u ( x,t ) − . − . − . − . − . . . . . .

846 0 5 10 15 20 x t ˜ u ( x,t ) − . − . − . − . − . . . . . .

846 0 5 10 15 20 x t | u ( x,t ) − ˜ u ( x,t ) | . . . . . . . . . . (a) Target, LED prediction, and associated absolute error on the activator density u ( x, t ) . x t v ( x,t ) − . − . − . − . − . . . . . . x t ˜ v ( x,t ) − . − . − . − . − . . . . . . x t | v ( x,t ) − ˜ v ( x,t ) | . . . . . . . . . . (b) Target, LED prediction, and associated absolute error on the inhibitor density v ( x, t ) . Figure 33: Contour plots of the dynamics of the FHN equation on the test dataset using LED.41earning the Effective Dynamics of Complex Multiscale Systems

A P

REPRINT - J

ULY

3, 2020

LED identiﬁed the intrinsic dynamics and is able to propagate them in time. This iterative propagation introduces anapproximation error that accumulates over time. However, the reduced order propagation of LED can be computationallyorders of magnitude less expensive compared to evolving the ﬁne-grained dynamics. In this case the ﬁne-graineddynamics are modeled with LB with a small time-step. We deﬁne the speed-up as S = T LED /T ﬁne , where T LED is thecomputational time needed to propagate the dynamics in one coarse time unit on average using LED, while T ﬁne isthe time needed to evolve the dynamics for one coarse time unit using the LB solver on the ﬁne scale. In the FHNwe used ∆ t = 1 as the coarse time scale. The speed-up achieved with LED is approximately S ≈ , albeit at thecost of the approximation error. This speed-up can be decisive in accelerating simulations and achieving much largertime-scales. By iteratively exchanging between the coarse-grained dynamics of LED and the high-dimensional dynamics(propagation using the solver on the ﬁne-scale) we can reduce the approximation error, at the cost of reduced speed-up.In the following, we ﬁxed the ﬁne scale time at T µ = 10 . Note that the initial LED warm-up period is T warm = 75 .Starting from different initial conditions on the test data, LED is simulated up to ﬁnal time T f = 8000 , considering ρ = T m /T µ . . . . M NA D ( u , ˜ u ) T µ = 10 . , T f = 8000 . (a) MNAD on the activator density w.r.t. macro tomicro ratio ρ . ρ = T m /T µ . . . . . M NA D ( v , ˜ v ) T µ = 10 . , T f = 8000 . (b) MNAD on the inhibitor density w.r.t. macro tomicro ratio ρ . Ω = T m /T µ Sp ee d - up T µ = 10 . , T f = 800 . t . . . NA D ( u , ˜ u ) T µ = 10 . , T f = 8000 . iterative T m = 200 . ,Ω = 20 . T m = 100 . ,Ω = 10 . T m = 50 . ,Ω = 5 . T m = 10 . ,Ω = 1 . (c) Speed-up w.r.t. macro to micro ratio ρ . Ω = T m /T µ . . . . . l og ( Sp ee d - up ) T µ = 10 . , T f = 800 . t . . . NA D ( u , ˜ u ) T µ = 10 . , T f = 8000 . iterative T m = 200 . ,Ω = 20 . T m = 100 . ,Ω = 10 . T m = 50 . ,Ω = 5 . T m = 10 . ,Ω = 1 . (d) Logarithm of speed-up w.r.t. macro to micro ratio ρ . Figure 34: Starting from different initial conditions in the test data, the LB method is utilized to compute theFHN dynamics up to a large horizon T f = 8000 , approximately times larger than the training data. The evolutionof the activator and inhibitor density obtained by the LB solver is considered as the groundtruth evolution. Usingthe LED framework, either with full iterative propagation on the reduced order latent space (denoted as “Latent”), oralternating between macro-dynamics for T m and high-dimensional dynamics for T µ , we approximate the evolution.We ﬁx T m = 10 . The MNAD error between the predicted and ground-truth evolution of the densities is plotted as afunction of the macro-to-micro ratio ρ = T m /T µ in (a) for the activator density, and (b) for the inhibitor density. In (c) the speed-up compared to the LB solver is plotted w.r.t. ρ . In (d) the loagirthm of the speed-up is plotted w.r.t. ρ .different values for the time T m spent at the coarse level (propagating the latent dynamics of LED). The MNAD onthe activator and inhibitor densities as well as the achieved speed-up depending on T m are plotted in Figure 34. Theevolution of the NAD averaged over the initial conditions is plotted in Figure 35a for the activator density andin Figure 35b for the inhibitor. The results for T µ = 0 are denoted with the label “Latent”. For T m = T µ = 10 ( ρ = 1 ),we observe that the MNAD is reduced from ≈ . , to ≈ . compared to only propagating the dynamics on thecoarse scale in both the activator and inhibitor densities. However, the speed-up is reduced from S ≈ to S ≈ .By increasing T m ∈ { , , } , we get the intermediate regimes between propagation of the computationallyexpensive (and possibly intractable) high-dimensional system dynamics, and the full iterative propagation. As we42earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020 t . . . NA D ( u , ˜ u ) T µ = 10 . , T f = 8000 . iterative T m = 200 . , ρ = 20 . T m = 100 . , ρ = 10 . T m = 50 . , ρ = 5 . T m = 10 . , ρ = 1 . (a) Evolution of NAD on the activator density. t . . . . NA D ( v , ˜ v ) T µ = 10 . , T f = 8000 . iterative T m = 200 . , ρ = 20 . T m = 100 . , ρ = 10 . T m = 50 . , ρ = 5 . T m = 10 . , ρ = 1 . (b) Evolution of NAD on the inhibitor density. t . . . . NA D ( u , ˜ u ) T µ = 10 . , T f = 800 . Latent T m = 200 . , ρ = 20 . T m = 100 . , ρ = 10 . T m = 50 . , ρ = 5 . T m = 10 . , ρ = 1 . Figure 35: Evolution of the normalized absolute error (NAD) as a function of time, for different values of the macro tomicro ratio ρ . Average over different initial conditions from the test data is reported.increase T m (increase ρ ), the speed-up is increased, as we are using more and more the reduced order dynamics, albeitat the cost of an increasing error. 43earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

In this section we brieﬂy describe the Lattice Boltzmann [61] method utilized to obtain data from the FitzHugh-Nagumomodel (FHN). In our study, the data from the meso-scale LB method are considered the high-dimensional ﬁne-graineddynamics. The FHN system is described by the equations: ∂u∂t = D u ∂ u∂x + u − u − v,∂v∂t = D v ∂ v∂x + (cid:15) ( u − α v − α ) , (29)where we denote with u ( x, t ) = ρ ac ( x, t ) the density of the activator and v ( x, t ) = ρ in ( x, t ) the density of the inhibitor,while the dependency of u and v on x and t is omitted from Equation (29) for brevity. D u and D v are the diffusioncoefﬁcients of the activator and the inhibitor respectively. We set the coefﬁcients α = − . and α = 2 . Thebifurcation parameter (cid:15) = 0 . controls the difference in the time-scales between the activator and the inhibitor. Thedomain considered in this study is x ∈ [0 , L ] , with domain size L = 20 , and N = 101 grid points. The step-size isthus, δx = L/N = 0 . . The diffusion coefﬁcients of the activator and the inhibitor are set to D u = 1 and D v = 4 ,respectively. The time-step considered in the Lattice Boltzmann method is δt = 0 . , while the bifurcation parameter (cid:15) = 0 . . The discrete-velocity distribution functions f ui and f vi (also called particle populations), that describe themesoscopic LB system are given by f ui ( x j + iδx, t k +1 ) = f ui ( x j + i , t k +1 ) = f ui ( x j , t k ) + Ω ui ( x j , t k ) + R ui ( x j , t k ) , (30) f vi ( x j + iδx, t k +1 ) = f vi ( x j + i , t k +1 ) = f vi ( x j , t k ) + Ω vi ( x j , t k ) + R vi ( x j , t k ) , (31)where in our work we consider the discrete velocities c i ∈ {− , , +1 } in the one dimensional domain (D1Q3 velocityset). The indexes i ∈ {− , , } denote the particle populations of each velocity. The densities for the activator and theinhibitor are given by: u ( x j , t k ) = (cid:88) i = − f ui ( x j , t k ) , v ( x j , t k ) = (cid:88) i = − f vi ( x j , t k ) . (32)The reaction terms are given by: R ui ( x j , t k ) = 13 ∆ t (cid:0) u ( x j , t k ) − u ( x j , t k ) − v ( x j , t k ) (cid:1) , (33) R vi ( x j , t k ) = 13 ∆ t(cid:15) (cid:0) u ( x j , t k ) − α v ( x j , t k ) − α (cid:1) . (34)Following [49], the collision terms are given by the Bhatnagar-Gross-Krook (BGK) model: Ω ui ( x j , t k ) = − ω ui (cid:0) f ui ( x j , t k ) − f u, equili ( x j , t k ) (cid:1) , (35) Ω vi ( x j , t k ) = − ω vi (cid:0) f vi ( x j , t k ) − f v, equili ( x j , t k ) (cid:1) , (36)(37)where the equilibrium densities are set to f u, equili ( x j , t k ) = 13 u ( x j , t k ) , f v, equili ( x j , t k ) = 13 v ( x j , t k ) , (38)based on spatially uniform Local diffusion equilibrium, for which the velocity distributions are homogeneous in allvelocity directions. Moreover, the BGK relaxation coefﬁcients are given by ω ui = 21 + 3 D u ∆ t ∆ x , ω vi = 21 + 3 D v ∆ t ∆ x . (39)In order to generate the training, validation and testing data, we solve the dynamics up to T f = 450 , with a time-step δt = 0 . starting from different initial conditions. For the training data set we consider initial conditions, for thevalidation dataset and one initial condition for testing. The initial conditions are plotted in Figure 36. For the LBmethod, in order to initialize the particle densities we employ an equal weight rule according to [49] at each grid point x n according to: f u − ( x n , t = 0) = f u ( x n , t = 0) = f u +1 ( x n , t = 0) = u ( x n , t = 0)3 , (40) f v − ( x n , t = 0) = f v ( x n , t = 0) = f v +1 ( x n , t = 0) = u ( x n , t = 0)3 . (41)44earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

As discussed in [49] this equal weight choice is not in general consistent with the FHN PDE equation, which is notspatially uniformly and simply diffusive. However, we expect that after an initial relaxation period, the ﬁne scalesimulation features will become slaved to the local concentration ﬁeld. For this reason, starting from all initial conditions,we start collecting data after an initial transient period of T = 2 time units. . . . . . . . . . x − . − . . . . u ( x , t = ) Activator . . . . . . . . . x − . − . − . . . . . . v ( x , t = ) Inhibitor

TestTrainTrainTrainValVal

Figure 36: The initial conditions for the activator u ( x, t ) and inhibitor v ( x, t ) density of the FHN model. Three runsstarting from three different initial conditions are used for training the network, two for validation purposes and one fortesting the predictive performance. 45earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

The hyperparameters utilized to tune the autoencoder of LED are given in Table 5. The hyperparameters for the RNNare given in Table 6. Table 5: Autoencoder Hyperparameters for FHNHyperparameter ValuesNumber of AE layers { } Size of AE layers { } Activation of AE layers tanh( · ) Latent dimension { , , , , , , , , , , , , , , , , , } Residual connections TrueVariational True/FalseInput/Output data scaling N (0 , Noise level in the data { , , } ( (cid:104) ) Weight decay rate { . , . } Batch size Initial learning rate . Table 6: LED-RNN Hyperparameters for FHNHyperparameter ValuesNumber of AE layers { } Size of AE layers { } Activation of AE layers tanh( · ) Latent dimension { , , , } Residual connections TrueVariational True/FalseInput/Output data scaling N (0 , Noise level in the data { , , } Weight decay rate { . , . } Batch size Initial learning rate . BPTT Sequence length { , } Hidden state propagation length

RNN cell type { lstm , gru } Number of RNN layers { } Size of RNN layers { , , , } Activation of RNN Cell tanh( · ) A P

REPRINT - J

ULY

3, 2020

The Kuramoto-Sivashinsky (KS) equation [63, 64] is a prototypical spatially extended partial differential equation(PDE) of fourth order that exhibits a very rich range of nonlinear phenomena and is studied extensively as a prototypicalmodel of turbulence. In this section, we demonstrate that LED can be easily adapted to learn the effective dynamics onthe inertial manifold in a similar fashion, without making any underlying assumptions about the nature of the PDE(translation invariance, energy conservation loss, etc.). Moreover, iterative forecasting the long-term dynamics withdata-driven methods such as the one proposed in [23], can lead to periodic orbits, stable attractors, or other spuriousdynamics, not present in the training data [47]. We demonstrate that by utilizing the multiscale approach explainedin Section 6.6 we can reproduce the long-term climate and lower the errors on the distribution of the predicted statevalues.In this work, we consider the one dimensional K-S equation given by the PDE ∂u∂t = − ν ∂ u∂x − ∂ u∂x − u ∂u∂x , (42)on the domain Ω = [0 , L ] with periodic boundary conditions u (0 , t ) = u ( L, t ) and ν = 1 . The dimensionless boundarysize ˜ L = L/ (2 π ) directly affects the dimensionality of the attractor. For large values of ˜ L , the attractor dimension scaleslinearly with ˜ L . In this work, we consider the case L = 22 , corresponding to ˜ L = L/ (2 π ) ≈ . studied extensivelyin [65] exhibiting a structurally stable chaotic attractor, i.e. an inertial manifold where the long-term dynamics lie.In order to spatially discretize Equation (42) we select a grid size ∆ x with D = L/ ∆ x + 1 = 64 the number of nodes.Further, we denote with u i = u ( i ∆ x ) the value of u at node i ∈ { , . . . , D − } . We discretize Equation (42) andsolve it using the fourth-order method for stiff PDEs introduced in [66] with a time-step of δt = 0 . starting from arandom initial condition. After discarding initial transients, we subsample the data keeping every tenth datapoint, toobtain a dataset where samples are distanced by ∆ t = 0 . in time (coarse time unit). We use · such samples fortraining and another · for validation. For testing purposes (long-term forecasting), we repeat the process with adifferent random seed, generating another samples. The largest Lyapunov exponent is computed as Λ = 0 . ,leading to a Lyapunov time of T Λ = 1 / Λ = 20 . .We tried both VAE and AE, and we tuned the noise level, weight decay rate, the size and number of layers of theautoencoders. For more information on the combinations tried out, the interested reader can refer to Section 8.4.3.The MSE error on the test data is plotted inFigure 37. Using more than d z = 8 nodes on the latent space, improvesthe quality of the reconstruction only by a very small margin in the order of − in the logarithmic scale. For bothAE and VAE a latent dimension of d z = 8 is enough to capture most information of the data and reproduce the stateevolution accurately, as the error is of the order of − in logarithmic scale. Note that we are not yet propagating thedynamics, as the autoencoder learned only to encode the data. The evolution though, can be accurately representedin this d z = 8 dimensional reduced order space learned by the AE/VAE. Among the trained autoencoders with latentdimension d z = 8 , we pick the one with the lowest MSE error on the validation data set, and couple it with an RNNforecasting the latent dynamics. We tune the hyperparameters of the RNN (see Section 8.4.3 ). We found that an AEwith an encoder and a decoder of layers of size each, trained with additional noise level k = 10 (cid:104) , and weightdecay rate of . (the weight decay is not taken into account in the RNN training, but only in the AE training), witha GRU cell of size , trained with a BPTT sequence length of provided the smallest error on the statistics of thestate evolution (distribution of the predicted states as compared to the true ones, and their power spectrum).Using the trained RNN we can evolve the dynamics on the reduced order space. An iterative prediction of LED (solelypropagating on the latent space) is illustrated in Figure 38. The initial LED warm-up period is set to T warm = 60 . Weobserve visually that the predictions do not deviate, which is a classical problem in iterative forecasting [47], while thelong-term climate is reproduced. Quantitative results on the power spectrum of the predicted state evolution, and thedensity of the state are provided in Figure 41. LED can effectively learn and propagate the dynamics on the reducedorder manifold capturing the long-term behavior. Two variants of LED switching between propagation in the latent state of the RNN for T m (macro dynamics) and themicro solver for T µ are given in Figure 39 and Figure 40. In both cases, we visually observe that the long-term climateof the system is reproduced. In the following, we plot the NAD error, we quantify the error between the distribution ofthe state values of the micro solver and the LED variants, and report the speed-up of LED compared to the micro solverin Figure 42. 47earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020 . . . . M S E L o ss AEVAE (a) MSE loss L = || x − ˜ x || . − − − l og ( M S E L o ss ) AEVAE (b) Log-MSE loss log (cid:0) || x − ˜ x || (cid:1) . Figure 37: (a)

MSE loss on the test data set plotted as a function of the latent state dimension. (b)

Logarithm ofthe MSE loss on the test data set. We observe for both AE and VAE a latent dimension of d z = 8 is enough tocapture most information of the data and reproduce the state evolution accurately, as the error is of the order of − inlogarithmic scale. The gain in MSE loss in using a higher latent state dimension is negligible, providing evidence thatwe approximately captured the dimensionality of the effective dynamics of the system.Figure 38: An iterative forecast on the test data set using LED propagating the dynamics on an d z = 8 dimensionallatent space. LED is not switching to the micro scale ( T µ = 0 ). The initial LED warm-up period is T warm = 60 .Figure 39: LED switching between propagation of the latent dynamics (macro scale with d z = 8 ) for T m = 64 andmicro dynamics for T µ = 8 . The relative ration is ρ = 8 . 48earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

Figure 40: LED switching between propagation of the latent dynamics (macro scale with d z = 8 ) for T m = 80 andmicro dynamics for T µ = 8 . The relative ration is ρ = 10 . . . . . . . . . . . . . P o w e r Sp ec tr u m [ d B ] Frequency error=0.0036 predictiontarget (a) − − − . . . . . . . . .

40 Target DensityPredicted Density (b)(c) (d)

Figure 41: (a)

The predicted power spectrum plotted against the true one. (b)

The distribution of the predicted state u values plotted against the correct distribution. The two distributions match. Plot of the density of values in the u x − u xx space obtained from (c) the groundtruth trajectories and (d) the predicted ones. Even though the LED is propagating thedynamics on a low order d z = 8 dimensional space (after an initial warm-up period T warm = 60 ), the statistics of thesystem, or long-term climate, is reproduced accurately, as demonstrated by the power spectrum, the state distribution ofthe predictions and the dynamics of the u x − u xx density plot.49earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020 t/ Λ . . . . . NA D T µ = 8 . , T f = 80 . Latent T m = 80 . , ρ = 10 . T m = 64 . , ρ = 8 . T m = 32 . , ρ = 4 . T m = 16 . , ρ = 2 . T m = 8 . , ρ = 1 . T m = 4 . , ρ = 0 . (a) Evolution of the NAD, averaged over initial conditionsfrom the test data. ρ = T m /T µ . . . . . W D T µ = 8 . , T f = 800 . (b) WD on the state distribution w.r.t. macro to micro ratio ρ . ρ = T m /T µ Sp ee d - up T µ = 8 . , T f = 800 . (c) Speed-up w.r.t. macro to micro ratio ρ . ρ = T m /T µ . . . . l og ( Sp ee d - up ) T µ = 8 . , T f = 800 . (d) Log of the speed-up w.r.t. macro to micro ratio ρ . Figure 42: Results of LED on the Kuramoto-Sivashinsky equation. Starting from different initial conditionsfrom the test data, we utilize LED with T µ = 0 , and variants switching between evolution of the dynamics basedon the spectral solver for T µ = 8 , and propagation of the latent space in LED for T m , with different values of T m for each variant. Unless ρ drops to ρ = 0 . , no signiﬁcant drop in the NAD or the WD on the state distribution isobserved, while at this multiscale ratio, the speed-up is negligible. This may be attributed to the chaotic nature of theKuramoto-Sivashinsky equation, where even a small initial error propagates exponentially. Nevertheless, the NAD erroris under . for at least one Lyapunov time / Λ for LED variants and the state distribution is captured as demonstratedby the low WD errors on the state distributions, even though LED is utilizing a latent space with dimension and beingalmost two orders of magnitude faster than the spectral solver (micro scale).50earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

The dynamics of the Kuramoto-Sivashinsky equation with L = 22 mostly take place on an 8 dimensional manifold [65].By performing Principal Component Analysis (PCA), we indeed see that . of the total energy can be capturedwith the ﬁrst most energetic PCA modes. The energy distribution is shown in Figure 43. The dynamics can berepresented in an dimensional space in the PCA basis, while the statistics of the system are captured correctly asshown in Figure 44, where the reconstructed evolution u rec ( t ) using only the most energetic PCA modes closelymatches the ground-truth u ( t ) . However, the transformation to the PCA basis is static, i.e. the PCA transformation doesnot take into account the dynamics on the reduced order manifold, i.e. how predictable the evolution of the reducedorder modes is. Indeed, utilizing the LED-VAE, the reconstruction error is smaller as depicted in Figure 45.Figure 43: The energy of the Principal Components of the Kuramoto-Sivashinsky equation with L = 22 . The mostenegetic modes contain approximately . of the total energy.Figure 44: The evolution of a trajectory u ( t ) in the state space of the Kuramoto-Sivashinsky equation with L = 22 . Thereconstruction of the trajectory u rec ( t ) using only the ﬁrst most energetic PCA modes, along with the reconstructionerror | u − u rec ( t ) | and the evolution of these modes z ( t ) .Figure 45: The evolution of a trajectory u ( t ) in the state space of the Kuramoto-Sivashinsky equation with L = 22 .The reconstruction of the trajectory u rec ( t ) using a VAE that ﬁrst compresses the input to a latent space with dimension and then projects back to the original space. The long-term statistics can be represented in this low dimensionalmanifold, while the error is smaller compared to PCA. 51earning the Effective Dynamics of Complex Multiscale Systems A P

REPRINT - J

ULY

3, 2020

In this section, we provide the results of a hyperparameter study we conducted on the Autoencoder on the Kuramoto-Sivashinsky equation. We trained the network with samples, half of them used for training and half for validation.All hyperparameter combinations we tried are given in Table Table 7. We found that VAE/AE with layers and units, equipped with residual connections and tanh activation function consistently provided better results in termsof the MSE loss in the test data compared to other combinations. For this reason, during the search for the optimalhyperparameters for the LED architecture, where we further need to optimize the hyperparameters of the RNNs, werestrict ourselves to the combinations reported in Table 8.Table 7: Autoencoder Hyperparameters for Kuramoto-SivashinskyHyperparameter ValuesNumber of AE layers { , , , } Size of AE layers { , } Activation of AE layers selu( · ) , tanh( · ) Latent dimension { , , , , , , , , , , , , , , , , , } Residual connections True/FalseVariational True/FalseInput/Output data scaling N (0 , Noise level in the data { , , } Weight decay rate { . , . , . } Batch size Initial learning rate . Table 8: LED Hyperparameters for Kuramoto-SivashinskyHyperparameter ValuesNumber of AE layers { } Size of AE layers { } Activation of AE layers tanh( · ) Latent dimension { } Residual connections TrueVariational True/FalseInput/Output data scaling N (0 , Noise level in the data { , } Weight decay rate { . , . } Batch size Initial learning rate . BPTT sequence length { , } Hidden state propagation length

RNN cell type { lstm , gru } Number of RNN layers { , } Size of RNN layers { , , } Activation of RNN Cell tanh( · ) A P

REPRINT - J

ULY

3, 2020

We have presented a novel data-driven framework for learning the effective dynamics (LED) and performing multiscalemodeling on both stochastic and deterministic dynamical systems, extending the equation-free formalism with state ofthe art deep learning methods. LED is utilizing deep autoencoders to compress the high dimensional state to a feweffective degrees of freedom in a latent space representation. A RNN is utilized to forecast the dynamics on this latentrepresentation. In stochastic systems, a Mixture Density decoder is probabilistically mapping the reduced order latentspace in the high dimensional state space.In systems where evolving the high dimensional state dynamics based on ﬁrst principles (solvers, equations, etc.) iscomputationally expensive, or time consuming, LED can accelerate the simulation by propagating on the latent spaceand then upscaling to high-dimensional system state with the decoder, when necessary. We demonstrate that LEDcan be orders of magnitudes faster than micro scale solvers, extending the simulation times that are possible withthe solvers based on ﬁrst principles. Moreover, we demonstrate how LED can be operated in a multiscale fashion,switching between propagation of the latent dynamics, and evolution of the micro dynamics, achieving lower error atthe cost of reduced speed-up. In this way, a trade-off between accuracy and speed-up may be achieved. The efﬁciencyof the proposed approach is evaluated on a stochastic dynamical system composed of particles following theAdvection-Diffusion dynamics (AD) in one and three dimensions, the FitzHugh-Nagumo (FHN), and the chaoticKuramoto-Sivashinsky (KS) equation.We demonstrate that LED efﬁciently uncovers the meta stable regions in the AD dynamics in both one and threedimensions, capturing the distribution of the positions of the particles with low error, while being two orders ofmagnitude faster than the micro scale solver. Moreover, by utilizing the proposed multiscale forecasting scheme, wedemonstrate that the statistical error is decreased as more time is spent evolving the particle dynamics, and the less timespent on the latent space of LED, albeit at the cost of lower computational savings. In addition, we demonstrated thatthe computational savings of LED increase for higher Péclet numbers (higher diffusion coefﬁcient) albeit at the cost ofslightly higher errors, and we demonstrated that LED can generalize to a different number of particles ( N = 200 ).We evaluated the efﬁciency of LED in forecasting the evolution of the densities of the activator and the inhibitor inthe FHN equation, and compared it with variants of the equation-free approach based on uncovering a PDE modelon the coarse level. LED variants exhibit one order of magnitude less mean normalised absolute difference (MNAD)on both predicted densities, while being two orders of magnitude faster compared to the Lattice Boltzmann methodconsidered as the micro scale solver. Moreover, we tested the multiscale approach on the FHN, demonstrating thatthe error can be further reduced by utilizing the solver for some portion of the forecasting time, at the cost of reducedspeed-up, achieving a compromise between the desired accuracy and execution time.Last but not least, LED efﬁciently uncovered an d z = 8 dimensional reduced order manifold in the dynamics of theKS equation with L = 22= 22