[PDF] Data-driven learning for the Mori-Zwanzig formalism: a generalization of the Koopman learning framework

Abstract

A theoretical framework which unifies the conventional Mori-Zwanzig formalism and the approximate Koopman learning is presented. In this framework, the Mori-Zwanzig formalism, developed in statistical mechanics to tackle the hard problem of construction of reduced-order dynamics for high-dimensional dynamical systems, can be considered as a natural generalization of the Koopman description of the dynamical system. We next show that similar to the approximate Koopman learning methods, data-driven methods can be developed for the Mori-Zwanzig formalism with Mori's linear projection operator. We developed two algorithms to extract the key operators, the Markov and the memory kernel, using time series of a reduced set of observables in a dynamical system. We adopted the Lorenz `96 system as a test problem and solved for the operators, which exhibit complex behaviors which are unlikely to be captured by traditional modeling approaches, in Mori-Zwanzig analysis. The nontrivial Generalized Fluctuation Dissipation relationship, which relates the memory kernel with the two-time correlation statistics of the orthogonal dynamics, was numerically verified as a validation of the solved operators. We present numerical evidence that the Generalized Langevin Equation, a key construct in the Mori-Zwanzig formalism, is more advantageous in predicting the evolution of the reduced set of observables than the conventional approximate Koopman operators.

Full PDF

DData-driven learning for the Mori–Zwanzig formalism: ageneralization of the Koopman learning framework

Yen Ting Lin , Yifeng Tian , Daniel Livescu , and Marian Anghel Information Sciences Group, Computer, Computational and Statistical Sciences Division (CCS-3), Los AlamosNational Laboratory, Los Alamos, NM 87545, USA Computational Physics and Methods Group, Computer, Computational and Statistical Sciences Division (CCS-2),Los Alamos National Laboratory, Los Alamos, NM 87545, USA * Corresponding author: [email protected]

January 18, 2021

Abstract

A theoretical framework which uniﬁes the conventional Mori–Zwanzig formalismand the approximate Koopman learning is presented. In this framework, the Mori–Zwanzig formalism, developed in statistical mechanics to tackle the hard problem ofconstruction of reduced-order dynamics for high-dimensional dynamical systems, canbe considered as a natural generalization of the Koopman description of the dynamicalsystem. We next show that similar to the approximate Koopman learning methods,data-driven methods can be developed for the Mori–Zwanzig formalism with Mori’slinear projection operator. We developed two algorithms to extract the key operators,the Markov and the memory kernel, using time series of a reduced set of observables ina dynamical system. We adopted the Lorenz ‘96 system as a test problem and solvedfor the operators, which exhibit complex behaviors which are unlikely to be capturedby traditional modeling approaches, in Mori–Zwanzig analysis. The nontrivial Gener-alized Fluctuation Dissipation relationship, which relates the memory kernel with thetwo-time correlation statistics of the orthogonal dynamics, was numerically veriﬁed asa validation of the solved operators. We present numerical evidence that the Gener-alized Langevin Equation, a key construct in the Mori–Zwanzig formalism, is moreadvantageous in predicting the evolution of the reduced set of observables than theconventional approximate Koopman operators.

Keywords:

Mori–Zwanzig formalism, memory eﬀects, Generalized Langevin Equa-tions, Generalized Fluctuation-Dissipation relationship, Dynamic Mode Decomposi-tion, Extended Dynamic Mode Decomposition, approximate Koopman learning, data-driven, reduced-order dynamical system

AMS subject classiﬁcations: a r X i v : . [ c ond - m a t . s t a t - m ec h ] J a n Introduction

The Mori–Zwanzig formalism [1, 2, 3, 4] was ﬁrst developed in statistical physics for thediﬃcult task of constructing coarse-grained models from high-dimensional microscopic mod-els. The goal of model coarse-graining is to construct equations describing the evolution of asmaller set of variables which are measurable, or quantities of our interests. These quantitiesare often referred to as the relevant or resolved variables. For example, a microscopic modelcan be all-atom molecular dynamics simulation of a protein in a solvent, and a relevantvariable can be the distance between two atoms of our interests. It is desirable to constructa closed dynamical system which include only the resolved variables without the informationof other degrees of freedom. On the one hand, it is easier to perform analysis on a closedlower-dimensional system and to shed important insights on the interactions between the re-solved variables. On the other hand, it is more eﬃcient to simulate the reduced-dimensionalsystem computationally.The major challenge of model coarse-graining is that the resolved variables may be in-ﬂuenced by the unresolved one. In the above example, the distance between the two speciﬁcatoms may be inﬂuenced by nearby water molecules that are not in the set of resolvedvariables. To solve this diﬃcult closure problem, Mori [1] and Zwanzig [2] developed theprojection-based methods to express the eﬀect of the unresolved variables in terms of theresolved ones. The Generalized Langevin Equation, the main result of the Mori–Zwanzigformalism, decomposes the evolutionary equations of the resolved variables into three parts:a Markov term, which captures the interaction within the resolved variables, a memory term,which is history-dependent and captures the interactions between the resolved and the un-resolved variables, and a term representing the orthogonal dynamics, which captures theunknown initial condition of the unresolved variables. Although the Generalized Langevinequation is formally exact, it is challenging to theoretically derive the closed-form expressionof these terms in the Generalized Langevin Equation without approximations. Conven-tionally, the applications of the Mori–Zwanzig formalism rely on modeling self-consistentoperators based on the mathematical structure of the Generalized Langevin Equation [5, 6].In a seemingly unrelated research area, approximate Koopman learning methods suchas Dynamic Mode Decomposition [7, 8] and Extended Dynamic Mode Decomposition [9],have been actively developed for data-driven modeling of dynamical systems. The generalidea is that by collecting enough data of a dynamical system, possibly from a high-ﬁdelitysimulation of the microscopic system, one would be able to learn important features in thedynamics, e.g., spectral [10] and dynamic modes [7, 8, 9]. The theoretical foundation ofthese methods, the Koopman theory, is a formulation for general dynamical systems [11, 12].Instead of the typical description of a possibly nonlinear system in the physical space, inKoopman theory, the dynamics are described as a linear dynamical system of the functionsof the observables in an inﬁnite-dimensional Hilbert space. Because the interactions in thisframework are linear (but with a caveat that the space is inﬁnite dimensional), learning fromdata is a convex problem which is easier to solve, in contrast to the nonlinear regression inthe conventional physical-space picture.The major aim of this article is to bridge these two seemingly disconnected researchareas in dynamical systems: the Mori–Zwanzig formalism and the approximate Koopmanlearning. We will establish that the Mori–Zwanzig formalism with Mori’s linear projector s functionally identical to the approximate Koopman learning methods in a shared Hilbertspace . This connection helps to bring the advantage of one research area to another. On theone hand, the approximate Koopman learning methods can be generalized for data-drivenmodeling of the operators in Mori–Zwanzig formalism. As will be seen in this article, theoperators describing the Markov and memory terms in the Generalized Langevin Equationcan be numerically learned from the simulation of the microscopic system. Surprisingly, theseoperators, inferred from data, are highly nontrivial and are unlikely to be modeled accuratelywithout in-depth knowledge of the system. On the other hand, the Mori–Zwanzig formalismis a generalization of the Koopman learning methods. We will show that by including thememory kernel and the history of the resolved variables, the Generalized Langevin Equationpredicts more accurately than the Extended Dynamic Mode Decomposition.This article is organized by the following structure. In Sec. 2, we provide a gentle intro-duction to the two descriptions of a dynamical system: the description of a ﬁnite-dimensional,but possibly nonlinear dynamics in the physical space, and the Koopman description of aninﬁnite-dimensional, but linear dynamics of the observables. For completeness, we includea self-contained introduction and review of the Mori–Zwanzig formalism in Sec. 3. Our aimis to covers the necessary intuition of understanding the Mori–Zwanzig theory in Sec. 3.1and to present the main result, the Generalized Langevin Equations (GLE) in Sec. 3.2. Weprovide a geometric interpretation of GLE in Sec. 3.3 and a discussion on diﬀerent projec-tion operators in Sec. 3.4, followed by the equations describing the projected image andcorrelation matrices in Sec. 3.5 and the self-consistent generalized Fluctuation-Dissipationrelationship in Sec. 3.6. We establish that the Mori–Zwanzig formalism is a generalizationof the Extended Dynamic Mode Decomposition (EDMD) in Sec. 4. Two novel algorithms,motivated by the Extended Dynamic Mode Decomposition [9] to extract the key operatorsin the Mori–Zwanzig formalism by simulation data, are presented in Sec. 5. We perform nu-merical experiments on a Lorenz ‘96 model [13] and present the results in Sec. 6. In Sec. 6.2,we demonstrate the advantage of the Mori–Zwanzig formalism over the conventional EDMDin predicting dynamical systems into the future. Finally, we provide a discussion and futureoutlook in Sec. 7. There exist two equivalent formulations to describe a dynamical system. In the ﬁrst formu-lation [14, 15], the system is characterized by a collection of physical-space variables , oftentermed as the state of the system. For example, a physical-space variable can be one compo-nent of the position of an atom in a many-particle system, or one component of the velocityﬁeld at a speciﬁc location in a ﬂuid dynamical system. The aim of this ﬁrst formulation is todescribe the evolution of these physical-space variables. Suppose the state of the system isfully characterized by N physical-space variables φ i , i = 1 , . . . , N , and we denote the stateof the system at time t by Φ ( t ) := [ φ ( t ) , . . . , φ N ( t )] T , an N × R N physical space is described by an evolutionary equationdd t φ i ( t ) = R i ( Φ ( t ) , t ) , and Φ (0) = Φ , i = 1 , . . . N, (1)3here the ﬂow R i is a function which maps the state Φ to a real number that characterizesthe velocity of the physical-space variable φ i at time t , and Φ is the given N × R i does not explicitly depends on the time t . Thus,the evolutionary equation (1) can be written in a terse form ˙ Φ = R ( Φ ), where the ﬂow R is deﬁned as [ R ( Φ ( t )) , . . . , R N ( Φ ( t ))] T and is implicitly time-dependent. In general, theﬂow R can be nonlinear in Φ .In the second formulation proposed by Koopman [11, 12], the system is characterized bya collection of observables which are functions of the physical-space variables. For example,an observable can be a component of the total angular momentum of a subset of all atomsin a particle system, or the locally averaged density in a ﬂuid dynamical system. TheKoopmanian formulation describes how observables evolve in an inﬁnite-dimensional Hilbertspace H , which is composed of all the possible observables. The advantage of this formulationis that the evolution of the observables, which is a vector in the inﬁnite dimensional Hilbertspace H , is always linear , even for systems that are nonlinear in the physical-space picture.The disadvantage of this formulation is that the state space of the system, which consists ofall possible observables, is inﬁnite dimensional.To illustrate the diﬀerence of the formulations, we consider a one-dimensional nonlineardynamical system in the physical-space formulation: ˙ φ ( t ) = R ( φ ( t )) and φ (0) := φ , where R ( x ) := − x is a nonlinear function and φ is the initial condition. While the analyticsolution exists for this simple problem ( φ ( t ) = 1 / ( t + 1 /φ )), it is challenging to derivethe closed-form solution for general multidimensional ( N >

1) nonlinear dynamical systems.Note that φ ( t ) is a complicated nonlinear function of the initial condition φ . In the Koopmanformulation, the dynamics are characterized by observables of φ . It is suﬃcient for us toconsider a set of observables which will serve as the basis functions. For this example,we consider g k := φ k , k ∈ Z + . Other observables can be expressed as a weighted linearsuperposition of these basis functions via Taylor series expansion. In contrast to the ﬁrstformulation, Koopman’s theory describes the dynamics of the basis functions g k :dd t g k ( t ) := dd t [ g k ◦ φ ( t )] = d g k d φ · d φ d t = kφ k − ( t ) · (cid:2) − φ ( t ) (cid:3) = − kg k +1 ( t ) . (2)Throughout this article, we will use the symbol ◦ to denote the composite functions. Here,the observable functions g k are functions of the physical-space variable φ , which is a functionof the physical time t . The dynamics of g k ( t ) is always linearly dependent to g k +1 ( t ), butis not closed unless the system involves inﬁnitely many k ’s. The evolution of lower-ordernonlinearity involves higher-order nonlinearity, similar to the common phenomenon in Car-leman linearization [16, 17] and moment expansion methods [18, 19]. Nevertheless, we canchoose two simple functions g ( φ ) := φ and g − ( φ ) = φ − and their dynamics are closed:dd t (cid:20) g ( t ) g − ( t ) (cid:21) = (cid:20) (cid:21) (cid:20) g ( t ) g − ( t ) (cid:21) , and (cid:20) g (0) g − (0) (cid:21) = (cid:20) φ − (cid:21) . (3)The above linear ordinary diﬀerential equations are solved to derive the analytical solutionof g − ( φ ( t )) = t + φ − , which is then used to calculate φ ( t ). The choice of this invariantset of functions is not unique. We can even choose a smaller set which contains only one4bservable g e ( φ ) := exp( − /φ ), which satisﬁes a one-dimensional linear ordinary equation˙ g e ( t ) = − g e ( t ).The above examples illuminate two key features of the Koopman theory. First, the dy-namics of observables are always linearly dependent on other observables. Secondly, to deriveclosed-form solution in the Koopman theory is equivalent to identifying a set of observableswhose dynamics are invariant in a subspace which is linearly spanned by the set of the ob-servables. In general, it is challenging to identify the ﬁnite set of observables that closes thedynamics, and one has to resort to approximation methods to close the system. In the nextsection, we illustrate how the Mori–Zwanzig formalism leverage the projection operators toclose the dynamics. Here, we provide an intuitive derivation of the major result of the Mori–Zwanzig formalism,the Generalized Langevin Equation (GLE). More rigorous yet technically tedious derivationscan be found in references [1, 2, 3, 20].

Similar to the Koopman picture, the Mori–Zwanzig formalism aims to describe the evolutionof a ﬁnite set of M independent observables, M := { g i } Mi =1 . Note that the physical-spacevariable Φ ( t ) is a function of both time t and the initial condition Φ . As such, the observ-ables g i evaluated at the time t , g i ◦ Φ ( t ), are functions of the initial condition Φ . We denotethis function as g i ( t ), namely, g i ◦ Φ ( t ) ≡ g i ( t ) ◦ Φ . In other words, g i ( t ) are functions ofthe initial condition Φ parametrized by the physical time t . Note that g i (0) ≡ g i .We adopt a terse vector notation g M ( t ) = [ g ( t ) , . . . g M ( t )] T in the following derivation.Recall that all functions of Φ are treated as vectors in the Hilbert space H . Throughoutthis article, we will leverage the equipped inner product to deﬁne the set of functions in H which are orthogonal to M : ¯ M := { f : (cid:104) f, g (cid:105) = 0 , ∀ g ∈ M} . We use the angular brackets todenote the inner product. Because the addition and scalar multiplication of these orthogonalfunctions are still orthogonal to the g ∈ M , ¯ M is a vector space; we shall refer to ¯ M as the orthogonal space . Leveraging the inner product of the Hilbert space leads to a speciﬁc choiceof the projection operator , which is a key construct in the Mori–Zwanzig formalism. As willbee seen in Sec. 4, this speciﬁc choice of the projection operator will be consistent with theKoopman’s picture. For completeness, we discuss another commonly adopted projectionoperator which does not requires the construction of the inner-product space in Sec. 3.4.In the Hilbert space H , the evolution of g M ( t ) linearly depends on the observables in M and other observables, which can always be decomposed into a linear combination of theobservables in M and those in ¯ M . Suppose the full dynamics has ﬁnite degrees of freedom,then we can write a vector g ¯ M as a set of observables in ¯ M on which the dynamical systemdepends, and the full dynamics is closed in the space spanned by g M and g ¯ M :dd t (cid:20) g M ( t ) g ¯ M ( t ) (cid:21) = L · (cid:20) g M ( t ) g ¯ M ( t ) (cid:21) := (cid:20) L MM L M ¯ M L ¯ MM L ¯ M ¯ M (cid:21) · (cid:20) g M ( t ) g ¯ M ( t ) (cid:21) . (4)5ere, the operator L i,j , i, j ∈ (cid:8) M , ¯ M (cid:9) , quantiﬁes the eﬀect from set j to set i in the linearevolution.To obtain a closed-form evolution for our observables of interest in M , we ﬁrst solve forthe observables in set ¯ M implicitly. Treating L ¯ MM g M as an inhomogeneous driving termfor the linear system, we can formally solve the linear evolutionary equation for g ¯ M : g ¯ M ( t ) = (cid:90) t e ( t − s ) L ¯ M ¯ M · L ¯ MM · g M ( s ) d s + e t L ¯ M ¯ M · g ¯ M (0) . (5)The implicit solution of g ¯ M is in turn used to express closed evolutionary equations for theobservables in the set M .dd t g M ( t ) = L MM g M ( t ) + L M ¯ M (cid:90) t e ( t − s ) L ¯ M ¯ M · L ¯ MM · g M ( s ) d s + L M ¯ M e t L ¯ M ¯ M · g ¯ M (0) . (6)Equation (6) is almost closed in our chosen set of variables except for the last term. The ﬁrstterm is the instantaneous conﬁguration of the set of observables applied to the physical-spaceconﬁguration at time t and the second is a delayed impact of the set of observables appliedto the physical-space variables at an earlier time s < t . Both these terms depend only on ourchosen set of observables, M . However, the third term is induced by the observables in ¯ M , aset of function that is orthogonal to our chosen observables { g i } Mi =1 . This term only dependson the initial conﬁguration in the orthogonal space g ¯ M (0) propagated to the current time t by the operator L M ¯ M · e t L ¯ M ¯ M .This simple analysis illustrates the essential intuition of the Mori–Zwanzig formalism.Because we only choose to describe the evolution of a partial set of observables ( M ) of thefull dynamics, the impact from other observables in the complement set ( ¯ M ) cannot bedirectly accessed. Instead, we indirectly estimate the eﬀect L M ¯ M · g ( t ) from Eq. (5), whichcontains two parts: the impact of the those observables at an earlier time s , and the initialconditions of the orthogonal observables. The former characterizes the “echo” of the set ofour observables of interest to itself: at an earlier time s , these observables made an impactto the complement set of variables (via the operator L ¯ MM ), and such an impact propagatesin the orthogonal space ( ¯ M ) for t − s time via the operator exp (( t − s ) L ¯ M ¯ M ) before comingback to aﬀect the observables in M at time t via the operator L M ¯ M . The second part is ageneric impact from the initial conﬁgurations of the orthogonal set of observables, g ¯ M (0),which have propagated in the space ¯ M until the current time t and aﬀect observables in M . In the end, Eq. (6) tells us that the accurate evolutionary equations of the relevantobservables { g i } Mi =1 always depend on (1) their instantaneous conﬁguration, (2) their pasthistory, and (3) an external “driving force” which depends on the initial conﬁgurations inthe orthogonal space.We make a remark that the second and the third term are zero if the dynamics is closed in M , that is, if we have a complete set of observables to describe the full dynamics. The echoand the driven force exist only because we have an incomplete observation of the dynamicalsystem in H .Finally, we remark that in the above analysis, we write the evolutionary equations of a6nite set of observables in the matrix form, Eq. (4). Such an analysis for ﬁnite-dimensionalsystem may not be general enough for more general systems whose dynamics depend onuncountably many observables. For example, a general observable g λ , which is parametrizedby a continuous λ , follows an evolutionary equation ˙ g λ ( t ) = − (cid:82) ∞−∞ g λ ( t ) d λ . A more generaland rigorous method for developing Mori–Zwanzig theory for these more general dynamicalsystem can be constructed with the aid of semi-group algebra [1, 20]. Despite a moreconvoluted derivation with more complex mathematical expressions, the intuition behindthe more general derivation is identical to our insights gained in the above, stylized analysis. It is common that the subscript M is dropped for brevity. By deﬁning matrices M := L MM , K ( s ) := − L M ¯ M e s L ¯ M ¯ M · L ¯ MM , and F ( t ) := L M ¯ M e t L ¯ M ¯ M · g ¯ M (0), we arrive at a cleanexpression of the major outcome of the Mori–Zwanzig thoery, the Generalized LangevinEquation describing the evolutionary equations of the reduced-order observables:dd t g ( t ) = M · g ( t ) − (cid:90) t K ( t − s ) · g ( s ) d s + F ( t ) . (7)Here, M is an M × M matrix, and so is K ( s ) for any given real s ≥

0. The ﬁrst term on theright hand side of the GLE prescribes a linear and Markov evolution. Thus, we refer to M as the Markov transition matrix . The second term describes a cumulative delayed impactof the previous trajectory g ( s ), s < t , to the current state g ( t ). We refer to K ( t − s ) asthe memory kernel , which quantiﬁes an impact propagating from time s to t . Note that thedelayed impact only depends on the diﬀerence of the two times t − s . The ﬁnal term F ( t )is referred to as the orthogonal dynamics , because it is a linear superposition of orthogonalobservables of the initial condition Φ . Although F ( t ) is fully deterministic, it is oftenreferred to as the noise because its resemblance of the Langevin noise in the GeneralizedLangevin Equation Eq. (7).We ﬁnally remark that the operators M quantify the interactions within the group of therelevant observables { g i } Mi =1 (that is, L MM ). In contrast, the memory kernel combines theeﬀects of the rest of the interactions ( L M ¯ M , L ¯ MM , and L ¯ M ¯ M ). The GLE describe the exact evolution of g ( t ) in H . Figure 1 illustrates a schematic diagramof the dynamics of g ( t ) in the space H . Because the GLE (7) is linear, the observables g ( t ) can be decomposed into two components: g ( t ) = g (cid:107) ( t ) + g ⊥ ( t ). We deﬁne the parallelcomponent g (cid:107) ( t ) as the general solution of the linear system and satisﬁes˙ g (cid:107) ( t ) = M · g (cid:107) ( t ) − (cid:90) t K ( t − s ) · g (cid:107) ( s ) d s, (8a) g (cid:107) (0) = g (0) . (8b)Because (8) is linear, it is clear that g (cid:107) ( t ) is just linear combination of the initial observables,i.e., we can always express g (cid:107) ( t ) = (cid:80) Mi =1 α i ( t ) g i with some time-dependent coeﬃcients α i ( t ).7igure 1: Schematic diagram of the Mori–Zwanzig formalism. The Hilbert space H con-tains all possible observables which are functions of the initial conditions of the physical-space variables Φ . The subspace H g is linearly spanned by the set of selected observablesSpan( { g i } Mi =1 ). We deﬁne the functions g i ( t ) parametrized by time t to be g i ( t ) ◦ Φ := g i ◦ Φ t ,noting that Φ ( t ) are functions of the initial condition Φ . At time t = 0, the vector g (0) = [ g (0) , . . . , g M (0)] T is exactly the vector of the basis { g i } Mi =1 and thus, g (0) is in H g . As the nonlinear dynamics evolves, g ( t ) does not necessarily stays invariantly in H g ;in other words, g ( t ) is not necessarily a linear combination of { g i } Mi =1 . Mori’s projection op-erator P projects g ( t ) into H g . The image is termed as g (cid:107) ( t ), which is a linear combinationof { g i } Mi =1 and g (cid:107) ( t ) satisﬁes Eq. (8).It is useful to deﬁne a subspace H g := span { g i } Mi =1 to be the linear span of all the initialobservables. Then, for any t ≥ g (cid:107) ( t ) ∈ H g .The orthogonal component g ⊥ ( t ) is the particular solution of the linear system with thedriven force and satisﬁes˙ g ⊥ ( t ) = M · g ⊥ ( t ) − (cid:90) t K ( t − s ) · g ⊥ ( s ) d s + F ( t ) , (9a) g ⊥ (0) = 0 . (9b)Recall that the “noise” F ( t ) lives in an orthogonal space ¯ M . Then, again due to the linearityof (9), the orthogonal component g ⊥ is in the orthogonal space ¯ M .The geometric representation in Fig. 1 illustrates the closure problem in the Hilbert space:to evolve the exact g ( t ), one needs to provide the orthogonal dynamics F ( t ). Solving F ( t ) isas diﬃcult as solving the full dynamics in the physical-space picture—we would need all theinformation of the orthogonal components, and thus the Mori–Zwanzig formalism has littleto no advantage over the physical-space picture if the task is to solve for the exact g ( t ). Infact, there is no free lunch by changing the physical-space picture to the Koopman picture.To move forward, traditional analysis aim to select a set of slow-evolving observables—whichare the coarse-grained variables—and a noise model to replace the orthogonal dynamics F ( t ).The rational of this approach is that, if the selected observables are complete to describe theslow dynamics, the orthogonal dynamics shall live at a much faster timescale and a propernoise model should suﬃce to model F ( t ). The challenge is that it is not a priori knownwhat these observables are and what the corresponding noise model is, and identifying theobservables and noise model is often from educated guesses supported by domain-speciﬁc8nowledge. We have set up the overview of the Mori–Zwanzig formalism. With a speciﬁc choice of theorthogonal space deﬁned by the equipped inner product in the Hilbert space, the core ideais to decompose the exact evolution g ( t ) by a parallel image g (cid:107) ( t ) and the orthogonal one g ⊥ ( t ) in ¯ M , which depends on the choice of the inner product. Conventionally, the innerproduct between two observables, which are functions of the initial state Φ , is deﬁned as (cid:104) f, h (cid:105) := (cid:90) Ω ( f ◦ Φ ) ( h ◦ Φ ) d µ ( Φ ) , (10)where d µ is a measure whose support is the full physical space Ω. Thus, the inner productof two functions f and h is considered as the expected product with respect to the initialcondition distributed by d µ . Despite the freedom of choosing the measure d µ , convention-ally, d µ is often set by a natural measure that is speciﬁc to the dynamics. For example, itis natural to adopt the canonical equilibrium distribution (Gibbs’ measure) for equilibriumHamiltonian systems [1]. For non-equilibrium systems, we can adopt the stationary measureas d µ [4], as shown below.With a deﬁned inner product, we can construct an operator P that projects any functionof Φ , f ∈ H , onto the subspace H g spanned by a pre-selected and linearly independent set offunctions, { g i } Mi =1 . Because any function in H g is a linear superposition of g i ’s, the projectedimage can be expressed by P f := (cid:80) Mi =1 α i g i with coeﬃcients α i . Now, we decompose thefunction f by f = P f + f ⊥ where f ⊥ is orthogonal to any function in H g . Because (cid:104) f ⊥ , g j (cid:105) =0, j = 1 . . . M , we can write the inner product (cid:104) f, g j (cid:105) as: (cid:104) f, g j (cid:105) = (cid:104)P f, g j (cid:105) = M (cid:88) i =1 α i (cid:104) g i , g j (cid:105) . (11)In vector notation, g := [ g . . . g M ] T and α := [ α . . . α M ], the above equation can be ex-pressed concisely by (cid:10) f, g T (cid:11) = α · (cid:10) g , g T (cid:11) . We assume that there is no redundant observablewhich can be expressed as the linear combination of other observables and thus, (cid:10) g , g T (cid:11) isa full-rank and invertible matrix. Then, α = (cid:10) f, g T (cid:11) · (cid:10) g , g T (cid:11) − , and we obtain the ﬁnalexpression for the projection operator P f = (cid:10) f, g T (cid:11) · (cid:10) g , g T (cid:11) − · g . (12)Note that if and only if the pre-selected set of functions { g i } Mi =1 are orthonormal with respectto the deﬁned inner product, (cid:10) g , g T (cid:11) − is an identity matrix and the expression can besimpliﬁed by P f = (cid:10) f, g T (cid:11) · g . Finally, we emphasize that the projection operator cruciallydepends on the choice of the deﬁned inner product. In the rest of this article, we consider aninner product deﬁned by averaging the observables over a long trajectory of the dynamical9ystem: (cid:104) f, g (cid:105) := lim T →∞ T (cid:90) T ( f ◦ Φ ( s )) ( g ◦ Φ ( s )) d s, (13)where Φ ( t ) is the solution in the physical-space variables which satisfy Eq. (1) from anyinitial condition, assuming the choice of the initial condition does not change the long-timestatistics.Leveraging the inner-product in the Hilbert space, which was Mori’s original formulation[1], is not the only way of analysis in the Mori–Zwanzig formalism. In Zwanzig’s approach [2],the projection operator P is deﬁned by integrating the unknown degrees of freedom accordingto a speciﬁed statistical distribution often motivated by the equilibrium statistics. The mainidea of this approach is to approximate terms with orthogonal components by the expectationvalue conditioned on the resolved components. Also termed as the nonlinear projection [20],this approach [21, 20, 6] can lead to a nonlinear Markov transition and nonlinear memorykernel in the Generalized Langevin Equation. Our discussion and development below will bebased on Mori’s formulation [1], and we will not consider the nonlinear Zwanzig projection. The GLE (7) is not a closed dynamical system for the exact g ( t ) = g because it contains thegeneralized Langevin noise, F , which is not known and hard to obtain. However, because g ( t ) = g (cid:107) + g ⊥ , the projected image P g ( t ) = P g (cid:107) ( t ) + P g ⊥ ( t ) = g (cid:107) follows the closedevolutionary Eq. (8) and with an initial condition P g ( t ) = g (0). The L -norm of theresidual error of the projected image (cid:107) g − P g (cid:107) := (cid:104) g − P g , g − P g (cid:105) is minimal among allthe schemes that decompose g ( t ) into a parallel component and a non-parallel component.Thus, the projected image P g ( t ) is the best approximation to the evolution of g(t) in theparallel space, and in this sense one can conceive it as the optimal predictor of the exactevolution g ( t ) into the future ( t > F ( t ).A related way to close the dynamics is to apply (cid:10) · , g T (cid:11) to the GLE (7), resulting in theevolutionary equation for the two-time correlation function C ( t ) := (cid:10) g ( t ) , g T (cid:11) :dd t C ( t ) = M · C ( t ) − (cid:90) t K ( t − s ) C ( s ) d s, C (0) = (cid:10) g , g T (cid:11) . (14)Here, C ( t ) is the expected two-time correlation of the observables with respect to an initialcondition Φ distributed according to a prescribed d µ (cf. Eq. (13)). Multiplying C − (0) · g (0)to Eq. (14) from the right, we obtaindd t C ( t ) · C − (0) · g (0) = M · C ( t ) · C − (0) · g (0) − (cid:90) t K ( t − s ) C ( s ) · C − (0) · g (0) d s. (15)Comparing Eq. (15) to the evolutionary Eq. (8), we immediately identify the solution of the10rojected image g (cid:107) ( t ): g (cid:107) ( t ) = C ( t ) · C − (0) · g (0) . (16)Thus, the temporal correlation matrix C ( t ) encodes the information for optimally predictingthe dynamics into the future. In addition, it is straightforward to solve the orthogonalcomponent g ⊥ ( t ) as a superposition of the orthogonal driven force F ( t ): g ⊥ ( t ) = (cid:90) t C ( t − s ) · C − (0) · F ( s ) d s. (17)We illustrate the analysis in Appendix A.As will be seen in Sec. 5, Eq. (14) plays a pivotal role for data-driven learning of theMarkov transition M and memory kernel K ( s ), s ≥

0. We will also see that the solutionof the optimal prediction Eq. (16) establishes the equivalence between the Mori–Zwanzigformalism and the approximate Koopman learning algorithm in Sec. 4.

With a suitable choice of the inner product, there exists a subtle relationship—often referredto as the Generalized Fluctuation-Dissipation (GFD) relationship—between the memorykernel K and the orthogonal dynamics F : K ( s ) = (cid:10) F ( s ) , F T (0) (cid:11) C − (0) . (18)Using the intuitive notations deﬁned in Sec. 3.1, we illustrate how this subtle relationshipemerges. Explicitly, from Eq. (6), we identify the orthogonal dynamics F ( t ) = L M ¯ M e t L ¯ M ¯ M g ¯ M , (19)because g ¯ M (0) ≡ g ¯ M . Next, with the choice of the temporal averaging inner productEq. (13), we can obtain: (cid:10) F ( t ) , F T (0) (cid:11) = lim T →∞ T (cid:90) T L M ¯ M e t L ¯ M ¯ M g ¯ M ◦ Φ ( s ) · [ L M ¯ M g ¯ M ◦ Φ ( s )] T d s. (20)Induced by the dynamics Eq. (1), g ¯ M ( s ) ≡ g ¯ M ◦ Φ ( s ) and g M ( s ) ≡ g M ◦ Φ ( s ) satisfy Eq. (4),so we use the identity L M ¯ M g ¯ M ( s ) = dd s g M ( s ) − L MM g M ( s ) (21)11o replace L M ¯ M g ¯ M ( s ) in (20): (cid:10) F ( t ) , F T (0) (cid:11) = lim T →∞ T (cid:90) T L M ¯ M e t L ¯ M ¯ M g ¯ M ( s ) · (cid:20) dd s g M ( s ) − L MM g M ( s ) (cid:21) T d s = lim T →∞ T (cid:90) T L M ¯ M e t L ¯ M ¯ M g ¯ M ( s ) · dd s g T M ( s ) d s − L M ¯ M e t L ¯ M ¯ M (cid:10) g ¯ M , g T M (cid:11) L T MM = lim T →∞ T (cid:90) T L M ¯ M e t L ¯ M ¯ M g ¯ M ( s ) · dd s g T M ( s ) d s. (22)In the last line, we have used the fact that g M and g ¯ M are orthogonal with respect to theinner product by construction. Next, we perform an integration by part, assume that theboundary terms are bounded and thus converge to 0 after divided by T in the limit T → ∞ ,and use the full dynamics Eq. (4) again to obtain (cid:10) F ( t ) , F T (0) (cid:11) = − lim T →∞ T (cid:90) T L M ¯ M e t L ¯ M ¯ M dd s g ¯ M ( s ) · g T M ( s ) d s = − lim T →∞ T (cid:90) T L M ¯ M e t L ¯ M ¯ M [ L ¯ MM g M ( s ) + L ¯ M ¯ M g ¯ M ( s )] · g T M ( s ) d s = − L M ¯ M e t L ¯ M ¯ M (cid:2) L ¯ MM (cid:10) g M , g T M (cid:11) + L ¯ M ¯ M (cid:10) g ¯ M , g T M (cid:11)(cid:3) = − L M ¯ M e t L ¯ M ¯ M L ¯ MM (cid:10) g M , g T M (cid:11) . (23)Again, we used the orthogonality (cid:104) g M , g ¯ M (cid:105) = 0. Finally, the memory kernel K ( t ) := − L M ¯ M e t L ¯ M ¯ M L ¯ MM can be expressed as the two-time correlation statistics of the orthogonaldynamics F ( t ) and the autocorrelation of the observables (cid:10) g M , g T M (cid:11) : K ( t ) = (cid:10) F ( t ) , F T (0) (cid:11) · (cid:10) g M , g T M (cid:11) − . (24)The above relationship between the two-time statistics of the orthogonal dynamics F ( t )and the memory kernel K ( t ) is referred to as the Generalized Fluctuation-Dissipation rela-tionship. In the more general derivation via operator algebra, GFD holds when the evolu-tionary (Liouville) operator L = L (Φ), which is deﬁned asdd t f ◦ Φ ( t ) = L f ◦ Φ ( t ) , (25)is anti self-adjoint with respect to the chosen inner product, i.e., for any test functions f and h of the physical-space variable Φ , (cid:104) f, L h (cid:105) = − (cid:104)L f, h (cid:105) . (26)By choosing the time-averaging as the inner product (Eq. (20)), we have the advantage that12he anti self-adjoint condition (26) is always fulﬁlled: (cid:104) f, L h (cid:105) = (cid:28) f, dd t h (cid:29) = lim T →∞ T (cid:90) T f ◦ Φ ( t ) dd t [ g ◦ Φ ( t )] d t = − lim T →∞ T (cid:90) T dd t [ f ◦ Φ ( t )] g ◦ Φ ( t ) d t = − (cid:28) dd t f, h (cid:29) = − (cid:104)L f, h (cid:105) . (27)All we need are the minor conditions that the integration by part is valid, and the contribu-tion from the boundary terms is negligible.Furthermore, if the dynamical system is ergodic, the temporal average (20) is identicalto the ensemble average against the stationary distribution d µ stat ( Φ ) of the dynamics inthe physical space (cid:104) f, g (cid:105) = (cid:90) Ω ( f ◦ Φ ) ( g ◦ Φ ) ρ stat ( Φ ) d Φ , (28)where ρ stat is the stationary density function and satisﬁes e t L ∗ ρ stat = ρ stat ; here, the adjoint L ∗ deﬁnes the Perron–Frobenius operator. In the literature, it is generally presented thatthe anti self-adjointness is valid for Hamiltonian systems in a heat bath, e.g., where thereis an induced Gibbs measure [20]. Our analysis shows that GFD is generally valid, even fornon-equilibrium (not necessarily Hamiltonian) systems, so long as we choose time-averagingas the inner product and the observables along a long trajectory is bounded so that thecontribution of the boundary terms in the integration by parts are negligible.Finally, GFD can be considered as a self-consistent condition of the memory kernel K andthe residual orthogonal dynamics F . A real stochastic Langevin system with a white noise(no time correlation) would result in a K ( t ) = K δ ( t ) where K is a constant matrix and δ is the Dirac δ -distribution. The Mori–Zwanzig formalism shows that when a dynamicalsystem is not fully resolved, in general, there exists a non-zero memory kernel K and thus theself-consistent orthogonal dynamics F must be a color noise. Below in Sec. 5 we provide data-driven algorithms to extract the memory kernel and the noise directly from the measuredobservables along a long chain of full simulation. The GFD can serve as a very stringentself-consistency check for the algorithms. Even though we are interested in a continuous-time model, it is common that the observations—either empirical measurements of the physical system or the output of computer simulations—are discrete in time. It is also desirable to store the trajectories of a large set of observablessparsely sampled in time to mitigate to the storage limitation. In this case, the continuous-time Mori–Zwanzig formalism is not adequate. In this section, we provide the result ofa discretized Mori–Zwanzig formulation that is more suitable for discrete-time data. Forbrevity, we only present the result in this section and leave the tedious yet straightforwardderivation in Appendix B.We consider to observe the continuous-time system at discretization of times, t = k ∆, k ∈ Z ≥ and ∆ is not necessarily small. After integrating the GLE Eq. (7) and the evo-lutionary equations for the correlation matrix C ( t ) (Eq. (14)) and the optimal prediction P g ( t ) (Eq. (16)) at the discrete times, the snapshots of the observables of interests satisfy13ery similar mathematical structures of the continuous-time formulations. Speciﬁcally, inAppendix B.1, we establish the discrete-time GLE (cf. Eq. (7)) g (( k + 1) ∆) = k (cid:88) (cid:96) =0 Ω ( (cid:96) )∆ · g (( k − (cid:96) ) ∆) + W k , (29)where Ω ( (cid:96) )∆ ’s are M × M , ∆-dependent matrices which can be deﬁned in terms of thecontinuous-time Markov transition matrix M and memory kernel K , and W k is the discrete-time orthogonal dynamics which can be decomposed into linear functions of the continuous-time orthogonal dynamics F . Similarly, the snapshots of the correlation matrix C satisfy(cf. Eq. (14)) C (( k + 1) ∆) = k (cid:88) (cid:96) =0 Ω ( (cid:96) )∆ · C (( k − (cid:96) ) ∆) , (30)and the snapshots of the projected image P g ( t ) satisfy (cf. Eq. (16)) P g (( k + 1) ∆) = k (cid:88) (cid:96) =0 Ω ( (cid:96) )∆ · P g (( k − (cid:96) ) ∆) . (31)It is tempting to associate the operator Ω (0)∆ to the continuous-time Markov matrix Ω (0)∆ ≈ I + M ∆, where I is the M × M identity matrix, and to associate the operator Ω ( (cid:96) )∆ , (cid:96) >

1, to thecontinuous-time memory kernel by Ω ( (cid:96) )∆ ≈ ∆ K ( (cid:96) ∆). We remark that these representationsinvolving the inﬁnitesimal ∆ (cid:28) Ω ( (cid:96) )∆ in terms of the continuous-time objects ( M and K ( s )) arenot simple; see Lemma 1 in Appendix B.1. Nevertheless, the above Eqs. (29), (30), and (31)are always valid regardless of the choice of ∆. We also establish the GFD relationship inAppendix B.4 (cf. Sec. 3.6).For a generic discrete-time dynamical system, it is also possible to directly derive itsMori–Zwanzig formula (Appendix B.2 and reference [22]). We remark that while the mathe-matical expression of the results of this approach look identical to Eqs. (29)-(31), there existsa subtlety between discretizing the correlation matrix C ( t ) of a continuous-time system (Ap-pendix B.1) and the generic discrete-time correlation matrix (Appendix B.2). We discusssuch a subtlety in Appendix B.3. Furthermore, in Appendix B.4.2, we show that the GFDrelationship is valid when the discrete-time operator L d forward propagate the observables(i.e., g (( k + 1) ∆) = L d g ( k ∆)) is anti-self-adjoint with respect to the invariant measure ofthe discrete-time dynamics (cf. Sec. 3.6). The discretized formulations presented in Sec. 3.7 provide the mathematical structure formaking comparison to the approximate Koopman learning framework [7, 8, 9]. In thissection, we establish that the Mori–Zwanzig formalism with Mori’s projection operator isnot only compatible with, but also a generalization of the Koopman learning framework.14e begin with a short summary of the approximate Kooman learning methods, specif-ically, the extended dynamic mode decomposition (EDMD, [9]). EDMD takes a long tra-jectory of a set of M observables, { g i } Mi =1 , of a nonlinear dynamical system as an input.The snapshots of these descriptors on a uniformly separated temporal grid were recorded as g i ( jδ ), j = 0 . . . N −

1. Generally, δ is conceived as a small time separation. The goal ofthe EDMD is to identify the approximate Kooman operator K Koop δ —note that we use K todenote the memory kernel in the context of Mori–Zwanzig—which linearly maps the previoussnapshots to the consecutive next ones with a minimal squared residual error. Operationally,EDMD stacks up the snapshots into an array of “dependent variables” Y and “independentvariables” X : Y =  g (1 δ ) g (2 δ ) . . . g (( N − δ ) g (1 δ ) g (2 δ ) . . . g (( N − δ )... ... ... g M (1 δ ) g M (2 δ ) . . . g M (( N − δ )  , (32) X =  g (0 δ ) g (1 δ ) . . . g (( N − δ ) g (0 δ ) g (1 δ ) . . . g (( N − δ )... ... ... g M (0 δ ) g M (1 δ ) . . . g M (( N − δ )  , (33)and the M × M matrix ˆK Koop δ is the linear operator which minimizes the mean squared error ε := 1 N − N − (cid:88) j =1 M (cid:88) i =1 (cid:16) y − K Koop δ · x (cid:17) i,j . (34)When the observables form a complete set of basis functions, the dynamics is closed inthe linear spanned space H g , and ε can be minimized to zero. Nevertheless, for generalproblems, it is challenging to identify a complete set of basis functions. Consequently, thereexist non-zero residuals. The learning problem is fundamentally a linear regression problem,which is concave. Formally, the unique minimizer ˆK Koop δ conditioned on a pair of Y and X is K Koop δ = (cid:0) Y · X T (cid:1) · (cid:0) X · X T (cid:1) − , (35)when the set of the observables are linearly independent. When the set of the observablesare not linearly independent, the problem is under-determined and there exist a family ofminimizers [9]; in such a case, the inverse matrix (cid:0) X · X T (cid:1) − can be operationally carried outby the Moore–Penrose pseudoinverse [9] to identify one of the minimizers. In the analysisbelow, we assume that the set of the basis functions is carefully chosen so that they arelinearly independent.Equation (35) exhibits the exact mathematical expressions of the projected image g (cid:107) ( t )in the continuous-time Mori–Zwanzig formalism (Eq. (16)). In the limit of inﬁnitely longsnapshots separated by δ , the matrices Y · X T and X · X T are exactly C ( δ ) and C (0) wherethe inner product is deﬁned to be with respect to the distribution induced by the long15rajectories: C (0) = (cid:10) g , g T (cid:11) = lim T →∞ T (cid:90) T g ◦ Φ ( s ) · g T ◦ Φ ( s ) d s ≈ X · X T , (36a) C ( δ ) = (cid:10) g e δ L , g T (cid:11) = lim T →∞ T (cid:90) T g ◦ Φ ( s + δ ) · g T ◦ Φ ( s ) d s ≈ Y · X T . (36b)Thus, both formulations predict the exact propagator forward δ -time C ( δ ) · C − (0).It is intriguing that the Mori–Zwanzig formulation relies on the projection operator P which requires the equipped inner product of the Hilbert space, but the approximate Koop-man learning framework only relies on the mean L -norm of the error which relies on a lessstrict norm space. Nevertheless, operationally, we can use the geometric interpretation pro-vided in Sec. 3.3 to illustrate the intuition of the two formulations. The Koopman learningframework seeks a point in H g that minimizes the L -norm between the point and g ( δ ),which is generally outside H g (see Fig. 1). In contrast, the Mori–Zwanzig formalism simplyprojects g ( δ ) onto H g . These two formulations are formally identical because the projectedimage P g ( δ ) would be the unique point on H g such that the L -error of the projected im-age to g ( δ ), (cid:107) g ( δ ) − P g ( δ ) (cid:107) = (cid:104) g ( δ ) − P g ( δ ) , g ( δ ) − P g ( δ ) (cid:105) is minimized. We remark thesubtle diﬀerence between the two frameworks. In the Koopman framework, orthogonalitybetween the residual and the H g is not established—there is no notion of an inner product.In the Mori–Zwanzig formalism with the equipped inner product in H , the unique operator C ( kδ ) · C − (0), k ∈ N propagates g (0) to C ( kδ ) · C − (0) · g (0) which is always the projectedimage of g ( kδ ), and the residual g ( kδ ) − P g ( kδ ) is always orthogonal to H g , with respectto the induced inner product.By far, we have established the equivalence of the Mori–Zwanzig formalism and theapproximate Koopman learning framework. Mori–Zwanzig is more general than the existingKoopman learning methods. First, Eq. (31) prescribes the optimal prediction P g (∆) = Ω (0)∆ g (0) if g (0) was sampled from the measure d µ which was used to deﬁne the innerproduct, and the horizon ∆ does not have to be small. Our analysis in Appendix B.1 showsthat it is always possible to identify the unique (but ∆-dependent) operator Ω (0)∆ , regardlessof how large ∆ is. Interestingly, in our derivation provided in Appendix B.1, we can seethat Ω (0)∆ depends on not only the Markov transition M but also the memory kernel K ( s ), s ∈ [0 , ∆). This formally states that when the system is not fully resolved, the forwardoperator cannot be simply approximated by e t M . Instead, the linear operator Ω (0)∆ , whichcan be estimated by our proposed algorithm in Sec. 5, has an implicit memory-kernel ( K ( s ), s ∈ [0 , ∆)) dependence, and thus we cannot simply exponentiate it for predicting the systemfurther than ∆ into the future. Secondly, the discretized Generalized Langevin Equation(29) states that the prediction shall be made with the past history, when it is available andwhen the discretized memory kernel Ω ( (cid:96) )∆ , (cid:96) ≥ g (cid:107) ( j ∆) and the perpendicular component g ⊥ ( j ∆) to forward propagate the system andthus reduce the prediction error. In contrast, in approximate Koopman learning, we onlyuse the current time to predict the next time step and neglect the orthogonal contributionwhich would be estimated by the past trajectory in the context of Mori–Zwanzig formalism.16he advantage of of the Mori–Zwanzig’s history-dependent prediction will be numericallyillustrated in Sec. 6.2. Conventionally, the aim of Mori–Zwanzig analysis is to select a set of coarse-grained vari-ables to construct a parallel space H g in which the projected image P g ( t ) captures the slowmodes of the dynamics, mostly described by the Markov transition matrix M . If the parallelspace fully capture the slow modes, the orthogonal dynamics would predominantly be thefast modes of the dynamics. In such construction by time-scale separation, the orthogonaldynamics F ( t ) are usually modeled by a generic stochastic process, and the self-consistentmemory kernel can be constructed to establish the Generalized Langevin Equation (7) de-scribing the slow modes of the dynamics.The conventional approach is challenging because the selection of the observables re-quires sophisticated understanding of how to separate the modes with diﬀerent timescalesin a complex dynamical systems. Nevertheless, the GLE (7) is always mathematically cor-rect, and thus, if we have collected a large enough set of data of a dynamical system, it ispossible to numerically estimate Markov transition matrix M , memory kernel K , and theorthogonal dynamics F directly from the collected data. Here, we provide two algorithms,one for continuous-time models (Algorithm 1) and the other for discrete-time models (Algo-rithm 2) for estimating the key quantities in the Mori–Zwanzig formalism directly from longtrajectories of the observables.The procedure begins with calculating the two-time correlation functions C ( t ) of theobservables from the recorded long trajectory. Then, we exploit Eqs. (14) (for continuous-time systems) and (30) (for discrete-time systems) to estimate continuous-time M and K and various orders of discrete-time Ω ( (cid:96) ) . After solving these operators, we will use Eqs. (7)and (29) to solve for the orthogonal dynamics F (continuous-time) and W k (discrete-time).Speciﬁcally, for the continuous-time system, we ﬁrst set t = 0 in Eq. (14), leading to˙ C (0) = M · C (0) . (37)Thus, we can solve for the Markov matrix M = ˙ C (0) · C − (0). The estimation of ˙ C (0)can be made by ﬁnite-diﬀerence method, or alternatively, directly computed from the right-hand-side of the dynamical equation (1) in the microscopic simulator. Next, we diﬀerentiateEq. (14) with respect to time t once and obtain¨ C (0) = K (0) · C (0) . (38)The memory kernel evaluated at t = 0 is K (0) = ¨ C (0) · C − (0). Again, ¨ C (0) can eitherbe accessed directly in the simulator, or estimated by ﬁnite-diﬀerence data. Then, we set t = δ (cid:28) (cid:90) δ K ( δ − s ) · C ( s ) d s ≈ δ K (0) · C ( δ ) + K ( δ ) · C (0)) (39)17o solve for K ( δ ) = (cid:104) C ( δ ) − M · C ( δ ) − δ K (0) · C ( δ ) (cid:105) · C − (0). Similarly, we can recur-sively solve for K (( k + 1) δ ), k ∈ N , as functions of previously obtained K (( (cid:96) ) δ ), (cid:96) ≤ k , andthe measured correlation matricies C : K (( k + 1) δ ) =2 (cid:34) ˙ C ( kδ ) − M · C ( kδ ) δ + k (cid:88) (cid:96) =1 K ( (cid:96)δ ) · C (( k − (cid:96) ) δ ) + K (0) · C ( kδ )2 (cid:35) · C − (0) . (40)Once M and K are solved, we use Eq. (7) and the measured trajectory to solve for theorthogonal dynamics F ( t ). A detailed description of our proposed procedure is presented asAlgorithm 1.As for the discrete-time dynamics, we use Eq. (30). Setting k = 0 in (30), C (∆) = Ω (0)∆ C (0) indicates that Ω (0)∆ = C (∆) · C − (0), exactly the approximate Koopman operatorthat one would obtain by carrying out EDMD analysis (cf. Sec. 4). Then, recursively, we cansolve Ω ( k )∆ , k ∈ N in terms of the correlation matrices C and previously solved lower-orderΩ ( (cid:96) )∆ , (cid:96) < k , using Eq. (30): Ω ( k )∆ = (cid:32) C (( k + 1) ∆) − k − (cid:88) (cid:96) =0 Ω ( (cid:96) )∆ · C (( k − (cid:96) ) ∆) (cid:33) · C − (0) . (41)Once the operators Ω ( k )∆ ’s are solved, we use Eq. (29) and the measured trajectory to solve forthe discrete-time noise W . A detailed description of the procedure is presented as Algorithm2. In this section, we present the application of our proposed algorithms. Our aim is to illustratethat the usage and the capability of the algorithms to extract the Mori–Zwanzig operatorsfrom a long trajectory of simulation.Throughout this section, we adopt the Lorenz ’96 system [13] as the test problem. Thesystem consists of N physical-space variables which evolve nonlinearly bydd t φ i ( t ) = ( φ i +1 − φ i − ) φ i − − φ i + F, i = 1 . . . N, (42)with the periodic boundary condition φ − = φ N − , φ = φ N , and φ N +1 = φ . We ﬁxed themodel parameter N = 20 and F = 8, with which the system has a chaotic behavior.We deﬁne our observables to be g ( t ) := φ ( t ), g ( t ) := φ ( t ), g ( t ) := φ ( t ), and g ( t ) = φ ( t ). We also include a constant function g = 1 because the long-time average of theobservables are not zero-meaned. We remark that the results below are conditioned onthe choice of the observables we choose. We use the general-purpose integrator LSODA(by scipy.integrate.solve ivp ) to solve the evolutionary equation (42). LSODA uses18 t ( t ) t ( t ) t ( t ) t ( t ) t ( t ) t ( t ) t ( t ) t ( t ) t ( t ) t ( t ) t ( t ) t ( t ) t ( t ) t ( t ) t ( t ) t ( t ) t ( t ) t ( t ) t ( t ) t ( t ) Figure 2: The dynamics of the physical-space variables { φ i } i =1 of the Lorenz ’96 system (42).The highlighted panels are those chosen variables which serve as the observables g := φ , g := φ , g := φ , and g := φ . Because the observables are not zero-meaned, we alsoinclude a constant function g := 1adaptive time steps for controlling the error of the numerical integration, and can handlestiﬀ ODE systems. We chose a randomized initial condition, integrated the trajectory until t = 10 , and recorded the snapshot of the observables every δ = 0 .

01. The total length t = 10 was deemed suﬃcient from the convergence of the computed correlation matrix C between these chosen observables. We checked that the choice of the initial condition doesnot aﬀect the obtained two-time correlations. A short-time behavior of the dynamical systemis shown in Fig. 2, where the observables are highlighted.We use the collected snapshots of the observables to compute two-time correlation func-tion up to a lag t = 10, as shown in Fig. 3. Then, we apply Algorithm 1 to the the numericallycomputed C ( t ) to extract the continuous-time Markov matrix M =  . . . . . . − . − .

416 0 . − . .

652 0 .

372 0 . − .

795 0 . − . − .

066 0 . − .

068 0 . .

338 0 .

171 0 . − .

378 0 .  , (43)and the memory kernel K ( t ), illustrated in Fig. 4. With the the calculated kernel K , Algo-rithm 1 also quantiﬁes the orthogonal dynamics, F ( t, Φ ( s )) for t ≥ s , which allows us to19 C , C , C , C , C , C , C , C , C , C , C , C , C , C , C , C , C , C , C , C , t C , t C , t C , t C , t C , Figure 3: The two-time correlation function C ( t ) computed from snapshots along a long(10 ) trajectory C ij ( kδ ) = 10 − × (cid:80) − k(cid:96) =1 g i (( k + (cid:96) ) δ ) g j ( kδ ).calculate the right-hand side of the Generalized Fluctuation-Dissipation relationship (18).The GFD, which servers as a stringent self-consistent condition, is numerically veriﬁed andpresented in Fig. 4.We also applied the Algorithm 2 to extract the discrete-time operators Ω ( (cid:96) )∆ . We ﬁrst ﬁx∆ = 30 δ and identify the lowest order Ω (0)∆ which serves like the Markov transition matrixin the discrete-time formulation: Ω (0)∆ =  . . . . . .

827 0 . − .

062 0 . − . .

378 0 .

173 0 . − .

108 0 . . − .

015 0 .

27 0 .

284 0 . .

771 0 .

026 0 . − .

105 0 .  , (44)We present the higher-order operators, Ω ( (cid:96) )∆ in Fig. 5. Algorithm 2 also quantiﬁes the discrete-time orthogonal dynamics W k , which we used to evaluate the right-hand side of the discrete-time GFD relationship (78). Again, the stringent self-consistent GFD illustrates that ournumerical analysis quantiﬁes the operators accurately.For a comparison between the continuous-time and discrete-time operators, we compute20 K GFD K GFD K GFD K GFD K GFD K GFD K GFD K GFD K GFD K GFD K GFD K GFD K GFD K GFD K GFD K GFD K GFD K GFD K GFD K GFD t K GFD t K GFD t K GFD t K GFD t K GFD Figure 4: The memory kernel K ( t ) is illustrated as the solid line. To verify the General-ized Fluctuation-Dissipation (GFD) relationship, the right-hand-side of Eq. (18), GFD ij := (cid:2)(cid:10) F ( t ) · F T (0) (cid:11) · C − (0) (cid:3) ij , is calculated and plotted as the discrete dots, which perfectlyalign with the memory kernel K .the propagator exp(∆ × M ) using the continuous-time Markov operator M computed in 43:exp (∆ × M ) =  . . . . . .

256 0 . − .

117 0 . − . .

296 0 .

114 0 . − . − . − . − .

003 0 .

236 0 .

946 0 . .

149 0 . − . − .

111 0 .  . (45)The diﬀerence between the lowest-order Markov operators, Eqs. (44) and (45), illustratethe diﬀerence between the continuous-time and the discrete-time formulation. The discreteMarkov operator Ω (0)∆ is identical to the approximate Koopman operator if one would carryout the EDMD [9] with a time separation ∆ = 30 δ = 0 .

3. Nevertheless, as pointed out inSec. 4, although the operator Ω (0)∆ can always be estimated by data (using Algorithm 2) andis optimal to predict one-step ∆ into the future, it cannot be approximated by exponentiatingthe instantaneous Markov operator of the continuous-time formulation, exp (∆ × M ). Ouranalysis in Appendix B.1 shows that Ω (0)∆ contains not only the eﬀect of continuous-timeMarkov operator M which is the exact Koopman operator of the dynamics, but also theeﬀect of the continuous-time memory kernel K ( s ), 0 ≤ s ≤ ∆.The continuous-time and discrete-time memory kernels shown in Figs. 4 and 5 showsimilar behavior. Note that in the GLE (7), it is conventional that the memory kernel K [ ( ) ] GFD [ ( ) ] GFD [ ( ) ] GFD [ ( ) ] GFD [ ( ) ] GFD

202 ×10 [ ( ) ] GFD

20 ×10 [ ( ) ] GFD

505 ×10 [ ( ) ] GFD

05 ×10 [ ( ) ] GFD

01 ×10 [ ( ) ] GFD

50 ×10 [ ( ) ] GFD

024 ×10 [ ( ) ] GFD

10 ×10 [ ( ) ] GFD

505 ×10 [ ( ) ] GFD

024 ×10 [ ( ) ] GFD

05 ×10 [ ( ) ] GFD

10 ×10 [ ( ) ] GFD

20 ×10 [ ( ) ] GFD

50 ×10 [ ( ) ] GFD t =505 ×10 [ ( ) ] GFD t =05 ×10 [ ( ) ] GFD t =05 ×10 [ ( ) ] GFD t =02 ×10 [ ( ) ] GFD t =20 ×10 [ ( ) ] GFD Figure 5: We present the extracted higher-order discrete-time operators Ω ( (cid:96) )∆ , (cid:96) ≥ ij := (cid:2)(cid:10) W (cid:96) · W T (cid:11) · C − ( − ∆) (cid:3) ij , is calculatedand plotted as the discrete pluses, which align with the memory kernel Ω ( (cid:96) )∆ .comes with an overall negative sign in front, but in the discrete formulation (Eq. (29)),it is more natural not to put the − (cid:96) ≥ t = 0 until the current time to make prediction—the trajectory with the ﬁnite timescale is suﬃcient. Secondly, the analysis shows that thememory kernels could behave very non-trivially, and the kernels are not likely captured bysimple models. Thirdly, the discrete-time memory kernel seems to be a coarse-grained andsmoothed-out picture of the continuous-time kernel.To investigate further into the ∆-dependence smoothing of the discrete memory op-erators, we calculate Ω ( (cid:96) )∆ , (cid:96) ≥ .

01, 0 .

1, and 0 .

2. Note that the smallest∆ = 0 .

01 corresponds to the smallest time resolution δ = 0 .

01, which we used to computethe continuous-time operators. To properly scale and compare their behavior, we need toscale the discrete operators by ∆ − . The ﬁrst scaling ∆ comes from the fact that for a ﬁxed22hysical memory timescale, a larger ∆ comes with fewer snapshots in the discrete sum ofthe past history. Another way to understand this scaling is that Ω in Eq. (29) is analogousto − K ( s ) d s in Eq. (7), and the scaling comes from ∆ ∼ d s . The second scaling comesfrom the fact that the discrete operators map the system forward to ∆. We present thescaled operators, ∆ − Ω ( (cid:96) )∆ on the same ﬁgure 6, which shows that when ∆ = δ = 0 .

01, thecalculated discrete-time operator by Algorithm 2 converges to the continuous-time kernelcalculated using continuous-time Algorithm 1. The smoothing as we increase ∆ can also beobserved. ( i , j ) = (0, 0) K ij [ ( ) ] ij , =0.01 [ ( ) ] ij , =0.05 [ ( ) ] ij , =0.2 ( i , j ) = (0, 1) K ij [ ( ) ] ij , =0.01 [ ( ) ] ij , =0.05 [ ( ) ] ij , =0.2 ( i , j ) = (0, 2) K ij [ ( ) ] ij , =0.01 [ ( ) ] ij , =0.05 [ ( ) ] ij , =0.2 ( i , j ) = (0, 3) K ij [ ( ) ] ij , =0.01 [ ( ) ] ij , =0.05 [ ( ) ] ij , =0.2 ( i , j ) = (0, 4) K ij [ ( ) ] ij , =0.01 [ ( ) ] ij , =0.05 [ ( ) ] ij , =0.2

05 ×10 ( i , j ) = (1, 0)

20 ×10 ( i , j ) = (1, 1) ( i , j ) = (1, 2) ( i , j ) = (1, 3) ( i , j ) = (1, 4)

05 ×10 ( i , j ) = (2, 0) ( i , j ) = (2, 1)

20 ×10 ( i , j ) = (2, 2) ( i , j ) = (2, 3) ( i , j ) = (2, 4) ( i , j ) = (3, 0) ( i , j ) = (3, 1) ( i , j ) = (3, 2)

210 ×10 ( i , j ) = (3, 3) ( i , j ) = (3, 4) t =05 ×10 ( i , j ) = (4, 0) t =01 ( i , j ) = (4, 1) t =101 ( i , j ) = (4, 2) t =05 ( i , j ) = (4, 3) t =20 ×10 ( i , j ) = (4, 4) Figure 6: Comparisons between the discrete-time memory operators Ω ( (cid:96) )∆ , (cid:96) ≥ − K ( t ). In this section, we compare diﬀerent numerical procedures for making prediction to illus-trate the advantage of the Mori–Zwanzig formalism. Speciﬁcally, we consider the followingproblem setup. Suppose we obtain snapshots of a set of observables along a long trajectory.These snapshots, separated by a ﬁne time resolution δ , were used to compute either the ap-proximate Koopman operator (cf. Sec. 4) or the Mori–Zwanzig operators (cf. Algorithm 2).Next, suppose we were given the snapshots of the observables along a segment of trajectoryof length ζ = mδ , m ∈ N . We denote the snapshots by { g ( kδ ) } mk =1 , noting that g is an M × = nδ , n ∈ N in the future. Speciﬁcally, we are interested in comparing the predictionerrors of Koopman and discrete-time Mori–Zwanzig formulations:1. [The recursive Koopman computation.] With this approach, we use the EDMD[9] to compute the approximate Koopman operator K Koop∆ with a time separation∆ = kδ . Because the aim is to predict nδ into the future, k is restricted to the setof divisors of n . When k = 1, the learned approximate Koopman operator would beclosest to the true Markov operator in the Mori–Zwanzig formalism. When k = n ,the time separation is chosen to be exactly the predictive horizon. We then apply n/k times the learned Koopman operator to advance the last known observables to makeprediction, i.e., g (( m + n ) δ ) ≈ (cid:16) K Koop∆ (cid:17) n/k · g ( mδ ).2. [The recursive discrete-time Mori–Zwanzig computation.] With this approach,we use Algorithm 2 to compute the Mori–Zwanzig operators Ω ( (cid:96) )∆ . We are interested indiﬀerent magnitudes of ∆, noting a restriction that they must be multiples of the ﬁnesttime resolution δ . We chose ∆ = 0 .

01, 0 .

02, 0 .

05, 0 .

1, and 0 .

2. We are also interested inthe predictive error as a function of the memory length, which is restricted as multiplesof ∆. We will integrate Eq. (29) to advance g ( t ) to g ( t + ∆), noting that it is notpossible to estimate the exact noise W k . Instead of injecting artiﬁcially generatedsamples from a noise model, we set W k ’s to zero. Note that the steps needed forsuch integration is also ∆-dependent. For example, when ∆ matches the predictivehorizon (∆ = nδ ), we only need to integrate Eq. (29) once; when ∆ is the ﬁnest timeresolution (∆ = δ ), we need to interatively integrate Eq. (29) n times and accumulatethe prediction g (( m + k ) δ ), 1 ≤ k < n as past history until the horizon is met.We remark that when the memory length is set to zero in the recursive discrete-timeMori–Zwanzig approach, the method converges to the recursive Koopman approach because K Koop∆ ≡ Ω (0)∆ . Thus, the second family of numerical procedures (with diﬀerent settings of∆ and memory length) is a superset of the ﬁrst one.We adopt the L -norm as the measure to compare errors between diﬀerent methods.That is, suppose a method made a prediction g pred and suppose the ground truth is g GT ,the error is computed by ε := (cid:13)(cid:13) g pred − g GT (cid:13)(cid:13) ≡ M (cid:88) i =1 ( g pred i − g GT i ) . (46)To collect the statistics of the prediction error, we generate 2 × samples of segments, eachof which contains m = 500 snapshots, from a long trajectory. Because we are interested inout-of-sample prediction, the long trajectory used to generate test samples is diﬀerent fromthe one we used to compute the correlation matrix (and the operators). Both trajectorieswere generated by integrating the same evolutionary equation (42) with two randomizedinitial conditions; the procedure ensures a consistent inner-product space which is deﬁnedby the long-time statistics.We present the error statistics from the Mori–Zwanzig with diﬀerent memory and pre-diction horizon in Fig. 7. In Fig. 7 (a) and (b), we present the average error over 2 × .0 0.2 0.4 0.6 0.8Prediction horizon050100150 (c) K Koop K Koop n MZ0.0 0.5 1.0 1.5 2.0Memeory27.530.032.535.037.540.0 [ ] (a) Prediction horizon = 0.2 = 0.01= 0.02= 0.05= 0.10= 0.20 0.0 0.5 1.0 1.5 2.0Memeory6080100 [ ] (b) Prediction horizon = 0.8 = 0.01= 0.02= 0.05= 0.10= 0.20

Figure 7: Comparison of the prediction error of diﬀerent numerical procedure. (a) The meanerror of the Mori–Zwanzig formulation for predicting the observable nδ = 0 . nδ = 0 . nδ from (1) the approximate Koopman operator K δ , (2) the approximateKoopman operator K nδ , and (3) recursive Mori–Zwanzig computation.samples when the prediction horizon is chosen as nδ = 0 . . δ = 0 . nδ . This observation is consistent to our reasoning in Sec. 3.5,that the optimal prediction nδ into the future is C ( nδ ) · C − (0), which is the approximateKoopman operator K Koop nδ (cf. Sec. 4). Secondly, as the length of the past history is in-creased, the error decreases but levels oﬀ at a ﬁnite timescale which depends on the choiceof ∆. The improvement comes from an important fact that the given segment of trajectory g ( kδ ), k = 1 . . . m , contains information in both the parallel and orthogonal spaces; seedecomposition Eqs. (8) and (9). By integrating GLE forward in time with the segment oftrajectory, we can evolve both the parallel and orthogonal dynamics beyond t = mδ , despitethe fact that we cannot access the unknown orthogonal dynamics F ( t ≥ mδ ), which is thusset to be zero operationally for making predictions. In contrast, the Koopman approach al-ways predict from the last snapshot at t = mδ and can only optimally predict in the parallelspace by Eq. (8) and misses out more orthogonal contributions. Thirdly and interestingly,given any ﬁxed memory length, the optimal choice of the Mori–Zwanzig formulation is alsoto match the time separation ∆ to that of the prediction horizon. This observation can bereasoned because when we integrate the GLE (29) and when ∆ = nδ , we only missed out asingle contribution from the orthogonal dynamics, W , while any shorter time separationswould require missing out more orthogonal noise incurred during the intermediate steps thatmight compound before the horizon is met. Finally, because of the ﬁnite timescale of thememory kernel, we observed that the memory length does not need to be long before theaccuracy levels oﬀ. To combine the last two points, we conclude that it is not expensive toimprove the accuracy using the discrete-time Mori–Zwanzig formulation: we only need tostore a ﬁnite length of snapshots to carry out the discrete sum in (29). In Fig. 7 (c), wepresent the error as a function of the prediction horizon nδ . In addition to the average error,25e also visualize the standard deviation of the ε in 2 × sample segments. We comparethree methods: Koopman with the smallest time separation ∆ = δ , with the matched timeseparation ∆ = nδ , and discrete Mori–Zwanzig with ∆ = nδ with a ﬁxed memory length0 .

24. We conclude that, the Mori–Zwanzig formulation is consistently more accurate inmaking predictions into the future, and if one must carry out the approximate Koopmananalysis, it is best to match the time separation ∆ to that of the prediction horizon—learningan approximate Koopman operator with a short time separation ∆ and recursively applying K Koop∆

In this article, we showed that the Mori–Zwanzig formalism with Mori’s projector [1] is notonly consistent with but also a generalization of the existing Koopman learning procedures,such as the extended dynamic mode decomposition (EDMD, [9]). The propagator of theprojected image C (∆) · C − (0) for any time separation ∆ is identiﬁed as the discrete-timeapproximate Koopman operator K Koop∆ . We identify that the propagator does not onlycontain the eﬀect of the instantaneous Markov transition matrix M ≡ ˙ C (0) · C − (0), butalso the memory eﬀect from the trajectory between t = 0 to t = ∆ (Lemma 1). Such a historydependence emerged because of the dynamics is not fully resolved—that is, the dynamicsdoes not evolves invariantly in functional subspace H g spanned by the selected observables.Although the Markov transition matrix M is the instantaneous Koopman operator, the ﬁnite-time propagator is exp (∆ M ) only when the dynamical system is fully resolved. In thosepartially observed cases, there always emerge a history-dependent term and the orthogonaldynamics from the unresolved degrees of freedom of the dynamics.Motivated by the data-driven Koopman learning methods, we constructed two numericalalgorithms which extract key operators in the Mori–Zwanzig formalism, speciﬁcally, withMori’s projection operator. One of our algorithms extracts the continuous-time operators,and the other the discrete-time ones. To the lowest order, the algorithms are operationallyidentical to the EDMD and they compute the continuous-time Markov matrix M and thediscrete-time approximate Koopman operator Ω (0)∆ . The novelty of our algorithms is thatthey go beyond the lowest order and proceed with a recursive procedure which uses furthertwo-time correlations to extract the memory kernels (continuous-time K ( s ) and discrete-time Ω ( (cid:96) )∆ , (cid:96) ≥ g ⊥ ( t ), and the prediction can be improvedby incorporating the information. We remark that such a memory-dependent learning is fun-damentally diﬀerent from the recently proposed time-embedded Koopman learning methods[23, 24, 25] which include the past history in the set of the observables. These methods,motivated by the famous Takens’ embedding theorem [26], requires to expand the numberin the set of variables by h times, where h is the number of the past snapshots and h has tobe speciﬁed before the computation. Such an expansion of the set of observables results in amore expensive inversion of the larger C (0) ∈ R Mh × Mh matrix. In comparison, the memorykernel of the Mori–Zwanzig construction requires a single inversion of the C (0) ∈ R M × M andthe memory kernel is constructed by other two-time correlations C ( k ∆). In the future, itwill be our primary interest to investigate into the computational eﬃciency and predictionaccuracy of these two alternative approaches, and synergistic approaches such as solving thememory kernel by our algorithms presented in this article ﬁrst as the initial condition foroptimizing the time-embedding Koopman operators.Finally, although the mathematical construction of the Mori–Zwanzig formalism andKoopman theory is general for any set of observables, the accuracy of the prediction cruciallydepends on the selection of the observables. In practice, it is preferable to adopt a set ofobservables which invoke smaller orthogonal dynamics F ( t ). Our algorithms, which areﬁrst to our knowledge, extract the exact orthogonal dynamics from the data and thus theyprovide an opportunity for us to study the statistics from the extracted orthogonal dynamics.One possibility in the future is to treat the orthogonal dynamics, F ( t ), as another dynamicalsystem and recursively perform Mori–Zwazig learning to telescope into the residual dynamics.We propose another possibility, analogous to Gram–Schmidt process, to use the numericallyextracted F ( t ) for identifying those predominant observables orthogonal to the existing set ofobservables. We expect by including these orthogonal observables, the predictive accuracy ofthe model can be improved. We remark that it is much cheaper to extract the memory kernel K ( s ) than the orthogonal dynamics F ( t ). Since the memory kernel is related to the two-timecorrelation function of the orthogonal dynamics by the Generalized Fluctuation-Dissipationrelationship, it will be an interesting research direction to identify those properties of thememory kernel that can be directly used for selecting the observables. For example, shouldthe optimization objective be minimizing the magnitude of the memory kernel (which ∝ (cid:10) F ( t ) , F T (0) (cid:11) by Eq. (18)), or the timescale of the memory kernel as one would like toachieve in the Markov State Models? Acknowledgments

This project was supported by the Laboratory Directed Research and Development Programat Los Alamos National Laboratory (Project XWX8: Machine Learning for Turbulence).The work has been approved (LA-UR-21-20187) for unlimited distribution. YTL sincerelyappreciates many insightful discussions with Danny Perez.27 eferences [1] Mori H. Transport, collective motion, and Brownian motion. Progress of theoreticalphysics. 1965;33(3):423–455.[2] Zwanzig R. Nonlinear generalized Langevin equations. Journal of Statistical Physics.1973;9(3):215–220.[3] Zwanzig R. Nonequilibrium statistical mechanics. Oxford University Press; 2001.[4] J Evans D, P Morriss G. Statistical Mechanics of Nonequilbrium Liquids. ANU Press;2007.[5] Chen M, Li X, Liu C. Computation of the memory functions in the generalized Langevinmodels for collective dynamics of macromolecules. The Journal of Chemical Physics.2014;141(6):064112. Available from: https://doi.org/10.1063/1.4892412 .[6] Hudson T, Li XH. Coarse-Graining of Overdamped Langevin Dynamics via the Mori–Zwanzig Formalism. Multiscale Modeling & Simulation. 2020;18(2):1113–1135. Availablefrom: https://doi.org/10.1137/18M1222533 .[7] SCHMID PJ. Dynamic mode decomposition of numerical and experimental data. Journalof Fluid Mechanics. 2010;656:5–28.[8] Schmid PJ, Li L, Juniper MP, Pust O. Applications of the dynamic mode decomposition.Theoretical and Computational Fluid Dynamics. 2011;25(1-4):249–259.[9] Williams MO, Kevrekidis IG, Rowley CW. A data–driven approximation of the koop-man operator: Extending dynamic mode decomposition. Journal of Nonlinear Science.2015;25(6):1307–1346.[10] Mezic I. Spectral Properties of Dynamical Systems, Model Reduction and Decomposi-tions. Nonlinear Dynamics. 2005;41:309–325.[11] Koopman BO, Neumann Jv. Dynamical systems of continuous spectra. Proceedingsof the National Academy of Sciences. 1932;18(3):255–263. Available from: .[12] Koopman BO. Hamiltonian systems and transformation in Hilbert space. Proceedingsof the National Academy of Sciences. 1931;17(5):315–318. Available from: .[13] Lorenz EN. Predictability: A problem partly solved. In: Proc. Seminar on predictability.vol. 1; 1996. .[14] Abraham R, Marsden JE, Marsden JE. Foundations of Mechanics. vol. 36. Ben-jamin/Cummings Publishing Company Reading, Massachusetts; 1978.[15] Strogatz SH. Nonlinear Dynamics And Chaos: With Applications To Physics, Biology,Chemistry, And Engineering. 1st ed. CRC press; 2000.2816] Carleman T. Application de la th´eorie des ´equations int´egrales lin´eaires aux syst`emesd’´equations diﬀ´erentielles non lin´eaires. Acta Math. 1932;59:63–87. Available from: https://doi.org/10.1007/BF02546499 .[17] Kowalski K, Steeb WH. Nonlinear Dynamical Systems and Carleman Linearization.WORLD SCIENTIFIC; 1991. Available from: .[18] Ale A, Kirk P, Stumpf MPH. A general moment expansion method for stochastickinetic models. The Journal of Chemical Physics. 2013;138(17):174101. Available from: https://doi.org/10.1063/1.4802475 .[19] Schnoerr D, Sanguinetti G, Grima R. Comparison of diﬀerent moment-closureapproximations for stochastic chemical kinetics. The Journal of Chemical Physics.2015;143(18):185101. Available from: https://doi.org/10.1063/1.4934990 .[20] Alexandre J Chorin, Hald OH, Kupferman R, Chorin AJ, Hald OH, Kupferman R.Optimal prediction with memory. Physica D: Nonlinear Phenomena. 2002 jun;166(3-4):239–257. Available from: .[21] Chorin AJ, Hald OH, Kupferman R. Optimal prediction and the Mori–Zwanzig rep-resentation of irreversible processes. Proceedings of the National Academy of Sciences.2000;97(7):2968–2973. Available from: .[22] Lin KK, Lu F. Data-driven model reduction, Wiener projections, and the Koopman-Mori-Zwanzig formalism. Journal of Computational Physics. 2021;424:109864. Availablefrom: .[23] Brunton SL, Brunton BW, Proctor JL, Kaiser E, Kutz JN. Chaos as an intermittentlyforced linear system. Nature communications. 2017;8(1):1–9.[24] Le Clainche S, Vega JM. Higher Order Dynamic Mode Decomposition. SIAM Journalon Applied Dynamical Systems. 2017;16(2):882–925. Available from: https://doi.org/10.1137/15M1054924 .[25] Kamb M, Kaiser E, Brunton SL, Kutz JN. Time-Delay Observables for Koopman:Theory and Applications. SIAM Journal on Applied Dynamical Systems. 2020;19(2):886–917. Available from: https://doi.org/10.1137/18M1216572 .[26] Takens F. Detecting strange attractors in turbulence. In: Rand D, Young LS, editors.Dynamical Systems and Turbulence, Warwick 1980. Berlin, Heidelberg: Springer BerlinHeidelberg; 1981. p. 366–381. 29

Solution of the orthogonal component

It is straightforward to check that (17), g ⊥ ( t ) = (cid:90) t C ( t − s ) · C − (0) · F ( s ) d s (47)is the solution to (9):dd t g ⊥ ( t ) = dd t (cid:20)(cid:90) t C ( t − s ) · C − (0) · F ( s ) d s (cid:21) (48)= F ( t ) + (cid:90) t d C ( t − s )d t · C − (0) · F ( s ) d s, and by Eq. (14),dd t g ⊥ ( t ) = F ( t ) + (cid:90) t M · C ( t − s ) · C − (0) · F ( s ) d s (49) − (cid:90) t (cid:90) t − s K ( w ) · C ( t − s − w ) · C − (0) · F ( s ) d w d s. We assume that the integrand satisﬁes the conditions which allow the change of the orderof integrations:dd t g ⊥ ( t ) = F ( t ) + M · (cid:90) t C ( t − s ) · C − (0) · F ( s ) d s (50) − (cid:90) t K ( w ) (cid:20)(cid:90) t − w · C ( t − w − s ) · C − (0) · F ( s ) d s (cid:21) d w = F ( t ) + M · g ⊥ ( t ) − (cid:90) t K ( w ) g ⊥ ( t − w ) d w, which is Eq. (17). In addition, g ⊥ ( t ) = 0 satisﬁes the initial condition in (9). B Derivation of the discrete-time Mori–Zwanzig formalism from continuous-time

We present two independent derivations of the discrete-time Mori–Zwanzig formalism, as-suming the underlying process is a continuous-time and deterministic dynamics. In theﬁrst approach presented in B.1, we begin with the continuous-time equations (7), (14),and (16), solving for the quantities evaluated on a temporally evenly sampled time grid t = 0 , ∆ , . . . , and identify the recursive relationship between the solutions. In the secondapproach presented in B.2, we consider to discretize the dynamics ﬁrst by transforming thecontinuous-time operator to a abstract discrete-time map, and re-derive the Mori–Zwanzigequations (analogous to (7), (14), and (16)) to the discrete-time map. Importantly, in bothapproaches, we do not impose inﬁnitesimal constraint on ∆, and it can be any ﬁnite number.30nterestingly, the two constructs delivers the same recursive relationship that resemblesthe continuous-time Mori–Zwanzig equations. We shall refer to these relationships as thediscrete-time Mori–Zwanzig equations. However, there exists a subtle diﬀerence that thesame discrete-time Mori–Zwanzig equations link diﬀerent mathematical objects in the twoapproaches. We discuss this subtle diﬀerence between the two approaches, and a suﬃcientcondition that the two descriptions agree, in B.3. The generalized ﬂuctuation-dissipationrelationship in the discrete-time formulation is discussed in B.4. B.1 Discretization of the continuous-time Mori–Zwanzig equations

The Generalized Langevin Equation (7), the evolutaionary equations for the correlationmatrix C ( t ) (Eq. (14)) and the projected image P g ( t ) = g (cid:107) ( t ) (Eq. (16)) exhibit a similarevoluationary operator which involves the Markov matrix M and the memory kernel K ( t ).Thus, here we consider the evolutionary operator applied on a test matrix T ( t ), which is M × P where M is the number of the observables serving as our basis functions spanning H g . In the case of Eqs. (7) and (14), P = 1, and in the case of Eq. (14), P = M . Supposethe evolutionary equation of the test matrix T ( t ) satisﬁesdd t T ( t ) = M · T ( t ) − (cid:90) t K ( t − s ) · T ( s ) d s, (51)and suppose we only measure the snapshots of the observables at times on a evenly spacedgrid t = [0 , ∆ , . . . ] with a not necessary small separation ∆. We also assume that weknow the initial value T (0). The central aim of this section is to prove that the solution T ( k ∆), k ∈ Z + , can be written as T (( k + 1) ∆) = k (cid:88) (cid:96) =0 Ω ( k ) · T (( k − (cid:96) ) ∆) , (52)where Ω ( k ) are prescribed M × M matrices. We break down the proof into a few separateLemmas. Lemma 1.

Given the evolutionary equation (51) with an initial condition T (0) and thecontinuous-time kernel K ( s ) , for any t ≥ , the solution T ( t ) can be expressed as a linearoperator Ω t parametrized by the continuous-time t operated on the initial condition T (0) : T ( t ) = Ω (0) t · T (0) . (53) Proof.

We observe that the correlation matrix C ( t ) satisﬁes Eq. (14), which has a similar31orm to Eq. (51). It is easy to check that Ω t := C ( t ) · C − (0) is the solution:dd t T ( t ) = dd t (cid:2) C ( t ) · C − (0) · T (0) (cid:3) (54)= (cid:20) dd t C ( t ) (cid:21) · C − (0) · T (0)= (cid:20) M · C ( t ) − (cid:90) t K ( t − s ) · C ( s ) d s (cid:21) · C − (0) · T (0)= M · T ( t ) − (cid:90) t K ( t − s ) · T ( s ) d s. Theorem 1.

Given (1) the evolutionary equation (51) , (2) the continuous-time memorykernel K ( s ) , s ≥ , (3) a non-negative integer k ∈ Z + , and (4) snapshots of past history atdiscrete times T ( j ∆) , j = 0 , , . . . ≤ k , we can express T at a future time k ∆ + τ , < τ ≤ ∆ in terms of linear superpositions of the past snapshots: T ( τ + k ∆) = k (cid:88) (cid:96) =0 Ω ( (cid:96) ) τ · T (( k − (cid:96) ) ∆) , (55) where the higher order operators Ω ( k ) τ , k ∈ N + are recursively deﬁned by Ω ( k ) τ = Ω (0) τ + k ∆ − k − (cid:88) (cid:96) =0 Ω ( (cid:96) ) τ Ω (0)( k − (cid:96) )∆ . (56) Proof.

From Lemma 1, T ( τ + k ∆) = Ω (0) τ + k ∆ T (0) , (57)and the recursive relationship (56) states Ω (0) τ + k ∆ = k (cid:88) (cid:96) =0 Ω ( (cid:96) ) τ Ω (0)( k − (cid:96) )∆ , (58)and thus, T ( τ + k ∆) = k (cid:88) (cid:96) =0 Ω ( (cid:96) ) τ Ω (0)( k − (cid:96) )∆ T (0) = k (cid:88) (cid:96) =0 Ω ( (cid:96) ) τ T (( k − (cid:96) ) ∆) . (59) Corollary 1.

Given the operators (56) , T (( k + 1) ∆) can be expressed as linear combinationof the past snapshots T ( (cid:96) ∆) , (cid:96) = 0 , , . . . k : T (( k + 1) ∆) = k (cid:88) (cid:96) =0 Ω ( (cid:96) )∆ T (( k − (cid:96) ) ∆) . (60)32 orollary 2. Because the correlation matrix C ( t ) satisﬁes Eq. (14) which is of the form (51) ,given the operators (56) , the snapshots of the correlation matrix at discrete times satisfy C (( k + 1) ∆) = k (cid:88) (cid:96) =0 Ω ( (cid:96) )∆ C (( k − (cid:96) ) ∆) . (61) Corollary 3.

Because the projected image P g ( t ) satisﬁes Eq. (16) which is of the form (51) ,given the operators (56) , the snapshots of the projected image at discrete times satisfy P g (( k + 1) ∆) = k (cid:88) (cid:96) =0 Ω ( (cid:96) )∆ P g (( k − (cid:96) ) ∆) . (62) Theorem 2.

Discretized Generalized Langiven Equation. Given the GLE (7) and the asso-ciated operators (56) , the snapshots g ( t ) at discrete times ( k + 1)∆ , k ∈ N , satisfy g (( k + 1) ∆) = k (cid:88) (cid:96) =0 Ω ( (cid:96) )∆ g (( k − (cid:96) ) ∆) + W k . (63) where W k is the discrete-time orthogonal dynamics, which is a linear function of the orthog-onal dynamics F ( t ) , t ≤ ( k + 1)∆ and W k +1 is orthogonal to H g .Proof. As illustrated in Sec. 3.3, the solution of GLE can be written as a general solution g (cid:107) ( t ) and a particular solution g ⊥ ( t ) satisfying Eqs. (8) and (9) respectively. Note that g (cid:107) ( t )is just the projected image P g ( t ), and from Corollary 3, we conclude that g (cid:107) (( k + 1) ∆) = k (cid:88) (cid:96) =0 Ω ( (cid:96) )∆ · g (cid:107) (( k − (cid:96) ) ∆) . (64)33ecause g ( t ) = g (cid:107) ( t ) + g ⊥ ( t ), ∀ t ≥

0, at time t = ( k + 1) ∆, we can equate g (( k + 1) ∆) = g (cid:107) (( k + 1) ∆) + g ⊥ (( k + 1) ∆) (65)= k (cid:88) (cid:96) =0 Ω ( (cid:96) )∆ · g (cid:107) (( k − (cid:96) ) ∆) + g ⊥ (( k + 1) ∆)= k (cid:88) (cid:96) =0 Ω ( (cid:96) )∆ · (cid:2) g (cid:107) (( k − (cid:96) ) ∆) + g ⊥ (( k − (cid:96) ) ∆) (cid:3) + g ⊥ (( k + 1) ∆) − k (cid:88) (cid:96) =0 Ω ( (cid:96) )∆ · g ⊥ (( k − (cid:96) ) ∆)= k (cid:88) (cid:96) =0 Ω ( (cid:96) )∆ · g (( k − (cid:96) ) ∆)+ (cid:34) g ⊥ (( k + 1) ∆) − k (cid:88) (cid:96) =0 Ω ( (cid:96) )∆ · g ⊥ (( k − (cid:96) ) ∆) (cid:35) . We identify W k = g ⊥ (( k + 1) ∆) − k (cid:88) (cid:96) =0 Ω ( k − (cid:96) )∆ g ⊥ ( (cid:96) ∆) . (66)Note that W k is a linear function of the snapshots of g ⊥ , which are linear functions of F ( s ), t ≤ ( k + 1) ∆, and thus W k is orthogonal to H g . B.2 Mori–Zwanzig equations of the generic discrete-time dynamics

We present an intuitive derivation of that is analogous to Sec. 3.1, noting that a more generalderivation can be found in reference [22]. We begin with Eq. (4), and integrate the equationto the discrete time grid t = 0 , ∆ , , . . . to obtain the discrete mapping in the full Hilbertspace H : (cid:20) g M ( t + ∆) g ¯ M ( t + ∆) (cid:21) = e ∆ L · (cid:20) g M ( t ) g ¯ M ( t ) (cid:21) := (cid:20) U MM U M ¯ M U ¯ MM U ¯ M ¯ M (cid:21) · (cid:20) g M ( t ) g ¯ M ( t ) (cid:21) . (67)Given the initial condition g M (0) and g ¯ M (0), the solution of the orthogonal component atthe discrete times, g ¯ M ( k ∆), can be expressed in terms of the historical snapshots of theresolved components, g M ( (cid:96) ∆), (cid:96) = 0 , , . . . k , and the initial condition g ¯ M (0): g ¯ M (( k + 1) ∆) = k (cid:88) (cid:96) =0 U (cid:96) ¯ M ¯ M U ¯ MM g M (( k − (cid:96) ) ∆) + U k ¯ M ¯ M g ¯ M (0) , (68)34hus, g M (( k + 1) ∆) = U MM g M ( k ∆) + U M ¯ M g ¯ M ( k ∆) (69)= U MM g M ( k ∆) + U M ¯ M U k ¯ M ¯ M g ¯ M (0)+ U M ¯ M k − (cid:88) (cid:96) =0 U (cid:96) ¯ M ¯ M U ¯ MM g M (( k − (cid:96) −

1) ∆) . Now we deﬁne Λ (0)∆ := U MM , Λ ( (cid:96) )∆ := U M ¯ M U ( (cid:96) − M ¯ M U ¯ MM , (cid:96) = 1 , , . . . , and V k := U M ¯ M U k ¯ M ¯ M g ¯ M (0)and obtain the discrete-time generalized Langevin equation g (( k + 1) ∆) = k (cid:88) (cid:96) =0 Λ ( (cid:96) )∆ g (( k − (cid:96) ) ∆) + V k . (70)A more general derivation ﬁrst transforms the dynamics Eq. (1) to a discrete map of thesolutions evaluated at t = 0 , ∆ , . . . , Φ ( t + ∆) = U ∆ ( Φ ( t )) (71)Here, U ∆ is the nonlinear operator deﬁned as the solution of the continuous-time equation: U ∆ ( Φ ) := (cid:90) ∆0 R ( Φ ( t )) d t + Φ . (72)With the transformed discrete-time map, U ∆ , we apply the generic Mori–Zwanzig formula-tion for the discrete-time dynamics [22] to obtain the discrete-time GLE (70).Because the samples collected from this picture involves only discrete-time snapshotsseparated by a ﬁnite time ∆, we need to replace the inner product from averaging over acontinuous-time trajectory (Eq. (13)) by a discrete-time sum: (cid:104) f, h (cid:105) ∆ := lim N →∞ N N (cid:88) i =1 (cid:90) Ω f ( Φ ) h ( Φ ) d µ ( Φ ) . (73)We put a subscript under the inner product (cid:104)· , ·(cid:105) ∆ to denote the diﬀerence between (73) andits continuous-time counterpart (13).Finally, a similar analysis to the one we presented in Sec. 3.5 results in the discrete-time recursive relationship between the correlation matrix C ∆ ( k ∆) and the projected image P ∆ g ( k ∆), k = 0 , , . . . , C ∆ (( k + 1) ∆) = k (cid:88) (cid:96) =0 Λ ( (cid:96) )∆ C ∆ (( k − (cid:96) ) ∆) , (74) P ∆ g (( k + 1) ∆) = k (cid:88) (cid:96) =0 Λ ( (cid:96) )∆ P ∆ g (( k − (cid:96) ) ∆) . (75)35gain, we put the subscript to the correlation matrix C ∆ and the projection operator P ∆ to diﬀerentiate them from their counterparts computed with continuous-time inner product,Eq. (13). B.3 Diﬀerence between the two formulations

The discrete-time dynamics in the above two formulations coincide, and it is tempting toequate the operators Ω ( (cid:96) )∆ to Λ ( (cid:96) )∆ , and the discrete-time noise W k to V k . Nevertheless, thereis a subtle diﬀerence between these two formulation, and in general, they do not have to bethe same. The subtlety is that the inner product is deﬁned by integrating over a continuousdomain of time in the ﬁrst formulation, but by summing over a discrete domain of the timein the second derivation. The invariant measure of the former does not have to be equalto the latter. Consequently, the projection operator, which depends on the deﬁnition innerproduct, Eq. (11), does not have to be identical in these two formulation.In our proposed algorithm 2, the discrete-time operators (either Ω ( (cid:96) )∆ or Λ ( (cid:96) )∆ ) and noise(either W k or V k ) are extracted from the correlation matrices. In the ﬁrst formulation, thecorrelation matrix was computed with the inner product Eq. (13), C ( k ∆) := lim T →∞ T (cid:90) k ∆0 g ◦ Φ ( k ∆ + s ) · g T ◦ Φ ( s ) d s, (76)and in the second approach, it is computed by with the inner product Eq. (73), C ∆ ( k ∆) := lim N →∞ N N − (cid:88) i =0 g ◦ Φ (( k + i ) ∆) · g T ◦ Φ ( i ∆) . (77)Our analysis shows that a suﬃcient condition for Ω ( (cid:96) )∆ = Λ ( (cid:96) )∆ and W k = V k is C ( k ∆) = C ∆ ( k ∆). The correlation matrices computed by two approaches are not necessarily identical.A simple harmonic oscillator ˙ x = p and ˙ p = − x with a unit amplitude x ( t ) + p ( t ) = 1 canbe a counterexample. It has a period 2 π , and if we choose ∆ is 2 π/

3, these two formulationscan be quite diﬀerent: the continuous correlation matrix (Eqs. (76)) has (cid:104) x, x (cid:105) = π , but thediscrete correlation matrix ((77)) has (cid:104) x, x (cid:105) = 1 + cos (2 π/

3) + cos (4 π/

3) = 3 / x (0) = 1.Note that in prediction (cf. Sec. 6.2), the correlation matrix obtained from the ﬁrst approachcan be used to project any conﬁguration satisfying x (0) + p (0) = 1, but the correlationmatrix obtained from the second approach can be only used to a subset of possible initialconditions: x (0) ∈ { , π/ , π/ } .The subtlety between the two formulations serves as a caution to the practitioners topay attention to formulating what we aim to learn . On the one hand, the ﬁrst formulationpresented in B.1 learns the discrete-time operators Ω ( (cid:96) )∆ , which have direct connection tothe continuous-time operators because they are deﬁned in terms of the continuous-timeMarkov transition M and the memory kernel K ( t ). Despite the fact that the correlationmatrix is discretized and stored with a coarser resolution, the overall learned operators Ω ( (cid:96) )∆ serve a similar role to the continuous-time Mori–Zwanzig operators. For example, theoperators predicts the same P g ( t ) in H g . Thus, if one aims to learn about the continuous-time operators, the ﬁrst approach is more desirable. The cost of the ﬁrst approach is that36ven though the discretization ∆ can be ﬁnite, one still needs a very ﬁnely sampled temporalgrid to evaluate the continuous-time correlation matrix C ( t ), if an analytical expression of C ( t ) is not possible. We remark that an additional online computation of C in the ﬁne-scalesimulation can make computations eﬃcient. On the other hand, a discretization with aﬁnite ∆ is beneﬁcial because it provides another coarser resolution in which the correlationmatrix is stored. The second formulation presented in B.2 learns the Mori–Zwanzig operatorsof the generic discrete maps . In those cases whose inner product deﬁned by the discrete-time statistics ((77)) is not identical to the continuous-time statistics ((76)), the projectionoperator in the second formulation is not the same as the one in the continuous-formulation.Consequently, the operators would predict a diﬀerent projected image P ∆ g ( t ) in H g . Theabove simple harmonic oscillator provides an intuitive example.Finally, we point out that despite the subtle diﬀerences, if the two measures (statisticsfrom discrete-time snapshots and continuous-time dynamics) are identical, these two formu-lations converge. In this scenario, there is no diﬀerence between the formulations in thetheoretical sense. Nevertheless, computationally, the ﬁrst approach—choosing a small δ tocollect the snapshot data, computing the correlation matrix C , and then discretizing C tomultiples of δ —possesses two advantages. First, because the computation of the correlationmatrix is agnostic to the discretization parameter ∆, one does not need to re-simulate thediscrete-time simulation when we change ∆. Secondly, recall that we need to compute thecorrelation matrix from the snapshots, which were recorded from a high-ﬁdelity simulationwhich is generally computationally expensive. Thus, given a ﬁxed amount of computationalresource, there is generally an upper bound of the physical time which we are allowed tosimulate. Because the ﬁrst approach uses an equal or ﬁner temporal resolution in compari-son to the second ( δ ≤ ∆), we will collect more snapshots before the high-ﬁdelity simulationends. Thus, computationally, the convergence of the correlation matrix is generally betterdue to more samples. B.4 Generalized Fluctuation–Dissipation Relationship in the discrete-time formulations

In each of the frameworks presented in Appendices B.1 and B.2, with respect to the cor-responding inner product (see discussion in B.3), there exists a Generalized Fluctuation-Dissipation relationship between the discrete-time noise ( W k or V k ) and the discrete timeoperators ( Ω ( (cid:96) )∆ or Λ ( (cid:96) )∆ ): Ω ( k )∆ = − (cid:10) W k , W T (cid:11) · C − ( − ∆) , k ∈ N , (78) Λ ( k )∆ = − (cid:10) V k , V T (cid:11) ∆ · C − ( − ∆) , k ∈ N . (79)Here, the ﬁrst inner product in Eq. (78) is an integration against the continuous-time tra-jectory (e.g., Eq. (13)), and the second inner product in Eq. (79) is a discrete sum overthe snapshots (e.g., Eq. (73)). Note that C ( − ∆) ≡ C T (∆) and C ∆ ( − ∆) = C T ∆ (∆) bydeﬁnitions (76) and (77). In this section, we present the proof to the generalized ﬂuctuation-dissipation relationship, Eqs. (78) and (79). 37 .4.1 Discretization of the continuous-time Mori–Zwanzig framework From Eq. (66) and noting that g ⊥ (0) = 0 (because g (0) = g (cid:107) (0)), (cid:10) W k , W T (cid:11) = (cid:42) g ⊥ (( k + 1) ∆) − k (cid:88) (cid:96) =0 Ω ( k − (cid:96) )∆ g ⊥ ( (cid:96) ∆) , g T ⊥ (∆) (cid:43) (80)= (cid:10) g (( k + 1) ∆) − g (cid:107) (( k + 1) ∆) , g T (∆) − g T (cid:107) (∆) (cid:11) − (cid:42) k (cid:88) (cid:96) =0 Ω ( k − (cid:96) )∆ (cid:2) g ( (cid:96) ∆) − g (cid:107) ( (cid:96) ∆) (cid:3) , g T (∆) − g T (cid:107) (∆) (cid:43) . This is because g ⊥ ( t ) = g ( t ) − g (cid:107) . Then, we can express (cid:10) W k , W T (cid:11) in terms of pairs ofobservables, (cid:10) W k , W T (cid:11) = (cid:10) g (( k + 1) ∆) , g T (∆) (cid:11) − (cid:10) g (( k + 1) ∆) , g T (cid:107) (∆) (cid:11) − (cid:10) g (cid:107) (( k + 1) ∆) , g T (∆) (cid:11) + (cid:10) g (cid:107) (( k + 1) ∆) , g T (cid:107) (∆) (cid:11) − k (cid:88) (cid:96) =0 Ω ( k − (cid:96) )∆ (cid:10) g ( (cid:96) ∆) , g T (∆) (cid:11) + k (cid:88) (cid:96) =0 Ω ( k − (cid:96) )∆ (cid:10) g ( (cid:96) ∆) , g T (cid:107) (∆) (cid:11) + k (cid:88) (cid:96) =0 Ω ( k − (cid:96) )∆ (cid:10) g (cid:107) ( (cid:96) ∆) , g T (∆) (cid:11) − k (cid:88) (cid:96) =0 Ω ( k − (cid:96) )∆ (cid:10) g (cid:107) ( (cid:96) ∆) , g T (cid:107) (∆) (cid:11) . (81)The above expression can be simpliﬁed by (cid:10) g ( t ) , g T (cid:107) ( s ) (cid:11) = (cid:10) g (cid:107) ( t ) , g T (cid:107) ( s ) (cid:11) = (cid:10) g (cid:107) ( t ) , g T ( s ) (cid:11) , (82)which leads to (cid:10) W k , W T (cid:11) = (cid:10) g (( k + 1) ∆) , g T (∆) (cid:11) − (cid:10) g (cid:107) (( k + 1) ∆) , g T (cid:107) (∆) (cid:11) − k (cid:88) (cid:96) =0 Ω ( k − (cid:96) )∆ (cid:10) g ( (cid:96) ∆) , g T (∆) (cid:11) + k (cid:88) (cid:96) =0 Ω ( k − (cid:96) )∆ (cid:10) g (cid:107) ( (cid:96) ∆) , g T (cid:107) (∆) (cid:11) = (cid:10) g (( k + 1) ∆) , g T (∆) (cid:11) − k (cid:88) (cid:96) =0 Ω ( k − (cid:96) )∆ (cid:10) g ( (cid:96) ∆) , g T (∆) (cid:11) , (83)The last equality comes from Eq. (62), that g (cid:107) (( k + 1) ∆) − k (cid:88) (cid:96) =0 Ω ( k − (cid:96) )∆ g (cid:107) ( (cid:96) ∆) = 0 ⇒ (cid:42) g (cid:107) (( k + 1) ∆) − k (cid:88) (cid:96) =0 Ω ( k − (cid:96) )∆ g (cid:107) ( (cid:96) ∆) , g T (cid:107) (∆) (cid:43) = 0 . (84)38y deﬁnition, (cid:10) g (( k + 1) ∆) , g T (∆) (cid:11) = C ( k ∆) and (cid:10) g ( (cid:96) ∆) , g T (∆) (cid:11) = C (( (cid:96) −

1) ∆). UsingEq. (61), we can further simplify the two-time correlation of the discrete-time noise W : (cid:10) W k , W T (cid:11) = C ( k ∆) − k (cid:88) (cid:96) =0 Ω ( k − (cid:96) )∆ C (( (cid:96) −

1) ∆)= k − (cid:88) (cid:96) =0 Ω ( k − − (cid:96) )∆ C ( (cid:96) ∆) − k (cid:88) (cid:96) =0 Ω ( k − (cid:96) )∆ C (( (cid:96) −

1) ∆)= k − (cid:88) (cid:96) =0 Ω ( k − − (cid:96) )∆ C ( (cid:96) ∆) − k − (cid:88) (cid:96) (cid:48) = − Ω ( k − − (cid:96) (cid:48) )∆ C ( (cid:96) (cid:48) ∆)= − Ω ( k )∆ C (∆) , (85)and establish Eq. (78) by multiplying C − ( − ∆) to both sides of the above equation. B.4.2 Generalized Fluctuation–Dissipation Relationship in the time-discretizeddynamics

A parallel analysis to the one presented in the above section B.4 can be carried out to proveEq. (79) with the inner product (e.g., Eq. (77)), the Generalized Langevin Equation (70), andthe evolutionary equations of the correlation matrix Eq. (74) and projected image Eq. (75).An alternative derivation, parallel to the derivation in Sec. 3.6, can be carried out. We usethe intuitive notation presented in B.2. By deﬁnition, the discrete-time noise after k snap-shots, starting at the i th snapshot along the long trajectory is V k | i := U M ¯ M U k ¯ M ¯ M g ¯ M ( i ∆).Then, the two-time correlation between the noise with respect to the inner product is (cid:10) V k , V T (cid:11) = lim N →∞ N N (cid:88) i =1 U M ¯ M U k ¯ M ¯ M g ¯ M ( i ∆) g T ¯ M ( i ∆) U T M ¯ M . (86)From the discrete-time mapping Eq. (67), U M ¯ M g ¯ M ( i ∆) = g M (( i + 1) ∆) − U MM g M ( i ∆) , (87)and thus, (cid:10) V k , V T (cid:11) = lim N →∞ N N (cid:88) i =1 U M ¯ M U k ¯ M ¯ M g ¯ M ( i ∆) (cid:2) g T M (( i + 1) ∆) − g T M ( i ∆) U T MM (cid:3) = lim N →∞ N N (cid:88) i =1 U M ¯ M U k ¯ M ¯ M g ¯ M ( i ∆) g T M (( i + 1) ∆) − U M ¯ M U k ¯ M ¯ M (cid:10) g M , g T ¯ M (cid:11) U T MM = lim N →∞ N N (cid:88) i =1 U M ¯ M U k ¯ M ¯ M g ¯ M ( i ∆) g T M (( i + 1) ∆) , (88)39ecause by construction, (cid:10) g M , g T ¯ M (cid:11) = 0. Again, using Eq. (67), U ¯ M ¯ M g ¯ M ( i ∆) = g ¯ M (( i + 1) ∆) − U MM g M ( i ∆), (cid:10) V k , V T (cid:11) = lim N →∞ N N (cid:88) i =1 U M ¯ M U k − M ¯ M g ¯ M (( i + 1) ∆) g T M (( i + 1) ∆) − lim N →∞ N N (cid:88) i =1 U M ¯ M U k − M ¯ M U MM g M ( i ∆) g T M (( i + 1) ∆)= U M ¯ M U k − M ¯ M (cid:10) g ¯ M , g T M (cid:11) − lim N →∞ N N (cid:88) i =1 U M ¯ M U k − M ¯ M U MM g M ( i ∆) g T M (( i + 1) ∆)= − U M ¯ M U k − M ¯ M U MM C ∆ ( − ∆) . (89)By deﬁnition, U M ¯ M U k − M ¯ M U MM = Λ ( k )∆ , and thus, we prove Eq. (79): (cid:10) V k , V T (cid:11) = − Λ ( k )∆ C ∆ ( − ∆) ⇒ Λ ( k )∆ = (cid:10) V k , V T (cid:11) C − ( − ∆) . (90)40 lgorithm 1 Data-driven learning of the continuous-time Mori–Zwanzig operators. Thisalgorithm uses a long trajectory of the reduced-order dynamics to estimate the continuous-time Markov transition matrix M , memory kernel K ( s ), and orthogonal dynamics F ( t ) inthe Generalized Langevin Equation (7). The algorithm requires the trajectories of M a priori selected observables { g i } Mi =1 , measured at ﬁnely and evenly discretized times t = kδ , δ (cid:28) k = 0 . . . N − N (cid:29)

1) trajectory. A number 0 ≤ h < N is required tospeciﬁed the longest estimated horizon of the memory kernel K . The algorithm will deliverestimates of the (1) Markov transition matrix M , (2) memory kernel at the discretizedtime points K ( kδ ), and (3) the orthogonal dynamics F ( kδ | i ), evaluated at discrete times k = 0 . . . h −

1, conditioned on the system started at the i th snapshot (i.e., the system’sinitial condition was set at g ( iδ )). for k in { , . . . h + 1 } do C ij ( kδ ) ← N − k (cid:104)(cid:80) N − k − (cid:96) =0 g i (( k + (cid:96) ) δ ) × g j ( (cid:96)δ ) (cid:105) end forfor k in { , } do C ij ( − kδ ) ← C ji ( kδ ) end forfor k in {− , , . . . h } do ˙ C ij ( kδ ) ← δ [ C ij (( k + 1) δ ) − C ij (( k − δ )] end forfor k in { , . . . h } do ¨ C ij (0) ← δ (cid:104) ˙ C ij ( δ ) − ˙ C ij ( − δ ) (cid:105) end forM ← ˙ C (0) · C − (0) K (0) ← ¨ C (0) · C − (0) for k in { , . . . h } doK ( kδ ) ← (cid:104) ˙ C (0) − M · C ( kδ ) δ + (cid:80) k − (cid:96) =1 K ( (cid:96)δ ) · C (( k − (cid:96) ) δ ) + K (0) · C ( kδ )2 (cid:105) · C − (0) end forfor i in { , . . . N − h − } dofor k in { , . . . h } do m ← δ (cid:104) K (0) · g ( kδ ) + 2 (cid:80) k − (cid:96) =1 K ( (cid:96)δ ) · C (( k − (cid:96) ) δ ) + K ( kδ ) · g (0) (cid:105) F ( kδ | i ) ← δ [ g (( k + 1) δ ) − g ( kδ )] − M · g ( kδ ) + m end forend forreturn M , K ( kδ ), F ( (cid:96)δ | i ) 41 lgorithm 2 Data-driven learning of the discrete-time Mori–Zwanzig operators. This al-gorithm estimates the discrete-time operators Ω ( (cid:96) )∆ and the orthogonal dynamics W in thediscrete-time Generalized Langevin Equation (29) from a long trajectory of the dynami-cal simulations of the dynamical system. The algorithm requires the snapshots of a setof M a priori selected observables { g i } Mi =1 measured at evenly distributed times t = k ∆, k = 0 . . . N −

1, along a long ( N (cid:29)

1) trajectory. In contrast to Algorithm 1, the timeseparation of the snapshots ∆ does not necessarily to be small. A number 0 ≤ h < N isrequired to speciﬁed the highest order of the discrete-time operators ( Ω ( h ) ). The algorithmdelivers estimates of (1) the discrete-time operators Ω ( k ∆) and (2) the orthogonal dynamics W ( k ∆ | i ∆), k = 0 , , . . . h conditioned on the system started at the i th snapshot (i.e., thesystem’s initial condition was set as g ( i ∆)). for k in { , . . . h + 1 } do C ij ( k ∆) ← N − k (cid:104)(cid:80) N − k − (cid:96) =0 g i (( k + (cid:96) ) ∆) × g j ( (cid:96) ∆) (cid:105) end forΩ (0)∆ ← C (∆) · C − (0) for k in { , . . . h } doΩ ( k )∆ ← (cid:104) C (( k + 1) ∆) − (cid:80) k − (cid:96) =0 Ω ( (cid:96) ) · C (( k − (cid:96) ) ∆) (cid:105) · C − (0) end forfor i in { , . . . N − h − } dofor k in { , . . . h } doW ( k ∆ | i ∆) ← g ( i + k + 1) − (cid:80) k(cid:96) =0 Ω ( (cid:96) ) · g (( i + k − (cid:96) ) ∆) end forend forreturn Ω ( k )∆ , W ( k ∆ | ii