[PDF] ISALT: Inference-based schemes adaptive to large time-stepping for locally Lipschitz ergodic systems

Abstract

Efficient simulation of SDEs is essential in many applications, particularly for ergodic systems that demand efficient simulation of both short-time dynamics and large-time statistics. However, locally Lipschitz SDEs often require special treatments such as implicit schemes with small time-steps to accurately simulate the ergodic measure. We introduce a framework to construct inference-based schemes adaptive to large time-steps (ISALT) from data, achieving a reduction in time by several orders of magnitudes. The key is the statistical learning of an approximation to the infinite-dimensional discrete-time flow map. We explore the use of numerical schemes (such as the Euler-Maruyama, a hybrid RK4, and an implicit scheme) to derive informed basis functions, leading to a parameter inference problem. We introduce a scalable algorithm to estimate the parameters by least squares, and we prove the convergence of the estimators as data size increases. We test the ISALT on three non-globally Lipschitz SDEs: the 1D double-well potential, a 2D multi-scale gradient system, and the 3D stochastic Lorenz equation with degenerate noise. Numerical results show that ISALT can tolerate time-step magnitudes larger than plain numerical schemes. It reaches optimal accuracy in reproducing the invariant measure when the time-step is medium-large.

Full PDF

IISALT: Inference-based schemes adaptive to large time-steppingfor locally Lipschitz ergodic systems

Xingjie Li, Fei Lu, and Felix X.-F. Ye

Abstract

Eﬃcient simulation of SDEs is essential in many applications, particularly for ergodic systems thatdemand eﬃcient simulation of both short-time dynamics and large-time statistics. However, locally Lip-schitz SDEs often require special treatments such as implicit schemes with small time-steps to accuratelysimulate the ergodic measure. We introduce a framework to construct inference-based schemes adaptiveto large time-steps (ISALT) from data, achieving a reduction in time by several orders of magnitudes.The key is the statistical learning of an approximation to the inﬁnite-dimensional discrete-time ﬂow map.We explore the use of numerical schemes (such as the Euler-Maruyama, a hybrid RK4, and an implicitscheme) to derive informed basis functions, leading to a parameter inference problem. We introduce ascalable algorithm to estimate the parameters by least squares, and we prove the convergence of theestimators as data size increases.We test the ISALT on three non-globally Lipschitz SDEs: the 1D double-well potential, a 2D multi-scale gradient system, and the 3D stochastic Lorenz equation with degenerate noise. Numerical resultsshow that ISALT can tolerate time-step magnitudes larger than plain numerical schemes. It reachesoptimal accuracy in reproducing the invariant measure when the time-step is medium-large.

Keywords: Stochastic diﬀerential equations, inference-based scheme, model reduction in time, locally Lips-chitz ergodic systems, data-driven modeling

Contents

Eﬃcient and accurate simulation of SDEs is important in many applications such as Monte Carlo sampling,data assimilation and predictive modeling (see e.g., [KP99, MT02, EEL `

07, KMK03, LL10, CLM ` a r X i v : . [ m a t h . NA ] F e b ipschitz SDEs, they tend to be numerical unstable and may miss the invariant measure even for the smalltime-step (for example, the Euler-Maruyama scheme, because it destroys the Lyapunov structure [RT96,MSH02]) and require special treatments such as taming scheme under small time-step size [HJK12, HJ15].Implicit schemes, on the other hand, are numerically stable and can accurately simulate the invariant measurewhen the time-step is small. However, they are computationally ineﬃcient due to the limited time-step sizeand the costly implicit step.We introduce ISALT, inference-based schemes adaptive to a large time-stepping statistical learning frame-work to construct schemes with large time-steps from data. When the data are generated from an implicitscheme, ISALT combines the advantages of both explicit and implicit schemes: it is as fast as an explicitscheme and is accurate as an implicit scheme in producing the invariant measure. The inference is doneonce for all and the inferred scheme can be used for general purpose simulations, either long trajectories orensembles of trajectories with diﬀerent initial distributions.More speciﬁcally, we consider the large time-step approximation of the ergodic SDE with additive noise d X t “ f p X t q dt ` σd B t ; (1.1)where the drift f : R d Ñ R d is local-Lipschitz. Here B is a standard m-dimensional Brownian motion with m ď d , the diﬀusion matrix σ P R d ˆ m has linearly independent columns, and they represent a degeneratenoise when m ă d . Our goal is to design an explicit scheme with large time-stepping so that it can eﬃcientlyand accurately simulate both short-time dynamics and large-time statistics such as invariant measures.We infer such explicit schemes with large time-stepping from oﬄine data generated by an implicit scheme.Figure 1 shows the schematic plot of the procedure. The essential task is to approximate the inﬁnite-dimensional discrete-time ﬂow map. A major diﬃculty in a statistical learning approach is the curse ofdimensionality (COD) when using generic basis functions. Our key contribution is to approximate the ﬂowmap by parametrization of numerical schemes, which provides informed basis functions, thus avoiding theCOD by harnessing the rich information and structure in classical numerical schemes. We also introducea scalable algorithm to compute the maximal likelihood estimator by least squares, which converges and isasymptotic normal as the data size increases (see Theorem 3.5). Furthermore, we show that the inferredscheme, when it is a parametrization of an explicit scheme, has 1-step strong order as the explicit scheme. Implicit scheme t n tt n +1 t n +2 Flow map F ( X t n , B [ t n ,t n +1 ] , t n , t n +1 )Inferred explicit scheme tt n +2 t n +1 t n Approximate ﬂ ow map F δ ( X t n , ∆ B t n )data { X t n , ∆ B t n } δ Figure 1: Schematic plot of inferring explicit scheme with a large time-step.In this study, we focus on learning approximate ﬂow maps that use only the increments of the Brownianmotion each time interval (that is, the function F δ p X t n , ∆ B t n q in Figure 1). We explore the derivationof informed-basis functions from three types of classical numerical schemes: the Euler-Maruyama (EM)[KP99], a hybrid RK4 (fourth-order Runge-Kutta) [HP06], and an implicit stochastic split backward Euler(SSBE) [MSH02], and we denote the inferred schemes by IS-EM, IS-RK4 and IS-SSBE. We test them onthree non-globally Lipschitz SDEs: the 1D double-well potential, a 2D multiscale gradient system, and the3D stochastic Lorenz equation with degenerate noise. Numerical results show that the inferred schemescan tolerate time-steps ten to hundreds times larger than plain numerical schemes, and it reaches optimalaccuracy in reproducing the invariant measure at a medium large time-step (see Figures 2, 4, 7, and 8).Overall, IS-RK4 produces the most accurate invariant measures in all examples, particularly when thedynamics is dominated by the drift (e.g., the Lorentz system) because the plain RK4 provides a higher orderapproximation to the drift.Discretization with large time-stepping for diﬀerential equations (SDEs, ODEs and PDEs) is a modelreduction in time, part of the general problem of space-time model reduction (see e.g., [EEL `

07, KMK03,2L10, MH13, CL15, LBL16, Lu20, HL20]). Since the large time-step prevents classical numerical approxi-mations based on Taylor expansions, data-driven approaches have been the primary eﬀorts and have wit-nessed many successes, including the time series approaches (see e.g., [CL15, LLC17, LL21]) and deep learn-ing methods that can eﬃciently solve high-dimensional PDEs and SDEs on rough space-time meshes (seee.g., [HJE18, SS18, BSHHB19, LGO20, YZK20]), to name just a few. In these approaches, the discrete-timemodels account for the eﬀects of the unresolved dynamics in an averaged fashion through inference, thusleading to computationally eﬃcient models for the eﬀective dynamics [LL10, LLS19, CL15, LLC16]. Thecontribution of our ISALT is to provide a simple yet eﬀective approach to achieve large time-stepping bycombining inference with classical numerical schemes. In particular, the explicit parametric form in ISALTclearly identiﬁes the connection between classical numerical schemes and the models inferred from data. Itprovides a ground for further understanding the fundamental issues of data-based reduced models, such asquantiﬁcation of the approximation and optimality of the reduction in time or in space-time.The exposition of our study proceeds as follows. We ﬁrst summarize the notations in Table 1. Afterintroducing a ﬂow map view for numerical schemes, we introduce in Section 2 the ISALT framework, thatis, the procedure and algorithm for inferring schemes adaptive to the large time-step from data. Section 3presents the theoretical results on the convergence of the estimators. In Section 4, we test ISALT on thethree typical non-globally Lipschitz SDEs. Section 5 concludes our main ﬁndings with an outlook of futureresearch. Table 1: NotationsNotation Description X t and B t true state process and original stochastic force f p X t q , σ P R d ˆ m with m ď d local-Lipschitz drift and diﬀusion matrix dt time-step generating data δ “ Gap ˆ dt time-step for inferred scheme, Gap

P t , , , , , , , . . . u t i “ iδ discrete time instants of data t X p m q t : t N , B p m q t : t N u Mm “ Data: M independent paths of X and B at discrete-times F ` X t i , B r t i , t i ` q , t i , t i ` ˘ true ﬂow map representing p X t i ` ´ X t i q{ δF δ p X t n , ∆ B t n q approximate ﬂow map using only X t n and ∆ B t n “ B t n ` ´ B t n r F δ ` c δ , X t n , ∆ B t n ˘ parametric approximate ﬂow map c δ “ p c δ , . . . , c δp q “ c δ p parameters to be estimated for the inferred scheme η n and σ δη iid N p , I d q and a diagonal matrix, representing regression residualEM and IS-EM plain Euler-Maruyama and inferred scheme (IS) parametrizing EMHRK4 and IS-RK4 plain hybrid RK4 and inferred scheme parametrizing RK4SSBE and IS-SSBE split-step stochastic backward Euler and IS parametrizing it Throughout this study, we assume that the SDE (1.1) is ergodic. Roughly speaking, the SDE is ergodicwhen (i) there is a Lyapunov function V : R d Ñ R ` such that lim | x |Ñ8 V p x q “ 8 and sup | x |ą R A V p x q Ñ 8 as R Ñ 8 , where A is the generator for (1.1) given by A g “ x f, ∇ g y ` ř di,j “ r σσ J s i,j B x i ,x j g [MSH02]; and(ii) X satisﬁes a minority condition that ensures recurrence [MSH02, Kha12].Our goal is to design a numerical scheme with a large time-step so that it can eﬃciently and accuratelysimulate both short-time dynamics and large-time statistics such as invariant measures. This is of particularinterest for SDEs with non-globally Lipschitz drift, because explicit schemes such as Euler-Maruyama oftenblow up or miss the invariant measure even if they are stable [RT96, MSH02] and implicit schemes arecomputationally costly while being accurate in large-time statistics.We obtain explicit schemes with large time-steps through inference from oﬄine data generated by animplicit scheme. The key is to approximate the ﬂow map by parametrization of numerical schemes, insteadof using a generic basis, to avoid the curse of dimensionality in the statistical learning of the ﬂow map.Toward the goal, we will ﬁrst introduce the view that numerical schemes are approximations of the ﬂowmap, then we outline the framework of statistical learning of the ﬂow map.3 .1 A ﬂow map view of numerical schemes A numerical scheme aims to approximate the discrete-time ﬂow map of the stochastic process. More precisely,for a time-step δ ą , let t i “ iδ and denote p X t i , i ě q the process deﬁned in (1.1) at discrete times. Basedon the Markov property of p X t q a numerical scheme approximates the ﬂow map X t i ` ´ X t i “ ż t i ` t i f p X s q ds ` ż t i ` t i σd B s “ δ F p X t i , B r t i ,t i ` s , t i , t i ` q« δF δ p X t i , ∆ B t i q (2.1)where F is a functional depending on X t i , the continuous trajectory B r t i ,t i ` s , t i , and t i ` . The simplestschemes approximate the functional by a function F δ p X t i , ∆ B t i q on R d , in which one represents B r t i ,t i ` s by its increment on the interval ∆ B t i “ B t i ` ´ B t i „ N p , δI d q . Among many such schemes ( [KP99,Hu96,MSH02,Mao07,LM16]), we consider three simple and representative examples: the explicit Euler-Maruyamascheme (EM) scheme [KP99], the hybrid RK4 (HRK4) [HP06], and the split-step stochastic backward Euler(SSBE) [MSH02] EM X n ` “ X n ` f p X n q δ ` σ ∆ B n , HRK4 X n ` “ X n ` φ p X n , σ ∆ B n q δ ` σ ∆ B n , SSBE X n ` “ X ˚ ` σ ∆ B n , with X ˚ “ X n ` f p X ˚ q δ, (2.2)where the term φ is a standard RK4 step with the stochastic force treated as a constant input: φ p X n , σ ∆ B n q : “ p k ` k ` k ` k q{ , with k “ f p X n q ` σ ∆ B n { δ,k “ f p X n ` k ¨ δ { q ` σ ∆ B n { δ,k “ f p X n ` k ¨ δ { q ` σ ∆ B n { δ,k “ f p X n ` k ¨ δ { q ` σ ∆ B n { δ. Correspondingly, they approximate the ﬂow map F p X t i , B r t i ,t i ` s , t i , t i ` q byEM F δEM p X t i , ∆ B t i q “ f p X t i q ` σ ∆ B t i { δ, HRK4 F δRK p X t i , ∆ B t i q “ φ p X t i , σ ∆ B t i q ` σ ∆ B t i { δ, SSBE F δSSBE p X t i , ∆ B t i q “ p X ˚ ´ X t i q{ δ ` σ ∆ B t i { δ, with X ˚ “ X t i ` f p X ˚ q δ. (2.3)For short time simulation, these schemes are of strong order 1, i.e., the discrete approximation convergesto the true solution trajectory-wisely in probability at order δ as the time-step vanishes, since the noise isadditive [R¨82,KP99,Hu96,MSH02]. For large time simulation aiming to approximate the invariant measure,the explicit schemes can be problematic for local Lipschitz drifts and degenerate noises, for instance, the EMscheme may destroy the Lyapunov structure and fail to be ergodic for any choice of time-step [MSH02, Lemma6.3]. The implicit scheme SSBE, on the other hand, is ergodic and produces accurate invariant measure whenthe time-step is suﬃciently small [MSH02, Section 6].In many applications, it is desirable to have an eﬃcient numerical scheme being accurate in both short-time and large-time. A drawback of an implicit scheme is its ineﬃciency: it has to solve a ﬁxed pointproblem in the implicit step, which is computationally costly and limits the time-step size. Taking advantageof implicit schemes, we use them to generate data and learn eﬃcient explicit schemes with large time-stepsfrom the data. We infer from data an explicit scheme that is accurate in both short-time dynamics and large-time statistics.It maintains the eﬃciency of explicit schemes while preserving the invariant measure as implicit schemes.The key idea is to learn an approximation of the ﬂow map from data. To avoid the curse of dimensionality inthe learning of the ﬂow map, which is often high-dimensional and nonlinear, we derive parametric functionsfrom the system and its numerical schemes. Roughly, the inference consists of four parts:1. Generation of faithful data using an implicit scheme with a small time-step size;4. Derivation of a parametric form to approximate the ﬂow map, by extracting basis functions from thesystem and its numerical approximations;3. Parameter estimation by maximal likelihood methods, which leads to a least squares problem whenthe parametric form is linear in the parameters;4. Model selection: by cross-validation and convergence criteria.

Data generation

We generate faithful data, consisting of trajectories of the process at discrete times t t i “ iδ u , by an accurate implicit scheme. That is, we ﬁrst solve the system by an implicit scheme with asmall time-step ∆ t ă δ , then we downsample the solution at the discrete times. We also save the trajectorydata of the stochastic force p B t q . Denote these trajectories byData: t X p m q t : t N , B p m q t : t N u Mm “ , (2.4)where N denotes the number of observing time grids and M denotes the number of independent trajectories.The initial conditions t X p m q t u Mm “ are samples from either a long trajectory, which represents the invariantmeasure, or an initial distribution that helps to explore the distribution of the process. Derivation of parametric form

The major diﬃculty in inference is the approximation of the ﬂow map F p X t i , B r t i ,t i ` s , t i , t i ` q , which is an inﬁnite-dimensional functional. When using a non-parametric approachwith the generic dictionary or basis functions, one encounters the well-known curse-of-dimensionality (COD):the size of the dictionary or basis functions increases exponentially as the dimension increases. Recent eﬀortson overcoming the COD include selecting adaptive-to-data basis functions in a nonparametric fashion [JH20],assuming a low-dimensional interaction between the components of the state variable in the spirit of particleinteractions [LZTM19], or or deep learning methods that approximate high dimensional functions throughcompositions of simple functions [HJE18, SS18, BSHHB19, LGO20, YZK20].We take a semi-parametric approach: we avoid the COD by deriving parametric functions from the fullsystem and its numerical schemes, which provide rich information about the ﬂow map. In particular, we aimfor parametric functions depending linearly on the parameters, so that the parameters can be estimated byleast squares and our algorithm is scalable .We focus on approximating the ﬂow map F p X t i , B r t i ,t i ` s , t i , t i ` q by the simplest functions F δ p X t i , ∆ B t i q ,in a parametric form F δ p c δ ; x, ξ q “ p ÿ i “ c δi φ i p x, ξ q , (2.5)with ξ having the same distribution as ∆ B t i . Here φ i : R d Ñ R d are basis functions to be extracted fromnumerical schemes (see Section 2.3), and t c δi u are the parameters to be estimated from data. That is, with p X n , ξ n q corresponding to p X t n , ∆ B t n q , we infer the following scheme X n ` “ X n ` δF δ p c δ ; X n , ξ n q ` δσ η η n “ X n ` δ p ÿ i “ c δi φ i p X n , ξ n q ` δσ δη η n , (2.6)where we add t σ δη η n u to represent the residual of the regression. For convenience, we assume that t η n u is asequence of iid Gaussian N p , I d q random variables and is independent of t ξ n u , and σ δη is a diagonal matrix.In view of statistical learning, the function (2.5) approximates the ﬂow map in the function space H “ span t φ i p x, ξ qu pi “ , which is a subspace of L p R d , µ b ν q with µ being the invariant measure of X and ν „ N p , δI d q being the distribution of ξ (which represents ∆ B t i ). We refer t φ i p x, ξ qu as basis functions andwill extract them from numerical scheme (see Section 2.3).Here we focus on using only ∆ B t n , but one can use more sample points of the trajectory B r t n ,t n ` s andextract terms from high-order approximations based on multiple stochastic integral [Hu96]. We postponethis as future work. 5 arameter estimation We estimate the parameters by maximizing the likelihood for the model in (2.6)with the data t X p m q t : t N , B p m q t : t N u Mm “ : l p c δ p q “ M M ÿ m “ l p X p m q t : t N , B p m q t : t N | c δ p q , where l p X t : t N , B t : t N | c δ p q “ N d ÿ k “ N ´ ÿ n “ « | X kt n ` ´ X kt n ´ δF δk p c δ , X t i , ∆ B t n q| σ k,δ ´

12 log p π p σ k,δ q q ﬀ , where F δk is the k -th entry of the R d -valued function F δ deﬁned in (2.5): F δk p c δ , X t i , ∆ B t n q “ p ÿ i “ c δi,k φ ki p X t n , ∆ B t n q . Noticing that the likelihood function is quadratic in the parameters (cid:32) c δi,k ( pi “ , we estimate them by leastsquares regression: { c δ,N,M p,k “ p s A N,Mk q ` s b N,Mk , p { σ δ,N,Mη,k q “ N N ´ ÿ n “ | X kt n ` ´ X kt n ´ δF δk p { c δ,N,M , X t i , ∆ B t n q| , (2.7)where A ` denotes the pseudo-inverse of A , and the normal matrix s A N,Mk and vector s b N,Mk are given by s A N,Mk p i, j q “ M N M ÿ m “ N ´ ÿ n “ φ ki p X k, p m q t n , ∆ B k, p m q t n q φ kj p X k, p m q t n , ∆ B k, p m q t n q , i, j “ , . . . , p, s b N,Mk p i q “ M N M ÿ m “ N ´ ÿ n “ X k, p m q t n ` ´ X k, p m q t n δ φ ki p X k, p m q t n , ∆ B k, p m q t n q , i “ , . . . , p. (2.8)Here { σ δ,N,Mη,k , the square root of the regression’s residuals, provide the diagonal entries of σ δη .The above least square regression is based on the assumption that the residual σ δ η n deﬁned in (2.6) isGaussian with uncorrelated entries. The entry-wise regression aims to reﬂect the dynamical scale diﬀerencebetween entries. One may improve the approximation by considering correlated entries or other distributionsfor the residual. Model selection

The parametric form in Eq.(2.6) has many freedoms underdetermined, particularly whenwe have multiple options for the parametric form, along with possible overﬁtting and redundancy in theseoptions. We select the estimated scheme by the following criteria:• Cross validation: the estimated scheme should be stable and can reproduce the distribution of theprocess, particularly the main dynamical-statistical properties. We will consider the marginal invariantdensities and temporal correlations:Invariant density of p X kt q : p p z q dz “ E r p z,z ` dz q p X kt n qs « N M

M,N ÿ m,n “ p z,z ` dz q p X k, p m q t n q , Temporal correlations: C k p h q “ E r X kt n ` h X kt n s « N M

M,N ÿ m,n “ X k, p m q t n ` h X k, p m q t n (2.9)for k “ , . . . , d .• Convergence of the estimators. If the model is perfect and the data are either independent trajectoriesor a long trajectory from an ergodic measure, the estimators should converge to the true values whenthe data size increases (see Theorem 3.2). While our parametric model is not perfect, the estimatorsshould also converge when the data size increases (see Theorem 3.5) and highly oscillatory estimatorsindicate large misﬁts between the proposed model and data.6 .3 Parametrization of numerical schemes We derive parametric forms to approximate the ﬂow map from numerical schemes. The numerical schemesprovide informed basis functions for inference because of their error-controlled approximations to the ﬂowmap F p X t i , B r t i ,t i ` s , t i , t i ` q in (2.1). These basis functions can either be simply the terms in an explicitscheme or terms approximating the implicit schemes. One may view this approach as a parametrization ofnumerical schemes.We focus on using only ∆ B t i , the increment of B r t i ,t i ` s , and seek parametric functions F δ p c δ , X t i , ∆ B t i q (as in (2.5)) to approximate the ﬂow map. This constraint has two advantages: ﬁrst, it makes the inferred-scheme computationally eﬃcient, because the inferred scheme will generate only two random numbers ( ξ i , η i in (2.6)) in each time step to represent the stochastic forces; second, it signiﬁcantly reduces the functionspace of inference, from a functional depending on the path B r t i ,t i ` s to a function depending only on theincrement. By starting from this simple setting, we hope to provide insight on the future design of schemesusing multi-point noise by parametrizing high-order stochastic schemes (see e.g. [Hu96, KP99, JK ` r F δ p c δ , X t i , ∆ B t i q . The EM is an explicit one-step scheme, the RK4 is an explicit multi-step scheme, and the SSBE is an implicit one-step scheme. Linearly parametrizing them or their Ito-Taylorexpansions, i.e., adding coeﬃcients to the terms, we obtain parametric ﬂow maps:EM r F δEM p c δ ; X t i , ∆ B t i q “ c δ X t i ` c δ f p X t i q ` c δ σ ∆ B t i { δ, HRK4 r F δRK p c δ ; X t i , ∆ B t i q “ c δ X t i ` c δ φ p X t i , σ ∆ B t i q ` c δ σ ∆ B t i { δ, SSBE r F δSSBE p c δ ; X t i , ∆ B t i q “ c δ X t i ` c δ φ SSBE p X t i q ` c δ σ ∆ B t i { δ, (2.10)where the function φ SSBE is given by φ SSBE p X t i q “ p I d ´ δ ∇ f p X t i qq ´ f p X t i q . (2.11)These terms are derived as follows.• The parametric ﬂow map r F δEM p c δ ; X t i , ∆ B t i q and r F δRK p c δ ; X t i , ∆ B t i q come simply by adding coeﬃ-cients to each term in F δEM and F δRK of the Euler and RK4 schemes in (2.3).• We introduced an extra linear term c δ X t i . When f is nonlinear, it serves as a linear basis function,and it helps to data-adaptively adjust the linear stability of the inferred scheme.• The parametric ﬂow maps r F δSSBE p c δ ; X t i , ∆ B t i q comes from parametrizing the terms in an approxi-mation of F δSSBE p X t i , ∆ B t i q in (2.3). More precisely, by the mean-value theorem, there exists a state r X t i depending on X ˚ and X t i such that f p X ˚ q “ f p X t i q ` ∇ f p r X t i qp X ˚ ´ X t i q“ f p X t i q ` ∇ f p X t i qp X ˚ ´ X t i q ` R p X ˚ , X t i , ∇ f q , (2.12)where R p X ˚ , X t i , ∇ f q “ r ∇ f p r X t i q ´ ∇ f p X t i qsp X ˚ ´ X t i q . Then, by the deﬁnition of X ˚ in the SSBEin (2.3), we have X ˚ “ X t i ` δf p X ˚ q “ X t i ` δ r f p X t i q ` ∇ f p X t i qp X ˚ ´ X t i qs ` R p X ˚ , X t i , ∇ f qñ p X ˚ ´ X t i q “ p I d ´ δ ∇ f p X t i qq ´ δf p X t i q ` R p X ˚ , X t i , ∇ f q . Thus, we have F δSSBE p X t i , ∆ B t i q “ p I d ´ δ ∇ f p X t i qq ´ f p X t i q ` σ ∆ B t i { δ ` R p X ˚ , X t i , ∇ f q . Assuming that R p X ˚ , X t i , ∇ f q is negligible, parametrizing the other terms, and adding c δ X t i , weobtain r F δSSBE with φ SSBE above. Note that when f is globally Lipschitz (thus | ∇ f | is boundedabove), we have E r| R p X ˚ , X t i , ∇ f q|s ď C E r| X ˚ ´ X t i | s , i.e., R p X ˚ , X t i , ∇ f q is an order smaller than X ˚ ´ X t i . However, when f is non-globally Lipschitz (thus | ∇ f | is unbounded ), R p X ˚ , X t i , ∇ f q maybe non-negligible and require additional terms to account for its eﬀect.7utting the parametric ﬂow maps in the form in (2.6), the corresponding inferred schemes (IS) with theseparametrized ﬂow maps in (2.10) areIS-EM p X t i ` ´ X t i q{ δ “ c δ X t i ` c δ f p X t i q ` c δ σ ∆ B t i { δ ` σ η η i , IS-RK4: p X t i ` ´ X t i q{ δ “ c δ X t i ` c δ φ p X t i , σ ∆ B t i q ` c δ σ ∆ B t i { δ ` σ η η i IS-SSBE p X t i ` ´ X t i q{ δ “ c δ X t i ` c δ φ SSBE p X t i q ` c δ σ ∆ B t i { δ ` σ η η i . (2.13)We point out that there are many other options for the parametric form. These three to-be-inferredschemes are typical: IS-EM and IS-RK4 are explicit schemes, and they will improve the statistical accuracyof the plain EM or RK4 by design (see Section 3.2). IS-RK4 is based on a multi-step scheme which providesa high-order approximation of the drift, so it is likely to perform better than IS-EM when it is stable. TheIS-SSBE comes from an implicit scheme, and is likely to inherit the stability. The following algorithm summarizes that above procedure for the inference of a scheme.

Input:

Full model; a high ﬁdelity solver preserving the invariant measure.

Output:

Estimated parametric scheme Generate data: solve the system with the high ﬁdelity solver, which has a small time-step dt ; down sample toget time series with δ “ Gap ˆ dt . Denote the data, consisting of M independent trajectories on r , Nδ s , by t X p m q t : t N , B p m q t : t N u Mm “ with t i “ iδ . Pick a parametric form approximating the ﬂow map (2.1) as in (2.5)–(2.6). Estimate parameters c δ p and σ η as in (2.7). Model selection: run the inferred scheme for cross-validation, and test the consistency of the estimators.

Algorithm 1: Inference-based reduced model with memory: detailed algorithm

We consider the convergence of the estimators in sample size in two settings: perfect model and imperfectmodel. The perfect model setting aims to validate our algorithm, in the sense that the algorithm can yieldconsistent and asymptotically normal estimators. The imperfect model setting is what we have in practice,and we show that our estimator converges to the (optimal) projection. In particular, we show that aninferred-scheme improves the statistical accuracy of its explicit counterpart.For simplicity of notation, we assume that d “ throughout this section. But the results also hold trueentry-wisely for the system with d ą . We denote the expectation of s A N,M and s b N,M in (2.8) by A and b : A “ E r s A N,M s “ N N ´ ÿ n “ ´ E ” x φ i p X p m q t n , ∆ B p m q t n q , φ j p X p m q t n , ∆ B p m q t n qy R d ı¯ i,j ,b “ E r s b N,M s “ N N ´ ÿ n “ p E rx X p m q t n ` ´ X p m q t n δ , φ i p X p m q t n , ∆ B p m q t n qy R d sq i . (3.1)Here the expectation is with respect to the distribution ﬁltration generated by the initial distribution andthe Brownian motion. Assumption 3.1. (a) Suppose that the data t X p m q t : t N , B p m q t : t N u Mm “ are independent trajectories of the system (2.6) with t X p m q t u Mm “ sampled from the ergodic measure of X . (b) Suppose that the normal matrix s A N,M in (2.8) and its expectation in (3.1) are invertible. (c) Suppose that the ﬂow map F δ in (2.1) is squareintegrable. heorem 3.2 (Consistency and asymptotic normality for perfect model) . Under Assumption , the esti-mator in (2.7) converges to c δ (the true parameter value) almost surely, and is asymptotically normal, wheneither M Ñ 8 or N Ñ 8 : ? M p { c δ,N,M ´ c δ q d ÝÑ N p , N σ η A q , ? N p { c δ,N,M ´ c δ q d ÝÑ N p , M σ η A q . (3.2) Proof.

By deﬁnition of s b N,M in (2.8) and the equation (2.6), we have s b N,M p i q “ M N M ÿ m “ N ´ ÿ n “ x p ÿ j “ c δj φ j p X p m q t n , ∆ B p m q t n q ` σ η η p m q n , φ i p X p m q t n , ∆ B p m q t n qy R d “ ` s A N,M c δ ˘ p i q ` s S N,M , where in the second equality we used the deﬁnition of s A N,M in (2.8), and we denote s S N,M “ M M ÿ m “ S N, p m q , with S N, p m q “ N N ´ ÿ n “ x σ η η p m q n , φ i p X p m q t n , ∆ B p m q t n qy R d . Note that η n is standard Gaussian and is independent of B t n and X t n . Then, S N, p m q has mean zero and itscovariance is Cov p S N, p m q q “ σ η N N ´ ÿ n,n “ E ” x η p m q n , φ i p X p m q t n , ∆ B p m q t n qy R d x η p m q n , φ i p X p m q t n , ∆ B p m q t n qy R d ı “ N σ η A. Thus, when M Ñ 8 , we have by the central limit theorem, ? M M M ÿ m “ S N, p m q d ÝÑ N p , N σ η A q ; (3.3)Furthermore, S N, p m q is a martingale with respect to the ﬁltration generated by t X t n , B t n , η n u , and when N Ñ 8 , we have by martingale central limit theorem [HH14, Theorem 3.2] ? N M M ÿ m “ S N, p m q d ÝÑ N p , M σ η A q . (3.4)We show ﬁrst that, when M Ñ 8 and for each ﬁxed N , the estimator is consistent and asymptoticallynormal. Note that by the strong Law of Large Numbers, s A N,M Ñ A and s b N,M Ñ b a.s. as M Ñ 8 . Thus, p s A N,M q ´ Ñ A ´ almost surely (using the fact that A ´ ´ B ´ “ A ´ p B ´ A q B ´ , see [LMT20, page 22]).Then, { c δ,N,M “ p s A N,M q ´ s b N,M Ñ A ´ b almost surely, i.e. the estimator is consistent. Combining (3.3) andthe almost sure convergence of p s A N,M q ´ , we obtain the asymptotic normality by noticing that { c δ,N,M “ p s A N,M q ´ s b N,M “ c δ ` p s A N,M q ´ s S N,M . When N Ñ 8 and M ﬁxed, we obtain s A N,M Ñ A and s b N,M Ñ b a.s. by the ergodicity of the process.The consistency and asymptotic normality follows similarly by using (3.4). In practice, the model is imperfect in our inferred scheme because we can rarely parametrize the ﬂow mapexactly. We show next that for an imperfect proposed model, the estimator converges to the projectedcoeﬃcient of the ﬂow map onto the function space spanned by the proposed basis in the ambient L space.Furthermore, we show that the inferred scheme improves the statistical accuracy of the explicit scheme thatit parametrizes. 9 ssumption 3.3. (a) Suppose that the data t X p m q t : t N , B p m q t : t N u Mm “ are independent trajectories of the system (1.1) with t X p m q t u Mm “ sampled from the ergodic measure of X . (b) Suppose that the normal matrix s A N,M in (2.8) and its expectation in (3.1) are invertible. (c) Suppose that the ﬂow map F δ in (2.1) is squareintegrable. The invertibility of the normal matrices s A N,M and A is crucial for our theory, and they lead to constraintson the basis functions. In practice, we can use it to guide the selection of basis functions and we recommendusing pseudo-inverse and regularization when the normal matrix is close to singular.With the notation A and b in (3.1), and assuming that A is invertible, we deﬁne c δ, proj : “ A ´ b. (3.5)The following lemma shows that c δ, proj is the projection coeﬃcients of the ﬂow map F δ . Lemma 3.4.

Under Assumption , the vector c δ, proj in (3.5) is the projection coeﬃcient of the ﬂow map F δ in (2.1) onto the space span t φ i u pi “ in L p R d ˆ Ω δ , µ b ν q with µ being the invariant measure of X and p Ω δ , B , ν q being the canonical probability space for the Brownian motion p B t , t P r , δ sq .Proof. Note that F δt n “ X tn ` ´ X tn δ . Denote F δ,mt n “ X p m q tn ` ´ X p m q tn δ . By the deﬁnition of b in (3.1), we have b p i q “ N N ´ ÿ n “ E x F δ,mt n , φ i p X p m q t n , ∆ B p m q t n qy R d “ E x F δt n , φ i p X t n , ∆ B t n qy R d , where the second equality follows from that p X t n , ∆ B t n q is stationary (so does F δt n ).Denote by c “ p c , c , . . . , c p q J the projection coeﬃcients of F δt n to span t φ i u pi “ , and write F δt n “ ř pi “ c i φ i ` F with F satisfying E rx F , φ i y R d s “ for each i “ , , . . . , p . Then E rx F δt n , φ i y R d s “ p ÿ j “ c j E rx φ j , φ i y R d s “ p Ac qp i q . Combining the above two equations, we obtain that c δ, proj “ A ´ b “ c .We remark that because F δ p X t i , B r t i ,t i ` s , t i , t i ` q is a functional depending on the trajectory B r t i ,t i ` s ,the function space of projection, L p R d ˆ Ω δ , µ b ν q , has an inﬁnite dimensional state space for p X t i , B r t i ,t i ` s q .When F δt n depends only on p X t n , ∆ B t n q (for instance, in the case of perfect model discussed in the previoussection), the state space becomes ﬁnite dimensional and the function space is simpliﬁed to L p R d , µ b ν q with ν „ N p , δI d q . Theorem 3.5 (Convergence of the estimator) . In addition to Assumption , assume that E r| F δt | s ă 8 and E r| φ i p X t , ∆ B t q| s ă 8 for each i “ , . . . , p . Then, we have • when M Ñ 8 and N ﬁxed, the estimator in (2.7) converges to the projection coeﬃcients c δ, proj in (3.5) a.s. and is asymptotically normal: ? M p { c δ,N,M ´ c δ, proj q d ÝÑ N p , A ´ Σ N p A ´ q J q , (3.6) where the matrix Σ N is the covariance of r b N,m p i q “ N N ´ ÿ n “ b n,m , with b n,m “ x X p m q t n ` ´ X p m q t n δ , φ i p X p m q t n , ∆ B p m q t n qy R d . • when N Ñ 8 and M ﬁxed, the estimator in (2.7) also converges and is asymptotically normal ? N p { c δ,N,M ´ c δ, proj q Ñ N p , M A ´ Σ p A ´ q J q , (3.7)10 ith Σ “ lim N Ñ8 N Σ N , provided that for each m , ÿ n “ E r b n,m E r b k,m | M ss converges for each k ě and lim N Ñ8 8 ÿ k “ K E r b k,m E r b N,m | M ss “ uniformly in K, where M denotes the ﬁltration generated by the extended stationary process up to time t .Proof. When M Ñ 8 , by the strong Law of Larger Numbers, we have s A N,M Ñ A and s b N,M Ñ b a.s. as M Ñ 8 . Thus, { c δ,N,M “ p s A N,M q ´ s b N,M Ñ A ´ b “ c δ, proj a.s. according to Lemma 3.4. To prove theasymptotic normality, note that for each m , the random vector r b N,m with entries r b N,m p i q “ N N ´ ÿ n “ x F δ,mt n , φ i p X p m q t n , ∆ B p m q t n qy R d has mean E r r b N,m s “ b and covariance Σ N with entries Σ Ni,j “ E r r b N,m p i q r b N,m p j qs ´ b p i q b p j q . Here thecovariance exists because E r r b N,m p i q r b N,m p j qs ď max i E r| r b N,m p i q| s ď max i E r|x F δ,mt n , φ i p X p m q t n , ∆ B p m q t n qy R d | sď p E r| F δt | sq { max i p E r| φ i p X t , ∆ B t q| q { . Then, s b N,M is the average of M iid samples t r b N,m u , each of which has covariance Σ N . Hence, by the centrallimit theorem, we have ? M p s b N,M ´ b q d ÝÑ N p , Σ N q . Combining with the facts that s A N,M Ñ A a.s. and that these matrices are invertible, we obtain (3.6).When N Ñ 8 , we obtain the convergence and asymptotic normality by ergodicity. First, by ergodicity,we have s A N,M Ñ A and s b N,M Ñ b a.s. as M Ñ 8 . Thus, { c δ,N,M “ p s A N,M q ´ s b N,M Ñ A ´ b “ c δ, proj almost surely. Next, to prove the asymptotic normality, note that by the central limit theorem for stationaryprocesses [Hey74, Theorem 1], we have Σ “ lim N Ñ8 N Σ N and ? N p s b N,M ´ b q d ÝÑ N p , M Σ q . Then we obtain (3.7) by noting that s A N,M Ñ A a.s. as above.We show next that the parametrization by inference will lead to improvement to an explicit scheme, inthe sense that the residual of the inferred scheme is not larger than the explicit scheme. In other words, the1-step discretization error of the inferred scheme is not larger than the explicit scheme. Theorem 3.6 (Order of residual) . Assume that the parametric form in (2.6) comes from an explicit scheme,such as

IS-EM or IS-RK4 in (2.13) from its explicit scheme in (2.3) . Then, under Assumption , theinferred scheme’s residual is smaller than the explicit scheme’s. Speciﬁcally, the residual in (2.7) satisﬁes E { p σ N,M q ď d E ” | X t n ` ´ X t n δ ´ F δ p X t n , ∆ B t n q| ı , (3.8) where F δ p X t n , ∆ B t n q denotes the ﬂow map of the explicit scheme, such as F δEM or F δRK in (2.3) . In otherwords, the inferred scheme has the same 1-step strong order as the explicit scheme it parametrizes. roof. Write the ﬂow map of the explicit scheme in parametric form: F δ p X t n , ∆ B t n q “ F δ p c ˚ , X t n , ∆ B t n q as in (2.5). Then, since the estimator { c δ,N,M in (2.7) is the minimizer of the likelihood, we have { p σ N,M q “ dδ M N M ÿ m “ N ´ ÿ n “ | X p m q t n ` ´ X p m q t n ´ δF δ p { c δ,N,M , X p m q t n , ∆ B p m q t n q| ď dδ M N M ÿ m “ N ´ ÿ n “ | X p m q t n ` ´ X p m q t n ´ δF δ p c ˚ , X p m q t n , ∆ B p m q t n q| . Since the process p X t n , ∆ B t n q n is stationary, we have E { p σ N,M q “ dδ E | X t n ` ´ X t n ´ δF δ p { c δ,N,M , X t i , ∆ B t n q| (3.9) ď d E | X t n ` ´ X t n δ ´ F δ p c ˚ , X t i , ∆ B t n q| . Recall that F δ p X t n , ∆ B t n q “ F δ p c ˚ , X t n , ∆ B t n q we have (3.8). Remark 3.7 (Order of residual for IS-RK4 and IS-EM) . Recall that either Euler-Maruyama scheme or theHRK4 schemes have E “ | X tn ` ´ X tn δ ´ F δ p c ˚ , X t i , ∆ B t n q| ‰ “ O p δ q , which follows from Ito formula. Thus,for the inferred schemes IS-EM and IS-RK4 in (2.3) , we have E r { p σ N,M q s “ O p δ q . Furthermore, by theLaw of Large Numbers, we have { p σ N,M q Ñ E { p σ N,M q a.s. when either M Ñ 8 or N Ñ 8 . Thus, theestimator { σ N,M “ O p δ { q a.s. for large N or M . However, IS-SSBE’s residual is not controlled by SSBE,because SSBE is not in the parametric family of IS-SSBE and the neglected term R in (2.12) may preventthe residual from decaying (see Figure 5(b)) . Remark 3.8.

The framework of inference-based scheme applies also for non-ergodic systems to achievereduced-in-time models. The convergence of the parameters in Theorem and Theorem remain truewhen the sample size M goes to inﬁnity. Furthermore, one may accelerate the simulation of slowly convergingergodic systems by training the inference-based schemes iteratively in time. In this study, we focus onnon-globally Lipschitz ergodic systems to highlight the ability of the inferred-scheme in producing long-termstatistics. Remark 3.9 (Optimal reduction in time) . We emphasize that our goal is to infer an explicit scheme with arelative large time-step for eﬃcient simulation of non-globally Lipschitz ergodic systems. Theorem showsthat the error in 1-step approximation (of the ﬂow map F δ in (2.1) ) decays at the time-step decreases. Buta smaller residual due to a smaller time-step does not necessarily imply a better performance for the inferredscheme, because large error may be accumulated in the invariant measure even when the time-step is small(e.g., EM may have a wrong invariant measure). To improve the inferred scheme, we seek for an optimaltime-step that balances the 1-step approximation error and the accumulation into the invariant measure. Inour numerical examples in the next section, the inferred scheme performs (in the sense of reproducing theinvariant measure and temporal correlation) the best when the time-step is moderately large. This is similarto the parameter estimation for homogenization of multiscale process [PS07] , where the sub-sampling ratemust be between the two characteristic time scales of the SDE. One may view the inference as an averagingprocess through the sampling of the invariant measure and connect the error in invariant measure withsampling error, then one can ﬁnd an optimal time step through a trade-oﬀ between the approximation error(i.e., numerical error) and the sampling error. We leave this as future work. In this section, we test three benchmark examples for each inference-based scheme proposed in (2.13) usingtwo diﬀerent parametric settings: c excluded vs c included, so as to distinct the contribution of thelinear term parametrized by c . Three non-globally Lipschitz examples are: a 1D equation with double-wellpotential; a 2D gradient system; and a 3D stochastic Lorenz system with degenerate noises.In each of the examples, we generate data for inference by the Split Step Backward Euler (SSBE) schemewith very ﬁne scale time step ∆ t . We infer schemes for diﬀerent time step-sizes δ “ Gap ˆ ∆ t with 10 options12or the time gap: Gap

P t , , , , , , , , , u , which will be used to select optimal time gap anddemonstrate the convergence order of the residual in Theorem 3.6. The computations of inference schemesinclude 5 diﬀerent options (1) IS-EM with c excluded; (2) IS-RK4 with c excluded; (3) IS-RK4 with c included; (4) IS-SSBE with c excluded; and (5) IS-SSBE with c included.We assess the performance of these schemes by the accuracy of the reproduced invariant density (PDF)and the auto-correlations function (ACF), which are empirically computed from a long trajectory. Theaccuracy of the PDF are measured by the total variation distance (TVD) from the reference PDF from data.Once we identify the best performing scheme for each example, we ﬁx the inference settings and presentthe convergence of the estimators and the residuals with respect to the time Gap as well as the number oftrajectories M .In summary, we ﬁnd from the examples that• The inferred scheme has signiﬁcantly stronger numerical stability than the plain schemes. The IS-RK4and IS-SSBE exhibit better stability than IS-EM. In particular, they can tolerate time-steps that aresigniﬁcantly larger than the plain RK4 or SSBE. Speciﬁcally, we ﬁnd the plain RK4 and SSBE (withoutinferred parameters) always blow up even when Gap “ , whereas the inferred schemes are still stablewhen Gap is larger than 200, which improves the eﬃciency by an order of more than 10. We summarizethe blow-up gap for plain verse inferred schemes for each example in the following table.1D double-well 2D gradient system 3D Lorenz systemPlain RK4

Gap “

20 Gap “

20 Gap “ IS-RK4

Gap ą

200 Gap ą

200 Gap ą Plain SSBE

Gap “

40 Gap “

40 Gap “ IS-SSBE

Gap ą

200 Gap ą

200 Gap ą Table 2: Blow up time gap for each scheme: plain verse inferred.• The inferred scheme can reproduce the invariant measure accurately. Both IS-RK4 and IS-SSBEperform well when the stochastic force dominates the dynamics. But when the drift dominates thedynamics in the example of Lorenz system, IS-RK4 performs better than IS-SSBE, because it providesa better approximation to the drift than IS-SSBE.• The inferred scheme reproduces the invariant density the best when the time-step is medium large(with a time gap between

Gap “ and Gap “ ), suggesting a balance between the approximationerror of the ﬂow map and the numerical error in simulating the invariant density. It is open to havean a-priori estimate of the optimal time gap. First consider an 1D SDE with a double-well potential [MSH02] dX t “ ´ V p X t q dt ` a { βdB t , (4.1)with V p x q “ µ p x ´ q . The corresponding invariant measure is Z exp ´ βV p x q where Z being the normalizingconstant Z : “ ş R exp ´ βV p x q dx . We set µ “ and β “ .We generate data by SSBE with a ﬁne time-step ∆ t “ e ´ . We ﬁrst simulate a long trajectory onan interval r , T s with X “ { and T “ (i.e., two million time steps), which is found to be longenough to represent the invariant density (PDF). This long trajectory will also provide us the referencePDF and ACF, which are referred as the true values to be approximated. Then we generate M “ trajectories on the time interval r , s with initial conditions sampled from the long trajectory. The datafor inference are the M trajectories of both the Brownian motion and the process p X t q observed at discretetimes t t n “ nδ “ n Gap ˆ ∆ t u , as in (2.4).The parameters of the schemes in (2.13) are then estimated by Algorithm 1 for each δ “ Gap ˆ ∆ t .Figure 2(a) shows the TVD of the ﬁve schemes with time gaps Gap

P t , , , , , , u . Notethat for every scheme, the TVD ﬁrst decreases and then increases, reaching the smallest TVD when Gap “ .13his suggests that when the gap is small, the approximation error of the ﬂow map (recall that the data arefrom an implicit scheme while the inferred schemes are explicit schemes) dominates the error in the invariantmeasure; when the gap is large, the numerical error of the inferred schemes dominates the TVD. A balancebetween the two errors is reached at the medium large time-step.We ﬁrst select the scheme that reproduces the invariant density with the smallest TVD. Overall, theIS-RK4 schemes perform the best and the inclusion of c brings in negligible improvement. Thus, we selectIS-RK4 without c to demonstrate further results. T o t a l v a r i a t i on d i s t an c e IS-RK4 c0 excludedIS-EM c0 excludedIS-RK4 c0 includedIS-SSBE c0 excludedIS-SSBE c0 includedOptimal gap 80 (a) TVD -2 -1 0 1 2 x1 P D F TrueIS-RK4 gap 10IS-RK4 gap 80IS-RK4 gap 200Plain RK4 gap10 (b) PDF A C F TrueIS-RK4 gap 10IS-RK4 gap 80IS-RK4 gap 120IS-RK4 gap 200Plain RK4 gap 10 (c) ACF

Figure 2: Large-time statistics for 1D double-well potential. (a) TVD between the empirical invariantdensities (PDF) of the inferred schemes and the reference PDF from data. (b) and (c): PDFs and ACFscomparison between the IS-RK4 with c excluded and the reference data.Figure 2 (b-c) show the PDFs and auto-correlation functions (ACFs) of IS-RK4 with c excluded at threerepresentative time gaps Gap

P t , , u , in comparison with those of the reference data and the plainRK4 with Gap “ . When Gap is small, that is

Gap “ , the IS-RK4 is close to the plain RK4, and bothproduce PDFs and ACFs with large errors. The PDF and ACF generated by IS-RK4 with Gap “ is thebest among all used gaps, ﬁtting the true PDF and ACF almost perfectly. Furthermore, when time gap isas large as Gap “ , the IS-RK4 can still produce qualitative results with the feature of PDF (that is thedouble-well feature), whereas the plain RK4 scheme blows up when Gap “ . (N M)-4.5-4-3.5-3-2.5-2-1.5-1 l og ( R e l . e rr) Rel. errorSlope = -0.5 41664256M (a) Convergence of c in sample size C oe ff i c i en t s c c -3 -2 -1 -1 log ( ) = 0.40log( )+ 0.03Reference slope = 0.5 (b) Coeﬃcients and residuals Figure 3: 1D double-well potential: Convergence of estimators in IS-RK4 with c excluded. (a) The relativeerror of the estimator { c δ,N,M with δ “ ˆ ∆ t converges at an order about p M N q ´ { , matching Theorem3.5. (b) Left column: The coeﬃcients depend on the time-step δ “ Gap ˆ ∆ t , with c being almost 1 and c being close to linear in δ until δ ą . . The error bars, which are too narrow to be seen, are the standarddeviations of the single-trajectory estimators from the M -trajectory estimator. Right column: The residualdecays at an order O p δ { q , matching Theorem 3.6. 14e also test the convergence of the estimators in sample size and their dependence on the time-step, aswell as the order of residual, aiming to conﬁrm the theory in Section 3. Figure 3(a) shows that the relativeerror of { c δ,N,M converges at a rate about p M N q ´ { as the sample size N or M increases. Here we takethe estimator from the largest sample size as the projection coeﬃcient, and compute the relative error to it.Note that the estimator of c is close to 1. Thus, the estimator { c δ,N,M converges at a rate about p M N q ´ { ,matching Theorem 3.5. The convergence of the estimator of c has similar convergence.Figure 3(b) shows the dependence of the estimators on the time-step δ “ Gap ˆ ∆ t . The coeﬃcient c is almost 1, while c is close to linear in δ . Furthermore, it also shows that the estimators from eachsingle trajectory are close to the M-trajectory estimator, with small standard deviations represented by errorbars that are too narrow to be seen. The residual decays at an order about . with respect to δ , closelymatching the rate in Theorem 3.6. We now consider a 2D dissipative gradient system [MSH02] d X t “ ´ ∇ V p X t q dt ` a { βd B t , (4.2)with V p X q “ V p x , x q “ exp ` µ x ` µ x ˘ . The corresponding invariant measure is Z exp ´ βV p x ,x q where Z being the normalizing constant Z : “ ş R exp ´ βV p x ,x q dx dx . We set µ “ . , µ “ and β “ .Because µ “ µ , so x is a slowly evolving variable compared to x and the resulting dynamics displays amulti-scale feature. Consequently, we estimate parameters of the inferred schemes entry-wisely and we focuson the marginal invariant density of x .We generate data by the SSBE scheme with ∆ t “ e ´ and time interval r , s with total time steps tN “ e . The rest setting and procedure are the same as the 1D double-well potential case.Figure 4(a) shows that IS-RK4 and IS-SSBE schemes have comparable TVD, and they reach the minimalTVD when Gap “ , where IS-EM blows put. They produce similar PDFs and ACFs, so we only presentthose of IS-SSBE with c excluded. Figure 4(b-c) show the PDFs and ACFs at representative time gaps Gap Pt , , , u . The ﬁndings are similar to those for the 1D double-well potential: (i) the performance ofIS-SSBE ﬁrst improves and then deteriorates as Gap increases; (ii) IS-SSBE can tolerate signiﬁcantly largertime-step than the plain SSBE, where the plain SSBE blows up due to the Newton-Raphson method usedas the implicit solver, which can only tolerate a small time-step limited by the inversion (similar to (2.11))in the Newton-Raphson method in the implicit solver. T o t a l v a r i a t i on d i s t an c e IS-RK4 c0 excludedIS-EM c0 excludedIS-RK4 c0 includedIS-SSBE c0 excludedIS-SSBE c0 included

Optimal gap 120 (a) TVD -4 -2 0 2 4 6 x1 P D F TrueIS-SSBE gap 10IS-SSBE gap 80IS-SSBE gap 120IS-SSBE gap 200Plain SSBE gap10 (b) PDF

Time Lag t -0.200.20.40.60.81 A C F TrueIS-SSBE gap 10IS-SSBE gap 80IS-SSBE gap 120IS-SSBE gap 200Plain SSBE gap 10 (c) ACF

Figure 4: Large-time statistics for the 2D gradient system. (a) TVD between the x marginal invariantdensities (PDF) of the inferred schemes and the reference PDF from data. (b) and (c): PDFs and ACFscomparison between IS-SSBE with c excluded and the reference data.The convergence of the estimators in sample size is also roughly of order p M N q ´ { , as shown in Figure5(a). Figure 5(b) shows that the estimators of c and c depend almost linearly on δ . Also, c ’s single-trajectory estimators have negligible standard deviations from the M -trajectory estimator, while c ’s esti-mators have a persistent noticeable standard deviation. This suggests that IS-SSBE has large uncertaintiesin the stochastic force term (recall that c and c being the coeﬃcients of the scaled drift and the stochasticforce, see (2.13)). In the right column, the residual of IS-SSBE remains little changed when δ decreases, far15rom a decay rate . . This does not violate Theorem 3.6, which is for parametrizations of explicit schemes.Instead, this highlights that the IS-SSBE is not a parametrization of the SSBE implicit scheme, and it hasa ﬂow map r F δSSBE with distance to the true ﬂow map E “ | X tn ` ´ X tn δ ´ r F δSSBE p c, X t i , ∆ B t n q| ‰ dependinglittle on δ . Such a feature may be helpful for further eﬀorts on improving the parametric form. l og ( R e l . e rr) Rel. errorSlope = -0.54 4.5 5 5.5 6 6.5 7log (N M)-3.5-3-2.5-2-1.5-1 l og ( R e l . e rr) Rel. errorSlope = -0.5 41664256M (a) Convergence of c in sample size C oe f. o f x c c -3 -2 -1 x log ( ) = 0.06log( )+ -3.32Reference slope = 0.50 0.1 0.2 0.3 0.40.60.811.21.41.6 C oe f. o f x c c -3 -2 -1 x log ( ) = 0.02log( )+ -1.70Reference slope = 0.5 (b) Coeﬃcients and residuals Figure 5: 2D gradient system: Convergence of estimators in IS-SSBE with c excluded. (a) The relativeerror of the estimator { c δ,N,M with δ “ t converges at an order about p M N q ´ { , matching Theorem3.5. (b) Left column: The estimators of c , c are almost linear in δ . Right column: The residual changeslittle as δ decreases, due to that IS-SSBE is not a parametrization of an explicit scheme (thus, Theorem 3.6does not apply).Figure 6 shows the convergence of the estimator for IS-RK4. Similar to the 1D case, we observe aconvergence rate p M N q ´ { in Figure 6(a). Also, in Figure 6(b), we observe almost δ independent estimatorsand the expected decay rate O p δ { q proved in Theorem 3.6. Consider next the 3D stochastic Lorenz system with degenerate noise [MSH02] dx “ σ p x ´ x q dt ` a { βdB ,dx “ ` x p γ ´ x q ´ x ˘ dt ` a { βdB ,dx “ p x x ´ bx q dt. (4.3)We set σ “ , γ “ , b “ { and β “ . This stochastic chaotic system is exponentially ergodic with aregular invariant measure because it is dissipative and hypoelliptic.As before, we generate data by SSBE with ∆ t “ e ´ and a reference long trajectory with tN “ e timesteps (or equivalently, on the time interval r , s ). We consider time gaps Gap

P t , , , , , , u ,so the maximal time-step is still . .Figure 7(a) shows the TVD of the inferred schemes. This time, the IS-RK4 scheme performs signiﬁcantlybetter than IS-SSBE schemes, with relatively small TVD for most time gaps. This is due to the high-orderapproximation of RK4 to the drift, particularly when the drift dominates the dynamics (note that the statevariable x is at a scale of magnitude larger than the degenerate noise). The IS-RK4 with c includedperforms the best and we select it for further demonstration of results.Figure 7(b-c) show the PDFs and ACFs at representative time gaps Gap

P t , , u . Since the plainRK4 blows up at Gap “ , so we display the results from IS-EM instead. The ﬁndings are similar to thosefor the 1D double-well potential: (i) the performance of IS-RK4 ﬁrst improves and then deteriorates as Gap increases; (ii) IS-RK4 can tolerate signiﬁcantly larger time-step than the plain RK4.16 l og ( R e l . e rr) Rel. errorSlope = -0.54 4.5 5 5.5 6 6.5 7log (N M)-4-3-2-1 l og ( R e l . e rr) Rel. errorSlope = -0.5 41664256M (a) Convergence of c in sample size C oe f. o f x c c -3 -2 -1 -2 x log ( ) = 0.46log( )+ -3.04Reference slope = 0.50 0.1 0.2 0.3 0.400.20.40.60.81 C oe f. o f x c c -3 -2 -1 -2 -1 x log ( ) = 0.43log( )+ -1.15Reference slope = 0.5 (b) Coeﬃcients and residuals Figure 6: 2D gradient system: Convergence of estimators in IS-RK4 with c excluded. (a) The relative errorof the estimator { c δ,N,M with δ “ t converges at an order about p M N q ´ { , matching Theorem 3.5. (b)Left column: The estimators of c , c are constant for all δ . Right column: The residual decays at an order O p δ { q , matching Theorem 3.6. T o t a l v a r i a t i on d i s t an c e IS-RK4 c0 excludedIS-EM c0 excludedIS-RK4 c0 includedIS-SSBE c0 excludedIS-SSBE c0 includedOptimal gap 240 (a) TVD -20 -10 0 10 20 x1 P D F TrueIS-RK4 gap 20IS-RK4 gap 240IS-RK4 gap 320IS-EM gap 20 (b) PDF

Time Lag t -0.200.20.40.60.81 A C F o f x TrueIS-RK4 gap 20IS-RK4 gap 240IS-RK4 gap 320IS-EM gap 20 (c) ACF

Figure 7: Large-time statistics of x for the stochastic Lorenz system. (a) TVD between the x marginalinvariant densities (PDF) of the inferred schemes and the reference PDF from data. (b) and (c): PDFs andACFs comparison between IS-RK4 with c included and the reference data. x3 P D F TrueIS-RK4 gap 20IS-RK4 gap 240IS-RK4 gap 320IS-EM gap 20 (a) PDF

Time Lag t A C F o f x TrueIS-RK4 gap 20IS-RK4 gap 240IS-RK4 gap 320IS-EM gap 20 (b) ACF

Figure 8: ACF and PDF of x in the stochastic Lorenz system. Similar to the other examples, IS-RK4 (with c included) reproduces the PDF and the ACF the best when the time-step is medium large, while plainRK4 and IS-EM blow up even when Gap “ . 17oreover, we also plot the PDF and ACF of x in Figure 8. The dynamics of x is the most challengingbecause there is no diﬀusive stochastic force acting on it and its ACF is highly oscillatory. As usual, theIS-RK4 can reproduce the PDF and ACF well, whereas the plain RK4 and IS-EM blow up even when thetime-step is small. In particular, the IS-RK4 produce the periodic and decay feature of x ’s ACF when thetime-step is medium large, that is Gap “ . We expect the best performance to be achieved at a gapbetween 120 to 240, and we postpone the study on the optimal time gap and other improvements in futurework.The IS-RK4 has convergence results mostly as expected. Figure 9(a) shows that the estimators of c foreach entry of p x , x , x q converge at an almost perfect rate p N M q ´ { . Figure 9(b) shows that the estimatorof c , c , c remain little varied until δ “ . (i.e., Gap ą ) for each entry. It also shows that the residualsof all three entries decay at a rate slightly higher than O p δ { q . l og ( R e l . e rr) Rel. errorSlope = -0.54 5 6 7 8-4-3-2-1 l og ( R e l . e rr) Rel. errorSlope = -0.54 5 6 7 8log (N M)-4-3-2 l og ( R e l . e rr) Rel. errorSlope = -0.5 41664256M (a) Convergence of c in sample size C oe f. o f x c c c -3 -2 -1 x log ( ) = 0.56log( )+ 1.97Reference slope = 0.50 0.05 0.1 0.15 0.200.51 C oe f. o f x c c c -3 -2 -1 x log ( ) = 0.67log( )+ 2.88Reference slope = 0.50 0.05 0.1 0.15 0.2-101 C oe f. o f x c c c -3 -2 -1 x log ( ) = 0.66log( )+ 2.94Reference slope = 0.5 (b) Coeﬃcients and residuals Figure 9: The 3D stochastic Lorenz system: Convergence of estimators in IS-RK4 with c included. (a) Therelative error of the estimator { c δ,N,M with δ “ t “ . converges at order about p M N q ´ { , matchingTheorem 3.5. (b) Left column: The estimators of c , c , c are varies little until δ ą . . The vertical dashline is the optimal time gap. Right column: The residuals decay at orders slightly higher than O p δ { q . We have introduced a framework to infer schemes adaptive to large time-stepping (ISALT) from data forlocally Lipschitz ergodic SDEs. We formulate it as a statistical learning problem, in which we learn anapproximation to the inﬁnite-dimensional discrete-time ﬂow map. By deriving informed basis functions fromclassical numerical schemes, we obtain a low-dimensional parameter estimation problem, avoiding the curseof dimensionality in statistical learning.Under mild conditions, we show that the estimator converges as the data size increases, and the inferredscheme has same a 1-step strong order as the explicit scheme it parametrizes. Thus, our algorithm comes withperformance guarantee. Numerical tests on three non-globally Lipschitz examples conﬁrm the theory. Theinferred scheme can tolerate large time-steps and eﬃciently and accurately simulate the invariant measure.Many fronts are left open for further investigation. (1) The optimal time-step. We have observed that theinferred schemes perform the best (producing the most accurate invariant measure) when the time-step ismedium-large. This observation suggests a trade-oﬀ between the 1-step approximation error of the ﬂow mapand the accumulated numerical error in the invariant measure. Similar optimality in the medium range wasobserved in space-time model reduction [Lu20] and in parameter estimation for multiscale diﬀusion [PS07].18t is crucial to have a universal a priori estimate on the optimal time-step, which can guide all data-drivenmodel reduction approaches. (2) Multi-step noise. We focused on approximate ﬂow maps that use only theincrements of the Brownian motion. This limits the performance of the inferred scheme because we omitthe details of the stochastic force. Thus, a multi-step noise provides the necessary information for furtherimprovements, particularly when the noise is non-stationary [LD21]. (3) Non-ergodic systems and/or space-time reduction. We expect to extend the framework of ISALT to simulate non-ergodic systems or achievespace-time reduction for high-dimensional nonlinear systems by extracting informed basis functions from theclassical numerical scheme.

Acknowledgments

FL is grateful for supports from NSF-DMS 1913243 and NSF-DMS 1821211. XL isgrateful for supports from NSF DMS CAREER-1847770. FY is grateful for supports from AMS-Simonstravel grants. FL would like to thank Kevin Lin for helpful discussions.

References [BSHHB19] Y. Bar-Sinai, S. Hoyer, J. Hickey, and M. P. Brenner. Learning data-driven discretizations forpartial diﬀerential equations.

Proceedings of the National Academy of Sciences , 116(31):15344–15349, 2019.[CL15] A. J. Chorin and F. Lu. Discrete approach to stochastic parametrization and dimensionreduction in nonlinear dynamics.

Proceedings of the National Academy of Sciences, USA ,112(32):9804–9809, 2015.[CLM `

16] A. J. Chorin, F. Lu, R. M. Miller, M. Morzfeld, and X. Tu. Sampling, feasibility, and priors indata assimilation.

Discrete and Continuous Dynamical Systems A , 36(8):4227–4246, 2016.[EEL `

07] W. E, B. Engquist, X. Li, W. Ren, and E. Vanden-Eijnden. The heterogeneous multiscalemethod: A review. In

Communications in Computational Physics , 2007.[Hey74] C. C. Heyde. On the central limit theorem for stationary processes.

Zeitschrift für Wahrschein-lichkeitstheorie und Verwandte Gebiete , 30(4):315–320, 1974.[HH14] P. Hall and C. C. Heyde.

Martingale limit theory and its application . Academic press, 2014.[HJ15] M. Hutzenthaler and A. Jentzen.

Numerical Approximations of Stochastic Diﬀerential Equationswith Non-globally Lipschitz Continuous Coeﬃcients . American Mathematical Society, 2015.[HJE18] J. Han, A. Jentzen, and W. E. Solving high-dimensional partial diﬀerential equations using deeplearning.

Proceedings of the National Academy of Sciences, USA , 115(34):8505–8510, 2018.[HJK12] M. Hutzenthaler, A. Jentzen, and P. E. Kloeden. Strong convergence of an explicit numeri-cal method for sdes with nonglobally lipschitz continuous coeﬃcients.

The Annals of AppliedProbability , 22:1611–1641, 2012.[HL20] T. Hudson and X. H. Li. Coarse-graining of overdamped langevin dynamics via the mori–zwanzigformalism.

Multiscale Modeling & Simulation , 18(2):1113–1135, 2020.[HP06] J. Hansen and C. Penland. Eﬃcient approximate technique for integrating stochastic diﬀerentialequations.

Monthly Weather Review , 134:3006–3014, 2006.[Hu96] Y. Hu. Strong and weak order of time discretization schemes of stochastic diﬀerential equations.In

Séminaire de Probabilités XXX , pages 218–227. Springer, 1996.[JH20] S. W. Jiang and J. Harlim. Modeling of missing dynamical systems: Deriving parametric modelsusing a nonparametric framework.

Research in the Mathematical Sciences , 7(3):1–25, 2020.[JK `

10] A. Jentzen, P. Kloeden, et al. Taylor expansions of solutions of stochastic partial diﬀerentialequations with additive noise.

The Annals of Probability , 38(2):532–569, 2010.[Kha12] R. Khasminskii.

Stochastic Stability of Diﬀerential Equations , volume 66. Springer-Verlag BerlinHeidelberg, 2nd edition, 2012.[KMK03] B. Khouider, A. J Majda, and M. A Katsoulakis. Coarse-grained stochastic models for tropicalconvection and climate.

Proceedings of the National Academy of Sciences, USA , 100(21):11941–11946, 2003. 19KP99] P. E. Kloeden and E. Platen.

Numerical Solution of Stochastic Diﬀerential Equations . Springer,Berlin, 3rd edition, 1999.[LBL16] H. Lei, N.A. Baker, and X. Li. Data-driven parameterization of the generalized Langevinequation.

Proc. Natl. Acad. Sci. USA , 113(50):14183–14188, 2016.[LD21] Y. Li and J. Duan. A data-driven approach for discovering stochastic dynamical systems withnon-gaussian lévy noise.

Physica D: Nonlinear Phenomena , 417:132830, 2021.[LGO20] S. Liu, L. Grzelak, and C. W. Oosterlee. The seven-league scheme: Deep learning for large timestep monte carlo simulations of stochastic diﬀerential equations. arXiv:2009.03202 , 2020.[LL10] Frédéric Legoll and Tony Lelievre. Eﬀective dynamics using conditional expectations.

Nonlin-earity , 23(9):2131, 2010.[LL21] K.K. Lin and F. Lu. Data-driven model reduction, wiener projections, and the koopman-mori-zwanzig formalism.

Journal of Computational Physics , 424:109864, 2021.[LLC16] F. Lu, K. K. Lin, and A. J. Chorin. Comparison of continuous and discrete-time data-basedmodeling for hypoelliptic systems.

Communications in Applied Mathematics and ComputationalScience , 11(2):187–216, 2016.[LLC17] F. Lu, K. K. Lin, and A. J. Chorin. Data-based stochastic model reduction for the Kuramoto–Sivashinsky equation.

Physica D: Nonlinear Phenomena , 340:46–57, 2017.[LLS19] F. Legoll, T. Leliévre, and U. Sharma. Eﬀective dynamics for non-reversible stochastic diﬀer-ential equations: a quantitative study.

Nonlinearity , 32(12):4779–4816, 2019.[LM16] B. Leimkuhler and C. Matthews.

Molecular Dynamics.

Springer, 2016.[LMT20] F. Lu, M. Maggioni, and S. Tang. Learning interaction kernels in stochastic systems of inter-acting particles from multiple trajectories. arXiv preprint arXiv:2007.15174 , 2020.[Lu20] F. Lu. Data-driven model reduction for stochastic Burgers equations.

Entropy , 22(12):1360,Nov 2020.[LZTM19] F. Lu, M. Zhong, S. Tang, and M. Maggioni. Nonparametric inference of interaction laws insystems of agents from trajectory data.

Proceedings of the National Academy of Sciences, USA ,116(29):14424–14433, 2019.[Mao07] X. Mao.

Stochastic diﬀerential equations and applications . Elsevier, 2007.[MH13] A. J. Majda and J. Harlim. Physics constrained nonlinear regression models for time series.

Nonlinearity , 26(1):201–217, 2013.[MSH02] J. C. Mattingly, A. M. Stuart, and D. J. Higham. Ergodicity for SDEs and approximations:locally Lipschitz vector ﬁelds and degenerate noise.

Stochastic Process. Appl. , 101:185–232,2002.[MT02] Y. Maday and G. Turinici. A parareal in time procedure for the control of partial diﬀerentialequations.

Comptes Rendus Mathematique , 335:387–392, 2002.[PS07] G. A. Pavliotis and A. M. Stuart. Parameter estimation for multiscale diﬀusions.

J. Statist.Phys. , 127(4):741–781, 2007.[R¨82] W. Rümelin. Numerical treatment of stochastic diﬀerential equations.

SIAM Journal on Nu-merical Analysis , 19(3):604–613, 1982.[RT96] G. O. Roberts and R. L. Tweedie. Exponential convergence of langevin distributions and theirdiscrete approximations.

Bernoulli , 2(4):341–363, 1996.[SS18] J. Sirignano and K. Spiliopoulos. DGM: A deep learning algorithm for solving partial diﬀerentialequations.

Journal of Computational Physics , 375:1339–1364, 2018.[YZK20] L. Yang, D. Zhang, and G. E. Karniadakis. Physics-informed generative adversarial networksfor stochastic diﬀerential equations.