[PDF] Accelerating Metropolis-within-Gibbs sampler with localized computations of differential equations

Abstract

Full PDF

aa r X i v : . [ s t a t . C O ] F e b Accelerating Metropolis-within-Gibbs sampler with localizedcomputations of diﬀerential equations

Qiang LIU ∗ Xin T. TONG † February 19, 2020

Abstract

Inverse problem is ubiquitous in science and engineering, and Bayesian methodologies areoften used to infer the underlying parameters. For high dimensional temporal-spatial models,classical Markov chain Monte Carlo (MCMC) methods are often slow to converge, and it isnecessary to apply Metropolis-within-Gibbs (MwG) sampling on parameter blocks. However,the computation cost of each MwG iteration is typically O ( n ), where n is the model dimension.This can be too expensive in practice. This paper introduces a new reduced computationmethod to bring down the computation cost to O ( n ), for the inverse initial value problem ofa stochastic diﬀerential equation (SDE) with local interactions. The key observation is thateach MwG proposal is only diﬀerent from the original iterate at one parameter block, and thisdiﬀerence will only propagate within a local domain in the SDE computations. Therefore wecan approximate the global SDE computation with a surrogate updated only within the localdomain for reduced computation cost. Both theoretically and numerically, we show that theapproximation errors can be controlled by the local domain size. We discuss how to implementthe local computation scheme using Euler–Maruyama and 4th order Runge–Kutta methods. Wenumerically demonstrate the performance of the proposed method with the Lorenz 96 modeland a linear stochastic ﬂow model. Inverse problem is ubiquitous in various ﬁelds of science and engineering [26, 4, 13]. It concernshow to infer model parameters from partial, delayed, and noisy observations. Typical examplesinclude using measurements of seismic waves to determine the location of an earthquake epicenter,and recovering images that are as close to natural ones as possible from blurry observations [22].The general formulation of inverse problem can be written as y = h ( x , ζ ) . (1)Here x denotes the parameters to be estimated, y denotes the observation data, h is a physicalmodel describing the data collection process, and ζ is the possible random factor involved.Often, it is of interest to quantify the uncertainty of x , which can be used to infer estimationaccuracy and regulate risk. The Bayesian approach is more appropriate for this purpose [13, 6, 24]. ∗ Department of Mathematics, National University of Singapore, Singapore, [email protected] † Department of Mathematics, National University of Singapore, Singapore, [email protected] x with a prior distribution p ( x ) and ﬁnding the observation distribution p l ( y | x ) from (1). Then the Bayes’ formula indicates that the posterior distribution of x is given by p ( x | y ) ∝ p ( x ) · p l ( y | x ) . (2)Markov chain Monte Carlo (MCMC) is a big class of stochastic algorithms designed to samplethe posterior density iteratively [11, 15]. In each iteration, a new proposal x ′ is generated from thecurrent iterate x , which can be described by a transition density Q ( x ′ | x ). Then the Metropolis-Hastings (MH) step accepts this proposal with probability α ( x ′ , x ) = min (cid:26) , Q ( x | x ′ ) p ( x ′ ) p l ( y | x ′ ) Q ( x ′ | x ) p ( x ) p l ( y | x ) (cid:27) . (3)Some popular choices of Q include random walk transition in random walk Metropolis (RWM),and Langevin dynamic transition in Metropolis adjusted Langevin algorithm (MALA) [19, 12, 23].While these MCMC algorithms perform well for classical problems, they become very slow whenapplied to modern data science problems, where the parameter dimension n := dim( x ) is verylarge. The main issue is that the proposal x ′ in RWM or MALA in general is diﬀerent from x at allcomponents, and the MH-acceptance probability α ( x ′ , x ) is often of order O ( e −k x ′ − x k ) ≈ O ( e − n ).This is extremely small when n is more than a few thousands. The proposals will mostly be rejected,and the MCMC is essentially stuck. One way to alleviate this issue is to choose a small step sizein the proposal, so the average acceptance probability is O (1). But then the consecutive MCMCiterates are close to each other, so the overall movement of MCMC can still be slow. The curse of dimensionality can often be lifted if there are exploitable statistical structures. Exam-ples include conditional Gaussianity and low eﬀective dimension. In geophysical applications, thecomponents of x usually describe status at diﬀerent locations. Because the underlying physical lawis often short-ranged, faraway components of x are nearly independent. Consequentially, the asso-ciated covariance matrix will have a clear banded structure. Such a phenomenon is called spatiallocalization , and it widely exists in problems involving vast spatial domains. In the statistic liter-ature, such banded structure can be exploited by tapering techniques, which signiﬁcantly improvecovariance estimation. And in numerical weather prediction (NWP), localization techniques aredesigned to utilize this structure, so algorithms such as ensemble Kalman ﬁlter can provide stableestimation for planetary models of 10 dimensions with merely 100 samples.A recent work [20] investigates the possibility to exploit spatial localization with MCMC. Itis found that Gibbs sampling [10, 9] is a natural framework for this purpose. To do so, one ﬁrstpartitions the model components into m blocks with x = [ x , . . . , x m ], where each block containsonly b = n/m = O (1) nearby components. When running MCMC, one generates a proposal for the k -th block by applying Gibbs sampler on the prior. This proposal x ′ will be diﬀerent from x only atthe k -th block, which is of dimension b . Therefore, the corresponding MH acceptance probability α ( x ′ , x ) in (3) will be of scale O (1) and does not degenerate even if the overall dimension n is large.A completely new iterate can be generated by repeating this procedure for all m blocks. A moredetailed description of this Metropolis-within-Gibbs (MwG) sampler can be found in Section 2.2.Numerical tests and rigorous analysis in Gaussian settings have revealed that MwG has dimensionindependent MCMC convergence rate when the underlying distributions are spatially localized.2 .3 Acceleration with local computation While MwG takes only a constant number of iterations to sample the posterior distribution, thecomputation cost of each iterate can be expensive. This is when a Gibbs block proposal is beingprocessed by the MH step (3), one often needs to evaluate p l ( y | x ′ ). It often involves O ( n ) compu-tational cost. The proposal procedure is repeated for all m blocks. So, to generate a new MwGiterate, the computation cost is O ( n ) × m = O ( n ). This is much more expensive than the ones forRWM and MALA iterates, which in general cost O ( n ) computation.However, it is possible to reduce the cost of p l ( y | x ′ ) to O (1). The main observation is that x ′ and x diﬀer only at one block, say the k -th block, and the computation of p l ( y | x ) is done in theprevious MH step. So if one can replace the x k part in the computation of p l ( y | x ) with x ′ k , thevalue of p l ( y | x ′ ) can be obtained cheaply. As a simple example, suppose in (1) the observationmodel is y = x + ξ with ξ ∼ N ( , I n ), a normal distribution with n -dimensional mean vector andcovariance matrix I n being an n × n identity matrix. Then, we have − log p l ( y | x ) ∝ m X k =1 k y k − x k k . Suppose this value is already available, then when computing − log p l ( y | x ′ ), one only needs toupdate the k -th block in the summation to k y k − x ′ k k , which only costs O (1) computation. Thisexample can be easily generalized to cases where y k relies on multiple blocks of x . See detaileddiscussion of this and the possibility of parallelization in [20].In this paper, we explore the possibility of reducing the computational cost of MwG from O ( n )to O ( n ). We assume the observation model (1) is given by a high dimensional stochastic diﬀerentialequation (SDE) with short-range interaction, where x = x (0) is the initial condition of the SDE,and y consists of noisy partial observations of the SDE, x ( s ≤ T ), in a ﬁxed time interval [0 , T ].Such an inverse initial value problem is practically important. It can be interpreted as an one-stepsmoothing problem in signal processing. Data assimilation problems such as NWP can also beformulated as sequential applications of it [7, 17].Finding p l ( y | x ) is equivalent to solving the SDE and ﬁnding x ( s ≤ T ) in the smoothing context.Standard Euler–Maruyama scheme would require a computational cost of O ( n ). When the Gibbssampler proposes x ′ , one needs to use it as a new SDE initial condition and compute x ′ ( s ≤ T ).But because x ′ is diﬀerent from x only for x k , we show in Proposition 1 that x ′ i ( s ≤ T ) is not muchdiﬀerent from x i ( s ≤ T ) if | i − k | is larger than a radius L . Therefore it is not necessary to dothe re-computation in a full way. In other words, instead of applying Euler–Maruyama to compute x ′ i ( s ≤ T ) for all i ∈ { , . . . , m } , we only compute locally for the ones with | i − k | ≤ L . Since theradius L can often be chosen as a constant independent of n , updating x ( s ≤ T ) to x ′ ( s ≤ T ) onlycosts O (1) computation. This procedure will be repeated through all m blocks, so the overall costof ﬁnding a new MwG iterate is O (1) × m = O ( n ). This achieves the aforementioned computationreduction objective. We use a-MwG to denote this accelerated version of MwG.Since MwG only requires constantly many iterates to sample a spatially localized distribution, a-MwG is expected to solve the Bayesian inverse problem with only a computation cost of O ( n ). Thisis the optimal dimension scaling one can obtain for any numerical methods. The main drawbackof a-MwG is that it uses a local computation scheme, so it is subjective to approximation errors.However, these errors can be controlled by choosing a large enough L . We demonstrate this throughrigorous analysis and numerical tests. 3 .4 Organization and Preliminaries This paper is organized in the following way. In Section 2, we consider the inverse problem of howto infer the initial condition of an SDE with local interaction. We review the MwG sampler witha discussion of its computational complexity. We derive an accelerated-algorithm, called a-MwGsampler, in Section 3, and analyze the approximation errors. We also discuss how to implementa-MwG with Euler–Maruyama and 4th order Runge–Kutta schemes, as well as its adaptation toparallelization. In section 4, two numerical experiments of the Lorenz 96 and linearized stochasticmodel are studied. The paper is concluded in Section 5. All the proofs are postponed to theAppendix.Throughout this paper, we will use the following notations. When applying MwG, we needto partition a high dimensional vector into blocks, for which we write as x = [ x , . . . , x m ]. Forsimplicity of the discussion, we assume each block shares the same length b , so the total dimension n = mb . We remark our result is easily generalizable to non-constant block sizes. In practice,each block often represents information at a location on a torus, therefore it is natural to introducemeasure of distance between indices as d ( j , j ) = min {| j − j | , | j − j + m | , | j − j + m |} with j , j = 1 , ..., m .When a matrix A is given, the ( i, j )-th entry is denoted as A i,j . A T , A − denote the transposeand the inverse of the matrix respectively. We adopt k · k and k · k ∞ to denote the l norm and l ∞ norm for a vector, namely, for a vector ~a with elements a , ..., a n , we have k ~a k = qP ni =1 a i ,and k ~a k ∞ = max {| a i | : i = 1 , ..., n } . For an m × n matrix A , the l operator is written as k A k = sup {k A~v k : ~v ∈ R n , k ~v k = 1 } . For two m × n matrices A and B (including vectors as aspecial case), we say A (cid:22) ( (cid:23) ) B if we have A i,j ≤ ( ≥ ) B i,j for 1 ≤ i ≤ m , 1 ≤ j ≤ n entry-wise. m × n and m × n are m × n matrices whose entries are 0 and 1 respectively. N ( m , Σ) is amultidimensional normal distribution with mean vector m and covariance matrix Σ. We use I n todenote n × n identity matrix. C is a generic positive constant that may varies from line to line. We consider a spatial-temporal model with local interaction d x j ( t ) = f j ( t, x j − ( t ) , x j ( t ) , x j +1 ( t )) dt + σ j ( t, x j ( t )) d W j ( t ) , for j = 1 , ..., m, x ( t ) = x m ( t ) , x m +1 ( t ) = x ( t ) , t ∈ [0 , T ] , (4)where σ j is an b × b matrix-valued adapted locally bounded process, and W j ( t ) is an b -dimensionalstandard Brownian motion. We assume the following Lipschitz continuity conditions are satisﬁedfor the coeﬃcient processes f j and σ j : Assumption Given b × vectors x , x , x , y , y , y and t ∈ [0 , T ] , for j = 1 , ..., m , there existsconstants C f > and C σ > such that k f j ( t, x , x , x ) − f j ( t, y , y , y ) k ≤ C f ( k x − y k + k x − y k + k x − y k ) , and k ( σ j ( t, x ) − σ j ( t, y )) k ≤ C σ k x − y k . W = [ W , ..., W m ] are given. Wewrite the solution as x ( t ) = Φ( x (0) , t, W ). In the scenario where there is no stochastic forcing,namely σ j ≡ b × b , (4) is an ordinary diﬀerential equation (ODE). Its solution can be simply writtenas x ( t ) = Φ( x (0) , t ).One key feature of (4) is that the drift term driving x j relies only on its neighboring blocks.This describes general short-range interactions that are typical in spatial-temporal models. Theformulation of (4) can be naturally derived, if one considers applying ﬁnite diﬀerence discretizationfor a stochastic partial diﬀerential equation, such as the reaction-diﬀusion equations. Details ofsuch derivation can be found in [5].We assume an n ′ -dimensional data is generated from noisy observation of (4) at time T > y = H · Φ( x , T, W ) + ξ. (5)Here H is an n ′ by n observation matrix that serves as selecting the observable components, and ξ represents the associated observation noise, which is distributed as N ( n ′ × , R ).The Bayesian inverse problem this paper trying to solve is ﬁnding the posterior distribution ofthe initial condition x with a given data y . For simplicity, we assume the prior distribution p of x is Gaussian with mean n × and covariance matrix Σ pri . Then the observation likelihood is givenby p l ( y | x ) ∝ E exp( − k y − H · Φ( x , T, W ) k R ) , (6)where for a vector v , we deﬁne k v k R := v T R − v . The expectation in (6) is averaging over allrealizations of W . In computation, it can be approximated by a Monte Carlo sampled versionˆ p l ( y | x ) := 1 S S X c =1 exp( − k y − H · Φ( x , T, W ( c ) ) k R ) , (7)where each W ( c ) is an independent Brownian motion realization. Note that (7) is an unbiasedestimator of (6), using it instead of (6) in MCMC is known as the pseudo-marginal algorithm [1].The convergence of the pseudo-marginal algorithm is largely similar to the standard MCMC where(6) is used directly [2].When (4) is an ODE, (6) is simpliﬁed as p l ( y | x ) ∝ exp( − k y − H · Φ( x , T ) k R ) . (8) The Metropolis-within-Gibbs sampler is an MCMC algorithm. For each block, it generates aproposal by applying Gibbs sampling to the prior, and use the Metropolis step to incorporate datainformation. In speciﬁc, it generates iterations of form x k = [ x k , ..., x km ] through the followingsteps, where we start with k = 1 and draw x randomly from p :1) Repeat steps 2-4 for all block index j = 1 , . . . , m

2) Sample ˜ x j from p ( x j ∈ · | x k , . . . , x kj − , x kj +1 , . . . , x km ) .

3) Let x p = [ x k , . . . , x kj − , ˜ x j , x kj +1 , . . . , x im ]. 5) Let x k = x p with probability α ( x p , x k ) = min (cid:26) , p l ( y | x p ) p l ( y | x k ) (cid:27) . (9) p l can also be replaced by the sample version ˆ p l in (7).5) When the loop in step 1) is ﬁnished, let x k +1 = x k , and increase the iteration index from k to k + 1.In [20], it is shown that MwG has dimension independent performance if 1) p is a Gaussiandistribution, and its covariance or precision matrix is close to be banded; 2) each component of y has signiﬁcant dependence only on a few components of x . It is also discussed how to truncatethe far-oﬀ diagonal entries of the prior covariance or precision matrix. This so-called “localization”technique can simplify the computation of the proposal probability in MwG step 2), making thecost to be O (1).When applying MwG directly to the inverse problem described in Section 2, the computationalcost for fully updating all blocks is O ( n ). This is because in step 4), we need to evaluate p l through either (7) or (8). This involves ﬁnding the numerical solution to (4), which requiresexecuting Euler–Maruyama or 4th order Runge–Kutta methods. Both these methods require O ( n )computation complexity when the dimension of the diﬀerential equation (4) is n . Then becausestep 4) is repeated for all m blocks, the total computation cost is O ( n ) × m = O ( n ).In summary, although the vanilla MwG only requires O(1) iterates to converge to the posteriordistribution, each individual iterate can cost O ( n ) computation. This can be less appealing thanstandard MCMC algorithms with optimal tuning. For example, RWM can converge to the posteriordistribution with O ( √ n ) iterations, while each iterate costs O ( n ) complexity, so the total complexityis O ( n ). In this paper, we demonstrate how to bring down the computational cost of each MwGiterate to O ( n ). This will lead to the optimal MCMC computational scalability. Based on our previous discussion, the main computational cost of MwG takes place at step 4)when (9) is evaluated. We will discuss how to reduce this cost to O (1). In what follows, We ﬁx theMCMC iteration index as o and block index as i ⋆ , while the same procedure applies to all iterationindices and blocks.First of all, it should be noted that when executing step 4), the value of p l ( y | x o ) is alreadyavailable from the previous iteration step 4). Likewise, we already have the values of x o ( t ≤ T ) =Φ( x o , t ≤ T, W ). It is only necessary to ﬁnd p l ( y | x p ), or equivalently x p ( t ≤ T ) = Φ( x p , t ≤ T, W ).Note that we will often write x o (0) as x o , since it is the information we try to recover.Our main observation here is that x p (0) = x p is diﬀerent from x o (0) = x o only at the i ⋆ -thblock. Since components in SDE (4) exchange information only through local interactions, thediﬀerences between x p ( t ) and x o ( t ) are likely to be of signiﬁcance only for blocks that are close to i ⋆ . In fact, we have the following proposition: Proposition 1

Under Assumption 1 and with any given positive constant C d , for j = 1 , ..., m , wehave E [ k x o j ( t ) − x p j ( t ) k ] ≤ k x o i ⋆ − x p i ⋆ k e C ( f , σ ) t e − C d · d ( j,i ⋆ ) (10)6 ith C ( f , σ ) = ( e C d + e − C d + 1)( C f + C σ + 1) . (11)In particular, for any ﬁxed small enough threshold ǫ and time range T , we can ﬁnd a radius L ,such that E [ k x o j ( t ) − x p j ( t ) k ] ≤ ǫ, ∀ t ≤ T, d ( i ⋆ , j ) > L. We deﬁne the local domain centered at index i ⋆ as B i ⋆ := { j : d ( i ⋆ , j ) ≤ L } , and use B ci ⋆ to denote its complement. Then x o j ( t ) is already a good approximation of x p j ( t ) for j ∈ B ci ⋆ . In other words, it is no longer necessary to recompute x p j ( t ) for j ∈ B ci ⋆ , and we only needto compute x p j ( t ) for j ∈ B i ⋆ . Since B i ⋆ contains at most 2 L + 1 elements, and L is likely to beindependent with the problem dimension n . This provides us with a way to accelerate the overallcomputation of MwG. Next we consider how to use existing values of x o ( t ≤ T ) to reduce the computation of x p ( t ≤ T ).For this purpose, we introduce a local surrogate model given by x l j ( t ) = x o j ( t ) , for j ∈ B ci ⋆ ,d x l j ( t ) = f j ( t, x l j − ( t ) , x l j ( t ) , x l j +1 ( t )) dt + σ j ( t, x l j ( t )) d W j ( t ) , for j ∈ B i ⋆ , x l0 ( t ) = x l m ( t ) , x l m +1 ( t ) = x l1 ( t ) . (12)Its initial condition is set to be x l (0) = x p (0) = x p . We write its solution as x l ( t ) = Φ l ( x p , t, x o ( t ≤ T ) , W ) . Note that the local surrogate x l j ( t ) within B i ⋆ depends on x o j ( t ) for the boundary blocks of B ci ⋆ , ofwhich the indices are usually just i ⋆ + L + 1 and i ⋆ − L −

1. Such dependence can be viewed asusing x o as spatial boundary conditions for the computation of x l .We will use the local surrogate x l ( t ≤ T ) as an approximation of x p ( t ≤ T ). It can be computedcheaply because it is diﬀerent from x o ( t ≤ T ) only for blocks inside the local domain B i ⋆ . Thedetails of how to achieve this can be found in subsequent parts. First of all, we investigate theapproximation error through the following theorem Theorem 1

Under Assumption 1 and with any given positive constant C d , for j = 1 , ..., m and t ∈ [0 , T ] , we have E [ k x l j ( t ) − x p j ( t ) k ] ≤ C ( f , σ ) k x o i ⋆ − x p i ⋆ k e C ( f , σ ) t e − C d · ( L +1) , (13) with C ( f , σ ) deﬁned in (11) and C ( f , σ ) = max n C f C ( f , σ ) , o . (14)7 n particular, for any given ǫ > , if the local domain radius L satisﬁes L ≥ log (cid:16) ǫC ( f , σ ) k x o i ⋆ − x p i ⋆ k (cid:17) − C d + 2 C ( f , σ ) C d T, (15) then E [ k x l j ( t ) − x p j ( t ) k ] ≤ ǫ for all t ≤ T . Theorem 1 indicates that we can use x l ( T ) to approximate x p ( T ) with controlled accuracy. So, α ( x p , x o ) as deﬁned in (9) can be substituted by α ( x l , x o ). When such a scheme for calculating(9) is plugged into MwG, this will lead to acceleration in terms of computation complexity. Wecall it accelerated Metropolis-within-Gibbs (a-MwG) and present its pseudocode in Algorithm 1.Next, we will discuss how to compute the local surrogate (12) with numerical methods, and howto implement parallelization. When stochastic forcing is nonzero, our model (4) is a bona-ﬁde SDE. Euler–Maruyama is thestandard numerical method for its computation.In the vanilla MwG, if one applies it directly to the full model (4) to obtain x p ( T ), a smallstep size h > x p ( t ) will be approximated by ˜ x p (( i + 1) h ) if ih < t ≤ ( i + 1) h . The numerical solution ˜ x p (( i + 1) h ) at grid points ( i + 1) h is generated by thefollowing iterations, starting from ˜ x p (0) = x p :˜ x p j (( i + 1) h ) = ˜ x p j ( ih ) + σ j ( ih, ˜ x p j ( ih )) √ hW i,j + f j ( ih, ˜ x p j − ( ih ) , ˜ x p j ( ih ) , ˜ x p j +1 ( ih )) h. (16)Here, W i,j are independent samples from N ( b × , I b ). One can view W j ( kh ) = k X i =1 √ hW i,j as a realization of the Brownian motion W j in (4). As (16) is repeated for all block index j ,obtaining ˜ x p ( T ) has an O ( n ) computational complexity.When applying Euler–Maruyama for the local surrogate (12), we assume numerical approxima-tion of x o ( t ≤ T ) are available as ˜ x o ( ih ). This comes either from the previous a-MwG iterationor an initial computation before the ﬁrst a-MwG iteration. Then if the proposal x p is diﬀerentfrom x o at the i ⋆ -th block, we set the local domain as B i ⋆ = { j : d ( j, i ⋆ ) ≤ L } , and denote itscomplement as B ci ⋆ . For blocks outside B i ⋆ , we directly use ˜ x o j ( ih ) as the numerical solution, thatis we let ˜ x l j ( ih ) = ˜ x o j ( ih ) , j ∈ B ci ⋆ , i = 1 , . . . , T /h. (17)New computation is needed for ˜ x l j ( ih ) with j ∈ B i ⋆ , which are obtained through the following˜ x l j (( i + 1) h ) = ˜ x l j ( ih ) + σ j ( ih, ˜ x l j ( ih )) √ hW i,j + f j ( ih, ˜ x l j − ( ih ) , ˜ x l j ( ih ) , ˜ x l j +1 ( ih )) h, (18)with initial condition ˜ x l j (0) = x p j . This procedure is how do we obtain the step x l ( t, c ) = Φ l ( x p , t, x o ( t ≤ T, c ) , W ( c ) )8 lgorithm 1: The accelerated Metropolis-within-Gibbs sampling

Input: K : number of iterations; b : block size; L : local domain radius; S : Brownian motionsample size Output: K iterations x k with k = 1 , . . . , K .%Initial computationSample x ∼ p ; for c = 1 , ..., S do Generate Brownian motion W ( c ) ( t ≤ T );Let x o ( t ≤ T, c ) = Φ( x , t ≤ T, W ( c ) );%Full model computation end Let ˆ p o l = S P Sc =1 exp( − k y − H · x o ( T, c ) k R );%MCMC loop for k = 1 , ..., K do %Gibbs loop for j = 1 , ..., m do Sample %Proposal step x p j ∼ p ( x j ∈ · | x o1 , ..., x o j − , x o j +1 , ..., x o m );Let x p = [ x o1 , ..., x o j − , x p j , x o j +1 , ..., x o m ]; for c = 1 , ..., S do Let x l ( t, c ) = Φ l ( x p , t, x o ( t ≤ T, c ) , W ( c ) )for all t ≤ T . %Local computation end %pseudo-marginal MH stepLet ˆ p p l = S P Sc =1 exp( − k y − H · x l ( T, c ) k R );Let α = min { , ˆ p p l / ˆ p o l } ;Sample u ∼ Uniform([0 , if u ≤ α then Let x o = x p ;Let ˆ p o l = ˆ p p l ; for c = 1 , ..., S do Let x o ( t ≤ T, c ) = x l ( t ≤ T, c ); endendend Let x k = x o ; end

9n a-MwG Algorithm 1. Note that in a-MwG, W ( c ) are Brownian motion realizations ﬁxed duringthe loops. So when implementing (18), one use ﬁxed N ( b × , I b ) samples W ( c ) i,j in place of W i,j ,instead of generating new independent samples for W i,j .The reason we need to do the copying step (17) before the local Euler–Maruyama, is becausethe values of x l i ⋆ ± ( L +1) ( ih ) are needed in the computation of x l i ⋆ ± L ( ih ) in (18). This means in realimplementation, (17) is only needed to be executed for the boundary blocks j = i ⋆ ± ( L + 1). Thiscan save additional computation time and storage in practice. We do not choose to formulate (17)and (18) in this fashion for simplifying the notations.Because (18) only requires execution for j ∈ B i ⋆ , so the total cost of obtaining ˜ x l is only O (1), which is one order cheaper than obtaining ˜ x p directly with Euler–Maruyama. This partialcomputation naturally introduces an error. But similar to Theorem 1, this error can be controlledthrough the local domain radius L , according to the following theorem: Theorem 2

Under Assumption 1, for any given positive constant C d , j = 1 , ..., m and i =0 , , ..., T /h , we have E [ k ˜ x l j ( ih ) − ˜ x p j ( ih ) k ] ≤ C ( f , σ ) e C ( f , σ )(1+ h ) ih e − C d ( L +1) k x o i ⋆ − x p i ⋆ k , (19) with C ( f , σ ) , C ( f , σ ) deﬁned in (11) and (14). For any given constant ǫ , if the local domain radius L satisﬁes L ≥ log (cid:16) ǫC ( f , σ ) k x o i ⋆ − x p i ⋆ k (cid:17) − C d + 2 C ( f , σ )(1 + h ) C d T, (20) then E [ k ˜ x l j ( ih ) − ˜ x p j ( ih ) k ] ≤ ǫ . Note that as h →

0, (19) and (20) converge to corresponding theoretical ones in Theorem 1. Alsonote that based on (20), the choice of L is independent of the dimension n . When stochastic forcing is zero, our model (4) is an ODE. This means we can use S = 1 in Algorithm1, and it is not necessary to sample the Brownian motion. 4th order Runge–Kutta (RK4) is thestandard numerical method for ODE computation.When applying RK4 to the full model for obtaining x o ( t ≤ T ) as a start, one runs the following˜ x o j (( i + 1) h ) = ˜ x o j ( ih )+16 ( k , o j ( i ) + 2 k , o j ( i ) + 2 k , o j ( i ) + k , o j ( i )) , k , o j ( i ) = h f j ( ih, ˜ x o j − ( ih ) , ˜ x o j ( ih ) , ˜ x o j +1 ( ih )) k , o j ( i ) = h f j ( ih + h , ˜ x o j − ( ih ) + k , o j − ( i )2 , ˜ x o j ( ih ) + k , o j ( i )2 , ˜ x o j +1 ( ih ) + k , o j +1 ( i )2 ) k , o j ( i ) = h f j ( ih + h , ˜ x o j − ( ih ) + k , o j − ( i )2 , ˜ x o j ( ih ) + k , o j ( i )2 , ˜ x o j +1 ( ih ) + k , o j +1 ( i )2 ) k , o j ( i ) = h f j ( ih + h, ˜ x o j − ( ih ) + k , o j − ( i ) , ˜ x o j ( ih ) + k , o j ( i ) , ˜ x o j +1 ( ih ) + k , o j +1 ( i )) . (21)This step is repeated for all j = 1 , . . . , m .When applying RK4 to compute the local surrogate x l ( t ) = Φ l ( x p , t, x o ( t ≤ T )), we neednot only the numerical approximation ˜ x o ( t ) of x o ( t ), but also the intermediate values k s, o j ( i ) for s = 1 , , ,

4. These values will be directly taken as ˜ x l j and its RK4 intermediate values for j ∈ B ci ⋆ :˜ x l j ( ih ) = ˜ x o j ( ih ) , k s, l j ( ih ) = k s, o j ( ih ) . (22)New RK4 computation is only needed for blocks inside the local domain, and ˜ x l j ( ih ) is obtainedthrough iterating the following step for j ∈ B i ⋆ and i = 1 , . . . , T /h :˜ x l j (( i + 1) h ) = ˜ x l j ( ih ) + 16 ( k , l j ( i ) + 2 k , l j ( i )+ 2 k , l j ( i ) + k , l j ( i )) . (23)Here the intermediate values k s, l j are deﬁned in the same way as k s, o j in (21), that is every instance of˜ x o j is replaced by ˜ x l j , and every instance of k s, o j is replaced by k s, l j . Similar as the local computationwith Euler–Maruyama method, we note that (22) in reality is only needed for { j : L + 1 ≤ | j − i ⋆ | ≤ L + 4 } . To see this, note that to obtain ˜ x l j (( i + 1) h ) we need k ,lj with j ∈ B i ⋆ . This needs k ,lj with j ∈ { j : | j − i ⋆ | ≤ L + 1 } . Likewise, we need k − s,lj with j ∈ { j : | j − i ⋆ | ≤ L + s } for s = 2 , , k ,lj = ˜ x l j ( ih ). We can exploit this to save both computation time and storage.By completing the procedures described above, we obtain the local surrogate x l ( t ) = Φ l ( x p , t, x o ( t ≤ T )). If x p is accepted in the MH step, it will be used as x o in the next iteration. Since (23) isrepeated only for blocks within the local domain, the overall cost is O (1).While an approximation error analysis like Theorem 2 in principle exists, we do not providethe detailed proof. This is partly because RK4 is much more complicated than Euler–Maruyama.Moreover, being an 4th order accuracy method, RK4 solution is very close to the exact solution of(12). So the approximation error can be learned directly from Theorem 1. The a-MwG algorithm we proposed can be easily adapted to parallelization for shorter wall-clockcomputation time.First of all, in Algorithm 1, every repetition among all S Brownian motion realizations can becomputed in parallel. This is the case because we assume W ( c ) are independent with each other.11ext, recall that in the local surrogate model (12), x l j ( t ) needs renewed computation only for j ∈ B i ⋆ , which depends on x o j ( t ) only at the boundary blocks j = i ⋆ ± ( L + 1). Now consideranother block index j ⋆ such that d ( i ⋆ , j ⋆ ) ≥ L + 2. Then its local domain B j ⋆ and boundaryblocks j = j ⋆ ± ( L + 1) have no intersection with the ones of i ⋆ . This means that if there aretwo Gibbs proposals x p ,i ⋆ and x p ,j ⋆ , and they are diﬀerent from the current iterate x o at the i ⋆ -thblock and the j ⋆ -th block respectively, then the computation of Φ l ( x p ,i ⋆ , t, x o ( t ≤ T, c ) , W ( c ) ) andΦ l ( x p ,j ⋆ , t, x o ( t ≤ T, c ) , W ( c ) ) will be independent. Thus we can compute them in parallel.In other words, when implementing a-MwG, at each Gibbs iteration, we can draw severalproposals in parallel. Each proposal is diﬀerent from the current iterate x o at one block index, andeach pair of these block indices are of distant 2 L + 2 or more apart. Then the local surrogate model(12) can be applied to each proposal in parallel, and the MH step is run independently on eachblock. This will have no doubt to save wall-clock computation time.Recall that our SDE model (4) can ﬁnd its origin in stochastic partial diﬀerential equation (PDE)such as the reaction-diﬀusion equation. With this in mind, our local computation scheme is closelyconnected to the PDE parallel computation scheme called “domain decomposition”. Both methodspartition the computation domain into smaller local domains and try to do the computation onlyin the local domains. The diﬀerence is that the domain decomposition is mostly applied to linearPDEs, and its execution relies on manipulation of linear algebra. The local computation schemeintroduced here is for inverse problems involving SDE formulation, which can be nonlinear. The Lorenz 96 model was introduced in [16] as a simpliﬁed description for equatorial oceanicﬂows. It is commonly used in the testing of high dimensional data assimilation methods (see e.g.[21, 8, 14, 3]). The component x j ( t ), for j = 1 , ..., n , is governed by the following ODE dx j ( t ) dt = − x j − ( t ) x j − ( t )+ x j − ( t ) x j +1 ( t ) − x j ( t ) + 8 , t ∈ [0 , T ] , (24)where we let x − ( t ) = x n − ( t ), x ( t ) = x n ( t ), x n +1 ( t ) = x ( t ). The solution of the Lorenz 96 modelhas an equilibrium distribution, which can be obtained by longtime simulation. We use a Gaussianapproximation of it as the prior distribution, of which the mean vector and covariance matrix areobtained by the localization procedure described in [20].We consider solving the inverse problem formulated as (5), where the underlying model (4) isgiven by (24) with T = 0 .

4. It can be veriﬁed that the spatial interaction of this model is local inthe form of (4) with a minimum choice of block size b = 2. We consider that the observation isobtained with every other component, so H is an n/ × n matrix with H i, i − = 1 for i = 1 , ..., n/ n . For the noise term ξ , we assume ξ ∼ N ( , I n/ ). The MwG sampler and a-MwG sampler are applied to draw K posterior samples { x k : k = 1 , ..., K } respectively. We implement the experiments for (24) of dimensions n = 40 and n = 400. We use RK4 with time step h = 0 .

01 for both MwG and a-MwG.The choice of block size b and local radius L play important roles in our a-MwG samplingalgorithm. The length of b should be larger than the bandwidth of the prior covariance. This isdiscussed in detail in [20]. As for L , from Theorems 1-2 we see a smaller value leads to a largererror of a-MwG sampler but faster computation. While Theorems 1-2 give theoretical lower bounds12f L to control the error, they might be pessimistic in practice. A better way is to simulate theproposal and rejection process within a-MwG, compare it with MwG, and choose L so that theproposal acceptance rate and the ODE state at T are close for two samplers.In speciﬁc, with a set of ﬁxed parameters n , b and L , we generate one x o from the priordistribution as in Algorithm 1. We replace the j -th block (we ﬁx j = 1 for simplicity) of x o by theproposal x p j and obtain x p . Recall in MwG, the ODE is computed through x p ( t ≤ T ) = Φ( x p , t ≤ T ), and the acceptance probability α is given by (9). In a-MwG, we approximate the ODE solutionwith x l ( T ) = Φ l ( x p , t, x o ( t ≤ T )) and acceptance rate α ′ := min { , ˆ p p l / ˆ p o l } in Algorithm 1. Basedon these quantities, the following errors can serve as possible standards for choosing a proper L Err-Φ = E x o ∼ p k x l ( T ) − x p ( T ) k ∞ E x o ∼ p k x p ( T ) k ∞ , and Err- α = E x o ∼ p | α − α ′ | . We note that Err-Φ is the relative error between x l ( T ) and x p ( T ), while Err- α depicts their inﬂuenceon acceptance rate. We use 500 independent samples x o from p to approximate the average, andlist the average errors in Table 1. From the table, we see that a choice of L = 4 ( L = 2) for b = 2( b = 4) is enough to keep the relative errors under 3%, both in the senses of acceptance rate α andsolution Φ. Also note that all the error terms decrease as the length of L increases. This veriﬁesour previous theoretical analyses.Table 1: Err- α (with α varies within [0 . , . L , undermodel (24). n=40 n=400 b L Err- α Err-Φ Err- α Err-Φ2 1 0.1858 0.3361 0.1531 0.29402 0.0819 0.1855 0.0857 0.14864 0.0134 0.0235 0.0043 0.01944 1 0.0817 0.1969 0.0548 0.15992 0.0119 0.0297 0.0136 0.0209Despite we have a criterion above to pick L , in below, we still test other choices of L forthe purpose of comparison. After implementing the MwG and a-MwG sampling algorithms, thequantities of mean squared error (MSE), mean sample variance (MSV) are calculated from theposterior samples generated MSE = 1 n n X j =1 k ¯ x j − x j k , and MSV = 1 n ( K − k ) K X k = k +1 n X j =1 k x kj − ¯ x j k , where ¯ x = 1 K − k K X k = k +1 x k .

13n above, the ﬁrst k samples are considered as burn-in and thrown away. Besides, the averageacceptance rate (AR) and the total amount of computation time (CT) for generating the samples arealso collected. The CT is calculated using a 2018 MacBook Pro with 2.2GHz Intel Core i7 processor.The related paremeters are set as K = 100000, k = 10000, with corresponding simulation resultspresented in Table 2. We see that our a-MwG algorithm samples the posterior distribution correctly,yielding MSE and MSV results similar to the ones for MwG, and the acceleration eﬀect is signiﬁcant.Its acceptance rate is also similar to the one of MwG sampler, in particular for larger L .Table 2: The results of AR, CT, MSE, MSV for 100000 samples drawn from MwG( b ) and a-MwG( b , L ) samplers, where b, L are block size and radius length.n=40 n=400AR CT MSE MSV AR CT MSE MSVMwG(2) 0.1343 2080 8.6902 9.0255 0.1149 27334 9.4972 10.2152a-MwG(2,1) 0.1254 1440 9.5906 9.7888 0.1014 12798 10.4869 12.5727a-MwG(2,2) 0.1310 1444 8.9625 8.3956 0.1090 12830 9.8387 10.5445a-MwG(2,4) 0.1347 1492 9.2899 8.1608 0.1152 13194 9.5599 9.8135MwG(4) 0.0389 1041 9.1626 8.2137 0.0273 13593 10.4764 9.5827a-MwG(4,1) 0.0388 724 9.2161 8.3851 0.0283 6460 9.6107 9.9683a-MwG(4,2) 0.0388 749 8.6808 8.8641 0.0280 6634 9.9938 9.3718We also show samples generated from two samplers MwG(4) and a-MwG(4,2) as a demonstra-tion of the posterior distribution for the case n = 400 in Figure 1. As a contrast, the same amountof prior samples and the true state are also shown. We see the two sampling algorithms performsimilarly.Speciﬁcally, for the computation time, we document its change when data dimension varies from40 to 1600 in the examples of MwG(4) and a-MwG(4,2), in Figure 2. It can be roughly seen thatthe computation cost of a-MwG increases linearly, compared with a quadratic increase for MwG.This veriﬁes our previous theoretical analysis. We study the following one-dimensional linearized stochastically forced dissipative advection equa-tion in [17] (Section 6.3) and [25]: ∂f ( x, t ) ∂t = w ∂f ( x, t ) ∂x − νf ( x, t ) + µ ∂ f ( x, t ) ∂x + σ x ∂W ( x, t ) ∂t , t ∈ [0 , T ] , (25)where W ( x, t ) is a white noise in both time and space. Applying the centered diﬀerence formulawith spatial grid size l transforms (25) into a time continuous linear stochastic system. The n -dimensional state x ( t ) = [ x ( t ) , ..., x n ( t )] T follows the SDE in below for j = 1 , ..., n , dx j ( t ) = ( ax j − ( t ) + bx j ( t ) + cx j +1 ( t )) dt + σ x dW j ( t ) ,x ( t ) = x n ( t ) , x n +1 ( t ) = x ( t ) , (26)with a = µl − w l , b = − µl − ν, c = µl + w l , Dimension T i m e W j ( t ) is a standard Brownian motion. We consider a regime with strong advection and weakdamping by setting l = 0 . , µ = 0 . , ν = 0 . , w = 2 , σ x ≡ . . We shall follow similar steps as in the last example to study the performance of our a-MwG samplingalgorithm under this linear stochastic model. The only diﬀerence is the additional involvement ofstochastic factors W ( c ) for c = 1 , ..., S = 100, which are generated preliminarily. For simplicity,we consider a prior distribution p ( x ) = N (0 , I n ). We set ξ ∼ . · N (0 , I n ), T = 0 . h = 0 .

01. Other parameters remain the same as in theprevious example, if not speciﬁed.One computational issue rises when we evaluate the a-MwG acceptance probability. When theassociated dimension n is large, the sample likelihood ˆ p l ( y | x ) is close to zero. So directly evaluatingthe acceptance rate α = min { , ˆ p p l / ˆ p o l } in Algorithm 1 may leads to numerical singularity. Thisproblem does not occur to the deterministic Lorenz 96 model, since p l ( y | x ) needs no approximation,and the diﬀerence of the log-likelihoods is explicitly accessible.To resolve this issue, we note the pseudo-marginal likelihood ratio isˆ p p l ˆ p o l = P Sc =1 exp( − k y − H · x l ( T, c ) k R ) P Sc =1 exp( − k y − H · x o ( T, c ) k R ) . (27)Since x l ( T, c ) is diﬀerent from x o ( T, c ) only for blocks inside the local domain B i ⋆ , the entriesoutside B i ⋆ will provide no information for the sampling of block x i ∗ . Thus, dropping those entrieswill cause no signiﬁcant diﬀerence for the evaluation of (27). In other words, we approximate (27)with a local version ˆ p p l ˆ p o l ≈ P Sc =1 exp( − k y B ′ i⋆ − H i ∗ · x l B i⋆ ( T, c ) k R i ∗ ) P Sc =1 exp( − k y B ′ i⋆ − H i ∗ · x o B i⋆ ( T, c ) k R i ∗ ) . (28)Here B ′ i ⋆ = { i : 2 i − ∈ B i ⋆ } are observed indices in B i ⋆ , H i ∗ consists the B ′ i ⋆ rows and B i ⋆ columns of H , and R i ∗ consists of the B ′ i ⋆ rows and columns of R . Formulation (28) avoids thecomputational singularity issue that may exists in (27), since the associated variables are no longerhigh dimensional.Note that using (28) is equivalent to localizing the observation matrix as in [20] Section 2.2.When (4) is a linear SDE, y is a linear noisy observation of x ( T, c ), and the bias caused by suchlocalization is discussed in [20] Section 2.3. We expect similar bias control is also possible forgeneral SDEs, while the exact veriﬁcation is beyond the scope of this paper. Also note that thelocal domain B i ⋆ in (28) does not need to be the same as the one used for SDE computation. Forthis test example, we use 20 entries centered around the perturbed block.Before applying a-MwG sampler to solve the inverse problem, we use the same proceduredescribed in the last example to choose the parameter L of proper scale. The only diﬀerence isthat all quantities depend on realizations of Brownian motions, and it is necessary to average over S = 100 simulations. We see from Table 3 that a minimum choice of L = 6 ( L = 3) for b = 2 ( b = 4)can even control the relative errors under only 1 percent. Finite sample performances of posteriordistribution by using MwG and a-MwG samplers are presented in Table 4 and Figure 3, fromwhich we can see our sampling algorithm also works well for stochastic setting. Figure 4 shows amore evident linear relationship between the data dimension and the computation time for a-MwG,while a quadratic one for MwG. We note this is because for each proposal, the realization (26) is16able 3: Err- α (with α varies within [0 . , . L , undermodel (26). n=40 n=400 b L Err- α Err-Φ Err- α Err-Φ2 2 0.0289 0.2200 0.0856 0.14074 2.1e-04 0.0040 2.2e-04 0.00274 1 0.0846 0.2189 0.0221 0.16042 1.3e-04 0.0043 1.7e-04 0.0030Table 4: The results of AR, CT, MSE, MSV for 10000 samples draw from MwG( b ) and a-MwG( b , L )samplers, where b, L are block size and radius length. The exact posterior mean variance is listedin the end for comparison. n=40 n=400AR CT MSE MSV AR CT MSE MSVMwG(2) 0.1342 2418 0.5953 0.5567 0.1494 40617 0.5799 0.5413a-MwG(2,2) 0.1280 2052 0.6382 0.5708 0.1388 20674 0.6025 0.5623a-MwG(2,4) 0.1344 2077 0.6009 0.5302 0.1489 21256 0.5871 0.5505MwG(4) 0.0367 1193 0.5854 0.5492 0.0449 21369 0.6004 0.5470a-MwG(4,1) 0.0343 1007 0.5905 0.5244 0.0423 10511 0.6077 0.5417a-MwG(4,2) 0.0353 1043 0.5903 0.5396 0.0451 10685 0.5907 0.5499Posterior (cid:31) (cid:31) (cid:31) (cid:31) (cid:31) (cid:31) Dimension T i m e repeated for S = 100 standard Brownian motion paths W , dominating the total computation ofthe sampling algorithms and making the relationships more clearer.In fact, for this linear model, the posterior distribution of x given y can be derived explicitly.To derive it, we rewrite (26) in the following vertor form d x ( t ) = M x ( t ) dt + σd W ( t ) , (29)where M is an n by n matrix with M , = b, M , = c, M ,n = a,M j,j − = a, M j,j = b, M j,j +1 = c, for j = 2 , ..., n − ,M n, = c, M n,n = b, M n,n − = a, and other entries are 0. According to Duhamel’s formula, we have x ( T ) = e MT x (0) + e MT Z T e − Mt σd W ( t ) . The observation model can therefore be written as y = H e MT x (0) + H e MT Z T e − Mt σd W ( t ) + ξ. (30)Recall that x (0) ∼ N ( n × , Σ pri ) and ξ ∼ N ( n ′ × , R ). It’s known that for linear observation withGaussian prior distribution and noise, the posterior distribution is also of Gaussian type, and theposterior mean m pos and covariance matrix Σ pos can be written as ([18]):Σ pos = (Σ − + ( H e MT ) T ( M ′ + R ) − H e MT ) − , m pos = Σ pos ( H e MT ) T ( M ′ + R ) − y , where M ′ = H e MT · Z T σ ( e − Mt ) T e − Mt dt · ( H e MT ) T . In this paper, we discuss how to infer initial conditions of high dimensional SDEs with local interac-tions, when noisy observations are available. Previous research [20] has shown that the Metropolis-within-Gibbs (MwG) sampling has dimension independent convergence rate. However, each MwGiteration requires a computation cost of O ( n ), with n being the model dimension. The maincontribution of this paper is that we introduce a reduced computation scheme to accelerate MwGimplementation. We observe that in each updating step, the MwG proposal is only diﬀerent fromthe original iterate at one parameter block, and the main diﬀerence caused also takes place nearthat block. Solving the SDEs within a local domain near this block, instead of doing the globalcomputation for all entries, only involves O (1) computation. Our accelerated algorithm, the a-MwG sampler, is proposed by integrating an approximate local computation into MwG sampling.The computation cost of each a-MwG iterate is O ( n ) in total. We derive rigorous bounds for theapproximation errors and show that they can be controlled if the local domain size is chosen tobe larger than a constant threshold. We also discuss how to implement our local computation byusing Euler–Maruyama and 4th order Runge–Kutta schemes. These methods are applied in ourexperimental studies of the Lorenz 96 and linear stochastic ﬂow model. Numerical results showthat our sampling algorithm can be greatly accelerated without much loss of accuracy. acknowledgements This work is supported by Singapore MOE-AcRF grant R-146-000-258-114 and R-146-000-292-114.

Appendix

Auxiliary lemmas and their proofs

Lemma 1 An n × n matrix A is circulant tridiagonal if A j,i = 0 for all pairs of ( j, i ) such that d ( j, i ) > (Recall d ( j, i ) = min {| j − i | , | j + n − i | , | i + n − j |} ). For such type of matrix, we have | ( e A ) j,i | ≤ e − C · d ( j,i ) · e ( e C + e − C +1) ρ , ∀ j, i = 1 , , ..., n, where ρ = max j,i =1 , ,...,n | A j,i | , C is any ﬁxed positive constant. Remark 3 If A is further a tridiagonal matrix, we have | ( e A ) j,i | ≤ e − C ·| j − i | · e ( e C + e − C +1) ρ , ∀ j, i = 1 , , ..., n. Proof:

Firstly, by using mathematical induction, we prove that with any positive constant C , | ( A m ) j,i |≤ ( e C + e − C + 1) m · ρ m · e − C · d ( j,i ) , for m = 0 , , · · · . (31)19he conclusion is true for m = 0 since A = I n and e − C · d ( j,i ) > e − C · d ( j,j ) = 1. If the resultis true for m = k , namely we have | ( A k ) j,i | ≤ ( e C + e − C + 1) k · ρ k · e − C · d ( j,i ) . Then, for m = k + 1,we observe that ( A k +1 ) j,i = ( A k ) j,i − A i − ,i + ( A k ) j,i A i,i + ( A k ) j,i +1 A i +1 ,i , (32)with ( A k ) j, = ( A k ) j,n , ( A k ) j,n +1 = ( A k ) j, , A ,i = A n,i , and A n +1 ,i = A ,i . Since | A j,i | ≤ ρ ,together with (31) for m = k , (32) gives us | ( A k +1 ) j,i |≤ | ( A k ) j,i − k A i − ,i | + | ( A k ) j,i k A i,i | + | ( A k ) j,i +1 k A i +1 ,i |≤ ( e C + e − C + 1) k · ρ k · e − C · d ( j,i − · ρ + ( e C + e − C + 1) k · ρ k · e − C · d ( j,i ) · ρ + ( e C + e − C + 1) k · ρ k · e − C · d ( j,i +1) · ρ. (33)We claim that e − C · d ( j,i − + e − C · d ( j,i ) + e − C · d ( j,i +1) ≤ ( e C + e − C + 1) e − C · d ( j,i ) , (34)which will be proved later. Combining it with (33), we obtain | ( A k +1 ) j,i | ≤ ( e C + e − C + 1) k +1 · ρ k +1 · e − C · d ( j,i ) , (35)this ﬁnishes the proof of (31). The claim (34) can be proved by the following case by case discussion.For the case n = 2 k with k ∈ N + , we see d ( j, i ) ≤ k from the deﬁnition of d ( j, i ). If d ( j, i ) < k , wehave e − C · d ( j,i − + e − C · d ( j,i ) + e − C · d ( j,i +1) = e − C · ( d ( j,i )+1) + e − C · d ( j,i ) + e − C · ( d ( j,i ) − = ( e C + e − C + 1) e − C · d ( j,i ) . Otherwise, if d ( j, i ) = k , we have e − C · d ( j,i − + e − C · d ( j,i ) + e − C · d ( j,i +1) = e − C · ( d ( j,i ) − + e − C · d ( j,i ) + e − C · ( d ( j,i ) − ≤ ( e − C + e − C + 1) e − C · d ( j,i ) . For n = 2 k − k ∈ N + , we observe d ( j, i ) ≤ k −

1. If d ( j, i ) < k −

1, we have e − C · d ( j,i − + e − C · d ( j,i ) + e − C · d ( j,i +1) = e − C · ( d ( j,i ) − + e − C · d ( j,i ) + e − C · ( d ( j,i )+1) = ( e C + e − C + 1) e − C · d ( j,i ) . Otherwise, if d ( j, i ) = k −

1, we have e − C · d ( j,i − + e − C · d ( j,i ) + e − C · d ( j,i +1) = e − C · ( d ( j,i ) − + e − C · d ( j,i ) + e − C · d ( j,i ) ≤ ( e − C + 1 + 1) e − C · d ( j,i ) . | ( e A ) j,i | = (cid:12)(cid:12)(cid:12) ∞ X m =0 m ! · ( A m ) j,i (cid:12)(cid:12)(cid:12) ≤ ∞ X m =0 m ! · | ( A m ) j,i |≤ ∞ X m =0 m ! · ( e C + e − C + 1) m · ρ m · e − C · d ( j,i ) = e ( e C + e − C +1) ρ · e − C · d ( j,i ) . The proof is complete. (cid:3)

Lemma 2 (Gronwall’s inequality)

Given an n × n matrix M whose entries are positive, and ifthe two n × vectors x ( t ) , b ( t ) satisfy M · x ( t ) + b ( t ) (cid:23) d x ( t ) dt , for t ∈ [0 , T ] , (36) (Recall (cid:23) is entriwise inequality, as deﬁned in Section 1.4), then e MT · x (0) + Z T e M ( T − t ) · b ( t ) dt (cid:23) x ( T ) . (37) Proof:

Consider an n -dimensional column vector y ( t ) deﬁned as d y ( t ) dt = M · y ( t ) + b ( t ) , for t ∈ [0 , T ] , with initial value y (0) = x (0) + ǫ · n × , where ǫ is a positive constant. Its solution can be writtenas y ( T ) = e MT · ( x (0) + ǫ · n × ) + Z T e M ( T − t ) · b ( t ) dt. Deﬁne z ( t ) = y ( t ) − x ( t ) for t ∈ [0 , T ], then z (0) = ǫ · n × (cid:23) n × . And according to (36), we have d z ( t ) dt = d y ( t ) dt − d x ( t ) dt = M · y ( t ) + b ( t ) − d x ( t ) dt (cid:23) M · ( y ( t ) − x ( t )) = M · z ( t ) . We deﬁne τ = inf { t : there exists at least one index i, such that z i ( t ) < . } , and note that τ > z ( t ). According to the deﬁnition, wehave z ( t ) (cid:23) n × for t ∈ [0 , τ ). Together with that all the entries of M are positive, we have d z ( t ) dt (cid:23) M · z ( t ) (cid:23) n × for t ∈ [0 , τ ). It implies that z ( t ) is entriwisely nondecreasing in [0 , τ ), thus z ( t ) (cid:23) z (0) = ǫ · n × for t ∈ [0 , τ ), namely e Mt · ( x (0) + ǫ · n × ) + Z t e M ( t − s ) · b ( s ) ds − x ( t ) (cid:23) ǫ · n × . (38)21e show that τ > T , which can be proved by contradiction. Suppose that τ ≤ T . From thedeﬁnition of τ and the previous analyses, we know that z ( t ) (cid:23) ǫ · n × when t ∈ [0 , τ ), and thereexists at least one index, say i , such that z i ( τ ) <

0. Thus we have lim t → τ − z i ( t ) − z i ( τ ) ≥ ǫ >

0. Thisresult contradicts with the continuity of z ( t ) for t ∈ [0 , T ] and gives us the result τ > T .The conclusion (37) is obtained by taking ǫ → τ > T . (cid:3) Lemma 3

For two n × n matrices M and N whose entries are positive, if M (cid:22) N, (39) we have e M (cid:22) e N . (40) Proof:

By mathematical induction, we ﬁrstly prove M k (cid:22) N k , for k = 0 , ..., ∞ . (41)The result is true for k = 0, since M = N = I n . If the result is true for k = K , namely we have M K (cid:22) N K . (42)Then for k = K + 1, since( M K +1 ) i,j = n X s =1 ( M K ) i,s M s,j , ( N K +1 ) i,j = n X s =1 ( N K ) i,s N s,j , for i, j = 1 , ..., n, the result ( M K +1 ) i,j ≤ ( N K +1 ) i,j follows from (39) and (42).According to the deﬁnition of matrix exponential, we have e M − e N = + ∞ X k =0 k ! ( M k − N k ) , the conclusion (40) naturally follows from (41). (cid:3) Lemma 4 (Volterra’s inequality)

Given an n × n matrix M whose entries are positive, if thetwo n × vectors x ( t ) , b ( t ) satisfy x ( t ) (cid:22) b ( t ) + Z t M · x ( s ) ds, for t ∈ [0 , T ] , (43) we have x ( T ) (cid:22) b ( T ) + Z T e M ( T − t ) · M · b ( t ) dt. (44)22 roof: We deﬁne y ( t ) = R t M · x ( s ) ds for t ∈ [0 , T ], then (43) gives us x ( t ) (cid:22) b ( t ) + y ( t ) , for t ∈ [0 , T ] . (45)And we observe that d y ( t ) dt = M · x ( t ) (cid:22) M · b ( t ) + M · y ( t ) , with y (0) = 0. According to Lemma 2, we have y ( T ) (cid:22) Z T e M ( T − t ) · M · b ( t ) dt. (46)The conclusion (44) follows from (45) with t = T and (46). (cid:3) Main lemmas and their proofs

Lemma 5

Under Assumption 1, with x o ( t ) , x p ( t ) for t ∈ [0 , T ] deﬁned in Section 3.1, we have, for j = 1 , ..., m , E [ k x o j ( t ) − x p j ( t ) k ] ≤ k x o i ⋆ − x p i ⋆ k · e C ( f , σ ) t · e − C d · d ( j,i ⋆ ) , (47) where C d is any given positive constant, C ( f , σ ) is deﬁned as (11). Proof:

Recall that x o j ( t ), x p j ( t ) are solutions of d x j ( t ) = f j ( t, x j − ( t ) , x j ( t ) , x j +1 ( t )) dt + σ j ( t, x j ( t )) d W j ( t ) , for j = 1 , ..., m, x ( t ) = x m ( t ) , x m +1 ( t ) = x ( t ) , t ∈ [0 , T ] , with corresponding initial values x o j (0), x p j (0), which only diﬀer at the i ⋆ -th entry. Thus, we have d ( x o j ( t ) − x p j ( t )) = ∆ f j ( t ) dt + ∆ σ j ( t ) d W j ( t ) , with ∆ f j ( t ) = f j ( t, x o j − ( t ) , x o j ( t ) , x o j +1 ( t )) − f j ( t, x p j − ( t ) , x p j ( t ) , x p j +1 ( t )) , ∆ σ j ( t ) = ( σ j ( t, x o j ( t )) − σ j ( t, x p j ( t ))) . According to Itˆo’s formula, d k x o j ( t ) − x p j ( t ) k = (2( x o j ( t ) − x p j ( t )) T ∆ f j ( t ) + k ∆ σ j ( t ) k ) dt + 2( x o j ( t ) − x p j ( t )) T ∆ σ j ( t ) d W j ( t ) . (48)23ts solution can be written as k x o j ( t ) − x p j ( t ) k − k x o j (0) − x p j (0) k = Z t (2( x o j ( s ) − x p j ( s )) T ∆ f j ( s ) + k ∆ σ j ( s ) k ) ds + Z t x o j ( s ) − x p j ( s )) T ∆ σ j ( s ) d W j ( s ) . After taking expectation with respect to both sides of above formula, and according to Assumption1 and Cauchy-Schwartz inequality, we have E [ k x o j ( t ) − x p j ( t ) k ] − k x o j (0) − x p j (0) k = Z t (2( x o j ( s ) − x p j ( s )) T ∆ f j ( s ) + k ∆ σ j ( s ) k ) ds ≤ Z t E [ k x o j ( s ) − x p j ( s ) k ] ds + Z t E [ k ∆ f j ( s ) k + k ∆ σ j ( s ) k ] ds ≤ C f Z t E [ k x o j − ( s ) − x p j − ( s ) k ] ds + ( C f + C σ + 1) Z t E [ k x o j ( s ) − x p j ( s ) k ] ds + C f Z t E [ k x o j +1 ( s ) − x p j +1 ( s ) k ] ds. (49)Equivalently, we can write (49) into the following vector form ~ ∆( t ) (cid:22) Z t M · ~ ∆( s ) ds + ~ ∆(0) , where ~ ∆( t ) is an m × j -th entry is E [ k x o j ( t ) − x p j ( t ) k ]; M is an m × m matrix with M , = C σ + C f + 1 , M , = M ,n = C f ,M j,j = C σ + C f + 1 , M j,j +1 = M j,j − = C f , for j = 2 , ..., m − ,M m,m = C σ + C f + 1 , M m, = M m,m − = C f , and other entries are 0; moreover, we have ~ ∆ i ⋆ (0) = k x o i ⋆ − x p i ⋆ k and other entries of ~ ∆(0) are 0.According to Lemma 4, we have ~ ∆( t ) (cid:22) ~ ∆(0) + Z t e M ( t − s ) · M · ~ ∆(0) ds. We observe that for the m -dimensional column vector ( M · ~ ∆(0)), its i ⋆ − , i ⋆ , i ⋆ + 1-th entries are C f k x o i ⋆ − x p i ⋆ k , ( C f + C σ + 1) k x o i ⋆ − x p i ⋆ k , C f k x o i ⋆ − x p i ⋆ k respectively, while other entries are 0.Together with Lemma 1 for C = C d , max j,i =1 , ,...,m | M j,i | = ( C f + C σ + 1), (34) and the deﬁnition of24 ( f , σ ) = ( e C d + e − C d + 1)( C f + C σ + 1), we have ~ ∆ j ( t ) ≤ ~ ∆ j (0) + Z t | ( e M ( t − s ) ) j,i ⋆ − | · ( M · ~ ∆(0)) i ⋆ − , ds + Z t | ( e M ( t − s ) ) j,i ⋆ | · ( M · ~ ∆(0)) i ⋆ , ds + Z t | ( e M ( t − s ) ) j,i ⋆ +1 | · ( M · ~ ∆(0)) i ⋆ +1 , ds ≤ ~ ∆ j (0) + ( C f + C σ + 1) k x o i ⋆ − x p i ⋆ k Z t e C ( f , σ )( t − s ) ds · ( e − C d · d ( j,i ⋆ − + e − C d · d ( j,i ⋆ ) + e − C d · d ( j,i ⋆ +1) ) ≤ ~ ∆ j (0) + k x o i ⋆ − x p i ⋆ k · ( e C ( f , σ ) t − e − C d · d ( j,i ⋆ ) ≤ k x o i ⋆ − x p i ⋆ k · e C ( f , σ ) · t · e − C d · d ( j,i ⋆ ) , where the last inequality is derived by considering two cases: when j = i ⋆ , ~ ∆ j (0) = k x o i ⋆ − x p i ⋆ k and d ( j, i ⋆ ) = 0, the result is true; when j = i ⋆ , ~ ∆ j (0) = 0, the result can also be veriﬁed. Recallthat ~ ∆ j ( t ) = E [ k x o j ( t ) − x p j ( t ) k ], the proof is complete. (cid:3) Lemma 6

Under Assumption 1, with x o ( t ) , x l ( t ) , x p ( t ) deﬁned for t ∈ [0 , T ] in Section 3.1 and3.2, we have, for j = 1 , ..., m , E [ k x l j ( t ) − x p j ( t ) k ] ≤ C ( f , σ ) k x o i ⋆ − x p i ⋆ k e C ( f , σ ) t e − C d · ( L +1) , (50) where C d is any given positive constant, C ( f , σ ) , C ( f , σ ) are deﬁned in (11) and (14). Proof:

For simplicity, we consider L + 2 ≤ i ⋆ ≤ n − L −

1, this doesn’t sacriﬁce any generalitybecause we can always rotate the index to make it true. So B i ⋆ = { i ⋆ − L, ..., i ⋆ + L } .Suppose j ∈ B ci ⋆ , since x l j ( t ) = x o j ( t ) and d ( j, i ⋆ ) ≥ L + 1, (50) follows from the conclusion ofLemma 5.Suppose j ∈ B i ⋆ , we observe that both x l j ( t ) and x p j ( t ) follow the evolutionary system d x j ( t ) = f j ( t, x j − ( t ) , x j ( t ) , x j +1 ( t )) dt + σ j ( t, x j ( t )) d W j ( t ) , t ∈ [0 , T ] , with initial value ( x o i ⋆ − L , ..., x o i ⋆ − , x p i ⋆ , x o i ⋆ +1 , ..., x o i ⋆ + L ) T . But when j ∈ B ci ⋆ , it is restricted that x l j ( t ) = x o j ( t ), where x o ( t ) is the solution of (4) with initial value ( x o1 , ..., x o m ) T . Recall that x p ( t ) alsosolves (4), but with a locally perturbed initial value, which can be written as ( x o1 , ..., x o i ⋆ − , x p i ⋆ , x o i ⋆ +1 , ..., x o m ) T .Thus, for i ⋆ − L ≤ j ≤ i ⋆ + L , we have d ( x l j ( t ) − x p j ( t )) = ∆ f j ( t ) dt + ∆ σ j ( t ) d W j ( t ) , with ∆ f j ( t ) = f j ( t, x l j − ( t ) , x l j ( t ) , x l j +1 ( t )) − f j ( t, x p j − ( t ) , x p j ( t ) , x p j +1 ( t )) , ∆ σ j ( t ) = ( σ j ( t, x l j ( t )) − σ j ( t, x p j ( t ))) . d k x l j ( t ) − x p j ( t ) k = (2( x l j ( t ) − x p j ( t )) T ∆ f j ( t ) + k ∆ σ j ( t ) k ) dt + 2( x l j ( t ) − x p j ( t )) T ∆ σ j ( t ) d W j ( t ) . (51)The result (51) is similar to (48), solving it and according to (49), we obtain E [( x l j ( t ) − x p j ( t )) ] − ( x l j (0) − x p j (0)) ≤ C f Z t E [( x l j − ( s ) − x p j − ( s )) ] ds + ( C f + C σ + 1) Z t E [( x l j ( s ) − x p j ( s )) ] ds + C f Z t E [( x l j +1 ( s ) − x p j +1 ( s )) ] ds. (52)We deﬁne an (2 L + 1) × ~ ∆( t ) = (cid:0) E [ k x l i ⋆ − L ( t ) − x p i ⋆ − L ( t ) k ] , ..., E [ k x l i ⋆ + L ( t ) − x p i ⋆ + L ( t ) k ] (cid:1) T ,whose j -th element ~ ∆ j ( t ) = E [ k x l i ⋆ − L + j − ( t ) − x p i ⋆ − L + j − ( t ) k ], and observe that ~ ∆(0) = (2 L +1) × .Since x l i ⋆ − L − ( t ) = x o i ⋆ − L − ( t ) and x l i ⋆ + L +1 ( t ) = x o i ⋆ + L +1 ( t ), we can write (52) as the followingvector form ~ ∆( t ) (cid:22) Z t M · ~ ∆( s ) ds + ~δ ( t ) , (53)where M is an (2 L + 1) × (2 L + 1) tridiagonal matrix with M , = C f + C σ + 1 , M , = C f ,M j,j = C f + C σ + 1 , M j,j +1 = M j,j − = C f , for 2 ≤ j ≤ LM L +1 , L = C f , M L +1 , L +1 = C f + C σ + 1 , and ~δ ( t ) is an (2 L +1)-dimensional column vector with ~δ ( t ) = C f R t E [ k x o i ⋆ − L − ( s ) − x p i ⋆ − L − ( s ) k ] ds , ~δ L +1 ( t )= C f R t E [ k x o i ⋆ + L +1 ( s ) − x p i ⋆ + L +1 ( s ) k ] ds , while other entries are 0. According to Lemma 5, wehave max j =1 ,..., L +1 {| ~δ j ( t ) |}≤ C f Z t k x o i ⋆ − x p i ⋆ k e C ( f , σ ) · s e − C d ( L +1) ds ≤ C f C ( f , σ ) k x o i ⋆ − x p i ⋆ k e C ( f , σ ) · t e − C d ( L +1) . (54)By Lemma 4, under (53), we have ~ ∆( t ) (cid:22) ~δ ( t ) + Z t e M ( t − s ) · M · ~δ ( s ) ds. (55)26or the (2 L + 1) × M · ~δ ( t )), we observe( M · ~δ ( t )) = ( C f + C σ + 1) ~δ ( t ) , ( M · ~δ ( t )) = C f ~δ ( t ) , ( M · ~δ ( t )) L = C f ~δ L +1 ( t ) , ( M · ~δ ( t )) L +1 = ( C f + C σ + 1) ~δ L +1 ( t ) , while other entries are 0. Together with (54), we havemax j =1 ,..., L +1 {| ( M · ~δ ( t )) j, |}≤ C f C ( f , σ ) ( C f + C σ + 1) k x o i ⋆ − x p i ⋆ k e C ( f , σ ) · t e − C d ( L +1) . (56)Since max j,i =1 ,..., L +1 M j,i = ( C f + C σ + 1), according to Lemma 3 and the conclusion of Lemma 1 with C = C d , we have | ( e M ( t − s ) ) j,i | ≤ | ( e Mt ) j,i | ≤ e C ( f , σ ) t . (57)From (55), together with (54), (56) and (57), we obtain ~ ∆ j ( t ) ≤ | ~δ j ( t ) | + X k =1 , , L, L +1 Z t | ( e M ( t − s ) ) j,k || ( M · ~δ ( s )) k, | ds ≤ max j =1 ,..., L +1 {| ~δ j ( t ) |} + 4 e C ( f , σ ) t Z t max j =1 ,..., L +1 {| ( M · ~δ ( s )) j, |} ds ≤ C f C ( f , σ ) k x o i ⋆ − x p i ⋆ k e C ( f , σ ) · t e − C d ( L +1) · (1 + 4 Z t ( C f + C σ + 1) e C ( f , σ ) · s ds ) ≤ C f C ( f , σ ) k x o i ⋆ − x p i ⋆ k e C ( f , σ ) · t e − C d ( L +1) · (1 + 4( e C ( f , σ ) t − e C d + e − C d + 1 ) ≤ C f C ( f , σ ) k x o i ⋆ − x p i ⋆ k e C ( f , σ ) · t e − C d ( L +1) , where the last inequality is derived by using e C d + e − C d + 1 ≥

2. Recall the deﬁnition of ~ ∆ j ( t ), theproof is complete. (cid:3) Lemma 7

Under the same settings in Lemma 6, for any given ǫ > , if the local domain radius L satisﬁes L ≥ log (cid:16) ǫC ( f , σ ) k x o i ⋆ − x p i ⋆ k (cid:17) − C d + 2 C ( f , σ ) C d · T, then E [ k x l j ( t ) − x p j ( t ) k ] ≤ ǫ for all t ≤ T and j = 1 , ..., m . roof: According to Lemma 6, it’s equivalent for us to solve C ( f , σ ) · k x o i ⋆ − x p i ⋆ k e C ( f , σ ) · t e − C d ( L +1) ≤ ǫ for t ∈ [0 , T ], which can be obtained by solving e − C d ( L +1) ≤ ǫC ( f , σ ) · k x o i ⋆ − x p i ⋆ k · e C ( f , σ ) · T . After taking log for both sides, we obtain that above result is true if L ≥ log (cid:16) ǫC ( f , σ ) k x o i ⋆ − x p i ⋆ k (cid:17) − C d + 2 C ( f , σ ) C d · T. The proof is ﬁnished. (cid:3)

Lemma 8

Under Assumption 1, with ˜ x o j ( ih ) , ˜ x p j ( ih ) , for j = 1 , ..., m and i = 0 , ..., T /h , deﬁnedin Section 3.3, we have k ˜ x o j ( ih ) − ˜ x p j ( ih ) k ≤ e − C d · d ( j,i ⋆ ) · e C ( f , σ )(1+ h ) ih k x o i ⋆ − x p i ⋆ k , (58) where C d is any given positive constant, C ( f , σ ) is deﬁned in (11). Proof:

Since ˜ x o j ( ih ) are obtained by iterating˜ x o j (( i + 1) h ) = ˜ x o j ( ih ) + σ j ( ih, ˜ x o j ( ih )) √ hW i,j + f j ( ih, ˜ x o j − ( ih ) , ˜ x o j ( ih ) , ˜ x o j +1 ( ih )) h, (59)with initial value ˜ x o j (0) = x o j . Comparing it with (16), we have˜ x o j (( i + 1) h ) − ˜ x p j (( i + 1) h )= ˜ x o j ( ih ) − ˜ x p j ( ih ) + ˜∆ σ j ( ih ) √ hW i,j + ˜∆ f j ( ih ) h, (60)with ˜∆ σ j ( ih ) = σ j ( ih, ˜ x o j ( ih )) − σ j ( ih, ˜ x p j ( ih )) , ˜∆ f j ( ih ) = f j ( ih, ˜ x o j − ( ih ) , ˜ x o j ( ih ) , ˜ x o j +1 ( ih )) − f j ( ih, ˜ x p j − ( ih ) , ˜ x p j ( ih ) , ˜ x p j +1 ( ih )) . Since ˜ x o j ( ih ) and ˜ x p j ( ih ) are independent with W i,j , together with Assumption 1, and Cauchy-Schwartz inequality, from (60), we have E [ k ˜ x o j (( i + 1) h ) − ˜ x p j (( i + 1) h ) k ]= E [ k ˜ x o j ( ih ) − ˜ x p j ( ih ) k ] + E [ k ˜∆ σ j ( ih ) √ hW i,j k ]+ E [ k ˜∆ f j ( ih ) h k ] + 2 E [(˜ x o j ( ih ) − ˜ x p j ( ih )) T ˜∆ f j ( ih ) h ] ≤ E [ k ˜ x o j ( ih ) − ˜ x p j ( ih ) k ] + E [ k ˜∆ σ j ( ih ) k ] h + E [ k ˜∆ f j ( ih ) k ] h + ( E [ k ˜ x o j ( ih ) − ˜ x p j ( ih ) k ] + E [ k ˜∆ f j ( ih ) k ]) h ≤ ( C f h + C f h ) E [(˜ x o j − ( ih ) − ˜ x p j − ( ih )) ]+(1 + ( C σ + C f + 1) h + C f h ) E [(˜ x o j ( ih ) − ˜ x p j ( ih )) ]+ ( C f h + C f h ) E [ k ˜ x o j +1 ( ih ) − ˜ x p j +1 ( ih ) k ] . (61)28quivalently, we have ~ ∆(( i + 1) h ) (cid:22) ( I m + M ) · ~ ∆( ih ) , (62)where ~ ∆( ih ) is deﬁned to be an m × j -th entry ~ ∆ j ( ih ) = E [ k ˜ x o j ( ih ) − ˜ x p j ( ih ) k ]for i = 0 , ..., T /h ; M is an m × m matrix with M , = ( C σ + C f + 1) h + C f h , M , = C f h + C f h ,M ,m = C f h + C f h ,M j,j − = C f h + C f h , M j,j = ( C σ + C f + 1) h + C f h ,M j,j +1 = C f h + C f h , for j = 2 , ..., m − ,M m, = C f h + C f h , M m,m − = C f h + C f h ,M m,m = ( C σ + C f + 1) h + C f h , and other entries are 0. After iterating (62) for i times, we have ~ ∆( ih ) (cid:22) ( I m + M ) i · ~ ∆(0) , (63)and for ~ ∆(0), we know ~ ∆ i ⋆ (0) = k x o i ⋆ − x p i ⋆ k and other entries are 0. We observe I m + M (cid:22) e M , (64)together with Lemma 1, max j,k =1 , ,...,m | ( iM ) j,k | = (( C σ + C f + 1) + C f h ) ih and (41), we have, for j = 1 , ..., m , | (( I m + M ) i ) j,i ⋆ | ≤ | (( e M ) i ) j,i ⋆ | = | ( e iM ) j,i ⋆ |≤ e − C d · d ( j,i ⋆ ) · e C ( f , σ )(1+ h ) ih . (65)Substituting (65) into (63) results in (58). (cid:3) Lemma 9

Under Assumption 1, with ˜ x o j ( ih ) ˜ x l j ( ih ) , ˜ x p j ( ih ) , for j = 1 , ..., m and i = 0 , ..., T /h ,deﬁned in Section 3.3, we have E [ k ˜ x l j ( ih ) − ˜ x p j ( ih ) k ] ≤ C ( f , σ ) e C ( f , σ )(1+ h ) ih e − C d ( L +1) k x o i ⋆ − x p i ⋆ k , (66) where C d is any given positive constant, C ( f , σ ) and C ( f , σ ) are deﬁned in (11) and (14). Proof:

Suppose j ∈ B ci ⋆ , since d ( j, i ⋆ ) ≥ L + 1, and ˜ x l j ( ih ) = ˜ x o j ( ih ), the result follows from Lemma8. Suppose j ∈ B i ⋆ , comparing (16) and (18), we have˜ x l j (( i + 1) h ) − ˜ x p j (( i + 1) h )= ˜ x l j ( ih ) − ˜ x p j ( ih ) + ˜∆ σ j ( ih ) √ hW i,j + ˜∆ f j ( ih ) h, (67)with ˜∆ σ j ( ih ) = σ j ( ih, ˜ x l j ( ih )) − σ j ( ih, ˜ x p j ( ih )) , ˜∆ f j ( ih ) = f j ( ih, ˜ x l j − ( ih ) , ˜ x l j ( ih ) , ˜ x l j +1 ( ih )) − f j ( ih, ˜ x p j − ( ih ) , ˜ x p j ( ih ) , ˜ x p j +1 ( ih )) . x l k ( ih ) = ˜ x o k ( ih ) for k ∈ B ci ⋆ during the evolution of ˜ x l ( ih ). Since (67) issimilar to (60), according to (61), we have E [ k ˜ x l j (( i + 1) h ) − ˜ x p j (( i + 1) h ) k ] ≤ ( C f h + C f h ) E [(˜ x l j − ( ih ) − ˜ x p j − ( ih )) ]+ (1 + ( C σ + C f + 1) h + C f h ) E [(˜ x l j ( ih ) − ˜ x p j ( ih )) ]+ ( C f h + C f h ) E [ k ˜ x l j +1 ( ih ) − ˜ x p j +1 ( ih ) k ] . Namely, for the (2 L +1) × ~ ∆( ih ) whose j -th entry ~ ∆ j ( ih ) = E [ k ˜ x l i ⋆ − L + j − ( ih ) − ˜ x p i ⋆ − L + j − ( ih ) k ]for j = 1 , ..., L + 1, we have ~ ∆(( i + 1) h ) (cid:22) ( I (2 L +1) + M ) · ~ ∆( ih ) + ~δ ( ih ) , (68)where M is an (2 L + 1) by (2 L + 1) tridiagonal matrix with M , = ( C σ + C f + 1) h + C f h , M , = C f h + C f h ,M j,j − = C f h + C f h , M j,j = ( C σ + C f + 1) h + C f h ,M j,j +1 = C f h + C f h , for 2 ≤ j ≤ LM L +1 , L = C f h + C f h ,M L +1 , L +1 = ( C σ + C f + 1) h + C f h , and ~δ ( ih ) is an (2 L + 1)-dimensional vector with ~δ ( ih ) = ( C f + C f h ) h E [ k ˜ x o i ⋆ − L − ( ih ) − ˜ x p i ⋆ − L − ( ih ) k ] ,~δ L +1 ( ih ) = ( C f + C f h ) h E [ k ˜ x o i ⋆ + L +1 ( ih ) − ˜ x p i ⋆ + L +1 ( ih ) k ]and other entries are 0. After iterating (68) for i times, we obtain ~ ∆( ih ) (cid:22) i − X k =0 ( I L +1 + M ) i − − k · ~δ ( kh )+ ( I L +1 + M ) i · ~ ∆(0) , (69)and we see ~ ∆(0) = (2 L +1) × from the deﬁnition of ˜ x l j ( ih ), ˜ x p j ( ih ). According to (64), (41), togetherwith Lemma 1, max j,l =1 , ,..., L +1 | ( iM ) j,k | = (( C σ + C f + 1) + C f h ) ih , for l = 1 , ..., L + 1, we have | (( I L +1 + M ) i − − k ) j,l |≤ | (( e M ) ( i − − k ) ) j,l | = | ( e ( i − − k ) M ) j,l |≤ e − C d · d ( j,l ) e C ( f , σ )(1+ h )( i − − k ) h ≤ e C ( f , σ )(1+ h )( i − − k ) h . (70)According to Lemma (8), we havemax {| ~δ ( kh ) | , | ~δ L +1 ( kh ) |}≤ ( C f h + C f h ) e − C d ( L +1) e C ( f , σ )(1+ h ) kh k x o i ⋆ − x p i ⋆ k . (71)30y applying Taylor’s theorem, we have i − X k =0 e C ( f , σ )(1+ h ) kh = e C ( f , σ )(1+ h ) ih − e C ( f , σ )(1+ h ) h − ≤ e C ( f , σ )(1+ h ) ih − C ( f , σ )(1 + h ) h . (72)Substituting (70), (71), (72) into (69), we have, for j = 1 , ..., L + 1, ~ ∆ j ( ih ) ≤ i − X k =0 (( I (2 L +1) + M ) i − − k ) j, · ~δ ( kh )+ i − X k =0 (( I (2 L +1) + M ) i − − k ) j, L +1 · ~δ L +1 ( kh ) ≤ · e C ( f , σ )(1+ h ) ih · e − C d ( L +1) · k x o i ⋆ − x p i ⋆ k · ( C f h + C f h ) · i − X k =0 e C ( f , σ )(1+ h ) kh ≤ C f C ( f , σ ) e C ( f , σ )(1+ h ) ih e − C d ( L +1) k x o i ⋆ − x p i ⋆ k . Recall the deﬁnition of ~ ∆ j ( ih ), the proof is complete. (cid:3) Lemma 10

Under the same settings in Lemma 9, given any positive constant ǫ , if only L ≥ log (cid:16) ǫC ( f , σ ) k x o i ⋆ − x p i ⋆ k (cid:17) − C d + 2 C ( f , σ )(1 + h ) C d T, we have E [ k ˜ x l j ( ih ) − ˜ x p j ( ih ) k ] ≤ ǫ for j = 1 , ..., m and i = 0 , ..., T /h . Proof:

According to Lemma 9, we only need to solve C ( f , σ ) · e C ( f , σ )(1+ h ) ih · e − C d ( L +1) · k x o i ⋆ − x p i ⋆ k ≤ ǫ, for i = 0 , ..., T /h , which can be obtained by solving e − C d ( L +1) ≤ ǫC ( f , σ ) · e C ( f , σ )(1+ h ) T · k x o i ⋆ − x p i ⋆ k . After taking log for both sides, we have L ≥ log (cid:16) ǫC ( f , σ ) k x o i ⋆ − x p i ⋆ k (cid:17) − C d + 2 C ( f , σ )(1 + h ) C d T. The proof is complete. (cid:3) roofs of propositions and theorems The proof of Proposition 1:

It’s a direct result of Lemma 5. (cid:3)

The proof of Theorem 1:

The conclusions are direct results of Lemma 6-7. (cid:3)

The proof of Theorem 2:

The conclusions are direct results of Lemma 9-10. (cid:3)

References [1] Andrieu, C., Roberts, G.O.: The pseudo-marginal approach for eﬃcient Monte Carlo compu-tations. Annals of Statistics (697-725) (2009)[2] Andrieu, C., Vihola, M.: Convergence properties of pseudo-marginal Markov chian MonteCarlo algorithms. Annals of Applied Probability (2), 1030 (2015)[3] Asch, M., Bocquet, M., Nodet, M.: Data Assimilation: Methods, Algorithms, and Applica-tions. Society for Industrial and Applied Mathematics (2016)[4] Chadan, K., Sabatier, P., Newton, R.: Inverse problems in quantum scattering theory.Springer-Verlag (1977)[5] Chen, N., Majda, A.J., Tong, X.T.: Spatial localization for nonlinear dynamical stochasticmodels for excitable media. Chinese Annals of Mathematics, Series B (6), 891–924 (2019)[6] Cotter, S., Dashti, M., Robinson, J., Stuart, A.: Bayesian inverse problems for functions andapplications to ﬂuid mechanics. Inverse problems (2009)[7] Crisan, D., Rozovskii, B. (eds.): The Oxford handbook of nonlinear ﬁltering. Oxford UniversityPress (2011)[8] Fertig, E.J., Harlim, J., Hunt, B.R.: A comparative study of 4D-VAR and a 4D EnsembleKalman Filter: perfect model simulations with Lorenz-96. Tellus , 96–100 (2007)[9] Gelfand, A., Smith, A.: Sampling-based approaches to calculating marginal densities. Journalof the American Statistical Association , 398–409 (1990)[10] Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions and the Bayesian restorationof images. IEEE Transactions on Pattern Analysis and Machine Intelligence , 721–741 (1984)[11] Gilks, W., Richardson, S., Spiegelhalter, D.: Markov Chain Monte Carlo in Practice. Chap-mann & Hall (1996)[12] Hastings, W.: Monte Carlo sampling methods using Markov chains and their applications.Biometrika , 97–109 (1970)[13] Kaipio, J., Somersalo, E.: Statistical and Computational Inverse Problems, vol. 160. Springer(2005)[14] Law, K., Sanz-Alonso, D., Shukla, A., Stuart, A.: Filter accuracy for the Lorenz 96 model:Fixed versus adaptive observation operators. Physica D , 1–13 (2016)[15] Liu, J.: Monte Carlo strategies in Scientiﬁc Computing. Springer-Verlag (2003)[16] Lorenz, E.N.: Predictability: a problem partly solved. ECMWF Seminar on predictability pp.1–18 (1995) 3217] Majda, A., Harim, J.: Filtering complex trubulent systems. Cambridge University Press (2012)[18] Meinhold, R.J., Singpurwalla, N.D.: Understanding the Kalman ﬁlter. The American Statis-tician (2), 123–127 (1983)[19] Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., Teller, E.: Equation of statecalculations by fast computing machines. The Journal of Chemical Physics , 1087–1092(1953)[20] Morzfeld, M., Tong, X., Marzouk, Y.: Localization for MCMC: sampling high-dimensionalposterior distributions with local structure. Journal of Computational Physics , 1–28(2019)[21] Ott, E., Hunt, B.R., Szunyogh, I., Zimin, A.V., Kostelich, E.J., Corazza, M., Kalnay, E., Patil,D.J., Yorke, J.A.: A local ensemble Kalman ﬁlter for atmospheric data assimilation. Tellus , 415–428 (2004)[22] Richter, M.: Inverse Problem: Basics, Theory and Applications in Geophysics. BirkhauserBasel (2016)[23] Rosenthal, J.S., Roberts, G.O.: Optimal scaling of discrete approximations to Langevin diﬀu-sions. Journal of the Royal Statistical Society: Series B , 255–268 (1998)[24] Stuart, A.M.: Inverse problems: A Bayesian perspective. Acta Numerica , 451–559 (2010)[25] Tong, X.: Performance analysis of local ensemble Kalman ﬁlter. Journal of Nonlinear Science28