Accelerating Metropolis-within-Gibbs sampler with localized computations of differential equations
aa r X i v : . [ s t a t . C O ] F e b Accelerating Metropolis-within-Gibbs sampler with localizedcomputations of differential equations
Qiang LIU ∗ Xin T. TONG † February 19, 2020
Abstract
Inverse problem is ubiquitous in science and engineering, and Bayesian methodologies areoften used to infer the underlying parameters. For high dimensional temporal-spatial models,classical Markov chain Monte Carlo (MCMC) methods are often slow to converge, and it isnecessary to apply Metropolis-within-Gibbs (MwG) sampling on parameter blocks. However,the computation cost of each MwG iteration is typically O ( n ), where n is the model dimension.This can be too expensive in practice. This paper introduces a new reduced computationmethod to bring down the computation cost to O ( n ), for the inverse initial value problem ofa stochastic differential equation (SDE) with local interactions. The key observation is thateach MwG proposal is only different from the original iterate at one parameter block, and thisdifference will only propagate within a local domain in the SDE computations. Therefore wecan approximate the global SDE computation with a surrogate updated only within the localdomain for reduced computation cost. Both theoretically and numerically, we show that theapproximation errors can be controlled by the local domain size. We discuss how to implementthe local computation scheme using Euler–Maruyama and 4th order Runge–Kutta methods. Wenumerically demonstrate the performance of the proposed method with the Lorenz 96 modeland a linear stochastic flow model. Inverse problem is ubiquitous in various fields of science and engineering [26, 4, 13]. It concernshow to infer model parameters from partial, delayed, and noisy observations. Typical examplesinclude using measurements of seismic waves to determine the location of an earthquake epicenter,and recovering images that are as close to natural ones as possible from blurry observations [22].The general formulation of inverse problem can be written as y = h ( x , ζ ) . (1)Here x denotes the parameters to be estimated, y denotes the observation data, h is a physicalmodel describing the data collection process, and ζ is the possible random factor involved.Often, it is of interest to quantify the uncertainty of x , which can be used to infer estimationaccuracy and regulate risk. The Bayesian approach is more appropriate for this purpose [13, 6, 24]. ∗ Department of Mathematics, National University of Singapore, Singapore, [email protected] † Department of Mathematics, National University of Singapore, Singapore, [email protected] x with a prior distribution p ( x ) and finding the observation distribution p l ( y | x ) from (1). Then the Bayes’ formula indicates that the posterior distribution of x is given by p ( x | y ) ∝ p ( x ) · p l ( y | x ) . (2)Markov chain Monte Carlo (MCMC) is a big class of stochastic algorithms designed to samplethe posterior density iteratively [11, 15]. In each iteration, a new proposal x ′ is generated from thecurrent iterate x , which can be described by a transition density Q ( x ′ | x ). Then the Metropolis-Hastings (MH) step accepts this proposal with probability α ( x ′ , x ) = min (cid:26) , Q ( x | x ′ ) p ( x ′ ) p l ( y | x ′ ) Q ( x ′ | x ) p ( x ) p l ( y | x ) (cid:27) . (3)Some popular choices of Q include random walk transition in random walk Metropolis (RWM),and Langevin dynamic transition in Metropolis adjusted Langevin algorithm (MALA) [19, 12, 23].While these MCMC algorithms perform well for classical problems, they become very slow whenapplied to modern data science problems, where the parameter dimension n := dim( x ) is verylarge. The main issue is that the proposal x ′ in RWM or MALA in general is different from x at allcomponents, and the MH-acceptance probability α ( x ′ , x ) is often of order O ( e −k x ′ − x k ) ≈ O ( e − n ).This is extremely small when n is more than a few thousands. The proposals will mostly be rejected,and the MCMC is essentially stuck. One way to alleviate this issue is to choose a small step sizein the proposal, so the average acceptance probability is O (1). But then the consecutive MCMCiterates are close to each other, so the overall movement of MCMC can still be slow. The curse of dimensionality can often be lifted if there are exploitable statistical structures. Exam-ples include conditional Gaussianity and low effective dimension. In geophysical applications, thecomponents of x usually describe status at different locations. Because the underlying physical lawis often short-ranged, faraway components of x are nearly independent. Consequentially, the asso-ciated covariance matrix will have a clear banded structure. Such a phenomenon is called spatiallocalization , and it widely exists in problems involving vast spatial domains. In the statistic liter-ature, such banded structure can be exploited by tapering techniques, which significantly improvecovariance estimation. And in numerical weather prediction (NWP), localization techniques aredesigned to utilize this structure, so algorithms such as ensemble Kalman filter can provide stableestimation for planetary models of 10 dimensions with merely 100 samples.A recent work [20] investigates the possibility to exploit spatial localization with MCMC. Itis found that Gibbs sampling [10, 9] is a natural framework for this purpose. To do so, one firstpartitions the model components into m blocks with x = [ x , . . . , x m ], where each block containsonly b = n/m = O (1) nearby components. When running MCMC, one generates a proposal for the k -th block by applying Gibbs sampler on the prior. This proposal x ′ will be different from x only atthe k -th block, which is of dimension b . Therefore, the corresponding MH acceptance probability α ( x ′ , x ) in (3) will be of scale O (1) and does not degenerate even if the overall dimension n is large.A completely new iterate can be generated by repeating this procedure for all m blocks. A moredetailed description of this Metropolis-within-Gibbs (MwG) sampler can be found in Section 2.2.Numerical tests and rigorous analysis in Gaussian settings have revealed that MwG has dimensionindependent MCMC convergence rate when the underlying distributions are spatially localized.2 .3 Acceleration with local computation While MwG takes only a constant number of iterations to sample the posterior distribution, thecomputation cost of each iterate can be expensive. This is when a Gibbs block proposal is beingprocessed by the MH step (3), one often needs to evaluate p l ( y | x ′ ). It often involves O ( n ) compu-tational cost. The proposal procedure is repeated for all m blocks. So, to generate a new MwGiterate, the computation cost is O ( n ) × m = O ( n ). This is much more expensive than the ones forRWM and MALA iterates, which in general cost O ( n ) computation.However, it is possible to reduce the cost of p l ( y | x ′ ) to O (1). The main observation is that x ′ and x differ only at one block, say the k -th block, and the computation of p l ( y | x ) is done in theprevious MH step. So if one can replace the x k part in the computation of p l ( y | x ) with x ′ k , thevalue of p l ( y | x ′ ) can be obtained cheaply. As a simple example, suppose in (1) the observationmodel is y = x + ξ with ξ ∼ N ( , I n ), a normal distribution with n -dimensional mean vector andcovariance matrix I n being an n × n identity matrix. Then, we have − log p l ( y | x ) ∝ m X k =1 k y k − x k k . Suppose this value is already available, then when computing − log p l ( y | x ′ ), one only needs toupdate the k -th block in the summation to k y k − x ′ k k , which only costs O (1) computation. Thisexample can be easily generalized to cases where y k relies on multiple blocks of x . See detaileddiscussion of this and the possibility of parallelization in [20].In this paper, we explore the possibility of reducing the computational cost of MwG from O ( n )to O ( n ). We assume the observation model (1) is given by a high dimensional stochastic differentialequation (SDE) with short-range interaction, where x = x (0) is the initial condition of the SDE,and y consists of noisy partial observations of the SDE, x ( s ≤ T ), in a fixed time interval [0 , T ].Such an inverse initial value problem is practically important. It can be interpreted as an one-stepsmoothing problem in signal processing. Data assimilation problems such as NWP can also beformulated as sequential applications of it [7, 17].Finding p l ( y | x ) is equivalent to solving the SDE and finding x ( s ≤ T ) in the smoothing context.Standard Euler–Maruyama scheme would require a computational cost of O ( n ). When the Gibbssampler proposes x ′ , one needs to use it as a new SDE initial condition and compute x ′ ( s ≤ T ).But because x ′ is different from x only for x k , we show in Proposition 1 that x ′ i ( s ≤ T ) is not muchdifferent from x i ( s ≤ T ) if | i − k | is larger than a radius L . Therefore it is not necessary to dothe re-computation in a full way. In other words, instead of applying Euler–Maruyama to compute x ′ i ( s ≤ T ) for all i ∈ { , . . . , m } , we only compute locally for the ones with | i − k | ≤ L . Since theradius L can often be chosen as a constant independent of n , updating x ( s ≤ T ) to x ′ ( s ≤ T ) onlycosts O (1) computation. This procedure will be repeated through all m blocks, so the overall costof finding a new MwG iterate is O (1) × m = O ( n ). This achieves the aforementioned computationreduction objective. We use a-MwG to denote this accelerated version of MwG.Since MwG only requires constantly many iterates to sample a spatially localized distribution, a-MwG is expected to solve the Bayesian inverse problem with only a computation cost of O ( n ). Thisis the optimal dimension scaling one can obtain for any numerical methods. The main drawbackof a-MwG is that it uses a local computation scheme, so it is subjective to approximation errors.However, these errors can be controlled by choosing a large enough L . We demonstrate this throughrigorous analysis and numerical tests. 3 .4 Organization and Preliminaries This paper is organized in the following way. In Section 2, we consider the inverse problem of howto infer the initial condition of an SDE with local interaction. We review the MwG sampler witha discussion of its computational complexity. We derive an accelerated-algorithm, called a-MwGsampler, in Section 3, and analyze the approximation errors. We also discuss how to implementa-MwG with Euler–Maruyama and 4th order Runge–Kutta schemes, as well as its adaptation toparallelization. In section 4, two numerical experiments of the Lorenz 96 and linearized stochasticmodel are studied. The paper is concluded in Section 5. All the proofs are postponed to theAppendix.Throughout this paper, we will use the following notations. When applying MwG, we needto partition a high dimensional vector into blocks, for which we write as x = [ x , . . . , x m ]. Forsimplicity of the discussion, we assume each block shares the same length b , so the total dimension n = mb . We remark our result is easily generalizable to non-constant block sizes. In practice,each block often represents information at a location on a torus, therefore it is natural to introducemeasure of distance between indices as d ( j , j ) = min {| j − j | , | j − j + m | , | j − j + m |} with j , j = 1 , ..., m .When a matrix A is given, the ( i, j )-th entry is denoted as A i,j . A T , A − denote the transposeand the inverse of the matrix respectively. We adopt k · k and k · k ∞ to denote the l norm and l ∞ norm for a vector, namely, for a vector ~a with elements a , ..., a n , we have k ~a k = qP ni =1 a i ,and k ~a k ∞ = max {| a i | : i = 1 , ..., n } . For an m × n matrix A , the l operator is written as k A k = sup {k A~v k : ~v ∈ R n , k ~v k = 1 } . For two m × n matrices A and B (including vectors as aspecial case), we say A (cid:22) ( (cid:23) ) B if we have A i,j ≤ ( ≥ ) B i,j for 1 ≤ i ≤ m , 1 ≤ j ≤ n entry-wise. m × n and m × n are m × n matrices whose entries are 0 and 1 respectively. N ( m , Σ) is amultidimensional normal distribution with mean vector m and covariance matrix Σ. We use I n todenote n × n identity matrix. C is a generic positive constant that may varies from line to line. We consider a spatial-temporal model with local interaction d x j ( t ) = f j ( t, x j − ( t ) , x j ( t ) , x j +1 ( t )) dt + σ j ( t, x j ( t )) d W j ( t ) , for j = 1 , ..., m, x ( t ) = x m ( t ) , x m +1 ( t ) = x ( t ) , t ∈ [0 , T ] , (4)where σ j is an b × b matrix-valued adapted locally bounded process, and W j ( t ) is an b -dimensionalstandard Brownian motion. We assume the following Lipschitz continuity conditions are satisfiedfor the coefficient processes f j and σ j : Assumption Given b × vectors x , x , x , y , y , y and t ∈ [0 , T ] , for j = 1 , ..., m , there existsconstants C f > and C σ > such that k f j ( t, x , x , x ) − f j ( t, y , y , y ) k ≤ C f ( k x − y k + k x − y k + k x − y k ) , and k ( σ j ( t, x ) − σ j ( t, y )) k ≤ C σ k x − y k . W = [ W , ..., W m ] are given. Wewrite the solution as x ( t ) = Φ( x (0) , t, W ). In the scenario where there is no stochastic forcing,namely σ j ≡ b × b , (4) is an ordinary differential equation (ODE). Its solution can be simply writtenas x ( t ) = Φ( x (0) , t ).One key feature of (4) is that the drift term driving x j relies only on its neighboring blocks.This describes general short-range interactions that are typical in spatial-temporal models. Theformulation of (4) can be naturally derived, if one considers applying finite difference discretizationfor a stochastic partial differential equation, such as the reaction-diffusion equations. Details ofsuch derivation can be found in [5].We assume an n ′ -dimensional data is generated from noisy observation of (4) at time T > y = H · Φ( x , T, W ) + ξ. (5)Here H is an n ′ by n observation matrix that serves as selecting the observable components, and ξ represents the associated observation noise, which is distributed as N ( n ′ × , R ).The Bayesian inverse problem this paper trying to solve is finding the posterior distribution ofthe initial condition x with a given data y . For simplicity, we assume the prior distribution p of x is Gaussian with mean n × and covariance matrix Σ pri . Then the observation likelihood is givenby p l ( y | x ) ∝ E exp( − k y − H · Φ( x , T, W ) k R ) , (6)where for a vector v , we define k v k R := v T R − v . The expectation in (6) is averaging over allrealizations of W . In computation, it can be approximated by a Monte Carlo sampled versionˆ p l ( y | x ) := 1 S S X c =1 exp( − k y − H · Φ( x , T, W ( c ) ) k R ) , (7)where each W ( c ) is an independent Brownian motion realization. Note that (7) is an unbiasedestimator of (6), using it instead of (6) in MCMC is known as the pseudo-marginal algorithm [1].The convergence of the pseudo-marginal algorithm is largely similar to the standard MCMC where(6) is used directly [2].When (4) is an ODE, (6) is simplified as p l ( y | x ) ∝ exp( − k y − H · Φ( x , T ) k R ) . (8) The Metropolis-within-Gibbs sampler is an MCMC algorithm. For each block, it generates aproposal by applying Gibbs sampling to the prior, and use the Metropolis step to incorporate datainformation. In specific, it generates iterations of form x k = [ x k , ..., x km ] through the followingsteps, where we start with k = 1 and draw x randomly from p :1) Repeat steps 2-4 for all block index j = 1 , . . . , m
2) Sample ˜ x j from p ( x j ∈ · | x k , . . . , x kj − , x kj +1 , . . . , x km ) .
3) Let x p = [ x k , . . . , x kj − , ˜ x j , x kj +1 , . . . , x im ]. 5) Let x k = x p with probability α ( x p , x k ) = min (cid:26) , p l ( y | x p ) p l ( y | x k ) (cid:27) . (9) p l can also be replaced by the sample version ˆ p l in (7).5) When the loop in step 1) is finished, let x k +1 = x k , and increase the iteration index from k to k + 1.In [20], it is shown that MwG has dimension independent performance if 1) p is a Gaussiandistribution, and its covariance or precision matrix is close to be banded; 2) each component of y has significant dependence only on a few components of x . It is also discussed how to truncatethe far-off diagonal entries of the prior covariance or precision matrix. This so-called “localization”technique can simplify the computation of the proposal probability in MwG step 2), making thecost to be O (1).When applying MwG directly to the inverse problem described in Section 2, the computationalcost for fully updating all blocks is O ( n ). This is because in step 4), we need to evaluate p l through either (7) or (8). This involves finding the numerical solution to (4), which requiresexecuting Euler–Maruyama or 4th order Runge–Kutta methods. Both these methods require O ( n )computation complexity when the dimension of the differential equation (4) is n . Then becausestep 4) is repeated for all m blocks, the total computation cost is O ( n ) × m = O ( n ).In summary, although the vanilla MwG only requires O(1) iterates to converge to the posteriordistribution, each individual iterate can cost O ( n ) computation. This can be less appealing thanstandard MCMC algorithms with optimal tuning. For example, RWM can converge to the posteriordistribution with O ( √ n ) iterations, while each iterate costs O ( n ) complexity, so the total complexityis O ( n ). In this paper, we demonstrate how to bring down the computational cost of each MwGiterate to O ( n ). This will lead to the optimal MCMC computational scalability. Based on our previous discussion, the main computational cost of MwG takes place at step 4)when (9) is evaluated. We will discuss how to reduce this cost to O (1). In what follows, We fix theMCMC iteration index as o and block index as i ⋆ , while the same procedure applies to all iterationindices and blocks.First of all, it should be noted that when executing step 4), the value of p l ( y | x o ) is alreadyavailable from the previous iteration step 4). Likewise, we already have the values of x o ( t ≤ T ) =Φ( x o , t ≤ T, W ). It is only necessary to find p l ( y | x p ), or equivalently x p ( t ≤ T ) = Φ( x p , t ≤ T, W ).Note that we will often write x o (0) as x o , since it is the information we try to recover.Our main observation here is that x p (0) = x p is different from x o (0) = x o only at the i ⋆ -thblock. Since components in SDE (4) exchange information only through local interactions, thedifferences between x p ( t ) and x o ( t ) are likely to be of significance only for blocks that are close to i ⋆ . In fact, we have the following proposition: Proposition 1
Under Assumption 1 and with any given positive constant C d , for j = 1 , ..., m , wehave E [ k x o j ( t ) − x p j ( t ) k ] ≤ k x o i ⋆ − x p i ⋆ k e C ( f , σ ) t e − C d · d ( j,i ⋆ ) (10)6 ith C ( f , σ ) = ( e C d + e − C d + 1)( C f + C σ + 1) . (11)In particular, for any fixed small enough threshold ǫ and time range T , we can find a radius L ,such that E [ k x o j ( t ) − x p j ( t ) k ] ≤ ǫ, ∀ t ≤ T, d ( i ⋆ , j ) > L. We define the local domain centered at index i ⋆ as B i ⋆ := { j : d ( i ⋆ , j ) ≤ L } , and use B ci ⋆ to denote its complement. Then x o j ( t ) is already a good approximation of x p j ( t ) for j ∈ B ci ⋆ . In other words, it is no longer necessary to recompute x p j ( t ) for j ∈ B ci ⋆ , and we only needto compute x p j ( t ) for j ∈ B i ⋆ . Since B i ⋆ contains at most 2 L + 1 elements, and L is likely to beindependent with the problem dimension n . This provides us with a way to accelerate the overallcomputation of MwG. Next we consider how to use existing values of x o ( t ≤ T ) to reduce the computation of x p ( t ≤ T ).For this purpose, we introduce a local surrogate model given by x l j ( t ) = x o j ( t ) , for j ∈ B ci ⋆ ,d x l j ( t ) = f j ( t, x l j − ( t ) , x l j ( t ) , x l j +1 ( t )) dt + σ j ( t, x l j ( t )) d W j ( t ) , for j ∈ B i ⋆ , x l0 ( t ) = x l m ( t ) , x l m +1 ( t ) = x l1 ( t ) . (12)Its initial condition is set to be x l (0) = x p (0) = x p . We write its solution as x l ( t ) = Φ l ( x p , t, x o ( t ≤ T ) , W ) . Note that the local surrogate x l j ( t ) within B i ⋆ depends on x o j ( t ) for the boundary blocks of B ci ⋆ , ofwhich the indices are usually just i ⋆ + L + 1 and i ⋆ − L −
1. Such dependence can be viewed asusing x o as spatial boundary conditions for the computation of x l .We will use the local surrogate x l ( t ≤ T ) as an approximation of x p ( t ≤ T ). It can be computedcheaply because it is different from x o ( t ≤ T ) only for blocks inside the local domain B i ⋆ . Thedetails of how to achieve this can be found in subsequent parts. First of all, we investigate theapproximation error through the following theorem Theorem 1
Under Assumption 1 and with any given positive constant C d , for j = 1 , ..., m and t ∈ [0 , T ] , we have E [ k x l j ( t ) − x p j ( t ) k ] ≤ C ( f , σ ) k x o i ⋆ − x p i ⋆ k e C ( f , σ ) t e − C d · ( L +1) , (13) with C ( f , σ ) defined in (11) and C ( f , σ ) = max n C f C ( f , σ ) , o . (14)7 n particular, for any given ǫ > , if the local domain radius L satisfies L ≥ log (cid:16) ǫC ( f , σ ) k x o i ⋆ − x p i ⋆ k (cid:17) − C d + 2 C ( f , σ ) C d T, (15) then E [ k x l j ( t ) − x p j ( t ) k ] ≤ ǫ for all t ≤ T . Theorem 1 indicates that we can use x l ( T ) to approximate x p ( T ) with controlled accuracy. So, α ( x p , x o ) as defined in (9) can be substituted by α ( x l , x o ). When such a scheme for calculating(9) is plugged into MwG, this will lead to acceleration in terms of computation complexity. Wecall it accelerated Metropolis-within-Gibbs (a-MwG) and present its pseudocode in Algorithm 1.Next, we will discuss how to compute the local surrogate (12) with numerical methods, and howto implement parallelization. When stochastic forcing is nonzero, our model (4) is a bona-fide SDE. Euler–Maruyama is thestandard numerical method for its computation.In the vanilla MwG, if one applies it directly to the full model (4) to obtain x p ( T ), a smallstep size h > x p ( t ) will be approximated by ˜ x p (( i + 1) h ) if ih < t ≤ ( i + 1) h . The numerical solution ˜ x p (( i + 1) h ) at grid points ( i + 1) h is generated by thefollowing iterations, starting from ˜ x p (0) = x p :˜ x p j (( i + 1) h ) = ˜ x p j ( ih ) + σ j ( ih, ˜ x p j ( ih )) √ hW i,j + f j ( ih, ˜ x p j − ( ih ) , ˜ x p j ( ih ) , ˜ x p j +1 ( ih )) h. (16)Here, W i,j are independent samples from N ( b × , I b ). One can view W j ( kh ) = k X i =1 √ hW i,j as a realization of the Brownian motion W j in (4). As (16) is repeated for all block index j ,obtaining ˜ x p ( T ) has an O ( n ) computational complexity.When applying Euler–Maruyama for the local surrogate (12), we assume numerical approxima-tion of x o ( t ≤ T ) are available as ˜ x o ( ih ). This comes either from the previous a-MwG iterationor an initial computation before the first a-MwG iteration. Then if the proposal x p is differentfrom x o at the i ⋆ -th block, we set the local domain as B i ⋆ = { j : d ( j, i ⋆ ) ≤ L } , and denote itscomplement as B ci ⋆ . For blocks outside B i ⋆ , we directly use ˜ x o j ( ih ) as the numerical solution, thatis we let ˜ x l j ( ih ) = ˜ x o j ( ih ) , j ∈ B ci ⋆ , i = 1 , . . . , T /h. (17)New computation is needed for ˜ x l j ( ih ) with j ∈ B i ⋆ , which are obtained through the following˜ x l j (( i + 1) h ) = ˜ x l j ( ih ) + σ j ( ih, ˜ x l j ( ih )) √ hW i,j + f j ( ih, ˜ x l j − ( ih ) , ˜ x l j ( ih ) , ˜ x l j +1 ( ih )) h, (18)with initial condition ˜ x l j (0) = x p j . This procedure is how do we obtain the step x l ( t, c ) = Φ l ( x p , t, x o ( t ≤ T, c ) , W ( c ) )8 lgorithm 1: The accelerated Metropolis-within-Gibbs sampling
Input: K : number of iterations; b : block size; L : local domain radius; S : Brownian motionsample size Output: K iterations x k with k = 1 , . . . , K .%Initial computationSample x ∼ p ; for c = 1 , ..., S do Generate Brownian motion W ( c ) ( t ≤ T );Let x o ( t ≤ T, c ) = Φ( x , t ≤ T, W ( c ) );%Full model computation end Let ˆ p o l = S P Sc =1 exp( − k y − H · x o ( T, c ) k R );%MCMC loop for k = 1 , ..., K do %Gibbs loop for j = 1 , ..., m do Sample %Proposal step x p j ∼ p ( x j ∈ · | x o1 , ..., x o j − , x o j +1 , ..., x o m );Let x p = [ x o1 , ..., x o j − , x p j , x o j +1 , ..., x o m ]; for c = 1 , ..., S do Let x l ( t, c ) = Φ l ( x p , t, x o ( t ≤ T, c ) , W ( c ) )for all t ≤ T . %Local computation end %pseudo-marginal MH stepLet ˆ p p l = S P Sc =1 exp( − k y − H · x l ( T, c ) k R );Let α = min { , ˆ p p l / ˆ p o l } ;Sample u ∼ Uniform([0 , if u ≤ α then Let x o = x p ;Let ˆ p o l = ˆ p p l ; for c = 1 , ..., S do Let x o ( t ≤ T, c ) = x l ( t ≤ T, c ); endendend Let x k = x o ; end
9n a-MwG Algorithm 1. Note that in a-MwG, W ( c ) are Brownian motion realizations fixed duringthe loops. So when implementing (18), one use fixed N ( b × , I b ) samples W ( c ) i,j in place of W i,j ,instead of generating new independent samples for W i,j .The reason we need to do the copying step (17) before the local Euler–Maruyama, is becausethe values of x l i ⋆ ± ( L +1) ( ih ) are needed in the computation of x l i ⋆ ± L ( ih ) in (18). This means in realimplementation, (17) is only needed to be executed for the boundary blocks j = i ⋆ ± ( L + 1). Thiscan save additional computation time and storage in practice. We do not choose to formulate (17)and (18) in this fashion for simplifying the notations.Because (18) only requires execution for j ∈ B i ⋆ , so the total cost of obtaining ˜ x l is only O (1), which is one order cheaper than obtaining ˜ x p directly with Euler–Maruyama. This partialcomputation naturally introduces an error. But similar to Theorem 1, this error can be controlledthrough the local domain radius L , according to the following theorem: Theorem 2
Under Assumption 1, for any given positive constant C d , j = 1 , ..., m and i =0 , , ..., T /h , we have E [ k ˜ x l j ( ih ) − ˜ x p j ( ih ) k ] ≤ C ( f , σ ) e C ( f , σ )(1+ h ) ih e − C d ( L +1) k x o i ⋆ − x p i ⋆ k , (19) with C ( f , σ ) , C ( f , σ ) defined in (11) and (14). For any given constant ǫ , if the local domain radius L satisfies L ≥ log (cid:16) ǫC ( f , σ ) k x o i ⋆ − x p i ⋆ k (cid:17) − C d + 2 C ( f , σ )(1 + h ) C d T, (20) then E [ k ˜ x l j ( ih ) − ˜ x p j ( ih ) k ] ≤ ǫ . Note that as h →
0, (19) and (20) converge to corresponding theoretical ones in Theorem 1. Alsonote that based on (20), the choice of L is independent of the dimension n . When stochastic forcing is zero, our model (4) is an ODE. This means we can use S = 1 in Algorithm1, and it is not necessary to sample the Brownian motion. 4th order Runge–Kutta (RK4) is thestandard numerical method for ODE computation.When applying RK4 to the full model for obtaining x o ( t ≤ T ) as a start, one runs the following˜ x o j (( i + 1) h ) = ˜ x o j ( ih )+16 ( k , o j ( i ) + 2 k , o j ( i ) + 2 k , o j ( i ) + k , o j ( i )) , k , o j ( i ) = h f j ( ih, ˜ x o j − ( ih ) , ˜ x o j ( ih ) , ˜ x o j +1 ( ih )) k , o j ( i ) = h f j ( ih + h , ˜ x o j − ( ih ) + k , o j − ( i )2 , ˜ x o j ( ih ) + k , o j ( i )2 , ˜ x o j +1 ( ih ) + k , o j +1 ( i )2 ) k , o j ( i ) = h f j ( ih + h , ˜ x o j − ( ih ) + k , o j − ( i )2 , ˜ x o j ( ih ) + k , o j ( i )2 , ˜ x o j +1 ( ih ) + k , o j +1 ( i )2 ) k , o j ( i ) = h f j ( ih + h, ˜ x o j − ( ih ) + k , o j − ( i ) , ˜ x o j ( ih ) + k , o j ( i ) , ˜ x o j +1 ( ih ) + k , o j +1 ( i )) . (21)This step is repeated for all j = 1 , . . . , m .When applying RK4 to compute the local surrogate x l ( t ) = Φ l ( x p , t, x o ( t ≤ T )), we neednot only the numerical approximation ˜ x o ( t ) of x o ( t ), but also the intermediate values k s, o j ( i ) for s = 1 , , ,
4. These values will be directly taken as ˜ x l j and its RK4 intermediate values for j ∈ B ci ⋆ :˜ x l j ( ih ) = ˜ x o j ( ih ) , k s, l j ( ih ) = k s, o j ( ih ) . (22)New RK4 computation is only needed for blocks inside the local domain, and ˜ x l j ( ih ) is obtainedthrough iterating the following step for j ∈ B i ⋆ and i = 1 , . . . , T /h :˜ x l j (( i + 1) h ) = ˜ x l j ( ih ) + 16 ( k , l j ( i ) + 2 k , l j ( i )+ 2 k , l j ( i ) + k , l j ( i )) . (23)Here the intermediate values k s, l j are defined in the same way as k s, o j in (21), that is every instance of˜ x o j is replaced by ˜ x l j , and every instance of k s, o j is replaced by k s, l j . Similar as the local computationwith Euler–Maruyama method, we note that (22) in reality is only needed for { j : L + 1 ≤ | j − i ⋆ | ≤ L + 4 } . To see this, note that to obtain ˜ x l j (( i + 1) h ) we need k ,lj with j ∈ B i ⋆ . This needs k ,lj with j ∈ { j : | j − i ⋆ | ≤ L + 1 } . Likewise, we need k − s,lj with j ∈ { j : | j − i ⋆ | ≤ L + s } for s = 2 , , k ,lj = ˜ x l j ( ih ). We can exploit this to save both computation time and storage.By completing the procedures described above, we obtain the local surrogate x l ( t ) = Φ l ( x p , t, x o ( t ≤ T )). If x p is accepted in the MH step, it will be used as x o in the next iteration. Since (23) isrepeated only for blocks within the local domain, the overall cost is O (1).While an approximation error analysis like Theorem 2 in principle exists, we do not providethe detailed proof. This is partly because RK4 is much more complicated than Euler–Maruyama.Moreover, being an 4th order accuracy method, RK4 solution is very close to the exact solution of(12). So the approximation error can be learned directly from Theorem 1. The a-MwG algorithm we proposed can be easily adapted to parallelization for shorter wall-clockcomputation time.First of all, in Algorithm 1, every repetition among all S Brownian motion realizations can becomputed in parallel. This is the case because we assume W ( c ) are independent with each other.11ext, recall that in the local surrogate model (12), x l j ( t ) needs renewed computation only for j ∈ B i ⋆ , which depends on x o j ( t ) only at the boundary blocks j = i ⋆ ± ( L + 1). Now consideranother block index j ⋆ such that d ( i ⋆ , j ⋆ ) ≥ L + 2. Then its local domain B j ⋆ and boundaryblocks j = j ⋆ ± ( L + 1) have no intersection with the ones of i ⋆ . This means that if there aretwo Gibbs proposals x p ,i ⋆ and x p ,j ⋆ , and they are different from the current iterate x o at the i ⋆ -thblock and the j ⋆ -th block respectively, then the computation of Φ l ( x p ,i ⋆ , t, x o ( t ≤ T, c ) , W ( c ) ) andΦ l ( x p ,j ⋆ , t, x o ( t ≤ T, c ) , W ( c ) ) will be independent. Thus we can compute them in parallel.In other words, when implementing a-MwG, at each Gibbs iteration, we can draw severalproposals in parallel. Each proposal is different from the current iterate x o at one block index, andeach pair of these block indices are of distant 2 L + 2 or more apart. Then the local surrogate model(12) can be applied to each proposal in parallel, and the MH step is run independently on eachblock. This will have no doubt to save wall-clock computation time.Recall that our SDE model (4) can find its origin in stochastic partial differential equation (PDE)such as the reaction-diffusion equation. With this in mind, our local computation scheme is closelyconnected to the PDE parallel computation scheme called “domain decomposition”. Both methodspartition the computation domain into smaller local domains and try to do the computation onlyin the local domains. The difference is that the domain decomposition is mostly applied to linearPDEs, and its execution relies on manipulation of linear algebra. The local computation schemeintroduced here is for inverse problems involving SDE formulation, which can be nonlinear. The Lorenz 96 model was introduced in [16] as a simplified description for equatorial oceanicflows. It is commonly used in the testing of high dimensional data assimilation methods (see e.g.[21, 8, 14, 3]). The component x j ( t ), for j = 1 , ..., n , is governed by the following ODE dx j ( t ) dt = − x j − ( t ) x j − ( t )+ x j − ( t ) x j +1 ( t ) − x j ( t ) + 8 , t ∈ [0 , T ] , (24)where we let x − ( t ) = x n − ( t ), x ( t ) = x n ( t ), x n +1 ( t ) = x ( t ). The solution of the Lorenz 96 modelhas an equilibrium distribution, which can be obtained by longtime simulation. We use a Gaussianapproximation of it as the prior distribution, of which the mean vector and covariance matrix areobtained by the localization procedure described in [20].We consider solving the inverse problem formulated as (5), where the underlying model (4) isgiven by (24) with T = 0 .
4. It can be verified that the spatial interaction of this model is local inthe form of (4) with a minimum choice of block size b = 2. We consider that the observation isobtained with every other component, so H is an n/ × n matrix with H i, i − = 1 for i = 1 , ..., n/ n . For the noise term ξ , we assume ξ ∼ N ( , I n/ ). The MwG sampler and a-MwG sampler are applied to draw K posterior samples { x k : k = 1 , ..., K } respectively. We implement the experiments for (24) of dimensions n = 40 and n = 400. We use RK4 with time step h = 0 .
01 for both MwG and a-MwG.The choice of block size b and local radius L play important roles in our a-MwG samplingalgorithm. The length of b should be larger than the bandwidth of the prior covariance. This isdiscussed in detail in [20]. As for L , from Theorems 1-2 we see a smaller value leads to a largererror of a-MwG sampler but faster computation. While Theorems 1-2 give theoretical lower bounds12f L to control the error, they might be pessimistic in practice. A better way is to simulate theproposal and rejection process within a-MwG, compare it with MwG, and choose L so that theproposal acceptance rate and the ODE state at T are close for two samplers.In specific, with a set of fixed parameters n , b and L , we generate one x o from the priordistribution as in Algorithm 1. We replace the j -th block (we fix j = 1 for simplicity) of x o by theproposal x p j and obtain x p . Recall in MwG, the ODE is computed through x p ( t ≤ T ) = Φ( x p , t ≤ T ), and the acceptance probability α is given by (9). In a-MwG, we approximate the ODE solutionwith x l ( T ) = Φ l ( x p , t, x o ( t ≤ T )) and acceptance rate α ′ := min { , ˆ p p l / ˆ p o l } in Algorithm 1. Basedon these quantities, the following errors can serve as possible standards for choosing a proper L Err-Φ = E x o ∼ p k x l ( T ) − x p ( T ) k ∞ E x o ∼ p k x p ( T ) k ∞ , and Err- α = E x o ∼ p | α − α ′ | . We note that Err-Φ is the relative error between x l ( T ) and x p ( T ), while Err- α depicts their influenceon acceptance rate. We use 500 independent samples x o from p to approximate the average, andlist the average errors in Table 1. From the table, we see that a choice of L = 4 ( L = 2) for b = 2( b = 4) is enough to keep the relative errors under 3%, both in the senses of acceptance rate α andsolution Φ. Also note that all the error terms decrease as the length of L increases. This verifiesour previous theoretical analyses.Table 1: Err- α (with α varies within [0 . , . L , undermodel (24). n=40 n=400 b L Err- α Err-Φ Err- α Err-Φ2 1 0.1858 0.3361 0.1531 0.29402 0.0819 0.1855 0.0857 0.14864 0.0134 0.0235 0.0043 0.01944 1 0.0817 0.1969 0.0548 0.15992 0.0119 0.0297 0.0136 0.0209Despite we have a criterion above to pick L , in below, we still test other choices of L forthe purpose of comparison. After implementing the MwG and a-MwG sampling algorithms, thequantities of mean squared error (MSE), mean sample variance (MSV) are calculated from theposterior samples generated MSE = 1 n n X j =1 k ¯ x j − x j k , and MSV = 1 n ( K − k ) K X k = k +1 n X j =1 k x kj − ¯ x j k , where ¯ x = 1 K − k K X k = k +1 x k .
13n above, the first k samples are considered as burn-in and thrown away. Besides, the averageacceptance rate (AR) and the total amount of computation time (CT) for generating the samples arealso collected. The CT is calculated using a 2018 MacBook Pro with 2.2GHz Intel Core i7 processor.The related paremeters are set as K = 100000, k = 10000, with corresponding simulation resultspresented in Table 2. We see that our a-MwG algorithm samples the posterior distribution correctly,yielding MSE and MSV results similar to the ones for MwG, and the acceleration effect is significant.Its acceptance rate is also similar to the one of MwG sampler, in particular for larger L .Table 2: The results of AR, CT, MSE, MSV for 100000 samples drawn from MwG( b ) and a-MwG( b , L ) samplers, where b, L are block size and radius length.n=40 n=400AR CT MSE MSV AR CT MSE MSVMwG(2) 0.1343 2080 8.6902 9.0255 0.1149 27334 9.4972 10.2152a-MwG(2,1) 0.1254 1440 9.5906 9.7888 0.1014 12798 10.4869 12.5727a-MwG(2,2) 0.1310 1444 8.9625 8.3956 0.1090 12830 9.8387 10.5445a-MwG(2,4) 0.1347 1492 9.2899 8.1608 0.1152 13194 9.5599 9.8135MwG(4) 0.0389 1041 9.1626 8.2137 0.0273 13593 10.4764 9.5827a-MwG(4,1) 0.0388 724 9.2161 8.3851 0.0283 6460 9.6107 9.9683a-MwG(4,2) 0.0388 749 8.6808 8.8641 0.0280 6634 9.9938 9.3718We also show samples generated from two samplers MwG(4) and a-MwG(4,2) as a demonstra-tion of the posterior distribution for the case n = 400 in Figure 1. As a contrast, the same amountof prior samples and the true state are also shown. We see the two sampling algorithms performsimilarly.Specifically, for the computation time, we document its change when data dimension varies from40 to 1600 in the examples of MwG(4) and a-MwG(4,2), in Figure 2. It can be roughly seen thatthe computation cost of a-MwG increases linearly, compared with a quadratic increase for MwG.This verifies our previous theoretical analysis. We study the following one-dimensional linearized stochastically forced dissipative advection equa-tion in [17] (Section 6.3) and [25]: ∂f ( x, t ) ∂t = w ∂f ( x, t ) ∂x − νf ( x, t ) + µ ∂ f ( x, t ) ∂x + σ x ∂W ( x, t ) ∂t , t ∈ [0 , T ] , (25)where W ( x, t ) is a white noise in both time and space. Applying the centered difference formulawith spatial grid size l transforms (25) into a time continuous linear stochastic system. The n -dimensional state x ( t ) = [ x ( t ) , ..., x n ( t )] T follows the SDE in below for j = 1 , ..., n , dx j ( t ) = ( ax j − ( t ) + bx j ( t ) + cx j +1 ( t )) dt + σ x dW j ( t ) ,x ( t ) = x n ( t ) , x n +1 ( t ) = x ( t ) , (26)with a = µl − w l , b = − µl − ν, c = µl + w l , Dimension T i m e W j ( t ) is a standard Brownian motion. We consider a regime with strong advection and weakdamping by setting l = 0 . , µ = 0 . , ν = 0 . , w = 2 , σ x ≡ . . We shall follow similar steps as in the last example to study the performance of our a-MwG samplingalgorithm under this linear stochastic model. The only difference is the additional involvement ofstochastic factors W ( c ) for c = 1 , ..., S = 100, which are generated preliminarily. For simplicity,we consider a prior distribution p ( x ) = N (0 , I n ). We set ξ ∼ . · N (0 , I n ), T = 0 . h = 0 .
01. Other parameters remain the same as in theprevious example, if not specified.One computational issue rises when we evaluate the a-MwG acceptance probability. When theassociated dimension n is large, the sample likelihood ˆ p l ( y | x ) is close to zero. So directly evaluatingthe acceptance rate α = min { , ˆ p p l / ˆ p o l } in Algorithm 1 may leads to numerical singularity. Thisproblem does not occur to the deterministic Lorenz 96 model, since p l ( y | x ) needs no approximation,and the difference of the log-likelihoods is explicitly accessible.To resolve this issue, we note the pseudo-marginal likelihood ratio isˆ p p l ˆ p o l = P Sc =1 exp( − k y − H · x l ( T, c ) k R ) P Sc =1 exp( − k y − H · x o ( T, c ) k R ) . (27)Since x l ( T, c ) is different from x o ( T, c ) only for blocks inside the local domain B i ⋆ , the entriesoutside B i ⋆ will provide no information for the sampling of block x i ∗ . Thus, dropping those entrieswill cause no significant difference for the evaluation of (27). In other words, we approximate (27)with a local version ˆ p p l ˆ p o l ≈ P Sc =1 exp( − k y B ′ i⋆ − H i ∗ · x l B i⋆ ( T, c ) k R i ∗ ) P Sc =1 exp( − k y B ′ i⋆ − H i ∗ · x o B i⋆ ( T, c ) k R i ∗ ) . (28)Here B ′ i ⋆ = { i : 2 i − ∈ B i ⋆ } are observed indices in B i ⋆ , H i ∗ consists the B ′ i ⋆ rows and B i ⋆ columns of H , and R i ∗ consists of the B ′ i ⋆ rows and columns of R . Formulation (28) avoids thecomputational singularity issue that may exists in (27), since the associated variables are no longerhigh dimensional.Note that using (28) is equivalent to localizing the observation matrix as in [20] Section 2.2.When (4) is a linear SDE, y is a linear noisy observation of x ( T, c ), and the bias caused by suchlocalization is discussed in [20] Section 2.3. We expect similar bias control is also possible forgeneral SDEs, while the exact verification is beyond the scope of this paper. Also note that thelocal domain B i ⋆ in (28) does not need to be the same as the one used for SDE computation. Forthis test example, we use 20 entries centered around the perturbed block.Before applying a-MwG sampler to solve the inverse problem, we use the same proceduredescribed in the last example to choose the parameter L of proper scale. The only difference isthat all quantities depend on realizations of Brownian motions, and it is necessary to average over S = 100 simulations. We see from Table 3 that a minimum choice of L = 6 ( L = 3) for b = 2 ( b = 4)can even control the relative errors under only 1 percent. Finite sample performances of posteriordistribution by using MwG and a-MwG samplers are presented in Table 4 and Figure 3, fromwhich we can see our sampling algorithm also works well for stochastic setting. Figure 4 shows amore evident linear relationship between the data dimension and the computation time for a-MwG,while a quadratic one for MwG. We note this is because for each proposal, the realization (26) is16able 3: Err- α (with α varies within [0 . , . L , undermodel (26). n=40 n=400 b L Err- α Err-Φ Err- α Err-Φ2 2 0.0289 0.2200 0.0856 0.14074 2.1e-04 0.0040 2.2e-04 0.00274 1 0.0846 0.2189 0.0221 0.16042 1.3e-04 0.0043 1.7e-04 0.0030Table 4: The results of AR, CT, MSE, MSV for 10000 samples draw from MwG( b ) and a-MwG( b , L )samplers, where b, L are block size and radius length. The exact posterior mean variance is listedin the end for comparison. n=40 n=400AR CT MSE MSV AR CT MSE MSVMwG(2) 0.1342 2418 0.5953 0.5567 0.1494 40617 0.5799 0.5413a-MwG(2,2) 0.1280 2052 0.6382 0.5708 0.1388 20674 0.6025 0.5623a-MwG(2,4) 0.1344 2077 0.6009 0.5302 0.1489 21256 0.5871 0.5505MwG(4) 0.0367 1193 0.5854 0.5492 0.0449 21369 0.6004 0.5470a-MwG(4,1) 0.0343 1007 0.5905 0.5244 0.0423 10511 0.6077 0.5417a-MwG(4,2) 0.0353 1043 0.5903 0.5396 0.0451 10685 0.5907 0.5499Posterior (cid:31) (cid:31) (cid:31) (cid:31) (cid:31) (cid:31) Dimension T i m e repeated for S = 100 standard Brownian motion paths W , dominating the total computation ofthe sampling algorithms and making the relationships more clearer.In fact, for this linear model, the posterior distribution of x given y can be derived explicitly.To derive it, we rewrite (26) in the following vertor form d x ( t ) = M x ( t ) dt + σd W ( t ) , (29)where M is an n by n matrix with M , = b, M , = c, M ,n = a,M j,j − = a, M j,j = b, M j,j +1 = c, for j = 2 , ..., n − ,M n, = c, M n,n = b, M n,n − = a, and other entries are 0. According to Duhamel’s formula, we have x ( T ) = e MT x (0) + e MT Z T e − Mt σd W ( t ) . The observation model can therefore be written as y = H e MT x (0) + H e MT Z T e − Mt σd W ( t ) + ξ. (30)Recall that x (0) ∼ N ( n × , Σ pri ) and ξ ∼ N ( n ′ × , R ). It’s known that for linear observation withGaussian prior distribution and noise, the posterior distribution is also of Gaussian type, and theposterior mean m pos and covariance matrix Σ pos can be written as ([18]):Σ pos = (Σ − + ( H e MT ) T ( M ′ + R ) − H e MT ) − , m pos = Σ pos ( H e MT ) T ( M ′ + R ) − y , where M ′ = H e MT · Z T σ ( e − Mt ) T e − Mt dt · ( H e MT ) T . In this paper, we discuss how to infer initial conditions of high dimensional SDEs with local interac-tions, when noisy observations are available. Previous research [20] has shown that the Metropolis-within-Gibbs (MwG) sampling has dimension independent convergence rate. However, each MwGiteration requires a computation cost of O ( n ), with n being the model dimension. The maincontribution of this paper is that we introduce a reduced computation scheme to accelerate MwGimplementation. We observe that in each updating step, the MwG proposal is only different fromthe original iterate at one parameter block, and the main difference caused also takes place nearthat block. Solving the SDEs within a local domain near this block, instead of doing the globalcomputation for all entries, only involves O (1) computation. Our accelerated algorithm, the a-MwG sampler, is proposed by integrating an approximate local computation into MwG sampling.The computation cost of each a-MwG iterate is O ( n ) in total. We derive rigorous bounds for theapproximation errors and show that they can be controlled if the local domain size is chosen tobe larger than a constant threshold. We also discuss how to implement our local computation byusing Euler–Maruyama and 4th order Runge–Kutta schemes. These methods are applied in ourexperimental studies of the Lorenz 96 and linear stochastic flow model. Numerical results showthat our sampling algorithm can be greatly accelerated without much loss of accuracy. acknowledgements This work is supported by Singapore MOE-AcRF grant R-146-000-258-114 and R-146-000-292-114.
Appendix
Auxiliary lemmas and their proofs
Lemma 1 An n × n matrix A is circulant tridiagonal if A j,i = 0 for all pairs of ( j, i ) such that d ( j, i ) > (Recall d ( j, i ) = min {| j − i | , | j + n − i | , | i + n − j |} ). For such type of matrix, we have | ( e A ) j,i | ≤ e − C · d ( j,i ) · e ( e C + e − C +1) ρ , ∀ j, i = 1 , , ..., n, where ρ = max j,i =1 , ,...,n | A j,i | , C is any fixed positive constant. Remark 3 If A is further a tridiagonal matrix, we have | ( e A ) j,i | ≤ e − C ·| j − i | · e ( e C + e − C +1) ρ , ∀ j, i = 1 , , ..., n. Proof:
Firstly, by using mathematical induction, we prove that with any positive constant C , | ( A m ) j,i |≤ ( e C + e − C + 1) m · ρ m · e − C · d ( j,i ) , for m = 0 , , · · · . (31)19he conclusion is true for m = 0 since A = I n and e − C · d ( j,i ) > e − C · d ( j,j ) = 1. If the resultis true for m = k , namely we have | ( A k ) j,i | ≤ ( e C + e − C + 1) k · ρ k · e − C · d ( j,i ) . Then, for m = k + 1,we observe that ( A k +1 ) j,i = ( A k ) j,i − A i − ,i + ( A k ) j,i A i,i + ( A k ) j,i +1 A i +1 ,i , (32)with ( A k ) j, = ( A k ) j,n , ( A k ) j,n +1 = ( A k ) j, , A ,i = A n,i , and A n +1 ,i = A ,i . Since | A j,i | ≤ ρ ,together with (31) for m = k , (32) gives us | ( A k +1 ) j,i |≤ | ( A k ) j,i − k A i − ,i | + | ( A k ) j,i k A i,i | + | ( A k ) j,i +1 k A i +1 ,i |≤ ( e C + e − C + 1) k · ρ k · e − C · d ( j,i − · ρ + ( e C + e − C + 1) k · ρ k · e − C · d ( j,i ) · ρ + ( e C + e − C + 1) k · ρ k · e − C · d ( j,i +1) · ρ. (33)We claim that e − C · d ( j,i − + e − C · d ( j,i ) + e − C · d ( j,i +1) ≤ ( e C + e − C + 1) e − C · d ( j,i ) , (34)which will be proved later. Combining it with (33), we obtain | ( A k +1 ) j,i | ≤ ( e C + e − C + 1) k +1 · ρ k +1 · e − C · d ( j,i ) , (35)this finishes the proof of (31). The claim (34) can be proved by the following case by case discussion.For the case n = 2 k with k ∈ N + , we see d ( j, i ) ≤ k from the definition of d ( j, i ). If d ( j, i ) < k , wehave e − C · d ( j,i − + e − C · d ( j,i ) + e − C · d ( j,i +1) = e − C · ( d ( j,i )+1) + e − C · d ( j,i ) + e − C · ( d ( j,i ) − = ( e C + e − C + 1) e − C · d ( j,i ) . Otherwise, if d ( j, i ) = k , we have e − C · d ( j,i − + e − C · d ( j,i ) + e − C · d ( j,i +1) = e − C · ( d ( j,i ) − + e − C · d ( j,i ) + e − C · ( d ( j,i ) − ≤ ( e − C + e − C + 1) e − C · d ( j,i ) . For n = 2 k − k ∈ N + , we observe d ( j, i ) ≤ k −
1. If d ( j, i ) < k −
1, we have e − C · d ( j,i − + e − C · d ( j,i ) + e − C · d ( j,i +1) = e − C · ( d ( j,i ) − + e − C · d ( j,i ) + e − C · ( d ( j,i )+1) = ( e C + e − C + 1) e − C · d ( j,i ) . Otherwise, if d ( j, i ) = k −
1, we have e − C · d ( j,i − + e − C · d ( j,i ) + e − C · d ( j,i +1) = e − C · ( d ( j,i ) − + e − C · d ( j,i ) + e − C · d ( j,i ) ≤ ( e − C + 1 + 1) e − C · d ( j,i ) . | ( e A ) j,i | = (cid:12)(cid:12)(cid:12) ∞ X m =0 m ! · ( A m ) j,i (cid:12)(cid:12)(cid:12) ≤ ∞ X m =0 m ! · | ( A m ) j,i |≤ ∞ X m =0 m ! · ( e C + e − C + 1) m · ρ m · e − C · d ( j,i ) = e ( e C + e − C +1) ρ · e − C · d ( j,i ) . The proof is complete. (cid:3)
Lemma 2 (Gronwall’s inequality)
Given an n × n matrix M whose entries are positive, and ifthe two n × vectors x ( t ) , b ( t ) satisfy M · x ( t ) + b ( t ) (cid:23) d x ( t ) dt , for t ∈ [0 , T ] , (36) (Recall (cid:23) is entriwise inequality, as defined in Section 1.4), then e MT · x (0) + Z T e M ( T − t ) · b ( t ) dt (cid:23) x ( T ) . (37) Proof:
Consider an n -dimensional column vector y ( t ) defined as d y ( t ) dt = M · y ( t ) + b ( t ) , for t ∈ [0 , T ] , with initial value y (0) = x (0) + ǫ · n × , where ǫ is a positive constant. Its solution can be writtenas y ( T ) = e MT · ( x (0) + ǫ · n × ) + Z T e M ( T − t ) · b ( t ) dt. Define z ( t ) = y ( t ) − x ( t ) for t ∈ [0 , T ], then z (0) = ǫ · n × (cid:23) n × . And according to (36), we have d z ( t ) dt = d y ( t ) dt − d x ( t ) dt = M · y ( t ) + b ( t ) − d x ( t ) dt (cid:23) M · ( y ( t ) − x ( t )) = M · z ( t ) . We define τ = inf { t : there exists at least one index i, such that z i ( t ) < . } , and note that τ > z ( t ). According to the definition, wehave z ( t ) (cid:23) n × for t ∈ [0 , τ ). Together with that all the entries of M are positive, we have d z ( t ) dt (cid:23) M · z ( t ) (cid:23) n × for t ∈ [0 , τ ). It implies that z ( t ) is entriwisely nondecreasing in [0 , τ ), thus z ( t ) (cid:23) z (0) = ǫ · n × for t ∈ [0 , τ ), namely e Mt · ( x (0) + ǫ · n × ) + Z t e M ( t − s ) · b ( s ) ds − x ( t ) (cid:23) ǫ · n × . (38)21e show that τ > T , which can be proved by contradiction. Suppose that τ ≤ T . From thedefinition of τ and the previous analyses, we know that z ( t ) (cid:23) ǫ · n × when t ∈ [0 , τ ), and thereexists at least one index, say i , such that z i ( τ ) <
0. Thus we have lim t → τ − z i ( t ) − z i ( τ ) ≥ ǫ >
0. Thisresult contradicts with the continuity of z ( t ) for t ∈ [0 , T ] and gives us the result τ > T .The conclusion (37) is obtained by taking ǫ → τ > T . (cid:3) Lemma 3
For two n × n matrices M and N whose entries are positive, if M (cid:22) N, (39) we have e M (cid:22) e N . (40) Proof:
By mathematical induction, we firstly prove M k (cid:22) N k , for k = 0 , ..., ∞ . (41)The result is true for k = 0, since M = N = I n . If the result is true for k = K , namely we have M K (cid:22) N K . (42)Then for k = K + 1, since( M K +1 ) i,j = n X s =1 ( M K ) i,s M s,j , ( N K +1 ) i,j = n X s =1 ( N K ) i,s N s,j , for i, j = 1 , ..., n, the result ( M K +1 ) i,j ≤ ( N K +1 ) i,j follows from (39) and (42).According to the definition of matrix exponential, we have e M − e N = + ∞ X k =0 k ! ( M k − N k ) , the conclusion (40) naturally follows from (41). (cid:3) Lemma 4 (Volterra’s inequality)
Given an n × n matrix M whose entries are positive, if thetwo n × vectors x ( t ) , b ( t ) satisfy x ( t ) (cid:22) b ( t ) + Z t M · x ( s ) ds, for t ∈ [0 , T ] , (43) we have x ( T ) (cid:22) b ( T ) + Z T e M ( T − t ) · M · b ( t ) dt. (44)22 roof: We define y ( t ) = R t M · x ( s ) ds for t ∈ [0 , T ], then (43) gives us x ( t ) (cid:22) b ( t ) + y ( t ) , for t ∈ [0 , T ] . (45)And we observe that d y ( t ) dt = M · x ( t ) (cid:22) M · b ( t ) + M · y ( t ) , with y (0) = 0. According to Lemma 2, we have y ( T ) (cid:22) Z T e M ( T − t ) · M · b ( t ) dt. (46)The conclusion (44) follows from (45) with t = T and (46). (cid:3) Main lemmas and their proofs
Lemma 5
Under Assumption 1, with x o ( t ) , x p ( t ) for t ∈ [0 , T ] defined in Section 3.1, we have, for j = 1 , ..., m , E [ k x o j ( t ) − x p j ( t ) k ] ≤ k x o i ⋆ − x p i ⋆ k · e C ( f , σ ) t · e − C d · d ( j,i ⋆ ) , (47) where C d is any given positive constant, C ( f , σ ) is defined as (11). Proof:
Recall that x o j ( t ), x p j ( t ) are solutions of d x j ( t ) = f j ( t, x j − ( t ) , x j ( t ) , x j +1 ( t )) dt + σ j ( t, x j ( t )) d W j ( t ) , for j = 1 , ..., m, x ( t ) = x m ( t ) , x m +1 ( t ) = x ( t ) , t ∈ [0 , T ] , with corresponding initial values x o j (0), x p j (0), which only differ at the i ⋆ -th entry. Thus, we have d ( x o j ( t ) − x p j ( t )) = ∆ f j ( t ) dt + ∆ σ j ( t ) d W j ( t ) , with ∆ f j ( t ) = f j ( t, x o j − ( t ) , x o j ( t ) , x o j +1 ( t )) − f j ( t, x p j − ( t ) , x p j ( t ) , x p j +1 ( t )) , ∆ σ j ( t ) = ( σ j ( t, x o j ( t )) − σ j ( t, x p j ( t ))) . According to Itˆo’s formula, d k x o j ( t ) − x p j ( t ) k = (2( x o j ( t ) − x p j ( t )) T ∆ f j ( t ) + k ∆ σ j ( t ) k ) dt + 2( x o j ( t ) − x p j ( t )) T ∆ σ j ( t ) d W j ( t ) . (48)23ts solution can be written as k x o j ( t ) − x p j ( t ) k − k x o j (0) − x p j (0) k = Z t (2( x o j ( s ) − x p j ( s )) T ∆ f j ( s ) + k ∆ σ j ( s ) k ) ds + Z t x o j ( s ) − x p j ( s )) T ∆ σ j ( s ) d W j ( s ) . After taking expectation with respect to both sides of above formula, and according to Assumption1 and Cauchy-Schwartz inequality, we have E [ k x o j ( t ) − x p j ( t ) k ] − k x o j (0) − x p j (0) k = Z t (2( x o j ( s ) − x p j ( s )) T ∆ f j ( s ) + k ∆ σ j ( s ) k ) ds ≤ Z t E [ k x o j ( s ) − x p j ( s ) k ] ds + Z t E [ k ∆ f j ( s ) k + k ∆ σ j ( s ) k ] ds ≤ C f Z t E [ k x o j − ( s ) − x p j − ( s ) k ] ds + ( C f + C σ + 1) Z t E [ k x o j ( s ) − x p j ( s ) k ] ds + C f Z t E [ k x o j +1 ( s ) − x p j +1 ( s ) k ] ds. (49)Equivalently, we can write (49) into the following vector form ~ ∆( t ) (cid:22) Z t M · ~ ∆( s ) ds + ~ ∆(0) , where ~ ∆( t ) is an m × j -th entry is E [ k x o j ( t ) − x p j ( t ) k ]; M is an m × m matrix with M , = C σ + C f + 1 , M , = M ,n = C f ,M j,j = C σ + C f + 1 , M j,j +1 = M j,j − = C f , for j = 2 , ..., m − ,M m,m = C σ + C f + 1 , M m, = M m,m − = C f , and other entries are 0; moreover, we have ~ ∆ i ⋆ (0) = k x o i ⋆ − x p i ⋆ k and other entries of ~ ∆(0) are 0.According to Lemma 4, we have ~ ∆( t ) (cid:22) ~ ∆(0) + Z t e M ( t − s ) · M · ~ ∆(0) ds. We observe that for the m -dimensional column vector ( M · ~ ∆(0)), its i ⋆ − , i ⋆ , i ⋆ + 1-th entries are C f k x o i ⋆ − x p i ⋆ k , ( C f + C σ + 1) k x o i ⋆ − x p i ⋆ k , C f k x o i ⋆ − x p i ⋆ k respectively, while other entries are 0.Together with Lemma 1 for C = C d , max j,i =1 , ,...,m | M j,i | = ( C f + C σ + 1), (34) and the definition of24 ( f , σ ) = ( e C d + e − C d + 1)( C f + C σ + 1), we have ~ ∆ j ( t ) ≤ ~ ∆ j (0) + Z t | ( e M ( t − s ) ) j,i ⋆ − | · ( M · ~ ∆(0)) i ⋆ − , ds + Z t | ( e M ( t − s ) ) j,i ⋆ | · ( M · ~ ∆(0)) i ⋆ , ds + Z t | ( e M ( t − s ) ) j,i ⋆ +1 | · ( M · ~ ∆(0)) i ⋆ +1 , ds ≤ ~ ∆ j (0) + ( C f + C σ + 1) k x o i ⋆ − x p i ⋆ k Z t e C ( f , σ )( t − s ) ds · ( e − C d · d ( j,i ⋆ − + e − C d · d ( j,i ⋆ ) + e − C d · d ( j,i ⋆ +1) ) ≤ ~ ∆ j (0) + k x o i ⋆ − x p i ⋆ k · ( e C ( f , σ ) t − e − C d · d ( j,i ⋆ ) ≤ k x o i ⋆ − x p i ⋆ k · e C ( f , σ ) · t · e − C d · d ( j,i ⋆ ) , where the last inequality is derived by considering two cases: when j = i ⋆ , ~ ∆ j (0) = k x o i ⋆ − x p i ⋆ k and d ( j, i ⋆ ) = 0, the result is true; when j = i ⋆ , ~ ∆ j (0) = 0, the result can also be verified. Recallthat ~ ∆ j ( t ) = E [ k x o j ( t ) − x p j ( t ) k ], the proof is complete. (cid:3) Lemma 6
Under Assumption 1, with x o ( t ) , x l ( t ) , x p ( t ) defined for t ∈ [0 , T ] in Section 3.1 and3.2, we have, for j = 1 , ..., m , E [ k x l j ( t ) − x p j ( t ) k ] ≤ C ( f , σ ) k x o i ⋆ − x p i ⋆ k e C ( f , σ ) t e − C d · ( L +1) , (50) where C d is any given positive constant, C ( f , σ ) , C ( f , σ ) are defined in (11) and (14). Proof:
For simplicity, we consider L + 2 ≤ i ⋆ ≤ n − L −
1, this doesn’t sacrifice any generalitybecause we can always rotate the index to make it true. So B i ⋆ = { i ⋆ − L, ..., i ⋆ + L } .Suppose j ∈ B ci ⋆ , since x l j ( t ) = x o j ( t ) and d ( j, i ⋆ ) ≥ L + 1, (50) follows from the conclusion ofLemma 5.Suppose j ∈ B i ⋆ , we observe that both x l j ( t ) and x p j ( t ) follow the evolutionary system d x j ( t ) = f j ( t, x j − ( t ) , x j ( t ) , x j +1 ( t )) dt + σ j ( t, x j ( t )) d W j ( t ) , t ∈ [0 , T ] , with initial value ( x o i ⋆ − L , ..., x o i ⋆ − , x p i ⋆ , x o i ⋆ +1 , ..., x o i ⋆ + L ) T . But when j ∈ B ci ⋆ , it is restricted that x l j ( t ) = x o j ( t ), where x o ( t ) is the solution of (4) with initial value ( x o1 , ..., x o m ) T . Recall that x p ( t ) alsosolves (4), but with a locally perturbed initial value, which can be written as ( x o1 , ..., x o i ⋆ − , x p i ⋆ , x o i ⋆ +1 , ..., x o m ) T .Thus, for i ⋆ − L ≤ j ≤ i ⋆ + L , we have d ( x l j ( t ) − x p j ( t )) = ∆ f j ( t ) dt + ∆ σ j ( t ) d W j ( t ) , with ∆ f j ( t ) = f j ( t, x l j − ( t ) , x l j ( t ) , x l j +1 ( t )) − f j ( t, x p j − ( t ) , x p j ( t ) , x p j +1 ( t )) , ∆ σ j ( t ) = ( σ j ( t, x l j ( t )) − σ j ( t, x p j ( t ))) . d k x l j ( t ) − x p j ( t ) k = (2( x l j ( t ) − x p j ( t )) T ∆ f j ( t ) + k ∆ σ j ( t ) k ) dt + 2( x l j ( t ) − x p j ( t )) T ∆ σ j ( t ) d W j ( t ) . (51)The result (51) is similar to (48), solving it and according to (49), we obtain E [( x l j ( t ) − x p j ( t )) ] − ( x l j (0) − x p j (0)) ≤ C f Z t E [( x l j − ( s ) − x p j − ( s )) ] ds + ( C f + C σ + 1) Z t E [( x l j ( s ) − x p j ( s )) ] ds + C f Z t E [( x l j +1 ( s ) − x p j +1 ( s )) ] ds. (52)We define an (2 L + 1) × ~ ∆( t ) = (cid:0) E [ k x l i ⋆ − L ( t ) − x p i ⋆ − L ( t ) k ] , ..., E [ k x l i ⋆ + L ( t ) − x p i ⋆ + L ( t ) k ] (cid:1) T ,whose j -th element ~ ∆ j ( t ) = E [ k x l i ⋆ − L + j − ( t ) − x p i ⋆ − L + j − ( t ) k ], and observe that ~ ∆(0) = (2 L +1) × .Since x l i ⋆ − L − ( t ) = x o i ⋆ − L − ( t ) and x l i ⋆ + L +1 ( t ) = x o i ⋆ + L +1 ( t ), we can write (52) as the followingvector form ~ ∆( t ) (cid:22) Z t M · ~ ∆( s ) ds + ~δ ( t ) , (53)where M is an (2 L + 1) × (2 L + 1) tridiagonal matrix with M , = C f + C σ + 1 , M , = C f ,M j,j = C f + C σ + 1 , M j,j +1 = M j,j − = C f , for 2 ≤ j ≤ LM L +1 , L = C f , M L +1 , L +1 = C f + C σ + 1 , and ~δ ( t ) is an (2 L +1)-dimensional column vector with ~δ ( t ) = C f R t E [ k x o i ⋆ − L − ( s ) − x p i ⋆ − L − ( s ) k ] ds , ~δ L +1 ( t )= C f R t E [ k x o i ⋆ + L +1 ( s ) − x p i ⋆ + L +1 ( s ) k ] ds , while other entries are 0. According to Lemma 5, wehave max j =1 ,..., L +1 {| ~δ j ( t ) |}≤ C f Z t k x o i ⋆ − x p i ⋆ k e C ( f , σ ) · s e − C d ( L +1) ds ≤ C f C ( f , σ ) k x o i ⋆ − x p i ⋆ k e C ( f , σ ) · t e − C d ( L +1) . (54)By Lemma 4, under (53), we have ~ ∆( t ) (cid:22) ~δ ( t ) + Z t e M ( t − s ) · M · ~δ ( s ) ds. (55)26or the (2 L + 1) × M · ~δ ( t )), we observe( M · ~δ ( t )) = ( C f + C σ + 1) ~δ ( t ) , ( M · ~δ ( t )) = C f ~δ ( t ) , ( M · ~δ ( t )) L = C f ~δ L +1 ( t ) , ( M · ~δ ( t )) L +1 = ( C f + C σ + 1) ~δ L +1 ( t ) , while other entries are 0. Together with (54), we havemax j =1 ,..., L +1 {| ( M · ~δ ( t )) j, |}≤ C f C ( f , σ ) ( C f + C σ + 1) k x o i ⋆ − x p i ⋆ k e C ( f , σ ) · t e − C d ( L +1) . (56)Since max j,i =1 ,..., L +1 M j,i = ( C f + C σ + 1), according to Lemma 3 and the conclusion of Lemma 1 with C = C d , we have | ( e M ( t − s ) ) j,i | ≤ | ( e Mt ) j,i | ≤ e C ( f , σ ) t . (57)From (55), together with (54), (56) and (57), we obtain ~ ∆ j ( t ) ≤ | ~δ j ( t ) | + X k =1 , , L, L +1 Z t | ( e M ( t − s ) ) j,k || ( M · ~δ ( s )) k, | ds ≤ max j =1 ,..., L +1 {| ~δ j ( t ) |} + 4 e C ( f , σ ) t Z t max j =1 ,..., L +1 {| ( M · ~δ ( s )) j, |} ds ≤ C f C ( f , σ ) k x o i ⋆ − x p i ⋆ k e C ( f , σ ) · t e − C d ( L +1) · (1 + 4 Z t ( C f + C σ + 1) e C ( f , σ ) · s ds ) ≤ C f C ( f , σ ) k x o i ⋆ − x p i ⋆ k e C ( f , σ ) · t e − C d ( L +1) · (1 + 4( e C ( f , σ ) t − e C d + e − C d + 1 ) ≤ C f C ( f , σ ) k x o i ⋆ − x p i ⋆ k e C ( f , σ ) · t e − C d ( L +1) , where the last inequality is derived by using e C d + e − C d + 1 ≥
2. Recall the definition of ~ ∆ j ( t ), theproof is complete. (cid:3) Lemma 7
Under the same settings in Lemma 6, for any given ǫ > , if the local domain radius L satisfies L ≥ log (cid:16) ǫC ( f , σ ) k x o i ⋆ − x p i ⋆ k (cid:17) − C d + 2 C ( f , σ ) C d · T, then E [ k x l j ( t ) − x p j ( t ) k ] ≤ ǫ for all t ≤ T and j = 1 , ..., m . roof: According to Lemma 6, it’s equivalent for us to solve C ( f , σ ) · k x o i ⋆ − x p i ⋆ k e C ( f , σ ) · t e − C d ( L +1) ≤ ǫ for t ∈ [0 , T ], which can be obtained by solving e − C d ( L +1) ≤ ǫC ( f , σ ) · k x o i ⋆ − x p i ⋆ k · e C ( f , σ ) · T . After taking log for both sides, we obtain that above result is true if L ≥ log (cid:16) ǫC ( f , σ ) k x o i ⋆ − x p i ⋆ k (cid:17) − C d + 2 C ( f , σ ) C d · T. The proof is finished. (cid:3)
Lemma 8
Under Assumption 1, with ˜ x o j ( ih ) , ˜ x p j ( ih ) , for j = 1 , ..., m and i = 0 , ..., T /h , definedin Section 3.3, we have k ˜ x o j ( ih ) − ˜ x p j ( ih ) k ≤ e − C d · d ( j,i ⋆ ) · e C ( f , σ )(1+ h ) ih k x o i ⋆ − x p i ⋆ k , (58) where C d is any given positive constant, C ( f , σ ) is defined in (11). Proof:
Since ˜ x o j ( ih ) are obtained by iterating˜ x o j (( i + 1) h ) = ˜ x o j ( ih ) + σ j ( ih, ˜ x o j ( ih )) √ hW i,j + f j ( ih, ˜ x o j − ( ih ) , ˜ x o j ( ih ) , ˜ x o j +1 ( ih )) h, (59)with initial value ˜ x o j (0) = x o j . Comparing it with (16), we have˜ x o j (( i + 1) h ) − ˜ x p j (( i + 1) h )= ˜ x o j ( ih ) − ˜ x p j ( ih ) + ˜∆ σ j ( ih ) √ hW i,j + ˜∆ f j ( ih ) h, (60)with ˜∆ σ j ( ih ) = σ j ( ih, ˜ x o j ( ih )) − σ j ( ih, ˜ x p j ( ih )) , ˜∆ f j ( ih ) = f j ( ih, ˜ x o j − ( ih ) , ˜ x o j ( ih ) , ˜ x o j +1 ( ih )) − f j ( ih, ˜ x p j − ( ih ) , ˜ x p j ( ih ) , ˜ x p j +1 ( ih )) . Since ˜ x o j ( ih ) and ˜ x p j ( ih ) are independent with W i,j , together with Assumption 1, and Cauchy-Schwartz inequality, from (60), we have E [ k ˜ x o j (( i + 1) h ) − ˜ x p j (( i + 1) h ) k ]= E [ k ˜ x o j ( ih ) − ˜ x p j ( ih ) k ] + E [ k ˜∆ σ j ( ih ) √ hW i,j k ]+ E [ k ˜∆ f j ( ih ) h k ] + 2 E [(˜ x o j ( ih ) − ˜ x p j ( ih )) T ˜∆ f j ( ih ) h ] ≤ E [ k ˜ x o j ( ih ) − ˜ x p j ( ih ) k ] + E [ k ˜∆ σ j ( ih ) k ] h + E [ k ˜∆ f j ( ih ) k ] h + ( E [ k ˜ x o j ( ih ) − ˜ x p j ( ih ) k ] + E [ k ˜∆ f j ( ih ) k ]) h ≤ ( C f h + C f h ) E [(˜ x o j − ( ih ) − ˜ x p j − ( ih )) ]+(1 + ( C σ + C f + 1) h + C f h ) E [(˜ x o j ( ih ) − ˜ x p j ( ih )) ]+ ( C f h + C f h ) E [ k ˜ x o j +1 ( ih ) − ˜ x p j +1 ( ih ) k ] . (61)28quivalently, we have ~ ∆(( i + 1) h ) (cid:22) ( I m + M ) · ~ ∆( ih ) , (62)where ~ ∆( ih ) is defined to be an m × j -th entry ~ ∆ j ( ih ) = E [ k ˜ x o j ( ih ) − ˜ x p j ( ih ) k ]for i = 0 , ..., T /h ; M is an m × m matrix with M , = ( C σ + C f + 1) h + C f h , M , = C f h + C f h ,M ,m = C f h + C f h ,M j,j − = C f h + C f h , M j,j = ( C σ + C f + 1) h + C f h ,M j,j +1 = C f h + C f h , for j = 2 , ..., m − ,M m, = C f h + C f h , M m,m − = C f h + C f h ,M m,m = ( C σ + C f + 1) h + C f h , and other entries are 0. After iterating (62) for i times, we have ~ ∆( ih ) (cid:22) ( I m + M ) i · ~ ∆(0) , (63)and for ~ ∆(0), we know ~ ∆ i ⋆ (0) = k x o i ⋆ − x p i ⋆ k and other entries are 0. We observe I m + M (cid:22) e M , (64)together with Lemma 1, max j,k =1 , ,...,m | ( iM ) j,k | = (( C σ + C f + 1) + C f h ) ih and (41), we have, for j = 1 , ..., m , | (( I m + M ) i ) j,i ⋆ | ≤ | (( e M ) i ) j,i ⋆ | = | ( e iM ) j,i ⋆ |≤ e − C d · d ( j,i ⋆ ) · e C ( f , σ )(1+ h ) ih . (65)Substituting (65) into (63) results in (58). (cid:3) Lemma 9
Under Assumption 1, with ˜ x o j ( ih ) ˜ x l j ( ih ) , ˜ x p j ( ih ) , for j = 1 , ..., m and i = 0 , ..., T /h ,defined in Section 3.3, we have E [ k ˜ x l j ( ih ) − ˜ x p j ( ih ) k ] ≤ C ( f , σ ) e C ( f , σ )(1+ h ) ih e − C d ( L +1) k x o i ⋆ − x p i ⋆ k , (66) where C d is any given positive constant, C ( f , σ ) and C ( f , σ ) are defined in (11) and (14). Proof:
Suppose j ∈ B ci ⋆ , since d ( j, i ⋆ ) ≥ L + 1, and ˜ x l j ( ih ) = ˜ x o j ( ih ), the result follows from Lemma8. Suppose j ∈ B i ⋆ , comparing (16) and (18), we have˜ x l j (( i + 1) h ) − ˜ x p j (( i + 1) h )= ˜ x l j ( ih ) − ˜ x p j ( ih ) + ˜∆ σ j ( ih ) √ hW i,j + ˜∆ f j ( ih ) h, (67)with ˜∆ σ j ( ih ) = σ j ( ih, ˜ x l j ( ih )) − σ j ( ih, ˜ x p j ( ih )) , ˜∆ f j ( ih ) = f j ( ih, ˜ x l j − ( ih ) , ˜ x l j ( ih ) , ˜ x l j +1 ( ih )) − f j ( ih, ˜ x p j − ( ih ) , ˜ x p j ( ih ) , ˜ x p j +1 ( ih )) . x l k ( ih ) = ˜ x o k ( ih ) for k ∈ B ci ⋆ during the evolution of ˜ x l ( ih ). Since (67) issimilar to (60), according to (61), we have E [ k ˜ x l j (( i + 1) h ) − ˜ x p j (( i + 1) h ) k ] ≤ ( C f h + C f h ) E [(˜ x l j − ( ih ) − ˜ x p j − ( ih )) ]+ (1 + ( C σ + C f + 1) h + C f h ) E [(˜ x l j ( ih ) − ˜ x p j ( ih )) ]+ ( C f h + C f h ) E [ k ˜ x l j +1 ( ih ) − ˜ x p j +1 ( ih ) k ] . Namely, for the (2 L +1) × ~ ∆( ih ) whose j -th entry ~ ∆ j ( ih ) = E [ k ˜ x l i ⋆ − L + j − ( ih ) − ˜ x p i ⋆ − L + j − ( ih ) k ]for j = 1 , ..., L + 1, we have ~ ∆(( i + 1) h ) (cid:22) ( I (2 L +1) + M ) · ~ ∆( ih ) + ~δ ( ih ) , (68)where M is an (2 L + 1) by (2 L + 1) tridiagonal matrix with M , = ( C σ + C f + 1) h + C f h , M , = C f h + C f h ,M j,j − = C f h + C f h , M j,j = ( C σ + C f + 1) h + C f h ,M j,j +1 = C f h + C f h , for 2 ≤ j ≤ LM L +1 , L = C f h + C f h ,M L +1 , L +1 = ( C σ + C f + 1) h + C f h , and ~δ ( ih ) is an (2 L + 1)-dimensional vector with ~δ ( ih ) = ( C f + C f h ) h E [ k ˜ x o i ⋆ − L − ( ih ) − ˜ x p i ⋆ − L − ( ih ) k ] ,~δ L +1 ( ih ) = ( C f + C f h ) h E [ k ˜ x o i ⋆ + L +1 ( ih ) − ˜ x p i ⋆ + L +1 ( ih ) k ]and other entries are 0. After iterating (68) for i times, we obtain ~ ∆( ih ) (cid:22) i − X k =0 ( I L +1 + M ) i − − k · ~δ ( kh )+ ( I L +1 + M ) i · ~ ∆(0) , (69)and we see ~ ∆(0) = (2 L +1) × from the definition of ˜ x l j ( ih ), ˜ x p j ( ih ). According to (64), (41), togetherwith Lemma 1, max j,l =1 , ,..., L +1 | ( iM ) j,k | = (( C σ + C f + 1) + C f h ) ih , for l = 1 , ..., L + 1, we have | (( I L +1 + M ) i − − k ) j,l |≤ | (( e M ) ( i − − k ) ) j,l | = | ( e ( i − − k ) M ) j,l |≤ e − C d · d ( j,l ) e C ( f , σ )(1+ h )( i − − k ) h ≤ e C ( f , σ )(1+ h )( i − − k ) h . (70)According to Lemma (8), we havemax {| ~δ ( kh ) | , | ~δ L +1 ( kh ) |}≤ ( C f h + C f h ) e − C d ( L +1) e C ( f , σ )(1+ h ) kh k x o i ⋆ − x p i ⋆ k . (71)30y applying Taylor’s theorem, we have i − X k =0 e C ( f , σ )(1+ h ) kh = e C ( f , σ )(1+ h ) ih − e C ( f , σ )(1+ h ) h − ≤ e C ( f , σ )(1+ h ) ih − C ( f , σ )(1 + h ) h . (72)Substituting (70), (71), (72) into (69), we have, for j = 1 , ..., L + 1, ~ ∆ j ( ih ) ≤ i − X k =0 (( I (2 L +1) + M ) i − − k ) j, · ~δ ( kh )+ i − X k =0 (( I (2 L +1) + M ) i − − k ) j, L +1 · ~δ L +1 ( kh ) ≤ · e C ( f , σ )(1+ h ) ih · e − C d ( L +1) · k x o i ⋆ − x p i ⋆ k · ( C f h + C f h ) · i − X k =0 e C ( f , σ )(1+ h ) kh ≤ C f C ( f , σ ) e C ( f , σ )(1+ h ) ih e − C d ( L +1) k x o i ⋆ − x p i ⋆ k . Recall the definition of ~ ∆ j ( ih ), the proof is complete. (cid:3) Lemma 10
Under the same settings in Lemma 9, given any positive constant ǫ , if only L ≥ log (cid:16) ǫC ( f , σ ) k x o i ⋆ − x p i ⋆ k (cid:17) − C d + 2 C ( f , σ )(1 + h ) C d T, we have E [ k ˜ x l j ( ih ) − ˜ x p j ( ih ) k ] ≤ ǫ for j = 1 , ..., m and i = 0 , ..., T /h . Proof:
According to Lemma 9, we only need to solve C ( f , σ ) · e C ( f , σ )(1+ h ) ih · e − C d ( L +1) · k x o i ⋆ − x p i ⋆ k ≤ ǫ, for i = 0 , ..., T /h , which can be obtained by solving e − C d ( L +1) ≤ ǫC ( f , σ ) · e C ( f , σ )(1+ h ) T · k x o i ⋆ − x p i ⋆ k . After taking log for both sides, we have L ≥ log (cid:16) ǫC ( f , σ ) k x o i ⋆ − x p i ⋆ k (cid:17) − C d + 2 C ( f , σ )(1 + h ) C d T. The proof is complete. (cid:3) roofs of propositions and theorems The proof of Proposition 1:
It’s a direct result of Lemma 5. (cid:3)
The proof of Theorem 1:
The conclusions are direct results of Lemma 6-7. (cid:3)
The proof of Theorem 2:
The conclusions are direct results of Lemma 9-10. (cid:3)
References [1] Andrieu, C., Roberts, G.O.: The pseudo-marginal approach for efficient Monte Carlo compu-tations. Annals of Statistics (697-725) (2009)[2] Andrieu, C., Vihola, M.: Convergence properties of pseudo-marginal Markov chian MonteCarlo algorithms. Annals of Applied Probability (2), 1030 (2015)[3] Asch, M., Bocquet, M., Nodet, M.: Data Assimilation: Methods, Algorithms, and Applica-tions. Society for Industrial and Applied Mathematics (2016)[4] Chadan, K., Sabatier, P., Newton, R.: Inverse problems in quantum scattering theory.Springer-Verlag (1977)[5] Chen, N., Majda, A.J., Tong, X.T.: Spatial localization for nonlinear dynamical stochasticmodels for excitable media. Chinese Annals of Mathematics, Series B (6), 891–924 (2019)[6] Cotter, S., Dashti, M., Robinson, J., Stuart, A.: Bayesian inverse problems for functions andapplications to fluid mechanics. Inverse problems (2009)[7] Crisan, D., Rozovskii, B. (eds.): The Oxford handbook of nonlinear filtering. Oxford UniversityPress (2011)[8] Fertig, E.J., Harlim, J., Hunt, B.R.: A comparative study of 4D-VAR and a 4D EnsembleKalman Filter: perfect model simulations with Lorenz-96. Tellus , 96–100 (2007)[9] Gelfand, A., Smith, A.: Sampling-based approaches to calculating marginal densities. Journalof the American Statistical Association , 398–409 (1990)[10] Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions and the Bayesian restorationof images. IEEE Transactions on Pattern Analysis and Machine Intelligence , 721–741 (1984)[11] Gilks, W., Richardson, S., Spiegelhalter, D.: Markov Chain Monte Carlo in Practice. Chap-mann & Hall (1996)[12] Hastings, W.: Monte Carlo sampling methods using Markov chains and their applications.Biometrika , 97–109 (1970)[13] Kaipio, J., Somersalo, E.: Statistical and Computational Inverse Problems, vol. 160. Springer(2005)[14] Law, K., Sanz-Alonso, D., Shukla, A., Stuart, A.: Filter accuracy for the Lorenz 96 model:Fixed versus adaptive observation operators. Physica D , 1–13 (2016)[15] Liu, J.: Monte Carlo strategies in Scientific Computing. Springer-Verlag (2003)[16] Lorenz, E.N.: Predictability: a problem partly solved. ECMWF Seminar on predictability pp.1–18 (1995) 3217] Majda, A., Harim, J.: Filtering complex trubulent systems. Cambridge University Press (2012)[18] Meinhold, R.J., Singpurwalla, N.D.: Understanding the Kalman filter. The American Statis-tician (2), 123–127 (1983)[19] Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., Teller, E.: Equation of statecalculations by fast computing machines. The Journal of Chemical Physics , 1087–1092(1953)[20] Morzfeld, M., Tong, X., Marzouk, Y.: Localization for MCMC: sampling high-dimensionalposterior distributions with local structure. Journal of Computational Physics , 1–28(2019)[21] Ott, E., Hunt, B.R., Szunyogh, I., Zimin, A.V., Kostelich, E.J., Corazza, M., Kalnay, E., Patil,D.J., Yorke, J.A.: A local ensemble Kalman filter for atmospheric data assimilation. Tellus , 415–428 (2004)[22] Richter, M.: Inverse Problem: Basics, Theory and Applications in Geophysics. BirkhauserBasel (2016)[23] Rosenthal, J.S., Roberts, G.O.: Optimal scaling of discrete approximations to Langevin diffu-sions. Journal of the Royal Statistical Society: Series B , 255–268 (1998)[24] Stuart, A.M.: Inverse problems: A Bayesian perspective. Acta Numerica , 451–559 (2010)[25] Tong, X.: Performance analysis of local ensemble Kalman filter. Journal of Nonlinear Science28